Poor spasmodic drop in linear regression

I applied a simple linear regression example (single version) in C ++ to help me understand the concepts. I'm sure the key algorithm is right, but my performance is terrible.

This is the method that actually performs the gradient descent:

void LinearRegression::BatchGradientDescent(std::vector<std::pair<int,int>> & data,float& theta1,float& theta2) { float weight = (1.0f/static_cast<float>(data.size())); float theta1Res = 0.0f; float theta2Res = 0.0f; for(auto p: data) { float cost = Hypothesis(p.first,theta1,theta2) - p.second; theta1Res += cost; theta2Res += cost*p.first; } theta1 = theta1 - (m_LearningRate*weight* theta1Res); theta2 = theta2 - (m_LearningRate*weight* theta2Res); } 

With other key features listed as:

 float LinearRegression::Hypothesis(float x,float theta1,float theta2) const { return theta1 + x*theta2; } float LinearRegression::CostFunction(std::vector<std::pair<int,int>> & data, float theta1, float theta2) const { float error = 0.0f; for(auto p: data) { float prediction = (Hypothesis(p.first,theta1,theta2) - p.second) ; error += prediction*prediction; } error *= 1.0f/(data.size()*2.0f); return error; } void LinearRegression::Regress(std::vector<std::pair<int,int>> & data) { for(unsigned int itr = 0; itr < MAX_ITERATIONS; ++itr) { BatchGradientDescent(data,m_Theta1,m_Theta2); //Some visualisation code } } 

Now the problem is that if the learning speed exceeds approximately 0.000001, the value of the cost function after the descent of the gradient is higher than before. That is, the algorithm works in reverse. A line is formed rather quickly into a straight line through the origin, but then takes millions of iterations to actually achieve a well-fitting line.

At a learning speed of 0.01, after six iterations, the output is: (where the difference is the default cost).

 Cost before 102901.945312, cost after 517539430400.000000, difference 517539332096.000000 Cost before 517539430400.000000, cost after 3131945127824588800.000000, difference 3131944578068774912.000000 Cost before 3131945127824588800.000000, cost after 18953312418560698826620928.000000, difference 18953308959796185006080000.000000 Cost before 18953312418560698826620928.000000, cost after 114697949347691988409089177681920.000000, difference 114697930004878874575022382383104.000000 Cost before 114697949347691988409089177681920.000000, cost after inf, difference inf Cost before inf, cost after inf, difference nan 

In this example, thetas are set to zero, the learning speed is 0.000001, and there are 8,000,000 iterations! The visualization code updates the graph only after every 100,000 iterations.

enter image description here

Function that creates data points:

 static void SetupRegressionData(std::vector<std::pair<int,int>> & data) { srand (time(NULL)); for(int x = 50; x < 750; x += 3) { data.push_back(std::pair<int,int>(x+(rand() % 100), 400 + (rand() % 100) )); } } 

In short, if my learning speed is too high, the gradient descent algorithm works effectively in the opposite direction and tends to infinity, and if it drops to the point where it actually converges to a minimum, the number of iterations required for this is unacceptably high.

Am I missing / making a mistake in the main algorithm?

+5
source share
1 answer

It seems that everything behaves as expected, but you are having trouble choosing a reasonable learning speed. This is not a completely trivial problem, and there are many approaches, ranging from predefined schedules that gradually reduce the learning speed (see, for example, this document ) for adaptive methods such as AdaGrad or AdaDelta.

For your vanilla implementation with a fixed learning factor, you have to make your life easier by normalizing the data to zero mean and standard deviation of the unit before you submit it to the gradient descent algorithm. This way, you can more easily learn about the speed of learning. Then you can simply rescale your prediction accordingly.

+5
source

All Articles