Is the implementation of stochastic gradient descent performed correctly?

I am trying to create a stochastic gradient descent, but I do not know if it is 100% consistent.

  • The cost generated by my scholastic gradient descent algorithm is sometimes very far from the cost created using the FMINUC or Batch gradient descent method.
  • while the cost of lowering the gradient of the batch converges, when I set the learning speed alpha to 0.2, I have to set the learning speed alpha to 0.0001 for my stochastic implementation so that it does not diverge. This is normal?

Here are some results that I got with a training kit of 10,000 items and num_iter = 100 or 500

FMINUC : Iteration #100 | Cost: 5.147056e-001 BACTH GRADIENT DESCENT 500 ITER Iteration #500 - Cost = 5.535241e-001 STOCHASTIC GRADIENT DESCENT 100 ITER Iteration #100 - Cost = 5.683117e-001 % First time I launched Iteration #100 - Cost = 7.047196e-001 % Second time I launched 

Perform gradient descent for logistic regression

 J_history = zeros(num_iters, 1); for iter = 1:num_iters [J, gradJ] = lrCostFunction(theta, X, y, lambda); theta = theta - alpha * gradJ; J_history(iter) = J; fprintf('Iteration #%d - Cost = %d... \r\n',iter, J_history(iter)); end 

Implementation of stochastic gradient descent for logistic regression

 % number of training examples m = length(y); % STEP1 : we shuffle the data data = [y, X]; data = data(randperm(size(data,1)),:); y = data(:,1); X = data(:,2:end); for iter = 1:num_iters for i = 1:m x = X(i,:); % Select one example [J, gradJ] = lrCostFunction(theta, x, y(i,:), lambda); theta = theta - alpha * gradJ; end J_history(iter) = J; fprintf('Iteration #%d - Cost = %d... \r\n',iter, J); end 

For reference, here is the logistic regression cost function used in my example

 function [J, grad] = lrCostFunction(theta, X, y, lambda) m = length(y); % number of training examples % We calculate J hypothesis = sigmoid(X*theta); costFun = (-y.*log(hypothesis) - (1-y).*log(1-hypothesis)); J = (1/m) * sum(costFun) + (lambda/(2*m))*sum(theta(2:length(theta)).^2); % We calculate grad using the partial derivatives beta = (hypothesis-y); grad = (1/m)*(X'*beta); temp = theta; temp(1) = 0; % because we don't add anything for j = 0 grad = grad + (lambda/m)*temp; grad = grad(:); end 
+7
matlab machine-learning logistic-regression gradient-descent
source share
3 answers

It is very good. If you are worried about choosing the appropriate alpha training speed, you should consider using the line search method.

Linear search is a method that selects the optimal learning speed for gradient descent at each iteration, which is better than using a fixed learning speed throughout the entire optimization process. The optimal value for the alpha learning speed is one that locally (from the current theta in the direction of the negative gradient) minimizes the cost of the function.

At each iteration of the gradient descent, starting with the learning rate alpha = 0 and gradually increasing alpha by a fixed step, deltaAlpha = 0.01 , for example. Recalculate theta parameters and evaluate the cost function. Since the cost function is convex, with increasing alpha (i.e., by moving in the direction of the negative gradient), the cost function will first begin to decrease, and then (at some point) increase. At this point, stop searching for the line and take the last alpha until the cost function starts to increase. Now update theta options with alpha . If the cost function never starts to increase, stop at alpha = 1 .

Note. . For large regularization coefficients ( lambda = 100 , lambda = 1000 ) it is possible that deltaAlpha too large and that the gradient deviation diverges. If so, reduce deltaAlpha 10 times ( deltaAlpha = 0.001 , deltaAlpha = 0.0001 ) until you get to the corresponding deltaAlpha , for which the gradient descent converges.

In addition, you should consider using some termination condition other than the number of iterations, for example. when the difference between the cost functions in the next two iterations becomes sufficiently small (less than epsilon ).

+1
source share

There is a reason for the low rate of learning. In short, when learning speeds decrease with the corresponding speed and under relatively mild assumptions, stochastic gradient descent almost certainly converges to a global minimum when the objective function is convex or pseudo - convex and otherwise almost converges to a local minimum . This is actually a consequence of the Robbins-Sigmund theorem.

Robbins, Herbert; Sigmund, David O. (1971). "A convergence theorem for non-negative almost supermartingales and some applications." Rustagi, Jagdish S. Optimization of methods in statistics. Academic Press

0
source share

The learning frequency is always in the range from 0 to 1. If you set the learning speed to very high, it follows to a lesser extent, due to skipping. So take a slow learning rate, although it takes more time. The result will be more convincing.

-one
source share

All Articles