Why is my Gradient wrong (Coursera, Logistic Regression, Julia)?

I am trying to do Logistic Regression from Coursera to Julia, but this will not work.

Julia's code for calculating the gradient:

sigmoid(z) = 1 / (1 + e ^ -z) hypotesis(theta, x) = sigmoid(scalar(theta' * x)) function gradient(theta, x, y) (m, n) = size(x) h = [hypotesis(theta, x[i,:]') for i in 1:m] g = Array(Float64, n, 1) for j in 1:n g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m]) end g end 

If this gradient is used, it produces incorrect results. I canโ€™t understand why the code seems to be correct.

full Julia script . In this script, the optimal Theta, calculated using my implementation of Gradient Descent and using the built-in Optim package, and the results are different.

+6
source share
2 answers

I tried to compare gradient() in the OP code with the numerical derivative of cost_j() (which is the target minimization function) using the following procedure

 function grad_num( theta, x, y ) g = zeros( 3 ) eps = 1.0e-6 disp = zeros( 3 ) for k = 1:3 disp[:] = theta[:] disp[ k ]= theta[ k ] + eps plus = cost_j( disp, x, y ) disp[ k ]= theta[ k ] - eps minus = cost_j( disp, x, y ) g[ k ] = ( plus - minus ) / ( 2.0 * eps ) end return g end 

But the gradient values โ€‹โ€‹obtained from the two routines do not seem to agree very well (at least for the initial stage of minimization) ... Therefore, I manually derived the gradient cost_j( theta, x, y ) , from which it seems that the division by m absent:

 #/ OP code # g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) #/ modified code g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m 

Because I'm not very sure if the code and expression above are correct, can you check them out yourself ...?

But in fact, regardless of whether I use the original or adjusted gradients, the program converges to the same minimum value (0.2034977016, almost the same as that obtained from Optim), because the two gradients differ only by the multiplicative factor! Since convergence was very slow, I also adapted stepize alpha , following Vincent's suggestion (here I used more moderate values โ€‹โ€‹for acceleration / deceleration):

 function gradient_descent(x, y, theta, alpha, n_iterations) ... c = cost_j( theta, x, y ) for i = 1:n_iterations c_prev = c c = cost_j( theta, x, y ) if c - c_prev < 0.0 alpha *= 1.01 else alpha /= 1.05 end theta[:] = theta - alpha * gradient(theta, x, y) end ... end 

and called this procedure as

 optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ] 

Below is a graph of the cost_j steps compared to iterations. enter image description here

+5
source

The gradient is true (to scalar multiple, as @roygvib points out). The problem is gradient descent.

If you look at the values โ€‹โ€‹of the cost function during the descent of the gradient, you will see a lot of NaN , which probably comes from the exponent: reducing the step size (for example, to 1e-5 ) will avoid overflow, but you will have to increase the number of iterations a lot (maybe up to 10_000_000 ).

A better (faster) solution would be to resize the step. For example, you can multiply the step size by 1.1 if the cost function improves after the step (the optimal one still looks far in this direction: we can move faster), and divide it by 2 if it is not (we went too fast and finished the minimum) .

You can also search the line in the direction of the gradient to find the best step size (but this is time-consuming and can be replaced by approximations, for example, the Armigio rule).

Scaling forecast variables also helps.

+5
source

All Articles