I tried to compare gradient() in the OP code with the numerical derivative of cost_j() (which is the target minimization function) using the following procedure
function grad_num( theta, x, y ) g = zeros( 3 ) eps = 1.0e-6 disp = zeros( 3 ) for k = 1:3 disp[:] = theta[:] disp[ k ]= theta[ k ] + eps plus = cost_j( disp, x, y ) disp[ k ]= theta[ k ] - eps minus = cost_j( disp, x, y ) g[ k ] = ( plus - minus ) / ( 2.0 * eps ) end return g end
But the gradient values โโobtained from the two routines do not seem to agree very well (at least for the initial stage of minimization) ... Therefore, I manually derived the gradient cost_j( theta, x, y ) , from which it seems that the division by m absent:
#/ OP code
Because I'm not very sure if the code and expression above are correct, can you check them out yourself ...?
But in fact, regardless of whether I use the original or adjusted gradients, the program converges to the same minimum value (0.2034977016, almost the same as that obtained from Optim), because the two gradients differ only by the multiplicative factor! Since convergence was very slow, I also adapted stepize alpha , following Vincent's suggestion (here I used more moderate values โโfor acceleration / deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations) ... c = cost_j( theta, x, y ) for i = 1:n_iterations c_prev = c c = cost_j( theta, x, y ) if c - c_prev < 0.0 alpha *= 1.01 else alpha /= 1.05 end theta[:] = theta - alpha * gradient(theta, x, y) end ... end
and called this procedure as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
Below is a graph of the cost_j steps compared to iterations. 