The error is very simple. Your delta declaration must be inside the first for loop. Each time you accumulate weighted differences between the learning pattern and the output, you should start accumulating from the very beginning.
Without doing this, what you do accumulates errors from the previous iteration , which takes into account the error of the previous learned version of theta , which is incorrect. You should put this at the beginning of the first for loop.
Also, you seem to have an extraneous call to computeCost . I assume that this calculates the cost function at each iteration, taking into account the current parameters, and therefore I am going to create a new output array called cost , which shows this at each iteration. I will also call this function and assign it to the corresponding elements in this array:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations) m = length(y); costs = zeros(m,1); %// New % delta=zeros(2,1); %// Remove for iter =1:1:iterations delta=zeros(2,1); %// Place here for i=1:1:m delta(1,1)= delta(1,1)+( X(i,:)*theta - y(i,1)) ; delta(2,1)=delta(2,1)+ (( X(i,:)*theta - y(i,1))*X(i,2)) ; end theta= theta-( delta*(alpha/m) ); costs(iter) = computeCost(X,y,theta); %// New end end
A Note on Proper Vectoring
FWIW, I do not consider this implementation fully vectorized. You can exclude the second for loop using vectorized operations. Before we do this, let me highlight some theories so that we are on the same page. You use gradient descent here in terms of linear regression. We want to find the best theta parameters, which are our linear regression coefficients that seek to minimize this cost function:

m corresponds to the number of training patterns available, and x^{i} corresponds to training example i th . y^{i} corresponds to the basic truth value that we have associated with the teaching pattern i th . h is our hypothesis, and it is given as:

Note that in the context of linear regression in 2D, we have only two values ββin theta that we want to calculate - the term intercept and slope.
We can minimize the cost function J to determine the best regression coefficients that can give us the best predictions that minimize the set of training errors. In particular, starting with some of the initial parameters of theta ... usually a vector of zeros, we iterate through iterations from 1 until we consider it necessary, and at each iteration we update our theta parameters as follows:

For each parameter that we want to update, you need to determine the gradient of the cost function with respect to each variable and evaluate what is in the current state of theta . If you work using Calculus, we get:

If you do not understand how this happens, then I refer you to this good publication in the section "Mathematical Statistics", which says this:
https://math.stackexchange.com/questions/70728/partial-derivative-in-gradient-descent-for-two-variables
Now ... how can we apply this to our current problem? In particular, you can calculate delta records by easily analyzing all samples together at a time. I mean, you can just do this:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations) m = length(y); costs = zeros(m,1); for iter = 1 : iterations delta1 = theta(1) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,1))); delta2 = theta(2) - (alpha/m)*(sum((theta(1)*X(:,1) + theta(2)*X(:,2) - y).*X(:,2))); theta = [delta1; delta2]; costs(iter) = computeCost(X,y,theta); end end
Operations on delta(1) and delta(2) can be fully vectorized in one expression for both. What are you doing theta^{T}*X^{i} for each pattern i from 1, 2, ..., m . You can conveniently place this in a single sum statement.
We can go even further and replace it with purely matrix operations. First, you can quickly calculate theta^{T}*X^{i} for each input sample x^{i} using matrix multiplication. Let's pretend that:

Here X is our data matrix, which consists of m rows corresponding to m training patterns and columns n corresponding to n . Similarly, theta is our scientific weight vector from gradient descent with n+1 functions, taking into account the term interception.
If we calculate X*theta , we get:

As you can see here, we calculated the hypothesis for each sample and placed them in a vector. Each element of this vector is a hypothesis for I th training sample. Now remember which gradient term of each parameter is in the gradient descent:

We want to implement this all at once for all parameters in your scientific vector, and therefore the inclusion of this in the vector gives us:

Finally:

Therefore, we know that y already a vector of length m , so we can very compactly compute the gradient descent at each iteration:
theta = theta - (alpha/m)*X'*(X*theta - y);
.... so your code is now simple:
function [theta, costs] = gradientDescent(X, y, theta, alpha, iterations) m = length(y); costs = zeros(m, 1); for iter = 1 : iterations theta = theta - (alpha/m)*X'*(X*theta - y); costs(iter) = computeCost(X,y,theta); end end