After Watching Andrew ng’s Machine Coursera Lessons, I want to practise Gradient Descent by myself.
So, I generate some datas first.
m = 10; %number of the datas
n = 1; %feature
alpha = 0.01; %learning rate
X = [ones(m,1),(1:m)']; %Input
theta = zeros(n+1,1); %theta
correctTheta = [2;1]; %CorrectTheta
y = (correctTheta'*X')'; %Result
function J = costFunction(m,X,y,theta),
J = (1/(2*m)) * sum(((theta'*X')'-y) .^ 2);
end;
% And h(x) is turns out to be the θ1*X1+θ2*X2
There are two ways to do the Gradient Descent based on this h(x) in Octave.
diata = zeros(n+1,1);
for i = 1:m,
diata += ((theta'*X(i,:)')'-y(i,:))'*(X(i,:))';
end;
theta = theta - alpha/m .* diata;
And the second:
theta = theta - (alpha / m) .* (((theta'*X')'-y)'*X)'; % it takes me some time to compus this expression.
It’s clear that the second is more convinent.But when I practised, I am nearly mad.
The theta is always getting higher and at last turns to be the INF.
When I am thinking, I recall that Andrew ng said that if the alpha turns to be too large, it will be like this
So, I change the learning rate from 0.1 to 0.01, to my relief, it finally work!
And for the fully version, n features, m datas.(Be careful, the n increases, the alpha decreases; For example, if the n = 10, m = 100, the alpha need to be 0.000001, even 0.0001 is too large.)
m = (the number of the datas);
n = (the number of the features);
alpha = (learning rate);
X = [ones(m,1),ceil(rand(m,n)*100)];
theta = zeros(n+1,1);
correctTheta = ceil(rand(n+1,1)*10);
y = (correctTheta'*X')';
function J = costFunction(m,X,y,theta),
J = (1/(2*m)) * sum(((theta'*X')'-y) .^ 2);
end;
for j = 1:(times),
theta = theta - (alpha / m) .* (((theta'*X')'-y)'*X)';
end;
Expand: if we want to accelerate the speed of convergence, we can use Features Scaling and Zero Averge(maybe this name.)