m=Number of training examples
h(hypothesis)=output function
Linear regression with one variable. Univariate linear regression.
Idea: Choose θ0,θ1 so that hθ(x) is close to y for our training examples(x,y)
Cost Function:
squared error function
minimize J(
θ0,θ1
)——overall objective for linear regression
Gradient Descent
outline:
start with some
θ0,θ1
(initialize them to 0)
(the result depending on where to start on the graph,however J(
θ0,θ1
) is a convex function)
repeat until convergence{
θj:=θj−αδδθjJ(θ0,θ1)(for j=0 and j=1)
}
αδδθjJ(θ0,θ1)
is the derivative term
α
is learning rate
*simultaneous update
temp0:=...
temp1:=...
θ0=temp0
θ1=temp1
if you do not do so,maybe it also works well, but we don’t call it GD algorithm.
compress:
if
α
is too small ,gradient descent can be slow.
if
α
is too large,gradient decent can overshoot the minimum.It may fail to converge, or even diverge.
GD can converge to a local minimum,even with the learning rate
α
fixed, because as we approach a local minimum, GD will automatically take smaller steps.
Linear Regression with Multiple Features
X(i)j
is the value of feature j in
ith
training example
hθ(x)=θ0x0+θ1x1+...θnxn=θTX
(both with n+1 elements)
(define
x0=1
)
feature scaling
if different features takes on similar ranges of values,FD will
converge more quickly.
It speeds up gradient descent by making it require fewer iterations to get to a good solution.
maybe
-3 to 3
-
13
to
13
is well
else
x1
=
...10000
get every feature into approximately a
−1≤xi≤1
range or standard deviation
mean normalization
makes features have approximately 0
to implement both measures above:
x = (value - average_value)/(max_value - min_value)
feature scaling doesn’t have to be too exact
converge judge
There is a example convergence test,but the threshold is not easy to find. So we had better plot the function.