2-1 Model representation
Linear regression with one variable
2-2 Cost function
Also called squared error function.
- Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_θ(x) = θ_0 + θ_1x hθ(x)=θ0+θ1x
-
parameters: θ 0 , θ 1 θ_0, θ_1 θ0,θ1
-
Cost function:
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(θ_0, θ_1) = \frac{1}{2m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)^2} J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2
- Goal:
m i n i m i z e J ( θ 0 , θ 1 ) minimize J(θ_0, \theta_1) minimizeJ(θ0,θ1)
2-3 Gradient descent
Have some function: J ( θ 0 , θ 1 ) J(\theta_0, \theta_1) J(θ0,θ1)
Want: m i n J ( θ 0 , θ 1 ) min J(\theta_0, \theta_1) minJ(θ0,θ1)
Outlines:
- Start with some θ 0 , θ 1 \theta_0, \theta_1 θ0,θ1
- Keep changing θ 0 , θ 1 \theta_0, \theta_1 θ0,θ1 to reduce J ( θ 0 , θ 1 ) J(\theta_0, \theta_1) J(θ0,θ1) until we hopefully end up at a minimum or maybe a local minimum
Gradient descent algorithm
repeat until convergence
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
0
,
θ
1
)
(
f
o
r
j
=
0
a
n
d
j
=
1
)
\theta_j := \theta_j - \alpha\frac{∂}{∂\theta_j}J(\theta_0, \theta_1) \quad (for\, j = 0 \, and \, j = 1)
θj:=θj−α∂θj∂J(θ0,θ1)(forj=0andj=1)
- “:=” means assignment, it is different from the truth assertion “=”.
- “α” means learning rate. And what alpha does is, it basically controls how big a step we take downhill with gradient descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure.
Correct: Simultaneous update
-
t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha\frac{∂}{∂\theta_0}J(\theta_0, \theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
-
t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_1 - \alpha\frac{∂}{∂\theta_1}J(\theta_0, \theta_1) temp1:=θ1−α∂θ1∂J(θ0,θ1)
-
θ 0 : = t e m p 0 \theta_0 := temp0 θ0:=temp0
-
θ 1 : = t e m p 1 \theta_1 := temp1 θ1:=temp1
Incorrect:
- t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp0 := \theta_0 - \alpha\frac{∂}{∂\theta_0}J(\theta_0, \theta_1) temp0:=θ0−α∂θ0∂J(θ0,θ1)
- θ 0 : = t e m p 0 \theta_0 := temp0 θ0:=temp0
- t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp1 := \theta_1 - \alpha\frac{∂}{∂\theta_1}J(\theta_0, \theta_1) temp1:=θ1−α∂θ1∂J(θ0,θ1)
- θ 1 : = t e m p 1 \theta_1 := temp1 θ1:=temp1
- if α is too small, gradient descent can be slow.
- if α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
Gradient descent can converge to a local minimum, even with the learning rage α fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.
Gradient descent for linear regression
Gradient descent algorithm
repeat until convergence
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
=
∂
∂
θ
0
J
(
θ
0
,
θ
1
)
\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)} = \frac{∂}{∂\theta_0}J(\theta_0, \theta_1)
m1i=1∑m(hθ(x(i))−y(i))=∂θ0∂J(θ0,θ1)
1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) = ∂ ∂ θ 1 J ( θ 0 , θ 1 ) \frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)·x^{(i)}} = \frac{∂}{∂\theta_1}J(\theta_0, \theta_1) m1i=1∑m(hθ(x(i))−y(i))⋅x(i)=∂θ1∂J(θ0,θ1)
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) \theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)} θ0:=θ0−αm1i=1∑m(hθ(x(i))−y(i))
θ 1 : = θ 1 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) \theta_1 := \theta_1 - \alpha\frac{1}{m} \sum_{i = 1}^{m} {\left( h_θ(x^{(i)}) - y^{(i)} \right)·x^{(i)}} θ1:=θ1−αm1i=1∑m(hθ(x(i))−y(i))⋅x(i)
Does gradient descent on this type of cost function which you get whenever you’re using linear regression, it will always convert to the global optimum.
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples. So, in gradient descent, when computing derivatives, we’re computing these sums, that sums over our M training examples.