2 Linear regression with one variable
2-1 Model representation
Training set
m = Number of training examples
x’s = “input” variable/features
y’s = “output” variable/“target” variable
(x,y) = one training example
Hypothesis 假设函数
h
θ
(
x
)
=
θ
0
+
θ
1
x
h_\theta (x)=\theta_0+\theta_1 x
hθ(x)=θ0+θ1x
2-2 Cost Function
Goal: minimize
1
2
m
∑
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
\frac{1}{2m}\sum_{1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2
2m11∑m(hθ(x(i)−y(i))2
除以m是使得误差平均到每个样本,除以2是一个微积分技巧,用于消除计算偏导数时出现的2。
Cost Function(Square error function )
J
(
θ
0
,
θ
1
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
2
J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2
J(θ0,θ1)=2m1i=1∑m(hθ(x(i)−y(i))2
2-3 Cost Function Intuition I
Simplified
θ 0 = 0 \theta_0=0 θ0=0
2-4 Cost Function Intuition II
Contour plot/figure 等高线图
2-5 Gradient descent
Outline
-
Start with some θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1
-
Keep changing θ 0 , θ 1 \theta_0,\theta_1 θ0,θ1 to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1)
until we hopefully end up with a minimum
Gradient descent algorithm
repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) θj:=θj−α∂θj∂J(θ0,θ1) (for j=0 and j=1)
}
α \alpha α learning rate
Correct :Simultaneous update
t e m p 0 : \textcolor{blue}{temp0:} temp0: θ 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) \theta_0 := \theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) θ0:=θ0−α∂θ0∂J(θ0,θ1)
t e m p 1 : \textcolor{blue}{temp1:} temp1: θ 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) \theta_1 := \theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1) θ1:=θ1−α∂θ1∂J(θ0,θ1)
θ 0 : \theta_0: θ0:= t e m p 0 \textcolor{blue}{temp0} temp0
θ 1 \theta_1 θ1 := t e m p 1 \textcolor{blue}{temp1} temp1
2-6 Gradient descent Intuition
Learning rate α \alpha α
if α \alpha α is too small,gradient descent can be slow
if α \alpha α is too large,gradient descent can overshoot the minimum .It may fail to converge,or even diverge
learning rate α \alpha α fixed
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α \alpha α over time
2-7 Gradient descent for linear regression
Gradient decent algorithm
repeat until convergence{
{
θ
0
:
=
θ
0
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
θ
1
:
=
θ
1
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
−
y
(
i
)
)
.
x
(
i
)
\begin{cases}\theta_0 := \theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})} \\\theta_1 := \theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})}.x^{(i)} \end{cases}
{θ0:=θ0−αm1∑i=1m(hθ(x(i)−y(i))θ1:=θ1−αm1∑i=1m(hθ(x(i)−y(i)).x(i)
}
update θ 0 \theta_0 θ0 and θ 1 \theta_1 θ1 simultaneously
convex function 凸函数
”Batch“ Gradient Descent
Batch ": Each step of gradient descent uses all the training examples