线性回归
对于
y
=
a
x
+
b
y = ax+b
y=ax+b 一元线性回归如下图所示:
考虑多个变量的情形:
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
h
(
x
)
=
∑
i
=
0
n
θ
i
x
i
=
θ
T
x
h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2\\ h(x) = \sum_{i=0}^n \theta_ix_i = \theta^Tx
hθ(x)=θ0+θ1x1+θ2x2h(x)=i=0∑nθixi=θTx
我们选取一个比较“符合常理”的误差函数为(损失函数):
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J(\theta) = \frac{1}{2}\sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2
J(θ)=21i=1∑m(hθ(x(i))−y(i))2
当损失函数取得极小值时,求得的
θ
\theta
θ即为局部最优解。(对于
J
(
θ
)
J(\theta)
J(θ)这个二次函数而言,当取得极小值,求得的
θ
\theta
θ为全局最优解)
对于
θ
\theta
θ的解析式的求解过程如下:
可得最小二乘意义下的参数最优解为:
θ
=
(
X
T
X
)
−
1
X
T
y
\theta = (X^TX)^{-1}X^Ty
θ=(XTX)−1XTy
特别的,当
X
T
X
X^TX
XTX阶过高时,仍然需要使用梯度下降的方式计算数值解
梯度下降算法
步骤:
1、初始化
θ
\theta
θ(随机初始化)
2、迭代得到新的
θ
\theta
θ能够是的
J
(
θ
)
J(\theta)
J(θ)更小
3、如果
J
(
θ
)
J(\theta)
J(θ)能够继续减少,返回(2)
迭代公式(
α
\alpha
α为称为学习率):
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j :=\theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)
θj:=θj−α∂θj∂J(θ)
梯度方向(本质上是对
θ
\theta
θ求偏导):
∂
∂
θ
j
J
(
θ
)
=
∂
∂
θ
j
1
2
(
h
θ
(
x
)
−
y
)
2
=
2
∗
1
2
(
h
θ
(
x
)
−
y
)
∗
∂
∂
∂
θ
j
(
h
θ
(
x
)
−
y
)
=
(
h
θ
(
x
)
−
y
)
∗
∂
∂
θ
j
(
∑
i
=
0
n
θ
i
x
i
−
y
)
=
(
h
θ
(
x
)
−
y
)
x
j
\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta) &= \frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x) - y)^2 \\ &= 2*\frac{1}{2}(h_\theta(x) - y) *\partial\frac{\partial}{\partial\theta_j}(h_\theta(x)-y) \\ &= (h_\theta(x)-y)*\frac{\partial}{\partial\theta_j}(\sum_{i=0}^n \theta_ix_i - y) \\ &= (h_\theta(x)-y)x_j \end{aligned}
∂θj∂J(θ)=∂θj∂21(hθ(x)−y)2=2∗21(hθ(x)−y)∗∂∂θj∂(hθ(x)−y)=(hθ(x)−y)∗∂θj∂(i=0∑nθixi−y)=(hθ(x)−y)xj
批梯度下降算法
Repeat until convergence{
θ
j
:
=
θ
j
+
α
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
}
\text{Repeat until convergence\{}\\ \theta_j := \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \text{\}}
Repeat until convergence{θj:=θj+αi=1∑m(y(i)−hθ(x(i)))xj(i)}
批梯度下降图示:
随机梯度下降算法
Loop{ for i = 1 to m,{ θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) } } \text{Loop\{}\\ \text{for i = 1 to m,\{} \theta_j :=\theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \text{\}}\\ \text{\}} Loop{for i = 1 to m,{θj:=θj+α(y(i)−hθ(x(i)))xj(i)}}
mini-batch梯度下降算法
Repeat until convergence{ θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) } \text{Repeat until convergence\{}\\ \theta_j := \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \text{\}} Repeat until convergence{θj:=θj+α∑i=1m(y(i)−hθ(x(i)))xj(i)}
Loop{ for i = 1 to m,{ θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) } } \text{Loop\{}\\ \text{for i = 1 to m,\{} \theta_j :=\theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \text{\}}\\ \text{\}} Loop{for i = 1 to m,{θj:=θj+α(y(i)−hθ(x(i)))xj(i)}}
梯度下降算法总结
对于批梯度下降算法而言,由于要计算所有样本的偏差,效率相对随机梯度下降算法和mini-batch梯度下降算法较差,其优点在于能够稳步地收敛得到最优解。而随机梯度下降算法由于时随机选取样本计算梯度,导致其收敛的时快时慢,有时甚至会产生振荡现象(由于梯度过大错过了全局最优解),但胜在效率更高。mini-batch则是两者的结合。