梯度下降(Gradient Descent )
x
:
=
x
−
α
⋅
d
x
x := x-\alpha \cdot {\rm{d}}x
x:=x−α⋅dx
其中
α
\alpha
α为学习率。代码如下:
while True:
dx = compute_gradient(x)
x += - learning_rate * dx #perform parameter update
带动量的梯度下降(Gradient Descent + Momentum)
v
:
=
ρ
⋅
v
+
d
w
x
:
=
x
−
α
⋅
v
v:= \rho \cdot v+ {\rm{d}}w \\ x:=x-\alpha \cdot v
v:=ρ⋅v+dwx:=x−α⋅v
ρ
\rho
ρ为动量因子,一般取0.9、0.99。动量的引入使下降的方向不仅受到当前梯度方向影响,还受到历史方向的影响,即有一个初速度。这使得本来由某一点开始的梯度下降过程是及其曲折的,并不是直接走向中心点,而是需要浪费很多时间折来折去,加入动量后会避免这个问题,加快学习速度。代码如下:
vx = 0
while True:
dx = computed_gradient(x)
vx = rho * vx + dx
x += - learning_rate * vx
Nesterov Momentum
v
:
=
ρ
⋅
v
−
α
⋅
d
(
x
+
ρ
⋅
v
)
x
:
=
x
+
v
v:=\rho \cdot v-\alpha \cdot {\rm{d}}(x+\rho \cdot v)\\x:=x+v
v:=ρ⋅v−α⋅d(x+ρ⋅v)x:=x+v
Nesterov Momentum是对Momentum的改进,可以理解为nesterov动量在标准动量方法中添加了一个校正因子。变形代码:
while True:
dx = compute_gradient(x)
old_v = v
v = rho * v - learning_rate * dx
x += - rho * old_v + (1+rho) * v
AdaGrad
g
=
d
x
⊤
⋅
d
x
x
:
=
x
−
α
⋅
d
x
g
+
ϵ
g = {\rm{d}}x^{\top} \cdot {\rm{d}}x\\x:=x-\alpha \cdot \frac{{\rm{d}}x}{\sqrt{g}+\epsilon}
g=dx⊤⋅dxx:=x−α⋅g+ϵdx
优点:抑制梯度大的维度的下降速度,增大梯度小的维度的下降速度。
缺点:随着迭代,步长越来越小,在非凸问题上容易卡在鞍点和局部极小值。
代码:
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
RMSProp
g
:
=
β
⋅
g
+
(
1
−
β
)
⋅
d
x
⊤
⋅
d
x
x
:
=
x
−
α
⋅
d
x
g
+
ϵ
g := \beta \cdot g + (1-\beta) \cdot {\rm{d}}x^{\top} \cdot {\rm{d}}x \\ x:=x-\alpha \cdot \frac{{\rm{d}}x}{\sqrt{g}+\epsilon}
g:=β⋅g+(1−β)⋅dx⊤⋅dxx:=x−α⋅g+ϵdx
解决AdaGrad的缺点,类似给dx*dx加了动量, decay_rate
β
\beta
β一般为 0.9、0.99。代码如下:
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared = decay_rate * grad_squared + (1-decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
Adam
g
:
=
β
1
⋅
g
+
(
1
−
β
1
)
⋅
d
x
g
g
:
=
β
2
⋅
g
g
+
(
1
−
β
2
)
⋅
d
x
⊤
⋅
d
x
x
:
=
x
−
α
⋅
g
g
g
+
ϵ
g := \beta_1 \cdot g + (1-\beta_1) \cdot {\rm{d}}x \\ gg := \beta_2 \cdot gg + (1-\beta_2) \cdot {\rm{d}}x^{\top} \cdot {\rm{d}}x \\ x:=x-\alpha \cdot \frac{g}{\sqrt{gg}+\epsilon}
g:=β1⋅g+(1−β1)⋅dxgg:=β2⋅gg+(1−β2)⋅dx⊤⋅dxx:=x−α⋅gg+ϵg
结合momentum和Ada的思想,真是秀操作啊。代码:
beta1 = 0.9
beta2 = 0.999
learning_rate = 1e-3 or 5e-4
first_moment = 0
second_moment = 0
while True:
dx = computed_gradient(x)
first_moment = beta1 * first_moment + (1-beta1) * dx # Momentum
second_moment = beta2 * second_moment + (1-beta2) * dx *dx #AdaGrad/RMSProp
x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7)
上面的Adam前几次迭代的步长会非常大,这里增加了偏置矫正项t:
注意t的值会随着迭代次数增加
first_moment = 0
second_moment = 0
for t in range(num_iterations):
dx = compute_gradient(x)
first_moment = beta1 * first_moment + (1-beta1) * dx
second_moment = beta2 * second_moment = (1-beta2) * dx * dx
first_unbias = first_moment / (1 - beta1 ** t) #偏置纠正
second_unbias = second_moment / (1 - beta2 ** t)
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)
至此,欢迎交流,谢谢指正!
[1]https://blog.csdn.net/u012328159/article/details/80311892
[2]https://www.bilibili.com/video/av53754154/?p=3