优化器
SGD
W n e w = W o l d − α ∂ L o s s ∂ W o l d W_{new}=W_{old} - \alpha\frac{\partial{Loss}}{\partial{W_{old}}} Wnew=Wold−α∂Wold∂Loss
α
\alpha
α:学习率
缺点:容易陷入局部极小值
加入动量(Momentum),解决局部极小值。
SGD+Momentum
Momentum更新: V n e w = η V o l d + α ∂ L o s s W o l d V_{new}=\eta V_{old} +\alpha \frac{\partial{Loss}}{W_{old}} Vnew=ηVold+αWold∂Loss
更新公式: W n e w = W o l d − V n e w W_{new} = W_{old}-V_{new} Wnew=Wold−Vnew
α
\alpha
α:学习率
η
\eta
η:动量系数
优点:防止陷入局部极小值,由于动量由历史积累,使得收敛速度快。
缺点:容易震荡
NAG(Nesterov加速梯度)
Momentum完全展开公式:
W
n
e
w
=
W
o
l
d
−
η
V
o
l
d
−
α
∂
L
o
s
s
W
o
l
d
W_{new} = W_{old}- \eta V_{old}-\alpha \frac{\partial{Loss}}{W_{old}}
Wnew=Wold−ηVold−αWold∂Loss
α
∂
L
o
s
s
W
o
l
d
\alpha \frac{\partial{Loss}}{W_{old}}
αWold∂Loss是个很小的值,未来位置权重:
W
f
u
t
u
r
e
=
W
o
l
d
−
η
V
o
l
d
W_{future}=W_{old}-\eta V_{old}
Wfuture=Wold−ηVold
Nesterov Momentum公式 V n e w = η V o l d + α ∂ L o s s W f u t u r e V_{new}=\eta V_{old}+\alpha \frac{\partial{Loss}}{W_{future}} Vnew=ηVold+αWfuture∂Loss
更新公式: W n e w = W o l d − V n e w W_{new} = W_{old}-V_{new} Wnew=Wold−Vnew
- 梯度更新规则
v t = γ v t − 1 + η ▽ θ ( θ − γ v t − 1 ) v_t=\gamma v_{t-1}+\eta \triangledown_{\theta}(\theta-\gamma v_{t-1}) vt=γvt−1+η▽θ(θ−γvt−1)
θ = θ − v t \theta=\theta-v_t θ=θ−vt
Adagrad
梯度缓存更新:
C
a
c
h
e
n
e
w
=
C
a
c
h
e
o
l
d
+
(
∂
L
o
s
s
W
o
l
d
)
2
Cache_{new}=Cache_{old}+(\frac{\partial{Loss}}{W_{old}})^2
Cachenew=Cacheold+(Wold∂Loss)2
更新公式:
W
n
e
w
=
W
o
l
d
+
α
C
a
c
h
e
n
e
w
+
ϵ
∂
L
o
s
s
W
o
l
d
W_{new} = W_{old}+\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}\frac{\partial{Loss}}{W_{old}}
Wnew=Wold+Cachenew+ϵαWold∂Loss
缺点:缓存始终增加,学习率会降到非常低以至于训练无法有效进行,导致训练提前结束。
RMSProp
缓存更新公式: C a c h e n e w = γ C a c h e o l d + ( 1 − γ ) ( ∂ L o s s W o l d ) 2 Cache_{new}=\gamma Cache_{old}+(1-\gamma)(\frac{\partial{Loss}}{W_{old}})^2 Cachenew=γCacheold+(1−γ)(Wold∂Loss)2
更新公式: W n e w = W o l d + α C a c h e n e w + ϵ ∂ L o s s W o l d W_{new} = W_{old}+\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}\frac{\partial{Loss}}{W_{old}} Wnew=Wold+Cachenew+ϵαWold∂Loss
Adam
Adam Momentum更新公式 V n e w = β 1 V o l d + ( 1 − β 1 ) ∂ L o s s W o l d V_{new} = \beta_{1}V_{old}+(1-\beta_1)\frac{\partial{Loss}}{W_{old}} Vnew=β1Vold+(1−β1)Wold∂Loss
缓存更新公式: C a c h e n e w = β 2 C a c h e o l d + ( 1 − β 2 ) ( ∂ L o s s W o l d ) 2 Cache_{new}=\beta_2 Cache_{old}+(1-\beta_2)(\frac{\partial{Loss}}{W_{old}})^2 Cachenew=β2Cacheold+(1−β2)(Wold∂Loss)2
Adam更新公式: W n e w = W o l d − α C a c h e n e w + ϵ V n e w W_{new}=W_{old}-\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}V_{new} Wnew=Wold−Cachenew+ϵαVnew
β 1 = 0.9 \beta_1=0.9 β1=0.9, β 2 = 0.99 \beta_2=0.99 β2=0.99, ϵ = 1 e − 08 \epsilon=1e-08 ϵ=1e−08