三、Tips for Training: Adaptive Learning Rate
1、训练收敛了吗?
可能会出现这样一种情况,明明训练的损失已经看上去收敛了,但是梯度还是在来回的波动,不用担心,可能其只是卡在了局部最优解的山谷里反复横跳而已;
2、如果不是局部最优解的话,为什么训练损失卡住了呢?
可能是你的学习率(Learning Rate)调整的不好。太大,会来回横跳,太小,步伐会太慢,可能永远到不了终点;
3、自适应你的学习率
(1)目标:如果在某一个方向上梯度很大,我们希望学习率会小一点,反之,如果某个方向上梯度很小,我们会希望学习率会增大一些;
(2)原来的参数更新方法: θ i t + 1 ← θ i t − η g i t \theta^{t+1}_i \gets \theta^t_i-\eta g^t_i θit+1←θit−ηgit;
自适应的参数更新方法: θ i t + 1 ← θ i t − η σ i t g i t \theta^{t+1}_i \gets \theta^t_i-\frac{\eta}{\sigma^t_i} g^t_i θit+1←θit−σitηgit;
(3) σ \sigma σ的计算方式 ① :Root Mean Square应用于Adagrad(不常用)
θ i 1 ← θ i 0 − η σ i 0 g i 0 σ i 0 = ( g i 0 ) 2 = ∣ g i 0 ∣ θ i 2 ← θ i 1 − η σ i 1 g i 1 σ i 1 = 1 2 [ ( g i 0 ) 2 + ( g i 1 ) 2 ] θ i 3 ← θ i 2 − η σ i 2 g i 2 σ i 2 = 1 3 [ ( g i 0 ) 2 + ( g i 1 ) 2 + ( g i 2 ) 2 ] ⋮ θ i t + 1 ← θ i t − η σ i t g i t σ i t = 1 t + 1 ∑ i = 0 t ( g i t ) 2 \qquad\theta^1_i \gets \theta^0_i-\frac{\eta}{\sigma^0_i} g^0_i \qquad \sigma^0_i=\sqrt{(g^0_i)^2}=|g^0_i| \\ \qquad \theta^2_i \gets \theta^1_i-\frac{\eta}{\sigma^1_i} g^1_i \qquad \sigma^1_i=\sqrt{\frac 12[(g^0_i)^2+(g^1_i)^2]} \\ \qquad \theta^3_i \gets \theta^2_i-\frac{\eta}{\sigma^2_i} g^2_i \qquad \sigma^2_i=\sqrt{\frac 13[(g^0_i)^2+(g^1_i)^2+(g^2_i)^2]} \\ \qquad\qquad\qquad\qquad\qquad \vdots \\ \qquad \theta^{t+1}_i \gets \theta^t_i-\frac{\eta}{\sigma^t_i} g^t_i \qquad \sigma^t_i=\sqrt{\frac {1}{t+1}\sum_{i=0}^t (g^t_i)^2} θi1←θi0−σi0ηgi0σi0=(gi0)2=∣gi0∣θi2←θi1−σi1ηgi1σi1=21[(gi0)2+(gi1)2]θi3←θi2−σi2ηgi2σi2=31[(gi0)2+(gi1)2+(gi2)2]⋮θit+1←θit−σitηgitσit=t+11∑i=0t(git)2
(4)
σ
\sigma
σ的计算方式 ② :RMS Prop(其中
α
\alpha
α是一个超参数,
0
<
α
<
1
0<\alpha<1
0<α<1)
θ
i
1
←
θ
i
0
−
η
σ
i
0
g
i
0
σ
i
0
=
(
g
i
0
)
2
=
∣
g
i
0
∣
θ
i
2
←
θ
i
1
−
η
σ
i
1
g
i
1
σ
i
1
=
α
(
σ
i
0
)
2
+
(
1
−
α
)
(
g
i
1
)
2
θ
i
3
←
θ
i
2
−
η
σ
i
2
g
i
2
σ
i
2
=
α
(
σ
i
1
)
2
+
(
1
−
α
)
(
g
i
2
)
2
⋮
θ
i
t
+
1
←
θ
i
t
−
η
σ
i
t
g
i
t
σ
i
t
=
α
(
σ
i
t
−
1
)
2
+
(
1
−
α
)
(
g
i
t
)
2
\qquad \theta^1_i \gets \theta^0_i-\frac{\eta}{\sigma^0_i} g^0_i \qquad \sigma^0_i=\sqrt{(g^0_i)^2}=|g^0_i| \\ \qquad \theta^2_i \gets \theta^1_i-\frac{\eta}{\sigma^1_i} g^1_i \qquad \sigma^1_i=\sqrt{\alpha(\sigma^0_i)^2+(1-\alpha)(g^1_i)^2} \\ \qquad \theta^3_i \gets \theta^2_i-\frac{\eta}{\sigma^2_i} g^2_i \qquad \sigma^2_i=\sqrt{\alpha(\sigma^1_i)^2+(1-\alpha)(g^2_i)^2} \\ \qquad\qquad\qquad\qquad\qquad \vdots \\ \qquad \theta^{t+1}_i \gets \theta^t_i-\frac{\eta}{\sigma^t_i} g^t_i \qquad \sigma^t_i=\sqrt{\alpha(\sigma^{t-1}_i)^2+(1-\alpha)(g^t_i)^2}
θi1←θi0−σi0ηgi0σi0=(gi0)2=∣gi0∣θi2←θi1−σi1ηgi1σi1=α(σi0)2+(1−α)(gi1)2θi3←θi2−σi2ηgi2σi2=α(σi1)2+(1−α)(gi2)2⋮θit+1←θit−σitηgitσit=α(σit−1)2+(1−α)(git)2
(5)最常用的Optimization策略:Adma[1]=RMS Prop + Momentum!
θ
i
t
+
1
←
θ
i
t
−
η
t
σ
i
t
m
i
t
\theta^{t+1}_i \gets \theta^t_i-\frac {\eta^t}{\sigma^t_i}\pmb{m^t_i}
θit+1←θit−σitηtmitmitmit
(6)学习率调整策略:这里我们让
η
\eta
η也随着时间而改变而非定值
- 法一(Learning Rate Decay):让 η t \eta^t ηt随着时间变小到0;
- 法二(Warm up):让 η t \eta^t ηt先从0变大后到一个顶峰再随着时间减小到0;
[1] Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.