Adadelta论文原文是:
《Adadelta:An adaptive learning rate method》
论文的重点是Section3,我们重点对Section3进行解读
section 3.Adadelta Method
- the continual decay of learning rates throughout training, and 2) the need for a manually selected
global learning rate.
意思是Adadelta是为了:
1.学习率衰退问题,2.学习率自动选择的问题
In the ADAGRAD method the denominator accumulates the squared gradients from each iteration starting at the beginning of training. Since each term is positive, this accumulated sum continues to grow throughout training, effectively shrinking the learning rate on each dimension. After many iterations, this learning rate will become infinitesimally small.
这段话的意思是ADAGRAD会随着训练的进行,导致学习率逐渐变成0.
3.1.Idea1:Accumulate Over Window
Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size
w
w
w (instead of size
t
t
t where
t
t
t is the current iteration as in ADAGRAD). With this windowed accumulation the denominator of ADAGRAD cannot accumulate to infinity and instead becomes a local estimate using recent gradients. This ensures that learning continues to make progress even after many iterations of updates have been done.
意思是用一个窗口w,而不是像adagrad那样累积之前t轮所有的权重.
E
[
g
2
]
t
=
ρ
E
[
g
2
]
t
−
1
+
(
1
−
ρ
)
g
t
2
(
8
)
E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2(8)
E[g2]t=ρE[g2]t−1+(1−ρ)gt2(8)
R
M
S
[
g
]
t
=
E
[
g
2
]
t
+
ϵ
(
9
)
RMS[g]_t=\sqrt{E[g^2]_t+\epsilon} (9)
RMS[g]t=E[g2]t+ϵ(9)
△
x
t
=
−
η
R
M
S
[
g
]
t
g
t
(
10
)
△x_t=-\frac{\eta}{RMS[g]_t}g_t (10)
△xt=−RMS[g]tηgt(10)
上面的式子中,(8)代入(9),(9)代入(10),即为最终伪代码的一部分
然后,因为式子中
η
\eta
η是需要手工设定的,所以下面有了3.2
3.2.Idea2:Correct Units with Hessian Approximation
二阶牛顿法可以写成:
x
t
+
1
=
x
t
−
f
′
(
x
)
f
′
′
(
x
)
x_{t+1}=x_t-\frac{f'(x)}{f''(x)}
xt+1=xt−f′′(x)f′(x)
所以二阶牛顿法中,我们可以把
1
f
′
′
(
x
)
\frac{1}{f''(x)}
f′′(x)1视为学习率。
在二阶牛顿法中,有:
△
x
=
∂
f
∂
x
∂
2
f
∂
x
2
△x=\frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2f}{\partial x^2}}
△x=∂x2∂2f∂x∂f
可以推导出:
1
∂
2
f
∂
x
2
=
△
x
∂
f
∂
x
\frac{1}{\frac{\partial ^2 f}{\partial x^2}}=\frac{△x}{\frac{\partial f}{\partial x}}
∂x2∂2f1=∂x∂f△x(这个步骤我认为没啥用,就是在论文里面凑字数逼叨几句)
Since the RMS of the previous gradients is already represented in the denominator in Eqn. 10 we considered a measure of the
△
x
\triangle x
△x quantity in the numerator.
这里的意思是已经把式子(10)的分母处理完了(这是废话,这里是为了增加字数)
△
x
t
\triangle x_t
△xt for the current time step is not known, so we assume the curvature is locally smooth and approximate
△
x
t
\triangle x_t
△xt by compute the exponentially decaying RMS over a window of size w of previous
△
x
\triangle x
△x to give the ADADELTA method.
这段话什么意思呢?
意思是说:
我们同样对
△
x
\triangle x
△x使用一个窗口来计算合理的值,讲人话就是:我们脑袋一拍,觉得这里就用均方根吧。
然后就有了分子中中的
R
M
S
[
△
x
]
t
−
1
RMS[\triangle x]_{t-1}
RMS[△x]t−1
最终算法如下:
Note:
算法中的第4步和第6步代入第5步,然后第5步代入第7步,这样就算完成了一次更新迭代