Adadelta原文解读

Adadelta论文原文是:
《Adadelta:An adaptive learning rate method》

论文的重点是Section3,我们重点对Section3进行解读

section 3.Adadelta Method

  1. the continual decay of learning rates throughout training, and 2) the need for a manually selected
    global learning rate.
    意思是Adadelta是为了:
    1.学习率衰退问题,2.学习率自动选择的问题

In the ADAGRAD method the denominator accumulates the squared gradients from each iteration starting at the beginning of training. Since each term is positive, this accumulated sum continues to grow throughout training, effectively shrinking the learning rate on each dimension. After many iterations, this learning rate will become infinitesimally small.
这段话的意思是ADAGRAD会随着训练的进行,导致学习率逐渐变成0.

3.1.Idea1:Accumulate Over Window
Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size w w w (instead of size t t t where t t t is the current iteration as in ADAGRAD). With this windowed accumulation the denominator of ADAGRAD cannot accumulate to infinity and instead becomes a local estimate using recent gradients. This ensures that learning continues to make progress even after many iterations of updates have been done.
意思是用一个窗口w,而不是像adagrad那样累积之前t轮所有的权重.
E [ g 2 ] t = ρ E [ g 2 ] t − 1 + ( 1 − ρ ) g t 2 ( 8 ) E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2(8) E[g2]t=ρE[g2]t1+(1ρ)gt2(8)
R M S [ g ] t = E [ g 2 ] t + ϵ ( 9 ) RMS[g]_t=\sqrt{E[g^2]_t+\epsilon} (9) RMS[g]t=E[g2]t+ϵ (9)
△ x t = − η R M S [ g ] t g t ( 10 ) △x_t=-\frac{\eta}{RMS[g]_t}g_t (10) xt=RMS[g]tηgt(10)

上面的式子中,(8)代入(9),(9)代入(10),即为最终伪代码的一部分
然后,因为式子中 η \eta η是需要手工设定的,所以下面有了3.2

3.2.Idea2:Correct Units with Hessian Approximation
二阶牛顿法可以写成:
x t + 1 = x t − f ′ ( x ) f ′ ′ ( x ) x_{t+1}=x_t-\frac{f'(x)}{f''(x)} xt+1=xtf(x)f(x)
所以二阶牛顿法中,我们可以把 1 f ′ ′ ( x ) \frac{1}{f''(x)} f(x)1视为学习率。

在二阶牛顿法中,有:
△ x = ∂ f ∂ x ∂ 2 f ∂ x 2 △x=\frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2f}{\partial x^2}} x=x22fxf
可以推导出:
1 ∂ 2 f ∂ x 2 = △ x ∂ f ∂ x \frac{1}{\frac{\partial ^2 f}{\partial x^2}}=\frac{△x}{\frac{\partial f}{\partial x}} x22f1=xfx(这个步骤我认为没啥用,就是在论文里面凑字数逼叨几句)

Since the RMS of the previous gradients is already represented in the denominator in Eqn. 10 we considered a measure of the △ x \triangle x x quantity in the numerator.
这里的意思是已经把式子(10)的分母处理完了(这是废话,这里是为了增加字数)

△ x t \triangle x_t xt for the current time step is not known, so we assume the curvature is locally smooth and approximate △ x t \triangle x_t xt by compute the exponentially decaying RMS over a window of size w of previous △ x \triangle x x to give the ADADELTA method.
这段话什么意思呢?
意思是说:
我们同样对 △ x \triangle x x使用一个窗口来计算合理的值,讲人话就是:我们脑袋一拍,觉得这里就用均方根吧。
然后就有了分子中中的 R M S [ △ x ] t − 1 RMS[\triangle x]_{t-1} RMS[x]t1

最终算法如下:
在这里插入图片描述
Note:
算法中的第4步和第6步代入第5步,然后第5步代入第7步,这样就算完成了一次更新迭代

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值