1. SGD + Momentum for Oscillation and Plateau Problem

  1. SGD method has trouble when navigating areas where the curvature is steeper in one dimension and is flat in another direction.
    • Ends up oscillating around the slopes and maks slow progress

   可以看到,在同一位置上,目标标函数在垂直方向上得斜率得绝对值远大于在水平方向上的斜率。因此,给定学习步长,SGD迭代自变量时会使自变量在竖直方向上的移动幅度过大,甚至越过最优解。这样不断地振荡,会导致向最优解移动的效率变慢。  
 期望让寻找最优解的曲线能够更加的平滑,在水平方向的速度更快: Momentum(动量)

  1. Incorporation of Momentum
    v t = β v t − 1 + η t g t , \bm{v_t = βv_{t-1} + \eta_tg_t}, vt=βvt1+ηtgt, β ∈ [ 0 , 1 ) ; v 0 = 0 \beta\in[0,1); v_0=0 β[0,1);v0=0 and g t g_t gt is the gradient at current x
    x t = x t − 1 − v t \bm{x_t=x_{t-1}-v_t} xt=xt1vt

    • if β = 0 \beta=0 β=0, it’s the normal SGD method
    • v t = β v t − 1 + ( 1 − β ) η t g t 1 − β = β t v 0 + ( 1 − β ) ∑ i = 0 t − i β i η t − i g t − i 1 − β v_t=\beta v_{t-1}+(1-\beta)\frac{\eta_tg_t}{1-\beta}=\bm{\beta^tv_0+(1-\beta)\sum_{i=0}^{t-i}\beta^{i}\frac{\eta _{t-i}g_{t-i}}{1-\beta}} vt=βvt1+(1β)1βηtgtβtv0+(1β)i=0tiβi1βηtigti
    • The weight on η t − i g t − i 1 − β = ( 1 − β ) β i \frac{\eta_{t-i}g_{t-i}}{1-\beta}=(1-\beta)\beta^i 1βηtigti=(1β)βi exponential decreases as i increases.
    • v t v_t vt is the eponentially weighted moving average of past v’s
    • Let n = 1 1 − β , β = 1 − 1 n n=\frac{1}{1-\beta},\beta=1-\frac{1}{n} n=1β1,β=1n1
      • l i m n − > ∞ ( 1 − 1 n ) n = 1 e ≈ 0.3679 lim_{n->\infty}(1-\frac{1}{n})^{n}=\frac{1}{e}\approx0.3679 limn>(1n1)n=e10.3679

      • If we treat this number as a small number, then we can ignore all terms including β 1 1 − β \beta^{\frac{1}{1-\beta}} β1β1 and higher terms when β → 1. \beta\rightarrow1. β1. β ∈ [ 0.9 , 1 ) \bm{\beta\in[0.9, 1)} β[0.9,1).

        For example: if β = 0.95 , β 1 − β ≈ 20 , v t = β t v 0 + ( 1 − β ) ∑ i = 0 19 β i η t − i g t − i 1 − β \beta=0.95, \beta^{1-\beta}\approx20,v_t=\beta^tv_0+(1-\beta)\sum_{i=0}^{19}\beta^{i}\frac{\eta _{t-i}g_{t-i}}{1-\beta} β=0.95,β1β20,vtβtv0+(1β)i=019βi1βηtigti

      • For mementum, the updated term for the x is approximated equal to the exponentially weighted moving average of previous 1 1 − β \frac{1}{1-\beta} 1β1 updated terms ( η t − i g t − i \eta_{t-i}g_{t-i} ηtigti) and then divided by 1 − β 1-\beta 1β

  2. Momentum

    • Reduce updates along directions that changes gradients frequently
    • Increase updates along directions that gradients are consistent
    • Dampen oscillations

图转自知乎https://zhuanlan.zhihu.com/p/34240246

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值