- SGD method has trouble when navigating areas where the curvature is steeper in one dimension and is flat in another direction.
- Ends up oscillating around the slopes and maks slow progress
可以看到,在同一位置上,目标标函数在垂直方向上得斜率得绝对值远大于在水平方向上的斜率。因此,给定学习步长,SGD迭代自变量时会使自变量在竖直方向上的移动幅度过大,甚至越过最优解。这样不断地振荡,会导致向最优解移动的效率变慢。
期望让寻找最优解的曲线能够更加的平滑,在水平方向的速度更快: Momentum(动量)
-
Incorporation of Momentum
v t = β v t − 1 + η t g t , \bm{v_t = βv_{t-1} + \eta_tg_t}, vt=βvt−1+ηtgt, β ∈ [ 0 , 1 ) ; v 0 = 0 \beta\in[0,1); v_0=0 β∈[0,1);v0=0 and g t g_t gt is the gradient at current x
x t = x t − 1 − v t \bm{x_t=x_{t-1}-v_t} xt=xt−1−vt- if
β
=
0
\beta=0
β=0, it’s the normal SGD method
- v t = β v t − 1 + ( 1 − β ) η t g t 1 − β = β t v 0 + ( 1 − β ) ∑ i = 0 t − i β i η t − i g t − i 1 − β v_t=\beta v_{t-1}+(1-\beta)\frac{\eta_tg_t}{1-\beta}=\bm{\beta^tv_0+(1-\beta)\sum_{i=0}^{t-i}\beta^{i}\frac{\eta _{t-i}g_{t-i}}{1-\beta}} vt=βvt−1+(1−β)1−βηtgt=βtv0+(1−β)∑i=0t−iβi1−βηt−igt−i
- The weight on η t − i g t − i 1 − β = ( 1 − β ) β i \frac{\eta_{t-i}g_{t-i}}{1-\beta}=(1-\beta)\beta^i 1−βηt−igt−i=(1−β)βi exponential decreases as i increases.
- v t v_t vt is the eponentially weighted moving average of past v’s
- Let
n
=
1
1
−
β
,
β
=
1
−
1
n
n=\frac{1}{1-\beta},\beta=1-\frac{1}{n}
n=1−β1,β=1−n1
-
l i m n − > ∞ ( 1 − 1 n ) n = 1 e ≈ 0.3679 lim_{n->\infty}(1-\frac{1}{n})^{n}=\frac{1}{e}\approx0.3679 limn−>∞(1−n1)n=e1≈0.3679
-
If we treat this number as a small number, then we can ignore all terms including β 1 1 − β \beta^{\frac{1}{1-\beta}} β1−β1 and higher terms when β → 1. \beta\rightarrow1. β→1. β ∈ [ 0.9 , 1 ) \bm{\beta\in[0.9, 1)} β∈[0.9,1).
For example: if β = 0.95 , β 1 − β ≈ 20 , v t = β t v 0 + ( 1 − β ) ∑ i = 0 19 β i η t − i g t − i 1 − β \beta=0.95, \beta^{1-\beta}\approx20,v_t=\beta^tv_0+(1-\beta)\sum_{i=0}^{19}\beta^{i}\frac{\eta _{t-i}g_{t-i}}{1-\beta} β=0.95,β1−β≈20,vt=βtv0+(1−β)∑i=019βi1−βηt−igt−i
-
For mementum, the updated term for the x is approximated equal to the exponentially weighted moving average of previous 1 1 − β \frac{1}{1-\beta} 1−β1 updated terms ( η t − i g t − i \eta_{t-i}g_{t-i} ηt−igt−i) and then divided by 1 − β 1-\beta 1−β
-
- if
β
=
0
\beta=0
β=0, it’s the normal SGD method
-
Momentum
- Reduce updates along directions that changes gradients frequently
- Increase updates along directions that gradients are consistent
- Dampen oscillations
图转自知乎https://zhuanlan.zhihu.com/p/34240246
|
|