前言
本文为9月30日计算机视觉基础学习笔记——优化算法,分为四个章节:
- BGD、SGD、mini-batch GD;
- Momentum、NAG;
- Ada-grad、RMS-Prop、Ada-delta;
- Ada-m.
一、BGD、SGD、mini-batch GD
- Batch gradient descent:
θ = θ − η ⋅ ▽ θ J ( θ ) θ t + 1 = θ t + △ θ t \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta )\\ \theta_{t+1} = \theta_t + \bigtriangleup \theta _t θ=θ−η⋅▽θJ(θ)θt+1=θt+△θt
其中, θ \theta θ 是权重和偏置, J J J 是损失函数。
- Stochastic gradient descent: 学习率会衰减:
θ = θ − η ⋅ ▽ θ J ( θ , x ( i ) , y ( i ) ) △ θ t = η ⋅ g t , i θ t + 1 = θ t + △ θ t \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta, x^{(i)}, y^{(i)} )\\ \bigtriangleup \theta _t = \eta \cdot g_{t, i}\\ \theta_{t+1} = \theta_t + \bigtriangleup \theta _t θ=θ−η⋅▽θJ(θ,x(i),y(i))△θt=η⋅gt,iθt+1=θt+△θt
都是凸函数的情况下,SGD 波动大,可能使梯度下降到更好的另一个局部最优解,但可能导致梯度一直在局部最优解附近波动。
- Mini-batch gradient descent:
θ = θ − η ⋅ ▽ θ J ( θ , x ( i : i + n ) , y ( i : i + n ) ) b a t c h s i z e = n \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta, x^{(i:i+n)}, y^{(i:i+n)} ) \quad batch\ size = n θ=θ−η⋅▽θJ(θ,x(i:i+n),y(i:i+n))batch size=n
相对于 SGD 可减小参数更新的波动。
二、Momentum、NAG
- Momentum:
v t = γ v t − 1 + η ⋅ ▽ θ J ( θ ) θ = θ − v t v_t = \gamma v_{t-1} + \eta \cdot \bigtriangledown _{\theta }J(\theta )\\ \theta = \theta - v_t vt=γvt−1+η⋅▽θJ(θ)θ=θ−vt
γ \gamma γ 通常为 0.9。
- Nesterov Accelerated Gradient:
v t = γ v t − 1 + η ⋅ ▽ θ J ( θ − γ v t − 1 ) θ = θ − v t △ θ t = − η ⋅ g t , i θ t + 1 = θ t + △ θ t v_t = \gamma v_{t-1} + \eta \cdot \bigtriangledown _{\theta }J(\theta - \gamma v_{t-1} )\\ \theta = \theta -v_t\\ \bigtriangleup \theta_t = -\eta \cdot g_{t, i}\\ \theta _{t+1} = \theta _t + \bigtriangleup \theta _t vt=γvt−1+η⋅▽θJ(θ−γvt−1)θ=θ−vt△θt=−η⋅gt,iθt+1=θt+△θt
与 momentum 的区别:计算梯度不同。NAG 先用当前的速度 v 更新一遍参数,再用更新的临时参数计算 loss,然后计算梯度。
三、Ada-grad、RMS-Prop、Ada-delta
-
Adaptive grad:
h → h + ∂ L ∂ W ⊙ ∂ L ∂ W W → W − η ⋅ 1 h ⋅ ∂ L ∂ W θ t + 1 , i = θ t , i − η G t , i i + ϵ ⋅ g t , i h → h + \frac{\partial L}{\partial \textbf{W} } \odot \frac{\partial L}{\partial \textbf{W} }\\ \textbf{W} → \textbf{W} - \eta \cdot \frac{1}{\sqrt{h} } \cdot \frac{\partial L}{\partial \textbf{W} }\\ \theta _{t+1, i} = \theta _{t, i} - \frac{\eta }{\sqrt{G_{t, ii}} + \epsilon} \cdot g_{t, i} h→h+∂W∂L⊙∂W∂LW→W−η⋅h1⋅∂W∂Lθt+1,i=θt,i−Gt,ii+ϵη⋅gt,i
缺点:随着训练次数增加,h 越来越大,训练步长越来越小,模型还未收敛,参数就不更新了。 -
Root Mean Square Propagation:
E [ g 2 ] t = γ E [ g 2 ] t − 1 + ( 1 − γ ) g t 2 △ θ t = − η E [ g 2 ] + ϵ g t E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2\\ \bigtriangleup \theta _t = - \frac{\eta }{E[g^2] + \epsilon } g_t E[g2]t=γE[g2]t−1+(1−γ)gt2△θt=−E[g2]+ϵηgt -
Ada-delta:
E [ △ θ 2 ] t = η E [ △ θ 2 ] t − 1 + ( 1 − γ ) △ θ t 2 R M S [ △ θ ] t = E [ △ θ 2 ] t + ϵ △ θ t = − R M E [ △ θ ] t R M E [ g ] t g t E[\bigtriangleup \theta ^2]_t = \eta E[\bigtriangleup \theta ^2]_{t-1} + (1-\gamma )\bigtriangleup \theta _t^2\\ RMS[\bigtriangleup \theta ]_t = \sqrt{E[\bigtriangleup \theta^2 ]_t + \epsilon } \\ \bigtriangleup \theta _t = -\frac{RME[\bigtriangleup \theta ]_t}{RME[g]_t} g_t E[△θ2]t=ηE[△θ2]t−1+(1−γ)△θt2RMS[△θ]t=E[△θ2]t+ϵ△θt=−RME[g]tRME[△θ]tgt
四、Ada-m
- Adaptive Moment Estimation:
m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 m ^ t = m t 1 − β 1 t v ^ t = v t 1 − β 2 t θ t + 1 = θ t − η v ^ t + ϵ m ^ t m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\\ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\\ \hat{m}_t = \frac{m_t}{1-\beta_1^t}\\ \hat{v}_t = \frac{v_t}{1-\beta_2^t}\\ \theta_{t+1} = \theta_t - \frac{\eta }{\sqrt{\hat{v}_t }+\epsilon }\hat{m}_t mt=β1mt−1+(1−β1)gtvt=β2vt−1+(1−β2)gt2m^t=1−β1tmtv^t=1−β2tvtθt+1=θt−v^t+ϵηm^t- m 用来稳定梯度:来自 momentum;
- v 使梯度自适应化:来自 RMSProp.