- 动量梯度下降:(使用了指数加权平均的方法)
- g t = ∇ θ J ( θ t ) g_t = \nabla_\theta J(\theta_t) gt=∇θJ(θt) —对 θ t \theta_t θt的导数
- m t = β m t − 1 + ( 1 − β ) g t m_{t}=\beta m_{t-1}+(1-\beta)g_t mt=βmt−1+(1−β)gt —导数按比例的和,指数加权平均
- θ t + 1 = θ t − α m t \theta_{t+1}=\theta_t- \alpha m_{t} θt+1=θt−αmt —参数 θ \theta θ更新
- m_{t+1}中有之前的梯度m_{t}作为动量,所以叫动量梯度下降
- AdaGrad
- g t = ∇ θ J ( θ t ) g_t = \nabla_\theta J(\theta_t) gt=∇θJ(θt) —对 θ t \theta_t θt的导数
- G t = ∑ i = 1 t g i 2 G_t=\sum_{i=1}^{t} g_i^2 Gt=∑i=1tgi2 —梯度累加平方和
- θ t + 1 = θ t − α G t + ϵ g t \theta_{t+1}=\theta_t-\frac{\alpha}{\sqrt{G_t+\epsilon}}g_t θt+1=θt−Gt+ϵαgt
- 相当于将 g t G t + ϵ \frac{g_t}{\sqrt{G_t+\epsilon}} Gt+ϵgt当作了梯度,这个梯度越来越小,相当于自适应学习率
- RMSprop
- s t = β s t − 1 + ( 1 − β ) g t 2 s_t=\beta s_{t-1}+(1-\beta)g_t^2 st=βst−1+(1−β)gt2 —将 G t G_t Gt换为 s t s_t st,避免学习率过小
- θ t + 1 = θ t − α g t s t + ϵ \theta_{t+1}=\theta_t-\alpha\frac{g_t}{\sqrt{s_t+\epsilon}} θt+1=θt−αst+ϵgt
- Adam
- m t = β 1 m t − 1 + ( 1 − β 1 ) g t m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t mt=β1mt−1+(1−β1)gt —动量(加权平均)
- s t = β 2 s t − 1 + ( 1 − β 2 ) g t 2 s_t=\beta_2 s_{t-1}+(1-\beta_2)g_t^2 st=β2st−1+(1−β2)gt2 —加权均方和
- θ t + 1 = θ t − α m t ^ s ^ t + ϵ \theta_{t+1}=\theta_t-\alpha \frac{\hat{m_t}}{\sqrt{\hat{s}_t}+\epsilon} θt+1=θt−αs^t+ϵmt^
- 结合了动量和自适应学习率的优点
- AdamW
- θ t + 1 = θ t − α m t ^ s ^ t + ϵ − α l r 2 θ t \theta_{t+1}=\theta_t-\alpha \frac{\hat{m_t}}{\sqrt{\hat{s}_t}+\epsilon}-\alpha \frac{lr}{2} \theta_t θt+1=θt−αs^t+ϵmt^−α2lrθt —添加了L2正则项
- AdamW 和 Adam 类似,但将权重衰减项添加 L2 正则项,防止过拟合