# 深度学习优化器+公式推导

python 同时被 2 个专栏收录
4 篇文章 0 订阅
1 篇文章 0 订阅

# 基础优化

w    =    w    +    Δ    w w\,\,=\,\,w\,\,+\,\,\varDelta \,\,w

finding the parameters θ \theta of a neural network that significantly reduce a cost function J ( θ ) J(\theta) ,

which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms

## 1. GD — 梯度下降

### 1.1 BGD — 批量梯度下降

w i + 1 = w − η 1 m ∑ j = 0 m ∂ C ∂ w j w_{i+1}=w-\eta \frac{1}{m}\sum_{j=0}^m{\frac{\partial C}{\partial w}_j}

m m 为数据量数， C C 为损失函数

### 1.2 SGD — 随机梯度下降

Δ w = − η J ′ ( w ) 或 w i + 1 = w − η ∂ C ∂ w \Delta w=-\eta J^{\prime}(w) 或 w_{i+1}=w-\eta \frac{\partial C}{\partial w}
η \eta 指学习率， J ′ J^{\prime} 指损失关于参数的梯度 ( ∇ w J ( w ) \nabla_{w} J(w) 也有这种形式) C C 为损失函数

### 1.3 MBGD — 小批量梯度下降

w = w − η ⋅ ∇ w J ( w ; x ( i : i + n ) ; y ( i : i + n ) ) w=w-\eta \cdot \nabla _wJ\left( w;x^{(i:i+n)};y^{(i:i+n)} \right)

## 2. Momentum

1. 加速训练过程
2. 解决 SGD 在 ravines 的情况下容易被困住, 就像一个深谷, SGD可能会在两侧左右横跳而达不到低谷

### 2.1 Simple momentum update

θ = θ − η ∇ θ J ( θ ) \theta=\theta-\eta \nabla_{\theta} J(\theta)

v i = γ v i − 1 + η ∇ w J ( w ) w = w − v i \begin{array}{l} v_{i}=\gamma v_{i-1}+\eta \nabla_{w} J(w) \\ w=w-v_{i} \end{array}

### 2.2 Nesterov momentum update (Nesterov Accelerated Gradient) (NAG)

v i = γ v i − 1 + η ∇ w J ( w − γ v i − 1 ) w = w − v i \begin{array}{l} v_{i}=\gamma v_{i-1}+\eta \nabla_{w} J(w - \gamma v_{i-1}) \\ w=w-v_{i} \end{array}

## 3. Adaptive learning rate optimization algorithm

g ← 1 m ∇ θ ∑ i L ( f ( x ( i ) ; θ ) , y ( i ) ) \boldsymbol{g} \leftarrow \frac{1}{m} \nabla_{\boldsymbol{\theta}} \sum_{i} L\left(f\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right), \boldsymbol{y}^{(i)}\right)

r i = r i − 1 + g i 2 Δ w = η ϵ + r i g i w = w − Δ w \begin{aligned} r_{i} &=r_{i-1}+g_{i}^{2} \\ \Delta w &=\frac{\eta}{\epsilon+\sqrt{r_{i}}} g_{i} \\ w &=w-\Delta w \end{aligned}

⊙ \odot 指矩阵的 Hadamard product (哈达玛积), 即 [ a 11 b 11 a 12 b 12 ⋯ a 1 n b 1 n a 21 b 21 a 22 b 22 ⋯ a 2 n b 2 n ⋮ ⋮ ⋮ a m 1 b m 1 a m 2 b m 2 ⋯ a m n b m n ] \left[\begin{array}{cccc} a_{11} b_{11} & a_{12} b_{12} & \cdots & a_{1 n} b_{1 n} \\ a_{21} b_{21} & a_{22} b_{22} & \cdots & a_{2 n} b_{2 n} \\ \vdots & \vdots & & \vdots \\ a_{m 1} b_{m 1} & a_{m 2} b_{m 2} & \cdots & a_{m n} b_{m n} \end{array}\right]

r i r_{i} 展开可得： 注意，i 从1开始
r i = r i − 1 + g i 2 = r i − 2 + g i − 1 2 + g i 2 = r 0 + g 1 2 + g 2 2 + ⋯ + g i 2 = r 0 + ∑ j = 1 i g j 2 \begin{aligned} r_{i} &=r_{i-1}+g_{i}^{2} \\ &=r_{i-2}+g_{i-1}^{2}+g_{i}^{2} \\ &=r_{0}+g_{1}^{2}+g_{2}^{2}+\cdots+g_{i}^{2} \\ &=r_{0}+\sum_{j=1}^{i} g_{j}^{2} \end{aligned}

Δ w = − η ϵ + r 1 g 1 = − η ϵ + g 1 g 1 \begin{aligned} \Delta w &=-\frac{\eta}{\epsilon+\sqrt{r_{1}}} g_{1} \\ &=-\frac{\eta}{\epsilon+g_{1}} g_{1} \end{aligned}

r i = r 0 + ∑ j = 1 i g j 2 = ϵ + ∑ j = 1 i g j 2 > 0 \begin{aligned} r_{i} &=r_{0}+\sum_{j=1}^{i} g_{j}^{2} \\ &=\epsilon+\sum_{j=1}^{i} g_{j}^{2}>0 \end{aligned}

1. 具有损失最大偏导的参数相应地有一个快速下降的学习率，而具有小偏导的参数在学习率上 有相对较小的下降.净效果是在参数空间中更为平缓的倾斜方向会取得更大的进步. 摘自<花书>

2. 减少了学习率的手动调节

r i = ρ r i − 1 + ( 1 − ρ ) g i 2 Δ w = − η ϵ + r i g i w = w + Δ w \begin{aligned} r_i&=\rho r_{i-1}+(1-\rho )g_{i}^{2}\\ \Delta w&=-\frac{\eta}{\epsilon +\sqrt{r_i}}g_i\\ w&=w+\Delta w\\ \end{aligned}

E [ g 2 ] t = γ E [ g 2 ] t − 1 + ( 1 − γ ) g t 2 θ t + 1 = θ t − η E [ g 2 ] t + ϵ g t \begin{array}{l} E\left[ g^2 \right] _t=\gamma E\left[ g^2 \right] _{t-1}+\left( 1-\gamma \right) g_{t}^{2}\\ \\ \theta _{t+1}=\theta _t-\frac{\eta}{\sqrt{E\left[ g^2 \right] _t+\epsilon}}g_t\\ \end{array}

ρ \rho γ \gamma 都是这个算法新引入的超参在不同参考资料中的写法，上面那种是在花书的，下面可能是论文的

RMSProp则通过指数加权平均, 仅记住最近的东西，使得能将海浪直接转化成碗, 从而发挥最大功效

RMSProp更简单好用点，这个计算相对复杂，原理都是相似的

r i = ρ r i − 1 + ( 1 − ρ ) g i 2 s i = ρ s i − 1 + ( 1 − ρ ) Δ w 2 Δ w = − ϵ + s i ϵ + r i g i w = w + Δ w \begin{aligned} r_i&=\rho r_{i-1}+(1-\rho )g_{i}^{2}\\ s_i&=\rho s_{i-1}+(1-\rho )\Delta w^2\\ \Delta w&=-\frac{\sqrt{\epsilon +s_i}}{\sqrt{\epsilon +r_i}}g_i\\ w&=w+\Delta w\\ \end{aligned}

E [ g 2 ] 0 = 0 , E [ Δ x 2 ] 0 = 0 E\left[ g^2 \right] _0=0,E\left[ \Delta x^2 \right] _0=0 \\

E [ g 2 ] t = ρ E [ g 2 ] t − 1 + ( 1 − ρ ) g t 2 Δ w t = − η E [ g 2 ] t + ϵ g t ⇒ − η R M S [ g ] t g t    ⇒ − R M S [ Δ w ] t − 1 R M S [ g ] t g t w t + 1 = w t + Δ w t \\ E\left[ g^2 \right] _t=\rho E\left[ g^2 \right] _{t-1}+(1-\rho )g_{t}^{2}\\ \\ \Delta w_t=-\frac{\eta}{\sqrt{E\left[ g^2 \right] _t+\epsilon}}g_t\Rightarrow -\frac{\eta}{RMS[g]_t}g_t\,\,\Rightarrow -\frac{RMS[\Delta w]_{t-1}}{RMS[g]_t}g_t\\ \\ w_{t+1}=w_t+\Delta w_t\\

ρ \rho 一般为 0.9

R M S [ g ] t = E [ g 2 ] t + ϵ RMS[g]_t=E\left[ g^2 \right] _t+\epsilon

R M S [ Δ w ] t − 1 = ρ E [ Δ w ] t − 1 + ( 1 − ρ ) Δ w t − 1 RMS[\Delta w]_{t-1}=\rho E\left[ \varDelta w \right] _{t-1}+(1-\rho )\varDelta w_{t-1}

1. t : t: 更新的步长 (steps)
2. α : \alpha: 学习率，用于控制步幅 (stepsize)
3. θ \theta : 目标参数
4. f ( θ ) : f(\theta): 带有参数 θ \theta 的随机目标函数, 一般指损失函数
5. g t : g_{t}: 目标函数 f ( θ ) f(\theta) θ \theta 求导所得梯度
6. β 1 : \beta_{1}: 一阶矩衰减系数
7. β 2 : \beta_{2}: 二阶矩衰减系数
8. m t : m_{t}: 梯度 g t g_{t} 的一阶矩,即梯度 g t g_{t} 的期望
9. v t : v_{t}: 梯度 g t g_{t} 的二阶矩,即梯度 g t 2 g_{t}^{2} 的期望
10. m ^ t : m t \hat{m}_{t}: m_{t} 的偏置矫正,考虑到 m t m_{t} 在零初始值情况下向0偏置
11. v ^ t : v t \hat{v}_{t}: v_{t} 的偏置矫正,考虑到 v t v_{t} 在零初始值情况下向0偏置

t ← t + 1 g t ← ∇ θ f t ( θ t − 1 )    m t ← β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g t v t ← β 2 ⋅ v t − 1 + ( 1 − β 2 ) ⋅ g t 2 m ^ t ← m t ( 1 − β 1 t )    v ^ t ← v t ( 1 − β 2 t ) θ t ← θ t − 1 − α m ^ t ( v ^ t + ϵ ) \begin{array}{l} t\gets t+1\\ g_t\gets \nabla _{\theta}f_t\left( \theta _{t-1} \right) \,\,\\ m_t\gets \beta _1\cdot m_{t-1}+\left( 1-\beta _1 \right) \cdot g_t\\ v_t\gets \beta _2\cdot v_{t-1}+\left( 1-\beta _2 \right) \cdot g_{t}^{2}\\ \hat{m}_t\gets \frac{m_t}{\left( 1-\beta _{1}^{t} \right) \,\,}\\ \widehat{v}_t\gets \frac{v_t}{\left( 1-\beta _{2}^{t} \right)}\\ \theta _t\gets \theta _{t-1}-\alpha \frac{\hat{m}_t}{\left( \sqrt{\hat{v}_t}+\epsilon \right)}\\ \end{array}

(3). 超参数具有很好的解释性，且通常无需调整或仅需很少的微调.

Adam 作者在Adam论文的extensions部分提出的一个新的算法，他们发现在Adam中，单个权重的更新规则是将其梯度与当前 ∣ g t ∣ 2 \left|g_{t}\right|^{2} 和过去梯度 v t − 1 v_{t-1} ℓ 2 \ell_{2} 范数（标量）成反比例缩放.

v t = β 2 v t − 1 + ( 1 − β 2 ) ∣ g t ∣ 2    ⟹ v t = β 2 p v t − 1 + ( 1 − β 2 p ) ∣ g t ∣ p    = ( 1 − β 2 p ) ∑ i = 1 t β 2 p ( t − i ) ⋅ ∣ g i ∣ p v_t=\beta _2v_{t-1}+\left( 1-\beta _2 \right) \left| g_t \right|^2\,\,\\ \Longrightarrow v_t=\beta _{2}^{p}v_{t-1}+\left( 1-\beta _{2}^{p} \right) \left| g_t \right|^p\,\,=\left( 1-\beta _{2}^{p} \right) \sum_{i=1}^t{\beta _{2}^{p(t-i)}}\cdot \left| g_i \right|^p

u t = lim ⁡ p → ∞ ( v t ) 1 / p = lim ⁡ p → ∞ ( ( 1 − β 2 p ) ∑ i = 1 t β 2 p ( t − i ) ⋅ ∣ g i ∣ p ) 1 / p = lim ⁡ p → ∞ ( 1 − β 2 p ) 1 / p ( ∑ i = 1 t β 2 p ( t − i ) ⋅ ∣ g i ∣ p ) 1 / p = lim ⁡ p → ∞ ( ∑ i = 1 t ( β 2 ( t − i ) ⋅ ∣ g i ∣ ) p ) 1 / p = max ⁡ ( β 2 t − 1 ∣ g 1 ∣ , β 2 t − 2 ∣ g 2 ∣ , … , β 2 ∣ g t − 1 ∣ , ∣ g t ∣ ) \begin{aligned} u_{t}=\lim _{p \rightarrow \infty}\left(v_{t}\right)^{1 / p} &=\lim _{p \rightarrow \infty}\left(\left(1-\beta_{2}^{p}\right) \sum_{i=1}^{t} \beta_{2}^{p(t-i)} \cdot\left|g_{i}\right|^{p}\right)^{1 / p} \\ &=\lim _{p \rightarrow \infty}\left(1-\beta_{2}^{p}\right)^{1 / p}\left(\sum_{i=1}^{t} \beta_{2}^{p(t-i)} \cdot\left|g_{i}\right|^{p}\right)^{1 / p} \\ &=\lim _{p \rightarrow \infty}\left(\sum_{i=1}^{t}\left(\beta_{2}^{(t-i)} \cdot\left|g_{i}\right|\right)^{p}\right)^{1 / p} \\ &=\max \left(\beta_{2}^{t-1}\left|g_{1}\right|, \beta_{2}^{t-2}\left|g_{2}\right|, \ldots, \beta_{2}\left|g_{t-1}\right|,\left|g_{t}\right|\right) \end{aligned}
(这步还不会… 学会了再来补充)

u t = max ⁡ ( β 2 ⋅ u t − 1 , ∣ g t ∣ ) u_{t}=\max \left(\beta_{2} \cdot u_{t-1},\left|g_{t}\right|\right)

t ← t + 1 g t ← ∇ w f t ( w t − 1 )    m t ← β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g t u t ← max ⁡ ( β 2 ⋅ u t − 1 , ∣ g t ∣ )    w t ← w t − 1 − ( α / ( 1 − β 1 t ) ) ⋅ m t / u t \begin{aligned} &t\gets t+1\\ &g_t\gets \nabla _wf_t\left( w_{t-1} \right) \,\,\\ &\begin{array}{l} m_t\gets \beta _1\cdot m_{t-1}+\left( 1-\beta _1 \right) \cdot g_t\\ u_t\gets \max \left( \beta _2\cdot u_{t-1},\left| g_t \right| \right) \,\,\\ w_t\gets w_{t-1}-\left( \alpha /\left( 1-\beta _{1}^{t} \right) \right) \cdot m_t/u_t\\ \end{array}\\ \end{aligned}

η ∣ α = 0.002 , β 1 = 0.9 ,  and  β 2 = 0.999 \eta | \alpha =0.002, \beta_{1}=0.9, \text { and } \beta_{2}=0.999

PS:这个max比较的是梯度各个维度上的当前值和历史最大值

x = [ x 1 , x 2 , ⋯   , x n ] T x=\left[x_{1}, x_{2}, \cdots, x_{n}\right]^{\mathrm{T}}

∥ x ∥ p = ( ∣ x 1 ∣ p + ∣ x 2 ∣ p + ⋯ + ∣ x n ∣ p ) 1 p \|x\|_{p}=\left(\left|x_{1}\right|^{p}+\left|x_{2}\right|^{p}+\cdots+\left|x_{n}\right|^{p}\right)^{\frac{1}{p}}
ℓ 1 \ell_{1} 范数：
∥ x ∥ 1 = ∣ x 1 ∣ + ∣ x 2 ∣ + ∣ x 3 ∣ + ⋯ + ∣ x n ∣ \|x\|_{1}=\left|x_{1}\right|+\left|x_{2}\right|+\left|x_{3}\right|+\cdots+\left|x_{n}\right|
ℓ 2 \ell_{2} 范数：
∥ x ∥ 2 = ( ∣ x 1 ∣ 2 + ∣ x 2 ∣ 2 + ∣ x 3 ∣ 2 + ⋯ + ∣ x n ∣ 2 ) 1 / 2 \|x\|_{2}=\left(\left|x_{1}\right|^{2}+\left|x_{2}\right|^{2}+\left|x_{3}\right|^{2}+\cdots+\left|x_{n}\right|^{2}\right)^{1 / 2}

L0范数是指向量中非0的元素的个数(L0范数很难优化求解)

L1范数是指向量中各个元素绝对值之和

L2范数是指向量各元素的平方和然后求平方根

L1范数可以进行特征选择，即让特征的系数变为0

L2范数可以防止过拟合，提升模型的泛化能力，有助于处理 condition number不好下的矩阵(数据变化很小矩阵求解后结果变化很大)

（核心：L2对大数，对outlier离群点更敏感！）

L1会趋向于产生少量的特征，而其他的特征都是0，而L2会选择更多的特征，这些特征都会接近于0.

### 4. Nadam => RMSProp + NAG

N A G g t = ∇ w t J ( w t − γ m t − 1 ) m t = γ m t − 1 + η g t w t + 1 = w t − m t       M o m e n t u m g t = ∇ w t J ( w t ) m t = γ m t − 1 + η g t w t + 1 = w t − m t \,\, \begin{aligned} \,\, NAG\\ g_t&=\nabla _{w_t}J\left( w_t-\gamma m_{t-1} \right)\\ m_t&=\gamma m_{t-1}+\eta g_t\\ w_{t+1}&=w_t-m_t\\ \end{aligned}\,\, \begin{aligned} \,\, Momentum\\ g_t&=\nabla _{w_t}J\left( w_t \right)\\ m_t&=\gamma m_{t-1}+\eta g_t\\ w_{t+1}&=w_t-m_t\\ \end{aligned}
R M S p r o p r i = ρ r i − 1 + ( 1 − ρ ) g i 2 Δ w = − η ϵ + r i g i w = w + Δ w    \begin{aligned} \,\, RMSprop\\ r_i&=\rho r_{i-1}+(1-\rho )g_{i}^{2}\\ \Delta w&=-\frac{\eta}{\epsilon +\sqrt{r_i}}g_i\\ w&=w+\Delta w\\ \end{aligned}\,\,
A d a m t    ←    t + 1 g t ← ∇ θ f t ( θ t − 1 )    m t ← β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g t v t ← β 2 ⋅ v t − 1 + ( 1 − β 2 ) ⋅ g t 2 m ^ t ← m t ( 1 − β 1 t )    v ^ t ← v t ( 1 − β 2 t ) θ t ← θ t − 1 − α m ^ t ( v ^ t + ϵ ) \begin{array}{c} \,\,Adam\\ t\,\,\gets \,\,t+1\\ g_t\gets \nabla _{\theta}f_t\left( \theta _{t-1} \right) \,\,\\ m_t\gets \beta _1\cdot m_{t-1}+\left( 1-\beta _1 \right) \cdot g_t\\ v_t\gets \beta _2\cdot v_{t-1}+\left( 1-\beta _2 \right) \cdot g_{t}^{2}\\ \hat{m}_t\gets \frac{m_t}{\left( 1-\beta _{1}^{t} \right) \,\,}\\ \widehat{v}_t\gets \frac{v_t}{\left( 1-\beta _{2}^{t} \right)}\\ \theta _t\gets \theta _{t-1}-\alpha \frac{\hat{m}_t}{\left( \sqrt{\hat{v}_t}+\epsilon \right)}\\ \end{array} \\

Dozat 提出一个修改NAG的方式：与其两次应用动量优化 γ m t − 1 \gamma m_{t-1} ，一次是更新 g t g_{t} ，第二次则是更新 w t + 1 w_{t+1} ，不如直接将预测动量向量应用到当前参数的更新上:

g t = ∇ w t J ( w t ) m t = γ m t − 1 + η g t w t + 1 = w t − ( γ m t + η g t ) \begin{aligned} g_{t} &=\nabla_{w_{t}} J\left(w_{t}\right) \\ m_{t} &=\gamma m_{t-1}+\eta g_{t} \\ w_{t+1} &=w_{t}-\left(\gamma m_{t}+\eta g_{t}\right) \end{aligned}

Adam公式中的第二条式子展开 PS: v ^ t \hat{v}_t 不需要改变

m t = β 1 m t − 1 + ( 1 − β 1 ) g t m ^ t = m t 1 − β 1 t w t + 1 = w t − η v ^ t + ϵ m ^ t \begin{aligned} m_t&=\beta _1m_{t-1}+\left( 1-\beta _1 \right) g_t\\ \hat{m}_t&=\frac{m_t}{1-\beta _{1}^{t}}\\ w_{t+1}&=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t\\ \end{aligned}

w t + 1 = w t − η v ^ t + ϵ ( β 1 m t − 1 1 − β 1 t + ( 1 − β 1 ) g t 1 − β 1 t ) w_{t+1}=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left( \frac{\beta _1m_{t-1}}{1-\beta _{1}^{t}}+\frac{\left( 1-\beta _1 \right) g_t}{1-\beta _{1}^{t}} \right)

m t − 1 1 − β 1 t    ≈    m t − 1 1 − β 1 t − 1    =    m ^ t − 1 \frac{m_{t-1}}{1-\beta _{1}^{t}}\,\,\approx \,\,\frac{m_{t-1}}{1-\beta _{1}^{t-1}}\,\,=\,\,\hat{m}_{t-1}
(不用在意，反正下一步就换掉了)

w t + 1 = w t − η v ^ t + ϵ ( β 1 m ^ t − 1 + ( 1 − β 1 ) g t 1 − β 1 t ) w_{t+1}=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left( \beta _1\hat{m}_{t-1}+\frac{\left( 1-\beta _1 \right) g_t}{1-\beta _{1}^{t}} \right)

w t + 1 = w t − η v ^ t + ϵ ( β 1 m ^ t + ( 1 − β 1 ) g t 1 − β 1 t ) w_{t+1}=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left( \beta _1\hat{m}_{t}+\frac{\left( 1-\beta _1 \right) g_t}{1-\beta _{1}^{t}} \right)

w t + 1 = w t − η v ^ t + ϵ ( β 1 m ^ t + ( 1 − β 1 ) g t 1 − β 1 t ) w_{t+1}=w_t-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\left( \beta _1\hat{m}_{t}+\frac{\left( 1-\beta _1 \right) g_t}{1-\beta _{1}^{t}} \right)

u t = max ⁡ ( β 2 ⋅ u t − 1 , ∣ g t ∣ ) Δ w = α 1 − β 1 t m t u t u_{t}=\max \left(\beta_{2} \cdot u_{t-1},\left|g_{t}\right|\right) \\ \Delta w = \frac{\alpha}{1-\beta _{1}^{t}}\frac{m_t}{u_t}

u t u_{t} 替换 v ^ t + ϵ \sqrt{\hat{v}_t}+\epsilon
w t + 1 = w t − η r i ( β 1 m ^ t + ( 1 − β 1 ) g t 1 − β 1 t ) r i    =    max ⁡ ( β 2 ⋅ u t − 1 , ∣ g t ∣ ) w_{t+1}=w_t-\frac{\eta}{r_i}\left( \beta _1\hat{m}_t+\frac{\left( 1-\beta _1 \right) g_t}{1-\beta _{1}^{t}} \right) \\ r_i\,\,=\,\,\max \left( \beta _2\cdot u_{t-1},\left| g_t \right| \right)

(得好好看看原文，这部分不是很懂)

AMSGrad, Reddi .etc 提出的解决方案，即采用 MAX 函数来取代指数加权平均

v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v_{t}=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2}
v ^ t = max ⁡ ( v ^ t − 1 , v t ) \hat{v}_{t}=\max \left(\hat{v}_{t-1}, v_{t}\right)

v ^ t ← v t ( 1 − β 2 t ) \widehat{v}_t\gets \frac{v_t}{\left( 1-\beta _{2}^{t} \right)}

m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v ^ t = max ⁡ ( v ^ t − 1 , v t ) w t + 1 = w t − η v ^ t + ϵ m t \begin{aligned} m_{t} &=\beta_{1} m_{t-1}+\left(1-\beta_{1}\right) g_{t} \\ v_{t} &=\beta_{2} v_{t-1}+\left(1-\beta_{2}\right) g_{t}^{2} \\ \hat{v}_{t} &=\max \left(\hat{v}_{t-1}, v_{t}\right) \\ w_{t+1} &=w_{t}-\frac{\eta}{\sqrt{\hat{v}_{t}}+\epsilon} m_{t} \end{aligned}

# 待学习

• https://ruder.io/deep-learning-optimization-2017/
• 近年的优化器
• 添加代码
• 论文查看

# Reference

CSC321 Lecture 7

I. Goodfellow, Y. Bengio和A. Courville, Deep Learning. MIT Press, 2016. （花书）

L1范数与L2范数的区别

(Reddi, Sashank J., Kale, Satyen, & Kumar, Sanjiv. On the Convergence of Adam and Beyond. Proceedings of ICLR 2018.)

• 2
点赞
• 0
评论
• 4
收藏
• 打赏
• 扫一扫，分享海报

03-11 662

10-12 692
12-16 776
02-17 2594
05-21 7万+
12-10 1300
03-26 245
02-02 7392
03-29 240
10-11 1万+
06-12 3248
10-01 1523
04-12 3万+
04-09 1万+
05-23 8694

eziaowonder

¥2 ¥4 ¥6 ¥10 ¥20

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。