待优化参数 w w w,损失函数 l o s s loss loss,学习率 l r lr lr,每次迭代一个batch, t t t表示当前batch迭代的总次数
- 计算t时刻损失函数关于当前参数的梯度 g t = ▽ l o s s = ∂ l o s s ∂ w t g_t=▽loss=\frac{\partial loss}{\partial w_t} gt=▽loss=∂wt∂loss
- 计算t时刻一阶动量 m t m_t mt和二阶动量 V t V_t Vt
- 计算t时刻下降梯度: η t = l r ⋅ m t / V t η_t=lr·m_t/\sqrt V_t ηt=lr⋅mt/Vt
- 计算t+1时刻参数: w t + 1 = w t − η t = w t − l r ⋅ m t / V t w_{t+1}=w_t-η_t=w_t-lr·m_t/\sqrt V_t wt+1=wt−ηt=wt−lr⋅mt/Vt
一阶动量:与梯度相关的函数
二阶动量:与梯度平方相关的函数
5种优化器
1. SGD (无动量):随机梯度下降
m
t
=
g
t
V
t
=
1
m_t=g_t\ \ \ \ V_t=1
mt=gt Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
=
l
r
⋅
g
t
η_t=lr·m_t/\sqrt V_t=lr·g_t
ηt=lr⋅mt/Vt=lr⋅gt
w
t
+
1
=
w
t
−
η
t
=
w
t
−
l
r
⋅
m
t
/
V
t
=
w
t
−
l
r
⋅
g
t
w_{t+1}=w_t-η_t=w_t-lr·m_t/\sqrt V_t=w_t-lr·g_t
wt+1=wt−ηt=wt−lr⋅mt/Vt=wt−lr⋅gt
= w t − l r ⋅ ∂ l o s s ∂ w t \ \ \ \ \ \ =w_t-lr·\frac{\partial loss}{\partial w_t} =wt−lr⋅∂wt∂loss
2. SGDM(含动量的SGD),在SGD基础上增加一阶动量
m
t
=
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
V
t
=
1
m_t=\beta·m_{t-1}+(1-\beta)·g_t \ \ \ \ \ V_t=1
mt=β⋅mt−1+(1−β)⋅gt Vt=1
η
t
=
l
r
⋅
m
t
/
V
t
=
l
r
⋅
m
t
η_t=lr·m_t/\sqrt V_t=lr·m_t
ηt=lr⋅mt/Vt=lr⋅mt
=
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =lr·(\beta·m_{t-1}+(1-\beta)·g_t)
=lr⋅(β⋅mt−1+(1−β)⋅gt)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-η_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
(
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
)
\ \ \ \ \ \ =w_t-lr·(\beta·m_{t-1}+(1-\beta)·g_t)
=wt−lr⋅(β⋅mt−1+(1−β)⋅gt)
3. Adagrad, 在SGD基础上增加二阶动量
m
t
=
g
t
V
t
=
∑
τ
=
1
t
g
2
m_t=g_t \ \ \ \ V_t=\sum_{\tau=1}^tg^2
mt=gt Vt=∑τ=1tg2
η
t
=
l
r
⋅
m
t
/
(
V
t
)
η_t=lr·m_t/(\sqrt V_t)
ηt=lr⋅mt/(Vt)
=
l
r
⋅
g
t
/
(
∑
τ
=
1
t
g
t
2
)
\ \ \ \ =lr·g_t/(\sqrt{\sum_{\tau=1}^tg^2_t)}
=lr⋅gt/(∑τ=1tgt2)
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-\eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
g
t
/
(
∑
τ
=
1
t
g
t
2
)
\ \ \ \ \ \ \ \ \ =w_t-lr·g_t/(\sqrt{\sum_{\tau=1}^tg^2_t)}
=wt−lr⋅gt/(∑τ=1tgt2)
4. RMSProp, SGD基础上增加二阶动量
m
t
=
g
t
V
t
=
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
t
2
m_t=g_t\ \ \ \ V_t=\beta\cdot V_{t-1}+(1-\beta)\cdot g^2_t
mt=gt Vt=β⋅Vt−1+(1−β)⋅gt2
η
=
l
r
⋅
m
t
/
V
t
\eta=lr\cdot m_t / {\sqrt V_t}
η=lr⋅mt/Vt
=
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
t
2
)
\ \ \ \ \ =lr\cdot g_t / (\sqrt{\beta\cdot V_{t-1}+(1-\beta)\cdot g^2_t})
=lr⋅gt/(β⋅Vt−1+(1−β)⋅gt2)
w
t
+
1
=
w
t
−
η
w_{t+1}=w_t-\eta
wt+1=wt−η
=
w
t
−
l
r
⋅
g
t
/
(
β
⋅
V
t
−
1
+
(
1
−
β
)
⋅
g
t
2
)
\ \ \ \ \ \ = w_t-lr\cdot g_t / (\sqrt{\beta\cdot V_{t-1}+(1-\beta)\cdot g^2_t})
=wt−lr⋅gt/(β⋅Vt−1+(1−β)⋅gt2)
5. Adam, 同时结合SGDM一阶动量和RMSProp二阶动量
m
t
=
β
⋅
m
t
−
1
+
(
1
−
β
)
⋅
g
t
m_t=\beta \cdot m_{t-1}+(1-\beta)\cdot g_t
mt=β⋅mt−1+(1−β)⋅gt
修正一阶动量的偏差:
m
t
^
=
m
t
1
−
β
1
t
\widehat{m_t}=\frac{m_t}{1-\beta^t_1}
mt
=1−β1tmt
V
t
=
β
2
⋅
V
s
t
e
p
−
1
+
(
1
−
β
2
)
⋅
g
t
2
V_t = \beta_2 \cdot V_{step-1}+(1-\beta_2)\cdot g^2_t
Vt=β2⋅Vstep−1+(1−β2)⋅gt2
修正二阶动量的偏差:
V
t
^
=
V
t
1
−
β
t
2
\widehat{V_t}=\frac{V_t}{1-\beta^2_t}
Vt
=1−βt2Vt
η
t
=
l
r
⋅
m
t
^
/
V
t
^
\eta_t=lr\cdot \widehat{m_t} / \sqrt{\widehat{V_t}}
ηt=lr⋅mt
/Vt
=
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
\ \ \ \ = lr\cdot \frac{m_t}{1-\beta^t_1} / \sqrt{\frac{V_t}{1-\beta^t_2}}
=lr⋅1−β1tmt/1−β2tVt
w
t
+
1
=
w
t
−
η
t
w_{t+1}=w_t-\eta_t
wt+1=wt−ηt
=
w
t
−
l
r
⋅
m
t
1
−
β
1
t
/
V
t
1
−
β
2
t
\ \ \ \ =w_t-lr\cdot \frac{m_t}{1-\beta^t_1} / \sqrt{\frac{V_t}{1-\beta^t_2}}
=wt−lr⋅1−β1tmt/1−β2tVt
笔记内容来源于视频:人工智能实践:Tensorflow笔记