神经网络参数优化器
符号及含义
- w:待优化参数
- loss:损失函数
- lr:学习率
- batch:一次迭代数量的多少
- t:当前迭代的总系数
迭代步骤
(1) 计算 t 时刻损失函数关于当前参数的梯度:
g
t
=
▽
l
o
s
s
=
d
l
o
s
s
d
(
w
t
)
g_t = ▽loss =\frac{d loss}{d(w_t)}
gt=▽loss=d(wt)dloss
(2) 计算 t 时刻一阶动量
m
t
m_t
mt(和梯度相关的函数)和二阶动量
V
t
V_t
Vt (和梯度平方相关的函数)
(3) 计算 t 时刻下降梯度:
η
t
=
l
r
∗
m
t
/
V
t
\eta_t = l_r * m_t / \sqrt{V_t}
ηt=lr∗mt/Vt
(4) 计算 t+1 时刻参数:
W
t
+
1
=
W
t
−
η
t
=
W
t
−
l
r
∗
m
t
/
V
t
W_{t+1} = W_t-\eta_t=W_t-l_r*m_t/\sqrt{V_t}
Wt+1=Wt−ηt=Wt−lr∗mt/Vt
常用优化器
-
SGD(无momentum):
(1) m t = g t m_t=g_t mt=gt、 V t = 1 V_t=1 Vt=1
(2) η t = l r ∗ m t / V t = l r ∗ g t \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t ηt=lr∗mt/Vt=lr∗gt
(3) W t + 1 = W t − η t ∗ m t / V t = W t − l r ∗ g t W_{t+1}=W_t-\eta_t*m_t/\sqrt{V_t}=W_t-l_r*g_t Wt+1=Wt−ηt∗mt/Vt=Wt−lr∗gt
W t + 1 = W t − l r ∗ d l o s s d ( w t ) W_{t+1}=W_t-l_r*\frac{d loss}{d(w_t)} Wt+1=Wt−lr∗d(wt)dloss
程序实现如下:w1.assign_sub(lr*grads[0]) b1.assign_sub(lr*grads[1])
-
SGDM(含monentum的SGD),在SGD的基础上增加一阶动量:
(1) m t = β ∗ m t − 1 + ( 1 − β ) ∗ g t m_t=\beta*m_{t-1}+(1-\beta)*g_t mt=β∗mt−1+(1−β)∗gt、 V t = 1 V_t=1 Vt=1( β \beta β:趋近1的超参数)
(2) η t = l r ∗ m t / V t = l r ∗ m t = l r ∗ ( β ∗ m t − 1 + ( 1 − β ) ∗ g t ) \eta_t=l_r*m_t/\sqrt{V_t}=l_r*m_t=l_r*(\beta*m_{t-1}+(1-\beta)*g_t) ηt=lr∗mt/Vt=lr∗mt=lr∗(β∗mt−1+(1−β)∗gt)
(3) W t + 1 = W t − η t = W t − l r ∗ ( β ∗ m t − 1 + ( 1 − β ) ∗ g t ) W_{t+1}=W_t-\eta_t=W_t-l_r*(\beta*m_{t-1}+(1-\beta)*g_t) Wt+1=Wt−ηt=Wt−lr∗(β∗mt−1+(1−β)∗gt)程序实现如下:
m_w,m_b = 0, 0 beta = 0.9 m_w = beta*m_w + (1-beta)*grads[0] m_b = beta*m_b + (1-beta)*grads[1] w1.assign_sub(lr*m_w) w2.assign_sub(lr*m_b)
-
Adagrad,在SGD基础上增加二阶动量:
(1) m t = g t m_t=g_t mt=gt、 V t = ∑ τ = 1 t g τ 2 V_t=\sum_{\tau=1}^tg_\tau^2 Vt=∑τ=1tgτ2
(2) η t = l r ∗ m t / V t = l r ∗ g t / ∑ τ = 1 t g τ 2 \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t/\sqrt{\sum_{\tau=1}^tg_\tau^2} ηt=lr∗mt/Vt=lr∗gt/∑τ=1tgτ2
(3) W t + 1 = W t − η t = W t − l r ∗ g t / ( ∑ τ = 1 t g τ 2 ) W_{t+1}=W_t-\eta_t=W_t-l_r*g_t/(\sqrt{\sum_{\tau=1}^tg_\tau^2}) Wt+1=Wt−ηt=Wt−lr∗gt/(∑τ=1tgτ2)程序实现如下:
v_w,v_b = 0,0 v_w += tf.square(grads[0]) v_b += tf.square(grads[1]) w1.assign_sub(lr*grades[0]/tf.sqrt(v_w)) b1.assign_sub(lr*grads[1]/tf.sqrt(v_b))
-
RMSProp,在SGD基础上增加二阶动量:
(1) m t = g t m_t=g_t mt=gt、 V t = β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 V_t=\beta*V_{t-1}+(1-\beta)*g_t^2 Vt=β∗Vt−1+(1−β)∗gt2
(2) η t = l r ∗ m t / V t = l r ∗ g t / ( β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 ) \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t/(\sqrt{\beta*V_{t-1}+(1-\beta)*g_t^2}) ηt=lr∗mt/Vt=lr∗gt/(β∗Vt−1+(1−β)∗gt2)
(3) W t + 1 = W t − η t = W t = l r ∗ g t / ( β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 ) W_{t+1}=W_t-\eta_t=W_t=lr*g_t/(\sqrt{\beta*V_{t-1}+(1-\beta)*g_t^2}) Wt+1=Wt−ηt=Wt=lr∗gt/(β∗Vt−1+(1−β)∗gt2)程序实现如下:
v_w,v_b = 0,0 beta = 0.9 v_w = beta*v_w+(1-beta)*tf.square(grads[0]) v_b = beta*v_b+(1-beta)*tf.square(grads[1]) w1.assign_sub(lr*grads[0]/tf.sqrt(v_w)) b1.assign_sub(lr*grads[1]/tf.sqrt(v_b))
-
Adam,结合SGDM一阶动量和RMSProp二阶动量:
(1) m t = β 1 ∗ m t − 1 + ( 1 − β 1 ) ∗ g t m_t=\beta_1*m_{t-1}+(1-\beta_1)*g_t mt=β1∗mt−1+(1−β1)∗gt
(2)修正一阶动量的偏差: m t ^ = m t 1 − β 1 t \hat{m_t}=\frac{m_t}{1-\beta_1^t} mt^=1−β1tmt
(3) V t = β 2 ∗ V s t e p − 1 + ( 1 − β 2 ) ∗ g t 2 V_t=\beta_2*V_{step-1}+(1-\beta_2)*g_t^2 Vt=β2∗Vstep−1+(1−β2)∗gt2
(4)修正二阶动量的偏差: V t ^ = V t 1 − β 2 t \hat{V_t}=\frac{V_t}{1-\beta_2^t} Vt^=1−β2tVt
(5) η t = l r ∗ m t ^ / V t ^ = l r ∗ m t 1 − β 1 t / V t 1 − β 2 t \eta_t=l_r*\hat{m_t}/\sqrt{\hat{V_t}}=l_r*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}} ηt=lr∗mt^/Vt^=lr∗1−β1tmt/1−β2tVt
(6) W t + 1 = W t − η t = W t − l r ∗ m t 1 − β 1 t / V t 1 − β 2 t W_{t+1}=W_t-\eta_t=W_t-l_r*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}} Wt+1=Wt−ηt=Wt−lr∗1−β1tmt/1−β2tVt
程序实现如下:
m_w,m_b = 0,0
v_w,v_b = 0,0
beta1,beta2 = 0.9,0.999
delta_w,delta_b = 0,0
global_step = 0
m_w = betal*m_w + (1 - beta1)*grads[0]
m_b = betal*m_b + (1 - beta1)*grads[1]
v_w = beta2*v_w + (1 - beta2)*tf.square(grads[0])
v_b = beta2*v_b + (1 - beta2)*tf.square(grads[1])
m_w_correction = m_v/(1 - tf.pow(betal,int(global_step)))
m_b_correction = m_b/(1 - tf.pow(betal,int(global_step)))
v_w_correction = v_w/(1 - tf.pow(beta2,int(global_step)))
v_b_correction = v_b/(1 - tf.pow(beta2,int(global_step)))
w1.assign_sub(l_r*m_w_correction/tf.sqrt(v_w_correction))
b1.assign_sub(l_r*m_b_correction/tf.sqrt(v_b_correction))