神经网络参数优化器

神经网络参数优化器

符号及含义

  • w:待优化参数
  • loss:损失函数
  • lr:学习率
  • batch:一次迭代数量的多少
  • t:当前迭代的总系数

迭代步骤

(1) 计算 t 时刻损失函数关于当前参数的梯度:
g t = ▽ l o s s = d l o s s d ( w t ) g_t = ▽loss =\frac{d loss}{d(w_t)} gt=loss=d(wt)dloss
(2) 计算 t 时刻一阶动量 m t m_t mt(和梯度相关的函数)和二阶动量 V t V_t Vt (和梯度平方相关的函数)
(3) 计算 t 时刻下降梯度:
η t = l r ∗ m t / V t \eta_t = l_r * m_t / \sqrt{V_t} ηt=lrmt/Vt
(4) 计算 t+1 时刻参数:
W t + 1 = W t − η t = W t − l r ∗ m t / V t W_{t+1} = W_t-\eta_t=W_t-l_r*m_t/\sqrt{V_t} Wt+1=Wtηt=Wtlrmt/Vt

常用优化器

  • SGD(无momentum):
    (1) m t = g t m_t=g_t mt=gt V t = 1 V_t=1 Vt=1
    (2) η t = l r ∗ m t / V t = l r ∗ g t \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t ηt=lrmt/Vt =lrgt
    (3) W t + 1 = W t − η t ∗ m t / V t = W t − l r ∗ g t W_{t+1}=W_t-\eta_t*m_t/\sqrt{V_t}=W_t-l_r*g_t Wt+1=Wtηtmt/Vt =Wtlrgt
    W t + 1 = W t − l r ∗ d l o s s d ( w t ) W_{t+1}=W_t-l_r*\frac{d loss}{d(w_t)} Wt+1=Wtlrd(wt)dloss
    程序实现如下:

    w1.assign_sub(lr*grads[0])
    b1.assign_sub(lr*grads[1])
    
  • SGDM(含monentum的SGD),在SGD的基础上增加一阶动量:
    (1) m t = β ∗ m t − 1 + ( 1 − β ) ∗ g t m_t=\beta*m_{t-1}+(1-\beta)*g_t mt=βmt1+(1β)gt V t = 1 V_t=1 Vt=1 β \beta β:趋近1的超参数)
    (2) η t = l r ∗ m t / V t = l r ∗ m t = l r ∗ ( β ∗ m t − 1 + ( 1 − β ) ∗ g t ) \eta_t=l_r*m_t/\sqrt{V_t}=l_r*m_t=l_r*(\beta*m_{t-1}+(1-\beta)*g_t) ηt=lrmt/Vt =lrmt=lr(βmt1+(1β)gt)
    (3) W t + 1 = W t − η t = W t − l r ∗ ( β ∗ m t − 1 + ( 1 − β ) ∗ g t ) W_{t+1}=W_t-\eta_t=W_t-l_r*(\beta*m_{t-1}+(1-\beta)*g_t) Wt+1=Wtηt=Wtlr(βmt1+(1β)gt)

    程序实现如下:

    m_w,m_b = 0, 0
    beta = 0.9
    m_w = beta*m_w + (1-beta)*grads[0]
    m_b = beta*m_b + (1-beta)*grads[1]
    w1.assign_sub(lr*m_w)
    w2.assign_sub(lr*m_b)
    
  • Adagrad,在SGD基础上增加二阶动量:
    (1) m t = g t m_t=g_t mt=gt V t = ∑ τ = 1 t g τ 2 V_t=\sum_{\tau=1}^tg_\tau^2 Vt=τ=1tgτ2
    (2) η t = l r ∗ m t / V t = l r ∗ g t / ∑ τ = 1 t g τ 2 \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t/\sqrt{\sum_{\tau=1}^tg_\tau^2} ηt=lrmt/Vt =lrgt/τ=1tgτ2
    (3) W t + 1 = W t − η t = W t − l r ∗ g t / ( ∑ τ = 1 t g τ 2 ) W_{t+1}=W_t-\eta_t=W_t-l_r*g_t/(\sqrt{\sum_{\tau=1}^tg_\tau^2}) Wt+1=Wtηt=Wtlrgt/(τ=1tgτ2 )

    程序实现如下:

    v_w,v_b = 0,0
    v_w += tf.square(grads[0])
    v_b += tf.square(grads[1])
    w1.assign_sub(lr*grades[0]/tf.sqrt(v_w))
    b1.assign_sub(lr*grads[1]/tf.sqrt(v_b))
    
  • RMSProp,在SGD基础上增加二阶动量:
    (1) m t = g t m_t=g_t mt=gt V t = β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 V_t=\beta*V_{t-1}+(1-\beta)*g_t^2 Vt=βVt1+(1β)gt2
    (2) η t = l r ∗ m t / V t = l r ∗ g t / ( β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 ) \eta_t=l_r*m_t/\sqrt{V_t}=l_r*g_t/(\sqrt{\beta*V_{t-1}+(1-\beta)*g_t^2}) ηt=lrmt/Vt =lrgt/(βVt1+(1β)gt2 )
    (3) W t + 1 = W t − η t = W t = l r ∗ g t / ( β ∗ V t − 1 + ( 1 − β ) ∗ g t 2 ) W_{t+1}=W_t-\eta_t=W_t=lr*g_t/(\sqrt{\beta*V_{t-1}+(1-\beta)*g_t^2}) Wt+1=Wtηt=Wt=lrgt/(βVt1+(1β)gt2 )

    程序实现如下:

    v_w,v_b = 0,0
    beta = 0.9
    v_w = beta*v_w+(1-beta)*tf.square(grads[0])
    v_b = beta*v_b+(1-beta)*tf.square(grads[1])
    w1.assign_sub(lr*grads[0]/tf.sqrt(v_w))
    b1.assign_sub(lr*grads[1]/tf.sqrt(v_b))
    
  • Adam,结合SGDM一阶动量和RMSProp二阶动量:
    (1) m t = β 1 ∗ m t − 1 + ( 1 − β 1 ) ∗ g t m_t=\beta_1*m_{t-1}+(1-\beta_1)*g_t mt=β1mt1+(1β1)gt
    (2)修正一阶动量的偏差: m t ^ = m t 1 − β 1 t \hat{m_t}=\frac{m_t}{1-\beta_1^t} mt^=1β1tmt
    (3) V t = β 2 ∗ V s t e p − 1 + ( 1 − β 2 ) ∗ g t 2 V_t=\beta_2*V_{step-1}+(1-\beta_2)*g_t^2 Vt=β2Vstep1+(1β2)gt2
    (4)修正二阶动量的偏差: V t ^ = V t 1 − β 2 t \hat{V_t}=\frac{V_t}{1-\beta_2^t} Vt^=1β2tVt
    (5) η t = l r ∗ m t ^ / V t ^ = l r ∗ m t 1 − β 1 t / V t 1 − β 2 t \eta_t=l_r*\hat{m_t}/\sqrt{\hat{V_t}}=l_r*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}} ηt=lrmt^/Vt^ =lr1β1tmt/1β2tVt
    (6) W t + 1 = W t − η t = W t − l r ∗ m t 1 − β 1 t / V t 1 − β 2 t W_{t+1}=W_t-\eta_t=W_t-l_r*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}} Wt+1=Wtηt=Wtlr1β1tmt/1β2tVt

程序实现如下:

m_w,m_b = 0,0
v_w,v_b = 0,0
beta1,beta2 = 0.9,0.999
delta_w,delta_b = 0,0
global_step = 0

m_w = betal*m_w + (1 - beta1)*grads[0]
m_b = betal*m_b + (1 - beta1)*grads[1]
v_w = beta2*v_w + (1 - beta2)*tf.square(grads[0])
v_b = beta2*v_b + (1 - beta2)*tf.square(grads[1])

m_w_correction = m_v/(1 - tf.pow(betal,int(global_step)))
m_b_correction = m_b/(1 - tf.pow(betal,int(global_step)))
v_w_correction = v_w/(1 - tf.pow(beta2,int(global_step)))
v_b_correction = v_b/(1 - tf.pow(beta2,int(global_step)))

w1.assign_sub(l_r*m_w_correction/tf.sqrt(v_w_correction))
b1.assign_sub(l_r*m_b_correction/tf.sqrt(v_b_correction))
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值