Adam(Adaptive Moment Estimation)

Adam(Adaptive Moment Estimation)

Adam(Adaptive Moment Estimation)是一种自适应学习率优化算法,结合了动量法和RMSProp的优点。它不仅考虑了梯度的一阶矩(动量),还考虑了梯度的二阶矩(RMSProp),通过自适应调整学习率,使得参数更新更加稳定和高效。

Adam优化算法的原理

Adam优化算法通过以下步骤来更新参数:

  1. 计算梯度的动量估计(Exponential Moving Average of Gradient, 一阶矩估计)
    m t = β 1 m t − 1 + ( 1 − β 1 ) g t m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t mt=β1mt1+(1β1)gt
    其中:
    - m t m_t mt是梯度的动量估计。
    - g t g_t gt是当前梯度。
    - β 1 \beta_1 β1是动量超参数,通常取值为0.9。

  2. 计算梯度平方的动量估计(Exponential Moving Average of Squared Gradient, 二阶矩估计)
    v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 vt=β2vt1+(1β2)gt2
    其中:
    - v t v_t vt是梯度平方的动量估计。
    - β 2 \beta_2 β2是RMSProp超参数,通常取值为0.999。

  3. 偏差校正
    由于动量估计和梯度平方的动量估计在初始时刻可能偏向于零,Adam引入了偏差校正:
    m ^ t = m t 1 − β 1 t \hat{m}_t = \frac{m_t}{1 - \beta_1^t} m^t=1β1tmt
    v ^ t = v t 1 − β 2 t \hat{v}_t = \frac{v_t}{1 - \beta_2^t} v^t=1β2tvt

  4. 更新参数
    θ t = θ t − 1 − α m ^ t v ^ t + ϵ \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} θt=θt1αv^t +ϵm^t
    其中:
    - θ t \theta_t θt是第 t t t次迭代的参数。
    - α \alpha α是学习率。
    - ϵ \epsilon ϵ是一个小常数,用于防止除零错误,通常取 1 0 − 8 10^{-8} 108

具体数据示例

假设我们有一个简单的线性回归问题,训练数据集如下:

xy
12
23
34
45

我们要拟合的线性模型为 h ( θ ) = θ 0 + θ 1 x h(\theta) = \theta_0 + \theta_1 x h(θ)=θ0+θ1x

步骤1:初始化参数

假设 θ 0 = 0 \theta_0 = 0 θ0=0 θ 1 = 0 \theta_1 = 0 θ1=0,学习率 α = 0.01 \alpha = 0.01 α=0.01 β 1 = 0.9 \beta_1 = 0.9 β1=0.9 β 2 = 0.999 \beta_2 = 0.999 β2=0.999 ϵ = 1 0 − 8 \epsilon = 10^{-8} ϵ=108,并且初始化动量项和二阶矩估计 m 0 = 0 m_0 = 0 m0=0 v 0 = 0 v_0 = 0 v0=0

步骤2:计算梯度

损失函数 J ( θ ) J(\theta) J(θ)为均方误差(MSE):
J ( θ ) = 1 2 m ∑ i = 1 m ( h ( θ ) − y i ) 2 J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(\theta) - y_i)^2 J(θ)=2m1i=1m(h(θ)yi)2
其中, m m m是训练样本的数量。

对于第一个样本 ( x 1 , y 1 ) = ( 1 , 2 ) (x_1, y_1) = (1, 2) (x1,y1)=(1,2),模型预测值为:
h ( θ ) = θ 0 + θ 1 x 1 = 0 h(\theta) = \theta_0 + \theta_1 x_1 = 0 h(θ)=θ0+θ1x1=0

计算损失函数对参数的梯度:
∂ J ∂ θ 0 = h ( θ ) − y 1 = 0 − 2 = − 2 \frac{\partial J}{\partial \theta_0} = h(\theta) - y_1 = 0 - 2 = -2 θ0J=h(θ)y1=02=2
∂ J ∂ θ 1 = ( h ( θ ) − y 1 ) x 1 = − 2 ⋅ 1 = − 2 \frac{\partial J}{\partial \theta_1} = (h(\theta) - y_1) x_1 = -2 \cdot 1 = -2 θ1J=(h(θ)y1)x1=21=2

步骤3:更新动量项和二阶矩估计
  1. 更新动量项:
    m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × 0 + 0.1 × ( − 2 ) = − 0.2 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times 0 + 0.1 \times (-2) = -0.2 mt,0=β1mt1,0+(1β1)gt,0=0.9×0+0.1×(2)=0.2
    m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × 0 + 0.1 × ( − 2 ) = − 0.2 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times 0 + 0.1 \times (-2) = -0.2 mt,1=β1mt1,1+(1β1)gt,1=0.9×0+0.1×(2)=0.2

  2. 更新二阶矩估计:
    v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0 + 0.001 × ( − 2 ) 2 = 0.004 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004 vt,0=β2vt1,0+(1β2)gt,02=0.999×0+0.001×(2)2=0.004
    v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0 + 0.001 × ( − 2 ) 2 = 0.004 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004 vt,1=β2vt1,1+(1β2)gt,12=0.999×0+0.001×(2)2=0.004

  3. 偏差校正:
    m ^ t , 0 = m t , 0 1 − β 1 t = − 0.2 1 − 0. 9 1 = − 0.2 0.1 = − 2 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2 m^t,0=1β1tmt,0=10.910.2=0.10.2=2
    m ^ t , 1 = m t , 1 1 − β 1 t = − 0.2 1 − 0. 9 1 = − 0.2 0.1 = − 2 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2 m^t,1=1β1tmt,1=10.910.2=0.10.2=2
    v ^ t , 0 = v t , 0 1 − β 2 t = 0.004 1 − 0.99 9 1 = 0.004 0.001 = 4 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4 v^t,0=1β2tvt,0=10.99910.004=0.0010.004=4
    v ^ t , 1 = v t , 1 1 − β 2 t = 0.004 1 − 0.99 9 1 = 0.004 0.001 = 4 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4 v^t,1=1β2tvt,1=10.99910.004=0.0010.004=4

  4. 更新参数:
    θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0 − 0.01 4 + 1 0 − 8 × ( − 2 ) = 0.01 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01 θt,0=θt1,0v^t,0 +ϵαm^t,0=04 +1080.01×(2)=0.01
    θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0 − 0.01 4 + 1 0 − 8 × ( − 2 ) = 0.01 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01 θt,1=θt1,1v^t,1 +ϵαm^t,1=04 +1080.01×(2)=0.01

第二次迭代

假设下一次随机选择的样本是 ( x 2 , y 2 ) = ( 2 , 3 ) (x_2, y_2) = (2, 3) (x2,y2)=(2,3)

  1. 计算新的预测值:
    h ( θ ) = θ 0 + θ 1 x 2 = 0.01 + 0.01 × 2 = 0.03 h(\theta) = \theta_0 + \theta_1 x_2 = 0.01 + 0.01 \times 2 = 0.03 h(θ)=θ0+θ1x2=0.01+0.01×2=0.03

  2. 计算新的梯度:
    ∂ J ∂ θ 0 = h ( θ ) − y 2 = 0.03 − 3 = − 2.97 \frac{\partial J}{\partial \theta_0} = h(\theta) - y_2 = 0.03 - 3 = -2.97 θ0J=h(θ)y2=0.033=2.97
    ∂ J ∂ θ 1 = ( h ( θ ) − y 2 ) x 2 = − 2.97 × 2 = − 5.94 \frac{\partial J}{\partial \theta_1} = (h(\theta) - y_2) x_2 = -2.97 \times 2 = -5.94 θ1J=(h(θ)y2)x2=2.97×2=5.94

  3. 更新动量项:
    m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × ( − 0.2 ) + 0.1 × ( − 2.97 ) = − 0.467 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.2) + 0.1 \times (-2.97) = -0.467 mt,0=β1mt1,0+(1β1)gt,0=0.9×(0.2)+0.1×(2.97)=0.467
    m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × ( − 0.2 ) + 0.1   t i m e s ( − 5.94 ) = − 0.872 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.2) + 0.1 \ times (-5.94) = -0.872 mt,1=β1mt1,1+(1β1)gt,1=0.9×(0.2)+0.1 times(5.94)=0.872

  4. 更新二阶矩估计:
    v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0.004 + 0.001 × ( − 2.97 ) 2 = 0.012 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.004 + 0.001 \times (-2.97)^2 = 0.012 vt,0=β2vt1,0+(1β2)gt,02=0.999×0.004+0.001×(2.97)2=0.012
    v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0.004 + 0.001 × ( − 5.94 ) 2 = 0.04 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.004 + 0.001 \times (-5.94)^2 = 0.04 vt,1=β2vt1,1+(1β2)gt,12=0.999×0.004+0.001×(5.94)2=0.04

  5. 偏差校正:
    m ^ t , 0 = m t , 0 1 − β 1 t = − 0.467 1 − 0. 9 2 = − 2.47 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.467}{1 - 0.9^2} = -2.47 m^t,0=1β1tmt,0=10.920.467=2.47
    m ^ t , 1 = m t , 1 1 − β 1 t = − 0.872 1 − 0. 9 2 = − 4.591 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.872}{1 - 0.9^2} = -4.591 m^t,1=1β1tmt,1=10.920.872=4.591
    v ^ t , 0 = v t , 0 1 − β 2 t = 0.012 1 − 0.99 9 2 = 6 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.012}{1 - 0.999^2} = 6 v^t,0=1β2tvt,0=10.99920.012=6
    v ^ t , 1 = v t , 1 1 − β 2 t = 0.04 1 − 0.99 9 2 = 20 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.04}{1 - 0.999^2} = 20 v^t,1=1β2tvt,1=10.99920.04=20

  6. 更新参数:
    θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0.01 − 0.01 6 + 1 0 − 8 × ( − 2.47 ) ≈ 0.02 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.01 - \frac{0.01}{\sqrt{6} + 10^{-8}} \times (-2.47) \approx 0.02 θt,0=θt1,0v^t,0 +ϵαm^t,0=0.016 +1080.01×(2.47)0.02
    θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0.01 − 0.01 20 + 1 0 − 8 × ( − 4.591 ) ≈ 0.02 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.01 - \frac{0.01}{\sqrt{20} + 10^{-8}} \times (-4.591) \approx 0.02 θt,1=θt1,1v^t,1 +ϵαm^t,1=0.0120 +1080.01×(4.591)0.02

第三次迭代

假设下一次随机选择的样本是 ( x 3 , y 3 ) = ( 3 , 4 ) (x_3, y_3) = (3, 4) (x3,y3)=(3,4)

  1. 计算新的预测值:
    h ( θ ) = θ 0 + θ 1 x 3 = 0.02 + 0.02 × 3 = 0.08 h(\theta) = \theta_0 + \theta_1 x_3 = 0.02 + 0.02 \times 3 = 0.08 h(θ)=θ0+θ1x3=0.02+0.02×3=0.08

  2. 计算新的梯度:
    ∂ J ∂ θ 0 = h ( θ ) − y 3 = 0.08 − 4 = − 3.92 \frac{\partial J}{\partial \theta_0} = h(\theta) - y_3 = 0.08 - 4 = -3.92 θ0J=h(θ)y3=0.084=3.92
    ∂ J ∂ θ 1 = ( h ( θ ) − y 3 ) x 3 = − 3.92 × 3 = − 11.76 \frac{\partial J}{\partial \theta_1} = (h(\theta) - y_3) x_3 = -3.92 \times 3 = -11.76 θ1J=(h(θ)y3)x3=3.92×3=11.76

  3. 更新动量项:
    m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × ( − 0.467 ) + 0.1 × ( − 3.92 ) = − 0.812 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.467) + 0.1 \times (-3.92) = -0.812 mt,0=β1mt1,0+(1β1)gt,0=0.9×(0.467)+0.1×(3.92)=0.812
    m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × ( − 0.872 ) + 0.1 × ( − 11.76 ) = − 1.906 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.872) + 0.1 \times (-11.76) = -1.906 mt,1=β1mt1,1+(1β1)gt,1=0.9×(0.872)+0.1×(11.76)=1.906

  4. 更新二阶矩估计:
    v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0.012 + 0.001 × ( − 3.92 ) 2 = 0.027 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.012 + 0.001 \times (-3.92)^2 = 0.027 vt,0=β2vt1,0+(1β2)gt,02=0.999×0.012+0.001×(3.92)2=0.027
    v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0.04 + 0.001 × ( − 11.76 ) 2 = 0.079 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.04 + 0.001 \times (-11.76)^2 = 0.079 vt,1=β2vt1,1+(1β2)gt,12=0.999×0.04+0.001×(11.76)2=0.079

  5. 偏差校正:
    m ^ t , 0 = m t , 0 1 − β 1 t = − 0.812 1 − 0. 9 3 = − 2.89 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.812}{1 - 0.9^3} = -2.89 m^t,0=1β1tmt,0=10.930.812=2.89
    m ^ t , 1 = m t , 1 1 − β 1 t = − 1.906 1 − 0. 9 3 = − 6.79 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-1.906}{1 - 0.9^3} = -6.79 m^t,1=1β1tmt,1=10.931.906=6.79
    v ^ t , 0 = v t , 0 1 − β 2 t = 0.027 1 − 0.99 9 3 = 9 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.027}{1 - 0.999^3} = 9 v^t,0=1β2tvt,0=10.99930.027=9
    v ^ t , 1 = v t , 1 1 − β 2 t = 0.079 1 − 0.99 9 3 = 26 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.079}{1 - 0.999^3} = 26 v^t,1=1β2tvt,1=10.99930.079=26

  6. 更新参数:
    θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0.02 − 0.01 9 + 1 0 − 8 × ( − 2.89 ) ≈ 0.03 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.02 - \frac{0.01}{\sqrt{9} + 10^{-8}} \times (-2.89) \approx 0.03 θt,0=θt1,0v^t,0 +ϵαm^t,0=0.029 +1080.01×(2.89)0.03
    θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0.02 − 0.01 26 + 1 0 − 8 × ( − 6.79 ) ≈ 0.03 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.02 - \frac{0.01}{\sqrt{26} + 10^{-8}} \times (-6.79) \approx 0.03 θt,1=θt1,1v^t,1 +ϵαm^t,1=0.0226 +1080.01×(6.79)0.03

总结

Adam优化算法结合了动量法和RMSProp的优点,通过考虑梯度的一阶矩和二阶矩来自适应调整学习率,使得参数更新更加稳定和高效。通过具体数据的示例,可以清楚地看到Adam如何在每次迭代中逐步计算动量和二阶矩估计,并通过偏差校正和参数更新来加速模型的收敛。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值