Adam(Adaptive Moment Estimation)
Adam(Adaptive Moment Estimation)是一种自适应学习率优化算法,结合了动量法和RMSProp的优点。它不仅考虑了梯度的一阶矩(动量),还考虑了梯度的二阶矩(RMSProp),通过自适应调整学习率,使得参数更新更加稳定和高效。
Adam优化算法的原理
Adam优化算法通过以下步骤来更新参数:
-
计算梯度的动量估计(Exponential Moving Average of Gradient, 一阶矩估计):
m t = β 1 m t − 1 + ( 1 − β 1 ) g t m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t mt=β1mt−1+(1−β1)gt
其中:
- m t m_t mt是梯度的动量估计。
- g t g_t gt是当前梯度。
- β 1 \beta_1 β1是动量超参数,通常取值为0.9。 -
计算梯度平方的动量估计(Exponential Moving Average of Squared Gradient, 二阶矩估计):
v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 vt=β2vt−1+(1−β2)gt2
其中:
- v t v_t vt是梯度平方的动量估计。
- β 2 \beta_2 β2是RMSProp超参数,通常取值为0.999。 -
偏差校正:
由于动量估计和梯度平方的动量估计在初始时刻可能偏向于零,Adam引入了偏差校正:
m ^ t = m t 1 − β 1 t \hat{m}_t = \frac{m_t}{1 - \beta_1^t} m^t=1−β1tmt
v ^ t = v t 1 − β 2 t \hat{v}_t = \frac{v_t}{1 - \beta_2^t} v^t=1−β2tvt -
更新参数:
θ t = θ t − 1 − α m ^ t v ^ t + ϵ \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} θt=θt−1−αv^t+ϵm^t
其中:
- θ t \theta_t θt是第 t t t次迭代的参数。
- α \alpha α是学习率。
- ϵ \epsilon ϵ是一个小常数,用于防止除零错误,通常取 1 0 − 8 10^{-8} 10−8。
具体数据示例
假设我们有一个简单的线性回归问题,训练数据集如下:
x | y |
---|---|
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
我们要拟合的线性模型为 h ( θ ) = θ 0 + θ 1 x h(\theta) = \theta_0 + \theta_1 x h(θ)=θ0+θ1x。
步骤1:初始化参数
假设 θ 0 = 0 \theta_0 = 0 θ0=0, θ 1 = 0 \theta_1 = 0 θ1=0,学习率 α = 0.01 \alpha = 0.01 α=0.01, β 1 = 0.9 \beta_1 = 0.9 β1=0.9, β 2 = 0.999 \beta_2 = 0.999 β2=0.999, ϵ = 1 0 − 8 \epsilon = 10^{-8} ϵ=10−8,并且初始化动量项和二阶矩估计 m 0 = 0 m_0 = 0 m0=0, v 0 = 0 v_0 = 0 v0=0。
步骤2:计算梯度
损失函数
J
(
θ
)
J(\theta)
J(θ)为均方误差(MSE):
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
h
(
θ
)
−
y
i
)
2
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(\theta) - y_i)^2
J(θ)=2m1∑i=1m(h(θ)−yi)2
其中,
m
m
m是训练样本的数量。
对于第一个样本
(
x
1
,
y
1
)
=
(
1
,
2
)
(x_1, y_1) = (1, 2)
(x1,y1)=(1,2),模型预测值为:
h
(
θ
)
=
θ
0
+
θ
1
x
1
=
0
h(\theta) = \theta_0 + \theta_1 x_1 = 0
h(θ)=θ0+θ1x1=0
计算损失函数对参数的梯度:
∂
J
∂
θ
0
=
h
(
θ
)
−
y
1
=
0
−
2
=
−
2
\frac{\partial J}{\partial \theta_0} = h(\theta) - y_1 = 0 - 2 = -2
∂θ0∂J=h(θ)−y1=0−2=−2
∂
J
∂
θ
1
=
(
h
(
θ
)
−
y
1
)
x
1
=
−
2
⋅
1
=
−
2
\frac{\partial J}{\partial \theta_1} = (h(\theta) - y_1) x_1 = -2 \cdot 1 = -2
∂θ1∂J=(h(θ)−y1)x1=−2⋅1=−2
步骤3:更新动量项和二阶矩估计
-
更新动量项:
m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × 0 + 0.1 × ( − 2 ) = − 0.2 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times 0 + 0.1 \times (-2) = -0.2 mt,0=β1mt−1,0+(1−β1)gt,0=0.9×0+0.1×(−2)=−0.2
m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × 0 + 0.1 × ( − 2 ) = − 0.2 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times 0 + 0.1 \times (-2) = -0.2 mt,1=β1mt−1,1+(1−β1)gt,1=0.9×0+0.1×(−2)=−0.2 -
更新二阶矩估计:
v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0 + 0.001 × ( − 2 ) 2 = 0.004 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004 vt,0=β2vt−1,0+(1−β2)gt,02=0.999×0+0.001×(−2)2=0.004
v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0 + 0.001 × ( − 2 ) 2 = 0.004 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004 vt,1=β2vt−1,1+(1−β2)gt,12=0.999×0+0.001×(−2)2=0.004 -
偏差校正:
m ^ t , 0 = m t , 0 1 − β 1 t = − 0.2 1 − 0. 9 1 = − 0.2 0.1 = − 2 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2 m^t,0=1−β1tmt,0=1−0.91−0.2=0.1−0.2=−2
m ^ t , 1 = m t , 1 1 − β 1 t = − 0.2 1 − 0. 9 1 = − 0.2 0.1 = − 2 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2 m^t,1=1−β1tmt,1=1−0.91−0.2=0.1−0.2=−2
v ^ t , 0 = v t , 0 1 − β 2 t = 0.004 1 − 0.99 9 1 = 0.004 0.001 = 4 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4 v^t,0=1−β2tvt,0=1−0.99910.004=0.0010.004=4
v ^ t , 1 = v t , 1 1 − β 2 t = 0.004 1 − 0.99 9 1 = 0.004 0.001 = 4 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4 v^t,1=1−β2tvt,1=1−0.99910.004=0.0010.004=4 -
更新参数:
θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0 − 0.01 4 + 1 0 − 8 × ( − 2 ) = 0.01 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01 θt,0=θt−1,0−v^t,0+ϵαm^t,0=0−4+10−80.01×(−2)=0.01
θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0 − 0.01 4 + 1 0 − 8 × ( − 2 ) = 0.01 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01 θt,1=θt−1,1−v^t,1+ϵαm^t,1=0−4+10−80.01×(−2)=0.01
第二次迭代
假设下一次随机选择的样本是 ( x 2 , y 2 ) = ( 2 , 3 ) (x_2, y_2) = (2, 3) (x2,y2)=(2,3)。
-
计算新的预测值:
h ( θ ) = θ 0 + θ 1 x 2 = 0.01 + 0.01 × 2 = 0.03 h(\theta) = \theta_0 + \theta_1 x_2 = 0.01 + 0.01 \times 2 = 0.03 h(θ)=θ0+θ1x2=0.01+0.01×2=0.03 -
计算新的梯度:
∂ J ∂ θ 0 = h ( θ ) − y 2 = 0.03 − 3 = − 2.97 \frac{\partial J}{\partial \theta_0} = h(\theta) - y_2 = 0.03 - 3 = -2.97 ∂θ0∂J=h(θ)−y2=0.03−3=−2.97
∂ J ∂ θ 1 = ( h ( θ ) − y 2 ) x 2 = − 2.97 × 2 = − 5.94 \frac{\partial J}{\partial \theta_1} = (h(\theta) - y_2) x_2 = -2.97 \times 2 = -5.94 ∂θ1∂J=(h(θ)−y2)x2=−2.97×2=−5.94 -
更新动量项:
m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × ( − 0.2 ) + 0.1 × ( − 2.97 ) = − 0.467 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.2) + 0.1 \times (-2.97) = -0.467 mt,0=β1mt−1,0+(1−β1)gt,0=0.9×(−0.2)+0.1×(−2.97)=−0.467
m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × ( − 0.2 ) + 0.1 t i m e s ( − 5.94 ) = − 0.872 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.2) + 0.1 \ times (-5.94) = -0.872 mt,1=β1mt−1,1+(1−β1)gt,1=0.9×(−0.2)+0.1 times(−5.94)=−0.872 -
更新二阶矩估计:
v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0.004 + 0.001 × ( − 2.97 ) 2 = 0.012 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.004 + 0.001 \times (-2.97)^2 = 0.012 vt,0=β2vt−1,0+(1−β2)gt,02=0.999×0.004+0.001×(−2.97)2=0.012
v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0.004 + 0.001 × ( − 5.94 ) 2 = 0.04 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.004 + 0.001 \times (-5.94)^2 = 0.04 vt,1=β2vt−1,1+(1−β2)gt,12=0.999×0.004+0.001×(−5.94)2=0.04 -
偏差校正:
m ^ t , 0 = m t , 0 1 − β 1 t = − 0.467 1 − 0. 9 2 = − 2.47 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.467}{1 - 0.9^2} = -2.47 m^t,0=1−β1tmt,0=1−0.92−0.467=−2.47
m ^ t , 1 = m t , 1 1 − β 1 t = − 0.872 1 − 0. 9 2 = − 4.591 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.872}{1 - 0.9^2} = -4.591 m^t,1=1−β1tmt,1=1−0.92−0.872=−4.591
v ^ t , 0 = v t , 0 1 − β 2 t = 0.012 1 − 0.99 9 2 = 6 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.012}{1 - 0.999^2} = 6 v^t,0=1−β2tvt,0=1−0.99920.012=6
v ^ t , 1 = v t , 1 1 − β 2 t = 0.04 1 − 0.99 9 2 = 20 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.04}{1 - 0.999^2} = 20 v^t,1=1−β2tvt,1=1−0.99920.04=20 -
更新参数:
θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0.01 − 0.01 6 + 1 0 − 8 × ( − 2.47 ) ≈ 0.02 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.01 - \frac{0.01}{\sqrt{6} + 10^{-8}} \times (-2.47) \approx 0.02 θt,0=θt−1,0−v^t,0+ϵαm^t,0=0.01−6+10−80.01×(−2.47)≈0.02
θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0.01 − 0.01 20 + 1 0 − 8 × ( − 4.591 ) ≈ 0.02 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.01 - \frac{0.01}{\sqrt{20} + 10^{-8}} \times (-4.591) \approx 0.02 θt,1=θt−1,1−v^t,1+ϵαm^t,1=0.01−20+10−80.01×(−4.591)≈0.02
第三次迭代
假设下一次随机选择的样本是 ( x 3 , y 3 ) = ( 3 , 4 ) (x_3, y_3) = (3, 4) (x3,y3)=(3,4)。
-
计算新的预测值:
h ( θ ) = θ 0 + θ 1 x 3 = 0.02 + 0.02 × 3 = 0.08 h(\theta) = \theta_0 + \theta_1 x_3 = 0.02 + 0.02 \times 3 = 0.08 h(θ)=θ0+θ1x3=0.02+0.02×3=0.08 -
计算新的梯度:
∂ J ∂ θ 0 = h ( θ ) − y 3 = 0.08 − 4 = − 3.92 \frac{\partial J}{\partial \theta_0} = h(\theta) - y_3 = 0.08 - 4 = -3.92 ∂θ0∂J=h(θ)−y3=0.08−4=−3.92
∂ J ∂ θ 1 = ( h ( θ ) − y 3 ) x 3 = − 3.92 × 3 = − 11.76 \frac{\partial J}{\partial \theta_1} = (h(\theta) - y_3) x_3 = -3.92 \times 3 = -11.76 ∂θ1∂J=(h(θ)−y3)x3=−3.92×3=−11.76 -
更新动量项:
m t , 0 = β 1 m t − 1 , 0 + ( 1 − β 1 ) g t , 0 = 0.9 × ( − 0.467 ) + 0.1 × ( − 3.92 ) = − 0.812 m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.467) + 0.1 \times (-3.92) = -0.812 mt,0=β1mt−1,0+(1−β1)gt,0=0.9×(−0.467)+0.1×(−3.92)=−0.812
m t , 1 = β 1 m t − 1 , 1 + ( 1 − β 1 ) g t , 1 = 0.9 × ( − 0.872 ) + 0.1 × ( − 11.76 ) = − 1.906 m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.872) + 0.1 \times (-11.76) = -1.906 mt,1=β1mt−1,1+(1−β1)gt,1=0.9×(−0.872)+0.1×(−11.76)=−1.906 -
更新二阶矩估计:
v t , 0 = β 2 v t − 1 , 0 + ( 1 − β 2 ) g t , 0 2 = 0.999 × 0.012 + 0.001 × ( − 3.92 ) 2 = 0.027 v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.012 + 0.001 \times (-3.92)^2 = 0.027 vt,0=β2vt−1,0+(1−β2)gt,02=0.999×0.012+0.001×(−3.92)2=0.027
v t , 1 = β 2 v t − 1 , 1 + ( 1 − β 2 ) g t , 1 2 = 0.999 × 0.04 + 0.001 × ( − 11.76 ) 2 = 0.079 v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.04 + 0.001 \times (-11.76)^2 = 0.079 vt,1=β2vt−1,1+(1−β2)gt,12=0.999×0.04+0.001×(−11.76)2=0.079 -
偏差校正:
m ^ t , 0 = m t , 0 1 − β 1 t = − 0.812 1 − 0. 9 3 = − 2.89 \hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.812}{1 - 0.9^3} = -2.89 m^t,0=1−β1tmt,0=1−0.93−0.812=−2.89
m ^ t , 1 = m t , 1 1 − β 1 t = − 1.906 1 − 0. 9 3 = − 6.79 \hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-1.906}{1 - 0.9^3} = -6.79 m^t,1=1−β1tmt,1=1−0.93−1.906=−6.79
v ^ t , 0 = v t , 0 1 − β 2 t = 0.027 1 − 0.99 9 3 = 9 \hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.027}{1 - 0.999^3} = 9 v^t,0=1−β2tvt,0=1−0.99930.027=9
v ^ t , 1 = v t , 1 1 − β 2 t = 0.079 1 − 0.99 9 3 = 26 \hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.079}{1 - 0.999^3} = 26 v^t,1=1−β2tvt,1=1−0.99930.079=26 -
更新参数:
θ t , 0 = θ t − 1 , 0 − α v ^ t , 0 + ϵ m ^ t , 0 = 0.02 − 0.01 9 + 1 0 − 8 × ( − 2.89 ) ≈ 0.03 \theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.02 - \frac{0.01}{\sqrt{9} + 10^{-8}} \times (-2.89) \approx 0.03 θt,0=θt−1,0−v^t,0+ϵαm^t,0=0.02−9+10−80.01×(−2.89)≈0.03
θ t , 1 = θ t − 1 , 1 − α v ^ t , 1 + ϵ m ^ t , 1 = 0.02 − 0.01 26 + 1 0 − 8 × ( − 6.79 ) ≈ 0.03 \theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.02 - \frac{0.01}{\sqrt{26} + 10^{-8}} \times (-6.79) \approx 0.03 θt,1=θt−1,1−v^t,1+ϵαm^t,1=0.02−26+10−80.01×(−6.79)≈0.03
总结
Adam优化算法结合了动量法和RMSProp的优点,通过考虑梯度的一阶矩和二阶矩来自适应调整学习率,使得参数更新更加稳定和高效。通过具体数据的示例,可以清楚地看到Adam如何在每次迭代中逐步计算动量和二阶矩估计,并通过偏差校正和参数更新来加速模型的收敛。