深入理解优化器:以 Adam 为例解析模型参数更新
在机器学习和深度学习中,优化器是模型训练中的核心工具。优化器的主要任务是根据损失函数的梯度,逐步调整模型参数,使损失函数趋于最小化。本文以 Adam 优化器为例,详细解析其更新机制,帮助大家更好地理解优化器在模型训练中的作用。
一、优化器的基本概念
优化器的目标是通过迭代优化,更新模型参数 ( θ \theta θ ),使目标函数(如损失函数)逐步减少。参数更新的基本公式如下:
θ ← θ − η ⋅ ∂ Loss ∂ θ \theta \leftarrow \theta - \eta \cdot \frac{\partial \text{Loss}}{\partial \theta} θ←θ−η⋅∂θ∂Loss
其中:
- ( θ \theta θ ):模型参数。
- ( η \eta η ):学习率(控制更新步长)。
- ( ∂ Loss ∂ θ \frac{\partial \text{Loss}}{\partial \theta} ∂θ∂Loss ):损失函数对参数的梯度。
问题:
直接利用梯度下降方法(SGD)会受到以下问题的限制:
- 学习率难以调节:较小的学习率可能导致收敛速度过慢,而较大的学习率可能导致不稳定。
- 梯度震荡:在复杂损失函数的优化中,梯度可能在某些方向上过于剧烈波动。
- 稀疏梯度问题:当某些参数的梯度较小时,可能会导致这些参数更新速度缓慢。
为了解决这些问题,Adam 优化器结合了 动量法 和 RMSProp 的思想,对梯度信息进行动态调整,提供了更高效的参数更新机制。
二、Adam 优化器的核心思想
Adam(Adaptive Moment Estimation)是一种自适应学习率优化算法,主要基于以下两点:
- 一阶动量估计(梯度的指数加权移动平均)。
- 二阶动量估计(梯度平方的指数加权移动平均)。
1. Adam 参数更新公式
假设当前时间步为 ( t t t ),优化过程如下:
(1)计算梯度
g
t
=
∂
Loss
∂
θ
t
g_t = \frac{\partial \text{Loss}}{\partial \theta_t}
gt=∂θt∂Loss
(
g
t
g_t
gt ) 是损失函数对当前参数 (
θ
t
\theta_t
θt ) 的梯度。
(2)更新一阶动量(动量项)
m
t
=
β
1
⋅
m
t
−
1
+
(
1
−
β
1
)
⋅
g
t
m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t
mt=β1⋅mt−1+(1−β1)⋅gt
其中:
- ( m t m_t mt ) 是梯度的一阶动量(梯度的加权平均值)。
- ( β 1 \beta_1 β1 ) 是一阶动量的指数衰减率,通常取 ( 0.9 0.9 0.9 )。
(3)更新二阶动量(RMS 项)
v
t
=
β
2
⋅
v
t
−
1
+
(
1
−
β
2
)
⋅
g
t
2
v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2
vt=β2⋅vt−1+(1−β2)⋅gt2
其中:
- ( v t v_t vt ) 是梯度平方的加权平均值。
- ( β 2 \beta_2 β2 ) 是二阶动量的指数衰减率,通常取 ( 0.999 0.999 0.999 )。
(4)偏差校正
为了消除初始时动量估计的偏差,进行如下修正:
m
^
t
=
m
t
1
−
β
1
t
,
v
^
t
=
v
t
1
−
β
2
t
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
m^t=1−β1tmt,v^t=1−β2tvt
(5)更新参数
结合一阶动量和二阶动量更新参数:
θ
t
=
θ
t
−
1
−
η
⋅
m
^
t
v
^
t
+
ϵ
\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
θt=θt−1−η⋅v^t+ϵm^t
其中:
- ( η \eta η ):学习率。
- ( ϵ \epsilon ϵ ):一个小的数值(如 ( 1 0 − 8 10^{-8} 10−8 )),防止分母为 0。
2. Adam 的优点
- 自适应学习率:通过结合一阶和二阶动量动态调整学习率,适应不同参数的梯度大小。
- 快速收敛:在稀疏梯度和非平稳目标的优化问题中表现出色。
- 稳健性:适用于多种类型的深度学习模型。
三、Adam 优化器的参数更新过程
假设案例
为了直观展示 Adam 的参数更新过程,我们假设以下场景:
- 当前参数 ( θ t = 0.5 \theta_t = 0.5 θt=0.5 )。
- 当前梯度 ( g t = − 0.2 g_t = -0.2 gt=−0.2 )。
- Adam 参数设置:( β 1 = 0.9 \beta_1 = 0.9 β1=0.9 )、( β 2 = 0.999 \beta_2 = 0.999 β2=0.999 )、( ϵ = 1 0 − 8 \epsilon = 10^{-8} ϵ=10−8 )、( η = 0.01 \eta = 0.01 η=0.01 )。
更新步骤
(1)初始化动量
初始值:( m 0 = 0 , v 0 = 0 m_0 = 0, v_0 = 0 m0=0,v0=0 )。
(2)计算一阶动量
m 1 = β 1 ⋅ m 0 + ( 1 − β 1 ) ⋅ g t = 0.9 ⋅ 0 + 0.1 ⋅ ( − 0.2 ) = − 0.02 m_1 = \beta_1 \cdot m_0 + (1 - \beta_1) \cdot g_t = 0.9 \cdot 0 + 0.1 \cdot (-0.2) = -0.02 m1=β1⋅m0+(1−β1)⋅gt=0.9⋅0+0.1⋅(−0.2)=−0.02
(3)计算二阶动量
v 1 = β 2 ⋅ v 0 + ( 1 − β 2 ) ⋅ g t 2 = 0.999 ⋅ 0 + 0.001 ⋅ ( − 0.2 ) 2 = 0.00004 v_1 = \beta_2 \cdot v_0 + (1 - \beta_2) \cdot g_t^2 = 0.999 \cdot 0 + 0.001 \cdot (-0.2)^2 = 0.00004 v1=β2⋅v0+(1−β2)⋅gt2=0.999⋅0+0.001⋅(−0.2)2=0.00004
(4)偏差校正
m
^
1
=
m
1
1
−
β
1
1
=
−
0.02
1
−
0.9
=
−
0.2
\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{-0.02}{1 - 0.9} = -0.2
m^1=1−β11m1=1−0.9−0.02=−0.2
v
^
1
=
v
1
1
−
β
2
1
=
0.00004
1
−
0.999
=
0.04
\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.00004}{1 - 0.999} = 0.04
v^1=1−β21v1=1−0.9990.00004=0.04
(5)更新参数
θ
1
=
θ
0
−
η
⋅
m
^
1
v
^
1
+
ϵ
\theta_1 = \theta_0 - \eta \cdot \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon}
θ1=θ0−η⋅v^1+ϵm^1
θ
1
=
0.5
−
0.01
⋅
−
0.2
0.04
+
1
0
−
8
=
0.5
+
0.01
⋅
1
=
0.51
\theta_1 = 0.5 - 0.01 \cdot \frac{-0.2}{\sqrt{0.04} + 10^{-8}} = 0.5 + 0.01 \cdot 1 = 0.51
θ1=0.5−0.01⋅0.04+10−8−0.2=0.5+0.01⋅1=0.51
最终,经过一次迭代,参数 ( θ \theta θ ) 从 0.5 更新为 0.51。
四、总结
Adam 优化器是一种高效的自适应学习率优化算法,它结合了动量法和 RMSProp 的优点,能够快速收敛并适应复杂的梯度环境。在深度学习模型的训练中,Adam 因其稳定性和适用性被广泛使用。
Understanding How Optimizers Update Model Parameters: The Case of Adam
In machine learning and deep learning, optimizers play a critical role in training models by updating model parameters to minimize the loss function. This blog post will explain the mechanics of parameter updates using gradients, focusing on the Adam optimizer as an example.
1. The Basics of Optimizers
The goal of an optimizer is to iteratively update model parameters ( θ \theta θ ) to minimize a given loss function. The fundamental update rule for gradient descent is:
θ ← θ − η ⋅ ∂ Loss ∂ θ \theta \leftarrow \theta - \eta \cdot \frac{\partial \text{Loss}}{\partial \theta} θ←θ−η⋅∂θ∂Loss
Where:
- ( θ \theta θ ): Model parameters.
- ( η \eta η ): Learning rate, which controls the step size for updates.
- ( ∂ Loss ∂ θ \frac{\partial \text{Loss}}{\partial \theta} ∂θ∂Loss ): The gradient of the loss function with respect to the model parameters.
Challenges with Simple Gradient Descent:
- Fixed Learning Rate: Choosing the right learning rate is challenging; too small makes convergence slow, while too large causes instability.
- Gradient Oscillations: In complex loss landscapes, gradients may oscillate, slowing convergence.
- Sparse Gradients: Some parameters may receive smaller updates if their gradients are small.
To address these issues, advanced optimizers like Adam dynamically adjust learning rates using adaptive estimates of first and second moments of the gradients.
2. Adam Optimizer: Key Concepts
Adam (Adaptive Moment Estimation) is a popular optimization algorithm that combines ideas from:
- Momentum: Smoothing gradients by taking their moving average.
- RMSProp: Scaling gradients by the square root of their recent squared values.
The Update Rule of Adam
At each step ( t t t ), Adam updates the model parameters ( θ \theta θ ) as follows:
Step 1: Compute Gradients
g
t
=
∂
Loss
∂
θ
t
g_t = \frac{\partial \text{Loss}}{\partial \theta_t}
gt=∂θt∂Loss
This is the gradient of the loss with respect to the current parameter (
θ
t
\theta_t
θt ).
Step 2: Update the First Moment (Moving Average of Gradients)
m t = β 1 ⋅ m t − 1 + ( 1 − β 1 ) ⋅ g t m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t mt=β1⋅mt−1+(1−β1)⋅gt
- ( m t m_t mt ): The first moment (mean of gradients).
- ( β 1 \beta_1 β1 ): Exponential decay rate for the first moment (typically ( β 1 = 0.9 \beta_1 = 0.9 β1=0.9 )).
Step 3: Update the Second Moment (Moving Average of Squared Gradients)
v t = β 2 ⋅ v t − 1 + ( 1 − β 2 ) ⋅ g t 2 v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 vt=β2⋅vt−1+(1−β2)⋅gt2
- ( v t v_t vt ): The second moment (mean of squared gradients).
- ( β 2 \beta_2 β2 ): Exponential decay rate for the second moment (typically ( β 2 = 0.999 \beta_2 = 0.999 β2=0.999 )).
Step 4: Bias Correction
To correct the bias introduced by initializing (
m
0
m_0
m0 ) and (
v
0
v_0
v0 ) to zero, the moments are scaled as:
m
^
t
=
m
t
1
−
β
1
t
,
v
^
t
=
v
t
1
−
β
2
t
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
m^t=1−β1tmt,v^t=1−β2tvt
Step 5: Update Parameters
The final update combines the corrected moments:
θ
t
=
θ
t
−
1
−
η
⋅
m
^
t
v
^
t
+
ϵ
\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
θt=θt−1−η⋅v^t+ϵm^t
Where:
- ( η \eta η ): Learning rate.
- ( ϵ \epsilon ϵ ): A small constant (e.g., ( 1 0 − 8 10^{-8} 10−8 )) to prevent division by zero.
3. Why Adam?
Adam is widely used in deep learning due to several advantages:
- Adaptive Learning Rates: Each parameter has its own dynamically adjusted learning rate based on the magnitude of past gradients.
- Fast Convergence: Performs well on non-stationary or sparse gradient problems.
- Ease of Use: Minimal hyperparameter tuning is required.
4. Example: Adam in Action
Scenario
Suppose we are updating a single parameter ( θ \theta θ ):
- Initial parameter: ( θ 0 = 0.5 \theta_0 = 0.5 θ0=0.5 ).
- Gradient at ( t = 1 t=1 t=1 ): ( g 1 = − 0.2 g_1 = -0.2 g1=−0.2 ).
- Hyperparameters: ( β 1 = 0.9 \beta_1 = 0.9 β1=0.9 ), ( β 2 = 0.999 \beta_2 = 0.999 β2=0.999 ), ( η = 0.01 \eta = 0.01 η=0.01 ), ( ϵ = 1 0 − 8 \epsilon = 10^{-8} ϵ=10−8 ).
Step-by-Step Updates
1. Initialize Moments
m 0 = 0 , v 0 = 0 m_0 = 0, \quad v_0 = 0 m0=0,v0=0
2. Compute First Moment
m 1 = β 1 ⋅ m 0 + ( 1 − β 1 ) ⋅ g 1 = 0.9 ⋅ 0 + 0.1 ⋅ ( − 0.2 ) = − 0.02 m_1 = \beta_1 \cdot m_0 + (1 - \beta_1) \cdot g_1 = 0.9 \cdot 0 + 0.1 \cdot (-0.2) = -0.02 m1=β1⋅m0+(1−β1)⋅g1=0.9⋅0+0.1⋅(−0.2)=−0.02
3. Compute Second Moment
v 1 = β 2 ⋅ v 0 + ( 1 − β 2 ) ⋅ g 1 2 = 0.999 ⋅ 0 + 0.001 ⋅ ( − 0.2 ) 2 = 0.00004 v_1 = \beta_2 \cdot v_0 + (1 - \beta_2) \cdot g_1^2 = 0.999 \cdot 0 + 0.001 \cdot (-0.2)^2 = 0.00004 v1=β2⋅v0+(1−β2)⋅g12=0.999⋅0+0.001⋅(−0.2)2=0.00004
4. Bias Correction
m
^
1
=
m
1
1
−
β
1
1
=
−
0.02
1
−
0.9
=
−
0.2
\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{-0.02}{1 - 0.9} = -0.2
m^1=1−β11m1=1−0.9−0.02=−0.2
v
^
1
=
v
1
1
−
β
2
1
=
0.00004
1
−
0.999
=
0.04
\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.00004}{1 - 0.999} = 0.04
v^1=1−β21v1=1−0.9990.00004=0.04
5. Update Parameter
θ
1
=
θ
0
−
η
⋅
m
^
1
v
^
1
+
ϵ
\theta_1 = \theta_0 - \eta \cdot \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon}
θ1=θ0−η⋅v^1+ϵm^1
θ
1
=
0.5
−
0.01
⋅
−
0.2
0.04
+
1
0
−
8
=
0.5
+
0.01
⋅
1
=
0.51
\theta_1 = 0.5 - 0.01 \cdot \frac{-0.2}{\sqrt{0.04} + 10^{-8}} = 0.5 + 0.01 \cdot 1 = 0.51
θ1=0.5−0.01⋅0.04+10−8−0.2=0.5+0.01⋅1=0.51
Thus, after one update, the parameter ( θ \theta θ ) is updated from ( 0.5 0.5 0.5 ) to ( 0.51 0.51 0.51 ).
5. Conclusion
The Adam optimizer is a powerful tool for training deep learning models, offering robust performance with minimal tuning. By combining the strengths of momentum and adaptive learning rates, it efficiently updates parameters, even in complex optimization landscapes. This blog post illustrates how Adam dynamically adjusts learning rates to achieve faster convergence while maintaining stability.
If you’re exploring other optimizers or want to discuss deeper insights into optimization algorithms, feel free to reach out in the comments!
后记
2024年12月12日12点02于上海。在GPT4o大模型辅助下完成。