深入理解优化器：以 Adam 为例解析模型参数更新（中英双语）

最新推荐文章于 2025-03-08 17:29:55 发布

阿正的梦工坊

最新推荐文章于 2025-03-08 17:29:55 发布

阅读量2.4k

点赞数 11

分类专栏： Machine Learning Deep Learning 文章标签：人工智能

本文链接：https://blog.csdn.net/shizheng_Li/article/details/144423316

版权

Deep Learning 同时被 2 个专栏收录

289 篇文章

订阅专栏

Machine Learning

80 篇文章

订阅专栏

深入理解优化器：以 Adam 为例解析模型参数更新

在机器学习和深度学习中，优化器是模型训练中的核心工具。优化器的主要任务是根据损失函数的梯度，逐步调整模型参数，使损失函数趋于最小化。本文以 Adam 优化器为例，详细解析其更新机制，帮助大家更好地理解优化器在模型训练中的作用。

一、优化器的基本概念

优化器的目标是通过迭代优化，更新模型参数 ( $\theta$ )，使目标函数（如损失函数）逐步减少。参数更新的基本公式如下：

$\theta \leftarrow \theta - \eta \cdot \frac{\partial \text{Loss}}{\partial \theta}$

其中：

( $\theta$ )：模型参数。
( $\eta$ )：学习率（控制更新步长）。
( $\frac{\partial \text{Loss}}{\partial \theta}$ )：损失函数对参数的梯度。

问题：
直接利用梯度下降方法（SGD）会受到以下问题的限制：

学习率难以调节：较小的学习率可能导致收敛速度过慢，而较大的学习率可能导致不稳定。
梯度震荡：在复杂损失函数的优化中，梯度可能在某些方向上过于剧烈波动。
稀疏梯度问题：当某些参数的梯度较小时，可能会导致这些参数更新速度缓慢。

为了解决这些问题，Adam 优化器结合了 动量法 和 RMSProp 的思想，对梯度信息进行动态调整，提供了更高效的参数更新机制。

二、Adam 优化器的核心思想

Adam（Adaptive Moment Estimation）是一种自适应学习率优化算法，主要基于以下两点：

一阶动量估计（梯度的指数加权移动平均）。
二阶动量估计（梯度平方的指数加权移动平均）。

1. Adam 参数更新公式

假设当前时间步为 ( $t$ )，优化过程如下：

（1）计算梯度

$g_t = \frac{\partial \text{Loss}}{\partial \theta_t}$
( $g_t$ ) 是损失函数对当前参数 ( $\theta_t$ ) 的梯度。

（2）更新一阶动量（动量项）

$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$
其中：

( $m_t$ ) 是梯度的一阶动量（梯度的加权平均值）。
( $\beta_1$ ) 是一阶动量的指数衰减率，通常取 ( $0.9$ )。

（3）更新二阶动量（RMS 项）

$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$
其中：

( $v_t$ ) 是梯度平方的加权平均值。
( $\beta_2$ ) 是二阶动量的指数衰减率，通常取 ( $0.999$ )。

（4）偏差校正

为了消除初始时动量估计的偏差，进行如下修正：
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

（5）更新参数

结合一阶动量和二阶动量更新参数：
$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
其中：

( $\eta$ )：学习率。
( $\epsilon$ )：一个小的数值（如 ( $10^{-8}$ )），防止分母为 0。

2. Adam 的优点

自适应学习率：通过结合一阶和二阶动量动态调整学习率，适应不同参数的梯度大小。
快速收敛：在稀疏梯度和非平稳目标的优化问题中表现出色。
稳健性：适用于多种类型的深度学习模型。

三、Adam 优化器的参数更新过程

假设案例

为了直观展示 Adam 的参数更新过程，我们假设以下场景：

当前参数 ( $\theta_t = 0.5$ )。
当前梯度 ( $g_t = -0.2$ )。
Adam 参数设置：( $\beta_1 = 0.9$ )、( $\beta_2 = 0.999$ )、( $\epsilon = 10^{-8}$ )、( $\eta = 0.01$ )。

更新步骤

（1）初始化动量

初始值：( $m_0 = 0, v_0 = 0$ )。

（2）计算一阶动量

$m_1 = \beta_1 \cdot m_0 + (1 - \beta_1) \cdot g_t = 0.9 \cdot 0 + 0.1 \cdot (-0.2) = -0.02$

（3）计算二阶动量

$v_1 = \beta_2 \cdot v_0 + (1 - \beta_2) \cdot g_t^2 = 0.999 \cdot 0 + 0.001 \cdot (-0.2)^2 = 0.00004$

（4）偏差校正

$\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{-0.02}{1 - 0.9} = -0.2$
$\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.00004}{1 - 0.999} = 0.04$

（5）更新参数

$\theta_1 = \theta_0 - \eta \cdot \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon}$
$\theta_1 = 0.5 - 0.01 \cdot \frac{-0.2}{\sqrt{0.04} + 10^{-8}} = 0.5 + 0.01 \cdot 1 = 0.51$

最终，经过一次迭代，参数 ( $\theta$ ) 从 0.5 更新为 0.51。

四、总结

Adam 优化器是一种高效的自适应学习率优化算法，它结合了动量法和 RMSProp 的优点，能够快速收敛并适应复杂的梯度环境。在深度学习模型的训练中，Adam 因其稳定性和适用性被广泛使用。

Understanding How Optimizers Update Model Parameters: The Case of Adam

In machine learning and deep learning, optimizers play a critical role in training models by updating model parameters to minimize the loss function. This blog post will explain the mechanics of parameter updates using gradients, focusing on the Adam optimizer as an example.

1. The Basics of Optimizers

The goal of an optimizer is to iteratively update model parameters ( $\theta$ ) to minimize a given loss function. The fundamental update rule for gradient descent is:

$\theta \leftarrow \theta - \eta \cdot \frac{\partial \text{Loss}}{\partial \theta}$

Where:

( $\theta$ ): Model parameters.
( $\eta$ ): Learning rate, which controls the step size for updates.
( $\frac{\partial \text{Loss}}{\partial \theta}$ ): The gradient of the loss function with respect to the model parameters.

Challenges with Simple Gradient Descent:

Fixed Learning Rate: Choosing the right learning rate is challenging; too small makes convergence slow, while too large causes instability.
Gradient Oscillations: In complex loss landscapes, gradients may oscillate, slowing convergence.
Sparse Gradients: Some parameters may receive smaller updates if their gradients are small.

To address these issues, advanced optimizers like Adam dynamically adjust learning rates using adaptive estimates of first and second moments of the gradients.

2. Adam Optimizer: Key Concepts

Adam (Adaptive Moment Estimation) is a popular optimization algorithm that combines ideas from:

Momentum: Smoothing gradients by taking their moving average.
RMSProp: Scaling gradients by the square root of their recent squared values.

The Update Rule of Adam

At each step ( $t$ ), Adam updates the model parameters ( $\theta$ ) as follows:

Step 1: Compute Gradients

$g_t = \frac{\partial \text{Loss}}{\partial \theta_t}$
This is the gradient of the loss with respect to the current parameter ( $\theta_t$ ).

Step 2: Update the First Moment (Moving Average of Gradients)

$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$

( $m_t$ ): The first moment (mean of gradients).
( $\beta_1$ ): Exponential decay rate for the first moment (typically ( $\beta_1 = 0.9$ )).

Step 3: Update the Second Moment (Moving Average of Squared Gradients)

$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$

( $v_t$ ): The second moment (mean of squared gradients).
( $\beta_2$ ): Exponential decay rate for the second moment (typically ( $\beta_2 = 0.999$ )).

Step 4: Bias Correction

To correct the bias introduced by initializing ( $m_0$ ) and ( $v_0$ ) to zero, the moments are scaled as:
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Step 5: Update Parameters

The final update combines the corrected moments:
$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
Where:

( $\eta$ ): Learning rate.
( $\epsilon$ ): A small constant (e.g., ( $10^{-8}$ )) to prevent division by zero.

3. Why Adam?

Adam is widely used in deep learning due to several advantages:

Adaptive Learning Rates: Each parameter has its own dynamically adjusted learning rate based on the magnitude of past gradients.
Fast Convergence: Performs well on non-stationary or sparse gradient problems.
Ease of Use: Minimal hyperparameter tuning is required.

4. Example: Adam in Action

Scenario

Suppose we are updating a single parameter ( $\theta$ ):

Initial parameter: ( $\theta_0 = 0.5$ ).
Gradient at ( $t = 1$ ): ( $g_1 = -0.2$ ).
Hyperparameters: ( $\beta_1 = 0.9$ ), ( $\beta_2 = 0.999$ ), ( $\eta = 0.01$ ), ( $\epsilon = 10^{-8}$ ).

Step-by-Step Updates

1. Initialize Moments

$m_0 = 0, \quad v_0 = 0$

2. Compute First Moment

$m_1 = \beta_1 \cdot m_0 + (1 - \beta_1) \cdot g_1 = 0.9 \cdot 0 + 0.1 \cdot (-0.2) = -0.02$

3. Compute Second Moment

$v_1 = \beta_2 \cdot v_0 + (1 - \beta_2) \cdot g_1^2 = 0.999 \cdot 0 + 0.001 \cdot (-0.2)^2 = 0.00004$

4. Bias Correction

$\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{-0.02}{1 - 0.9} = -0.2$
$\hat{v}_1 = \frac{v_1}{1 - \beta_2^1} = \frac{0.00004}{1 - 0.999} = 0.04$

5. Update Parameter

$\theta_1 = \theta_0 - \eta \cdot \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon}$
$\theta_1 = 0.5 - 0.01 \cdot \frac{-0.2}{\sqrt{0.04} + 10^{-8}} = 0.5 + 0.01 \cdot 1 = 0.51$

Thus, after one update, the parameter ( $\theta$ ) is updated from ( $0.5$ ) to ( $0.51$ ).

5. Conclusion

The Adam optimizer is a powerful tool for training deep learning models, offering robust performance with minimal tuning. By combining the strengths of momentum and adaptive learning rates, it efficiently updates parameters, even in complex optimization landscapes. This blog post illustrates how Adam dynamically adjusts learning rates to achieve faster convergence while maintaining stability.

If you’re exploring other optimizers or want to discuss deeper insights into optimization algorithms, feel free to reach out in the comments!