首先的亚当和末后的亚当_亚当优化的完整指南

最新推荐文章于 2023-05-11 16:07:52 发布

weixin_26750481

最新推荐文章于 2023-05-11 16:07:52 发布

阅读量1k

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/complete-guide-to-adam-optimization-1e5f29532c3d

版权

本文详细介绍了亚当（Adam）优化算法，这是一种广泛应用于深度学习模型训练的高效算法，旨在通过结合动量和自适应学习率来加速收敛并提高模型性能。文章深入探讨了亚当算法的原理、实现步骤及其在Python和Java中的应用。

摘要由CSDN通过智能技术生成

首先的亚当和末后的亚当

In the 1940s, mathematical programming was synonymous with optimization. An optimization problem included an objective function that is to be maximized or minimized by choosing input values from an allowed set of values [1].

在1940年代，数学编程是优化的代名词。一个优化问题包括一个目标函数 ，该目标函数可以通过从一组允许的值中选择输入值来最大化或最小化[1]。

Nowadays, optimization is a very familiar term in AI. Specifically, in Deep Learning problems. And one of the most recommended optimization algorithms for Deep Learning problems is Adam.

如今，优化是AI中非常熟悉的术语。特别是在深度学习中。对于深度学习问题，最推荐的优化算法之一是Adam 。

Disclaimer: basic understanding of neural network optimization. Such as Gradient Descent and Stochastic Gradient Descent is preferred before reading.

免责声明：对神经网络优化的基本了解。 在阅读之前，请优先选择“梯度下降”和“随机梯度下降”。

在这篇文章中，我将重点介绍以下几点： (In this post, I will highlight the following points:)

Definition of Adam Optimization
亚当优化的定义
The Road to Adam
亚当之路
The Adam Algorithm for Stochastic Optimization
随机优化的Adam算法
Visual Comparison Between Adam and Other Optimizers
亚当和其他优化程序之间的视觉比较
Implementation
实作
Advantages and Disadvantages of Adam
亚当的优缺点
Conclusion and Further Reading
结论和进一步阅读

1.亚当优化的定义 (1. Definition of Adam Optimization)

The Adam algorithm was first introduced in the paper Adam: A Method for Stochastic Optimization [2] by Diederik P. Kingma and Jimmy Ba. Adam is defined as “a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement” [2]. Okay, let’s breakdown this definition into two parts.

Diederik P. Kingma和Jimmy Ba在论文《 亚当：随机优化方法》 [2]中首次引入了亚当算法。亚当被定义为“一种高效的随机优化方法 ， 只需要一阶梯度且几乎不需要内存” [2]。好的，让我们将该定义分为两部分。

First, stochastic optimization is the process of optimizing an objective function in the presence of randomness. To understand this better let’s think of Stochastic Gradient Descent (SGD). SGD is a great optimizer when we have a lot of data and parameters. Because at each step SGD calculates an estimate of the gradient from a random subset of that data (mini-batch). Unlike Gradient Descent which considers the entire dataset at each step.

首先， 随机优化是在存在随机性的情况下优化目标函数的过程。为了更好地理解这一点，让我们考虑一下随机梯度下降(SGD)。当我们拥有大量数据和参数时，SGD是一个出色的优化程序。因为SGD在每个步骤都会根据该数据的随机子集 (最小批量)计算梯度的估算值。与“梯度下降”不同，后者在每个步骤都要考虑整个数据集。

Image for post — deeplearning.ai) deeplearning.ai )

Second, Adam only requires first-order gradients. Meaning, Adam only requires the first derivative of the parameters.

其次，亚当只需要一阶梯度 。意思是，亚当只需要参数的一阶导数。

Now, the name of algorithm Adam is derived from adaptive moment estimation. This will become apparent as we go through the algorithm.

现在，算法Adam的名称来自自适应矩估计 。当我们通过算法时，这将变得显而易见。

2.亚当之路 (2. The Road to Adam)

Adam builds upon and combines the advantages of previous algorithms. To understand the Adam algorithm we need to have a quick background on those previous algorithms.

亚当建立并结合了先前算法的优势。为了理解亚当算法，我们需要对那些先前的算法有一个快速的了解。

我 SGD与动量 (I. SGD with Momentum)

Momentum in physics is an object in motion, such as a ball accelerating down a slope. So, SGD with Momentum [3] incorporates the gradients from the previous update steps to speed up the gradient descent. This is done by taking small but straightforward steps in the relevant direction.

物理学中的动量是运动中的物体，例如球沿斜坡加速。因此， SGD与动量 [3] 合并了先前更新步骤中的梯度以加快梯度下降 。这是通过在相关方向上采取小而简单的步骤来完成的。

SGD with momentum is achieved by computing a moving average of the gradient (also known as exponentially weighted averages), then use it to update your parameters “θ” (weights, biases).

具有动量的SGD通过计算梯度的移动平均值 (也称为 指数加权平均值 )来实现 ，然后使用它来更新参数“ θ ” (权重，偏差)。

The term Beta (𝛽) controls the moving average. The value of Beta is [0,1), a common value is 𝛽 = 0.9, meaning we are averaging over the last 10 iterations’ gradients and the older gradients are discarded or forgotten. So, a large value of beta (say 𝛽 = 0.98) means that we are averaging over more gradients.
Beta( 𝛽 )项控制移动平均线。 Beta的值为[0,1)，公共值为𝛽 = 0.9，这意味着我们要对最后10个迭代的梯度进行平均，而较旧的梯度将被丢弃或遗忘。因此，β值很大(例如𝛽 = 0.98)意味着我们正在更多梯度上求平均值。
Alpha (α) is the learning rate which determines the step size at each iteration.
Alpha( α )是学习率，它决定每次迭代的步长。

二。相关工作(AdaGrad和RMSProp) (II. Related Work (AdaGrad and RMSProp))

Alright, there are two algorithms to know about before we get to Adam. AdaGrad (adaptive gradient algorithm)[4] and RMSProp (root mean square propagation)[5] are both extensions of SGD. The two algorithms share some similarities with Adam. In fact, Adam combines the advantages of the two algorithms.

好了，在到达亚当之前，有两种算法需要了解。 AdaGrad (自适应梯度算法)[4]和RMSProp (均方根传播)[5]都是SGD的扩展。两种算法与Adam有一些相似之处。实际上，亚当结合了两种算法的优势。

三，适应性学习率 (III. Adaptive Learning Rate)

Both AdaGrad and RMSProp are also adaptive gradient descent algorithms. Meaning, for each one of the parameters (w, b), the learning rate (α) is adapted. In short, the learning rate is maintained per-parameter.

AdaGrad和RMSProp都是自适应梯度下降算法。意思是， 对于参数 (w，b)中的每一个 ， 学习率 (α) 被适配 。简而言之，按参数保持学习率。

To illustrate this better, here is an explanation of AdaGrad and RMSProp:

为了更好地说明这一点，以下是AdaGrad和RMSProp的说明：

AdaGrad
阿达格拉德

AdaGrad’s per-parameter learning rate helps increases the learning rate for sparser parameters. Thus, AdaGrad works well for sparse gradients, such as in natural language processing and image recognition applications [4].

AdaGrad的每参数学习率有助于提高稀疏参数的学习率 。因此， AdaGrad对于稀疏梯度非常有效 ，例如在自然语言处理和图像识别应用中[4]。

RMSProp
RMSProp

RMSProp was introduced by Tielemen and Hinton to speed up mini-batch learning. In RMSProp, the learning rate adapts based on the moving average of the magnitudes of the recent gradients.

Tielemen和Hinton引入了RMSProp，以加速小批量学习。在RMSProp中， 学习率 根据 最近梯度的幅度的移动平均值进行 调整。

Meaning, RMSProp maintains a moving average of the squares of the recent gradients, denoted by (v). Thus, giving more weight to recent gradients.

意思是，RMSProp保持最近梯度的平方的移动平均值，用(v)表示。因此，为最近的渐变赋予更多权重。

Here, the term Beta (𝛽) is introduced as the forgetting factor (just like in SGD with Momentum).

在此，引入术语Beta(𝛽)作为遗忘因子 (就像在带有动量的SGD中一样)。

In short, when updating θ (say w or b), divide the gradient of the previous value of θ by a moving average of the squares of recent gradients for that parameter θ, then multiply it by α and of course subtract the previous value of θ.

总之，更新θ时(说w或b)中 ，由α 由移动平均最近的梯度用于该参数θ的平方除以θ的先前值的梯度，然后乘以它和当然减去的先前值θ

Also, RMSProp works well on big and redundant datasets (e.g. noisy data)[5].

同样，RMSProp在大型和冗余数据集(例如嘈杂数据)上也能很好地工作[5]。

* The term (𝜖) is used for numerical stability (avoid division by zero).

*项( 𝜖 ) 用于数字稳定性(避免除以零)。

这是到目前为止我们学到的东西的直观比较： (Here’s a visual comparison of what we learned so far:)

In the gif above, you can see Momentum exploring around before finding the corrected path. As for SGD, AdaGrad, and RMSProp, they are all taking a similar path, but AdaGrad and RMSProp are clearly faster.

在上面的gif文件中，您可以看到Momentum在找到正确的路径之前一直在探索。至于SGD，AdaGrad和RMSProp，它们都采用类似的方法，但是AdaGrad和RMSProp显然要快一些。

3.随机优化的亚当算法 (3. The Adam Algorithm for Stochastic Optimization)

Okay, now we’ve got all the pieces we need to get to the algorithm.

好的，现在我们已经掌握了算法所需的所有内容。

As explained by Andrew Ng, Adam: adaptive moment estimation is simply a combination of Momentum and RMSProp.

正如Adam Ng的Andrew Ng所解释的那样：自适应矩估计只是动量和RMSProp的组合。

Here’s the algorithm to optimize an objective function f(θ), with parameters θ (weights and biases).

这是使用参数θ (权重和偏差)优化目标函数f(θ)的算法。

Adam includes the hyperparameters: α, 𝛽1 (from Momentum), 𝛽2 (from RMSProp).

亚当包含超参数： α ，𝛽 1 (来自动量)，𝛽 2 (来自RMSProp)。

Initialize:

初始化：

m = 0, this is the first moment vector, treated as in Momentum
m = 0，这是第一矩矢量，按动量处理
v = 0, this is the second moment vector, treated as in RMSProp
v = 0，这是第二个矩矢量，在RMSProp中被视为
t = 0
t = 0

On iteration t:

在迭代t上：

Update t, t := t + 1
更新t，t ：= t + 1
Get the gradients / derivatives (g) with respect to t, here g is equivalent to (dw and db respectively)
获得关于t的梯度/导数( g ) ，这里g等于 (分别为dw和db )

Update the first moment mt
更新第一时刻mt
Update the second moment vt
更新第二个时刻vt

Compute the bias-corrected mt (bias-correction gives better estimate for the moving averages)
计算经过偏差校正的mt (通过偏差校正可以更好地估算移动平均值)
Compute the bias-corrected vt
计算偏差校正后的v t

Update the parameters θ
更新参数θ

And that’s it! The loop will continue until Adam converges to a solution.

就是这样！ 循环将一直持续到Adam收敛到一个解。

4.优化器之间的视觉比较 (4. Visual Comparison Between Optimizers)

A better way to recognize the differences between the previously mentioned optimization algorithms is to see a visual comparison of their performance.

认识到前面提到的优化算法之间差异的更好方法是对它们的性能进行直观比较。

The figure above is from the Adam paper. It showcases the training cost over 45 epochs, and you can see Adam converging faster than AdaGrad for CNNs. Perhaps it’s good to mention that AdaGrad corresponds to a version of Adam with the hyperparameters (α, 𝛽1, 𝛽2) are at specific values [2]. I decided to remove the AdaGrad’s math explanation from this post to avoid confusion, but here is a simple explanation by mxnet if you want to learn more on that.

上图来自亚当的论文。它展示了超过45个纪元的培训成本，您可以看到Adam的融合速度比AdaGrad的CNN融合速度更快。也许值得一提的是AdaGrad对应于亚当的版本，其超参数(α， 𝛽1，𝛽2 )处于特定值[2]。为了避免造成混淆，我决定从这篇文章中删除AdaGrad的数学解释，但是如果您想了解更多信息，这里是mxnet的简单解释。

In the gif above, you can see Adam and RMSProp converging at a similar speed, while AdaGrad seems to be struggling to converge.

在上面的gif文件中，您可以看到Adam和RMSProp以相似的速度收敛，而AdaGrad似乎难以收敛。

Meanwhile, in this gif, you can Adam and SGD with Momentum converging to a solution. While SGD, AdaGrad, and RMSProp seem to be stuck in a local minimum.

同时，在此gif中，您可以将Adam和SGD与Momentum融合为一个解决方案。尽管SGD，AdaGrad和RMSProp似乎停留在本地最低要求中。

5.实施 (5. Implementation)

Here I’ll show three different ways to incorporate Adam into your model, with TensorFlow, PyTorch, and NumPy implementations.

在这里，我将展示三种通过TensorFlow，PyTorch和NumPy实现将Adam集成到模型中的方法。

TensorFlow implementation:
TensorFlow实现：

import tensorflow as tf


tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
    name='Adam')

PyTorch implementation:
PyTorch的实现：

import torch


torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
                 eps=1e-08, weight_decay=0, amsgrad=False)

Implementation with just NumPy:
仅用NumPy实现：

This implementation may not be as practical, but it will give you a much better understanding of the Adam algorithm.

此实现可能不实际，但是它将使您更好地了解Adam算法。

But as you can guess, the code is quite long, so for better viewing, here’s the gist.

但是您可以猜到，代码很长，因此为了更好地查看，这里是要点。

6.亚当的利与弊 (6. Advantages and Disadvantages of Adam)

Adam is one of the best optimizers compared to other algorithms, but it is not perfect either. So, here are some advantages and disadvantages of Adam.

与其他算法相比，Adam是最佳的优化器之一，但它也不是完美的。因此，这是亚当的一些优点和缺点。

优点： (Advantages:)

Can handle sparse gradients on noisy datasets.
可以处理嘈杂数据集上的稀疏渐变。
Default hyperparameter values do well on most problems.
默认的超参数值可以很好地解决大多数问题。
Computationally efficient.
计算效率高。
Requires little memory, thus memory efficient.
需要很少的内存，因此内存效率高。
Works well on large datasets.
在大型数据集上效果很好。

缺点： (Disadvantages:)

Adam does not converge to an optimal solution in some areas (this is the motivation for AMSGrad).
Adam在某些领域没有收敛到最佳解决方案(这是AMSGrad的动机)。
Adam can suffer a weight decay problem (which is addressed in AdamW).
Adam可能会遇到重量衰减问题(在AdamW中已解决)。
Recent optimization algorithms have been proven faster and better [6].
最近的优化算法已被证明更快，更好[6]。

7.结论和进一步阅读 (7. Conclusion and Further Reading)

That is all for Adam: adaptive moment estimation!

亚当就是这一切：自适应力矩估计！

Adam is an extension of SGD, and it combines the advantages of AdaGrad and RMSProp. Adam is also an adaptive gradient descent algorithm, such that it maintains a learning rate per-parameter. And it keeps track of the moving average of the first and second moment of the gradient. Thus, using the first and second moment, Adam can give an unscaled direct estimation of the parameter’s updates. Finally, although newer optimization algorithms have emerged, Adam (and SGD) is still a stable optimizer to use.

Adam是SGD的扩展，它结合了AdaGrad和RMSProp的优点。 Adam还是一种自适应梯度下降算法，因此它可以保持每个参数的学习率。并且它跟踪梯度的第一和第二矩的移动平均值。因此，使用第一时刻和第二时刻，Adam可以对参数的更新给出未缩放的直接估计。最后，尽管出现了更新的优化算法，但Adam(和SGD)仍然是稳定的优化程序。

Great resources for further reading (and watching):

进一步阅读(和观看)的丰富资源：