首先的亚当和末后的亚当_亚当优化的完整指南

本文详细介绍了亚当(Adam)优化算法,这是一种广泛应用于深度学习模型训练的高效算法,旨在通过结合动量和自适应学习率来加速收敛并提高模型性能。文章深入探讨了亚当算法的原理、实现步骤及其在Python和Java中的应用。
摘要由CSDN通过智能技术生成

首先的亚当和末后的亚当

In the 1940s, mathematical programming was synonymous with optimization. An optimization problem included an objective function that is to be maximized or minimized by choosing input values from an allowed set of values [1].

在1940年代,数学编程是优化的代名词。 一个优化问题包括一个目标函数 ,该目标函数可以通过从一组允许的值中选择输入值来最大化或最小化[1]。

Nowadays, optimization is a very familiar term in AI. Specifically, in Deep Learning problems. And one of the most recommended optimization algorithms for Deep Learning problems is Adam.

如今,优化是AI中非常熟悉的术语。 特别是在深度学习中。 对于深度学习问题,最推荐的优化算法之一是Adam

Disclaimer: basic understanding of neural network optimization. Such as Gradient Descent and Stochastic Gradient Descent is preferred before reading.

免责声明:对神经网络优化的基本了解。 在阅读之前,请优先选择“梯度下降”和“随机梯度下降”。

在这篇文章中,我将重点介绍以下几点: (In this post, I will highlight the following points:)

  1. Definition of Adam Optimization

    亚当优化的定义
  2. The Road to Adam

    亚当之路
  3. The Adam Algorithm for Stochastic Optimization

    随机优化的Adam算法
  4. Visual Comparison Between Adam and Other Optimizers

    亚当和其他优化程序之间的视觉比较
  5. Implementation

    实作
  6. Advantages and Disadvantages of Adam

    亚当的优缺点
  7. Conclusion and Further Reading

    结论和进一步阅读

1.亚当优化的定义 (1. Definition of Adam Optimization)

The Adam algorithm was first introduced in the paper Adam: A Method for Stochastic Optimization [2] by Diederik P. Kingma and Jimmy Ba. Adam is defined as “a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement” [2]. Okay, let’s breakdown this definition into two parts.

Diederik P. Kingma和Jimmy Ba在论文《 亚当:随机优化方法》 [2]中首次引入了亚当算法。 亚当被定义为“一种高效的随机优化方法只需要一阶梯度且几乎不需要内存” [2]。 好的,让我们将该定义分为两部分。

First, stochastic optimization is the process of optimizing an objective function in the presence of randomness. To understand this better let’s think of Stochastic Gradient Descent (SGD). SGD is a great optimizer when we have a lot of data and parameters. Because at each step SGD calculates an estimate of the gradient from a random subset of that data (mini-batch). Unlike Gradient Descent which considers the entire dataset at each step.

首先, 随机优化是在存在随机性的情况下优化目标函数的过程。 为了更好地理解这一点,让我们考虑一下随机梯度下降(SGD)。 当我们拥有大量数据和参数时,SGD是一个出色的优化程序。 因为SGD在每个步骤都会根据该数据随机子集 (最小批量)计算梯度的估算值。 与“梯度下降”不同,后者在每个步骤都要考虑整个数据集。

Image for post
deeplearning.ai) deeplearning.ai )

Second, Adam only requires first-order gradients. Meaning, Adam only requires the first derivative of the parameters.

其次,亚当只需要一阶梯度 。 意思是,亚当只需要参数的一阶导数。

Now, the name of algorithm Adam is derived from adaptive moment estimation. This will become apparent as we go through the algorithm.

现在,算法Adam的名称来自自适应矩估计 。 当我们通过算法时,这将变得显而易见。

2.亚当之路 (2. The Road to Adam)

Adam builds upon and combines the advantages of previous algorithms. To understand the Adam algorithm we need to have a quick background on those previous algorithms.

亚当建立并结合了先前算法的优势。 为了理解亚当算法,我们需要对那些先前的算法有一个快速的了解。

SGD与动量 (I. SGD with Momentum)

Momentum in physics is an object in motion, such as a ball accelerating down a slope. So, SGD with Momentum [3] incorporates the gradients from the previous update steps to speed up the gradient descent. This is done by taking small but straightforward steps in the relevant direction.

物理学中的动量是运动中的物体,例如球沿斜坡加速。 因此, SGD与动量 [3] 合并了先前更新步骤中的梯度加快梯度下降 。 这是通过在相关方向上采取小而​​简单的步骤来完成的。

SGD with momentum is achieved by computing a moving average of the gradient (also known as exponentially weighted averages), then use it to update your parameters “θ (weights, biases).

具有动量的SGD通过计算梯度移动平均值 (也称为 指数加权平均值 )来实现 ,然后使用它来更新参数“ θ (权重,偏差)。

Image for post
Calculates the exponentially weighted averages (moving average), then updates the parameters
计算指数加权平均值(移动平均值),然后更新参数
  • The term Beta (𝛽) controls the moving average. The value of Beta is [0,1), a common value is 𝛽 = 0.9, meaning we are averaging over the last 10 iterations’ gradients and the older gradients are discarded or forgotten. So, a large value of beta (say 𝛽 = 0.98) means that we are averaging over more gradients.

    Beta( 𝛽 )项控制移动平均线。 Beta的值为[0,1),公共值为𝛽 = 0.9,这意味着我们要对最后10个迭代的梯度进行平均,而较旧的梯度将被丢弃或遗忘 。 因此,β值很大(例如𝛽 = 0.98)意味着我们正在更多梯度上求平均值。

  • Alpha (α) is the learning rate which determines the step size at each iteration.

    Alpha( α )是学习率,它决定每次迭代的步长。

Image for post
Left: SGD, :SGD, Right: SGD with Momentum (Source: :带有动量的SGD(来源: Momentum and Learning Rate Adaptation) 动量和学习率调整 )

二。 相关工作(AdaGrad和RMSProp) (II. Related Work (AdaGrad and RMSProp))

Alright, there are two algorithms to know about before we get to Adam. AdaGrad (adaptive gradient algorithm)[4] and RMSProp (root mean square propagation)[5] are both extensions of SGD. The two algorithms share some similarities with Adam. In fact, Adam combines the advantages of the two algorithms.

好了,在到达亚当之前,有两种算法需要了解。 AdaGrad (自适应梯度算法)[4]和RMSProp (均方根传播)[5]都是SGD的扩展。 两种算法与Adam有一些相似之处。 实际上,亚当结合了两种算法的优势。

三, 适应性学习率 (III. Adaptive Learning Rate)

Both AdaGrad and RMSProp are also adaptive gradient descent algorithms. Meaning, for each one of the parameters (w, b), the learning rate (α) is adapted. In short, the learning rate is maintained per-parameter.

AdaGrad和RMSProp都是自适应梯度下降算法。 意思是, 对于参数 (w,b)中的每一个学习率 (α) 被适配 。 简而言之,按参数保持学习率。

To illustrate this better, here is an explanation of AdaGrad and RMSProp:

为了更好地说明这一点,以下是AdaGrad和RMSProp的说明:

  • AdaGrad

    阿达格拉德

AdaGrad’s per-parameter learning rate helps increases the learning rate for sparser parameters. Thus, AdaGrad works well for sparse gradients, such as in natural language processing and image recognition applications [4].

AdaGrad的每参数学习率有助于提高稀疏参数的学习率 。 因此, AdaGrad对于稀疏梯度非常有效 ,例如在自然语言处理和图像识别应用中[4]。

  • RMSProp

    RMSProp

RMSProp was introduced by Tielemen and Hinton to speed up mini-batch learning. In RMSProp, the learning rate adapts based on the moving average of the magnitudes of the recent gradients.

Tielemen和Hinton引入了RMSProp,以加速小批量学习。 在RMSProp中, 学习率 根据 最近梯度的幅度的移动平均值进行 调整

Meaning, RMSProp maintains a moving average of the squares of the recent gradients, denoted by (v). Thus, giving more weight to recent gradients.

意思是,RMSProp保持最近梯度的平方的移动平均值,用(v)表示。 因此,为最近的渐变赋予更多权重。

Here, the term Beta (𝛽) is introduced as the forgetting factor (just like in SGD with Momentum).

在此,引入术语Beta(𝛽)作为遗忘因子 (就像在带有动量的SGD中一样)。

Image for post
Calculates the moving average of the squares of recent gradients
计算最近梯度的平方的移动平均值

In short, when updating θ (say w or b), divide the gradient of the previous value of θ by a moving average of the squares of recent gradients for that parameter θ, then multiply it by α and of course subtract the previous value of θ.

总之,更新θ时(说wb)中 ,由α 移动平均最近的梯度用于该参数θ的平方除以θ的先前值的梯度,然后乘以它和当然减去的先前值θ

Image for post
θ θ的更新步骤

Also, RMSProp works well on big and redundant datasets (e.g. noisy data)[5].

同样,RMSProp在大型和冗余数据集(例如嘈杂数据)上也能很好地工作[5]。

* The term (𝜖) is used for numerical stability (avoid division by zero).

*项( 𝜖 ) 用于数字稳定性(避免除以零)。

这是到目前为止我们学到的东西的直观比较: (Here’s a visual comparison of what we learned so far:)

Image for post
Imgur by Alec Radford的 Alec Radford) Imgur )

In the gif above, you can see Momentum exploring around before finding the corrected path. As for SGD, AdaGrad, and RMSProp, they are all taking a similar path, but AdaGrad and RMSProp are clearly faster.

在上面的gif文件中,您可以看到Momentum在找到正确的路径之前一直在探索。 至于SGD,AdaGrad和RMSProp,它们都采用类似的方法,但是AdaGrad和RMSProp显然要快一些。

3.随机优化的亚当算法 (3. The Adam Algorithm for Stochastic Optimization)

Okay, now we’ve got all the pieces we need to get to the algorithm.

好的,现在我们已经掌握了算法所需的所有内容。

As explained by Andrew Ng, Adam: adaptive moment estimation is simply a combination of Momentum and RMSProp.

正如Adam NgAndrew Ng所解释的那样:自适应矩估计只是动量和RMSProp的组合。

Image for post
The Adam Algorithm (Source: Adam: A Method for Stochastic Optimization [2])
Adam算法(来源:Adam:随机优化方法[2])

Here’s the algorithm to optimize an objective function f(θ), with parameters θ (weights and biases).

这是使用参数θ (权重和偏差)优化目标函数f(θ)的算法。

Adam includes the hyperparameters: α, 𝛽1 (from Momentum), 𝛽2 (from RMSProp).

亚当包含超参数: α ,𝛽 1 (来自动量),𝛽 2 (来自RMSProp)。

Initialize:

初始化:

  • m = 0, this is the first moment vector, treated as in Momentum

    m = 0,这是第一矩矢量,按动量处理

  • v = 0, this is the second moment vector, treated as in RMSProp

    v = 0,这是第二个矩矢量,在RMSProp中被视为

  • t = 0

    t = 0

On iteration t:

在迭代t上

  • Update t, t := t + 1

    更新t,t := t + 1

  • Get the gradients / derivatives (g) with respect to t, here g is equivalent to (dw and db respectively)

    获得关于t的梯度/导数( g ) 这里g等于 (分别为dwdb )

Image for post
  • Update the first moment mt

    更新第一时刻mt

  • Update the second moment vt

    更新第二个时刻vt

Image for post
Update of mt and vt respectively
分别更新mt和vt
  • Compute the bias-corrected mt (bias-correction gives better estimate for the moving averages)

    计算经过偏差校正的mt (通过偏差校正可以更好地估算移动平均值)

  • Compute the bias-corrected vt

    计算偏差校正后的v t

Image for post
Bias-corrected mt and vt respectively
分别由偏差校正的mt和vt
  • Update the parameters θ

    更新参数θ
Image for post
Update the parameters
更新参数

And that’s it! The loop will continue until Adam converges to a solution.

就是这样! 循环将一直持续到Adam收敛到一个解。

4.优化器之间的视觉比较 (4. Visual Comparison Between Optimizers)

A better way to recognize the differences between the previously mentioned optimization algorithms is to see a visual comparison of their performance.

认识到前面提到的优化算法之间差异的更好方法是对它们的性能进行直观比较。

Image for post
Comparison between the training cost of different optimizers [2]
不同优化器的培训成本之间的比较[2]

The figure above is from the Adam paper. It showcases the training cost over 45 epochs, and you can see Adam converging faster than AdaGrad for CNNs. Perhaps it’s good to mention that AdaGrad corresponds to a version of Adam with the hyperparameters (α, 𝛽1, 𝛽2) are at specific values [2]. I decided to remove the AdaGrad’s math explanation from this post to avoid confusion, but here is a simple explanation by mxnet if you want to learn more on that.

上图来自亚当的论文 。 它展示了超过45个纪元的培训成本,您可以看到Adam的融合速度比AdaGrad的CNN融合速度更快。 也许值得一提的是AdaGrad对应于亚当的版本,其超参数(α, 𝛽1,𝛽2 )处于特定值[2]。 为了避免造成混淆,我决定从这篇文章中删除AdaGrad的数学解释,但是如果您想了解更多信息,这里是mxnet的简单解释

Image for post
Gif by author using [7] Gif由作者使用[7]

In the gif above, you can see Adam and RMSProp converging at a similar speed, while AdaGrad seems to be struggling to converge.

在上面的gif文件中,您可以看到Adam和RMSProp以相似的速度收敛,而AdaGrad似乎难以收敛。

Image for post
Gif by author using [7] Gif由作者使用[7]

Meanwhile, in this gif, you can Adam and SGD with Momentum converging to a solution. While SGD, AdaGrad, and RMSProp seem to be stuck in a local minimum.

同时,在此gif中,您可以将Adam和SGD与Momentum融合为一个解决方案。 尽管SGD,AdaGrad和RMSProp似乎停留在本地最低要求中。

5.实施 (5. Implementation)

Here I’ll show three different ways to incorporate Adam into your model, with TensorFlow, PyTorch, and NumPy implementations.

在这里,我将展示三种通过TensorFlow,PyTorch和NumPy实现将Adam集成到模型中的方法。

import tensorflow as tf


tf.keras.optimizers.Adam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
    name='Adam')
import torch


torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
                 eps=1e-08, weight_decay=0, amsgrad=False)
  • Implementation with just NumPy:

    仅用NumPy实现:

This implementation may not be as practical, but it will give you a much better understanding of the Adam algorithm.

此实现可能不实际,但是它将使您更好地了解Adam算法。

But as you can guess, the code is quite long, so for better viewing, here’s the gist.

但是您可以猜到,代码很长,因此为了更好地查看, 这里是要点

6.亚当的利与弊 (6. Advantages and Disadvantages of Adam)

Adam is one of the best optimizers compared to other algorithms, but it is not perfect either. So, here are some advantages and disadvantages of Adam.

与其他算法相比,Adam是最佳的优化器之一,但它也不是完美的。 因此,这是亚当的一些优点和缺点。

优点: (Advantages:)

  1. Can handle sparse gradients on noisy datasets.

    可以处理嘈杂数据集上的稀疏渐变。
  2. Default hyperparameter values do well on most problems.

    默认的超参数值可以很好地解决大多数问题。
  3. Computationally efficient.

    计算效率高。
  4. Requires little memory, thus memory efficient.

    需要很少的内存,因此内存效率高。
  5. Works well on large datasets.

    在大型数据集上效果很好。

缺点: (Disadvantages:)

  1. Adam does not converge to an optimal solution in some areas (this is the motivation for AMSGrad).

    Adam在某些领域没有收敛到最佳解决方案(这是AMSGrad的动机)。
  2. Adam can suffer a weight decay problem (which is addressed in AdamW).

    Adam可能会遇到重量衰减问题(在AdamW中已解决)。
  3. Recent optimization algorithms have been proven faster and better [6].

    最近的优化算法已被证明更快,更好[6]。

7.结论和进一步阅读 (7. Conclusion and Further Reading)

That is all for Adam: adaptive moment estimation!

亚当就是这一切:自适应力矩估计!

Adam is an extension of SGD, and it combines the advantages of AdaGrad and RMSProp. Adam is also an adaptive gradient descent algorithm, such that it maintains a learning rate per-parameter. And it keeps track of the moving average of the first and second moment of the gradient. Thus, using the first and second moment, Adam can give an unscaled direct estimation of the parameter’s updates. Finally, although newer optimization algorithms have emerged, Adam (and SGD) is still a stable optimizer to use.

Adam是SGD的扩展,它结合了AdaGrad和RMSProp的优点。 Adam还是一种自适应梯度下降算法,因此它可以保持每个参数的学习率。 并且它跟踪梯度的第一和第二的移动平均值。 因此,使用第一时刻和第二时刻,Adam可以对参数的更新给出未缩放的直接估计 。 最后,尽管出现了更新的优化算法,但Adam(和SGD)仍然是稳定的优化程序。

Great resources for further reading (and watching):

进一步阅读(和观看)的丰富资源:

8.参考 (8. References)

  1. Stephen J. Wright, Optimization (2016), Encyclopædia Britannica

    Stephen J.Wright, 优化 (2016),不列颠百科全书

  2. Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization (2015), arxiv

    Diederik P. Kingma,Jimmy Ba, Adam:一种随机优化方法 (2015),arxiv

  3. Learning internal representations by error propagation (1986), Rumelhart, Hinton and Williams, ACM

    通过错误传播学习内部表示 (1986),Rumelhart,Hinton和Williams,ACM

  4. Duchi et al., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (2011), Stanford

    Duchi等人, 在线学习和随机优化的自适应次梯度方法 (2011),斯坦福

  5. Geoffrey Hinton with Nitish Srivastava Kevin Swersky, Neural Networks for Machine Learning (Lecture 6) (2012), UToronto and Coursera

    Geoffrey Hinton和Nitish Srivastava Kevin Swersky, 机器学习的神经网络(第6讲) (2012年),多伦多和Coursera

  6. John Chen, An updated overview of recent gradient descent algorithms (2020), GitHub

    John Chen, 最近的梯度下降算法的最新概述 (2020年),GitHub

  7. kuroitu S, Comparison of optimization methods (2020), Qiita

    kuroitu S, 优化方法比较 (2020年),奇塔

翻译自: https://towardsdatascience.com/complete-guide-to-adam-optimization-1e5f29532c3d

首先的亚当和末后的亚当

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值