深度学习优化入门:Momentum, RMSProp and Adam

最新推荐文章于 2022-05-13 16:46:42 发布

名字重复又重复

最新推荐文章于 2022-05-13 16:46:42 发布

阅读量852

点赞数 2

转载：https://medium.com/paperspace/intro-to-optimization-in-deep-learning-momentum-rmsprop-and-adam-8335f15fdee2

While local minima and saddle points can stall our training, pathological curvature can slow down training to an extent that the machine learning practitioner might think that search has converged to a sub-optimal minma. Let us understand in depth what pathological curvature is.

虽然局部最小值和鞍点可以阻止我们的训练，但是病理曲率会使训练减慢到机器学习从业者可能认为搜索已经收敛到次优迷你的程度。让我们深入了解病理曲率是什么。

Pathological Curvature

Consider the following loss contour.

病理曲率

考虑以下损失轮廓。

You see, we start off randomly before getting into the ravine-like region marked by blue color. The colors actually represent how high the value the loss function is at a particular point, with reds representing highest values and blues representing the lowest values.

We want to get down to the minima, but for that we have move through the ravine. This region is what is called pathological curvature. To understand why it’s called pathological , let us delve deeper. This is how pathological curvature, zoomed up, looks like..

你看，我们在进入以蓝色标记的山沟状区域之前随机开始。颜色实际上表示损失函数在特定点处的值有多高，红色代表最高值，蓝色代表最低值。

我们想要达到最小值，但为此我们已经穿过峡谷。这个区域就是所谓的病理曲率。要理解为什么它被称为病态，让我们深入研究。这是病理曲率，放大，看起来像..

It’s not very hard to get hang of what is going on in here. Gradient descent is bouncing along the ridges of the ravine, and moving a lot slower towards the minima. This is because the surface at the ridge curves much more steeply in the direction of w1.

Consider a point A, on the surface of the ridge. We see, the gradient at the point can be decomposed into two components, one along direction w1 and other along w2. The component of the gradient in direction of w1 is much larger because of the curvature of the loss function, and hence the direction of the gradient is much more towards w1, and not towards w2 (along which the minima lies).

要抓住这里发生的事情并不是很难。梯度下降沿着山沟的山脊反弹，向最小值移动缓慢。这是因为脊的表面在w1的方向上弯曲得更陡峭。

考虑在脊的表面上的点A. 我们看到，该点的梯度可以分解为两个分量，一个沿着方向w1，另一个沿着w2。由于损失函数的曲率，因此梯度在w1方向上的分量要大得多，因此梯度的方向更靠近w1，而不是朝向w2（最小值位于其上）。

Normally, we could use a slow learning rate to deal with this bouncing between the ridges problem as we covered in the last post on gradient descent. However, this spells trouble.

It makes sense to slow down when were are nearing a minima, and we want to converge into it. But consider the point where gradient descent enters the region of pathological curvature, and the sheer distance to go until the minima. If we use a slower learning rate, it might take so too much time to get to the minima. In fact, one paper reports that learning rates small enough to prevent bouncing around the ridges might lead the practitioner to believe that the loss isn’t improving at all, and abandon training all together.

通常情况下，我们可以使用较慢的学习率来处理脊柱问题之间的这种反弹，正如我们在梯度下降的最后一篇文章中所述。但是，这会带来麻烦。

当接近最小值时减速是有道理的，我们想要融入其中。但考虑到梯度下降进入病理曲率区域的点，以及直到最小值的绝对距离。如果我们使用较慢的学习速率，可能需要太多时间才能达到最小值。事实上，有一篇论文报告说，学习率足够小以防止在山脊周围弹跳可能会导致从业者相信损失根本没有改善，并且放弃了所有的训练。

And if the only directions of significant decrease in f are ones of low curvature, the optimization may become too slow to be practical and even appear to halt altogether, creating the false impression of a local minimum

Probably we want something that can get us slowly into the flat region at the bottom of pathological curvature first, and then accelerate in the direction of minima. Second derivatives can help us do that.

并且如果f的显着减少的唯一方向是低曲率的方向，则优化可能变得太慢而不实用并且甚至看起来完全停止，从而产生局部最小值的错误印象

可能我们想要的东西可以让我们首先慢慢进入病理曲率底部的平坦区域，然后在最小值的方向上加速。二阶导数可以帮助我们做到这一点。

Newton’s Method

Gradient descent is a First Order Optimization Method. It only takes the first order derivatives of the loss function into account and not the higher ones. What this basically means it has no clue about the curvature of the loss function. It can tell whether the loss is declining and how fast, but cannot differentiate between whether the curve is a plane, curving upwards or curving downwards.

牛顿法

梯度下降是一阶优化方法。它只考虑损失函数的一阶导数而不是较高的导数。这基本上意味着它没有关于损失函数曲率的线索。它可以判断损失是否下降以及速度有多快，但无法区分曲线是平面，向上弯曲还是向下弯曲。

This happens because gradient descent only cares about the gradient, which is the same at the red point for all of the three curves above. The solution? Take into account double derivative, or the rate of how quickly the gradient is changing.

A very popular technique that can use second order derivatives to fix our issue is called the Newton’s Method. For sake of not straying away from the topic of post, I won’t delve much into the math of Newton’s method. What I’ll do instead is try to build an intuition of what Newton’s Method does.

Newton’s method can give us an ideal step size to move in the direction of the gradient. Since we now have information about the curvature of our loss surface, the step size can be accordingly chosen to not overshoot the floor of the region with pathological curvature.

Newton’s Method does it by computing the Hessian Matrix, which is a matrix of the double derivatives of the loss function with respect of all combinations of the weights. What I mean by saying a combination of the weights, is something like this.

发生这种情况是因为梯度下降仅关注梯度，对于上述所有三条曲线，梯度在红点处是相同的。解决方案？考虑双重导数，或梯度变化的速度。

一种非常流行的技术可以使用二阶导数来解决我们的问题，称为牛顿方法。为了不偏离帖子的主题，我不会深入研究牛顿方法的数学。我要做的是尝试建立牛顿方法所做的直觉。

牛顿的方法可以给我们一个理想的步长，以便在梯度方向上移动。由于我们现在具有关于损耗表面的曲率的信息，因此可以相应地选择步长以不超过具有病理曲率的区域的地板。

牛顿方法通过计算Hessian矩阵来实现它，Hessian矩阵是关于所有权重组合的损失函数的双重导数的矩阵。我所说的权重组合就是这样的。

A Hessian Matrix then accumulates all these gradients in one large big matrix.

然后，Hessian矩阵将所有这些梯度累积在一个大的大矩阵中。

The Hessian gives us an estimate of the curvature of loss surface at a point. A loss surface can have a positive curvature which means the surface, which means a surface is rapidly getting less steeper as we move. If we have a negative curvature, it means that the surface is getting more steeper as we move.

Hessian给出了一个点处损失曲面曲率的估计。损耗表面可以具有正曲率，这意味着表面，这意味着当我们移动时表面迅速变得更陡峭。如果我们有一个负曲率，这意味着当我们移动时表面变得更加陡峭。

Notice, if this step is negative, it means we can use a arbitrary step. In other words, we can just switch back to our original algorithm. This corresponds to the following case where the gradient is getting more steeper.

请注意，如果此步骤为否定，则表示我们可以使用任意步骤。换句话说，我们可以切换回原始算法。这对应于以下情况，其中梯度变得更陡峭。

However if the gradient is getting less steeper, we might be heading towards a region at the bottom of pathological curvature. Here, Newton’s algorithm gives us a revised learning step, which is, as you can see is inversely proportional to the curvature, or how quickly the surface is getting less steeper.

If the surface is getting less steeper, then the learning step is decreased.

然而，如果梯度变得越来越陡峭，我们可能会走向病理曲率底部的区域。在这里，牛顿算法为我们提供了一个修改过的学习步骤，正如您所看到的那样，曲率与曲率成反比，或表面变得越来越陡峭。

如果表面变得越来越陡峭，则学习步骤减少。

So why don’t we use Newton’s algorithm more often?

You see that Hessian Matrix in the formula? That hessian requires you to compute gradients of the loss function with respect to every combination of weights. If you know your combinations, that value is of the order of the square of the number of weights present in the neural network.

For modern day architectures, the number of parameters may be in billions, and calculating a billion squared gradients makes it computationally intractable for us to use higher order optimization methods.

However, here’s an idea. Second order optimization is about incorporating the information about how is the gradient changing itself. Though we cannot precisely compute this information, we can chose to follow heuristics that guide our search for optima based upon the past behavior of gradient.

那么为什么我们不经常使用牛顿算法呢？

你在公式中看到Hessian Matrix？粗麻布要求您根据每个权重组合计算损失函数的梯度。如果您知道您的组合，则该值是神经网络中存在的权重数的平方的量级。

对于现代建筑，参数的数量可能是数十亿，并且计算十亿平方的梯度使得我们使用更高阶优化方法在计算上是难以处理的。

但是，这是一个想法。二阶优化是关于合并梯度如何变化的信息。虽然我们无法精确计算这些信息，但我们可以选择遵循启发式方法来指导我们根据渐变的过去行为搜索最优值。

Momentum

A very popular technique that is used along with SGD is called Momentum. Instead of using only the gradient of the current step to guide the search, momentum also accumulates the gradient of the past steps to determine the direction to go. The equations of gradient descent are revised as follows.

动量

与SGD一起使用的非常流行的技术称为Momentum。动量也不是仅使用当前步骤的梯度来指导搜索，而是累积过去步骤的梯度以确定要去的方向。梯度下降的方程式修改如下。

The first equations has two parts. The first term is the gradient that is retained from previous iterations. This retained gradient is multiplied by a value called “Coefficient of Momentum” which is the percentage of the gradient retained every iteration.

第一个方程有两部分。第一项是从先前迭代中保留的梯度。该保留梯度乘以称为“动量系数”的值，该值是每次迭代保留的梯度的百分比。

If we set the initial value for v to 0 and chose our coefficient as 0.9, the subsequent update equations would look like.

如果我们将v的初始值设置为0并选择我们的系数为0.9，则后续更新方程式将如下所示。

We see that the previous gradients are also included in subsequent updates, but the weightage of the most recent previous gradients is more than the less recent ones. (For the mathematically inclined, we are taking an exponential average of the gradient steps)

How does this help our case. Consider the image,and notice that most of the gradient updates are in a zig-zag direction. Also notice that each gradient update has been resolved into components along w1 and w2 directions. If we will individually sum these vectors up, their components along the directionw1 cancel out, while the component along the w2 direction is reinforced.

我们看到之前的渐变也包含在后续更新中，但最近的渐变的权重比最近的渐变更重。（对于数学上的倾向，我们采用梯度步长的指数平均值）

这对我们的案例有何帮助？考虑图像，并注意到大多数渐变更新都是在Z字形方向上。另请注意，每个渐变更新已经分解为沿w1和w2方向的组件。如果我们将这些向量单独求和，则它们沿w1方向的分量抵消，而沿w2方向的分量被加强。

For an update, this adds to the component along w2, while zeroing out the component in w1 direction. This helps us move more quickly towards the minima. For this reason, momentum is also referred to as a technique which dampens oscillations in our search.

It also builds speed, and quickens convergence, but you may want to use simulated annealing in case you overshoot the minima.

In practice, the coefficient of momentum is initialized at 0.5, and gradually annealed to 0.9 over multiple epochs.

对于更新，这将沿w2添加到组件，同时将组件调到w1方向。这有助于我们更快地迈向最小值。出于这个原因，动量也被称为一种抑制我们搜索中的振荡的技术。

它还可以提高速度并加快收敛速度，但是如果超过最小值，则可能需要使用模拟退火。

实际上，动量系数初始化为0.5，并在多个时期逐渐退火至0.9。

RMSProp

RMSprop, or Root Mean Square Propogation has an interesting history. It was devised by the legendary Geoffrey Hinton, while suggesting a random idea during a Coursera class.

RMSProp also tries to dampen the oscillations, but in a different way than momentum. RMS prop also takes away the need to adjust learning rate, and does it automatically. More so, RMSProp choses a different learning rate for each parameter.

In RMS prop, each update is done according to the equations described below. This update is done separately for each parameter.

RMSProp

RMSprop或Root均方传播有一个有趣的历史。它是由传奇人物杰弗里·辛顿（Geoffrey Hinton）设计的，同时在Coursera课程中提出了一个随意的想法。

RMSProp也试图抑制振荡，但方式与动量不同。RMS道具也不需要调整学习率，而是自动完成。更重要的是，RMSProp为每个参数选择不同的学习率。

在RMS prop中，每次更新都是根据下面描述的等式完成的。此更新是针对每个参数单独完成的。

So, let’s break down what is happening here.

In the first equation, we compute an exponential average of the square of the gradient. Since we do it separately for each parameter, gradient Gt here corresponds to the projection, or component of the gradient along the direction represented by the parameter we are updating.

To do that, we multiply the exponential average computed till the last update with a hyperparameter, represented by the greek symbol nu. We then multiply the square of the current gradient with (1 — nu). We then add them together to get the exponential average till the current time step.

The reason why we use exponential average is because as we saw, in the momentum example, it helps us weigh the more recent gradient updates more than the less recent ones. In fact, the name “exponential” comes from the fact that the weightage of previous terms falls exponentially (the most recent term is weighted as p, the next one as squared of p, then cube of p, and so on.)

Notice our diagram denoting pathological curvature, the components of the gradients along w1 are much larger than the ones along w2. Since we are squaring and adding them, they don’t cancel out, and the exponential average is large for w2 updates.

Then in the second equation, we decided our step size. We move in the direction of the gradient, but our step size is affected by the exponential average. We chose an initial learning rate eta, and then divide it by the average. In our case, since the average of w1 is much much larger than w2, the learning step for w1 is much lesser than that of w2. Hence, this will help us avoid bouncing between the ridges, and move towards the minima.

The third equation is just the update step. The hyperparameter p is generally chosen to be 0.9, but you might have to tune it. The epsilon is equation 2, is to ensure that we do not end up dividing by zero, and is generally chosen to be 1e-10.

It’s also to be noted that RMSProp implicitly performs simulated annealing. Suppose if we are heading towards the minima, and we want to slow down so as to not to overshoot the minima. RMSProp automatically will decrease the size of the gradient steps towards minima when the steps are too large (Large steps make us prone to overshooting)

所以，让我们分解这里发生的事情。

在第一个等式中，我们计算梯度平方的指数平均值。由于我们对每个参数单独进行，因此此处的梯度Gt对应于投影，或者是沿着我们正在更新的参数所表示的方向的渐变的分量。

为此，我们将计算到最后一次更新的指数平均值乘以一个超参数，用希腊符号nu表示。然后我们将当前梯度的平方乘以（1-nu）。然后我们将它们加在一起得到指数平均值直到当前时间步长。

我们使用指数平均值的原因是因为正如我们所看到的，在动量示例中，它有助于我们比较近期的梯度更新更多地权衡更新的梯度更新。事实上，命名为“指数”来自于以前的术语的权重按指数下降的事实（最近的术语加权为p，作为平方的下一个p，那么立方体p，等等。）

请注意我们的图表表示病理曲率，沿w1的梯度分量远大于沿w2的分量。由于我们正在平方并添加它们，因此它们不会取消，并且对于w2更新，指数平均值很大。

然后在第二个等式中，我们确定了步长。我们沿着渐变的方向移动，但我们的步长受指数平均值的影响。我们选择了初始学习率eta，然后将其除以平均值。在我们的情况下，由于平均W1是远远大得多W2，为学习步骤W1比的少得多W2。因此，这将有助于我们避免在山脊之间弹跳，并朝着极小的方向前进。

第三个等式只是更新步骤。超参数p通常选择为0.9，但您可能需要调整它。ε是等式2，是为了确保我们不会最终除以零，通常选择为1e-10。

还需要注意的是，RMSProp隐式执行模拟退火。假设我们正朝着最小值前进，并且我们想要减速以便不超过最小值。当步长太大时，RMSProp会自动将梯度步长减小到最小值（大步使我们容易超调）

Adam

So far, we’ve seen RMSProp and Momentum take contrasting approaches. While momentum accelerates our search in direction of minima, RMSProp impedes our search in direction of oscillations.

Adam or Adaptive Moment Optimization algorithms combines the heuristics of both Momentum and RMSProp. Here are the update equations.

到目前为止，我们已经看到RMSProp和Momentum采用对比方法。虽然动量加速了我们对极小方向的搜索，但RMSProp阻碍了我们在振荡方向上的搜索。

Adam或Adaptive Moment Optimization算法结合了Momentum和RMSProp的启发式算法。这是更新方程式。

Here, we compute the exponential average of the gradient as well as the squares of the gradient for each parameters (Eq 1, and Eq 2). To decide our learning step, we multiply our learning rate by average of the gradient (as was the case with momentum) and divide it by the root mean square of the exponential average of square of gradients (as was the case with momentum) in equation 3. Then, we add the update.

The hyperparameter beta1 is generally kept around 0.9 while beta_2 is kept at 0.99. Epsilon is chosen to be 1e-10 generally.

在这里，我们计算梯度的指数平均值以及每个参数（等式1和等式2）的梯度的平方。为了确定我们的学习步骤，我们将学习率乘以梯度的平均值（如动量的情况），并将其除以等式中梯度平方的指数平均值的均方根（如动量的情况） 3.然后，我们添加更新。

超参数beta1通常保持在0.9左右，而beta_2保持在0.99。Epsilon一般被选为1e-10。

Conclusion

In this post, we have seen 3 methods to build upon gradient descent to combat the problem of pathological curvature, and speed up search at the same time. These methods are often called “Adaptive Methods” since the learning step is adapted according to the topology of the contour.

Out of the above three, you may find momentum to be the most prevalent, despite Adam looking the most promising on paper. Empirical results have shown the all these algorithms can converge to different optimal local minima given the same loss. However, SGD with momentum seems to find more flatter minima than Adam, while adaptive methods tend to converge quickly towards sharper minima. Flatter minima generalize better than sharper ones.

结论

在这篇文章中，我们已经看到了3种建立梯度下降的方法来对抗病理曲率问题，同时加快搜索速度。这些方法通常称为“自适应方法”，因为学习步骤是根据轮廓的拓扑结构进行调整的。

在上述三个中，你可能会发现最流行的动力，尽管亚当在纸上看起来最有希望。经验结果表明，在给定相同损失的情况下，所有这些算法都可以收敛到不同的最优局部最小值。然而，具有动量的SGD似乎发现比亚当更平坦的最小值，而自适应方法倾向于快速收敛到更尖锐的最小值。更平坦的最小值比更清晰的更好。

Despite the fact that adaptive methods help us tame the unruly contours of a loss function of a deep net’s loss function, they are not enough, especially with networks becoming deeper and deeper everyday. Along with choosing better optimization methods, considerable research is being put in coming up with architectures that produce smoother loss functions to start with. Batch Normalization and Residual Connections are a part of that effort, and we’ll try to do a detailed blog post on them very shortly. But that’s it for this post. Feel free to ask questions in the comments.

尽管自适应方法有助于我们驯服深度网络损失函数的损失函数的不规则轮廓，但它们还不够，特别是随着网络日益深入和深入。除了选择更好的优化方法之外，还在进行大量研究，以提出能够开始实现更平滑损失功能的体系结构。批量规范化和剩余连接是这项工作的一部分，我们将尽快在其上做一篇详细的博客文章。但这就是这篇文章。随意在评论中提问。

名字重复又重复

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
4
评论
深度学习优化入门:Momentum, RMSProp and Adam

转载：https://medium.com/paperspace/intro-to-optimization-in-deep-learning-momentum-rmsprop-and-adam-8335f15fdee2While local minima and saddle points can stall our training, pathological curvature ca...
复制链接

扫一扫