2.6 RMSprop

最新推荐文章于 2022-08-14 19:22:18 发布

布纸所云

最新推荐文章于 2022-08-14 19:22:18 发布

阅读量392

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/XindiOntheWay/article/details/82258157

版权

深度学习专栏收录该内容

22 篇文章 0 订阅

订阅专栏

这里写图片描述
There’s another algorithm calledRMSprop, which stands for root mean square prop, that can also speed up gradient descent. Let’s see how it works.

这里写图片描述

Recall our example from before, that if you implement gradient descent, you can end up with huge oscillations in the vertical direction, even while it’s trying to make progress in the horizontal direction.

In order to provide intuition for this example, let’s say that the vertical axis is the parameter b and horizontal axis is the parameter w. It could be w1 and w2 where some of the center parameters was named as b and w for the sake of intuition. And so, you want to slow down the learning in the b direction, or in the vertical direction. And speed up learning, or at least not slow it down in the horizontal direction. So this is what the RMSprop algorithm does to accomplish this.

On iteration t, it will compute as usual the derivative dW, db on the current mini-batch. So I was going to keep this exponentially weighted average. Instead of VdW, I’m going to use the new notation SdW. So SdW is equal to beta times their previous value + 1- beta times dW squared. So for clarity, this squaring operation is an element-wise squaring operation. So what this is doing is really keeping an exponentially weighted average of the squares of the derivatives. And similarly, we also have Sdb equals beta Sdb + 1- beta, db squared. And again, the squaring is an element-wise operation.

S d W = β S d W + (1 - β) d W 2 S d b = β S d b + (1 - β) d b 2 W = W - α d W S d W - - - - \sqrt b = b - α d b S d b - - - \sqrt

$\begin{align*} &S_{dW}=\beta S_{dW}+(1-\beta) dW^2\\ & S_{db}=\beta S_{db}+ (1-\beta) db^2\\ &\\ & W=W-\alpha \frac{dW}{\sqrt{S_{dW}}}\\ & b=b-\alpha \frac{db}{\sqrt{S_{db}}} \end{align*}$

Next, RMSprop then updates the parameters as follows. W gets updated as W minus the learning rate, and whereas previously we had alpha times dW,
now it’s dW divided by square root of SdW. And b gets updated as b minus the learning rate times, instead of just the gradient, this is also divided by, now divided by Sdb.

So let’s gain some intuition about how this works. Recall that in the horizontal direction or in this example, in the W direction we want learning to go pretty fast. Whereas in the vertical direction or in this example in the b direction, we want to slow down all the oscillations into the vertical direction. So with this terms SdW an Sdb, what we’re hoping is that SdW will be relatively small, so that here we're dividing by relatively small number. Whereas Sdb will be relatively large, so that here we're dividing a relatively large number in order to slow down the updates on a vertical dimension.

And indeed if you look at the derivatives, these derivatives are much larger in the vertical direction than in the horizontal direction. So the slope is very large in the b direction, right? So with derivatives like this, this is a very large db and a relatively small dw. Because the function is sloped much more steeply in the vertical direction than as in the b direction, than in the w direction, than in horizontal direction. And so, db squared will be relatively large. So Sdb will relatively large, whereas compared to that dW will be smaller, or dW squared will be smaller, and so SdW will be smaller. So the net effect of this is that your up days in the vertical direction are divided by a much larger number, and so that helps damp out the oscillations. Whereas the updates in the horizontal direction are divided by a smaller number.

So the net impact of using RMSprop is that your updates will end up looking more like this. That your updates in the, Vertical direction and then horizontal direction you can keep going. And one effect of this is also that you can therefore use a larger learning rate alpha, and get faster learning without diverging in the vertical direction.

Now just for the sake of clarity, I’ve been calling the vertical and horizontal directions b and w, just to illustrate this. In practice, you’re in a very high
dimensional space of parameters, so maybe the vertical dimensions where you’re trying to damp the oscillation is a sum set of parameters, w1, w2, w17. And the horizontal dimensions might be w3, w4 and so on, right?. In practice, dW is a very high-dimensional parameter vector. db is also very high-dimensional parameter vector, but your intuition is that in dimensions where you're getting these oscillations, you end up computing a larger sum. A weighted average for these squares and derivatives, and so you end up dumping out the directions in which there are these oscillations. So that's RMSprop, and it stands for root mean squared prop, because here you're squaring the derivatives, and then you take the square root here at the end.

这里写图片描述
So finally, just a couple last details on this algorithm before we move on. In the next video, we’re actually going to combine RMSprop together with momentum. So rather than using the hyperparameter beta, which we had used for momentum, I’m going to call this hyperparameter beta 2 just to not clash. The same hyperparameter for both momentum and for RMSprop.

And also to make sure that your algorithm doesn't divide by 0. What if square root of SdW, right, is very close to 0. Then things could blow up. Just to ensure numerical stability, when you implement this in practice you add a very, very small epsilon to the denominator. It doesn’t really matter what epsilon is used. 10 to the -8 would be a reasonable
default, but this just ensures slightly greater numerical stability that for
numerical round off or whatever reason, that you don’t end up dividing by a very, very small number.

So that’s RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent. And allowing you to maybe use a larger learning rate alpha. And certainly speeding up the learning speed of your algorithm.

布纸所云

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2.6 RMSprop

There’s another algorithm calledRMSprop, which stands for root mean square prop, that can also speed up gradient descent. Let’s see how it works. Recall our example from before, that if you impl...
复制链接

扫一扫