2.6 动量梯度下降法

Gradient Descent with momentum
In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.

As a example let’s say that you’re trying to optimize a cost function which has contours like this. So the red dot denotes the position of the minimum.
这里写图片描述

Maybe you start gradient descent here and if you take one iteration of gradient descent either or descent maybe end up -heading there. But now you’re on the other side of this ellipse, and if you take another step of gradient descent maybe you end up doing that. And then another step,
another step, and so on. And you see that gradient descents will sort of take a lot of steps, right? Just slowly oscillate toward the minimum. And this up and down oscillations slows down gradient descent and prevents you from using a much larger learning rate.

In particular, if you were to use a much larger learning rate you might end up over shooting and end up diverging like so. And so the need to prevent the oscillations from getting too big forces you to use a learning rate that's not itself too large.

这里写图片描述
Another way of viewing this problem is that on the vertical axis you want your learning to be a bit slower, because you don't want those oscillations. But on the horizontal axis, you want faster learning. Right, because you want it to aggressively move from left to right, toward that minimum, toward that red dot. So here’s what you can do if you implement gradient descent with momentum.

这里写图片描述
On each iteration, or more specifically, during iteration t t you would compute the usual derivatives dw,db. I’ll omit the superscript square bracket l l ’s but you compute dw,db on the current mini-batch. And if you’re using batch gradient descent, then the current mini-batch would be just your whole batch. And this works as well off a batch gradient descent. So if your current mini-batch is your entire training set, this works fine as well.

vdW=βvdw+(1β)dWvdb=βvdb+(1β)dbW=WαvdWb=bαvdb v d W = β v d w + ( 1 − β ) d W v d b = β v d b + ( 1 − β ) d b W = W − α v d W b = b − α v d b

And then what you do is you compute vdW v d W to be βvdw+(1β)dW β v d w + ( 1 − β ) d W . So this is similar to when we’re previously computing the vθ=βvθ+(1β)θt v θ = β v θ + ( 1 − β ) θ t . Right, so it’s computing a moving average of the derivatives for w you're getting. And then you similarly compute vdb v d b equals βvdb+(1β)db β v d b + ( 1 − β ) d b . And then you would update your weights using W gets updated as W W minus the learning rate times, instead of updating it with dW, with the derivative, you update it with vdW v d W . And similarly, b b gets updated as b minus α α times vdb v d b .

So what this does is smooth out the steps of gradient descent. For example, let’s say that in the last few derivatives you computed were this, this, this, this, this. If you average out these gradients, you find that the oscillations in the vertical direction will tend to average out to something closer to zero. So, in the vertical direction, where you want to slow things down, this will average out positive and negative numbers, so the average will be close to zero. Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big. So that’s why with this algorithm, with a few iterations you find that the gradient descent with momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction, but are more directed to just moving quickly in the horizontal direction. And so this allows your algorithm to take a more straightforward path, or to damp out the oscillations in this path to the minimum.

这里写图片描述
Finally, let’s look at some details on how you implement this. Here’s the algorithm and so you now have two hyperparameters of the learning rate α α , as well as this parameter β β , which controls your exponentially weighted average. The most common value for β β is 0.9. We’re averaging over the last ten days temperature. So it is averaging of the last ten iteration’s gradients. And in practice, β β equals 0.9 works very well. Feel free to try different values and do some hyperparameter search, but 0.9 appears to be a pretty robust value.

Well, and how about bias correction, right? So do you want to take vdW v d W and vdb v d b and divide it by 1 minus beta to the t. In practice, people don’t usually do this because after just ten iterations, your moving average will have warmed up and is no longer a bias estimate. So in practice, I don’t really see people bothering with bias correction when implementing gradient descent or momentum.

And of course this process initialize the vdW v d W equals 0. Note that this is a matrix of zeroes with the same dimension as dW, d W , which has the same dimension as W W . And vdb is also initialized to a vector of zeroes. So, the same dimension as db, which in turn has same dimension as b b <script type="math/tex" id="MathJax-Element-2932">b</script>.

这里写图片描述

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值