2.2 理解Mini-batch

这里写图片描述

  • Batch gradient descent:
    With batch gradient descent on every iteration you go through the entire training set and you’d expect the cost to go down on every single iteration. So if we’ve had the cost function J as a function of different iterations it should decrease on every single iteration. And if it ever goes up even on
    iteration then something is wrong.
  • Mini-batch gradient descent:
    On mini batch gradient descent though, if you plot progress on your cost function, then it may not decrease on every iteration. In particular, on every iteration you’re processing some X{t}, Y{t} and so if you plot the cost function J{t}, which is computed using just X{t}, Y{t}. Then it’s as if on every iteration you're training on a different training set or really training on a different mini batch. So you plot the cross function J, you’re more likely to see something that looks like this. It should trend downwards, but it's also going to be a little bit noisier. So it’s okay if it doesn’t go down on every derivation. But it should trend downwards, and the reason it'll be a little bit noisy is that, maybe X{1}, Y{1} is just the rows of easy mini batch so your cost might be a bit lower, but then maybe just by chance, X{2}, Y{2} is just a harder mini batch. Maybe you needed some mislabeled examples in it, in which case the cost will be a bit higher and so on. So that’s why you get these oscillations as you plot the cost when you’re running mini batch gradient descent.

这里写图片描述

So m was the training set size.

  • On one extreme, if the mini-batch size, = m, then you just end up
    with batch gradient descent. So in this extreme you would just have one mini-batch X{1}, Y{1}, and this mini-batch is equal
    to your entire training set. So setting a mini-batch size m just
    gives you batch gradient descent.

  • The other extreme would be if your mini-batch size, Were = 1. This gives you an algorithm called stochastic gradient descent. And here every example is its own mini-batch. So what you do in this case is you look at the first mini-batch, so X{1}, Y{1}, but when your mini-batch size is one, this just has your first training example, and you take derivative to sense that your first training example. And then you next take a look at your second mini-batch, which is just your second training example, and take your gradient descent step with that, and then you do it with the third training example and so on looking at just one single training sample at the time.

If these are the contours of the cost function you’re trying to minimize so your minimum is there.

  • batch gradient descent might start somewhere and be able to take relatively low noise, relatively large steps. And you could just keep matching to the minimum.
  • In contrast with stochastic gradient descent If you start somewhere let’s
    pick a different starting point. Then on every iteration you’re taking gradient descent with just a single strain example so most of the time you hit two at the global minimum. But sometimes you hit in the wrong direction if that one example happens to point you in a bad direction. So stochastic gradient descent can be extremely noisy. And on average, it'll take you in a good direction, but sometimes it'll head in the wrong direction as well. As stochastic gradient descent won't ever converge, it'll always just kind of oscillate and wander around the region of the minimum. But it won't ever just head to the minimum and stay there.

In practice, the mini-batch size you use will be somewhere between in 1 and m and 1 and m are respectively too small and too large. And here’s why.

  • If you go to the opposite, if you use stochastic gradient descent, Then it’s nice that you get to make progress after processing just one example that’s actually not a problem. And the noisiness can be ameliorated or can be reduced by just using a smaller learning rate. But a huge disadvantage to stochastic gradient descent is that you lose almost all your speed up from vectorization. Because, here you're processing a single training example at a time. The way you process each example is going to be very inefficient.

  • So what works best in practice is something in between where you have some, Mini-batch size not to big or too small. And this gives you in practice
    the fastest learning. And you notice that this has two good things going for it.

    • One is that you do get a lot of vectorization. So in the example we used on the previous video, if your mini batch size was 1000 examples then, you might be able to vectorize across 1000 examples which is going to be much faster than processing the examples one at a time.
    • And second, you can also make progress, Without needing to wait til you process the entire training set. So again using the numbers we have from the previous video, each epoch each part your training set allows you to see 5,000 gradient descent steps. So in practice they’ll be some in-between mini-batch size that works best.

    And It’s not guaranteed to always head toward the minimum but it tends to head more consistently in direction of the minimum than the consequent descent. And then it doesn’t always exactly convert or oscillate in a very small region. If that’s an issue you can always reduce the learning rate slowly. We’ll talk more about learning rate decay or how to reduce the learning rate in a later video.

So if the mini-batch size should not be m and should not be 1 but should be something in between, how do you go about choosing it?

Well, here are some guidelines.

  1. First, if you have a small training set, Just use batch gradient descent. What a small training set means, I would say if it’s less than maybe 2000 it’d be perfectly fine to just use batch gradient descent.

  2. Otherwise, if you have a bigger training set, typical mini batch sizes would be, Anything from 64 up to maybe 512 are quite typical. And because of the way computer memory is layed out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2. All right, so 64 is 2 to the 6th, is 2 to the 7th, 2 to the 8, 2 to the 9, so often I’ll implement my mini-batch size to be a power of 2. This range of mini batch sizes, a little bit more common.

  3. In practice of course the mini batch size is another hyper parameter that you might do a quick search over to try to figure out which one is most sufficient of reducing the cost function j. So what i would do is just try several different values. Try a few different powers of two and then see if you can pick one that makes your gradient descent optimization algorithm as efficient as possible.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值