In the previous class, we have talked about Stochastic gradient descent and how it can be faster than Batch gradient descent. In this class, let's talk about Mini-batch gradient descent. It can sometimes work even faster than Stochastic gradient descent.
To summarize:
- Batch gradient descent: use all m examples in each iteration
- Stochastic gradient descent: use 1 example in each iteration
The Mini-batch gradient descent is somewhere in between. Rather than using 1 example or m examples, we'll use b examples in each iteration, where b is called the "Mini-batch size". And typical value for b is 10; and typical range will be 2 ~ 100.
It shows the Mini-batch gradient descent algorithm in above figure-1. Here we have a batch size of 10 and 1000 training examples. So we perform this sort of gradient descent update using 10 examples at a time. And we need 100 steps of size 10 in order to get through all 1000 training examples.
Comparing to Batch gradient descent, this also allows us to make progress much faster. Again let's say we have 300,000,000 examples:
- With Batch gradient descent, we need scan through the entire 300 million training set before we can make any progress
- With Mini-batch gradient descent, after looking at just the first 10 examples, we can start to make progress in improving the parameters s. And then we can look at the second 10 examples and modify the parameters a little bit again and so on.
How about Mini-batch gradient descent versus Stochastic gradient descent? Why do we want to look at b examples at a time instead of just looking at 1 example at a time as the Stochastic gradient descent?
The answer is at vectorization. Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorize implementation. In that case, the item can be performed in a more vectorized way by using good numerical algebra libraries and that will allow you to partially parallelize your computation over the 10 examples.
<end>