Large scale machine learning - Stochastic gradient descent

In this class, we'll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent. It will allow us to scale this algorithm to bigger training set.

Note that we'll use linear regression as the example, but the idea of Stochastic gradient descent is fully general and also applies to other learning algorithms like logistic regression, neural network and other algorithms.

Batch gradient descent

 figure-1

In above figure-1, it shows the hypothesis function h_{\theta}(x) and cost function J_{train}(\theta ) for linear regression. The cost function, as we've already seen, looks like this sort of bowl-shaped function.

figure-2

The figure-2 shows how the gradient descent works. If the parameters are initialized to the point there, then as you run gradient descent, different iterations of gradient descent will take the parameters to the global minimum. So take a trajectory that heads pretty directly to the global minimum.

The problem with gradient descent is that if m is large, then computing this derivative term can be very expensive. For example, if we have 300 millions training examples, for each little step, we need scan through all 300 million records. And it's going to take a long time in order to get the algorithm to converge. 

This particular version of gradient descent is called Batch gradient descent. The term "batch" refers to the fact that they're looking at all the training examples at a time.

Stochastic gradient descent

In contrast to Batch gradient descent, we'll come up with a different algorithm called Stochastic gradient descent. It doesn't need look at all the training examples in every single iteration. Instead it only needs to look at a single training example in one iteration.

figure-3

On the left of figure-3, it shows the cost function/optimization objective and the steps of gradient descent for Batch gradient descent. On the right side, for the new Stochastic gradient descent:

  1. We define the cost of the parameter \theta with respect to a training example (x^{(i)},y^{(i)}) as below. It really measures how well is my hypothesis doing on a single example (x^{(i)},y^{(i)})
    cost(\theta , (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta }(x^{(i)})-y^{(i)})^{2}
  2. Then write out the cost function slightly differently. So J_{train}(\theta) is actually the average of the cost for all my training examples.
    J_{train}(\theta )=\frac{1}{m}\sum _{i=1}^{m}cost(\theta ,(x^{(i)},y^{(i)}))

Then with above view, following is what the Stochatic gradient descent does:

  1. Randomly shuffle the data sets. That is randomly reorder the m training examples
  2. Repeat (usually 3 - 10 times) the scan through my m training examples and perform the following update
    \theta _{j}:=\theta _{j}-\alpha (h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}_{j}
    Note that actually the following is true if we do some calculus calculation:
    (h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}_{j}=\frac{\partial }{\partial \theta _{j}}cost(\theta , (x^{(i)},y^{(i)}))
  3. Please see the overall process in figure-3

So what Stochastic gradient descent is doing is it is actually scanning through the training examples. Firstly, it looks the first example (x^{(1)}, y^{(1)}) and takes a little gradient descent step with respect to the cost of just this first training example. Having done this, inside the inner for loop shown in figure-3, it is then going on to the second training example. Taking another little step to try to fit the second training example, and so on, until you get through the entire training set. And note the outer repeat loop may cause it to take multiple passes over the entire training set. This view of Stochastic gradient descent also motivates why we want to start by randomly shuffling the data set.

Stochastic gradient descent intuition

figure-4

When we use Batch gradient descent, it tends to take a reasonably straight line trajectory to get to the global minimum as the red line shown in figure-4. In contrast, with Stochastic gradient descent, every iteration is just trying to fit single training example better. So, as the magenta line in figure-4, you'll find that it will generally move the parameters in the direction of the global minimum, but not always. And actually, when you run Stochastic gradient descent, it doesn't actually converge in the same sense as Batch gradient descent does. It is wandering around continuously in some region, close to the global minimum, and doesn't just get to the global minimum and stay there. But in practice, so long as the parameters end up pretty close to the global minimum, that will be a pretty good hypothesis. 

The last detail is how many times do we repeat this outer loop? Depends on the size of the training set, doing this loop just a single time may be enough. And up to maybe 10 times. If you have a pretty massive dataset like 300 millions training examples, it's possible by taking a single pass through the dataset, you already have a prefectly good hypothesis.

So, that was the Stochastic gradient descent. Hopefully that will allow you to scale up many of your learning algorithms to much bigger data sets and get much better performance.

<end>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值