Large scale machine learning - Stochastic gradient descent

最新推荐文章于 2023-03-10 15:40:23 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2023-03-10 15:40:23 发布

阅读量334

点赞数

分类专栏：人工智能 # 机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/114888303

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

In this class, we'll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent. It will allow us to scale this algorithm to bigger training set.

Note that we'll use linear regression as the example, but the idea of Stochastic gradient descent is fully general and also applies to other learning algorithms like logistic regression, neural network and other algorithms.

Batch gradient descent

In above figure-1, it shows the hypothesis function $h_{\theta}(x)$ and cost function $J_{train}(\theta )$ for linear regression. The cost function, as we've already seen, looks like this sort of bowl-shaped function.

The figure-2 shows how the gradient descent works. If the parameters are initialized to the point there, then as you run gradient descent, different iterations of gradient descent will take the parameters to the global minimum. So take a trajectory that heads pretty directly to the global minimum.

The problem with gradient descent is that if m is large, then computing this derivative term can be very expensive. For example, if we have 300 millions training examples, for each little step, we need scan through all 300 million records. And it's going to take a long time in order to get the algorithm to converge.

This particular version of gradient descent is called Batch gradient descent. The term "batch" refers to the fact that they're looking at all the training examples at a time.

Stochastic gradient descent

In contrast to Batch gradient descent, we'll come up with a different algorithm called Stochastic gradient descent. It doesn't need look at all the training examples in every single iteration. Instead it only needs to look at a single training example in one iteration.

On the left of figure-3, it shows the cost function/optimization objective and the steps of gradient descent for Batch gradient descent. On the right side, for the new Stochastic gradient descent:

We define the cost of the parameter $\theta$ with respect to a training example $(x^{(i)},y^{(i)})$ as below. It really measures how well is my hypothesis doing on a single example $(x^{(i)},y^{(i)})$
$cost(\theta , (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
Then write out the cost function slightly differently. So $J_{train}(\theta)$ is actually the average of the cost for all my training examples.
$J_{train}(\theta )=\frac{1}{m}\sum _{i=1}^{m}cost(\theta ,(x^{(i)},y^{(i)}))$

Then with above view, following is what the Stochatic gradient descent does:

Randomly shuffle the data sets. That is randomly reorder the m training examples
Repeat (usually 3 - 10 times) the scan through my m training examples and perform the following update
$\theta _{j}:=\theta _{j}-\alpha (h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}_{j}$
Note that actually the following is true if we do some calculus calculation:
$(h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}_{j}=\frac{\partial }{\partial \theta _{j}}cost(\theta , (x^{(i)},y^{(i)}))$
Please see the overall process in figure-3

So what Stochastic gradient descent is doing is it is actually scanning through the training examples. Firstly, it looks the first example $(x^{(1)}, y^{(1)})$ and takes a little gradient descent step with respect to the cost of just this first training example. Having done this, inside the inner for loop shown in figure-3, it is then going on to the second training example. Taking another little step to try to fit the second training example, and so on, until you get through the entire training set. And note the outer repeat loop may cause it to take multiple passes over the entire training set. This view of Stochastic gradient descent also motivates why we want to start by randomly shuffling the data set.

Stochastic gradient descent intuition

When we use Batch gradient descent, it tends to take a reasonably straight line trajectory to get to the global minimum as the red line shown in figure-4. In contrast, with Stochastic gradient descent, every iteration is just trying to fit single training example better. So, as the magenta line in figure-4, you'll find that it will generally move the parameters in the direction of the global minimum, but not always. And actually, when you run Stochastic gradient descent, it doesn't actually converge in the same sense as Batch gradient descent does. It is wandering around continuously in some region, close to the global minimum, and doesn't just get to the global minimum and stay there. But in practice, so long as the parameters end up pretty close to the global minimum, that will be a pretty good hypothesis.

The last detail is how many times do we repeat this outer loop? Depends on the size of the training set, doing this loop just a single time may be enough. And up to maybe 10 times. If you have a pretty massive dataset like 300 millions training examples, it's possible by taking a single pass through the dataset, you already have a prefectly good hypothesis.

So, that was the Stochastic gradient descent. Hopefully that will allow you to scale up many of your learning algorithms to much bigger data sets and get much better performance.

<end>