2.1 Mini-batch 梯度下降

最新推荐文章于 2024-04-10 15:41:13 发布

布纸所云

最新推荐文章于 2024-04-10 15:41:13 发布

阅读量270

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/XindiOntheWay/article/details/82251656

版权

深度学习专栏收录该内容

22 篇文章 0 订阅

订阅专栏

这里写图片描述

Applying machine learning is a highly empirical process, is highly iterative process. In which you just had to train a lot of models to find one that works really well. So, it really helps to really train models quickly.

It turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire , your giant training sets of 5 million examples. In particular, here’s what you can do.

$m$ training samples:
这里写图片描述

$X : (n_X,m),\quad Y: (1,m)$

Let’s say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches. And let’s say each of your baby training sets have just 1000 examples each. So, you take X1 through X1000 (记作 $x^{\{1\}}$ )and you call that your first little baby training set, also call the mini-batch. And then you take home the next 1000 examples. X1001 through X2000 ( $x^{\{2\}}$ ) and then X1000 examples and come next one and so on. Altogether you would have 5,000 of these mini batches.

You would also split up your training data for Y accordingly.

$X^{\{t\}}: (n_X, 1000), \quad Y: (1, 1000)$

下标含义：

$x^{(i)}$ : 第i个样本
$z^{[l]}$ : 第 $l$ 层的隐藏单元
$x^{\{t\}},y^{\{t\}}$ : 第t个mini-batch

To explain the name of this algorithm, batch gradient descent, refers to the gradient descent algorithm we have been talking about previously. Where you process your entire training set all at the same time. And the name comes from viewing that as processing your entire batch of training samples all at the same time. I know it’s not a great name but that’s just what it’s called. Mini-batch gradient descent in contrast, refers to algorithm which we’ll talk about on the next slide and which you process is single mini batch XT, YT at the same time rather than processing your entire training set XY the same time.

这里写图片描述

伪代码：
repeat {
for t=1, $\cdots$ , 5000:

Forward prop on $X^{\{t\}}$ (Vectorized implementation on 1000 examples):
$Z [1] = W [1] X {t} + b [l] A [1] = g [1] (Z [1]) ⋮ A [l] = g [l] (Z [l])$ $\begin{align*} &Z^{[1]}=W^{[1]}X^{\{t\}}+b^{[l]}\\ &A^{[1]}=g^{[1]}(Z^{[1]})\\ &\vdots \\ &A^{[l]}=g^{[l]}(Z^{[l]})\\ \end{align*}$
Compute cost

$J{t}=11000∑iL(y^(i),y(i))+λ2∗1000∑l||w[l]||2 J { t } = 1 1000 ∑ i L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ 1000 ∑ l | | w [ l ] | | 2$ $J^{\{t\}}=\frac{1}{1000}\sum_{i}L(\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2*1000}\sum_{l}||w^{[l]}||^2$
Back prop to compute gradients on $J^{\{t\}}$ (using $X^{\{t\}}, Y^{\{t\}}$ )
$W^{[l]}=W^{[l]}-\alpha dW^{[l]},\quad, b^{[l]}=b^{[l]}-\alpha db^{[l]}$
}

This is one pass through your training set using mini-batch gradient descent. The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there. So you keep taking passes through the training set until hopefully you converge with approximately converge. When you have a lost training set, mini-batch gradient descent runs much faster than batch gradient descent and that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

布纸所云

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2.1 Mini-batch 梯度下降

Applying machine learning is a highly empirical process, is highly iterative process. In which you just had to train a lot of models to find one that works really well. So, it really helps to really...
复制链接

扫一扫

专栏目录