Coursera | Andrew Ng (02-week-2-2.1)—Mini-batch 梯度下降法

最新推荐文章于 2021-08-21 13:40:10 发布

ZJ_Improve

最新推荐文章于 2021-08-21 13:40:10 发布

阅读量703

点赞数

分类专栏：深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化深度学习 | 吴恩达文章标签： Mini-batch 梯度下降

本文链接：https://blog.csdn.net/junjun_zhao/article/details/79096566

版权

深度学习 | 吴恩达同时被 2 个专栏收录

129 篇文章 19 订阅

订阅专栏

深度学习 | 吴恩达-02.改善深层NN：超参数调试、正则化以及优化

34 篇文章 2 订阅

订阅专栏

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79096566

2.1 Mini-batch Gradient Descent (Mini-batch 梯度下降法)

(字幕来源：网易云课堂)

这里写图片描述

Hello, and welcome back.In this week, you learn about optimization algorithms.that will enable you to train your neural network much faster.You’ve heard me say before that applying machine learning is a highly empirical process,is a highly iterative process.In which you just had to train a lot of models to find one that works really well.So, it really helps to really train models quickly.One thing that makes it more difficult is that Deep Learning does not work best in a regime of big data.We are able to train neural networks on a huge data set,and training on a large data set is just slow.So, what you find is that having fast optimization algorithms,having good optimization algorithms can really speed up the efficiency of you and your team.

你好，欢迎回来，本周你将学习优化算法，这能让你的神经网络运行得更快，还记得我跟你们说过，机器学习的应用是一个高度依赖经验的过程，伴随着大量迭代的过程，你需要训练诸多模型才能找到合适的那一个。所以，优化算法能够帮助你快速训练模型，其中一个难点在于，深度学习没有在大数据领域发挥最大的效果，我们可以利用一个巨大的数据集来训练神经网络，而在巨大的数据集基础上进行训练速度很慢，因此，你会发现使用快速的优化算法，使用好用的优化算法能够，大大提高你和团队的效率。

So, let’s get started by talking about mini-batch gradient descent.You’ve learned previously that vectorization allows you to efficiently compute on all m examples,that allows you to process your whole training set without an explicit formula.So, that’s why we would take our training examples and stack them into these huge matrix capsule Xs. $x^{(1)}$ , $x^{(2)}$ , $x^{(3)}$ , and then eventually it goes up to $x^{(m)}$ training samples.And similarly for Y.This is $y^{(1)}$ and $y^{(2)}$ , $y^{(3)}$ , and so on up to $y^{(m)}$ .So, the dimension of X was n_x by m and this was 1 by m.Vectorization allows you to process all m examples quickly, relatively quickly.if m is very large then it can still be slow.For example what if m was 5 million or 50 million or even bigger.With the implementation of gradient descent on your whole training set,what you have to do is you have to process your entire training set before you take one little step of gradient descent.And then you have to process your entire training sets of five million training samples again before you take another little step of gradient descent.So, it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire,your giant training sets of 5 million examples.

这里写图片描述

那么，我们首先来谈谈 mini-batch 梯度下降法，你之前学过，向量化能让你有效地对所有 m 个例子进行计算，允许你处理整个训练集而无需某个明确的公式，所以我们要把训练样本，放到巨大的矩阵 $X$ 当中去， $x^{(1)}$ $x^{(2)}$ $x^{(3)}$ 然后一直到第 $x^{(m)}$ 个训练样本， $Y$ 也是如此， $y^{(1)}$ $y^{(2)}$ $y^{(3)}$ 然后一直到 $y^{(m)}$ ，所以 $X$ 的维数是 $(n_x, m)$ $Y$ 的维数是 $(1,m)$ ，向量化能让你相对较快地处理所有 m 个样本，但如果 m 很大的话处理速度仍然缓慢，比如说如果 m 是 500万或 5000万或者更大的一个数，在对整个训练集执行梯度下降法时，你要做的是，你必须处理整个训练集，然后才能进行一步梯度下降法，然后你需要再重新处理，500万个训练样本，才能进行下一步梯度下降法，所以如果你在处理完整个500万个样本的训练集之前，先让梯度下降法处理一部分，你的算法速度会更快。

In particular, here’s what you can do.Let’s say that you split up your training set into smaller, little baby training sets,and these baby training sets are called mini-batches.And let’s say each of your baby training sets have just 1,000 examples each.So, you take $x^{(1)}$ through $x^{(1000)}$ ，and you call that your first little baby training set,also call the mini-batch.And then you take the next 1,000 examples. $x^{(1001)}$ through $x^{(2000)}$ ,that’s the next 1,000 examples and come next one and so on.I’m going to introduce a new notation,I’m going to call this X superscript with curly braces, 1,and I am going to call this,X superscript with curly braces, 2.Now, if you have 5 million training samples total and each of these little mini batches has a thousand examples,that means you have 5,000 of these,because you know 5,000 times 1,000 equals 5 million.So all together you would have 5,000 of these mini batches.So it ends with X superscript curly braces, 5,000,

这里写图片描述

准确地说这是你可以做的一些事情，你可以把训练集分割为小一点的子训练集，这些子集被取名为 Mini-batch，假设每一个子集中只有 1000 个样本，那么你其中的 $x^{(1)}$ 到 $x^{(1000)}$ 取出来，将其称之为第一个子训练集，也叫做 Mini-batch，然后你再取出接下来的 1000个样本，从 $x^{(1001)}$ 到 $x^{(2000)}$ ，然后再取 1000 个样本以此类推，接下来我要说一个新的符号， $X^{\{1\}}$ ，接下来这个是， $X^{\{2\}}$ ，如果你的训练样本一共有 500万个，每个 mini-batch 都有 1000 个样本，也就是说你有 5000个 mini-batch，因为 5000 乘以 1000 就是 500万，你共有 5000 个 mini-batch，所以最后得到是 $X^{\{5000\}}$ ，

and then similarly you do the same thing for Y.You also split up your training data for Y accordingly.So, call that $Y^{\{1\}}$ .And then this is $y^{(1001)}$ through $y^{(2000)}$ , this becomes, called $Y^{\{2\}}$ ,and so on until you have $Y^{\{5000\}}$ .So now, mini batch number t is going to be comprised of $X^{\{t\}}$ and $Y^{\{t\}}$ .That is a thousand training samples with the corresponding input output pairs.Before moving on, just to make sure my notation is clear,We have previously used superscript round brackets i to index in the training set,so $x^{(i)}$ is the i training sample.We use superscript square brackets l to index into the different layers of the neural network.So, $z^{[l]}$ comes from the z value of the l layer of the neural network.And here we are introducingthe curly brackets t to index into different mini batches.So you have $X^{\{t\}}$ , $Y^{\{t\}}$ .And to check your understanding of these,what is the dimension of $X^{\{t\}}$ and $Y^{\{t\}}$ ?Well, X is n_x by m.So if $X^{\{1\}}$ is a thousand training examples or the x values for a thousand examples,then this dimension should be $n_x$ by 1,000,and X^ should also be $n_x$ by 1,000 and so on.So, all of these should have dimension $n_x$ by 1,000,and these should have dimension 1 by 1,000.To explain the name of this algorithm,batch gradient descent, refers to the gradient descent algorithm we have been talking about previously,where you process your entire training set all at the same time.And the name comes from viewing that as processing your entire batch of training samples all at the same time.I know it’s not a great name, but that’s just what it’s called.Mini-batch gradient descent in contrast,refers to the algorithm which we’ll talk about on the next slide,and which you process is single mini batch $X^{\{t\}}$ , $Y^{\{t\}}$ at the same time,rather than processing your entire training set X, Y the same time.

这里写图片描述

对 $Y$ 也要进行相同处理，你也要相应地拆分 $Y$ 的训练集，所以这是 $Y^{\{1\}}$ ，然后从 $y^{(1001)}$ 到 $y^{(2000)}$ 这个叫 $Y^{\{2\}}$ ，一直到 $Y^{\{5000\}}$ ，Mini-batch 的数量 t 组成了 $X^{\{t\}}$ 和 $Y^{\{t\}}$ ，这就是 1000 个训练样本包含相应的输入输出对，在继续课程之前先确认一下我的符号，之前我们使用了上角小括号 $i$ 表示训练集里的值，所以 $x^{(i)}$ 是第 $i$ 个训练样本，我们用了上角中括号 $l$ ，来表示神经网络的层数，因此 $z^{[l]}$ 表示神经网络中第 $l$ 层的 z 值，我们现在引入了大括号 $t$ 来代表不同的 mini-batch，所以我们有 $X^{\{t\}}$ $Y^{\{t\}}$ ，检查一下自己是否理解无误， $X^{\{t\}}$ 和 $Y^{\{t\}}$ 的维数是什么，X 的维数是 $(n_x,m)$ ，如果 $X^{\{1\}}$ 是一个有 1000 个样本的训练集，或者说是 1000 个样本的 x值，所以维数应该是 $(n_x,1000)$ ， $X^{\{2\}}$ 的维数应该是 $(n_x,1000)$ 以此类推，因此所有的子集维数都是 $(n_x,1000)$ ，而这些的维数都是 (1,1000)，解释一下这个算法的名称，batch 梯度下降法指的是，我们之前讲过的梯度下降法算法，就是同时处理整个训练集，这名字就是来源于，能够同时看到整个batch 训练集的样本被处理，这个名字不怎么样但就是这样叫它的，相比之下，Mini-batch 梯度下降法，指的是我们在下一张幻灯片中会讲到的算法，你每次同时处理的是单个的 mini-batch $X^{\{t\}}$ 和 $Y^{\{t\}}$ ，而不是同时处理全部的 X 和 Y 训练集。

So, let’s see how mini-batch gradient descent works.To run mini-batch gradient descent on your training sets,you run for t equals 1 to 5,000,because we had 5,000 mini batches as high as 1,000 each.What are you going to do inside the for loop isbasically implement one step of gradient descent using $X^{\{t\}}$ comma $Y^{\{t\}}$ .It is as if you had a training set of size 1,000 examples,and it was as if you were to implement the overall you are already familiar withbut just on this little training set size of m equals 1,000,rather than having an explicit for loop over all 1,000 examples,you would use vectorization to process all 1,000 examples sort of all at the same time.Let us write this out.First you implemented forward prop on the inputs, so just on $X^{\{t\}}$ ,and you do that by implementing $z^{[1]}$ equals $W^{(1)}$ ,Previously, we would just have X there, right?But now you are processing the entire training set,you are just processing the first mini-batch,so that it becomes $X^{\{t\}}$ when you’re processing mini-batch.And then you will have $A^{[1]}$ equals $g^{[1]}$ of $Z^{[1]}$ ,This’s a capital Z since this is actually a vectorizing connotation,and so on until you end up with $A^{[l]}$ , and as I guess $g^{[l]}$ of $Z^{[l]}$ , and then this is your prediction.And you notice that here you should use a vectorized implementation.It’s just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples.Next you compute the cost function J,which I’m going to write as 1 over 1,000.since here 1,000 is the size of your little training set.Sum from i equals 1 through l of really the loss of $\hat{y}^{(i)}$ $y^{(i)}$ ,and this notation for clarity, refers to examples from the mini batch $X^{\{t\}}$ , $Y^{\{t\}}$ .And if you’re using regularization,you can also have this regularization term.Move the 2 to the denominator,times sum of l, Frobenius on the way which is a square.Because this is really the cost on just one mini-batch,I’m going to index as cost J with a superscript t in curly braces.

这里写图片描述

那么究竟 mini-batch 梯度下降法的原理是什么，在训练集上运行 mini-batch 梯度下降法，你运行 for t = 1 to 5000，因为我们有 5000 个各有 1000 个样本的 mini-batch，在 for 循环里你要做得基本就是，对 $X^{\{t\}}$ 和 $Y^{\{t\}}$ 执行一步梯度下降法，假设你有一个拥有 1000 个样本的训练集，而且假设你已经很熟悉一次性处理完的方法，但是这是对于 m 等于 1000 的子训练集，而不是用一个明确的 for 循环去处理全部 1000 个样本，你要用向量化去几乎同时处理 1000 个样本，我们把这个写下来，首先对输入也就是 $X^{\{t\}}$ 执行前向传播，然后执行 $z^{[1]} = W^{(1)}$ ，之前我们这里只有 X 对吧，但是现在你正在处理整个训练集，你在处理第一个 mini-batch，在处理 mini-batch 时它变成了 $X^{\{t\}}$ ，然后执行 $A^{[1]} = g^{[1]}(Z^{[1]})$ ，之所以用大写的 $Z$ 是因为这是一个向量内涵，以此类推直到，执行 $A^{[l]} = g^{[l]}(Z^{[l]})$ 这就是你的预测值，注意这里你需要用到一个向量化的执行命令，这个向量化的执行命令，一次性处理 1000 个而不是 5 万个样本，接下来你要计算损失成本函数 $J$ ，因为子集的样本规模是 1000，所以我把 $J$ 写成 1/1000，乘以从 i = 1 到 l 的 $\hat{y}^{(i)}$ 和 $y^{(i)}$ 损失的总和，说明一下这指的是来自于 mini-batch $X^{\{t\}}$ $Y^{\{t\}}$ 中的样本，如果你用到了正则化，你也可以使用正则化的术语，把 2 移动到分母，乘以 $l$ 个 Frobenius 模平方的总和，因为这是一个 mini-batch 的损失，所以我将 $J$ 损失记为上角标 t 放在大括号里，

You notice that everything we are doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on X, Y,you’re now doing it on $X^{\{t\}}$ , Y^{{t}}.Next, you implement backprop to compute gradients with respect to $J^{\{t\}}$ ,you are still using only $X^{\{t\}}$ $Y^{\{t\}}$ and then you update the weights,You know W, really $W^{[l]}$ ,gets updated as $W^{[l]}$ minus alpha d $W^{[l]}$ and similarly for B.And so this is one pass through your training set using mini-batch gradient descent.The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set.Whereas with batch gradient descent,a single pass through the training allows you to take only one gradient descent step.With mini-batch gradient descent, a single pass through the training set,that is one epoch, allows you to take 5,000 gradient descent steps.Now of course you want totake multiple passes through the training set which you usually want to,you might want another for loop for another while loop out there.So you keep taking passes through the training setuntil hopefully you converge with approximately converge.

这里写图片描述

你也会注意到我们做的一切都似曾相识，其实跟之前我们执行梯度下降法如出一辙，除了你现在的对象不是 X Y，而是 $X^{\{t\}}$ $Y^{\{t\}}$ ，接下来你执行反向传播来计算 $J^{\{t\}}$ 的梯度，你只使用 $X^{\{t\}}$ $Y^{\{t\}}$ 然后你更新加权值，W 实际上是 $W^{[l]}$ ，更新为 $W^{[l]}$ 减去 $αdW^{[l]}$ 对 B做相同处理，这是使用 mini-batch 梯度下降法训练样本的一步，我写下的代码也可被称为进行一代的训练，一代这个词意味着只是一次历遍了训练集，使用batch 梯度下降法，一次历遍训练集只能让你做一个梯度下降，使用 mini-batch 梯度下降法一次历遍训练集，就是一代能让你做 5000 个梯度下降，当然正常来说你想要，多次历遍训练集，你还需要为另一个 while 循环设置另一个 for 循环，所以你可以一直处理历遍训练集，直到最后你能收敛到一个合适的精度，

When you have a lost training set,mini-batch gradient descent runs much faster than batch gradient descentthat’s pretty much what everyone in Deep Learningwill use when you’re training on a large data set.In the next video, let’s delve deeper into mini-batch gradient descent soyou can get a better understanding of what it is doing and why it works so well.

如果你有一个丢失的训练集，mini-batch 梯度下降法比 batch 梯度下降法运行地更快，所以几乎每个研习深度学习的人，在训练巨大的数据集时都会用到，下一个视频中我们将进一步深度探讨 mini-batch 梯度下降法，你也会因此更好地理解它的作用和原理。