该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79096566
2.1 Mini-batch Gradient Descent (Mini-batch 梯度下降法)
(字幕来源:网易云课堂)
Hello, and welcome back.In this week, you learn about optimization algorithms.that will enable you to train your neural network much faster.You’ve heard me say before that applying machine learning is a highly empirical process,is a highly iterative process.In which you just had to train a lot of models to find one that works really well.So, it really helps to really train models quickly.One thing that makes it more difficult is that Deep Learning does not work best in a regime of big data.We are able to train neural networks on a huge data set,and training on a large data set is just slow.So, what you find is that having fast optimization algorithms,having good optimization algorithms can really speed up the efficiency of you and your team.
你好,欢迎回来,本周你将学习优化算法,这能让你的神经网络运行得更快,还记得我跟你们说过,机器学习的应用是一个高度依赖经验的过程,伴随着大量迭代的过程,你需要训练诸多模型才能找到合适的那一个。所以,优化算法能够帮助你快速训练模型,其中一个难点在于,深度学习没有在大数据领域发挥最大的效果,我们可以利用一个巨大的数据集来训练神经网络,而在巨大的数据集基础上进行训练速度很慢,因此,你会发现使用快速的优化算法,使用好用的优化算法能够,大大提高你和团队的效率。
So, let’s get started by talking about mini-batch gradient descent.You’ve learned previously that vectorization allows you to efficiently compute on all m examples,that allows you to process your whole training set without an explicit formula.So, that’s why we would take our training examples and stack them into these huge matrix capsule Xs. x(1) , x(2) , x(3) , and then eventually it goes up to x(m) training samples.And similarly for Y.This is y(1) and y(2) , y(3) , and so on up to y(m) .So, the dimension of X was n_x by m and this was 1 by m.Vectorization allows you to process all m examples quickly, relatively quickly.if m is very large then it can still be slow.For example what if m was 5 million or 50 million or even bigger.With the implementation of gradient descent on your whole training set,what you have to do is you have to process your entire training set before you take one little step of gradient descent.And then you have to process your entire training sets of five million training samples again before you take another little step of gradient descent.So, it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire,your giant training sets of 5 million examples.
那么,我们首先来谈谈 mini-batch 梯度下降法,你之前学过,向量化能让你有效地对所有 m 个例子进行计算,允许你处理整个训练集 而无需某个明确的公式,所以我们要把训练样本,放到巨大的矩阵
X
当中去,
In particular, here’s what you can do.Let’s say that you split up your training set into smaller, little baby training sets,and these baby training sets are called mini-batches.And let’s say each of your baby training sets have just 1,000 examples each.So, you take x(1) through x(1000) ,and you call that your first little baby training set,also call the mini-batch.And then you take the next 1,000 examples. x(1001) through x(2000) ,that’s the next 1,000 examples and come next one and so on.I’m going to introduce a new notation,I’m going to call this X superscript with curly braces, 1,and I am going to call this,X superscript with curly braces, 2.Now, if you have 5 million training samples total and each of these little mini batches has a thousand examples,that means you have 5,000 of these,because you know 5,000 times 1,000 equals 5 million.So all together you would have 5,000 of these mini batches.So it ends with X superscript curly braces, 5,000,
准确地说 这是你可以做的一些事情,你可以把训练集分割为小一点的子训练集,这些子集被取名为 Mini-batch,假设每一个子集中只有 1000 个样本,那么你其中 的 x(1) 到 x(1000) 取出来,将其称之为第一个子训练集,也叫做 Mini-batch,然后你再取出接下来的 1000个 样本,从 x(1001) 到 x(2000) ,然后再取 1000 个样本 以此类推,接下来我要说一个新的符号, X{1} ,接下来这个是, X{2} ,如果你的训练样本一共有 500万个,每个 mini-batch 都有 1000 个样本,也就是说你有 5000个 mini-batch,因为 5000 乘以 1000 就是 500万,你共有 5000 个 mini-batch,所以最后得到是 X{5000} ,
and then similarly you do the same thing for Y.You also split up your training data for Y accordingly.So, call that Y{1} .And then this is y(1001) through y(2000) , this becomes, called Y{2} ,and so on until you have Y{5000} .So now, mini batch number t is going to be comprised of X{t} and Y{t} .That is a thousand training samples with the corresponding input output pairs.Before moving on, just to make sure my notation is clear,We have previously used superscript round brackets i to index in the training set,so x(i) is the i training sample.We use superscript square brackets l to index into the different layers of the neural network.So, z[l] comes from the z value of the l layer of the neural network.And here we are introducingthe curly brackets t to index into different mini batches.So you have X{t} , Y{t} .And to check your understanding of these,what is the dimension of X{t} and Y{t} ?Well, X is n_x by m.So if X{1} is a thousand training examples or the x values for a thousand examples,then this dimension should be nx by 1,000,and X^ should also be nx by 1,000 and so on.So, all of these should have dimension nx by 1,000,and these should have dimension 1 by 1,000.To explain the name of this algorithm,batch gradient descent, refers to the gradient descent algorithm we have been talking about previously,where you process your entire training set all at the same time.And the name comes from viewing that as processing your entire batch of training samples all at the same time.I know it’s not a great name, but that’s just what it’s called.Mini-batch gradient descent in contrast,refers to the algorithm which we’ll talk about on the next slide,and which you process is single mini batch X{t} , Y{t} at the same time,rather than processing your entire training set X, Y the same time.
对
Y
也要进行相同处理,你也要相应地拆分
So, let’s see how mini-batch gradient descent works.To run mini-batch gradient descent on your training sets,you run for t equals 1 to 5,000,because we had 5,000 mini batches as high as 1,000 each.What are you going to do inside the for loop isbasically implement one step of gradient descent using X{t} comma Y{t} .It is as if you had a training set of size 1,000 examples,and it was as if you were to implement the overall you are already familiar withbut just on this little training set size of m equals 1,000,rather than having an explicit for loop over all 1,000 examples,you would use vectorization to process all 1,000 examples sort of all at the same time.Let us write this out.First you implemented forward prop on the inputs, so just on X{t} ,and you do that by implementing z[1] equals W(1) ,Previously, we would just have X there, right?But now you are processing the entire training set,you are just processing the first mini-batch,so that it becomes X{t} when you’re processing mini-batch.And then you will have A[1] equals g[1] of Z[1] ,This’s a capital Z since this is actually a vectorizing connotation,and so on until you end up with A[l] , and as I guess g[l] of Z[l] , and then this is your prediction.And you notice that here you should use a vectorized implementation.It’s just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples.Next you compute the cost function J,which I’m going to write as 1 over 1,000.since here 1,000 is the size of your little training set.Sum from i equals 1 through l of really the loss of y^(i) y(i) ,and this notation for clarity, refers to examples from the mini batch X{t} , Y{t} .And if you’re using regularization,you can also have this regularization term.Move the 2 to the denominator,times sum of l, Frobenius on the way which is a square.Because this is really the cost on just one mini-batch,I’m going to index as cost J with a superscript t in curly braces.
那么究竟 mini-batch 梯度下降法的原理是什么,在训练集上运行 mini-batch 梯度下降法,你运行 for t = 1 to 5000,因为我们有 5000 个各有 1000 个样本的 mini-batch,在 for 循环里你要做得基本就是,对
X{t}
和
Y{t}
执行一步梯度下降法,假设你有一个拥有 1000 个样本的训练集,而且假设你已经很熟悉一次性处理完的方法,但是这是对于 m 等于 1000 的子训练集,而不是用一个明确的 for 循环去处理全部 1000 个样本,你要用向量化去几乎同时处理 1000 个样本,我们把这个写下来,首先对输入 也就是
X{t}
执行前向传播,然后执行
z[1]=W(1)
,之前我们这里只有 X 对吧,但是现在你正在处理整个训练集,你在处理第一个 mini-batch,在处理 mini-batch 时它变成了
X{t}
,然后执行
A[1]=g[1](Z[1])
,之所以用大写的
Z
是因为这是一个向量内涵,以此类推 直到,执行
You notice that everything we are doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on X, Y,you’re now doing it on
你也会注意到 我们做的一切都似曾相识,其实跟之前我们执行梯度下降法如出一辙,除了你现在的对象不是 X Y,而是 X{t} Y{t} ,接下来 你执行反向传播来计算 J{t} 的梯度,你只使用 X{t} Y{t} 然后你更新加权值,W 实际上是 W[l] ,更新为 W[l] 减去 αdW[l] 对 B做相同处理,这是使用 mini-batch 梯度下降法训练样本的一步,我写下的代码也可被称为进行一代的训练,一代这个词意味着只是一次历遍了训练集,使用batch 梯度下降法,一次历遍训练集只能让你做一个梯度下降,使用 mini-batch 梯度下降法 一次历遍训练集,就是一代 能让你做 5000 个梯度下降,当然正常来说你想要,多次历遍训练集,你还需要为另一个 while 循环设置另一个 for 循环,所以你可以一直处理历遍训练集,直到最后你能收敛到一个合适的精度,
When you have a lost training set,mini-batch gradient descent runs much faster than batch gradient descentthat’s pretty much what everyone in Deep Learningwill use when you’re training on a large data set.In the next video, let’s delve deeper into mini-batch gradient descent soyou can get a better understanding of what it is doing and why it works so well.
如果你有一个丢失的训练集,mini-batch 梯度下降法比 batch 梯度下降法运行地更快,所以几乎每个研习深度学习的人,在训练巨大的数据集时都会用到,下一个视频中 我们将进一步深度探讨 mini-batch 梯度下降法,你也会因此更好地理解它的作用和原理。
重点总结:
对整个训练集进行梯度下降法的时候,我们必须处理整个训练数据集,然后才能进行一步梯度下降,即每一步梯度下降法需要对整个训练集进行一次处理,如果训练数据集很大的时候,如有 500 万或 5000 万的训练数据,处理速度就会比较慢。
但是如果每次处理训练数据的一部分即进行梯度下降法,则我们的算法速度会执行的更快。而处理的这些一小部分训练子集即称为 Mini-batch。
对于普通的梯度下降法,一个 epoch 只能进行一次梯度下降;而对于 Mini-batch 梯度下降法,一个 epoch 可以进行 Mini-batch 的个数次梯度下降。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。