Coursera | Andrew Ng (01-week-2-2.4)—梯度下降法

最新推荐文章于 2020-05-28 10:24:00 发布

ZJ_Improve

最新推荐文章于 2020-05-28 10:24:00 发布

阅读量628

点赞数 2

分类专栏：深度学习 | 吴恩达-01.神经网络和深度学习深度学习 | 吴恩达文章标签：吴恩达深度学习梯度下降算法

本文链接：https://blog.csdn.net/JUNJUN_ZHAO/article/details/78864516

版权

深度学习 | 吴恩达同时被 2 个专栏收录

129 篇文章 19 订阅

订阅专栏

深度学习 | 吴恩达-01.神经网络和深度学习

40 篇文章 2 订阅

订阅专栏

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/JUNJUN_ZHAO/article/details/78864516

Gradient Descent 梯度下降法

(字幕来源：网易云课堂)

You’ve seen the logistic regression model,You’ve seen the loss function that measures how well you’re doing on the single training example,You’ve also seen the cost function,that measures how well your parameters w and b are doing on your entire training set,

在上个视频里学习了 logistic 回归模型，也知道了损失函数，损失函数是衡量单一训练样例的效果，你还知道了成本函数，成本函数用于在全部训练集上来衡量参数 w 和 b 的效果，

要点：

1. Loss function (损失函数)：是衡量单一训练样本的效果。
2. Cost function (成本函数)：成本函数是在全部训练集上，来衡量参数 w 和 b 的效果。

Now let’s talk about how you can use the gradient descent algorithm to train or to learn the parameters w and b on your training set. To recap,here is the familiar logistic regression algorithm.And we have on the second line the cost function J,which is a function of your parameters w and b.And that’s defined as the average.So it’s 1 over m times the sum of this loss function.And so the loss function measures how well your algorithms outputs y-hat(i) on each of the training examples,stacks up or compares to the ground true label y^(i) on each of the training examples.And the full formula is expanded out on the right.So the cost function measures how well your parameters w and b are doing on the training set.

下面我们讨论如何使用梯度下降法来训练或学习训练集上的参数 w 和 b。回顾一下，这里是熟悉的 logistic 回归算法，第二行是成本函数 J，成本函数 J 是参数 w 和 b 的函数，它被定义为平均值，即 1/m 的损失函数之和，损失函数可以衡量你的算法的效果，每一个训练样例都输出 $y^{(i)}$ ，把它和与基本真值标签 $y^{(i)}$ 进行比较，等号右边展开完整的公式，成本函数衡量了参数 w 和 b 在训练集上的效果.

这里写图片描述

要点：

1. Logistic 回归：

y^=σ(wTx+b) $\hat y =\sigma(w^{T}x+b)$ ,

σ(z)=11+e−z $\sigma(z)=\dfrac{1}{1+e^{-z}}$ ,

z=wTx+b $z=w^{T}x+b$

2. Cost function (成本函数)：

J(w,b)=1m∑mi=1L(y^(i),y(i))=−1m∑mi=1[y(i)logy^(i)+(1−y(i))log(1−y^(i))] $J(w,b)=\dfrac{1}{m}\sum_{i=1}^{m}L(\hat y^{(i)}, y^{(i)})=-\dfrac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat y^{(i)}+(1-y^{(i)})\log(1-\hat y^{(i)})\right]$

So in order to learn the set of parameters w and b,it seems natural that we want to find w and b ,that make the cost function J(w, b) as small as possible.So here’s an illustration of gradient descent.In this diagram the horizontal axes ,represent your spatial parameters w and b.In practice w can be much higher dimensional,but for the purposes of plotting.let’s illustrate w as a single real number,and b as a single real number.

要习得合适的参数 w 和 b，很自然地就想到我们想找到使得成本函数 $J( w, b )$ 尽可能小的 w 和 b，下面来看看梯度下降法，这个图中的横轴，表示空间参数 w 和 b，在实践中 w 可以是更高维的，但为了方便绘图，我们让 w 是一个实数，b 也是一个实数。

目的：找到（学习训练得到）w and b ，使得成本函数 $J(w,b)$ 最小。

这里写图片描述

The cost function $J(w,b)$ is then some surface above these horizontal axes w and b.So the height of the surface represents the value of $J(w,b)$ at a certain point.And what we want to do is really to find the value of w and b that corresponds to the minimum of the cost function J.It turns out that this cost function J is a convex function.So it’s just a single big bowl,so this is a convex function,and this is opposed to functions that look like this ,which are non-convex,and has lots of different local optimum.

成本函数 $J(w,b)$ 是在水平轴 w 和 b 上的曲面，曲面的高度表示了 $J(w,b)$ 在某一点的值。我们所想要做的就是找到这样的 w 和 b，使其对应的成本函数 $J$ 值是最小值，可以看到成本函数 $J$ 是一个凸函数，像这样的一个大碗，因此这是一个凸函数，看起来和这样的函数不一样，它是非凸的，有很多不同的局部最优。

凸函数：全局最优解。(这是我们想要的)
非凸函数：局部最优解。

So the fact that our cost function $J(w,b)$ as defined here is convex is one of the huge reasons why we use this particular cost function J for logistic regression.So to find a good value for the parameters what we’ll do is initialize w and b to some initial value,maybe denoted by that little red dot.And for logistic regression almost any initialization method works usually you initialize the value to zero,.Random initialization also works,but people don’t usually do that for logistic regression.

因此我们的成本函数 $J(w,b)$ ，之所以是凸函数，凸函数这性质是我们使用 logistic 回归的这个特定成本函数 $J$ 的重要原因之一， 为了找到更好的参数值，我们要做的就是用某初始值初始化 w 和 b，用那个小红点表示的，对于 logistic 回归而言几乎是任意的初始化方法都有效，通常用 0 来进行初始化，随机初始化也有效，但是对于 logistic 回归我们通常不这么做。

But because this function is convex,no matter where you initialize,you should get to the same point or roughly the same point.And what gradient descent does is it starts at that initial point,and then takes a step in the steepest downhill direction.So after one step of gradient descent,you might end up there,because it’s trying to take a step downhill in the direction of steepest descent,or as quickly downhill as possible.So that’s one iteration of gradient descent.

但因为函数是凸的，无论在哪里初始化，都应该达到同一点或大致相同的点，梯度下降法所做的就是从初始点开始，朝最陡的下坡方向走一步 ，在梯度下降一步后，或许在那里停下，因为它正试图沿着，最快下降的方向往下走，或者说尽可能快地往下走，这是梯度下降的一次迭代。

And after two iterations of gradient descent you might step there,three iterations and so on,I guess this is now hidden by the back of the plot until eventually,hopefully you converge to this global optimum,or get to something close to the global optimum,So this picture illustrates the gradient descent algorithm,Let’s write a bit more of the details.For the purpose of illustration,let’s say that there’s some function J(w) that you want to minimize,and maybe that function looks like this.To make this easier to draw.I’m going to ignore b for now,just to make this a one-dimensional plot instead of a high-dimensional plot.So gradient descent does this,we’re going to repeatedly carry out the following update.We’re going to take the value of w and update it,going to use colon equals to represent updating w.

两次迭代或许会到达那里，或者三次……等等，隐藏在图上的曲线，很有希望收敛到这个全局最优解，或接近全局最优解。这张图片阐述了梯度下降法，我们多写一点细节，为了更好地说明。让我们来看一些函数，你希望得到最小化 $J(w)$ ，函数可能会看起来像这样，为了方便画，我先忽略 b，仅仅是用一维曲线，来代替多维曲线，梯度下降法是这样做的，我们将重复执行以下的更新操作，我们更新 w 的值，使用 $:=$ 表示更新 w。

要点：

$w:=w-\alpha\dfrac{\partial J(w,b)}{\partial w}$

So set w to w minus α times and this is a derivative dJ(w)/dw,I will repeatedly do that until the algorithm converges .So couple of points in the notatio α here is the learning rate,and controls how big a step we take on each iteration or gradient descent.We’ll talk later about some ways by choosing the learning rate alpha.

让 w 为 $w-α\dfrac{dJ(w)}{dw}$ ，在算法收敛之前我会重复这样去做，我要说明公式中的两点，这里的 $α$ 表示学习率，学习率可以控制每一次迭代，或者梯度下降法中的步长。 我们之后会讨论 如何选择学习率 α 。

And second this quantity here this is a derivative.This is basically the update or the change you want to make to the parameters w.When we start to write code to implement gradient descent,we’re going to use the convention that the variable name in our code,dw will be used to represent this derivative term.So when you write code you write something like,w colon equals w minus alpha times dw.

其次在这里的这个数是导数，这就是对参数 w 的更新或者变化量，当我们开始编写代码来实现梯度下降，我们会使用代码中变量名的约定， $dw$ 表示导数，像这样编写代码，即 $w：=w-α dw$ 。

And so we use dw to be the variable name to represent this derivative term.Now let’s just make sure that this gradient descent update makes sense.Let’s say that w was over here.So you’re at this point on the cost function J(w).Remember that the definition of a derivative is the slope of a function at the point.So the slope of the function is really the height divided by the width of a low triangle here at this tangent to J(w) at that point.And so, here the derivative is positive w gets updated as w minus a learning rate times the derivative.The derivative is positive and so you end up subtracting from w ,you end up taking a step to the left.And so gradient descent,will make your algorithm slowly decrease the parameter,if you have started off with this large value of w.

我们用 dw，作为导数的变量名，现在我们确保梯度下降法更新是有用的。w在这对应的成本函数 J(w) 在曲线上的这一点。记住导数的定义，是函数在这个点的斜率，而函数的斜率是高除以宽。 在这个点相切于 J(w) 的一个小三角形，在这里导数是正的，新的 w 值等于 w 自身减去学习率乘导数，导数是正的从 w 中减去这个乘积，接着向左边走一步，像这样梯度下降法，让你的算法渐渐地减小这个参数 w，对于一开始就很大的参数 w 来说。

As another example if w was over here,then at this point the slope here of dJ/dw will be negative ,and so the gradient descent update would subtract α times a negative number.And so end up slowly increasing w,so you end up making w bigger and bigger with successive iterations and gradient descent.So that hopefully whether you initialize on the left or on the right,gradient descent will move you towards this global minimum here.If you’re not familiar with derivates or with calculus,and what this term dJ(w)/dw means,don’t worry too much about it.We’ll talk some more about derivatives in the next video.If you have a deep knowledge of calculus,you might be able to have a deeper intuitions about how neural networks work.

另一个例子如果 w 的位置是在这里，这个点处的斜率将会是负的，用梯度下降法去更新，w 将会减去 α 乘上一个负数，慢慢地使得参数 w 增加，不断地用梯度下降法来迭代，w 会变得越来越大，无论你初始化的位置是在左边还是右边，梯度下降法会朝着全局最小值方向移动，如果你不熟悉导数或者微积分，也不熟悉 dJ(w)/dw 的含义，不要着急。在下一个视频我们会讨论更多关于导数的知识。如果你很了解微积分，你应该对神经网络是如何工作的，有更深刻更直观的认识。

But even if you’re not that familiar with calculus,in the next few videos,we’ll give you enough intuitions about derivatives and about calculus,that you’ll be able to effectively use neural networks.But the overall intuition for now is that this term represents the slope of the function,and we want to know the slope of the function at the current setting of the parameters ,so that we can take these steps of steepest descent,so that we know what direction to step in, in order to go downhill on the cost function J.

但即使你并不熟悉微积分，通过下面的几个视频，我们也会对导数和微积分有足够的直观认识，你将能够有效地使用神经网络。但现在所有的直观认识，即这个术语表示的是函数的斜率，我们知道用目前参数的情况下函数的斜率，据此朝下降速度最快的方向走，我们知道为了让成本函数 $J$ 走下坡路，下一步更新的方向在哪。

So we wrote our gradient descent for J(w),if only w was your parameter.In logistic regression, your cost function is a function of both w and b.So in that case, the inner loop of gradient descent that is this thing here,this thing you have to repeat becomes as follows .You update w as w:=w-α dJ(w,b)/dw, You update b as b:=b-α dJ(w,b)/db,So these two equations at the bottom are the actual update you implement.As an aside I just want to mention,one notational convention in calculus that is a bit confusing to some people.

当前 $J(w)$ 的梯度下降法，只有参数 w，在 logistic 回归中你的成本函数，是一个含有 w 和 b 的函数，在这种情况下梯度下降的内循环，就是这里的这个东西，你必须一直重复计算，通过 $w:=w-α \dfrac{dJ(w,b)}{dw}$ 更新 $w$ ，通过 $b:=b-α \dfrac{dJ(w,b)}{db}$ 更新 $b$ ，这两个等式是，实际更新参数时进行的操作，另外我想提到的是，在微积分的符号约定中，对一些人来说是有点困惑的。

迭代更新的修正表达式：

$w:=w-\alpha\dfrac{\partial J(w,b)}{\partial w}$

$b:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}$

I don’t think it’s super important that you understand calculus,but in case you see this I want to make sure that you don’t think too much of this,Which is that in calculus this term here,we actually write as fallows of that funny squiggle /’skwɪg(ə)l/ symbol.So this symbol this is actually just a lower case d,in a fancy font in a stylized font,for when you see this expression all this means is derivative $J(w,b)$ ,or really the slope of the function J(w,b),how much that function slopes in the w direction,And the rule of the notation in calculus,which I think isn’t totally logical,but the rule in the notation for calculus,which I think just makes things ,much more complicated than you need to be,is that if J is a function of two or more variables,then instead of using lowercase d you use this funny symbol.This is called a partial derivative symbol.

目前我并不认为理解微积分是非常重要的，如果你看到了这些，希望你不要想太多，在微积分中在这里的术语，是这个有趣的花体符号，这个符号其实是一个小写 d，用一个花哨字体来写的，当你看到这种表示方式时,意思是 $J(w,b)$ 的导数，或者函数 $J(w,b)$ 的斜率，就是函数在 w 方向的斜率是多少，在微积分中这个符号的规则，我认为并不是完全符合逻辑的，微积分中用这个符号规则，我认为这样，会导致更加复杂，当函数 J 有两个以上的变量，不使用小写字母 d 而使用花哨的符号， $\partial$ 这个就被称为偏导数符号。

But don’t worry about this,and if J is a function of only one variable,then you use lowercase d.So the only difference between ,whether you use this funny partial derivative symbol,or lowercase d as we did on top is whether J is a function of two or more variables.In which case, you use this symbol the partial derivative symbol,or if J is only a function of one variable then you use lowercase d.

但别担心，如果 J 只有一个变量，就是用小写字母 d。唯一的区别就是，你是用偏导数符号，还是小写字母 d，是取决于你的函数 J 是否含有两个以上的变量 :
1. 变量超过两个就用偏导数符号 $\partial$ ，
2.如果函数只有一个变量就用小写字母 d。

This is one of those funny rules of notation in calculus that I think just make things more complicated than they need to be.But if you see this partial derivative symbol,all it means is you’re measure the slope of the function with respect to one of the variables.And similarly to add here,to the formerly correct mathematical notation in calculus ,because here J has two inputs not just one.This thing at the bottom should be written with this partial derivative symbol. But it really means the same thing as almost the same thing as lower case d.Finally when you implement this in code,we’re going to use the convention that this quantity really the amount by which you update w,will denote as the variable dw in your code.

这是在微积分里一个有趣的符号规则，我认为它使事情变得更加复杂了，但如果你看到了偏导数符号，其含义就是计算函数关于其中一个变量，在对应点所对应的斜率，类似在这里，形式上应该用微积分中的正确符号，因为在这里 $J$ 有两个输入而不是一个，屏幕底部的这个，应该用这个偏导数来写，但是其实表达的，和小写字母 $d$ 表达的是一样的，最后当你编写代码 ,想要实现屏幕写出的这个式子时，通常在更新 $w$ 的值时，我们会用 $dw$ 这个变量来代替这个式子。

And this quantity, right?The amount by which you want to update b will denote by the variable db in your code.All right, so, that’s how you can implement gradient descent.Now if you haven’t seen calculus for a few years.I know that that might seem like a lot more derivatives in calculus than you might be comfortable with so far.But if you’re feeling that way don’t worry about it .In the next video, we’ll give you better intuition about derivatives.And even without the deep mathematical understanding of calculus ,with just an intuitive understanding of calculus,you will be able to make neural networks work effectively.So that, let’s go onto the next video,where we’ll talk a little bit more about derivatives.

对于这个量呢？当你想去更新 $b$ 的数值，代码中就用 $db$ 来表示，综上这就是梯度下降法的实现方法，如果你有几年没碰过微积分了，我知道微积分中有很多的导数，你可能会更适应，如果你有这种感觉,也不要担心，在下一个视频中更直观地来给你讲解导数，不需要很深入理解微积分，而只是直观地理解微积分，你也可以让神经网络变得更有效，让我们进入下一个视频课程，一起来多讨论一下导数。

重点总结：

梯度下降法

用梯度下降法（Gradient Descent）算法来最小化 Cost function，以计算出合适的 w 和 b 的值。

迭代更新的修正表达式：

$w:=w-\alpha\dfrac{\partial J(w,b)}{\partial w}$

$b:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}$

在程序代码中，我们通常使用 $dw$ 来表示 $\dfrac{\partial J(w,b)}{\partial w}$ ，用 $db$ 来表示 $\dfrac{\partial J(w,b)}{\partial b}$ 。

偏导数符号 使用 $\partial$ 还是小写字母 $d$ ：取决于你的函数 J 是否含有两个以上的变量 :

1. 变量超过两个就用偏导数符号 $\partial$ ，
2.如果函数只有一个变量就用小写字母 d。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（1-2）– 神经网络基础

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。