机器学习学习笔记——1.1.1.6.2 Implementing gradient descent（梯度下降的实现）

预见未来to50

于 2024-09-18 14:58:36 发布

阅读量577

点赞数 10

分类专栏：机器学习、深度学习（ML/DL) 文章标签：机器学习学习笔记

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/142333657

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

148 篇文章 13 订阅

订阅专栏

Let's take a look at how you can actually implement the gradient descent algorithm. Let me write down the gradient descent algorithm. Here it is. On each step, w, the parameter, is updated to the old value of w minus Alpha times this term d/dw of the cos function J of wb. What this expression is saying is, after your parameter w by taking the current value of w and adjusting it a small amount, which is this expression on the right, minus Alpha times this term over here. If you feel like there's a lot going on in this equation, it's okay, don't worry about it. We'll unpack it together.

First, this equal notation here. Now, since I said we're assigning w a value using this equal sign, so in this context, this equal sign is the assignment operator. Specifically, in this context, if you write code that says a equals c, it means take the value c and store it in your computer, in the variable a. Or if you write a equals a plus 1, it means set the value of a to be equal to a plus 1, or increments the value of a by one. The assignment operator encoding is different than truth assertions in mathematics. Where if I write a equals c, I'm asserting, that is, I'm claiming that the values of a and c are equal to each other. Hopefully, I will never write a truth assertion a equals a plus 1 because that just can't possibly be true. In Python and in other programming languages, truth assertions are sometimes written as equals equals, so you may see oh, that says a equals equals c if you're testing whether a is equal to c. But in math notation, as we conventionally use it, like in these videos, the equal sign can be used for either assignments or for truth assertion. I try to make sure I was clear when I write an equal sign, whether we're assigning a value to a variable, or whether we're asserting the truth of the equality of two values.

Now, this dive more deeply into what the symbols in this equation means. The symbol here is the Greek alphabet Alpha. In this equation, Alpha is also called the learning rate. The learning rate is usually a small positive number between 0 and 1 and it might be say, 0.01. What Alpha does is, it basically controls how big of a step you take downhill. If Alpha is very large, then that corresponds to a very aggressive gradient descent procedure where you're trying to take huge steps downhill. If Alpha is very small, then you'd be taking small baby steps downhill. We'll come back later to dive more deeply into how to choose a good learning rate Alpha.

Finally, this term here, that's the derivative term of the cost function J. Let's not worry about the details of this derivative right now. But later on, you'll get to see more about the derivative term. But for now, you can think of this derivative term that I drew a magenta box around as telling you in which direction you want to take your baby step. In combination with the learning rate Alpha, it also determines the size of the steps you want to take downhill.

Now, I do want to mention that derivatives come from calculus. Even if you aren't familiar with calculus, don't worry about it. Even without knowing any calculus, you'd be able to figure out all you need to know about this derivative term in this video and the next. One more thing. Remember your model has two parameters, not just w, but also b. You also have an assignment operations update the parameter b that looks very similar. b is assigned the old value of b minus the learning rate Alpha times this slightly different derivative term, d/db of J of wb. Remember in the graph of the surface plot where you're taking baby steps until you get to the bottom of the value, well, for the gradient descent algorithm, you're going to repeat these two update steps until the algorithm converges. By converges, I mean that you reach the point at a local minimum where the parameters w and b no longer change much with each additional step that you take.

Now, there's one more subtle detail about how to correctly in semantic gradient descent, you're going to update two parameters, w and b. This update takes place for both parameters, w and b. One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time. What I mean by that, is that in this expression, you're going to update w from the old w to a new w, and you're also updating b from its oldest value to a new value of b. The way to implement this is to compute the right side, computing this thing for w and b, and simultaneously at the same time, update w and b to the new values. Let's take a look at what this means. Here's the correct way to implement gradient descent which does a simultaneous update. This sets a variable temp_w equal to that expression, which is w minus that term here. There's also a set in another variable temp_b to that, which is b minus that term. You compute both for hand sides, both updates, and store them into variables temp_w and temp_b. Then you copy the value of temp_w into w, and you also copy the value of temp_b into b. Now, one thing you may notice is that this value of w is from the for w gets updated. Here, I noticed that the pre-update w is where it goes into the derivative term over here. In contrast, here is an incorrect implementation of gradient descent that does not do a simultaneous update. In this incorrect implementation, we compute temp_w, same as before, so far that's okay. Now here's where things start to differ. We then update w with the value in temp_w before calculating the new value for the other parameter to be.

Next, we calculate temp_b as b minus that term here, and finally, we update b with the value in temp_b. The difference between the right-hand side and the left-hand side implementations is that if you look over here, this w has already been updated to this new value, and this is updated w that actually goes into the cost function j of w, b. It means that this term here on the right is not the same as this term over here that you see on the left. That also means this temp_b term on the right is not quite the same as the temp b term on the left, and thus this updated value for b on the right is not the same as this updated value for variable b on the left. The way that gradient descent is implemented in code, it actually turns out to be more natural to implement it the correct way with simultaneous updates. When you hear someone talk about gradient descent, they always mean the gradient descents where you perform a simultaneous update of the parameters. If however, you were to implement non-simultaneous update, it turns out it will probably work more or less anyway. But doing it this way isn't really the correct way to implement it, is actually some other algorithm with different properties. I would advise you to just stick to the correct simultaneous update and not use this incorrect version on the right.

That's gradient descent. In the next video, we'll go into details of the derivative term which you saw in this video, but that we didn't really talk about in detail. Derivatives are part of calculus, and again, if you're not familiar with calculus, don't worry about it. You don't need to know calculus at all in order to complete this course or this specialization, and you have all the information you need in order to implement gradient descent. Coming up in the next video, we'll go over derivatives together, and you come away with the intuition and knowledge you need to be able to implement and apply gradient descent yourself. I think that'll be an exciting thing for you to know how to implement. Let's go on to the next video to see how to do that.

让我们看看你如何实际实现梯度下降算法。让我写下梯度下降算法。就是这样。在每一步中，参数w被更新为w的旧值减去α乘以这个项d/dw的余弦函数J的wb。这个表达式的意思是，通过取当前值的w并对其进行小幅度的调整，即这个表达式的右侧，减去α乘以这个项。如果你觉得这个方程中有很多事情正在发生，没关系，不用担心。我们会一起解析它。首先，这里的等号表示。现在，由于我说我们使用这个等号给w赋值，所以在这个上下文中，这个等号是赋值操作符。具体来说，在这个上下文中，如果你写代码说a=c，意思是取值c并将其存储在你的计算机中，变量a中。或者如果你写a=a+1，意思是将a的值设置为等于a+1，或者将a的值增加一。赋值操作符的编码与数学中的真理断言不同。如果我写a=c，我是在断言，也就是说，我声称a和c的值彼此相等。希望我永远不会写一个真理断言a=a+1，因为这根本不可能是真的。在Python和其他编程语言中，真理断言有时写成等于等于，所以你可能会看到哦，那说a==c，如果你在测试a是否等于c。但在数学符号中，如我们在这些视频中惯常使用的，等号可以用于赋值或用于断言两个值的平等的真理。当我写一个等号时，我尽量确保我清楚地表明我们是将一个值赋给一个变量，还是我们在断言两个值的平等性。现在，更深入地探讨这个方程中的符号意味着什么。这里的符号是希腊字母α。在这个方程中，α也被称为学习率。学习率通常是0到1之间的一个小正数，可能是0.01。α的作用基本上是控制你下坡的步伐大小。如果α非常大，那么这就对应于一个非常激进的梯度下降过程，你试图迈出巨大的步伐下坡。如果α非常小，那么你将会迈出小小的步伐下坡。我们稍后会回来更深入地探讨如何选择一个好的学习率α。最后，这个术语在这里，是成本函数JJ的导数项。现在不要担心这个导数的细节。但稍后，你会看到更多关于导数项的内容。但现在，你可以将我画了洋红色框的这个导数项视为告诉你想要采取的小步方向。与学习率α结合，它还决定了你想要下坡的步伐大小。现在，我想提一下导数来自微积分。即使你不熟悉微积分，也不要担心。即使不知道任何微积分，你也能在这个视频和下一个视频中弄清楚所有你需要知道的关于这个导数项的信息。还有一件事。记住你的模型有两个参数，不仅仅是w，还有b。你还有一个赋值操作来更新参数b，看起来非常相似。b被赋予b的旧值减去学习率α乘以这个略有不同的导数项，J的wb的d/db。记住在表面图的图形中，你正在采取小步骤直到你到达底部的值，对于梯度下降算法，你会重复这两个更新步骤，直到算法收敛。通过收敛，我的意思是你达到了一个局部最小值点，在那里参数w和b在每个额外步骤中不再有太大变化。

现在，关于如何正确实现梯度下降的一个细微细节是，你将更新两个参数，w和b。这两个参数的更新都会发生。一个重要的细节是，对于梯度下降，你希望同时更新w和b，意味着你希望同时更新这两个参数。我的意思是，在这个表达式中，你会从旧的ww更新到新的w，并且你也在将b从其最旧的值更新到b的新值。实现这一点的方法是计算右侧，为w和b计算这个表达式，并同时更新w和b到新值。让我们看看这意味着什么。这里是实现梯度下降的正确方法，它执行了同时更新。这设置了一个变量temp_w等于那个表达式，即w减去这边的那个项。还有一个变量temp_b设置为那个，即bb减去那个项。你计算两边的更新，并将它们存储到变量temp_w和temp_b中。然后你将temp_w的值复制到w，并且也将temp_b的值复制到b。现在，你可能注意到的一点是，这里的w值是从w得到更新的。在这里，我注意到预更新的ww是进入这边导数项的。相比之下，这里是一个不正确的梯度下降实现，它没有执行同时更新。在这个不正确的实现中，我们像之前一样计算temp_w，到目前为止这是可以的。

现在这里开始有所不同。然后我们在计算另一个参数的新值之前用temp_w中的值更新w。接下来，我们计算temp_b作为b减去这边的那个项，最后我们用temp_b中的值更新b。右边和左边实现之间的区别在于，如果你看这边，这个ww已经被更新到这个新值，这是更新后的w实际上进入了成本函数j的w，b。这意味着这边的这个项与这边你看到的左边的这个项不同。这也意味着这边的temp_b项与左边的temp_b项不完全相同，因此这边的更新后的b值与左边的变量b的更新值不同。在代码中实现梯度下降的方式，实际上更自然地以同时更新的方式正确实现。当你听到有人谈论梯度下降时，他们总是意味着你执行参数的同时更新的梯度下降。然而，如果你实现非同时更新，结果可能或多或少仍然有效。但这样做并不是真正正确的实现方式，实际上是具有不同属性的其他算法。我建议你只坚持使用正确的同时更新，不要使用右边这种不正确的版本。那就是梯度下降。

在接下来的视频中，我们将详细介绍你在本视频中看到但没有详细讨论的导数项。导数是微积分的一部分，再次强调，如果你不熟悉微积分，不用担心。你根本不需要知道微积分就能完成这门课程或这个专业，你已经拥有了实现梯度下降所需的所有信息。在接下来的视频中，我们将一起讨论导数，你将获得实施和应用梯度下降所需的直觉和知识。我认为这将是一件令人兴奋的事情，让你知道如何实现。让我们继续下一个视频，看看如何做到这一点。