机器学习学习笔记——1.1.2.1.5 Gradient descent for multiple linear regression（多特征线性回归的梯度下降）

预见未来to50

于 2024-09-18 18:46:02 发布

阅读量331

点赞数 4

分类专栏：机器学习、深度学习（ML/DL) 文章标签：机器学习学习笔记

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/142340732

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

148 篇文章 12 订阅

订阅专栏

You've learned about gradient descents about multiple linear regression and also vectorization. Let's put it all together to implement gradient descent for multiple linear regression with vectorization. This would be cool. Let's quickly review what multiple linear regression look like. Using our previous notation, let's see how you can write it more succinctly using vector notation. We have parameters w_1 to w_n as well as b. But instead of thinking of w_1 to w_n as separate numbers, that is separate parameters, let's start to collect all of the w's into a vector w so that now w is a vector of length n.

We're just going to think of the parameters of this model as a vector w, as well as b, where b is still a number same as before. Whereas before we had to find multiple linear regression like this, now using vector notation, we can write the model as f_w, b of x equals the vector w dot product with the vector x plus b. Remember that this dot here means.product. Our cost function can be defined as J of w_1 through w_n, b. But instead of just thinking of J as a function of these and different parameters w_j as well as b, we're going to write J as a function of parameter vector w and the number b. This w_1 through w_n is replaced by this vector W and J now takes this input of vector w and a number b and returns a number. Here's what gradient descent looks like. We're going to repeatedly update each parameter w_j to be w_j minus Alpha times the derivative of the cost J, where J has parameters w_1 through w_n and b. Once again, we just write this as J of vector w and number b.

Let's see what this looks like when you implement gradient descent and in particular, let's take a look at the derivative term. We'll see that gradient descent becomes just a little bit different with multiple features compared to just one feature. Here's what we had when we had gradient descent with one feature. We had an update rule for w and a separate update rule for b. Hopefully, these look familiar to you. This term here is the derivative of the cost function J with respect to the parameter w. Similarly, we have an update rule for parameter b, with univariate regression, we had only one feature. We call that feature xi without any subscript. Now, here's a new notation for where we have n features, where n is two or more. We get this update rule for gradient descent. Update w_1 to be w_1 minus Alpha times this expression here and this formula is actually the derivative of the cost J with respect to w_1. The formula for the derivative of J with respect to w_1 on the right looks very similar to the case of one feature on the left. The error term still takes a prediction f of x minus the target y.

One difference is that w and x are now vectors and just as w on the left has now become w_1 here on the right, xi here on the left is now instead xi _1 here on the right and this is just for J equals 1. For multiple linear regression, we have J ranging from 1 through n and so we'll update the parameters w_1, w_2, all the way up to w_n, and then as before, we'll update b. If you implement this, you get gradient descent for multiple regression. That's it for gradient descent for multiple regression. Before moving on from this video, I want to make a quick aside or a quick side note on an alternative way for finding w and b for linear regression.

This method is called the normal equation. Whereas it turns out gradient descent is a great method for minimizing the cost function J to find w and b, there is one other algorithm that works only for linear regression and pretty much none of the other algorithms you see in this specialization for solving for w and b and this other method does not need an iterative gradient descent algorithm. Called the normal equation method, it turns out to be possible to use an advanced linear algebra library to just solve for w and b all in one goal without iterations. Some disadvantages of the normal equation method are; first unlike gradient descent, this is not generalized to other learning algorithms, such as the logistic regression algorithm that you'll learn about next week or the neural networks or other algorithms you see later in this specialization.

The normal equation method is also quite slow if the number of features and this large. Almost no machine learning practitioners should implement the normal equation method themselves but if you're using a mature machine learning library and call linear regression, there is a chance that on the backend, it'll be using this to solve for w and b. If you're ever in the job interview and hear the term normal equation, that's what this refers to. Don't worry about the details of how the normal equation works. Just be aware that some machine learning libraries may use this complicated method in the back-end to solve for w and b. But for most learning algorithms, including how you implement linear regression yourself, gradient descents offer a better way to get the job done. In the optional lab that follows this video, you'll see how to define a multiple regression model encode and also how to calculate the prediction f of x. You'll also see how to calculate the cost and implement gradient descent for a multiple linear regression model. This will be using Python's NumPy library. If any of the code looks very new, that's okay but you should feel free also to take a look at the previous optional lab that introduces NumPy and vectorization for a refresher of NumPy functions and how to implement those in code.

That's it. You now know multiple linear regression. This is probably the single most widely used learning algorithm in the world today. But there's more. With just a few tricks such as picking and scaling features appropriately and also choosing the learning rate alpha appropriately, you'd really make this work much better. Just a few more videos to go for this week. Let's go on to the next video to see those little tricks that will help you make multiple linear regression work much better.

你已经了解了关于多元线性回归的梯度下降以及向量化。让我们把它们整合在一起，用向量化实现多元线性回归的梯度下降。这会很酷。让我们快速回顾一下多元线性回归是什么样子。使用我们之前的符号，看看如何用向量符号更简洁地表示它。我们有参数 w₁到 wₙ以及 b。但是与其把 w₁到 wₙ看作单独的数字，即单独的参数，不如让我们开始把所有的 w 收集到一个向量 w 中，这样现在 w 就是一个长度为 n 的向量。

我们将把这个模型的参数仅仅看作一个向量 w 以及 b，其中 b 仍然和以前一样是一个数字。以前我们必须像这样找到多元线性回归，现在使用向量符号，我们可以把模型写成 fₓ(w,b)=w 向量与 x 向量的点积加上 b。记住这里的点表示乘积。我们的代价函数可以定义为 J (w₁,...,wₙ,b)。但是与其仅仅把 J 看作这些不同参数 wj 以及 b 的函数，不如我们把 J 写成参数向量 w 和数字 b 的函数。这个 w₁到 wₙ被这个向量 W 所取代，J 现在接收向量 w 和一个数字 b 作为输入并返回一个数字。

这是梯度下降的样子。我们将反复更新每个参数 wj 为 wj - α 乘以代价 J 的导数，其中 J 有参数 w₁到 wₙ和 b。再次，我们把它写成 J (向量 w 和数字 b)。

让我们看看当你实现梯度下降时是什么样子，特别是让我们看一下导数项。我们会看到，与只有一个特征相比，有多个特征时梯度下降会有一点不同。这是我们在只有一个特征时的梯度下降。我们有一个 w 的更新规则和一个单独的 b 的更新规则。希望这些对你来说很熟悉。这里的这个项是代价函数 J 关于参数 w 的导数。类似地，我们有参数 b 的更新规则，在一元回归中，我们只有一个特征。我们把那个特征称为 xi，没有下标。现在，这是一个新的符号，当我们有 n 个特征时，其中 n 是 2 个或更多。我们得到这个梯度下降的更新规则。把 w₁更新为 w₁ - α 乘以这里的这个表达式，这个公式实际上是代价 J 关于 w₁的导数。右边关于 J 对 w₁的导数的公式看起来与左边只有一个特征的情况非常相似。误差项仍然是预测值 f (x) 减去目标值 y。

一个不同之处是 w 和 x 现在是向量，就像左边的 w 现在在右边变成了 w₁，左边的 xi 现在在右边变成了 xi₁，这只是当 j = 1 时的情况。对于多元线性回归，我们有 j 从 1 到 n，所以我们将更新参数 w₁，w₂，一直到 wₙ，然后像以前一样，我们将更新 b。如果你实现了这个，你就得到了多元回归的梯度下降。这就是多元回归的梯度下降。

在离开这个视频之前，我想快速提一下或做一个快速的旁注，关于一种寻找线性回归的 w 和 b 的替代方法。

这个方法叫做正规方程。虽然事实证明梯度下降是一种很好的最小化代价函数 J 以找到 w 和 b 的方法，但还有一种只适用于线性回归的算法，而且几乎没有本专业中你看到的其他算法适用于求解 w 和 b，并且这种其他方法不需要迭代梯度下降算法。叫做正规方程方法，事实证明可以使用高级线性代数库一次性求解 w 和 b，无需迭代。正规方程方法的一些缺点是：首先，与梯度下降不同，它不能推广到其他学习算法，比如你下周将学习的逻辑回归算法或者神经网络或本专业后面你将看到的其他算法。

如果特征数量很大，正规方程方法也会非常慢。几乎没有机器学习从业者应该自己实现正规方程方法，但是如果你使用一个成熟的机器学习库并调用线性回归，有可能在后端它会使用这个来求解 w 和 b。如果你在工作面试中听到 “正规方程” 这个术语，这就是它所指的。不用担心正规方程是如何工作的细节。只要知道一些机器学习库可能在后端使用这个复杂的方法来求解 w 和 b。但是对于大多数学习算法，包括你自己如何实现线性回归，梯度下降提供了一种更好的完成任务的方法。

在这个视频之后的可选实验中，你将看到如何定义一个多元回归模型编码，以及如何计算预测值 f (x)。你还将看到如何计算代价并为多元线性回归模型实现梯度下降。这将使用 Python 的 NumPy 库。如果任何代码看起来很新，没关系，但你也可以随意看一下之前介绍 NumPy 和向量化的可选实验，复习一下 NumPy 函数以及如何在代码中实现它们。

就是这样。你现在了解了多元线性回归。这可能是当今世界上使用最广泛的学习算法。但还有更多。只需一些技巧，如适当选择和缩放特征，以及适当选择学习率 α，你就能让它工作得更好。本周还有几个视频。让我们继续看下一个视频，看看那些能帮助你让多元线性回归工作得更好的小技巧。