机器学习学习笔记——1.1.2.3.1 Feature scaling part 1（特征缩放——第1部分）

预见未来to50

于 2024-09-18 19:07:20 发布

阅读量556

点赞数 20

分类专栏：机器学习、深度学习（ML/DL) 文章标签：机器学习学习笔记

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/142341134

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

148 篇文章 13 订阅

订阅专栏

So welcome back. Let's take a look at some techniques that make great inter sense work much better. In this video you see a technique called feature scaling that will enable gradient descent to run much faster. Let's start by taking a look at the relationship between the size of a feature that is how big are the numbers for that feature and the size of its associated parameter. As a concrete example, let's predict the price of a house using two features x1 the size of the house and x2 the number of bedrooms. Let's say that x1 typically ranges from 300 to 2000 square feet. And x2 in the data set ranges from 0 to 5 bedrooms. So for this example, x1 takes on a relatively large range of values and x2 takes on a relatively small range of values.

Now let's take an example of a house that has a size of 2000 square feet has five bedrooms and a price of 500k or $500,000. For this one training example, what do you think are reasonable values for the size of parameters w1 and w2? Well, let's look at one possible set of parameters. Say w1 is 50 and w2 is 0.1 and b is 50 for the purposes of discussion. So in this case the estimated price in thousands of dollars is 100,000k here plus 0.5 k plus 50 k. Which is slightly over 100 million dollars. So that's clearly very far from the actual price of $500,000. And so this is not a very good set of parameter choices for w1 and w2. Now let's take a look at another possibility. Say w1 and w2 were the other way around. W1 is 0.1 and w2 is 50 and b is still also 50. In this choice of w1 and w2, w1 is relatively small and w2 is relatively large, 50 is much bigger than 0.1. So here the predicted price is 0.1 times 2000 plus 50 times five plus 50. The first term becomes 200k, the second term becomes 250k, and the plus 50. So this version of the model predicts a price of $500,000 which is a much more reasonable estimate and happens to be the same price as the true price of the house.

So hopefully you might notice that when a possible range of values of a feature is large, like the size and square feet which goes all the way up to 2000. It's more likely that a good model will learn to choose a relatively small parameter value, like 0.1. Likewise, when the possible values of the feature are small, like the number of bedrooms, then a reasonable value for its parameters will be relatively large like 50. So how does this relate to grading descent? Well, let's take a look at the scatter plot of the features where the size square feet is the horizontal axis x1 and the number of bedrooms exudes is on the vertical axis. If you plot the training data, you notice that the horizontal axis is on a much larger scale or much larger range of values compared to the vertical axis. Next let's look at how the cost function might look in a contour plot. You might see a contour plot where the horizontal axis has a much narrower range, say between zero and one, whereas the vertical axis takes on much larger values, say between 10 and 100. So the contours form ovals or ellipses and they're short on one side and longer on the other. And this is because a very small change to w1 can have a very large impact on the estimated price and that's a very large impact on the cost J. Because w1 tends to be multiplied by a very large number, the size and square feet. In contrast, it takes a much larger change in w2 in order to change the predictions much. And thus small changes to w2, don't change the cost function nearly as much.

So where does this leave us? This is what might end up happening if you were to run great in dissent, if you were to use your training data as is. Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time before it can finally find its way to the global minimum. In situations like this, a useful thing to do is to scale the features. This means performing some transformation of your training data so that x1 say might now range from 0 to 1 and x2 might also range from 0 to 1. So the data points now look more like this and you might notice that the scale of the plot on the bottom is now quite different than the one on top. The key point is that the re scale x1 and x2 are both now taking comparable ranges of values to each other. And if you run gradient descent on a cost function to find on this, re scaled x1 and x2 using this transformed data, then the contours will look more like this more like circles and less tall and skinny. And gradient descent can find a much more direct path to the global minimum.

So to recap, when you have different features that take on very different ranges of values, it can cause gradient descent to run slowly but re scaling the different features so they all take on comparable range of values. because speed, upgrade and dissent significantly. How do you actually do this? Let's take a look at that in the next video.

欢迎回来。让我们来看一些能让梯度下降工作得更好的技术。在这段视频中，你将看到一个叫做特征缩放的技术，它将使梯度下降运行得更快。首先，我们来看看一个特征的大小（即该特征的数值大小）与其相关参数的大小之间的关系。以一个具体的例子来说，我们尝试用两个特征x1（房屋的大小）和x2（卧室的数量）来预测房屋的价格。假设x1通常的范围是300到2000平方英尺，而数据集中的x2范围是0到5个卧室。因此，在这个例子中，x1取值范围相对较大，而x2取值范围相对较小。

现在，让我们看一个例子，一个房屋有2000平方英尺，有5个卧室，价格为50万或500,000美元。对于这个训练样本，你认为参数w1和w2的合理大小是什么？我们来看看一组可能的参数。假设w1是50，w2是0.1，b是50，为了讨论方便。在这种情况下，估算的价格是10万这里加上0.5k加上50k。这略高于1亿美元。所以显然这与实际价格50万美元相差甚远。因此，这不是一组很好的w1和w2的选择。现在，让我们来看看另一种可能性。假设w1和w2反过来了，w1是0.1，w2是50，b仍然是50。在这种情况下选择的w1和w2，w1相对较小，而w2相对较大，50比0.1大得多。所以这里的预测价格是0.1乘以2000加上50乘以5加上50。第一项变成了200k，第二项变成了250k，再加上50。所以这个版本的模型预测了一个价格为500,000美元，这是一个更合理的估计，恰好与房屋的实际价格相同。

所以希望你们能注意到，当一个特征的可能值范围很大时，比如面积和平方英尺可以高达2000，一个好的模型更有可能学到选择一个相对较小的参数值，比如0.1。同样地，当特征的可能值较小时，比如卧室数量，那么其参数的合理值将相对较大，如50。那么这与梯度下降有什么关系呢？让我们看一下特征的散点图，其中面积平方英尺是横轴x1，卧室数是纵轴。如果你绘制训练数据，你会注意到横轴的尺度或值的范围与纵轴相比要大得多。接下来，我们来看看成本函数在等高线图中可能是什么样子。你可能会看到一个等高线图，其中横轴的范围较窄，比如说在零到一之间，而纵轴取较大的值，比如在10到100之间。因此，轮廓形成了椭圆形，一边短另一边长。这是因为对w1的很小改变会对估算的价格产生很大的影响，从而对成本J产生很大影响。因为w1倾向于乘以一个非常大的数字，即面积和平方英尺。相比之下，需要更大的变化才能显著改变预测。因此，对w2的小改变几乎不会改变成本函数那么多。

那么这会给我们带来什么结果呢？这就是如果你使用原始训练数据运行梯度下降可能会发生的情况。因为轮廓如此高瘦，梯度下降可能需要很长时间才能最终找到全局最小值。在这种情况下，一个有用的方法是缩放特征。这意味着对你的训练数据进行某种转换，使得x1现在可能在0到1范围内，x2也可能在0到1范围内。所以数据点现在看起来像这样，你可能注意到底部图的比例现在与顶部的相当不同。关键是重新缩放的x1和x2现在都取相似的值范围。如果你在一个经过这种变换的数据上运行梯度下降来找到重新缩放的x1和x2的成本函数上的最小值，那么轮廓看起来更像这样，更像是圆形而不是又高又瘦。梯度下降可以找到一条更直接通往全局最小值的路径。

所以总结一下，当你有不同的特征取非常不同的值范围时，它会导致梯度下降运行缓慢，但通过重新缩放不同的特征使它们都取相似的值范围，因为速度、升级和不服从显著提高。你实际上是怎么做的呢？让我们在接下来的视频中看看这一点。