k 最近邻_k最近邻与维数的诅咒

最新推荐文章于 2024-09-15 22:31:42 发布

weixin_26752765

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量423

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/k-nearest-neighbors-and-the-curse-of-dimensionality-7d64634015d9

版权

k 最近邻

机器学习模型和维数的诅咒 (Machine Learning models and the curse of dimensionality)

There is always a trade off between things in life. If you take up a certain path then there is always a possibility that you might have to compromise with some other parameter. Machine Learning models are no different, considering the case of k-Nearest Neighbor there has always been a problem which has a huge impact over classifiers that rely on pairwise distance and that problem is nothing but the “Curse of Dimensionality”. By the end of this article you will be able to create your own k-Nearest Neighbor Model and observe the impact of increasing the dimension to fit a data set. Let’s dig in!

生活中的事物之间总会有一个权衡。如果您采用某条路径，那么总是有可能不得不折衷一些其他参数。机器学习模型也没有什么不同，考虑到k最近邻的情况，一直存在着一个问题，该问题对依赖成对距离的分类器产生了巨大影响，而这个问题不过是“维数诅咒”而已。到本文结束时，您将能够创建自己的k最近邻居模型，并观察增加维度以适合数据集的影响。让我们开始吧！

Creating a k-Nearest Neighbor model:

创建k最近邻居模型：

Right before we get our hands dirty with the technical part, we need to lay the buttress for our analysis, which is nothing but the libraries.

就在我们开始接触技术部分之前，我们需要为我们的分析奠定基础，这不过是库。

Thanks to inbuilt machine learning packages which makes our job quite easy.

借助内置的机器学习包，这使我们的工作变得非常轻松。

最近邻居分类器： (Nearest neighbors classifier:)

Let’s begin with a simple nearest neighbor classifier in which we have been posed with a binary classification task: we have a set of labeled inputs, where the labels are all either 0 or 1. Our goal is to train a classifier to predict a 0 or 1 label for new, unseen test data. One conceptually simple approach is to simply find the sample in the training data that is “most similar” to our test sample (a “neighbor” in the feature space), and then give the test sample the same label as the “most similar” training sample. This is the nearest neighbors classifier.

让我们从一个简单的最近邻分类器开始，在该分类器中，我们已经执行了一个二进制分类任务：我们有一组带标签的输入，其中标签全为0或1。我们的目标是训练一个分类器来预测0或1。 1个标签，用于显示看不见的新测试数据。从概念上讲，一种简单的方法是简单地在训练数据中找到与我们的测试样本“最相似”(特征空间中的“邻居”)的样本，然后为测试样本赋予与“最相似”的相同标签训练样本。这是最近的邻居分类器。

After running few lines of code we can visualize our data set, with training data shown in blue (negative class) and red (positive class). A test sample is shown in green.For keeping things simple I have used a simple linear boundary for classification.

运行几行代码后，我们可以可视化我们的数据集，其中训练数据以蓝色(负类)和红色(正类)显示。测试样本以绿色显示。为了使事情简单，我使用了简单的线性边界进行分类。

To find the nearest neighbor, we need a distance metric. For our case, I chose to use the L2 norm. There certainly are few perks of using the L2 norm as a distance metric, considering that we don’t have any outliers the L2 norm minimizes the mean cost and treats every feature equally.

为了找到最近的邻居，我们需要一个距离度量 。对于我们的情况，我选择使用L2范数。考虑到我们没有任何异常值，使用L2范数作为距离度量当然很少有好处，因为L2范数可以最大程度地降低平均成本并平等地对待每个特征。

The nearest neighbor to the test sample is circled, and its label is applied as the prediction for the test sample:

圈出最接近测试样本的邻居，并使用其标签作为测试样本的预测：

Using nearest neighbor we successfully classified our test value as label “0”, but again we made an assumption of no outliers and we also moderated the noise.

使用最近的邻居，我们成功地将测试值分类为标签“ 0”，但是我们再次假设没有离群值，并且也降低了噪声。

The nearest neighbor classifier works by “memorizing” the training data. One interesting consequence of this is that it will have zero prediction error (or equivalently, 100% accuracy) on the training data, since each training sample’s nearest neighbor is itself:

最近的邻居分类器通过“存储”训练数据来工作。一个有趣的结果是，由于每个训练样本的最近邻居本身都是零，因此在训练数据上它将具有零预测误差(或等效地，为100％的准确性)：

Now we look to overcome the shortcomings of the nearest neighbor model and the answer lies in the model named as the k-Nearest Neighbor classifier.

现在，我们着眼于克服最邻近模型的缺点，答案就在于名为k-最邻近分类器的模型。

K个最近邻居分类器： (K nearest neighbors classifier:)

To make this approach less sensitive to noise, we might choose to look for multiple similar training samples to each new test sample, and classify the new test sample using the mode of the labels of the similar training samples. This is k nearest neighbors, where k is the number of “neighbors” that we search for.

为了使这种方法对噪声的敏感性降低，我们可以选择为每个新的测试样本寻找多个相似的训练样本，并使用相似的训练样本的标签模式对新的测试样本进行分类。这是k个最近的邻居，其中k是我们搜索的“邻居”数。

In the following plot, we show the same data as in the previous example. Now, however, the 3 closest neighbors to the test sample are circled, and the mode of their labels is used as the prediction for the new test sample. Feel free to play with the parameter k and observe the changes.

在下图中，我们显示了与上一个示例相同的数据。但是，现在，将最接近测试样本的3个邻居圈起来，并将其标签的模式用作新测试样本的预测。随意使用参数k并观察其变化。

The following image shows a set of test points plotted on top of the training data. The size of each test points indicate the confidence in the label, which we approximate by the proportion of k neighbors sharing that label.

下图显示了在训练数据上方绘制的一组测试点。每个测试点的大小表示对标签的置信度 ，我们可以通过共享该标签的k个邻居的比例来近似。

The bigger the dots are means that the confidence score is higher for those points.

点越大表示这些点的置信度得分越高。

Also note that the training error for k nearest neighbors is not necessarily zero (though it can be!), since a training sample may have a different label than its k closest neighbors.

还应注意，k个最邻近邻居的训练误差不一定为零(尽管可能是！)，因为训练样本可能具有与其k个最邻近邻居不同的标签。

功能缩放： (Feature scaling:)

One important limitation of k nearest neighbors is that it does not “learn” anything about which features are most important for determining y. Every feature is weighted equally in finding the nearest neighbor.

k个最近邻居的一个重要限制是它不“学习”关于哪些特征对于确定y最重要。在寻找最接近的邻居时，每个要素的权重均相等。

The first implication of this is:

这的第一个含义是：

If all features are equally important, but they are not all on the same scale, they must be normalized — re scaled onto the interval [0,1]. Otherwise, the features with the largest magnitudes will dominate the total distance.
如果所有功能都同等重要，但是它们的缩放比例不同，则必须将它们归一化-重新缩放为间隔[0,1]。否则，幅度最大的要素将主导总距离。

The second implication is:

第二个含义是：

Even if some features are more important than others, they will all be considered equally important in the distance calculation. If uninformative features are included, they may dominate the distance calculation.
即使某些功能比其他功能更重要，它们在距离计算中也将被视为同等重要。如果包括非信息性特征，则它们可能会主导距离计算。

Contrast this with our logistic regression classifier. In the logistic regression, the training process involves learning coefficients. The coefficients weight each feature’s effect on the overall output.

将此与我们的逻辑回归分类器进行对比。在逻辑回归中，训练过程涉及学习系数。系数加权每个功能对整体输出的影响。

Let’s see how our model performs for an image classification problem. Consider the following images from CIFAR10, a dataset of low-resolution images in ten classes:

让我们看看我们的模型如何处理图像分类问题。考虑以下来自CIFAR10的图像，它是十类低分辨率图像的数据集：

The images above show a test sample and two training samples with their distances to the test sample.

上图显示了一个测试样本和两个训练样本以及它们与测试样本的距离。

The background pixels in the test sample “count” just as much as the foreground pixels, so that the image of the deer is considered a very close neighbor, while the image of the car is not. As stated before we used L2 norm and our model considers every pixel to be equal so it makes it difficult for nearest neighbor to classify real time images.

测试样本中的背景像素“计数”与前景像素一样多，因此，鹿的图像被认为是非常近的邻居，而汽车的图像则不是。如前所述，我们使用L2范数，并且我们的模型认为每个像素都相等，因此最近邻很难对实时图像进行分类。

We also see here that Euclidean distance is not a good metric of visual similarity — the frog on the right is almost as similar to the car as the deer in the middle!

我们在这里还看到，欧几里得距离不是视觉相似度的良好度量标准-右侧的青蛙与汽车之间的距离几乎与中间的鹿一样！

K最近邻居回归： (K nearest neighbors regression:)

K nearest neighbors can also be used for regression, with just a small change: instead of using the mode of the nearest neighbors to predict the label of a new sample, we use the mean. Consider the following training data:

K个最接近的邻居也可以用于回归，只做很小的改变：我们使用均值，而不是使用最接近的邻居的模式来预测新样本的标签。考虑以下训练数据：

We can add a test sample, then use k nearest neighbors to predict its value:

我们可以添加一个测试样本，然后使用k个最近的邻居来预测其值：

“维数的诅咒”： (The “curse of dimensionality”:)

Classifiers that rely on pairwise distance between points, like the k neighbors methods, are heavily impacted by a problem known as the “curse of dimensionality”. In this section, I will illustrate the problem. We will look at a problem with data uniformly distributed in each dimension of the feature space, and two classes separated by a linear boundary.

像k邻居方法一样，依赖点之间成对距离的分类器受到称为“维数诅咒”的问题的严重影响。在本节中，我将说明问题。我们将研究一个数据均匀分布在特征空间各个维度上的问题，并且两个类之间由线性边界分隔。

We will generate a test point, and show the k nearest neighbors to the test point. We will also show the length (or area, or volume) that we had to search to find those k test points. We will observe the radius required to find the nearest neighbor for increasing dimension space.

我们将生成一个测试点，并显示距该测试点最近的k个邻居。我们还将显示为找到这k个测试点而必须搜索的长度(或面积或体积)。我们将观察为增加尺寸空间而寻找最接近的邻居所需的半径。

Pay special attention to how that length (or area, or volume) changes as we increase the dimensionality of the feature space.

当我们增加特征空间的维数时，请特别注意长度(或面积或体积)如何变化。

First, let's observe the 1D problem:

首先，让我们观察一维问题：

Now, the 2D equivalent:

现在，等效于2D：

Finally, the 3D equivalent:

最后，等效于3D：

We can see that as the dimensionality of the problem grows, the higher-dimensional space is less densely occupied by the training data, and we need to search a large volume of space to find neighbors of the test point. The pair-wise distance between points grows as we add additional dimensions.

我们可以看到，随着问题维数的增长，高维空间被训练数据所占据的密度降低，并且我们需要搜索大量空间以找到测试点的邻居。 点之间的成对距离随着我们添加其他尺寸而增大。

And in that case, the neighbors may be so far away that they don’t actually have much in common with the test point.

在这种情况下，邻居可能相距太远，以至于他们实际上与测试点没有太多共同之处。

In general, the length of the smallest hyper-cube that contains all k-nearest neighbors of a test point is:

通常，包含测试点的所有k个最近邻的最小超立方体的长度为：

(k/N)¹/d

(k / N)¹/ d

for N samples with dimensionality d.

对于N个维数为d的样本。

From the expression above, we can see that as the number of dimensions increases linearly, the number of training samples must increase exponentially to counter the “curse”.

从上面的表达式中，我们可以看到，随着维数线性增加，训练样本的数量必须成倍增加以抵消“诅咒”。

Alternatively, we can reduce d — either by feature selection or by transforming the data into a lower-dimensional space.

或者，我们可以通过特征选择或将数据转换为低维空间来减小d。