k最近邻算法_从零开始的算法：K最近邻

最新推荐文章于 2021-09-20 20:00:41 发布

weixin_26737625

最新推荐文章于 2021-09-20 20:00:41 发布

阅读量883

点赞数

文章标签：算法 python java leetcode 人工智能

原文链接：https://towardsdatascience.com/algorithms-from-scratch-k-nearest-neighbors-fe19b431a57

版权

本文深入探讨了k最近邻(KNN)算法，从零开始介绍其原理，并提供了Python和Java的实现示例，适合数据科学初学者及对算法感兴趣的开发者。内容包括算法的工作方式、距离度量和分类决策等核心概念。

摘要由CSDN通过智能技术生成

k最近邻算法

从零开始的算法 (Algorithms From Scratch)

介绍 (Introduction)

A non-parametric algorithm capable of performing Classification and Regression; Thomas Cover, a professor at Stanford University, first proposed the idea of K-Nearest Neighbors algorithm in 1967.

一种能够执行分类和回归的非参数算法；斯坦福大学教授Thomas Cover于1967年首次提出了K最近邻算法的思想。

Many often refer to the K-NN as a lazy learner or a type of instance based learner since all computation is deferred until function evaluation. Personally, I believe this puts K-Nearest Neighbors towards the less complex end of Machine Learning Algorithms when we begin to conceptualize it.

许多人通常将K-NN称为懒惰学习者或一种基于实例的学习者，因为所有计算都推迟到函数评估为止。我个人认为，当我们开始对其进行概念化时，这会使K最近邻居进入机器学习算法不太复杂的一端。

No matter whether we are doing Classification or Regression style problems the input will consist of the k nearest training examples in the original feature space. However, the output for the algorithm will of-course depend on the type of question — See Terminology section for more on the different outputs.

无论我们正在处理分类还是回归样式问题，输入都将包含原始特征空间中的k个最近的训练示例。但是，该算法的输出当然取决于问题的类型-有关不同输出的更多信息，请参见术语部分。

Link to code generated in the Article…

链接到文章中生成的代码…

术语 (Terminology)

K-Nearest Neighbors Classification → The output would determine class membership and the prediction is made by a plurality vote of its neighbors. Therefore, the new instance would be assigned to the class most common amongst the k nearest neighbors.

K最近邻居分类 →输出将确定类别成员资格，并且通过其邻居的多次投票做出预测。因此，新实例将分配给k个最近邻居中最常见的类。

K-Nearest Neighbors Regression → The output would determine property value for the object. Therefore, the new instance would be classified as the average of the values of the k Nearest Neighbors

K最近邻回归 →输出将确定对象的属性值。因此，新实例将被分类为k个最近邻居的值的平均值

Instance-Based Learning → A family of Machine Learning Learning Algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training which has been stored in memory. (Source: Wikipedia)

基于实例的学习 →一系列机器学习算法，而不是执行明确的概括，而是将新问题实例与训练中看到的实例进行比较，并将其存储在内存中。 (来源： 维基百科 )

Lazy Learning → A Machine Learning method in which generalization of the training data is, in theory, delayed until a query is made to the system, as opposed to eager learning where the system tries to generalize the training data before receiving queries. (Source: Wikipedia)

惰性学习 →一种机器学习方法，其中在理论上将训练数据的泛化延迟到对系统进行查询之前，与在系统在接收查询之前尝试对训练数据进行泛化的急切学习相反。 (来源： 维基百科 )

创建模型 (Creating the Model)

Creating the K-NN algorithm is quite simple. The training phase is literally storing feature vectors and labels of the training samples, however we need to determine a positive integer for k. Generally, when we select a large value of k we reduce the effect of noise on the classification, thence making the boundaries between classes less distinctive. Ultimately, the selection of k is largely influenced by the data which means we have no way of knowing until we have tried with the data, yet there are many different heuristics we can use to select k for our data.

创建K-NN算法非常简单。训练阶段实际上是存储训练样本的特征向量和标签，但是我们需要确定k的正整数。通常，当我们选择一个较大的k值时，我们会降低噪声对分类的影响，从而使类之间的边界不那么明显。最终， k的选择在很大程度上受到数据的影响，这意味着在尝试使用数据之前我们无法知道，但是我们可以使用多种启发式方法为数据选择k 。

Note: To read more about tuning the k hyperparameter see the Hyperparameter Optimization Wikipedia page.

注意：要了解有关调整k超参数的更多信息，请参阅Hyperparameter Optimization Wikipedia页面。

Great, we have selected k. In order to make a prediction on new instances for a classification task, the k closest records (from the training data) to the new observation are identified. Upon evaluation of the of the k neighbors an prediction is made — see K-Nearest Neighbors Classification in the terminology section to see how this is done.

很好，我们选择了k 。为了对分类任务的新实例进行预测，标识(来自训练数据的)最接近新观察值的k条记录。在评估了k个邻居的时，会做出预测-请参阅“术语”部分中的“ K-最近邻居分类 ”以了解如何进行。

For us to identify the k closest records to a new instance we must take a measure of all the instances. This can be done in various ways, although as a guide many practitioners often use Euclidean distance when we have continuous variables and Hamming distance for discrete variables.

为了确定最接近新实例的k条记录，我们必须对所有实例进行度量。这可以以多种方式完成，尽管作为指导，许多从业人员在我们具有连续变量时使用欧几里得距离，而对于离散变量则使用汉明距离。

IMAGE HAMMING AND EUCLIDEAN DISTANCE

图像处理和EUCLIDEAN距离

Chunking the Algorithm

分块算法

Calculate the Euclidean Distance
计算欧几里得距离
Locate Neighbors
找到邻居
Predict
预测

Implementation

实作

To implement our K-Nearest Neighbors Classification algorithm, we will use the Iris dataset from Scikit-Learn. In this task we are challenged to predict whether a flower is setosa, versicolor, or virginica given the measurements of the flowers — making it a multiclass classification task.

为了实现我们的K最近邻居分类算法，我们将使用Scikit-Learn的Iris数据集。在此任务中，给定花朵的尺寸，我们面临的挑战是如何预测花朵是setosa ， versicolor还是virginica ，这使其成为多类分类任务。

Note: In this implementation I did not perform any heuristics to select the optimal k — I simply randomly selected a value for k.

注意：在此实现中，我没有执行任何试探法来选择最佳k —我只是随机地为k选择了一个值。

import numpy as np
import pandas as pd
from collections import Counterimport matplotlib.pyplot as plt
%matplotlib inlinefrom sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_splitiris = load_iris()
X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1810)X_train.shape, y_train.shape, X_test.shape, y_test.shape((120, 4), (120,), (30, 4), (30,))

We now have the data and have used a holdout based cross validation scheme to split the data — If you are unfamiliar with this terminology see the link below.

现在，我们有了数据，并使用了基于保留的交叉验证方案来拆分数据-如果您不熟悉此术语，请参见下面的链接。

The first step is calculating the Euclidean distance between two rows.

第一步是计算两行之间的欧几里得距离。

def euclidean(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))

To test this function, I have taken some code from Jason Brownlee that he used to test his distance function. If we have the correct implementation then our outputs should be the same.

为了测试此功能，我从Jason Brownlee那里获取了一些他用来测试距离功能的代码。如果我们有正确的实现，那么我们的输出应该是相同的。

# dataset from https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]row0 = dataset[0]for row in dataset: 
    print(euclidean(np.array(row0), np.array(row)))0.0
1.3290173915275787
1.9494646655653247
1.5591439385540549
0.5356280721938492
4.952940611164215
2.7789902674782985
4.3312480380207
6.59862349695304
5.084885603993178

And we get the exact same output — feel free to check on the link provided.

我们得到的输出是完全相同的-随时检查提供的链接。

As previously mentioned, the k — neighbors to a new observation are the k nearest instances from the training data. Using our distance function, euclidean, we can now calculate the distance between each observation in the training data and the new we’ve been passed and select the k instances from are training data that are closest to our new observation.

如前所述，新观测值的k个邻居是训练数据中k个最近的实例。现在，使用距离函数euclidean ，我们可以计算出训练数据中每个观测值与传递的新观测值之间的距离，并从最接近新观测值的训练数据中选择k个实例。

def find_neighbors(X_train, X_test, y_train, n_neighbors): 
    
    distances = [euclidean(X_test, x) for x in X_train]
    k_nearest = np.argsort(distances)[:n_neighbors]
    k_nearest_label = [y_train[i] for i in k_nearest]
    
    most_common = Counter(k_nearest_label).most_common(1)[0][0]
    
    return most_common

This function calculates the distances of the new observation to all the rows in the training data and stores it in a list. Next, we find the indexes of the k lowest distances using the NumPy module np.argsort() — see documentation. We then use the indexes to identify the class of the k instances. After that, we count the number of instances in k_nearest_labels list using the Counter function from Pythons in-built modules and return the most common (the label with highest counts). However, we won’t get to see this in action until we make a prediction so lets build our predict function.

此函数计算新观测值到训练数据中所有行的距离，并将其存储在列表中。接下来，我们使用NumPy模块np.argsort()找到k个最短距离的索引—参见documentation 。然后，我们使用索引来标识k个实例的类。之后，我们使用Python内置模块中的Counter函数对k_nearest_labels列表中的实例数进行计数，并返回最常用的(计数最高的标签)。但是，在做出预测之前，我们不会看到实际的效果，因此让我们构建预测功能。

def predict(X_test, X_train, y_train, n_neighbors=3): 
    predictions = [find_neighbors(X_train, x, y_train, n_neighbors)                  for x in X_test]
    return np.array(predictions)predict(X_test, X_train, y_train, n_neighbors=3)array([0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 2, 1, 2, 0, 1, 2, 0, 0, 0, 2,           1, 2, 0, 0, 0, 0, 1, 1, 0, 2])

In the predict function we use a list comprehension to find the nearest neighbor for each new instance in the test set and return an array. Using 3 neighbors we get 100% accuracy on our task and we can compared it to scikit-learns implementation to see if we get the same results — which it did.

在predict函数中，我们使用列表推导为测试集中的每个新实例找到最近的邻居，并返回一个数组。使用3个邻居，我们在任务上的准确性达到100％，我们可以将其与scikit-learns实现进行比较，以查看是否获得了相同的结果-做到了。

Note: The Documentation for K-Nearest Neighbors Classifier can be found here

注意：K最近邻居分类器的文档可以在这里找到

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)sklearn_preds = knn.predict(X_test)
preds = predict(X_test, X_train, y_train, n_neighbors=3)print(f"My Implementation: {accuracy_score(y_test, preds)}\nScikit-Learn Implementation: {accuracy_score(y_test, sklearn_preds)}")My Implementation: 1.0
Scikit-Learn Implementation: 1.0

Pros

优点

Intuitive and Simple
直观简单
No training step
没有训练步骤
Can be used both for Classification and Regression (and unsupervised learning)
可用于分类和回归(以及无监督学习)
Easy to implement for multiclass problems
易于解决多类问题

Cons

缺点

As the data grows the algorithm becomes slow quite quickly
随着数据的增长，算法变得非常缓慢
Sensitive to Outliers
对异常值敏感
Imbalanced Data causes problems — Can use weighted distances to overcome this.
数据不平衡会导致问题-可以使用加权距离来解决此问题。

结语 (Wrap Up)

In this story you’ve learned about the K-Nearest Neighbors algorithm, how to implement the K-Nearest Neighbors classification algorithm from scratch in Python, and the pros and cons of using K-Nearest Neighbors.

在这个故事中，您学习了K最近邻居算法，如何从头开始在Python中实现K最近邻居分类算法以及使用K最近邻居的利弊。

Let’s continue the conversation on LinkedIn…

让我们继续在LinkedIn上进行对话…