WIKI
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
-
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
-
In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
算法(algorithm)
模型(model)
-
模型基本要素:距离度量、K值的选择、分类决策规则
距离度量
Lp距离:当p=2时,称为欧式距离。
K值的选择
-
K值若较小:近似误差减小,估计误差增大,整体的模型变得复杂,预测结果对近邻的实例点非常敏感,容易过拟合。
-
K值若较大:近似误差增大,估计误差减小,整体的模型变得简单。
-
应用中:k值一般取较小的值,采用交叉验证法选取最优的k值。
分类决策规则
多数表决规则
实现
-
主要考虑如何对训练数据进行快速k近邻搜索
-
最简单的实现:线性扫描
-
提高效率:kd树
PS:kd树是一种对k维空间中实例点进行存储以便对其进行快速检索的树形数据结构。
构造kd树
搜索kd树