目录
1.基本概念
KNN(K-Nearest Neighbors, K近邻)是有监督学习中普遍使用的方法之一,其中,KNN分类器一般将观察值的类别判定为离它最近的k个观察值中所占比例最大的分类,KNN回归器一般将离它最近的k个观察值中的标签值进行平均处理。
2.KNN
KNN常见有分类器和回归器两种使用场景,具体如下
2.1 KNN Classifier
KNN(K-Nearest Neighbors, K近邻)分类器是有监督学习中普遍使用的分类器之一,将观察值的分类判定为离它最近的k个观察值中所占比例最大的分类。
2.2 KNN Regressor
KNN回归器通过找出测试样本的k个最近邻样本,将这些近邻样本做平均赋给该样本,就可以得到测试样本的预测值
2.3 距离
一般情况下,KNN将欧氏距离作为距离度量,但是这是只适用于连续变量
在文本分类这种离散变量情况下,另一个度量——重叠度量(或海明距离)可以用来作为度量
在基因表达微阵列数据情况下,KNN也与Pearson和Spearman相关系数结合起来使用
3.细节讨论
3.1 性能分析
KNN分类器出错的概率如下,其中测试样本为x,最近邻样本为z,分类器预测值为c,分类器出错概率为x与z标记不同的概率值
以下公式中c*表示贝叶斯最优分类器的结果,得到结论,KNN分类器的泛化错误率不超过贝叶斯最优分类器的两倍
3.2 K值分配
较小的K值会使得对预测样本的预测错误概率更大,较大的K值能够减小噪声的影响,但会使类别之间的界限变得模糊,适当的K值特别重要,在实验操作中,通过K折将训练集分为训练集和验证集,在使用gridsearch搜索最佳的K值
3.3 加权最近邻分类器
KNN可以看做将k个最近邻的样本分配1/k的权重,为所有其他邻居分类0权重,加权最近邻分类器可以通过距离值或样本重要性等方法对邻居样本的标签值进行加权,改进以获取更好的KNN预测值
4.代码分析
class NearestNeighbors(NeighborsBase, KNeighborsMixin,
RadiusNeighborsMixin, UnsupervisedMixin):
"""Unsupervised learner for implementing neighbor searches.
Read more in the :ref:`User Guide <unsupervised_neighbors>`.
Parameters
----------
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for :meth:`kneighbors` queries.
radius : float, optional (default = 1.0)
Range of parameter space to use by default for :meth:`radius_neighbors`
queries.
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:
- 'ball_tree' will use :class:`BallTree`
- 'kd_tree' will use :class:`KDTree`
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm
based on the values passed to :meth:`fit` method.
Note: fitting on sparse input will override the setting of
this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or KDTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.
metric : string or callable, default 'minkowski'
metric to use for distance computation. Any metric from scikit-learn
or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each
pair of instances (rows) and the resulting value recorded. The callable
should take two arrays as input and return one value indicating the
distance between them. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.
Distance matrices are not supported.
Valid values for metric are:
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
- from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao',
'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean',
'yule']
See the documentation for scipy.spatial.distance for details on these
metrics.
p : integer, optional (default = 2)
Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric_params : dict, optional (default = None)
Additional keyword arguments for the metric function.
n_jobs : int or None, optional (default=None)
The number of parallel jobs to run for neighbors search.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
参数说明:
变量名称 | 变量含义 | 取值 |
---|---|---|
n_neighbors | 选择n个邻居 | 整型,默认为5 |
radius | 半径 | 选择距离小于等于radius的邻居值 |
algorithm | 计算n近邻的算法 | {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’},auto将自动确定适合的方法 |
leaf_size | 计算树算法的叶子节点数目 | 整型,默认为30,传递给ball_tree和kd_tree的参数 |
metric | 距离度量方法 | -scikit-learn库 : [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,‘manhattan’] -scipy.spatial.distance库 : [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’] |
p | 控制Minkowski度量方法的值 | 整型,默认为2 |
n_jobs | 对近邻计算的并行工作数 | 除非在joblib.parallel_backend 中,None表示1,-1表示使用所有的处理器 |
举例
Examples
--------
>>> import numpy as np
>>> from sklearn.neighbors import NearestNeighbors
>>> samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
>>> neigh = NearestNeighbors(2, 0.4)
>>> neigh.fit(samples) #doctest: +ELLIPSIS
NearestNeighbors(...)
>>> neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
... #doctest: +ELLIPSIS
array([[2, 0]]...)
>>> nbrs = neigh.radius_neighbors([[0, 0, 1.3]], 0.4, return_distance=False)
>>> np.asarray(nbrs[0][0])
array(2)
See also
--------
KNeighborsClassifier
RadiusNeighborsClassifier
KNeighborsRegressor
RadiusNeighborsRegressor
BallTree
5.参考资料
1.维基百科 K-近邻算法
2.https://zhuanlan.zhihu.com/p/23191325
3.https://www.cnblogs.com/pythoner6833/p/9296035.html