KNN介绍及参数分析

1.基本概念

KNN(K-Nearest Neighbors, K近邻)是有监督学习中普遍使用的方法之一,其中,KNN分类器一般将观察值的类别判定为离它最近的k个观察值中所占比例最大的分类,KNN回归器一般将离它最近的k个观察值中的标签值进行平均处理。

2.KNN

KNN常见有分类器和回归器两种使用场景,具体如下

2.1 KNN Classifier

KNN(K-Nearest Neighbors, K近邻)分类器是有监督学习中普遍使用的分类器之一,将观察值的分类判定为离它最近的k个观察值中所占比例最大的分类。
在这里插入图片描述

2.2 KNN Regressor

KNN回归器通过找出测试样本的k个最近邻样本,将这些近邻样本做平均赋给该样本,就可以得到测试样本的预测值

2.3 距离

一般情况下,KNN将欧氏距离作为距离度量,但是这是只适用于连续变量
在文本分类这种离散变量情况下,另一个度量——重叠度量(或海明距离)可以用来作为度量
在基因表达微阵列数据情况下,KNN也与Pearson和Spearman相关系数结合起来使用
在这里插入图片描述

3.细节讨论
3.1 性能分析

KNN分类器出错的概率如下,其中测试样本为x,最近邻样本为z,分类器预测值为c,分类器出错概率为x与z标记不同的概率值
在这里插入图片描述
以下公式中c*表示贝叶斯最优分类器的结果,得到结论,KNN分类器的泛化错误率不超过贝叶斯最优分类器的两倍
在这里插入图片描述

3.2 K值分配

较小的K值会使得对预测样本的预测错误概率更大,较大的K值能够减小噪声的影响,但会使类别之间的界限变得模糊,适当的K值特别重要,在实验操作中,通过K折将训练集分为训练集和验证集,在使用gridsearch搜索最佳的K值

3.3 加权最近邻分类器

KNN可以看做将k个最近邻的样本分配1/k的权重,为所有其他邻居分类0权重,加权最近邻分类器可以通过距离值或样本重要性等方法对邻居样本的标签值进行加权,改进以获取更好的KNN预测值

4.代码分析
class NearestNeighbors(NeighborsBase, KNeighborsMixin,
                       RadiusNeighborsMixin, UnsupervisedMixin):
    """Unsupervised learner for implementing neighbor searches.

    Read more in the :ref:`User Guide <unsupervised_neighbors>`.

    Parameters
    ----------
    n_neighbors : int, optional (default = 5)
        Number of neighbors to use by default for :meth:`kneighbors` queries.

    radius : float, optional (default = 1.0)
        Range of parameter space to use by default for :meth:`radius_neighbors`
        queries.

    algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
        Algorithm used to compute the nearest neighbors:

        - 'ball_tree' will use :class:`BallTree`
        - 'kd_tree' will use :class:`KDTree`
        - 'brute' will use a brute-force search.
        - 'auto' will attempt to decide the most appropriate algorithm
          based on the values passed to :meth:`fit` method.

        Note: fitting on sparse input will override the setting of
        this parameter, using brute force.

    leaf_size : int, optional (default = 30)
        Leaf size passed to BallTree or KDTree.  This can affect the
        speed of the construction and query, as well as the memory
        required to store the tree.  The optimal value depends on the
        nature of the problem.

    metric : string or callable, default 'minkowski'
        metric to use for distance computation. Any metric from scikit-learn
        or scipy.spatial.distance can be used.

        If metric is a callable function, it is called on each
        pair of instances (rows) and the resulting value recorded. The callable
        should take two arrays as input and return one value indicating the
        distance between them. This works for Scipy's metrics, but is less
        efficient than passing the metric name as a string.

        Distance matrices are not supported.

        Valid values for metric are:

        - from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
          'manhattan']

        - from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
          'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
          'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao',
          'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean',
          'yule']

        See the documentation for scipy.spatial.distance for details on these
        metrics.

    p : integer, optional (default = 2)
        Parameter for the Minkowski metric from
        sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
        equivalent to using manhattan_distance (l1), and euclidean_distance
        (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

    metric_params : dict, optional (default = None)
        Additional keyword arguments for the metric function.

    n_jobs : int or None, optional (default=None)
        The number of parallel jobs to run for neighbors search.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

参数说明:

变量名称变量含义取值
n_neighbors选择n个邻居整型,默认为5
radius半径选择距离小于等于radius的邻居值
algorithm计算n近邻的算法{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’},auto将自动确定适合的方法
leaf_size计算树算法的叶子节点数目整型,默认为30,传递给ball_tree和kd_tree的参数
metric距离度量方法-scikit-learn库: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,‘manhattan’] -scipy.spatial.distance库: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
p控制Minkowski度量方法的值整型,默认为2
n_jobs对近邻计算的并行工作数除非在joblib.parallel_backend中,None表示1,-1表示使用所有的处理器

举例

    Examples
    --------
      >>> import numpy as np
      >>> from sklearn.neighbors import NearestNeighbors
      >>> samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]

      >>> neigh = NearestNeighbors(2, 0.4)
      >>> neigh.fit(samples)  #doctest: +ELLIPSIS
      NearestNeighbors(...)

      >>> neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
      ... #doctest: +ELLIPSIS
      array([[2, 0]]...)

      >>> nbrs = neigh.radius_neighbors([[0, 0, 1.3]], 0.4, return_distance=False)
      >>> np.asarray(nbrs[0][0])
      array(2)
      
    See also
    --------
    KNeighborsClassifier
    RadiusNeighborsClassifier
    KNeighborsRegressor
    RadiusNeighborsRegressor
    BallTree
5.参考资料

1.维基百科 K-近邻算法
2.https://zhuanlan.zhihu.com/p/23191325
3.https://www.cnblogs.com/pythoner6833/p/9296035.html

  • 2
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值