KNN介绍及参数分析

最新推荐文章于 2024-06-04 15:50:31 发布

盏茶y

最新推荐文章于 2024-06-04 15:50:31 发布

阅读量9.9k

点赞数 2

分类专栏：机器学习

本文链接：https://blog.csdn.net/XD_Cauthy/article/details/106332687

版权

机器学习专栏收录该内容

5 篇文章 3 订阅

订阅专栏

一般情况下，KNN将欧氏距离作为距离度量，但是这是只适用于连续变量
在文本分类这种离散变量情况下，另一个度量——重叠度量（或海明距离）可以用来作为度量
在基因表达微阵列数据情况下，KNN也与Pearson和Spearman相关系数结合起来使用
在这里插入图片描述

3.细节讨论

3.1 性能分析

KNN分类器出错的概率如下，其中测试样本为x，最近邻样本为z，分类器预测值为c，分类器出错概率为x与z标记不同的概率值
在这里插入图片描述
以下公式中c*表示贝叶斯最优分类器的结果，得到结论，KNN分类器的泛化错误率不超过贝叶斯最优分类器的两倍

3.2 K值分配

较小的K值会使得对预测样本的预测错误概率更大，较大的K值能够减小噪声的影响，但会使类别之间的界限变得模糊，适当的K值特别重要，在实验操作中，通过K折将训练集分为训练集和验证集，在使用gridsearch搜索最佳的K值

3.3 加权最近邻分类器

KNN可以看做将k个最近邻的样本分配1/k的权重，为所有其他邻居分类0权重，加权最近邻分类器可以通过距离值或样本重要性等方法对邻居样本的标签值进行加权，改进以获取更好的KNN预测值

4.代码分析

class NearestNeighbors(NeighborsBase, KNeighborsMixin,
                       RadiusNeighborsMixin, UnsupervisedMixin):
    """Unsupervised learner for implementing neighbor searches.

    Read more in the :ref:`User Guide <unsupervised_neighbors>`.

    Parameters
    ----------
    n_neighbors : int, optional (default = 5)
        Number of neighbors to use by default for :meth:`kneighbors` queries.

    radius : float, optional (default = 1.0)
        Range of parameter space to use by default for :meth:`radius_neighbors`
        queries.

    algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
        Algorithm used to compute the nearest neighbors:

        - 'ball_tree' will use :class:`BallTree`
        - 'kd_tree' will use :class:`KDTree`
        - 'brute' will use a brute-force search.
        - 'auto' will attempt to decide the most appropriate algorithm
          based on the values passed to :meth:`fit` method.

        Note: fitting on sparse input will override the setting of
        this parameter, using brute force.

    leaf_size : int, optional (default = 30)
        Leaf size passed to BallTree or KDTree.  This can affect the
        speed of the construction and query, as well as the memory
        required to store the tree.  The optimal value depends on the
        nature of the problem.

    metric : string or callable, default 'minkowski'
        metric to use for distance computation. Any metric from scikit-learn
        or scipy.spatial.distance can be used.

        If metric is a callable function, it is called on each
        pair of instances (rows) and the resulting value recorded. The callable
        should take two arrays as input and return one value indicating the
        distance between them. This works for Scipy's metrics, but is less
        efficient than passing the metric name as a string.

        Distance matrices are not supported.

        Valid values for metric are:

        - from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
          'manhattan']

        - from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
          'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski',
          'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao',
          'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean',
          'yule']

        See the documentation for scipy.spatial.distance for details on these
        metrics.

    p : integer, optional (default = 2)
        Parameter for the Minkowski metric from
        sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
        equivalent to using manhattan_distance (l1), and euclidean_distance
        (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

    metric_params : dict, optional (default = None)
        Additional keyword arguments for the metric function.

    n_jobs : int or None, optional (default=None)
        The number of parallel jobs to run for neighbors search.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

参数说明：

变量名称	变量含义	取值
n_neighbors	选择n个邻居	整型，默认为5
radius	半径	选择距离小于等于radius的邻居值
algorithm	计算n近邻的算法	{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}，auto将自动确定适合的方法
leaf_size	计算树算法的叶子节点数目	整型，默认为30，传递给ball_tree和kd_tree的参数
metric	距离度量方法	`-scikit-learn库`: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,‘manhattan’] `-scipy.spatial.distance库`: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
p	控制Minkowski度量方法的值	整型，默认为2
n_jobs	对近邻计算的并行工作数	除非在`joblib.parallel_backend`中，None表示1，-1表示使用所有的处理器

举例

    Examples
    --------
      >>> import numpy as np
      >>> from sklearn.neighbors import NearestNeighbors
      >>> samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]

      >>> neigh = NearestNeighbors(2, 0.4)
      >>> neigh.fit(samples)  #doctest: +ELLIPSIS
      NearestNeighbors(...)

      >>> neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
      ... #doctest: +ELLIPSIS
      array([[2, 0]]...)

      >>> nbrs = neigh.radius_neighbors([[0, 0, 1.3]], 0.4, return_distance=False)
      >>> np.asarray(nbrs[0][0])
      array(2)
      
    See also
    --------
    KNeighborsClassifier
    RadiusNeighborsClassifier
    KNeighborsRegressor
    RadiusNeighborsRegressor
    BallTree

5.参考资料

1.维基百科 K-近邻算法
2.https://zhuanlan.zhihu.com/p/23191325
3.https://www.cnblogs.com/pythoner6833/p/9296035.html

盏茶y

关注

2
点赞
踩
30

收藏

觉得还不错? 一键收藏
0
评论
KNN介绍及参数分析

目录1.基本概念2.KNN2.1 KNN Classifier2.2 KNN Regressor2.3 距离3.细节讨论3.1 性能3.2 K值分配4.代码分析5.参考资料1.基本概念KNN(K-Nearest Neighbors, K近邻)是有监督学习中普遍使用的方法之一，其中，KNN分类器一般将观察值的类别判定为离它最近的k个观察值中所占比例最大的分类，KNN回归器一般将离它最近的k个观察值中的标签值进行平均处理。2.KNN2.1 KNN ClassifierKNN(K-Nearest Nei
复制链接

扫一扫