最邻近

最新推荐文章于 2023-08-31 16:12:01 发布

u200710

最新推荐文章于 2023-08-31 16:12:01 发布

阅读量394

点赞数

分类专栏： scikit-learn 文章标签： python 机器学习最近邻分类算法

原文链接：https://scikit-learn.org/stable/modules/neighbors.html

版权

scikit-learn 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

最近邻

sklearn.neighbors提供了基于非监督和监督邻居的学习算法功能。非监督最邻近是许多其它学习算法的基础。监督最邻近学习有两类作用：离散标签数据的分类，连续标签数据的回归。

最邻近的基本原则是找到离新样本最近的给定数量的训练样本，然后从这些样本中预测标签。样本的数量可以是用户自定义的常数(k-nearest neighbor learning)，或根据样本局部密度变化(radius-based neighbor learning)。距离可以任何方式度量：标准的欧几里得距离是最常用的选择。基于邻居的方法也称作非泛化的机器学习方法，因为他们仅仅是记住了所有的训练数据。

尽管非常简单，最近邻已经在大量分类和回归问题上成功应用，包括手写数字和卫星图像感知。作为一个无参数方法，当决策边界非常不规则时，它能够被成功应用。

sklearn.neighbors能够处理NumPy数组或scipy.sparse矩阵作为输入。对于密集矩阵，可以支持大量可能距离值。对于稀疏矩阵，支持任意的Minkowski矩阵。

非监督的最近邻

NearestNeighbors实现了非监督最近邻学习方法。它作为一个统一的接口调用三种最近邻算法：BallTree、KDTree和基于sklearn.metrics.pairwise中函数的暴力算法。邻居搜索的选择算法可以通过关键词algorithm控制，但必须是如下几个值中的一个：['auto',ball_tree, 'kd_tree', 'brute']。当传递默认值auto时，算法尝试根据训练数据选择最好的方法。

找到最近邻

为了找到数据中两个集合的最近邻，在sklearn.neighbors中的非监督算法可以按如下使用：

# coding: utf-8
# finding the Nearest Neighbors

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)

print(indices)
print(distances)
# produce a sparse graph showing the connections between neighboring points
print(nbrs.kneighbors_graph(X).toarray())

KDTree和BallTree

可以使用KDTree或BallTree寻找最近邻。这些都已经包含在NearestNeighbors中了。

# coding: utf-8
# using KDTree

from sklearn.neighbors import KDTree
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
kdt = KDTree(X, leaf_size=30, metric='euclidean')
print(kdt.query(X, k=2, return_distance=False))

最邻近回归

基于邻居的回归可以用于数据标签是连续的而不是离散的变量。分配给查询点的标签是取得它最近邻居标签的均值。

scikit-learn提供了两种不同的邻居回归器：KNeighborsRegressor和RadiusNeighborsRegressor。与最近邻分类一样，也可以通过设置weights参数值来为不同的邻居分配不同的权重。

# coding: utf-8
# Nearest Neighbors regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()

# 取指定间隔的数据
y[::5] += 1 * (0.5 - np.random.randn(8))

n_neighbors = 5

for i, weights in enumerate(['uniform', 'distance']):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
    y_ = knn.fit(X, y).predict(T)

    plt.subplot(2, 1, i+1)
    plt.scatter(X, y, color='darkorange', label='data')
    plt.plot(T, y_, color='navy', label='prediction')
    plt.axis('tight')
    plt.legend()
    plt.title("KNeighborsRegressor (k = %i, weights = '%s'" % (n_neighbors, weights))

plt.tight_layout()
plt.show()

最邻近中心分类器

NearestCentroid分类器是一种简单的算法，它使用成员的中心作为类别。它没有需要选择的参数，可以作为一个好的baseline分类器。

# coding: utf-8
# Nearest Centroid Classification

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import NearestCentroid

n_neighbors = 15

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

h = .02  # step size in the mesh

cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

for shrinkage in [None, .2]:

    clf = NearestCentroid(shrink_threshold=shrinkage)
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print(shrinkage, np.mean(y == y_pred))

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.title("3-Class classification (shrink_threshold=%r)"
              % shrinkage)
    plt.axis('tight')

plt.show()

邻居组件分析

邻居组件分析Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis)是一种距离度量学习方法，是为了提高最近邻分类器的准确度。算法直接最大化留一 KNN在训练集分数上的随机方差。它也能够学习数据的低维度线性投影，这样就可用于数据可视化和快速分类。

分类

与最近邻分类结合，NCA对于分类是很有吸引力的，这是因为它能够很自然地用于处理多分类问题，而不会增加模型的大小，也不会增加需要用户调整的多余参数。

# coding: utf-8
# Comparing Nearest Neighbors with and without Neighborhood Components Analysis
# using the Euclidean distance on the original features, versus
# using the Euclidean distance after the transformation learned
# by Neighborhood Components Analysis

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import (KNeighborsClassifier,
                               NeighborhoodComponentsAnalysis)
from sklearn.pipeline import Pipeline

n_neighbors = 1

dataset = datasets.load_iris()
X, y = dataset.data, dataset.target

X = X[:, [0, 2]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)

h = .01

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

names = ['KNN', 'NCA, KNN']

classifiers = [Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))]),
               Pipeline([('scaler', StandardScaler()),
                         ('nca', NeighborhoodComponentsAnalysis()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))])]

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

for name, clf in zip(names, classifiers):

    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("{} (k = {})".format(name, n_neighbors))
    plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
             ha='center', va='center', transform=plt.gca().transAxes)

plt.show()

降维

NCA可以用于执行监督降维。目标维度可用参数n_components设置。

# coding: utf-8
# Dimensionality Reduction with Neighborhood Components Analysis

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier, NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

n_neighbors = 3
random_state = 0

X, y = datasets.load_digits(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    stratify=y, random_state=random_state)

dim = len(X[0])
n_classes = len(np.unique(y))

pca = make_pipeline(StandardScaler(),
                    PCA(n_components=2, random_state=random_state))

lda = make_pipeline(StandardScaler(),
                    LinearDiscriminantAnalysis(n_components=2))

nca = make_pipeline(StandardScaler(),
                    NeighborhoodComponentsAnalysis(n_components=2,
                                                   random_state=random_state))

knn = KNeighborsClassifier(n_neighbors=n_neighbors)

dim_reduction_methods = [('PCA', pca), ('LDA', lda), ('NCA', nca)]

for i, (name, model) in enumerate(dim_reduction_methods):
    plt.figure()

    model.fit(X_train, y_train)

    knn.fit(model.transform(X_train), y_train)

    acc_knn = knn.score(model.transform(X_test), y_test)

    X_embedded = model.transform(X)

    plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
    plt.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
                                                              n_neighbors,
                                                              acc_knn))

plt.show()