最近邻
sklearn.neighbors
提供了基于非监督和监督邻居的学习算法功能。非监督最邻近是许多其它学习算法的基础。监督最邻近学习有两类作用:离散标签数据的分类,连续标签数据的回归。
最邻近的基本原则是找到离新样本最近的给定数量的训练样本,然后从这些样本中预测标签。样本的数量可以是用户自定义的常数(k-nearest neighbor learning
),或根据样本局部密度变化(radius-based neighbor learning)。距离可以任何方式度量:标准的欧几里得距离是最常用的选择。基于邻居的方法也称作非泛化的机器学习方法,因为他们仅仅是记住了所有的训练数据。
尽管非常简单,最近邻已经在大量分类和回归问题上成功应用,包括手写数字和卫星图像感知。作为一个无参数方法,当决策边界非常不规则时,它能够被成功应用。
sklearn.neighbors
能够处理NumPy
数组或scipy.sparse
矩阵作为输入。对于密集矩阵,可以支持大量可能距离值。对于稀疏矩阵,支持任意的Minkowski矩阵。
非监督的最近邻
NearestNeighbors
实现了非监督最近邻学习方法。它作为一个统一的接口调用三种最近邻算法:BallTree
、KDTree
和基于sklearn.metrics.pairwise
中函数的暴力算法 。邻居搜索的选择算法可以通过关键词algorithm
控制,但必须是如下几个值中的一个:['auto',
ball_tree, 'kd_tree', 'brute']
。当传递默认值auto
时,算法尝试根据训练数据选择最好的方法。
找到最近邻
为了找到数据中两个集合的最近邻,在sklearn.neighbors
中的非监督算法可以按如下使用:
# coding: utf-8
# finding the Nearest Neighbors
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
print(indices)
print(distances)
# produce a sparse graph showing the connections between neighboring points
print(nbrs.kneighbors_graph(X).toarray())
KDTree和BallTree
可以使用KDTree
或BallTree
寻找最近邻。这些都已经包含在NearestNeighbors
中了。
# coding: utf-8
# using KDTree
from sklearn.neighbors import KDTree
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
kdt = KDTree(X, leaf_size=30, metric='euclidean')
print(kdt.query(X, k=2, return_distance=False))
最近邻分类
基于邻居的分类是一种基于样例的学习或无泛化学习:它没有试图建立一个通用的内在的模型,仅仅是存储了训练数据的一些样例。分类是用每个点最近邻投票计算:一个查询点分配给了每个数据类,是点最近邻中最有代表性的类别。
scikit-learn
实现了两种不同的最近邻分类器:KNeighborsClassifier
实现了基于
k
k
k个最近邻的学习,RadiusNeighborsClassifier
实现了基于每个训练点固定半径
r
r
r中邻居的数量的学习。
在KNeighborsClassifier
中的
k
k
k邻近分类是最常用的技术,
k
k
k值越大能够抑制噪声的影响,但是会是分类边界不明显。当数据不是均匀抽样时,RadiusNeighborsClassifier
是一个更好的选择。
基本的最邻近分类使用的是相同的权重,可以设置weights
关键字决定最近邻的贡献。如下例:
# coding: utf-8
# Nearest Neighbors Classification
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
h = 0.02
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
for weights in ['uniform', 'distance']:
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
x_min, x_max = X[:, 0].min()-1, X[:, 0].max()+1
y_min, y_max = X[:, 1].min()-1, X[:, 1].max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-class classification (k = %i, weights = '%s')" % (n_neighbors, weights))
plt.show()
最邻近回归
基于邻居的回归可以用于数据标签是连续的而不是离散的变量。分配给查询点的标签是取得它最近邻居标签的均值。
scikit-learn
提供了两种不同的邻居回归器:KNeighborsRegressor
和RadiusNeighborsRegressor
。与最近邻分类一样,也可以通过设置weights
参数值来为不同的邻居分配不同的权重。
# coding: utf-8
# Nearest Neighbors regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
# 取指定间隔的数据
y[::5] += 1 * (0.5 - np.random.randn(8))
n_neighbors = 5
for i, weights in enumerate(['uniform', 'distance']):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_ = knn.fit(X, y).predict(T)
plt.subplot(2, 1, i+1)
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(T, y_, color='navy', label='prediction')
plt.axis('tight')
plt.legend()
plt.title("KNeighborsRegressor (k = %i, weights = '%s'" % (n_neighbors, weights))
plt.tight_layout()
plt.show()
最近邻算法
暴力枚举
暴力枚举计算数据集中所有点对的距离。
K-D树
基于树的数据结果尝试减少所需距离的计算次数。按Cartesian坐标划分数据集。
Ball树
按照一系列嵌套的超球体划分数据。超球体通过Centroid和半径定义。
最邻近中心分类器
NearestCentroid
分类器是一种简单的算法,它使用成员的中心作为类别。它没有需要选择的参数,可以作为一个好的baseline分类器。
# coding: utf-8
# Nearest Centroid Classification
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import NearestCentroid
n_neighbors = 15
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
h = .02 # step size in the mesh
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
for shrinkage in [None, .2]:
clf = NearestCentroid(shrink_threshold=shrinkage)
clf.fit(X, y)
y_pred = clf.predict(X)
print(shrinkage, np.mean(y == y_pred))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.title("3-Class classification (shrink_threshold=%r)"
% shrinkage)
plt.axis('tight')
plt.show()
最近邻转换器
许多scikit-learn
估计器依赖于最近邻居,比如,KNeighborsClassifier
和KNeighborsRegressor
,还有一些聚类的方法,如,DBSCAN
和SpectralClustering
,和一些流形嵌入,如,TSNE
和Isomap
等。
所有这些估计器都能够内部地计算最近邻,但是它们中的大部分可以接受提前计算好的最近邻sparse_graph
,保存在kneighbors_graph
和radius_neighbors_graph
中。
# coding: utf-8
# Caching nearest neighbors
from tempfile import TemporaryDirectory
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsTransformer, KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline
X, y = load_digits(return_X_y=True)
n_neighbors_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# return a distance sparse graph
graph_model = KNeighborsTransformer(n_neighbors=max(n_neighbors_list),
mode='distance')
classifier_model = KNeighborsClassifier(metric='precomputed')
with TemporaryDirectory(prefix='sklearn_graph_cache_') as tempdir:
full_model = Pipeline(
steps=[('graph', graph_model), ('classifier', classifier_model)],
memory=tempdir)
param_grid = {'classifier__n_neighbors': n_neighbors_list}
grid_model = GridSearchCV(full_model, param_grid)
grid_model.fit(X, y)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
axes[0].errorbar(x=n_neighbors_list, y=grid_model.cv_results_['mean_test_score'],
yerr=grid_model.cv_results_['std_test_score'])
axes[0].set(xlabel='n_neighbors', title='Classification accuracy')
axes[1].errorbar(x=n_neighbors_list, y=grid_model.cv_results_['mean_fit_time'],
yerr=grid_model.cv_results_['std_fit_time'], color='r')
axes[1].set(xlabel='n_neighbors', title='Fit time (with caching)')
fig.tight_layout()
plt.show()
邻居组件分析
邻居组件分析Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis)
是一种距离度量学习方法,是为了提高最近邻分类器的准确度。算法直接最大化留一 KNN在训练集分数上的随机方差。它也能够学习数据的低维度线性投影,这样就可用于数据可视化和快速分类。
分类
与最近邻分类结合,NCA对于分类是很有吸引力的,这是因为它能够很自然地用于处理多分类问题,而不会增加模型的大小,也不会增加需要用户调整的多余参数。
# coding: utf-8
# Comparing Nearest Neighbors with and without Neighborhood Components Analysis
# using the Euclidean distance on the original features, versus
# using the Euclidean distance after the transformation learned
# by Neighborhood Components Analysis
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import (KNeighborsClassifier,
NeighborhoodComponentsAnalysis)
from sklearn.pipeline import Pipeline
n_neighbors = 1
dataset = datasets.load_iris()
X, y = dataset.data, dataset.target
X = X[:, [0, 2]]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)
h = .01
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
names = ['KNN', 'NCA, KNN']
classifiers = [Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))]),
Pipeline([('scaler', StandardScaler()),
('nca', NeighborhoodComponentsAnalysis()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))])]
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
for name, clf in zip(names, classifiers):
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("{} (k = {})".format(name, n_neighbors))
plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
ha='center', va='center', transform=plt.gca().transAxes)
plt.show()
降维
NCA可以用于执行监督降维。目标维度可用参数n_components
设置。
# coding: utf-8
# Dimensionality Reduction with Neighborhood Components Analysis
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier, NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
n_neighbors = 3
random_state = 0
X, y = datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
stratify=y, random_state=random_state)
dim = len(X[0])
n_classes = len(np.unique(y))
pca = make_pipeline(StandardScaler(),
PCA(n_components=2, random_state=random_state))
lda = make_pipeline(StandardScaler(),
LinearDiscriminantAnalysis(n_components=2))
nca = make_pipeline(StandardScaler(),
NeighborhoodComponentsAnalysis(n_components=2,
random_state=random_state))
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
dim_reduction_methods = [('PCA', pca), ('LDA', lda), ('NCA', nca)]
for i, (name, model) in enumerate(dim_reduction_methods):
plt.figure()
model.fit(X_train, y_train)
knn.fit(model.transform(X_train), y_train)
acc_knn = knn.score(model.transform(X_test), y_test)
X_embedded = model.transform(X)
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
plt.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
n_neighbors,
acc_knn))
plt.show()