最邻近

最近邻

sklearn.neighbors提供了基于非监督和监督邻居的学习算法功能。非监督最邻近是许多其它学习算法的基础。监督最邻近学习有两类作用:离散标签数据的分类,连续标签数据的回归。

最邻近的基本原则是找到离新样本最近的给定数量的训练样本,然后从这些样本中预测标签。样本的数量可以是用户自定义的常数(k-nearest neighbor learning),或根据样本局部密度变化(radius-based neighbor learning)。距离可以任何方式度量:标准的欧几里得距离是最常用的选择。基于邻居的方法也称作非泛化的机器学习方法,因为他们仅仅是记住了所有的训练数据。

尽管非常简单,最近邻已经在大量分类和回归问题上成功应用,包括手写数字和卫星图像感知。作为一个无参数方法,当决策边界非常不规则时,它能够被成功应用。

sklearn.neighbors能够处理NumPy数组或scipy.sparse矩阵作为输入。对于密集矩阵,可以支持大量可能距离值。对于稀疏矩阵,支持任意的Minkowski矩阵。

非监督的最近邻

NearestNeighbors实现了非监督最近邻学习方法。它作为一个统一的接口调用三种最近邻算法:BallTreeKDTree和基于sklearn.metrics.pairwise中函数的暴力算法 。邻居搜索的选择算法可以通过关键词algorithm控制,但必须是如下几个值中的一个:['auto',ball_tree, 'kd_tree', 'brute']。当传递默认值auto时,算法尝试根据训练数据选择最好的方法。

找到最近邻

为了找到数据中两个集合的最近邻,在sklearn.neighbors中的非监督算法可以按如下使用:

# coding: utf-8
# finding the Nearest Neighbors

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)

print(indices)
print(distances)
# produce a sparse graph showing the connections between neighboring points
print(nbrs.kneighbors_graph(X).toarray())
KDTree和BallTree

可以使用KDTreeBallTree寻找最近邻。这些都已经包含在NearestNeighbors中了。

# coding: utf-8
# using KDTree

from sklearn.neighbors import KDTree
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1,], [2, 1], [3, 2]])
kdt = KDTree(X, leaf_size=30, metric='euclidean')
print(kdt.query(X, k=2, return_distance=False))
最近邻分类

基于邻居的分类是一种基于样例的学习或无泛化学习:它没有试图建立一个通用的内在的模型,仅仅是存储了训练数据的一些样例。分类是用每个点最近邻投票计算:一个查询点分配给了每个数据类,是点最近邻中最有代表性的类别。

scikit-learn实现了两种不同的最近邻分类器:KNeighborsClassifier实现了基于 k k k个最近邻的学习,RadiusNeighborsClassifier实现了基于每个训练点固定半径 r r r中邻居的数量的学习。

KNeighborsClassifier中的 k k k邻近分类是最常用的技术, k k k值越大能够抑制噪声的影响,但是会是分类边界不明显。当数据不是均匀抽样时,RadiusNeighborsClassifier是一个更好的选择。

基本的最邻近分类使用的是相同的权重,可以设置weights关键字决定最近邻的贡献。如下例:

# coding: utf-8
# Nearest Neighbors Classification

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 15

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

h = 0.02

cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

for weights in ['uniform', 'distance']:

    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    x_min, x_max = X[:, 0].min()-1, X[:, 0].max()+1
    y_min, y_max = X[:, 1].min()-1, X[:, 1].max()+1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-class classification (k = %i, weights = '%s')" % (n_neighbors, weights))

plt.show()
最邻近回归

基于邻居的回归可以用于数据标签是连续的而不是离散的变量。分配给查询点的标签是取得它最近邻居标签的均值。

scikit-learn提供了两种不同的邻居回归器:KNeighborsRegressorRadiusNeighborsRegressor。与最近邻分类一样,也可以通过设置weights参数值来为不同的邻居分配不同的权重。

# coding: utf-8
# Nearest Neighbors regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()

# 取指定间隔的数据
y[::5] += 1 * (0.5 - np.random.randn(8))

n_neighbors = 5

for i, weights in enumerate(['uniform', 'distance']):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
    y_ = knn.fit(X, y).predict(T)

    plt.subplot(2, 1, i+1)
    plt.scatter(X, y, color='darkorange', label='data')
    plt.plot(T, y_, color='navy', label='prediction')
    plt.axis('tight')
    plt.legend()
    plt.title("KNeighborsRegressor (k = %i, weights = '%s'" % (n_neighbors, weights))

plt.tight_layout()
plt.show()
最近邻算法
暴力枚举

暴力枚举计算数据集中所有点对的距离。

K-D树

基于树的数据结果尝试减少所需距离的计算次数。按Cartesian坐标划分数据集。

Ball树

按照一系列嵌套的超球体划分数据。超球体通过Centroid和半径定义。

最邻近中心分类器

NearestCentroid分类器是一种简单的算法,它使用成员的中心作为类别。它没有需要选择的参数,可以作为一个好的baseline分类器。

# coding: utf-8
# Nearest Centroid Classification

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import NearestCentroid

n_neighbors = 15

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

h = .02  # step size in the mesh

cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

for shrinkage in [None, .2]:

    clf = NearestCentroid(shrink_threshold=shrinkage)
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print(shrinkage, np.mean(y == y_pred))

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.title("3-Class classification (shrink_threshold=%r)"
              % shrinkage)
    plt.axis('tight')

plt.show()
最近邻转换器

许多scikit-learn估计器依赖于最近邻居,比如,KNeighborsClassifierKNeighborsRegressor,还有一些聚类的方法,如,DBSCANSpectralClustering,和一些流形嵌入,如,TSNEIsomap等。

所有这些估计器都能够内部地计算最近邻,但是它们中的大部分可以接受提前计算好的最近邻sparse_graph,保存在kneighbors_graphradius_neighbors_graph中。

# coding: utf-8
# Caching nearest neighbors

from tempfile import TemporaryDirectory
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsTransformer, KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline

X, y = load_digits(return_X_y=True)
n_neighbors_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# return a distance sparse graph
graph_model = KNeighborsTransformer(n_neighbors=max(n_neighbors_list),
                                    mode='distance')
classifier_model = KNeighborsClassifier(metric='precomputed')

with TemporaryDirectory(prefix='sklearn_graph_cache_') as tempdir:
    full_model = Pipeline(
        steps=[('graph', graph_model), ('classifier', classifier_model)],
        memory=tempdir)

    param_grid = {'classifier__n_neighbors': n_neighbors_list}
    grid_model = GridSearchCV(full_model, param_grid)
    grid_model.fit(X, y)

fig, axes = plt.subplots(1, 2, figsize=(8, 4))
axes[0].errorbar(x=n_neighbors_list, y=grid_model.cv_results_['mean_test_score'],
                 yerr=grid_model.cv_results_['std_test_score'])
axes[0].set(xlabel='n_neighbors', title='Classification accuracy')
axes[1].errorbar(x=n_neighbors_list, y=grid_model.cv_results_['mean_fit_time'],
                 yerr=grid_model.cv_results_['std_fit_time'], color='r')
axes[1].set(xlabel='n_neighbors', title='Fit time (with caching)')
fig.tight_layout()
plt.show()
邻居组件分析

邻居组件分析Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis)是一种距离度量学习方法,是为了提高最近邻分类器的准确度。算法直接最大化留一 KNN在训练集分数上的随机方差。它也能够学习数据的低维度线性投影,这样就可用于数据可视化和快速分类。

分类

与最近邻分类结合,NCA对于分类是很有吸引力的,这是因为它能够很自然地用于处理多分类问题,而不会增加模型的大小,也不会增加需要用户调整的多余参数。

# coding: utf-8
# Comparing Nearest Neighbors with and without Neighborhood Components Analysis
# using the Euclidean distance on the original features, versus
# using the Euclidean distance after the transformation learned
# by Neighborhood Components Analysis

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import (KNeighborsClassifier,
                               NeighborhoodComponentsAnalysis)
from sklearn.pipeline import Pipeline

n_neighbors = 1

dataset = datasets.load_iris()
X, y = dataset.data, dataset.target

X = X[:, [0, 2]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)

h = .01

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

names = ['KNN', 'NCA, KNN']

classifiers = [Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))]),
               Pipeline([('scaler', StandardScaler()),
                         ('nca', NeighborhoodComponentsAnalysis()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))])]

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

for name, clf in zip(names, classifiers):

    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("{} (k = {})".format(name, n_neighbors))
    plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
             ha='center', va='center', transform=plt.gca().transAxes)

plt.show()
降维

NCA可以用于执行监督降维。目标维度可用参数n_components设置。

# coding: utf-8
# Dimensionality Reduction with Neighborhood Components Analysis

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier, NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

n_neighbors = 3
random_state = 0

X, y = datasets.load_digits(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    stratify=y, random_state=random_state)

dim = len(X[0])
n_classes = len(np.unique(y))

pca = make_pipeline(StandardScaler(),
                    PCA(n_components=2, random_state=random_state))

lda = make_pipeline(StandardScaler(),
                    LinearDiscriminantAnalysis(n_components=2))

nca = make_pipeline(StandardScaler(),
                    NeighborhoodComponentsAnalysis(n_components=2,
                                                   random_state=random_state))

knn = KNeighborsClassifier(n_neighbors=n_neighbors)

dim_reduction_methods = [('PCA', pca), ('LDA', lda), ('NCA', nca)]

for i, (name, model) in enumerate(dim_reduction_methods):
    plt.figure()

    model.fit(X_train, y_train)

    knn.fit(model.transform(X_train), y_train)

    acc_knn = knn.score(model.transform(X_test), y_test)

    X_embedded = model.transform(X)

    plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
    plt.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
                                                              n_neighbors,
                                                              acc_knn))

plt.show()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值