K邻近分类（python实现）

最新推荐文章于 2024-07-11 21:12:54 发布

羽星_s

最新推荐文章于 2024-07-11 21:12:54 发布

阅读量2.7k

点赞数 2

文章标签： python 分类机器学习 K邻近

本文链接：https://blog.csdn.net/qq_20144897/article/details/123486736

版权

导入

今天学习了《统计学习方法》中的K邻近法，参考刘建平老师的博客，使用python进行分类实现。
推荐一下刘建平老师写的K邻近法总结，写的真的很好。
这里对K邻近法的原理不再赘述，感兴趣的读者可以自行学习。
本人学术水平有限，有错误的地方还请大家指出。

算法实现

导入包

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs
# 忽略警告
import warnings 
warnings.filterwarnings("ignore")

数据准备

这里使用make_blobs()函数产生数据集。函数中各参数含义如下表：

参数	含义
`n_features`	每一样本包含的特征值
`n_samples`	样本的个数
`centers`	聚类中心点的个数，可以理解为label的种类数
`random_state`	随机种子，可以固定的数据
`cluster_std`	设置每个类别的方差

我这里X为样本特征，Y为样本类别输出；共1000个样本，每个样本2个特征，输出有3个类别，没有冗余特征，每个类别一个簇，随机种子为40。

X, Y = make_blobs(n_samples=1000, 
                  n_features=2,
                  cluster_std=1, 
                  centers=3, 
                  random_state = 40)
plt.scatter(X[:, 0], X[:, 1], marker='o', c=Y)
#plt.savefig('随机样本散点图.png')
plt.show()

输出：
请添加图片描述

建立KNN分类模型并调参

KNN算法中各类参数的作用参考刘建平老师的博客：scikit-learn K近邻法类库使用小结
根据《统计学习方法》k邻近法的三个要素：k值的选择、距离度量及分类决策，这里使用GridSearchCV函数进行网格最优参搜索
- k值(n_neighbors)：以10为初始值，步长为10，终值为100
- 近邻权(weights)：uniform或distance
- 使用的算法(algorithm)：brute
- 距离度量(metric)：euclidean（欧式距离），manhattan（曼哈顿距离），minkowski（闵可夫斯基距离）
10折交叉检验训练，这里需要注意的是GridSearchCV函数中的scoring参数，即评估模型分数指标，我这里选择的是f1_micro，F1值中的一种变体。也可以选择其他评估模型的方法，比如准确率accuracy
分类算法模型评估方法（scoring参数）

参数	含义
`accuracy`	metrics.accuracy_score
`average_precision`	metrics.average_precision_score
`f1`	metrics.f1_score
`f1_micro`	metrics.f1_score
`f1_macro`	metrics.f1_score
`f1_weighted`	metrics.f1_score
`f1_samples`	metrics.f1_score
`neg_log_loss`	metrics.log_loss
`precision etc.`	metrics.precision_score
`recall etc.`	metrics.recall_score
`roc_auc`	metrics.roc_auc_score

回归算法模型评估方法（scoring参数）

参数	含义
`neg_mean_absolute_error`	metrics.mean_absolute_error
`neg_mean_squared_error`	metrics.mean_squared_error
`neg_median_absolute_error`	metrics.median_absolute_error
`r2`	metrics.r2_score

from sklearn import neighbors
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_neighbors' : np.arange(10,101,10),
               'weights' : ['uniform','distance'],
               'algorithm' : ['brute'],
               'metric' : ['euclidean','manhattan','minkowski']}]
KNN_class = neighbors.KNeighborsClassifier()
grid_search = GridSearchCV(KNN_class,
                           param_grid,
                           cv = 10,
                           scoring = 'f1_micro',
                          return_train_score = True)
grid_search.fit(X,Y)

输出：

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
             param_grid=[{'algorithm': ['brute'],
                          'metric': ['euclidean', 'manhattan', 'minkowski'],
                          'n_neighbors': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]),
                          'weights': ['uniform', 'distance']}],
             return_train_score=True, scoring='f1_micro')

最优模型参数

显示最优模型参数

final_model = grid_search.best_estimator_
grid_search.best_params_

输出：

{'algorithm': 'brute',
 'metric': 'euclidean',
 'n_neighbors': 80,
 'weights': 'uniform'}

可以看到各最优参分别为：k值：80，距离度量：euclidean，权重：uniform，算法：brute

显示网格搜索模型分数

网格搜索中各参数的组合应该有 $10 \times 3 \times 2 = 60$ 种，即分数也有60个，这里由于篇幅原因仅展示部分。

cvres = grid_search.cv_results_
for f,params in zip(cvres["mean_test_score"],cvres["params"]):
    print("{:.2}".format(f),params)

输出：

0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'uniform'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 10, 'weights': 'distance'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'uniform'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 20, 'weights': 'distance'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 30, 'weights': 'uniform'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 30, 'weights': 'distance'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 40, 'weights': 'uniform'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 40, 'weights': 'distance'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 50, 'weights': 'uniform'}
0.99 {'algorithm': 'brute', 'metric': 'euclidean', 'n_neighbors': 50, 'weights': 'distance'}
...

预测

这里首先确认训练集的边界，然后生成随机数组使用最优模型进行预测

from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

#确认训练集的边界
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#生成随机数据来做测试集，然后作预测
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
# final_model为最优模型
Z = final_model.predict(np.c_[xx.ravel(), yy.ravel()])

可视化

使用matplotlib包对预测情况进行可视化。

# 画出测试集数据
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# 也画出所有的训练集数据
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = 80, weights = 'uniform')" )
#plt.savefig('预测效果.png')