3.k近邻算法
3.1 kNN算法的实现
k近邻算法(k-Nearest Neighbors)也称为kNN算法,它是算法中最简单、最基础的一种。kNN的基本思想:将数据集分为训练数据集和测试数据集,为了测试的准确率,两个数据集互异;计算测试数据集中每一个样本和训练数据集中每一个样本的距离,统计出距离值最小的k个样本(训练数据集),对k个样本中的标签值进行统计,最多的那个标签值为测试样本的预测;对测试数据集中所有的样本进行预测,并计算准确率。
3.1.1自己实现kNN算法
使用欧式距离进行计算,在n维空间中,点x和y的距离:
d(x,y) =
#程序3-1
import numpy as np
from sklearn import datasets
from math import sqrt
from collections import Counter
iris = datasets.load_iris()
'''
data:鸢尾花数据集的特征集(150,4)
target:鸢尾花数据集的标签集(150,)
target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']
DESCR:鸢尾花数据集的说明文档
feature_names:鸢尾花特征对应的名称,
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
filename:数据集存放的位置,
D:\Program Files (x86)\Python\lib\site-packages\sklearn\datasets\data\iris.csv
'''
print(iris.keys())
np.random.seed(1)
random_index = np.random.permutation(len(iris.data))
proportion = 0.2
train_size = int((1-proportion)*len(iris.data))
train_index = random_index[:train_size]
test_index = random_index[train_size:]
#训练数据集
#或train_X = iris.data[train_index]不写列,只对行进行fancy indexing
X_train = iris.data[train_index,:]
y_train = iris.target[train_index]
#测试数据集
X_test = iris.data[test_index,:]
y_test = iris.target[test_index]
print('X_train:\n',X_train.shape)
print('y_train:\n',y_train.shape)
def distance(k,X_train,y_train,x):
#计算x到训练集X_train中每一个点的距离,索引排序后;
# 通过索引值取出y_train中的k个元素
# 对k个元素进行计数,返回计数最多值对应的target
distances = []
for x_train in X_train:
dist = sqrt(sum((x_train - x)**2))
distances.append(dist)
distances_arg = np.argsort(distances)
topK_y = [y_train[top_arg] for top_arg in distances_arg[:k]]
top_counter = Counter(topK_y)
top_most = top_counter.most_common(1)
return top_most[0][0]
y_predict = []
for x_test in X_test:
#对X_test中的每个元素求距离,将返回的target存入y_predict中
y_dis = distance(8,X_train,y_train,x_test)
y_predict.append(y_dis)
accuracy_rate = sum(y_predict==y_test)/len(y_test)
print('预测准确率: ',accuracy_rate)
运行结果:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
X_train:
(120, 4)
y_train:
(120,)
预测准确率: 0.9666666666666667
3.1.2使用sklearn库封装的kNN
官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit
#程序3-2
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
'''
data:鸢尾花数据集的特征集(150,4)
target:鸢尾花数据集的标签集(150,)
target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']
DESCR:鸢尾花数据集的说明文档
feature_names:鸢尾花特征对应的名称,
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
filename:数据集存放的位置,
D:\Program Files (x86)\Python\lib\site-packages\sklearn\datasets\data\iris.csv
'''
print(iris.keys())
#train_size表示训练数据集所占比例
#test_size表示测试数据集所占比例
#train_size + test_size = 1,尽量使用test_size
#random_state表示随机种子
#相当于X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,\
# train_size=0.8,random_state=123)
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,\
test_size=0.2,random_state=123)
print('X_train:\n',X_train.shape)
print('X_test:\n',X_test.shape)
print('y_train:\n',y_train.shape)
print('y_test:\n',y_test.shape)
iris_kNNClassifier = KNeighborsClassifier(n_neighbors=8)
#训练数据集使用kNN算法进行训练,得到模型
iris_kNNClassifier.fit(X_train,y_train)
#使用模型对测试数据集进行预测
# y_predict = iris_kNNClassifier.predict(X_test)
# accuracy_rate = accuracy_score(y_test,y_predict)
accuracy_rate = iris_kNNClassifier.score(X_test,y_test)
print('预测准确率: ',accuracy_rate)
运行结果:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
X_train:
(120, 4)
X_test:
(30, 4)
y_train:
(120,)
y_test:
(30,)
预测准确率: 0.9666666666666667
3.2超参数
对于同一个学习算法,当使用不同的参数配置时,也会产生不同的模型。因此,在使用学习算法时,如何调参是很重要的问题。
超参数是在算法运行前需要决定的参数;在kNN算法中,超参数有k、距离的权重、明可夫斯基距离参数p。模型参数是在算法过程中学习的参数;kNN算法中并没有模型参数。
3.2.1超参数k
当选取k为何值时,准确率最高?
#程序3-3
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
'''
data:鸢尾花数据集的特征集(150,4)
target:鸢尾花数据集的标签集(150,)
target_names:鸢尾花标签对应的名称['setosa' 'versicolor&#