K近邻算法-监督算法-机器学习
原理
k近邻算法(k-Nearest Neighbor,kNN)的主要算法思想是对为标记样本的类别,由距离其最近的k个邻居投票决定属于那个类别。
假设X_test为待标记的样本,X_train为已标记的数据集,算法的原理描述如下:
(1)遍历X_train中的所有样本,计算每个样本与X_test的距离,并把距离保存在数组D[ ]中。
(2)对数组数组D[ ]进行排序,取距离最近的k个点,记为X_knn。
(3)在X_knn中统计每个类别的个数,即class0在X_knn中有几个样本,class1在X_knn中有几个样本。
待标记样本的类别,就是在X_knn中样本个数最多的 那个类别。
knn鸢尾花分类.py
import math
import numpy as np
import pandas as pd
import operator
# 计算样本点之间的距离
def get_euc_dist(ins1, ins2, dim):
dist = 0
for i in range(dim):
dist += pow((ins1[i] - ins2[i]), 2)
return math.sqrt(dist)
# 获取k个邻居
def get_neighbors(test_sample, train_set, train_set_y, k=3):
dist_list = []
dim = len(test_sample)
for x in range(len(train_set)):
dist = get_euc_dist(test_sample, train_set[x], dim)
dist_list.append((train_set_y[x], dist))
dist_list.sort(key=operator.itemgetter(1)) # 获取对象的哪些维的数据,按照dist大小排序
test_sample_neighbors = []
for i in range(k): # 获取到距离最近的k个样例
test_sample_neighbors.append(dist_list[i][0])
return test_sample_neighbors
# 预测样本所属分类
def predict_class_label(neighbors):
class_labels = {}
# 统计得票数
for x in range(len(neighbors)):
neighbor_index = neighbors[x]
if neighbor_index in class_labels:
class_labels[neighbor_index] += 1
else:
class_labels[neighbor_index] = 1
label_sorted = sorted(class_labels.items(), key=operator.itemgetter(1), reverse=True)
return label_sorted[0][0]
# 计算预测准确率
def getAccuracy(test_labels, pre_labels):
correct = 0
for x in range(len(test_labels)):
if test_labels[x] == pre_labels[x]:
correct += 1
return (correct / float(len(test_labels))) * 100.0
if __name__ == '__main__':
column_names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
iris_data = pd.read_csv("iris_training.csv", header=0, names=column_names)
print(iris_data.head(5)) # 展示前5个数据
iris_data = np.array(iris_data)
iris_train, iris_train_y = iris_data[:, 0:4], iris_data[:, 4]
iris_test = pd.read_csv("iris_test.csv", header=0, names=column_names)
iris_test = np.array(iris_test)
iris_test, iris_test_y = iris_test[:, 0:4], iris_test[:, 4]
# print(iris_test[0])
print(iris_test_y)
pre_labels = []
k = 3
for x in range(len(iris_test)):
neighbors = get_neighbors(iris_test[x], iris_train, iris_train_y, k) # iris_test[x]行方式取值
# print("neighbors",neighbors)
result = predict_class_label(neighbors)
pre_labels.append(result)
print('预测类别:' + repr(result) + ', 真实的类别=' + repr(iris_test_y[x]))
print('预测类别: ' + repr(pre_labels))
accuracy = getAccuracy(iris_test_y, pre_labels)
print('Accuracy: ' + repr(accuracy) + '%')
sklearn库提供KNeighborsClassifier类实现k近邻算法
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris['data'],iris['target'],random_state=5)
knn = KNeighborsClassifier(n_neighbors=3,algorithm='kd_tree')
knn.fit(X_train,y_train)
print("测试集平均准确率:{:.2f}".format(knn.score(X_test,y_test)))
总结
kNN算法是一种监督分类算法,算法思想简单、易于实现,无需估计参数、分类准确率高,不足之处在于其预测结果依赖于k值的选择,已受到噪声数据影响,特别是当样本不平衡时,分类会偏向于训练样本数量较多的类别,具有较高的计算复杂度。
kNN算法没有训练过程,通过投票的方式预测新样本的类别,但kNN算法仍有很多值得关注的地方,如k值的选择、距离的度量方法以及快速检索k近邻的算法(kd树等)。