一、概述
kNN算法采用测量不同特征值之间的距离方法进行分类。对未知类别属性的数据集中的每个点执行如下操作:
(1)计算已知类别数据集中的点与当前点之间的距离;
(2)按照距离递增次序排序;
(3)选取与当前点距离最小的k个点;
(4)确定前K个点所在类别的出现频率;
(5)返回前k个点出现频率最高的类别作为当前点的预测分类。
二、代码实现
1.基于scikit-learn包实现
import numpy as np
from sklearn import neighbors
def split_data(data, test_size):
data_num = data.shape[0]
train_ind = list(range(data_num))
test_ind = []
test_num = int(data_num * test_size)
for i in range(test_num):
rand_ind = np.random.randint(0, len(train_ind))
test_ind.append(rand_ind)
del train_ind[rand_ind]
train_data = data[train_ind]
test_data = data[test_ind]
return train_data, test_data
# load the data and divide the data
mydata = n