原理
k近邻(k-Nearest Neighbor, 简称kNN)是一种常用的监督学习方法,最简单和最常用的分类算法之一,区别于K-means算法。
基本原理就是根据某种距离度量找出训练集中与其最靠近的k个训练样本,然后根据这k个“邻居”的信息进行预测。在分类任务中,可使用“投票法”;在回归任务中,使用“平均法“。
kNN的一个明显的特点就是“懒惰学习”,没有训练过程,也就是说训练样本保存起来,待收到测试样本时,再对训练样本处理,找到最近的k个样本。
影响分类结果的因素:
- k的取值,不同的k值预测的结果也会不同
- 距离计算方式,不同的距离计算方式,找出的“近邻”也会有显著的区别
k值选择
k值的选择比较重要,那么如何确定k的大小呢?通过交叉验证的方法。
将样本数据按照一定的比例,拆分为训练集和验证集,从选择较小的k值开始,不断增加k的值,然后计算在验证集上的错误率,错误率最小对应的k值就是最终选择。
当开始增加k值时,一般错误率会先降低,因为选择周围更多的样本分类的效果会更好;但是当k值变得很大时,错误率增加,比如,有35个样本,但是k取30,这时候KNN就没什么意义了。
距离计算方式
常见的距离度量有欧式距离、曼哈顿距离、余弦距离等,KNN常用的是欧式距离,公式定义为
d
x
y
=
∑
k
=
1
n
(
x
k
−
y
k
)
2
d_{xy}=\sqrt{\sum_{k=1}^{n}(x_k-y_k)^2}
dxy=k=1∑n(xk−yk)2
其中,n表示维度,如在二维空间中,n=2
将测试样本与训练样本逐一计算距离,选择距离最小的k个样本。
如果是分类任务,预测结果就是k个样本中包含最多的类别(投票法);如果是分类任务,预测结果就是k个样本target的平均值
y
^
=
1
K
∑
i
=
1
K
y
i
\hat{y}=\frac{1}{K}\sum_{i=1}^{K}y_i
y^=K1∑i=1Kyi
KNN的局限性
优点:
- 简单,容易实现
- 精度高、对异常值不敏感
缺点:
- 计算复杂度高,需要和每个训练样本计算距离,当数据量特别大时,计算成本是非常高的
Implemention
虽然各个库(sklearn、torch等)实现已经有很多了,但为了更好的掌握KNN原理,还是自己实现比较好,预测结果可以和调用库函数的预测结果比较验证。
Version1
def KNN(test_data1,train_data_pca,train_label,k,p):
subMat = train_data_pca - np.tile(test_data1,(train_data_pca.shape[0],1))
subMat = np.abs(subMat)
distance = subMat**p
distance = np.sum(distance,axis=1)
distance = distance**(1.0/p)
distanceIndex = np.argsort(distance)
classCount = np.zeros(41)
for i in range(k):
label = train_label[distanceIndex[i]]
classCount[label] = classCount[label] + 1
return np.argmax(classCount)
def test(k, p,train_data_pca, test_data_pca, train_labels, test_labels):
train_labels, test_labels = train_labels.squeeze(), test_labels.squeeze()
print("testing with K= %d and lp norm p=%d" % (k, p))
m, n = np.shape(test_data_pca)
correctCount = 0
M = np.zeros((41, 41), int)
for i in range(m):
test_data1 = test_data_pca[i, :]
predict_label = KNN(test_data1, train_data_pca, train_labels, k, p)
true_label = test_labels[i]
M[true_label][predict_label] += 1
# print("predict:%d,true:%d" % (predict_label,true_label))
if true_label == predict_label:
correctCount += 1
print("The accuracy is: %f" % (float(correctCount) / m))
print("Confusion matrix:", M)
- 使用说明
test(3, 2, train_data_pca, test_data_pca, train_labels, test_labels)
数据格式:(n_samples, dim)
Version 2
import numpy as np
import operator
import matplotlib.pyplot as plt
class KNN(object):
def __init__(self, k=3):
self.k = k
def fit(self, x, y):
self.x = x
self.y = y.squeeze()
def _square_distance(self, v1, v2):
return np.sum(np.square(v1-v2))
def _vote(self, ys):
vote_dict = {}
for y in ys:
if y not in vote_dict.keys():
vote_dict[y] = 1
else:
vote_dict[y] += 1
sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)
return sorted_vote_dict[0][0]
def predict(self, x):
y_pred = []
for i in range(len(x)):
dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]
sorted_index = np.argsort(dist_arr)
top_k_index = sorted_index[0:self.k]
y_pred.append(self._vote(ys=self.y[top_k_index]))
return np.array(y_pred)
def score(self, y_true=None, y_pred=None):
if y_true is None and y_pred is None:
y_pred = self.predict(self.x)
y_true = self.y
score = 0.0
y_true = y_true.squeeze()
for i in range(len(y_true)):
if y_true[i] == y_pred[i]:
score += 1
score /= len(y_true)
return score
if __name__ == '__main__':
# data generation
np.random.seed(314)
data_size_1 = 300
x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
y_1 = [0 for _ in range(data_size_1)]
data_size_2 = 400
x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
y_2 = [1 for _ in range(data_size_2)]
x1 = np.concatenate((x1_1, x1_2), axis=0)
x2 = np.concatenate((x2_1, x2_2), axis=0)
x = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))
y = np.concatenate((y_1, y_2), axis=0)
data_size_all = data_size_1 + data_size_2
shuffled_index = np.random.permutation(data_size_all)
x = x[shuffled_index]
y = y[shuffled_index]
split_index = int(data_size_all * 0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
# visualize data
plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, marker='.')
plt.show()
plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, marker='.')
plt.show()
# data preprocessing
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))
# knn classifier
clf = KNN(k=3)
clf.fit(x_train, y_train)
print('train accuracy: {:.3}'.format(clf.score()))
y_test_pred = clf.predict(x_test)
print('test accuracy: {:.3}'.format(clf.score(y_test, y_test_pred)))
Vesion3(推荐)
import numpy as np
import operator
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
class KNN(object):
def __init__(self, k=3):
self.k = k
self.result = None
def fit(self, x, y):
self.x = x
self.y = y.squeeze()
def _square_distance(self, v1, v2):
return np.sum(np.square(v1-v2))
def _vote(self, ys):
vote_dict = {}
for y in ys:
if y not in vote_dict.keys():
vote_dict[y] = 1
else:
vote_dict[y] += 1
sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)
return sorted_vote_dict[0][0]
def predict(self, x):
y_pred = []
for i in range(len(x)):
dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]
sorted_index = np.argsort(dist_arr)
top_k_index = sorted_index[0:self.k]
y_pred.append(self._vote(ys=self.y[top_k_index]))
self.result = np.array(y_pred)
return self.result
def score(self, y_true=None, y_pred=None):
if y_true is None and y_pred is None:
y_pred = self.predict(self.x)
y_true = self.y
score = 0.0
y_true = y_true.squeeze()
for i in range(len(y_true)):
if y_true[i] == y_pred[i]:
score += 1
score /= len(y_true)
return score
# 计算准确率
def get_accuracy(self, y_test):
assert self.result is not None, "must predict before calculate accuracy!"
assert y_test.shape[0] == self.result.shape[0], "the label number of test data must be equal to train data!"
correct = 0
for i in range(len(self.result)):
if y_test[i] == self.result[i]:
correct += 1
return (correct / float(len(self.result))) * 100.0
if __name__ == '__main__':
# 用鸢尾花数据集测试knn
iris_data = datasets.load_iris()
all_data = iris_data['data']
all_label = iris_data['target']
flower_name = iris_data['target_names']
train_data, val_data, train_label, val_label = train_test_split(all_data, all_label, test_size=0.2,
random_state=1, stratify=all_label)
knn_clf = KNN(k=1)
knn_clf.fit(train_data, train_label)
result = knn_clf.predict(val_data)
acc = knn_clf.get_accuracy(val_label)
print(result)
print(val_label)
print(acc)