KNN分类算法基本概念
1.被分类的样本,通过看其周围邻近K个样本的类别,以投票方式决定此样本属于哪一类别。
2.样本邻近距离计算方式:
对于a、b两个样本之间的距离,样本的特征数即构成样本坐标空间的维数。
-
欧拉距离公式(每个样本有两个特征): ( x 1 a − x 1 b ) 2 + ( x 2 a − x 2 b ) 2 \sqrt {(x^a_1-x^b_1)^2+(x^a_2-x^b_2)^2} (x1a−x1b)2+(x2a−x2b)2
-
欧拉距离公式(每个样本有n个特征): ( x 1 a − x 1 b ) 2 + ( x 2 a − x 2 b ) 2 + . . . + ( x n a − x n b ) 2 = ∑ i = 1 n ( x n a − x n b ) 2 = ( ∑ i = 1 n ( x n a − x n b ) 2 ) 1 2 \sqrt {(x^a_1-x^b_1)^2+(x^a_2-x^b_2)^2+...+(x^a_n-x^b_n)^2}=\sqrt{\sum^{n}_{i = 1}{(x^a_n-x^b_n)^2}} = (\sum^{n}_{i = 1}{(x^a_n-x^b_n)^2)^{{1} \over {2}}} (x1a−x1b)2+(x2a−x2b)2+...+(xna−xnb)2=∑i=1n(xna−xnb)2=(∑i=1n(xna−xnb)2)21
-
曼哈顿距离公式: ∑ i = 1 n ( x n a − x n b ) \sum^{n}_{i = 1}{(x^a_n-x^b_n)} ∑i=1n(xna−xnb)
-
明可夫斯基距离(Minkowski Distance): ( ∑ i = 1 n ( x n a − x n b ) p ) 1 p (\sum^{n}_{i = 1}{(x^a_n-x^b_n)^p)^{{1} \over {p}}} (∑i=1n(xna−xnb)p)p1
3.距离权重问题:
如图,假设k=3,当距离绿样本最近有2个蓝样本,1个红样本。但绿样本距离红样本非常近,应将绿样本分为哪一类?因此引入距离权重概念。
采用取距离倒数的方式:红色权重为1,蓝色权重为
1
3
+
1
4
=
7
12
{{1} \over {3}} + {{1} \over {4}}={{7} \over {12}}
31+41=127,因此归于权重更大的红色。
4.KNN算法是唯一不需要训练模型的算法,数据集本身就是模型。
sciki-learn中的KNN
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# 样本集,每个样本两个特征
raw_data_X = [[3.393533211, 2.331273381],[3.110073483, 1.781539638], [1.343808831, 3.368360954],
[3.582294042, 4.679179110], [2.280362439, 2.866990263],[7.423436942, 4.696522875],
[5.745051997, 3.533989803],[9.172168622, 2.511101045], [7.792783481, 3.424088941],
[7.939820817, 0.791637231]]
# 样本集label
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
# 测试样本
x = np.array([[8.093607318, 3.365731514],[5.15198764,4.5168763]])
# k设置为6 ,weights权重{uniform不考虑,distance考虑}和距离类型p,
kNN_classifier = KNeighborsClassifier(n_neighbors=6, weights = 'uniform',p=2 )
# 训练模型(knn只是储存样本集而已)
kNN_classifier.fit(X_train, y_train)
# 预测
y_predict = kNN_classifier.predict(x)
手写KNN底层
import numpy as np
from math import sqrt
from collections import Counter
class KNNClassifier:
def __init__(self, k):
assert k >= 1, "k 必须有效"
self.k = k
self._X_train = None
self._y_train = None
def fit(self, X_train, y_train):
assert X_train.shape[0] == y_train.shape[0],"训练样本集和label集数量要相等"
assert self.k <= X_train.shape[0], "训练样本集数量必须大于k"
self._X_train = X_train
self._y_train = y_train
return self
def predict(self, X_predict):
assert self._X_train is not None and self._y_train is not None, "必须先fit才能predict!"
assert X_predict.shape[1] == self._X_train.shape[1], "带预测样本的特征数必须和训练集的特征数相同"
#从带预测样本集中,取出样本一个一个预测。
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
def _predict(self, x):
# 样本x与训练集所有样本的距离(欧拉距离)
distances = [sqrt(np.sum((x_train - x) ** 2))
for x_train in self._X_train]
# 获取排序后的索引
nearest = np.argsort(distances)
# 取出前k个样本的label值
topK_y = [self._y_train[i] for i in nearest[:self.k]]
# 计算topK_y中种类数 eg:votes --> {0: 1, 1: 5}
votes = Counter(topK_y)
# 返回votes最大值的前1个值, 以元组列表形式 eg:[(1,5)]
return votes.most_common(1)[0][0]
def __repr__(self)
return "KNN(k={})".format(self.k)
KNN总结
- 天然的可以解决多分类问题,思想简单。
- 同样也可用于解决回归问题,查看scikit-learn API:KNeighborsRegressor
- KNN缺点:
效率低,结果不具有解释性,维数灾难。