数据准备
由于knn可以进行多类别的判别,因此直接使用原mnist的十分类手写数字数据集;即 https://github.com/phdsky/xCode/tree/main/机器学习/统计学习方法/data 中的 mnist.csv。
k近邻算法
k近邻的思想很简单,针对每一个输入
x
x
x,遍历计算样本空间中的样本点到
x
x
x 的距离,选择其中距离最短的k个样本点,投票决出输入样本的类别;由此可见knn算法没有显示的训练过程,算法的步骤如下:
knn算法针对每一个输入样本
x
x
x,都需要遍历一遍样本空间,每次遍历所需的时间复杂度为
O
(
n
)
O(n)
O(n),对N个样本总的时间复杂度是
O
(
n
2
)
O(n^2)
O(n2),这对于大量输入数据将是不可接受的。算法代码实现如下:
# @Author: phd
# @Date: 19-4-17
# @Site: github.com/phdsky
# @Description:
# KNN has no explict training progress
# can deal with multi label classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
def calc_accuracy(y_pred, y_truth):
assert len(y_pred) == len(y_truth)
n = len(y_pred)
hit_count = 0
for i in range(0, n):
if y_pred[i] == y_truth[i]:
hit_count += 1
print("Predicting accuracy %f\n" % (hit_count / n))
def minkowski(xi, xj, p):
assert len(xi) == len(xj)
n = len(xi)
# distance = 0
# for i in range(0, n):
# distance += pow(abs(xi[i] - xj[i]), p)
#
# distance = pow(distance, 1/p)
# Euclidean distance
distance = np.linalg.norm(xi - xj)
return distance
class KNN(object):
def __init__(self, k, p):
self.k = k
self.p = p
def vote(self, k_vec):
assert len(k_vec) == self.k
flag = np.full(10, 0) # Ten labels
for i in range(0, self.k):
flag[k_vec[i][1]] += 1
return np.argmax(flag)
def predict(self, X_train, y_train, X_test):
n = len(X_test)
m = len(X_train)
predict_label = np.full(n, -1)
for i in range(0, n):
to_predict = X_test[i]
distances, distances_label = [], []
for j in range(0, m):
to_compare = X_train[j]
dist = minkowski(to_predict, to_compare, self.p)
distances.append(dist)
distances_label = list(zip(distances, y_train))
distances_label.sort(key=lambda kv: kv[0])
predict_label[i] = self.vote(distances_label[0:self.k])
# print("Nearest neighbour is %s" % X_train[predict_label[i]])
print("Sample %d predicted as %d" % (i, predict_label[i]))
return predict_label
def example():
print("Start testing on simple dataset...")
X_train = np.asarray([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
y_train = np.asarray([0, 1, 2, 3, 4, 5])
X_test = np.asarray([[3, 5]])
knn = KNN(k=1, p=2) # p=2 Euclidean distance
y_predicted = knn.predict(X_train=X_train, y_train=y_train, X_test=X_test)
print("Simple testing done...\n")
if __name__ == "__main__":
# example()
mnist_data = pd.read_csv("../data/mnist.csv")
mnist_values = mnist_data.values
images = mnist_values[::, 1::]
labels = mnist_values[::, 0]
X_train, X_test, y_train, y_test = train_test_split(
images, labels, test_size=100, random_state=42
)
knn = KNN(k=10, p=2) # p=2 Euclidean distance
# Start predicting, training progress omitted
print("Testing on %d samples..." % len(X_test))
y_predicted = knn.predict(X_train=X_train, y_train=y_train, X_test=X_test)
calc_accuracy(y_pred=y_predicted, y_truth=y_test)
代码输出
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/knn/knn.py
Testing on 100 samples...
Sample 0 predicted as 8
Sample 1 predicted as 1
Sample 2 predicted as 9
Sample 3 predicted as 9
Sample 4 predicted as 8
Sample 5 predicted as 6
Sample 6 predicted as 2
Sample 7 predicted as 2
Sample 8 predicted as 7
Sample 9 predicted as 1
Sample 10 predicted as 6
Sample 11 predicted as 3
Sample 12 predicted as 1
Sample 13 predicted as 2
Sample 14 predicted as 7
Sample 15 predicted as 4
Sample 16 predicted as 3
Sample 17 predicted as 3
Sample 18 predicted as 6
Sample 19 predicted as 4
Sample 20 predicted as 9
Sample 21 predicted as 5
Sample 22 predicted as 2
Sample 23 predicted as 6
Sample 24 predicted as 0
Sample 25 predicted as 0
Sample 26 predicted as 0
Sample 27 predicted as 8
Sample 28 predicted as 6
Sample 29 predicted as 3
Sample 30 predicted as 6
Sample 31 predicted as 6
Sample 32 predicted as 1
Sample 33 predicted as 9
Sample 34 predicted as 8
Sample 35 predicted as 6
Sample 36 predicted as 7
Sample 37 predicted as 3
Sample 38 predicted as 6
Sample 39 predicted as 1
Sample 40 predicted as 9
Sample 41 predicted as 7
Sample 42 predicted as 9
Sample 43 predicted as 6
Sample 44 predicted as 8
Sample 45 predicted as 3
Sample 46 predicted as 4
Sample 47 predicted as 2
Sample 48 predicted as 7
Sample 49 predicted as 8
Sample 50 predicted as 4
Sample 51 predicted as 3
Sample 52 predicted as 3
Sample 53 predicted as 7
Sample 54 predicted as 1
Sample 55 predicted as 2
Sample 56 predicted as 6
Sample 57 predicted as 2
Sample 58 predicted as 9
Sample 59 predicted as 6
Sample 60 predicted as 4
Sample 61 predicted as 0
Sample 62 predicted as 4
Sample 63 predicted as 8
Sample 64 predicted as 5
Sample 65 predicted as 3
Sample 66 predicted as 4
Sample 67 predicted as 3
Sample 68 predicted as 9
Sample 69 predicted as 3
Sample 70 predicted as 9
Sample 71 predicted as 4
Sample 72 predicted as 2
Sample 73 predicted as 8
Sample 74 predicted as 1
Sample 75 predicted as 6
Sample 76 predicted as 3
Sample 77 predicted as 7
Sample 78 predicted as 0
Sample 79 predicted as 3
Sample 80 predicted as 1
Sample 81 predicted as 7
Sample 82 predicted as 6
Sample 83 predicted as 7
Sample 84 predicted as 6
Sample 85 predicted as 1
Sample 86 predicted as 9
Sample 87 predicted as 5
Sample 88 predicted as 3
Sample 89 predicted as 6
Sample 90 predicted as 9
Sample 91 predicted as 3
Sample 92 predicted as 7
Sample 93 predicted as 6
Sample 94 predicted as 6
Sample 95 predicted as 5
Sample 96 predicted as 2
Sample 97 predicted as 9
Sample 98 predicted as 3
Sample 99 predicted as 5
Predicting accuracy 0.970000
Process finished with exit code 0
以上,knn 中参数 k = 10 选择十近邻,p = 2 选择欧式距离;实现过程中按 minkowski k = 2 来计算欧式距离太慢了,所以替换成了numpy中的欧式距离。从输出结果来看,测试准确率达到97%,算法简单有效。
基于kd树的最近邻算法
knn每次计算都需要遍历样本空间中的所有样本,由此导致算法的复杂度很高;利用树型结构对搜索过程进行加速,使其能在样本空间中快速搜索到最近邻。kd树则是一种通过不断地用垂直于坐标轴的超平面,将k维空间中的实例点进行切分存储,以便对其进行快速搜索的树形数据结构。
kd树的构造方法步骤如下:
上述构造方法选择训练实例点在选定坐标轴上的中位数点作为切分点,将实例保存在相应的节点上,由此得到的kd树是平衡的,但搜索效率不一定是最优的。
对构造好的kd树进行最近邻搜索的算法步骤如下:
根据上述算法步骤,对kd树进行构造和基于该kd树的最近邻搜索的实现如下:
代码中kd树的构造过程采用了递归构造的方式;kd树最近邻搜索则先通过寻找到输入样本对应的叶子节点,再使用栈结构反向遍历访问过的树节点,最终通过比较得到距离输入样本最近的节点。
# @Author: phd
# @Date: 2019-07-02
# @Site: github.com/phdsky
# @Description: NULL
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
def calc_accuracy(y_pred, y_truth):
assert len(y_pred) == len(y_truth)
n = len(y_pred)
hit_count = 0
for i in range(0, n):
if y_pred[i] == y_truth[i]:
hit_count += 1
print("Predicting accuracy %f" % (hit_count / n))
def minkowski(xi, xj, p):
assert len(xi) == len(xj)
n = len(xi)
# distance = 0
# for i in range(0, n):
# distance += pow(abs(xi[i] - xj[i]), p)
#
# distance = pow(distance, 1/p)
# Euclidean distance
distance = np.linalg.norm(xi - xj)
return distance
class Node(object):
def __init__(self, data, label, axis, left, right):
self.data = data
self.label = label
self.axis = axis
self.left = left
self.right = right
class KDTree(object):
def __init__(self, k, p):
self.k = k
self.p = p
self.root = None
def build(self, X_train, y_train):
def create(dataset, axis):
if len(dataset) == 0:
return None # Leaf node
dataset.sort(key=lambda kv: kv[0][axis]) # Sort by axis
median = len(dataset) // 2
data = dataset[median][0]
label = dataset[median][1]
# This k is not self.k
# Errata in book ?
sp = (axis + 1) % len(data)
left = create(dataset[0:median], sp) # Create left sub-tree
right = create(dataset[median + 1::], sp) # Create right sub-tree
return Node(data, label, axis, left, right)
dataset = list(zip(X_train, y_train))
self.root = create(dataset, 0)
def nearest(self, x):
nearest_nodes = []
parent_nodes = [] # Parent nodes visited
def traverse(x):
while len(parent_nodes) != 0:
parent_node = parent_nodes.pop()
if parent_node is None:
continue
dist = x[parent_node.axis] - parent_node.data[parent_node.axis]
nearest_nodes.sort(key=lambda kv: kv[0])
if abs(dist) < nearest_nodes[0][0]:
distance = minkowski(x, parent_node.data, self.p)
nearest_nodes.append((distance, parent_node))
parent_nodes.append(parent_node.right if dist < 0 else parent_node.left)
# Find leaf node
node = self.root
while node is not None:
parent_nodes.append(node)
dist = x[node.axis] - node.data[node.axis]
node = node.left if dist < 0 else node.right
leaf_node = parent_nodes.pop()
distance = minkowski(x, leaf_node.data, self.p)
nearest_nodes.append((distance, leaf_node))
traverse(x)
nearest_nodes.sort(key=lambda kv: kv[0])
print("Nearest neighbour is %s" % nearest_nodes[0][1].data)
return nearest_nodes[0][1].label
def predict(self, X_test):
n = len(X_test)
predict_label = np.full(n, -1)
for i in range(0, n):
predict_label[i] = self.nearest(X_test[i])
print("Sample %d predicted as %d" % (i, predict_label[i]))
return predict_label
def example_small():
print("Start testing on simple dataset...")
X_train = np.asarray([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
y_train = np.asarray([0, 1, 2, 3, 4, 5])
X_test = np.asarray([[3, 5]])
# y_test = np.asarray([2])
print("KDTree building...")
kdtree = KDTree(k=1, p=2) # Init KDTree
kdtree.build(X_train=X_train, y_train=y_train) # Build KDTree
print("Building complete...")
# Start predicting, training progress omitted
print("Testing on %d samples..." % len(X_test))
y_predicted = kdtree.predict(X_test=X_test)
# calc_accuracy(y_pred=y_predicted, y_truth=y_test)
print("Simple testing done...\n")
def example_large():
mnist_data = pd.read_csv("../data/mnist.csv")
mnist_values = mnist_data.values
images = mnist_values[::, 1::]
labels = mnist_values[::, 0]
X_train, X_test, y_train, y_test = train_test_split(
images, labels, test_size=100, random_state=42
)
print("KDTree building...")
kdtree = KDTree(k=1, p=2) # Init KDTree
kdtree.build(X_train=X_train, y_train=y_train) # Build KDTree
print("Building complete...")
# Start predicting, training progress omitted
print("Testing on %d samples..." % len(X_test))
y_predicted = kdtree.predict(X_test=X_test)
calc_accuracy(y_pred=y_predicted, y_truth=y_test)
if __name__ == "__main__":
example_small()
# example_large()
输出如下
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/knn/kdtree.py
KDTree building...
Building complete...
Testing on 100 samples...
Sample 0 predicted as 2
Sample 1 predicted as 1
Sample 2 predicted as 9
Sample 3 predicted as 7
Sample 4 predicted as 9
Sample 5 predicted as 6
Sample 6 predicted as 2
Sample 7 predicted as 3
Sample 8 predicted as 7
Sample 9 predicted as 1
Sample 10 predicted as 6
Sample 11 predicted as 3
Sample 12 predicted as 1
Sample 13 predicted as 2
Sample 14 predicted as 7
Sample 15 predicted as 9
Sample 16 predicted as 3
Sample 17 predicted as 3
Sample 18 predicted as 6
Sample 19 predicted as 4
Sample 20 predicted as 8
Sample 21 predicted as 5
Sample 22 predicted as 2
Sample 23 predicted as 6
Sample 24 predicted as 0
Sample 25 predicted as 0
Sample 26 predicted as 0
Sample 27 predicted as 8
Sample 28 predicted as 6
Sample 29 predicted as 3
Sample 30 predicted as 6
Sample 31 predicted as 6
Sample 32 predicted as 1
Sample 33 predicted as 0
Sample 34 predicted as 8
Sample 35 predicted as 6
Sample 36 predicted as 7
Sample 37 predicted as 3
Sample 38 predicted as 6
Sample 39 predicted as 1
Sample 40 predicted as 4
Sample 41 predicted as 7
Sample 42 predicted as 4
Sample 43 predicted as 6
Sample 44 predicted as 8
Sample 45 predicted as 3
Sample 46 predicted as 4
Sample 47 predicted as 7
Sample 48 predicted as 7
Sample 49 predicted as 8
Sample 50 predicted as 4
Sample 51 predicted as 3
Sample 52 predicted as 3
Sample 53 predicted as 7
Sample 54 predicted as 1
Sample 55 predicted as 7
Sample 56 predicted as 1
Sample 57 predicted as 1
Sample 58 predicted as 4
Sample 59 predicted as 6
Sample 60 predicted as 1
Sample 61 predicted as 0
Sample 62 predicted as 1
Sample 63 predicted as 1
Sample 64 predicted as 6
Sample 65 predicted as 9
Sample 66 predicted as 4
Sample 67 predicted as 3
Sample 68 predicted as 9
Sample 69 predicted as 3
Sample 70 predicted as 4
Sample 71 predicted as 4
Sample 72 predicted as 2
Sample 73 predicted as 3
Sample 74 predicted as 1
Sample 75 predicted as 6
Sample 76 predicted as 3
Sample 77 predicted as 7
Sample 78 predicted as 0
Sample 79 predicted as 3
Sample 80 predicted as 1
Sample 81 predicted as 7
Sample 82 predicted as 6
Sample 83 predicted as 7
Sample 84 predicted as 6
Sample 85 predicted as 1
Sample 86 predicted as 4
Sample 87 predicted as 5
Sample 88 predicted as 3
Sample 89 predicted as 6
Sample 90 predicted as 9
Sample 91 predicted as 3
Sample 92 predicted as 7
Sample 93 predicted as 6
Sample 94 predicted as 6
Sample 95 predicted as 5
Sample 96 predicted as 2
Sample 97 predicted as 7
Sample 98 predicted as 3
Sample 99 predicted as 6
Predicting accuracy 0.750000
Process finished with exit code 0
该实现包含了书中的简单例子和mnist实例两种测试数据;
- 在实现mnist实例的过程中,对于切分轴计算方式 l = j ( m o d k ) + 1 l = j(modk) + 1 l=j(modk)+1,由于mnist数据中前几轴数据基本上都是0,所以感觉这样切分可能是不太好(另一种方法是按数据轴的方差大小来切分),最终运行结果准确率也在75%,但是明显能感觉到搜索过程速度变快了。
- 因此使用了书本上的例子来验证代码的正确性,并在knn中做了测试对比;kd树中我测试了很多组输入数据,基本上已经没有问题了(不太确定有没有其他bug,因为书上的例子太简单了);代码最后我在knn和kd树中留的输入数据是[3, 5],可以发现程序结果的输出并不一样:
knn:
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/knn/knn.py
Start testing on simple dataset...
Nearest neighbour is [2 3]
Sample 0 predicted as 0
Simple testing done...
Process finished with exit code 0
kd树:
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/knn/kdtree.py
Start testing on simple dataset...
KDTree building...
Building complete...
Testing on 1 samples...
Nearest neighbour is [4 7]
Sample 0 predicted as 3
Simple testing done...
Process finished with exit code 0
其实仔细分析一下可以发现,这个输入数据正好在三个正样本中间,搜索先后次序导致了最近邻结果的不一样;下面几张图展示了简单输入数据的特征空间划分和kd树形态。
由此推测,最近邻的随机性对之前mnist数据来说也会存在这个问题,导致了预测错误。
总结
- k近邻算法没有显示的学习过程,可以用来做分类和回归
- k近邻基于训练数据集对特征空间进行划分;当训练集、距离度量、k值及分类决策规则确定之后,其结果唯一确定(相同距离的多个样本从属不同类带来的随机性?)
- k值小时,k近邻模型复杂;k值大时,k近邻模型简单;k值反应的近似误差和估计误差的权衡,常采用交叉验证来选择最优k值
- k近邻算法的时间复杂度为 O ( n 2 ) O(n^2) O(n2) ;kd树搜索的平均时间复杂度是 O ( l o g N ) O(logN) O(logN)
- kd树更适用于训练实例数 N N N 远大于空间维数的k近邻搜索;当空间树接近实例数时,kd树的搜索效率接近于线性搜索
参考
- 《统计学习方法》