机器学习之KNN算法python实现
一. 理论基础
1. 距离度量
特征空间中两个实例点的距离是两个实例点相似程度的反映。一般采用欧氏距离,但也可以是其他距离,如cosine距离,曼哈顿距离等.
2. k值选择
- k值越大,意味着模型越简单,学习近似误差大,估计误差小,欠拟合;
- k值越小,意味着模型越复杂,学习近似误差小,估计误差大,过拟合,而且对近邻的实例点敏感.
通常采取交叉验证选取最优的k值。
3. 分类决策规则
多数表决,即由输入实例的K个近邻的多数类决定输入实例的类别。
4. kd树
高效实现k近邻,类似于二分查找,只不过是在高维的二分查找。
kd树更适用于训练实例数远大于空间维数时的k近邻搜索,当空间维数接近训练实例数时,它的效率会迅速下降,几乎接近线性扫描。
二. python实现
实现了knn的暴力搜索,也实现了kd-tree搜索,但是kd-tree只能找最近邻,即k=1,当k>1时,还未实现,初步想法:可以考虑k次搜索kd-tree,每次搜索后将最近邻节点删除,继续搜索,就找到了top k近邻搜索;这样的话就得实现kd-tree的删除插入。
1. 代码
knn.py
#encoding=utf-8
'''
implement the knn algorithm
'''
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.stats import mode
import matplotlib.pyplot as plt
class KNN:
def __init__(self):
pass
def predict(self, x_train, y_train, x_test, k=3):
self.k = k
m_train = x_train.shape[0]
m_test = x_test.shape[0]
x_train = np.mat(x_train)
y_train = np.mat(y_train)
x_test = np.mat(x_test)
#1. get the distances between each sample in train samples and each sample in test samples,
#the distances matrix's shape is (m_test, m_train).
dists = self.__distance__(x_train, x_test)
#2. sort the distances by row, and get the sort index
sort_idx = np.argsort(dists, axis=1)
#3. get the x index and y index, which is top k distance sample index
x_idx = np.tile(np.mat(range(m_test)).T, [1, self.k])
y_idx = sort_idx[:, 0 : self.k]
#4. get the top k distance labels, and the matrix's shape is (m_test, k)
labels = np.tile(y_train.T, [m_test, 1])
p_labels = labels[x_idx, y_idx]
#5. get the mode of each row, which means the most labels
y_predict = np.mat(mode(p_labels, axis=1)[0])
return y_predict
def __distance__(self, x_train, x_test):
'''
force compute to get the distance between each sample in train samples and each sample in test samples
'''
m_train = x_train.shape[0]
m_test = x_test.shape[0]
dists = np.zeros((m_test, m_train))
count = 0
for test in x_test:
test = np.tile(test, [m_train, 1])
distance = np.sum(np.multiply(x_train - test, x_train - test), axis=1)
dists[count] = distance.T
count += 1
return dists
def create_kd_tree(self, datalist):
'''
create KD tree
Args:
data: data list
'''
root = KDNode()
self.build_tree(root, datalist)
self.kd_tree = root
return root
def build_tree(self, parent, datalist):
'''
recursive build tree function
Args:
parent: parent node
'''
m = datalist.shape[0]
#if the length of data is equal to 1, the node is a leaf node
if m == 1:
parent.data = datalist
return
#compute the best split demension by the variance of each demension of the data
demension = np.argmax(np.var(datalist, axis=0))
#sort the data by the chosen demension
sorted_index = np.argsort(datalist[:, demension], axis=0)
#get the index of the middle value in the datalist
middle = m / 2
#get the left data
l_data = datalist[np.squeeze(sorted_index[0 : middle].getA()), :]
#get the right data
r_data = datalist[np.squeeze(sorted_index[middle + 1 : ].getA()), :]
#assign the property of the parent node
parent.data = datalist[np.squeeze(sorted_index[middle, :].getA())]
parent.demension = demension
parent.split_value = datalist[np.squeeze(sorted_index[middle, :].getA()), demension]
#recursive build the child node if the length of rest data is not equal to zero
if len(l_data) != 0:
l_node = KDNode()
parent.left = l_node
self.build_tree(l_node, l_data)
if len(r_data) != 0:
r_node = KDNode()
parent.right = r_node
self.build_tree(r_node, r_data)
def __distance_by_kd_tree__(self, x_test):
'''
get nearest neighbors matrix by kd_tree search
'''
m = x_test.shape[0]
dists = np.zeros((m, 1))
count = 0
for x in x_test:
dists[count] = self.__find_neighbor__(x, self.kd_tree)
count += 1
return np.mat(dists)
def __find_neighbor__(self, x, node):
'''
recursive find the neighbor of x in kd-tree
Args:
the root node of current child tree
steps:
1. if the current is leaf node, return the data in the node as the nearest neighbor
2. if the value of x is less than the split value, take the neighbor of left child
tree as nearest neighbor. And then check if another child tree has the more nearest
neighbor;
if the value of x is more than the split value, do it as like mentioned above;
3. check if the current node and x has more nearest distance
'''
if node.demension == None:
return node.data
if (x[0, node.demension] <= node.split_value) and node.left:
neighbor = self.__find_neighbor__(x, node.left)
if node.right \
and (np.abs(x[0, node.demension] - node.split_value) < self.__euclidean_distance__(x, neighbor)) \
and (self.__euclidean_distance__(self.__find_neighbor__(x, node.right), x) < self.__euclidean_distance__(x, neighbor)):
neighbor = self.__find_neighbor__(x, node.right)
elif (x[0, node.demension] > node.split_value) and node.right:
neighbor = self.__find_neighbor__(x, node.right)
if node.left \
and (np.abs(x[0, node.demension] - node.split_value)