k近邻法
k近算法算法简单、直观:给定一个训练数据集,对于新的输入实例,在训练数据集中找到与该实例最接近的k个实例点。这k个实例点的多数属于某个类,就把该输入实例归于该类。
k近邻法的特点:
- k近邻法不具有显示的学习过程。k近邻实际上是利用训练数据集对特征空间进行划分,并将其作为分类的模型
- k近邻法的三个基本要素:k值的选择、距离度量方式和分类决策规则。当这三个要素唯一确定后,结果也唯一确定
- k近邻法的实现需要考虑如何快速搜索k个最近点。kd树是一种便于对k维空间进行快速检索的数据结构。
距离度量
特征空间两个实例点的距离是两个实例点相似程度的反映. k近邻模型的特征空间一般是n维实数向量空间 Rn . 使用的距离是欧式距离,但也可以是其他距离,如更一般的 Lp 距离或Minkowski距离。
设特征空间
χ
是n维实数空间向量空间
Rn
,
xi,xj∈χ
,
xi,xj
的
Lp
距离定义为
这里 p≥1 . 当p=2时,称为欧式距离。当p=1时,称为曼哈顿距离。当 p=∞ 时,它是各个坐标距离的最大值。下图给出了二维空间中p取不同值时,与原点的 Lp 距离为1的点的图形。
k值的选择
k值的选择会对k近邻法的结果产生重大影响。
如果选择较小的k值,就相当于用较小的领域中的训练数据进行预测,“学习“的近似误差会减小,但是”学习”的估计误差会增大,预测结果会对近邻的实例点非常敏感,容易发生过拟合。
如果选择较大的k值,就相当于用较大邻域中的实例进行预测。其优点是可以减少估计误差,但是近似误差会增大。这时与输入实例较远的(不相似)训练实例也会起预测作用,使预测发生错误。k值的增大意味着整体模型变得简单。
在应用中,k值一般取一个较小的数值,通常采用交叉验证法来选取最优的k值。
分类决策规则
k近邻法的分类决策规则往往是多数表决,即由输入实例的k个邻近的训练实例中的多数类决定输入实例的类。
多数表决规则有如下解释:如果分类的损失函数为0-1损失函数,分类函数为
那么误分类的概率为
对于给定的实例 x∈χ , 其最近邻的k个训练实例点构成集合 Nk(x) . 如果涵盖 Nk(x) 的区域的类别 cj , 那么误分类率是
要是误分类率最小即经验风险最小,就要使 ∑xi∈Nk(x)I(yi=cj) 最大。所以多数表决制对应于经验风险最小化。
代码实现
# -*- coding: utf-8 -*-
'''
构建kd树过程
1, 选择切分维度和切分点
2, 构建新的子树根节点并与其父节点连接
3, 检查切分点左边是否还有数据,有则将左边的数据进行下一次切分
4, 检查切分点右边是否还有数据,有则将右边的数据进行下一次切分
搜索则按照球体的半径是否大于目标节点到父节点的切分平面的距离,
是则表明以目标点为圆心,以其到最近节点的距离为半径的超球体与
当前搜索的子树根节点的兄弟节点为根的子树代表的超矩形相交,应该
搜索兄弟节点为根的子树
k近邻则维护一个队列,最开始先将最先接触到的k个节点填进去,然后每次遇到新的
节点都将其和队列中最远的那个节点比较,如果距离相对近就将其替换队列中最远
的节点,并调整队列的排序
'''
import numpy as np
import matplotlib.pyplot as plt
import os
import random
my_dir = os.path.dirname(__file__)
picname = "k_neareast_neighbor"
'''
calculate euclidean distance between point and point
: @param x_a: point's coordinate 1d-array
: @param x_b: point's coordinate 1d-array
: @return distance
'''
def Euclidistance(x_a,x_b):
return np.sqrt(sum((x_a - x_b) * (x_a - x_b)))
'''
calculate the distance between point and plane
: @prama x: the point 1d-array
: @prama xp: an arbitrary point on the plane 1d-array
: @prama nvector: the normal vector of the plane 1d-array
: @return Euclidean distance between point plane float
'''
def distance_point_plane(x, xp, nvector):
return abs(float(np.dot(x - xp, nvector)))/np.sqrt(np.dot(nvector,nvector))
class kNode(object):
def __init__(self, point,classLabel, splitDim, parent=None):
self.point = point
self.classLabel = classLabel
self.splitDim = splitDim
self.lc = None
self.rc = None
self.parent = parent
class KNN(object):
def __init__(self, k=3):
self.k = k
self.root = None#tree root node kdNode
self.classLabels = None#all class labels 1d-array or 1d-list
'''
decide the splited dimension of the dataset according to variance
and the splited position according to median.
: @param dataset: the dataset is used to build the child kd tree 2d-array
: @return splitDim: the next dimension to split the dataset int
: @return splitPos: the sperate postion int
'''
def __selectSplitDimAndPos(self, dataset):
per_dim_var = [np.var(dataset[:,i]) for i in range(dataset.shape[1] - 1)]
splitDim = np.argmax(per_dim_var)
splitPos = dataset.shape[0] // 2
return splitDim, splitPos
'''
build the child kd tree
: @param dataset: the dataset is used to build the child kd tree 2d-array
: @param reminder_dims: the remind demension of the dataset after several spliting 1d-array
: @param parent: the child kd tree's parent node
: @return root: the root node of the child kd tree
'''
def __buildKdTree(self, dataset, parent=None):
splitDim, splitPos = self.__selectSplitDimAndPos(dataset)
dataset = np.array(sorted(dataset,key=lambda x: x[splitDim]))
node = kNode(dataset[splitPos,:-1], dataset[splitPos,-1], splitDim, parent)
if splitPos > 0: node.lc = self.__buildKdTree(dataset[:splitPos,:],node)
if splitPos + 1 < dataset.shape[0] : node.rc = self.__buildKdTree(dataset[(splitPos+1): ,:],node)
return node
'''
train the kdtree to leaning the data.Note, the kNN algorithm does not have apparent learning process.
The train process is just building a kd-tree.
'''
def train(self, dataset):
if not dataset.shape[0]:
return
self.classLabels = set(dataset[:,-1])
self.root = self.__buildKdTree(dataset)
'''
go straight to a child kd-tree's leaf node
: @param croot: child kd-tree's root node kdNode
: @param data: objective data 1d-array
: @return the leaf node of the child kd-tree kdNode
'''
def __toLeaf(self, croot, data):
while croot.lc or croot.rc:
if croot.point[croot.splitDim] < data[croot.splitDim] and croot.rc:
croot = croot.rc
elif croot.point[croot.splitDim] >= data[croot.splitDim] and croot.lc:
croot = croot.lc
else: break
if croot.lc: croot = croot.lc
if croot.rc: croot = croot.rc
return croot
'''
search the nearest one node with objective data in child kd-tree
: @param data: objective data 1d-array
: @param root: root node of the child kd-tree kdNode
: @return nearest: the nearest one node kdNode
'''
def __searchNearestOne(self, data, root):
#search the leaf node in data's region
cur_node = self.__toLeaf(root,data)
nearest = cur_node#the current nearest node of the child kd-tree
nearest_dist = Euclidistance(data,nearest.point)
nDims = cur_node.point.shape
while cur_node != root:#exit search the child kd-tree when the root node is visited
fromLeft = True
if cur_node != cur_node.parent.lc: fromLeft = False
cur_node = cur_node.parent #upward backtracking
#calculate the distance between objective point and point representing
#the sperate plane defined by cur_node's parent node
plane_nvector = np.zeros(nDims)
plane_nvector[cur_node.splitDim] = 1#the normal vector of the sperate plane
distptpe = distance_point_plane(data,cur_node.point,plane_nvector)
#the hyberrectangle defined by sperate point is intersect with the hybersphere
#defined by the objective point and the current nearest point
if distptpe < nearest_dist:
#if the left branch has been visited, then go to right branch
if fromLeft and cur_node.rc:
#check whether updating the nearest point
new_nst = self.__searchNearestOne(data, cur_node.rc)
new_nst_dist = Euclidistance(data,new_nst.point)
if new_nst_dist < nearest_dist:
nearest = new_nst
nearest_dist = new_nst_dist
#if the right branch has been visited, then go to left branch
if not fromLeft and cur_node.lc:
new_nst = self.__searchNearestOne(data, cur_node.lc)
new_nst_dist = Euclidistance(data,new_nst.point)
if new_nst_dist < nearest_dist:
nearest = new_nst
nearest_dist = new_nst_dist
#visit the current point and check whether updating the nearest point
distptpt = Euclidistance(data, cur_node.point)
if distptpt < nearest_dist:
nearest = cur_node
nearest_dist = distptpt
return nearest
'''
check whether updating the k nearest node with the objective point
when visiting a new node.The function is just used when searching k nearest node
: @param data: the objective point 1d-array
: @param cur_node: the new node kdNode
: @param nearestK: the container to record the k nearest node and distance values
with the objective point, and the capacity is k 2d-list
: @return None
'''
def __updateNearestK(self, data, cur_node, nearestK):
dist = Euclidistance(data,cur_node.point)
if len(nearestK) < self.k:#whether the size of the nearestK is up to k
nearestK.append([cur_node, dist])
else:
#if the size had already up to k, select the farthest point within nearsetK.
#Compare the farthest point with the cur_node
maxdis_id = nearestK.index(max(nearestK,key=lambda x:x[1]))
maxnearest = nearestK[maxdis_id][0]
maxnearest_dist = nearestK[maxdis_id][1]
if dist < maxnearest_dist:
nearestK[maxdis_id] = [cur_node, dist]
'''
search k nearest nodes within the child kd-tree
: @param data: the objective point 1d-array
: @param root: the root node of the child kd-tree kdNode
: @param nearestK: the container to record the k nearest node and distance values
with the objective point, and the capacity is k 2d-list
: @return None
'''
def __searchNearestK(self, data, root, nearestK):
cur_node = self.__toLeaf(root,data)
self.__updateNearestK(data,cur_node,nearestK)
nDims = cur_node.point.shape
while cur_node != root:
fromLeft = True
if cur_node != cur_node.parent.lc: fromLeft = False
cur_node = cur_node.parent
#no necessity to check whether the hyberrectangle is intersect with the hybersphere
#if the nearestK is not full
if len(nearestK) != self.k:
if fromLeft and cur_node.rc:
self.__searchNearestK(data, cur_node.rc, nearestK)
if not fromLeft and cur_node.lc:
self.__searchNearestK(data, cur_node.lc, nearestK)
else:
plane_nvector = np.zeros(nDims)
plane_nvector[cur_node.splitDim] = 1
maxnearest_dist = max(nearestK, key=lambda x: x[1])[1]
distptpe = distance_point_plane(data,cur_node.point,plane_nvector)
if distptpe < maxnearest_dist:
if fromLeft and cur_node.rc:
self.__searchNearestK(data, cur_node.rc, nearestK)
if not fromLeft and cur_node.lc:
self.__searchNearestK(data, cur_node.lc, nearestK)
self.__updateNearestK(data,cur_node,nearestK)
'''
predict the class label of the objective point
: @param data: the objective point 1d-array
: @return predicted class label int
'''
def run(self, data):
if not self.root:
print("Train the KNN algorithm firstly!")
return None
nearestK = []
self.__searchNearestK(data, self.root, nearestK)
labelcount = dict()
for label in self.classLabels:
labelcount[label] = 0
for item in nearestK:
label = item[0].classLabel
labelcount[label] += 1
return max(labelcount,key=lambda x:labelcount[x])
# nearest = self.__searchNearestOne(data, self.root)
# return nearest.classLabel
'''
print the kd-tree
'''
def printKdTree(kdtree):
if not kdtree.root: return
def printKdNode(kdnode, indent):
if not isinstance(kdnode,kNode): return
print(kdnode.point, " ", kdnode.splitDim)
if kdnode.lc:
print(" "*indent,"left:", end=" ")
printKdNode(kdnode.lc, indent + 1)
if kdnode.rc:
print(" "*indent,"right:", end=" ")
printKdNode(kdnode.rc, indent + 1)
printKdNode(kdtree.root,0)
'''
plot the kdTree.Note, it can only be used when the kdTree is 2D.
'''
def plotKdTree(kdtree, _xlim, _ylim):
if not kdtree.root: return
fig = plt.figure()
ax = plt.axes(xlim=_xlim,ylim=_ylim)
def plotChildKdTree(kdnode, xlim, ylim):
x = kdnode.point[0];y = kdnode.point[1]
if kdnode.splitDim == 0:
plt.plot([x,x],ylim,'k',lw=2)
else:
plt.plot(xlim,[y,y],'k',lw=2)
if kdnode.classLabel == 1:
plt.plot(x,y,'bo',markersize=5)
else: plt.plot(x,y,'rx',markersize=5)
if kdnode.lc:
if kdnode.splitDim == 0:
xlim_l = [xlim[0],x]
ylim_l = ylim
else:
xlim_l = xlim
ylim_l = [ylim[0],y]
plotChildKdTree(kdnode.lc, xlim_l,ylim_l)
if kdnode.rc:
if kdnode.splitDim == 0:
xlim_r = [x,xlim[1]]
ylim_r = ylim
else:
xlim_r = xlim
ylim_r = [y,ylim[1]]
plotChildKdTree(kdnode.rc,xlim_r,ylim_r)
plotChildKdTree(kdtree.root,_xlim,_ylim)
plt.show()
global my_dir
global picname
fig.savefig(os.path.join(my_dir,picname),dpi=80)
if __name__ == "__main__":
filename = "data.txt"
data = []
with open(os.path.join(my_dir,filename),'r') as f:
for line in f.readlines():
temp1 = line.strip().split(',')
temp2 = [float(temp1[0]),float(temp1[1]),int(temp1[2])]
data.append(temp2)
rndindex = [i for i in range(len(data))]
random.shuffle(rndindex)
data = np.array(data)
testdata = data[rndindex[-10:]]
kdtree = KNN(k=3)
kdtree.train(data[rndindex[:-10]])
rst = np.zeros(testdata.shape[0])
for i,item in enumerate(testdata):
rst[i]= kdtree.run(item[0:2])
print(rst)
print(testdata[:,-1])
plotKdTree(kdtree,[30,100],[30,100])
代码除了计算输入实例的预测结果还绘制出特征空间。以上图片中蓝点表示正类实例点,红点表示负类实例点,而黑线表示以当前节点为分割点将子特征空间一分为二的分割平面。竖线表示在 x 维度上对数据进行分割,横线表示对数据在 y 维度上进行分割。
参考文献
[1] 《统计学习方法》 李航
[2] 演示数据集来自于吴恩达在coursera上的课程《Machine Learning》