k近邻法

k近邻法

k近算法算法简单、直观:给定一个训练数据集,对于新的输入实例,在训练数据集中找到与该实例最接近的k个实例点。这k个实例点的多数属于某个类,就把该输入实例归于该类。


k近邻法的特点:

  1. k近邻法不具有显示的学习过程。k近邻实际上是利用训练数据集对特征空间进行划分,并将其作为分类的模型
  2. k近邻法的三个基本要素:k值的选择、距离度量方式和分类决策规则。当这三个要素唯一确定后,结果也唯一确定
  3. k近邻法的实现需要考虑如何快速搜索k个最近点。kd树是一种便于对k维空间进行快速检索的数据结构。

距离度量

特征空间两个实例点的距离是两个实例点相似程度的反映. k近邻模型的特征空间一般是n维实数向量空间 Rn . 使用的距离是欧式距离,但也可以是其他距离,如更一般的 Lp 距离或Minkowski距离。

设特征空间 χ 是n维实数空间向量空间 Rn , xi,xjχ , xi,xj Lp 距离定义为

Lp(xi,xj)={l=1n|x(l)ix(l)j|p}1p

这里 p1 . 当p=2时,称为欧式距离。当p=1时,称为曼哈顿距离。当 p= 时,它是各个坐标距离的最大值。下图给出了二维空间中p取不同值时,与原点的 Lp 距离为1的点的图形。
这里写图片描述


k值的选择

k值的选择会对k近邻法的结果产生重大影响。

如果选择较小的k值,就相当于用较小的领域中的训练数据进行预测,“学习“的近似误差会减小,但是”学习”的估计误差会增大,预测结果会对近邻的实例点非常敏感,容易发生过拟合。

如果选择较大的k值,就相当于用较大邻域中的实例进行预测。其优点是可以减少估计误差,但是近似误差会增大。这时与输入实例较远的(不相似)训练实例也会起预测作用,使预测发生错误。k值的增大意味着整体模型变得简单。

在应用中,k值一般取一个较小的数值,通常采用交叉验证法来选取最优的k值。


分类决策规则

k近邻法的分类决策规则往往是多数表决,即由输入实例的k个邻近的训练实例中的多数类决定输入实例的类。

多数表决规则有如下解释:如果分类的损失函数为0-1损失函数,分类函数为

f:Rn{c1,c2,...,ck}

那么误分类的概率为
P(Yf(X))=1P(Y=f(X))

对于给定的实例 xχ , 其最近邻的k个训练实例点构成集合 Nk(x) . 如果涵盖 Nk(x) 的区域的类别 cj , 那么误分类率是
1kxiNk(x)I(yicj)=11kxiNk(x)I(yi=cj)

要是误分类率最小即经验风险最小,就要使 xiNk(x)I(yi=cj) 最大。所以多数表决制对应于经验风险最小化。


代码实现

# -*- coding: utf-8 -*-


'''
构建kd树过程
1, 选择切分维度和切分点
2, 构建新的子树根节点并与其父节点连接
3, 检查切分点左边是否还有数据,有则将左边的数据进行下一次切分
4, 检查切分点右边是否还有数据,有则将右边的数据进行下一次切分

搜索则按照球体的半径是否大于目标节点到父节点的切分平面的距离,
是则表明以目标点为圆心,以其到最近节点的距离为半径的超球体与
当前搜索的子树根节点的兄弟节点为根的子树代表的超矩形相交,应该
搜索兄弟节点为根的子树

k近邻则维护一个队列,最开始先将最先接触到的k个节点填进去,然后每次遇到新的
节点都将其和队列中最远的那个节点比较,如果距离相对近就将其替换队列中最远
的节点,并调整队列的排序
'''
import numpy as np
import matplotlib.pyplot as plt
import os
import random

my_dir = os.path.dirname(__file__)
picname = "k_neareast_neighbor"

'''
calculate euclidean distance between point and point
: @param    x_a: point's coordinate    1d-array
: @param    x_b: point's coordinate    1d-array
: @return   distance
'''
def Euclidistance(x_a,x_b):
    return np.sqrt(sum((x_a - x_b) * (x_a - x_b)))

'''
calculate the distance between point and plane
: @prama    x: the point  1d-array
: @prama    xp: an arbitrary point on the plane  1d-array
: @prama    nvector: the normal vector of the plane  1d-array
: @return   Euclidean distance between point plane   float
'''
def distance_point_plane(x, xp, nvector):
    return abs(float(np.dot(x - xp, nvector)))/np.sqrt(np.dot(nvector,nvector))


class kNode(object):
    def __init__(self, point,classLabel, splitDim, parent=None):
        self.point = point
        self.classLabel = classLabel
        self.splitDim = splitDim
        self.lc = None
        self.rc = None
        self.parent = parent


class KNN(object):
    def __init__(self, k=3):
        self.k = k
        self.root = None#tree root node  kdNode
        self.classLabels = None#all class labels  1d-array or 1d-list

    '''
    decide the splited dimension of the dataset according to variance
    and the splited position according to median.
    : @param    dataset: the dataset is used to build the child kd tree    2d-array
    : @return   splitDim:  the next dimension to split the dataset    int
    : @return   splitPos:  the sperate postion    int
    '''
    def __selectSplitDimAndPos(self, dataset):
        per_dim_var = [np.var(dataset[:,i]) for i in range(dataset.shape[1] - 1)]
        splitDim = np.argmax(per_dim_var)
        splitPos = dataset.shape[0] // 2

        return splitDim, splitPos


    '''
    build the child kd tree
    : @param    dataset: the dataset is used to build the child kd tree    2d-array
    : @param    reminder_dims: the remind demension of the dataset after several spliting    1d-array
    : @param    parent: the child kd tree's parent node
    : @return   root: the root node of the child kd tree
    '''
    def __buildKdTree(self, dataset, parent=None):
        splitDim, splitPos = self.__selectSplitDimAndPos(dataset)
        dataset = np.array(sorted(dataset,key=lambda x: x[splitDim]))
        node = kNode(dataset[splitPos,:-1], dataset[splitPos,-1], splitDim, parent)

        if splitPos > 0: node.lc = self.__buildKdTree(dataset[:splitPos,:],node)
        if splitPos + 1 < dataset.shape[0] : node.rc = self.__buildKdTree(dataset[(splitPos+1): ,:],node)

        return node
    '''
    train the kdtree to leaning the data.Note, the kNN algorithm does not have apparent learning process.
    The train process is just building a kd-tree.
    '''
    def train(self, dataset):
        if not dataset.shape[0]:
            return
        self.classLabels = set(dataset[:,-1])
        self.root = self.__buildKdTree(dataset)
    '''
    go straight to a child kd-tree's leaf node
    : @param    croot: child kd-tree's root node    kdNode
    : @param    data:  objective data    1d-array
    : @return   the leaf node of the child kd-tree  kdNode
    '''
    def __toLeaf(self, croot, data):
        while croot.lc or croot.rc:
            if croot.point[croot.splitDim] < data[croot.splitDim] and croot.rc:
                croot = croot.rc
            elif croot.point[croot.splitDim] >= data[croot.splitDim] and croot.lc:
                croot = croot.lc
            else: break
        if croot.lc: croot = croot.lc
        if croot.rc: croot = croot.rc
        return croot
    '''
    search the nearest one node with objective data in child kd-tree
    : @param    data: objective data    1d-array
    : @param    root: root node of the child kd-tree   kdNode
    : @return   nearest: the nearest one node  kdNode
    '''
    def __searchNearestOne(self, data, root):
        #search the leaf node in data's region
        cur_node = self.__toLeaf(root,data)
        nearest = cur_node#the current nearest node of the child kd-tree
        nearest_dist = Euclidistance(data,nearest.point)

        nDims = cur_node.point.shape
        while cur_node != root:#exit search the child kd-tree when the root node is visited
            fromLeft = True
            if cur_node != cur_node.parent.lc: fromLeft = False 
            cur_node = cur_node.parent   #upward backtracking

            #calculate the distance between objective point and point representing
            #the sperate plane defined by cur_node's parent node
            plane_nvector = np.zeros(nDims)
            plane_nvector[cur_node.splitDim] = 1#the normal vector of the sperate plane
            distptpe = distance_point_plane(data,cur_node.point,plane_nvector)
            #the hyberrectangle defined by sperate point is intersect with the hybersphere
            #defined by the objective point and the current nearest point
            if distptpe < nearest_dist:
                #if the left branch has been visited, then go to right branch
                if fromLeft and cur_node.rc: 
                    #check whether updating the nearest point
                    new_nst = self.__searchNearestOne(data, cur_node.rc)
                    new_nst_dist = Euclidistance(data,new_nst.point)
                    if new_nst_dist < nearest_dist:
                        nearest = new_nst
                        nearest_dist = new_nst_dist
                 #if the right branch has been visited, then go to left branch
                if not fromLeft and cur_node.lc:
                    new_nst = self.__searchNearestOne(data, cur_node.lc)
                    new_nst_dist = Euclidistance(data,new_nst.point)
                    if new_nst_dist < nearest_dist: 
                        nearest = new_nst
                        nearest_dist = new_nst_dist
            #visit the current point and check whether updating the nearest point       
            distptpt = Euclidistance(data, cur_node.point)
            if distptpt < nearest_dist: 
                nearest = cur_node
                nearest_dist = distptpt

        return nearest
    '''
    check whether updating the k nearest node with the objective point
    when visiting a new node.The function is just used when searching k nearest node
    : @param    data: the objective point    1d-array
    : @param    cur_node: the new node    kdNode
    : @param    nearestK: the container to record the k nearest node and distance values
                          with the objective point, and the capacity is k  2d-list
    : @return   None
    '''
    def __updateNearestK(self, data, cur_node, nearestK):
        dist = Euclidistance(data,cur_node.point)
        if len(nearestK) < self.k:#whether the size of the nearestK is up to k
                nearestK.append([cur_node, dist])
        else:
            #if the size had already up to k, select the farthest point within nearsetK.
            #Compare the farthest point with the cur_node
            maxdis_id = nearestK.index(max(nearestK,key=lambda x:x[1]))
            maxnearest = nearestK[maxdis_id][0]
            maxnearest_dist = nearestK[maxdis_id][1]
            if dist < maxnearest_dist:
                nearestK[maxdis_id] = [cur_node, dist]
    '''
    search k nearest nodes within the child kd-tree
    : @param    data: the objective point    1d-array
    : @param    root: the root node of the child kd-tree    kdNode
    : @param    nearestK: the container to record the k nearest node and distance values
                          with the objective point, and the capacity is k   2d-list
    : @return   None
    '''
    def __searchNearestK(self, data, root, nearestK):
        cur_node = self.__toLeaf(root,data)
        self.__updateNearestK(data,cur_node,nearestK)
        nDims = cur_node.point.shape
        while cur_node != root:
            fromLeft = True
            if cur_node != cur_node.parent.lc: fromLeft = False
            cur_node = cur_node.parent
            #no necessity to check whether the hyberrectangle is intersect with the hybersphere
            #if the nearestK is not full
            if len(nearestK) != self.k:
                if fromLeft and cur_node.rc:
                    self.__searchNearestK(data, cur_node.rc, nearestK)
                if not fromLeft and cur_node.lc:
                    self.__searchNearestK(data, cur_node.lc, nearestK)
            else:
                plane_nvector = np.zeros(nDims)
                plane_nvector[cur_node.splitDim] = 1

                maxnearest_dist = max(nearestK, key=lambda x: x[1])[1]

                distptpe = distance_point_plane(data,cur_node.point,plane_nvector)
                if distptpe < maxnearest_dist:
                    if fromLeft and cur_node.rc:
                        self.__searchNearestK(data, cur_node.rc, nearestK)
                    if not fromLeft and cur_node.lc:
                        self.__searchNearestK(data, cur_node.lc, nearestK)
            self.__updateNearestK(data,cur_node,nearestK)

    '''
    predict the class label of the objective point
    : @param    data: the objective point    1d-array
    : @return   predicted class label    int
    '''
    def run(self, data):
        if not self.root:
            print("Train the KNN algorithm firstly!")
            return None

        nearestK = []
        self.__searchNearestK(data, self.root, nearestK)

        labelcount = dict()
        for label in self.classLabels:
            labelcount[label] = 0
        for item in nearestK:
            label = item[0].classLabel
            labelcount[label] += 1

        return max(labelcount,key=lambda x:labelcount[x])
        # nearest = self.__searchNearestOne(data, self.root)
        # return nearest.classLabel

'''
print the kd-tree
'''
def printKdTree(kdtree):
    if not kdtree.root: return
    def printKdNode(kdnode, indent):
        if not isinstance(kdnode,kNode): return

        print(kdnode.point, "  ", kdnode.splitDim)
        if kdnode.lc:
            print(" "*indent,"left:", end=" ")
            printKdNode(kdnode.lc, indent + 1)
        if kdnode.rc:
            print(" "*indent,"right:", end=" ")
            printKdNode(kdnode.rc, indent + 1)

    printKdNode(kdtree.root,0)

'''
plot the kdTree.Note, it can only be used when the kdTree is 2D.
'''
def plotKdTree(kdtree, _xlim, _ylim):
    if not kdtree.root: return
    fig = plt.figure()
    ax = plt.axes(xlim=_xlim,ylim=_ylim)
    def plotChildKdTree(kdnode, xlim, ylim):
        x = kdnode.point[0];y = kdnode.point[1]
        if kdnode.splitDim == 0:
            plt.plot([x,x],ylim,'k',lw=2)
        else:
            plt.plot(xlim,[y,y],'k',lw=2)
        if kdnode.classLabel == 1:
            plt.plot(x,y,'bo',markersize=5)
        else: plt.plot(x,y,'rx',markersize=5)

        if kdnode.lc:
            if kdnode.splitDim == 0:
                xlim_l = [xlim[0],x]
                ylim_l = ylim
            else:
                xlim_l = xlim
                ylim_l = [ylim[0],y]
            plotChildKdTree(kdnode.lc, xlim_l,ylim_l)
        if kdnode.rc:
            if kdnode.splitDim == 0:
                xlim_r = [x,xlim[1]]
                ylim_r = ylim
            else:
                xlim_r = xlim
                ylim_r = [y,ylim[1]]
            plotChildKdTree(kdnode.rc,xlim_r,ylim_r)

    plotChildKdTree(kdtree.root,_xlim,_ylim)
    plt.show()
    global my_dir
    global picname
    fig.savefig(os.path.join(my_dir,picname),dpi=80)


if __name__ == "__main__":

    filename = "data.txt"
    data = []
    with open(os.path.join(my_dir,filename),'r') as f:
        for line in f.readlines():
            temp1 = line.strip().split(',')
            temp2 = [float(temp1[0]),float(temp1[1]),int(temp1[2])]
            data.append(temp2)
    rndindex = [i for i in range(len(data))]
    random.shuffle(rndindex)

    data = np.array(data)
    testdata = data[rndindex[-10:]]

    kdtree = KNN(k=3)
    kdtree.train(data[rndindex[:-10]])

    rst = np.zeros(testdata.shape[0])
    for i,item in enumerate(testdata):
        rst[i]= kdtree.run(item[0:2])


    print(rst)
    print(testdata[:,-1])


    plotKdTree(kdtree,[30,100],[30,100])  

这里写图片描述

代码除了计算输入实例的预测结果还绘制出特征空间。以上图片中蓝点表示正类实例点,红点表示负类实例点,而黑线表示以当前节点为分割点将子特征空间一分为二的分割平面。竖线表示在 x 维度上对数据进行分割,横线表示对数据在 y 维度上进行分割。


参考文献

[1] 《统计学习方法》 李航
[2] 演示数据集来自于吴恩达在coursera上的课程《Machine Learning》

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值