K_Nearest_Neighbot(knn)方法及其Pyhon 实现

GREAT THANKS TO: http://cs231n.github.io/classification/

  • 1.近邻算法
    • 给定一个训练数据集合,对新的输入实例,在训练数据集中找到与该实例最近邻的k个实例,这k个实例多数属于某个类,就把这个实例划分为某个类。k=1时,称为最近邻算法。
    • k近邻算法的三个基本要素:
      • k值的选择: k值的减小就相当于是整体模型变得复杂,容易过拟合,k变大表示模型变的简单,k=N表示完全忽略了训练实例中的大量有用信息(这个时候只是单纯的根据训练数据集的标签来进行分类,新输入的数据属于训练集合中元素数比较多的类,根据标签来进行分类是很不靠谱的)。
        knn
        上图中NN Classifier是最近邻算法(k==1),可以看到它把蓝色点集中的几个绿色的异常点(Outlier)也区分了出来,这样的模型很容易表现出过拟合,泛化的效果比较差。而最右边的5-NN classifier就可以平滑掉这些绿色的异常点,会有较好的泛化效果。
      • 距离的度量
      • 分类决策规则: 多数表决。
    • 当输入是两张图片时,可以将其转换成两个向量 I1 I 1 I2 I 2 ,这里先选用 L1 L 1 距离, d1(I1,I2)=pIp1Ip2 d 1 ( I 1 , I 2 ) = ∑ p | I 1 p − I 2 p | ,其过程可视化为:l1
    • 还可以使用 L2 L 2 距离,其几何意义是两个向量之间的欧氏距离, d2(I1,I2)=p(Ip1Ip2)2 d 2 ( I 1 , I 2 ) = ∑ p ( I 1 p − I 2 p ) 2
    • 近邻算法的有点和缺点(pros and cons)
      • 优点:易于实现和理解,无需训练
      • 缺点:测试花费的时间太长
      • 缺点: 当输入数据的维度很高时,譬如图片很大,像L2距离这些并没有感官上的直接联系。
        如下图,someone
        基于像素的高维数据的距离是非常不直观的,上图中最左侧是原始图像,右侧三张与其的L2距离是相同的,但很显然从视觉效果上和语义上它们三个之间并没什么相关性。L1和L2距离只是和图片背景和颜色分布有较强的相关性。
    • 2.代码实现
     #!/usr/bin/env python2
    ## -*- coding: utf-8 -*-
    """
    Created on Thu Aug  2 09:46:44 2018
    @author: rd
    """
    from __future__ import division
    import numpy as np

    class KNearestNeighbor(object):
        """ a kNN classifier with L2 distance """
        def __init__(self):
            pass
        """In kNearestNeighbor,training means store the training data"""
        def train(self, X, Y):
            self.X_train = X
            self.Y_train = Y

        def predict(self, X, k=1, num_loops=0):
            if num_loops == 0:
                dists = self.compute_distances_no_loops(X)
            elif num_loops == 1:
                dists = self.compute_distances_one_loop(X)
            elif num_loops == 2:
                dists = self.compute_distances_two_loops(X)
            else:
                raise ValueError('Invalid value %d for num_loops' % num_loops)
            return self.predict_labels(dists, k=k)

        def compute_distances_two_loops(self, X):
            num_test = X.shape[0]
            num_train = self.X_train.shape[0]
            dists = np.zeros((num_test, num_train))
            for i in range(num_test):
                for j in range(num_train):
                     dists[i][j]=np.sum(np.square(self.X_train[j,:] - X[i,:]))
            return dists
        def compute_distances_one_loops(self,X):
            num_test = X.shape[0]
            num_train = self.X_train.shape[0]
            dists = np.zeros((num_test, num_train))
            for i in range(num_test):
                dists[i]=np.sum(np.square(self.X_train-X[i]),axis=1)
            return dists
        def compute_distances_no_loops(self,X):
            squa_sum_X=np.sum(np.square(X),axis=1).reshape(-1,1)
            squa_sum_Xtr=np.sum(np.square(self.X_train),axis=1)
            inner_prod=np.dot(X,self.X_train.T)
            dists = -2*inner_prod+squa_sum_X+squa_sum_Xtr
            return dists

        def predict_labels(self, dists, k=1):
            num_test = dists.shape[0]
            y_pred = np.zeros(num_test)
            for i in range(num_test):
                pos=np.argsort(dists[i])[:k]
                closest_y = self.Y_train[pos]
                y_pred[i]=np.argmax(np.bincount(closest_y.astype(int)))
            return y_pred
    """
    This dataset is part of MNIST dataset,but there is only 3 classes,
    classes = {0:'0',1:'1',2:'2'},and images are compressed to 14*14 
    pixels and stored in a matrix with the corresponding label, at the 
    end the shape of the data matrix is 
    num_of_images x 14*14(pixels)+1(lable)
    """
    def load_data(split_ratio):
        tmp=np.load("data216x197.npy")
        data=tmp[:,:-1]
        label=tmp[:,-1]
        mean_data=np.mean(data,axis=0)
        train_data=data[int(split_ratio*data.shape[0]):]-mean_data
        train_label=label[int(split_ratio*data.shape[0]):]
        test_data=data[:int(split_ratio*data.shape[0])]-mean_data
        test_label=label[:int(split_ratio*data.shape[0])]
        return train_data,train_label,test_data,test_label
    def main():
        train_data,train_label,test_data,test_label=load_data(0.4)
        knn=KNearestNeighbor()
        knn.train(train_data,train_label)
        Yte=knn.predict(test_data,k=2)
        print "The accuracy is {}".format(np.mean(Yte==test_label))
    if __name__=="__main__":
        main()
    >>>python knn.py
    The accuracy is 0.976744186047
    #数据很少,图片是单通道尺寸也很小,所以分类结果还不错 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值