[AI 笔记] KNN

最新推荐文章于 2021-11-01 11:45:22 发布

mark__tuwen

最新推荐文章于 2021-11-01 11:45:22 发布

阅读量480

点赞数

分类专栏： AI 笔记文章标签：机器学习神经网络深度学习

本文链接：https://blog.csdn.net/mark__tuwen/article/details/106421953

版权

AI 笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

[AI 笔记] KNN

K-Nearest Neighbors
KNN的超参数
- K
- 距离算法
代码实现
维度灾难

参考资料：
CS231n课程

K-Nearest Neighbors

K-最近邻分类算法。顾名思义，要对一个样本进行分类，选取样本空间中距离它最近的K个已知类别的样本，将这K个已知类别样本中占多数的类别作为分类结果。
如下图所示，图中的散点是已知类别的样本。图中对应的色块是对样本空间进行分类的结果。K是KNN算法的超参数，即选取离待分类样本最近的K个已知分类样本。
在这里插入图片描述
观察上图，当K=1的时候，绿色区域中间的一小块被分类成了黄色。这是因为中间有一个黄色的已知分类样本。显然，这样的分类结果可能具有较差的泛化能力。

KNN的超参数

KNN除了超参数K以外，还有一个超参，就是如何判断两个样本之间的距离。

K

对于K而言，由上面的图可知，K太小可能会导致泛化能力弱；但如果太大的话，比如大到覆盖了整个样本空间，那么每一个待分类样本的分类结果都将是相同的，也就不具备分类能力了。
所以应当选取合适的K值。

距离算法

对于样本距离算法而言，比如可以用L1距离或者L2距离。如下图所示。
对于L1距离而言，同样本距离相同的点围成了一个菱形；对于L2距离而言，同样本距离相同的点围成了一个圆形。
在这里插入图片描述
如果转动使用L1距离的样本空间的坐标轴，那么同样本距离相同的点也会跟着变化。
如果转动使用L2距离的样本空间的坐标轴，那么同样本距离相同的点不会发生变化。
所以，距离算法的选择，影响的是对样本空间的处理方式。
从下图的对比中可以发现，采用L1距离的分类结果，分类边界多锯齿状；采用L2距离的分类结果，分类边界相对平滑。
在这里插入图片描述

代码实现

KNN对cifar-10数据集进行分类。这里只实现了K=1。

#NearestNeighbor.py
import numpy as np
 
class NearestNeighbor(object):
  def __init__(self):
    pass
 
  def train(self, X, y):
    """ X is N x D where each row is an example. Y is 1-dimension of size N """
    # the nearest neighbor classifier simply remembers all the training data
    self.Xtr = X      
    self.ytr = y
 
  def predict(self, X):
    """ X is N x D where each row is an example we wish to predict label for """
    num_test = X.shape[0]
    print('[predict] num_test = {}'.format(num_test))
    # lets make sure that the output type matches the input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
 
    # loop over all test rows
    for i in range(num_test):
        
      # find the nearest training image to the i'th test image
      # using the L1 distance (sum of absolute value differences)
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances) # get the index with smallest distance
      Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
      print('[predict] 已完成 {}/{}'.format(i, num_test))
 
    return Ypred

#KNN_run.py
import numpy as np
from NearestNeighbor import NearestNeighbor
#读数据
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict
 
#get the training data
dataTrain = []
labelTrain = []
for i in range(1,6):
    #dic = unpickle("cifar-10-batches-py/data_batch_"+str(i))
    dic = unpickle("D:/Desktop/论文及算法实现/DATA_SETS/cifar-10-batches-py/data_batch_"+str(i))
    for item in dic[b"data"]:
        dataTrain.append(item)
    for item in dic[b"labels"]:
        labelTrain.append(item)
        
#get test data
dataTest = []
labelTest = []
dic = unpickle("D:/Desktop/论文及算法实现/DATA_SETS/cifar-10-batches-py/test_batch")
#dic = unpickle("cifar-10-batches-py/test_batch")
for item in dic[b"data"]:
    dataTest.append(item)
   
for item in dic[b"labels"]:
    labelTest.append(item)
   
#print ("dataTest:%d" %(len(dataTest)))
#print ("labelTest:%d" %(len(labelTest)))
dataTr = np.asarray(dataTrain)
dataTs = np.asarray(dataTest)
labelTr = np.asarray(labelTrain)
labelTs = np.asarray(labelTest)
print('dataTr.shape: {}'.format(dataTr.shape))
 
nn = NearestNeighbor() # create a Nearest Neighbor classifier class
nn.train(dataTr, labelTr) # train the classifier on the training images and labels
Yte_predict = nn.predict(dataTs) # predict labels on the test images
# and now print the classification accuracy, which is the average number
# of examples that are correctly predicted (i.e. label matches)
print ('accuracy: %f' % ( np.mean(Yte_predict == labelTs) ))

训练结果：

[predict] 已完成 9994/10000
[predict] 已完成 9995/10000
[predict] 已完成 9996/10000
[predict] 已完成 9997/10000
[predict] 已完成 9998/10000
[predict] 已完成 9999/10000
accuracy: 0.249200

准确率为0.2492，可见KNN的分类效率并不是很高。

维度灾难

由于KNN算法对于每一个样本，都要计算其与所有已知分类样本点之间的距离，所以算法复杂度为O(N)，当样本的维数增加时，样本数量几何倍数增加，使得计算量几何倍数增加，故称维度灾难。
在这里插入图片描述

mark__tuwen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[AI 笔记] KNN

[AI 笔记] KNNK-Nearest NeighborsKNN的超参数K距离算法代码实现维度灾难参考资料：CS231n课程K-Nearest NeighborsK-最近邻分类算法。顾名思义，要对一个样本进行分类，选取样本空间中距离它最近的K个已知类别的样本，将这K个已知类别样本中占多数的类别作为分类结果。如下图所示，图中的散点是已知类别的样本。图中对应的色块是对样本空间进行分类的结果。K是KNN算法的超参数，即选取离待分类样本最近的K个已知分类样本。观察上图，当K=1的时候，绿色区域中间的
复制链接

扫一扫

专栏目录