MachineLearning— (KNN)k Nearest Neighbor实现手写数字识别（三）

最新推荐文章于 2022-04-16 18:16:54 发布

令狐公子

最新推荐文章于 2022-04-16 18:16:54 发布

阅读量1.1k

点赞数

分类专栏： Machine Learning 文章标签： Machine Learning k Nearest Neighbor knn k最近邻 Python

本文链接：https://blog.csdn.net/qq_14959801/article/details/51691759

版权

Machine Learning 专栏收录该内容

35 篇文章 3 订阅

订阅专栏

本篇博文主要结合前两篇的knn算法理论部分knn理论理解（一）和knn理论理解（二），做一个KNN的实现，主要是根据《机器学习实战》这本书的内容，一个非常经典有趣的例子就是使用knn最近邻算法来实现对手写数字的识别，下面将给出Python代码，尽量使用详尽的解释和注解让大家更好地理解每一步的目的和方法，欢迎大家一起交流学习~~~

我们使用的training数据保存在trainingDigits当中，我们总共使用了100个样本点数据，从0到9总共十个数字，每个数字有十个手写数据，所以我们准备了总共100个样本点来作为基准计算样本；

而每个手写数字则是以文本的形式保存，每一个手写数字都被提前进行了二值化处理，即只是用0,1两个数字来表现手写数字，如下所示：

这样做的目的显然可以方便我们将其转化为特征行向量进行计算，同时能够保存足够的原信息量用来进行数字识别；

而测试数据则是保存在testDigits文件当中，我们准备了50个样本点用来测试识别分类效果，十个数字，每个数字有五个测试点，所以我们总共有50个样本数据；（这里我们要注意一下文件的命名格式）

下面给出Python实现代码，代码主要包括三大块吧，每一块主要的功能分别是：

第一块的目的是要将原本32*32大小的矩阵数字转化为1*1024的行向量，第二块是测试分类模块程序（获得handwritinglabels数字标签列表将训练数据保存到大矩阵trainingMat当中去）（逐一读取测试图片并将其送到classify0分类器当中得出分类识别结果并最终打印结果），第三块的目的是编写分类主程序即knn算法核心部分计算欧氏距离找出k近邻点并将最多的那个作为测试样本的识别结果；

<span style="font-size:14px;"># -*- coding: utf-8 -*-
#样本是32*32的二值图片，将其处理成1*1024的特征行向量
#第一块</span>
from numpy import *
import operator
from os import listdir

def img2vector(filename):
    returnVect = zeros((1,1024))   <span style="font-size:14px;">#产生一个数组将来用于存储图片数据</span>
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()    <span style="font-size:14px;">#读取第一行的32个二值数据</span>
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])   <span style="font-size:14px;">#以32为一个循环 逐渐填满1024个</span>
    return returnVect


<span style="font-size:14px;">#第二块</span>
def handwritingClassTest():
    <span style="font-size:14px;">#加载训练集到大矩阵trainingMat</span>
    hwLabels = []
    trainingFileList = listdir('C:\\Anaconda\\trainingDigits')           <span style="font-size:14px;">#os模块中的listdir('str')可以读取目录str下的所有文件名，返回一个字符串列表</span>
    m = len(trainingFileList)                 <span style="font-size:14px;">#m表示总体训练样本个数</span>
    trainingMat = zeros((m,1024))             <span style="font-size:14px;">#用来存放m个样本</span>
    for i in range(m):
        fileNameStr = trainingFileList[i]                  <span style="font-size:14px;">#训练样本的命名格式：1_120.txt 获取文件名</span>
        fileStr = fileNameStr.split('.')[0]                <span style="font-size:14px;">#string.split('str')以字符str为分隔符切片，返回list，这里去list[0],得到类似1_120这样的</span>
        classNumStr = int(fileStr.split('_')[0])           <span style="font-size:14px;">#以_切片，得到1，即数字类别</span>
        hwLabels.append(classNumStr)                    <span style="font-size:14px;">#这样hwLabels列表中保存了m个样本点的所有类别</span>
        trainingMat[i,:] = img2vector('C:\\Anaconda\\trainingDigits\\%s' % fileNameStr)    <span style="font-size:14px;">#将每个样本存到m*1024矩阵</span>
        
    <span style="font-size:14px;">#逐一读取测试图片，同时将其分类 </span>  
    testFileList = listdir('C:\\Anaconda\\testDigits')     <span style="font-size:14px;">#测试文件名列表   </span>
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]            <span style="font-size:14px;">#获取此刻要测试的文件名</span>
        fileStr = fileNameStr.split('.')[0]     
        classNumStr = int(fileStr.split('_')[0])          <span style="font-size:14px;">#数字标签分类 重新生成的classNumStr 并没有使用上面已有的classNumStr</span>
        vectorUnderTest = img2vector('C:\\Anaconda\\testDigits\\%s' % fileNameStr)  <span style="font-size:14px;">#将要测试的数字转化为一行向量</span>
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)   <span style="font-size:14px;">#传参</span>
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
        if (classifierResult != classNumStr): errorCount += 1.0
    print "\nthe total number of errors is: %d" % errorCount
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))


<span style="font-size:14px;">#第三块
#分类主程序，计算欧式距离，选择距离最小的前k个，返回k个中出现频次最高的类别作为分类别
#inX是所要测试的向量
#dataSet是训练样本集，一行对应一个样本，dataSet对应的标签向量为labels
#k是所选的最近邻数</span>
def classify0(inX, dataSet, labels, k):      <span style="font-size:14px;">#参数和上面的一一对应</span>
    dataSetSize = dataSet.shape[0]                       <span style="font-size:14px;">#shape[0]得出dataSet的行数，即训练样本个数
    diffMat = tile(inX, (dataSetSize,1)) - dataSet       #tile(A,(m,n))将数组A作为元素构造m行n列的数组 100*1024的数组
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)                  #array.sum(axis=1)按行累加，axis=0为按列累加
    distances = sqDistances**0.5                        #就是sqrt的结果
    sortedDistIndicies = distances.argsort()             #array.argsort()，得到每个元素按次排序后分别在原数组中的下标 从小到大排列
    classCount={}                                        #sortedDistIndicies[0]表示排序后排在第一个的那个数在原来数组中的下标
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]   #最近的那个在原来数组中的下标位置
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #一个字典从无到有的生成过程  get(key,x)从字典中获取key对应的value，没有key的话返回0
    #classCount的形式:{5:3,0:6,1:7,2:1}
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #sorted()函数，按照第二个元素即value的次序逆向（reverse=True）排序
                                                         #sorted第一个参数表示要排序的对象 iteritems代表键值对 
    return sortedClassCount[0][0]                        #经过sorted后的字典变成了[(),()]形式的键值对列表</span>

测试结果：

import knn

knn.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 0

the total error rate is: 0.000000

我们发现准确率还是挺高的，没有一个错误的，原因是我们使用的数据量比较小，恰巧没有一个错误的；

参考资料：http://blog.csdn.net/u012162613/article/details/41768407#t2