MachineLearning— (KNN)k Nearest Neighbor实现手写数字识别(三)

    本篇博文主要结合前两篇的knn算法理论部分knn理论理解(一)knn理论理解(二),做一个KNN的实现,主要是根据《机器学习实战》这本书的内容,一个非常经典有趣的例子就是使用knn最近邻算法来实现对手写数字的识别,下面将给出Python代码,尽量使用详尽的解释和注解让大家更好地理解每一步的目的和方法,欢迎大家一起交流学习~~~

    我们使用的training数据保存在trainingDigits当中,我们总共使用了100个样本点数据,从0到9总共十个数字,每个数字有十个手写数据,所以我们准备了总共100个样本点来作为基准计算样本;

 

而每个手写数字则是以文本的形式保存,每一个手写数字都被提前进行了二值化处理,即只是用0,1两个数字来表现手写数字,如下所示:


这样做的目的显然可以方便我们将其转化为特征行向量进行计算,同时能够保存足够的原信息量用来进行数字识别;

而测试数据则是保存在testDigits文件当中,我们准备了50个样本点用来测试识别分类效果,十个数字,每个数字有五个测试点,所以我们总共有50个样本数据;(这里我们要注意一下文件的命名格式)


下面给出Python实现代码,代码主要包括三大块吧,每一块主要的功能分别是:

第一块的目的是要将原本32*32大小的矩阵数字转化为1*1024的行向量,第二块是测试分类模块程序(获得handwritinglabels数字标签列表 将训练数据保存到大矩阵trainingMat当中去)(逐一读取测试图片并将其送到classify0分类器当中得出分类识别结果并最终打印结果),第三块的目的是编写分类主程序即knn算法核心部分计算欧氏距离找出k近邻点并将最多的那个作为测试样本的识别结果;

<span style="font-size:14px;"># -*- coding: utf-8 -*-
#样本是32*32的二值图片,将其处理成1*1024的特征行向量
#第一块</span>
from numpy import *
import operator
from os import listdir

def img2vector(filename):
    returnVect = zeros((1,1024))   <span style="font-size:14px;">#产生一个数组将来用于存储图片数据</span>
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()    <span style="font-size:14px;">#读取第一行的32个二值数据</span>
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])   <span style="font-size:14px;">#以32为一个循环 逐渐填满1024个</span>
    return returnVect


<span style="font-size:14px;">#第二块</span>
def handwritingClassTest():
    <span style="font-size:14px;">#加载训练集到大矩阵trainingMat</span>
    hwLabels = []
    trainingFileList = listdir('C:\\Anaconda\\trainingDigits')           <span style="font-size:14px;">#os模块中的listdir('str')可以读取目录str下的所有文件名,返回一个字符串列表</span>
    m = len(trainingFileList)                 <span style="font-size:14px;">#m表示总体训练样本个数</span>
    trainingMat = zeros((m,1024))             <span style="font-size:14px;">#用来存放m个样本</span>
    for i in range(m):
        fileNameStr = trainingFileList[i]                  <span style="font-size:14px;">#训练样本的命名格式:1_120.txt 获取文件名</span>
        fileStr = fileNameStr.split('.')[0]                <span style="font-size:14px;">#string.split('str')以字符str为分隔符切片,返回list,这里去list[0],得到类似1_120这样的</span>
        classNumStr = int(fileStr.split('_')[0])           <span style="font-size:14px;">#以_切片,得到1,即数字类别</span>
        hwLabels.append(classNumStr)                    <span style="font-size:14px;">#这样hwLabels列表中保存了m个样本点的所有类别</span>
        trainingMat[i,:] = img2vector('C:\\Anaconda\\trainingDigits\\%s' % fileNameStr)    <span style="font-size:14px;">#将每个样本存到m*1024矩阵</span>
        
    <span style="font-size:14px;">#逐一读取测试图片,同时将其分类 </span>  
    testFileList = listdir('C:\\Anaconda\\testDigits')     <span style="font-size:14px;">#测试文件名列表   </span>
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]            <span style="font-size:14px;">#获取此刻要测试的文件名</span>
        fileStr = fileNameStr.split('.')[0]     
        classNumStr = int(fileStr.split('_')[0])          <span style="font-size:14px;">#数字标签分类 重新生成的classNumStr 并没有使用上面已有的classNumStr</span>
        vectorUnderTest = img2vector('C:\\Anaconda\\testDigits\\%s' % fileNameStr)  <span style="font-size:14px;">#将要测试的数字转化为一行向量</span>
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)   <span style="font-size:14px;">#传参</span>
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
        if (classifierResult != classNumStr): errorCount += 1.0
    print "\nthe total number of errors is: %d" % errorCount
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))


<span style="font-size:14px;">#第三块
#分类主程序,计算欧式距离,选择距离最小的前k个,返回k个中出现频次最高的类别作为分类别
#inX是所要测试的向量
#dataSet是训练样本集,一行对应一个样本,dataSet对应的标签向量为labels
#k是所选的最近邻数</span>
def classify0(inX, dataSet, labels, k):      <span style="font-size:14px;">#参数和上面的一一对应</span>
    dataSetSize = dataSet.shape[0]                       <span style="font-size:14px;">#shape[0]得出dataSet的行数,即训练样本个数
    diffMat = tile(inX, (dataSetSize,1)) - dataSet       #tile(A,(m,n))将数组A作为元素构造m行n列的数组 100*1024的数组
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)                  #array.sum(axis=1)按行累加,axis=0为按列累加
    distances = sqDistances**0.5                        #就是sqrt的结果
    sortedDistIndicies = distances.argsort()             #array.argsort(),得到每个元素按次排序后分别在原数组中的下标 从小到大排列
    classCount={}                                        #sortedDistIndicies[0]表示排序后排在第一个的那个数在原来数组中的下标
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]   #最近的那个在原来数组中的下标位置
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #一个字典从无到有的生成过程  get(key,x)从字典中获取key对应的value,没有key的话返回0
    #classCount的形式:{5:3,0:6,1:7,2:1}
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #sorted()函数,按照第二个元素即value的次序逆向(reverse=True)排序
                                                         #sorted第一个参数表示要排序的对象 iteritems代表键值对 
    return sortedClassCount[0][0]                        #经过sorted后的字典变成了[(),()]形式的键值对列表</span>


测试结果:

import knn

knn.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 0

the total error rate is: 0.000000

我们发现准确率还是挺高的,没有一个错误的,原因是我们使用的数据量比较小,恰巧没有一个错误的;




参考资料:http://blog.csdn.net/u012162613/article/details/41768407#t2

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值