K近邻算法（一） python实现，手写数字识别(from机器学习实战)

最新推荐文章于 2024-07-24 04:40:15 发布

xie_qiuqiu

最新推荐文章于 2024-07-24 04:40:15 发布

阅读量1.4k

点赞数

分类专栏：机器学习实战--笔记文章标签： python 机器学习算法数据

机器学习实战--笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

k近邻法（KNN）算法概述

算法概述

优点
精度高、对异常值不敏感、无数据输入假定。
缺点
计算复杂度高、空间复杂度高。
试用数据范围
数值型和标称型

工作原理：将新数据的每个特征与样本集中数据对应特征进行比较，计算之间的距离值，选取样本数据集中前k个最相似的数据。

伪代码：
1. 计算已知类别数据集中的点与当前点之间的距离
2. 按照距离递增次序排序
3. 选取与当前点距离最小的k个点
4. 确定前k个点所在类别的出现频率
5. 返回前k个点出现频率最高的类别作为当前点的预测分类

python实现

from numpy import *
import operator

引入numpy包。

def createDataSet():  
    group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]]) 
    labels = ['A', 'A', 'B', 'B'] 
    return group, labels  

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]#4l
    diffMat = tile(inX, (dataSetSize,1)) - dataSet#inx按4*1重复排列再与dataset做差
    sqDiffMat = diffMat**2#每个元素平方
    sqDistances = sqDiffMat.sum(axis=1)#一个框里的都加起来
    distances = sqDistances**0.5#加起来之后每个开根号
    sortedDistIndicies = distances.argsort() #返回的是数组值从小到大的索引值，最小为0
    classCount={}          
    for i in range(k):#即0到k-1.最后得到的classCount是距离最近的k个点的类分布，即什么类出现几次如{'A': 1, 'B': 2}
        voteIlabel = labels[sortedDistIndicies[i]]#返回从近到远第i个点所对应的类
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1#字典模式，记录类对应出现次数.这里.get(a,b)就是寻找字典classcount的a对应的值，如果没找到a，就显示b
    sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

注1：这里的operator.itemgetter(1)是指定义一个函数，这个函数可以获取它括号里的第1个元素
如

a=[3,4,2,6]

b=operator.itemgetter(1)

b(a)
Out[116]: 4

c=operator.itemgetter(0)

c(a)
Out[118]: 3

如果operator.itemgetter(1，2)的话指的是用多级排序，以1为首，2为辅

注2：这里sorted(iterable,cmp=None,key=None,reverse=False)

参数解释：

（1）iterable指定要排序的list或者iterable，不用多说；

（2）cmp为函数，指定排序时进行比较的函数，可以指定一个函数或者lambda函数，如：

students为类对象的list，没个成员有三个域，用sorted进行比较时可以自己定cmp函数，例如这里要通过比较第三个数据成员来排序，代码可以这样写：

students = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

sorted(students, key=lambda student : student[2])

Out[121]: [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

（3）key为函数，指定取待排序元素的哪一项进行排序，函数用上面的例子来说明，key指定的lambda函数功能是去元素student的第三个域（即：student[2]）
（4）reverse=False是升序，True是降。默认升序。

故综合起来sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True)是以第1个元素为排序标准，将classCount进行降序排列。
如classCount为{‘A’: 1, ‘B’: 2}，则sortedClassCount[(‘B’, 2), (‘A’, 1)]，则最后返回值sortedClassCount[0][0]为“B”

注3：另外代码在使用前应加一句
group,labels=createDataSet()
因为group,labels是内置变量，在外面直接使用的话要再定义一遍。

最后输入如classify0([1,0], group, labels, 2)来使用

手写数字识别

这里写图片描述

问题描述：训练集对应0~9的数字，每个数字有200个这样的txt，要进行识别，这里我们先把矩阵转换为向量，再比较一个新的向量跟已有数据集向量的距离，找最为接近的k个训练实例，对应的最多的类就是我们识别出的数字。

def img2vector(filename):#把32*32矩阵处理成1*1024向量
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')           #load the training set,list of filename
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))               #形成array,共有m个元素，每个是1*1024向量
    for i in range(m):
        fileNameStr = trainingFileList[i]       #如'0_1.txt'      
        fileStr = fileNameStr.split('.')[0]     #take off “.txt”即用.分割成2个元素，取前一个
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)#厉害了
    testFileList = listdir('testDigits')        #iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
        if (classifierResult != classNumStr): errorCount += 1.0
    print "\nthe total number of errors is: %d" % errorCount
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

出来结果为

......
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 11

the total error rate is: 0.011628

值得注意的是，这段代码一定要在已经导入os的情况下才能用，即必须最前面有
from os import listdir

xie_qiuqiu

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
K近邻算法（一） python实现，手写数字识别(from机器学习实战)

k近邻法（KNN）算法概述算法概述优点精度高、对异常值不敏感、无数据输入假定。缺点计算复杂度高、空间复杂度高。试用数据范围数值型和标称型工作原理：将新数据的每个特征与样本集中数据对应特征进行比较，计算之间的距离值，选取样本数据集中前k个最相似的数据。伪代码： 1. 计算已知类别数据集中的点与当前点之间的距离 2. 按照距离递增次序排序 3. 选取与当前点距离最小的k个点
复制链接

扫一扫

专栏目录