k-近邻算法_def handwritingclasstes-CSDN博客

本文链接：https://blog.csdn.net/Ruanes/article/details/104973885

本文代码摘自书籍《机器学习实战》,我仅稍加改正和整理，详细代码和数据集见GitHub

文章目录

k-近邻算法概述
示例1：在约会网站上使用k-近邻算法
示例2：手写识别系统
- 获得数据
- 测试算法
总结

k-近邻算法概述

优点：精度搞、对异常值不敏感、无数据输入假定。
缺点：计算复杂度高、空间复杂度高。

示例1：在约会网站上使用k-近邻算法

收集数据：提供文本文件。
准备数据：使用Python解析文本文件。
分析数据：使用Matplotlib画二维扩散图。
训练数据：此步骤不适用于k-近邻算法。
测试数据：使用海伦提供的部分数据作为测试样本。测试样本与非测试样本的区别在于：测试样本是已经完成分类的数据，如果预测分类和实际类别不同，则标记为一个错误。
使用算法：产生简单的命令行程序，然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型

获得数据

数据存放在文本文件datingTestSet.txt中，每个样本数据占据一行，总共有1000行。包含3种特征和一个标签。我们需要先处理数据，输出训练样本矩阵和类标签向量。

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)

    # 训练样本矩阵和类标签向量
    returnMat = np.zeros((numberOfLines, 3))
    classLabelVector = []

    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

拿到数据后我们使用Matplotlib制作原始数据的散点图：

# 获得数据
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')

# #解决中文显示问题
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus'] = False

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2],
           # 根据类标签绘制不同颜色的点
           15.0 * np.array(datingLabels), 15.0 * np.array(datingLabels))
plt.xlabel('玩游戏所耗时间百分比')
plt.ylabel('每周消费的冰淇淋公升数')
plt.legend(['不喜欢', '魅力一般', '极具魅力'])
# 保存图片
plt.savefig('Img.png', dpi=600)
plt.show()

结果图：在这里插入图片描述

归一化数值

不同特征取值范围不同会对kNN的距离计算带来影响，因此我们需要归一化,这里我们采用最大值最小值归一化，将数据转化为0到1区间内的值

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

kNN算法实现

# inX是待判断数据
def classify0(inX, dataSet, labels, k):  
    dataSetSize = dataSet.shape[0]

    #计算测试样本与训练集间的距离
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDisttances = sqDiffMat.sum(axis=1)
    distances = sqDisttances ** 0.5

    #将字典值从大到小排列，最后返回发生频率最高的元素标签
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        votelabel = labels[sortedDistIndicies[i]]
        classCount[votelabel] = classCount.get(votelabel, 0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

测试算法

我们用90%的数据用来训练，用10%的测试数据进行测试，模型的好坏用错误率来衡量，如果分类标签与实际标签不同，则认为分类错误。

def datingClassTest():
    hoRatio = 0.1
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = kNN1.classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with:%d,the real answer is:%d" % (classifierResult, datingLabels[i]))
        # 如果分类标签与实际标签不同，则errorCount+1
        if classifierResult != datingLabels[i]:
            errorCount += 1.0
    print("the total error rate is:%f" % (errorCount / float(numTestVecs)))

运行结果如下：

the classifier came back with:3,the real answer is:3
the classifier came back with:2,the real answer is:2
the classifier came back with:1,the real answer is:1
.
.
the classifier came back with:1,the real answer is:1
the classifier came back with:1,the real answer is:1
the total error rate is:0.050000

模型的分类错误率约为5%。

示例2：手写识别系统

获得数据

数据分为测试数据和预测数据，目录trainingDigits中包含了大约2000个例子，目录testDigits中包含了大约900个测试数据，数据大概长这样在这里插入图片描述
我们先编写函数img2vector，将图像转换为向量。

def img2vector(filename):
    returnVect = np.zeros((1, 1024))
    fr = open(filename)
	#图像的尺寸是32*32
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect

测试算法

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')
    m = len(trainingFileList)
    trainingMat = np.zeros((m, 1024))
    for i in range(m):
    	#获得标签，标签在文件名中已经指明
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumstr = int(fileStr.split('_')[0])
        hwLabels.append(classNumstr)
        
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNameStr)
    testFileList = os.listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumstr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        #调用上例编写的分类函数
        classifierResult =classify0(vectorUnderTest,trainingMat,hwLabels,4)
        
        print("the classifier came back with: %d, the real answer is: %d"%(classifierResult,classNumstr))
        if(classifierResult!=classNumstr):
            errorCount+=1.0
    print("the total number of errors is: %d"%errorCount)
    print("\nthe total error rate is: %f"%(errorCount/float(mTest)))

运行结果如下：

the classifier came back with: 9, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
.
.
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the total number of errors is: 11
the total error rate is: 0.011628