《机器学习实战》之 k-近邻算法

最新推荐文章于 2022-09-12 13:36:04 发布

小白终究会黑化

最新推荐文章于 2022-09-12 13:36:04 发布

阅读量254

点赞数

分类专栏：机器学习实战 pytjhon 文章标签：机器学习 python

本文链接：https://blog.csdn.net/qq_34406071/article/details/108960803

版权

机器学习实战同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

pytjhon

3 篇文章 0 订阅

订阅专栏

本文详细介绍了k-近邻(kNN)算法的原理、优缺点及适用范围，通过两个实例展示了kNN在约会网站配对效果改进和手写数字识别中的应用。在约会网站案例中，kNN算法用于预测用户喜好；在手写数字识别系统中，它能识别图像中的数字。通过归一化处理和Python代码实现，kNN算法表现出较高的分类准确性，但也存在计算复杂度高的问题。

摘要由CSDN通过智能技术生成

k-近邻分类算法
从文本文件中解析和导入数据
使用Matplotlib创建扩散图
-归一化数值

k-近邻算法概述

简言之，k-近邻算法采用测量不同特征值之间的距离方法进行分类

优点：精度高，对异常值不敏感、无数据输入假定
缺点：计算复杂度高、空间复杂度高
适用数据范围：数值型和标称型

k-近邻算法的一般流程

收集数据：可以使用任何方法
准备数据：距离计算所需要的数值，最好结构化的数据格式
分析数据：可以使用任何方法
训练数据：不适用与k-近邻算法
测试算法：计算错误率
使用算法：首先需要输入样本数据和结构化的输出结果，然后运行k-近邻算法判定输入数据属于哪一分类，最后应用对计算出的分类执行后续的处理。

实施kNN分类算法

k-近邻算法的伪代码和实际的python代码

伪代码

对未知类别属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的点与当前点之间的距离；
按照距离递增次序排列；
选取与当前点距离最小的k个点；
确定前k个点所在类别的出现频率；
返回前k个点出现的频率最高的类别作为当前点的预测分类

python代码

'''
k-近邻算法
inX-用于分类的输入变量
dataSet-输入的训练集
labels-标签变量【标签向量的元素个数与dataSet的行数相同】
k-选择最近邻居的数目
def classify0(inX,dataSet,labels,k):
    # 计算欧式距离
    dataSetSize=dataSet.shape[0]
    diffMat=tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1)
    distances=sqDistances**0.5
    sortedDistIndicies=distances.argsort()
    # 选择距离最小的k个点
    classCount={}
    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1

    # 排序
    sortedClassCount=sorted(classCount.items(),
                            key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

实例一：改进约会网站的配对效果

在约会网站上使用k-近邻算法

收集数据：提供文本文件
准备数据：使用python解析文本文件
分析数据：使用Matplotlib画二维散点图
训练算法：暂时不适合
测试算法：使用用户提供的部分数据作为测试样本
使用算法：产生简单的命令行程序，然后用户可以输入一些特征数据以判断对方为自己喜欢的类型

准备数据：从文本文件中解析数据

样本主要包括以下三个特征：

每年获得的飞行常客里程数
玩视频游戏所消耗的时间百分比
每周消耗的冰淇淋公斤数

在这里插入图片描述
从文本文件中可得知数据具有三个标签：
didntlike,smallDoses,largeDoses

file2matrix函数：输入为文件名字字符串，输出为训练样本矩阵[returnMat]和类标签向量[classLabelVector]

# 将文本记录转换为NumPy的解析程序
def file2matrix(filename):
    # 读取文件行数
    fr=open(filename)
    arrayOLines=fr.readlines()
    numberOfLines=len(arrayOLines)
    # 创建返回的NumPy矩阵
    returnMat=zeros((numberOfLines,3))
    classLabelVector=[]
    # 解析文件数据到列表
    index=0
    for line in arrayOLines:
        line=line.strip()
        listFromLine=line.split('\t')
        returnMat[index,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return returnMat,classLabelVector

returnMat:
[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
[1.4488000e+04 7.1534690e+00 1.6739040e+00]
[2.6052000e+04 1.4418710e+00 8.0512400e-01]
…
[2.6575000e+04 1.0650102e+01 8.6662700e-01]
[4.8111000e+04 9.1345280e+00 7.2804500e-01]
[4.3757000e+04 7.8826010e+00 1.3324460e+00]]
classLabelVector[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

分析数据：使用Matplotlib创建散点图

采用图形化的方式直观的展示数据，更能清晰的了解数据之间的结构。
散点图使用了datingDataMat矩阵的第二，第三列数据，分别表示特征值：
“玩视频游戏所消耗时间百分比”和“每周所消耗的冰淇淋公斤数”

def showdatas(datingDataMat,datingLabels):
    font=FontProperties(fname=r'c:\windows\fonts\simsun.ttc',size=14)  # 设置汉字格式
    fig,axs=plt.subplots(nrows=2,ncols=2,sharex=False,sharey=False,figsize=(13,8))
    numberOfLabels=len(datingLabels)
    LabelsColors=[]
    for i in datingLabels:
        if i==1:
            LabelsColors.append('black')
        if i==2:
            LabelsColors.append('orange')
        if i==3:
            LabelsColors.append('red')
    # 画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第二列(玩游戏)数据画散点数据,散点大小为15,透明度为0.5
    axs[0][0].scatter(x=datingDataMat[:,0],y=datingDataMat[:,1],color=LabelsColors,s=15,alpha=.5)
    axs0_title_text=axs[0][0].set_title(u'每年获得的飞行常客里程数与玩视频游戏所消耗时间占比',FontProperties=font)
    axs0_xlabel_text = axs[0][0].set_xlabel(u'每年获得的飞行常客里程数', FontProperties=font)
    axs0_ylabel_text = axs[0][0].set_ylabel(u'玩视频游戏所消耗时间占', FontProperties=font)
    plt.setp(axs0_title_text, size=9, weight='bold', color='red')
    plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black')

    # 画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5
    axs[0][1].scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
    axs0_title_text = axs[0][1].set_title(u'每年获得的飞行常客里程数与每周消耗的冰淇淋公斤数', FontProperties=font)
    axs0_xlabel_text = axs[0][1].set_xlabel(u'每年获得的飞行常客里程数', FontProperties=font)
    axs0_ylabel_text = axs[0][1].set_ylabel(u'每周消耗的冰激凌公斤数', FontProperties=font)
    plt.setp(axs0_title_text, size=9, weight='bold', color='red')
    plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black')

    # 画出散点图,以datingDataMat矩阵的第二(玩游戏)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5
    axs[1][0].scatter(x=datingDataMat[:, 1], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5)
    # 设置标题,x轴label,y轴label
    axs2_title_text = axs[1][0].set_title(u'玩视频游戏所消耗时间占比与每周消费的冰激淋公升数', FontProperties=font)
    axs2_xlabel_text = axs[1][0].set_xlabel(u'玩视频游戏所消耗时间占比', FontProperties=font)
    axs2_ylabel_text = axs[1][0].set_ylabel(u'每周消费的冰激淋公升数', FontProperties=font)
    plt.setp(axs2_title_text, size=9, weight='bold', color='red')
    plt.setp(axs2_xlabel_text, size=7, weight='bold', color='black')
    plt.setp(axs2_ylabel_text, size=7, weight='bold', color='black')

    # 设置图例
    didntlike=mlines.Line2D([],[],color='black',marker='.',markersize=6,label='didntLike')
    smallDoses=mlines.Line2D([],[],color='orange',marker='.',markersize=6,label='smallDoses')
    largeDoses=mlines.Line2D([],[],color='red',marker='.',markersize=6,label='largeDoses')

    # 添加图例
    axs[0][0].legend(handles=[didntlike,smallDoses,largeDoses])
    axs[0][1].legend(handles=[didntlike,smallDoses,largeDoses])
    axs[1][0].legend(handles=[didntlike,smallDoses,largeDoses])

    plt.show()

在这里插入图片描述

准备数据：归一化数值

下表给出了提取的四组数据，如果想要计算样本3和样本4之间的距离，可以使用下面的方法：
$d=\sqrt{(0-67)^2+(20000-32000)^2+(1.1-0.1)^2}$
很容易看见，上面方程式中数字相差太大的属性对计算结果的影响最大，因此处理这种不同取值范围的特征值事，通常采用将数值归一化，如将取值范围处理为0到1或者-1到1.
$n e w V a l u e = (o l d V a l u e - m i n) / (m a x - m i n)$

在这里插入图片描述

# 准备数据：归一化数值
def autoNorm(dataSet):
    minvals=dataSet.min(0) #dtadSet.min(0)中的参数0使得函数可以从列中选取最小值
    maxvals=dataSet.max(0)
    ranges=maxvals-minvals
    normDataSet=zeros(shape(dataSet))
    m=dataSet.shape[0]
    normDataSet=dataSet-tile(minvals,(m,1))   # tile()将变量内容复制成输入矩阵同样大小的矩阵
    normDataSet=normDataSet/tile(ranges,(m,1))  # 具体的特征值相除
    return normDataSet,ranges,minvals

函数的执行结果
normDataSet:
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
[0.28542943 0.06892523 0.47449629]
…
[0.29115949 0.50910294 0.51079493]
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
ranges:
[9.1273000e+04 2.0919349e+01 1.6943610e+00]
minvals:
[0. 0. 0.001156]

测试算法：作为完整程序验证分类器

使用错误率来检测分类器的性能，即分类器给出错误结果的次数除以测试数据的总数。

# 分类器针对约会网站的测试代码
def datingClassTest():
    hoRatio=0.10
    datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals=autoNorm(datingDataMat)
    m=normMat.shape[0]
    numTestVecs=int(m*hoRatio)
    errorCount=0.0
    for i in range(numTestVecs):
        classifierResult=classify0(normMat[i,:],normMat[numTestVecs:m,:],\
                                   datingLabels[numTestVecs:m],3)
        print('the classifier came back with:{},the real answer is:{}'.format(classifierResult,datingLabels[i]))
        if (classifierResult !=datingLabels[i]):
            errorCount+=1.0
    print('the total error rate is :{}%'.format(errorCount/float(numTestVecs)*100))

输出结果：
the classifier came back with:2,the real answer is:2
the classifier came back with:1,the real answer is:1
the classifier came back with:3,the real answer is:3
the classifier came back with:3,the real answer is:3
the classifier came back with:2,the real answer is:2
the classifier came back with:1,the real answer is:1
the classifier came back with:3,the real answer is:1
the total error rate is :5.0%
分类器处理约会数据集的错误率为5.0%，这个结果是个很不错的结果。，用户完全可以输入未知对象的属性，由分类软件帮助她判定某个对象的课交往程度。

使用算法：构建完整可用程序

用户分别输入上述的三个特征值，就可以预测出对未知对象的喜欢程度。

def classifPerson():
    resultList=['not at all','in small doses','in large doses']
    percentTats=float(input('percentage of time spent playing video games?'))
    ffmiles=float(input('frequent flier miles earned per year?'))
    iceCream=float(input('liters of ice cream consumed per year?'))
    datingDataMat,datingLabels=file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals=autoNorm(datingDataMat)
    inArr=array([ffmiles,percentTats,iceCream])
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print('You will probably like this person:{}'.format(resultList[classifierResult-1]))

实际效果测试：
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person:in small doses

实例二：手写识别系统

收集数据：提供文本文件
准备数据：编写函数img2vector()，将图像格式转换为分类器使用的向量格式。
分析数据：在python命令提示符中检查数据
训练数据：可以不用使用
测试算法：使用部分数据作为测试样本
使用算法

准备数据：将图像转换为测试向量

trainingDigits:
在这里插入图片描述
textDigits:
可以将一个3232的二进制图像矩阵转换为11024的向量，这样前一个实例的分类器就可以处理数字图像信息了。

# 将图片格式化处理为一个向量
def img2vector(filename):
    returnVect=zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        lineStr=fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect

该函数创建了1*1024的Numpy数组，然后打开给定的文件，循环读出来文件的前32行，并将每行的头32个字符值存储在Numpy数组中，最后返回数组。
测试结果：
returnVect[0,0:32]:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0.]

测试算法：使用kNN算法识别手写数字

def handwritingClassTest():
    hwLabels=[]
    trainingFileList=listdir('trainingDigits')  #获取目录内容
    m=len(trainingFileList)
    trainMat=zeros((m,1024))
    # 从文件名解析分类数字
    for i in range(m):
        fileNameStr=trainingFileList[i]
        fileStr=fileNameStr.split('.')[0]
        classNumStr=int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainMat[i,:]=img2vector('trainingDigits/{}'.format(fileNameStr))
    testFileList=listdir('testDigits')
    errorCount=0.0
    mTest=len(testFileList)
    for i in range(mTest):
        fileNameStr=testFileList[i]
        fileStr=fileNameStr.split('.')[0]
        classNumStr=int(fileStr.split('_')[0])
        vectorUnderTest=img2vector('testDigits/{}'.format(fileNameStr))
        classifierResult=classify0(vectorUnderTest,trainMat,hwLabels,3)
        print('the classifier came back with:{},the real answer is:{}'.format(classifierResult,classNumStr))
        if classifierResult!=classNumStr:
            errorCount+=1
    print('the total number of errors is :{}'.format(errorCount))
    print('the total error rate is :{}'.format(errorCount/float(mTest)))

测试结果：
the classifier came back with:5,the real answer is:9
the classifier came back with:1,the real answer is:9
the classifier came back with:5,the real answer is:9
the classifier came back with:5,the real answer is:9
the classifier came back with:3,the real answer is:9
the classifier came back with:3,the real answer is:9
the classifier came back with:4,the real answer is:9
the total number of errors is :322.0
the total error rate is :0.3403805496828753

本章小结

k-近邻算法是最简单最有效的算法，但必须保存全部数据集，如果数据集过大，必需使用大量的存储空间，此外，对每个数据计算距离时，使用时间可能过长。且无法给出任何数据的基础结构信息，无法知晓平均实例样本和典型的实例样本具有什么特征。

小白终究会黑化

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《机器学习实战》之 k-近邻算法

本章内容k-近邻算法概述实施kNN分类算法伪代码python代码实例一：改进约会网站的配对效果准备数据：从文本文件中解析数据分析数据：使用Matplotlib创建散点图准备数据：归一化数值测试算法：作为完整程序验证分类器使用算法：构建完整可用程序实例二：手写识别系统准备数据：将图像转换为测试向量测试算法：使用kNN算法识别手写数字本章小结k-近邻分类算法从文本文件中解析和导入数据使用Matplotlib创建扩散图-归一化数值k-近邻算法概述简言之，k-近邻算法采用测量不同特征值之间的距离方
复制链接

扫一扫