机器学习实战（二）——k-近邻算法

最新推荐文章于 2022-11-29 14:29:28 发布

FavoriteStar

最新推荐文章于 2022-11-29 14:29:28 发布

阅读量521

点赞数

分类专栏：机器学习文章标签：机器学习近邻算法 python 人工智能算法

本文链接：https://blog.csdn.net/StarandTiAmo/article/details/126412896

版权

机器学习专栏收录该内容

41 篇文章 12 订阅

订阅专栏

机器学习实战（二）——k-近邻算法

由于是之前学习的，因此本博客只是进行一下记录部分思路和代码而已。有问题请私信。

一、k-近邻算法概述

kNN算法的工作原理为：存在一个训练样本集，且每个数据都是带有标签。输入没有标签的新数据后，将新数据的每个特征与样本集中数据对应的特征进行比较，然后算法提取样本集中特征最相似数据的分类标签

一般来说是选择样本数据集中前k个最相似的数据，因此成为k近邻

例如下图：

请添加图片描述

可以将打斗镜头和接吻镜头视为特征，电影类型视为标签，那么如果有个新数据（最后一行），可以计算它与其他已知电影的距离：
请添加图片描述

然后假设 $k = 3$ ，那我们取出最近的前3部电影，而这三部都是爱情片，因此可以认为新电影也为爱情片。

1.1、准备：使用python导入数据

from numpy import *
import operator
from os import listdir

def createDateSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group,labels

2.2、从文本文件中解析数据

这里先给出k-近邻算法的伪代码：

对未知类别属性的数据集中的每个点依次执行如下操作：
(1)计算已知类别数据集中的点与当前点之间的距离
(2)按照距离递增次序排序
(3)选取与当前点距离最小的k个点
(4)确定前k个点所在类别的出现频率
(5)返回前k个点出现频率最高的类别作为当前点的预测分类

具体的代码如下：

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]#shape[0]就是获取了这个矩阵的行数
    diffMat = tile(inX,(dataSetSize,1))-dataSet#tile将inX这个向量(1,n)进行了重复
    #(dataSetSize,1)就是将其复制为和数据集一样的行数，1就是多少列是固定的
    sqDiffMat = diffMat ** 2  #对每一个数平方
    sqDistances = sqDiffMat.sum(axis = 1)  #对每一行进行求和
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()  #得到排序后从小到大的索引
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]] #从小到大的索引去labels中找打那个对应的类别
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        #这里的类别是按照数字来分类的因此可以加上索引
    sortedClassCount = sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
    return sortedClassCount[0][0]  #返回的是那个类别

2.3、如何测试分类器

对于KNN算法来说，可以将数据集划分为训练集与测试集，然后看看在测试集中该算法能够有多好的准确率。

二、示例：使用k-近邻算法改进约会网站的配对结果

2.1、准备数据：从文本文件中解析数据

#使用KNN算法来改进约会网站的配对效果
#将文本记录转换到numpy的解析程序
def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines() #读取所有行，每一行为列表的一个元素
    numberOfLines = len(arrayOLines)
    returnMat = zeros((numberOfLines,3))
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip() #截取掉所有的回车字符，也就是每一行的末尾
        listFromLine = line.split('\t') #按照制表符分开成各个元素
        returnMat[index,:] = listFromLine[0:3] #返回第012个，不包括3，除了标签都返回到这里面了
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

2.3、归一化数据

在前面的kNN算法中我们并没有对数据进行处理，但实际上在计算距离时需要将特征都进行归一化，否则不同数量级的数据将会无法精确考虑到每个特征的差距，会被数量级大的特征决定距离。

#下面是对数值进行归一化的函数
def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0) #分别取出每一个特征的最大值和最小值形成一个行向量
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet / (tile(ranges,(m,1)))
    return normDataSet,ranges,minVals

2.4、测试算法：作为完整程序验证分类器

#总的函数
def datingClassTest():
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix('D:\学习\大四上学习\Python机器学习代码实现\MLiA_SourceCode\machinelearninginaction\Ch02\datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)  #0.1*m决定你要选取多少个作为测试机
    errorCount = 0.0
    k = 4
    for i in range(numTestVecs):   #这里用来测试的只有从0：0.1m的数据集
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],k)
        #第一个参数就是我们要测试的新变量，第二个就是我们的数据集，这里从0.1m：m，就是取0.9作为训练集，第三个是训练集的便签信息
        print("the classifier came back with: %d, the real answer is: %d" %(classifierResult,datingLabels[i]))
        if(classifierResult != datingLabels[i]):
            errorCount += 1.0

    print("the total error rate is: %f %%" % (errorCount / float(numTestVecs) * 100))

2.5、使用算法：构建完整可用系统

接下来将程序完善，只要海伦输入某个人的信息就可以判断是否会对他感兴趣

#定义一个可以自己输出数据然后来判断其是什么类别的程序
def classifyPerson():
    resultList = ['你将不会有兴趣！','你将会有点兴趣！','你将会非常喜欢！']
    percentTats = float(input("Percentage of time spent playing video games?"))
    ffMiles = float(input("Frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    #输入获取完成
    datingDataMat,datingLabels = file2matrix('D:\学习\大四上学习\Python机器学习代码实现\MLiA_SourceCode\machinelearninginaction\Ch02\datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    k = 3
    inArr = array([ffMiles,percentTats,iceCream])#构成一个array,需要进行归一化才能输入
    classifierResult = classify0(((inArr - minVals)/ranges),normMat,datingLabels,k)
    print("You will probably like this person:",resultList[classifierResult-1])

三、示例：手写识别系统

3.1、准备数据：将图像转换为测试向量

请添加图片描述

如图所示，图像中有写字的地方用1表示，空白地方用0表示，因此我们需要将每一个图像都转换成一个向量，才能够利用kNN算法来计算距离。

#将一个32*32的图像矩阵，转换为1*1024的向量，就可以用上面的分类方法了
def img2vector(filename):
    returnvector = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        linestr = fr.readline()
        for j in range(32):
            returnvector[0,32*i+j] = int(linestr[j])

    return returnvector

3.2、测试算法：使用k-近邻算法识别手写数字

#使用了os的listdir可以列出给定目录的文件名
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')
    m = len(trainingFileList) #这个是获取了那个文件夹下的所有文件的目录
    trainingMat = zeros((m,1024))

    for i in range(m):
        fileNamestr = trainingFileList[i] # 取出来例如是 0_1.txt
        filestr = fileNamestr.split('.')[0] #取出来就是 0_1
        classNumStr = int(filestr.split('_')[0]) #取出来就是写的数字几
        hwLabels.append(classNumStr) #作为第i个的标签
        trainingMat[i,:] = img2vector('%s' % fileNamestr)

    testFileList = listdir('digits/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    k = 3
    for i in range(mTest):
        fileNamestr = testFileList[i]
        filestr = fileNamestr.split('.')[0]  # 取出来就是 0_1
        classNumStr = int(filestr.split('_')[0])  # 取出来就是写的数字几
        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNamestr)
        classifyresult = classify0(vectorUnderTest,trainingMat,hwLabels,k)
        if(classifyresult != classNumStr):
            errorCount += 1.0

    print("The total number of errors is: %d" % errorCount)
    print("The total error rate is: %f" % (errorCount/float(mTest)))