k-近邻分类算法详解以两个使用案例

最新推荐文章于 2022-04-19 11:52:20 发布

晓理紫

最新推荐文章于 2022-04-19 11:52:20 发布

阅读量3.6k

点赞数

分类专栏：机器学习 Python相关学习

本文链接：https://blog.csdn.net/u011573853/article/details/102536755

版权

Python相关学习同时被 2 个专栏收录

52 篇文章 1 订阅

订阅专栏

机器学习

32 篇文章 1 订阅

订阅专栏

k-近邻分类算法

1.基本思想

存在一个样本数据集(训练数据集)，数据集中每一个数据都带有一个标签，即我们知道数据集中每一个数据与所属分类得对应关系。输入没有标签得数据后，将数据得每一个特征与样本数据集中数据对应得特征进行比较，然后算法提取样本集中特征最相近数据(最邻近)得分类标签作为测试数据得分类标签。主要是根据测试数据特征与样本集数据特征得距离来判断，选择距离最近得数据分类标签。

2.使用得一般流程

* 1）收集数据：可以使用任何方法进行数据得收集
* 2）准备数据：距离计算所需要得数值，最好是结构化得数据格式
* 3）分析数据：可以使用任何方法分析数据
* 4）训练算法：k-近邻算法不需要训练，带入样本和测试数据就可以
* 5）测试算法：计算错误率

3.kNN分类算法

 1.算法内部步骤

1）计算已知类别数据集得点与当前点之间得距离
2）按照距离递增排序
3）选取与当前点距离最小得k个点
4）确定前k个点所在类别得出现频率
5）返回k个点出现频率最高得类别当作当前点的预测类别

2.点之间得距离计算

计算点 $A(x_a,y_a)$ 与点 $B(x_b,y_b)$ 之间得距离

$d=\sqrt((x_a-x_b)^2+(y_a-y_b)^2)$

3.代码中相关得函数解释

1）. numpy中得tile函数可以看成使对矩阵得一种复制操作

tile(A,(b,c))可以简单得看成是把矩阵A按行进行复制b次按列进行复制c次

如：

np.tile([2,3],(1,2)) #可以看成是把矩阵[2,3]按行复制一次，按列复制两次
>>[[2,3,2,3]]
np.tile([2,3],(2,2)) #可以看成是把矩阵[2,3]按行复制两次，按列复制两次
>>[[2,3,2,3]
   [2,3,2,3]]
np.tile([2,3],(2,1)) #可以看成是把矩阵[2,3]按行复制两次，按列复制一次
>>[[2,3]
   [2,3]]

2）.numpy中的矩阵切片

切片一般是[起始位置:结束位置:步长] $\color{red}{不包括结束点得位置}$

如[0:4:1] 按照步长为1 取0，1，2，3位上的值

arr=[::,::] $\color{red}{逗号以前实对行进行操作，逗号以后是对列进行操作}$

如：

arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,13]])
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 13]])
arr[1:2,2:3]      #取arr中按行数1号位，按列数2号位上的值组成矩阵
>>array([[7]])
arr[:,1:3]        #取arr中所有行，并且按照步长位1取1号位和2号位上的值
>>array([[ 2,  3],
       [ 6,  7],
       [10, 11]])
arr[::2,:]        #在arr行上按照步长位2从0位开始取，在列上取所有列
>>array([[ 1,  2,  3,  4],
       [ 9, 10, 11, 13]])

3）.矩阵求和sum函数

arr.sum(axis=1) #把arr中每一行进行相加
>>array([10, 26, 43])
arr.sum(axis=0) #把arr中得每一列进行相加
>>array([15, 18, 21, 25])

4）.operator中的itergetter函数和排序函数sorted

operator模块提供的itemgetter函数用于获取对象的哪些维的数据，参数为一些序号（即需要获取的数据在对象中的序号),如 operator.itemgetter((2,3))就是产生一个获取2行3列数据得对象，然后使用这个对象去获得某个矩阵得2行3列得数据。 $\color{red}{可以简单得看做生成一个获取某个矩阵或数组某行某列数据得函数，使用得把要取值得矩阵或数组当作参数传给此函数}$

如：

 b=operator.itemgetter((2,3)) #此时产生一个获取2行3列数据得对象b
 b(arr)                       #使用b对象去取矩阵arr中得2行3列得值
>>13

sorted(iterable, /, *, key=None, reverse=False)
Return a new list containing all items from the iterable in ascending order.

A custom key function can be supplied to customize the sort order, and the
reverse flag can be set to request the result in descending order.

iterable(可迭代):可以是list也可以是dict
key:用于比较的值
reverse:指定是顺序还是逆序，默认是顺序

如:

#list使用案例
data = [2,40,56,3,54,7,8,14,89]
sorted(data)
>>[2, 3, 7, 8, 14, 40, 54, 56, 89]
sorted(data,reverse=True)
>>[89, 56, 54, 40, 14, 8, 7, 3, 2]

#以dic为例
dics={'a':1,'b':34,'c':2,'d':4,'e':6}
sorted(dics)					#只对key进行排序
>>['a', 'b', 'c', 'd', 'e']
sorted(dics.items(),key=operator.itemgetter(0))   #这也是按照key进行排序
>>[('a', 1), ('b', 34), ('c', 2), ('d', 4), ('e', 6)]
sorted(dics.items(),key=operator.itemgetter(1))  #按照值进行顺序排序
>>[('a', 1), ('c', 2), ('d', 4), ('e', 6), ('b', 34)]
sorted(dics.items(),key=operator.itemgetter(1),reverse=True) #按照值进行逆序排序
>>[('b', 34), ('e', 6), ('d', 4), ('c', 2), ('a', 1)]

5）numpy中的argsort函数

argsort 函数返回数组值从小到大得数组索引值

a=[90,45,3,4,5,3,1,6]
np.argsort(a)
>>array([6, 2, 5, 3, 4, 7, 1, 0], dtype=int64)

4.kNN分类算法实现

"""
inX:是要进行预测得数据
dataSet:样本数据集
labels"样本数据集得标签
k:在计算好距离以后要去前k个最短距离进行统计
求距离公式步骤 
1.先求两点坐标之差 A=[3,2,1] B=[4,5,6]  A-B=[-1,-3，-5]
2.求差得平方 A-B得平方 = [1，3，25]
3.求平和     ===》29
4.求开方   根号29
"""
def classify0(inX,dataSet,labels,k):
    #获取数据集得大小
    dataSetSize=dataSet.shape[0]
    
    #把测试数据扩张成和样本数据集一样维度得数据，并做减去样本数据集，求点之间得差
    diffMat = np.tile(inX,(dataSetSize,1))-dataSet	
    # print("diffMat::",diffMat)
    #求差得平方
    sqDiffmat = diffMat**2
    # print('sqDiffmat:',sqDiffmat)
    #把平方得和，即把每一行得数据相加
    sqDistance = sqDiffmat.sum(axis=1)
    # print('sqDistance:',sqDistance)
    
    #对平方和开方
    distance = sqDistance**0.5
    # print('distance:',distance)
    
    #获取距离数组中从小到大得数组索引值
    sorteDisIndices=np.argsort(distance)
    # print('sorteDisIndices:',sorteDisIndices)
    classCount={}
    #获取前k个距离最短得标签
    for i in range(k):
        #获取k个最短距离对应得标签（因为标签和样本数据是一一对应，sorteDisIndices[i]拿到样本数据得索引值也就可以找到次数据对应得标签）
        voteIlabel = labels[sorteDisIndices[i]]
        # print('voteIlabel:',voteIlabel)
        
        #统计标签出现得次数
        tem = classCount.get(voteIlabel,0)+1
        # print('tem:',tem)
        classCount[voteIlabel]=tem
        # print('classCount:',classCount)
        
        #对标签出现次数词典进行排序
    sortedClassCouunt = sorted(classCount.items(),key=op.itemgetter(1),reverse=True)
        # print('sortedClassCouunt:',sortedClassCouunt)
    #获取标签出现词数最多得标签
    return sortedClassCouunt[0][0]

5.数据归一化处理

对取值不在同一范围得数据进行归一化处理，尽可能使数据都在同一个取值范围内

归一化公式

$newValue=\frac{(oldValue-min)}{(max-min)}$

#利用newValue = (oldValue-min)/(max-min)进行数据得归一化，是数据得值都处于0--1之间
def autoNorm(dataSet):
    #获取最小值
    minValue = dataSet.min(0)
    print("minValue:",minValue)
    #获得最大值
    maxValue = dataSet.max(0)
    print('maxValue:',maxValue)
    
    rangs = maxValue-minValue
    print('rangs:',rangs)
    normDataSet = np.zeros((dataSet.shape))
    m = dataSet.shape[0]
    #进行数据归一化
    normDataSet = dataSet-np.tile(minValue,(m,1))
    normDataSet = normDataSet/np.tile(rangs,(m,1))
    print('normDataSet:',normDataSet)
    return normDataSet,rangs,minValue

6.kNN分类算法得使用案例

1.筛选人物

#读取文件并进行格式化
def file2matrix(filename):
    fr = open(filename)
    arrayOlines = fr.readlines()
    numberOfLines = len(arrayOlines)
    print('numberOfLines:',numberOfLines)
    returnMat = np.zeros((numberOfLines,3))
    classLabelVector=[]
    index=0
    for line in arrayOlines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    # print('returnMat:',returnMat)
    # print('classLabelVector:',classLabelVector)
    return returnMat,classLabelVector

#与上个函数功能一样
def readtxt(filename):
    dataset = np.loadtxt(filename)
    data = dataset[:,0:3]
    label = dataset[:,3]
    # print('dataset:',dataset)
    print('data:',data)
    # print('label:',label)


#进行可视化数据
def showData(datiingDataMat,datiingDatalabes):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    		    ax.scatter(datiingDataMat[:,1],datiingDataMat[:,2],1*bytearray(datiingDatalabes),
                               1*bytearray(datiingDatalabes))
    plt.show()

#用测试集进行测试
def datingClassTest(filename):
    hoRatio = 0.1
    datingDataMat,datingLabels = file2matrix(filename)
    normMat,ranges,minValue = autoNorm(datingDataMat)
    m = normMat.shape[0]
    unmTestVecs = int(m*hoRatio)
    errorCount=0.0
    for i in range(unmTestVecs):
        classifierRsult = classify0(normMat[i,:],normMat[unmTestVecs:m,:],datingLabels[unmTestVecs:m],3)
        print('the classifier came back with:%d,the real answer is:%d'%(classifierRsult,datingLabels[i]))
        if classifierRsult!=datingLabels[i]:
            errorCount+=1.0
    print('the total error rate is:%f'%(errorCount/float(unmTestVecs)))
 #把分类集中在一个可以动态输入数据测试得实例中
def classifyPerson(filename):
    resultList = ['not at all','in small doses','in large doses']
    percentTats=float(input("percentage of time spent play ing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream  = float(input("liters of ice cream consumed per year?"))
    datingDataMat,datingLabels = file2matrix(filename)
    normMat,ranges,minVals = autoNorm(datingDataMat)
    inArr = [percentTats,ffMiles,iceCream]
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print("you will probably like this person:",resultList[classifierResult-1])

2.手写字识别

#将一张32x32得图片处理成1x1024向量
def img2vector(filename):
    returnVect = np.zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr=fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect

#手写数字识别系统
import os
def handwritingClassTest():
    hwLabels=[]
    finame='M:/机器学习实战源码/MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits'
    finame2 = 'M:/机器学习实战源码/MLiA_SourceCode/machinelearninginaction/Ch02/digits/testDigits'
    
    #获取文件夹中得文件列表
    trainingFileList = os.listdir(finame)
    #获取文件列表个数
    m = len(trainingFileList)
    traingMat = np.zeros((m,1024))
    for i in range(m):
        #获取文件名
        fileNameStr=trainingFileList[i]
        #通过文件名获取文件中保存得数字标签
        filestr=fileNameStr.split('.')[0]
        classNumStr=int(filestr.split('_')[0])
        hwLabels.append(classNumStr)
        #获取样本数据
        traingMat[i,:]=img2vector(finame+'/'+fileNameStr)
    #对测试数据做一样得操作    
    testFileList = os.listdir(finame2)
    errorCount=0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        filestr = fileNameStr.split('.')[0]
        classNumStr=int(filestr.split('_')[0])
        vectorUnderTest = img2vector(finame2+'/'+fileNameStr)
        classifierRrst = classify0(vectorUnderTest,traingMat,hwLabels,3)
        print("the classifier came back with:%d,the real answer is:%d"%(classifierRrst,classNumStr))
        #获取预测错误得总数
        if classifierRrst!=classNumStr:
            errorCount+=1.0
    print("\nthe total number of error is:%d"%errorCount)
    print("\nthe total error rate is :%f"%(errorCount/float(mTest)))

晓理紫

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
k-近邻分类算法详解以两个使用案例

k-近邻分类算法1.基本思想存在一个样本数据集(训练数据集)，数据集中每一个数据都带有一个标签，即我们知道数据集中每一个数据与所属分类得对应关系。输入没有标签得数据后，将数据得每一个特征与样本数据集中数据对应得特征进行比较，然后算法提取样本集中特征最相近数据(最邻近)得分类标签作为测试数据得分类标签。主要是根据测试数据特征与样本集数据特征得距离来判断，选择距离最近得数据分类标签。2.使用...
复制链接

扫一扫