K-临近算法机器学习实战

最新推荐文章于 2024-02-05 23:29:58 发布

Treasure_zz

最新推荐文章于 2024-02-05 23:29:58 发布

阅读量180

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/zz133110/article/details/89415559

版权

机器学习专栏收录该内容

10 篇文章 1 订阅

订阅专栏

K-临近算法

从今天开始，开始挖个坑。学习《机器学习实战》这本书，然后将学习的笔记记录下来。相关代码请参考：

https://github.com/pbharrin/machinelearninginaction
里面有数据集和源码。但是这是基于python2编写的代码。我的学习笔记都是基于python3 的。会有一些函数的不同。但是思路是一样的。

k近邻属于最简单的机器学习方法了，属于一种分类的方法，监督学习。原理就是计算待分类的点和训练集的距离。然后得出结果。其实准确的说训练集不妥当，因为该算法没有训练的过程，所以属于懒惰学习。

优点：精度高，对异常值不敏感，无数据输入假定

缺点：计算复杂度和空间复杂度高

简单实例

通过简单的一个例子。首先构造一个数据集如下：（记得安装numpy包）

from numpy import *
import operator 
def createDataSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group, labels
group,labels = createDataSet()

在（0,0）和（1,1）坐标附近分别有两个点AA和BB。然后给一个点（0，0），问这个点是属于A还是B。

解决思路如下，就是让（0,0）点分别和剩下的四个点算距离，然后取距离最近的k个值。

tile函数是干嘛的可以参考这篇文章，非常形象，两秒看懂：https://www.jianshu.com/p/9519f1984c70

dataSet = group
k=3
inX = [0,0]
# 计算距离
dataSetSize = dataSet.shape[0]
diffMat = tile(inX,(dataSetSize,1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount = {}
#选择距离最小的k个点
for i in range(k):
    voteIlable = labels[sortedDistIndicies[i]]
    classCount[voteIlable]=classCount.get(voteIlable,0)+1
#排序
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

print(sortedClassCount[0][0])

约会数据应用

然后就是通过约会数据做一个分类。如下有一种数据集，（可从这里下载：https://github.com/pbharrin/machinelearninginaction/blob/master/Ch02/datingTestSet.txt）：

40920	8.326976	0.953952	largeDoses
14488	7.153469	1.673904	smallDoses
26052	1.441871	0.805124	didntLike
...

4列分别表示的意思是，每年获得的飞行常客里程数、玩游戏所消耗的时间百分比、每周消耗冰激凌的公升数、喜爱程度。

这是训练集，测试内容就是提供里程数、游戏时间和冰激凌量，让你分析一下是女性对该人喜爱程度。

需要先下载一下sklearn库。这里的作用就是将喜爱程度largeDoses、smallDoses 和didntLike转化为1,2,3。

下面的代码就是读取数据，然后把文本的数据分为训练集returnMat 和标签classLabelVector

from numpy import *
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn import preprocessing
import operator 
def file2matrix(filename):
    fr = open(filename, encoding='UTF-8-sig')
    arrayOlines = fr.readlines()
    numberOfLines = len(arrayOlines)
    returnMat=zeros((numberOfLines,3))
    classLabelVector=[]
    index=0
    #解析文件数据到列
    for line in arrayOlines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:]=listFromLine[0:3]
        classLabelVector.append(listFromLine[-1]) 
        index +=1
    le = preprocessing.LabelEncoder()
    classLabelVector =le.fit_transform(classLabelVector)+1
    return returnMat,classLabelVector

运行瞅一下结果，注意地址别照抄。

file2matrix('data/datingTestSet.txt')

(array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
        [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
        [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
        ...,
        [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
        [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
        [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]]),
 array([2, 3, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 3, 2, 3, 1,
        ...
        2, 3, 3, 3, 3, 3, 1, 2, 2, 2], dtype=int64))

这一步就是把数据显示一下，数据可视化。下载matplotlib。运行结果的图就是游戏和冰激凌之间对于喜爱程度的一个可视化。
这一步就是把数据显示一下，数据可视化。下载matplotlib。运行结果的图就是游戏和冰激凌之间对于喜爱程度的一个可视化。

import matplotlib
import matplotlib.pyplot as plt
datingData,datingLable=file2matrix('data/datingTestSet.txt')
fig =plt.figure()
ax = fig.add_subplot(111)#画子图用的。就是画布fig里面可以有好多子图
ax.scatter(datingData[:,1],datingData[:,2],8*array(datingLable),array(datingLable))
#scatter(x,y,size,color) 前两个表示坐标轴，第三表示点的大小。第四个是颜色

在这里插入图片描述
航空里程和冰激凌量的对于喜爱程度的可视化。

ax.scatter(datingData[:,0],datingData[:,2],8*array(datingLable),array(datingLable))
plt.show()

在这里插入图片描述

航空里程和游戏时间的对于喜爱程度的可视化。

plt.scatter(datingData[:,0],datingData[:,1],8*array(datingLable),array(datingLable))
plt.show()

在这里插入图片描述（题外话：这些数据细思极恐，哈哈哈哈哈哈哈哈哈哈）

归一化数值

先说什么是归一化，就是把上面列举的那些参数都处理在0到1之间，或者-1到1。

$\frac{当前数值-最小值}{最大值-最小值}$

为什么要归一化，因为飞行距离的数据数量级大概是千和万左右的数量级。而其他两者则是十位或者个位数。这就导致算距离的结果严重受飞行距离的影响。也就是该参数占得权重太大了。因而为了让三者参数平衡，所以进行归一化。下面的程序和结果可以看到，归一化之后的数据样式。

def autoNorm(dataset):
    minVals=dataset.min(0)
    maxVals=dataset.max(0)
    ranges=maxVals-minVals
    normDataSet=zeros(shape(dataset))
    m=dataset.shape[0]
    normDataSet = dataset-tile(minVals,(m,1)) 
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals
print(autoNorm(datingData))

(array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]]), array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00]), array([0.      , 0.      , 0.001156]))

数据集设置好之后，就开始分类了。classify0这个函数就是一开始简单例子中的代码。

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]



hoRatio=0.1
datingData,datingLable=file2matrix('data/datingTestSet.txt')
normMat,ranges,minVals=autoNorm(datingData)
m=normMat.shape[0]
numTestVecs = int(m*hoRatio) 
errorCount=0.0
for i in range(numTestVecs): 
    classifierResult = classify0(normMat[i,:],normMat[numTestVecs-1:m,:],datingLable[numTestVecs-1:m],3)
    print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLable[i]))
    if (classifierResult != datingLable[i]): errorCount += 1.0
print( "the total error rate is: %f" % (errorCount/float(numTestVecs)))

the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
...
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the total error rate is: 0.040000

手写数字识别

在这里插入图片描述
数据是这样的，数据下载：https://github.com/pbharrin/machinelearninginaction/blob/master/Ch02/digits.zip

如果把1代表是黑色，0是白色，一个数据是一个像素点。这样就可以形成手写内容。k近邻的分类可以想象成把训练集的数据0-9的模样和测试数据求距离。这个距离其实可以代表图片的重合度。距离短重合度高，这样来识别。

首先先读取将这些数据读取为向量。

def img2vector(filename):
    returnVect = zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect
testVector= img2vector('data/testDigits/0_13.txt')
testVector[0,32:63]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

之后进行分类。每个文件就是一个数字，分为训练和测试集，其文件名都是采用数字和编号进行编码。也就是说上图编码为0_1.txt。按照这样的方式编码，所以我们取出“_”前面的数字作为标签。

from os import listdir
def handwritingClassTest():
    hwLabels=[]
    trainingFileList = listdir('data/trainingDigits')
    m = len(trainingFileList)#文件数
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0] #取文件名  去掉后缀
        classNumStr = int (fileStr.split('_')[0]) #取出标记值
        hwLabels.append(classNumStr)
        trainingMat[i,:]=img2vector('data/trainingDigits/%s' % fileNameStr)
    testFileList =  listdir('data/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0] #取文件名  去掉后缀
        classNumStr = int (fileStr.split('_')[0]) #取出标记值
        vectorUnderTest = img2vector('data/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest,trainingMat,hwLabels,3)
        print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
        if (classifierResult != classNumStr): errorCount += 1.0
    print ("\nthe total number of errors is: %d" % errorCount)
    print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))
    
handwritingClassTest()

the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
...
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 10

the total error rate is: 0.010571