机器学习实战之KNN实现

最近入坑机器学习。买了本《机器学习实战》配合李航老师的《统计学习方法》方便尽快入门。方便记录,从博客开始,从KNN开始,下面先介绍一下KNN的原理以及思路,并给出实战中的代码搭配食用。

1、算法原理

k-近邻法是一种基本分类和回归方法,不具有显式的学习过程。k-近邻法实际上利用训练数据集对特征向量空间进行划分,并作为其分类”模型“。有三个要点:k值的选择,距离度量及分类决策规则。k-近邻算法的效率低,能耗高,kd树方法可以用来减少计算次数。

算法步骤:
输入: 训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , x 2 ) , . . . , ( x N , y N ) } T = \{ (x_1,y_1), (x_2,x_2),...,(x_N,y_N) \} T={(x1y1),x2,x2,...,(xN,yN)}
x x x为实例的特征向量, y y y为实例的类别
输出:实例 x x x所属的类 y y y
(1)根据距离度量,在训练集 T T T中找出与 x x x最近邻的 k k k个点,涵盖这 k k k个点的 x x x的邻域记作 N k ( x ) N_k(x) Nk(x);
(2)在 N k ( x ) N_k(x) Nk(x)中根据分类决策规则(如多数表决)决定 x x x的类别 y y y

2、K-近邻分类器

导入模块

from numpy import *
import operator  # 运算符操作,为排序准备

2.1、创建数据集

def creatDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])  # 创建训练数据集
    label = ['A', 'A', 'B', 'B']  # 训练数据集对应的标签
    return group, label

2.2、创建k-近邻算法

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]  # 得到数组结构中的4
    diffMat = tile(inX, (dataSetSize, 1))-dataSet  # 得到和group一样的结构并作数组减法,即将inX重复4次
    sqDiffMat = diffMat**2  # 对各个减后的数组做平方
    sqDistances = sqDiffMat.sum(axis=1)  # 求和
    distance = sqDistances**0.5  # 开根号
    sortDisIndicies = distance.argsort()  # 排序并返回下标
    classCount = {}  # 空字典
    for i in range(k):
        votelabel = labels[sortDisIndicies[i]]
        classCount[votelabel] = classCount.get(votelabel, 0) + 1  # 查找字典中的键,若没有返回0
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)  # 取定数字进行排序,items=(key,value)
    return sortedClassCount[0][0]

2.3、交互体验

In[1]:import KNN1
In[2]:group,labels=KNN1.creatDataSet()
In[3]:KNN1.classify0([0,0], group, labels, 3)

得到结果

Out[3]:'B'

3、改进约会网站的配对效果

3.1、将文本转换为Numpy的解析程序

def file2matrix(file_name):
    fr = open(file_name)  # 打开文件
    arrayOfLines = fr.readlines()  # 按行读取文件
    numberOfLines = len(arrayOfLines)  # 文件的长度
    returnMat = zeros((numberOfLines, 3))  # 创建0矩阵
    classLabelVector = []  # 创建用于存储标签的空列表
    index = 0
    for line in arrayOfLines:
        line = line.strip()  # 截取所有的回车字符
        listFormLine = line.split('\t')  # 将整行数据分割成一个元素列表
        returnMat[index, :] = listFormLine[0:3]  # 选取前三个存储到列表中
        classLabelVector.append(int(listFormLine[-1]))  # 选取最后一列元素存储到标签中
        index += 1
    return returnMat, classLabelVector

之后重新运行程序,继续进行交互体验

In[4]:import KNN1
In[5]:datingDataMat,datingLabels = KNN1.file2matrix('datingTestset2.txt')
In[6]:datingDataMat
Out[6]:array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
       		[1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
       		[2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
      		 ...,
      		[2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
      		[4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
       		[4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
In[7]:datingLabels[0:20]
Out[7]:[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

3.2、分析数据

在交互下按顺序输入以下代码,得到下面的图表。

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
plt.show()
ax.scatter(datingDataMat[:,1], datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

可视化图
可视化图二

3.3、归一化特征

由于数据的量纲不一,需要对数据归一化。

def autoNorm(dataSet):
    minVals = dataSet.min(0)  # 0表示从列选取
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDatsSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDatsSet = dataSet - tile(minVals, (m, 1))  # tile扩展矩阵的大小为m*1
    normDatsSet = normDatsSet/tile(ranges, (m, 1))
    return normDatsSet, ranges, minVals

交互检测:

import KNN1
datingMat, datingLabels = KNN1.file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = KNN1.autoNorm(datingMat)
normMat
Out[6]: 
array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])

3.4、分类器针对约会网站的测试

def datingClassTest():
    hoRatio = 0.10  # 测试集比例
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is %d" %(classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]) : errorCount += 1
    print("the total error rate is: %f" % (errorCount/float(numTestVecs)))

交互测试误差率(重新加载KNN1 python3中需要更改如下列代码所示)

import importlib
importlib.reload(KNN1)
KNN1.datingClassTest()
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 1
the total error rate is: 0.050000

3.5、约会网站预测函数

def classifyPerson():
    resultList = ['not at all', 'in small does', 'in large does']
    percentTats = float(input("percentage of time spent playing video game?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([percentTats, ffMiles, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person:", resultList[classifierResult-1])

交互测试

importlib.reload(KNN1)
Out[20]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
KNN1.classifyPerson()
percentage of time spent playing video game?>? 10
frequent flier miles earned per year?>? 1000
liters of ice cream consumed per year?>? 0.5
You will probably like this person: not at all

4、识别手写数字

4.1、将图像数据转换测试向量

def img2vector(img_file):
    returnVect = zeros((1, 1024))
    fr = open(img_file)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

交互测试

importlib.reload(KNN1)
Out[22]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
testVector = KNN1.img2vector('testDigits/0_13.txt')
testVector
Out[26]: array([[0., 0., 0., ..., 0., 0., 0.]])
testVector[0, 0:31]
Out[27]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testVector[0, 32:63]
Out[28]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

4.2、使用k-近邻算法识别手写数字

def handwritingClassTest():
    hwLabels = []  # 用于存储类代码的空列表
    trainingFileList = os.listdir('trainingDigits')  # 读取文件夹下的文件名字
    m = len(trainingFileList)  # 文件数量
    trainingMat = zeros((m, 1024))  # 空训练矩阵
    for i in range(m):  # 从文件名中解析分类数字9_45表示数字9的第45个实例
        fileNamestr = trainingFileList[i]
        fileStr = fileNamestr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNamestr)
    testFileList = os.listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNamestr = testFileList[i]
        fileStr = fileNamestr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNamestr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is %d" % (classifierResult, classNumStr))
        if (classifierResult != classNumStr) : errorCount += 1
    print("\n the total number of errors is: %d" % errorCount)
    print("\n the total error rate is: %f" % (errorCount/float(mTest)))

交互测试

importlib.reload(KNN1)
Out[29]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
KNN1.handwritingClassTest()
.
.
.
`the classifier came back with: 9, the real answer is 9
 the total number of errors is: 10
 the total error rate is: 0.010571``

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

NXU2023

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值