机器学习实战之KNN实现_2.创建一个knn对象,分别训练特征归一化前和特征归一化后的数据,分别在测试集上进-CSDN博客

本文链接：https://blog.csdn.net/weixin_43448905/article/details/107139854

最近入坑机器学习。买了本《机器学习实战》配合李航老师的《统计学习方法》方便尽快入门。方便记录，从博客开始，从KNN开始，下面先介绍一下KNN的原理以及思路，并给出实战中的代码搭配食用。

1、算法原理

k-近邻法是一种基本分类和回归方法，不具有显式的学习过程。k-近邻法实际上利用训练数据集对特征向量空间进行划分，并作为其分类”模型“。有三个要点：k值的选择，距离度量及分类决策规则。k-近邻算法的效率低，能耗高，kd树方法可以用来减少计算次数。

算法步骤：
输入：训练数据集 $T = \{ (x_1，y_1), （x_2,x_2）,...,(x_N,y_N) \}$
$x$ 为实例的特征向量， $y$ 为实例的类别
输出：实例 $x$ 所属的类 $y$
（1）根据距离度量，在训练集 $T$ 中找出与 $x$ 最近邻的 $k$ 个点，涵盖这 $k$ 个点的 $x$ 的邻域记作 $N_k(x)$ ;
（2）在 $N_k(x)$ 中根据分类决策规则（如多数表决）决定 $x$ 的类别 $y$

2、K-近邻分类器

导入模块

from numpy import *
import operator  # 运算符操作，为排序准备

2.1、创建数据集

def creatDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])  # 创建训练数据集
    label = ['A', 'A', 'B', 'B']  # 训练数据集对应的标签
    return group, label

2.2、创建k-近邻算法

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]  # 得到数组结构中的4
    diffMat = tile(inX, (dataSetSize, 1))-dataSet  # 得到和group一样的结构并作数组减法，即将inX重复4次
    sqDiffMat = diffMat**2  # 对各个减后的数组做平方
    sqDistances = sqDiffMat.sum(axis=1)  # 求和
    distance = sqDistances**0.5  # 开根号
    sortDisIndicies = distance.argsort()  # 排序并返回下标
    classCount = {}  # 空字典
    for i in range(k):
        votelabel = labels[sortDisIndicies[i]]
        classCount[votelabel] = classCount.get(votelabel, 0) + 1  # 查找字典中的键，若没有返回0
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)  # 取定数字进行排序，items=（key，value）
    return sortedClassCount[0][0]

2.3、交互体验

In[1]:import KNN1
In[2]:group,labels=KNN1.creatDataSet()
In[3]:KNN1.classify0([0,0], group, labels, 3)

得到结果

Out[3]:'B'

3、改进约会网站的配对效果

3.1、将文本转换为Numpy的解析程序

def file2matrix(file_name):
    fr = open(file_name)  # 打开文件
    arrayOfLines = fr.readlines()  # 按行读取文件
    numberOfLines = len(arrayOfLines)  # 文件的长度
    returnMat = zeros((numberOfLines, 3))  # 创建0矩阵
    classLabelVector = []  # 创建用于存储标签的空列表
    index = 0
    for line in arrayOfLines:
        line = line.strip()  # 截取所有的回车字符
        listFormLine = line.split('\t')  # 将整行数据分割成一个元素列表
        returnMat[index, :] = listFormLine[0:3]  # 选取前三个存储到列表中
        classLabelVector.append(int(listFormLine[-1]))  # 选取最后一列元素存储到标签中
        index += 1
    return returnMat, classLabelVector

之后重新运行程序，继续进行交互体验

In[4]:import KNN1
In[5]:datingDataMat,datingLabels = KNN1.file2matrix('datingTestset2.txt')
In[6]:datingDataMat
Out[6]:array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
       		[1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
       		[2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
      		 ...,
      		[2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
      		[4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
       		[4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
In[7]:datingLabels[0:20]
Out[7]:[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

3.2、分析数据

在交互下按顺序输入以下代码，得到下面的图表。

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
plt.show()
ax.scatter(datingDataMat[:,1], datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

可视化图
可视化图二

3.3、归一化特征

由于数据的量纲不一，需要对数据归一化。

def autoNorm(dataSet):
    minVals = dataSet.min(0)  # 0表示从列选取
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDatsSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDatsSet = dataSet - tile(minVals, (m, 1))  # tile扩展矩阵的大小为m*1
    normDatsSet = normDatsSet/tile(ranges, (m, 1))
    return normDatsSet, ranges, minVals

交互检测：

import KNN1
datingMat, datingLabels = KNN1.file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = KNN1.autoNorm(datingMat)
normMat
Out[6]: 
array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])

3.4、分类器针对约会网站的测试

def datingClassTest():
    hoRatio = 0.10  # 测试集比例
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is %d" %(classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]) : errorCount += 1
    print("the total error rate is: %f" % (errorCount/float(numTestVecs)))

交互测试误差率（重新加载KNN1 python3中需要更改如下列代码所示）

import importlib
importlib.reload(KNN1)
KNN1.datingClassTest()
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 3
the classifier came back with: 1, the real answer is 1
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 3
the classifier came back with: 3, the real answer is 3
the classifier came back with: 2, the real answer is 2
the classifier came back with: 1, the real answer is 1
the classifier came back with: 3, the real answer is 1
the total error rate is: 0.050000

3.5、约会网站预测函数

def classifyPerson():
    resultList = ['not at all', 'in small does', 'in large does']
    percentTats = float(input("percentage of time spent playing video game?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([percentTats, ffMiles, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person:", resultList[classifierResult-1])

交互测试

importlib.reload(KNN1)
Out[20]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
KNN1.classifyPerson()
percentage of time spent playing video game?>? 10
frequent flier miles earned per year?>? 1000
liters of ice cream consumed per year?>? 0.5
You will probably like this person: not at all

4、识别手写数字

4.1、将图像数据转换测试向量

def img2vector(img_file):
    returnVect = zeros((1, 1024))
    fr = open(img_file)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

交互测试

importlib.reload(KNN1)
Out[22]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
testVector = KNN1.img2vector('testDigits/0_13.txt')
testVector
Out[26]: array([[0., 0., 0., ..., 0., 0., 0.]])
testVector[0, 0:31]
Out[27]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
testVector[0, 32:63]
Out[28]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

4.2、使用k-近邻算法识别手写数字

def handwritingClassTest():
    hwLabels = []  # 用于存储类代码的空列表
    trainingFileList = os.listdir('trainingDigits')  # 读取文件夹下的文件名字
    m = len(trainingFileList)  # 文件数量
    trainingMat = zeros((m, 1024))  # 空训练矩阵
    for i in range(m):  # 从文件名中解析分类数字9_45表示数字9的第45个实例
        fileNamestr = trainingFileList[i]
        fileStr = fileNamestr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNamestr)
    testFileList = os.listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNamestr = testFileList[i]
        fileStr = fileNamestr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNamestr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is %d" % (classifierResult, classNumStr))
        if (classifierResult != classNumStr) : errorCount += 1
    print("\n the total number of errors is: %d" % errorCount)
    print("\n the total error rate is: %f" % (errorCount/float(mTest)))

交互测试

importlib.reload(KNN1)
Out[29]: <module 'KNN1' from 'C:\\Users\\xuning\\PycharmProjects\\machine learning\\KNN_machine\\KNN1.py'>
KNN1.handwritingClassTest()
.
.
.
`the classifier came back with: 9, the real answer is 9
 the total number of errors is: 10
 the total error rate is: 0.010571``