机器学习实验一：knn算法之约会网站配对结果_使用knn算法,改写电影影片任务。将课本p152页中的任务,电影库可替换为自选的20个-CSDN博客

KNN的全称是K Nearest Neighbors，意思是K个最近的邻居，从这个名字我们就能看出一些KNN算法的蛛丝马迹了。K个最近邻居，毫无疑问，K的取值肯定是至关重要的。那么最近的邻居又是怎么回事呢？KNN的原理就是当预测一个新的值x的时候，根据它距离最近的K个点是什么类别来判断x属于哪个类别。如下图所示

图 1

从这个例子中，我们就能看得出K的取值是很重要的。接下来介绍k值的选取和点距离的计算

1.2.1 k值的选取

如果当k的取值过小时，一旦有噪声的成分存在，将会对预测产生比较大的影响，例如取k值为1时，一旦最近的一个点是噪声，那么就会出现偏差，k值的减小就意味着整体模型变得复杂，容易发生过拟合（在训练集上准确率非常高，而在测试集上准确率低），而忽略了数据真实的分布。
如果k的值取得过大时，就相当于用较大邻域中的训练实例进行预测，学习的近似误差会增大，这时与输入目标点较远的实例也会对预测起作用，使预测发生错误。k值的增大就意味着整体的模型变得简单，比如如果k=N（N为训练样本的个数），那么无论输入实例是什么，都将简单地预测它属于在训练实例中最多的类。这时相当于你压根就没有训练模型，直接拿训练数据统计了一下各个数据的类别，再找最大的类别而已！
所以说k值既不能过大，也不能过小（也就是说，选取k值的关键是实验调参）。k的取值尽量要取奇数，以保证在计算结果最后会产生一个较多的类别，如果取偶数则可能会产生相等的情况，不利于预测。

1.2.2 点距离的计算

本次实验采用欧氏距离，即计算n维空间点 a(x11 , x12 ,…, x1n) 与 b(x21 , x22 , … , x2n) 间的欧氏距离（两个n维向量）：

$d\left ( x,y \right )=\sqrt{\sum_{i=1}^{n}\left ( x_{i}-y_{i} \right )^{2}}$

1.3 K近邻算法的一般流程

计算已知类别数据集中的点与当前点之间的距离；
按照距离递增次序排序；
选取与当前点距离最小的k个点；
确定前k个点所在类别的出现频率；
返回前k个点出现频率最高的类别作为当前点的预测分类

二.knn算法实例

2.代码讲解

海伦女士一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的任选，但她并不是喜欢每一个人。经过一番总结，她发现自己交往过的人可以进行如下分类：不喜欢的人、魅力一般的人、极具魅力的人。
海伦收集的样本数据主要包含以下3种特征：

每年获得的飞行常客里程数
玩视频游戏所消耗时间百分比
每周消费的冰淇淋公升数

2.1 准备数据

2.1.1 导入数据

海伦收集约会数据已经有了一段时间，她把这些数据存放在文本文件datingTestSet.txt中，每个样本数据占据一行，总共有1000行。链接如下： https://github.com/Jack-Cherish/Machine-Learning/blob/master/kNN/2.%E6%B5%B7%E4%BC%A6%E7%BA%A6%E4%BC%9A/kNN_test02.py

2.1.2 将文本记录到转换Numpy的解析程序并显示数据

import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素，存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

结果

图 2

2.2 分析数据：使用Matplotlib创建散点图

import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素，存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector


if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    # 数据可视化
    fig = plt.figure()
    ax = fig.add_subplot(111)  # 111,参数111的意思是：将画布分割成1行1列，图像画在从左到右从上到下的第1块
    ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2])
    plt.show()

结果

图 3

2.3 准备数据：归一化数值

import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素，存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

# 归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0)         #从列中选取最小值
    maxVals = dataSet.max(0)         #从列中选取最大值
    # 当前值减去最小值，然后除以取值范围
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))     #特征值相除
    return normDataSet, ranges, minVals

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    #归一化特征值
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    print(normDataSet)
    print(ranges)
    print(minVals)

结果

图 4

2.4 测试算法：作为完整程序验证分类器

import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

# 在文本文件中解析数据
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet      # 距离计算
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndices = distances.argsort()
    classCount = {}
    # 选择距离最小的K个点 把classCount分解为元组列表
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        # itemgetter为排序 为逆序，即按照最大到最小的次序
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素，存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

# 归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0)         #从列中选取最小值
    maxVals = dataSet.max(0)         #从列中选取最大值
    # 当前值减去最小值，然后除以取值范围
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))     #特征值相除
    return normDataSet, ranges, minVals


#测试算法：作为完整程序验证分类器
def datingClassTest():
        filename = "datingTestSet.txt"
        datingDataMat, datingLabels = file2matrix(filename)
        hoRatio = 0.10
        normMat, ranges, minVals = autoNorm(datingDataMat)
        m = normMat.shape[0]
        numTestVecs = int(m * hoRatio)
        errorCount = 0.0
        for i in range(numTestVecs):
            classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
                                         datingLabels[numTestVecs:m], 4)
            print("分类结果:%d\t真实类别:%d" % (classifierResult, datingLabels[i]))
            if classifierResult != datingLabels[i]:
                errorCount += 1.0
        print("错误率:%f%%" % (errorCount / float(numTestVecs) * 100))

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    #归一化特征值
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    print(normDataSet)
    print(ranges)
    print(minVals)

    #测试算法
    datingClassTest()

结果

the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :1
the total error rate is: 0.040000

Process finished with exit code 0

错误率4%，可以改变horatio和k值来改变错误率

2.5 使用算法：构建完整可以系统

def classifyPerson():
        resultList = ['讨厌','有些喜欢','非常喜欢']
        precentTats = float(input("玩视频游戏所耗时间百分比:"))
        ffMiles = float(input("每年获得的飞行常客里程数:"))
        iceCream = float(input("每周消费的冰激淋公升数:"))
        filename = "datingTestSet.txt"
        datingDataMat, datingLabels = file2matrix(filename)
        normMat, ranges, minVals = autoNorm(datingDataMat)
        inArr = np.array([precentTats, ffMiles, iceCream])
        norminArr = (inArr - minVals) / ranges
        classifierResult = classify0(norminArr, normMat, datingLabels, 3)
        print("你可能%s这个人" % (resultList[classifierResult - 1]))

2.6 代码结果

图 5

三.总结

3.实验总结

3.1 实验中出现的错误

错误1：

FileNotFoundError: [Errno 2] No such file or directory: 'datingTestSet.txt'

datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')

解决办法： datingTestSet.txt要与本py文件在同一个文件夹下

错误2：

ValueError: invalid literal for int() with base 10: 'largeDoses'

解决办法：此处应该要用datingTestSet.txt，因为datingTestSet2.txt的末尾列为数字，而datingTestSet.txt末尾列为字符串。

错误3：

AttributeError: 'dict' object has no attribute 'iteritems'

解决办法：python2.x版本有iteritems，而python3.x将它删除了，将其改为items即可

3.2 实验感悟

k最近邻（kNN）算法是一种简单而有效的机器学习算法，常用于分类和回归问题。基本思路就是计算测试数据与样本的距离，取得距离最近的前k个数据的标签类，将其中出现次数最多的标签类作为测试数据的预测结果。

KNN算法优点：

1.简单易用，相比其他算法，KNN算是比较简洁明了的算法。即使没有很高的数学基础也能搞清楚它的原理。
2.模型训练时间快，上面说到KNN算法是惰性的，这里也就不再过多讲述。
预测效果好。
3.对异常值不敏感

KNN算法缺点：

1.对内存要求较高，因为该算法存储了所有训练数据
预测阶段可能很慢
2. 对不相关的功能和数据规模敏感

本次实验中，我们通过对knn算法在约会网站数据上的应用进行了深入研究与实践。通过数据准备、分析和预处理等步骤，我们有效地解决了数据处理中的一些常见问题。在实验过程中，我深刻认识到数据质量和特征选择对算法性能的重要影响。