机器学习实验一:knn算法之约会网站配对结果

目录

一.knn算法

  1.1 knn算法概述

 1.2 knn算法介绍

       1.2.1 k值的选取

1.2.2  点距离的计算

 1.3 K近邻算法的一般流程

二.knn算法实例

  2.代码讲解

    2.1 准备数据

           2.1.1 导入数据

           2.1.2 将文本记录到转换Numpy的解析程序并显示数据

 2.2 分析数据:使用Matplotlib创建散点图

 2.3 准备数据:归一化数值

 2.4 测试算法:作为完整程序验证分类器

2.5 使用算法:构建完整可以系统

2.6 代码结果

三.总结

3.实验总结

 3.1 实验中出现的错误

3.2 实验感悟



一.knn算法

  1.1 knn算法概述

      K近邻算法(英文为K-Nearest Neighbor,因而又简称KNN算法)是非常经典的机器学习算法。K近邻算法的原理非常简单:对于一个新样本,K近邻算法的目的就是在已有数据中寻找与它最相似的K个数据,或者说“离它最近”的K个数据,如果这K个数据大多数属于某个类别,则该样本也属于这个类别。

 1.2 knn算法介绍

      KNN的全称是K Nearest Neighbors,意思是K个最近的邻居,从这个名字我们就能看出一些KNN算法的蛛丝马迹了。K个最近邻居,毫无疑问,K的取值肯定是至关重要的。那么最近的邻居又是怎么回事呢?KNN的原理就是当预测一个新的值x的时候,根据它距离最近的K个点是什么类别来判断x属于哪个类别。如下图所示

图 1

  从这个例子中,我们就能看得出K的取值是很重要的。接下来介绍k值的选取和点距离的计算

       1.2.1 k值的选取

     如果当k的取值过小时,一旦有噪声的成分存在,将会对预测产生比较大的影响,例如取k值为1时,一旦最近的一个点是噪声,那么就会出现偏差,k值的减小就意味着整体模型变得复杂,容易发生过拟合(在训练集上准确率非常高,而在测试集上准确率低),而忽略了数据真实的分布。
     如果k的值取得过大时,就相当于用较大邻域中的训练实例进行预测,学习的近似误差会增大,这时与输入目标点较远的实例也会对预测起作用,使预测发生错误。k值的增大就意味着整体的模型变得简单,比如如果k=N(N为训练样本的个数),那么无论输入实例是什么,都将简单地预测它属于在训练实例中最多的类。这时相当于你压根就没有训练模型,直接拿训练数据统计了一下各个数据的类别,再找最大的类别而已!
     所以说k值既不能过大,也不能过小(也就是说,选取k值的关键是实验调参)。k的取值尽量要取奇数,以保证在计算结果最后会产生一个较多的类别,如果取偶数则可能会产生相等的情况,不利于预测。

1.2.2  点距离的计算

本次实验采用欧氏距离,即计算n维空间点 a(x11 , x12 ,…, x1n) 与 b(x21 , x22 , … , x2n) 间的欧氏距离(两个n维向量):

d\left ( x,y \right )=\sqrt{\sum_{i=1}^{n}\left ( x_{i}-y_{i} \right )^{2}}

 1.3 K近邻算法的一般流程

  1. 计算已知类别数据集中的点与当前点之间的距离;
  2. 按照距离递增次序排序;
  3. 选取与当前点距离最小的k个点;
  4. 确定前k个点所在类别的出现频率;
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类

  

二.knn算法实例

  2.代码讲解

       海伦女士一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的任选,但她并不是喜欢每一个人。经过一番总结,她发现自己交往过的人可以进行如下分类:不喜欢的人、魅力一般的人、极具魅力的人。
     海伦收集的样本数据主要包含以下3种特征:

  1.  每年获得的飞行常客里程数
  2. 玩视频游戏所消耗时间百分比
  3. 每周消费的冰淇淋公升数
    2.1 准备数据
           2.1.1 导入数据

海伦收集约会数据已经有了一段时间,她把这些数据存放在文本文件datingTestSet.txt中,每个样本数据占据一行,总共有1000行。链接如下:   https://github.com/Jack-Cherish/Machine-Learning/blob/master/kNN/2.%E6%B5%B7%E4%BC%A6%E7%BA%A6%E4%BC%9A/kNN_test02.py

           2.1.2 将文本记录到转换Numpy的解析程序并显示数据
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素,存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)
  • 结果

图 2

 2.2 分析数据:使用Matplotlib创建散点图
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素,存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector


if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    # 数据可视化
    fig = plt.figure()
    ax = fig.add_subplot(111)  # 111,参数111的意思是:将画布分割成1行1列,图像画在从左到右从上到下的第1块
    ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2])
    plt.show()
  • 结果

图 3

 2.3 准备数据:归一化数值
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素,存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

# 归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0)         #从列中选取最小值
    maxVals = dataSet.max(0)         #从列中选取最大值
    # 当前值减去最小值,然后除以取值范围
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))     #特征值相除
    return normDataSet, ranges, minVals

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    #归一化特征值
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    print(normDataSet)
    print(ranges)
    print(minVals)
  • 结果

图 4

 2.4 测试算法:作为完整程序验证分类器
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator

# 在文本文件中解析数据
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet      # 距离计算
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndices = distances.argsort()
    classCount = {}
    # 选择距离最小的K个点 把classCount分解为元组列表
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        # itemgetter为排序 为逆序,即按照最大到最小的次序
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)                #得到文本行数
    returnMat = np.zeros((numberOfLines,3))      #创建返回的NumPy二维矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        # line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        # 选取前面三个元素,存储到特征矩阵中
        returnMat[index,:] = listFromLine[0:3]
        # 使用索引值-1为表示列表中最后一列元素
        if listFromLine[-1] == 'didntLike':      #文本内收集的数据
            classLabelVector.append(1)
        elif listFromLine[-1] == 'smallDoses':
            classLabelVector.append(2)
        elif listFromLine[-1] == 'largeDoses':
            classLabelVector.append(3)
        index += 1
    return returnMat, classLabelVector

# 归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0)         #从列中选取最小值
    maxVals = dataSet.max(0)         #从列中选取最大值
    # 当前值减去最小值,然后除以取值范围
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet / np.tile(ranges, (m, 1))     #特征值相除
    return normDataSet, ranges, minVals


#测试算法:作为完整程序验证分类器
def datingClassTest():
        filename = "datingTestSet.txt"
        datingDataMat, datingLabels = file2matrix(filename)
        hoRatio = 0.10
        normMat, ranges, minVals = autoNorm(datingDataMat)
        m = normMat.shape[0]
        numTestVecs = int(m * hoRatio)
        errorCount = 0.0
        for i in range(numTestVecs):
            classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
                                         datingLabels[numTestVecs:m], 4)
            print("分类结果:%d\t真实类别:%d" % (classifierResult, datingLabels[i]))
            if classifierResult != datingLabels[i]:
                errorCount += 1.0
        print("错误率:%f%%" % (errorCount / float(numTestVecs) * 100))

if __name__ == '__main__':
    #打开的文件名
    filename = "datingTestSet.txt"
    # 打开并处理数据
    datingDataMat, datingLabels = file2matrix(filename)
    print(datingDataMat)
    print(datingLabels)

    #归一化特征值
    normDataSet, ranges, minVals = autoNorm(datingDataMat)
    print(normDataSet)
    print(ranges)
    print(minVals)

    #测试算法
    datingClassTest()

  • 结果
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :1
the total error rate is: 0.040000

Process finished with exit code 0

     错误率4%,可以改变horatio和k值来改变错误率

2.5 使用算法:构建完整可以系统

def classifyPerson():
        resultList = ['讨厌','有些喜欢','非常喜欢']
        precentTats = float(input("玩视频游戏所耗时间百分比:"))
        ffMiles = float(input("每年获得的飞行常客里程数:"))
        iceCream = float(input("每周消费的冰激淋公升数:"))
        filename = "datingTestSet.txt"
        datingDataMat, datingLabels = file2matrix(filename)
        normMat, ranges, minVals = autoNorm(datingDataMat)
        inArr = np.array([precentTats, ffMiles, iceCream])
        norminArr = (inArr - minVals) / ranges
        classifierResult = classify0(norminArr, normMat, datingLabels, 3)
        print("你可能%s这个人" % (resultList[classifierResult - 1]))
2.6 代码结果

图 5 

三.总结

3.实验总结

 3.1 实验中出现的错误

错误1:

FileNotFoundError: [Errno 2] No such file or directory: 'datingTestSet.txt'
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')

解决办法: datingTestSet.txt要与本py文件在同一个文件夹下 

错误2:

ValueError: invalid literal for int() with base 10: 'largeDoses'

解决办法:此处应该要用datingTestSet.txt,因为datingTestSet2.txt的末尾列为数字,而datingTestSet.txt末尾列为字符串。

错误3:

AttributeError: 'dict' object has no attribute 'iteritems'

解决办法:python2.x版本有iteritems,而python3.x将它删除了,将其改为items即可

3.2 实验感悟

    k最近邻(kNN)算法是一种简单而有效的机器学习算法,常用于分类和回归问题。基本思路就是计算测试数据与样本的距离,取得距离最近的前k个数据的标签类,将其中出现次数最多的标签类作为测试数据的预测结果。

  • KNN算法优点:

     1.简单易用,相比其他算法,KNN算是比较简洁明了的算法。即使没有很高的数学基础也能搞清楚它的原理。
    2.模型训练时间快,上面说到KNN算法是惰性的,这里也就不再过多讲述。
预测效果好。
    3.对异常值不敏感

  • KNN算法缺点:

    1.对内存要求较高,因为该算法存储了所有训练数据
预测阶段可能很慢
   2. 对不相关的功能和数据规模敏感

   本次实验中,我们通过对knn算法在约会网站数据上的应用进行了深入研究与实践。通过数据准备、分析和预处理等步骤,我们有效地解决了数据处理中的一些常见问题。在实验过程中,我深刻认识到数据质量和特征选择对算法性能的重要影响。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值