machine learning---kNN（1）

最新推荐文章于 2023-02-22 15:19:55 发布

solodom

最新推荐文章于 2023-02-22 15:19:55 发布

阅读量222

点赞数

分类专栏：个人成长文章标签： machine learning

本文链接：https://blog.csdn.net/solodom/article/details/84885582

版权

本文详细介绍了k近邻(kNN)算法，包括创建测试数据集、算法过程及实现。通过实例展示了如何使用kNN改进约会网站的匹配效果，并构建手写数字识别系统。还提供了数据集地址和算法的完整代码，以及测试和归一化处理的过程。

摘要由CSDN通过智能技术生成

创建测试数据集
KNN算法过程
KNN 算法实现
分类器测试
使用k近邻算法改进约会网站的匹配效果
手写识别系统
- 准备数据：将图像转化为测试向量
- 测试算法：使用k-近邻算法识别手写数字
K近邻算法完整代码
数据集地址

创建测试数据集

import numpy as np # import everything from numpy
import operator # operator module

def createDataSet():
    # there are two [] for array 
    group=np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 
    labels=['A','A','B','B']
    return group, labels

group,labels=createDataSet()

group

array([[1. , 1.1],
       [1. , 1. ],
       [0. , 0. ],
       [0. , 0.1]])

labels

['A', 'A', 'B', 'B']

KNN算法过程

（1）计算已知类别数据集中的点与当前点之间的距离(距离度量：欧氏距离）
（2）按照距离递增排序
（3）选取与当前点距离最小的K个点
（4）确定K个点所在各个类别的频率
（5）返回频率最高的类别作为当前点的预测分类（决策方式：多数表决）

KNN 算法实现

import numpy as np
import operator


def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    # numpy.tile(A,reps) tile共有2个参数，A指待输入数组，reps则决定A重复的次数。整个函数用于重复数组A来构建新的数组。
    # 构建与样本数组同型的数组
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat**2
    # sum 默认的axis=0 就是普通的相加 而当加入axis=1以后就是将一个矩阵的每一行向量相加
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    # argsort()将数组元素从小到大排序，返回index数组，默认axis=1 按行排序，axis=0时按列排序
    sortedDistIndicies = distances.argsort()
    classCount = {
   }
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        # get返回指定键的值，如果值不在字典中返回默认值
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        #sorted 返回一个list， classCount.items()返回 [(key,value)] list operator.itemgetter 指定按照哪一个
        #元素进行排序 reverse=True 降序排列
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

分类器测试

classify0([0,0],group,labels,3)

'B'

classify0([1.45,0.11],group,labels,3)

'A'

使用k近邻算法改进约会网站的匹配效果

准备数据，从文本文件中构建数据集

DatingTestSet.txt的结构如下：
40920	8.326976	0.953952	largeDoses
14488	7.153469	1.673904	smallDoses
26052	1.441871	0.805124	didntLike
75136	13.147394	0.428964	didntLike
...
前三项分别为：每年的飞行里程数， 玩游戏所耗时间比， 每周消耗冰激凌的公升数

import numpy as np
def file2matrix(filename):
    # 构建标签与数字的对应字典
    love_dictionary = {
   'largeDoses':3, 'smallDoses':2, 'didntLike':1}
    fr = open(filename)
    # realines() return the list of lines
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)            #get the number of lines in the file
    # 构建数据集矩阵
    returnMat = np.zeros((numberOfLines, 3))        #prepare matrix to return
    # construct the labels vector
    classLabelVector = []                       #prepare labels return
    # index is the row index of the matrix
    index = 0
    for line in arrayOLines:
        # get rid of the spaces at the front and the end of every line
        line = line.strip()
        # split the string into list by tab
        listFromLine = line.split('\t')
        # put the data which are the first three elements of the listFromline into every row
        returnMat[index, :] = listFromLine[0:3]
        # put the label which is the fourth element of listFromline into labels vector
        if(listFromLine[-1].isdigit()):
            classLabelVector.append(int(listFromLine[-1]))
        else:
            classLabelVector.append(love_dictionary.get(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

datingDataMat, datingLabels=file2matrix('D:/pythoncode/machine learning in action/DatingTestSet.txt')

print(datingDataMat)

[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 ...
 [2.6575000e+04 1.0650102e+01 8.6662700e-01]
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]

print(datingLabels)

[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 1, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2, 3, 3, 3, 1, 1, 1, 1, 2, 2, 1, 3, 2, 2, 2, 2, 3, 1, 2, 1, 2, 2, 2, 2, 2, 3, 2, 3, 1, 2, 3, 2, 2, 1, 3, 1, 1, 3, 3, 1, 2, 3, 1, 3, 1, 2, 2, 1, 1, 3, 3, 1, 2, 1, 3, 3, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 2, 1, 3, 1, 2, 1, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 3, 3, 1, 3, 2, 2, 3, 1, 3, 3, 3, 1, 3, 1, 1, 3, 3, 2, 3, 3, 1, 2, 3, 2, 2, 3, 3, 3, 1, 2, 2, 1, 1, 3, 2, 3, 3, 1, 2, 1, 3, 1, 2, 3, 2, 3, 1, 1, 1, 3, 2, 3, 1, 3, 2, 1, 3, 2, 2, 3, 2, 3, 2, 1, 1, 3, 1, 3, 2, 2, 2, 3, 2, 2, 1, 2, 2, 3, 1, 3, 3, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 1, 1, 1, 2, 3, 2, 1, 3, 1, 3, 2, 2, 3, 1, 3, 1, 1, 2, 1, 2, 2, 1, 3, 1, 3, 2, 3, 1, 2, 3, 1, 1, 1, 1, 2, 3, 2, 2, 3, 1, 2, 1, 1, 1, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 3, 2, 3, 3, 3, 3, 1, 2, 3, 1, 1, 1, 3, 1, 3, 2, 2, 1, 3, 1, 3, 2, 2, 1, 2, 2, 3, 1, 3, 2, 1, 1, 3, 3, 2, 3, 3, 2, 3, 1, 3, 1, 3, 3, 1, 3, 2, 1, 3, 1, 3, 2, 1, 2, 2, 1, 3, 1, 1, 3, 3, 2, 2, 3, 1, 2, 3, 3, 2, 2, 1, 1, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 3, 3, 2, 3, 2, 1, 1, 1, 1, 1, 3, 2, 2, 1, 2, 1, 3, 2, 1, 3, 2, 1, 3, 1, 1, 3, 3, 3, 3, 2, 1, 1, 2, 1, 3, 3, 2, 1, 2, 3, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 3, 1, 1, 2, 2, 1, 2, 2, 2, 3, 1, 1, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 3, 2, 3, 3, 2, 2, 1, 1, 1, 2, 1, 2, 2, 3, 3, 3, 1, 1, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 1, 2, 3, 2, 1, 1, 1, 1, 3, 3, 3, 3, 2, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 3, 2, 1, 3, 2, 3, 2, 3, 2, 1, 1, 2, 3, 1, 3, 3, 3, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 1, 1, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 1, 3, 1, 2, 1, 3, 1, 1, 1, 3, 1, 1, 3, 3, 2, 2, 1, 3, 1, 1, 3, 2, 3, 1, 1, 3, 1, 3, 3, 1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 3, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 2, 3, 2, 2, 1, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 1, 3, 2, 3, 2, 1, 2, 1, 1, 1, 2, 3, 2, 2, 1, 2, 2, 1, 3, 1, 3, 3, 3, 2, 2, 3, 3, 1, 2, 2, 2, 3, 1, 2, 1, 3, 1, 2, 3, 1, 1, 1, 2, 2, 3, 1, 3, 1, 1, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 2, 2, 2, 3, 1, 3, 1, 2, 3, 2, 2, 3, 1, 2, 3, 2, 3, 1, 2, 2, 3, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 3, 2, 1, 3, 3, 3, 1, 1, 3, 1, 2, 3, 3, 2, 2, 2, 1, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 3, 1, 2, 2, 2, 3, 2, 3, 3, 1, 2, 1, 1, 2, 1, 3, 1, 2, 2, 1, 3, 2, 1, 3, 3, 2, 2, 2, 1, 2, 2, 1, 3, 1, 3, 1, 3, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 1, 1, 3, 2, 2, 1, 3, 1, 2, 3, 1, 3, 1, 3, 1, 1, 3, 2, 3, 1, 1, 3, 3, 3, 3, 1, 3, 2, 2, 1, 1, 3, 3, 2, 2, 2, 1, 2, 1, 2, 1, 3, 2, 1, 2, 2, 3, 1, 2, 2, 2, 3, 2, 1, 2, 1, 2, 3, 3, 2, 3, 1, 1, 3, 3, 1, 2, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3, 1, 1, 3, 2, 1, 2, 1, 2, 2, 3, 2, 2, 2, 3, 1, 2, 1, 2, 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 2, 2, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, 3, 2, 2, 1, 3, 1, 2, 3, 1, 2, 2, 2, 2, 3, 2, 3, 3, 1, 2, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 3, 3, 3]

分析数据：Matplotlib 创建散点图

import matplotlib
import matplotlib.pyplot as plt
# creat a new figure
fig=plt.figure()
# creat axes add_subplot(row_quantity,column_quantity,position) used as figure.add_subplot
# the same as plt.subplot(row_quantity,column_quantity,position)
ax=fig.add_subplot(111)
# scatter(x,y)
#以玩游戏所耗时间比， 每周消耗冰激凌的公升数来构建散点图
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show(