机器学习之交友匹配（kNN）

最新推荐文章于 2023-04-01 23:13:50 发布

ZZQHLA

最新推荐文章于 2023-04-01 23:13:50 发布

阅读量574

点赞数 1

文章标签： python 算法近邻算法机器学习

本文链接：https://blog.csdn.net/qq_63128704/article/details/124641250

版权

下面是样本集（部分）：

40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1
75136	13.147394	0.428964	1
38344	1.669788	0.134296	1
72993	10.141740	1.032955	1
35948	6.830792	1.213192	3
42666	13.276369	0.543880	3
67497	8.631577	0.749278	1
35483	12.273169	1.508053	3
50242	3.723498	0.831917	1
63275	8.385879	1.669485	1
5569	4.875435	0.728658	2
51052	4.680098	0.625224	1
77372	15.299570	0.331351	1
43673	1.889461	0.191283	1
61364	7.516754	1.269164	1
69673	14.239195	0.261333	1
15669	0.000000	1.250185	2
28488	10.528555	1.304844	3
6487	3.540265	0.822483	2
37708	2.991551	0.833920	1
22620	5.297865	0.638306	2
28782	6.593803	0.187108	3
19739	2.816760	1.686209	2
36788	12.458258	0.649617	3
5741	0.000000	1.656418	2
28567	9.968648	0.731232	3
6808	1.364838	0.640103	2

下面是kNN核心部分：

from numpy import *
import operator
def classify0(inX,dataSet,labels,k):#inX是输入向量，dataSet为训练样本集，labels为标签向量，k为最近邻的数目.假设dataSet为4*2，inX为1*2
#距离计算
    dataSetSize=dataSet.shape[0]#行数4
    diffMat=tile(inX,(dataSetSize,1))-dataSet#tile扩展后为4*2，坐标相减
    sqDiffMat=diffMat**2#平方
    sqDistances=sqDiffMat.sum(axis=1)#对于2维数组，axis=1为轴1(行方向)，即将每行相加求和，把列数变为1.得到4*1数组，每个代表两点间距离的平方
    distances=sqDistances**0.5#将距离的平方开方得到距离
#选择距离最小的k个点
    sortedDistIndicies=distances.argsort()#将距离从小到大排序
    classCount={ }
    for i in range(k):#选取前k个,计算标签类别频率
        voteIlabel=labels[sortedDistIndicies[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1#计算频次
#类别排序，降序排列，返回最高频次对应的类别
    sortedClassCount=sorted(classCount.items( ),key=operator.itemgetter(1),reverse=True)#对频次排序
    return sortedClassCount[0][0]
#将文本记录转化为NumPy的解析程序
def file2matrix(filename):
    fr = open(filename)#打开文件，获得文件对象
    arrayOLines=fr.readlines()
    numberOfLines = len(arrayOLines) #获得行数，假设为1000行
    returnMat = zeros((numberOfLines,3)) #零填充矩阵，二维数组，1000*3
    classLabelVector = []#存储标签（3,2,1）。代表喜欢，一般，不喜欢.结果为1000个元素的一维数组
    index = 0
    for line in arrayOLines:#遍历每一行
        line = line.strip()#去除每一行的回车键
        listFromLine = line.split('\t')#以制表格分割本行的数据，转化为列表
        returnMat[index, :] = listFromLine[0:3]#选取前三个元素存储到特征矩阵中，index代表特征矩阵行数
        classLabelVector.append(int(listFromLine[-1]))#每行最后一个转化为整数类型作为标签存储到列表
        index += 1#行数递增
    return returnMat,classLabelVector
#归一化数值
def autoNorm(dataSet):#假设为1000*3
    minVals = dataSet.min(0)#0代表找每一列最小值组成数组，1代表找每一行最小值.结果为1*3
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals#范围数组，1*3
    normDataSet = zeros(shape(dataSet))#零填充的1000*3数组
    m = dataSet.shape[0]#shape[0]返回行数，即第一维度的长度1000
    normDataSet = dataSet - tile(minVals, (m,1))#特征值相减。tile进行扩展或者叫复制，行数变为1000，列数为1
    normDataSet = normDataSet/tile(ranges, (m,1))#特征值相除。tile形成1000*3数组
    return normDataSet, ranges, minVals
#进行分类器测试
def datingClassTest():
    hoRatio = 0.10  #测试率  
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')   
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)#要测试的行数
    errorCount = 0.0
    for i in range(numTestVecs):#进行测试
    #前numTestVecs个用于测试，后面的用作样本集
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print ("the classifier came back with: %d, the real answer is: %d" %(classifierResult, datingLabels[i]))
        if  (classifierResult != datingLabels[i]):
               errorCount += 1.0#预测不一致则错误数加1
    print ("the total error rate is: %.2f" %(errorCount/float(numTestVecs)))

datingClassTest()是分类器测试函数，随时可以测试：（部分结果）

the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 1
the total error rate is: 0.05
>>>

实现代码如下：

from kNN import *
#配对函数
def classifyPerson():
    resultList=['不适合','一般','非常适合']
    percentTats=float(input("每年游戏所占时间的百分比："))#raw_input返回用户输入的值
    ffMiles=float(input("每年乘坐飞机的里程："))
    iceCream=float(input("每年消费的冰激凌升数："))
    datingDataMat,datingLabels= file2matrix('datingTestSet2.txt')#样本数据
    normMat,ranges,minVals=autoNorm(datingDataMat)#归一化
    inArr=array([ffMiles,percentTats,iceCream])#构成数组
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print("评价结果:%s"%( resultList[classifierResult-1]))#翻译转化分类结果

运行结果如下：

>>> classifyPerson()
每年游戏所占时间的百分比：13
每年乘坐飞机的里程：123
每年消费的冰激凌升数：2
评价结果:非常适合
>>> classifyPerson()
每年游戏所占时间的百分比：1
每年乘坐飞机的里程：800
每年消费的冰激凌升数：5
评价结果:一般
>>> classifyPerson()
每年游戏所占时间的百分比：99
每年乘坐飞机的里程：1
每年消费的冰激凌升数：0
评价结果:不适合
>>>

kNN作为最简单最有效的机器学习算法，精度非常高。只是受限于时间复杂度和空间复杂度，计算时间可能会很长，占用空间也会相对大一些。但它仍然是十大机器学习算法之一

ZZQHLA

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习之交友匹配（kNN）

下面是样本集（部分）：40920 8.326976 0.953952 314488 7.153469 1.673904 226052 1.441871 0.805124 175136 13.147394 0.428964 138344 1.669788 0.134296 172993 10.141740 1.032955 135948 6.830792 1.213192 342666 13.276369 0.543880 367497 8.631577 0.749278 135483
复制链接

扫一扫