机器学习实战之使用k-邻近算法改进约会网站的配对效果

本文链接：https://blog.csdn.net/y1535766478/article/details/76286134

1 准备数据，从文本文件中解析数据

用到的数据是机器学习实战书中datingTextSet2.txt

代码如下：

from numpy import *
def file2matrix(filname):
    fr=open(filname)
    arrayOLines=fr.readlines()
    numberOfLines=len(arrayOLines)
    returnMat=zeros((numberOfLines,3))
    classLabelVector=[]
    index=0
    for line in arrayOLines:
        line=line.strip()
        listFromLine=line.split('\t')
        returnMat[index:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1

return returnMat, classLabelVector

在这要简单介绍几个函数的应用：

1）.readlines()y一次性读取整个文件，自动将文件内容分析成一个行的列表，该列表可以由 python 的 for... in ... 结构进行处理

.readline() 每次只读取一行，通常比 .readlines() 慢得多。仅当没有足够内存可以一次读取整个文件时，才应该使用它

.read() 每次读取整个文件，它通常用于将文件内容放到一个字符串变量中。然而 .read() 生成文件内容最直接的字符串表示，但对于连续的面向行的处理，它却是不必要的，并且如果文件大于可用内存，则不可能实现这种处理。

2）.strip（），本代码中实现的是截取掉所有的回车字符

首先使用函数.strip截取掉所有的回车字符，然后使用'\t'字符将上一步得到的整行数据分割成一个元素列表。接着，我们选取前3个元素，将它们存储到特征矩阵中。python语言可以使用索引值-1表示列表中的最后一列元素，利用这种负索引，我们可以很方便地将列表的最后一列存储到向量classLabelVector中。需要注意的是，我们必须明确地通知解释器，告诉它列表中存储的元素值为整型，否则会将这些元素当作字符串处理。

我们成功导入了datingTestSet2.txt

2 分析数据：使用Matplotlib常见散点图

我们用散点图进行可视化

在上述代码中加下面代码：

from numpy import *
import matplotlib
import matplotlib.pyplot as plt
datingDataMat,datingLabels=file2matrix('E:\PythonProject\machineL\Ch02\datingTestSet2.txt')
fig=plt.figure()
ax=fig.add_subplot(111)

ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15.0*array(datingLabels),15.0*array(datingLabels))

plt.xlabel(u'玩视频游戏所占时间百分比', fontproperties='SimHei')
plt.ylabel(u'每周消费冰淇淋公升数', fontproperties='SimHei')
plt.show()

注意：在加x,y轴标注时一定要加上fontproperties='SimHei',这样中文才能正常显示。

结果如图：

很清楚的可以看出，分为三类；分别是黄色部分、紫色部分和蓝色部分，

我们是显示了第二列和第一列的属性，但如果我们显示的是第二列和第三列属性的话，效果远不如此：

从图中我们很难进行分类，比较分散。

3 数据准备：归一化数值

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m, 1))
    normDataSet = normDataSet / tile(ranges, (m, 1))  # element wise divide
    return normDataSet, ranges, minVals

4 测试算法

def datingClassTest():
    hoRatio = 0.50  # hold out 10%
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')  # load data setfrom file
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print "the total error rate is: %f" % (errorCount / float(numTestVecs))
    print errorCount

5 使用算法

def classifyPerson():
    resultList=['not at all','in small doses','in large doses']
    percentTats=float(raw_input("percent games?"))
    ffMiles=float(raw_input("frequent flier miles?"))
    iceCream=float(raw_input("liters of ice cream per year?"))
    datingDataMat,datingLabels=file2matrix('E:\PythonProject\machineL\Ch02\datingTestSet2.txt')
    normMat,ranges,minVals=autoNorm(datingDataMat)
    inArr=array([ffMiles,percentTats,iceCream])
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print "you will like this person",resultList[classifierResult-1]

结果：