初识机器学习——k-近邻算法(2)

使用k-近邻算法改进约会网站的配对效果

运行环境:Python 3.6.0 |Anaconda 4.3.1 (64-bit)

首先将给定的特征文件转换为Numpy的解析程序
代码:

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)
    returnMat = zeros((numberOfLines,3)) #构造和文件一样的全0矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3] #将文件复制到returnMat
        classLabelVector.append(int(listFromLine[-1])) #取出Labels
        index += 1
    return returnMat,classLabelVector

运行结果:

In [9]: datingDatMat,datinglabels = kNN.file2matrix('datingTestSet2.txt')

In [10]: datingDatMat
Out[10]:
array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       ...,
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]])

运行前需要将datingTest文件复制到保存代码的文件中。
对于代码中不懂得函数,可以在代码中多用一些print()将中间结果输出没这样看的更直观一些,有助于理解。

使用Matplotlib创建散点图

In [21]: import matplotlib

In [22]: import matplotlib.pyplot as plt

In [23]: fig = plt.figure()

In [24]: ax = fig.add_subplot(111)

In [25]: ax.scatter(datingDatMat[:,1],datingDatMat[:,2],15.0*array(datinglabels
    ...: ),15.0*array(datinglabels))
Out[25]: <matplotlib.collections.PathCollection at 0x7622f98>

In [26]: plt.show()

使用时要from numpy import *,array是numpy中的函数。

归一化算法
什么是归一化算法:就是健身教练去幼儿园踢馆,为了保持平衡,不让教练欺负小朋友,就需要哆啦A梦拿出缩小射线将教练按照儿童的比例变成和儿童一个量级的再较量。
代码:

def autoNorm(dataSet): #归一化算法
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

通过归一化算法将网站中数量级不同的数据化为相同的数量级,使程序结果更加科学。
运行结果:

In [27]: normMat,ranges,minVals = kNN.autoNorm(datingDatMat)

In [28]: normMat
Out[28]:
array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       ...,
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])

In [29]: ranges
Out[29]: array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])

In [30]: minVals
Out[30]: array([ 0.      ,  0.      ,  0.001156])

分类器针对约会网站的代码测试
代码:

def datingClassTest(): #约会网站测试
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #文件Mat和对应labels
    normMat,ranges,minVals = autoNorm(datingDataMat) #归一化
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio) #取10%用于测试
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
                                      datingLabels[numTestVecs:m],3) #numTestVecs:m 其余90%用于训练
        print("The classifier came back with: %d,the real answer is: %d"\
              %(classifierResult,datingLabels[i]))
        if (classifierResult != datingLabels[i]):errorCount += 1.0
    print("The total error rate is: %f" % (errorCount/float(numTestVecs)))

运行结果:

In [31]: kNN.datingClassTest()
The classifier came back with: 3,the real answer is: 3
The classifier came back with: 2,the real answer is: 2
The classifier came back with: 1,the real answer is: 1
The classifier came back with: 1,the real answer is: 1
The classifier came back with: 1,the real answer is: 1
.
.
The classifier came back with: 2,the real answer is: 3
The classifier came back with: 1,the real answer is: 1
The classifier came back with: 2,the real answer is: 2
The classifier came back with: 3,the real answer is: 3
The classifier came back with: 2,the real answer is: 2
The classifier came back with: 2,the real answer is: 1
The classifier came back with: 1,the real answer is: 1
The total error rate is: 0.080000

约会网站预测函数
代码:

def classifyPerosn():
    resultList = ['not at all','in small does','in large does']
    percentTats = float(input(\
                  "percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr - minVals)\
                                 /ranges,normMat,datingLabels,3)
    print("you will probably like this person:",\
          resultList[classifierResult - 1])

Python3以后将raw_input整合成input了,所以将代码中raw_input换成input。
http://stackoverflow.com/questions/954834/how-do-i-use-raw-input-in-python-3
本段代码的主要功能就是对输入的要求通过函数classify0找最符合要求的类型。
运行结果:

kNN.classifyPerosn()

percentage of time spent playing video games?10
frequent flier miles earned per year?1000
liters of ice cream consumed per year?0.5
you will probably like this person: in small does
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值