

我的朋友海伦一直使用在线约会网站寻找适合自己的约会对象。 尽管约会网站会推荐不同的

□ 不喜欢的人

□ 魅力一般的人

□ 极具魅力的人

尽管发现了上述规律, 但海伦依然无法将约会网站推荐的匹配对象归人恰当的分类。她觉得


她把这些数据存放在文本文件(1如1^及抓 比加 中,每个样本数据占据一行,总共有1000行。海伦的样本主要包含以下3种特征:

□ 每年获得的飞行常客里程数

□ 玩视频游戏所耗时间百分比

□ 每周消费的冰淇淋公升数


40920 8.326976 0.953952 largeDoses

14488 7.153469 1.673904 smallDoses

26052 1.441871 0.805124 didntLike

75136 13.147394 0.428964 didntLike

38344 1.669788 0.134296 didntLike

72993 10.141740 1.032955 didntLike

35948 6.830792 1.213192 largeDoses

42666 13.276369 0.543880 largeDoses

67497 8.631577 0.749278 didntLike

35483 12.273169 1.508053 largeDoses

from numpy import *
import operator

# 将文件的记录数据分离为特征向量矩阵和标签矩阵
def file2matrix(filename):
    fr = open(filename)
    arrayOfLines = fr.readlines()
    numberOfLines = len(arrayOfLines)
    retMat = zeros((numberOfLines,3))
    classLabelVector = []
    index = 0
    for line in arrayOfLines:
        #line = '40920\t8.326976\t0.953952\tlargeDoses'
        line = line.strip()
        #获得字段的列表如['40920', '8.326976', '0.953952', 'largeDoses']
        listFromLine = line.split('\t')
        #左闭右开,获取0到2的列  即特征向量
        retMat[index,:] = listFromLine[0:3]
        index += 1
    return retMat,classLabelVector 
datingDataMat,datingLabels = file2matrix('C:\machinelearninginaction\Ch02\datingTestSet2.txt')
array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]])
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]


import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
plt.xlabel('play game')



处理为0到 1或者-1到 1之间。下面的公式可以将任意取值范围的特征值转化为0到 1区间内的值:

    newValue = {oldValue- min ) / (max-min)

其中min 和max分别是数据集中的最小特征值和最大特征值。虽然改变数值取值范围增加了分类器的复杂度,但为了得到准确结果,我们必须这样做

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    #得到一个原矩阵与最小值相减所得的矩阵{oldValue- min )
    normDataSet = dataSet - tile(minVals,(m,1))
    #计算{oldValue- min ) / (max-min)
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals


normMat,ranges,minvalue = autoNorm(datingDataMat)
array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])
array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])
array([ 0.      ,  0.      ,  0.001156])


重要的工作就是评估算法的正确率,通常我们只提供已有数据的90%作为训练样本来训练分类器 ,而使用其余的10%数据去测试分类器,检测分类器的正确率

from numpy import *
import operator

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]           #计算有多少条记录
    #tile(X,(a,b))   a 表示数据集X每条记录复制的次数,b表示每一条记录中元素按顺序重复的次数

    diffMat = tile(inX,(dataSetSize,1))-dataSet
    sqDifMat = diffMat ** 2
    #axis = 1对行求和
    sqDistances = sqDifMat.sum(axis = 1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()   #indices
    classCount = { }
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #这里的classCount.items() python3.x与python 2.X iteritems不一样
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
from numpy import *
import operator

def datingClassTest():
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix('C:\machinelearninginaction\Ch02\datingTestSet2.txt')
    nornMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
#         print("the classifier came back with : ",classifierResult)
        print("the classifier came back with : %d,the real answer is %d"%(classifierResult,datingLabels[i]))
        if(classifierResult != datingLabels[i]) : errorCount += 1.0
    print("the total error rate is :%f"%(errorCount/float(numTestVecs)))
    print ("errorCount:",errorCount)
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 1
the total error rate is :0.050000
errorCount: 5.0






