机器学习笔记--K-近邻算法(二)

使用K-近邻算法改进约会网站的配对效果

我的朋友海伦一直使用在线约会网站寻找适合自己的约会对象。 尽管约会网站会推荐不同的
人选,但她没有从中找到喜欢的人。经过一番总结,她发现曾交往过三种类型的人:

□ 不喜欢的人

□ 魅力一般的人

□ 极具魅力的人

尽管发现了上述规律, 但海伦依然无法将约会网站推荐的匹配对象归人恰当的分类。她觉得
可以在周一到周五约会那些魅力一般的人,而周末则更喜欢与那些极具魅力的人为伴。海伦希望我们的分类软件可以更好地帮助她将匹配对象划分到确切的分类中.

1、准备数据

她把这些数据存放在文本文件(1如1^及抓 比加 中,每个样本数据占据一行,总共有1000行。海伦的样本主要包含以下3种特征:

□ 每年获得的飞行常客里程数

□ 玩视频游戏所耗时间百分比

□ 每周消费的冰淇淋公升数

数据节选

40920 8.326976 0.953952 largeDoses

14488 7.153469 1.673904 smallDoses

26052 1.441871 0.805124 didntLike

75136 13.147394 0.428964 didntLike

38344 1.669788 0.134296 didntLike

72993 10.141740 1.032955 didntLike

35948 6.830792 1.213192 largeDoses

42666 13.276369 0.543880 largeDoses

67497 8.631577 0.749278 didntLike

35483 12.273169 1.508053 largeDoses

from numpy import *
import operator

# 将文件的记录数据分离为特征向量矩阵和标签矩阵
def file2matrix(filename):
    fr = open(filename)
    arrayOfLines = fr.readlines()
    #得到文件的行数,也就是数据的记录数
    numberOfLines = len(arrayOfLines)
    #创建一个以文件记录行数为行,特征值数量(3个,飞行,游戏,冰淇淋)为列,初始值为0的矩阵,目的就是从原始文件中剥离目标变量
    retMat = zeros((numberOfLines,3))
    #创建一个空列表来装分类标签变量
    classLabelVector = []
    index = 0
    for line in arrayOfLines:
        #strip将除去字符串中的尾空格,并转换成一定的格式unicode
        #line = '40920\t8.326976\t0.953952\tlargeDoses'
        line = line.strip()
        #获得字段的列表如['40920', '8.326976', '0.953952', 'largeDoses']
        listFromLine = line.split('\t')
        #左闭右开,获取0到2的列  即特征向量
        retMat[index,:] = listFromLine[0:3]
        #将标签变量从字符串转变成
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return retMat,classLabelVector 
datingDataMat,datingLabels = file2matrix('C:\machinelearninginaction\Ch02\datingTestSet2.txt')
datingDataMat
array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],
       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],
       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],
       ..., 
       [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],
       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],
       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]])
datingLabels[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2、分析数据:使用Matplotlib创建散点图

import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.xlabel('play game')
plt.ylabel('ice-scream')
plt.show()

这里写图片描述

3、归一化数据

在处理这种不同取值范围的特征值时,我们通常采用的方法是将数值归一化,如将取值范围
处理为0到 1或者-1到 1之间。下面的公式可以将任意取值范围的特征值转化为0到 1区间内的值:

    newValue = {oldValue- min ) / (max-min)

其中min 和max分别是数据集中的最小特征值和最大特征值。虽然改变数值取值范围增加了分类器的复杂度,但为了得到准确结果,我们必须这样做

#归一化函数
def autoNorm(dataSet):
    #获取列的最小值
    minVals = dataSet.min(0)
     #获取列的最大值
    maxVals = dataSet.max(0)
    #获得极差
    ranges = maxVals - minVals
    #创建以传入数据集同结构的零矩阵
    normDataSet = zeros(shape(dataSet))
    #获取列的长度
    m = dataSet.shape[0]
    #得到一个原矩阵与最小值相减所得的矩阵{oldValue- min )
    normDataSet = dataSet - tile(minVals,(m,1))
    #计算{oldValue- min ) / (max-min)
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

利用归一化函数计算结果值

normMat,ranges,minvalue = autoNorm(datingDataMat)
normMat
array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       ..., 
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])
ranges
array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])
minvalue
array([ 0.      ,  0.      ,  0.001156])

4、测试算法

如果分类器的正确率满足要求,海伦就可以使用这个软件来处理约会网站提供的约会名单了。机器学习算法一个很
重要的工作就是评估算法的正确率,通常我们只提供已有数据的90%作为训练样本来训练分类器 ,而使用其余的10%数据去测试分类器,检测分类器的正确率

#inX表示待测的数据集如[0.0,0.5],dataSet表示已知的数据集group,labels代表数据集的标签值labels,k表k值
from numpy import *
import operator

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]           #计算有多少条记录
    #tile(X,(a,b))   a 表示数据集X每条记录复制的次数,b表示每一条记录中元素按顺序重复的次数
    #因为要将待测是数据集与已知的数据集做差,所以将待测数据集的记录复制已知数据集的记录数,然后再减去原来的数据集,就可以获得相应的差值

    diffMat = tile(inX,(dataSetSize,1))-dataSet
    sqDifMat = diffMat ** 2
    #axis = 1对行求和
    sqDistances = sqDifMat.sum(axis = 1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()   #indices
    classCount = { }
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #这里的classCount.items() python3.x与python 2.X iteritems不一样
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
#分类器针对约会网站测试代码
from numpy import *
import operator

def datingClassTest():
    #0.10表示在数据集中取出10%作为测试集
    hoRatio = 0.10
    #datingDataMat特征值数据,datingLabels目标标签
    datingDataMat,datingLabels = file2matrix('C:\machinelearninginaction\Ch02\datingTestSet2.txt')
    #normMat归一化值,ranges极差,minVals最小值  
    nornMat,ranges,minVals = autoNorm(datingDataMat)
    #m为数据集的数量,即为记录数
    m = normMat.shape[0]
    #提取测试集的数量
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
#         print("the classifier came back with : ",classifierResult)
        print("the classifier came back with : %d,the real answer is %d"%(classifierResult,datingLabels[i]))
        if(classifierResult != datingLabels[i]) : errorCount += 1.0
    print("the total error rate is :%f"%(errorCount/float(numTestVecs)))
    print ("errorCount:",errorCount)
datingClassTest()
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 3
the classifier came back with : 1,the real answer is 1
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 3
the classifier came back with : 3,the real answer is 3
the classifier came back with : 2,the real answer is 2
the classifier came back with : 1,the real answer is 1
the classifier came back with : 3,the real answer is 1
the total error rate is :0.050000
errorCount: 5.0

上面我们已经在数据上对分类器进行了测试,现在终于可以使用这个分类器为海伦来对人们
分类。我们会给海伦一小段程序,通过该程序海伦会在约会网站上找到某个人并输入他的信息。
程序会给出她对对方喜欢程度的预测值。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值