最近在学习《机器学习实战》
kNN算法是从训练集中找到和新数据最接近的k条记录(欧氏距离),然后根据他们的主要分类来决定新数据的类别。该算法涉及3个主要因素:训练集、距离或相似的衡量、k的大小。
一、运行kNN算法
kNN算法可以解决如下问题,样本如下:
span group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
然后要判断[1.1,1.2],[0.1,0.2]属于哪一类,首先创建kNN.py文件导入数据
def createDataSet():
group = numpy.array([[1.0,1.1], [1.1,1.0], [0,0], [0,0.1]])
labels = ['A','A','B','B']
return group,labels
然后我们使用kNN算法
对待测样本点执行以下操作
1、计算待测点与样本点的欧氏距离;
2、按距离递增次序排列;
3、选择前k个点,计算其对应的标签,对标签次数按降序排列;
4、选择出现次数最多的标签作为kNN算法的预测结果
代码如下:
# K-近邻算法
def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
# 计算距离(欧氏距离)
diffMat = numpy.tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
# 对距离排序返回排序后的索引
sortedDistIndicies = distances.argsort()
# 定义一个空的字典
classCount = {}
#选择距离最小的K个点
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#排序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
然后去当前文件目录下打开控制台启动python,进入Python交互式环境执行以下命令(这是一个简单的分类器)
>>>import kNN
>>> group,labels=kNN.createDataSet()
>>> kNN.classify0([1.1,1.2],group,labels,3)
'A'
>>> kNN.classify0([0.1,0.2],group,labels,3)
'B'
>>>
二 运用kNN解决网站约会配对效果
数据集存放在文本文件datingTestSet.txt文件中,每个样本占据一行,一共1000行,主要包括了以下3个特征:
1、每年获得的飞行常客里程数
2、玩视频游戏所消耗时间
3、每周消费的冰淇淋公升数
2.1 从文本中解析数据并分析
将上述数据输入到分类器之前,需要将数据的格式处理为分类器可以接受的格式,在kNN.py中创建名为file2matrix函数,来处理输入格式问题,该程序如下:
# 将约会数据文本记录转化为numpy的解析程序
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
# 得到文件的行数
numberOfLines = len(arrayOLines)
# 创建返回Numpy的矩阵
returnMat = numpy.zeros((numberOfLines, 3))
classLabelVector = []
index = 0
# 解析文件数据到列表
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
重新载入kNN.py(Python2版本是直接reload(kNN),但是Python3.6是 import importlib; importlib.reload(kNN))
再利用Matpoltlib可以创建散点图,观察数据分布:
>>> import kNN
>>> datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')
>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig=plt.figure()
>>> ax=fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
>>> plt.show()
这里使用datingDataMat矩阵第二和第三列数据方分别表示“玩游戏所耗时间百分比”和“每周消耗冰激凌公升数”。得到效果图如下:
2.2 归一化数据
在进行测试之前我们需要对数据进行归一化处理,不然数值大的属性对距离计算的影响十分巨大,所以我们需要将数据值处理到0到1 之间或者-1到1 之间,利用如下公式:
newValue = (oldValue - min) / (max - min)
我们需要在kNN.py文件里添加函数autoNorm(),该函数自动将数值转化到0和1之间:
# 归一化特征值(约会数据)
def autoNorm(dataSet):
# 取最小值和最大值并计算差值
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
# 建立一个1000*3的矩阵 值都是0
normDataSet = numpy.zeros(numpy.shape(dataSet),dtype=float)
# 取dataSet的维度 1000
m = dataSet.shape[0]
# 利用公式进行归一化( newVal = (oldVal - min)/ranges )
normDataSet = dataSet - numpy.tile(minVals, (m, 1))
normDataSet = normDataSet/numpy.tile(ranges,(m, 1))
return normDataSet,ranges,minVals
在Python命令提示符下,重新加载kNN.py模块,执行autoNorm函数,检测执行效果
>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> norMat,ranges,minVals = kNN.autoNorm(datingDataMat)
[[ 4.09200000e+04 8.32697600e+00 9.52796000e-01]
[ 1.44880000e+04 7.15346900e+00 1.67274800e+00]
[ 2.60520000e+04 1.44187100e+00 8.03968000e-01]
...,
[ 2.65750000e+04 1.06501020e+01 8.65471000e-01]
[ 4.81110000e+04 9.13452800e+00 7.26889000e-01]
[ 4.37570000e+04 7.88260100e+00 1.33129000e+00]]
[[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]
...,
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00]]
2.3 测试算法
机器学习的算法最重要的是要保证算发的正确率,这个例子用90%作为训练样本,10%作为测试,因为数据是随机分布的,索引选择前10%的数据用来测试,在文件中创建datingClassTest函数:
# 分类器针对约会网站的测试代码
def datingClassTest():
# 取10%的数据进行测试
hoRatio = 0.10
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classfier came back with: %d,the real answer is : %d"%(classifierResult,datingLabels[i]))
if (classifierResult != datingLabels[i]):
errorCount += 1.0
# 计算处理数据的错误率
print("the total error rate is: %f"%(errorCount/float(numTestVecs)))
在Python命令提示符下,重新加载kNN.py模块,执行该函数,检测执行效果:
>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> kNN.datingClassTest()
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 1
the total error rate is: 0.050000
由此可见,错误率为5%,可以改变函数datingClassSet内变量hoRatio和变量k的值,检测错误率是否随着变量值的变化而变化。依赖于分类算法、数据集和程序设置,分类器的输出结果都是不同的。
2.4 使用算法,构建完整系统
将下列代码加入kNN.py中,并重新载入:
# 约会网站预测函数
def calssifyPerson():
resultList = ['not al all','is small doses','in large doses']
percentTats = float(input("percentage of time spent playing video games?"))
ffMiles = float(input("frequent flier miles earned per year?"))
iceCream = float(input("liters of ice cream consumed per year?"))
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = numpy.array([ffMiles,percentTats,iceCream])
classifierResult = classify0((inArr - minVals)/ranges,normMat,datingLabels,3)
print("you will probably like this person:",resultList[classifierResult - 1])
执行上述函数,输入摸个用户三个特征的值,并返回判断结果:
>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> kNN.calssifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
you will probably like this person: is small doses