机器学习实战学习笔记系列
机器学习实战——K-近邻算法【1:从文本中解析数据并可视化】
机器学习实战——K-近邻算法【2:改进约会网站配对效果】
1. 首先准备数据,将数据特征归一化
采用下列公式将任意取值范围的特征值转化为0~1区间内的值:
NewValue = (oldValue - min) / (max - min)
其中max 和min 分别是数据集中的最小特征值和最大特征值
归一化代码为:
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
在函数autoNorm()中 ,我们将每列的最小值放在变量minVals中,将最大值放在变量 maxVals中, 其中dataSet.min(0)中的参数0使得函数可以从列中选取最小值,而不是选取当 前行的最小值。然后,函数计算可能的取值范围,并创建新的返回矩阵。正如前面给出的公式,为了归一化特征值,我们必须使用当前值减去最小值,然后除以取值范围。 需要注意的是,特征值矩阵有10003个值,而minVals和ranges的值都为13。为了解决这个冋题,我们使用Numpy 库中tile()函数将变量内容复制成输人矩阵同样大小的矩阵,注意这是具体特征值相除,而对于某些数值处理软件包,“/”可能意味着矩阵除法,但在Numpy库中,矩阵除法需要使用函数linalg.solve(matA, matB)
执行归一化函数:
>>> imp.reload(kNN)
<module 'kNN' from 'D:\\Program Files\\Python35\\machinelearninginaction\\Ch02\\kNN.py'>
>>> normMat, ranges, minVals = kNN.autoNorm(datingDataMat)
>>> normMat
array([[ 0.44832535, 0.39805139, 0.56233353],
[ 0.15873259, 0.34195467, 0.98724416],
[ 0.28542943, 0.06892523, 0.47449629],
...,
[ 0.29115949, 0.50910294, 0.51079493],
[ 0.52711097, 0.43665451, 0.4290048 ],
[ 0.47940793, 0.3768091 , 0.78571804]])
>>> ranges
array([ 9.12730000e+04, 2.09193490e+01, 1.69436100e+00])
>>> minVals
array([ 0. , 0. , 0.001156])
测试验证分类器
创建函数datingClassTest(),针对约会网站进行测试。
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def datingClassTest():
hoRatio = 0.50 #hold out 10%
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
if (classifierResult != datingLabels[i]): errorCount += 1.0
print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))
print (errorCount)
注意这里原文中代码
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
中iteritems()为python2.x中的格式,python3.X直接调用的话会报错
AttributeError: ‘dict’ object has no attribute ‘iteritems’
这里将iteritems 改为items即可。
即:
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
重新加载kNN.py就可以调用啦
>>> imp.reload(kNN)
<module 'kNN' from 'D:\\Program Files\\Python35\\machinelearninginaction\\Ch02\\kNN.py'>
>>> kNN.datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
…
the total error rate is: 0.064000
分类器处理约会数据集的错误率是6.4%。
我们可以改变函数 datingClassTest内变量hoRatio和变量k的值,检测错误率是否随着变量值的变化而增加。依赖于分类算法、数据集和程序设置,分类器的输出结果可能有很大的不同。