K邻近算法
此文代码解释部分引用其他作者内容,但链接找不到了,如发现能够提供出来我附上。侵删。
算法实现
def classsify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0] //1
diffMat = tile(inX,(dataSetSize,1)) - dataSet //2
sqDiffMat = diffMat**2 //3
sqDistances = sqDiffMat.sum(axis=1) //4
distances = sqDistances**0.5 //5
sortedDisIndicies = distances.argsort() //6
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDisIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=Ture)
return sortedClassCount[0][0]
#####以下步骤实现数据集中的点与输入点的距离求值
-
其中1中shape[0]表示输入样本集中的行数。
矩阵的shape是个tuple,如果直接调用dataSet.shape,会返回(4,2),即 返回矩阵的(行数,列数), 那么shape[0]获取数据集的行数, 行数就是样本的数量 shape[1]返回数据集的列数
-
2中为了计算输入向量与各个样本间的距离,通过tile函数将输入样本的数量与测试样本保持一致,以实现接下来的相减操作。tile中的(dataSetSize,1)表示将inX复制出dataSetSize和1列。之后diffMat实现的是数组元素对应相减。
#此处Mat是Maxtrix的缩写,diffMat,即矩阵的差,结果也是矩阵 #关于tile函数的说明,见http://www.cnblogs.com/Sabre/p/7976702.html #简单来说就是把inX(本例是[1,1])在“行”这个维度上,复制了dataSetSize次(本例dataSetSize==4),在“列”这个维度上,复制了1次 #形成[[1,1],[1,1],[1,1],[1,1]]这样一个矩阵,以便与dataSet进行运算 #之所以进行这样的运算,是因为要使用欧式距离公式求输入点与已存在各点的距离 #这是第1步,求给出点[1,1]与已知4点的差,输出为矩阵 diffMat = tile(inX,(dataSetSize,1)) - dataSet #print(tile(inX,(dataSetSize,1))) ###################说明代码######################## #print("diffMat:" + str(diffMat)) diffMat:[[ 2 1] [ 1 0] [ 2 2] [-1 -2] ###################################################
-
3实现的是求出的diffMat实现元素对应求平方。
-
4中sum(axis=1)实现数组每行的内容各自进行求和。实际的意义就是将输入与测试样本进行相减和平方求值后进行各元素的求和。每行均代表输入向量与每个测试样本的求值。
#sum(axis=1)是将矩阵中每一行中的数值相加,如[[0 0] [1 1] [0 1] [9 9]]将得到[0,2,1,18],得到平方和 #sum(axis=0)是将矩阵中每一列中的数值相加 #第3步,求和 sqDistances = sqDiffMat.sum(axis=1) #print("sqDistances:", end="") #print(sqDistances[875]) ###################说明代码######################## #print("sqDistances:" + str(sqDistances)) sqDiffMat:[[4 1] [1 0] [4 4] [1 4]] sqDistances:[5 1 8 5] ###################################################
-
进行开根,求出距离。
###################说明代码######################## #print("未知点到各个已知点的距离:",distances) 未知点到各个已知点的距离: [ 2.23606798 1. 2.82842712 2.23606798] ###################################################
####以下步骤实现按照距离从小排序
其中6的argsort()是将距离数组从小到大进行排序,之后将他们排序的序号进行表示。举例,若原始数组为[2,5,1],他们对应的位置分别为0,1,2。从小到大的排序后序号对应为[2,0,1]。
###################说明代码########################
#print("索引位置:", sortedDistIndicies) #可得到前k个索引
索引位置: [1 0 3 2]
###################################################
####以下循环中实现k个最小距离元素所在的主要分类
#创建空字典
classCount = {}
#k值是取前k个样本进行比较
for i in range(k):
#返回distances中索引为sortedDistIndicies[i]的值
#此例中分别为:
#sortedDistIndicies[0]==0,则labels[0]=='A',voteIlabel=='A'
#sortedDistIndicies[1]==2,则labels[2]=='B',voteIlabel=='B'
#sortedDistIndicies[2]==1,则labels[0]=='A',voteIlabel=='A'
#sortedDistIndicies[3]==18,则labels[0]=='B',voteIlabel=='B'
voteIlabel = labels[sortedDistIndicies[i]]
#print("中华人民共和国")
###################说明代码########################
# print(voteIlabel)
# print("标签" + str(i) + ":" + str(voteIlabel))
###################################################
#dict.get(key, default=None),对于键 key 返回其对应的值,或者若 dict 中不含 key 则返回 default(注意, default的默认值为 None,此处设置为0)
#第一次调用classCount.get时,classCount内还没有值
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
###################说明代码########################
# print("第"+str(i+1)+"次访问,classCount[" + str(voteIlabel) + "]值为:" + str(classCount[voteIlabel]))
# print("classCount的内容为:")
# print(classCount)
###################################################
第1次访问,classCount[A]值为:1
classCount的内容为:
{'A': 1} 标签A出现1次
标签1:A
第2次访问,classCount[A]值为:2
classCount的内容为:
{'A': 2} A出现两次
标签2:B
第3次访问,classCount[B]值为:1
classCount的内容为:
{'A': 2, 'B': 1} A出现两次,B出现一次
标签3:B
第4次访问,classCount[B]值为:2
classCount的内容为:
{'A': 2, 'B': 2}
[('A', 2), ('B', 2)]
###################################################
####以下按照第二元素次序对元组进行从大到小排序,返回发生频率最高的类别
# sorted(iterable[,cmp,[,key[,reverse=True]]])
# 作用:Return a new sorted list from the items in iterable.
# 第一个参数是一个iterable,返回值是一个对iterable中元素进行排序后的列表(list)。
# 可选的参数有三个,cmp、key和reverse。
# 1)cmp指定一个定制的比较函数,这个函数接收两个参数(iterable的元素),如果第一个参数小于第二个参数,返回一个负数;如果第一个参数等于第二个参数,返回零;如果第一个参数大于第二个参数,返回一个正数。默认值为None。
# 2)key指定一个接收一个参数的函数,这个函数用于从每个元素中提取一个用于比较的关键字。默认值为None。
# 从python2.4开始,list.sort()和sorted()函数增加了key参数来指定一个函数,此函数将在每个元素比较前被调用
# key参数的值为一个函数,此函数只有一个参数且返回一个值用来进行比较。这个技术是快速的,因为key指定的函数将准确地对每个元素调用。
# key=operator.itemgetter(0)或key=operator.itemgetter(1),决定以字典的键排序还是以字典的值排序
# 0以键排序,1以值排序
# 3)reverse是一个布尔值。如果设置为True,列表元素将被倒序排列。
# operator.itemgetter(1)这个很难解释,用以下的例子一看就懂
# a=[11,22,33]
# b = operator.itemgetter(2)
# b(a)
# 输出:33
# b = operator.itemgetter(2,0,1)
# b(a)
# 输出:(33,11,22)
# operator.itemgetter函数返回的不是值,而是一个函数,通过该函数作用到对象上才能获取值
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
#print(sortedClassCount)
#返回正序排序后最小的值,即“k个最小相邻”的值决定测试样本的类别
print("最终结果,测试样本类别:" , end="")
print(sortedClassCount)
return sortedClassCount[0][0]
###################################################
最终结果,测试样本类别:A
[Finished in 5.3s]
算法应用 ##约会实例
####将文本处理成numpy的解析程序
def file2matrix(filename):
fr = open('C:/Users/mac/PycharmProjects/mechine learning/KNN/datingTestSet2.txt')#此为文件的本地地址,应与函数py文件在同目录下
arrayOLines = fr.readlines()
numberOFLines = len(arrayOLines)
returnMat = zeros((numberOFLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.strip('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
####注释步骤
def file2matrix(filename):
print("读入文件" + str(filename))
#以下两行为打开文本文件并读取内容到数组
fr = open(filename)
arrayOLines = fr.readlines() #把文件中的文本转为数组
numberOfLines = len(arrayOLines)#得到文件行数
returnMat = zeros((numberOfLines,3)) #创建返回的Numpy矩阵,1000行所有值均为0的,选取的列数自己设定,此次设定是有影响的特征为三个因此设定为三列。当zeros(2)只有一个数据时,表示有2列,即一行两列数据为0的数组。
#print(returnMat)
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
print(listFromLine)
#访问矩阵中的元素的方法
#returnMat[1,0:3],3个数字依次表示第1行,从第0列开始,到第2列
#returnMat[1,0:],2个数字依次表示第1行,从第0列开始,到第最后一列
#returnMat[1,:],1个数字依次表示第1行,从第0列开始,到第最后一列,即全部列
#returnMat[2,:3],1个数字依次表示第2行,从第0列开始,到第2列
returnMat[index,0:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
#print(returnMat[1,0:4])
return returnMat,classLabelVector
####调用过程
>>> import kNN1
>>> datingDataMat,datingLabels = kNN1.file2matrix(r'C:/Users/mac/PycharmProjects/mechine learning/KNN/datingTestSet2.txt')#调用过程出现了好多问题,将具体文件地址写清晰,位置前加上r!
#datingDataMat表示三个特征,datingLabels表示最终的评价标签
>>> datingDataMat
array([[ 2.6, 2.6, 2.6],
[ 3.9, 3.9, 3.9],
[ 9.8, 9.8, 9.8],
[ 2. , 2. , 2. ],
[ 3.4, 3.4, 3.4],
[ 9.9, 9.9, 9.9],
[10. , 10. , 10. ],
[ 9.1, 9.1, 9.1],
[ 7.8, 7.8, 7.8]])
>>> datingLabels[0:5]
[2, 2, 2, 2, 2]
####matplotlib散点图分析数据
import matplotlib
import matplotlib.pyplot as plt
def g():
datingDataMat,datingLabels = file2matrix('C:/Users/mac/PycharmProjects/mechine learning/KNN/datingTestSet2.txt')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,0],
15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()
其中上例的两个15.0*array(datingLabels)分别代表不同的大小和颜色。详细介绍在下面。
-
. 知识点1:plt.figure()
用画板和画纸来做比喻的话,figure就好像是画板,是画纸的载体,但具体画画等操作是在画纸上完成的。在pyplot中,画纸的概念对应的就是Axes/Subplot。
figure语法说明,具体如下:
figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)num:图像编号或名称,数字为编号 ,字符串为名称
figsize:指定figure的宽和高,单位为英寸;
dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为80
facecolor:背景颜色
edgecolor:边框颜色
frameon:是否显示边框 -
知识点2:画子图add_subplot新增子图
子图:就是在一张figure里面生成多张子图。 Matplotlib对象简介 FigureCanvas:画布 Figure:图 Axes:坐标轴(实际画图的地方) add_subplot的参数与subplots的相似 subplot语法,具体如下: subplot(nrows,ncols,sharex,sharey,subplot_kw,**fig_kw) nrows为subplot的行数 ncols为subplot的列数 sharex所有subplot应该使用相同的x轴刻度 sharey所有subplot应该使用相同的y轴刻度 subplot_kw用于创建各subplot的关键字字典 **fig_kw创建figure时的其他关键字,如plt.subplots(2,2,figsize=(8,6))
-
知识点3:scatter(x, y, 点的大小, 颜色,标记)
matplotlib模块中scatter函数语法及参数含义: plt.scatter(x, y, s=20,c=None, marker='o',cmap=None, norm=None,vmin=None, vmax=None,alpha=None, linewidths=None,edgecolors=None) x:指定散点图的x轴数据; y:指定散点图的y轴数据; s:指定散点图点的大小,默认为20,通过传入新的变量,实现气泡图的绘制; c:指定散点图点的颜色,默认为蓝色; marker:指定散点图点的形状,默认为圆形; cmap:指定色图,只有当c参数是一个浮点型的数组的时候才起作用; norm:设置数据亮度,标准化到0~1之间,使用该参数仍需要c为浮点型的数组; vmin、vmax:亮度设置,与norm类似,如果使用了norm则该参数无效; alpha:设置散点的透明度; linewidths:设置散点边界线的宽度; edgecolors:设置散点边界线的颜色; 学习参考链接:# 从零开始学Python【15】--matplotlib(散点图) scatter(x, y, 点的大小, 颜色,标记),这是最主要的几个用法,如果括号中不写s= c=则按默认顺序,写了则按规定的来,不考虑顺序.
####归一化约会网站原始数据数值
def autoNorm(dataSet):
minVals = dataSet.min(0)#1
maxVals = dataSet.max(0)
ranges = maxVals - minVals#2
normDataSet = zeros(shape(dataSet))#3
m = dataSet.shape[0]#4
normDataSet = dataSet - tile(minVals,(m,1))#5
normDataSet = normDataSet/tile(ranges,(m,1))#6
return normDataSet,ranges,minVals
归一化的公式:
newValue = (oldValue - min) / (max - min)
就是把数据归一化到[0, 1]区间上。
具体含义:oldValue代表原始的约会数据,min和max代表各列的最小值与最大值。
- #1中min(0)找到矩阵中列的最小值,min(1)为矩阵中行的最小值。max()用法与其相同。
- #2中range实现的是归一化公式中的(max-min)
- #3中shape()查出矩阵的大小即几行几列。通过zeros函数建立与dataset尺寸相同的0矩阵。
- #4中shape[0]找到行数,shape[1]找到列数。
- #5中tile()是把第一个参数复制成一个m1的矩阵中去。
将minVals为基础将其作为一个元素,建立m1的矩阵,实现与原始约会数据相减得到公式中的(oldValue - min)。 - #6实现了归一化的整个过程。这里的 / 是每个元素相除,如果是矩阵的除法,需要用到linalg.solve(matA, matB)
####测试分类器
def datingClassTest():
hoRatio = 0.10
datingDataMat,datingLabels = file2matrix('C:/Users/mac/PycharmProjects/mechine learning/datingTestSet2.txt')
normMat,ranges,minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classsify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classifier came back with:%d,the real answer is:%d" % (classifierResult,datingLabels[i]))
if (classifierResult != datingLabels[i]):
errorCount += 1.0
print("the total error rate is:%f" % (errorCount/float(numTestVecs)))
-
输出结果为:
import kNN1 kNN1.datingClassTest() the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:3 the classifier came back with:1,the real answer is:1 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:1 the total error rate is:0.050000
####预测函数看喜好程度
def classifyPerson():
resultList = ['not at all','in small doses','in large doses']
percentTats = float(input("perventage of time spent playing video games?"))
ffMiles = float(input("frequent flier miles earned per year?"))
iceCream = float(input("liters of ice cream consumed per year?"))
datingDataMat,datingLabels = file2matrix('C:/Users/mac/PycharmProjects/mechine/kNN/datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = array([percentTats,ffMiles,iceCream])
classifierResult = classsify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print("you will probably like this person:",resultList[classifierResult - 1])
-
输出结果为
import kNN1 kNN1.classifyPerson() perventage of time spent playing video games?10 frequent flier miles earned per year?10000 liters of ice cream consumed per year?0.5 you will probably like this person: in small doses