参考书籍:《机器学习实战》
实验说明:预测约会对象对用户是否具有吸引力
输入数据:每个待约会的对象有三个属性,分别是 每年飞行里程数、玩游戏占时间比、每周吃的冰淇淋(单位公升);(ps:我觉得这三个参数,分别代表一个人是否有钱,生活娱乐,饮食习惯)
样本集:有1000个约会对象的数据,并且每个对象有一个标签,标签有三大类,分别是 不喜欢、魅力一般、非常有魅力
实验过程:
1、将样本集90%当做训练集,10%当做测试集,测试classify.py的错误率
2、用户输入一个约会对象的参数,给出分类的标签,为用户提供建议
代码文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
autoNorm.py:将参数归一化,每个参数的大小范围不一致
datingClassTest.py:错误率测试
classify.py:预测分类函数
classifyPerson.py:输入某个人的参数,给出预测结果
knn.py:主函数
样本集及源文件下载:点击打开链接
源文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
__author__ = 'root'
import numpy as np
def file2Matrix(filename):
#open file
fileHandle=open(filename,mode='r')
#read lines, here lines is a list
lines=fileHandle.readlines()
#for saving data
i=0
datingDataSet=np.zeros((len(lines),3))
labels=[]
#traverse all lines,save to matrix
for line in lines:
line=line.strip()
listFromLine=line.split('\t')
datingDataSet[i,:]=listFromLine[0:3]
labels.append(int(listFromLine[-1]))
i+=1
#return dataSet and labels
return datingDataSet, labels
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
__author__ = 'root'
import numpy as np
import matplotlib.pyplot as plt
def plotDataSet(datingDataSet,labels):
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataSet[:,0],datingDataSet[:,1],15*np.array(labels[:]),15*np.array(labels[:]))
plt.show()
autoNorm.py:将参数归一化,每个参数的大小范围不一致
__author__ = 'root'
import file2Matrix
import numpy as np
def autoNorm(datingDataSet):
#get the minimum and maximum value of each feature
dataSetMin=datingDataSet.min(axis=0)
dataSetMinTiled=np.tile(dataSetMin,(datingDataSet.shape[0],1))
dataSetMax=datingDataSet.max(axis=0)
dataSetMaxTiled=np.tile(dataSetMax,(datingDataSet.shape[0],1))
# =(value-min)/(max-min)
datingDataSet=(datingDataSet-dataSetMinTiled)/(dataSetMaxTiled-dataSetMinTiled)
return datingDataSet,dataSetMin,dataSetMax
datingClassTest.py:错误率测试
__author__ = 'root'
import numpy as np
import classify
def datingClassTest(datingDataSet,labels):
#set:ratio of test,k
ratio=0.1
k=4
#num of testData
lenOfDataSet=datingDataSet.shape[0]
numOfTest=int(ratio*lenOfDataSet)
print numOfTest
#variable:num of error
numOfError=0
#traverse all test data
for i in range(numOfTest):
#prepare input data
inX=datingDataSet[i,:]
label=labels[i]
ans=classify.classify(inX,datingDataSet[numOfTest:lenOfDataSet,:],labels[numOfTest:lenOfDataSet],k)
if ans!=label:
numOfError+=1.0
print 'predict error'
return numOfError/numOfTest
classify.py:预测分类函数
__author__ = 'root'
import numpy as np
import operator
def classify(inX,dataSet,labels,k):
#calculate euclidean distance between k and dataSet
dataSetSize=dataSet.shape[0]
diffMat=np.tile(inX,(dataSetSize,1))-dataSet
sqDiffMat=diffMat**2
sqDistances=sqDiffMat.sum(axis=1)
distance=sqDistances**0.5
#sort distance, min to max, return index list
sortedDistIndicies=distance.argsort()
# from 0 to k-1, count times of every class
classCount={}
for i in range(k):
className=labels[sortedDistIndicies[i]]
#print classCount.get(className,0)
#here parameter 0 means:if className doesn't exist, return
classCount[className]=classCount.get(className,0)+1
#sort class count result, i don't understand this method now
#parameter reverse=true:from big to small,reverse=flase:from small to big
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
#print sortedClassCount
#print sortedClassCount[0][0]
# return result
return sortedClassCount[0][0]
classifyPerson.py:输入某个人的参数,给出预测结果
__author__ = 'root'
import numpy as np
import classify
def classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels):
resultList=['not at all','a little like','like very much']
k=3
#input data
flyMiles=float(raw_input('please input fly miles per year:'))
percOfVedioGames=float(raw_input('please input percentage of time you spend playing video games:'))
iceCream=float(raw_input('please input how much iceCream you eat every week:'))
inX=[flyMiles,percOfVedioGames,iceCream]
inX=(inX-dataSetMin)/(dataSetMax-dataSetMin)
#predict
ans=classify.classify(inX,datingDataSet,labels,k)
ans=resultList[ans-1]
#print result
print 'you may feel this person:',ans
knn.py:主函数
__author__ = 'root'
import file2Matrix
import plotDataSet
import autoNorm
import datingDataSetClassifyTest
import classifyPerson
import numpy as np
#get data to ram
datingDataSetOri, labels=file2Matrix.file2Matrix('datingTestSet2.txt')
print 'datingDataSetOri:\n',datingDataSetOri
print 'labels:\n',labels
#plot data
plotDataSet.plotDataSet(datingDataSetOri,labels)
#autonorm data to [0,1]
datingDataSet,dataSetMin,dataSetMax=autoNorm.autoNorm(datingDataSetOri)
print 'datingDataSet:\n',datingDataSet
#test error rate
errorRate=datingDataSetClassifyTest.datingClassTest(datingDataSet,labels)
print 'errorRate:',errorRate
#pridect person
classifyPerson.classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels)
总结:
knn的优点:算法简单,易于实现
knn的缺点:1、随着样本集的增加,计算时间线性增长,当特征数量增加时,计算复杂度也线性增加;2、没有训练过程,无法提取出对样本的特征表述
原文链接:点击打开链接