KNN近邻算法
工作原理:存在一个样本数据集合,即训练样本集,每个数据对应一个标签;当输入没有标签的数据后,将新数据的每个特征和样本集数据中对应的特征进行比较,然后提取出特征最相似的数据的分类标签。一般说来,只选择样本集中前k个最相似的数据。一般k为不大于20的整数。最后,选择k个最相似数据中出现次数最多的分类,作为新数据的分类。
K近邻模型的三要素:
1):距离度量
2):K值的选择
3):分类决策规则
距离度量:
设特征空间X是n维实数向量空间,,那么和的距离定义为:
其中p≥1.
当p=2,称为欧式距离(Euclideandistance),即
当p=1时,称为曼哈顿距离(Manhattandistance),即:
当p=∞大时,它是各个坐标距离的最大值,即
K值:
K值一般去比较小的正整数,通常用交叉验证法来选取最优值。
分类决策规则:
KNN的分类决策规则往往是多数表决,即有输入实例的K个邻近的训练实例中的多数决定输入实例的类别。
解释:如果分类的损失函数是0-1损失函数,分类函数是:
其中误分类的概率是:
对于给定的实例,其最近邻的K个训练实例点构成集合,如果涵盖的类别是,那么误分类的概率是:
要使误分类率达到最小(即经验风险最小化),即要使得最大。因此,多数表决规则等价于经验风险最小化。
KNN算法流程:
(1):计算要判断的点和已知类别数据的距离
(2):将得到的距离从小到大排列
(3):将排序后的距离的前k个提取出来
(4):确定前k个点的类别出现的频次
(5):选取(4)中频次最高的类别作为当前点的类别输出。
源码:
#-*- coding: utf-8 -*-
"""
Createdon Mon Apr 4 14:47:59 2016
@author:Administrator
"""
fromnumpy import * ##import numpy package
importoperator ##
fromos import listdir
defcreateDataSet():
group =array([1.0,1.1],[1.0,1.0],[0,0],[0,0.1])
labels = {'A','A','B','B'}
return group,labels
defclassify0(inX,dataSet,labels,k):
dataSetSize= dataSet.shape[0]
diffMat= tile(inX,(dataSetSize,1)) - dataSet #tile,repeat the matrix X
sqDiffMat= diffMat**2
sqDistances= sqDiffMat.sum(axis = 1)
distance= sqDistances**0.5
sortedDistIndicies= distance.argsort()
classCount= {}
fori in range(k):
voteIlabel= labels[sortedDistIndicies[i]]
classCount[voteIlabel]= classCount.get(voteIlabel,0) + 1
sortedClassCount= sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
returnsortedClassCount[0][0]
deffile2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] =listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
defAutoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet-tile(minVals,(m,1))
normDataSet =normDataSet/tile(ranges,(m,1))
return normDataSet,ranges,minVals
defKNNClassify(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX,(dataSetSize,1)) -dataSet
sqDiffMat = diffMat**2# calculate thedistance
sqDistances = sqDiffMat.sum(axis=1)## sumof distance
distance = sqDistances**0.5
sortedDistIndicies = distance.argsort()
classCount = {}
for i in range(k):
voteIlabel =labels[sortedDistIndicies[i]]
classCount[voteIlabel] =classCount.get(voteIlabel,0) + 1
# print("items=%d",classCount.items())
sortedClassCount =sorted(classCount.items(),key=operator.itemgetter(1),reverse = True)
return sortedClassCount[0][0]
defimg2Vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] =int(lineStr[j])
return returnVect
defdatingClassTest():
hoRatio = 0.5
datingDataMat,datingLabels =file2matrix('datingTestSet2.txt')
normMat,ranges,minVals = AutoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult =KNNClassify(normMat[i,:],normMat[numTestVecs:m,:],\
datingLabels[numTestVecs:m],3)
print("the classify0 resultis %d,the real answer is:%d"\
%(classifierResult,datingLabels[i]))
if classifierResult !=datingLabels[i]:errorCount += 1.0
print("the total error rateis:%f",errorCount/float(numTestVecs))
defHandWritingClassTest():
hwLabels = []
trainingFileList =listdir('trainingDigits')
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr =fileNameStr.split('.')[0]
classNumStr =int(fileStr.split('_')[0])
hwLabels.append(classNumStr)#classNumStrrepresent the number in txt file
trainingMat[i,:] =img2Vector('trainingDigits/%s' %fileNameStr)
testFileList = listdir('testDigits')
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr =fileNameStr.split('.')[0]
classNumStr =int(fileStr.split('_')[0])
VectorUnderTest =img2Vector('testDigits/%s' %fileNameStr)
classifierResult =KNNClassify(VectorUnderTest,\
trainingMat,hwLabels,3)
print("KNN classifier Resultis %d,real answer is %d"\
%(classifierResult,classNumStr))
if classifierResult !=classNumStr:
errorCount = errorCount +1.0
print("\ntotal number of error is%d" % errorCount)
print("error rate is %f"%(errorCount/float(mTest)))
参考书目:
1):《统计学习方法》,李航
2):《机器学习实战》 Peter Harrington