刚读了《machine learning in action》的KNN算法。
K最近邻算法(kNN,k-NearestNeighbo),即计算到每个样本的距离,选取前k个。从前k个选择出大多数属于的class来进行分类,以下特点:
1. 简单,无需训练2. 样本数量不平衡时, 对‘最邻近,大多数’这样的规则,明显样本数量多的分类占优势
3. 计算到全部样本的距离,计算量大
书中给出的第一个实例代码如下,原书中是python2的,下面改为python3 (仅对一行代码进行了修改):
'''
first case of KNN classifer
'''
from numpy import *
import operator
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return (group,labels)
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1))-dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
# change itemgetter to item
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
if __name__=='__main__':
print ('dataset - labels')
print(createDataSet())
group,labels = createDataSet()
label = classify0([1,1.3],group,labels,3)
print (label)