Here we will explain the classify algorithm :
Pseudocode:
get the input data inX which is determined 获取输入待判定数据
for every piece data of our example data set: 针对样本集中的每一条数据
calculate the distance between inX and the current piece data计算这条数据与输入数据的距离
sort the distances in increasing order 对所有的距离升序排序
take k items with lowest distances to inX 取最顶部的k条记录
find the majority class among these items 找到最主要的分类
return the majority class as our prediction for the class of inX 返回分类
python code:
# inX: 输入数据, 一般是数组,需要注意列数必须和dataSet的列数一致,列数代表对象的特征值
# dataSet:样本数据, 一般是矩阵(多维数组)
# labels:样本数据对应的类别标签数组,有多少条数据,就有多少条对应的标签
# k:最顶部的k条记录
def classify0(inX, dataSet, labels, k):
# shape()得到矩阵的维度(行,列),第一个是行数 dataSetSize = dataSet.shape[0]
# tile()把inX从一个向量(数组)复制多次变为一个矩阵,重复次数和dataSet的行数一样,参见:详细用法
# 此时inX变为一个行数和列数均和dataSet一样多的矩阵
# diffMat为两个矩阵减法后的差值矩阵 diffMat = tile(inX, (dataSetSize,1)) – dataSet
# 矩阵平方等于矩阵个中每个元素的平方 sqDiffMat = diffMat**2
# sum(axis=1)代表每一行各元素相加,得到的是一个向量/数组(行数未dataSet行数),参见:详细用法 sqDistances = sqDiffMat.sum(axis=1)
# 数组的每个元素开方 distances = sqDistances**0.5
# argsort()对数组从小到大按数值排序, 参见:详细用法 sortedDistIndicies = distances.argsort()
# label 分类的dict,key为label name, value为label出现的次数 classCount={}
# 取数组中最top的k个元素 for i in range(k):
# 取出label voteIlabel = labels[sortedDistIndicies[i]]
# 此label的引用计数加1 classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
# sorted()对label 分类的dict进行排序, 按其value降序排序,参见:详细用法
# 得到是tuple组成的list,形如 [(key,value),(key,value)]
# 针对本例,key为label名, value为次数
# 返回数组第一个元素所对应的分类label return sortedClassCount[0][0]sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
How to use it?
from numpy import *
import operator
# create the example data set include the data and class label def createDataSet(): group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = ['A', 'A', 'B', 'B'] return group, labels if __name__ == '__main__':
# create example data set group, labels = createDataSet()
# get the class of each piece of test data print inX, kNN.classify0(inX, group, labels, 3)# the test data for determined testData = [[0, 0], [0.8, 0.8], [0.5, 0.5], [0.6, 0.5]] for inX in testData:
The output is:
[0, 0] B [0.8, 0.8] A [0.5, 0.5] B [0.6, 0.5] A