Machine Learning in Action 程序清单2.1 k-近邻算法代码解析
源代码如下:
from numpy import *
import operator
from os import listdir
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
代码解析
numpy.ndarray
An ndarray
is a (usually fixed-size) multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape
, which is a tuple
of N positive integers that specify the sizes of each dimension. The type of items in the array is specified by a separate dtype
, one of which is associated with each ndarray.
e.g. :
x = numpy.array([[1,2,3],[4,5,6]], numpy.int32)
type(x) # <class 'numpy.ndarray'>
x.shape # (2,3)
x.dtype # dtype('int32')
详见官方文档。
numpy.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m)
. The length of the shape tuple is therefore the rank, or number of dimensions, ndim.
在本例中,通过dataSet.shape[0]
获得训练数据的行数,即样本数量。
numpy.tile
numpy.tile(A,reps)
Construct an array by repeating A the number of times given by reps.
If reps
has length d
, the result will have dimension of max(d, A.ndim)
.
If A.ndim < d
, A
is promoted to be d-dimensional by prepending new axes. So a shape (3,) array is promoted to (1, 3) for 2-D replication, or shape (1, 1, 3) for 3-D replication. If this is not the desired behavior, promote A
to d-dimensions manually before calling this function.
If A.ndim > d
, reps
is promoted to A.ndim
by pre-pending 1’s to it. Thus for an A
of shape (2, 3, 4, 5), a reps
of (2, 2) is treated as (1, 1, 2, 2).
不妨令输入向量inX
为(Ax, Ay)
,某个训练样本为(Bx,By)
。在本例中,将输入样本inX
重复dataSetSize
次,与dataSet
相减,即为分别计算(Ax-Bx)
和(Ay-By)
。下面的diffMat**2
分别计算两项平方。
sum(axis=1)
分别计算每一行的和;sum(axis=0)
分别计算每一列的和。 因此,此处sum(axis=1)
即为
(Ax−Bx)2+(Ay−By)2
。
numpy.argsort
Return the indices that would sort an array. 对于一个给定的array,返回一个能够使数组排序的索引序列。
numpy.argsort(a, axis=-1, kind='quicksort', order=none)
。
a
:需要进行排序的数组。axis
:沿着哪个轴进行排序。kind
:所采用的排序算法,{‘quicksort’, ‘mergesort’, ‘heapsort’}。 order
:如果数组定义了多个域,order参数决定了各域比较的先后顺序。
官方文档
e.g. :
x = numpy.array([3,1,2])
numpy.argsort(x) # array([1,2,0])
x = numpy.array([(1,0),(0,1)], dtype=[('x','<i4'),('y','<i4')])
numpy.argsort(x,order=('x','y')) # array([1,0])
numpy.argsort(x,order=('y','x')) # array([0,1])
operator.itemgetter
返回一个可调用对象,将元素从使用该对象的操作数中取出。如果指定了多个元素,返回元素组成的元组。
e.g. :
itemgetter(1)('ABCDEFG') # 'B'
itemgetter(1,3,5)('ABCDEFG') # ('B','D','F')
sorted()
sorted()函数对所有可迭代的对象进行排序操作。
sorted(iterable[,cmp[,key[,reverse]]])
。iterable为可迭代对象;cmp为比较的函数,具有两个参数,参数的值都是从可迭代对象中取出,大于则返回1,小于则返回-1,等于则返回0;key表示用来比较的元素,指定可迭代对象中的某一个元素来进行排序;reverse表示排序规则,reverse=True降序,reverse=False升序(默认)。
e.g. :
a = [5,7,6,3,4,1,2]
b = sorted(a) # b = [1,2,3,4,5,6,7]
L=[('b',2),('a',1),('c',3),('d',4)]
sorted(L,cmp=lambda x,y:cmp(x[1],y[1]))
# [('a',1),('b',2),('c',3),('d',4)]
students=[('john','A',15),('dave','B',10),('jane','B',12)]
sorted(students, key=lambda s:s[2],reverse=True) #按年龄降序排列
#[('john','A',15),('jane','B',12),('dave','B',10)]