[机器学习实战] k-近邻算法代码解析

Machine Learning in Action 程序清单2.1 k-近邻算法代码解析

源代码如下:

    from numpy import *
    import operator
    from os import listdir

    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort()     
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]

代码解析

numpy.ndarray

An ndarray is a (usually fixed-size) multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape, which is a tuple of N positive integers that specify the sizes of each dimension. The type of items in the array is specified by a separate dtype, one of which is associated with each ndarray.

e.g. :

x = numpy.array([[1,2,3],[4,5,6]], numpy.int32)
type(x) #  <class 'numpy.ndarray'>
x.shape #  (2,3)
x.dtype #  dtype('int32')

详见官方文档

numpy.shape

the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the rank, or number of dimensions, ndim.

在本例中,通过dataSet.shape[0]获得训练数据的行数,即样本数量。

numpy.tile

numpy.tile(A,reps)
Construct an array by repeating A the number of times given by reps.

If reps has length d, the result will have dimension of max(d, A.ndim).

If A.ndim < d, A is promoted to be d-dimensional by prepending new axes. So a shape (3,) array is promoted to (1, 3) for 2-D replication, or shape (1, 1, 3) for 3-D replication. If this is not the desired behavior, promote A to d-dimensions manually before calling this function.

If A.ndim > d, reps is promoted to A.ndim by pre-pending 1’s to it. Thus for an A of shape (2, 3, 4, 5), a reps of (2, 2) is treated as (1, 1, 2, 2).

不妨令输入向量inX(Ax, Ay),某个训练样本为(Bx,By)。在本例中,将输入样本inX重复dataSetSize次,与dataSet相减,即为分别计算(Ax-Bx)(Ay-By)。下面的diffMat**2分别计算两项平方。

sum(axis=1)分别计算每一行的和;sum(axis=0)分别计算每一列的和。 因此,此处sum(axis=1)即为 (AxBx)2+(AyBy)2

numpy.argsort

Return the indices that would sort an array. 对于一个给定的array,返回一个能够使数组排序的索引序列。

numpy.argsort(a, axis=-1, kind='quicksort', order=none)
a:需要进行排序的数组。axis:沿着哪个轴进行排序。kind:所采用的排序算法,{‘quicksort’, ‘mergesort’, ‘heapsort’}。 order:如果数组定义了多个域,order参数决定了各域比较的先后顺序。
官方文档
e.g. :

x = numpy.array([3,1,2])
numpy.argsort(x)  # array([1,2,0])

x = numpy.array([(1,0),(0,1)], dtype=[('x','<i4'),('y','<i4')])
numpy.argsort(x,order=('x','y'))  # array([1,0])
numpy.argsort(x,order=('y','x'))  # array([0,1])
operator.itemgetter

返回一个可调用对象,将元素从使用该对象的操作数中取出。如果指定了多个元素,返回元素组成的元组。
e.g. :

itemgetter(1)('ABCDEFG')  # 'B'
itemgetter(1,3,5)('ABCDEFG')  # ('B','D','F')
sorted()

sorted()函数对所有可迭代的对象进行排序操作。
sorted(iterable[,cmp[,key[,reverse]]])。iterable为可迭代对象;cmp为比较的函数,具有两个参数,参数的值都是从可迭代对象中取出,大于则返回1,小于则返回-1,等于则返回0;key表示用来比较的元素,指定可迭代对象中的某一个元素来进行排序;reverse表示排序规则,reverse=True降序,reverse=False升序(默认)。

e.g. :

a = [5,7,6,3,4,1,2]
b = sorted(a)  # b = [1,2,3,4,5,6,7]

L=[('b',2),('a',1),('c',3),('d',4)]
sorted(L,cmp=lambda x,y:cmp(x[1],y[1]))  
# [('a',1),('b',2),('c',3),('d',4)]

students=[('john','A',15),('dave','B',10),('jane','B',12)]
sorted(students, key=lambda s:s[2],reverse=True) #按年龄降序排列
#[('john','A',15),('jane','B',12),('dave','B',10)]
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值