Index by integer array
Let’s look at an example first,
>>> a = np.arange(4)**2 # [0, 1, 2, 3] 做平方
>>> a
array([0, 1, 4, 9])
>>> i = [1, 1, 3, 2, 2] # an list of integer
>>> x = a[i] # indexing using integer array
>>> x
array([1, 1, 9, 4, 4])
上面例子中, 因为i是个non-tuple的sequence(list),a[i] 将遵循index by integer array的规则。因此,x = a[i],得到的x是对a[i]的copy。改变x[1],会发现a并没有受到影响。
>>> x[1] = 999
>>> a
array([0, 1, 4, 9])
However, if we assign to a[int_array] directly:
>>> i
[1, 1, 3, 2, 2]
>>> a
array([0, 1, 4, 9])
>>> a[i]
array([1, 1, 9, 4, 4])
>>> a[i] = np.arange(5)*10 # a.__setitem__(i, ...)
>>> a
array([ 0, 10, 40, 20])
The original array’s data would be updated IN PLACE, without building any new away.
>> b = a[i] # b refer to a copy from a
>> b[:] = 0 # clear b
>> a == [0, 1, 4, 9] # a is not affected
True
>> a[i] = 0 # directly write a[1], a[2], a[3]
>> a
[0, 0, 0, 0] # a is changed
关于copy和view,前一篇博客已经涉及很多。不再赘述,这里我们看一个应用。假设一条messagey由四个符号a,b,c,d的组合构成,我们要对这条message进行one-hot coding.
>>> onehot = np.eye(4, dtype=int)
>>> msg = "abcddcba"
>>> msg_i = [ord(c) - ord('a') for c in msg]
>>> msg_i
[0, 1, 2, 3, 3, 2, 1, 0]
>>> msg_coded = onehot[msg_i]
>>> msg_coded
array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]])
显然,msg的length有多长,msg_coded就有多少行。而msg_coded的列数,是有onehot矩阵的列数决定的。因为,
onehot[msg_i] == onehot[msg_i, :]
进一步推广,当a的dimension大于2时候,a[indices_array_1d]会继承a的non-1st dimension. 而a[indices_array_1d]的1st dimension(shape[0]),由indices_array_1d的长度决定。
进一步推广,当indices array是multi-dimensional情况下,a[indices_array_xd]的shape分为两部分:
前面的dimension是:indices_array_xd.shape
后面的dimension是:a.shape[1:]
如果感觉很晕的话,动手自己验证一下,是个好主意。
Index by boolean array
用boolean array做index的情形又是另一番风景。
先看一个常用的情况,把一个矩阵A里面所有小于3的元素都清零。
>>> A = np.random.randint(0, 10, (4,4))
>>> A
array([[4, 5, 0, 1],
[5, 0, 2, 5],
[8, 6, 9, 1],
[9, 8, 9, 1]])
>>> A[A<3] = 0
>>> A
array([[4, 5, 0, 0],
[5, 0, 0, 5],
[8, 6, 9, 0],
[9, 8, 9, 0]])
A[A<3] 这种用法是不是很简单方便呢~ 这里就用到了boolean array as index,
注意,“A<3”本身就是boolean array,并且和A有一样的shape(概念上,3会被broadcast到A一样的shape,然后和A的每一个元素做比较)
>>> A < 3
array([[False, False, True, True],
[False, True, True, False],
[False, False, False, True],
[False, False, False, True]], dtype=bool)
注意,这里dtype=bool,如果我们吧bool array转换成了int类型,结果会怎样呢?
>>> A[(A<3).astype(int)]
array([[[4, 5, 0, 0], # A Row 0 (False)
[4, 5, 0, 0], # A Row 0 (False)
[5, 0, 0, 5], # A Row 1 (True)
[5, 0, 0, 5]], # A Row 1 (True)
[[4, 5, 0, 0], # False
[5, 0, 0, 5], # True
[5, 0, 0, 5], # True
[4, 5, 0, 0]], # False
[[4, 5, 0, 0], # False
[4, 5, 0, 0], # False
[4, 5, 0, 0], # False
[5, 0, 0, 5]], # True
[[4, 5, 0, 0],
[4, 5, 0, 0],
[4, 5, 0, 0],
[5, 0, 0, 5]]])
可以看到, 用int array代替bool array做index,numpy indexing的结果遵循的是“index by integer array”的规律。
boolean array index特例
上面的例子中,A和(A-3)的shape都是完全一样的。其实,还可以用shape不一样的boolean array做index:
>>> a = np.arange(12).reshape(3,4)
>>> b1 = np.array([False,True,True]) # first dim selection
>>> b2 = np.array([True,False,True,False]) # second dim selection
>>>
>>> a[b1,:] # selecting rows
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[b1] # same thing
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[:,b2] # selecting columns
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
在上面的例子中,用1-dimension的b1和b2做a的index。这里要注意的是,
len(b1) == a.shape[0]
len(b2) == s.sahpe[1]
可以看到a[b1, :]选中的是b1里面True对应的row;a[:, b2]选中的是b2里面True对应的column. 据此,我们可以选中a里面总和大于25的行:
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a.sum(-1)
array([ 6, 22, 38])
>>> a.sum(-1) > 25
array([False, False, True], dtype=bool)
>>> a[a.sum(-1) > 25]
array([[ 8, 9, 10, 11]])
可见boolean array as indexing有其独特的优势。虽然很个性,但绝非多余。