numpy深入理解(2) indexing: integer array & boolean array

最新推荐文章于 2024-07-21 07:30:00 发布

XiaoPing88Eric

最新推荐文章于 2024-07-21 07:30:00 发布

阅读量2.9k

点赞数 2

分类专栏： python笔记文章标签： index numpy-深入理解

本文链接：https://blog.csdn.net/XiaoPing88Eric/article/details/78386535

版权

python笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Index by integer array

Let’s look at an example first,

>>> a = np.arange(4)**2       # [0, 1, 2, 3] 做平方
>>> a
array([0, 1, 4, 9])
>>> i = [1, 1, 3, 2, 2]       # an list of integer
>>> x = a[i]                  # indexing using integer array
>>> x
array([1, 1, 9, 4, 4])

上面例子中, 因为i是个non-tuple的sequence(list)，a[i] 将遵循index by integer array的规则。因此，x = a[i]，得到的x是对a[i]的copy。改变x[1]，会发现a并没有受到影响。

>>> x[1] = 999
>>> a
array([0, 1, 4, 9])

However, if we assign to a[int_array] directly:

>>> i
[1, 1, 3, 2, 2]
>>> a
array([0, 1, 4, 9])
>>> a[i]
array([1, 1, 9, 4, 4])
>>> a[i] = np.arange(5)*10  # a.__setitem__(i, ...)
>>> a
array([ 0, 10, 40, 20])

The original array’s data would be updated IN PLACE, without building any new away.

>> b = a[i]             # b refer to a copy from a
>> b[:] = 0             # clear b
>> a == [0, 1, 4, 9]    # a is not affected
True 
>> a[i] = 0             # directly write a[1], a[2], a[3]
>> a
[0, 0, 0, 0]            # a is changed

关于copy和view，前一篇博客已经涉及很多。不再赘述，这里我们看一个应用。假设一条messagey由四个符号a，b，c，d的组合构成，我们要对这条message进行one-hot coding.

>>> onehot = np.eye(4, dtype=int)
>>> msg = "abcddcba"
>>> msg_i = [ord(c) - ord('a') for c in msg]
>>> msg_i
[0, 1, 2, 3, 3, 2, 1, 0]
>>> msg_coded = onehot[msg_i]
>>> msg_coded
array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 0]])

显然，msg的length有多长，msg_coded就有多少行。而msg_coded的列数，是有onehot矩阵的列数决定的。因为，

    onehot[msg_i] == onehot[msg_i, :]

进一步推广，当a的dimension大于2时候，a[indices_array_1d]会继承a的non-1st dimension. 而a[indices_array_1d]的1st dimension(shape[0])，由indices_array_1d的长度决定。

进一步推广，当indices array是multi-dimensional情况下，a[indices_array_xd]的shape分为两部分：

前面的dimension是：indices_array_xd.shape
后面的dimension是：a.shape[1:]

如果感觉很晕的话，动手自己验证一下，是个好主意。

Index by boolean array

用boolean array做index的情形又是另一番风景。
先看一个常用的情况，把一个矩阵A里面所有小于3的元素都清零。

>>> A = np.random.randint(0, 10, (4,4))
>>> A
array([[4, 5, 0, 1],
       [5, 0, 2, 5],
       [8, 6, 9, 1],
       [9, 8, 9, 1]])
>>> A[A<3] = 0
>>> A
array([[4, 5, 0, 0],
       [5, 0, 0, 5],
       [8, 6, 9, 0],
       [9, 8, 9, 0]])

A[A<3] 这种用法是不是很简单方便呢~ 这里就用到了boolean array as index，
注意，“A<3”本身就是boolean array，并且和A有一样的shape(概念上，3会被broadcast到A一样的shape，然后和A的每一个元素做比较)

>>> A < 3
array([[False, False,  True,  True],
       [False,  True,  True, False],
       [False, False, False,  True],
       [False, False, False,  True]], dtype=bool)

注意，这里dtype=bool，如果我们吧bool array转换成了int类型，结果会怎样呢？

>>> A[(A<3).astype(int)]
array([[[4, 5, 0, 0],       # A Row 0 （False）
        [4, 5, 0, 0],       # A Row 0  (False)
        [5, 0, 0, 5],       # A Row 1  (True)
        [5, 0, 0, 5]],      # A Row 1  (True)

       [[4, 5, 0, 0],       # False
        [5, 0, 0, 5],       # True
        [5, 0, 0, 5],       # True
        [4, 5, 0, 0]],      # False

       [[4, 5, 0, 0],       # False
        [4, 5, 0, 0],       # False
        [4, 5, 0, 0],       # False
        [5, 0, 0, 5]],      # True

       [[4, 5, 0, 0],
        [4, 5, 0, 0],
        [4, 5, 0, 0],
        [5, 0, 0, 5]]])

可以看到, 用int array代替bool array做index，numpy indexing的结果遵循的是“index by integer array”的规律。

boolean array index特例

上面的例子中，A和(A-3)的shape都是完全一样的。其实，还可以用shape不一样的boolean array做index：

>>> a = np.arange(12).reshape(3,4)
>>> b1 = np.array([False,True,True])             # first dim selection
>>> b2 = np.array([True,False,True,False])       # second dim selection
>>>
>>> a[b1,:]                                   # selecting rows
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>> a[b1]                                     # same thing
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>> a[:,b2]                                   # selecting columns
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

在上面的例子中，用1-dimension的b1和b2做a的index。这里要注意的是，

len(b1) == a.shape[0]
len(b2) == s.sahpe[1]

可以看到a[b1, :]选中的是b1里面True对应的row；a[:, b2]选中的是b2里面True对应的column. 据此，我们可以选中a里面总和大于25的行:

>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.sum(-1)
array([ 6, 22, 38])
>>> a.sum(-1) > 25
array([False, False,  True], dtype=bool)
>>> a[a.sum(-1) > 25]
array([[ 8,  9, 10, 11]])