python数组切片操作_为什么Python在numpy数组切片上循环比完全矢量化操作快

I need to create a boolean mask by thresholding a 3D data array: mask at locations where data are smaller than lower acceptable limit or data are larger than upper acceptable limit must be set to True (otherwise False). Succinctly:

mask = (data < low) or (data > high)

I have two versions of the code for performing this operation: one works directly with entire 3D arrays in numpy while the other method loops over slices of the array. Contrary to my expectations, the second method seems to be faster than the first one. Why???

In [1]: import numpy as np

In [2]: import sys

In [3]: print(sys.version)

3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)

[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)

1.14.0

In [5]: arr = np.random.random((10, 1000, 1000))

In [6]: def method1(arr, low, high):

...: """ Fully vectorized computations """

...: out = np.empty(arr.shape, dtype=np.bool)

...: np.greater_equal(arr, high, out)

...: np.logical_or(out, arr < low, out)

...: return out

...:

In [7]: def method2(arr, low, high):

...: """ Partially vectorized computations """

...: out = np.empty(arr.shape, dtype=np.bool)

...: for k in range(arr.shape[0]):

...: a = arr[k]

...: o = out[k]

...: np.greater_equal(a, high, o)

...: np.logical_or(o, a < low, o)

...: return out

...:

First of all, let's make sure that both methods produce identical results:

In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8))

Out[8]: True

And now some timing tests:

In [9]: %timeit method1(arr, 0.2, 0.8)

14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit method2(arr, 0.2, 0.8)

11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is going on here?

EDIT 1: A similar behavior is observed in an older environment:

In [3]: print(sys.version)

2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)

[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [4]: print(np.__version__)

1.11.3

In [9]: %timeit method1(arr, 0.2, 0.8)

100 loops, best of 3: 14.3 ms per loop

In [10]: %timeit method2(arr, 0.2, 0.8)

100 loops, best of 3: 13 ms per loop

解决方案

Outperforming both methods

In method one you are accessing the array twice. If it doesn't fit in the cache, the data will be read two times from RAM which lowers the performance. Additionally it is possible that temporary arrays are created as mentioned in the comments.

Method two is more cache friendly, since you are accessing a smaller part of the array twice, which is likely to fit in cache. The downsides are slow looping and more function calls, which are also quite slow.

To get a good performance here it is recommended to compile the code, which can be done using cython or numba. Since the cython version is some more work (annotating, need of a seperate compiler), I will show how to do this using Numba.

import numba as nb

@nb.njit(fastmath=True, cache=True)

def method3(arr, low, high):

out = np.empty(arr.shape, dtype=nb.boolean)

for i in range(arr.shape[0]):

for j in range(arr.shape[1]):

for k in range(arr.shape[2]):

out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high

return out

Using arr = np.random.random((10, 1000, 1000)) this outperforms your method_1 by a factor of two and your method_2 by 50 percent on my PC (Core i7-4771, python 3.5, windows)

This is only a simple example, on more complex code, where you can make use of SIMD, and parallel processing which is also very easy to use the performance gain can be a lot bigger. On non compiled code vectorization is usualy but not always (as shown) the best you can do, but it will always lead to a bad cache behaiviour which can lead to suboptimal performance if the chunks of data you are acessing don't fit at least in L3 cache. On some other problems there will be also a performance hit if the data can't fit in the much smaller L1 or L2 cache. Another advantage will be automatic inlining of small njited functions in a njited function which calls this functions.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值