运筹系列58:python使用numba进行加速

vectorize中的参数target一共有三种取值:cpu(默认)、parallel和cuda。关于选择哪个取值,官方文档上有很好的说明:The “cpu” target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The “parallel” target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The “cuda” target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead.
GPU我们在下一篇单独介绍。

1. numba快速入门

1.1 注意事项

nopython指的是完全不用python编译器,推荐使用。可以用@njit代替。
fastmath可以去掉一些数值检查的步骤
使用cache=True将代码进行缓存
使用generated_jit进行数据类型重载

1.2 加速循环

因为numba内置的函数本身是个装饰器,所以只要在自己定义好的函数前面加个@nb.jit()就行,简单上手。下面以一个求和函数为例

# 用numba加速的求和函数
@nb.jit()
def nb_sum(a):
    Sum = 0
    for i in range(len(a)):
        Sum += a[i]
    return Sum

# 没用numba加速的求和函数
def py_sum(a):
    Sum = 0
    for i in range(len(a)):
        Sum += a[i]
    return Sum

来测试一下速度

import numpy as np
a = np.linspace(0,100,100) # 创建一个长度为100的数组

%timeit np.sum(a) # numpy自带的求和函数
%timeit sum(a) # python自带的求和函数
%timeit nb_sum(a) # numba加速的求和函数
%timeit py_sum(a) # 没加速的求和函数

结果如下

# np.sum(a)
7.1 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# sum(a)
27.7 µs ± 2.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# nb_sum(a)
1.05 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# py_sum(a)
43.7 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

可以看出,numba甚至比号称最接近C语言速度运行的numpy还要快6倍以上。但大家都知道,numpy往往对大的数组更加友好,那我们来测试一个更长的数组

a = np.linspace(0,100,10**6) # 创建一个长度为100万的数组

测试结果如下

# np.sum(a)
2.51 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# sum(a)
249 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# nb_sum(a)
3.01 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# py_sum(a)
592 ms ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

可见即便是用很长的loop来计算,numba的表现也丝毫不亚于numpy。在这里,我们可以看到numba相对于numpy一个非常明显的优势:numba可以把各种具有很大loop的函数加到很快的速度,但numpy的加速只适用于numpy自带的函数。

1.3 尽量写的像c

我们来看下下面的代码:

import numba as nb
# 普通的 MaxPool
def max_pool_kernel(x, rs, *args):
    n, n_channels, pool_height, pool_width, out_h, out_w = args
    for i in range(n):
        for j in range(n_channels):
            for p in range(out_h):
                for q in range(out_w):
                    window = x[i, j, p:p+pool_height, q:q+pool_width]
                    rs[i, j, p, q] += np.max(window)

# 简单地加了个 jit 后的 MaxPool
@nb.jit(nopython=True)
def jit_max_pool_kernel(x, rs, *args):
    n, n_channels, pool_height, pool_width, out_h, out_w = args
    for i in range(n):
        for j in range(n_channels):
            for p in range(out_h):
                for q in range(out_w):
                    window = x[i, j, p:p+pool_height, q:q+pool_width]
                    rs[i, j, p, q] += np.max(window)

# 不惧用 for 的、“更像 C”的 MaxPool
@nb.jit(nopython=True)
def jit_max_pool_kernel2(x, rs, *args):
    n, n_channels, pool_height, pool_width, out_h, out_w = args
    for i in range(n):
        for j in range(n_channels):
            for p in range(out_h):
                for q in range(out_w):
                    _max = x[i, j, p, q]
                    for r in range(pool_height):
                        for s in range(pool_width):
                            _tmp = x[i, j, p+r, q+s]
                            if _tmp > _max:
                                _max = _tmp
                    rs[i, j, p, q] += _max

# MaxPool 的封装
def max_pool(x, kernel, args):
    n, n_channels = args[:2]
    out_h, out_w = args[-2:]
    rs = np.zeros([n, n_filters, out_h, out_w], dtype=np.float32)
    kernel(x, rs, *args)
    return rs

x = np.random.randn(64, 3, 28, 28).astype(np.float32)
w = np.random.randn(16, x.shape[1], 5, 5).astype(np.float32)
pool_height, pool_width = 2, 2
n, n_channels, height, width = x.shape
out_h = height - pool_height + 1
out_w = width - pool_width + 1
args = (n, n_channels, pool_height, pool_width, out_h, out_w)

assert np.allclose(max_pool(x, max_pool_kernel, args), max_pool(x, jit_max_pool_kernel, args))
assert np.allclose(max_pool(x, jit_max_pool_kernel, args), max_pool(x, jit_max_pool_kernel2, args))
%timeit max_pool(x, max_pool_kernel, args)
%timeit max_pool(x, jit_max_pool_kernel, args)
%timeit max_pool(x, jit_max_pool_kernel2, args)

结果如下:

1.02 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
789 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

好意外,写for循环竟然比调用numpy自带的函数还要快。

1.4 用vectorize和guvectorize创建ufunc

ufunc是universal function的缩写,逐点作用在array上,其主要作用是使用reduce和accumulate:

@vectorize([int32(int32, int32),
            int64(int64, int64),
            float32(float32, float32),
            float64(float64, float64)])
def f(x, y):
    return x + y
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> f.reduce(a, axis=0)
array([12, 15, 18, 21])
>>> f.reduce(a, axis=1)
array([ 6, 22, 38])
>>> f.accumulate(a)
array([[ 0,  1,  2,  3],
       [ 4,  6,  8, 10],
       [12, 15, 18, 21]])
>>> f.accumulate(a, axis=1)
array([[ 0,  1,  3,  6],
       [ 4,  9, 15, 22],
       [ 8, 17, 27, 38]])

guvectorize则可以更加复杂:

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y

>>> a
array([[0, 1, 2],
       [3, 4, 5]])
>>> g(a, 10)
array([[10, 11, 12],
       [13, 14, 15]])
>>> g(a, np.array([10, 20]))
array([[10, 11, 12],
       [23, 24, 25]])

1.5 jitclass

import numpy as np
from numba import int32, float32    # import the types
from numba.experimental import jitclass

spec = [
    ('value', int32),               # a simple scalar field
    ('array', float32[:]),          # an array field
]

@jitclass(spec)
class Bag(object):
    def __init__(self, value):
        self.value = value
        self.array = np.zeros(value, dtype=np.float32)

    @property
    def size(self):
        return self.array.size

    def increment(self, val):
        for i in range(self.size):
            self.array[i] += val
        return self.array

    @staticmethod
    def add(x, y):
        return x + y

n = 21
mybag = Bag(n)

这里spec可以用字典orderDict

2. 并发

支持的运算清单:


unary operators: + - ~
binary operators: + - * / /? % | >> ^ << & ** //
comparison operators: == != < <= > >=
Numpy ufuncs that are supported in nopython mode.
User defined DUFunc through vectorize().
Numpy reduction functions sum, prod, min, max, argmin, and argmax. Also, array math functions mean, var, and std.

Numpy array creation functions zeros, ones, arange, linspace, and several random functions (rand, randn, ranf, random_sample, sample, random, standard_normal, chisquare, weibull, power, geometric, exponential, poisson, rayleigh, normal, uniform, beta, binomial, f, gamma, lognormal, laplace, randint, triangular).

Numpy dot function between a matrix and a vector, or two vectors. In all other cases, Numba’s default implementation is used.

Multi-dimensional arrays are also supported for the above operations when operands have matching dimension and size. The full semantics of Numpy broadcast between arrays with mixed dimensionality or size is not supported, nor is the reduction across a selected dimension.

Array assignment in which the target is an array selection using a slice or a boolean array, and the value being assigned is either a scalar or another selection where the slice range or bitarray are inferred to be compatible.

The reduce operator of functools is supported for specifying parallel reductions on 1D Numpy arrays but the initial value argument is mandatory.

2.1 prange

可以用spark的功能都可以考虑用prange来做。

from numba import njit, prange

@njit(parallel=True)
def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s

当数据量比较小时,parallel的速度反而更慢。

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def two_d_array_reduction_prod(n):
    shp = (13, 17)
    result1 = 2 * np.ones(shp, np.int_)
    tmp = 2 * np.ones_like(result1)

    for i in prange(n):
        result1 *= tmp

    return result1

2.2 应用案例

这里是用梯度下降法进行回归。

@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
    for i in range(iterations):
        w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    return w

2.3 卷积

使用@stencil装饰符

from numba import stencil

@stencil
def kernel1(a):
    return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])

>>> input_arr = np.arange(100).reshape((10, 10))
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
>>> output_arr = kernel1(input_arr)
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,   0.],
       [  0.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,   0.],
       [  0.,  31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,   0.],
       [  0.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.,   0.],
       [  0.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,   0.],
       [  0.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,   0.],
       [  0.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,   0.],
       [  0.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

2.4 机制

loop fusion:合并loop
loop serialization:内部loop会强制使用range
Loop invariant code motion:将需要重复计算的代码移出loop
Allocation hoisting:将内存分配移出循环

3. AOT模式

ahead of time模式,即提前编译好,部署时不再需要numba。注意事项如下:

  1. 不支持ufuncs
  2. 必须要注明function signatures
  3. AOT只编译出特定CPU系列优化的代码

下面是个示例:

from numba.pycc import CC

cc = CC('my_module')
# Uncomment the following line to print out the compilation steps
#cc.verbose = True

@cc.export('multf', 'f8(f8, f8)')
@cc.export('multi', 'i4(i4, i4)')
def mult(a, b):
    return a * b

@cc.export('square', 'f8(f8)')
def square(a):
    return a ** 2

if __name__ == "__main__":
    cc.compile()

运行这段代码,会生成如下三个文件:
my_module.so, my_module.pyd, my_module.cpython-34m.so
接下来就可以使用了:

>>> import my_module
>>> my_module.multi(3, 4)
12
>>> my_module.square(1.414)
1.9993959999999997

也可以使用setup tools来生成安装包:

from distutils.core import setup

from source_module import cc

setup(...,
      ext_modules=[cc.distutils_extension()])
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值