vectorize中的参数target一共有三种取值:cpu(默认)、parallel和cuda。关于选择哪个取值,官方文档上有很好的说明:The “cpu” target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The “parallel” target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The “cuda” target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead.
GPU我们在下一篇单独介绍。
1. numba快速入门
1.1 注意事项
nopython指的是完全不用python编译器,推荐使用。可以用@njit代替。
fastmath可以去掉一些数值检查的步骤
使用cache=True将代码进行缓存
使用generated_jit进行数据类型重载
1.2 加速循环
因为numba内置的函数本身是个装饰器,所以只要在自己定义好的函数前面加个@nb.jit()就行,简单上手。下面以一个求和函数为例
# 用numba加速的求和函数
@nb.jit()
def nb_sum(a):
Sum = 0
for i in range(len(a)):
Sum += a[i]
return Sum
# 没用numba加速的求和函数
def py_sum(a):
Sum = 0
for i in range(len(a)):
Sum += a[i]
return Sum
来测试一下速度
import numpy as np
a = np.linspace(0,100,100) # 创建一个长度为100的数组
%timeit np.sum(a) # numpy自带的求和函数
%timeit sum(a) # python自带的求和函数
%timeit nb_sum(a) # numba加速的求和函数
%timeit py_sum(a) # 没加速的求和函数
结果如下
# np.sum(a)
7.1 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# sum(a)
27.7 µs ± 2.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# nb_sum(a)
1.05 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# py_sum(a)
43.7 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
可以看出,numba甚至比号称最接近C语言速度运行的numpy还要快6倍以上。但大家都知道,numpy往往对大的数组更加友好,那我们来测试一个更长的数组
a = np.linspace(0,100,10**6) # 创建一个长度为100万的数组
测试结果如下
# np.sum(a)
2.51 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# sum(a)
249 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# nb_sum(a)
3.01 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# py_sum(a)
592 ms ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
可见即便是用很长的loop来计算,numba的表现也丝毫不亚于numpy。在这里,我们可以看到numba相对于numpy一个非常明显的优势:numba可以把各种具有很大loop的函数加到很快的速度,但numpy的加速只适用于numpy自带的函数。
1.3 尽量写的像c
我们来看下下面的代码:
import numba as nb
# 普通的 MaxPool
def max_pool_kernel(x, rs, *args):
n, n_channels, pool_height, pool_width, out_h, out_w = args
for i in range(n):
for j in range(n_channels):
for p in range(out_h):
for q in range(out_w):
window = x[i, j, p:p+pool_height, q:q+pool_width]
rs[i, j, p, q] += np.max(window)
# 简单地加了个 jit 后的 MaxPool
@nb.jit(nopython=True)
def jit_max_pool_kernel(x, rs, *args):
n, n_channels, pool_height, pool_width, out_h, out_w = args
for i in range(n):
for j in range(n_channels):
for p in range(out_h):
for q in range(out_w):
window = x[i, j, p:p+pool_height, q:q+pool_width]
rs[i, j, p, q] += np.max(window)
# 不惧用 for 的、“更像 C”的 MaxPool
@nb.jit(nopython=True)
def jit_max_pool_kernel2(x, rs, *args):
n, n_channels, pool_height, pool_width, out_h, out_w = args
for i in range(n):
for j in range(n_channels):
for p in range(out_h):
for q in range(out_w):
_max = x[i, j, p, q]
for r in range(pool_height):
for s in range(pool_width):
_tmp = x[i, j, p+r, q+s]
if _tmp > _max:
_max = _tmp
rs[i, j, p, q] += _max
# MaxPool 的封装
def max_pool(x, kernel, args):
n, n_channels = args[:2]
out_h, out_w = args[-2:]
rs = np.zeros([n, n_filters, out_h, out_w], dtype=np.float32)
kernel(x, rs, *args)
return rs
x = np.random.randn(64, 3, 28, 28).astype(np.float32)
w = np.random.randn(16, x.shape[1], 5, 5).astype(np.float32)
pool_height, pool_width = 2, 2
n, n_channels, height, width = x.shape
out_h = height - pool_height + 1
out_w = width - pool_width + 1
args = (n, n_channels, pool_height, pool_width, out_h, out_w)
assert np.allclose(max_pool(x, max_pool_kernel, args), max_pool(x, jit_max_pool_kernel, args))
assert np.allclose(max_pool(x, jit_max_pool_kernel, args), max_pool(x, jit_max_pool_kernel2, args))
%timeit max_pool(x, max_pool_kernel, args)
%timeit max_pool(x, jit_max_pool_kernel, args)
%timeit max_pool(x, jit_max_pool_kernel2, args)
结果如下:
1.02 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
789 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
好意外,写for循环竟然比调用numpy自带的函数还要快。
1.4 用vectorize和guvectorize创建ufunc
ufunc是universal function的缩写,逐点作用在array上,其主要作用是使用reduce和accumulate:
@vectorize([int32(int32, int32),
int64(int64, int64),
float32(float32, float32),
float64(float64, float64)])
def f(x, y):
return x + y
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> f.reduce(a, axis=0)
array([12, 15, 18, 21])
>>> f.reduce(a, axis=1)
array([ 6, 22, 38])
>>> f.accumulate(a)
array([[ 0, 1, 2, 3],
[ 4, 6, 8, 10],
[12, 15, 18, 21]])
>>> f.accumulate(a, axis=1)
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38]])
guvectorize则可以更加复杂:
@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
for i in range(x.shape[0]):
res[i] = x[i] + y
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> g(a, 10)
array([[10, 11, 12],
[13, 14, 15]])
>>> g(a, np.array([10, 20]))
array([[10, 11, 12],
[23, 24, 25]])
1.5 jitclass
import numpy as np
from numba import int32, float32 # import the types
from numba.experimental import jitclass
spec = [
('value', int32), # a simple scalar field
('array', float32[:]), # an array field
]
@jitclass(spec)
class Bag(object):
def __init__(self, value):
self.value = value
self.array = np.zeros(value, dtype=np.float32)
@property
def size(self):
return self.array.size
def increment(self, val):
for i in range(self.size):
self.array[i] += val
return self.array
@staticmethod
def add(x, y):
return x + y
n = 21
mybag = Bag(n)
这里spec可以用字典orderDict
2. 并发
支持的运算清单:
unary operators: + - ~
binary operators: + - * / /? % | >> ^ << & ** //
comparison operators: == != < <= > >=
Numpy ufuncs that are supported in nopython mode.
User defined DUFunc through vectorize().
Numpy reduction functions sum, prod, min, max, argmin, and argmax. Also, array math functions mean, var, and std.
Numpy array creation functions zeros, ones, arange, linspace, and several random functions (rand, randn, ranf, random_sample, sample, random, standard_normal, chisquare, weibull, power, geometric, exponential, poisson, rayleigh, normal, uniform, beta, binomial, f, gamma, lognormal, laplace, randint, triangular).
Numpy dot function between a matrix and a vector, or two vectors. In all other cases, Numba’s default implementation is used.
Multi-dimensional arrays are also supported for the above operations when operands have matching dimension and size. The full semantics of Numpy broadcast between arrays with mixed dimensionality or size is not supported, nor is the reduction across a selected dimension.
Array assignment in which the target is an array selection using a slice or a boolean array, and the value being assigned is either a scalar or another selection where the slice range or bitarray are inferred to be compatible.
The reduce operator of functools is supported for specifying parallel reductions on 1D Numpy arrays but the initial value argument is mandatory.
2.1 prange
可以用spark的功能都可以考虑用prange来做。
from numba import njit, prange
@njit(parallel=True)
def prange_test(A):
s = 0
# Without "parallel=True" in the jit-decorator
# the prange statement is equivalent to range
for i in prange(A.shape[0]):
s += A[i]
return s
当数据量比较小时,parallel的速度反而更慢。
from numba import njit, prange
import numpy as np
@njit(parallel=True)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in prange(n):
result1 *= tmp
return result1
2.2 应用案例
这里是用梯度下降法进行回归。
@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
for i in range(iterations):
w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
return w
2.3 卷积
使用@stencil装饰符
from numba import stencil
@stencil
def kernel1(a):
return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])
>>> input_arr = np.arange(100).reshape((10, 10))
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
>>> output_arr = kernel1(input_arr)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 11., 12., 13., 14., 15., 16., 17., 18., 0.],
[ 0., 21., 22., 23., 24., 25., 26., 27., 28., 0.],
[ 0., 31., 32., 33., 34., 35., 36., 37., 38., 0.],
[ 0., 41., 42., 43., 44., 45., 46., 47., 48., 0.],
[ 0., 51., 52., 53., 54., 55., 56., 57., 58., 0.],
[ 0., 61., 62., 63., 64., 65., 66., 67., 68., 0.],
[ 0., 71., 72., 73., 74., 75., 76., 77., 78., 0.],
[ 0., 81., 82., 83., 84., 85., 86., 87., 88., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
2.4 机制
loop fusion:合并loop
loop serialization:内部loop会强制使用range
Loop invariant code motion:将需要重复计算的代码移出loop
Allocation hoisting:将内存分配移出循环
3. AOT模式
ahead of time模式,即提前编译好,部署时不再需要numba。注意事项如下:
- 不支持ufuncs
- 必须要注明function signatures
- AOT只编译出特定CPU系列优化的代码
下面是个示例:
from numba.pycc import CC
cc = CC('my_module')
# Uncomment the following line to print out the compilation steps
#cc.verbose = True
@cc.export('multf', 'f8(f8, f8)')
@cc.export('multi', 'i4(i4, i4)')
def mult(a, b):
return a * b
@cc.export('square', 'f8(f8)')
def square(a):
return a ** 2
if __name__ == "__main__":
cc.compile()
运行这段代码,会生成如下三个文件:
my_module.so, my_module.pyd, my_module.cpython-34m.so
接下来就可以使用了:
>>> import my_module
>>> my_module.multi(3, 4)
12
>>> my_module.square(1.414)
1.9993959999999997
也可以使用setup tools来生成安装包:
from distutils.core import setup
from source_module import cc
setup(...,
ext_modules=[cc.distutils_extension()])