首先, 为了明确一些背景, 参阅:https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb
在这个案例中,作者利用numba cuda提供的kernel API 实现并行计算, 但是本人在测试其代码的时候, 发现
如下代码的执行时间,第一次与第二、三次不一样
具体而言, 以下是部分的我的代码:
blockdim = (32, 32) griddim = (32,16) OP=my_kernel[griddim,blockdim] start=timer() OP(-2.0, 1.0, -1.0, 1.0, dimage, 20) dimage.to_host() dt = timer() - start print "Mandelbrot created in %f s" % dt start=timer() OP(-2.0, 1.0, -1.0, 1.0, dimage, 20) dimage.to_host() dt = timer() - start print "Mandelbrot created in %f s" % dt
运行结果:
Mandelbrot created in 0.261337 s
Mandelbrot created in 0.065873 s
第一次之所以比后来的长, 是因为多出了一部分编译的时间。 而当‘OP’这个 AutoJitCUDAKernel 对象第二次被调用时, 同样的参数类型(-2.0, 1.0, -1.0, 1.0, dimage, 20)使得这次运算无需“重新编译”核函数,所以所花时间较短。 这种编译造成的时间差总会让人感到难受。
如何提前进行这样的与编译相关的耗时操作呢? 在调用前加入这样一句即可:
OP.specialize(-2.0, 1.0, -1.0, 1.0, dimage, 20)
运行结果:
Mandelbrot created in 0.065735 s
Mandelbrot created in 0.067235 s
这个specialize函数的功能,可以看这里:http://numba.pydata.org/numba-doc/0.25.0/cuda-reference/kernel.html
以下是本人测试用的全部代码:
import numpy as np from pylab import imshow, show from timeit import default_timer as timer from numba import cuda from numba.cuda.compiler import * def mandel(x, y, max_iters): """ Given the real and imaginary parts of a complex number, determine if it is a candidate for membership in the Mandelbrot set given a fixed number of iterations. """ c = complex(x, y) z = 0.0j for i in range(max_iters): z = z*z + c if (z.real*z.real + z.imag*z.imag) >= 4: return i return max_iters mandel_gpu = cuda.jit(device=True)(mandel) @cuda.jit def my_kernel(min_x, max_x, min_y, max_y, image, iters): for i in range(1): height = image.shape[0] width = image.shape[1] pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height startX, startY = cuda.grid(2) gridX = cuda.gridDim.x * cuda.blockDim.x gridY = cuda.gridDim.y * cuda.blockDim.y for x in range(startX, width, gridX): real = min_x + x * pixel_size_x for y in range(startY, height, gridY): imag = min_y + y * pixel_size_y """ cl=-1 c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: cl=i break if cl==-1: cl=iters """ # alomost same time consumption for function method and code method image[y, x] = mandel_gpu(real,imag,iters) my_kernel_gpu=cuda.jit(device=True)(my_kernel) gimage = np.zeros((5000,3000),dtype=np.uint8) dimage=cuda.to_device(gimage) blockdim = (32, 32) griddim = (32,16) OP=my_kernel[griddim,blockdim] OP.specialize(-2.0, 1.0, -1.0, 1.0, dimage, 20) start=timer() OP(-2.0, 1.0, -1.0, 1.0, dimage, 20) dimage.to_host() dt = timer() - start print "Mandelbrot created in %f s" % dt blockdim = (32, 8) griddim = (32,2) start=timer() OP(-2.0, 1.0, -1.0, 1.0, dimage, 20) dimage.to_host() dt = timer() - start print "Mandelbrot created in %f s" % dt imshow(gimage) show()