如何预编译CUDA的Kernel函数

首先, 为了明确一些背景, 参阅:https://github.com/harrism/numba_examples/blob/master/mandelbrot_numba.ipynb


在这个案例中,作者利用numba cuda提供的kernel API 实现并行计算, 但是本人在测试其代码的时候, 发现

如下代码的执行时间,第一次与第二、三次不一样


具体而言, 以下是部分的我的代码:

blockdim = (32, 32)
griddim = (32,16)

OP=my_kernel[griddim,blockdim]

start=timer()
OP(-2.0, 1.0, -1.0, 1.0, dimage, 20)
dimage.to_host()
dt = timer() - start
print "Mandelbrot created in %f s" % dt

start=timer()
OP(-2.0, 1.0, -1.0, 1.0, dimage, 20)
dimage.to_host()
dt = timer() - start
print "Mandelbrot created in %f s" % dt

运行结果:

Mandelbrot created in 0.261337 s

Mandelbrot created in 0.065873 s

第一次之所以比后来的长, 是因为多出了一部分编译的时间。 而当‘OP’这个 AutoJitCUDAKernel 对象第二次被调用时, 同样的参数类型(-2.0, 1.0, -1.0, 1.0, dimage, 20)使得这次运算无需“重新编译”核函数,所以所花时间较短。 这种编译造成的时间差总会让人感到难受。


如何提前进行这样的与编译相关的耗时操作呢? 在调用前加入这样一句即可

OP.specialize(-2.0, 1.0, -1.0, 1.0, dimage, 20)

运行结果:

Mandelbrot created in 0.065735 s

Mandelbrot created in 0.067235 s

这个specialize函数的功能,可以看这里:http://numba.pydata.org/numba-doc/0.25.0/cuda-reference/kernel.html




以下是本人测试用的全部代码:

import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.compiler import *

def mandel(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters
mandel_gpu = cuda.jit(device=True)(mandel)


@cuda.jit
def my_kernel(min_x, max_x, min_y, max_y, image, iters):
    for i in range(1):
        height = image.shape[0]
        width = image.shape[1]

        pixel_size_x = (max_x - min_x) / width
        pixel_size_y = (max_y - min_y) / height

        startX, startY = cuda.grid(2)
        gridX = cuda.gridDim.x * cuda.blockDim.x
        gridY = cuda.gridDim.y * cuda.blockDim.y

        for x in range(startX, width, gridX):
            real = min_x + x * pixel_size_x
            for y in range(startY, height, gridY):
                imag = min_y + y * pixel_size_y
                """
                cl=-1
                c = complex(real, imag)
                z = 0.0j
                for i in range(iters):
                    z = z * z + c
                    if (z.real * z.real + z.imag * z.imag) >= 4:
                        cl=i
                        break
                if cl==-1:
                    cl=iters
                """
                # alomost same time consumption for function method and code method
                image[y, x] = mandel_gpu(real,imag,iters)

my_kernel_gpu=cuda.jit(device=True)(my_kernel)


gimage  =  np.zeros((5000,3000),dtype=np.uint8)
dimage=cuda.to_device(gimage)

blockdim = (32, 32)
griddim = (32,16)

OP=my_kernel[griddim,blockdim]
OP.specialize(-2.0, 1.0, -1.0, 1.0, dimage, 20)

start=timer()
OP(-2.0, 1.0, -1.0, 1.0, dimage, 20)
dimage.to_host()
dt = timer() - start
print "Mandelbrot created in %f s" % dt


blockdim = (32, 8)
griddim = (32,2)
start=timer()
OP(-2.0, 1.0, -1.0, 1.0, dimage, 20)
dimage.to_host()
dt = timer() - start
print "Mandelbrot created in %f s" % dt


imshow(gimage)
show()




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值