Numba(？？)

最新推荐文章于 2024-06-29 11:33:14 发布

weixin_30663471

最新推荐文章于 2024-06-29 11:33:14 发布

阅读量296

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/yanxingang/p/10908192.html

版权

Numba for CUDA GPUs

一、概况

Numba能够将部分python代码编进CUDA内核里面。Numpy数组能够在CPU和GPU之间自动传输。

术语：

host: CPU

device: GPU

host memory: 系统主memory

device memory: GPU板卡上的memory

kernels: 一个GPU函数launched by the host，能够在device上执行

device function: 一个GPU函数在device上执行，只能被device调用

注意：

Numba没有实现所有的CUDA功能，下面的功能是没有实现的：

dynamic parallelism

texture memory

二、编写CUDA内核

内核declaration(声明):

内核函数是CPU能够调用的GPU函数，它有两个基本的特性：

1、内核函数无法显式的返回值，所有结果数据必须写入传递给函数的数组(array)

2、调用内核函数时，显式声明线程结构

咋一看，用Numba写一个CUDA内核很像是写一个CPU的JIT 函数。

@cuda.jit
def increment_by_one(an_array):
    """
    Increment all array elements by one.
    """
    # code elided here; read further for different implementations

内核调用：

内核通常用下面的方式调用：

threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
increment_by_one[blockspergrid, threadsperblock](an_array)

注意有两步：

1、实例化内核：确定blocks per grid和a number of threads per block。两者的乘积就是整个的threads数目。

2、运行内核，传入input array。

上面，选择block size(number of threads per block)是非常关键的。

1、在软件端，the block size决定了多少个threads共享给定的shared memory区域。

2、在硬件端，the block size需要足够大才能充分利用execution units。

转载于:https://www.cnblogs.com/yanxingang/p/10908192.html

weixin_30663471

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Numba(？？)

Numba for CUDA GPUs一、概况Numba能够将部分python代码编进CUDA内核里面。Numpy数组能够在CPU和GPU之间自动传输。术语：host: CPUdevice: GPUhost memory: 系统主memorydevice memory: GPU板卡上的memorykernels: 一个GPU函数launched by th...
复制链接

扫一扫