如何设置CUDA Kernel中的grid_size和block_size？

OneFlow深度学习框架

于 2021-12-07 13:53:27 发布

阅读量2.2k

点赞数 3

分类专栏：前沿技术文章标签： java python 多线程数据库编程语言

本文链接：https://blog.csdn.net/OneFlow_Official/article/details/121782499

版权

本文探讨了CUDA Kernel中grid_size和block_size的设置，指出block_size应为32的倍数且不大于1024，推荐值为128。grid_size应确保足够的Occupancy，避免tail effect，且不超过GPU的最大grid尺寸。通常，elementwise kernel中，block_size设为128，grid_size满足多wave需求。

摘要由CSDN通过智能技术生成

撰文 | 柳俊丞

一般而言，我们在代码中会看到使用以下方式启动一个 CUDA kernel：

cuda_kernel<<<grid_size, block_size, 0, stream>>>(...)

cuda_kernel 是 global function 的标识，(...) 中是调用 cuda_kernel 对应的参数，这两者和 C++ 的语法是一样的，而 <<<grid_size, block_size, 0, stream>>> 是 CUDA 对 C++ 的扩展，称之为 Execution Configuration（https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration），参考 CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#abstract，后续简称 Guide ) 中的介绍：

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:

Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched;

Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;

Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned i

最低0.47元/天解锁文章