如何设置CUDA Kernel中的grid_size和block_size?

本文探讨了CUDA Kernel中grid_size和block_size的设置,指出block_size应为32的倍数且不大于1024,推荐值为128。grid_size应确保足够的Occupancy,避免tail effect,且不超过GPU的最大grid尺寸。通常,elementwise kernel中,block_size设为128,grid_size满足多wave需求。
摘要由CSDN通过智能技术生成

f8ec6a4fc713cb4066f6a8b41080e56b.png

撰文 | 柳俊丞

一般而言,我们在代码中会看到使用以下方式启动一个 CUDA kernel:

 
 
cuda_kernel<<<grid_size, block_size, 0, stream>>>(...)

cuda_kernel 是 global function 的标识,(...) 中是调用 cuda_kernel 对应的参数,这两者和 C++ 的语法是一样的,而 <<<grid_size, block_size, 0, stream>>> 是 CUDA 对 C++ 的扩展,称之为 Execution Configuration(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration),参考 CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#abstract,后续简称 Guide ) 中的介绍:

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:

  • Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched;

  • Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;

  • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned i

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值