撰文 | 柳俊丞
一般而言,我们在代码中会看到使用以下方式启动一个 CUDA kernel:
cuda_kernel<<<grid_size, block_size, 0, stream>>>(...)
cuda_kernel 是 global function 的标识,(...) 中是调用 cuda_kernel 对应的参数,这两者和 C++ 的语法是一样的,而 <<<grid_size, block_size, 0, stream>>> 是 CUDA 对 C++ 的扩展,称之为 Execution Configuration(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration),参考 CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#abstract,后续简称 Guide ) 中的介绍:
The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched;
Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned i