Understanding CUDA grid dimensions, block dimensions and threads organization

最新推荐文章于 2023-04-19 12:42:49 发布

EnjoyCodingAndGame

最新推荐文章于 2023-04-19 12:42:49 发布

阅读量580

点赞数

分类专栏： CUDA 文章标签： CUDA GPU

CUDA 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Hardware

If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

Software

Threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be identified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case x*y*z <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).

Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).

Now a simple case: processing a 512x512 image

Suppose we want one thread to process one pixel (i,j).

We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)

It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

dim3 threadsPerBlock(8, 8); // 64 threads

and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it blocksPerGrid.

dim3 blocksPerGrid(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/ imageHeight/threadsPerBlock.y);

The kernel is launched like this:

myKernel <<<blocksPerGrid, threadsPerBlock>>>( /* params for the kernel function */ );

Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

uint i = (blockIdx.x * blockDim.x) + threadIdx.x;

uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

EnjoyCodingAndGame

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Understanding CUDA grid dimensions, block dimensions and threads organization

HardwareIf a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you pla
复制链接

扫一扫

专栏目录