cuda编程之thread，block and grid

最新推荐文章于 2024-01-27 18:02:18 发布

siqi_fighting

最新推荐文章于 2024-01-27 18:02:18 发布

阅读量949

点赞数 1

分类专栏： CUDA编程文章标签： cuda 线程

本文链接：https://blog.csdn.net/u010454261/article/details/78289756

版权

CUDA编程专栏收录该内容

4 篇文章 0 订阅

订阅专栏

block中的所有线程都将在同一个stream processor中；

关于thread blocks, 可参考cuda c programming guide

Cuda Dynamic Parallelism 章节中D.1.2 Glossary关于thread block的描述。

A Thread Block is a group of threads which execute on the same multiprocessor(SMX). Threads within a Thread Block have access to shared memory and can be explicitly synchronized.

一个kernel可被多个equal shaped blocks执行。

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional gridof thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.

take nvidia gk110 architecture as an example, which was refered on https://devtalk.nvidia.com/default/topic/897696/relationship-between-threads-and-gpu-core-units/?offset=6

A SMX consists of 4 subpartitions each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double precision units.

The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its resources will be freed.

Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared memory and texture unit have variable latency.

The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.

In addition, Appendix A of NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf mentioned that:

CUDA Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads, they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses.

siqi_fighting

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
cuda编程之thread，block and grid

block中的所有线程都将在同一个stream processor中；一个kernel可被多个相同shaped blocks执行。There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core
复制链接

扫一扫