cuda编程之thread,block and grid

block中的所有线程都将在同一个stream processor中;

关于thread blocks, 可参考cuda c programming guide

Cuda Dynamic Parallelism 章节中D.1.2 Glossary关于thread block的描述。

A Thread Block is a group of threads which execute on the same multiprocessor(SMX). Threads within a Thread Block have access to shared memory and can be explicitly synchronized. 

一个kernel可被多个equal shaped blocks执行。

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional gridof thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed. 


take nvidia gk110 architecture as an example, which was refered on https://devtalk.nvidia.com/default/topic/897696/relationship-between-threads-and-gpu-core-units/?offset=6


A SMX consists of 4 subpartitions each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double precision units.

The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its resources will be freed.

Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared memory and texture unit have variable latency.

The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.


In addition, Appendix A of NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf mentioned that:

CUDA Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads, they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses. 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值