TURING STREAMING MULTIPROCESSOR (SM) ARCHITECTURE

最新推荐文章于 2023-10-18 14:52:09 发布

妖怪哪里走

最新推荐文章于 2023-10-18 14:52:09 发布

阅读量202

点赞数

分类专栏： GPU 文章标签： GPU 硬件 Nvida

本文链接：https://blog.csdn.net/royalfizz/article/details/93301678

版权

GPU 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

The Turing architecture features a new SM design that incorporates many of the features introduced in our Volta GV100 SM architecture.

Two SMs are included per TPC；
Each SM has a total of 64 FP32 Cores and 64 INT32 Cores.

In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM.

The Turing SM supports concurrent execution of FP32 and INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU. Each Turing SM also includes eight mixed-precision Turing Tensor Cores, and one RT Core.

The Turing SM is partitioned into four processing blocks, each with

16 FP32 Cores
16 INT32 Cores
two Tensor Cores
one warp scheduler
and one dispatch unit.
Each block includes a new L0 instruction cache and a 64 KB register file.

The four processing blocks share a combined 96 KB L1 data cache/shared memory.

Traditional graphics workloads partition the 96 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 32 KB for texture cache and register file spill area.
Compute workloads can divide the 96 KB into 32 KB shared memory and 64 KB L1 cache, or 64 KB shared memory and 32 KB L1 cache.

Turing implements a major revamping of the core execution datapaths. Modern shader workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler instructions such as integer adds for addressing and fetching data, floating point compare or min/max for processing results, etc. In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math.

Figure 5 shows that the mix of integer pipe versus floating point instructions varies, but across several modern applications, we typically see about 36 additional integer pipe instructions for every 100 floating point instructions. Moving these instructions to a separate pipe translates to an effective 36% additional throughput possible for floating point.