【转载】各种 NVIDIA 架构所匹配的 arch 和 gencode

最新推荐文章于 2025-02-10 16:26:37 发布

ShaderJoy

最新推荐文章于 2025-02-10 16:26:37 发布

阅读量1.2w

点赞数 9

分类专栏： CUDA

原文链接：https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

版权

CUDA 专栏收录该内容

47 篇文章

订阅专栏

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

I’ve seen some confusion regarding NVIDIA’s nvcc sm flags and what they’re used for:
When compiling with NVCC, the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Here’s a list of NVIDIA architecture names, and which compute capabilities they have:

Fermi	Kepler	Maxwell	Pascal	Volta	Turing	Ampere	Hopper
sm_20	sm_30	sm_50	sm_60	sm_70	sm_75	sm_80	sm_90*?
	sm_35	sm_52	sm_61	sm_72		sm_86
	sm_37	sm_53	sm_62

* Hopper is NVIDIA’s rumored “tesla-next” series, with a 5nm process.

When should different ‘gencodes’ or ‘cuda arch’ be used?

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Before you continue, identify which GPU you have and which CUDA version you have installed first.

Supported SM and Gencode variations

Below are the supported sm variations and sample cards from that generation.

I’ve tried to supply representative NVIDIA GPU cards for each architecture name, and CUDA version.

Fermi cards (CUDA 3.2 until CUDA 8)

Deprecated from CUDA 9, support completely dropped from CUDA 10.

SM20 or SM_20, compute_30 –
GeForce 400, 500, 600, GT-630.
Completely dropped from CUDA 10 onwards.

Kepler cards (CUDA 5 until CUDA 10)

Deprecated from CUDA 11.

SM30 or SM_30, compute_30 –
Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).
Adds support for unified memory programming
Completely dropped from CUDA 11 onwards.
SM35 or SM_35, compute_35 –
Tesla K40.
Adds support for dynamic parallelism.
Deprecated from CUDA 11, will be dropped in future versions.
SM37 or SM_37, compute_37 –
Tesla K80.
Adds a few more registers.
Deprecated from CUDA 11, will be dropped in future versions.

Maxwell cards (CUDA 6 until CUDA 11)

SM50 or SM_50, compute_50 –
Tesla/Quadro M series.
Deprecated from CUDA 11, will be dropped in future versions.
SM52 or SM_52, compute_52 –
Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
SM53 or SM_53, compute_53 –
Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal (CUDA 8 and later)

SM60 or SM_60, compute_60 –
Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61–
GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
SM62 or SM_62, compute_62 –
Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

SM70 or SM_70, compute_70 –
DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
SM72 or SM_72, compute_72 –
Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

SM75 or SM_75, compute_75 –
GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11 and later)

SM80 or SM_80, compute_80 –
NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards)
Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A6000, RTX A40

“Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.“
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Hopper (CUDA 12 [planned] and later)

SM90 or SM_90, compute_90 –
NVIDIA H100 (GH100)

Sample `nvcc` `gencode` and `arch` Flags

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

Sample flags for generation on CUDA 7.0 for maximum compatibility with all cards from the era:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

Sample flags for generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

Sample flags for generation on CUDA 9.2 for maximum compatibility with Volta cards:

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080 and other Ampere cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 11.1 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_86,code=compute_86

原文地址：https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/