[深度学习][转载]CUDA arch和CUDA gencode对应关系

I’ve seen some confusion regarding NVIDIA’s nvcc sm flags and what they’re used for:
When compiling with NVCC, the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Here’s a list of NVIDIA architecture names, and which compute capabilities they have:

FermiKeplerMaxwellPascalVoltaTuringAmpereHopper*Lovelace?
sm_20sm_30sm_50sm_60sm_70sm_75sm_80sm_90sm_100?
sm_35sm_52sm_61sm_72sm_86
sm_37sm_53sm_62sm_87

 Fermi and Kepler are deprecated from CUDA 9 and 11 onwards
 Maxwell is deprecated from CUDA 11.6 onwards
* Hopper is NVIDIA’s “tesla-next” series, with a 5nm process, replacing Ampere.

When should different ‘gencodes’ or ‘cuda arch’ be used?

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Before you continue, identify which GPU you have and which CUDA version you have installed first.

Supported SM and Gencode variations

Below are the supported sm variations and sample cards from that generation.

I’ve tried to supply representative NVIDIA GPU cards for each architecture name, and CUDA version.

Fermi cards (CUDA 3.2 until CUDA 8)

Deprecated from CUDA 9, support completely dropped from CUDA 10.

  • SM20 or SM_20, compute_30 –
    GeForce 400, 500, 600, GT-630.
    Completely dropped from CUDA 10 onwards.

Kepler cards (CUDA 5 until CUDA 10)

Deprecated from CUDA 11.

  • SM30 or SM_30, compute_30 –
    Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).
    Adds support for unified memory programming
    Completely dropped from CUDA 11 onwards.
  • SM35 or SM_35, compute_35 –
    Tesla K40.
    Adds support for dynamic parallelism.
    Deprecated from CUDA 11, will be dropped in future versions.
  • SM37 or SM_37, compute_37 –
    Tesla K80.
    Adds a few more registers.
    Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.

Maxwell cards (CUDA 6 until CUDA 11)

  • SM50 or SM_50, compute_50 –
    Tesla/Quadro M series.
    Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000.
  • SM52 or SM_52, compute_52 –
    Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
  • SM53 or SM_53, compute_53 –
    Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal (CUDA 8 and later)

  • SM60 or SM_60, compute_60 –
    Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • SM61 or SM_61, compute_61
    GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • SM62 or SM_62, compute_62 – 
    Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

  • SM70 or SM_70, compute_70 –
    DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
  • SM72 or SM_72, compute_72 –
    Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

  • SM75 or SM_75, compute_75 –
    GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11.1 and later)

  • SM80 or SM_80, compute_80 –
    NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
  • SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards)
    Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Hopper (CUDA 12 and later)

  • SM90 or SM_90, compute_90 –
    NVIDIA H100 (GH100)

Sample nvcc gencode and arch Flags in GCC

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

Sample flags for GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

Sample flags for generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

Sample flags for generation on CUDA 9.2 for maximum compatibility with Volta cards:

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80 

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080 and other Ampere cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_86,code=compute_87

Sample flags for generation on CUDA 11.1 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_87,code=compute_87

Sample flags for generation on CUDA 12 for best performance with NVIDIA H100 (Hopper) GPUs, and no backwards compatibility for previous generations:

-arch=sm_90 \ 
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90,code=compute_90

Using TORCH_CUDA_ARCH_LIST for PyTorch

If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST env variable like this:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 file.py

Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.

You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX suffix to the most recent architecture you specify:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 file.py

Using Cmake for TensorRT

If you’re compiling TensorRT with CMAKE, drop the sm_ and compute_ prefixes, refer only to the compute capabilities instead.

Example for Tesla V100 and Volta cards in general:
cmake <...> -DGPU_ARCHS="70"

Example for NVIDIA RTX 2070 and Tesla T4:
cmake <...> -DGPU_ARCHS="75"

Example for NVIDIA A100:
cmake <...> -DGPU_ARCHS="80"

Example for NVIDIA RTX 3080 and A100 together:
cmake <...> -DGPU_ARCHS="80 86"

Example for NVIDIA H100:
cmake <...> -DGPU_ARCHS="90"

What does "Value 'sm_86' is not defined for option 'gpu-architecture'" mean?

If you get an error that looks like this:

nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.


←No, your job isn’t to say “no”→Top 6 must-read books for PMs getting into product management in 2021


31 replies on “Matching CUDA arch and CUDA gencode for various NVIDIA architectures”

正在上传…重新上传取消Alexander Stohrsays:

02/12/2016 at 09:13

i can not find any hard information for term “SM62” on the web.
at least some are speculating that it is meant for Tegra.

what are your sources for your statements on “SM62”?

REPLY

Arnon Shimonisays:

02/12/2016 at 11:38

You could be right… I’m not entirely sure

REPLYBY POST AUTHOR

Arunabh Athreyasays:

17/12/2019 at 01:56

SM62 is meant for compute capability version 6.2. Tegra X2, Jetson TX2, DRIVE PX 2 and GP10B fall in this category.
You can find more information in the following wikipedia page:

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

REPLY

正在上传…重新上传取消Yansays:

08/03/2017 at 10:00

Hi,

Then what happens if I only use the following at compile time
-gencode arch=compute_20,code=\”sm_20,compute_20\”

but run the compiled code on a 5.0 card? The JIT compiler will generate the GPU code, but is it going to compile with
-gencode arch=compute_50,code=\”sm_50,compute_50\”

I’ve been searching the web, but couldn’t find anything. Please advice.

Thanks,

Ian

REPLY

Arnon Shimonisays:

09/03/2017 at 03:22

Hey Ian
If you’re compiling for a 5.0 card, the second option you suggested is better. If you have to have cross-compatibility, I’d recommend the first.

REPLYBY POST AUTHOR

jgsays:

24/05/2017 at 04:14

Thank you, very useful, what about sm_37 ?

REPLY

Arnon Shimonisays:

24/05/2017 at 08:00

`sm_37` is for the Tesla K80 cards, but our experience proves that it’s not effective to compile for it specifically. sm_30 gives the same results and is better if you also have K40s or similar.

REPLYBY POST AUTHOR

LostWorldsays:

19/06/2017 at 17:57

kindly help me to find SM for GTX950 and compute_????

REPLY

Arnon Shimonisays:

20/06/2017 at 01:30

-gencode=arch=compute_52,code=sm_52 – make sure you have CUDA 6.5 at least.

REPLYBY POST AUTHOR

LostWorldsays:

20/06/2017 at 03:26

thank you. so nice of u

REPLY

Mandar Gogatesays:

23/07/2017 at 15:46

Thank you. 

REPLY

正在上传…重新上传取消dee6600durgeshsays:

09/05/2018 at 20:56

can you help me with gtx860 please

REPLY

ku4etosays:

26/09/2018 at 12:31

Heads up, the Turing SM_80 is incorrect.
Turing uses SM_75 (sm_75/compute_75) according to the NVCC CUDA Toolkit Documentation.

REPLY

Arnon Shimonisays:

26/09/2018 at 12:46

Fixed, thanks

REPLYBY POST AUTHOR

正在上传…重新上传取消Divya Mohansays:

08/11/2018 at 17:52

My laptop has CUDA V9.1.85, and GeForce MX130. What SM would be suggested? Thank you in advanced!

REPLY

Arunabh Athreyasays:

17/12/2019 at 01:58

Hi Divya
It should be sm61, I think.

REPLY

正在上传…重新上传取消Chris Jacobisays:

31/01/2019 at 20:24

Thank you, that was very helpful.

REPLY

正在上传…重新上传取消villjoiesays:

15/01/2020 at 14:48

CUDA V10 , and GTX1080. What SM would be suggested? Thank you in advanced!

REPLY

Girish Biswassays:

19/05/2020 at 04:52

In a document, it is said that Compute Capablity 2.0 (Fermi) supports dynamic parallelism.. but when I use it in kernel function it shows error saying: Global function cannot be called from a global function which is only supported in 3.5 architecture!
How can I use dynamic parallelism in 2.0 architecture? Plz help! Thanks in advance

REPLY

Arnon Shimonisays:

19/05/2020 at 06:06

From what I recall, the syntax for dynamic parallelism is different between Fermi and subsequent architectures like Kepler.
The error you’re getting is telling you that. Try asking on the NVIDIA Development forums!

REPLYBY POST AUTHOR

Kunal Khoslasays:

08/06/2020 at 08:04

I have Nvidia complier 10.2, NVidia 930MX , can anyone help me with the architecture (sm_xx). Thank you in advance

REPLY

Arnon Shimonisays:

08/06/2020 at 10:06

930MX is a Maxwell generation card, so `sm_50`

REPLYBY POST AUTHOR

CUDAsays:

22/07/2020 at 03:03

Release Notes :: CUDA Toolkit Documentation

Support for Kepler sm_30 and sm_32 architecture based products is dropped.

Support for the following compute capabilities are deprecated in the CUDA Toolkit:

sm_35 (Kepler)
sm_37 (Kepler)
sm_50 (Maxwell)

REPLY

NiKosays:

16/09/2020 at 22:09

Both –gpu-architecture and –gpu-code options can be omitted according to the NVCC CUDA Toolkit Documentation. I want to know the difference between not setting them and setting them. Thank you in advanced!

REPLY

Arnon Shimonisays:

17/09/2020 at 16:01

If you don’t specify them, you’ll only get a compilation for the current “default”.
With CUDA 11, that’s sm_52. It may not be the best option for the GPU you have installed.

REPLYBY POST AUTHOR

正在上传…重新上传取消svenstarosays:

29/09/2020 at 04:17

cuda 11.1 adds 8.6: Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.
Release Notes :: CUDA Toolkit Documentation

REPLY

Manjunath K Nsays:

15/12/2020 at 04:53

I have GeForce 840 GPU, CUDA 11.0 installed and a CUDA C++ project created in Visual Studio. My task is to compile this .cu file into PTX as I need this PTX file to create the CudaKernel in my C# code. Code written in C# is as follows where CudaContext and CUModule belongs to the ManagedCUDA library (ManagedCUDA is the wrapper class written to access cuda modules from C#).

CudaContext cntxt = new CudaContext();
CUmodule cumodule = cntxt.LoadModule(@”E:\Manjunath K N\Programs\15-12-2020_2\15-12-2020_2\Debug\kernel.ptx”);
kernel.ptx is the kernel file that I need to load into CUModule. But when this LoadModule is called, I get the error “ErrorNoBinaryForGPU: This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration”.

To resolve this I went back to CUDA and checked for the CUDA project properties. I made the required changes following changes. Still I am unable to create the ptx file.
1. Keep preprocessed files as Yes, and
2. NVC Compilation type as “Generate .ptx fier”
3. The device configuration is set to compute_52 and sm_52 by default when the CUDA project was created.
Could you please provide a solution for this?

REPLY

Arnon Shimonisays:

15/12/2020 at 05:02

The compute capability for the 840 is `compute_50, sm_50`.

Note that it’s deprecated from CUDA 11 onwards.

REPLYBY POST AUTHOR

Manjunath K Nsays:

15/12/2020 at 13:25

Thank you. Yes my project properties shows that compute capability is _50. Please suggest the solution? Should I use the older version CUDA like 10.0 or 7.5 etc in order to create my CUDA project?

REPLY

正在上传…重新上传取消Ioansays:

05/02/2021 at 11:19

I think for Nvidia GeForce 840, you can do the following in Visual Studio:

Project properties > Configuration Properties > CUDA C/C++ > Device > Code Generation > drop-down list > Edit

The Code Generation window opens.
Enter

compute_50,sm_50

in the edit field at the top then click OK.
Back in the previous window click OK.

REPLY

Ganesh Rohansays:

22/06/2021 at 20:38

kindly help me to find it for GTX 1650Ti with cuda 11

REPLY

转自:Matching CUDA arch and CUDA gencode for various NVIDIA architectures - Arnon Shimoni

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

FL1623863129

你的打赏是我写文章最大的动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值