【转载】各种 NVIDIA 架构所匹配的 arch 和 gencode

Matching CUDA arch and CUDA gencode for various NVIDIA architectures

I’ve seen some confusion regarding NVIDIA’s nvcc sm flags and what they’re used for:
When compiling with NVCC, the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Here’s a list of NVIDIA architecture names, and which compute capabilities they have:

FermiKeplerMaxwellPascalVoltaTuringAmpereHopper
sm_20sm_30sm_50sm_60sm_70sm_75sm_80sm_90*?
 sm_35sm_52sm_61sm_72 sm_86 
 sm_37sm_53sm_62    

* Hopper is NVIDIA’s rumored “tesla-next” series, with a 5nm process.

When should different ‘gencodes’ or ‘cuda arch’ be used?

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Before you continue, identify which GPU you have and which CUDA version you have installed first.

Supported SM and Gencode variations

Below are the supported sm variations and sample cards from that generation.

I’ve tried to supply representative NVIDIA GPU cards for each architecture name, and CUDA version.

Fermi cards (CUDA 3.2 until CUDA 8)

Deprecated from CUDA 9, support completely dropped from CUDA 10.

  • SM20 or SM_20, compute_30 –
    GeForce 400, 500, 600, GT-630.
    Completely dropped from CUDA 10 onwards.

Kepler cards (CUDA 5 until CUDA 10)

Deprecated from CUDA 11.

  • SM30 or SM_30, compute_30 –
    Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).
    Adds support for unified memory programming
    Completely dropped from CUDA 11 onwards.
  • SM35 or SM_35, compute_35 –
    Tesla K40.
    Adds support for dynamic parallelism.
    Deprecated from CUDA 11, will be dropped in future versions.
  • SM37 or SM_37, compute_37 –
    Tesla K80.
    Adds a few more registers.
    Deprecated from CUDA 11, will be dropped in future versions.

Maxwell cards (CUDA 6 until CUDA 11)

  • SM50 or SM_50, compute_50 –
    Tesla/Quadro M series.
    Deprecated from CUDA 11, will be dropped in future versions.
  • SM52 or SM_52, compute_52 –
    Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
  • SM53 or SM_53, compute_53 –
    Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal (CUDA 8 and later)

  • SM60 or SM_60, compute_60 –
    Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • SM61 or SM_61, compute_61
    GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • SM62 or SM_62, compute_62 – 
    Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

  • SM70 or SM_70, compute_70 –
    DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
  • SM72 or SM_72, compute_72 –
    Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

  • SM75 or SM_75, compute_75 –
    GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11 and later)

  • SM80 or SM_80, compute_80 –
    NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
  • SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards)
    Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A6000, RTX A40

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Hopper (CUDA 12 [planned] and later)

  • SM90 or SM_90, compute_90 –
    NVIDIA H100 (GH100)

Sample nvcc gencode and arch Flags

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

Sample flags for generation on CUDA 7.0 for maximum compatibility with all cards from the era:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

Sample flags for generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

Sample flags for generation on CUDA 9.2 for maximum compatibility with Volta cards:

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80 

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards, but also support newer RTX 3080 and other Ampere cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 11.1 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_86,code=compute_86 

原文地址:https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

 

 

ERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=sparse_conv_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 FAILED: /tmp/pip-install-2orq3ccd/mmdet3d_bedd1fa9197349c4b23a2ae1df261803/build/temp.linux-x86_64-cpython-38/mmdet3d/ops/spconv/src/indice_cuda.o /usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pip-install-2orq3ccd/mmdet3d_bedd1fa9197349c4b23a2ae1df261803/mmdet3d/ops/spconv/include -I/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/include -I/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/include/TH -I/root/miniconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/open-mmlab/include/python3.8 -c -c /tmp/pip-install-2orq3ccd/mmdet3d_bedd1fa9197349c4b23a2ae1df261803/mmdet3d/ops/spconv/src/indice_cuda.cu -o /tmp/pip-install-2orq3ccd/mmdet3d_bedd1fa9197349c4b23a2ae1df261803/build/temp.linux-x86_64-cpython-38/mmdet3d/ops/spconv/src/indice_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -w -std=c++14 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=sparse_conv_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 /tmp/pip-install-2orq3ccd/mmdet3d_bedd1fa9197349c4b23a2ae1df261803/mmdet3d/ops/spconv/src/indice_cuda.cu:16:10: fatal error: spconv/indice.cu.h: No such file or directory 16 | #include <spconv/indice.cu.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated.
最新发布
03-26
从您提供的错误信息来看,这是一个与CUDA相关的编译失败问题。以下是详细的分析解决建议: --- ### 错误原因分析 1. **核心提示**: ``` fatal error: spconv/indice.cu.h: No such file or directory ``` 这表明编译器无法找到 `spconv/indice.cu.h` 文件。 2. **上下文背景**: 您正在尝试安装一个深度学习库 (如 mmdet3d),其中涉及到自定义 CUDA 扩展模块的构建。这个过程需要依赖于特定路径下的头文件 (`*.h`) 源代码文件 (`*.cu`)。 3. **可能的原因**: - 缺少必要的依赖项或第三方库。 - 安装过程中未正确指定某些环境变量(例如 `INCLUDE_PATH` 或 `LIBRARY_PATH`)。 - 系统上缺少 NVIDIACUDA 工具链支持(包括正确的版本匹配)。 - PyTorch 版本、CUDA 驱动程序版本以及 GPU 架构之间的兼容性问题。 4. **关键配置检查点**: - CUDA 相关选项:`-gencode=arch=compute_86,code=sm_86` 表示目标硬件架构为 Ampere (计算能力 8.6)。如果您的显卡并非此型号,则可能导致不必要地生成无效指令集。 - C++ 标准设置为 `-std=c++14`,通常足以满足大多数现代框架需求;但如果项目强制要求更高标准(比如 C++17),则可能会引发其他潜在冲突。 5. **进一步推测**: 如果确实存在上述任一因素干扰到正常流程运转的话,那么很可能是由于以下情况导致了此次 build 失败: * 开发者忘记将预处理好的 `.cu.h` 包含进发布包里; * 使用 pip install 方式获取最新版源码仓库内容而跳过了官方推荐稳定分支; * 当前操作系统平台缺乏对所有功能全面的支持(尤其是针对 windows 平台而言尤为明显). --- ### 解决方案 #### 方法一:确认是否遗漏重要组件 确保已经下载完整的 spConv 库及其对应的 include 路径已添加至 NVCC 命令行参数内(-I标志)。如果您是从 git clone 得来的原始资源,请务必按照文档说明完成初始化步骤后再试一次。 ```bash # 示例命令 git submodule update --init --recursive ``` 同时验证本地是否有类似结构存放该丢失头部声明的位置: * .../mmdetection3d/mmdet3d/ops/spconv/include/ 假如依然缺失对应标题档案夹(`include`)及底下资料成员们时考虑手动拉取完整压缩包解压替换掉旧目录即可恢复常态运作状况啦! --- #### 方法二:调整编译选项适配当前设备规格 修改 Makefile 中关于 target arch 属性设定值部分使之更贴近实际运行机器物理属性描述形式如下所示例子可供参考借鉴作用喔~ 原样貌呈现样子大概是这样子滴呢:`ARCH=-gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,...`,更改之后的样子应该长成这样子才对哦~`(假设有 Pascal Titan X)`: ```makefile NVCC_FLAGS += $(EXTRA_NVCC_FLAGS) ifeq ($(strip $(TARGET_ARCH)),pascal) ARCH := -gencode arch=compute_60,code=[sm_60,compute_60] \ -gencode arch=compute_61,code=[sm_61,compute_61] endif ``` 然后重新执行 make all 即可顺利进入下一环节咯~ --- #### 方法三:切换适合自身系统的Pytorch/CUDA组合搭配模式试试看吧! 有时候仅仅是因为相互之间版本号差异太大才会造成诸如此类麻烦的事情发生嘛...所以不妨试着降低一点先前选用过高或者过低版本数值看看效果怎么样?比如说如果你现在用的是cuda11.x+pytoch nightly version ,那就不妨换成相对保守一些的选择组合策略吧!像这样子操作会比较稳妥安全些哈~ ```shell script conda create python=3.8 pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch pip install mmcv-full==latest.* -f https://download.openmmlab.com/mmcv/dist/cu102/torchx.y/html/ pip install open-mmlab/mmdetection3d.git@master ``` > 注释: cu102代表使用nvidia cuda toolkit vesion 10.2; torchx.y需依据实际情况填写准确数字串序列. ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值