一、参考资料
二、显卡相关查询
1. 查询显卡型号
方法一(推荐)
查询显卡型号:
lspci | grep -i vga
输出结果为一个十六进制数字代码:
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1f82 (rev a1)
查看十六进制代号:
查询网站:http://pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci
查询结果:Name: TU107 [GeForce GTX 1650]
方法二
查询显卡型号:
nvidia-smi
2. GPU Specs Database
3. 查询显卡算力
4. 查询驱动版本
cat /proc/driver/nvidia/version
三、 CUDA/cuDNN相关查询
1. 查询CUDA版本
方法一
nvcc -V
输出结果:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
方法二
旧版本支持,新版本可能不存在 version.txt
文件。
cat /usr/local/cuda/version.txt
2. 查询cuDNN版本
方法一
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
方法二
如果方法一失效,尝试方法二。
Step1:查看 cudnn.h
文件。
cat /usr/local/cuda/include/cudnn.h | grep cudnn
/* cudnn : Neural Networks Library
#include "cudnn_version.h"
#include "cudnn_ops_infer.h"
#include "cudnn_ops_train.h"
#include "cudnn_adv_infer.h"
#include "cudnn_adv_train.h"
#include "cudnn_cnn_infer.h"
#include "cudnn_cnn_train.h"
#include "cudnn_backend.h"
Step2:cudnn.h
文件没有定义cuDNN版本,找到 cudnn_version.h
文件。
find / -name cudnn_version.h 2>/dev/null
/home/yoyo/Downloads/cuda/include/cudnn_version.h
Step3:查看cuDNN版本信息。
cat /home/yoyo/Downloads/cuda/include/cudnn_version.h
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 5
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
由此可见,cuDNN的版本为:8.0.5
。
3. 查询cuda算力/cuda cores核心数
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# 清空之前编译的文件
sudo make clean
# 重新编译,-j8表示8线程用于加速
sudo make -j8
./deviceQuery
# 如果最后一行出现 Result = PASS,说明cuda安装成功
yoyo@yoyo:/usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 1650"
CUDA Driver Version / Runtime Version 11.4 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 3904 MBytes (4093444096 bytes)
(14) Multiprocessors, ( 64) CUDA Cores/MP: 896 CUDA Cores
GPU Max Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 4001 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 3904 MBytes (4093444096 bytes)
(14) Multiprocessors, ( 64) CUDA Cores/MP: 896 CUDA Cores
四、测试CUDA
1. 测试cuda是否安装成功(run方式安装)
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# 清空之前编译的文件
sudo make clean
# 重新编译,-j8表示8线程用于加速
sudo make -j8
./deviceQuery
# 如果最后一行出现 Result = PASS,说明cuda安装成功
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 3060"
CUDA Driver Version / Runtime Version 11.4 / 10.2
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 12051 MBytes (12636061696 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(28) Multiprocessors, ( 64) CUDA Cores/MP: 1792 CUDA Cores
GPU Max Clock rate: 1777 MHz (1.78 GHz)
Memory Clock rate: 7501 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 2359296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 10 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 3060"
CUDA Driver Version / Runtime Version 11.4 / 11.1
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 12051 MBytes (12636061696 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1777 MHz (1.78 GHz)
Memory Clock rate: 7501 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 2359296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 10 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.1, NumDevs = 1
Result = PASS
重要说明
RTX 3060显卡是 Ampere 架构, cuda 11.1以上版本支持 RTX 3060 显卡;cuda 11.1 以下的版本,无法发挥 RTX 3060 的性能。
第一次的结果:
(28) Multiprocessors, ( 64) CUDA Cores/MP: 1792 CUDA Cores
第二次的结果:
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
2. 测试cuDNN是否安装成功(deb方式安装)
当选择deb方式进行安装时,会在 /usr/src/cudnn_samples_v7
有一些cudnn的sample,编译 mnistCUDNN
进行验证。
# 复制cuDNN samples到home目录下
cp -r /usr/src/cudnn_samples_v7 /$HOME
# 进入home目录
cd $HOME/cudnn_samples_v7/mnistCUDNN/
# 编译mnistCUDNN
sudo make clean
sudo make
# 运行mnistCUDNN
# 如果出现Test passed!表明cuDNN已安装成功
sudo ./mnistCUDNN
五、CUDA与cuDNN版本对齐
cuDNN Support Matrix
cuDNN-Support-Matrix
六、GPU量化支持
Supported hardware
compute-capabilities
CUDA Compute Capability | Example Device | TF32 | FP32 | FP16 | INT8 | FP16 Tensor Cores | INT8 Tensor Cores | DLA |
---|---|---|---|---|---|---|---|---|
8.6 | NVIDIA A10 | Yes | Yes | Yes | Yes | Yes | Yes | No |
8.0 | NVIDIA A100/GA100 GPU | Yes | Yes | Yes | Yes | Yes | Yes | No |
7.5 | Tesla T4 | No | Yes | Yes | Yes | Yes | Yes | No |
7.2 | Jetson AGX Xavier | No | Yes | Yes | Yes | Yes | Yes | Yes |
7.0 | Tesla V100 | No | Yes | Yes | Yes | Yes | No | No |
6.2 | Jetson TX2 | No | Yes | Yes | No | No | No | No |
6.1 | Tesla P4 | No | Yes | No | Yes | No | No | No |
6.0 | Tesla P100 | No | Yes | Yes | No | No | No | No |
5.3 | Jetson TX1 | No | Yes | Yes | No | No | No | No |
5.2 | Tesla M4 | No | Yes | No | No | No | No | No |
5.0 | Quadro K2200 | No | Yes | No | No | No | No | No |