Ubuntu - 安装NVIDIA 驱动、cuda、cudnn、gcc

AI工程化

已于 2023-01-31 11:47:56 修改

阅读量1.6k

点赞数 2

分类专栏：软件工具/使用技巧 DL深度学习文章标签： ubuntu cuda nvidia gcc

于 2022-01-11 14:14:43 首次发布

本文链接：https://blog.csdn.net/lovechris00/article/details/122431331

版权

软件工具/使用技巧同时被 2 个专栏收录

103 篇文章 5 订阅

订阅专栏

DL深度学习

61 篇文章 8 订阅

订阅专栏

请添加图片描述

文章目录

关于 cuda、cudnn

官网：https://developer.nvidia.com

CUDA: Compute Unified Device Architecture，是显卡厂商NVIDIA推出的通用并行计算架构，是一种并行计算平台和编程模型，该架构使GPU能够解决复杂的计算问题。

CUDA包含三部分，CUDA toolkit、CUDA driver和NVIDIA GPU driver

CUDA Toolkit (libraries, CUDA runtime and developer tools) - User-mode SDK used to build CUDA applications
CUDA driver - User-mode driver component used to run CUDA applications (such as libcuda.so on Linux systems)
NVIDIA GPU device driver - Kernel-mode driver component for NVIDIA GPUs.
即：显卡驱动。

在linux系统中，CUDA driver 和 NVIDIA GPU device driver 是统一在NVIDIA driver下的。

CUDA Driver & NVIDIA Driver

CUDA本身包含CUDA Driver和GPU kernel-mode Driver，而这两者在Linux系统中是统一在NVIDIA Driver中的。
因此在安装好NVIDIA Driver好以后，只需要安装 CUDA toolkit 就可以保证CUDA相关的程序运行。

cudnn

官网说明：https://developer.nvidia.com/cudnn

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

cudnn 是专为深度学习计算设计的软件库，里面提供了很多专门的计算函数，如卷积等。

NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html

一、安装 NVIDIA 驱动

1、准备工作

查看是否有旧的驱动

$ nvidia-smi

如果下面出现GPU信息，就代表存在旧的驱动。

卸载旧的驱动

sudo apt autoremove
 
sudo apt-get --purge remove "*nvidia*" 

# 删除旧驱动
sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*

卸载完后需要重启

禁用 nouveau

Nouveau是由第三方为NVIDIA显卡开发的一个开源3D驱动，也没能得到NVIDIA的认可与支持。
虽然Nouveau无法和NVIDIA官方私有驱动相提并论，不过确让Linux更容易的应对各种复杂的NVIDIA显卡环境，让用户安装完系统即可进入桌面并且有不错的显示效果，所以，很多Linux发行版默认集成了Nouveau驱动。

查看 nouveau 是否运行

lsmod | grep nouveau

如果有输出，表示运行；没有输出，代表已禁用

禁用自带的 nouveau nvidia驱动

sudo vim /etc/modprobe.d/blacklist.conf

加入内容

blacklist nouveau
options nouveau modeset=0

# 更新
sudo update-initramfs -u

dracut -f
systemctl set-default multi-user.target

sudo reboot

修改后需要重启系统。确认下Nouveau是已经被你干掉，使用命令：

lsmod | grep nouveau

退出图形模式

停止X服务器

sudo service gdm stop

# 如有
sudo service lightdm stop
sudo service kdm stop  # this is the one that worked for mi as I use kdm and Linux mint

恢复图形模式

如果运行深度学习框架后，分辨率等出现问题；
如果后续不再使用深度学习框架，想要卸载cuda，恢复原来的图形模式，需要
1、卸载 cuda，见上方步骤
2、取消 nouveau 禁用
注释黑名单对应的内容；刷新
3、启用图形模式

sudo service gdm start

4、重启

安装 gcc、g++

$ sudo apt update
# $ sudo apt install gcc-9 g++-9
$ sudo apt install gcc g++

# 查看版本
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

获取当前系统、GPU 信息

# 获取ubuntu 版本信息
$ uname -a
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
newtranx@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.3 LTS
Release:	16.04   # 版本号
Codename:	xenial

# 查看显卡信息
$ lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
...

方式一：ubuntu-drivers（推荐）

1、搜索可用的驱动

ubuntu-drivers devices

$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001B06sv00001462sd00003609bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP102 [GeForce GTX 1080 Ti]
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-510 - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-510-server - distro non-free recommended
driver   : nvidia-driver-470 - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

有一个 driver 后面跟着 recommended，代表推荐安装这个版本。
如果你的ubuntu是服务器版本，则安装带server的版本

2、安装驱动

安装推荐版本

sudo ubuntu-drivers autoinstall

安装指定版本

sudo apt install nvidia-driver-510-server

安装后需要重启电脑，否则可能报错：

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

报错1：Possible missing firmware

W: Possible missing firmware /lib/firmware/rtl_nic/rtl8125a-3.fw for module r8169
W: Possible missing firmware /lib/firmware/rtl_nic/rtl8168fp-3.fw for module r8169

参考文章：https://blog.csdn.net/qq_34213260/article/details/109140996

进入 /lib/firmware/rtl_nic/ 文件夹，并去 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/ 获取指定 rtl 链接。
比如我下载缺失的两个：

cd /lib/firmware/rtl_nic/
sudo wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/rtl8125a-3.fw
sudo wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/rtl8168fp-3.fw

报错2：Bad return status for module build on kernel

Error! Bad return status for module build on kernel: 5.4.0-136-generic (x86_64)

我的 gcc 为6，升级到 gcc-8 解决了这个问题。
参考博客：https://blog.csdn.net/JerryZhang__/article/details/108865176

方式二：网页下载

查看显卡型号

如果上述查询信息，可以在下载时找到对应的类型就跳过这一步。
如果查询到 1b01 之类的信息，找不到对应的类型，可以参考下属文章找到对应型号。

根据： https://blog.csdn.net/zhuguiqian/article/details/104795435
http://pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci

下载 nvidia 驱动

根据系统和显卡型号，获取合适的驱动
https://www.nvidia.com/Download/index.aspx?lang=en-us

请添加图片描述

得到文件： NVIDIA-Linux-x86_64-470.94.run

GeForce 驱动程序
https://www.nvidia.cn/geforce/drivers/

运行驱动安装脚本


sudo chmod +x NVIDIA-Linux-x86_64-525.60.11.run
sudo sh ./NVIDIA-Linux-x86_64-525.60.11.run --no-x-check  -no-opengl-files

Install NVIDIA’s 32-bit compatibility libraries?

选择YES

安装报错

1、ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel.

ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a
CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.
Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running,
you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

解决方法：

检查之前的 NVIDIA 驱动有没有卸载干净（可看文章最上方），有旧版本的就卸掉，然后重启机器。

The distribution-provided pre-install script failed! Are you sure you want to continue?

这个问题源自nvidia驱动安装包自身的问题，这里可以直接点击yes或者continue继续安装。

测试驱动有效性

$ nvidia-smi
Fri Dec  2 13:55:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0 Off |                  N/A |

注意 Driver Version: 525.60.11 即刚才安装的驱动版本

二、安装 cuda

下载安装

CUDA Toolkit Archive 下载地址:
https://developer.nvidia.com/cuda-toolkit-archive

我将要安装的 pytorch 支持 Cuda 10.2 和 11.3 (https://pytorch.org/get-started/locally/)

Cuda 10.2 并不支持 Ubuntu 20.*，所以我点击进入 11.3，一步步选择合适的版本，得到下载脚本
https://developer.nvidia.com/cuda-11.3.0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network

*** 请添加图片描述

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda

我得到这样的脚本

wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run

sudo sh cuda_11.3.1_465.19.01_linux.run

运行过程中的选择：

accept the above EULA?

取消勾选 driver, 然后选择 Install

在这里插入图片描述

配置环境变量

修改 ~/.bashrc 文件，然后source激活

export CUDA_LIB_PATH=/usr/local/cuda-11.3/lib64
export CUDA_BIN_PATH=/usr/local/cuda-11.3/bin
export CUDA_HOME=/usr/local/cuda-11.3

多版本的CUDA管理

本质是管理 cuda 软链接的

sudo rm -rf cuda # 删除旧版本的软连接
# 建立新版本的软连接，前面的路径是需要的版本的cuda的安装路径。
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda  

sudo ln -s /usr/local/cuda-11.3/bin/nvcc /usr/bin/nvcc

三、安装 cudann

cudann 下载地址: https://developer.nvidia.com/rdp/cudnn-archive

# 解压
tar -xvf cudnn-10.2-linux-x64-v8.2.1.32.tgz

解压之后得到一个 cuda文件夹。

配置

sudo cp cuda/include/cudnn.h    /usr/local/cuda-10.2/include # 填写对应的版本的cuda路径
sudo cp cuda/lib64/libcudnn*    /usr/local/cuda-10.2/lib64   # 填写对应的版本的cuda路径
sudo chmod a+r /usr/local/cuda-10.2/include/cudnn.h   /usr/local/cuda-10.2/lib64/libcudnn*

四、升级 gcc、g++ （使用 apt-get）

这里演示使用 apt-get 安装 gcc；然后修改软链接；
你可以可以下载安装包来安装。

1、查看本机 gcc 信息

查看 gcc、g++ 版本

gcc --version
g++ --version

可能得到如下结果：

$ gcc --version
gcc (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ --version
g++ (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

查看本机安装了哪些版本gcc

sudo dpkg -l | grep gcc

可能得到类似这样的结果

ii  gcc                                  4:5.3.1-1ubuntu1                           amd64        GNU C compiler
ii  gcc-5                                5.4.0-6ubuntu1~16.04.11                    amd64        GNU C compiler
ii  gcc-5-base:amd64                     5.4.0-6ubuntu1~16.04.11                    amd64        GCC, the GNU Compiler Collection (base package)
ii  gcc-6                                6.5.0-2ubuntu1~16.04                       amd64        GNU C compiler
ii  gcc-6-base:amd64                     6.5.0-2ubuntu1~16.04                       amd64        GCC, the GNU Compiler Collection (base package)
...
ii  gcc-9-base:amd64                     9.4.0-1ubuntu1~16.04                       amd64        GCC, the GNU Compiler Collection (base package)
...
ii  gir1.2-packagekitglib-1.0            0.8.17-4ubuntu6~gcc5.4ubuntu1.4            amd64        GObject introspection data for the PackageKit GLib library
ii  libgcc-5-dev:amd64                   5.4.0-6ubuntu1~16.04.11                    amd64        GCC support library (development files)
ii  libgcc-6-dev:amd64                   6.5.0-2ubuntu1~16.04                       amd64        GCC support library (development files)
ii  libgcc1:amd64                        1:9.4.0-1ubuntu1~16.04                     amd64        GCC support library

如果已经安装了你所需的 gcc 版本，就不用安装，只需要改变 /usr/bin/gcc 软链接即可

2、查询可安装的 gcc

sudo apt-cache search gcc

得到很多结果

...
cpp-5 - GNU C preprocessor
cpp-5-aarch64-linux-gnu - GNU C preprocessor
...
gcc - GNU C compiler
gcc-5 - GNU C compiler
gcc-5-aarch64-linux-gnu - GNU C compiler
gcc-5-aarch64-linux-gnu-base - GCC, the GNU Com

3、安装 gcc

sudo apt-get install gcc-6 g++-6

安装成功后，你可以使用 sudo dpkg -l | grep gcc 命令查看已安装的版本。
此时如果使用 gcc --version 可能还是过去的版本，这时需要修改默认的gcc版本（软链接）。

4、设置软链接

查看软链接

ls -l /usr/bin/gcc

可能得到如下：代表当前 gcc 指向 gcc-5

lrwxrwxrwx 1 root root 5 Feb 11  2016 /usr/bin/gcc -> gcc-5

重定向 gcc 链接到 gcc-6

cd /usr/bin
sudo sudo rm gcc
sudo ln -s gcc-6 gcc

sudo rm g++
sudo ln -s g++-6 g++

此时再次使用 gcc --version 即可看到设置的版本。

报错 unknown user ‘redis’ in statoverride file

使用 apt-get 可能容易出现这个问题
由于我目前不使用 redis，所以粗暴去掉了，方法如下：

1、查看，通过执行命令

dpkg-statoverride --list

可以看到 redis 字样

2、修改 /var/lib/dpkg/statoverride 文件
这里我使用 vim 将其打开

sudo vim /var/lib/dpkg/statoverride

会发现最后一行是 redis redis 640 /etc/redis/redis.conf，将这行去掉，然后保存文件。
后面再去运行 apt-get 就不再报这个错误了。

五、其它

cuda & macOS

NVIDIA® CUDA Toolkit 11.6 no longer supports development or running applications on macOS.
While there are no tools which use macOS as a target environment, NVIDIA is making macOS host versions of these tools that you can launch profiling and debugging sessions on supported target platforms.
CUDA driver update to support CUDA Toolkit 10.1 Update 1 and macOS 10.13.6

cuda 不再支持 macOS，你可以在 macOS上安装调试工具。
CUDA driver 支持的最高版本是macOS 10.13.6 和 CUDA Toolkit 10.1 版本（2019年5月发布）。
macOS 10.14， 10.15 以上无法安装 cuda，也无法安装 cuda 10.2及以上版本。

具体信息可见官网：

NVIDIA CUDA Toolkit - Developer Tools for macOS - CUDA Toolkit 11.6
https://developer.nvidia.com/nvidia-cuda-toolkit-11_6_0-developer-tools-mac-hosts
CUDA DRIVERS FOR MAC ARCHIVE (19年后没有新的 driver)
https://www.nvidia.com/en-us/drivers/cuda/mac-driver-archive/

nvcc 和 nvidia-smi 中 cuda 的版本不同

两个显示出来的cuda的版本不同；
nvcc -V 得到的版本，是运行时的cuda 版本；
nvidia-smi 中 cuda 的版本，代表当前驱动支持的最高 cuda 版本。

所以我觉得（未验证）如果你的 nvidia-smi 中cuda 版本比较低，需要升级 NVIDIA 驱动。

CUDA driver initialization failed, you might not have a CUDA gpu

cuda 和 NVIDIA 驱动不一致，会报这个错误。

查看显卡驱动版本: nvidia-smi 命令输出的 driver version 字段中

NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html
在这里插入图片描述

cuda 和 pytorch 版本不一致等问题

如果 cuda 版本比较低，pytorch 版本比较高，运行 pytorch 的时候，可能会报如下错误：

No module named ‘packaging’

AttributeError: module ‘logging’ has no attribute ‘getLogger’

针对不同版本的cuda，可以安装不同版本的 pytorch，详见：
https://pytorch.org/get-started/previous-versions/

如果 pytorch 在低版本的 cuda 下安装
安装 apex 的时候可能会报如下问题：

untimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.2.

常用设备信息查询

为了解决安装过程中遇到的各种问题，这里将可能需要查询的信息和方法都罗列在此。

Ubuntu

# 查看系统版本
$ lsb_release -a
LSB Version:	core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial

 
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"


# 查看 ubuntu 架构等信息
$ uname -a 
Linux ubuntu-101 4.4.0-210-generic #242-Ubuntu SMP Fri Apr 16 09:57:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux


# 显示各种驱动信息（包含显卡及其驱动）  
$ ubuntu-drivers devices

# 查看显卡型号/ nvidia GPU 信息  
$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)

cuda 信息

$ nvcc -V   # 需要安装 nvidia-cuda-toolkit  
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17


# 查看 cuda 版本（旧）
$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.105

# 查看显卡驱动所使用的内核版本  
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.39  Sat Feb  9 19:19:37 CST 2019
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

查看显卡信息

$ lshw -c video
# $ lshw -C display 
WARNING: you should run this program as super-user.
  *-display               
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:16 memory:ee000000-eeffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:e000(size=128) memory:ef000000-ef07ffff
WARNING: output may be incomplete or inaccurate, you should run this program as super-user.


# lspci | grep -i nvidia查看全部显卡信息。 
$ lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b06] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at e0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	[virtual] Expansion ROM at ef000000 [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_418_drm, nvidia_418

01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10ef] (rev a1)


# 检查硬件加速。启用基于硬件的3D加速。 
$ glxinfo | grep OpenGL
The program 'glxinfo' is currently not installed. You can install it by typing:
sudo apt install mesa-utils

nvidia 信息

 
# 查看系统驱动日志
$ cat /var/log/dpkg.log | grep nvidia 
2022-01-06 06:02:02 upgrade nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1 470.86-0ubuntu0.20.04.2
2022-01-06 06:02:02 status half-configured nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1

# 查看驱动程序
$ sudo dpkg --list | grep nvidia-*
ii  libnvidia-common-470                       470.86-0ubuntu0.20.04.2               all          Shared files used by the NVIDIA libraries
ii  nvidia-compute-utils-470                   470.86-0ubuntu0.20.04.2               amd64        NVIDIA compute utilities
ii  nvidia-driver-470                          470.86-0ubuntu0.20.04.2               amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-470                   470.86-0ubuntu0.20.04.2               amd64        Shared files used with the kernel module


$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017

# 动态监控显卡状态 
$ watch -n 1 nvidia-smi

$ nvidia-smi
Sat May 28 18:13:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 32%   44C    P8    12W / 250W |    336MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1051      G   /usr/lib/xorg/Xorg                 24MiB |
|    0   N/A  N/A      1154      G   /usr/bin/gnome-shell               49MiB |
|    0   N/A  N/A      1838      G   /usr/lib/xorg/Xorg                174MiB |
|    0   N/A  N/A      1988      G   /usr/bin/gnome-shell               83MiB |
+-----------------------------------------------------------------------------+

pytorch cuda 相关api

进入相关 env

$ python3
Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
 
>>> import torch
>>> torch.cuda.is_available() # 查看 cuda 是否可用
True  # 代表可用

>>> print(torch.__version__) # 查看torch 版本
1.9.0+cu102 

# 多少个 cuda 可用
>>> print(torch.cuda.device_count()) # 查看
1
>>> torch.version.cuda  
'10.2' 

# 查看当前使用的GPU序号：
>>> device = torch.cuda.current_device()
>>> device
0

# 查看指定GPU的容量、名称：
>>> torch.cuda.get_device_capability(device)
(6, 1)
>>> torch.cuda.get_device_name(device)
'NVIDIA GeForce GTX 1080 Ti'

# 清空程序占用的GPU资源： 
>>> torch.cuda.empty_cache()