关于 cuda、cudnn
官网:https://developer.nvidia.com
CUDA: Compute Unified Device Architecture,是显卡厂商NVIDIA推出的通用并行计算架构
,是一种并行计算平台和编程模型,该架构使GPU能够解决复杂的计算问题。
CUDA包含三部分,CUDA toolkit、CUDA driver和NVIDIA GPU driver
- CUDA Toolkit (libraries, CUDA runtime and developer tools) - User-mode SDK used to build CUDA applications
- CUDA driver - User-mode driver component used to run CUDA applications (such as libcuda.so on Linux systems)
- NVIDIA GPU device driver - Kernel-mode driver component for NVIDIA GPUs.
即:显卡驱动。
在linux系统中,CUDA driver 和 NVIDIA GPU device driver 是统一在NVIDIA driver下的。
CUDA Driver & NVIDIA Driver
CUDA本身包含CUDA Driver和GPU kernel-mode Driver,而这两者在Linux系统中是统一在NVIDIA Driver中的。
因此在安装好NVIDIA Driver好以后,只需要安装 CUDA toolkit 就可以保证CUDA相关的程序运行。
cudnn
官网说明:https://developer.nvidia.com/cudnn
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
cudnn 是专为深度学习计算设计的软件库,里面提供了很多专门的计算函数,如卷积等。
NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html
一、安装 NVIDIA 驱动
1、准备工作
查看是否有旧的驱动
$ nvidia-smi
如果下面出现GPU信息,就代表存在旧的驱动。
卸载旧的驱动
sudo apt autoremove
sudo apt-get --purge remove "*nvidia*"
# 删除旧驱动
sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
卸载完后需要重启
禁用 nouveau
Nouveau是由第三方为NVIDIA显卡开发的一个开源3D驱动,也没能得到NVIDIA的认可与支持。
虽然Nouveau无法和NVIDIA官方私有驱动相提并论,不过确让Linux更容易的应对各种复杂的NVIDIA显卡环境,让用户安装完系统即可进入桌面并且有不错的显示效果,所以,很多Linux发行版默认集成了Nouveau驱动。
查看 nouveau 是否运行
lsmod | grep nouveau
如果有输出,表示运行;没有输出,代表已禁用
禁用自带的 nouveau nvidia驱动
sudo vim /etc/modprobe.d/blacklist.conf
加入内容
blacklist nouveau
options nouveau modeset=0
# 更新
sudo update-initramfs -u
dracut -f
systemctl set-default multi-user.target
sudo reboot
修改后需要重启系统。确认下Nouveau是已经被你干掉,使用命令:
lsmod | grep nouveau
退出图形模式
停止X服务器
sudo service gdm stop
# 如有
sudo service lightdm stop
sudo service kdm stop # this is the one that worked for mi as I use kdm and Linux mint
恢复图形模式
如果运行深度学习框架后,分辨率等出现问题;
如果后续不再使用深度学习框架,想要卸载cuda,恢复原来的图形模式,需要
1、卸载 cuda,见上方步骤
2、取消 nouveau 禁用
注释黑名单对应的内容;刷新
3、启用图形模式
sudo service gdm start
4、重启
安装 gcc、g++
$ sudo apt update
# $ sudo apt install gcc-9 g++-9
$ sudo apt install gcc g++
# 查看版本
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
获取当前系统、GPU 信息
# 获取ubuntu 版本信息
$ uname -a
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
newtranx@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04 # 版本号
Codename: xenial
# 查看显卡信息
$ lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
...
方式一:ubuntu-drivers(推荐)
1、搜索可用的驱动
ubuntu-drivers devices
$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001B06sv00001462sd00003609bc03sc00i00
vendor : NVIDIA Corporation
model : GP102 [GeForce GTX 1080 Ti]
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-510-server - distro non-free recommended
driver : nvidia-driver-470 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
- 有一个 driver 后面跟着
recommended
,代表推荐安装这个版本。 - 如果你的ubuntu是服务器版本,则安装带server的版本
2、安装驱动
安装推荐版本
sudo ubuntu-drivers autoinstall
安装指定版本
sudo apt install nvidia-driver-510-server
安装后需要重启电脑,否则可能报错:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
报错1:Possible missing firmware
W: Possible missing firmware /lib/firmware/rtl_nic/rtl8125a-3.fw for module r8169
W: Possible missing firmware /lib/firmware/rtl_nic/rtl8168fp-3.fw for module r8169
参考文章:https://blog.csdn.net/qq_34213260/article/details/109140996
进入 /lib/firmware/rtl_nic/
文件夹,并去 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/ 获取指定 rtl 链接。
比如我下载缺失的两个:
cd /lib/firmware/rtl_nic/
sudo wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/rtl8125a-3.fw
sudo wget https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/rtl_nic/rtl8168fp-3.fw
报错2:Bad return status for module build on kernel
Error! Bad return status for module build on kernel: 5.4.0-136-generic (x86_64)
我的 gcc 为6,升级到 gcc-8 解决了这个问题。
参考博客:https://blog.csdn.net/JerryZhang__/article/details/108865176
方式二:网页下载
查看显卡型号
如果上述查询信息,可以在下载时找到对应的类型就跳过这一步。
如果查询到 1b01 之类的信息,找不到对应的类型,可以参考下属文章找到对应型号。
根据 : https://blog.csdn.net/zhuguiqian/article/details/104795435
http://pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci
下载 nvidia 驱动
根据系统和显卡型号,获取合适的驱动
https://www.nvidia.com/Download/index.aspx?lang=en-us
得到文件: NVIDIA-Linux-x86_64-470.94.run
GeForce 驱动程序
https://www.nvidia.cn/geforce/drivers/
运行驱动安装脚本
sudo chmod +x NVIDIA-Linux-x86_64-525.60.11.run
sudo sh ./NVIDIA-Linux-x86_64-525.60.11.run --no-x-check -no-opengl-files
Install NVIDIA’s 32-bit compatibility libraries?
选择YES
安装报错
1、ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel.
ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a
CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.
Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running,
you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
解决方法:
检查之前的 NVIDIA 驱动有没有卸载干净(可看文章最上方),有旧版本的就卸掉,然后重启机器。
The distribution-provided pre-install script failed! Are you sure you want to continue?
这个问题源自nvidia驱动安装包自身的问题,这里可以直接点击yes或者continue继续安装 。
测试 驱动 有效性
$ nvidia-smi
Fri Dec 2 13:55:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
注意 Driver Version: 525.60.11 即刚才安装的 驱动版本
二、安装 cuda
下载安装
CUDA Toolkit Archive 下载地址:
https://developer.nvidia.com/cuda-toolkit-archive
我将要安装的 pytorch 支持 Cuda 10.2 和 11.3 (https://pytorch.org/get-started/locally/)
Cuda 10.2 并不支持 Ubuntu 20.*,所以我点击进入 11.3,一步步选择合适的版本,得到下载脚本
https://developer.nvidia.com/cuda-11.3.0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network
***
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
我得到这样的脚本
wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run
运行过程中的选择:
accept the above EULA?
- accept
取消勾选 driver, 然后选择 Install
配置环境变量
修改 ~/.bashrc
文件,然后source激活
export CUDA_LIB_PATH=/usr/local/cuda-11.3/lib64
export CUDA_BIN_PATH=/usr/local/cuda-11.3/bin
export CUDA_HOME=/usr/local/cuda-11.3
多版本的CUDA管理
本质是管理 cuda 软链接的
sudo rm -rf cuda # 删除旧版本的软连接
# 建立新版本的软连接,前面的路径是需要的版本的cuda的安装路径。
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
sudo ln -s /usr/local/cuda-11.3/bin/nvcc /usr/bin/nvcc
三、安装 cudann
cudann 下载地址: https://developer.nvidia.com/rdp/cudnn-archive
# 解压
tar -xvf cudnn-10.2-linux-x64-v8.2.1.32.tgz
解压之后得到一个 cuda
文件夹。
配置
sudo cp cuda/include/cudnn.h /usr/local/cuda-10.2/include # 填写对应的版本的cuda路径
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.2/lib64 # 填写对应的版本的cuda路径
sudo chmod a+r /usr/local/cuda-10.2/include/cudnn.h /usr/local/cuda-10.2/lib64/libcudnn*
四、升级 gcc、g++ (使用 apt-get)
这里演示使用 apt-get 安装 gcc;然后修改软链接;
你可以可以下载安装包来安装。
1、查看本机 gcc 信息
查看 gcc、g++ 版本
gcc --version
g++ --version
可能得到如下结果:
$ gcc --version
gcc (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ --version
g++ (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
查看本机安装了哪些版本gcc
sudo dpkg -l | grep gcc
可能得到类似这样的结果
ii gcc 4:5.3.1-1ubuntu1 amd64 GNU C compiler
ii gcc-5 5.4.0-6ubuntu1~16.04.11 amd64 GNU C compiler
ii gcc-5-base:amd64 5.4.0-6ubuntu1~16.04.11 amd64 GCC, the GNU Compiler Collection (base package)
ii gcc-6 6.5.0-2ubuntu1~16.04 amd64 GNU C compiler
ii gcc-6-base:amd64 6.5.0-2ubuntu1~16.04 amd64 GCC, the GNU Compiler Collection (base package)
...
ii gcc-9-base:amd64 9.4.0-1ubuntu1~16.04 amd64 GCC, the GNU Compiler Collection (base package)
...
ii gir1.2-packagekitglib-1.0 0.8.17-4ubuntu6~gcc5.4ubuntu1.4 amd64 GObject introspection data for the PackageKit GLib library
ii libgcc-5-dev:amd64 5.4.0-6ubuntu1~16.04.11 amd64 GCC support library (development files)
ii libgcc-6-dev:amd64 6.5.0-2ubuntu1~16.04 amd64 GCC support library (development files)
ii libgcc1:amd64 1:9.4.0-1ubuntu1~16.04 amd64 GCC support library
如果已经安装了你所需的 gcc 版本,就不用安装,只需要改变 /usr/bin/gcc
软链接即可
2、查询可安装的 gcc
sudo apt-cache search gcc
得到很多结果
...
cpp-5 - GNU C preprocessor
cpp-5-aarch64-linux-gnu - GNU C preprocessor
...
gcc - GNU C compiler
gcc-5 - GNU C compiler
gcc-5-aarch64-linux-gnu - GNU C compiler
gcc-5-aarch64-linux-gnu-base - GCC, the GNU Com
3、安装 gcc
sudo apt-get install gcc-6 g++-6
安装成功后,你可以使用 sudo dpkg -l | grep gcc
命令查看已安装的版本。
此时如果使用 gcc --version
可能还是过去的版本,这时需要修改默认的gcc版本(软链接)。
4、设置软链接
查看软链接
ls -l /usr/bin/gcc
可能得到如下:代表当前 gcc 指向 gcc-5
lrwxrwxrwx 1 root root 5 Feb 11 2016 /usr/bin/gcc -> gcc-5
重定向 gcc 链接到 gcc-6
cd /usr/bin
sudo sudo rm gcc
sudo ln -s gcc-6 gcc
sudo rm g++
sudo ln -s g++-6 g++
此时再次使用 gcc --version
即可看到设置的版本。
报错 unknown user ‘redis’ in statoverride file
使用 apt-get 可能容易出现这个问题
由于我目前不使用 redis,所以粗暴去掉了,方法如下:
1、查看,通过执行命令
dpkg-statoverride --list
可以看到 redis 字样
2、修改 /var/lib/dpkg/statoverride
文件
这里我使用 vim 将其打开
sudo vim /var/lib/dpkg/statoverride
会发现最后一行是 redis redis 640 /etc/redis/redis.conf
,将这行去掉,然后保存文件。
后面再去运行 apt-get
就不再报这个错误了。
五、其它
cuda & macOS
NVIDIA® CUDA Toolkit 11.6 no longer supports development or running applications on macOS.
While there are no tools which use macOS as a target environment, NVIDIA is making macOS host versions of these tools that you can launch profiling and debugging sessions on supported target platforms.
CUDA driver update to support CUDA Toolkit 10.1 Update 1 and macOS 10.13.6
cuda 不再支持 macOS,你可以在 macOS上安装调试工具。
CUDA driver 支持的最高版本是macOS 10.13.6 和 CUDA Toolkit 10.1 版本(2019年5月发布)。
macOS 10.14, 10.15 以上无法安装 cuda,也无法安装 cuda 10.2及以上版本。
具体信息可见官网:
- NVIDIA CUDA Toolkit - Developer Tools for macOS - CUDA Toolkit 11.6
https://developer.nvidia.com/nvidia-cuda-toolkit-11_6_0-developer-tools-mac-hosts - CUDA DRIVERS FOR MAC ARCHIVE (19年后没有新的 driver)
https://www.nvidia.com/en-us/drivers/cuda/mac-driver-archive/
nvcc 和 nvidia-smi 中 cuda 的版本不同
两个显示出来的cuda的版本不同;
nvcc -V
得到的版本,是运行时的cuda 版本;
nvidia-smi
中 cuda 的版本,代表当前驱动支持的最高 cuda 版本。
所以我觉得(未验证)如果你的 nvidia-smi
中cuda 版本比较低,需要升级 NVIDIA 驱动。
CUDA driver initialization failed, you might not have a CUDA gpu
cuda 和 NVIDIA 驱动不一致,会报这个错误。
查看显卡驱动版本: nvidia-smi
命令输出的 driver version
字段中
NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html
cuda 和 pytorch 版本不一致等问题
如果 cuda 版本比较低,pytorch 版本比较高,运行 pytorch 的时候,可能会报如下错误:
No module named ‘packaging’
AttributeError: module ‘logging’ has no attribute ‘getLogger’
针对不同版本的cuda,可以安装不同版本的 pytorch,详见:
https://pytorch.org/get-started/previous-versions/
如果 pytorch 在低版本的 cuda 下安装
安装 apex 的时候可能会报如下问题:
untimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.2.
常用设备信息查询
为了解决安装过程中遇到的各种问题,这里将可能需要查询的信息和方法都罗列在此。
Ubuntu
# 查看系统版本
$ lsb_release -a
LSB Version: core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
# 查看 ubuntu 架构等信息
$ uname -a
Linux ubuntu-101 4.4.0-210-generic #242-Ubuntu SMP Fri Apr 16 09:57:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
# 显示各种驱动信息(包含显卡及其驱动)
$ ubuntu-drivers devices
# 查看显卡型号/ nvidia GPU 信息
$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
cuda 信息
$ nvcc -V # 需要安装 nvidia-cuda-toolkit
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
# 查看 cuda 版本(旧)
$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.105
# 查看显卡驱动所使用的内核版本
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.39 Sat Feb 9 19:19:37 CST 2019
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
查看显卡信息
$ lshw -c video
# $ lshw -C display
WARNING: you should run this program as super-user.
*-display
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:16 memory:ee000000-eeffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:e000(size=128) memory:ef000000-ef07ffff
WARNING: output may be incomplete or inaccurate, you should run this program as super-user.
# lspci | grep -i nvidia查看全部显卡信息。
$ lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b06] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at ef000000 [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_418_drm, nvidia_418
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10ef] (rev a1)
# 检查硬件加速。启用基于硬件的3D加速。
$ glxinfo | grep OpenGL
The program 'glxinfo' is currently not installed. You can install it by typing:
sudo apt install mesa-utils
nvidia 信息
# 查看系统驱动日志
$ cat /var/log/dpkg.log | grep nvidia
2022-01-06 06:02:02 upgrade nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1 470.86-0ubuntu0.20.04.2
2022-01-06 06:02:02 status half-configured nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1
# 查看驱动程序
$ sudo dpkg --list | grep nvidia-*
ii libnvidia-common-470 470.86-0ubuntu0.20.04.2 all Shared files used by the NVIDIA libraries
ii nvidia-compute-utils-470 470.86-0ubuntu0.20.04.2 amd64 NVIDIA compute utilities
ii nvidia-driver-470 470.86-0ubuntu0.20.04.2 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-470 470.86-0ubuntu0.20.04.2 amd64 Shared files used with the kernel module
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
# 动态监控显卡状态
$ watch -n 1 nvidia-smi
$ nvidia-smi
Sat May 28 18:13:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 32% 44C P8 12W / 250W | 336MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1051 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 1154 G /usr/bin/gnome-shell 49MiB |
| 0 N/A N/A 1838 G /usr/lib/xorg/Xorg 174MiB |
| 0 N/A N/A 1988 G /usr/bin/gnome-shell 83MiB |
+-----------------------------------------------------------------------------+
pytorch cuda 相关api
进入相关 env
$ python3
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available() # 查看 cuda 是否可用
True # 代表可用
>>> print(torch.__version__) # 查看torch 版本
1.9.0+cu102
# 多少个 cuda 可用
>>> print(torch.cuda.device_count()) # 查看
1
>>> torch.version.cuda
'10.2'
# 查看当前使用的GPU序号:
>>> device = torch.cuda.current_device()
>>> device
0
# 查看指定GPU的容量、名称:
>>> torch.cuda.get_device_capability(device)
(6, 1)
>>> torch.cuda.get_device_name(device)
'NVIDIA GeForce GTX 1080 Ti'
# 清空程序占用的GPU资源:
>>> torch.cuda.empty_cache()
相关资料
- 显卡,显卡驱动,nvcc, cuda driver,cudatoolkit,cudnn区别
https://zhuanlan.zhihu.com/p/394201746 - Nvidia 显卡 Failed to initialize NVML Driver/library version mismatch 错误解决方案
https://blog.csdn.net/zywvvd/article/details/115500412 - 熬了几个通宵,我写了份CUDA新手入门代码
https://zhuanlan.zhihu.com/p/360441891 - CUDA C++ Best Practices Guide
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html - NVIDIA驱动安装 - 从问题到解决(Linux/Ubuntu)
https://zhuanlan.zhihu.com/p/115758882
https://blog.csdn.net/github_38060285/article/details/82927362
伊织 2022-12-02(五)