CUDA使用遇到的几个问题
背景:Tensorflow和Pytorch不同版本依赖的CUDA版本不同,在使用较高版本时出现GPU无法加载使用。
-
CUDA 多个版本并存与使用
-
由于cuDNN缺失引起的错误
-
驱动版本不匹配引起的错误
-
检测 Tensorflow 是否可以使用GPU
一、多个版本并存
-
cuda 的下载与安装方法选择
建议选择使用 .run 文件安装,因为使用 .deb可能会将已经安装的较新的显卡驱动替换。
以 cuda_10.1.105_418.39_linux.run为例
-
CUDA 安装
进入到放置 cuda_10.1.105_418.39_linux.run 的目录:
sudo chmod +x cuda_10.1.105_418.39_linux.run # 为 cuda_10.1.105_418.39_linux.run 添加可执行权限 ./cuda_10.1.105_418.39_linux.run # 安装 cuda_10.1.105_418.39_linux.run
在安装过程中截取其中比较重要的几个选择:
Do you accept the previously read EULA? accept/decline/quit: accept Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.39? (y)es/(n)o/(q)uit: n # 如果在这之前已经安装好更高版本的显卡驱动就不需要再重复安装,如果需要重复安装就选择 yes,此外还需要关闭图形界面。 Install the CUDA 10.1 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-10.1 ]: # 一般选择默认即可,也可以选择安装在其他目录,在需要用的时候指向该目录或者使用软连接 link 到 /usr/local/cuda。 /usr/local/cuda-10.1 is not writable. Do you wish to run the installation with 'sudo'? (y)es/(n)o: y Please enter your password: Do you want to install a symbolic link at /usr/local/cuda? # 是否将安装目录通过软连接的方式 link 到 /usr/local/cuda,yes or no 都可以,取决于你是否使用 /usr/local/cuda 为默认的 cuda 目录。 (y)es/(n)o/(q)uit: n Install the CUDA 10.1 Samples? (y)es/(n)o/(q)uit: n
安装完成后可以在 /usr/local 目录下看到:
lrwxrwxrwx 1 root root 20 10月 20 15:39 cuda -> /usr/local/cuda-10.0 # cuda-10.0 的软连接 drwxr-xr-x 19 root root 4.0K 8月 19 2019 cuda-10.0 # 之前安装的cuda-10.0 drwxr-xr-x 18 root root 4.0K 10月 20 15:25 cuda-10.1 # 新安装的cuda-10.1
=========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-10.1/ Samples: Installed in /home/username/, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.1/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 418.00 is required for CUDA 10.1 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run --silent --driver Logfile is /var/log/cuda-installer.log
-
检查 CUDA 是否安装成功
一般使用命令:
nvcc --version
也可以使用如下:
$cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery $ls deviceQuery.cpp Makefile NsightEclipse.xml readme.txt $make mkdir -p ../../bin/x86_64/linux/release cp deviceQuery ../../bin/x86_64/linux/release $ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) ... deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 4 Result = PASS
-
多个 CUDA 版本之间进行切换
将~/.bashrc 或 ~/.zshrc 下与cuda相关的路径都改为 /usr/local/cuda/ 而不使用 /usr/local/cuda-10.0/ 或/usr/local/cuda-10.1/
(此处若使用软连接,在安装cuDNN时需注意)
# 在切换cuda版本时 # rm -rf /usr/local/cuda #删除之前创建的软链接 sudo ln -s /usr/local/cuda-10.0/ /usr/local/cuda nvcc --version #查看当前 cuda 版本 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130 # cuda10.0 切换到 cuda10.1 rm -rf /usr/local/cuda sudo ln -s /usr/local/cuda-10.1/ /usr/local/cuda nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, release 10.1, V10.1.105
-
查看cuda版本
cat /usr/local/cuda/version.txt
或者
nvcc --version # nvcc -V
-
CUDA的卸载
CUDA卸载是有自己的卸载工具的
cd /usr/local/cuda-10.1/bin ./cuda-uninstaller Successfully uninstalled
-
注意配置 .bashrc
CUDAROOT=/usr/local/cuda-10.1 # 若使用软连接则是 /usr/local/cuda export PATH=$CUDAROOT/bin:$PATH export LD_LIBRARY_PATH=$CUDAROOT/lib64:$LD_LIBRARY_PATH export CUDA_HOME=$CUDAROOT export CUDA_PATH=$CUDAROOT
二、在使用Tensorflow时由于cuDNN缺失引起的错误
Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory;LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
解决方案:
-
查看cudnn版本
$cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2 cat: /usr/local/cuda/include/cudnn.h: No such file or directory
cudnn.h 不存在
-
cuDNN安装
选择对应版本下载 cuDNN
$tar -xvf cudnn-10.1-linux-x64-v7.5.0.56.tgz # 解压后生成文件夹cuda $cd cuda $sudo cp include/cudnn.h /usr/local/cuda/include/ # 注意此处 $sudo cp lib64/* /usr/local/cuda/lib64/ $sudo chmod a+r /usr/local/cuda/include/cudnn.h $sudo chmod a+r /usr/local/cuda/lib64/libcudnn* $cd /usr/local/cuda/lib64/ $sudo rm -rf libcudnn.so libcudnn.so.7 $sudo ln -s libcudnn.so.7.5.0 libcudnn.so.7 $sudo ln -s libcudnn.so.7 libcudnn.so $sudo ldconfig # 为了让动态链接库为系统所共享,需运行动态链接库的管理命令ldconfig
当存在多个版本时,又使用软连接,最好将cuDNN安装到对应版本中,此时可以执行如下命令
sudo cp cuda/include/cudnn.h /usr/local/cuda-10.1/include/ sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.1/lib64/ sudo chmod a+r /usr/local/cuda-10.1/include/cudnn.h sudo chmod a+r /usr/local/cuda-10.1/lib64/libcudnn*
-
检查 cuDNN
$cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2 #define CUDNN_MAJOR 7 #define CUDNN_MINOR 5 #define CUDNN_PATCHLEVEL 0 -- #define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
-
CuDNN 版本不匹配
Loaded runtime CuDNN library: 7.5.0 but source was compiled with: 7.6.4
移除版本7.5.0, 安装版本7.6.4
三、驱动版本不匹配引起的错误
问题描述:
RuntimeError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
问题原因:CUDA驱动版本不满足CUDA运行版本
排查与解决:
-
检查当前驱动版本
$ls /usr/src | grep nvidia nvidia-384-384.130 nvidia-410.48
或者
$nvidia-smi ... +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | ... +-----------------------------------------------------------------------------+
正在使用的驱动是 ‘nvidia-410.48’
若输入nvidia-smi报以下错误:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
则可以执行以下命令:
sudo apt install dkms sudo dkms install -m nvidia -v 410.48
若还是出错,根据具体错误提示信息进行解决。
-
查看官方版本对应
CUDA版本对显卡驱动的版本有要求:https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
CUDA 10.1 (10.1.105 general release, and updates) >= 418.39 >= 418.96 CUDA 10.0.130 >= 410.48 >= 411.3
-
更新驱动
到官网下载驱动安装程序,选择合适的版本 下载地址:https://www.nvidia.com/zh-tw/geforce/drivers/
# 下载得到的文件如下 NVIDIA-Linux-x86_64-450.57.run
安装
# 安装前确认当前无在使用(包括占用)GPU # 然后输入下面命令进行安装,避免遇到错误"You appear to be running an X server" sudo service lightdm stop sudo init 3 # 正式安装命令 sudo ./NVIDIA-Linux-x86_64-450.57.run # 遇到Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later # 选择 NO # 遇到Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up # 选择 NO
安装完成后
nvidia-smi Mon Dec 14 18:23:04 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ ... +-----------------------------------------------------------------------------+
-
NVIDIA 内核驱动版本与系统驱动不一致
成功安装驱动后可能会遇到Failed to initialize NVML: Driver/library version mismatch
# 查看显卡驱动所使用的内核版本 cat /proc/driver/nvidia/version
更新:后来在工作中再次遇到该问题,于是参考博客成功解决问题。
(base) dd@mq1:~/packages$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_modeset
(base) dd@mq1:~/packages$ lsmod | grep nvidia
nvidia_modeset 860160 0
nvidia 13160448 1 nvidia_modeset
(base) dd@mq1:~/packages$ sudo lsof -n -w /dev/nvidia*
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
(base) dd@mq1:~/packages$ sudo rmmod nvidia_modeset
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ sudo rmmod nvidia
(base) dd@mq1:~/packages$ nvidia-smi
Mon Aug 8 16:25:43 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
四、检测 Tensorflow 是否可以使用GPU
$python
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-12-14 18:51:40.956835: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__version__
'2.3.0'
>>> tf.test.is_gpu_available()
True
>>> gpus = tf.config.experimental.list_physical_devices('GPU')
2020-12-14 18:54:02.379270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.381556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.383812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:82:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.386026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:83:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.386113: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-14 18:54:02.386197: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-14 18:54:02.386240: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-14 18:54:02.386275: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-14 18:54:02.386310: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-14 18:54:02.386345: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-14 18:54:02.386380: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-14 18:54:02.396935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
>>> gpus
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU', PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
希望可以帮助到你
以上