CUDA使用遇到的几个问题

CUDA使用遇到的几个问题

背景:Tensorflow和Pytorch不同版本依赖的CUDA版本不同,在使用较高版本时出现GPU无法加载使用。

  • CUDA 多个版本并存与使用

  • 由于cuDNN缺失引起的错误

  • 驱动版本不匹配引起的错误

  • 检测 Tensorflow 是否可以使用GPU

一、多个版本并存

  1. cuda 的下载与安装方法选择

    下载 CUDA Toolkit Download

    建议选择使用 .run 文件安装,因为使用 .deb可能会将已经安装的较新的显卡驱动替换。

    以 cuda_10.1.105_418.39_linux.run为例

  2. CUDA 安装

    进入到放置 cuda_10.1.105_418.39_linux.run 的目录:

    sudo chmod +x cuda_10.1.105_418.39_linux.run # 为 cuda_10.1.105_418.39_linux.run 添加可执行权限
    ./cuda_10.1.105_418.39_linux.run # 安装 cuda_10.1.105_418.39_linux.run
    

    在安装过程中截取其中比较重要的几个选择:

    Do you accept the previously read EULA?
    accept/decline/quit: accept
    
    Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.39?
    (y)es/(n)o/(q)uit: n # 如果在这之前已经安装好更高版本的显卡驱动就不需要再重复安装,如果需要重复安装就选择 yes,此外还需要关闭图形界面。
    
    Install the CUDA 10.1 Toolkit?
    (y)es/(n)o/(q)uit: y
    
    Enter Toolkit Location
    [ default is /usr/local/cuda-10.1 ]: # 一般选择默认即可,也可以选择安装在其他目录,在需要用的时候指向该目录或者使用软连接 link 到 /usr/local/cuda。
    
    /usr/local/cuda-10.1 is not writable.
    Do you wish to run the installation with 'sudo'?
    (y)es/(n)o: y
    
    Please enter your password: 
    Do you want to install a symbolic link at /usr/local/cuda? # 是否将安装目录通过软连接的方式 link 到 /usr/local/cuda,yes or no 都可以,取决于你是否使用 /usr/local/cuda 为默认的 cuda 目录。
    (y)es/(n)o/(q)uit: n
    
    Install the CUDA 10.1 Samples?
    (y)es/(n)o/(q)uit: n
    

    安装完成后可以在 /usr/local 目录下看到:

    lrwxrwxrwx  1 root root   20 1020 15:39 cuda -> /usr/local/cuda-10.0  # cuda-10.0 的软连接
    drwxr-xr-x 19 root root 4.0K 819  2019 cuda-10.0  # 之前安装的cuda-10.0
    drwxr-xr-x 18 root root 4.0K 1020 15:25 cuda-10.1  # 新安装的cuda-10.1
    
      ===========
      = Summary =
      ===========
    
      Driver:   Not Selected
      Toolkit:  Installed in /usr/local/cuda-10.1/
      Samples:  Installed in /home/username/, but missing recommended libraries
    
      Please make sure that
      -   PATH includes /usr/local/cuda-10.1/bin
      -   LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root
    
      To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
    
      Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
      ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 418.00 is required for CUDA 10.1 functionality to work.
      To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
         sudo <CudaInstaller>.run --silent --driver
    
      Logfile is /var/log/cuda-installer.log
    
  3. 检查 CUDA 是否安装成功

    一般使用命令:

    nvcc --version
    

    也可以使用如下:

    $cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
    
    $ls 
    deviceQuery.cpp  Makefile  NsightEclipse.xml  readme.txt
    
    $make
    mkdir -p ../../bin/x86_64/linux/release
    cp deviceQuery ../../bin/x86_64/linux/release
    
    $ ./deviceQuery
    ./deviceQuery Starting...
    CUDA Device Query (Runtime API) version (CUDART static linking)
    ...
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 4
    Result = PASS
    
  4. 多个 CUDA 版本之间进行切换

    将~/.bashrc 或 ~/.zshrc 下与cuda相关的路径都改为 /usr/local/cuda/ 而不使用 /usr/local/cuda-10.0/ 或/usr/local/cuda-10.1/

    (此处若使用软连接,在安装cuDNN时需注意)

    # 在切换cuda版本时
    # rm -rf /usr/local/cuda  #删除之前创建的软链接
    sudo ln -s /usr/local/cuda-10.0/ /usr/local/cuda
    nvcc --version #查看当前 cuda 版本
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2018 NVIDIA Corporation
    Built on Sat_Aug_25_21:08:01_CDT_2018
    Cuda compilation tools, release 10.0, V10.0.130
    
    # cuda10.0 切换到 cuda10.1 
    rm -rf /usr/local/cuda
    sudo ln -s /usr/local/cuda-10.1/ /usr/local/cuda
    nvcc --version
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Fri_Feb__8_19:08:17_PST_2019
    Cuda compilation tools, release 10.1, V10.1.105
    
  5. 查看cuda版本

    cat /usr/local/cuda/version.txt
    

    或者

    nvcc --version  # nvcc -V
    
  6. CUDA的卸载

    CUDA卸载是有自己的卸载工具的

    cd /usr/local/cuda-10.1/bin
    ./cuda-uninstaller
    
    Successfully uninstalled
    
  7. 注意配置 .bashrc

    CUDAROOT=/usr/local/cuda-10.1  # 若使用软连接则是 /usr/local/cuda 
    
    export PATH=$CUDAROOT/bin:$PATH
    export LD_LIBRARY_PATH=$CUDAROOT/lib64:$LD_LIBRARY_PATH
    export CUDA_HOME=$CUDAROOT
    export CUDA_PATH=$CUDAROOT
    

二、在使用Tensorflow时由于cuDNN缺失引起的错误

Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory;LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64

解决方案:

  1. 查看cudnn版本

    $cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
    cat: /usr/local/cuda/include/cudnn.h: No such file or directory
    

    cudnn.h 不存在

  2. cuDNN安装

    选择对应版本下载 cuDNN

    $tar -xvf cudnn-10.1-linux-x64-v7.5.0.56.tgz  # 解压后生成文件夹cuda
    
    $cd cuda  
    $sudo cp include/cudnn.h /usr/local/cuda/include/  # 注意此处
    $sudo cp lib64/*  /usr/local/cuda/lib64/
    $sudo chmod a+r /usr/local/cuda/include/cudnn.h
    $sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
     
    $cd /usr/local/cuda/lib64/
    $sudo rm -rf libcudnn.so libcudnn.so.7
    $sudo ln -s libcudnn.so.7.5.0 libcudnn.so.7
    $sudo ln -s libcudnn.so.7 libcudnn.so 
    $sudo ldconfig  # 为了让动态链接库为系统所共享,需运行动态链接库的管理命令ldconfig
    

    当存在多个版本时,又使用软连接,最好将cuDNN安装到对应版本中,此时可以执行如下命令

    sudo cp cuda/include/cudnn.h /usr/local/cuda-10.1/include/
    sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.1/lib64/
    sudo chmod a+r /usr/local/cuda-10.1/include/cudnn.h
    sudo chmod a+r /usr/local/cuda-10.1/lib64/libcudnn*
    
  3. 检查 cuDNN

    $cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
    
    #define CUDNN_MAJOR 7
    #define CUDNN_MINOR 5
    #define CUDNN_PATCHLEVEL 0
    --
    #define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
    
  4. CuDNN 版本不匹配

    Loaded runtime CuDNN library: 7.5.0 but source was compiled with: 7.6.4
    

    移除版本7.5.0, 安装版本7.6.4

三、驱动版本不匹配引起的错误

问题描述:

RuntimeError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

问题原因:CUDA驱动版本不满足CUDA运行版本

排查与解决:

  1. 检查当前驱动版本

    $ls /usr/src | grep nvidia
    
    nvidia-384-384.130
    nvidia-410.48
    

    或者

    $nvidia-smi
    ...      
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     ...
     +-----------------------------------------------------------------------------+
    

    正在使用的驱动是 ‘nvidia-410.48’

    若输入nvidia-smi报以下错误:

    NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
    

    则可以执行以下命令:

    sudo apt install dkms
    sudo dkms install -m nvidia -v 410.48
    

    若还是出错,根据具体错误提示信息进行解决。

  2. 查看官方版本对应

    CUDA版本对显卡驱动的版本有要求:https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

    CUDA 10.1 (10.1.105 general release, and updates)	>= 418.39	>= 418.96
    CUDA 10.0.130	>= 410.48	>= 411.3
    
  3. 更新驱动

    到官网下载驱动安装程序,选择合适的版本 下载地址:https://www.nvidia.com/zh-tw/geforce/drivers/

    # 下载得到的文件如下
    NVIDIA-Linux-x86_64-450.57.run
    

    安装

    # 安装前确认当前无在使用(包括占用)GPU
    
    # 然后输入下面命令进行安装,避免遇到错误"You appear to be running an X server"
    sudo service lightdm stop 
    sudo init 3
    
    # 正式安装命令
    sudo ./NVIDIA-Linux-x86_64-450.57.run
    
    # 遇到Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later
    # 选择 NO
    
    # 遇到Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up
    # 选择 NO
    

    安装完成后

    nvidia-smi
    Mon Dec 14 18:23:04 2020       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
     ...
    +-----------------------------------------------------------------------------+
    
  4. NVIDIA 内核驱动版本与系统驱动不一致
    成功安装驱动后可能会遇到

    Failed to initialize NVML: Driver/library version mismatch
    
    # 查看显卡驱动所使用的内核版本
    cat /proc/driver/nvidia/version
    

    更新:后来在工作中再次遇到该问题,于是参考博客成功解决问题。

(base) dd@mq1:~/packages$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_modeset
(base) dd@mq1:~/packages$ lsmod | grep nvidia
nvidia_modeset        860160  0
nvidia              13160448  1 nvidia_modeset
(base) dd@mq1:~/packages$ sudo lsof -n -w  /dev/nvidia*
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
(base) dd@mq1:~/packages$ sudo rmmod nvidia_modeset
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ sudo rmmod nvidia_uvm
rmmod: ERROR: Module nvidia_uvm is not currently loaded
(base) dd@mq1:~/packages$ sudo rmmod nvidia
(base) dd@mq1:~/packages$ nvidia-smi
Mon Aug  8 16:25:43 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |

四、检测 Tensorflow 是否可以使用GPU

$python
Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-12-14 18:51:40.956835: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__version__
'2.3.0'
>>> tf.test.is_gpu_available()
True
>>> gpus = tf.config.experimental.list_physical_devices('GPU')
2020-12-14 18:54:02.379270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.381556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.383812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties: 
pciBusID: 0000:82:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.386026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties: 
pciBusID: 0000:83:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-12-14 18:54:02.386113: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-14 18:54:02.386197: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-14 18:54:02.386240: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-14 18:54:02.386275: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-14 18:54:02.386310: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-14 18:54:02.386345: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-14 18:54:02.386380: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-14 18:54:02.396935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2, 3
>>> gpus
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU',   PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

希望可以帮助到你

以上

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值