今早查看服务器的cuda版本信息,使用nvcc -V
命令提示command not found
。我没多想,就按照网上给的方法执行了sudo apt install nvidia-cuda-toolkit
,然后就报错了:
>>> nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
我当场去世,上网一查可能是cuda被重装了,整个人都斯巴达了,赶忙采取补救措施。
>>> sudo dpkg --list | grep nvidia-*
ii libnvidia-compute-450-server:amd64 450.51.06-0ubuntu0.18.04.2 amd64 NVIDIA libcompute package
ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files
ii nvidia-prime 0.8.8 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 440.44-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
>>> cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
查出来的两个显卡驱动版本果然不匹配,好家伙,我的440.64直接被更新成450了。
焦头烂额捣鼓了半天,最后有效的方法是:
- 卸载驱动:
sudo apt-get purge nvidia*
- 把显卡驱动加入ppa(个人软件包文档,仅支持Ubuntu):
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
- 重新安装驱动:
sudo apt-get install nvidia-driver-440 nvidia-settings nvidia-prime
此时再输入nvidia-smi
,可以正常使用了。
PS:尽管我重装时指定了440,但驱动还是被更新成了450,nvidia-smi
显示CUDA Version: 11.0。我在程序里测试了可以正常使用GPU,不知道版本更新后续会有什么影响,本次踩坑记录暂且到此结束。
参考链接:https://blog.csdn.net/anyang1996/article/details/107937898