$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
查看是否能运行cuda:
>>> import torch
>>> torch.cuda.is_available()
False
查看当前驱动,发现安装了一堆显卡驱动(用来驱动显卡的程序,它是硬件所对应的软件)
$ ubuntu-drivers devices
== /sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0 ==
modalias : pci:v000010DEd00001E87sv00001458sd000037A8bc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-470-server - distro non-free recommended
driver : nvidia-driver-460 - distro non-free
driver : nvidia-driver-460-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
驱动太乱了,决定卸载原来所有的驱动,重新安装recommended的版本
$ sudo apt-get remove --purge nvidia-\*
安装
$ sudo apt-get install nvidia-driver-470-server nvidia-settings nvidia-prime
但是运行nvidia-smi
还是报错。
查看当前驱动安装:
$ dpkg -l | grep nvidia-driver
ii nvidia-driver-470-server 470.57.02-0ubuntu0.18.04.2 amd64 NVIDIA Server Driver metapackage
发现驱动安装成功。
原文提到这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod:
- unload nvidia kernel mod, i.e., (
sudo rmmod nvidia
) - reload nvidia kernel mod, i.e., (
sudo nvidia-smi
)
执行时,遇到卸载失败
$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm nvidia_modese