服务器重启后,内核升级,导致nvidia-smi 不可用,版本不对应。重新 使用 dkms 安装 nvidia ,解决问题。不需要重启服务器。
1. 执行 `nvidia-smi` 发现报错:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2. 使用 `uname -r` 查看内核
:-$ uname -r
4.15.0-192-generic
3. 使用 `dkms status nvidia` 查看 nvidia 安装信息
:-$ dkms status nvidia
nvidia,515.43.04,4.15.0-162-generic,x86 64: installed
可以看到nvidia 对应的是之前的 “4.15.0-162-generic” 而非 “192”。
4. 使用 dkms 重新安装,提示错误,需要安装 "linux-headers-4.15.0-192-generic"
:~$ sudo dkms install -m nvidia -v 515.43.04
Error! Your kernel headers for kernel 4.15.0-192-generic cannot be found.
please install the linux-headers-4.15.0-192-generic package,
or use the --kernelsourcedir option to tell DKMS where it's located
5. 查看下已有的
:~$ ll /usr/src/
6. 安装 “linux-headers-4.15.0-192-generic”
:-$ sudo apt install linux-headers-4.15.0-192-generic
安装成功后,再次查看
7. 再次安装 nvidia
安装完成。
参考文章:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver
重启后,内核没变,驱动升级,可以参考【nvidia-smi】Failed to initialize NVML: Driver/library version mismatch解决方法(不用重启)-CSDN博客