起因
元旦前加了一块硬盘,假期过后一看服务器,怎么连Nvidia都不能用了?
一看nvidia-smi,就出现Failed to initialize NVML: Driver/library version mismatch。
问题分析
- 首先第一反应,有人更新了驱动还是内核啊,但是查看了一圈命令行记录,几乎没有变化啊。
- 看了下驱动和内核版本?好像也对啊
:cat /sys/module/nvidia/version
470.182.03
:cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.182.03 Fri Feb 24 03:29:56 UTC 2023
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
- 那是硬件问题?直接看dmesg
dmesg
发现问题所在了:显卡驱动和内核模块版本号不对应
[ 15.320131] NVRM: API mismatch: the client has the version 470.63.01, but
NVRM: this kernel module has the version 470.182.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 17.318103] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
[ 57.825979] NVRM: API mismatch: the client has the version 470.63.01, but
NVRM: this kernel module has the version 470.182.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 267.855713] NVRM: API mismatch: the client has the version 470.63.01, but
NVRM: this kernel module has the version 470.182.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
解决办法
问题找到了,当然要解决
- 重装驱动
- 降级版本
之前有过重装驱动的惨痛教训,装完直接掉了2个点,所以这次我选择了降级内核版本。
- 首先,试试看sudo reboot,如果你重启后能用,那么恭喜你,解决了
- 如果不行,可以试试看sudo rmmod ,这里的用法是把内核驱动卸载,然后让服务器去重新挂载。具体的实现方法看这位写的rmmod
- 如果还是不行,那么考虑考虑我的方法
首先,查看已经安装或构建的驱动信息
(base):~$ dkms status
nvidia, 470.182.03, 5.4.0-97-generic, x86_64: installed
nvidia, 470.182.03, 5.4.0-99-generic, x86_64: installed
nvidia, 470.63.01, 5.4.0-99-generic, x86_64: built
再查看你的当前内核
~$ uname -r
5.4.0-99-generic
可以看到,我在当前的内核5.4.0-99-generic版本上构建了两个内核,但是安装了nvidia, 470.182.03, 5.4.0-99-generic版本的内核。因此,删除nvidia, 470.182.03, 5.4.0-99-generic,重新安装nvidia, 470.63.01, 5.4.0-99-generic即可
sudo dkms remove nvidia/470.182.03 --all
sudo dkms install nvidia/470.63.01
sudo update-initramfs -u
##这里可能运行不了,如果运行不了,尝试 sudo mkinitcpio -P
sudo reboot
重启后,再次尝试nvidia-smi
(base):~$ nvidia-smi
Tue Jan 2 11:59:52 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:65:00.0 Off | N/A |
| 19% 38C P0 66W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:B3:00.0 Off | N/A |
| 18% 34C P0 60W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
至此,降级内核的工作已经完成了,如果还未解决你的问题,尝试重新安装驱动也可以。
具体的细节可以查看stackoverflow