Failed to initialize NVML: Driver/library version mismatch

起因

元旦前加了一块硬盘,假期过后一看服务器,怎么连Nvidia都不能用了?
一看nvidia-smi,就出现Failed to initialize NVML: Driver/library version mismatch。

问题分析

  • 首先第一反应,有人更新了驱动还是内核啊,但是查看了一圈命令行记录,几乎没有变化啊。
  • 看了下驱动和内核版本?好像也对啊
:cat /sys/module/nvidia/version

470.182.03
:cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.182.03  Fri Feb 24 03:29:56 UTC 2023
GCC version:  gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 
  • 那是硬件问题?直接看dmesg
dmesg

发现问题所在了:显卡驱动和内核模块版本号不对应

[   15.320131] NVRM: API mismatch: the client has the version 470.63.01, but
               NVRM: this kernel module has the version 470.182.03.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[   17.318103] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
[   57.825979] NVRM: API mismatch: the client has the version 470.63.01, but
               NVRM: this kernel module has the version 470.182.03.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.
[  267.855713] NVRM: API mismatch: the client has the version 470.63.01, but
               NVRM: this kernel module has the version 470.182.03.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver
               NVRM: components have the same version.

解决办法

问题找到了,当然要解决

  1. 重装驱动
  2. 降级版本

之前有过重装驱动的惨痛教训,装完直接掉了2个点,所以这次我选择了降级内核版本。

  1. 首先,试试看sudo reboot,如果你重启后能用,那么恭喜你,解决了
  2. 如果不行,可以试试看sudo rmmod ,这里的用法是把内核驱动卸载,然后让服务器去重新挂载。具体的实现方法看这位写的rmmod
  3. 如果还是不行,那么考虑考虑我的方法

首先,查看已经安装或构建的驱动信息

(base):~$ dkms status


nvidia, 470.182.03, 5.4.0-97-generic, x86_64: installed
nvidia, 470.182.03, 5.4.0-99-generic, x86_64: installed
nvidia, 470.63.01, 5.4.0-99-generic, x86_64: built

再查看你的当前内核

~$ uname -r
5.4.0-99-generic

可以看到,我在当前的内核5.4.0-99-generic版本上构建了两个内核,但是安装了nvidia, 470.182.03, 5.4.0-99-generic版本的内核。因此,删除nvidia, 470.182.03, 5.4.0-99-generic,重新安装nvidia, 470.63.01, 5.4.0-99-generic即可

sudo dkms remove nvidia/470.182.03 --all
sudo dkms install nvidia/470.63.01
sudo update-initramfs -u
##这里可能运行不了,如果运行不了,尝试 sudo mkinitcpio -P
sudo reboot

重启后,再次尝试nvidia-smi

(base):~$ nvidia-smi
Tue Jan  2 11:59:52 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:65:00.0 Off |                  N/A |
| 19%   38C    P0    66W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 18%   34C    P0    60W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

至此,降级内核的工作已经完成了,如果还未解决你的问题,尝试重新安装驱动也可以。
具体的细节可以查看stackoverflow

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值