服务器配置
操作系统:Ubuntu 18.04
显卡:nvidia 470
问题
输入
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
也就是传说中的显卡不匹配问题。吓死宝宝了。
解决
确定显卡
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.xxx Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
从上面输出可以知道,显卡为 470。
抢救方案
删除 nvidia 当前所有驱动
$ sudo dpkg --list | grep nvidia-*
增加源
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
安装驱动
sudo apt-get install nvidia-470 nvidia-settings nvidia-prime
重新启动
sudo reboot
验证显卡
$ nvidia-smi
Mon May 9 20:09:31 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 |
| N/A 28C P0 39W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:88:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 29C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 29C P0 40W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 |
| N/A 26C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ lspci | grep -i nvidia
1a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
1b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
3d:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
3e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
88:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
b2:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
b3:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
吓坏了,还以为我吧服务器搞坏了。