前言
同事找我说用的双卡虚拟机只有一个卡显示了,看看怎么处理处理
一、现象
1.1 nvidia-smi的输出只有一个卡
(base) root@XXX:~# nvidia-smi
Wed Feb 19 14:13:33 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.156.00 Driver Version: 450.156.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:07.0 Off | 0 |
| N/A 42C P0 27W / 70W | 9224MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2816445 C .../XXX/bin/python 9221MiB |
+-----------------------------------------------------------------------------+
1.2 dmesg的输出有RmInitAdapter failed
[14094353.118943] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0x65:1224)
[14094353.120811] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
[14094360.267337] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x24:0x65:1224)
[14094360.269036] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
1.3 lspci -v的输出
00:06.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4]
Physical Slot: 6
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f2000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
00:07.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4]
Physical Slot: 7
Flags: bus master, fast devsel, latency 0, IRQ 10
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f4000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
二、分析过程及思路
根据RmInitAdapter failed的提示去查,好多说是驱动不行了。但是这是机器是双卡机器,两张卡是相同的,这种方向应该可以直接pass掉。
然后还有说是物理卡坏了。这个有较可能,卡已经好几年了。
没啥好办法,先关闭虚拟机,重启下对应的物理机,然后再把虚拟机拉起来,看看有没有什么变化,确定下是不是卡坏了。
三、动手操作
第一步:虚拟机正常关机
第二步:物理机检查系统日志和dmesg,查看是否有坏道提示和其他的硬件故障提示,别重启死掉了。
检查通过,物理机重启
第三步:物理机启动完毕后,等待虚拟化环境自动恢复,手动启动虚拟机
第四步:虚拟机启动完毕,检查加速卡状态
这时发现问题消失了,两张卡都显示了
联系同事跑下任务,把两个卡都用起来,看看会不会在运行时掉卡
经过一段时间观察,任务正常运行,卡没有掉,问题解决。
root@XXX:~# nvidia-smi
Wed Feb 19 16:43:36 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.156.00 Driver Version: 450.156.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:06.0 Off | 0 |
| N/A 41C P0 26W / 70W | 9224MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:07.0 Off | 0 |
| N/A 42C P0 27W / 70W | 9458MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6896 C .../XXX/bin/python 9221MiB |
| 1 N/A N/A 7011 C .../XXX/bin/python 9455MiB |
+-----------------------------------------------------------------------------+
总结
重启大法好!前提是重启之后能起来。。