root@node1:/data1/tmp# journalctl -xeu nvidia-fabricmanager.service
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
May 12 14:11:08 node1 nvidia-fabricmanager-start.sh[49006]: "/usr/bin/nv-fabricmanager" failed! Exit code: 1
May 12 14:11:07 node1 systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░
░░ The job identifier is 171 and the job result is failed.
May 12 14:14:55 node1 systemd[1]: Starting NVIDIA fabric manager service...
░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has begun execution.
░░
░░ The job identifier is 1573.
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12212]: Detected Pre-NVL5 system
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12215]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING>
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12212]: "/usr/bin/nv-fabricmanager" failed! Exit code: 1
May 12 14:14:55 node1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
May 12 14:14:55 node1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
May 12 14:14:55 node1 systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░
░░ The job identifier is 1573 and the job result is failed.
解决方案:
若你运行命令输出如下:
nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0-47,96-143 0 N/A
GPU1 NODE X 0-47,96-143 0 N/A
✅ 判断结果
-
GPU0 与 GPU1 之间连接为
NODE
,表示:两张卡之间是通过 普通 PCIe 总线 + NUMA 节点互连,而没有使用 NVLink/NVSwitch。
❗说明 Fabric Manager 不适用当前配置
项目 | 状态 |
---|---|
NVLink / NVSwitch 拓扑 | ❌ 未启用 |
GPU 类型 | A100-SXM4-80GB ✅ |
是否需要 Fabric Manager | ❌ 不需要启用(可安全跳过) |
✅ 建议操作
-
禁用并停止 Fabric Manager 服务(防止反复失败)
sudo systemctl disable nvidia-fabricmanager
sudo systemctl stop nvidia-fabricmanager
-
从你的安装脚本中移除 Fabric Manager 安装逻辑或根据
topo
智能判断
🧠 Bonus:如何在脚本中自动判断是否跳过 Fabric Manager?
加入这段判断逻辑即可:
# 判断是否存在 NVLink/NVSwitch 拓扑
if nvidia-smi topo -m | grep -q "NV[0-9]"; then
echo "✅ 检测到 NVLink/NVSwitch 结构,启用 Fabric Manager"
INSTALL_FABRIC=true
else
echo "ℹ️ 未检测到 NVLink/NVSwitch 拓扑,跳过 Fabric Manager"
INSTALL_FABRIC=false
fi
然后在你的脚本中:
if $INSTALL_FABRIC; then
sudo apt-get install -y nvidia-fabricmanager-570
sudo systemctl enable nvidia-fabricmanager
sudo systemctl start nvidia-fabricmanager
fi