解决 nvidia-fabricmanager.service has entered the ‘failed‘ state with result ‘exit-code‘.

root@node1:/data1/tmp# journalctl -xeu nvidia-fabricmanager.service
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
May 12 14:11:08 node1 nvidia-fabricmanager-start.sh[49006]: "/usr/bin/nv-fabricmanager" failed! Exit code: 1
May 12 14:11:07 node1 systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░ 
░░ The job identifier is 171 and the job result is failed.
May 12 14:14:55 node1 systemd[1]: Starting NVIDIA fabric manager service...
░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit nvidia-fabricmanager.service has begun execution.
░░ 
░░ The job identifier is 1573.
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12212]: Detected Pre-NVL5 system
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12215]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING>
May 12 14:14:55 node1 nvidia-fabricmanager-start.sh[12212]: "/usr/bin/nv-fabricmanager" failed! Exit code: 1
May 12 14:14:55 node1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
May 12 14:14:55 node1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
May 12 14:14:55 node1 systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░ 
░░ The job identifier is 1573 and the job result is failed.

解决方案:

若你运行命令输出如下:

nvidia-smi topo -m

        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-47,96-143     0               N/A
GPU1    NODE     X      0-47,96-143     0               N/A

✅ 判断结果

  • GPU0 与 GPU1 之间连接为 NODE,表示:

    两张卡之间是通过 普通 PCIe 总线 + NUMA 节点互连,而没有使用 NVLink/NVSwitch


❗说明 Fabric Manager 不适用当前配置

项目状态
NVLink / NVSwitch 拓扑未启用
GPU 类型A100-SXM4-80GB ✅
是否需要 Fabric Manager不需要启用(可安全跳过)

✅ 建议操作

  1. 禁用并停止 Fabric Manager 服务(防止反复失败)

sudo systemctl disable nvidia-fabricmanager
sudo systemctl stop nvidia-fabricmanager
  1. 从你的安装脚本中移除 Fabric Manager 安装逻辑或根据 topo 智能判断


🧠 Bonus:如何在脚本中自动判断是否跳过 Fabric Manager?

加入这段判断逻辑即可:

# 判断是否存在 NVLink/NVSwitch 拓扑
if nvidia-smi topo -m | grep -q "NV[0-9]"; then
    echo "✅ 检测到 NVLink/NVSwitch 结构,启用 Fabric Manager"
    INSTALL_FABRIC=true
else
    echo "ℹ️ 未检测到 NVLink/NVSwitch 拓扑,跳过 Fabric Manager"
    INSTALL_FABRIC=false
fi

然后在你的脚本中:

if $INSTALL_FABRIC; then
    sudo apt-get install -y nvidia-fabricmanager-570
    sudo systemctl enable nvidia-fabricmanager
    sudo systemctl start nvidia-fabricmanager
fi

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值