分布式训练报错

最新推荐文章于 2024-05-21 23:28:06 发布

在改bug

最新推荐文章于 2024-05-21 23:28:06 发布

阅读量895

点赞数 4

文章标签：服务器 linux 运维

本文链接：https://blog.csdn.net/weixin_57021544/article/details/135992118

版权

配置：

8张3090

报错信息：

[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800133 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800729 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800464 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800796 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800860 milliseconds before timing out.

现象：

我在深度学习使用8张3090分布式训练时，正常运行了一段时间后，突然出现异常，停止训练并显示上述信息。根据报错信息看像是8卡通讯延时导致的，但是之前是可以8卡分布式训练的，而且也没有爆显存，可能是使用时间长了，散热不好导致哒～

当我使用nvidia-smi查看显卡信息时：

Unable to determine the device handle for GPU0000:9E:00.0: Unknown Error

解决办法：重启服务器可以恢复正常！！！

在改bug

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
3
评论
分布式训练报错

我在深度学习使用8张3090分布式训练时，正常运行了一段时间后，突然出现异常，停止训练并显示上述信息。根据报错信息看像是8卡通讯延时导致的，但是之前是可以8卡分布式训练的，而且也没有爆显存，可能是使用时间长了，散热不好导致哒～解决办法：重启服务器可以恢复正常！
复制链接

扫一扫