配置:
8张3090
报错信息:
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800133 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800729 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800516 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800464 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800796 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=21722, OpType=ALLREDUCE, NumelIn=296899, NumelOut=296899,
Timeout(ms)=1800000) ran for 1800860 milliseconds before timing out.
现象:
我在深度学习使用8张3090分布式训练时,正常运行了一段时间后,突然出现异常,停止训练并显示上述信息。根据报错信息看像是8卡通讯延时导致的,但是之前是可以8卡分布式训练的,而且也没有爆显存,可能是使用时间长了,散热不好导致哒~
当我使用nvidia-smi查看显卡信息时:
Unable to determine the device handle for GPU0000:9E:00.0: Unknown Error
解决办法:重启服务器可以恢复正常!!!