目录
网友推荐方法1:export NCCL_IB_DISABLE=1
临时的解决方案:export NCCL_P2P_DISABLE=1
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal erro
Some NCCL operations have failed or timed out. Due to the asynchronous natur
网友推荐方法1:export NCCL_IB_DISABLE=1
遇到这样一个奇怪问题:
[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1751615, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
…
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
…
说明系统在使用标准的P2P通信路径时遇到了问题。
临时的解决方案:export NCCL_P2P_DISABLE=1
原文链接:https://blog.csdn.net/shysea2019/article/details/135657740
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal erro
问题存在于,分布式训练时我的网卡没指定正确:
用命令ifconfig查看网卡,然后重新设置
同事测得另一个原因,读取图片太慢,导致的。