多机多卡训练代码:
报错信息:
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1248, unhandled system error, NCCL version 2.12.10
第一台机器:
NNODES=2 NODE_RANK=0 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4
第二台机器:
NNODES=2 NODE_RANK=1 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4
解决方案:
export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; NCCL_DEBUG=INFO NNODES=2 NODE_RANK=0 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4