报错信息:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
nccl报错表明在使用DistributedDataParallel (DDP)分布式训练时出现了问题。
所用环境配置:
cuda:10.2 , python:3.7 , torch:1.8.1 , 2080Ti
解决方法:
先在终端进入自己创建的环境,然后执行以下命令:
export NCCL_P2P_DISABLE=1
问题解决,程序可运行!!!