work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648
解决办法
pip list | grep nccl
看看有几个 比如我的就有这个
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
卸载
pip uninstall nvidia-nccl-cu11