现象:
1. deepspeed --num_gpus 4 --num_nodes 2 --hostfile /data_shared/xxx/config/node_1_2 --master_port 29500 --master_addr XXXXXX /dataxxxxxx/train.py
卡在:[INFO] [config.py:733:__init__] Config mesh_device None world_size = 8,[INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
半小时后:RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:30:00) 超时
2. bin/python -m torch.distributed.run --nproc_per_node=4 --nnode=2 --node_rank=0 --master_addr=1XXXXX --master_port=9901 /data_shared/XXXX/train.py
报错:
/torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648