在使用torch.distributed的时候,会报torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=1, worker_count=2, timeout=0:30:00)然后一直卡住,把任务取消了再提交一遍也是一样,甚至worker_count还会增加
解决办法就是把原来的output_dir删掉,或者重新开一个