复现stylegan3的时候报错
torch.multiprocessing.spawn.ProcessRaisedException:
– Process 2 terminated with the following error:
Traceback (most recent call last):
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 59, in _wrap
fn(i, *args)
File “/home/ubuntu/lxd-workplace/landf/face/stylegan3-main/train.py”, line 38, in subprocess_fn
torch.distributed.init_process_group(backend=‘nccl’, init_method=init_method, rank=rank, world_size=c.num_gpus)
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 583, in init_process_grou p
default_pg = _new_process_group_helper(
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 708, in _new_process_grou p_helper
raise RuntimeError("Distributed package doesn’t have NCCL " “built in”)
RuntimeError: Distributed package doesn’t have NCCL built in
报错代码是这里的nccl
# Init torch.distributed.
if c.num_gpus > 1:
init_file = os.path.abspath(os.path.join(temp_dir, '.torch_distributed_init'))
if os.name == 'nt':
init_method = 'file:///' + init_file.replace('\\', '/')
torch.distributed.init_process_group(backend='gloo', init_method=init_method, rank=rank, world_size=c.num_gpus)
else:
init_method = f'file://{init_file}'
torch.distributed.init_process_group(backend='nccl', init_method=init_method, rank=rank, world_size=c.num_gpus)
百度出来都是window报错,说:在dist.init_process_group语句之前添加backend=‘gloo’,也就是在windows中使用GLOO替代NCCL。好家伙,可是我是linux服务器上啊
代码是对的,我开始怀疑是pytorch版本的原因
控制台输入>>>python
接着>>>import torch
然后
最后还是给找到了,果然是pytorch版本原因,1.8的版本就好了,成功解决