Pytorch多卡分布式训练,程序发生以下报错:
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
0%| | 0/310 [00:01<?, ?it/s]
Traceback (most recent call last):
File "dist_train.py", line 234, in <module>
main()
File "dist_train.py", line 152, in main
train(net, optimizer)
File "dist_train.py", line 187, in train
map_out,num_pre = net(images,Grad_Imgs) # num_pre [B,1] map_out[B,1,H,W]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/SR/SheetCounting_2/Test_demo/model/TS_Net.py", line 124, in forward
feat_map_r1 = self.r1_conv1(im_data)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 529, in forward
raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'dist_train.py', '--local_rank=1']' returned non-zero exit status 1.
重点在于:SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
按照字面意思是说SyncBatchNorm 只在使用torch.nn.parallel.DistributedDataParallel时被支持,然而,我的程序里,确实时使用了这个东西。
这里的DDP就是
from torch.nn.parallel import DistributedDataParallel as DDP
报错里面提到的torch.nn.parallel.DistributedDataParallel
解决方案:
最终发现其实是cuda和pytorch版本的问题,因为这个库是近几年才出的,升级了更高版本的cuda便解决了问题。使用docker的朋友可以尝试换个docker环境。