解决:AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

 Pytorch多卡分布式训练,程序发生以下报错:

AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
  0%|                                                                                                                                                                                                                                                                  | 0/310 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "dist_train.py", line 234, in <module>
    main()
  File "dist_train.py", line 152, in main
    train(net, optimizer)
  File "dist_train.py", line 187, in train
    map_out,num_pre = net(images,Grad_Imgs)  # num_pre [B,1] map_out[B,1,H,W]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/SR/SheetCounting_2/Test_demo/model/TS_Net.py", line 124, in forward
    feat_map_r1 = self.r1_conv1(im_data)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 529, in forward
    raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'dist_train.py', '--local_rank=1']' returned non-zero exit status 1.
 

重点在于:SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

按照字面意思是说SyncBatchNorm 只在使用torch.nn.parallel.DistributedDataParallel时被支持,然而,我的程序里,确实时使用了这个东西。

 这里的DDP就是

from torch.nn.parallel import DistributedDataParallel as DDP

报错里面提到的torch.nn.parallel.DistributedDataParallel


解决方案:

最终发现其实是cuda和pytorch版本的问题,因为这个库是近几年才出的,升级了更高版本的cuda便解决了问题。使用docker的朋友可以尝试换个docker环境。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值