错误提示:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th
at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passin
g the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all f orward
function outputs participate in calculating loss. If you already have done the above two steps, then the distribute
d data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward
function. Pl
ease include the loss function and the structure of the return value of forward
of your module when reporting this issue
(e.g. list, dict, iterable).
这里解释两个可能的原因:
- 模型中forward函数内有的参数没有参与计算,举个例子在输出图片时你输出了目标图和一个其他的参数(假设这是一个观察变量,你仅仅想看看那个参数的变化情况,而不是将他也参与反向传播),计算时只用了目标图计算loss和反向传播,此时就会报错
- 模型中的某一个操作,例如conv在forward是未使用
解决方法:
如果你的观察变量不会影响结果,即y2,你就可以将分布式中的torch.nn.parallel.DistributedDataParallel的参数find_unused_parameters=True,就不会报错了;如果会影响结果说明forward函数写错了,查看错误即可
forward不要return任何不计算loss的变量!
比如
model = nn.parallel.DistributedDataParallel(model, device_ids=[config.args.local_rank],
output_device=config.args.local_rank,
broadcast_buffers=True)
y_pred, y_tgt=model(x)
loss = cross_entropy_loss(y_pred)
其中y_tgt就是一个未参与loss计算的变量,就不要输出出来!!!不然find_unused_parameters=True都救不了。
git issue:https://github.com/pytorch/pytorch/issues/22436