错误提示:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th
at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passin
g the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `f
orward` function outputs participate in calculating loss. If you already have done the above two steps, then the distribute
d data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Pl
ease include the loss function and the structure of the return value of `forward` of your module when reporting this issue
(e.g. list, dict, iterable).
解决方法:
forward不要return任何不计算loss的变量!
比如
model = nn.parallel.DistributedDataParallel(model, device_ids=[config.args.local_rank], output_device=config.args.local_rank, broadcast_buffers=True)
y_pred, y_tgt=model(x)
loss = cross_entropy_loss(y_pred)
其中y_tgt就是一个未参与loss计算的变量,就不要输出出来!!!不然find_unused_parameters=True都救不了。