DistributedDataParallel提示未参与loss计算的变量错误

最新推荐文章于 2023-08-16 21:01:58 发布

CV矿工

最新推荐文章于 2023-08-16 21:01:58 发布

阅读量1.2k

点赞数 2

分类专栏： python（pytorch）编程基础文章标签： pytorch 深度学习人工智能

本文链接：https://blog.csdn.net/ZauberC/article/details/127385150

版权

python（pytorch）编程基础专栏收录该内容

93 篇文章 5 订阅

订阅专栏

错误提示：

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th
at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passin
g the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all f orward function outputs participate in calculating loss. If you already have done the above two steps, then the distribute
d data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Pl
ease include the loss function and the structure of the return value of forward of your module when reporting this issue
(e.g. list, dict, iterable).

这里解释两个可能的原因：

模型中forward函数内有的参数没有参与计算，举个例子在输出图片时你输出了目标图和一个其他的参数（假设这是一个观察变量，你仅仅想看看那个参数的变化情况，而不是将他也参与反向传播），计算时只用了目标图计算loss和反向传播，此时就会报错
模型中的某一个操作，例如conv在forward是未使用

解决方法：

如果你的观察变量不会影响结果，即y2，你就可以将分布式中的torch.nn.parallel.DistributedDataParallel的参数find_unused_parameters=True，就不会报错了；如果会影响结果说明forward函数写错了，查看错误即可

forward不要return任何不计算loss的变量！

比如

model = nn.parallel.DistributedDataParallel(model, device_ids=[config.args.local_rank],
                                                output_device=config.args.local_rank,
                                                broadcast_buffers=True)
y_pred, y_tgt=model(x)

loss = cross_entropy_loss(y_pred)