Pytorch报错 “RuntimeError: Expected to have finished reduction in the prior iteration ... ” 的解决方案

shaojie_45

已于 2022-12-11 13:53:18 修改

阅读量8.7k

点赞数 18

分类专栏： cv 文章标签： pytorch 深度学习 python

于 2022-02-20 13:29:37 首次发布

本文链接：https://blog.csdn.net/shaojie_45/article/details/123029735

版权

cv 专栏收录该内容

19 篇文章 6 订阅

订阅专栏

在单卡跑代码的时候没有问题，多卡的时候出现报错信息：

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

我觉的比较合适的解决方案是找到代码中哪一部分的参数没有接收到回传梯度，然后改正即可。可以在梯度回传之后、参数更新之前，查看每个参数的梯度。举一个简单的例子，一般的神经网络都会有计算损失-梯度回传-参数更新三个步骤，如下：

 l = loss(predict, label)
 l.backward()
 optimizer.step()

那么在backward之后，optimizer更新参数之前，网络的每个参数都应当有回传梯度。这样在backward之后，step之前加入代码：

for name, param in model.named_parameters():
    if param.grad is None:
        print(name)

查看哪些参数没有接收到回传参数，并分析问题在哪。
实际上其它复杂的代码库中可能没有明显的三步骤，都被封装好了，但是遇到这种问题解决的思路都是一样的：找哪些参数没有回传梯度。举一个例子，假如使用mmdetection，用faster r-cnn做目标检测遇到了类似的问题，那么找到loss回传的位置，比如这里是在mmdet/models/detectors/two_stage.py中TwoStageDetector的forward_train中：

roi_losses = self.roi_head.forward_train(x, img_metas, proposal_list,
                                         gt_bboxes, gt_labels,
                                         gt_bboxes_ignore, gt_masks,
                                         **kwargs)

losses.update(roi_losses)

那么在update之后加入：

for name, p in self.roi_head.named_parameters():
    #  print(name)
    if p.grad is None:
        print(name)

就可以确定哪一部分出了问题。

shaojie_45

关注

18
点赞
踩
20

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录