Pytorch——基于mmseg/mmdet训练报错：RuntimeError: Expected to have finished reduction in the prior iteration

Irving.Gao

已于 2022-11-10 15:37:50 修改

阅读量3.8k

点赞数 13

分类专栏： OpenMMLab pytorch 文章标签： pytorch 深度学习人工智能

于 2022-07-23 01:03:43 首次发布

本文链接：https://blog.csdn.net/qq_45779334/article/details/125942448

版权

pytorch 同时被 2 个专栏收录

34 篇文章 4 订阅

订阅专栏

OpenMMLab

25 篇文章 2 订阅

订阅专栏

报错截图

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 109 110
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

在这里插入图片描述

报错原因

有部分参数初始化了，但是并没有在模型的foward过程中使用，因此没有梯度无法反传参数更新。

解决方法

方法一：直接忽略该问题

在配置文件中，加入：

find_unused_parameters = True

如下图所示：
在这里插入图片描述

方法二：找出没有进行forward的参数，针对性解决问题

如果确实某些网络结构是不需要使用的，那我们就直接去除即可，但首先我们需要debug出哪些参数和结构是没有被用到了，然后针对性去除即可：

找出参数

只需要在你正常的分布式命令前加入TORCH_DISTRIBUTED_DEBUG=DETAIL即可：

TORCH_DISTRIBUTED_DEBUG=DETAIL bash tools/dist_train.sh config/xxx.py 1

运行后的得到具体没有梯度的参数：

在这里插入图片描述

注释掉这些网络结构即可~

当然，有一种情况就是，比如backbone里的feature map只用了一层，但是backbone默认出多层的feature map，这也会导致以上问题，这种情况下我们没法直接改网络结构，因此只能使用方法一解决。

参考文章：

Irving.Gao

关注

13
点赞
踩
26

收藏

觉得还不错? 一键收藏
2
评论
Pytorch——基于mmseg/mmdet训练报错：RuntimeError: Expected to have finished reduction in the prior iteration

训练报错解决
复制链接

扫一扫

专栏目录