MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

最新推荐文章于 2024-07-13 23:34:27 发布

小鑫爱学习

最新推荐文章于 2024-07-13 23:34:27 发布

阅读量2.3k

点赞数 9

分类专栏：目标检测文章标签： python 开发语言 mmdetection 深度学习 pytorch

本文链接：https://blog.csdn.net/weixin_43380510/article/details/127428758

版权

目标检测专栏收录该内容

5 篇文章 1 订阅

订阅专栏

MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

使用mmdetection多卡训练命令时：

./tools/dist_train.py config_file num_gpus

出现下述错误：

RuntimeError: Expected to have finished reduction in the prior
iteration before starting a new one. This error indicates that your
module has parameters that were not used in producing loss. You can
enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True to
torch.nn.parallel.DistributedDataParallel; (2) making sure all forward
function outputs participate in calculating loss. If you already have
done the above two steps, then the distributed data parallel module
wasn’t able to locate the output tensors in the return value of your
module’s forward function. Please include the loss function and the
structure of the return value of forward of your module when reporting
this issue (e.g. list, dict, iterable)

出现该问题的原因之一是网络中有没有参与loss计算的网络参数。一种解决方法就是按照错误提示，将分布式训练代码中的find_unused_parameters设置为True即可，具体位置在mmdet/apis/train.py中。

但一些额外且没用的网络参数会增加网络模型大小，所以另一种解决办法是直接把这些参数在网络定义中删掉：

找到所安装的mmcv软件包位置，并找到optimizer.py（mmcv/runner/hooks中）文件。
在该文件中找到after_train_iter函数。
在该函数中的runner.outputs[‘loss’].backward()的下一行输入以下代码：

    for name, param in runner.model.named_parameters():
         if param.grad is None:
             print(name)

运行单卡训练代码：

python tools/train.py config_file

此时会将没有参与loss计算的网络参数名字打印出来。

最后将这些网络参数在代码里注释掉即可。

小鑫爱学习

关注

9
点赞
踩
11

收藏

觉得还不错? 一键收藏
2
评论
MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

使用mmdetection进行多卡训练时报错，错误内容是发现网络中的部分参数没有参与损失loss的计算
复制链接

扫一扫

专栏目录