mmdetection3d报错问题解决汇总

最新推荐文章于 2024-05-13 19:40:56 发布

非晚非晚

最新推荐文章于 2024-05-13 19:40:56 发布

阅读量1w

点赞数 5

分类专栏： mmdetection 文章标签： python pytorch mmdetection3d 网络训练 mmdetection

本文链接：https://blog.csdn.net/QLeelq/article/details/130404416

版权

mmdetection 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

1. 问题：torch.distributed.elastic.multiprocessing.errors.ChildFailedError

1. 问题：torch.distributed.elastic.multiprocessing.errors.ChildFailedError

问题描述

当在mmdetection3d上添加了新的网络结构之后，使用多GPU运行程序时，会报以下错误。

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 333) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

但是当我使用单GPU运行时，却没有发现报错。于是可以肯定不是修改网络结构带来的问题，应该是某处配置没有设置好，debug了很长时间,发现没有设置find_unused_parameters=True。

当设置find_unused_parameters=True时，DistributedDataParallel会跟踪每个节点的计算图，标记那些没用梯度的参数，并将其梯度视为0，然后再进行梯度平均。

解决办法一

直接修改源码，如下。

# find_unused_parameters = cfg.get('find_unused_parameters', False)  # 原代码
find_unused_parameters = True  # 修改后代码

这种办法需要对源代码进行修改，不是很方便，更好的做法是直接添加配置项目，也就是下面的方法二。

解决办法二

mmdetection3d有很好的config文件，我们可以直接添加配置即可，我使用的训练网络为centerpoint，所以可以在配置文件centerpoint_02pillar_second_secfpn_4x8_cyclic_20e_nus.py的最后一行添加下面一句即可。

find_unused_parameters = True  # Whether to find unused parameters

非晚非晚

关注

5
点赞
踩
23

收藏

觉得还不错? 一键收藏
打赏
2
评论
mmdetection3d报错问题解决汇总

当在mmdetection3d上添加了新的网络结构之后，使用多GPU运行程序时，会报以下错误。ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 333) of binary: /opt/conda/bin/pythonTraceback (most recent call last):
复制链接

扫一扫