[Debug] pytorch分布式训练报错

最新推荐文章于 2024-10-14 21:27:57 发布

cc嘿嘿嘿

最新推荐文章于 2024-10-14 21:27:57 发布

阅读量5.6k

点赞数 1

文章标签： python 深度学习开发语言

本文链接：https://blog.csdn.net/ccheiheihei/article/details/126579349

版权

本文记录了一次使用PyTorch进行分布式训练时遇到的问题及解决方案。主要介绍了如何通过设置find_unused_parameters参数为True来解决由信号SIGABRT导致的CalledProcessError异常。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Traceback (most recent call last):
  File "/opt/conda/envs/py39_torch12/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/py39_torch12/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/py39_torch12/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/opt/conda/envs/py39_torch12/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/envs/py39_torch12/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/py39_torch12/bin/python', '-u', 'moby_main.py', '--local_rank=0', '--cfg', 'configs/moby_swin_tiny.yaml', '--data-path', '/cheung/docker/project/int8/qat/classification/data/loadingrate', '--batch-size', '8']' died with <Signals.SIGABRT: 6>.


    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[config.LOCAL_RANK], broadcast_buffers=False, find_unused_parameters=True)

在torch.nn.parallel.DistributedDataParallel函数加上find_unused_parameters=True