torch.distributed.elastic.multiprocessing.errors.ChildFailedError

最新推荐文章于 2024-03-28 10:31:21 发布

hhh2080

最新推荐文章于 2024-03-28 10:31:21 发布

阅读量2.2k

点赞数 11

文章标签：人工智能

本文链接：https://blog.csdn.net/yyz2080/article/details/135649923

版权

问题

Traceback (most recent call last):
  File "/ssd1/miniconda3/envs/pytorch2.1.2/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 65322)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 65323)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 65324)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 65325)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 65326)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 65321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

解决

修改finetune_qlora_ds.sh，设置GPUS_PER_NODE与可使用GPU数相同

GPUS_PER_NODE=2

hhh2080

关注

11
点赞
踩
11

收藏

觉得还不错? 一键收藏
3
评论
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

修改finetune_qlora_ds.sh，设置GPUS_PER_NODE与可使用GPU数相同。
复制链接

扫一扫