问题
Traceback (most recent call last):
File "/ssd1/miniconda3/envs/pytorch2.1.2/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 65322)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 65323)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 65324)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 65325)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 65326)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-17_14:12:08
host : aidev02
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 65321)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
解决
修改finetune_qlora_ds.sh,设置GPUS_PER_NODE与可使用GPU数相同
GPUS_PER_NODE=2