WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4212 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 4211) of binary: /home/lsw/miniconda3/envs/mi/bin/python
Traceback (most recent call last):
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lsw/miniconda3/envs/mi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/home/lsw/LSW/hl/mi/tools/train.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-21_19:58:06
host : lswPlus
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 4211)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4211
这个错误信息主要是关于使用
torch.distributed
模块进行分布式训练时发生的错误。具体信息如下:
错误信息:
exitcode: -9
:这是进程的退出代码,-9
表示进程被SIGKILL
信号杀死。local_rank: 0 (pid: 4211)
:这是被杀死的进程的 PID。Signal 9 (SIGKILL) received by PID 4211
:进程 4211 收到了SIGKILL
信号。可能原因:
- 内存不足:分布式训练过程中,可能会消耗大量的内存。如果系统内存不足,操作系统可能会杀死进程以释放内存。
- 超时:某些情况下,进程可能会因为超时而被杀死。
- 资源限制:其他资源限制(如 CPU 或 GPU 资源)也可能导致进程被杀死。
解决方法:
- 检查内存使用情况:确保在运行分布式训练时有足够的内存。可以使用命令如
top
或htop
来监控内存使用情况。- 减少训练数据的批量大小:减少批量大小可以降低内存消耗。
- 检查系统日志:查看系统日志(如
/var/log/syslog
或/var/log/messages
),找出具体的进程被杀死的原因。- 优化代码:检查代码中是否有可以优化的部分,减少不必要的内存使用。
- 分布式设置:确认分布式训练的设置是否正确,包括启动参数和配置文件。