【PyTorch】分布式训练报错记录-ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1)

            <div id="content_views" class="htmledit_views">
                <p>最近,我在服务器上起基于<span class="words-blog hl-git-1" data-tit="PyTorch" data-pretit="pytorch">PyTorch</span>分布式框架的<a href="https://so.csdn.net/so/search?q=%E9%A2%84%E8%AE%AD%E7%BB%83&amp;spm=1001.2101.3001.7020" target="_blank" class="hl hl-1" data-report-click="{&quot;spm&quot;:&quot;1001.2101.3001.7020&quot;,&quot;dest&quot;:&quot;https://so.csdn.net/so/search?q=%E9%A2%84%E8%AE%AD%E7%BB%83&amp;spm=1001.2101.3001.7020&quot;,&quot;extra&quot;:&quot;{\&quot;searchword\&quot;:\&quot;预训练\&quot;}&quot;}" data-tit="预训练" data-pretit="预训练">预训练</a>实验,起初实验都在顺利进行,但是当我们把模型的深度与宽度调大之后,模型在训练几代之后便会出现如下的报错:</p> 

 
 
  1. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41495 closing signal SIGTERM
  2. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41497 closing signal SIGTERM
  3. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41498 closing signal SIGTERM
  4. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41500 closing signal SIGTERM
  5. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41502 closing signal SIGTERM
  6. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41504 closing signal SIGTERM
  7. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41506 closing signal SIGTERM
  8. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 41496)
  9. of binary: /home/user/anaconda3/envs/conda-envs/ bin/python
  10. Traceback (most recent call last):
  11. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  12. return _run_code(code, main_globals, None,
  13. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/runpy.py", line 87, in _run_code
  14. exec(code, run_globals)
  15. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
  16. ine 193, in <module>
  17. main()
  18. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
  19. ine 189, in main
  20. launch(args)
  21. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
  22. ine 174, in launch
  23. run(args)
  24. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/run.py", line
  25. 710, in run
  26. elastic_launch(
  27. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launcher/api.
  28. py", line 131, in __call__
  29. return launch_agent(self._config, self._entrypoint, list(args))
  30. File "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launcher/api.
  31. py", line 259, in launch_agent
  32. raise ChildFailedError(
  33. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  34. ============================================================
  35. run_pretraining.py FAILED
  36. ------------------------------------------------------------
  37. Failures:
  38. <NO_OTHER_FAILURES>
  39. ------------------------------------------------------------
  40. Root Cause (first observed failure):
  41. [ 0]:
  42. time : 2024-08- 30_09:05: 52
  43. host : ae83085e5bc2
  44. rank : 1 (local_rank: 1)
  45. exitcode : 1 (pid: 41496)
  46. error_file: <N/A>
  47. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  48. ============================================================
AI写代码

起初,我认为是batch size太大的问题,导致GPU显存不够,但是当我调小之后,问题照常发生。之后,我更新了PyTorch框架到2.0,但还是出现这样的问题。
 

后续,在观察实验日志的时候发现,训练期间我的梯度范数(grad_norm)变化非常不稳定,于是我顺着这条线去查,遂把原因归结为优化方面的问题。

之后,我发现对于学习率的设置,我是使用了学习率扩张法则,我的总batch为800,远远大于设定的256,因此导致实际训练中,我的初始学习率由我设置的3e-4转变为1e-3,从而导致学习率太大,进而造成了训练坍塌。

基于上述结论,我将初始学习率调整为2e-4,模型恢复正常训练。

上述bug出现的原因各不相同,我把我的报错原因分享给大家,仅供参考。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值