【PyTorch】分布式训练报错记录-ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1)

Not L

于 2025-04-07 23:39:38 发布

阅读量123

点赞数

文章标签： pytorch 分布式人工智能

原文链接：https://blog.csdn.net/m0_47623548/article/details/141815301

版权

            <div id="content_views" class="htmledit_views">
                <p>最近，我在服务器上起基于<span class="words-blog hl-git-1" data-tit="PyTorch" data-pretit="pytorch">PyTorch</span>分布式框架的<a href="https://so.csdn.net/so/search?q=%E9%A2%84%E8%AE%AD%E7%BB%83&amp;spm=1001.2101.3001.7020" target="_blank" class="hl hl-1" data-report-click="{&quot;spm&quot;:&quot;1001.2101.3001.7020&quot;,&quot;dest&quot;:&quot;https://so.csdn.net/so/search?q=%E9%A2%84%E8%AE%AD%E7%BB%83&amp;spm=1001.2101.3001.7020&quot;,&quot;extra&quot;:&quot;{\&quot;searchword\&quot;:\&quot;预训练\&quot;}&quot;}" data-tit="预训练" data-pretit="预训练">预训练</a>实验，起初实验都在顺利进行，但是当我们把模型的深度与宽度调大之后，模型在训练几代之后便会出现如下的报错：</p>


 
 
   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41495 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41497 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41498 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41500 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41502 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41504 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 
     
     41506 closing signal SIGTERM
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 
     
     1) local_rank: 
     
     1 (pid: 
     
     41496)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     of binary: /home/user/anaconda3/envs/conda-envs/
     
     bin/python
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     Traceback (most recent call last):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/runpy.py", line 
     
     194, 
     
     in _run_module_as_main
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return _run_code(code, main_globals, 
     
     None,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/runpy.py", line 
     
     87, 
     
     in _run_code
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     exec(code, run_globals)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ine 
     
     193, 
     
     in <module>
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         main()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ine 
     
     189, 
     
     in main
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         launch(args)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launch.py", l
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ine 
     
     174, 
     
     in launch
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         run(args)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/run.py", line
    
    
   
   

   
   
    
    
   
   
   
   
    
     
     
     710, 
     
     in run
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         elastic_launch(
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launcher/api.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     py", line 
     
     131, 
     
     in __call__
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     return launch_agent(self._config, self._entrypoint, 
     
     list(args))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     File 
     
     "/home/user/anaconda3/envs/conda-envs/lib/python3.8/site-packages/torch/distributed/launcher/api.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     py", line 
     
     259, 
     
     in launch_agent
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     raise ChildFailedError(
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ============================================================
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     run_pretraining.py FAILED
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ------------------------------------------------------------
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     Failures:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       <NO_OTHER_FAILURES>
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ------------------------------------------------------------
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     Root Cause (first observed failure):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     [
     
     0]:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       time      : 
     
     2024-08-
     
     30_09:05:
     
     52
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       host      : ae83085e5bc2
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       rank      : 
     
     1 (local_rank: 
     
     1)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       exitcode  : 
     
     1 (pid: 
     
     41496)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       error_file: <N/A>
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
       traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ============================================================
    
    
   
   
 
 
AI写代码

起初，我认为是batch size太大的问题，导致GPU显存不够，但是当我调小之后，问题照常发生。之后，我更新了PyTorch框架到2.0，但还是出现这样的问题。

后续，在观察实验日志的时候发现，训练期间我的梯度范数（grad_norm）变化非常不稳定，于是我顺着这条线去查，遂把原因归结为优化方面的问题。

之后，我发现对于学习率的设置，我是使用了学习率扩张法则，我的总batch为800，远远大于设定的256，因此导致实际训练中，我的初始学习率由我设置的3e-4转变为1e-3，从而导致学习率太大，进而造成了训练坍塌。

基于上述结论，我将初始学习率调整为2e-4，模型恢复正常训练。

上述bug出现的原因各不相同，我把我的报错原因分享给大家，仅供参考。