Some NCCL operations have failed or timed out.

背景:在两台服务器上通过torchrun进行分布式模型训练。

报错:Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

翻译:某些 NCCL 操作失败或超时。由于 CUDA 的异步特性 内核,后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致,我们正在取消整个过程。

完整报错日志:

Some NCCL operations have failed
 or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/inc
omplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007ff9b3476700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b2c75700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b3c77700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b547a700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4478700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b5c7b700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4c79700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b67fc700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b7fff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/queue.py", line 180 in get
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", l
ine 227 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ff9d3ae2700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ffa68d63180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2070 i
n all_gather
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 79 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 152 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/utils.py", line 1016 in all_gathe
r_dp_groups
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805
 in step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1814 in _take_mo
del_step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1913 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA 
kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taki
ng the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007f75b7fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75c4ffd700 (most recent call first):
<no Python frame>

Thread 0x00007f75c57fe700 (most recent call first):
<no Python frame>

Thread 0x00007f75c5fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75def7d700 (most recent call first):
<no Python frame>

Thread 0x00007f75df77e700 (most recent call first):
<no Python frame>

Thread 0x00007f75dff7f700 (most recent call first):
<no Python frame>

Thread 0x00007f72f37fe700 (most recent call first):
<no Python frame>

Thread 0x00007f7655632700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f76ea8b2180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496 in synchronize   
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 189 in stop        
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1915 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120955 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120956 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120957 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120958 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120959 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120960 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120961 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120962 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/liuzhaofeng/anaconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main       
    run(args)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run        
    elastic_launch(
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __
call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in la
unch_agent
    result = agent.run()
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125
, in wrapper
    result = f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 709, in run
    result = self._invoke_run(role)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", 
line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4120889 got signal: 1

重点:

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

翻译:

某些 NCCL 操作失败或超时。由于 CUDA 的异步特性
内核,后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致,我们正在取消整个过程。
抛出“std::runtime_error”实例后调用terminate
what():NCCL错误:未处理的系统错误,NCCL版本2.10.3
ncclSystemError:系统调用(例如socket、malloc)或外部库调用失败或设备错误。这也可能是由远程对等端意外退出引起的,您可以检查NCCL警告以了解失败原因,并查看对等端是否关闭了连接。

看样子貌似是两台机器没有同步导致的报错,并且这个问题也是偶发性的,可以先重启一下看看能不能解决。

### PyTorch 中 ProcessGroupNCCL 未销毁导致的警告问题分析 在分布式训练场景下,PyTorch 使用 NCCL 后端来实现高效的 GPU 数据通信。当程序结束时,如果 `ProcessGroupNCCL` 对象未能被正确销毁,则可能会触发如下警告: > "Warning: ProcessGroupNCCL is not destroyed at the end of distributed job." 此警告表明,在进程退出前,某些资源尚未释放完毕,可能导致潜在的内存泄漏或其他不可预期的行为。 #### 警告原因解析 该问题的根本原因是分布式环境中的资源管理不当。具体来说: - 当使用 `torch.distributed.init_process_group(backend='nccl')` 初始化分布式环境后,会创建一个名为 `ProcessGroupNCCL` 的对象。 - 如果在程序正常退出之前没有显式调用 `destroy_process_group()` 或者通过其他方式清理这些资源,则会在 Python 解释器关闭期间尝试自动销毁它们[^1]。 - 这种延迟销毁行为可能与其他全局变量(如单例模式下的实例)发生冲突,尤其是在多线程或多设备环境中。 --- ### 解决方案 以下是几种常见的解决方案及其适用场景: #### 方法一:手动销毁过程组 可以通过显式调用 `dist.destroy_process_group()` 来确保所有与分布相关的资源都被及时释放。例如: ```python import torch.distributed as dist def cleanup(): """Destroy process group to avoid warnings.""" dist.destroy_process_group() ``` 在主脚本中,可以将上述方法作为最后一步执行: ```python if __name__ == "__main__": # Initialize and run your training loop here... # Cleanup after finishing all tasks. cleanup() ``` 这种方法简单有效,适用于大多数情况,并能显著减少因资源未释放而引发的问题。 --- #### 方法二:调整析构函数逻辑 对于更复杂的项目结构,尤其是涉及自定义类封装的情况下,可以在相关类的析构函数中加入额外的安全措施。然而需要注意的是,正如引用所提到的内容那样,“不建议在析构函数中调用 aclFinalize 接口”,因为这可能导致由于单例析构顺序不确定带来的崩溃风险。 因此,推荐的做法是在适当的时间点主动调用必要的清理操作而不是依赖于默认的析构机制。 --- #### 方法三:设置正确的设备上下文 另一个常见问题是多个设备间的切换混乱也可能间接影响到最终状态处理阶段的表现。为此,在初始化以及后续每次更改目标 device ID 前应始终确认当前使用的 device 设置无误: ```cpp // Example C++ equivalent pseudo-code demonstrating setting a specific device before operations. aclError err = aclrtSetDevice(deviceId); if (err != ACL_SUCCESS) { std::cerr << "Failed to set device!" << std::endl; } ``` 虽然这是基于 ACL API 的例子,但在 PyTorch 应用里同样重要——即保证在整个生命周期内只针对单一明确指定的 GPU 设备工作除非确实有必要跨不同卡之间转移数据流. --- ### 总结代码片段 综合以上讨论,下面给出一段完整的示范代码用于消除此类警告消息: ```python import os import torch import torch.distributed as dist def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # initialize the process group dist.init_process_group( backend='nccl', init_method='env://', world_size=world_size, rank=rank) def cleanup(): dist.destroy_process_group() def main(): rank = ... # Define or get from environment variables etc. size = ... # Similarly define total number of processes setup(rank, size) try: # Your actual computation goes here pass finally: cleanup() # Ensure this runs even on exceptions! if __name__ == '__main__': main() ``` --- ###
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Matrix 工作室

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值