如何解决NVIDIA显卡报错:uncorrectable ECC error的问题

一、问题是怎么发现的

近期工作中发现数字人形象模型训练期间服务器报错:

[2024-03-22 03:13:57] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] [W CUDAGuardImpl.h:112] Warning: CUDA warning: uncorrectable ECC error encountered (function destroyEvent)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 206, in <module>
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] main(args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 178, in main
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn(subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] while not context.join():
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] raise ProcessRaisedException(msg, error_index, failed_process.pid)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn.ProcessRaisedException:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] -- Process 2 terminated with the following error:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] fn(i, *args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 165, in subprocess_fn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] training_loop(rank=rank, args=args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 90, in training_loop
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] input = input.to(device)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] RuntimeError: CUDA error: uncorrectable ECC error encountered
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

通过异常日志分析,发现关键字:RuntimeError: CUDA error: uncorrectable ECC error encountered

二、问题带来的影响

该异常导致数字人形象模型训练失败,无法为客户提供数字人形象,进而影响交付。

三、排查问题的详细过程

首先通过百度搜索[CUDA warning: uncorrectable ECC error encountered],了解了ECC是什么,它的作用是什么。

Volatile Uncorr. ECC:是否启用显存错误校正(如果未启用则为0)(Volatile Uncorr. ECC——Volatile Uncorrectable Error Correction and Detection (VUECC):是一种可变不可修正的错误校验与纠正(ECC)技术,它旨在在计算机存储器中检测和纠正位错误。它使用了特殊的硬件来监控计算机内部数据,并在发现任何差错时通过可靠的方法自动纠正它们。)

然后搜索解决方案,百度搜索没有找到特别好的解决方案,然后改用Google搜索,找到了如下搜索记录:

发现有英伟达官方的帖子,果断点进去,寻找解决方案。

四、如何解决问题

1、查看显卡状态 nvidia-smi, 发现了关键参数[Volatile Uncorr. ECC],4张显卡其中第3张的值与其他三张不同,这样就定位到了出故障的显卡。

2、通过指令 nvidia-smi -i 2 -p 0 修复显卡状态

显卡状态已恢复,完好如初。

联系业务运营,重新开启形象模型训练任务。

五、总结反思

线上问题出现的时候,如果国内的百度搜不到解决方案,就试试国际的Google,办法总比困难多。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值