如何解决NVIDIA显卡报错：uncorrectable ECC error的问题

最新推荐文章于 2024-10-11 20:02:21 发布

卡亦克

最新推荐文章于 2024-10-11 20:02:21 发布

阅读量8.6k

点赞数 13

文章标签：深度学习人工智能机器学习

本文链接：https://blog.csdn.net/caryeko/article/details/140764046

版权

一、问题是怎么发现的

近期工作中发现数字人形象模型训练期间服务器报错：

[2024-03-22 03:13:57] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] [W CUDAGuardImpl.h:112] Warning: CUDA warning: uncorrectable ECC error encountered (function destroyEvent)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 206, in <module>
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] main(args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 178, in main
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn(subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] while not context.join():
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] raise ProcessRaisedException(msg, error_index, failed_process.pid)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] torch.multiprocessing.spawn.ProcessRaisedException:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] -- Process 2 terminated with the following error:
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] Traceback (most recent call last):
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/usr/local/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] fn(i, *args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 165, in subprocess_fn
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] training_loop(rank=rank, args=args)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] File "/export/Instance/algorithm/blindrestoration/distribute_training_sr_s.py", line 90, in training_loop
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] input = input.to(device)
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] RuntimeError: CUDA error: uncorrectable ECC error encountered
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
[2024-03-22 03:14:25] [INFO] [auto_train_light_algorithm_executor] [train_algorithm_execute] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

通过异常日志分析，发现关键字：RuntimeError: CUDA error: uncorrectable ECC error encountered

二、问题带来的影响

该异常导致数字人形象模型训练失败，无法为客户提供数字人形象，进而影响交付。

三、排查问题的详细过程

首先通过百度搜索[CUDA warning: uncorrectable ECC error encountered]，了解了ECC是什么，它的作用是什么。

Volatile Uncorr. ECC：是否启用显存错误校正（如果未启用则为0）（Volatile Uncorr. ECC——Volatile Uncorrectable Error Correction and Detection (VUECC)：是一种可变不可修正的错误校验与纠正（ECC）技术，它旨在在计算机存储器中检测和纠正位错误。它使用了特殊的硬件来监控计算机内部数据，并在发现任何差错时通过可靠的方法自动纠正它们。）

然后搜索解决方案，百度搜索没有找到特别好的解决方案，然后改用Google搜索，找到了如下搜索记录：