问题解决 |关于CUDA的代码错误总结以及解决方法

最新推荐文章于 2025-03-11 15:05:19 发布

夏天｜여름이다

最新推荐文章于 2025-03-11 15:05:19 发布

阅读量1.3w

点赞数 10

分类专栏： - 计算机基础 - 文章标签：深度学习人工智能错误解决

本文链接：https://blog.csdn.net/weixin_44649780/article/details/128911586

版权

- 计算机基础 - 专栏收录该内容

24 篇文章

订阅专栏

本文总结了CUDA编程中遇到的RuntimeError，包括CUDAoutofmemory问题及其解决策略，如调整batchsize、释放显存和检查版本兼容性。对于PyTorch用户，提出了设置cuDNN确定性和优化模式，以及处理GPU内存碎片的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本博客主要关于常见的CUDA的代码错误总结以及解决方法~

1.RuntimeError运行错误

1.1.RuntimeError: CUDA error: out of memory

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

错误解析：

程序本来跑的好好的，代码没有问题，显存还都没用，且显存充足，可能是GPU被占用，

可能是因为之前训练的缓存问题，因为是在docker容器内运行的，所以先stop docker容器,再start容器就好啦~

1.2.RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

可能出现的错误

pytorch和cuda的版本不对问题

显存不足

参考别的博客试验代码

# True：每次返回的卷积算法将是确定的，即默认算法。
torch.backends.cudnn.deterministic = True
# 程序在开始时花额外时间，为整个网络的每个卷积层搜索最适合它的卷积实现算法
# 实现网络的加速。
torch.backends.cudnn.benchmark = True

最终解决方案

把numwork设置为0

1.3.RuntimeError: CUDA out of memory

①RuntimeError: CUDA out of memory. Tried to allocate 152.00 MiB (GPU 0; 23.65 GiB total capacity; 13.81 GiB already allocated; 118.44 MiB free; 14.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

超出GPU所占的内存，本地的GPU资源应该完全够用的，然而，在pytorch训练过程中，由于梯度下降等神经网络参数的反向传播和正向参数，会占用大量GPU内存，因此需要减小batch。

解决办法：

缩小batch，即减少单词训练的样本大小

释放显存：torch.cuda.empty_cache()

②torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.65 GiB total capacity; 22.73 GiB already allocated; 116.56 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

torch.cuda.OutOfMemoryError： CUDA没有内存了。尝试分配128.00 MiB（GPU 0；总容量23.65 GiB；已分配22.73 GiB；空闲116.56 MiB；PyTorch共保留22.78 GiB）如果保留的内存>>分配的内存，尝试设置max_split_size_mb以避免碎片化。请参阅内存管理和PYTORCH_CUDA_ALLOC_CONF的文档。

错误原因分析：

在深度学习模型训练时，代码每训练一次，未释放显存

解决方案：

查看nvidia-smi