解决Pytorch显存碎片化导致的CUDA:Out Of Memory问题

AdilAdams_ASR

已于 2025-03-17 17:36:54 修改

阅读量686

点赞数 5

文章标签： pytorch 人工智能 python

于 2025-03-17 17:06:40 首次发布

本文链接：https://blog.csdn.net/weixin_43488255/article/details/146320080

版权

问题描述：每完成一次训练(完成任意一个epoch)，开始下一个epoch时出现OOM

问题呈现：torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 460.00 MiB. GPU has a total capacity of 79.33 GiB of which 248.00 MiB is free.

Including non-PyTorch memory, this process has 79.06 GiB memory in use.Of the allocated memory 76.12 GiB is allocated by PyTorch,and 1.08 GiB is reserved by PyTorch but unallocated.

If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

解决方案有以下几种:

1.手动清理GPU内存：

在PyTorch中，使用torch.cuda.empty_cache().来清理GPU内存缓存，虽然这不会释放GPU中被PyTorch占用的内存，但可以减少PyTorch内存的碎片。

每次forward()之后添加：torch.cuda.empty_cache()

2.使用max_split_size_mb来避免内存碎片

在训练脚本中设置PYTORHC_CUDA_ALLOC_CONF环境变量来控制内存分配行为，避免内存碎片问题；

具体操作以下：

python中：

import os

os.environ['PYTORCH_CUDA_ALLOC_CONF]='expandable_segments:true',

os.environ['PYTORCH_CUDA_ALLOC_CONF]='max_split_size_mb:128',其中128可以按照本机现有的GPU显存的大小来确定(调大调小)