问题描述:每完成一次训练(完成任意一个epoch),开始下一个epoch时出现OOM
问题呈现:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 460.00 MiB. GPU has a total capacity of 79.33 GiB of which 248.00 MiB is free.
Including non-PyTorch memory, this process has 79.06 GiB memory in use.Of the allocated memory 76.12 GiB is allocated by PyTorch,and 1.08 GiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
解决方案有以下几种:
1.手动清理GPU内存:
在PyTorch中,使用torch.cuda.empty_cache().来清理GPU内存缓存,虽然这不会释放GPU中被PyTorch占用的内存,但可以减少PyTorch内存的碎片。
每次forward()之后添加 :torch.cuda.empty_cache()
2.使用max_split_size_mb来避免内存碎片
在训练脚本中设置PYTORHC_CUDA_ALLOC_CONF环境变量来控制内存分配行为,避免内存碎片问题;
具体操作以下:
python中:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF]='expandable_segments:true',
os.environ['PYTORCH_CUDA_ALLOC_CONF]='max_split_size_mb:128',其中128可以按照本机现有的GPU显存的大小来确定(调大调小)
shell中:
export PYTORCH_CUDA_ALLOC_CONF]=expandable_segments:true
set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
3.减少batch size
4.设置ddp_timeout和NCCL_TIMEOUT为:30000(根据数据的大小,30-30000ms)