🚀Debug专栏
其他debug记录请参考上方【debug专栏】
目录
🍀🍀背景
mmseg工程在一台服务器上训练了几十个epoch,在新的服务器上接着resume,中间已经训练了500个iter,下一个迭代报错RuntimeError: CUDA error: an illegal memory access was encountered
📸📸报错信息
一个epoch中第500次迭代,详细报错信息如下:
File "/data/xx/project/mmsegmentation-master/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
如果是环境的问题,之前训练过程是好好的,现在是断点续训,怎么会报错?那CUDA error: an illegal memory access was encountered报错是不是内存或显存的我问题?
🙋🙋解决方案
1.分析原因
如果是显存爆炸,那就报out of memory了,这个memory报错是什么原因?
怀着尝试的态度,把batch_size调小,python代码如下所示:
data = dict(
samples_per_gpu=16,
workers_per_gpu=16,
......
)
修改了batch_size 和worker,果然就好了。
2.完整python代码
训练数据加载的python完整代码如下所示:
data = dict(
samples_per_gpu=16,
workers_per_gpu=16,
train=dict(
type='PascalVOCDataset',
data_root='/data/123',
img_dir='images',
ann_dir='labels',
split='train.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=False),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
])
🌷🌷总结
看到【memory类】的报错,不一定是out of memory,都要条件反射似的首先要想到内存或者显存的问题,第一个debug的思路就是调节batch_size大小,先放到比较小的值,再按照显存不爆的原则慢慢增加。
整理不易,欢迎一键三连!!!
送你们一条美丽的--分割线--
🌷🌷🍀🍀🌾🌾🍓🍓🍂🍂🙋🙋🐸🐸🙋🙋💖💖🍌🍌🔔🔔🍉🍉🍭🍭🍋🍋🍇🍇🏆🏆📸📸⛵⛵⭐⭐🍎🍎👍👍🌷🌷