【debug】报错RuntimeError: CUDA error: an illegal memory access was encountered

zy_destiny

已于 2023-11-08 11:05:08 修改

阅读量2.7k

点赞数 1

分类专栏： Python mmSegmentation Debug 文章标签：人工智能深度学习机器学习

于 2023-08-07 10:04:37 首次发布

本文链接：https://blog.csdn.net/qq_38308388/article/details/132140117

版权

mmSegmentation 同时被 3 个专栏收录

24 篇文章 7 订阅

订阅专栏

Python

23 篇文章 1 订阅

订阅专栏

Debug

17 篇文章 2 订阅

订阅专栏

🚀Debug专栏

其他debug记录请参考上方【debug专栏】

🍀🍀背景

mmseg工程在一台服务器上训练了几十个epoch，在新的服务器上接着resume，中间已经训练了500个iter，下一个迭代报错RuntimeError: CUDA error: an illegal memory access was encountered

📸📸报错信息

一个epoch中第500次迭代，详细报错信息如下：

  File "/data/xx/project/mmsegmentation-master/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered

如果是环境的问题，之前训练过程是好好的，现在是断点续训，怎么会报错？那CUDA error: an illegal memory access was encountered报错是不是内存或显存的我问题？

🙋🙋解决方案

1.分析原因

如果是显存爆炸，那就报out of memory了，这个memory报错是什么原因？

怀着尝试的态度，把batch_size调小，python代码如下所示：

data = dict(
    samples_per_gpu=16,
    workers_per_gpu=16,
    ......
    )

修改了batch_size 和worker，果然就好了。

2.完整python代码

训练数据加载的python完整代码如下所示：

data = dict(
    samples_per_gpu=16,
    workers_per_gpu=16,
    train=dict(
        type='PascalVOCDataset',
        data_root='/data/123',
        img_dir='images',
        ann_dir='labels',
        split='train.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', reduce_zero_label=False),
            dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ])

🌷🌷总结

看到【memory类】的报错，不一定是out of memory，都要条件反射似的首先要想到内存或者显存的问题，第一个debug的思路就是调节batch_size大小，先放到比较小的值，再按照显存不爆的原则慢慢增加。

整理不易，欢迎一键三连！！！

送你们一条美丽的--分割线--

🌷🌷🍀🍀🌾🌾🍓🍓🍂🍂🙋🙋🐸🐸🙋🙋💖💖🍌🍌🔔🔔🍉🍉🍭🍭🍋🍋🍇🍇🏆🏆📸📸⛵⛵⭐⭐🍎🍎👍👍🌷🌷

zy_destiny

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
【debug】报错RuntimeError: CUDA error: an illegal memory access was encountered

mmseg工程报错RuntimeError: CUDA error: an illegal memory access was encountered解决。
复制链接

扫一扫