mmsegmentation修仙之路-bug篇（1）

石头变钻石？

已于 2023-09-28 16:31:29 修改

阅读量1.8k

点赞数

分类专栏： # mmsegmentation 文章标签： bug python 深度学习

于 2023-02-28 20:30:00 首次发布

本文链接：https://blog.csdn.net/stone_tigerli/article/details/129063787

版权

mmsegmentation 专栏收录该内容

3 篇文章 2 订阅

订阅专栏

合集目录

RuntimeError: CUDA error: device-side assert triggered

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: 
。。。
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [11,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [12,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [13,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "F:/research/deeplabv3/train.py", line 71, in <module>
    train_segmentor(model, datasets, cfg, distributed=False, validate=True, meta=dict())
  File "f:\research\openmmlab\mmsegmentation\mmseg\apis\train.py", line 194, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 144, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 64, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\parallel\data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 138, in train_step
    losses = self(**data_batch)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 108, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 144, in forward_train
    loss_decode = self._decode_head_forward_train(x, img_metas,
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 87, in _decode_head_forward_train
    loss_decode = self.decode_head.forward_train(x, img_metas,
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 233, in forward_train
    losses = self.losses(seg_logits, gt_semantic_seg)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 208, in new_func
    return old_func(*args, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 270, in losses
    seg_weight = self.sampler.sample(seg_logit, seg_label)
  File "f:\research\openmmlab\mmsegmentation\mmseg\core\seg\sampler\ohem_pixel_sampler.py", line 56, in sample
    sort_prob, sort_indices = seg_prob[valid_mask].sort()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Process finished with exit code -1073740791 (0xC0000409)

原因分析：
1、训练标签类别数目和指定的num_class不一致。
解决办法：
检查样本标注的，看看是否有超出规定的标注。如果有，删除或者重新制作样本。
注：二分类样本要使用PIL库制作（可参考下面代码），open-CV生成的为3波段，模型读入会报错
这里给出我使用的一个简单的图像类别检查代码：

import numpy as np
import cv2
import os.path as osp

dataroot = r'data/Satellite_buildings_data'
imgs = "src"
labels = 'label'

train_data = osp.join(dataroot, labels)
with open(osp.join(dataroot, 'split/val.txt'), 'r+') as f:
    line = [l for l in f.readlines()]

# 提取所有图片中同一元素的和
label_count = dict()
for file in line:
    dataset = cv2.imread(osp.join(train_data, file.strip()+'.png'))
    class_num = np.unique(dataset)
    for num in class_num:
        temp = np.sum(dataset == num)
        if str(num) in label_count.keys():
            label_count[str(num)] = label_count[str(num)] + temp
        else:
            label_count[str(num)] = temp
print(label_count)

# 二分类样本
import os
from PIL import Image
dataroot = r'F:\research\floodext\floodDataset\labels'
out_root = r'F:\research\floodext\floodDataset\label'

for file in os.listdir(dataroot):
    img = Image.open(os.path.join(dataroot, file))
    img = img.point(lambda x:x > 0)
    print(np.unique(img))
    img.save(os.path.join(out_root, file))

2023.9.28补充：

    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: an illegal memory access was encountered

最近做多标签类别的语义分割（包含类别的变化检测）时遇到了一个新问题，其本质也是类别数目和指定的不一致。解决办法同上。
这里也花费了一天时间，所以记录下拍错的过程。但单看报错信息我以为是显存不够用或者loss函数的问题。首先loss函数不会出现问题，那么如果是显存不够的话，调小batchsize应该是可行的，当调小batchsize为2时，可以训练，但loss为0，acc为0，同时后台看显存只用了2GB，远远低于满负荷运转时的状态。最后回归排查标签，发现部分标签是三通道的颜色标注。将其改为0~n后成功运行。

2、预测结果中有nan
比如我加了自动混合精度训练，网络结构中可能某些算子不支持自动混合精度训练，导致出现nan值，最终导致计算loss时报错。
解决办法：
检查输出结果，检查过程，检查网络结构

ValueError: Expected more than 1 value per channel when training, got input size [1, 512, 1, 1]

原因有二：

BN需要大于1，不然无法计算平均值会报错
batch_size大于1时，仍报错，可能是数据集总数不能整除batch_size。将dataloader中的drop_last设置为True，舍去批次多余的数据。

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

  File "f:\research\openmmlab\mmsegmentation\mmseg\models\backbones\resnet.py", line 662, in forward
    x = self.stem(x)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
    input = module(input)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code -1073741510 (0xC000013A: interrupted by Ctrl+C)

原因分析：
debug发现torch\nn\modules\batchnorm.py中，第725-732行代码如下，need_sync为True，但 process_group为None。这里的nee_sync是指是否开启sys batchnorm。查看我的代码，使用了SyncBN，我的是单卡训练，因此报错。

725        # Don't sync batchnorm stats in inference mode (model.eval()).
726        need_sync = (bn_training and self.training)
727        if need_sync:
728            process_group = torch.distributed.group.WORLD
729            if self.process_group:
730                process_group = self.process_group
731            world_size = torch.distributed.get_world_size(process_group)
732            need_sync = world_size > 1

解决办法：
如果是多卡训练，采用“SyncBN”; 如果是单卡训练，将type修改为’BN’即可。

RuntimeError: weight tensor should be defined either for all or no classes

  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 271, in forward
    loss_cls = self.loss_weight * self.cls_criterion(
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 45, in cross_entropy
    loss = F.cross_entropy(
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all or no classes

Process finished with exit code -1

原因分析：
这里主要是类别权重的问题，检查类别权重和类别数目是否一致。我这里报错是因为数据集换成新的数据集后，变为了二分类，但源码是使用的cityscapes，类别有19类，因此报错。