mmsegmentation修仙之路-bug篇(1)

合集目录
  1. mmsegmentation修仙之路-bug篇(1)
  2. mmsegmentation修仙之路-bug篇(2)
  3. mmsegmentation修仙之路-bug篇(3)

RuntimeError: CUDA error: device-side assert triggered

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [82,0,0], thread: 
。。。
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [11,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [12,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [105,0,0], thread: [13,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "F:/research/deeplabv3/train.py", line 71, in <module>
    train_segmentor(model, datasets, cfg, distributed=False, validate=True, meta=dict())
  File "f:\research\openmmlab\mmsegmentation\mmseg\apis\train.py", line 194, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 144, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\iter_based_runner.py", line 64, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\parallel\data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 138, in train_step
    losses = self(**data_batch)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 119, in new_func
    return old_func(*args, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\base.py", line 108, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 144, in forward_train
    loss_decode = self._decode_head_forward_train(x, img_metas,
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\segmentors\encoder_decoder.py", line 87, in _decode_head_forward_train
    loss_decode = self.decode_head.forward_train(x, img_metas,
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 233, in forward_train
    losses = self.losses(seg_logits, gt_semantic_seg)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\mmcv\runner\fp16_utils.py", line 208, in new_func
    return old_func(*args, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\decode_heads\decode_head.py", line 270, in losses
    seg_weight = self.sampler.sample(seg_logit, seg_label)
  File "f:\research\openmmlab\mmsegmentation\mmseg\core\seg\sampler\ohem_pixel_sampler.py", line 56, in sample
    sort_prob, sort_indices = seg_prob[valid_mask].sort()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Process finished with exit code -1073740791 (0xC0000409)

原因分析:
1、训练标签类别数目和指定的num_class不一致。
解决办法:
检查样本标注的,看看是否有超出规定的标注。如果有,删除或者重新制作样本。
注:二分类样本要使用PIL库制作(可参考下面代码),open-CV生成的为3波段,模型读入会报错
这里给出我使用的一个简单的图像类别检查代码:

import numpy as np
import cv2
import os.path as osp

dataroot = r'data/Satellite_buildings_data'
imgs = "src"
labels = 'label'

train_data = osp.join(dataroot, labels)
with open(osp.join(dataroot, 'split/val.txt'), 'r+') as f:
    line = [l for l in f.readlines()]

# 提取所有图片中同一元素的和
label_count = dict()
for file in line:
    dataset = cv2.imread(osp.join(train_data, file.strip()+'.png'))
    class_num = np.unique(dataset)
    for num in class_num:
        temp = np.sum(dataset == num)
        if str(num) in label_count.keys():
            label_count[str(num)] = label_count[str(num)] + temp
        else:
            label_count[str(num)] = temp
print(label_count)
# 二分类样本
import os
from PIL import Image
dataroot = r'F:\research\floodext\floodDataset\labels'
out_root = r'F:\research\floodext\floodDataset\label'

for file in os.listdir(dataroot):
    img = Image.open(os.path.join(dataroot, file))
    img = img.point(lambda x:x > 0)
    print(np.unique(img))
    img.save(os.path.join(out_root, file))

2023.9.28补充:

    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: an illegal memory access was encountered

最近做多标签类别的语义分割(包含类别的变化检测)时遇到了一个新问题,其本质也是类别数目和指定的不一致。解决办法同上。
这里也花费了一天时间,所以记录下拍错的过程。但单看报错信息我以为是显存不够用或者loss函数的问题。首先loss函数不会出现问题,那么如果是显存不够的话,调小batchsize应该是可行的,当调小batchsize为2时,可以训练,但loss为0,acc为0,同时后台看显存只用了2GB,远远低于满负荷运转时的状态。最后回归排查标签,发现部分标签是三通道的颜色标注。将其改为0~n后成功运行。

2、预测结果中有nan
比如我加了自动混合精度训练,网络结构中可能某些算子不支持自动混合精度训练,导致出现nan值,最终导致计算loss时报错。
解决办法:
检查输出结果,检查过程,检查网络结构

ValueError: Expected more than 1 value per channel when training, got input size [1, 512, 1, 1]

原因有二:

  1. BN需要大于1,不然无法计算平均值会报错
  2. batch_size大于1时,仍报错,可能是数据集总数不能整除batch_size。将dataloader中的drop_last设置为True,舍去批次多余的数据。

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

  File "f:\research\openmmlab\mmsegmentation\mmseg\models\backbones\resnet.py", line 662, in forward
    x = self.stem(x)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
    input = module(input)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code -1073741510 (0xC000013A: interrupted by Ctrl+C)

原因分析:
debug发现torch\nn\modules\batchnorm.py中,第725-732行代码如下,need_sync为True,但 process_group为None。这里的nee_sync是指是否开启sys batchnorm。查看我的代码,使用了SyncBN,我的是单卡训练,因此报错。

725        # Don't sync batchnorm stats in inference mode (model.eval()).
726        need_sync = (bn_training and self.training)
727        if need_sync:
728            process_group = torch.distributed.group.WORLD
729            if self.process_group:
730                process_group = self.process_group
731            world_size = torch.distributed.get_world_size(process_group)
732            need_sync = world_size > 1

解决办法:
如果是多卡训练,采用“SyncBN”; 如果是单卡训练,将type修改为’BN’即可。

RuntimeError: weight tensor should be defined either for all or no classes

  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 271, in forward
    loss_cls = self.loss_weight * self.cls_criterion(
  File "f:\research\openmmlab\mmsegmentation\mmseg\models\losses\cross_entropy_loss.py", line 45, in cross_entropy
    loss = F.cross_entropy(
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all or no classes

Process finished with exit code -1

原因分析:
这里主要是类别权重的问题,检查类别权重和类别数目是否一致。我这里报错是因为数据集换成新的数据集后,变为了二分类,但源码是使用的cityscapes,类别有19类,因此报错。

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值