mmdetection训练报错

在使用mmdetection进行训练时遇到了device-side assert triggered的错误,原因是coco格式的annotations.json文件中categories的category_id不能为0。排查并修复该问题后,训练可以继续进行。
摘要由CSDN通过智能技术生成

mmdetection训练报错。

先说结论吧,coco格式的annotations.json中categories的category_id不能有0(即背景类)。

查了好几个小时,都快要疯了,才搞出这个结论。给大家趟个坑。

 

完整报错内容如下 :

2020-01-04 12:18:44,206 - INFO - Distributed training: False
2020-01-04 12:18:45,023 - INFO - load model from: torchvision://resnet50
2020-01-04 12:18:45,213 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.4.conv2_offset.weight, layer2.3.conv2_offset.bias, layer4.0.conv2_offset.bias, layer3.3.conv2_offset.bias, layer2.1.conv2_offset.weight, layer3.1.conv2_offset.weight, layer3.3.conv2_offset.weight, layer4.2.conv2_offset.weight, layer2.0.conv2_offset.bias, layer2.1.conv2_offset.bias, layer3.0.conv2_offset.weight, layer4.2.conv2_offset.bias, layer2.0.conv2_offset.weight, layer2.2.conv2_offset.bias, layer2.2.conv2_offset.weight, layer3.2.conv2_offset.weight, layer3.2.conv2_offset.bias, layer3.0.conv2_offset.bias, layer4.1.conv2_offset.bias, layer4.1.conv2_offset.weight, layer4.0.conv2_offset.weight, layer2.3.conv2_offset.weight, layer3.5.conv2_offset.bias, layer3.5.conv2_offset.weight, layer3.4.conv2_offset.bias, layer3.1.conv2_offset.bias

loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
2020-01-04 12:18:47,564 - INFO - load checkpoint from ./work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x/latest.pth
2020-01-04 12:18:47,854 - INFO - Start running, host: root@91b7c01c2149, work_dir: /competition/work_dirs/cascade_rcnn_dconv_c3-c5_r50_fpn_1x
2020-01-04 12:18:47,854 - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [3,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [4,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [6,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [24,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "tools/train.py", line 108, in <module>
    main()
  File "tools/train.py", line 104, in main
    logger=logger)
  File "/competition/mmdet/apis/train.py", line 60, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/competition/mmdet/apis/train.py", line 221, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/competition/mmdet/apis/train.py", line 38, in batch_processor
    losses = model(**data)
  File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/competition/mmdet/core/fp16/decorators.py", line 75, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/competition/mmdet/models/detectors/base.py", line 86, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/competition/mmdet/models/detectors/cascade_rcnn.py", line 219, in forward_train
    loss_bbox = bbox_head.loss(cls_score, bbox_pred, *bbox_targets)
  File "/competition/mmdet/core/fp16/decorators.py", line 152, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/competition/mmdet/models/bbox_heads/bbox_head.py", line 120, in loss
    pos_bbox_pred = bbox_pred.view(bbox_pred.size(0), 4)[pos_inds]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f9374698dc5 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14792 (0x7f93715a7792 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f9374688640 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x3067fb (0x7f9371cc77fb in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x14019b (0x7f939a47919b in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3bfc84 (0x7f939a6f8c84 in /home/mmdetection/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function>

评论 15
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值