接上篇[1],现需要将 backbone 换成 DeiT-tiny[2,3]。
MMDetection[4] 不直接支持 DeiT(backbones/ 下没有),但 MMClassification 有实现。参考 [6,7],可以直接在 MMDetection 中调用 MMClassification 的模型。
由于 DeiT 是 transformer 结构,而 MMDetection 直接支持的 DETR[8,9]也是,考虑基于它的配置文件[10]来改。
Configuration
配置文件的介绍见 [11-16]。对照已有的配置文件和 mmdetection/mmdet/models/ 的类定义来看,配置文件中 model
里配置的域,对应相应模型类的构造函数参数,所以替换 DeiT-tiny 做 backbone 时 model/backbone
要写哪些项,是看 deit.py 及其父类的构造函数有什么参数。
由 [10],它引用的数据集相关的配置文件是 coco_detection.py,类似 [1] 中,两个配置文件分别修改:
scannet_detection.py
- 修数据集类集的方法见 [17,18],要改 / 加
classes
、data/train/dataset/classes
、data/val/classes
、data/test/classes
。类集见 [19]。
# Inherited from: mmdetection/configs/_base_/datasets/coco_detection.py
# fit to ScanNet-frames-25k
dataset_type = 'CocoDataset'
classes = (
"wall", "floor", "cabinet", "bed", "chair",
"sofa", "table", "door", "window", "bookshelf",
"picture", "counter", "desk", "curtain", "refrigerator",
"shower curtain", "toilet", "sink", "bathtub", "otherfurniture"
)
data_root = 'data/scannet-frames/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type=dataset_type,
ann_file=data_root + 'scannet_objdet_train.json',
img_prefix=data_root + 'train/',
pipeline=train_pipeline,
classes=classes),
val=dict(
type=dataset_type,
ann_file=data_root + 'scannet_objdet_val.json',
img_prefix=data_root + 'val/',
pipeline=test_pipeline,
classes=classes),
test=dict(
type=dataset_type,
ann_file=data_root + 'scannet_objdet_val.json',
img_prefix=data_root + 'val/',
pipeline=test_pipeline,
classes=classes))
evaluation = dict(interval=1, metric='bbox')
detr_deit_tiny_8x1_150e_scannet.py
model/backbone
- 引用 mmclassification backbone 的写法见 [7];
type
、arch
抄自 deit-tiny_pt-4xb256_in1k.py;- 没用 feature pyramid networks,取最后一层输出,所以
out_indices=-1
; - 其它项参考/抄自 [10] 和 DistilledVisionTransformer、VisionTransformer 的构造函数。
model/bbox_head
num_classes
参考 [17,18],类数也是数自 [19];in_channels
即 backbone 最后一层输出的特征维度,见 arch_zoo 中的deit-tiny/embed_dims
;transformer
/encoder
和decoder
/transformerlayers
/attn_cfgs/embed_dims
和ffn_cfgs/embed_dims
(见 BaseTransformerLayer 中的__init__/ffn_cfgs/embed_dims
)也要相应改成跟in_channels
同维;positional_encoding/num_feats
要恰好是in_channels
的一半。
data/samples_per_gpu
改成 1,即每块卡 batch_size = 1,否则我这会爆显存。按 [11] 的命名规则,配置文件名中改成8x1
。
# Inherited from mmdetection/configs/detr/detr_r50_8x2_150e_coco.py
## Modifications
# 1. use DeiT-tiny as backbone
# 2. use ScanNet-frames-25k
_base_ = [
'../_base_/datasets/scannet_detection.py',
'../../mmdetection/configs/_base_/default_runtime.py'
]
custom_imports = dict(imports=['mmcls.models'], allow_failed_imports=False)
model = dict(
type='DETR',
backbone=dict(
# _delete_=True, # Delete the backbone field in _base_
# from: mmclassification/configs/deit/deit-tiny_pt-4xb256_in1k.py
type='mmcls.VisionTransformer',
arch='deit-tiny',
img_size=224,
patch_size=16,
with_cls_token=False,
output_cls_token=False,
out_indices=-1,
# norm_cfg=dict(type='BN', requires_grad=False),
# norm_eval=True,
# style='pytorch',
init_cfg=dict(
type='Pretrained',
# from: mmclassification/configs/deit/README.md -> DeiT-tiny
checkpoint='https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth',
prefix='backbone.',
),
),
bbox_head=dict(
type='DETRHead',
num_classes=20, # from: convert-scannet-coco-objdet.py
# from: mmclassification/mmcls/models/backbones/vision_transformer.py
# -> arch_zoo["deit-tiny"]["embed_dims"]
in_channels=192,
transformer=dict(
type='Transformer',
encoder=dict(
type='DetrTransformerEncoder',
num_layers=6,
transformerlayers=dict(
type='BaseTransformerLayer',
attn_cfgs=[
dict(
type='MultiheadAttention',
embed_dims=192,#256,
num_heads=8,
dropout=0.1)
],
feedforward_channels=2048,
ffn_dropout=0.1,
# from: mmcv/mmcv/cnn/bricks/transformer.py
# -> BaseTransformerLayer/__init__/ffn_cfgs
ffn_cfgs=dict(
embed_dims=192,
# feedforward_channels=2048,
# ffn_drop=0.1,
),
operation_order=('self_attn', 'norm', 'ffn', 'norm'))),
decoder=dict(
type='DetrTransformerDecoder',
return_intermediate=True,
num_layers=6,
transformerlayers=dict(
type='DetrTransformerDecoderLayer',
attn_cfgs=dict(
type='MultiheadAttention',
embed_dims=192,#256,
num_heads=8,
dropout=0.1),
feedforward_channels=2048,
ffn_dropout=0.1,
# from: mmcv/mmcv/cnn/bricks/transformer.py
# -> BaseTransformerLayer/__init__/ffn_cfgs
ffn_cfgs=dict(
embed_dims=192,
# feedforward_channels=2048,
# ffn_drop=0.1,
),
operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
'ffn', 'norm')),
)),
positional_encoding=dict(
# type='SinePositionalEncoding', num_feats=128, normalize=True),
type='SinePositionalEncoding', num_feats=int(192 // 2), normalize=True),
loss_cls=dict(
type='CrossEntropyLoss',
bg_cls_weight=0.1,
use_sigmoid=False,
loss_weight=1.0,
class_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=5.0),
loss_iou=dict(type='GIoULoss', loss_weight=2.0)),
# training and testing settings
train_cfg=dict(
assigner=dict(
type='HungarianAssigner',
cls_cost=dict(type='ClassificationCost', weight=1.),
reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'),
iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))),
test_cfg=dict(max_per_img=100))
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
# train_pipeline, NOTE the img_scale and the Pad's size_divisor is different
# from the default setting in mmdet.
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='AutoAugment',
policies=[[
dict(
type='Resize',
img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
(608, 1333), (640, 1333), (672, 1333), (704, 1333),
(736, 1333), (768, 1333), (800, 1333)],
multiscale_mode='value',
keep_ratio=True)
], [
dict(
type='Resize',
img_scale=[(400, 1333), (500, 1333), (600, 1333)],
multiscale_mode='value',
keep_ratio=True),
dict(
type='RandomCrop',
crop_type='absolute_range',
crop_size=(384, 600),
allow_negative_crop=True),
dict(
type='Resize',
img_scale=[(480, 1333), (512, 1333), (544, 1333),
(576, 1333), (608, 1333), (640, 1333),
(672, 1333), (704, 1333), (736, 1333),
(768, 1333), (800, 1333)],
multiscale_mode='value',
override=True,
keep_ratio=True)
]]),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=1),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
# test_pipeline, NOTE the Pad's size_divisor is different from the default
# setting (size_divisor=32). While there is little effect on the performance
# whether we use the default setting or use size_divisor=1.
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=1),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=1,#2,
workers_per_gpu=2,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
# optimizer
optimizer = dict(
type='AdamW',
lr=0.0001,
weight_decay=0.0001,
paramwise_cfg=dict(
custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}))
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
# learning policy
lr_config = dict(policy='step', step=[100])
runner = dict(type='EpochBasedRunner', max_epochs=150)
Training
代码结构类似 [1],这里只展示必要的部分:
my-project/
|- mmdetection/
|- configs/
| |- _base_/
| | `- datasets/
| | `- mstrain_3x_scannet.py
| `- detr/
| `- detr_deit_tiny_8x1_150e_scannet.py
`- scripts/
|- find_gpu.sh
`- train-scannet-frames.sh
其中,训练脚本:
#!/bin/bash
# train-scannet-frames.sh
clear
# run `conda activate openmmlab` first
config=configs/detr/detr_deit_tiny_8x2_150e_scannet.py
. scripts/find_gpu.sh -1 14787
echo begin: $(date) > scripts/RUN-`basename $0`.txt
PATH=/usr/local/cuda/bin:$PATH \
PYTHONPATH=mmdetection/mmdet:$PYTHONPATH \
CUDA_VISIBLE_DEVICES=${gpu_id} \
MMDET_DATASETS=`pwd`/data/scannet-frames/ \
bash mmdetection/tools/dist_train.sh \
$config ${n_gpu_found}
# python mmdetection/tools/train.py \
# $config
echo end: $(date) >> scripts/RUN-`basename $0`.txt
References
- MMDetection在ScanNet上训练
- (ICLR 2021) Training data-efficient image transformers & distillation through attention
- facebookresearch/deit
- open-mmlab/mmdetection
- open-mmlab/mmclassification
- How to change the model of mmclassification to mmdetection? #7761
- Use backbone network implemented in MMClassification / Use backbone network implemented in MMClassification
- (ECCV 2020) End-to-End Object Detection with Transformers - paper, supplementary
- facebookresearch/detr
- open-mmlab/mmdetection/configs/detr/detr_r50_8x2_150e_coco.py
- Tutorial 1: Learn about Configs
- Config
- MMDetection框架入门教程(二):快速上手教程
- MMDetection框架入门教程(三):配置文件详细解析
- 【MMDetection-学习记录】config配置文件说明
- mmdetection的config配置文件参数介绍
- AssertionError: The num_classes (3) in Shared2FCBBoxHead of MMDataParallel does not matches the length of CLASSES 80) in CocoDataset #4828
- Prepare a config / Prepare a config
- ScanNet/BenchmarkScripts/convert2panoptic.py