Paper-info
title: SSD: Single Shot MultiBox Detector[2016-ECCV]
author : Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg.
Motivation
- Faster RCNN相对与之前的RCNN和Fast RCNN在inference的速度[5fps]上已经有了显著地提升,但是还远不能达到实时检测[>=25fps ?]的速度,影响inference速度的关键因素是:产生proposal的RPN和RoI-pooling环节!
feature_maps = process(image)
ROIs = RPN(image) # time costly!
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)
- Faster RCNN一般是基于单尺度(即采用主干网络输出的单一尺度的feature_map进行后续的检测任务),这对于检测尺度变化比较大的object而言比较困难。
- SSD针对Faster RCNN存在的这两个缺陷进行改进,提出了利用default-bbox来代替RPN的策略,并尝试利用Multi-scale feature maps 来进行detection,前者极大地提升了检测速度,后者稳步提升了检测效果!
Idea
- multi-scale
- SSD使用的是VGG16作为basenet, 将原来的FC6、FC7替换为3x3和1x1的卷积网络;
- 它利用Conv4-3、FC-7、Conv8-2、Conv9-2、Conv10-2、Conv11-2输出的feature_map进行detect,对应的feature_map的size[以input_size = 300为例]依次为:[38x38, 19x19, 10x10, 5x5, 3x3, 1x1];
- 不同scale的feature map采用不同[aspect-ratio x size]的default box,可以更好地检测到不同尺度的object;
- default box
- image中object的bbox并非任意宽高比的,它总是呈现出一定的规律;
- 直观上,groundtruth-bboxes可以被划分为不同的类(聚类)中,这些类的中心即为default-bbox;
3. 不同resolution的feature_map采用的default-box不一样,具体指导思想:
a). scale从Conv4-3对应的0.2线性增加至Conv11-2对应的0.9;
b). aspect ratio = {1/3, 1/2, 1, 2, 3}
4. 与anchor机制类似,对于取定的feature_map上的每个pixel,产生k(k=4)个多样化的default box, 它输出k x (num_classes + 1 + 4)的tensor,其中1表示no-object类
- matching strategy
- SSD的prediction按照其对应的default box与groundtruth的IoU值被划分为positive和negative两类;
- 在计算localization-loss时只考虑被划分为positive的prediction,这样model学到的bbox更加靠近groundtruth,训练更稳定;
- 在计算classification-loss时,考虑所有positive的预测结果和部分negative的预测结果,正负样本比控制在1:3;
- loss function细节如下:
Outline
feat_list = backbone_multiscale(image)
for feature_map in feat_list:
for position in feature_map:
[conf, offset] = detector(position) # fundamental improvement in speed! No RPN, No RoI-pooling
Code
- 工程链接 :ssd pytorch
def multibox(in_size = 300, num_classes = 21):
'''
Construct the tools for SSD-vgg
step - 1. re-build vgg16 (truncated vgg16 before last maxpool-layer, then add fc6, fc7)
step - 2. construct the extract_layers
step - 3. construct the localization and classification header
'''
# step - 1
backbone = VGG16(in_size=in_size)
# step - 2
extra_layers = add_extras(in_size=in_size)
# step - 3
config, loc_layers, conf_layers = mbox[str(in_size)], list(), list()
# extract the conv-4-3, FC-7
for k, v in enumerate([21, -2]):
loc_layers += [nn.Conv2d(backbone[v].out_channels, config[k] * 4, kernel_size=3, padding=1)]
conf_layers += [nn.Conv2d(backbone[v].out_channels, config[k] * num_classes, kernel_size=3, padding=1)]
# extract the [1, 3, 5, 7] layer of extra_layers
for k, v in enumerate(extra_layers[1::2], 2):
loc_layers += [nn.Conv2d(v.out_channels, config[k]* 4, kernel_size=3, padding=1)]
conf_layers += [nn.Conv2d(v.out_channels, config[k]* num_classes, kernel_size=3, padding=1)]
head = (loc_layers, conf_layers)
return backbone, extra_layers, head
Experiment
模型参数:lr = 1e-3, momentum = 0.9, weight_decay = 5e-4, batch_size = 32.
- VOC-2007 test:
- Various design choices
- Multi-scale feature_map
- Inference-time
Conclusion
- SSD在小尺度目标的检测上比Faster RCNN差[reason : 对于SSD而言,小尺度的object只能在高resolution的layer被检测到,但是,高resolution的layer往往包含的是一些诸如edges、color等low-level的feature,这些信息对分类任务干扰较大];
- Accuracy、default bbox的个数、FPS三者相互权衡;
- Multi-scale feature map对于检测效果的提升有帮助;
- 大尺度的输入更容易获得好的效果;
Reference
[1]. SSD object detection: Single Shot MultiBox Detector for real-time processing [ECCV-2016]