论文笔记《SSD: Single Shot MultiBox Detector》

最新推荐文章于 2022-07-15 11:51:41 发布

罗泽

最新推荐文章于 2022-07-15 11:51:41 发布

阅读量961

点赞数

分类专栏： Object Detection 文章标签： ssd

本文链接：https://blog.csdn.net/u013698770/article/details/55100158

版权

Object Detection 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

2017/2/15 first reading

Abstract

discretizes the output space of bounding box in to a set of default boxes over different scale and aspect ratio per feature map location.
At prediction, gen scores each catagory in each default box and produce adjustments to the box to better match the object shape
combines prediction from multiple feature maps with different resolutions to naturelly handle objects in various size.

Introduction

not resample pixels or features for bounding box
improvements:
(1)using a small convolutional ﬁlter to predict object categories and offsets in bounding box locations
(2)using separate predictors (ﬁlters) for different aspect ratio detections, and applying these ﬁlters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales
(3)The core of SSD is predicting category scores and box offsets for a ﬁxed set of default bounding boxes using small convolutional ﬁlters applied to feature maps.
(4)To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio

2 The Single Shot Detector(SSD)

2.1 Model

Multi-scale feature maps for detection

Multi-scale feature maps for detection
convolutional feature layers decrease in size progressively and allow predictions of detections at multiple scales.

Convolutional predictors for detection
each feature layer can produce a fixed set of detection predictions using a set of convolutional filters.

Default boxes and aspact ratios

Discretize（离散化） the box space densely（密集的）

a set of default bounding boxes with each feature map cell（特征）, the default boxes tile（卷积） the features map, so the position of each box relative to its corresponding cell is fixed.
feature map cell
->predict the offsets relative to the default box shapes in the cell,
->predict the per-class scores that indicate the presence of a class instance in each of those boxes.

2.2 Training

ground truth information needs to be assigned to specific outputs in the fixed set of detector output.
choosing the set of default boxes and scales
the hard negative mining（？） and data augmentation（数据增强） strategies
Matching strategy
match between default boxes and grounding truth
select from default boxes that vary over location, aspect ratio, and scale
begin by matching each ground truth boxe to the default box with the best jaccard overlap(as in MultiBox[7])（？），and we then match default boxes to any ground with jaccard overlap higher than a threshold(0.5)
that allowing the network to predict high scores for multiple overlapping default boxes rather than the max overlap one.
Training objective
The SSD training objective is derived from multibox objective but is extended to handle multiple object categories（多类）.
$x^p_ {ij}$ = {0,1} is a indicater the i-th default box to the j-th ground truth box of category p
$x^p_ {ij}$ 是指当在类别p中时，第i个默认框与第j个真实框的匹配程度

$\sum_ix^p_ {ij}\geq1$

总的目标损失函数如下：分别由两部分构成localization loss(loc) and the confindence loss(cof)
$L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+ \alpha L_{loc}(x,l,g)$
localization loss(loc) is a smooth L1 loss between the predicted box(l) and the ground truth box(g)
The conﬁdence loss is the softmax loss over multiple classes conﬁdences (c).
Choosing scales and aspect ratios for default boxes
we use both the lower and upper feature maps for detection

tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects
default boxes的大小是根据feature map的大小而变化的
and, per map location resulting in 6 different scale default boxes

For example, in Fig. 1, the dog is matched to a default box in the 4×4 feature map, but not to any default boxes in the 8×8 feature map. This is because those boxes have different scales and do not match the dog box, and therefore are considered as negatives during training.

Hard negative mining
负样本数量远远大于正样本
Instead of using all the negative examples, we sort them using the highest conﬁdence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.

Data augmentation
通过数据的变化来使模型更加稳健
keep the overlapped part of the ground truth box if the center of it is in the sampled patch（？）