论文阅读：SSD

最新推荐文章于 2024-08-07 21:24:34 发布

贾小树

最新推荐文章于 2024-08-07 21:24:34 发布

阅读量672

点赞数

分类专栏：目标检测论文阅读

本文链接：https://blog.csdn.net/j879159541/article/details/100421478

版权

论文阅读同时被 2 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标检测

45 篇文章 1 订阅

订阅专栏

一、对网络的理解

1、SSD网络结构
在这里插入图片描述
这个网络主要是利用了不同层的feature map，不同层的feature map尺寸大小不一样，尺寸随feature map层数增加线性增加，可以用于产生不同尺寸的先验框，再加上同一种尺寸下不同长宽比的先验框，6个feature map的情况下用可以产生8732个先验框，覆盖范围非常广。

其中，先验框的产生是利用小的卷积核去卷积feature map，所以产生先验框的同时可以预测框的坐标大小以及对应类别的置信度，一次齐活，没有faster rcnn的roi pooling那个阶段，不用二次利用feature map。

还有一点是利用了data augmentation，使得map提升将近10%，论文中有数据说明。data augmentation就主要是利用目标的部分作为整张图像进入网络，使得网络对目标的细节捕捉能力变强。

2、SSD的基础网络是VGG16，其中fc6 fc7被改成卷积层，dropout和fc8被去掉，conv6用到了膨胀卷积增加感受野。

VGG16中的Conv4_3层将作为用于检测的第一个特征图。conv4_3层特征图大小是 38*38 ，但是该层比较靠前，其norm较大，所以在其后面增加了一个L2 Normalization层（参见ParseNet），以保证和后面的检测层差异不是很大，这个和Batch Normalization层不太一样，其仅仅是对每个像素点在channle维度做归一化，而Batch Normalization层是在[batch_size, width, height]三个维度上做归一化。归一化后一般设置一个可训练的放缩变量gamma。

从后面新增的卷积层中提取Conv7，Conv8_2，Conv9_2，Conv10_2，Conv11_2作为检测所用的特征图，加上Conv4_3层，共提取了6个特征图。

3、网络的输出检测值 包含两个部分：类别置信度和边界框位置，各对6个feature map采用一次 33 卷积来进行完成。令 k为该特征图所采用的先验框数目，那么类别置信度需要的卷积核数量为kc，其中c为目标类别数加1,1为背景置信度，而边界框位置需要的卷积核数量为k*4。由于每个先验框都会预测一个边界框，所以SSD300一共可以预测8732个边界框，这是一个相当庞大的数字，所以说SSD本质上是密集采样。

4、SSD的网络速度：SSD300的速度比yolov1还要快一点，因为用了多尺度的feature map之后，ssd的网络输入可以变小为300，而yolov1的网络输入为448*448，但SSD512的速度不及yolov1，不过SSD512的准确度超过了FasterRCNN。

5、SSD检测示意图
在这里插入图片描述
从图中知，猫在 88 feature map中匹配到了两个先验框，狗在44feature map中匹配到了一个先验框，一个GT可以匹配到多个先验框，但一个先验框只能匹配一个GT。

6、训练时的先验框与GT匹配规则： 分为两步：先是GT找到与其IOU最大的那个先验框，这样保证每个GT都有其对应的先验框，然后是对每个先验框找到与其IOU大于某个阈值的GT（0.5），尽管一个ground truth可以与多个先验框匹配，但正样本还是很少，为了保证正负样本尽量平衡，SSD采用了hard negative mining，就是对负样本进行抽样，抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，选取误差的较大的top-k作为训练的负样本，以保证正负样本比例接近1:3。

7、先验框的尺度设置：
在这里插入图片描述这里m=5，不包括conv4_3，conv4_3的Sk单独计算，一般是0.2/2=0.1。这里的Sk只是相对于图片的比例，如果输入为300300的图片，那么conv4_3的尺度为3000.1=30，后面依据公式（4）计算，具体参考
这里参考了论文11页的一句话：

Since objects in COCO tend to be smaller than PASCAL VOC, we use smaller default boxes for all layers. We follow the strategy mentioned in Sec. 2.2, but now our smallest default box has a scale of 0.15 instead of 0.2, and the scale of the default box on conv4 3 is 0.07 (e.g. 21 pixels for a 300×300 image).

注：尺度是指正方形边长。

8、网络推理过程

预测过程比较简单，对于每个预测框，首先根据类别置信度确定其类别（置信度最大者）与置信度值，并过滤掉属于背景的预测框。然后根据置信度阈值（如0.5）过滤掉阈值较低的预测框。对于留下的预测框进行解码，根据先验框得到其真实的位置参数（解码后一般还需要做clip，防止预测框位置超出图片）。解码之后，一般需要根据置信度进行降序排列，然后仅保留top-k（如400）个预测框。最后就是进行NMS算法，过滤掉那些重叠度较大的预测框。最后剩余的预测框就是检测结果了。

二、论文中给出的数据

1、损失函数
在这里插入图片描述
2、模型分析

3、Related Work

三、论文原句摘抄

1、性能提升的主要原因
Our improvements include using a small convolutional ﬁlter to predict object categories and offsets in bounding box locations, using separate predictors (ﬁlters) for different
aspect ratio detections,andapplying these ﬁlters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.

2、用不同的predictor预测不同长宽比的框
Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions.
Allowing different default box shapes in several feature maps let us efﬁciently discretize the space of possible output box shapes.

3、数据增广 将目标的一部分视为目标，提高对细节的识别能力
Data augmentation To make the model more robust to various input object sizes and shapes, each training image is randomly sampled by one of the following options:
– Use the entire original input image.
– Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.
– Randomly sample a patch.

4、SSD的优势与不足
Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learn regress the object shape and classify object categories instead of using two decoupled steps.
However, SSD has more confusions with similar object categories (例如：无人机项目中对不同种类的无人机区分度不高) (especially for animals), partly because we share locations for multiple categories. Figure 4 shows that SSD is very sensitive to the bounding box size. In other words, it has much worse performance on smaller objects than bigger objects.

5、SSD可能的改进方式
An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive ﬁeld of each position on a feature map. We leave this for future work.

6、SSD与 OverFeat及YOLO的联系
However, our approach is more ﬂexible than the existing methods because we can use default boxes of different
ratios on each feature location from multiple feature maps at different scales.
If we only use one default box per location from the top most feature map, our SSD would have similar architecture to OverFeat[4];
if we use the whole top most feature map and add a fully connected layer for predictions instead of our convolutional predictors,and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].

参考文献：
1、目标检测|SSD原理与实现

2、论文阅读：SSD: Single Shot MultiBox Detector