【深度学习】：目标检测SSD - Single Shot MultiBox Detector算法详解

最新推荐文章于 2024-01-14 23:48:58 发布

Allen Chou

最新推荐文章于 2024-01-14 23:48:58 发布

阅读量615

点赞数

分类专栏：深度学习文章标签：机器学习深度学习

本文链接：https://blog.csdn.net/Vermont_/article/details/113732785

版权

深度学习专栏收录该内容

17 篇文章 27 订阅

订阅专栏

开篇简短概括SSD的特点：

•One-Stage

•均匀的密集抽样

•Priorboxes/Defaultboxes(Anchorboxes)

•不同尺度抽样•不同scale尺度的特征图抽样

•对于小目标检测效果不错

•预测速度快

•训练困难(正负样本极度不均衡)

下图为一些网络结构的效率对比图，坦率来说，map达到75％到80％就可以了

SSD 算法是一种直接 predict bounding box location by regression 和 predict object class by classification 的 object detection 算法，compared to Faster-RCNN，去掉了 bounding box proposal 以及后续的 pixel/feature resampling。运算速度比起 Faster RCNN 快很多，准确率也要高，it holds when compares to YOLO，当时拿到了 Pascal Object Detection 比赛的 top 1，不过现在已经又被 YOLO9000 等算法超越了。

针对不同大小的物体检测，传统的做法是将图像转换成不同的大小，然后分别处理，最后将结果综合起来，而 ssd利用不同卷积层的 feature map 进行综合也能达到同样的效果。算法的主网络结构是 VGG16，将两个全连接层改成卷积层再增加4个卷积层。对其中 5 个不同的卷积层的输出分别用 两个 3*3 的卷积核 进行卷积，一个输出分类用的 confidence（conf），每个default box生成21个confidence（这是针对VOC数据集包含20个object类别而言的）；一个输出回归用的 localization（loc），每个 default box 生成 4 个坐标值（x，y，w，h）。另外这5个卷积层还经过 priorBox 层生成default box（生成的是坐标）。上面所述的5个卷积层中每一层的 default box 的数量是给定的，最后将前面三个计算结果分别合并然后传递给 loss layer。

the core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxesusing small convolutional filters applied to feature maps. 实验表明 default box 的 aspect ratios 越多效果越好，这里用到的 default box 和 Faster RCNN 中的 anchor box 类似，在 Faster RCNN 中，anchor 只用在最后一个卷积层，在 SSD 中，default box 是应用在多个不同层的 feature map 上。
to achieve high detection accuracy, they produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio. 即同时采用 lower layer 和upper layer 的 feature maps 做 detection。

default box 的意思是，对于 feature map 中的每一个 cell，预设了默认的以该 cell 的中心点为中心，长宽分别为 x,y 的多个 bouding box。这样做的目的是想将 object 的 gt box match 到某个 cell 的一个或者多个 default box。假设每个 feature map cell 有 k 个 default box，那么对于每个 default box 都需要预测 c 个类别 score 和 4个 offset(gt box 相对于 default box 的位移)，那么如果一个 feature map 的大小是 m*n，也就是有 m*n 个feature map cell，那么这个 feature map 就一共有 (c+4)*k*m*n 个输出。

这些输出个数的含义是：采用 3*3 的卷积核对该层的 feature map 卷积时卷积核的个数，包含两部分（实际code是分别用不同数量的 3*3 卷积核对该层 feature map 进行卷积）:数量 c*k*m*n 是confidence 输出，表示每个 default box 的 confidence，就是类别；数量 4*k*m*n 是 localization 输出，表示每个 default box 的坐标。

在预测阶段，直接预测每个 default box 的 offset 以及对每个类别相应的 score，最后通过 NMS（非极大值抑制）得到最终的结果，如上图 c 所示。

接下来看一下网络的结构：

YOLO 算法的输入是 448*448*3，输出是7*7*30，这 7*7 个 grid cell 一共预测 98 个 bounding boxes。SSD 算法是在原来 VGG16 的后面添加了几个卷积层来预测 offset 和 confidence（相比之下 YOLO 算法是采用全连接层），算法的输入是 300*300*3，采用 conv4_3，conv7，conv8_2，conv9_2，conv10_2 和 conv11_2 的输出来预测 location 和 confidence。

这种算法对于不同 aspect ratio 的 object 的 detection 都有效，很大程度上是因为对于每个 cell 设置了多个不同比例的 default box，作者也在论文中提到，default box 的数量越多，效果会越好。default box 类似于 Faster RCNN 中的 anchor 。

最后本文也强调了增加数据集的作用，包括随机裁剪，旋转，对比度调整等等。

文中作者提到该算法对于小的object的detection比大的object要差。作者认为原因在于这些小的object在网络的顶层所占的信息量太少，所以增加输入图像的尺寸对于小的object的检测有帮助。另外增加数据集对于小的object的检测也有帮助，原因在于随机裁剪后的图像相当于“放大”原图像，所以这样的裁剪操作不仅增加了图像数量，也放大了图像。

训练部分

The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to speciﬁc outputs in the ﬁxed set of detector outputs.

Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.

Match Strategy

For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale.

We begin by matching each ground truth box to the default box with the best jaccard overlap, then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).

这样做的好处是，可以简化这个学习问题，allow 网络对多个重叠的 default box 都能打出较高的分数，而不是只选择 IoU 最大的那个。

Traning Objective

目标函数当然就是 loc 损失和 conf 损失加权求和了，如下：

其中，Lloc 为 location 损失

Lconf 是分类损失

其他一些选择 default box 的比例、控制正负样本比例的为题比较简单，不再赘述。

有需要的同学可以阅读 WeiLiu 的 https://github.com/weiliu89/caffe/tree/ssd

参考资料：

https://zhuanlan.zhihu.com/p/32881740

Allen Chou

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【深度学习】：目标检测SSD - Single Shot MultiBox Detector算法详解

开篇简短概括SSD的特点：•One-Stage•均匀的密集抽样•Priorboxes/Defaultboxes(Anchorboxes)•不同尺度抽样•不同scale尺度的特征图抽样•对于小目标检测效果不错•预测速度快•训练困难(正负样本极度不均衡)下图为一些网络结构的效率对比图，坦率来说，map达到75％到80％就可以了SSD 算法是一种直接 predict bounding box location by regression 和 predict object c
复制链接

扫一扫