SSD: Single Shot MultiBox Detector

SSD: Single Shot MultiBox Detector

Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3, Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1

1UNC Chapel Hill 2Zoox Inc. 3Google Inc. 4University of Michigan, Ann-Arbor 1wliu@cs.unc.edu, 2drago@zoox.com, 3fdumitru,szegedyg@google.com, 4reedscot@umich.edu, 1fcyfu,abergg@cs.unc.edu

Abstract. We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For 300 300 input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for 500500 input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. Code is available at https://github.com/weiliu89/caffe/tree/ssd .

 

摘要:我们目前提出一种使用a single deep neural network的方法来做目标检测。我们的方法,叫做SSD,输出一系列的离散化的bounding box,这些box是从不同层次(layers)上的feature maps上生成的,并且有着不同的aspect ratio。

在prediction阶段,network计算出每一个default box的物体,其所属每个类别的可能性,即score,同时对这些bounding box的shape进行微调,以使得其符合物体的外接矩阵。

除以之外,为了处理相同物体的不同尺寸的情况,SSD结合了不同分辨率的feature maps的predictions。

我们的SSD模型相对于那些需要object proposals的检测模型,本文的SSD方法完全取消了proposals generation、pixel resampling或者feature resampling这些阶段。这样使得SSD更容易去优化训练,也更容易地检测模型融合进系统之中。

在PASCAL VOC、MS COCO、ILSVRC数据集上的实验显示,SSD在保证精度的同时,其速度要比用region proposals的方法要快很多。

相对于其他单结构模型(YOLO),SSD取得更高的精度,即是在输入图像较小的情况下。如输入300×300大小的PASCAL VOC 2007 test图像,在Titan X上,SSD以58帧的速度,同时取得了72.1%的mAP。

如果输入图像的大小是500×500,SSD模型则取得了75.1%mAP,比目前的最state-of-art的Faster R-CNN要好很多。

代码位置:https://github.com/weiliu89/caffe/tree/ssd 

 

1 Introduction
        Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a highquality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, MS COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. Although accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time or near real-time applications. Often detection speed for these approaches is measured in seconds per frame, and even the fastest high-accuracy detector, the basic Faster R-CNN, operates at only 7 frames per second (FPS). There have been a wide range of attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sec. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.

        This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (58 FPS with mAP 72.1% on VOC2007 test, vs Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. This is not the first paper to do this (cf [4, 5]) but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.With these modifications we can achieve high-accuracy detection using relatively low resolution input, further increasing processing speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on high-speed detection for PASCAL VOC from 63.4% mAP for YOLO to 72.1% mAP for our proposed network. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.

        We summarize our contributions as follows:

– We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state of the art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).

– The core of the SSD approach is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.

– In order to achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.

– Together, these design features lead to simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.

– Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, MS COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.

 

1.简介

    现今流行的state-of-art的检测系统大致都是如下步骤:先生成一些假设的boundingboxes,然后在这些boundingboxes中提取特征,之后再经过一个分类器,来判断是不是物体,是什么物体。这类pipeline自从IJCV 2013,Selective Search for Object Recognition开始,到如今在PASCAL VOC、MS COCO、ILSVRC数据集上取得领先的基于Faster R-CNN的ResNet。但这类方法对于嵌入式系统,所需要的计算时间太久了,不足以实时的进行检测。当然也有很多工作是朝着实时检测迈进的,到目前为止,都是牺牲检测精度来换取时间。

   本文提出基于the first deep nework的目标检测,这种方法不用resample pixels和boundingbox,而且和approaches方法一样精准。这对快速高精度检测有很大影响(58 FPS with mAP 72.1% on VOC2007 test, vs Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%)。这种之所以如此快速是因为,消除了中间的bounding boxes、subsequent pixel or feature resampling的过程。虽然本文不是第一篇这样做的,但是本文做了一些提升性的工作,既保证了速度,也保证了检测精度。我们的提升包括,使用一个a small convolutional filter来预测对象类别和offsets在boundingbox的位置,使用独立的预测因此(滤波器)对不同长宽比检测,并且使用这些filters到multiple feature maps从the later stages of a network 为了执行检测多尺度。通过这些调整,我们可以实现低分辨率高精度检测,更快的处理速度。虽然这些贡献看起来小,我们发现这种系统提高了在保证速度的程度下提高了准确度PASCAL VOC from 63.4% mAP for YOLO to 72.1% mAP。与近期的其他赶紧相比,这种方法实现了准确度的大幅提升,在residual networks中非常高效。此外,显著提高高精度的检测的速度可以更大范围的使用计算机视觉。

    我们总结的贡献如下:

  • 提出了新的物体检测方法:SSD,比原先最快的YOLO:You Only Look Once方法,还要快,还要准确。保证速度的同时,其结果mAP可与使用region proposals技术的方法(如Faster R-CNN)相媲美。
  • SSD方法的核心就是predict object,以及其归属类别的score;同时,在feature map上使用小的卷积核,去predict一系列boundingbox的box offsets。
  • 为了获得更高的提取精度,在不同层次的feature maps上去predict object、box offsets,同时,还得到不同aspect ratio的predictions。
  • 本文的这些改进设计,能够在输入分辨率较低的图像时,保证检测的精度。同时,这个整体end-to-end的设计,训练也变得简单。在检测速度,检测精度之间取得较好的trade-off。
  • 本文提出的模型在不同的数据集上,如PASCAL VOC、MS COCO、ILSVRC,都进行了测试。在检测时间,检测精度上,均与目前物体检测领域state-of-art的检测方法进行了比较。

2.The Single Shot Detector (SSD)
    This section describes our proposed SSD framework for detection (Sec. 2.1) and the associated training methodology (Sec. 2.2). Afterwards, Sec. 3 presents dataset-specific model details and experimental results.

Fig. 1: SSD framework. (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4)of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8×8 and 4×4 in (b) and (c)). For each default box, we predict both the shape offsets and the confidences for all object categories ((c1,c2,..... ,cp)). At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).

2.The Single Shot Detector (SSD)

    这部分描述SSD框架用于检测(2.1)和训练的方法(2.2)。在Sec.3中介绍dataset-specific模型的细节和实验结果。

    这里,先弄清楚下文所说的 default box 以及 feature map cell 是什么。看下图:

  • feature map cell 就是将 feature map 切分成 8×8  或者 4×4  之后的一个个 格子

  • 而 default box 就是每一个格子上,一系列固定大小的 box,即图中虚线所形成的一系列 boxes。

 


Fig.1:SSD架构.(a) SSD只需要输入图片和ground truth boxes为每个对象训练。在一个卷积的方式,我们估计一些小的default boxes在各个位置的不同方向的比例,在不同尺度的特征图谱。对每个defult box,我们预测他的形状和属于哪类的信度((c1,c2,..... ,cp))。在训练的时候,我们首先把这些default boxes与ground truth boxes相比较。例如,我们已经把两个defult boxes和cat,和一个default boxes与dog相比较,这些default boxes一个设为正样本,余下的就是负样本。模型的损失是位置损失和信度损失的加权求和。

2.1 Model

    The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network1. We then add auxiliary structure to the network to produce detections with the following key features:     

2.1 Model

    SSD是基于一个向前传播CNN网络,成圣一些列固定大小的boundingbox,以及飞一个box中包含物体实例的可能性,进行一个非极大值抑制(non-maximum suppression)得到最终的predictions。SSD 模型的最开始部分,本文称作 base network,是用于高质量图像分类的标准架构(截断之前分类层)。在base network之后本文添加了额外辅助的网络结构:

Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).

  •  Multi-scale feature maps for detection

      我们再base network上添加了额外的卷积层。这些卷积层的大小是逐层递减的,可以在多尺度下进行predictions。The convolutional model for predicting detections不同于其他特征层。

Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size m×n with p channels, the basic element for predicting parameters of a potential detection is a 3×3×p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m×n locations where the kernel is applied, it produces an output value. The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).

  • Convolutional predictors for detection

    每个添加的特征层(或者在基础网络结构中的特征层),可以使用一系列convolutional filters,去产生一系列固定大小的predictions,具体见Fig.2。对于一个m×n,具有p通道的特征层,使用convolutional filters就是3×3×p的kernel。产生的predictions,那么就是归属类别的一个得分,要么就是相对于default box coordinate的shape offsets。在每一个m×n 的特征图位置上,使用上面的 3×3 的 kernel,会产生一个输出值。bounding box offset 值是输出的 default box 与此时 feature map location 之间的相对距离(YOLO 架构则是用一个全连接层来代替这里的卷积层)。

Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box instance relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, we compute c class scores and the 4 offsets relative to the original default box shape. This results in a total of (c + 4)k filters that are applied around each location in the feature map, yielding (c + 4)×k×m×n outputs for a m×n feature map. For an illustration of default boxes, please refer to Fig. 1. Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps lets us efficiently discretize the space of possible output box shapes.

  • Default boxes and aspect ratios

    我们将一组default bounding boxes和feature map cell联系起来,对multiple feature maps在网络的顶部。每一个box相对于其对应的feature map cell的位置是固定的。在每一个feature map cell中,我们要predict得到的box与default box之间的offsets,以及每一个box中包含物体的score(每一个类别概率都要计算出)。因此,对于一个位置上的k个boxes中的每一个box,我们需要计算出c个类,每一个类的score,还有这个box相对于他的默认box的4个偏移量(offsets)。于是,在feature map中的每一个feature map cell上,就需要有(c+4)×k×m×n个输出结果。这里的default box很类似于Faster R-CNN中的anchor boxes,关于这里的anchor boxes,详细的参见原论文。但是又不同于 Faster R-CNN 中的,本文中的 Anchor boxes 用在了不同分辨率的 feature maps 上

       

 

 

Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5].Our SSD model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a 300×300 input size significantly outperforms its 448 448 YOLO counterpart in accuracy on VOC2007 test while also improving the run-time speed, albeit YOLO customized network is faster than VGG16.

 

 2.2 Training

    The key difference between training SSD and training a typical detector that uses region proposals and pooling before a final classifier, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO[5] and for the region proposal stages of Faster R-CNN[2] and MultiBox[7]. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as hard negative mining and data augmentation strategies.

 2.2 Training

    在训练时,本文的 SSD 与那些用 region proposals + pooling 方法的区别是,SSD 训练图像中 的 groundtruth 需要赋予到那些固定输出的 boxes 上。在前面也已经提到了,SSD 输出的是事先定义好的,一系列固定大小的 bounding boxes。

 如下图中,狗狗的 groundtruth 是红色的 bounding boxes,但进行 label 标注的时候,要将红色的 groundtruth box 赋予 图(c)中一系列固定输出的 boxes 中的一个,即 图(c)中的红色虚线框。 

事实上,文章中指出,像这样定义的 groundtruth boxes 不止在本文中用到。在 YOLO 中,在 Faster R-CNN 中的 region proposal 阶段,以及在 MultiBox 中,都用到了。

当这种将训练图像中的 groundtruth 与固定输出的 boxes 对应之后,就可以 end-to-end 的进行 loss function 的计算以及 back-propagation 的计算更新了。

训练中会遇到一些问题:

  • 选择一系列 default boxes

  • 选择上文中提到的 scales 的问题

  • hard negative mining

  • 数据增广的策略

下面会谈本文的解决这些问题的方式,分为以下下面的几个部分。

Matching strategy

如何将 groundtruth boxes 与 default boxes 进行配对,以组成 label 呢?

在开始的时候,用 MultiBox 中的 best jaccard overlap 来匹配每一个 ground truth box 与 default box,这样就能保证每一个 groundtruth box 与唯一的一个 default box 对应起来。

但是又不同于 MultiBox ,本文之后又将 default box 与任何的 groundtruth box 配对,只要两者之间的 jaccard overlap 大于一个阈值,这里本文的阈值为 0.5。

 

在训练时,我们需要建立真值和 default boxes的对应关系。对于每个真值,我们选择不同位置、宽高比、尺度的 default boxes 与之匹配,选择重合最大的 default boxe。这个和 original MultiBox [7] 是相似的。但是不同于 MultiBox,我们match default boxes to any ground truth with jaccard overlap higher than a threshold(0.5),这么做是为了简化学习问题

 


 

 

Choosing scales and aspect ratios for default boxes:

大部分 CNN 网络在越深的层,feature map 的尺寸(size)会越来越小。这样做不仅仅是为了减少计算与内存的需求,还有个好处就是,最后提取的 feature map 就会有某种程度上的平移与尺度不变性。

同时为了处理不同尺度的物体,一些文章,如 ICLR 2014, Overfeat: Integrated recognition, localization and detection using convolutional networks,还有 ECCV 2014, Spatial pyramid pooling in deep convolutional networks for visual recognition,他们将图像转换成不同的尺度,将这些图像独立的通过 CNN 网络处理,再将这些不同尺度的图像结果进行综合。

但是其实,如果使用同一个网络中的、不同层上的 feature maps,也可以达到相同的效果,同时在所有物体尺度中共享参数。

之前的工作,如 CVPR 2015, Fully convolutional networks for semantic segmentation,还有 CVPR 2015, Hypercolumns for object segmentation and fine-grained localization 就用了 CNN 前面的 layers,来提高图像分割的效果,因为越底层的 layers,保留的图像细节越多。文章 ICLR 2016, ParseNet: Looking wider to see better 也证明了以上的想法是可行的。

因此,本文同时使用 lower feature maps、upper feature maps 来 predict detections。下图展示了本文中使用的两种不同尺度的 feature map,8×8 的feature map,以及 4×4 的 feature map: 

 

一般来说,一个 CNN 网络中不同的 layers 有着不同尺寸的 感受野(receptive fields)。这里的感受野,指的是输出的 feature map 上的一个节点,其对应输入图像上尺寸的大小。具体的感受野的计算,参见两篇 blog:

所幸的是,SSD 结构中,default boxes 不必要与每一层 layer 的 receptive fields 对应。本文的设计中,feature map 中特定的位置,来负责图像中特定的区域,以及物体特定的尺寸。假如我们用 m 个 feature maps 来做 predictions,每一个 feature map 中 default box 的尺寸大小计算如下: 

 

 对于 aspect ratio 为 1 时,本文还增加了一个 default box,这个 box 的 scale 是。所以最终,在每个 feature map location 上,有 6 个 default boxes。

 

 

Hard negative mining

在生成一系列的 predictions 之后,会产生很多个符合 ground truth box 的 predictions boxes,但同时,不符合 ground truth boxes 也很多,而且这个 negative boxes,远多于 positive boxes。这会造成 negative boxes、positive boxes 之间的不均衡。训练时难以收敛。

因此,本文采取,先将每一个物体位置上对应 predictions(default boxes)是 negative 的 boxes 进行排序,按照 default boxes 的 confidence 的大小。 选择最高的几个,保证最后 negatives、positives 的比例在 3:1

本文通过实验发现,这样的比例可以更快的优化,训练也更稳定。

 

Data augmentation

本文同时对训练数据做了 data augmentation,数据增广。关于数据增广,推荐一篇文章:Must Know Tips/Tricks in Deep Neural Networks,其中的 section 1 就讲了 data augmentation 技术。

每一张训练图像,随机的进行如下几种选择:

  • 使用原始的图像
  • 采样一个 patch,与物体之间最小的 jaccard overlap 为:0.10.30.50.7 与 0.9
  • 随机的采样一个 patch

采样的 patch 是原始图像大小比例是 [0.11],aspect ratio 在 1/2 与 2 之间。

当 groundtruth box 的 中心(center)在采样的 patch 中时,我们保留重叠部分。

在这些采样步骤之后,每一个采样的 patch 被 resize 到固定的大小,并且以 0.5 的概率随机的 水平翻转(horizontally flipped)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值