论文阅读笔记(三十六):Single-Shot Refinement Neural Network for Object Detection

For object detection, the two-stage approach (e.g., Faster R-CNN) has been achieving the highest accuracy, whereas the one-stage approach (e.g., SSD) has the advantage of high efficiency. To inherit the merits of both while overcoming their disadvantages, in this paper, we propose a novel single-shot based detector, called RefineDet, that achieves better accuracy than two-stage methods and maintains comparable efficiency of one-stage methods. RefineDet consists of two inter-connected modules, namely, the anchor refinement module and the object detection module. Specifically, the former aims to (1) filter out negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The latter module takes the refined anchors as the input from the former to further improve the regression and predict multi-class label. Meanwhile, we design a transfer connection block to transfer the features in the anchor refinement module to predict locations, sizes and class labels of objects in the object detection module. The multitask loss function enables us to train the whole network in an end-to-end way. Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO demonstrate that RefineDet achieves state-of-the-art detection accuracy with high efficiency. Code is available at https: //github.com/sfzhang15/RefineDet.

对于目标检测, two-stage方法 (例如, Faster R-CNN) 已经达到了最高的精确度, 而one-stage方法 (如 SSD) 具有高效率的优点。为了继承两者的优缺点, 本文提出了一种新的基于single-shot的检测器, 称为 RefineDet, 它比two-stage方法具有更好的精度, 并保持了一级方法的可比效率。RefineDet 由两个相互连接的模块组成, 即定位细化模块和物体检测模块。具体而言, 前者旨在 (1) 过滤掉negative anchors , 以减少分类器的搜索空间, (2) 粗略地调整anchor点的定位和大小, 为后续的回归提供更好的初始化。后一个模块以精致的anchor杆作为前者的输入, 进一步改善回归, 预测多类标签。同时, 我们设计了一个transfer connection block来传输定位细化模块中的特征, 以预测物体检测模块中物体的定位、大小和类标签。多任务损耗函数使我们能够以端到端的方式对整个网络进行训练。在PASCAL VOC 2007, PASCAL VOC 2012 和MS COCO的广泛实验表明, RefineDet 达到,高效的state-of-the-art的检测精度。代码可在 https://github.com/sfzhang15/RefineDet

Object detection has achieved significant advances in recent years, with the framework of deep neural networks (DNN). The current DNN detectors of state-of-the-art can be divided into two categories: (1) the two-stage approach, including [3, 15, 36, 41], and (2) the one-stage approach, including [30, 35]. In the two-stage approach, a sparse set of candidate object boxes is first generated, and then they are further classified and regressed. The two-stage methods have been achieving top performances on several challenging benchmarks, including PASCAL VOC [8] and MS COCO [29].

近年来,随着深度神经网络(DNN)的框架,物体检测取得了重大进展。现有的最新DNN检测器可分为两类:(1)two-stage方法,包括[3,15,36,41];(2)one-stage方法,包括[30,35]。在two-stage方法中,首先生成一组稀疏的候选目标框,然后对它们进行进一步分类和回归。两个阶段的方法已经在几个具有挑战性的基准测试中取得了最高的成绩,包括PASCAL VOC [8]和MS COCO [29]。

The one-stage approach detects objects by regular and dense sampling over locations, scales and aspect ratios. The main advantage of this is its high computational efficiency. However, its detection accuracy is usually behind that of the two-stage approach, one of the main reasons being due to the class imbalance problem [28].

one-stage方法通过对定位,比例和长宽比进行正则和密集采样来检测物体。这样做的主要优点是计算效率高。然而,它的检测精度通常落后于two-stage方法,其中一个主要原因是由于类别不平衡问题[28]。

Some recent methods in the one-stage approach aim to address the class imbalance problem, to improve the detection accuracy. Kong et al. [24] use the objectness prior constraint on convolutional feature maps to significantly reduce the search space of objects. Lin et al. [28] address the class imbalance issue by reshaping the standard cross entropy loss to focus training on a sparse set of hard examples and down-weights the loss assigned to well-classified examples. Zhang et al. [53] design a max-out labeling mechanism to reduce false positives resulting from class imbalance.

one-stage方法中的一些最近的方法旨在解决类别不平衡问题,以提高检测精度。 Kong等人[24]使用卷积特征映射上的物体先验约束来显着减少物体的搜索空间。 Lin等人[28]通过重塑标准交叉熵损失来解决阶级失衡问题,将训练集中在一组稀疏的例子上,并对指定给良好分类例子的损失进行下调。 Zhang等人[53]设计一个最大标记机制来减少类不平衡导致的误报。

In our opinion, the current state-of-the-art two-stage methods, e.g., Faster R-CNN [36], R-FCN [5], and FPN [27], have three advantages over the one-stage methods as follows: (1) using two-stage structure with sampling heuristics to handle class imbalance; (2) using two-step cascade to regress the object box parameters; (3) using two-stage features to describe the objects. In this work, we design a novel object detection framework, called RefineDet, to inherit the merits of the two approaches (i.e., one-stage and two-stage approaches) and overcome their shortcomings. It improves the architecture of the one-stage approach, by using two inter-connected modules (see Figure 1), namely, the anchor refinement module (ARM) and the object detection module (ODM). Specifically, the ARM is designed to (1) identify and remove negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The ODM takes the refined anchors as the input from the former to further improve the regression and predict multi-class labels. As shown in Figure 1, these two inter-connected modules imitate the two-stage structure and thus inherit the three aforementioned advantages to produce accurate detection results with high efficiency. In addition, we design a transfer connection block (TCB) to transfer the features in the ARM to predict locations, sizes, and class labels of objects in the ODM. The multi-task loss function enables us to train the whole network in an end-to-end way.

在我们看来,当前state-of-the-art的two-stage方法,例如Faster R-CNN [36],R-FCN [5]和FPN [27],与one-stage方法相比有三个优点具体如下:(1)采用sampling heuristics的two-stage结构处理类别失衡; (2)使用two-step cascade来回归目标框参数; (3)使用two-stage特征来描述物体。在这项工作中,我们设计了一个名为RefineDet的新物体检测框架,以继承这两种方法(即one-stage和two-stage方法)的优点并克服它们的缺点。通过使用两个互连模块(参见图1),即anchor refinement module(ARM)和 object detection module(ODM),它改进了one-stage方法的架构。具体来说,ARM被设计为(1)识别和移除negative anchors 以减少分类器的搜索空间,并且(2)粗略地调整anchor的定位和大小以为随后的回归器提供更好的初始化。 ODM将精致的anchor定作为前者的输入,以进一步改进回归和预测多类别标签。如图1所示,这两个相互连接的模块模仿two-stage结构,从而继承了上述三个优点,以高效率产生精确的检测结果。另外,我们设计了一个transfer connection block(TCB)来传输ARM中的features,以预测ODM中物体的定位,大小和类别标签。multi-task loss function使我们能够以端到端的方式训练整个网络。

Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO benchmarks demonstrate that RefineDet outperforms the state-of-the-art methods. Specifically, it achieves 85.8% and 86.8% mAPs on VOC 2007 and 2012, with VGG-16 network. Meanwhile, it outperforms the previously best published results from both one-stage and two-stage approaches by achieving 41.8% AP on MS COCO test-dev with ResNet-101. In addition, RefineDet is time efficient, i.e., it runs at 40.2 FPS and 24.1 FPS on a NVIDIA Titan X GPU with the input sizes 320 × 320 and 512 × 512 in inference.

PASCAL VOC 2007,PASCAL VOC 2012和MS COCO基准的大量实验表明,RefineDet优于state-of-the-art的方法。具体来说,在VGG-16网络中,VOC 2007和2012年的mAP达到85.8%和86.8%。同时,它通过使用ResNet-101在MS COCO test-dev上获得41.8%的AP,胜过了之前发布的one-stage和two-stage方法的最佳结果。此外,RefineDet在时间效率上也很高效,即在NVIDIA Titan X GPU上以40.2 FPS和24.1 FPS运行,输入尺寸为320×320和512×512。

The main contributions of this work are summarized as follows. (1) We introduce a novel one-stage framework for object detection, composed of two inter-connected modules, i.e., the ARM and the ODM. This leads to performance better than the two-stage approach while maintaining high efficiency of the one-stage approach. (2) To ensure the effectiveness, we design the TCB to transfer the features in the ARM to handle more challenging tasks, i.e., predict accurate object locations, sizes and class labels, in the ODM. (3) RefineDet achieves the latest state-of-the-art results on generic object detection (i.e., PASCAL VOC 2007 [10], PASCAL VOC 2012 [11] and MS COCO [29]).

这项工作的主要贡献总结如下。 (1)我们引入了一个新的物体检测one-stage框架,由两个互连模块组成,即ARM和ODM。这导致性能比two-stage方法更好,同时保持one-stage方法的高效率。 (2)为确保效果,我们设计了TCB,以便在ARM中传输特征,以处理更具挑战性的任务,即预测ODM中准确的物体定位,尺寸和类别标签。 (3)RefineDet实现了通用物体检测(即PASCAL VOC 2007 [10],PASCAL VOC 2012 [11]和MS COCO [29])最新的最新技术成果。

Classical Object Detectors. Early object detection methods are based on the sliding-window paradigm, which apply the hand-crafted features and classifiers on dense image grids to find objects. As one of the most successful methods, Viola and Jones [47] use Haar feature and AdaBoost to train a series of cascaded classifiers for face detection, achieving satisfactory accuracy with high efficiency. DPM [12] is another popular method using mixtures of multiscale deformable part models to represent highly variable object classes, maintaining top results on PASCAL VOC [8] for many years. However, with the arrival of deep convolutional network, the object detection task is quickly dominated by the CNN-based detectors, which can be roughly divided into two categories, i.e., the two-stage approach and one-stage approach.

Classical Object Detectors。早期的物体检测方法基于滑动窗口范例,它将稠密图像网格上的手工特征和分类器应用于查找物体。作为最成功的方法之一,Viola和Jones [47]使用Haar特征和AdaBoost来训练一系列用于人脸检测的cascaded classifiers,以高效率实现令人满意的精度。 DPM [12]是另一种流行的方法,它使用multiscale deformable part models的混合来代表高度可变的物体类别,多年来在PASCAL VOC [8]上保持最佳结果。然而,随着深度卷积网络的到来,物体检测任务迅速被基于CNN的检测器所支配,其大致可以分为两类,即two-stage方法和one-stage方法。

Two-Stage Approach. The two-stage approach consists of two parts, where the first one (e.g., Selective Search [46], EdgeBoxes [55], DeepMask [32, 33], RPN [36]) generates a sparse set of candidate object proposals, and the second one determines the accurate object regions and the corresponding class labels using convolutional networks. Notably, the two-stage approach (e.g., R-CNN [16], SPPnet [18], Fast RCNN [15] to Faster R-CNN [36]) achieves dominated performance on several challenging datasets (e.g., PASCAL VOC 2012 [11] and MS COCO [29]). After that, numerous effective techniques are proposed to further improve the performance, such as architecture diagram [5, 26, 54], training strategy [41, 48], contextual reasoning [1, 14, 40, 50] and multiple layers exploiting [3, 25, 27, 42].

two-stage方法。two-stage方法由两部分组成,其中第一部分(如Selective Search [46],EdgeBoxes [55],DeepMask [32,33],RPN [36])生成候选物体提议的稀疏集合,以及第二个使用卷积网络来确定准确的目标区域和相应的类别标签。值得注意的是,两个阶段的方法(例如,R-CNN [16],SPPnet [18],Fast R-CNN [15]到Faster R-CNN [36])在几个具有挑战性的数据集上实现了主导性能(例如PASCAL VOC 2012 [ 11]和MS COCO [29])。之后,提出了许多有效的技术来进一步提高性能,如architecture diagram [5, 26, 54], training strategy [41, 48], contextual reasoning [1, 14, 40, 50] 和 multiple layers exploiting [3, 25, 27, 42]。

One-Stage Approach. Considering the high efficiency, the one-stage approach attracts much more attention recently. Sermanet et al. [38] present the OverFeat method for classification, localization and detection based on deep ConvNets, which is trained end-to-end, from raw pixels to ultimate categories. Redmon et al. [34] use a single feedforward convolutional network to directly predict object classes and locations, called YOLO, which is extremely fast. After that, YOLOv2 [35] is proposed to improve YOLO in several aspects, i.e., add batch normalization on all convolution layers, use high resolution classifier, use convolution layers with anchor boxes to predict bounding boxes instead of the fully connected layers, etc. Liu et al. [30] propose the SSD method, which spreads out anchors of different scales to multiple layers within a ConvNet and enforces each layer to focus on predicting objects of a certain scale. DSSD [13] introduces additional context into SSD via deconvolution to improve the accuracy. DSOD [39] designs an efficient framework and a set of principles to learn object detectors from scratch, following the network structure of SSD. To improve the accuracy, some one-stage methods [24, 28, 53] aim to address the extreme class imbalance problem by re-designing the loss function or classification strategies. Although the one-stage detectors have made good progress, their accuracy still trails that of two-stage methods.

one-stage方法。考虑到高效率,最近one-stage方法吸引了更多关注。 Sermanet et al。 [38]提出了基于Deep ConvNets的分类,定位和检测的OverFeat方法,该方法是从原始像素到最终类别的端到端训练。 Redmon等人[34]使用单一的前馈卷积网络直接预测被称为YOLO的物体类别和定位,这是非常快的。之后,YOLOv2 [35]被提出来在几个方面改进YOLO,即在所有卷积层上添加batch normalization,使用高分辨率分类器,使用带有anchor box的卷积层来预测边界框而不是完全连接的层等。Liu等人。 [30]提出了SSD方法,它将不同尺度的anchor展开到ConvNet中的多个层,并强制每个层专注于预测特定尺度的物体。 DSSD [13]通过deconvolution将附加上下文引入SSD以提高准确性。 DSOD [39]设计了一个高效的框架和一套原则,以从零开始学习物体检测器,遵循SSD的网络结构。为了提高准确性,一些one-stage方法[24,28,53]旨在通过重新设计损失函数或分类策略来解决极端类别失衡问题。虽然one-stage检测器取得了很好的进展,但其准确性仍然落后于two-stage检测方法。

Refer to the overall network architecture shown in Figure 1. Similar to SSD [30], RefineDet is based on a feedforward convolutional network that produces a fixed number of bounding boxes and the scores indicating the presence of different classes of objects in those boxes, followed by the non-maximum suppression to produce the final result. RefineDet is formed by two inter-connected modules, i.e., the ARM and the ODM. The ARM aims to remove negative anchors so as to reduce search space for the classifier and also coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor, whereas ODM aims to regress accurate object locations and predict multi-class labels based on the refined anchors. The ARM is constructed by removing the classification layers and adding some auxiliary structures of two base networks (i.e., VGG-16 [43] and ResNet-101 [19] pretrained on ImageNet [37]) to meet our needs. The ODM is composed of the outputs of TCBs followed by the prediction layers (i.e., the convolution layers with 3 × 3 kernel size), which generates the scores for object classes and shape offsets relative to the refined anchor box coordinates. The following explain three core components in RefineDet, i.e., (1) transfer connection block (TCB), converting the features from the ARM to the ODM for detection; (2) two-step cascaded regression, accurately regressing the locations and sizes of objects; (3) negative anchor filtering, early rejecting well-classified negative anchors and mitigate the imbalance issue.

参考图1所示的总体网络架构。与SSD [30]类似,RefineDet基于前馈卷积网络,该网络生成固定数量的边界框,并且分数表示这些框中存在不同类别的物体,通过非最大抑制来产生最终结果。 RefineDet由两个相互连接的模块组成,即ARM和ODM。 ARM旨在消除negative anchors ,以减少分类器的搜索空间,并粗略调整anchor的定位和大小,以便为后续分类器提供更好的初始化,而ODM旨在回归准确的物体定位并预测基于多类标签在refined anchor坐标上。通过去除分类层并添加两个基础网络的辅助结构(即在ImageNet [37]上预训练的VGG-16 [43]和ResNet-101 [19])来构建ARM,以满足我们的需求。 ODM由TCB的输出组成,其后是预测层(即具有3×3核大小的卷积层),其产生相对于精确anchor box坐标的物体类别和形状偏移的分数。下面解释RefineDet中的三个核心组件,即(1)transfer connection block(TCB),将特征从ARM转换到ODM以供检测; (2)two-step cascaded regression,准确回归物体的定位和大小; (3) negative anchor filtering,尽早拒绝分类好的negative anchors ,缓解不平衡问题。

Transfer Connection Block. To link between the ARM and ODM, we introduce the TCBs to convert features of different layers from the ARM, into the form required by the ODM, so that the ODM can share features from the ARM. Notably, from the ARM, we only use the TCBs on the feature maps associated with anchors. Another function of the TCBs is to integrate large-scale context [13, 27] by adding the high-level features to the transferred features to improve detection accuracy. To match the dimensions between them, we use the deconvolution operation to enlarge the high-level feature maps and sum them in the element-wise way. Then, we add a convolution layer after the summation to ensure the discriminability of features for detection. The architecture of the TCB is shown in Figure 2.

transfer connection block。为了在ARM和ODM之间建立链接,我们引入了TCB,将来自ARM的不同层的特征转换为ODM所需的形式,以便ODM可以共享来自ARM的特征。值得注意的是,在ARM中,我们只在与anchor点关联的特征映射上使用TCB。 TCB的另一个特征是通过将高级特征添加到传输的特征来集成大规模上下文[13,27],以提高检测准确性。为了匹配它们之间的尺寸,我们使用反卷积操作来放大高级特征映射并以元素方式对它们进行求和。然后,我们在求和之后添加卷积层以确保检测的特征的辨别力。 TCB的体系结构如图2所示。

Two-Step Cascaded Regression. Current one-stage methods [13, 24, 30] rely on one-step regression based on various feature layers with different scales to predict the locations and sizes of objects, which is rather inaccurate in some challenging scenarios, especially for the small objects. To that end, we present a two-step cascaded regression strategy to regress the locations and sizes of objects. That is, we use the ARM to first adjust the locations and sizes of anchors to provide better initialization for the regression in the ODM. Specifically, we associate n anchor boxes with each regularly divided cell on the feature map. The initial position of each anchor box relative to its corresponding cell is fixed. At each feature map cell, we predict four offsets of the refined anchor boxes relative to the original tiled anchors and two confidence scores indicating the presence of foreground objects in those boxes. Thus, we can yield n refined anchor boxes at each feature map cell.

Two-Step Cascaded Regression。目前的one-stage方法[13,24,30]依赖于基于不同尺度的各种要素层的一步回归预测物体的定位和大小,这在一些具有挑战性的情况下是非常不准确的,特别是对于小物体。为此,我们提出了一个two-Step Cascaded Regression策略来回归物体的定位和大小。也就是说,我们使用ARM来首先调整anchor的定位和大小,以便为ODM中的回归提供更好的初始化。具体而言,我们将n个anchor定位框与特征映射上的每个规则划分的单元相关联。每个anchor box相对于其相应单元的初始定位是固定的。在每个特征映射单元格中,我们预测refined anchor boxes相对于原始tiled anchor点的四个偏移量以及表示这些框中存在前景物体的两个置信度分数。因此,我们可以在每个特征映射单元中生成n个refined anchor boxes。

After obtaining the refined anchor boxes, we pass them to the corresponding feature maps in the ODM to further generate object categories and accurate object locations and sizes, as shown in Figure 1. The corresponding feature maps in the ARM and the ODM have the same dimension. We calculate c class scores and the four accurate offsets of objects relative to the refined anchor boxes, yielding c + 4 outputs for each refined anchor boxes to complete the detection task. This process is similar to the default boxes used in SSD [30]. However, in contrast to SSD [30] directly uses the regularly tiled default boxes for detection, RefineDet uses two-step strategy, i.e., the ARM generates the refined anchor boxes, and the ODM takes the refined anchor boxes as input for further detection, leading to more accurate detection results, especially for the small objects. Negative Anchor Filtering. To early reject well-classified negative anchors and mitigate the imbalance issue, we design a negative anchor filtering mechanism. Specifically, in training phase, for a refined anchor box, if its negative confidence is larger than a preset threshold θ (i.e., set θ = 0.99 empirically), we will discard it in training the ODM. That is, we only pass the refined hard negative anchor boxes and refined positive anchor boxes to train the ODM. Meanwhile, in the inference phase, if a refined anchor box is assigned with a negative confidence larger than θ, it will be discarded in the ODM for detection.

获得refined anchor boxes后,我们将它们传递给ODM中的相应特征映射,以进一步生成物体类别和准确的物体定位和大小,如图1所示。ARM中的对应特征映射和ODM具有相同的维度。我们计算c类分数以及相对于refined anchor boxes的四个精确的偏移量,为每个refined anchor boxes产生c + 4个输出以完成检测任务。该过程与SSD [30]中使用的默认框类似。然而,与SSD [30]直接使用regularly tiled default boxes进行检测相比,RefineDet使用two-step策略,即ARM生成refined anchor boxes,ODM将refined anchor boxes作为输入用于进一步检测,导致更精确的检测结果,特别是对于小物体。Negative Anchor Filtering。为了尽早拒绝分类良好的negative anchors 并缓解不平衡问题,我们设计了一个Negative Anchor Filtering机制。具体而言,在训练阶段,对于refined anchor boxes,如果其负置信度大于预设的阈值θ(即,根据经验设置θ= 0.99),我们将在训练ODM时丢弃它。也就是说,我们只通过refined negative anchors box和refined positive anchor boxes来训练ODM。同时,在推理阶段,如果一个refined anchor boxes被分配了大于θ的负置信度,它将在ODM中被丢弃以供检测。

In this paper, we present a single-shot refinement neural network based detector, which consists of two interconnected modules, i.e., the ARM and the ODM. The ARM aims to filter out the negative anchors to reduce search space for the classifier and also coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor, while the ODM takes the refined anchors as the input from the former ARM to regress the accurate object locations and sizes and predict the corresponding multiclass labels. The whole network is trained in an end-to-end fashion with the multi-task loss. We carry out several experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO datasets to demonstrate that RefineDet achieves the state-of-the-art detection accuracy with high efficiency. In the future, we plan to employ RefineDet to detect some other specific kinds of objects, e.g., pedestrian, vehicle, and face, and introduce the attention mechanism in RefineDet to further improve the performance.

在本文中,我们提出了一个基于single-shot refinement neural network的检测器,它由两个interconnected的模块组成,即ARM和ODM。 ARM旨在过滤掉negative anchors 以减少分类器的搜索空间,并粗略地调整anchor的定位和大小,以便为后续的分类器提供更好的初始化,而ODM则将refined anchors作为前ARM的输入,回归准确的物体定位和尺寸,并预测相应的多类标签。整个网络以端到端的方式受到multi-task loss的训练。我们在PASCAL VOC 2007,PASCAL VOC 2012和MS COCO数据集上进行了多项实验,以证明RefineDet能够高效地实现state-of-the-art的检测精度。未来,我们计划使用RefineDet来检测其他特定类型的目标,例如行人,车辆和脸部,并在RefineDet中引入the attention mechanism以进一步提高性能。
这里写图片描述
Figure 1: Architecture of RefineDet. For better visualization, we only display the layers used for detection. The celadon parallelograms denote the refined anchors associated with different feature layers. The stars represent the centers of the refined anchor boxes, which are not regularly paved on the image.

图1:RefineDet的架构。为了更好的可视化,我们只显示用于检测的图层。青瓷色平行四边形表示与不同特征层相关联的refined anchor 。星星代表refined anchor box的中心,这些box在图像上并不经常铺设。

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值