目标检测经典论文——Faster R-CNN论文翻译：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net

最新推荐文章于 2021-12-14 19:22:58 发布

bigcindy

最新推荐文章于 2021-12-14 19:22:58 发布

阅读量5.4k

点赞数 4

CC 4.0 BY-SA版权

分类专栏：深度学习经典论文翻译文章标签： Faster R-CNN SPPnet RPN Region Proposal 目标检测

本文链接：https://blog.csdn.net/Jwenxue/article/details/107671999

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN：通过Region Proposal网络实现实时目标检测

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features——using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

Index Terms

Object Detection, Region Proposal, Convolutional Neural Network.

摘要

最先进的目标检测网络依靠region proposal算法来推理检测目标的位置。SPPnet[1]和Fast R-CNN[2]等类似的研究已经减少了这些检测网络的运行时间，使得region proposal计算成为一个瓶颈。在这项工作中，我们引入了一个region proposal网络（RPN），该网络与检测网络共享整个图像的卷积特征，从而使近乎零成本的region proposal成为可能。RPN是一个全卷积网络，可以同时在每个位置预测目标边界和目标分数。RPN经过端到端的训练，可以生成高质量的region proposal，并使用Fast R-CNN完成检测。我们将RPN和Fast R-CNN通过共享卷积特征进一步合并为一个单一的网络——使用最近流行的具有“注意力”机制的神经网络术语，RPN组件告诉统一网络在哪里寻找。对于非常深的VGG-16模型[3]，我们的检测系统在GPU上的帧率为5fps（包括所有步骤），同时在PASCAL VOC 2007、2012和MS COCO数据集上达到了目前最好的目标检测精度，每个图像只有300个proposals。在ILSVRC和COCO 2015竞赛中，Faster R-CNN和RPN是多个比赛中获得第一名的基础。代码已公开。

关键字

目标检测，Region Proposal，卷积神经网络

1. Introduction

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.

1. 引言

目标检测的最新进展是由region proposal方法（例如[4]）和基于区域的卷积神经网络（R-CNN）[5]的成功驱动的。尽管在[5]中最初开发的基于区域的CNN计算代价很大，但是由于在各种proposals中共享卷积，所以其成本已经大大降低了[1]，[2]。忽略花费在region proposals上的时间，最新版本Fast R-CNN[2]利用非常深的网络[3]实现了接近实时的速率。现在，proposals是最新的检测系统中测试时间的计算瓶颈。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.

Region proposal方法通常依赖廉价的特征和简练的推断方案。Selective Search [4]是最流行的方法之一，它贪婪地合并基于设计的低级特征的超级像素。然而，与有效的检测网络[2]相比，Selective Search速度慢了一个数量级，在CPU实现中每张图像的时间为2秒。EdgeBoxes[6]目前提出了在proposal质量和速度之间的最佳权衡，每张图像0.2秒。尽管如此，region proposal步骤仍然像检测网络那样消耗同样多的运行时间。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.

有人可能会注意到，基于区域的快速CNN利用GPU，而在研究中使用的region proposal方法在CPU上实现，使得运行时间比较不公平。加速region proposal计算的一个显而易见的方法是将其在GPU上重新实现。这可能是一个有效的工程解决方案，但重新实现忽略了下游检测网络，因此错过了共享计算的重要机会。

In this paper, we show that an algorithmic change——computing proposals with a deep convolutional neural network——leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).

在本文中，我们展示了算法的变化——用深度卷积神经网络计算region proposal——获得了一个优雅和有效的解决方案，其中在给定检测网络计算的情况下region proposal计算接近零成本。为此，我们引入了新的region proposal网络（RPN），它们共享最先进目标检测网络的卷积层[1]，[2]。通过在测试时共享卷积，计算region proposal的边际成本很小（例如，每张图像仅需10ms）。

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.

我们的观察到基于区域的检测器所使用的卷积特征映射，如Fast R-CNN，也可以用于生成region proposal。在这些卷积特征之上，我们通过添加一些额外的卷积层来构建RPN，这些卷积层同时在规则网格上的每个位置上回归区域边界和目标分数。因此RPN是一种全卷积网络（FCN）[7]，可以针对生成检测区域proposals的任务进行端到端的训练。

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel “anchor” boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

RPN旨在有效预测具有广泛尺度和长宽比的region proposal。与使用图像金字塔（图1 a）或滤波器金字塔（图1 b）的流行方法[8]，[9]，[1]，[2]相比，我们引入新的“anchor”框作为多种尺度和长宽比的参考。我们的方案可以被认为是回归参考金字塔（图1 c），它避免了遍历多种比例或长宽比的图像或滤波器。这个模型在使用单尺度图像进行训练和测试时运行良好，从而有利于提升运行速度。

图1：解决多尺度和尺寸的不同方案。（a）构建图像和特征映射金字塔，分类器以各种尺度运行。（b）在特征映射上运行具有多个比例/大小的滤波器的金字塔。（c）我们在回归函数中使用参考边界框金字塔。

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.

为了将RPN与Fast R-CNN [2]目标检测网络相结合，我们提出了一种训练方案，在fine-tune region proposal任务和fine-tune目标检测之间进行交替，同时保持region proposal的固定。该方案快速收敛，并产生两个任务之间共享的具有卷积特征的统一网络。

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time——the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).

我们在PASCAL VOC检测基准数据集上[11]综合评估了我们的方法，其中具有Fast R-CNN的RPN产生的检测精度优于使用Selective Search的Fast R-CNN的强基准模型。同时，我们的方法在测试时几乎免除了Selective Search的所有计算负担——region proposal的有效运行时间仅为10毫秒。使用[3]的昂贵的非常深的模型，我们的检测方法在GPU上仍然具有5fps的帧率（包括所有步骤），因此在速度和准确性方面是实用的目标检测系统。我们还报告了在MS COCO数据集上[12]的结果，并使用COCO数据研究了在PASCAL VOC上的改进。代码可公开获得https://github.com/shaoqingren/faster_rcnn（MATLAB实现）和https://github.com/rbgirshick/py-faster-rcnn（Python实现）。

A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.

这篇稿件的初始版本是以前发表的[10]。从那时起，RPN和Faster R-CNN的框架已经被采用并推广到其他方法，如3D目标检测[13]，基于部件的检测[14]，实例分割[15]和图像标题生成[16]。我们快速和有效的目标检测系统也已经在Pinterest[17]的商业系统中进行了部署，并报告了用户参与度的提高。

In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these competitions. These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.

在ILSVRC和COCO 2015竞赛中，Faster R-CNN和RPN是ImageNet检测任务、ImageNet定位任务、COCO检测任务和COCO分割任务中几个第一名获胜模型[18]的基础。RPN完全从数据中学习propose regions，因此可以从更深入和更具表达性的特征（例如[18]中采用的101层残差网络）中轻松获益。Faster R-CNN和RPN也被这些比赛中的其他几个主要参赛者所使用。这些结果表明，我们的方法不仅是一个实用合算的解决方案，而且是一个提高目标检测精度的有效方法。

2. RELATED WORK

Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).

2. 相关研究工作

目标Proposals。目标Proposals方法方面有大量的文献。目标Proposals方法的综合调查和比较可以在[19]，[20]，[21]中找到。广泛使用的目标提议方法包括基于超像素分组（例如，Selective Search [4]，CPMC[22]，MCG[23]）和那些基于滑动窗口的方法（例如窗口中的目标[24]，EdgeBoxes[6]）。目标Proposals方法被采用为独立于检测器（例如，Selective Search [4]目标检测器，R-CNN[5]和Fast R-CNN[2]）的外部模块。

Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “single-box” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.

用于目标检测的深度网络。R-CNN方法[5]端到端地对CNN进行训练，将proposal regions分类为目标类别或背景。R-CNN主要作为分类器，并不能预测目标边界（除了通过边界框回归进行修正）。其准确度取决于region proposal模块的性能（参见[20]中的比较）。一些论文提出了使用深度网络来预测目标边界框的方法[25]，[9]，[26]，[27]。在OverFeat方法[9]中，训练一个全连接层来预测假定单个目标定位任务的边界框坐标。然后将全连接层变成卷积层，用于检测多个类别的目标。MultiBox方法[26]，[27]从网络中生成region proposal，网络最后的全连接层同时预测多个类别不相关的边界框，并推广到OverFeat的“单边界框”方式。这些类别不可知的边界框框被用作R-CNN的候选区域[5]。与我们的全卷积方案相比，MultiBox提议网络适用于单张裁剪图像或多张大型裁剪图像（例如224×224）。MultiBox在提议区域和检测网络之间不共享特征。稍后在介绍我们的方法时会讨论OverFeat和MultiBox。与我们的工作同时进行的DeepMask方法[28]是为学习分割proposals而开发的。

Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.

卷积[9]，[1]，[29]，[7]，[2]的共享计算已经越来越受到人们的关注，因为它可以有效而准确地进行视觉识别。OverFeat论文[9]计算图像金字塔的卷积特征用于分类、定位和检测。共享卷积特征映射的自适应大小池化（SPP）[1]被开发用于有效的基于区域的目标检测[1]，[30]和语义分割[29]。Fast R-CNN[2]能够对共享卷积特征进行端到端的检测器训练，并显示出令人信服的准确性和速度。

3. FASTER R-CNN

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with attention [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

Figure 2: Faster R-CNN is a single, unified network for object detection. The RPN module serves as the ‘attention’ of this unified network.

3. FASTER R-CNN

我们的目标检测系统，称为Faster R-CNN，由两个模块组成。第一个模块是产生proposes regions的深度全卷积网络，第二个模块是使用proposes regions的Fast R-CNN检测器[2]。整个系统是一个单个的、统一的目标检测网络（图2）。使用最近流行的“注意力”[31]机制的神经网络术语，RPN模块告诉Fast R-CNN模块在哪里寻找。在第3.1节中，我们介绍了region proposal网络的设计和属性。在第3.2节中，我们开发了用于训练具有共享特征的两个模块算法。