论文阅读笔记(十二):Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features——using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look.

最先进的目标检测网络依靠区域提出算法来假设目标的位置。SPPnet[1]和Fast R-CNN[2]等研究已经减少了这些检测网络的运行时间,使得区域提出计算成为一个瓶颈。在这项工作中,我们引入了一个区域提出网络(RPN),该网络与检测网络共享全图像的卷积特征,从而使近乎零成本的区域提出成为可能。RPN是一个全卷积网络,可以同时在每个位置预测目标边界和目标分数。RPN经过端到端的训练,可以生成高质量的区域提出,由Fast R-CNN用于检测。我们将RPN和Fast R-CNN通过共享卷积特征进一步合并为一个单一的网络——使用最近流行的具有“注意力”机制的神经网络术语,RPN组件告诉统一网络在哪里寻找。

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.

目标检测的最新进展是由区域提出方法(例如[4])和基于区域的卷积神经网络(R-CNN)[5]的成功驱动的。尽管在[5]中最初开发的基于区域的CNN计算成本很高,但是由于在各种提议中共享卷积,所以其成本已经大大降低了[1],[2]。忽略花费在区域提议上的时间,最新版本Fast R-CNN[2]利用非常深的网络[3]实现了接近实时的速率。现在,提议是最新的检测系统中测试时间的计算瓶颈。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.

区域提议方法通常依赖廉价的特征和简练的推断方案。选择性搜索[4]是最流行的方法之一,它贪婪地合并基于设计的低级特征的超级像素。然而,与有效的检测网络[2]相比,选择性搜索速度慢了一个数量级,在CPU实现中每张图像的时间为2秒。EdgeBoxes[6]目前提供了在提议质量和速度之间的最佳权衡,每张图像0.2秒。尽管如此,区域提议步骤仍然像检测网络那样消耗同样多的运行时间。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.

有人可能会注意到,基于区域的快速CNN利用GPU,而在研究中使用的区域提议方法在CPU上实现,使得运行时间比较不公平。加速区域提议计算的一个显而易见的方法是将其在GPU上重新实现。这可能是一个有效的工程解决方案,但重新实现忽略了下游检测网络,因此错过了共享计算的重要机会。

In this paper, we show that an algorithmic change——computing proposals with a deep convolutional neural network——leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).

在本文中,我们展示了算法的变化——用深度卷积神经网络计算区域提议——导致了一个优雅和有效的解决方案,其中在给定检测网络计算的情况下区域提议计算接近领成本。为此,我们引入了新的区域提议网络(RPN),它们共享最先进目标检测网络的卷积层[1],[2]。通过在测试时共享卷积,计算区域提议的边际成本很小(例如,每张图像10ms)。

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.

我们的观察是,基于区域的检测器所使用的卷积特征映射,如Fast R-CNN,也可以用于生成区域提议。在这些卷积特征之上,我们通过添加一些额外的卷积层来构建RPN,这些卷积层同时在规则网格上的每个位置上回归区域边界和目标分数。因此RPN是一种全卷积网络(FCN)[7],可以针对生成检测区域建议的任务进行端到端的训练。

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel “anchor” boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.

RPN旨在有效预测具有广泛尺度和长宽比的区域提议。与使用图像金字塔(图1,a)或滤波器金字塔(图1,b)的流行方法[8],[9],[1]相比,我们引入新的“锚”盒作为多种尺度和长宽比的参考。我们的方案可以被认为是回归参考金字塔(图1,c),它避免了枚举多种比例或长宽比的图像或滤波器。这个模型在使用单尺度图像进行训练和测试时运行良好,从而有利于运行速度。

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.

为了将RPN与Fast R-CNN 2]目标检测网络相结合,我们提出了一种训练方案,在微调区域提议任务和微调目标检测之间进行交替,同时保持区域提议的固定。该方案快速收敛,并产生两个任务之间共享的具有卷积特征的统一网络。

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time——the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy.

我们在PASCAL VOC检测基准数据集上[11]综合评估了我们的方法,其中具有Fast R-CNN的RPN产生的检测精度优于使用选择性搜索的Fast R-CNN的强基准。同时,我们的方法在测试时几乎免除了选择性搜索的所有计算负担——区域提议的有效运行时间仅为10毫秒。使用[3]的昂贵的非常深的模型,我们的检测方法在GPU上仍然具有5fps的帧率(包括所有步骤),因此在速度和准确性方面是实用的目标检测系统。

A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.

这个手稿的初步版本是以前发表的[10]。从那时起,RPN和Faster R-CNN的框架已经被采用并推广到其他方法,如3D目标检测[13],基于部件的检测[14],实例分割[15]和图像标题[16]。我们快速和有效的目标检测系统也已经在Pinterest[17]的商业系统中建立了,并报告了用户参与度的提高。

In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these competitions. These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.

在ILSVRC和COCO 2015竞赛中,Faster R-CNN和RPN是ImageNet检测,ImageNet定位,COCO检测和COCO分割中几个第一名参赛者[18]的基础。RPN完全从数据中学习提议区域,因此可以从更深入和更具表达性的特征(例如[18]中采用的101层残差网络)中轻松获益。Faster R-CNN和RPN也被这些比赛中的其他几个主要参赛者所使用。这些结果表明,我们的方法不仅是一个实用合算的解决方案,而且是一个提高目标检测精度的有效方法。

Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).

目标提议。目标提议方法方面有大量的文献。目标提议方法的综合调查和比较可以在[19],[20],[21]中找到。广泛使用的目标提议方法包括基于超像素分组(例如,选择性搜索[4],CPMC[22],MCG[23])和那些基于滑动窗口的方法(例如窗口中的目标[24],EdgeBoxes[6])。目标提议方法被采用为独立于检测器(例如,选择性搜索[4]目标检测器,R-CNN[5]和Fast R-CNN[2])的外部模块。

Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “single-box” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.

用于目标检测的深度网络。R-CNN方法[5]端到端地对CNN进行训练,将提议区域分类为目标类别或背景。R-CNN主要作为分类器,并不能预测目标边界(除了通过边界框回归进行细化)。其准确度取决于区域提议模块的性能(参见[20]中的比较)。一些论文提出了使用深度网络来预测目标边界框的方法[25],[9],[26],[27]。在OverFeat方法[9]中,训练一个全连接层来预测假定单个目标定位任务的边界框坐标。然后将全连接层变成卷积层,用于检测多个类别的目标。MultiBox方法[26],[27]从网络中生成区域提议,网络最后的全连接层同时预测多个类别不相关的边界框,并推广到OverFeat的“单边界框”方式。这些类别不可知的边界框框被用作R-CNN的提议区域[5]。与我们的全卷积方案相比,MultiBox提议网络适用于单张裁剪图像或多张大型裁剪图像(例如224×224)。MultiBox在提议区域和检测网络之间不共享特征。稍后在我们的方法上下文中会讨论OverFeat和MultiBox。与我们的工作同时进行的,DeepMask方法[28]是为学习分割提议区域而开发的。

Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.

卷积[9],[1],[29],[7],[2]的共享计算已经越来越受到人们的关注,因为它可以有效而准确地进行视觉识别。OverFeat论文[9]计算图像金字塔的卷积特征用于分类,定位和检测。共享卷积特征映射的自适应大小池化(SPP)[1]被开发用于有效的基于区域的目标检测[1],[30]和语义分割[29]。Fast R-CNN[2]能够对共享卷积特征进行端到端的检测器训练,并显示出令人信服的准确性和速度。

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with attention [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

我们的目标检测系统,称为Faster R-CNN,由两个模块组成。第一个模块是提议区域的深度全卷积网络,第二个模块是使用提议区域的Fast R-CNN检测器[2]。整个系统是一个单个的,统一的目标检测网络(图2)。使用最近流行的“注意力”[31]机制的神经网络术语,RPN模块告诉Fast R-CNN模块在哪里寻找。在第3.1节中,我们介绍了区域提议网络的设计和属性。在第3.2节中,我们开发了用于训练具有共享特征模块的算法。

Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

解决多尺度和尺寸的不同方案。(a)构建图像和特征映射金字塔,分类器以各种尺度运行。(b)在特征映射上运行具有多个比例/大小的滤波器的金字塔。(c)我们在回归函数中使用参考边界框金字塔。

Faster R-CNN is a single, unified network for object detection. The RPN module serves as the ‘attention’ of this unified network.

Faster R-CNN是一个单一,统一的目标检测网络。RPN模块作为这个统一网络的“注意力”。

We have presented RPNs for efficient and accurate region proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.

我们已经提出了RPN来生成高效,准确的区域提议。通过与下游检测网络共享卷积特征,区域提议步骤几乎是零成本的。我们的方法使统一的,基于深度学习的目标检测系统能够以接近实时的帧率运行。学习到的RPN也提高了区域提议的质量,从而提高了整体的目标检测精度。

相关参考

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值