【翻译】【FPN】Feature Pyramid Networks for Object Detection

Feature Pyramid Networks for Object Detection
用于目标检测的特征金字塔网络
Tsung-Yi Lin Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie

论文:https://arxiv.org/pdf/1612.03144.pdf
代码:https://github.com/facebookresearch/Detectron

Abstract(摘要)

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
  特征金字塔是识别系统中检测不同尺度目标的一个基本组成部分。但最近的深度学习目标检测器避免了金字塔表示,部分原因是它们是计算和内存密集型的。在本文中,我们利用深度卷积网络固有的多尺度、金字塔式的层次结构,以边际的额外成本构建特征金字塔。我们开发了一个带有横向连接的自上而下的架构,用于在所有尺度上构建高级语义特征图。这种架构被称为特征金字塔网络(FPN),作为一种通用的特征提取器,在一些应用中显示出明显的改进。在基本的Faster R-CNN系统中使用FPN,我们的方法在COCO检测基准上取得了最先进的单模型结果,超过了所有现有的单模型作品,包括COCO 2016挑战赛的获胜者。此外,我们的方法可以在GPU上以5FPS的速度运行,因此是一个实用而准确的多尺度目标检测解决方案。代码将公开提供。

1. Introduction(介绍)

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.
  识别不同尺度的目标是计算机视觉中的一个基本挑战。建立在图像金字塔基础上的特征金字塔(我们简称为特征化图像金字塔)构成了标准解决方案的基础[1](图1(a))。这些金字塔是标度不变的,即目标的标度变化会通过移动其在金字塔中的级别来抵消。直观地说,这一特性使模型能够通过扫描模型的位置和金字塔的层次来检测大范围内的目标。
在这里插入图片描述
  Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.
  在手工设计特征的时代,特征化图像金字塔被大量使用[5, 25]。它们是如此关键,以至于像DPM[7]这样的目标检测器需要密集的尺度采样来达到良好的效果(例如,每octave有10个尺度)。对于识别任务,工程化的特征在很大程度上被深度卷积网络(ConvNets)计算的特征所取代[19, 20]。除了能够代表更高层次的语义外,卷积网络对octave的变化也更加稳健,因此有助于通过在单一输入octave上计算的特征进行识别[15, 11, 29](图1(b))。但即使有这样的鲁棒性,仍然需要金字塔来获得最准确的结果。最近在ImageNet[33]和COCO[21]检测挑战中的所有顶级作品都使用了对特征化图像金字塔的多尺度测试(例如,[16,35])。对图像金字塔的每一层进行特征化处理的主要优势在于,它能产生一个多尺度的特征表示,其中所有层次的语义都很强,包括高分辨率层次。
  Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.
  然而,对图像金字塔的每一层进行特征化处理有明显的局限性。推理时间大大增加(例如,四倍[11]),使得这种方法在实际应用中不实用。此外,在图像金字塔上端到端的训练深度网络在内存方面是不可行的,因此,如果利用的话,图像金字塔只在测试时使用[15, 11, 16, 35],这造成了训练/测试时间推理的不一致。由于这些原因,Fast和Faster R-CNN[11, 29]选择在默认设置下不使用featurized image pyramids。
  However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multiscale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.
  然而,图像金字塔并不是计算多尺度特征表示的唯一方法。深度ConvNet逐层计算特征层次,通过子采样层,特征层次具有固有的多尺度、金字塔形状。这种网络内的特征层次产生了不同空间分辨率的特征图,但却引入了由不同深度引起的巨大语义差距。高分辨率的特征图具有低层次的特征,损害了它们对目标识别的代表能力。
  The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1©). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.
  单次检测器(Single Shot Detector, SSD)[22]是首次尝试使用ConvNet的金字塔特征层次,就好像它是一个特征化的图像金字塔(图1(c))。理想情况下,SSD风格的金字塔会重复使用前向计算的不同层的多尺度特征图,因此是免费的。但是为了避免使用低级别的特征,SSD放弃了重复使用已经计算过的层,而是从网络的高层开始建立金字塔(例如,VGG网的conv4_3[36]),然后增加几个新的层。因此,它错过了重用特征层次的高分辨率图的机会。我们表明,这些对于检测小目标是很重要的。
  The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.
  本文的目标是自然地利用ConvNet的特征层次的金字塔形状,同时创建一个在所有尺度上都具有强语义的特征金字塔。为了实现这一目标,我们依靠一种架构,通过自上而下的途径和横向连接,将低分辨率、语义强的特征与高分辨率、语义弱的特征相结合(图1(d))。其结果是一个特征金字塔,该金字塔在各个层面都有丰富的语义,并能从单一的输入图像比例中快速构建。换句话说,我们展示了如何创建网络内的特征金字塔,这些特征金字塔可以用来替代特征化的图像金字塔,而不牺牲表示能力、速度或内存。
  Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.
  采用自上而下和跳跃连接的类似架构在最近的研究中很受欢迎[28, 17, 8, 26]。他们的目标是产生一个精细分辨率的单一高层特征图,在此基础上进行预测(图2顶部)。相反,我们的方法利用架构作为一个特征金字塔,预测(例如,目标检测)在每一层都是独立进行的(图2底部)。我们的模型呼应了一个特征化的图像金字塔,这在这些作品中还没有被探讨过。
在这里插入图片描述
  We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.
  我们评估了我们的方法,称为特征金字塔网络(FPN),在各种系统中用于检测和分割[11, 29, 27]。在没有任何花哨的情况下,我们报告了在具有挑战性的COCO检测基准[21]上简单地基于FPN和基本的Faster R-CNN检测器[29]的最先进的单模型结果,超过了所有现有的大量工程化的单模型参赛的竞争者。在消融实验中,我们发现,对于边界盒建议,FPN将平均召回率(AR)大幅提高了8.0分;对于目标检测,它将COCO式的平均精度(AP)提高了2.3分,将PASCAL式的AP提高了3.8分,超过了ResNets上Faster R-CNN[16]的强大单尺度基线。我们的方法也很容易扩展到掩码建议,并比严重依赖图像金字塔的最先进的方法提高了实例分割的AR和速度。
  In addition, our pyramid structure can be trained end-toend with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over the single-scale baseline. We believe these advances will facilitate future research and applications. Our code will be made publicly available.
  此外,我们的金字塔结构可以对所有尺度进行端对端训练,并在训练/测试时一致使用,这在使用图像金字塔时是不可行的。因此,FPN能够达到比所有现有的最先进的方法更高的准确性。此外,这种改进是在不增加测试时间的情况下实现的,而不是在单尺度基线上。我们相信这些进展将促进未来的研究和应用。我们的代码将被公开提供。

2. Related Work(相关工作)

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Dollar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.
  手工设计的特征和早期的神经网络。SIFT特征[25]最初是在标度空间极值处提取的,用于特征点匹配。HOG特征[5],以及后来的SIFT特征,是在整个图像金字塔上密集计算的。这些HOG和SIFT金字塔已被用于图像分类、目标检测、人体姿势估计等众多工作中。人们对快速计算特征化图像金字塔也有很大兴趣。Dollar等人[6]展示了快速金字塔计算,首先计算一个稀疏采样(按比例)的金字塔,然后对缺失的层次进行内插。在HOG和SIFT之前,用ConvNets[38, 32]进行的早期人脸检测工作是在图像金字塔上计算浅层网络,以检测不同尺度的人脸。
  Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.
  深度ConvNet目标检测器。随着现代深度ConvNets[19]的发展,像OverFeat[34]和R-CNN[12]这样的目标检测器在精确度上有了极大的提高。OverFeat采用了类似于早期神经网络人脸检测器的策略,将ConvNet作为滑动窗口检测器应用于图像金字塔。R-CNN采用了一种基于区域建议的策略[37],在用ConvNet进行分类之前,每个建议都被尺度化了。SPPnet[15]表明,这种基于区域的检测器可以更有效地应用于在单一图像尺度上提取的特征图。最近更准确的检测方法,如Fast R-CNN[11]和Faster R-CNN[29],主张使用从单一比例计算的特征,因为它在准确性和速度之间提供了一个良好的权衡。然而,多尺度检测仍然表现得更好,特别是对于小目标。
  Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.
  使用多层的方法。最近有一些方法通过在ConvNet中使用不同的层来改善检测和分割。FCN[24]将每个类别的部分分数在多个标度上相加以计算语义分割。Hypercolumns[13]使用类似的方法进行目标实例分割。其他一些方法(HyperNet[18]、ParseNet[23]和ION[2])在计算预测之前将多层的特征连接起来,这相当于将转换的特征相加。SSD[22]和MS-CNN[3]在不合并特征或分数的情况下,在特征层次的多个层面上预测对象。
  There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].
  最近有一些方法利用横向/跳跃连接,在不同的分辨率和语义层面上关联低层次的特征图,包括用于分割的U-Net[31]和SharpMask[28],用于人脸检测的Recombinator网络[17],以及用于关键点估计的Stacked Hourglass网络[26]。Ghiasi等人[8]为FCN提出了一个拉普拉斯金字塔的表述,以逐步完善分割。尽管这些方法采用了具有金字塔形状的架构,但它们与featurized图像金字塔[5, 7, 34]不同,在金字塔中,预测是在所有层面上独立进行的,见图2。事实上,对于图2(顶部)中的金字塔架构,仍然需要图像金字塔来识别多个尺度的目标[28]。

3. Feature Pyramid Networks(特征金字塔网络)

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is generalpurpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.
  我们的目标是利用ConvNet的金字塔式特征层次,其语义从低到高,并建立一个贯穿高层语义的特征金字塔。由此产生的特征金字塔网络是通用的,在本文中,我们专注于滑动窗口提议器(Region Proposal Network,简称RPN)[29]和基于区域的检测器(Fast R-CNN)[11]。在第6节中,我们还将FPN推广到实例分割建议。
  Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.
  我们的方法将任意大小的单尺度图像作为输入,并以完全卷积的方式在多个层次上输出比例大小的特征图。这个过程独立于骨干卷积架构(例如,[19,36,16]),在本文中我们提出了使用ResNets[16]的结果。我们的金字塔的构建涉及到一个自下而上的通路,一个自上而下的通路,以及横向连接,如下文所介绍。
  Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.
  自下而上的途径。自下而上的途径是骨干ConvNet的前馈计算,它计算的特征层次由几个尺度的特征图组成,缩放步长为2。通常有许多层产生相同大小的输出图,我们说这些层处于同一网络阶段。对于我们的特征金字塔,我们为每个阶段定义一个金字塔层。我们选择每个阶段最后一层的输出作为我们的特征图参考集,我们将充实这些特征图来创建我们的金字塔。这种选择是自然的,因为每个阶段的最深层应该有最强的特征。
  Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as { C 2 , C 3 , C 4 , C 5 } \{C_2, C_3, C_4, C_5\} {C2,C3,C4,C5} for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.
  具体来说,对于ResNets[16],我们使用每个阶段的最后一个残差块所输出的特征激活值。我们用 { C 2 , C 3 , C 4 , C 5 } \{C_2, C_3, C_4, C_5\} {C2,C3,C4,C5}表示conv2、conv3、conv4和conv5的输出,并注意它们相对于输入图像的跨度是{4, 8, 16, 32}像素。由于conv1的内存占用较大,我们没有将其纳入金字塔中。
  Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.
  自上而下的途径和横向连接。自上而下的途径通过对来自金字塔高层的空间上较粗、但语义上较强的特征图进行上采样,幻化出更高分辨率的特征。然后,这些特征通过横向连接与来自自下而上途径的特征一起增强。每个横向连接将来自自下而上途径和自上而下途径的相同空间大小的特征图合并起来。自下而上的特征图具有较低的语义,但由于其被子采样的次数较少,其激活的定位更为准确。
  Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C 5 C_5 C5 to produce the coarsest resolution map. Finally, we append a 3×3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called { P 2 , P 3 , P 4 , P 5 } \{P_2, P_3, P_4, P_5\} {P2,P3,P4,P5}, corresponding to { C 2 , C 3 , C 4 , C 5 } \{C_2, C_3, C_4, C_5\} {C2,C3,C4,C5} that are respectively of the same spatial sizes.
  图3显示了构建我们自上而下特征图的构件。对于一个较粗分辨率的特征图,我们将空间分辨率上采样2倍(为简单起见,使用最近邻上采样)。然后,上采样的特征图与相应的自下而上的特征图(经过1×1卷积层以减少通道尺寸)通过元素相加进行合并。这个过程反复进行,直到生成最精细的分辨率特征图。为了开始迭代,我们只需在 C 5 C_5 C5上附加一个1×1卷积层,以产生最粗的分辨率图。最后,我们在每个合并的特征图上附加一个3×3的卷积层,生成最终的特征图,这是为了减少上采样的混叠效应。这组最终的特征图被称为 { P 2 , P 3 , P 4 , P 5 } \{P_2, P_3, P_4, P_5\} {P2,P3,P4,P5},对应于 { C 2 , C 3 , C 4 , C 5 } \{C_2, C_3, C_4, C_5\} {C2,C3,C4,C5},分别具有相同的空间尺寸。
在这里插入图片描述
  Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d d d) in all the feature maps. We set d = 256 d = 256 d=256 in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.
  因为金字塔的所有层次都使用共享的分类器/回归器,就像传统的特征图像金字塔一样,我们在所有的特征图中固定了特征维度(通道的数量,表示为 d d d)。在本文中,我们设定 d = 256 d = 256 d=256,因此所有额外的卷积层都有256通道的输出。在这些额外的层中没有非线性,我们根据经验发现这些非线性的影响很小。
  Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.
  简单性是我们设计的核心,我们发现我们的模型对许多设计选择都很稳健。我们曾试验过更复杂的模块(例如,使用多层残余模块[16]作为连接),并观察到稍好的结果。设计更好的连接模块不是本文的重点,所以我们选择了上述的简单设计。

4. Applications(应用)

Our method is a generic solution for building feature pyramids inside deep ConvNets. In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection. To demonstrate the simplicity and effectiveness of our method, we make minimal modifications to the original systems of [29, 11] when adapting them to our feature pyramid.
  我们的方法是在深度ConvNets中构建特征金字塔的通用解决方案。在下文中,我们在RPN[29]中采用我们的方法来生成边界盒建议,在Fast R-CNN[11]中采用我们的方法来进行目标检测。为了证明我们方法的简单性和有效性,我们在将[29, 11]的原始系统适应于我们的特征金字塔时对其进行了最小的修改。

4.1. Feature Pyramid Networks for RPN(用于RPN的特征金字塔网络)

RPN [29] is a sliding-window class-agnostic object detector. In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a singlescale convolutional feature map, performing object/nonobject binary classification and bounding box regression. This is realized by a 3×3 convolutional layer followed by two sibling 1×1 convolutions for classification and regression, which we refer to as a network head. The object/nonobject criterion and bounding box regression target are defined with respect to a set of reference boxes called anchors [29]. The anchors are of multiple pre-defined scales and aspect ratios in order to cover objects of different shapes.
  RPN[29]是一个滑动窗口的类别诊断目标检测器。在最初的RPN设计中,一个小的子网络在密集的3×3滑动窗口上进行评估,在单尺度卷积特征图之上,进行目标/非目标二元分类和边界盒回归。这是由一个3×3的卷积层和两个同级别的1×1卷积层实现的,用于分类和回归,我们称之为网络头。目标/非目标标准和边界盒回归目标是相对于一组称为锚的参考盒定义的[29]。锚点有多种预定的比例和长宽比,以涵盖不同形状的目标。
  We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid. Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level. Formally, we define the anchors to have areas of { 3 2 2 , 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{32^2, 64^2, 128^2, 256^2, 512^2\} {322,642,1282,2562,5122} pixels on { P 2 , P 3 , P 4 , P 5 } \{P_2, P_3, P_4, P_5\} {P2,P3,P4,P5} respectively.(Here we introduce P 6 P_6 P6 only for covering a larger anchor scale of 51 2 2 512^2 5122. P 6 P_6 P6 is simply a stride two subsampling of P 5 P_5 P5. P 6 P_6 P6 is not used by the Fast R-CNN detector in the next section.) As in [29] we also use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each level. So in total there are 15 anchors over the pyramid.
  我们通过用我们的FPN取代单尺度特征图来调整RPN。我们在特征金字塔的每一层都附加了一个相同设计的头部(3×3 conv和两个同级的1×1 conv)。因为头部在所有金字塔级别的所有位置上密集滑动,所以没有必要在特定级别上设置多尺度锚点。相反,我们为每一层分配一个单一尺度的锚点。形式上,我们将锚点定义为在 { P 2 , P 3 , P 4 , P 5 } \{P_2, P_3, P_4, P_5\} {P2,P3,P4,P5}上分别有 { 3 2 2 , 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{32^2, 64^2, 128^2, 256^2, 512^2\} {322,642,1282,2562,5122}像素的区域。(这里我们引入 P 6 P_6 P6只是为了覆盖 51 2 2 512^2 5122的较大锚点比例。 P 6 P_6 P6仅仅是 P 5 P_5 P5的第2步子抽样。 P 6 P_6 P6不被下一节中的Fast R-CNN检测器使用)。与[29]一样,我们也在每一级使用多个长宽比{1:2,1:1,2:1}的锚点。所以在金字塔上总共有15个锚点。
  We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29]. Formally, an anchor is assigned a positive label if it has the highest IoU for a given groundtruth box or an IoU over 0.7 with any ground-truth box, and a negative label if it has IoU lower than 0.3 for all ground-truth boxes. Note that scales of ground-truth boxes are not explicitly used to assign them to the levels of the pyramid; instead, ground-truth boxes are associated with anchors, which have been assigned to pyramid levels. As such, we introduce no extra rules in addition to those in [29].
  我们按照[29]中的方法,根据锚点与地面真实框的交叉比(IoU)为其分配训练标签。从形式上看,如果一个锚对给定的地面真实框具有最高的IoU,或者与任何地面真实框的IoU超过0.7,则被分配为正标签;如果它对所有地面真实框的IoU低于0.3,则被分配为负标签。请注意,地面真实框的尺度并不明确用于将其分配到金字塔的各个层次;相反,地面真实框与锚点有关,而锚点已经被分配到金字塔的各个层次。因此,除了[29]中的规则外,我们没有引入额外的规则。
  We note that the parameters of the heads are shared across all feature pyramid levels; we have also evaluated the alternative without sharing parameters and observed similar accuracy. The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels. This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale.
  我们注意到,头像的参数在所有特征金字塔级别中都是共享的;我们还评估了不共享参数的替代方案,观察到类似的准确性。共享参数的良好表现表明,我们的金字塔的所有层次都共享类似的语义层次。这一优势类似于使用特征金字塔的优势,一个共同的头部分类器可以应用于在任何图像比例计算的特征。
  With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. We elaborate on the implementation details in the experiments.
  通过上述调整,RPN可以自然地与我们的FPN进行训练和测试,其方式与[29]相同。我们在实验中详细说明了实施细节。

4.2. Feature Pyramid Networks for Fast R-CNN(用于Fast R-CNN的特征金字塔网络)

Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.
  Fast R-CNN[11]是一种基于区域的目标检测器,其中兴趣区域(RoI)集合被用来提取特征。Fast R-CNN最常用于单尺度特征图。为了将其用于我们的FPN,我们需要将不同尺度的RoI分配给金字塔层。
  We view our feature pyramid as if it were produced from an image pyramid. Thus we can adapt the assignment strategy of region-based detectors [15, 11] in the case when they are run on image pyramids. Formally, we assign an RoI of width w and height h h h (on the input image to the network) to the level P k P_k Pk of our feature pyramid by:
  我们把我们的特征金字塔看作是由图像金字塔产生的。因此,当基于区域的检测器[15, 11]在图像金字塔上运行时,我们可以调整其分配策略。从形式上看,我们将一个宽度为 w w w、高度为 h h h的RoI(在网络的输入图像上)分配到我们的特征金字塔的第 P k P_k Pk层,方法是:
在这里插入图片描述
Here 224 is the canonical ImageNet pre-training size, and k 0 k_0 k0 is the target level on which an RoI with w × h = 22 4 2 w × h = 224^2 w×h=2242 should be mapped into. Analogous to the ResNet-based Faster R-CNN system [16] that uses C 4 C_4 C4 as the single-scale feature map, we set k 0 k_0 k0 to 4. Intuitively, Eqn. (1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, k = 3).
这里224是典型的ImageNet预训练规模, k 0 k_0 k0 w × h = 22 4 2 w×h=224^2 w×h=2242的RoI应该被映射到的目标层次。类似于基于ResNet的Faster R-CNN系统[16]使用 C 4 C_4 C4作为单尺度特征图,我们将 k 0 k_0 k0设为4。 直观地说,公式(1)意味着如果RoI的尺度变小(例如224的1/2),它应该被映射到一个更精细的分辨率级别(例如k=3)。
  We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. Again, the heads all share parameters, regardless of their levels. In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers. These layers are randomly initialized, as there are no pre-trained fc layers available in ResNets. Note that compared to the standard conv5 head, our 2-fc MLP head is lighter weight and faster.
  我们将预测器头(在Fast R-CNN中,头是特定类别的分类器和边界箱回归器)附加到所有级别的所有RoI。同样,这些头都是共享参数的,与它们的级别无关。在[16]中,ResNet的conv5层(一个9层的深度子网络)被作为conv4特征之上的头部,但我们的方法已经利用conv5来构建特征金字塔。因此,与[16]不同的是,我们只是采用RoI池提取7×7的特征,并在最后的分类和边界盒回归层之前附加两个隐藏的1024-d全连接(fc)层(每个层后都有ReLU)。这些层是随机初始化的,因为ResNets中没有预训练的fc层。请注意,与标准的conv5头相比,我们的2-fc MLP头重量更轻,速度更快。
  Based on these adaptations, we can train and test Fast R-CNN on top of the feature pyramid. Implementation details are given in the experimental section.
  基于这些调整,我们可以在特征金字塔的顶部训练和测试Fast R-CNN。实施细节在实验部分给出。

5. Experiments on Object Detection(目标检测的实验)

We perform experiments on the 80 category COCO detection dataset [21]. We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival). We also report final results on the standard test set (test-std) [21] which has no disclosed labels.
  我们在80个类别的COCO检测数据集[21]上进行了实验。我们使用8万训练图像和3.5万验证图像子集(trainval35k[2])的组合进行训练,并在5千验证图像子集(minival)上报告消融情况。我们还报告了在标准测试集(test-std)[21]上的最终结果,该测试集没有披露标签。
  As is common practice [12], all network backbones are pre-trained on the ImageNet1k classification set [33] and then fine-tuned on the detection dataset. We use the pre-trained ResNet-50 and ResNet-101 models that are publicly available.(https://github.com/kaiminghe/deep- residual- networks) Our code is a reimplementation of py-faster-rcnn(https://github.com/rbgirshick/py- faster- rcnn) using Caffe2.(https://github.com/caffe2/caffe2)
  按照惯例[12],所有的网络骨干都在ImageNet1k分类集[33]上进行了预训练,然后在检测数据集中进行了微调。我们使用公开的预训练的ResNet-50和ResNet-101模型。(https://github.com/kaiminghe/deep-residual-networks) 我们的代码是使用Caffe2对py-faster-rcnn(https://github.com/rbgirshick/py-faster-rcnn)进行的重新实现。(https://github.com/caffe2/caffe2)

5.1. Region Proposal with RPN(使用RPN的区域建议)

We evaluate the COCO-style Average Recall (AR) and AR on small, medium, and large objects (AR s _s s, AR m _m m, and AR l _l l) following the definitions in [21]. We report results for 100 and 1000 proposals per images (AR 100 ^{100} 100 and AR 1 k ^{1k} 1k).
  我们按照[21]中的定义,对COCO式的平均召回率(AR)和大、中、小对象的AR(AR s _s s、AR m _m m和AR l _l l)进行评估。我们报告了每个图像的100和1000个建议的结果(AR 100 ^{100} 100和AR 1 k ^{1k} 1k)。
  Implementation details. All architectures in Table 1 are trained end-to-end. The input image is resized such that its shorter side has 800 pixels. We adopt synchronized SGD training on 8 GPUs. A mini-batch involves 2 images per GPU and 256 anchors per image. We use a weight decay of 0.0001 and a momentum of 0.9. The learning rate is 0.02 for the first 30k mini-batches and 0.002 for the next 10k. For all RPN experiments (including baselines), we include the anchor boxes that are outside the image for training, which is unlike [29] where these anchor boxes are ignored. Other implementation details are as in [29]. Training RPN with FPN on 8 GPUs takes about 8 hours on COCO.
  实施细节。表1中的所有架构都是端到端的训练。输入的图像被调整大小,使其短边有800像素。我们在8个GPU上采用同步的SGD训练。一个小型批次涉及每个GPU的2张图像和每个图像的256个锚。我们使用0.0001的权重衰减和0.9的动量。前3万个mini-batch的学习率为0.02,后1万个为0.002。对于所有的RPN实验(包括基线),我们将图像外的锚定框包括在训练中,这与[29]不同,在那里这些锚定框被忽略了。其他实施细节与[29]相同。在8个GPU上用FPN训练RPN,在COCO上大约需要8小时。
在这里插入图片描述

5.1.1 Ablation Experiments(消融实验)

Comparisons with baselines. For fair comparisons with original RPNs [29], we run two baselines (Table 1(a, b)) using the single-scale map of C 4 C_4 C4 (the same as [16]) or C 5 C_5 C5, both using the same hyper-parameters as ours, including using 5 scale anchors of { 3 2 2 , 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{32^2, 64^2, 128^2, 256^2, 512^2\} {322,642,1282,2562,5122}. Table 1 (b) shows no advantage over (a), indicating that a single higherlevel feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics.
  与基线的比较。为了与原始RPNs[29]进行公平的比较,我们使用 C 4 C_4 C4(与[16]相同)或 C 5 C_5 C5的单一尺度图运行了两条基线(表1(a,b)),两者都使用与我们相同的超参数,包括使用 { 3 2 2 , 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{32^2,64^2,128^2,256^2,512^2\} {322642128225625122}的5个尺度锚。表1(b)显示与(a)相比没有优势,表明单一的高层次特征图是不够的,因为在更粗的分辨率和更强的语义之间存在着权衡。
  Placing FPN in RPN improves AR 1 k ^{1k} 1k to 56.3 (Table 1 ©), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)). In addition, the performance on small objects ( A R s 1 k AR_s^{1k} ARs1k) is boosted by a large margin of 12.9 points. Our pyramid representation greatly improves RPN’s robustness to object scale variation.
  将FPN置于RPN中,可将AR 1 k ^{1k} 1k提高到56.3(表1(c)),比单尺度RPN基线(表1(a))提高8.0点。此外,在小目标( A R s 1 k AR_s^{1k} ARs1k)上的性能也得到了12.9分的大幅提升。我们的金字塔表示法大大提高了RPN对目标尺度变化的鲁棒性。
  How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the topdown pathway. With this modification, the 1×1 lateral connections followed by 3×3 convolutions are attached to the bottom-up pyramid. This architecture simulates the effect of reusing the pyramidal feature hierarchy (Fig. 1(b)).
  自上而下的富集有多重要?表1(d)显示了我们的特征金字塔没有自上而下途径的结果。在这种修改下,1×1的侧向连接和3×3的卷积被附加到自下而上的金字塔上。这种结构模拟了重复使用金字塔特征层次的效果(图1(b))。
  The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours. We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets. We have also evaluated a variant of Table 1(d) without sharing the parameters of the heads, but observed similarly degraded performance. This issue cannot be simply remedied by level-specific heads.
  表1(d)中的结果只是与RPN基线持平,远远落后于我们的结果。我们猜测这是因为自下而上的金字塔(图1(b))的不同层次之间存在着巨大的语义差距,尤其是对于非常深的ResNets。我们还评估了表1(d)的一个变体,没有共享头的参数,但观察到类似的性能下降。这个问题不能简单地通过特定级别的头来补救。
  How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections. This top-down pyramid has strong semantic features and fine resolutions. But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times. More precise locations of features can be directly passed from the finer levels of the bottom-up maps via the lateral connections to the top-down maps. As a results, FPN has an A R 1 k AR_{1k} AR1k score 10 points higher than Table 1(e).
  横向连接有多重要?表1(e)显示了一个没有1×1横向连接的自上而下的特征金字塔的消融结果。这个自上而下的金字塔具有强烈的语义特征和精细的分辨率。但我们认为,这些特征的位置并不精确,因为这些特征图已经被下采样和上采样多次了。更精确的特征位置可以通过横向连接到自上而下的特征图,直接从自下而上的特征图的更精细的层次传递出来。作为一个结果,FPN的 A R 1 k AR_{1k} AR1k得分比表1(e)高10分。
  How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of P 2 P_2 P2 (i.e., the finest level in our pyramids). Similar to the single-scale baselines, we assign all anchors to the P 2 P_2 P2 feature map. This variant (Table 1(f)) is better than the baseline but inferior to our approach. RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance.
  金字塔表征有多重要?与其求助于金字塔表征,不如将头部附着在 P 2 P_2 P2的最高分辨率、强语义的特征图上(即我们金字塔中最细的一层)。与单尺度基线类似,我们将所有的锚点都分配给 P 2 P_2 P2的特征图。这个变体(表1(f))比基线好,但比我们的方法差。RPN是一个具有固定窗口大小的滑动窗口检测器,因此在金字塔级别上的扫描可以增加其对规模差异的鲁棒性。
  In addition, we note that using P 2 P_2 P2 alone leads to more anchors (750k, Table 1(f)) caused by its large spatial resolution. This result suggests that a larger number of anchors is not sufficient in itself to improve accuracy.
  此外,我们注意到,单独使用 P 2 P_2 P2会导致更多的锚点(750k,表1(f)),这是由其大空间分辨率造成的。这一结果表明,更多的锚点本身并不足以提高精确度。

5.2. Object Detection with Fast/Faster R-CNN(用Fast/Faster R-CNN进行目标检测)

Next we investigate FPN for region-based (non-sliding window) detectors. We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5). We also report COCO AP on objects of small, medium, and large sizes (namely, AP s _s s, AP m _m m, and AP l _l l) following the definitions in [21].
  接下来我们研究基于区域(非滑动窗口)的检测器的FPN。我们通过COCO式的平均精度(AP)和PASCAL式的AP(在0.5的单一IoU阈值下)评估目标检测。我们还按照[21]中的定义,对小、中、大尺寸的目标(即AP s _s s、AP m _m m和AP l _l l)报告了COCO的AP。
  Implementation details. The input image is resized such that its shorter side has 800 pixels. Synchronized SGD is used to train the model on 8 GPUs. Each mini-batch involves 2 image per GPU and 512 RoIs per image. We use a weight decay of 0.0001 and a momentum of 0.9. The learning rate is 0.02 for the first 60k mini-batches and 0.002 for the next 20k. We use 2000 RoIs per image for training and 1000 for testing. Training Fast R-CNN with FPN takes about 10 hours on the COCO dataset.
  实施细节。输入的图像被调整大小,使其短边有800像素。同步SGD被用来在8个GPU上训练模型。每个mini-batch涉及每个GPU的2个图像和每个图像的512个RoI。我们使用0.0001的权重衰减和0.9的动量。前6万个小批次的学习率为0.02,接下来的2万个小批次为0.002。我们用每幅图像2000个RoI进行训练,1000个用于测试。在COCO数据集上,用FPN训练快速R-CNN大约需要10个小时。

5.2.1 Fast R-CNN (on fixed proposals)(Fast R-CNN(固定建议))

To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals. We choose to freeze the proposals as computed by RPN on FPN (Table 1©), because it has good performance on small objects that are to be recognized by the detector. For simplicity we do not share features between Fast R-CNN and RPN, except when specified.
  为了更好地研究FPN对基于区域的检测器的单独影响,我们在一组固定的建议上进行Fast R-CNN的消融。我们选择冻结由RPN在FPN上计算的建议(表1©),因为它在要被检测器识别的小目标上有良好的性能。为了简单起见,我们不在Fast R-CNN和RPN之间共享特征,除非有特别说明。
  As a ResNet-based Fast R-CNN baseline, following [16], we adopt RoI pooling with an output size of 14×14 and attach all conv5 layers as the hidden layers of the head. This gives an AP of 31.9 in Table 2(a). Table 2(b) is a baseline exploiting an MLP head with 2 hidden fc layers, similar to the head in our architecture. It gets an AP of 28.8, indicating that the 2-fc head does not give us any orthogonal advantage over the baseline in Table 2(a).
  作为基于ResNet的Fast R-CNN基线,按照[16],我们采用输出尺寸为14×14的RoI池,并将所有conv5层作为头部的隐藏层。这使得表2(a)中的AP为31.9。表2(b)是一个基线,利用具有2个隐藏fc层的MLP头,类似于我们架构中的头。它得到的AP为28.8,表明2-fc头没有给我们带来任何比表2(a)中的基线更多的正交优势。
在这里插入图片描述
  Table 2© shows the results of our FPN in Fast R-CNN. Comparing with the baseline in Table 2(a), our method improves AP by 2.0 points and small object AP by 2.1 points. Comparing with the baseline that also adopts a 2fc head (Table 2(b)), our method improves AP by 5.1 points.(We expect a stronger architecture of the head [30] will improve upon our results, which is beyond the focus of this paper.) These comparisons indicate that our feature pyramid is superior to single-scale features for a region-based object detector.
  表2(c)显示了我们在Fast R-CNN中的FPN的结果。与表2(a)中的基线相比,我们的方法使AP提高了2.0分,小目标AP提高了2.1分。与同样采用2fc头的基线(表2(b))相比,我们的方法提高了5.1分。(我们期望一个更强大的头部架构[30]将改善我们的结果,这超出了本文的重点。)这些比较表明,对于基于区域的目标检测器,我们的特征金字塔比单尺度特征要好。
  Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above subsection for RPN. It is noteworthy that removing top-down connections (Table 2(d)) significantly degrades the accuracy, suggesting that Fast R-CNN suffers from using the low-level features at the high-resolution maps.
  表2(d)和(e)显示,去除自上而下的连接或去除横向连接会导致较差的结果,这与我们在上述小节中观察到的RPN的情况相似。值得注意的是,去除自上而下的连接(表2(d))大大降低了准确性,这表明Fast R-CNN因在高分辨率地图上使用低层次特征而受到影响。
  In Table 2(f), we adopt Fast R-CNN on the single finest scale feature map of P 2 P_2 P2. Its result (33.4 AP) is marginally worse than that of using all pyramid levels (33.9 AP, Table 2©). We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales. Despite the good accuracy of this variant, it is based on the RPN proposals of { P k P_k Pk} and has thus already benefited from the pyramid representation.
  在表2(f)中,我们在 P 2 P_2 P2的单一最细尺度特征图上采用了Fast R-CNN。其结果(33.4 AP)比使用所有金字塔级别的结果(33.9 AP,表2(c))略差。我们认为这是因为RoI pooling是一个类似翘曲的操作,对区域的尺度不太敏感。尽管这个变体的精度很好,但它是基于{ P k P_k Pk}的RPN建议,因此已经从金字塔表示法中获益。

5.2.2 Faster R-CNN (on consistent proposals)(Faster R-CNN(一致的建议))

In the above we used a fixed set of proposals to investigate the detectors. But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible. Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN. Table 3(a) shows our reproduction of the baseline Faster R-CNN system as described in [16]. Under controlled settings, our FPN (Table 3©) is better than this strong baseline by 2.3 points AP and 3.8 points AP@0.5.
  在上面的内容中,我们使用了一组固定的建议来调查检测器。但是在Faster R-CNN系统中[29],RPN和Fast R-CNN必须使用相同的网络骨干,以便使特征共享成为可能。表3显示了我们的方法和两个基线之间的比较,所有的RPN和Fast R-CNN都使用一致的骨干架构。表3(a)显示了我们对基线Faster R-CNN系统的再现,如[16]中所述。在受控设置下,我们的FPN(表3©)比这个强大的基线要好2.3点AP和3.8点AP@0.5。
在这里插入图片描述
  Note that Table 3(a) and (b) are baselines that are much stronger than the baseline provided by He et al. [16] in Table 3(*). We find the following implementations contribute to the gap: (i) We use an image scale of 800 pixels instead of 600 in [11, 16]; (ii) We train with 512 RoIs per image which accelerate convergence, in contrast to 64 RoIs in [11, 16]; (iii) We use 5 scale anchors instead of 4 in [16] (adding 3 2 2 32^2 322); (iv) At test time we use 1000 proposals per image instead of 300 in [16]. So comparing with He et al.’s ResNet50 Faster R-CNN baseline in Table 3(*), our method improves AP by 7.6 points and AP@0.5 by 9.6 points.
  请注意,表3(a)和(b)是基线,比表3(*)中He等人[16]提供的基线强得多。我们发现以下实现方式导致了差距:(i)我们使用800像素的图像比例,而不是[11,16]中的600像素;(ii)我们用每幅图像512个RoIs进行训练,这加速了收敛,而[11,16]中是64个RoIs;(iii)我们使用5个比例锚,而不是[16]中的4个(增加 3 2 2 32^2 322);(iv)在测试时,我们每幅图像使用1000个提议,而不是[16]的300。因此,与表3(*)中He等人的ResNet50 Faster R-CNN基线相比,我们的方法将AP提高了7.6分,AP@0.5 提高了9.6分。
  Sharing features. In the above, for simplicity we do not share the features between RPN and Fast R-CNN. In Table 5, we evaluate sharing features following the 4-step training described in [29]. Similar to [29], we find that sharing features improves accuracy by a small margin. Feature sharing also reduces the testing time.
  共享特征。在上文中,为了简单起见,我们没有在RPN和Fast R-CNN之间共享特征。在表5中,我们按照[29]中描述的4步训练法评估了共享特征。与[29]类似,我们发现共享特征能以很小的幅度提高准确性。特征共享也减少了测试时间。
在这里插入图片描述
  Running time. With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.165 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.19 seconds for ResNet-101.6 As a comparison, the single-scale ResNet-50 baseline in Table 3(a) runs at 0.32 seconds. Our method introduces small extra cost by the extra layers in the FPN, but has a lighter weight head. Overall our system is faster than the ResNet-based Faster R-CNN counterpart. We believe the efficiency and simplicity of our method will benefit future research and applications.
  运行时间。在特征共享的情况下,我们基于FPN的Faster R-CNN系统在单个NVIDIA M40 GPU上对ResNet-50的推理时间为0.165秒,对ResNet-101的推理时间为0.19秒。6 作为比较,表3(a)中的单尺度ResNet-50基线的运行时间为0.32秒。我们的方法由于在FPN中加入了额外的层,所以引入了少量的额外成本,但有一个较轻的重量头。总的来说,我们的系统比基于ResNet的Faster R-CNN对应的系统要快。我们相信我们的方法的效率和简单性将有利于未来的研究和应用。

5.2.3 Comparing with COCO Competition Winners(与COCO竞赛冠军进行比较)

We find that our ResNet-101 model in Table 5 is not sufficiently trained with the default learning rate schedule. So we increase the number of mini-batches by 2× at each learning rate when training the Fast R-CNN step. This increases AP on minival to 35.6, without sharing features. This model is the one we submitted to the COCO detection leaderboard, shown in Table 4. We have not evaluated its feature-sharing version due to limited time, which should be slightly better as implied by Table 5.
  我们发现,表5中的ResNet-101模型在默认的学习率安排下没有得到充分的训练。因此,我们在训练Fast R-CNN步骤时,在每个学习速率下增加2倍的mini-batch数量。这样,在不共享特征的情况下,minival上的AP增加到35.6。这个模型是我们提交给COCO检测排行榜的模型,如表4所示。由于时间有限,我们没有对其特征共享版本进行评估,正如表5所暗示的那样,它应该略胜一筹。
  Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors.
  表4将我们的方法与COCO竞赛获奖者的单模型结果进行了比较,包括2016年获奖者G-RMI和2015年获奖者Faster R-CNN+++。在没有添加花哨的东西的情况下,我们的单模型作品已经超过了这些强大的、经过大量设计的竞争对手。
在这里插入图片描述
  On the test-dev set, our method increases over the existing best results by 0.5 points of AP (36.2 vs. 35.7) and 3.4 points of AP@0.5 (59.1 vs. 55.7). It is worth noting that our method does not rely on image pyramids and only uses a single input image scale, but still has outstanding AP on small-scale objects. This could only be achieved by highresolution image inputs with previous methods.
  在测试-开发集上,我们的方法比现有的最佳结果增加了0.5点的AP(36.2对35.7)和3.4点的AP@0.5(59.1对55.7)。值得注意的是,我们的方法不依赖图像金字塔,只使用单一的输入图像比例,但在小尺寸目标上仍有突出的AP。这只有在以前的方法中通过高分辨率的图像输入才能实现。
  Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.
  此外,我们的方法没有利用许多流行的改进方法,如迭代回归[9]、困难负样本挖掘[35]、上下文建模[16]、更强的数据增量[22]等。这些改进是对FPN的补充,应该能进一步提高准确性。
  Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation. See [14] for details.
  最近,FPN在COCO竞赛的所有赛道上都取得了新的顶尖成绩,包括检测、实例分割和关键点估计。详见[14]。

6. Extensions: Segmentation Proposals( 分割实验有需要请看原论文)

7. Conclusion(结论)

We have presented a clean and simple framework for building feature pyramids inside ConvNets. Our method shows significant improvements over several strong baselines and competition winners. Thus, it provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids. Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations.
  我们提出了一个简洁的框架,用于在ConvNets内构建特征金字塔。我们的方法比几个强大的基线和竞赛获胜者有明显的改进。因此,它为特征金字塔的研究和应用提供了一个实用的解决方案,而不需要计算图像金字塔。最后,我们的研究表明,尽管深度ConvNets具有很强的表示能力,并且对尺度变化具有隐含的鲁棒性,但使用金字塔表示明确地解决多尺度问题仍然是至关重要的。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值