论文阅读笔记（二十二）：Feature Pyramid Networks for Object Detection（FPN）

最新推荐文章于 2024-04-10 22:17:00 发布

__Sunshine__

最新推荐文章于 2024-04-10 22:17:00 发布

阅读量1.2k

点赞数 1

分类专栏：笔记文章标签： FPN Feature Pyramid Object Detection

本文链接：https://blog.csdn.net/sunshine_010/article/details/80000844

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection.

特征金字塔是识别系统中用于检测不同尺度目标的基本组件。但最近的深度学习目标检测器已经避免了金字塔表示，部分原因是它们是计算和内存密集型的。在本文中，我们利用深度卷积网络内在的多尺度、金字塔分级来构造具有很少额外成本的特征金字塔。开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得state-of-the-art的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。此外，我们的方法可以在GPU上以6FPS运行，因此是多尺度目标检测的实用和准确的解决方案。

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

识别不同尺度的物体是计算机视觉的一项根本性挑战。建立在图像金字塔之上的特征金字塔 (为简短我们称这些为Featurized image pyramids) 形成level解决方案的基础[1] (图 1 (a))。在某种意义上, 这些金字塔是尺度不变的, 物体的尺度变化是通过改变它在金字塔中的level而抵消的。直观地, 此属性使模型能够通过在位置和金字塔level上扫描模型来检测大范围尺度的物体。

Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

Featurized image pyramids在Hand-engineered特征的时代被大量使用了 [5, 25]。他们是如此关键的物体检测器如 DPM [7] 需要密集的规模取样, 以取得良好的结果 (例如, 10 尺度每octave)。对于识别任务, 工程特征已基本上被替换为由深卷积网络 (ConvNets) 计算的特征 [19, 20]。除了能够表示更高层次的语义之外, ConvNets 还更健壮, 可以依比例上产生差异, 从而便于从单一输入尺度 [15、11、29] (图 1 (b)) 计算的特征进行识别。但即使有了这种鲁棒性, 仍然需要金字塔得到最准确的结果。最近在 ImageNet [33] 和COCO [21] 检测挑战中的所有顶级条目都使用 featurized image pyramids上的多尺度测试 (例如 [16, 35])。featurizing image pyramids的每个层次的原理优势是它产生了一个多尺度特征表示, 其中所有level都在语义上很强, 包括高分辨率level。

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.

然而, featurizing image pyramids的每个层次都有明显的局限性。推理时间大大增加 (例如, 四乘以 [11]), 使这种方法不切实际的实际应用。此外, 在图像金字塔端到端对深网络进行训练是不可行的, 因此, 如果被利用, 图像金字塔只在测试时使用 [15、11、16、35], 这就造成了训练/测试时间推断之间的不一致。由于这些原因, Fast R-CNN 和 Faster R-CNN [11, 29] 选择不使用 featurized image pyramids在默认设置下。

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

然而, 图像金字塔并不是计算多尺度特征表示的唯一方法。Deep ConvNet按层计算特征层层, 并且具有抽样层, 特征层具有固有的多尺度、金字塔形状。这种网络特征层产生了不同空间分辨率的特征映射, 但引入了不同深度引起的大语义缺口。高分辨率的映射具有低级的特征, 损害了它们的表示能力, 用于物体识别。

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

Single Shot Detector (SSD) [22] 是第一次尝试使用 ConvNet 的金字塔特征层, 就好像它是一个 featurized image pyramid (图 1 (c))。理想情况下, SSD 风格的金字塔将重用在正向传递中计算的不同层的多尺度特征映射, 并且没有成本。但为了避免使用低级特征 SSD 放弃再利用已经计算的层, 取而代之的是在网络中从高处开始构建金字塔 (例如, conv4 3 of VGG nets [36]), 然后添加几个新层。因此, 它错失了重用特征层的高分辨率映射的机会。我们表明, 这些对检测小物体很重要。

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

本文的目标是自然地利用 ConvNet 的特征层的金字塔形状, 同时创建一个具有强烈语义的特征金字塔。为了实现这一目标, 我们依赖于一种体系结构, 它将低分辨率、语义强的特征与高分辨率、语义较弱的特征结合起来, 通过自上而下的途径和横向连接 (图 1 (d))。其结果是一个特征金字塔, 它在所有level都具有丰富的语义, 并且从单一输入图像比例快速构建。换言之, 我们展示了如何创建网络内特征金字塔, 可用于替换 featurized image pyramids而不牺牲表现力、速度或内存。

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

采用自上而下和跳过连接的类似体系结构在最近的研究中很受欢迎 [28、17、8、26]。他们的目标是制作一个高level的特征映射, 其中的一个精细的分辨率, 将作出预测 (图2顶部)。相反, 我们的方法利用架构作为一个特征金字塔, 在每个level上独立地进行预测 (如物体检测) (图2底部)。我们的模型呼应了一个 featurized image pyramid, 这没有在这项工作中探索。

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

我们评估我们的方法, 称为特征金字塔网络 (FPN), 在各种系统的检测和分割 [11, 29, 27]。没有bells 和 whistles, 我们报告一个state-of-the-art的单一模型的结果, 在挑战COCO检测基准 [21] 简单地基于 FPN 和一个基本Faster R-CNN 探测器 [29], 超过所有现有的重工程单模型条目竞争优胜者。在消融实验中, 我们发现, 对于边界框的提议, FPN 显著增加Average Recall (AR) 8.0 点;对于物体检测, 它提高了COCO式Average Precision (AP) 2.3 点和PASCAL-style AP 3.8 点, 超越了一个强大的单尺度基于Faster R-CNN 的 ResNets [16]。我们的方法也很容易扩展到mask proposals, 并改进了实例分割AR 并且速度超越了严重依赖于图像金字塔的state-of-the-art方法。

In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over the single-scale baseline. We believe these advances will facilitate future research and applications.

此外, 我们的金字塔结构可以用所有尺度进行端到端的训练, 并且在训练/测试时间上始终如一地使用, 使用图像金字塔这将是内存不可行的。因此, FPNs 能够达到比所有现有的先进方法更高的精确度。而且, 这种改进是在不增加单尺度基线的测试时间的情况下实现的。我们相信这些进展将有助于今后的研究和应用。

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more.

Hand-engineered 的特征和早期神经网络。SIFT特征 [25] 最初提取在尺度空间极值和用于特征点匹配。HOG特征 [5] 和后来的SIFT特征也是在整个图象金字塔密集地被计算了。这些HOG和SIFT金字塔已用于许多作品的图像分类, 物体检测, 人的姿态估计, 等等。

There has also been significant interest in computing featurized image pyramids quickly. Dollar et al.[6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.

对 featurized image pyramids的快速计算也有很大的兴趣。Dollar等. [6] 演示了fast pyramid演算由首先计算一个稀疏抽样 (依比例) 金字塔然后插值缺掉的level。在HOG和SIFT之前, 及早前用ConvNets [38, 32] 计算浅层网络来检测人脸的工作超越了在图象金字塔跨尺度检测人脸。

Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet.

Deep ConvNet 物体探测器。与现代 Deep ConvNet 的发展 [19], 物体检测器像 OverFeat [34]和 R-CNN [12] 显示了戏剧性的改善准确性。OverFeat 采用了类似于早期神经网络面探测器的策略, 将 ConvNet 作为图像金字塔上的滑动窗口检测器。R-CNN 通过了一个区域提议的战略 [37] 在其中每个提议在分类 ConvNet 之前是尺度规范化的。

SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.

SPPnet [15] 表明, 这种基于区域的探测器可以更有效地应用于在单一图像尺度上提取的特征映射。最近和更准确的检测方法像Fast R-CNN [11] 和Faster R-CNN [29] 提倡使用从单一尺度计算的特征, 因为它提供了一个良好的权衡精度和速度。但是, 多尺度检测仍能更好地执行, 特别是对于小物体。

Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.

多层次方法。最近的一些方法通过在 ConvNet 中使用不同的层来改进检测和分割。FCN [24] 汇总局部分数为每个类别在多尺度计算语义分割。Hypercolumns [13] 使用类似的方法来进行物体实例分割。其他几种方法 (HyperNet [18]、ParseNet [23] 和ION [2]) 在计算预测之前串联多个层的特征, 这相当于对转换后的特征求和。SSD [22] 和 MS-CNN [3] 预测特征层的多个层次上的物体, 而不结合特征或分数。

There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

最近有几种方法利用横向/跳过连接, 将低层特征映射与分辨率和语义level相关联, 包括 U-Net [31] 和 SharpMask [28] 用于分割、Recombinator 网络 [17] 用于人脸检测和Stacked Hourglass Net [26] 为关键点估计。Ghiasi 等 [8] 提出一个Laplacian金字塔呈现为 FCNs 逐步细化分割。虽然这些方法采用具有金字塔形状的体系结构, 但它们不同于 featurized image pyramids [5、7、34], 在所有level上独立进行预测, 见图2。事实上, 对于图 2 (顶部) 的金字塔结构, 仍然需要图像金字塔来识别跨多尺度的物体 [28]。

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.

我们的目标是利用 ConvNet 的金字塔特征层, 它具有从低级到高级的语义, 并在整个过程中构建一个具有高级语义的特征金字塔。由此产生的特征金字塔网络是一般目的, 本文重点研究滑动窗口提议 (区域提出网络, RPN简写) [29] 和基于区域的探测器 (Fast R-CNN) [11]。我们还将 FPNs 概括为实例分割提议。

Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.

我们的方法采取一个任意大小的单尺度图像作为输入, 并输出成比例大小的特征映射在多个level, 以完全卷积的方式。此过程独立于主干卷积体系结构 (例如 [19、36、16]), 本文使用 ResNets [16] 来显示结果。我们的金字塔的构造包括一个自下而上的路径, 一个自上而下的路径, 和横向连接, 如下面介绍。

Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.

自下而上的途径。自下而上的路径是主干 ConvNet 的前馈计算, 它计算一个特征层, 由几个尺度的特征映射组成, 缩放步骤为2。通常有许多层产生相同大小的输出映射, 我们说这些层处于同一网络阶段。对于我们的特征金字塔, 我们为每个阶段定义一个金字塔level。我们选择每个阶段的最后一层的输出作为我们的特征映射的参考集, 我们将丰富创建我们的金字塔。这种选择是自然的, 因为每个阶段最深的层应该具有最强的特征。

Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as {C2 , C3 , C4 , C5 } for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

具体地说, 对于 ResNets [16], 我们使用每个阶段的最后一个residual block的特征激活输出。我们将这些最后一个residual block的输出表示为 conv2、conv3、conv4 和 conv5 输出的 {C2、C3、C4、C5}, 并注意到它们在输入图像方面具有 {4、8、16、32} 像素的步长。由于内存占用量大, 我们不包括 conv1 到金字塔中。

Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

自上而下的路径和横向连接。自上而下路径 hallucinates 更高的分辨率特征上采样空间的粗糙, 但语义更强, 特征映射从较高的金字塔level。这些特征然后增强与特征从自下而上的途径通过横向连接。每个横向连接从自下而上的和自上而下的路径融合了相同空间大小的特征映射。自下而上的特征映射是低级语义, 但是它的激活更精确地被本地化, 因为它被重采样几次。

Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C5 to produce the coarsest resolution map. Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5} that are respectively of the same spatial sizes.

图3显示了构建我们的自上而下特征映射的构建块。用一个粗糙分辨率的特征映射, 我们上采样的空间分辨率的系数为 2 (使用最近邻上采样简单)。然后, 上采样映射与相应的自底向上映射 (通过1×1卷积层来减少通道尺寸) 合并, 通过元素的加法。此过程将被迭代, 直到生成最佳分辨率映射为止。要开始迭代, 我们只需在 C5 上附加一个1×1卷积层即可生成粗糙分辨率映射。最后, 我们在每个合并的映射上追加 3 x 3 卷积, 以生成最终的特征映射, 这是为了减少上采样的混淆效果。这最后一组特征映射称为 {P2、P3、P4、P5}, 对应于分别为相同空间大小的 {C2、C3、C4、C5}。

Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. We set d = 256 in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.

由于金字塔的所有level都使用共享分类器/回归量, 就像在传统的 featurized image pyramids中一样, 我们在所有特征映射中修正了特征维度 (通道数, 表示为 d)。我们在本文中设置了 d = 256, 因此所有额外的卷积层都有256通道输出。在这些额外的层中没有非线性, 我们经验中发现它有轻微的影响。

Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

简单性是我们设计的核心, 我们发现我们的模型对许多设计选择是健壮的。我们已经试验了更复杂的block (例如, 使用多层residual block [16] 作为连接), 并观察到了稍微更好的结果。设计更好的连接模块不是本文的重点, 所以我们选择了上面描述的简单设计。

We have presented a clean and simple framework for building feature pyramids inside ConvNets. Our method shows significant improvements over several strong baselines and competition winners. Thus, it provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids. Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations.

我们提出了一个干净和简单的框架, 以建立在 ConvNets 内的特征金字塔。我们的方法表明, 在几个强大的基线和比赛赢家有显著的改善。因此, 它为特征金字塔的研究和应用提供了实用的解决方案, 无需计算图像金字塔。最后, 我们的研究表明, 尽管deep ConvNets 的强大的表现力及其对尺度变化的隐含鲁棒性, 但使用金字塔表示法处理多尺度问题仍然是至关重要的。

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.

图1。(a) 使用图像金字塔构建特征金字塔。在每个图像尺度上分别计算特征, 这是缓慢的。(b) 最近的检测系统选择只使用单一的尺度特征, 以便更快地检测。(c) 另一种方法是重用 ConvNet 计算的金字塔特征层, 就好像它是一个 featurized image pyramid。(d) 我们提议的特征金字塔网络 (FPN) 快速如 (b) 和 (c), 但更准确。在这个数字中, 特征映射由蓝色轮廓表示, 较粗的轮廓显示语义上更强的特征。

Figure 2. Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]). Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.

图2。顶部: 一个自上而下的架构与跳过连接, 其中的预测是在最好的level (例如, [28])。底部: 我们的模型, 具有类似的结构, 但利用它作为一个特征金字塔, 与预测在所有level独立。

Figure 3. A building block illustrating the lateral connection and the top-down pathway, merged by addition.

图3。一种building block, 用于说明横向连接和自上而下的路径, 并由加法合并。

Figure 4. FPN for object segment proposals. The feature pyramid is constructed with identical structure as for object detection. We apply a small MLP on 5×5 windows to generate dense object segments with output dimension of 14×14. Shown in orange are the size of the image regions the mask corresponds to for each pyramid level (levels P₃₋₅ are shown here). Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown. Half octaves are handled by an MLP on 7x7 windows (7 ≈ 5 $\sqrt{2}$ ), not shown here. Details are in the appendix.

图4。FPN 物体分割提议。特征金字塔构造了与物体检测相同的结构。我们在5×5窗口上应用小 MLP, 以生成具有14×14输出维度的dense物体分割。以橙色显示的是mask对应于每个金字塔level的图像区域的大小 (此处显示的level P₃₋₅)。相应的显示图像区域大小 (浅橙色) 和level物体大小 (暗橙色)。Half octaves是由一个 MLP 在7x7 窗口 (7 ≈ 5 $\sqrt{2}$ )使用, 这里没有显示。详情载于附录。