[纯理论] FPN (Feature Pyramid Network)

最新推荐文章于 2022-09-26 11:30:18 发布

Le0v1n

最新推荐文章于 2022-09-26 11:30:18 发布

阅读量1.4k

点赞数 2

分类专栏： PyTorch 目标检测（Object Detection）面试题（Interview Questions）文章标签：深度学习计算机视觉人工智能

搬的时候标注一下来源，谢谢。

本文链接：https://blog.csdn.net/weixin_44878336/article/details/126004264

版权

面试题（Interview Questions）同时被 3 个专栏收录

73 篇文章 8 订阅

订阅专栏

PyTorch

67 篇文章 9 订阅

订阅专栏

目标检测（Object Detection）

27 篇文章 3 订阅

订阅专栏

Feature Pyramid Networks for Object Detection

作者: Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie

机构:

Facebook AI Research (FAIR)
Cornell University and Cornell Tech

论文地址：https://arxiv.org/abs/1612.03144

2016年CVPR

0. 摘要

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

特征金字塔是识别系统中检测不同尺度目标的一个基本组成部分。但最近的深度学习目标检测器避免了金字塔表示，部分原因是它们的计算和内存密集。在本文中，我们利用深度卷积网络固有的多尺度、金字塔式的层次结构来构建特征金字塔，而不需要额外的成本。我们开发了一个具有横向连接的自上而下的架构，用于在所有尺度上构建高水平的语义特征图。这种架构被称为特征金字塔网络（FPN），作为一种通用的特征提取器，在一些应用中显示出明显的改进。在基本的Faster R-CNN系统中使用FPN，我们的方法在COCO检测baseline上取得了最先进的单模型结果，没有任何花哨的东西，超过了所有现有的单模型作品，包括COCO 2016挑战赛获奖者的作品。此外，我们的方法可以在GPU上以5 FPS的速度运行，因此是一个实用而准确的多尺度目标检测解决方案。代码将公开提供。

1. 前瞻

1.1 FPN在Faster R-CNN上的效果

Faster R-CNN如果使用了FPN结构，在COCO的AP^@0.5~0.95上可以提升2.3%；在PASCAL的AP^@0.5是可以提升3.8%。这说明FPN结构对于提升目标检测网络性能很有效果。

1.2 不同的图像金字塔方案对比

在这里插入图片描述

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. © An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and ©, but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.

图1. (a) 使用图像金字塔来建立一个特征金字塔。特征是在每个图像尺度上独立计算的，这很慢。 (b) 最近的目标检测网络选择了只使用单一尺度的特征，以加快检测速度。© 另一种方法是重新使用由ConvNet计算的金字塔特征层次，就像它是一个特征化的图像金字塔。(d) 我们提出的特征金字塔网络（FPN）与(b)和©一样快速，但更准确。在该图中，特征图由蓝色轮廓表示，较粗的轮廓表示语义上更强的特征。

1.2.1 (a) 特征化的图像金字塔

因为要检测不同尺度的图片,所以可以将图片缩放到不同的尺度, 如 (a) 所示, 将特征图缩放到了 4 个不同的尺度. 然后需要对每种不同尺度的特征图以此通过目标检测网络得到检测结果.

因为预测多少个尺度的图片, 就要检测多少次不同尺度的图片.这种方法的效率很明显是非常低的.

1.2.2 (b) 单一的尺度的图片

也就是Faster R-CNN中使用的策略, 网络检测就使用一种尺度的图片. 图b的优点很明显, 因为只有一种尺度的图片, 因此速度会有优势. 但是因为没有不同尺度的图片, 所以对小目标的预测效果并不是很好.

图片通过backbone后会进行下采样.

1.2.3 © 金字塔型特征层级

图c中的方案和SSD的方案类似. 首先使用一张图片(一种尺度)输入给backbone, backbone在正向传播中会得到不同尺度的特征图, 然后对这些不同尺度的特征图分别进行预测. 相比于(a), 这样的处理方案无疑是更好的.

1.2.4 (d) FPN (Feature Pyramid Network, 特征金字塔网络结构)

图(d) 和图©有些类似, 但并不想图©在不同尺度的特征图上进行简单地预测, 而是将不同尺度的特征图进行融合后, 再进行预测.

根据 1.1 中的结果可以看到, FPN的确是可以提升网络的检测效果的.

2. FPN (Feature Pyramid Network, 特征金字塔网络结构)

前面我们提到, 我们需要对不同尺度的特征图进行融合, 那么问题来了 – 该如何进行融合呢?

在这里插入图片描述

Figure 3. A building block illustrating the lateral connection and the top-down pathway, merged by addition.

图3. 一个说明横向联系和自上而下途径的构件，通过加法合并。

2.1 FPN工作细节

上图中,每一个需要融合的特征图的尺寸其是人为设计的, 一般而言, 在backbone中的特征图的下采样率为2.

在FPN结构, 对于每一个backbone中的特征图都会先使用一个 1×1的卷积层处理. 使用1×1卷积层的目的是调整不同特征图的channel.

因为在融合的时候采用的是加法 $\oplus$ , 所以在融合之前需要保证不同特征图的shape是相同的. 而1×1的卷积层是保证它们的channel是一样的.

接下来我们需要将特征图的尺寸统一.

先对于最上面的特征图(最高层的)需要进行2倍的上采样, 使其尺寸×2 (上采样过程不会改变channel); 再对中间特征图进行2×的上采样.

2.2 两倍上采样实现过程

这里的上采样实现非常简单, 并没有使用转置卷积 (Transposed Convolution/ Deconvolution), 而是使用 邻近插值算法 实现的, 如下所示:

for idx in range(len(x) - 2, -1, -1):
    inner_lateral = sefl.get_result_from_inner_blocks(x[idx], idx)
    feat_shape = inner_lateral.shape[-2:]
    inner_top_down = F.interpolate(last_inner, size=feat_shape, mode="nearest")  # 邻近插值算法
    last_inner = inner_lateral + inner_top_down
    result.insert(0, self.get_result_from_layer_blocks(last_inner, idx))

2.3 FPN细节结构

在这里插入图片描述

在通过1×1卷积和上采样之后, FPN每个分支的特征图都就可以融合了. 之前图中没有画出来的是后面的3×3卷积. 每一个3×3卷积会对得到的融合后的特征图进行特征图提取, 从而以此得到最终输出 P2, P3, P4, P5.

根据原论文的描述, 最终会在P5的基础上进行下采样, 从而得到 P6.

这里下采样的具体实现也非常简单, 就是MaxPooling.

原论文中, 1×1卷积核的个数为256, 即最终得到的特征图的channel都等于256.

得到P6的Pooling使用的MaxPooling, 而它的池化核大小为1×1, 所以这里换成AVGPooling效果也是一样的 😂.

2.3.1 注意事项1

P6 只用于Faster R-CNN的RPN部分(RPN生成Proposals的时候会使用 P2 ~ P6这5个特征图). 但对于Faster R-CNN的Predictor, 只会使用 P2 ~ P5这四个特征图上进行预测.

对于Faster R-CNN而言, 会在预测特征图上进行RPN, 从而生成得到一系列的Proposals, 之后会将得到的proposals映射回特征图上, 然后再将映射的这部分特征输入到Predictor, 得到最终预测的结果.

但是在使用FPN结构的Faster R-CNN网络中, 首先FPN结构生成5种不同尺度的预测特征图, 之后在所有的预测特征图上进行RPN, 从而得到预测所需的proposals. 在将proposals映射会预测特征图时, 不会使用5种不同尺度的预测特征图, 而是仅仅使用 P2 ~ P5 这四种尺度的预测特征图上. 最后再经过Predictor得到最终预测的结果.

这里肯定会有疑惑 --> RPN网络在不同尺度特征图上生成的proposals该如何确定它们各自该映射到哪一个特征图上呢? 这个问题我们先按下不表.

2.3.2 注意事项2

由于在RPN网络中生成了多个不同尺度的proposals, 所以网络可以在不同的预测特征图上分别针对不同尺度的目标进行预测.

之前在讲Faster R-CNN时, 由于只有一个预测特征图, 所以仅仅是在一个预测特征图上生成不同面积和比例的Anchors. 但是引入FPN结构后, 就可以使用不同尺度的特征图去预测不同大小的目标.

2.4 不同预测特征图预测的目标大小及其参数设置

在这里插入图片描述

Formally, we define the anchors to have areas of {32², 64², 128², 256², 512²} pixels on {P2, P3, P4, P5, P6} respectively. As in [29] we also use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each level. So in total there are 15 anchors over the pyramid.

形式上，我们定义Anchors在{P2, P3, P4, P5, P6}上的面积分别为{32², 64², 128², 256², 512²}像素。与[29]一样，我们也在每一级使用多个长宽比的锚点{1:2, 1:1, 2:1}。所以在金字塔上总共有15个锚点。

预测特征图	预测目标大小	预测目标尺寸	比例
`P2`	最小目标	32×32	1:2, 1:1, 2:1
`P3`	略小目标	64×64	1:2, 1:1, 2:1
`P4`	中目标	128×128	1:2, 1:1, 2:1
`P5`	大目标	256×256	1:2, 1:1, 2:1
`P6`	最大目标	512×512	1:2, 1:1, 2:1

3. 其他

3.1 RPN和Faster R-CNN组件的数量

在讲解Faster R-CNN的时候我们知道, 网络中有两个非常重要的部分:

RPN
Faster R-CNN Predictor

如下图所示.

在这里插入图片描述

但是在使用了FPN结构后, 会生成多个预测特征图, 那么我们是否需要针对每一个预测特征图使用不同的RPN和Predictor呢?

在原论文中, 作者也对其进行了实验. 作者发现: 在不同的预测特征图上, 共用同一个RPN和Predicator和分别在不同预测特征图使用不同的RPN和Predictor的效果其实是差不多的.

既然在检测效果上没有什么差异, 那么共享RPN和Predictor是更好的选择 -> 减少网络训练参数.

3.2 proposals映射策略

因为使用了FPN结构, 所以会生成不同尺度的proposals和预测特征图, 但RPN部分和Predicator部分使用的预测特征图是不同的, 因此如何让proposals映射到预测特征图上就成了一个问题.

作者在原论文中给出了方案:

We view our feature pyramid as if it were produced from an image pyramid. Thus we can adapt the assignment strategy of region-based detectors [15, 11] in the case when they are run on image pyramids. Formally, we assign an RoI of width $w$ and height $h$ (on the input image to the network) to the level $P_k$ of our feature pyramid by:

我们把我们的特征金字塔看作是由图像金字塔产生的。因此，当基于区域的检测器[15, 11]在图像金字塔上运行时，我们可以调整它们的分配策略。形式上，我们将宽度为 $w$ 、高度为 $h$ 的RoI（在网络的输入图像上）分配给我们的特征金字塔的第 $P_k$ 层，方法是：

$\lfloor k_0 + \log_2 (\sqrt{wh} / 224)\rfloor$

Here 224 is the canonical ImageNet pre-training size, and $k_0$ is the target level on which an RoI with $\times h = 224^2$ should be mapped into. Analogous to the ResNet-based Faster R-CNN system [16] that uses $C_4$ as the single-scale feature map, we set $k_0$ to 4. Intuitively, Eqn.(1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, $k = 3$ ).

其中224是典型的ImageNet预训练规模， $k_0$ 是目标层，一个 $w/times h=224^2$ 的RoI应该被映射到其中。类似于基于ResNet的Faster R-CNN系统[16]使用 $C_4$ 作为单尺度特征图，我们将 $k_0$ 设为4。直观地说，公式(1)意味着如果RoI的尺度变小（例如224的1/2），它应该被映射到一个更精细的分辨率级别（例如， $k = 3$ ）。

上述公式中:

$\lfloor \cdot \rfloor$ 表示向下取整
$\in \{2, 3, 4, 5\}$ 就是proposals应该映射到的预测特征图索引(对应P2 ~ P5).
$k_0$ 设置为4
$w h$ 为RPN网络生成一系列proposals在原图上的宽度和高度

假设某一个层的proposals映射到原图上的 $w h = 112$ , 那么:

$\begin{aligned} k & = \lfloor k_0 + \log_2 (\sqrt{wh} / 224)\rfloor \\ & = \lfloor 4 + \log_2 (1 / 2)\rfloor \\ & = \lfloor 4 + (-1) \rfloor \\ & = 3 \end{aligned}$

$k$ 与预测特征图的编号是一一对应的, 也就是说, 该尺度的proposals应该映射到 P3上.

其PyTorch代码实现如下:

class LevelMapper:
    """Determine which FPN level each RoI in a set of RoIs should map to based
    on the heuristic in the FPN paper.

    Args:
        k_min (int)
        k_max (int)
        canonical_scale (int)
        canonical_level (int)
        eps (float)
    """

    def __init__(
        self,
        k_min: int,
        k_max: int,
        canonical_scale: int = 224,
        canonical_level: int = 4,
        eps: float = 1e-6,
    ):
        self.k_min = k_min
        self.k_max = k_max
        self.s0 = canonical_scale
        self.lvl0 = canonical_level
        self.eps = eps

    def __call__(self, boxlists: List[Tensor]) -> Tensor:
        """
        Args:
            boxlists (list[BoxList])
            box_area: 将宽和高相乘后再开根号
        """
        # Compute level ids
        s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

        # Eqn.(1) in FPN paper
        target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
        target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
        return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)