论文笔记(六)【特征金字塔】Feature Pyramid Networks for Object Detection

《Feature Pyramid Networks for Object Detection》这篇论文主要解决的问题是目标检测在处理多尺度变化问题是的不足,现在的很多网络都使用了利用单个高层特征(比如说Faster R-CNN利用下采样四倍的卷积层——Conv4,进行后续的物体的分类和bounding box的回归),但是这样做有一个明显的缺陷,即小物体本身具有的像素信息较少,在下采样的过程中极易被丢失,为了处理这种物体大小差异十分明显的检测问题,经典的方法是利用图像金字塔的方式进行多尺度变化增强,但这样会带来极大的计算量。所以这篇论文提出了特征金字塔的网络结构,能在增加极小的计算量的情况下,处理好物体检测中的多尺度变化问题

Abstract
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

特征金字塔是识别系统中用于检测不同尺度目标的基本组件。但最近的深度学习目标检测器已经避免了金字塔表示,部分原因是它们是计算和内存密集型的。在本文中,我们利用深度卷积网络内在的多尺度、金字塔分级来构造具有很少额外成本的特征金字塔。开发了一种具有横向连接的自顶向下架构,用于在所有尺度上构建高级语义特征映射。这种称为特征金字塔网络(FPN)的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。在一个基本的Faster R-CNN系统中使用FPN,没有任何不必要的东西,我们的方法可以在COCO检测基准数据集上取得最先进的单模型结果,结果超过了所有现有的单模型输入,包括COCO 2016挑战赛的获奖者。此外,我们的方法可以在GPU上以6FPS运行,因此是多尺度目标检测的实用和准确的解决方案。代码将公开发布。

  1. Introduction

1,介绍

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

识别不同尺度的目标是计算机视觉中的一个基本挑战。建立在图像金字塔之上的特征金字塔(我们简称为特征化图像金字塔)构成了标准解决方案的基础[1](图1(a))。这些金字塔是尺度不变的,因为目标的尺度变化是通过在金字塔中移动它的层级来抵消的。直观地说,该属性使模型能够通过在位置和金字塔等级上扫描模型来检测大范围尺度内的目标。

在这里插入图片描述
Figure 1. (a) Using an image pyramid to build a feature pyramid.Features are computed on each of the image scales independently,which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. © An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and ©, but more accurate. In this figure, feature maps are indicate by blue outlines and thicker
outlines denote semantically stronger features.

图1。(a)使用图像金字塔构建特征金字塔。每个图像尺度上的特征都是独立计算的,速度很慢(b)最近的检测系统选择只使用单一尺度特征进行更快的检测。(c)另一种方法是重用ConvNet计算的金字塔特征层次结构,就好像它是一个特征化的图像金字塔。(d)我们提出的特征金字塔网络(FPN)与(b)和(c)类似,但更准确。在该图中,特征映射用蓝色轮廓表示,较粗的轮廓表示语义上较强的特征

Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

特征化图像金字塔在手工设计的时代被大量使用[5,25]。它们非常关键,以至于像DPM[7]这样的目标检测器需要密集的尺度采样才能获得好的结果(例如每组10个尺度,octave参考SIFT特征)。对于识别任务,工程特征大部分已经被深度卷积网络(ConvNets)[19,20]计算的特征所取代。除了能够表示更高级别的语义,ConvNets对于尺度变化也更加鲁棒,从而有助于从单一输入尺度上计算的特征进行识别[15,11,29](图1(b))。但即使有这种鲁棒性,金字塔仍然需要得到最准确的结果。在ImageNet[33]和COCO[21]检测挑战中,最近的所有排名靠前的输入都使用了针对特征化图像金字塔的多尺度测试(例如[16,35])。对图像金字塔的每个层次进行特征化的主要优势在于它产生了多尺度的特征表示,其中所有层次上在语义上都很强,包括高分辨率层。

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.

尽管如此,特征化图像金字塔的每个层次都具有明显的局限性。推断时间显著增加(例如,四倍[11]),使得这种方法在实际应用中不切实际。此外,在图像金字塔上端对端地训练深度网络在内存方面是不可行的,所以如果被采用,图像金字塔仅在测试时被使用[15,11,16,35],这造成了训练/测试时推断的不一致性。出于这些原因,Fast和Faster R-CNN[11,29]选择在默认设置下不使用特征化图像金字塔。

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multiscale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

但是,图像金字塔并不是计算多尺度特征表示的唯一方法。深层ConvNet逐层计算特征层级,而对于下采样层,特征层级具有内在的多尺度金字塔形状。这种网内特征层级产生不同空间分辨率的特征映射,但引入了由不同深度引起的较大的语义差异。高分辨率映射具有损害其目标识别表示能力的低级特征。

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1©). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

单次检测器(SSD)[22]是首先尝试使用ConvNet的金字塔特征层级中的一个,好像它是一个特征化的图像金字塔(图1(c))。理想情况下,SSD风格的金字塔将重用正向传递中从不同层中计算的多尺度特征映射,因此是零成本的。但为了避免使用低级特征,SSD放弃重用已经计算好的图层,而从网络中的最高层开始构建金字塔(例如,VGG网络的conv4_3[36]),然后添加几个新层。因此它错过了重用特征层级的更高分辨率映射的机会。我们证明这些对于检测小目标很重要

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

本文的目标是自然地利用ConvNet特征层级的金字塔形状,同时创建一个在所有尺度上都具有强大语义的特征金字塔。为了实现这个目标,我们所依赖的架构将低分辨率、强语义的特征与高分辨率、弱语义的特征通过自顶向下的路径和横向连接相结合。(图1(d))。其结果是一个特征金字塔,在所有级别都具有丰富的语义,并且可以从单个输入图像尺度上进行快速构建。换句话说,我们展示了如何创建网络中的特征金字塔,可以用来代替特征化的图像金字塔,而不牺牲表示能力,速度或内存。

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

最近的研究[28,17,8,26]中流行采用自顶向下和跳跃连接的类似架构。他们的目标是生成具有高分辨率的单个高级特征映射,并在其上进行预测(图2顶部)。相反,我们的方法利用这个架构作为特征金字塔,其中预测(例如目标检测)在每个级别上独立进行(图2底部)。我们的模型反映了一个特征化的图像金字塔,这在这些研究中还没有探索过。

在这里插入图片描述
Figure 2. Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]). Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.

图2。顶部:带有跳跃连接的自顶向下的架构,在最好的级别上进行预测(例如,[28])。底部:我们的模型具有类似的结构,但将其用作特征金字塔,并在各个层级上独立进行预测。

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

我们评估了我们称为特征金字塔网络(FPN)的方法,其在各种系统中用于检测和分割[11,29,27]。没有任何不必要的东西,我们在具有挑战性的COCO检测基准数据集上报告了最新的单模型结果,仅仅基于FPN和基本的Faster R-CNN检测器[29],就超过了竞赛获奖者所有现存的严重工程化的单模型竞赛输入。在消融实验中,我们发现对于边界框提议,FPN将平均召回率(AR)显著增加了8个百分点;对于目标检测,它将COCO型的平均精度(AP)提高了2.3个百分点,PASCAL型AP提高了3.8个百分点,超过了ResNet[16]上Faster R-CNN强大的单尺度基准线。我们的方法也很容易扩展掩模提议,改进实例分隔AR,加速严重依赖图像金字塔的最先进方法。

### 回答1: 特征金字塔网络(Feature Pyramid Networks, FPN)是一种用于目标检测的神经网络架构。它通过在深层特征图上构建金字塔结构来提高空间分辨率,从而更好地检测小目标。FPN具有高效的多尺度特征表示和鲁棒性,在COCO数据集上取得了很好的表现。 ### 回答2: 特征金字塔网络(Feature Pyramid Networks,简称FPN)是一种用于目标检测的深度学习模型。该模型是由FAIR(Facebook AI Research)在2017年提出的,旨在解决单一尺度特征不能有效检测不同大小目标的问题。 传统的目标检测算法通常采用的是滑动窗口法,即在图像上以不同大小和不同位置进行滑动窗口的检测。但是,这种方法对于不同大小的目标可能需要不同的特征区域来进行检测,而使用单一尺度特征可能会导致对小目标的错误检测或漏检。FPN通过利用图像金字塔和多层特征提取,将不同尺度的特征合并起来,从而达到对不同大小目标的有效检测。 FPN主要分为两个部分:上采样路径(Top-Down Pathway)和下采样路径(Bottom-Up Pathway)。下采样路径主要是通过不同层级的卷积神经网络(CNN)来提取特征,每层都采用了非极大值抑制(Non-Maximum Suppression,NMS)方法来选择最具有代表性的特征。上采样路径则主要是将低层特征进行上采样操作,使其与高层特征的尺寸对齐,并与高层特征相加,实现特征融合。 FPN在目标检测中的优势体现在以下几个方面。首先,FPN可以提高模型对小目标的检测能力,同时仍保持对大目标的检测准确度。其次,FPN的特征金字塔结构可以在一次前向传递中完成目标检测,减少了计算时间。最后,FPN对于输入图像的尺寸和分辨率不敏感,可以在不同分辨率的图像上进行目标检测,从而适应多种应用场景。 总之,FPN是一种在目标检测领域中得到广泛应用的模型,其特征金字塔结构能够有效地解决单一尺度特征不足以检测不同大小目标的问题,并在检测准确率和计算效率方面取得了不错的表现。 ### 回答3: 特征金字塔网络是一种用于目标检测的深度学习模型,主要解决的问题是在不同尺度下检测不同大小的物体。在传统的卷积神经网络中,网络的特征图大小会不断减小,因此只能检测较小的物体,对于较大的物体则无法很好地检测。而特征金字塔网络则通过在底部特征图的基础上构建一个金字塔状的上采样结构,使得网络能够在不同尺度下检测不同大小的物体。 具体来说,特征金字塔网络由两个主要部分构成:共享特征提取器和金字塔结构。共享特征提取器是一个常规的卷积神经网络,用于提取输入图像的特征。而金字塔结构包括多个尺度的特征图,通过上采样和融合来获得不同尺度的特征表示。这些特征图之后被输入到后续的目标检测网络中,可以通过这些特征图来检测不同尺度的物体。 特征金字塔网络可以有效地解决目标检测任务中的尺度问题,并且在许多实际应用中表现出了优异的性能。例如,通过使用特征金字塔网络,在COCO数据集上得到的目标检测结果明显优于现有的一些目标检测算法。 总之,特征金字塔网络是一种非常有效的深度学习模型,可以处理目标检测任务中的尺度问题,提高模型在不同大小物体的检测精度。它在实际应用中具有很高的价值和应用前景。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值