论文翻译——Feature Pyramid Networks for Object Detection

本文介绍了特征金字塔网络(Feature Pyramid Network, FPN)在深度学习对象检测中的重要作用。FPN利用深度卷积网络的多尺度金字塔层次结构,以较低的额外成本构建具有高级语义特征的地图,提高了检测不同尺度目标的能力。在Faster R-CNN系统中使用FPN,方法在COCO检测基准上取得了最先进的单模型结果,优于所有现有单模型参赛作品。此外,FPN在GPU上运行速度快,是实用且准确的多尺度目标检测解决方案。" 103353632,7879362,使用pywin32 API在Python中打开文件管理器,"['python', 'GUI开发', '文件操作', 'pywin32']
摘要由CSDN通过智能技术生成

摘要:

特征金字塔是识别系统中检测不同尺度目标的基本组成部分。但最近的深度学习对象检测器已经避免了金字塔表示,部分原因是它们需要大量的计算和内存。本文利用深度卷积网络固有的多尺度金字塔层次结构,以额外的边际成本构造特征金字塔。提出了一种具有横向连接的自顶向下体系结构,用于构建各种尺度下的高级语义特征图。这种被称为特征金字塔网络(FPN)的架构,作为一种通用的特征提取器,在一些应用程序中得到了显著的改进。在一个基本的Faster R-CNN系统中使用FPN,我们的方法在没有铃铛和口哨的COCO检测基准上获得了最先进的单模型结果,超过了所有现有的单模型参赛作品,包括来自COCO 2016挑战赛的获胜者。此外,我们的方法可以在GPU上运行5 FPS,是一种实用和准确的多尺度目标检测解决方案。代码将向公众开放。

原文: 可修改后右键重新翻译

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper , we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

翻译:

在计算机视觉中,识别不同尺度的物体是一个基本的挑战。建立在图像金字塔上的特征金字塔(简而言之,我们称之为特征图像金字塔)构成了标准解决方案的基础[1](图1(a))。这些金字塔是比例不变的,从某种意义上说,对象的比例变化是通过改变金字塔中的级别来抵消的。直观地说,此属性使模型能够通过扫描位置和金字塔级别的模型来检测大范围的对象。

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

1  介绍

翻译:

在手工设计特征的时代,图像金字塔被大量使用[5,25]。它们是如此的关键,以至于像DPM[7]这样的物体探测器需要密集的尺度采样才能获得好的结果(例如,每八度音阶10个尺度)。对于识别任务,工程特征在很大程度上已经被深度卷积网络(ConvNets)计算的特征所取代[19,20]。除了能够代表更高层次的语义之外,ConvNets对尺度上的方差也更加鲁棒,从而便于从单一输入尺度上计算的特征进行识别[15,11,29](图1(b))。但即便如此,金字塔仍然需要得到最准确的结果。最近ImageNet[33]和COCO[21]检测挑战中的所有热门条目都使用了对非饱和图像金字塔的多尺度测试(例如,[16,35])。对图像金字塔的每一层进行量化的主要优点是,它产生了一个多尺度的特征表示,其中所有的层都是语义很强的,包括高分辨率层。

 

Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

翻译:

然而,对图像金字塔的每个层次进行特征化处理有明显的局限性。推理时间大大增加(例如,增加了4倍[11]),使得这种方法在实际应用中不切实际。此外,在图像金字塔上端到端地训练深层网络在内存方面是不可行的,因此,如果加以利用,图像金字塔仅在测试时使用[15,11,16,35],这在训练/测试时间推断之间造成不一致。基于这些原因,Fast/Faster R-CNN[11,29]选择在默认设置下不使用特征化的图像金字塔。

 

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.

翻译:

然而,图像金字塔并不是计算多尺度特征表示的唯一方法。deep ConvNe通过特征层次具有固有的多尺度、金字塔形状的子采样层,逐层计算特征层次。这种网络特征层次结构产生了不同空间分辨率的特征图,但由于深度的不同而引入了较大的语义鸿沟。高分辨率图的低层次特征损害了它们对物体识别的表现能力。

 

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multiscale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

图1。(a) 使用图像金字塔构建特征金字塔。特征是在每个图像尺度上独立计算的,这是缓慢的。(b) 最近的检测系统已经选择只使用单尺度特征来更快地检测。(c) 另一种方法是重用ConvNet计算的金字塔特征层次,就好像它是一个特征化的图像金字塔一样。(d) 我们提出的特征金字塔网络(FPN)与(b)和(c)一样快速,但更精确。在这个图中,由blueoutlines表示的功能图和ticker轮廓表示语义更强的功能。

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, featuremapsareindicatebyblueoutlinesandthicker outlines denote semantically stronger features.

翻译:

单发探测器(SSD)[22]是第一次尝试使用ConvNet的金字塔特征层次结构,就好像它是一个特征化的图像金字塔(图1(c))。理想情况下,SSD风格的金字塔将重用在前向过程中计算的不同层的多尺度特征映射,从而实现免费。但是为了避免使用低级功能,SSD放弃了对已经计算的层的重用,而是从网络的高层开始构建金字塔(例如VGG nets[36]的conv4 3),然后添加几个新的层。因此,它错过了重用特征层次的高分辨率映射的机会。我们证明了这些对于探测小物体很重要。

 

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

翻译:

本文的目标是自然地利用ConvNet特征层次的金字塔形状,同时创建一个在所有尺度上都具有强大语义的特征金字塔。从语义上讲,我们通过一个低分辨率的结构来实现这一目标。其结果是一个特征金字塔,它在所有级别都有丰富的语义,并且可以从单个输入图像比例快速构建。换言之,我们展示了如何在不牺牲表现力、速度或内存的情况下创建可用于替换特征化图像金字塔的网络特征金字塔。

 

原文: 可修改后右键重新翻译

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

翻译:

采用自顶向下和跳跃连接的类似架构在最近的研究中很流行[28,17,8,26]。他们的目标是产生一个单一的高层次的特征地图,在这个地图上进行预测(图2上图)。相反,我们的方法将架构作为一个特征金字塔,在每个层次上独立地进行预测(例如,目标检测)(图2底部)。我们的模型与一个特征化的图像金字塔相呼应,这在这些作品中还没有被探索过。

 

原文: 可修改后右键重新翻译

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

翻译:

我们评估了我们称为特征金字塔网络(FPN)的方法在各种检测和分割系统中的应用[11,29,27]。在没有钟声和哨声的情况下,我们在具有挑战性的COCO检测基准[21]上报告了最先进的单模型结果,仅基于FPN和predict-predict预测基本上更快的R-CNN检测器[29],超过了所有现有的大赛优胜者的精心设计的单模型参赛作品。在烧蚀实验中,我们发现,对于包围盒方案,FPN显著提高了8.0个点的平均召回率(AR);对于目标检测,它提高了2.3个点的COCO式平均精度(AP)和3.8个点的PASCAL式AP,超过了ResNets上更快的R-CNN的单尺度基线[16]。我们的方法也很容易扩展到遮罩方案,并提高了实例分割的AR和速度,这是非常依赖于图像金字塔的最新方法。

 

原文: 可修改后右键重新翻译

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and predict predict predict predict a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

翻译:

此外,我们的金字塔结构可以用所有的尺度进行端到端的训练,并且在训练/测试时一致地使用,这将使使用图像金字塔的内存不可行。因此,FPNs能够获得比所有现有的最先进的方法更高的精度。此外,这种改进是在不增加单尺度基线测试时间的情况下实现的。我们相信这些进展将促进未来的研究和应用。我们的代码将公开。

 

原文: 可修改后右键重新翻译

In addition, our pyramid structure can be trained end-toend with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over t

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值