【翻译】【PANet】Path Aggregation Network for Instance Segmentation

Path Aggregation Network for Instance Segmentation
用于实例分割的路径聚合网络
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya Jia

论文:Path Aggregation Network for Instance Segmentation
代码:https://github.com/ShuLiu1993/PANet.git

本文翻译使用翻译软件进行初步翻译后再自己进行粗略的翻译,可能有不足之处,还望能指出,如有侵权,请联系删除。

Abstract(摘要)

The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.
  信息在神经网络中的传播方式是非常重要的。在本文中,我们提出了路径聚合网络(PANet),旨在提高基于proposal的实例分割框架中的信息流。具体来说,我们通过自下而上的路径增强来提高整个特征层次的准确定位信号,从而缩短了低层和最顶层特征之间的信息路径。我们提出了自适应特征池,它将特征网格和所有的特征层次联系起来,使每个层次的有用信息直接传播到下面的proposal子网络。为每个proposal创建了一个捕捉不同观点的补充分支,以进一步改善掩码预测。
  These improvements are simple to implement, with subtle extra computational overhead. Yet they are useful and make our PANet reach the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. PANet is also state-of-the-art on MVD and Cityscapes.
  这些改进很容易实现,但有细微的额外计算开销。但它们很有用,使我们的PANet在COCO 2017挑战赛实例分割任务中获得第一名,在没有大批量训练的情况下,在目标检测任务中获得第二名。PANet在MVD和Cityscapes上也是最先进的。

1. Introduction(介绍)

Instance segmentation is one of the most important and challenging tasks. It aims to predict class labels and pixelwise instance masks to localize a varying number of instances presented in each image. This task widely benefits autonomous vehicles, robotics, video surveillance, to name a few.
  实例分割是最重要和最具挑战性的任务之一。它的目的是预测类标签和按像素划分的实例掩码,以定位每张图像中呈现的不同数量的实例。这项任务广泛地有利于自主车辆、机器人、视频监控,仅此而已。
  With the help of deep convolutional neural networks, several frameworks for instance segmentation, e.g., [21, 33, 3, 38], were proposed where performance grows rapidly [12]. Mask R-CNN [21] is a simple and effective system for instance segmentation. Based on Fast/Faster R-CNN [16, 51], a fully convolutional network (FCN) is used for mask prediction, along with box regression and classification. To achieve high performance, feature pyramid network (FPN) [35] is utilized to extract in-network feature hierarchy, where a top-down path with lateral connections is augmented to propagate semantically strong features.
  在深度卷积神经网络的帮助下,提出了几个实例分割的框架,例如[21, 33, 3, 38],其中性能增长迅速[12]。Mask R-CNN[21]是一个简单而有效的实例分割系统。基于Fast/Faster R-CNN[16, 51],一个完全卷积网络(FCN)被用于掩码预测,同时还有box回归和分类。为了实现高性能,特征金字塔网络(FPN)[35]被用来提取网络内的特征层次,其中自上而下的路径与横向连接被增强以传播语义强的特征。
  Several newly released datasets [37, 7, 45] facilitate design of new algorithms. COCO [37] consists of 200k images. Several instances with complex spatial layout are captured in each image. Differently, Cityscapes [7] and MVD [45] provide street scenes with a large number of traffic participants in each image. Blur, heavy occlusion and extremely small instances appear in these datasets.
  几个新发布的数据集[37, 7, 45]促进了新算法的设计。COCO[37]由20万张图片组成。每张图像中都有几个具有复杂空间布局的实例。不同的是,Cityscapes[7]和MVD[45]提供了街道场景,每张图像中有大量的交通参与者。在这些数据集中出现了模糊、严重的闭塞和极小的实例。
  There have been several principles proposed for designing networks in image classification that are also effective for object recognition. For example, shortening information path and easing information propagation by clean residual connection [23, 24] and dense connection [26] are useful. Increasing the flexibility and diversity of information paths by creating parallel paths following the split-transform-merge strategy [61, 6] is also beneficial.
  在图像分类中已经提出了一些设计网络的原则,这些原则对目标识别也很有效。例如,通过干净的残差连接[23, 24]和密集连接[26]来缩短信息路径和缓解信息传播是有用的。通过按照分割-变换-合并策略[61,6]创建平行路径,增加信息路径的灵活性和多样性也是有益的。
  Our Findings. Our research indicates that information propagation in state-of-the-art Mask R-CNN can be further improved. Specifically, features in low levels are helpful for large instance identification. But there is a long path from low-level structure to topmost features, increasing difficulty to access accurate localization information. Further, each proposal is predicted based on feature grids pooled from one feature level, which is assigned heuristically. This process can be updated since information discarded in other levels may be helpful for final prediction. Finally, mask prediction is made on a single view, losing the chance to gather more diverse information.
  我们的发现。我们的研究表明,最先进的Mask R-CNN中的信息传播可以进一步改进。具体来说,低层次的特征对大的实例识别有帮助。但是,从低层结构到最顶层的特征有一个很长的路径,增加了获得准确定位信息的难度。此外,每个proposal都是根据从一个特征层汇集的特征网格来预测的,这是以启发式方式分配的。这个过程可以更新,因为在其他层次中放弃的信息可能对最终的预测有帮助。最后,mask预测是在一个单一的视图上进行的,失去了收集更多不同信息的机会。
  Our Contributions. Inspired by these principles and observations, we propose PANet, illustrated in Figure 1, for instance segmentation.
  我们的贡献。受这些原则和观察的启发,我们提出了PANet,如图1所示,用于实例分割。
在这里插入图片描述
  First, to shorten information path and enhance feature pyramid with accurate localization signals existing in low-levels, bottom-up path augmentation is created. In fact, features in low-layers were utilized in the systems of [44, 42, 13, 46, 35, 5, 31, 14]. But propagating low-level features to enhance entire feature hierarchy for instance recognition was not explored.
  首先,为了缩短信息路径,用存在于低层的准确定位信号增强特征金字塔,自下而上的路径增强被创造出来。事实上,在[44, 42, 13, 46, 35, 5, 31, 14]的系统中,低层的特征被利用了。但是传播低层特征以增强整个特征层次的实例识别并没有被探索。
  Second, to recover broken information path between each proposal and all feature levels, we develop adaptive feature pooling. It is a simple component to aggregate features from all feature levels for each proposal, avoiding arbitrarily assigned results. With this operation, cleaner paths are created compared with those of [4, 62].
  第二,为了恢复每个proposal和所有特征层次之间的破碎信息路径,我们开发了自适应特征池。这是一个简单的组件,用于聚合每个proposal的所有特征层的特征,避免任意分配结果。通过这种操作,与[4, 62]的操作相比,创建了更干净的路径。
  Finally, to capture different views of each proposal, we augment mask prediction with tiny fully-connected (fc) layers, which possess complementary properties to FCN originally used by Mask R-CNN. By fusing predictions from these two views, information diversity increases and masks with better quality are produced.
  最后,为了捕捉每个proposal的不同观点,我们用微小的全连接(fc)层来增强掩码预测,它拥有与Mask R-CNN最初使用的FCN互补的特性。通过融合来自这两种观点的预测,信息多样性增加,并产生质量更好的掩码。
  The first two components are shared by both object detection and instance segmentation, leading to much enhanced performance of both tasks.
  前两个组件被目标检测和实例分割所共享,导致两个任务的性能大大增强。
  Experimental Results. With PANet, we achieve state-ofthe-art performance on several datasets. With ResNet-50 [23] as the initial network, our PANet tested with a single scale already outperforms champion of COCO 2016 Challenge in both object detection [27] and instance segmentation [33] tasks. Note that these previous results are achieved by larger models [23, 58] along with multi-scale and horizontal flip testing.
  实验结果。通过PANet,我们在几个数据集上实现了最先进的性能。以ResNet-50[23]作为初始网络,我们的PANet在单一规模的测试中已经在目标检测[27]和实例分割[33]任务中超过了COCO 2016挑战赛的冠军。请注意,之前的这些结果是通过更大的模型[23, 58]以及多尺度和水平翻转测试实现的。
  We achieve the 1 s t ^{st} st place in COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. We also benchmark our system on Cityscapes and MVD, which similarly yields top-ranking results, manifesting that our PANet is a very practical and top-performing framework. Our code and models will be made publicly available.
  我们在COCO 2017挑战赛实例分割任务中取得了第1名,在没有大批量训练的情况下取得了目标检测任务的第2名。我们还在Cityscapes和MVD上对我们的系统进行了基准测试,同样产生了排名靠前的结果,这表明我们的PANet是一个非常实用且性能一流的框架。我们的代码和模型将被公开提供。

2. Related Work(相关工作)

Instance Segmentation. There are mainly two streams of methods in instance segmentation. The most popular one is proposal-based. Methods in this stream have a strong connection to object detection. In R-CNN [17], object proposals from [60, 68] were fed into the network to extract features for classification. Fast and faster R-CNN [16, 51] and SPPNet [22] sped up the process by pooling features from global feature maps. Earlier work [18, 19] took mask proposals from MCG [1] as input to extract features while CFM [9], MNC [10] and Hayder et al.[20] merged feature pooling to network for faster speed. Newer design was to generate instance masks in networks as proposal [48, 49, 8] or final result [10, 34, 41]. Mask R-CNN [21] is an effective framework in this stream. Our work is built on Mask R-CNN and improves it in important aspects.
  实例分割。在实例分割中主要有两类方法。最流行的是基于proposal的方法。这类方法与目标检测有着密切的联系。在R-CNN[17]中,来自[60, 68]的目标proposal被送入网络以提取分类的特征。Fast/Faster R-CNN[16, 51]和SPPNet[22]通过汇集全局特征图的特征,加快了这个过程。早期的工作[18, 19]将MCG[1]的掩码proposal作为提取特征的输入,而CFM[9]、MNC[10]和Hayder等人[20]将特征池合并到网络中以提高速度。较新的设计是在网络中生成实例掩码作为proposal[48, 49, 8]或最终结果[10, 34, 41]。Mask R-CNN[21]是这一领域的一个有效框架。我们的工作是建立在Mask R-CNN的基础上,并在重要方面进行了改进。
  Methods in the other stream are mainly segmentationbased. They learned specially designed transformation [3, 33, 38, 59] or instance boundaries [30]. Then instance masks were decoded from predicted transformation. Instance segmentation by other pipelines also exists. DIN [2] fused predictions from object detection and semantic segmentation systems. A graphical model was used in [66, 65] to infer the order of instances. RNN was utilized in [53, 50] to propose one instance in each time step.
  另一个领域的方法主要是基于分割的。他们学习了专门设计的转换[3, 33, 38, 59]或实例边界[30]。然后从预测的变换中解码实例掩码。其他管道的实例分割也存在。DIN[2]融合了目标检测和语义分割系统的预测。在[66, 65]中使用了一个图形模型来推断实例的顺序。在[53,50]中利用RNN在每个时间步骤中提出一个实例。
  Multi-level Features. Features from different layers were used in image recognition. SharpMask [49], Peng et al.[47] and LRR [14] fused feature maps for segmentation with finer details. FCN [44], U-Net [54] and Noh et al.[46] fused information from lower layers through skip-connections. Both TDM [56] and FPN [35] augmented a top-down path with lateral connections for object detection. Different from TDM, which took the fused feature map with the highest resolution to pool features, SSD [42], DSSD [13], MS-CNN [5] and FPN [35] assigned proposals to appropriate feature levels for inference. We take FPN as a baseline and much enhance it.
  多层次特征。来自不同层次的特征被用于图像识别。SharpMask[49]、Peng等人[47]和LRR[14]将特征图融合在一起进行更精细的分割。FCN[44]、U-Net[54]和Noh等人[46]通过跳过连接融合了来自低层的信息。TDM[56]和FPN[35]都通过横向连接来增加自上而下的路径,用于目标检测。与TDM不同的是,SSD[42]、DSSD[13]、MS-CNN[5]和FPN[35]采用最高分辨率的融合特征图来汇集特征,将proposal分配给适当的特征层进行推理。我们以FPN为基线,并大大加强了它。
  ION [4], Zagoruyko et al.[62], Hypernet [31] and Hypercolumn [19] concatenated feature grids from different layers for better prediction. A sequence of operations, i.e., normalization, concatenation and dimension reduction, are needed to get feasible new features. In comparison, our design is much simpler.
  ION[4]、Zagoruyko等人[62]、Hypernet[31]和Hypercolumn[19]将不同层的特征网格连接起来,以便更好地进行预测。为了得到可行的新特征,需要进行一系列的操作,即规范化、串联和降维。相比之下,我们的设计要简单得多。
  Fusing feature grids from different sources for each proposal was also utilized in [52]. But this method extracted feature maps on input with different scales and then conducted feature fusion (with the max operation) to improve feature selection from the input image pyramid. In contrast, our method aims at utilizing information from all feature levels in the in-network feature hierarchy with single-scale input. End-to-end training is enabled.
  在[52]中也利用了对每个proposal的不同来源的特征网格的融合。但这种方法是在不同比例的输入上提取特征图,然后进行特征融合(用最大运算),以改善输入图像金字塔的特征选择。相比之下,我们的方法旨在利用单尺度输入的网络内特征层次中的所有特征层次的信息。启用了端到端训练。
  Larger Context Region. Methods of [15, 64, 62] pooled features for each proposal with a foveal structure to exploit context information from regions with different resolutions. Features pooled from a larger region provide surrounding context. Global pooling was used in PSPNet [67] and ParseNet [43] to greatly improve quality of semantic segmentation. Similar trend was observed by Peng et al. [47] where global convolutions were utilized. Our mask prediction branch also supports accessing global information. But the technique is completely different.
  更大的上下文区域。[15, 64, 62]的方法汇集了每个proposal的特征,以利用来自不同分辨率区域的上下文信息,具有凹陷结构。从较大区域汇集的特征提供了周围的环境。在PSPNet[67]和ParseNet[43]中使用了全局池化,大大改善了语义分割的质量。Peng等人[47]也观察到了类似的趋势,全局卷积被利用了。我们的mask预测分支也支持访问全局信息。但其技术完全不同。

3. Our Framework(我们的框架)

Our framework is illustrated in Figure 1. Path augmentation and aggregation are conducted for improving performance. A bottom-up path is augmented to make low-layer information easier to propagate. We design adaptive feature pooling to allow each proposal to access information from all levels for prediction. A complementary path is added to the mask-prediction branch. This new structure leads to decent performance. Similar to FPN, the improvement is independent of the CNN structures, such as those of [57, 32, 23].
  我们的框架如图1所示。为了提高性能,我们进行了路径增强和聚合。一个自下而上的路径被增强,使低层信息更容易传播。我们设计了自适应特征池,使每个proposal都能从所有层次获取信息进行预测。在掩码预测分支中增加了一个补充路径。这种新结构导致了体面的性能。与FPN类似,这种改进与CNN的结构无关,例如[57, 32, 23]的结构。

3.1. Bottom-up Path Augmentation(自下向上路径增强)

Motivation. The insightful point [63] that neurons in high layers strongly respond to entire objects while others are more likely to be activated by local texture and patterns manifests the necessity of augmenting a top-down path to propagate semantically strong features and enhance all features with reasonable classification capability in FPN.
  动机。富有洞察力的观点[63],即高层的神经元对整个目标有强烈的反应,而其他的神经元则更有可能被局部的纹理和图案激活,这说明有必要增加一个自上而下的路径来传播语义强烈的特征,并增强FPN中所有特征的合理分类能力。
  Our framework further enhances the localization capability of the entire feature hierarchy by propagating strong responses of low-level patterns based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. To this end, we build a path with clean lateral connections from the low level to top ones. This process yields a “shortcut” (dashed green line in Figure 1), which consists of less than 10 layers, across these levels. In comparison, the CNN trunk in FPN gives a long path (dashed red line in Figure 1) passing through 100+ layers from low layers to the topmost one.
  基于对边缘或实例部分的高响应是准确定位实例的有力指标,我们的框架通过传播低层次模式的强响应,进一步增强了整个特征层次的定位能力。为此,我们建立了一条从低层到高层的干净的横向连接的路径。这个过程产生了一条 “捷径”(图1中的绿色虚线),它由不到10层组成,跨越这些层次。相比之下,FPN中的CNN主干给出了一条很长的路径(图1中的红色虚线),从低层到最顶层要经过100多层。
  Augmented Bottom-up Structure. Our framework first accomplishes bottom-up path augmentation. We follow FPN to define that layers producing feature maps with the same spatial sizes are in the same network stage. Each feature level corresponds to one stage. We also take ResNet [23] as the basic structure and use { P 2 , P 3 , P 4 , P 5 } \{P_2,P_3,P_4,P_5\} {P2,P3,P4,P5} to denote feature levels generated by FPN. Our augmented path starts from the lowest level P 2 P_2 P2 and gradually approaches P 5 P_5 P5 as shown in Figure 1(b). From P 2 P_2 P2 to P 5 P_5 P5, the spatial size is gradually down-sampled with factor 2. We use { N 2 , N 3 , N 4 , N 5 } \{N_2,N_3,N_4,N_5\} {N2,N3,N4,N5} to denote newly generated feature maps corresponding to { P 2 , P 3 , P 4 , P 5 } \{P_2,P_3,P_4,P_5\} {P2,P3,P4,P5}. Note that N 2 N_2 N2 is simply P 2 P_2 P2, without any processing.
  增强的自下而上结构。我们的框架首先完成了自下而上的路径增强。我们遵循FPN的定义,即产生具有相同空间大小的特征图的层处于同一网络阶段。每个特征层对应一个阶段。我们也以ResNet[23]为基本结构,用 { P 2 , P 3 , P 4 , P 5 } \{P_2,P_3,P_4,P_5\} {P2,P3,P4,P5}表示由FPN产生的特征层。我们的增强路径从最低级别 P 2 P_2 P2开始,逐渐接近 P 5 P_5 P5,如图1(b)所示。从 P 2 P_2 P2 P 5 P_5 P5,空间大小逐渐向下取样,系数为2。我们用 { N 2 , N 3 , N 4 , N 5 } \{N_2,N_3,N_4,N_5\} {N2,N3,N4,N5}表示新生成的对应于 { P 2 , P 3 , P 4 , P 5 } \{P_2,P_3,P_4,P_5\} {P2,P3,P4,P5}的特征图。请注意, N 2 N_2 N2只是 P 2 P_2 P2,没有任何处理。
  As shown in Figure 2, each building block takes a higher resolution feature map N i N_i Ni and a coarser map P i + 1 P_{i+1} Pi+1 through lateral connection and generates the new feature map N i + 1 N_{i+1} Ni+1. Each feature map N i N_i Ni first goes through a 3 × 3 convolutional layer with stride 2 to reduce the spatial size. Then each element of feature map P i + 1 P_{i+1} Pi+1 and the down-sampled map are added through lateral connection. The fused feature map is then processed by another 3 × 3 convolutional layer to generate N i + 1 N_{i+1} Ni+1 for following sub-networks. This is an iterative process and terminates after approaching P 5 P_5 P5. In these building blocks, we consistently use channel 256 of feature maps. All convolutional layers are followed by a ReLU [32]. The feature grid for each proposal is then pooled from new feature maps, i.e., { N 2 , N 3 , N 4 , N 5 } \{N_2,N_3,N_4,N_5\} {N2,N3,N4,N5}.
  如图2所示,每个构件通过横向连接获取一个较高分辨率的特征图 N i N_i Ni和一个较粗糙的图 P i + 1 P_{i+1} Pi+1,并生成新的特征图 N i + 1 N_{i+1} Ni+1。每个特征图 N i N_i Ni首先经过一个3×3的卷积层,跨度为2,以减少空间大小。然后,特征图 P i + 1 P_{i+1} Pi+1的每个元素和下采样图通过横向连接被加入。融合后的特征图再由另一个3×3卷积层处理,为下面的子网络生成 N i + 1 N_{i+1} Ni+1。这是一个迭代过程,在接近 P 5 P_5 P5时终止。在这些构件中,我们一贯使用特征图的通道256。所有的卷积层后面都有一个ReLU[32]。然后从新的特征图中汇集每个proposal的特征网格,即 { N 2 , N 3 , N 4 , N 5 } \{N_2,N_3,N_4,N_5\} {N2,N3,N4,N5}
在这里插入图片描述

3.2. Adaptive Feature Pooling(自适应特征池化)

Motivation. In FPN [35], proposals are assigned to different feature levels according to the size of proposals. It makes small proposals assigned to low feature levels and large proposals to higher ones. Albeit simple and effective, it could generate non-optimal results. For example, two proposals with 10-pixel difference can be assigned to different levels. In fact, these two proposals are rather similar.
  动机。在FPN[35]中,proposals根据大小的不同被分配到不同的特征级别。它使小的proposal被分配到低的特征层,大的proposal被分配到高的特征层。尽管简单有效,但它可能产生非最佳的结果。例如,两个相差10像素的proposal可以被分配到不同的级别。事实上,这两个proposal是相当相似的。
  Further, importance of features may not be strongly correlated to the levels they belong to. High-level features are generated with large receptive fields and capture richer context information. Allowing small proposals to access these features better exploits useful context information for prediction. Similarly, low-level features are with many fine details and high localization accuracy. Making large proposals access them is obviously beneficial. With these thoughts, we propose pooling features from all levels for each proposal and fusing them for following prediction. We call this process adaptive feature pooling.
  此外,特征的重要性可能与它们所属的级别没有密切的联系。高层次的特征是由大的感受野产生的,并捕捉到更丰富的背景信息。允许小proposal访问这些特征,可以更好地利用有用的上下文信息进行预测。同样地,低层次的特征具有许多精细的细节和高定位精度。让大proposal访问它们显然是有益的。基于这些想法,我们提议将每个proposal的所有层次的特征汇集起来,并将它们融合在一起进行预测。我们把这个过程称为自适应特征池化。
  We now analyze the ratio of features pooled from different levels with adaptive feature pooling. We use max operation to fuse features from different levels, which lets network select element-wise useful information. We cluster proposals into four classes based on the levels they were assigned to originally in FPN. For each set of proposals, we calculate the ratio of features selected from different levels. In notation, levels 1 − 4 represent low-to-high levels.
  现在我们来分析一下用自适应特征池化从不同层次汇集的特征的比例。我们使用最大操作来融合不同层次的特征,这让网络从元素上选择有用的信息。我们根据proposals在FPN中最初被分配到的级别,将其分为四类。对于每一组proposals,我们计算从不同层次中选择的特征的比率。在符号上,1-4级代表从低到高的级别。
  As shown in Figure 3, the blue line represents small proposals that were assigned to level 1 originally in FPN. Surprisingly, nearly 70% of features are from other higher levels. We also use the yellow line to represent large proposals that were assigned to level 4 in FPN. Again, 50%+ of the features are pooled from other lower levels. This observation clearly indicates that features in multiple levels together are helpful for accurate prediction. It is also a strong support of designing bottom-up path augmentation.
  如图3所示,蓝线代表在FPN中最初被分配到1级的小proposal。令人惊讶的是,近70%的特征来自其他更高的级别。我们还用黄线表示在FPN中被分配到第4级的大型proposal。同样,50%以上的特征是来自其他较低级别的集合。这一观察结果清楚地表明,多个层次的特征汇集在一起有助于准确预测。这也是对设计自下而上的路径增强的有力支持。
在这里插入图片描述
  Adaptive Feature Pooling Structure. Adaptive feature pooling is actually simple in implementation and is demonstrated in Figure 1©. First, for each proposal, we map them to different feature levels, as denoted by dark grey regions in Figure 1(b). Following Mask R-CNN [21], ROIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.
  自适应特征池化结构。自适应特征池化的实现其实很简单,如图1©所示。首先,对于每个proposal,我们将其映射到不同的特征级别,如图1(b)中的深灰色区域表示。按照Mask R-CNN[21],ROIAlign被用来汇集每个层次的特征网格。然后利用融合操作(元素的最大或总和)来融合不同层次的特征网格。
  In following sub-networks, pooled feature grids go through one parameter layer independently, which is followed by the fusion operation, to enable network to adapt features. For example, there are two fc layers in the box branch in FPN. We apply the fusion operation after the first layer. Since four consecutive convolutional layers are used in mask prediction branch in Mask R-CNN, we place fusion operation between the first and second convolutional layers. Ablation study is given in Section 4.2. The fused feature grid is used for each proposal for further prediction, i.e., classification, box regression and mask prediction.
  在下面的子网络中,汇集的特征网格独立通过一个参数层,然后再进行融合操作,以使网络能够适应特征。例如,在FPN的盒子分支中,有两个fc层。我们在第一层之后进行融合操作。由于在Mask R-CNN的掩码预测分支中使用了四个连续的卷积层,我们将融合操作放在第一和第二卷积层之间。第4.2节中给出了消融研究。融合后的特征网格被用于每个proposal的进一步预测,即分类、边框回归和mask预测。
  Our design focuses on fusing information from innetwork feature hierarchy instead of those from different feature maps of input image pyramid [52]. It is simpler compared with the process of [4, 62, 31], where L2 normalization, concatenation and dimension reduction are needed.
  我们的设计侧重于融合来自网络内特征层次的信息,而不是来自输入图像金字塔的不同特征图[52]。与[4, 62, 31]相比,它更简单,因为后者需要进行二级标准化、串联和降维。

3.3. Fully-connected Fusion(全连接融合)

Motivation. Fully-connected layers, or MLP, were widely used in mask prediction in instance segmentation [10, 41, 34] and mask proposal generation [48, 49]. Results of [8, 33] show that FCN is also competent in predicting pixelwise masks for instances. Recently, Mask R-CNN [21] applied a tiny FCN on the pooled feature grid to predict corresponding masks avoiding competition between classes.
  动机。全连接层或MLP被广泛用于实例分割中的mask预测[10, 41, 34]和mask proposal生成[48, 49]。[8, 33]的结果显示,FCN在预测实例的像素级mask方面也有能力。最近,Mask R-CNN[21]在汇集的特征网格上应用了一个微小的FCN来预测相应的掩码,避免了类之间的竞争。
  We note fc layers yield different properties compared with FCN where the latter gives prediction at each pixel based on a local receptive field and parameters are shared at different spatial locations. Contrarily, fc layers are location sensitive since predictions at different spatial locations are achieved by varying sets of parameters. So they have the ability to adapt to different spatial locations. Also prediction at each spatial location is made with global information of the entire proposal. It is helpful to differentiate instances [48] and recognize separate parts belonging to the same object. Given different properties of fc and convolutional layers, we fuse predictions from these two types of layers for better mask prediction.
  我们注意到fc层与FCN相比产生了不同的属性,后者在每个像素点上基于局部接受场进行预测,参数在不同的空间位置共享。相反,fc层对位置敏感,因为不同空间位置的预测是通过不同的参数集实现的。所以它们有能力适应不同的空间位置。同时,在每个空间位置的预测是通过整个proposal的全局信息进行的。这有助于区分实例[48]和识别属于同一目标的独立部分。鉴于fc层和卷积层的不同属性,我们将这两类层的预测融合起来,以获得更好的掩码预测。
  Mask Prediction Structure. Our component of mask prediction is light-weighted and easy to implement. The mask branch operates on pooled feature grid for each proposal. As shown in Figure 4, the main path is a small FCN, which consists of 4 consecutive convolutional layers and 1 deconvolutional layer. Each convolutional layer consists of 256 3 × 3 filters and the deconvolutional layer up-samples feature with factor 2. It predicts a binary pixel-wise mask for each class independently to decouple segmentation and classification, similar to that of Mask R-CNN. We further create a short path from layer conv3 to a fc layer. There are two 3 × 3 convolutional layers where the second shrinks channels to half to reduce computational overhead.
  掩码预测结构。我们的mask预测组件是轻量级的,易于实现。mask分支对每个proposal的集合特征网格进行操作。如图4所示,主路径是一个小的FCN,它由4个连续的卷积层和1个去卷积层组成。每个卷积层由256个3×3的滤波器组成,去卷积层对特征进行上采样,系数为2。它为每个类别独立预测一个二进制的像素级mask,以解耦分割和分类,类似于Mask R-CNN的做法。我们进一步创建一个从conv3层到fc层的短路径。有两个3×3的卷积层,其中第二层将通道缩小到一半,以减少计算开销。
  A fc layer is used to predict a class-agnostic foreground/background mask. It not only is efficient, but also allows parameters in the fc layer trained with more samples, leading to better generality. The mask size we use is 28 × 28 so that the fc layer produces a 784 × 1 × 1 vector. This vector is reshaped to the same spatial size as the mask predicted by FCN. To obtain the final mask prediction, mask of each class from FCN and foreground/background prediction from fc are added. Using only one fc layer, instead of multiple of them, for final prediction prevents the issue of collapsing the hidden spatial feature map into a short feature vector, which loses spatial information.
  fc层被用来预测一个与类别无关的前景/背景mask。它不仅效率高,而且允许用更多的样本训练fc层中的参数,导致更好的通用性。我们使用的mask大小为28×28,这样fc层就会产生一个784×1×1的向量。这个向量被重塑为与FCN预测的mask相同的空间大小。为了得到最终的mask预测,来自FCN的每个类别的mask和来自fc的前景/背景预测被加上。只使用一个fc层,而不是多个fc层来进行最终预测,可以避免将隐藏的空间特征图压缩成一个短的特征向量,从而失去空间信息。

4. Experiments(实验)

We compare our method with state-of-the-arts on challenging COCO [37], Cityscapes [7] and MVD [45] datasets. Our results are top ranked in all of them. Comprehensive ablation study is conducted on the COCO dataset. We also present our results of COCO 2017 Instance Segmentation and Object Detection Challenges.
  我们在具有挑战性的COCO[37]、Cityscapes[7]和MVD[45]数据集上将我们的方法与先进的技术进行了比较。我们的结果在所有这些数据中都名列前茅。在COCO数据集上进行了全面的消融研究。我们还介绍了我们在COCO 2017实例分割和目标检测挑战中的结果。

4.1. Implementation Details(实施细节)

We re-implement Mask R-CNN and FPN based on Caffe [29]. All pre-trained models we use in experiments are publicly available. We adopt image centric training [16]. For each image, we sample 512 region-of-interests (ROIs) with positive-to-negative ratio 1:3. Weight decay is 0.0001 and momentum is set to 0.9. Other hyper-parameters slightly vary according to datasets and we detail them in respective experiments. Following Mask R-CNN, proposals are from an independently trained RPN [35, 51] for convenient ablation and fair comparison, i.e., the backbone is not shared with object detection and instance segmentation.
  我们在Caffe[29]的基础上重新实现了Mask R-CNN和FPN。我们在实验中使用的所有预训练模型都是公开可用的。我们采用以图像为中心的训练[16]。对于每幅图像,我们对512个感兴趣的区域(ROI)进行采样,正负比为1:3。权重衰减为0.0001,动量被设置为0.9。其他超参数根据数据集略有不同,我们在各自的实验中详细说明。在Mask R-CNN之后,建议来自独立训练的RPN[35, 51],以方便消减和公平比较,也就是说,骨干不与目标检测和实例分割共享。

4.2. Experiments on COCO(在COCO上的实验)

Dataset and Metrics. COCO [37] dataset is among the most challenging ones for instance segmentation and object detection due to the data complexity. It consists of 115k images for training and 5k images for validation (new split of 2017). 20k images are used in test-dev and 20k images are used as test-challenge. Ground-truth labels of both testchallenge and test-dev are not publicly available. There are 80 classes with pixel-wise instance mask annotation. We train our models on train-2017 subset and report results on val-2017 subset for ablation study. We also report results on test-dev for comparison.
  数据集和指标。由于数据的复杂性,COCO[37]数据集是对实例分割和目标检测最具挑战性的数据集之一。它由11.5万张图像组成,用于训练,5千张图像用于验证(2017年的新划分)。2万张图像用于test-dev,2万张图像用于test-challenge。test-challenge和test-dev的地面真实标签都是不公开的。有80个带有像素化实例mask注释的类别。我们在train-2017子集上训练我们的模型,在val-2017子集上报告消融研究的结果。我们还报告了test-dev的结果以进行比较。
  We follow the standard evaluation metrics, i.e., AP, AP 50 _{50} 50, AP 75 _{75} 75, AP S _S S, AP M _M M and AP L _L L. The last three measure performance with respect to objects with different scales. Since our framework is general to both instance segmentation and object detection, we also train independent object detectors. We report mask AP, box ap AP b b ^{bb} bb of an independently trained object detector, and box ap AP b b M ^{bbM} bbM of the object detection branch trained in the multi-task fashion.
  我们遵循标准评估指标,即AP、AP 50 _{50} 50、AP 75 _{75} 75、AP S _S S、AP M _M M和AP L _L L。后三者是针对不同尺度的目标来衡量性能的。由于我们的框架普遍适用于实例分割和目标检测,我们也训练独立的目标检测器。我们报告了独立训练的目标检测器的掩码AP、边框AP b b ^{bb} bb,以及以多任务方式训练的目标检测分支的AP b b M ^{bbM} bbM
  Hyper-parameters. We take 16 images in one image batch for training. The shorter and longer edges of the images are 800 and 1000, if not specially noted. For instance segmentation, we train our model with learning rate 0.02 for 120k iterations and 0.002 for another 40k iterations. For object detection, we train one object detector without the mask prediction branch. Object detector is trained for 60k iterations with learning rate 0.02 and another 20k iterations with learning rate 0.002. These parameters are adopted from Mask R-CNN and FPN without any fine-tuning.
  超参数。我们在一个图像批次中取16张图像进行训练。如果没有特别指出,图像的短边和长边分别为800和1000。对于实例分割,我们用0.02的学习率训练我们的模型,进行12万次迭代,再进行4万次迭代的学习率为0.002。对于目标检测,我们训练一个没有掩码预测分支的目标检测器。目标检测器以0.02的学习率训练了6万次迭代,以0.002的学习率训练了另外2万次迭代。这些参数来自于Mask R-CNN和FPN,没有进行任何微调。
  Instance Segmentation Results. We report performance of our PANet on test-dev for comparison, with and without multi-scale training. As shown in Table 1, our PANet with ResNet-50 trained on multi-scale images and tested on single-scale images already outperforms Mask R-CNN and the champion in 2016, where the latter used larger model ensembles and testing tricks of [23, 33, 10, 15, 39, 62]. Trained and tested with the same image scale 800, our method outperforms the single-model Mask R-CNN with nearly 3 points under the same initial models.
  实例分割的结果。我们报告了我们的PANet在test-dev上的表现,以进行比较,有无多尺度训练。如表1所示,我们的PANet与ResNet-50在多尺度图像上训练并在单尺度图像上测试,已经超过了Mask R-CNN和2016年的冠军,后者使用更大的模型群和测试技巧[23, 33, 10, 15, 39, 62]。在相同的图像比例800的情况下进行训练和测试,在相同的初始模型下,我们的方法超过了单模型Mask R-CNN近3分。
在这里插入图片描述
  Object Detection Results. Similar to the way adopted in Mask R-CNN, we also report bounding box results inferred from the box branch. Table 2 shows that our method with ResNet-50, trained and tested on single-scale images, outperforms, by a large margin, all other single-model ones that even used much larger ResNeXt-101 [61] as the initial model. With multi-scale training and single-scale testing, our PANet with ResNet-50 outperforms the champion 2016, which used larger model ensemble and testing tricks.
  目标检测结果。与Mask R-CNN中采用的方式类似,我们也报告了从边框分支推断出来的边界框结果。表2显示,我们使用ResNet-50的方法,在单尺度图像上进行训练和测试,以很大的优势胜过其他所有的单模型,甚至使用更大的ResNeXt-101 [61] 作为初始模型。在多尺度训练和单尺度测试的情况下,我们的PANet与ResNet-50的表现超过了2016年的冠军,后者使用了更大的模型集合和测试技巧。
在这里插入图片描述
  Component Ablation Studies. First, we analyze importance of each proposed component. Besides bottom-up path augmentation, adaptive feature pooling and fully-connected fusion, we also analyze multi-scale training, multi-GPU synchronized batch normalization [67, 28] and heavier head. For multi-scale training, we set the longer edge to 1, 400 and let the other range from 400 to 1, 400. We calculate mean and variance based on all samples in one batch across all GPUs, do not fix any parameters during training, and make all new layers followed by a batch normalization layer, when using multi-GPU synchronized batch normalization. The heavier head uses 4 consecutive 3 × 3 convolutional layers shared by box classification and box regression, instead of two fc layers. It is similar to the head used in [36]. But the convolutional layers for box classification and box regression branches are not shared in [36].
  组件消融研究。首先,我们分析了每个proposal组件的重要性。除了自下而上的路径增强、自适应特征池和全连接融合,我们还分析了多尺度训练、多GPU同步批量归一化[67, 28]和更重的头部。对于多尺度训练,我们将较长的边设置为1400,并让其他的边在400到1400之间。在使用多GPU同步批量归一化时,我们基于所有GPU的一个批次的所有样本来计算均值和方差,在训练过程中不固定任何参数,并使所有新层后面都有一个批量归一化层。较重的头部使用4个连续的3×3卷积层,由边框分类和边框回归共享,而不是两个fc层。它类似于[36]中使用的头。但是边框分类和边框回归分支的卷积层在[36]中没有共享。
在这里插入图片描述
  Our ablation study from the baseline gradually to all components incorporated is conducted on val-2017 subset and is shown in Table 3. ResNet-50 [23] is our initial model. We report performance in terms of mask AP, box ap AP b b ^{bb} bb of an independently trained object detector and box ap AP b b M ^{bbM} bbM of box branch trained in the multi-task fashion.
  我们的消融研究从基线逐渐到纳入的所有组件在val-2017子集上进行,如表3所示。ResNet-50[23]是我们的初始模型。我们以Mask AP、独立训练的目标检测器的边框AP b b ^{bb} bb和以多任务方式训练的边框分支的边框AP b b M ^{bbM} bbM来报告性能。
  1) Re-implemented Baseline. Our re-implemented Mask R-CNN performs comparably with the one described in original paper and our object detector performs better.
  2) Multi-scale Training & Multi-GPU Sync. BN. These two techniques help the network to converge better and increase the generalization ability.
  3) Bottom-up Path Augmentation. With or without adaptive feature pooling, bottom-up path augmentation consistently improves mask AP and box ap AP b b ^{bb} bb by more than 0.6 and 0.9 respectively. The improvement on big instances is significant, manifesting the usefulness of information sent from lower feature levels.
  4) Adaptive Feature Pooling. With or without bottomup path augmentation, adaptive feature pooling consistently improves performance in all scales, which is in accordance with our aforementioned observation that features in other layers are also useful in final prediction.
  5) Fully-connected Fusion. Fully-connected fusion predicts masks with better quality. It yields 0.7 improvement in terms of mask AP. It is general for instances at all scales.
  6) Heavier Head. Heavier head is quite effective for box ap AP b b M ^{bbM} bbM of bounding boxes trained in the multi-task fashion. While for mask AP and independently trained object detector, the improvement is smaller.
  1) 重新实施的基线。我们重新实施的Mask R-CNN与原论文中描述的R-CNN性能相当,我们的目标检测器性能更好。
  2) 多尺度训练和多GPU同步BN。这两项技术有助于网络更好地收敛,并提高泛化能力。
  3) 自下而上的路径增强。无论是否有自适应特征池,自下而上的路径增强都能持续改善Mask AP和边框AP b b ^{bb} bb,分别超过0.6和0.9。在大的实例上的改进是显着的,体现了从较低的特征层发出的信息的有用性。
  4) 自适应特征池。无论是否有自下而上的路径增强,自适应特征池在所有尺度上都能持续改善性能,这与我们前面提到的其他层的特征在最终预测中也是有用的观察结果相一致。
  5) 全连接的融合。全连接的融合预测面具的质量更好。它在mask AP方面产生了0.7的改进。它对所有规模的实例都是通用的。
  6) 较重的头部。Heavier head对于以多任务方式训练的边界框AP b b M ^{bbM} bbM相当有效。而对于mask AP和独立训练的目标检测器,改进较小。
  With all these components in PANet, improvement on mask AP is 4.4 over baselines. Box ap AP b b ^{bb} bb of independently trained object detector increases 4.2. They are significant. Small- and medium-size instances contribute most. Half of the improvement is from multi-scale training and multi-GPU sync. BN. They are effective strategies.
  在PANet的所有这些组件中,对mask AP的改进比基线要大4.4。独立训练的目标检测器的边框AP b b ^{bb} bb增加了4.2。它们是显著的。小型和中型实例贡献最大。一半的改进来自于多尺度训练和多GPU同步BN。它们是有效的策略。
  Ablation Studies on Adaptive Feature Pooling. Ablation studies on adaptive feature pooling are to verify fusion operation type and location. We place it either between ROIAlign and fc1, represented as “fu.fc1fc2” or between fc1 and fc2, represented as “fc1fu.fc2” in Table 4. These settings are also applied to the mask prediction branch. For feature fusing type, max and sum operations are tested.
  适应性特征的消融研究池化。关于自适应特征池化的消融研究是为了验证融合操作的类型和位置。我们把它放在ROIAlign和fc1之间,表示为 "fu.fc1fc2 "或fc1和fc2之间,在表4中表示为 “fc1fu.fc2”。这些设置也适用于掩码预测分支。对于特征融合类型,测试了最大和操作。
在这里插入图片描述
  As shown in Table 4, adaptive feature pooling is not sensitive to the fusion operation type. Allowing a parameter layer to adapt feature grids from different levels, however, is of greater importance. In our final system, we use max fusion operation behind the first parameter layer.
  如表4所示,自适应特征池对融合操作类型不敏感。然而,允许一个参数层来适应来自不同层次的特征网格,则是更重要的。在我们的最终系统中,我们在第一个参数层后面使用最大融合操作。
  Ablation Studies on Fully-connected Fusion. We investigate performance when instantiating the augmented fc branch differently. We consider two aspects, i.e., the layer to start the new branch and the way to fuse predictions from the new branch and FCN. We experiment with creating new paths from conv2, conv3 and conv4, respectively. “max”, “sum” and “product” operations are used for fusion. We take our reimplemented Mask R-CNN with bottom-up path augmentation and adaptive feature pooling as the baseline. Corresponding results are listed in Table 5. They clearly show that staring from conv3 and taking sum for fusion produce the best results.
  对全连接融合的消融研究。我们调查了以不同方式实例化增强的fc分支时的性能。我们考虑两个方面,即启动新分支的层以及融合新分支和FCN预测的方式。我们尝试从conv2、conv3和conv4分别创建新的路径。融合时使用了 “最大”、"总和 "和 "乘积 "操作。我们把我们重新实现的Mask R-CN与自下而上的路径增强和自适应特征池作为基线。相应的结果列于表5。它们清楚地表明,从conv3中盯住并取和进行融合会产生最好的结果。
在这里插入图片描述
  COCO 2017 Challenge. With PANet, we participated in the COCO 2017 Instance Segmentation and Object Detection Challenges. Our framework reaches the 1st place in Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. As shown in Tables 6 and 7, compared with last year champion, we achieve 9.1% absolute and 24% relative improvement on instance segmentation. While for object detection, 9.4% absolute and 23% relative improvement is yielded.
  COCO 2017挑战赛。通过PANet,我们参加了COCO 2017实例分割和目标检测挑战赛。我们的框架在没有大批量训练的情况下获得了实例分割任务的第一名和目标检测任务的第二名。如表6和表7所示,与去年的冠军相比,我们在实例分割上取得了9.1%的绝对改进和24%的相对改进。而在目标检测方面,取得了9.4%的绝对改善和23%的相对改善。
在这里插入图片描述
在这里插入图片描述
  There are a few more details for the top performance. First, we use deformable convolution where DCN [11] is adopted. The common testing tricks [23, 33, 10, 15, 39, 62], such as multi-scale testing, horizontal flip testing, mask voting and box voting, are used too. For multi-scale testing, we set the longer edge to 1, 400 and let the other range from 600 to 1, 200 with step 200. Only 4 scales are used. Second, we use larger initial models from publicly available ones. We use 3 ResNeXt-101 (64 × 4d)[61], 2 SE-ResNeXt-101 (32 × 4d)[25], 1 ResNet-269 [64] and 1 SENet [25] as ensemble for bounding box and mask generation. Performance with different larger initial models are similar. One ResNeXt-101 (64 × 4d) is used as the base model to generate proposals. We train these models with different random seeds, with and without balanced sampling [55] to enhance diversity between models. Detection results are acquired by tightening instance masks. We show a few visual results in Figure 5 – most of our predictions are with high quality.
  还有一些细节可以说明我们的顶级性能。首先,我们使用可变形卷积,其中DCN[11]被采用。我们还使用了常见的测试技巧[23, 33, 10, 15, 39, 62],如多尺度测试、水平翻转测试、遮罩投票和盒子投票。对于多尺度测试,我们将较长的边缘设置为1400,让其他的范围从600到1200,步长为200。只使用4个尺度。第二,我们从公开可用的模型中使用更大的初始模型。我们使用3个ResNeXt-101(64×4d)[61]、2个SE-ResNeXt-101(32×4d)[25]、1个ResNet-269[64]和1个SENet[25]作为边界框和掩码生成的集合。不同规模的初始模型的性能是相似的。一个ResNeXt-101(64×4d)被用作基础模型来生成proposal。我们用不同的随机种子训练这些模型,用和不用平衡采样[55]来增强模型之间的多样性。检测结果是通过收紧实例掩码获得的。我们在图5中展示了一些视觉结果–我们的大多数预测都具有很高的质量。
在这里插入图片描述

4.3. Experiments on Cityscapes (有需要请看原论文)

4.4. Experiments on MVD(有需要请看原论文)

5. Conclusion(结论)

We have presented our PANet for instance segmentation. We designed several simple and yet effective components to enhance information propagation in representative pipelines. We pool features from all feature levels and shorten the distance among lower and topmost feature levels for reliable information passing. Complementary path is augmented to enrich feature for each proposal. Impressive results are produced. Our future work will be to extend our method to videos and RGBD data.
  我们已经介绍了我们的PANet,用于实例分割。我们设计了几个简单而有效的组件来加强代表管道中的信息传播。我们汇集了所有特征层的特征,并缩短了低层和最上层特征层之间的距离,以实现可靠的信息传递。补充路径被增强以丰富每个proposal的特征。产生了令人印象深刻的结果。我们未来的工作将是把我们的方法扩展到视频和RGBD数据。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值