论文阅读笔记（二十）：Mask R-CNN

最新推荐文章于 2021-11-23 11:14:50 发布

__Sunshine__

最新推荐文章于 2021-11-23 11:14:50 发布

阅读量1.1k

点赞数 2

分类专栏：笔记文章标签： Mask R-CNN segmentation

本文链接：https://blog.csdn.net/sunshine_010/article/details/79952020

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

我们提出了一个概念上简单，灵活和通用的目标分割框架。我们的方法有效地检测图像中的目标，同时为每个实例生成高质量的分割掩码。称为Mask R-CNN的方法通过添加一个与现有目标检测框回归并行的，用于预测目标掩码的分支来扩展Faster R-CNN。Mask R-CNN训练简单，相对于Faster R-CNN，只需增加一个较小的开销，运行速度可达5 FPS。此外，Mask R-CNN很容易推广到其他任务，例如，允许我们在同一个框架中估计人的姿势。我们在COCO挑战的所有三个项目中取得了最佳成绩，包括目标分割，目标检测和人体关键点检测。在没有使用额外技巧的情况下，Mask R-CNN优于所有现有的单一模型，包括COCO 2016挑战优胜者。我们希望我们的简单而有效的方法将成为一个促进未来目标级识别领域研究的坚实基础。我们稍后将提供代码。

The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster R-CNN [12, 36] and Fully Convolutional Network (FCN) [30] frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.

目标检测和语义分割的效果在短时间内得到了很大的改善。在很大程度上，这些进步是由强大的基线系统驱动的，例如，分别用于目标检测和语义分割的Fast/Faster R-CNN和全卷积网络(FCN)框架。这些方法在概念上是直观的，提供灵活性和鲁棒性，以及快速的训练和推理。我们在这项工作中的目标是为目标分割开发一个相对有力的框架。

Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances（Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection.） . Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results.

目标分割是具有挑战性的，因为它需要正确检测图像中的所有目标，同时也精确地分割每个目标。因此，它结合了来自经典计算机视觉任务目标检测的元素，其目的是对目标进行分类，并使用边界框定位每个目标，以及语义分割（通常来说，目标检测来使用边界框而不是掩码来标定每一个目标检测，而语义分割以在不区分目标的情况下表示每像素的分类。然而，目标分割既是语义分割，又是另一种形式的检测。）鉴于此，人们可能认为需要一种复杂的方法才能取得良好的效果。然而，我们的研究表明，使用非常简单，灵活和快速的系统就可以超越先前的最先进的目标分割结果。

Our method, called Mask R-CNN, extends Faster R-CNN [36] by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-topixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.

我们称之为Mask R-CNN的方法通过添加一个用于在每个感兴趣区域（RoI）上预测分割掩码的分支来扩展Faster R-CNN，这个分支与用于分类和目标检测框回归的分支并行执行，如下图所示（用于目标分割的Mask R-CNN框架）：掩码分支是作用于每个RoI的小FCN，以像素到像素的方式预测分割掩码。Mask R-CNN易于实现和训练，它是基于Faster R-CNN这种灵活的框架的。此外，掩码分支只增加了很小的计算开销。

In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool [18, 12], the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation.

原理上，Mask R-CNN是Faster R-CNN的直接扩展，而要获得良好的结果，正确构建掩码分支至关重要。最重要的是，Faster R-CNN不是为网络输入和输出之间的像素到像素对齐而设计的。在《how RoIPool》中提到，实际上，应用到目标上的核心操作执行的是粗略的空间量化特征提取。为了修正错位，我们提出了一个简单的，量化无关的层，称为RoIAlign，可以保留精确的空间位置。尽管是一个看似很小的变化，RoIAlign起到了很大的作用：它可以将掩码准确度提高10％至50％，在更严格的位置度量下显示出更大的收益。其次，我们发现解耦掩码和分类至关重要：我们为每个类独立地预测二进制掩码，这样不会跨类别竞争，并且依赖于网络的RoI分类分支来预测类别。相比之下，FCN通常执行每像素多类分类，分割和分类同时进行，基于我们的实验，对于目标分割效果不佳。

Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [28], including the heavilyengineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.

Mask R-CNN超越了COCO实例分割任务上所有先前最先进的单一模型结果，其中包括COCO 2016挑战优胜者。作为副产品，我们的方法也优于COCO对象检测任务。在消融实验中，我们评估多个基本实例，这使我们能够证明其鲁棒性并分析核心因素的影响。

Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework’s flexibility and accuracy, will benefit and ease future research on instance segmentation.

我们的模型可以在GPU上以200毫秒每帧的速度运行，使用一台有8个GPU的机器，在COCO上训练需要一到两天的时间。我们相信，快速的训练和测试速度，以及框架的灵活性和准确性将促进未来目标分割的研究。

Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset [28]. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.

最后，我们通过COCO关键点数据集上的人体姿态估计任务来展示我们框架的通用性。通过将每个关键点视为one-hot二进制掩码，只需要很少的修改，Mask R-CNN可以应用于人体关键点检测。不需要额外的技巧，Mask R-CNN超过了COCO 2016人体关键点检测比赛的冠军，同时运行速度可达5 FPS。因此，Mask R-CNN可以被更广泛地看作是用于目标级识别的灵活框架，并且可以容易地扩展到更复杂的任务。

R-CNN: The Region-based CNN (R-CNN) approach [13] to bounding-box object detection is to attend to a manageable number of candidate object regions [42, 20] and evaluate convolutional networks [25, 24] independently on each RoI. R-CNN was extended [18, 12] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [36] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [38, 27, 21]), and is the current leading framework in several benchmarks.

R-CNN：R-CNN方法是通过找到一定数量的候选区域，并独立地在每个RoI上执行卷积来进行目标检测的。基于R-CNN的改进，使用RoIPool在特征图上选取RoI，实现了更快的速度和更好的准确性。Faster R-CNN通过使用RPN学习注意机制来产生候选框。还有后续的对Faster R-CNN灵活性和鲁棒性的改进。这是目前在几个基准测试中领先的框架。

Instance Segmentation: Driven by the effectiveness of RCNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [13, 15, 16, 9] resorted to bottom-up segments [42, 2]. DeepMask [33] and following works [34, 8] learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. [10] proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.

目标分割：在R- CNN的有效性的推动下，目标分割的许多方法都是基于segment proposals的。先前的方法依赖自下而上的分割。 DeepMask和通过学习提出分割候选，然后使用Fast R-CNN分类。在这些方法中，分割先于识别，这样做既慢又不太准确。同样，Dai等人提出了一个复杂的多级联级联，从候选框中预测候选分割，然后进行分类。相反，我们的方法并行进行掩码和类标签的预测，更简单也更灵活。

Most recently, Li et al. [26] combined the segment proposal system in [8] and object detection system in [11] for “fully convolutional instance segmentation” (FCIS). The common idea in [8, 11, 26] is to predict a set of positionsensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.

最近，Li等人将中的分割候选系统与[8]中的目标检测系统进行了“全卷积目标分割”（FCIS）的融合。在 [8, 11, 26]中的共同想法是用全卷积得到一组位置敏感的输出通道候选。这些通道同时处理目标分类，目标检测和掩码，这使系统速度变得更快。但FCIS在重叠实例上出现系统错误，并产生虚假边缘。

Another family of solutions [23, 4, 3, 29] to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.

另一族的解决方案到实例分割是由成功的语义分割驱动。从每像素分类结果 (如 FCN 输出) 开始, 这些方法尝试将同一类别的像素剪切到不同的实例中。与这些方法的分割优先策略相比, 掩码 R-CNN 是基于实例优先策略的。我们预计, 今后将进一步研究这两项战略。

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.

Mask R-CNN在概念上是简单的：Faster R-CNN为每个候选目标输出类标签和边框偏移量。为此，我们添加了一个输出目标掩码的第三个分支。因此，Mask R-CNN是一种自然而直观的点子。但是，附加的掩码输出与类和框输出不同，需要提取对象的更精细的空间布局。接下来，我们介绍Mask R-CNN的关键特点，包括像素到像素对齐，这是Fast/Faster R-CNN的主要缺失。

Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector [36]. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

Faster R-CNN：我们首先简要回顾一下Faster R-CNN检测器。Faster R-CNN由两个阶段组成。称为区域提议网络（RPN）的第一阶段提出候选目标边界框。第二阶段，本质上是Fast R-CNN，使用RoIPool从每个候选框中提取特征，并进行分类和边界回归。两个阶段使用的特征可以共享，以便更快的推理。

Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions (e.g. [33, 10, 26]). Our approach follows the spirit of Fast R-CNN [12] that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [13]).

Mask R-CNN：Mask R-CNN采用相同的两个阶段，具有相同的第一阶段（即RPN）。在第二阶段，与预测类和框偏移量并行，Mask R-CNN还为每个RoI输出二进制掩码。这与最近的其它系统相反，其分类取依赖于掩码预测。我们的方法遵循Fast R-CNN ，预测类和框偏移量并行（这在很大程度上简化了R-CNN的多级流水线）。

Formally, during training, we define a multi-task loss on each sampled RoI as L = L_cos + L_box + L_mask. The classification loss L_cls and bounding-box loss L_box are identical as those defined in [12]. The mask branch has a K $m$ ² dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes. To this we apply a per-pixel sigmoid, and define L_mask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, L_mask is only defined on the k-th mask (other mask outputs do not contribute to the loss).

在训练期间，我们将在每个采样后的RoI上的多任务损失函数定义为 L = L_cos + L_box + L_mask。分类损失L_cls和检测框损失L_box与1中定义的相同。掩码分支对于每个RoI的输出维度为K $m$ ²，即K个分辨率为m×m的二进制掩码，每个类别一个，K表示类别数量。我们为每个像素应用Sigmoid，并将L_mask定义为平均二进制交叉熵损失。对于真实类别为k的RoI，仅在第k个掩码上计算L_mask（其他掩码输出不计入损失）。

Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [30] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.

我们对Lmask的定义允许网络为每个类独立地预测二进制掩码，这样不会跨类别竞争。我们依靠专用分类分支预测用于选择输出掩码的类标签。这将解耦掩码和类预测。这与通常将FCN 应用于像素级Softmax和多重交叉熵损失的语义分段的做法不同。在这种情况下，掩码将在不同类别之间竞争。而我们的方法，使用了其它方法没有的像素级的Sigmod和二进制损失。我们通过实验发现，这种方法是改善目标分割效果的关键。

Mask Representation: A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.

掩码表示：掩码表示输入目标的空间布局。因此，与通过全连接（fc）层不可避免地缩成短输出向量的类标签或框偏移不同，提取掩码的空间结构可以通过由卷积提供的像素到像素对应自然地被解决。

Specifically, we predict an m × m mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit m × m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction [33, 34, 10], our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.

具体来说，我们使用FCN来为每个RoI预测一个m×m的掩码。这允许掩码分支中的每个层显式的保持m×m的对象空间布局，而不会将其缩成缺少空间维度的向量表示。与以前使用fc层掩码预测的的方法不同，我们的全卷积表示需要更少的参数，并且如实验所证明的更准确。

This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.

这种像素到像素的行为需要RoI特征，它们本身就是小特征图。为了更好地对齐，以准确地保留显式的像素空间对应关系，我们开发出在掩模预测中发挥关键作用的以下RoIAlign层。

RoIAlign: RoIPool [12] is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing [x/16], where 16 is a feature map stride and [·] is rounding; likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.

RoIAlign：RoIPool1是从每个RoI提取小特征图（例如，7×7）的标准操作。 RoIPool首先将浮点数表示的RoI缩放到与特征图匹配的粒度，然后将缩放后的RoI分块，最后汇总每个块覆盖的区域的特征值（通常使用最大池化）。例如，对在连续坐标系上的x计算[x/16]，其中16是特征图步幅，[⋅]表示四舍五入。同样地，当对RoI分块时（例如7×7）时也执行同样的计算。这样的计算使RoI与提取的特征错位。虽然这可能不会影响分类，因为分类对小幅度的变换具有一定的鲁棒性，但它对预测像素级精确的掩码有很大的负面影响。

To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use x/16 instead of [x/16]). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.

为了解决这个问题，我们提出了一个RoIAlign层，可以去除RoIPool的错位，将提取的特征与输入准确对齐。我们提出的改变很简单：我们避免避免计算过程中的四舍五入（比如，我们使用x/16代替[x/16]）。我们选取分块中的4个常规的位置，使用双线性插值来计算每个位置的精确值，并将结果汇总（使用最大或平均池化）。（我们抽取四个常规位置，以便我们可以使用最大或平均池化。事实上，在每个分块中心取一个值（没有池化）几乎同样有效。我们也可以为每个块采样超过四个位置，我们发现这些位置的收益递减。）

RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [10]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [10] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [22], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment.

如我们在消融实验中所示，RoIAlign的改进效果明显。我们还比较了中提出的RoIWarp操作。与RoIAlign不同，RoIWarp忽略了对齐问题，并在的实现中，有像RoIPool那样的四舍五入计算。因此，即使RoIWarp也采用 [22]提到的双线性重采样，如实验所示，它与RoIPool效果差不多。这表明了对齐起到了关键的作用。

Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI.

网络架构：为了证明我们的方法的普适性，我们构造了多种不同结构的Mask R-CNN。详细来说就是，我们使用不同的：(i)用于整个图像上的特征提取的下层卷积网络，以及(ii)用于检测框识别（分类和回归）和掩码预测的上层网络。

We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [19] and ResNeXt [45] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets [19] extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in [19, 10, 21, 39].

我们使用”网络-深度-特征输出层”的方式命名底下层卷积网络。我们评估了深度为50或101层的ResNet和ResNeXt网络。使用ResNet的Faster R-CNN从第四阶段的最终卷积层提取特征，我们称之为C4。例如，使用ResNet-50的下层网络由ResNet-50-C4表示。这是中常用的选择。

We also explore another more effective backbone recently proposed by Lin et al. [27], called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask R-CNN gives excellent gains in both accuracy and speed.

我们还探讨了Lin等人最近提出的另一种更有效的下层网络，称为特征金字塔网络（FPN）。 FPN使用具有横旁路连接的自顶向下架构，以从单尺度输入构建网络中的特征金字塔。使用FPN的Faster R-CNN根据其尺度提取不同级别的金字塔的RoI特征，不过其它部分和平常的ResNet类似。使用ResNet-FPN进行特征提取的Mask R-CNN可以在精度和速度方面获得极大的提升。

For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer ‘res5’ [19]), which is computeintensive. For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters.

对于上层网络，我们基本遵循了以前论文中提出的架构，我们添加了一个全卷积的掩码预测分支。具体来说，我们扩展了 ResNet和FPN中提出的Faster R-CNN的上层网络。详细信息如下图所示：ResNet-C4的上层网络包括ResNet的第五阶段（即9层的“res5”），这是计算密集型的。对于FPN，下层网已经包含了res5，因此可以使上层网络包含更少的卷积核而变的更高效。

We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.

我们注意到我们的掩码分支是一个非常简单的结构。也许更复杂的设计有可能提高性能，但不是这项工作的重点。

Figure 4. Head Architecture: We extend two existing Faster R-CNN heads [19, 27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU [31] in hidden layers. Left: ‘res5’ denotes ResNet’s fifth stage, which for simplicity we altered so that the first conv operates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs.

图4。上层网络架构：我们扩展了两种现有的Faster R-CNN上层网络架构，分别添加了一个掩码分支。图中数字表示分辨率和通道数，箭头表示卷积、反卷积或全连接层（可以通过上下文推断，卷积减小维度，反卷积增加维度。）所有的卷积都是3×3的，除了输出层，是1×1的。反卷积是2×2的，步进为2，,我们在隐藏层中使用ReLU28。左图中，“res5”表示ResNet的第五阶段，简单起见，我们修改了第一个卷积操作，使用7×7，步长为1的RoI代替14×14，步长为2的RoI25。右图中的“×4”表示堆叠的4个连续的卷积。

Figure 3. RoIAlign: The dashed grid represents a feature map, the solid lines an RoI (with 2×2 bins in this example), and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map. No quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points.

图3。RoIAlign: 虚线网格表示一个特征映射, 实线为 RoI (本例中有2×2个箱子), 每个箱子中的4个取样点。RoIAlign 用双线性插值从特征图上的相邻网格点计算每个采样点的值。在 RoI、箱子或取样点所涉及的任何坐标上都不执行四舍五入。

Our framework can easily be extended to human pose estimation. We model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow). This task helps demonstrate the flexibility of Mask R-CNN.

我们的框架可以很容易地扩展到人类姿态估计。我们将关键点的位置建模为one-hot掩码，并采用Mask R-CNN来预测K个掩码，每个对应K种关键点类型之一（例如左肩，右肘）。此任务有助于展示Mask R-CNN的灵活性。

We note that minimal domain knowledge for human pose is exploited by our system, as the experiments are mainly to demonstrate the generality of the Mask R-CNN framework. We expect that domain knowledge (e.g., modeling structures [6]) will be complementary to our simple approach.

我们注意到，我们的系统利用了人类姿态的最小领域知识，因为实验主要是为了证明Mask R-CNN框架的一般性。我们期望领域知识（模型结构[6]）将是我们简单方法的补充，但这超出了本文的范围。