Mask RCNN

Abstract

He proposed a general framework for object instance segmentation, which is called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition, running at 5 fps. Mask R-CNN is easy to generalize to other tasks, such as instance segmentation, bounding-box object detection and person key point detection.

Introduction

What is instance segmentation?

diffenert

  • ( a ) “Image Classification” only needs to assign categorical class labels to the image;
  • ( b ) “Object detection” not only predict categorical labels but also localize each object instance via bounding boxes;
  • ( c ) “Semantic segmentation” aims to predict categorical labels for each pixel, without differentiating object instances;Labels are class-aware
  • ( d ) “Instance segmentation”, a special setting of object detection, differentiates different object instances by pixel-level segmentation masks.Labels are object-aware

Propose Method

在这里插入图片描述
Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure 1). The mask branch is a small Fully Convolutional Network (FCN) applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.

This method decouples the mask and class prediction: they predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category.

RoIAlign

In principle Mask R-CNN is an intuitive extension of Faster R-CNN, but Fast R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, He propose a quantization-free layer, called RoIAlign which faithfully preserves exact spatial locations.

In order to fix the pixel-level misalignment between network inputs and outputs in Faster R-CNN, caused by coarse spatial quantization in the layer of RoIPool for feature extraction, they propose a quantization-free layer, called RoIAlign which faithfully preserves exact spatial locations.

Performance

“Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task.”

Generalization

"By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint compe- tition, and at the same time runs at 5 fps. "

Relate works

Feature Map

在每个卷积层,数据都是以三维形式存在的。你可以把它看成许多个二维图片叠在一起,其中每一个称为一个feature map。在输入层,如果是灰度图片,那就只有一个feature map;如果是彩色图片,一般就是3个feature map(红绿蓝)。层与层之间会有若干个卷积核(kernel),上一层的每个feature map跟每个卷积核做卷积,都会产生下一层的一个feature map。

R-CNN(Region-based CNN)

Selective Search + CNN + SVM

  1. 使用选择性搜索从原始图像中选取约1000-2000个候选框。
  2. 每个候选框内图像块缩放(crop/warp)至相同大小,并分别输入到CNN内进行特征提取 。
  3. 对每一个候选框中提取出的特征,使用SVM分类器判别是否属于一个特定类。
  4. 对于属于某一特征的候选框,用Bounding box Regression进一步调整其位置。

R-CNN 论文解读及个人理解

SPP-Net(Spatial Pyramid Pooling)

CNN + Selective Search + SPP + SVM

  1. 通过选择性搜索,对待检测的图片进行搜索出若干个候选窗口。这一步和R-CNN一样。
  2. 特征提取阶段。这一步与R-CNN存在较大区别。具体操作如下:把整张待检测的图片,输入CNN中,进行一次性特征提取,得到输入图像的Feature map。
  3. 将每一个候选框的在原始图像中的位置信息输入到空间金字塔池化层(Spatial Pyramid Pooling),提取出候选框在原始图像Feature map上对应的固定大小的特征。在R-CNN中,输入的是每个候选框,然后在进入CNN,因为SPP-Net只需要一次对整张图片进行特征提取,速度会大大提升。
  4. 对每一个候选框中提取出的特征,使用SVM分类器判别是否属于一个特定类。
  5. 对于属于某一特征的候选框,用Bounding box Regression进一步调整其位置。

SPP-Net论文详解

Fast R-CNN

CNN + Selective Search + ROI_Pooling + Neural Network

  1. 通过选择性搜索,对待检测的图片进行搜索出若干个候选窗口。
  2. 对整张图片输进CNN,得到原始图像的Feature map。
  3. 将每一个候选框的在原始图像中的位置信息输入到ROI_Polling层。提取出候选框在原始图像Feature map上对应的固定大小的Feature map。ROI_Pooling通过输入来修改窗口和步长大小,使其输出特征维数保持一致。(ROI_pooling 是 SPP 是的精简版,本质是为了在输入窗口大小不相同时输出相同位数的特征)
  4. 每一个候选框对应的特征输入两个并行的全连接层中,一个分支用作目标分类,一个分支用作位置回归。

Fast RCNN算法详解

Faster R-CNN

CNN + RPN(Region Proposal Network) + ROI_Pooling + Neural Network

  1. 对整张图片输进CNN,得到feature map。
  2. 卷积特征输入到RPN,输出候选框的位置信息。
  3. 将每一个候选框的在原始图像中的位置信息输入到ROI_Polling层。提取出候选框在原始图像Feature map上对应的固定大小的Feature map。
  4. 每一个候选框对应的特征输入两个并行的全连接层中,一个分支用作目标分类,一个分支用作位置回归。

Faster R-CNN文章详细解读
Faster RCNN算法详解

FPN(Feature Pyramid Networks)

”作者提出的FPN(Feature Pyramid Network)算法同时利用低层特征高分辨率和高层特征的高语义信息,通过融合这些不同层的特征达到预测的效果。并且预测是在每个融合后的特征层上单独进行的,这和常规的特征融合方式不同。“

FPN详解
FPN(feature pyramid networks)算法讲解

Mask R-CNN

Pipeline

在这里插入图片描述

Image Source

Mask R-CNN adopts the two-stage procedure, with an identical first stage to Faster R-CNN. In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI.

Faster R-CNN

see there

RoIAlign

RoIPool is a standard operation for extracting a fixed-size feature map (e.g., 7×7) from each RoI.

The RoI’s positions obtained from RPN as RoIPool inputs, which usually are float numbers, are quantized to the discrete granularity of of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated(Usually by max pooling).

Obviously, mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object, including pixel-to-pixel alignment, which is the main missing piece of ROIPool.

They use float position coordinates instead of integer position coordinates to avoid quantifying ROI coordinates, and by using bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each bin. They note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed. See figure 3 for details.
在这里插入图片描述

See more details from ROIAlign 原理理解
Or 详解 ROI Align 的基本原理

FCN(Fully Convolutional Network) Head

Fully Convolutional Network (FCN) was proposed in CVPR 2015 for semantic segmentation. Unlike previous methods that resort to fc layers for mask prediction, extracting the spatial structure of masks can be naturally by the pixel-to-pixel correspondence provided by FCN.
In there, FCN is used to predict the instance mask, and each RoI is already a specific object. See more details for FCN
在这里插入图片描述

Loss

Formally, they define a multi-task loss o each sampled RoI as
L = L c l s + L b o x + L m a s k L=L_{cls}+L_{box}+L_{mask} L=Lcls+Lbox+Lmask
The classification loss L c l s L_{cls} Lcls and bounding-box loss L b o x L_{box} Lbox are identical as those defined in Fast R-CNN.The mask branch has a Km2- dimensional output for each RoI, which encodes K binary masks of resolution m×m, one for each of the K classes. To this they apply a per-pixel sigmoid, and define L m a s k L_{mask} Lmask as the average binary cross-entropy loss of all the pixels. For an RoI associated with ground-truth class k, L m a s k L_{mask} Lmask is only defined on the k-th mask (other mask outputs do not contribute to the loss).

Implementation Details

They set hyper-parameters following existing Fast/Faster R-CNN work and find that the Mask R-CNN is robust to them.

Training

As in Fast R-CNN, the mask loss L m a s k L_{mask} Lmask is defined only on positive RoIs which has IoU with a ground-truth of at least 0.5. See more details from Mask R-CNN paper

Inference

“At test time, the proposal number is 300 for the C4 backbone and 1000 for FPN. We run the box prediction branch on these proposals, followed by non-maximum suppression. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs).”
The mask branch can predict K masks per RoI, and only the k-th mask is used, where k is the predicted class by the classification branch. The mxm floating-number mask output is the resized to the RoI size, and binarized at a a threshold of 0.5.
See more details from Mask R-CNN paper

Experiments: Instance Segmentation

They perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset.

Main Results

在这里插入图片描述
"They compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1. All instantiations of theirs model outperform baseline variants of previous state-of-the-art models. This includes MNC and FCIS, the winners of the COCO 2015 and 2016 segmentation challenges, respectively. "
在这里插入图片描述

Ablation Experiments

Architecture

The deeper, the better.
在这里插入图片描述

Multinomial vs. Independent Masks

在这里插入图片描述
This shows that the mask R-CNN benefit from the decouple of mask and class prediction in some way.
“once the instance has been classified as a whole (by the box branch), it is sufficient to predict a binary mask without concern for the categories, which makes the model easier to train.”

Class-Specific vs. Class-Agnostic Masks

They find that predicts class-agnostic masks(AP: 29.7), i.e. one m x m output regardless of class is nearly the same effective to the class-specific masks(AP: 30.3) i.e. one m x m mask per class. This further highlights the division of labor in the approach which largely decouples classification and segmentation.

RoIAlign

The bigger the stride, the more improvement the ROIAlign brings.
在这里插入图片描述
在这里插入图片描述

Mask Branch

在这里插入图片描述

Bounding Box Detection Results

在这里插入图片描述

Timing

Inference

~195ms per image on an Nvidia Tesla M40 GPU(plus 15ms CPU time resizing the outputs to the original resolution)

Training

“Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16 image mini-batch), and 44 hours with ResNet-101-FPN.”

Mask R-CNN for Human Pose Estimation

Implementation Details

“For each of the K keypoints of an instance, the training target is a one-hot m×m binary mask where only a single pixel is labeled as foreground. During training, for each visible ground-truth keypoint, they minimize the cross-entropy loss over an m2-way softmax output (which encourages a single point to be detected).”

Main Results and Ablations

在这里插入图片描述
Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks. However, adding the keypoint branch reduces the box/mask AP slightly, suggest- ing that while keypoint detection benefits from multitask training, it does not in turn help the other tasks.

Conclusion

  • Mask R-CNN was obtained by adding a mask branch to existing Faster R-CNN, it decouples the procedure of classification and mask. Experiment proves its rationality.
  • Mask output requires extraction of much finer spatial layout of an object, including the pixel-to-pixel alignment between the inputs and the outputs. He solved the misalignment by using RoIAlign instead of RoIPool.
  • needless to say, Mask R-CNN is faster and more accurate and generalduty.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值