Mask RCNN

最新推荐文章于 2023-01-05 16:02:03 发布

好运来2333

最新推荐文章于 2023-01-05 16:02:03 发布

阅读量257

点赞数

分类专栏： Paper

本文链接：https://blog.csdn.net/qq_33254870/article/details/89639289

版权

Paper 专栏收录该内容

15 篇文章 2 订阅

订阅专栏

论文地址：https://arxiv.org/abs/1703.06870
项目地址：https://github.com/facebookresearch/maskrcnn-benchmark
or https://github.com/matterport/Mask_RCNN
自制PPT与讲解视频链接：https://github.com/DHUB721/Object-Detection （注：仅个人理解，如有错误请多多指正，轻喷，谢谢！）
由于视频中已经对论文做了详细讲解，这里不再赘述，只提出几个疑惑的点交流一下！

1. 论文背景

（1）Faster RCNN中ROIPool对于提取特征采取了粗糙的空间量化（coarse spatial quantization），为了修正这种错位（misalignment）提出了ROIAlign，用以保留精确的空间位置，这也是论文的核心所在。
（2）实现了mask and class prediction的解耦（decouple），这是通过一个mask branch实现的，并且只增加了一点小的开销。
The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.
Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN.

2. 论文亮点

（1） Our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.
（2）ROIAlign 使得定位准确率提升（FCIS exhibits systematic errors on overlapping instances and creates spurious edges）
（3）提出实例优先策略
在这里插入图片描述

3. 论文细节

3.1 Mask RCNN Architecture

Mask RCNN由两部分组成：(i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition(classificationandregression) and mask prediction that is applied separately to each RoI.
在这里插入图片描述
以上Backbone分别是ResNet与FPN，再来看头结构：
其实就是类似Faster RCNN结构，如果不清楚，再回顾一下Faster RCNN结构：

那么左边的部分是Backbone，右边的部分是Head Architecture。显然这是一个two-stage过程。注意：Head Architecture里的 res5 指的是ResNet第五阶段（5-th stage），因为 backbone 以第四阶段（4-th stage）的最后一个卷积层提取的特征结束，这里阶段指的是一个shortcut connection过程，即ResNet Backbone的紫色、绿色、红色、蓝色各为一个stage。

3.2 L_mask: define L _mask as the average binary cross-entropy loss

在这里插入图片描述 Our definition of L mask allows the network to generate masks for every class without competition among classes; 即为每个类别都定义一个二分类交叉熵损失。
The mask branch has a Km² dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes. 损失函数不是类别独立吗？为什么会采用 Km² dimensional output ？这里的 K 表示 m * m 分辨率特征图中包含的类别数量，所以包含 K 个类别的特征图会有 K 个binary cross-entropy loss！

3.3 Mask Representation

class labels 与 box offesets 可以由全连接层折叠成短向量表示，但 Mask 的表示要考虑到空间结构，因此 Mask 要用全连接表示。
在这里插入图片描述

3.4 ROIAlign

（1）为什么引入ROIAlign？
I. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map.
II. this quantized RoI is then subdivided into spatial bins which are themselves quantized.
这两次量化在ROI与提取的特征之间引入了偏差。（尽管对分类或小变形不影响，但会影响mask的预测）
看具体例子：
在这里插入图片描述这是第一次量化，经过若干次池化之后ROI因取整丢失了部分信息，比如 665/32=20.78，但是ROI不可能取到小数位小数，所以近似成20丢失了信息。

这是第二次量化，可以很明显地看到各个池化窗口并不对称，也会影响ROI Pooling结果，论文中就采用双线性插值对此进行了优化。
在这里插入图片描述

ROIAlign采取的做法是将每一个bin（红色框部分）等分成四等分，然后在每一个等分区域内取一个采样点(等分区域中心为采样点)，然后再对每一个采样点利用双线性插值计算这个bin的值。也许你会问：为什么不对每个bin采用最大或平均池化？如果采用平均池化怎么判断中间那一列像素是属于左bin还是右bin呢？所以作者采用双线性插值就能很好的避免这个问题，因为采样点必然会落在某个像素内。

3.5 Training

（1）The mask loss L mask is defined only on positive RoIs.
（2）We adopt image-centric training.
（3）The RPN anchors span 5 scales and 3 aspect ratios, following FPN.
（4）The mask branch is then applied to the highest scoring 100 detection boxes.