Mask-RCNN论文翻译_目标检测maskrcnn经典论文英文原版-CSDN博客

Mask Rcnn英文版论文链接：https://arxiv.org/pdf/1703.06870.pdf

Mask Rcnn项目地址（caffe2）：https://github.com/facebookresearch/Detectron

人骑手小汽车卡车公交车火车摩托车自行车17.9k1.8k26.9k0.5k0.4k0.2k0.7k3.7k

该任务的目标分割性能由和COCO一样的掩码AP（在IoU阈值上平均）来测量，也包括（即，IoU为0.5的掩码AP）。

实现：我们Mask R-CNN模型使用的下层网络是ResNet-FPN-50，我们也测试了对应的101层的网络，不过由于数据集比较小，性能相似。我们将图像在像素范围内随机缩放（较短边）进行训练，从而减少过拟合。测试时则统一缩放到1024像素。我们使用的批量大小为每个GPU 1个图像（实际上8个GPU上有8个），学习率为0.01，迭代次数为24k，在迭代次数达到18k时，学习率减少到0.001。其他实现细节与相同。

结果：我们在测试集和验证集上，将我们的结果与其它主流方法进行了比较，如下表（表7）所示：

在不使用无精细标注的训练集的情况下，我们的方法在测试集上的AP达到了26.2，相对于以前的最佳结果（使用了所有的训练集），相对提升了超过30％。与仅使用精细标注训练集（17.4 AP）的前最佳结果相比，相对提升了约50％。在一台8 GPU的机器上需要约4个小时的训练才能获得此结果。

对于人和小汽车类别，Cityscapes数据集包含了大量的类内重叠目标（每个图像平均6人和9辆小汽车）。我们认为类内重叠是目标分割的核心难点。我们的方法在这两个类别相对前最佳结果有大幅度改善（人相对提升了约85％，从16.5提高到30.5，小汽车相对提升了约30％，从35.7提高到46.9）。

Cityscapes数据集的主要挑战是训练数据较少，特别是对于卡车，公共汽车和火车的类别，每个类别的训练样本大约有200-500个。为了在一定程度上改善这个问题，我们进一步报告了使用COCO预训练的结果。为了做到这一点，我们使用预先训练好的COCO Mask R-CNN模型（骑手类别被随机初始化）。然后我们在Cityscapes数据集上进行4k次迭代来微调这个模型，其中学习速率在迭代次数达到3k时减少，微调需要约1小时。

使用COCO预训练的Mask R-CNN模型在测试集上达到了32.0 AP，比不预训练的模型提高了6个点。这表明足够的训练数据的重要性。同时，在Cityscapes数据集上的目标分割还收到其low-shot学习性能的影响。我们发现，使用COCO预训练是减轻涉及此数据集的数据数据偏少问题的有效策略。

最后，我们观察到测试集和训练集AP之间的偏差，从的结果也可以看出。我们发现这种偏差主要是由卡车，公共汽车和火车类别造成的，其中只使用精细标注训练数据的模型，在验证集和测试集上的AP分别为28.8/22.8，53.5/32.2和33.0/18.6。这表明这些训练数据很少的类别存在domain shift。 COCO预训练有助于改善这些类别上的结果，然而，domain shift依然存在，在验证集和测试集上的AP分别为38.0/30.1，57.5/40.9和41.2/30.9。不过，对于人和小汽车类别，我们没有看到任何此类偏差（在验证集和测试集上的AP偏差在±1以内）。

Cityscapes的结果示例如下图（图7）所示：（Mask R-CNN在Cityscapes的测试结果（32.0 AP）。右下图出错。）

参考文献

R. Girshick. Fast R-CNN. In ICCV, 2015.

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.

J. Hosang, R. Benenson, P. Dollár, and B. Schiele. What makes for effective detection proposals? PAMI, 2015.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.

A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.

B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014.

B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-columns for object segmentation and fine-grained localization. In CVPR, 2015.

J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015.

P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.

P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.

P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.

J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.

J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.

Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.

J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016.

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016.

V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.

R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In CVPR, 2015.

S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.

A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017.

M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.