[ICCV 2019] YOLACT Real-time Instance Segmentation

YOLACT Real-time Instance Segmentation

1. Author

Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee
University of California, Davis

2. Abstract

We present a simple, fully-convolutional model for realtime instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach.

Moreover, we obtain this result after training on only one GPU.

We accomplish this by breaking instance segmentation into two parallel subtasks:
(1) generating a set of prototype masks and
(2) predicting per-instance mask coefficients.
Then we produce instance masks by linearly combining the prototypes with the mask coefficients.

Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

3. Introduction

In this work, our goal is to fill that gap with a fast, one-stage instance segmentation model in the same way that SSD and YOLO fill that gap for object detection.

在这里插入图片描述
One-stage object detectors like SSD and YOLO are able to speed up existing two-stage detectors like Faster R-CNN by simply removing the second stage and making up for the lost performance in other ways.

State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
These methods “repool” features in some bounding box region (e.g., via RoIpool/align), and then feed these now localized features to their mask predictor.

One-stage methods that perform these steps in parallel like FCIS do exist, but they require significant amounts of post-processing after localization, and thus are still far from real-time.

This approach also has several practical advantages.

  1. First and foremost, it’s fast: because of its parallel structure and extremely lightweight assembly process, Y O L A C T YOLACT YOLACT adds only a marginal amount of computational overhead to a one-stage backbone detector, making it easy to reach 30 fps even when using R e s N e t − 101 ResNet-101 ResNet101; in fact, the entire mask branch takes only ∼5 ms to evaluate.
  2. Second, masks are high-quality: since the masks use the full extent of the image space without any loss of quality from repooling, our masks for large objects are significantly higher quality than those of other methods
  3. Finally, it’s general: the idea of generating prototypes and mask coefficients could be added to almost any modern object detector.

4. YOLACT

在这里插入图片描述
Our goal is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN does to Faster R-CNN, but without an explicit feature localization step (e.g., feature repooling).

To do this, we break up the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks.

The first branch uses an FCN to produce a set of image-sized “prototype masks” that do not depend on any one instance.
The second adds an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space.
Finally, for each instance that survives NMS, we construct a mask for that instance by linearly combining the work of these two branches.

4.1 Rationale

Thus, we break the problem into two parallel parts, making use of fc layers, which are good at producing semantic vectors, and conv layers, which are good at producing spatially coherent masks, to produce the “mask coefficients” and “prototype masks”, respectively.

Because prototypes and mask coefficients can be computed independently, the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication.

In this way, we can maintain spatial coherence in the feature space while still being one-stage and fast.

4.2 Prototype Generation

在这里插入图片描述
All supervision for these prototypes comes from the final mask loss after assembly.

We note two important design choices: taking protonet from deeper backbone features produces more robust masks, and higher resolution prototypes result in both higher quality masks and better performance on smaller objects.

4.3 Mask Coefficients

Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict c class confidences, and the other to predict 4 bounding box regressors. For mask coefficient prediction, we simply add a third branch in parallel that predicts k mask coefficients, one corresponding to each prototype.

we apply tanh to the k mask coefficients, which produces more stable outputs over no nonlinearity.

4.4 Mask Assembly

To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients.
在这里插入图片描述
These operations can be implemented efficiently using a single matrix multiplication and sigmoid:
M = σ ( P C T ) M=\sigma\left(P C^{T}\right) M=σ(PCT)
使用线性组合更加简单快速。

4.5 Emergent Behavior

在这里插入图片描述
We observe many prototypes to activate on certain “partitions” of the image. That is, they only activate on objects on one side of an implicitly learned boundary.

Increasing k is ineffective most likely because predicting coefficients is difficult.

4.6 Backbone Detector

The design of our backbone detector closely follows RetinaNet with an emphasis on speed.

We apply s m o o t h − L 1 smooth-L_{1} smoothL1 loss to train box regressors and encode box regression coordinates in the same way as SSD.

Unlike RetinaNet we do not use focal loss, which we found not to be viable in our situation.

5. Results

在这里插入图片描述

在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值