[paper] 00037-Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Author: Liang-CHieh Chen et. al--- Google Inc.

Keywords:

DeepLabv3+ :  extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries.

Atrous Spatial Pyramial Pooling:

1. Introduction

In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21,22] for semantic segmentation, where the former one captures rich contextual information by pooling features at dierent resolution while the latter one is able to obtain sharp object boundaries.

Contributions:

  1. We propose a novel encoder-decoder structure which employs DeepLabv3 as a powerful encoder module and a simple yet eective decoder module.
  2. In our structure, one can arbitrarily control the resolution of extracted en-coder features by atrous convolution to trade-o precision and runtime,which is not possible with existing encoder-decoder models.
  3. We adapt the Xception model for the segmentation task and apply depthwise separable convolution to both ASPP module and decoder module, resulting in a faster and stronger encoder-decoder network.
  4. Our proposed model attains a new state-of-art performance on PASCAL VOC 2012 and Cityscapes datasets. We also provide detailed analysis of design choices and model variants.
  5. We make our Tensor ow-based implementation of the proposed model pub-licly available at https://github.com/tensorflow/models/tree/master/research/deeplab.

2 Related Work

Spatial pyramid pooling:

PASS

Encoder-decoder:

Use DeepLabv3 as the encoder module and add a simple yet effective decoder module to obtain sharper segmentations.

Depthwise separable convolution:

3 Methods

3.1 Encoder-Decoder with Atrous Convolution

Atrous convolution:

Depthwise separable convolution:

drastically reduces computation complexity

DeepLabv3 as encoder:

We use the last feature map before logits in the original DeepLabv3 as the encoder output in our proposed encoder-decoder structure.

Proposed decoder:

We apply another 1  1 convolution on the low-level features to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512) which may outweigh the importance of the rich encoder features (only 256 channels in our model) and make the training harder. After the concatenation, we apply a few 3  3 convolutions to rene the features followed by another simple bilinear upsampling by a factor of 4.

3.2 Modified Aligned Xception

4. Experimental Evaluation

4.1 Decoder Design Choices

We define "DeepLabv3 feature map" as the last feature map computed by DeepLabv3 (i.e., the features containing ASPP features and image-level fea-tures), and [k X k; f] as a convolution operation with kernel k X k and f flters.

In the decoder module, we consider three places for dierent de-sign choices, namely (1) the 1 X 1 convolution used to reduce the channels of the low-level feature map from the encoder module, (2) the 3 X 3 convolution used to obtain sharper segmentation results, and (3) what encoder low-level features should be used.

We do not pursue further denser output feature map (i.e.,output stride < 4) given the limited GPU resources.

4.2 ResNet-101 as Network Backbone

4.3 Xception as Networkk Backbone

ImageNet pretraining:

The result is

Baseline:

Table 5 second row

Adding decoder:

Table 5 third row

Pretraining on COCO:

Pretraining on JFT

Test set results:

Qualitative results:

4.4 Improvement along Object Boundaries

4.5 Experimental Results on Cityscapes

5 Conclusion

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值