DenseASPP for Semantic Segmentation in Street Scenes（DenseASPP 用于街景语义分割）的阅读笔记

DenseNet can be viewed as a special case of DenseASPP by setting dilation rate as 1. 因此，DenseASPP 拥有DenseNet的优点： alleviating gradient-vanishing problem 和substantially fewer parameters

摘要

objects in autonomous driving exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded.

如图一，人的大小在变化；图二中，离的很近的公交车，非常远的小车

为了解决这个问题，atrous convolution 被提出。 Atrous Spatial Pyramid Pooling (ASPP) in DeepLab V3 was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation.

But feature resolution in the scale-axis is not dense enough for the autonomous driving scenario.

So we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size.

1.Introduction

高级特征对我们的分割很有作用。To extract high level information, FCN uses multiple pooling layers to increase the receptive field size of an output neuron.但是做下采样和池化，会降低图片分辨率。 However, increased number of pooling layers leads to reduced feature map size, which poses serious challenges to up-sample the segmentation output back to full resolution. 另一外面，我们又不能不增大感受野。 if we output the segmentation from an early layer with larger resolution, we were not able to make use of higher level semantics for better reasoning.

这个时候空洞卷积就派上用场了。A feature map produced by an atrous convolution can be as the same size as the input, but with each output neuron possessing a larger receptive field, and therefore encoding higher level semantics.

但空洞卷积还是有缺点的:1. 生成单一scale的特征图。all neurons in the atrous-convolved feature map share the same receptive field size, which means the process of semantic mask generation only made use of features from a single scale. 这一点上ASPP能解决，它 concatenate
feature maps generated by atrous convolution with different dilation rates 2. 但如果我们输入的是高分辨率的图像，为了增大感受野，我们需要提高 dilation ratio。但实验中证明，rate>24之后，空洞卷积会ineffective and gradually loses it modeling power。所以我们需要设计一个新的网络结构，able to encode multi-scale information, and simultaneously achieves a large enough receptive field size.

2.DenseASPP

首先使用一个basebone得到feature map, 然后进入所提出的DenseASPP, a cascade of
atrous convolution layers (d ≤ 24)，以避免 kernel degradation issue.

可以看到这个（b）和如下的resnet几乎一模一样了。不过把普通卷积换为了空洞卷积。

这样之下，1. 每个 neurons at each intermediate feature map encode semantic information from multiple scales。 2.每个 neurons所输入的semantic information的 scale ranges（ (in terms of receptive field sizes)都是不同的。

因此， DenseASPP 主要有两个贡献点：

generate features that covers a very large scale range
egenerated features are able to cover the above scale range in a very dense manner.

3.DenseASPP的细节

3.1空洞卷积与 pyramid pooling

K = filter size， d= dilation rate

可见，本文使用的都是3×3的膨胀卷积，rate分别为6，12，18，24

表示将以往层的输出全部拼接起来，作为l层膨胀卷积的输入。

3.2 更加密集的特征金字塔和更大的 receptive field

3.2.1 更加密集的特征金字塔

Denser scale sampling:

For an atrous convolutional layer with dilation rate d and kernel size K, the equivalent receptive field size is:

例如一个 3×3的膨胀卷积，rate d为d=3，则对应的R，receptive field size =7。

堆叠两个膨胀卷积能够得到更大的receptive field，假设我们有两个卷积层，其filter size分别为K1,K2，则感受野为：

例如，一个filter size=7和filter size=13的卷积堆叠到一起，构成为receptive field为19。

如图，DenseASPP的scale金字塔由堆叠一群扩张率为3,6,12,18的膨胀卷积层构成。每个stride里的数字表示rate的组合，长度表示等效的卷积核大小，k表示实际的receptive field，如下所示：

Denser pixel sampling:

a. 单独一个d=3的膨胀卷积只有3个像素参与了计算。b.堆叠一个d=6的卷积在d=3的卷积的上面，可以映射到7个像素. c是b在二维的情况下一共可以映射到49个像素

3.2.2 更大的 receptive field

对于DeepLabV2，3中的ASPP(6, 12, 18, 24)，其膨胀卷积为并行模式，

则最大感受野为

而 DenseASPP使用skip connections连接了所有的信息，使得大，小扩张rate的卷积相互依赖，其最大感受野为：

可以看出感受野大大扩大了。

3.3 模型规模控制

和DenseNet的bottle-neck类似，DenseASPP也在扩张卷积之前使用了1×1卷积用于减少特征图channel数，以减少参数量。

假设每个膨胀层输出n个feature maps。DenseASPP有 $c_{0}$ 个feature maps作为输入。则l th 膨胀层有 $c_{l}$ 输入feature maps。即输入的增长率为n，逐层递增。

在本文的设定中，每个在膨胀层之前的1×1卷积将channel数量减半（即每层膨胀层都输出 ASPP的输入 $c_{0}$ 的一半），并set 增长率 $n=\frac{c_{0}}{8}$ 。DenseASPP中所有参数可计算为：

详解下公式的第一步，对于每个l, 参数=当前层的输入 $c_{l}$ ×11卷积核× $c_{0}$ /2 + $c_{0}$ /2×当前膨胀层的K^2×增长率n。即每个11卷积的输入为 $c_{l}$ ，输出为 $c_{0}$ /2。接上每层膨胀层，其输入为 $c_{0}$ /2，输出为n。再拼接上前面的层，在输入给下一层的11卷积。

这样情况下，参数量为1×10^6，比densnet121缩小了接近10倍。