【Dilated Conv】《Multi-Scale Context Aggregation by Dilated Convolutions》

最新推荐文章于 2023-07-15 15:50:08 发布

bryant_meng

最新推荐文章于 2023-07-15 15:50:08 发布

阅读量697

点赞数 1

本文链接：https://blog.csdn.net/bryant_meng/article/details/81263591

版权

CNN / Transformer 专栏收录该内容

212 篇文章 7 订阅

订阅专栏

在这里插入图片描述

ICLR-2016

文章目录

1 Background and Motivation
2 Advantages / Contributions
3 Method
4 Experiments
5 Conclusion（own）

1 Background and Motivation

计算机视觉中许多问题都是 dense prediction 问题，也就是给每个 pixel 预测一个连续的或者离散的标签！

最突出的的例子是语义分割，FCN 论文展示了为 classification 而设计的 CNN 也可以成功的运用到 dense prediction

由于 classification 和 dense prediction 任务是不同的，引发了作者如下的两个思考：

classification 任务中的 CNN repurposed for dense prediction 任务中，CNN结构中哪些方面是真正有用的呢？哪些方面又会影响 dense prediction 的精度呢？
在 CNN 结构基础上，为 dense prediction 设计一个专用模块，能否进一步提升 dense prediction 的精度呢？

现在的分类网络，通过不断的下采样来 integrate multi-scale contextual information，最终输入一个 global prediction！然而 dense prediction 既需要 multiscale contextual reasoning（eg，只有一个像素了）也需要 full-resolution output！这两者是有 conflicting 的！现有工作解决 multiscale contextual reasoning 和 full-resolution output 之间的冲突有如下两种方法

repeated 上采样，来 recover lost resolution，同时，下采样所提取的 global perspective 也会随着上采样传递上去。问题：severe intermediate downsampling 是否真的有必要，必须凤凰涅槃，撞了南墙才肯回头吗？
用 multiple rescaled versions of the image 作为网络的输入，然后把他们的结果 combining。问题：不清楚分开分析这些 rescaled input images 是否有必要！

作者提出 Dilated Conv 来缓解这个矛盾

aggregates multi-scale contextual information without losing resolution or analyzing rescaled images
support exponential expansion of the receptive field without loss of resolution or coverage.

2 Advantages / Contributions

develop a new convolutional network module that is specifically designed for dense prediction.（Dilated Conv）
presented context module increases the accuracy of state-of-the-art semantic segmentation systems.（Multi-scale Context Aggregation）
examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy（剔除了 VGG-16 的两个 maxpooling）

3 Method

3.1 Dilated Convolution

传统卷积

理解不是那么直观，看看下面的式子就好理解了（是真正的卷积哟，不是相关，参考别怕，"卷积"其实很简单）

嗦嘎，那么传统卷积中每个变量对应的含义如下：
- $F$ 是特征图
- $k$ 是卷积核
- $p$ 是感受野大小，上面公式展示的是感受野半径

dilated 卷积

多了一个 dilation factor $l$

在这里插入图片描述

传统卷积的话，一直用 3×3 堆叠卷积， $F_1$ 的感受野是 3×3， $F_2$ 的感受野是 5×5， $F_3$ 的感受野是 7×7！感受野线性增长
dilated 卷积的话，如下图所示，每个3×3 堆叠卷积，配合线性增长的 dilation factor，感受野随着层数的增加指数级增长。 $F_{i+1}$ 的感受野为 $2^{i+2}-1)×(2^{i+2}-1)$ ，
- 3×3卷积是 1-dilated convolution，堆叠起来，感受野升 1 级（感受野增加2）
- $F_2$ 在 $F_1$ 的基础上用了 2-dilated convolution，相当于感受野升了 2 级， $F_1$ 感受野是 3×3，升了 2 级就是 7×7
- $F_3$ 在 $F_2$ 的基础上用了 4-dilated convolution，相当于感受野升了4级， $F_2$ 感受野是 7×7，升了 4 级就是 15×15

3.2 Multi-scale Context Aggregation

用来捕获 multi-scale context 的模块，C channel 的 feature map 作为输入，C channel feature map 作为输出，输入和输出相同，dilated convolution 来增加感受野，3×3堆叠，分辨率没有下降
在这里插入图片描述
有了 3.1 小节的分析，这个表的感受野也比较清晰了：

$F_4$ 在 $F_3$ 的基础上用了 4-dilated convolution，相当于感受野升了 4 级， $F_3$ 感受野是 9×9，升了 4 级就是 17×17
$F_5$ 在 $F_4$ 的基础上用了 8-dilated convolution，相当于感受野升了 8 级， $F_4$ 感受野是 17×17，升了 8 级就是 33×33
$F_6$ 在 $F_5$ 的基础上用了 16-dilated convolution，相当于感受野升了 16 级， $F_4$ 感受野是 33×33，升了 16 级就是 65×65

pointwise truncation：max(·,0)，应该指的是分辨率维度上的

实验中这个模块接在 64×64分辨率的特征图上，所以 layer 6 之后就没有进行 dilated conv 了

Table1 这个模块采用随即初始化效果不好，作者采用了 identity initialization 初始化方法（在CNN中,有时我们希望将权重初始化为上一层的 feature map 能够完整的传递到下一层，即对于卷积操作 $F 2 = F 1 * w$ ，我们希望初始化权重矩阵 $w$ ，使得 $F 2 = F 1$ ，此时的权重均值 $w$ 初始化操作就叫 identity initialization——参考tensorflow参数初始化–identity initializtion），Table 1 中的 basic 方法参数初始化如下：
在这里插入图片描述

$a$ 、 $b$ 分别为输入特征图和输出特征图的索引
$^1[a,b]$ ，单位矩阵，a = b 的时候，对应位置为 1
$t$ ，作者这种初始化方法引用论文《A simple way to initialize recurrent networks of rectified linear units》，核心思想是 the use of the identity matrix or its scaled version to initialize the recurrent weight matrix！盲猜这个 t 应该指的是序列，但是图片中不存在，所以 t = 0

Table 1 中的 large 方法参数初始化如下（We generalize the initialization scheme to account for the difference in the number of feature maps in different layers.），因为这里输入通道和输出通道不一样了，不好用 identity initialization：
在这里插入图片描述

$c_i$ 和 $c_{i+1}$ 是连续两层的特征通道数，
$\varepsilon$ ~ $N(0,\sigma^2)$ ，其中高斯分布的方差 $\sigma \ll C/c_{i+1}$

具体如下图所示，假设输入和输出分类别为 $c_{i}$ 和 $c_{i+1}$ 维的列向量，简化下第一个公式的条件，就是 $ac_{i+1} = bc_{i}$ ！如果没有推理错的话，下图仅有最后一个角落符合上面公式的条件

在这里插入图片描述

3.3 Front End

采用 VGG-16，remove the last two pooling and striding layers entirely

我们来重温下 ICLR-2015 的 VGGNet《Very Deep Convolutional Networks for Large Scale Image Recognition》
在这里插入图片描述
作者的 front-end model 应该移除了上图红色框框的部分，改为了 dilated conv，4 factor，保持分辨率，但是感受野不变！

输出分辨率为，64×64，C = 21

下面看看在 VOC-2012 dataset 上和 FCN、DeepLab-v1 的比较
在这里插入图片描述
效果展示如下：

FCN-8s 在《Fully Convolutional Networks for Semantic Segmentation》论文中展示如下:
在这里插入图片描述
注意，作者的方法比 leaderboard 中的 DeepLab+CRF 方法效果还好！(67.6% vs. 66.4%)，但是作者没有用 CRF

4 Experiments

4.1 Datasets

PASCAL VOC 2012
Microsoft COCO（训练，包含 VOC 类别的为前景，其它的为背景）
urban scene understanding
- CamVid dataset：367 training images, 100 validation images, and 233 test images. 11 semantic classes are used.
- KITTI dataset：100 training images and 46 test images
- Cityscapes dataset：2975 training images, 500 validation images, and 1525 test images

训练分两个阶段，先 VOC 12 和 COCO 一起训练，然后在 VOC 12 上 fine-tune

仅 front-end 模块，69.8 % mIoU on VOC-2012 validation set，71.3% mIoU on test dataset

4.2 Controlled evaluation of context aggregation

在 front-end 模块中，插入 Multi-scale Context Aggregation 模块

在这里插入图片描述

加 CRF，加 RNN 属于添加了 structure prediction 模块， structure prediction 的介绍可以参考闲聊结构化预测（structured learning）

Table 3 可以看出，front-end 和 Context Aggregation 两个模块一起使用效果更好！

下面看看在 VOC 12 测试集上的表现

在这里插入图片描述
CRF+RNN 效果更好，下面看看图感受一下

作者也展示了一些失败的例子
在这里插入图片描述

4.3 Urban Scene Understanding

1）CAMVID Dataset

front-end + context，有 8 layer，简写成 Dilation8
在这里插入图片描述

2）KITTI Dataset

front-end + context，有 7 layer，简写成 Dilation7
在这里插入图片描述
3） Cityscapes dataset

front-end + context，有 10 layer，简写成 Dilation10

在这里插入图片描述

感受一下，训练分三个阶段，第一个阶段训练 front-end，第二个阶段训练 context module，第三个阶段一起训练
在这里插入图片描述

5 Conclusion（own）

dense prediction
we have also shown that the accuracy of existing convolutional networks for semantic segmentation can be increased by removing vestigial components that had been developed for image classification.（也算是收尾呼应吧，前面说了分类网络有些模块可能会影响分割精度的）
aggregates multi-scale contextual information without losing resolution or analyzing rescaled images，support exponential expansion of the receptive field without loss of resolution or coverage. 这两句描述太关键了
熟悉下 structure learning，熟悉下 identity initialization，看完 Figure 4，感觉分割好难，哈哈哈
作者对未来进行了展望，如果有更好的数据产生，就可以不依赖 Imagenet 的 pre-train 了，这样，输入和输出可以保持同分辨率，然后设计更合适的结构！
没看 code，不太清楚， context module 是插入到 front-end 的哪里，最后？感觉不像，因为论文中给的是 plug！代码地址：https://github.com/fyu/dilation/blob/master/network.py，有空瞅瞅