论文地址:https://arxiv.org/abs/1511.07122
代码地址:https://github.com/ndrplz/dilation-tensorflow
https://github.com/fyu/dilation 作者用caffe写的
introduction
motivation 动机
CNN的池化操作影响语义分割的精度
This prompts new questions motivated by the structural differences between image classification and dense prediction. Which aspects of the repurposed networks are truly necessary and which reduce accuracy when operated densely? Can dedicated modules designed specifically for dense prediction improve accuracy further?
虽然FCN通过改进CNN使原本用于图像分类的CNN在语义分割任务上有了很好的表现,但是由于图像分类和密集预测还是有区别的。
Modern image classification networks integrate multi-scale contextual information via successive pooling and subsampling layers that reduce resolution until a global prediction is obtained.
In contrast, dense prediction calls for multiscale contextual reasoning in combination with full-resolution output.
CNN的池化操作会减少resolution从而失去位置信息,这和语义分割的目标是冲突的。因为dense prediction 要求结合full-resolution全分辨率进行多尺度上下文推理。
前人针对池化操作影响精度这个问题提出的解决办法
通过反卷积操作恢复丢失的分辨率信息
针对这个问题,本文之前也提出了两种方法解决。
One approach involves repeated up-convolutions that aim to recover lost resolution while carrying over the global perspective from downsampled layers (Noh et al., 2015; Fischer et al., 2015).
主要思想就是为了恢复失去的分辨率加了up-convolutions。
《Learning deconvolution network for semantic segmentation.》和
《Learning optical flow with convolutional neural net- works.》都用了这种思想。
《Learning deconvolution network for semantic segmentation.》
https://cloud.tencent.com/developer/article/1008415 具体学习到时候看这篇笔记
This leaves open the question of whether severe intermediate downsampling was truly necessary。
这就引出一个疑问,中间下采样的操作是否是真的有必要的。
《Learning optical flow with convolutional neural net- works.》
提供多尺寸的输入图片,并将这些图片的预测结果进行组合。
Another approach involves providing multiple rescaled versions of the image as input to the network and combining the predictions obtained for these multiple inputs。
主要思想是 提供多尺寸的输入图片,并将这些图片的预测结果进行组合。
《 Learning hierarchical features for scene labeling.》、《Efficient piecewise training of deep structured models for semantic segmentation.》和《Scale-aware semantic image segmentation.》都用到了这种思想。
Again, it is not clear whether separate analysis of rescaled input images is truly necessary.
同样,这里存在一个问题,对不同尺寸输入的图片,是否需要对他们的结果单独进行分析。
所以我们就想用专门用于dense prediction的dedicated modules进一步改善语义分割的精度。
contribution 贡献
In this work, we develop a convolutional network module that aggregates multi-scale contextual information without losing resolution or analyzing rescaled images. The module can be plugged into existing architectures at any resolution. Unlike pyramid-shaped architectures carried over from image classification, the presented context module is designed specifically for dense prediction. It is a rectangular prism of convolutional layers, with no pooling or subsampling. The module is based on dilated convolutions, which support exponential expansion of the receptive field without loss of resolution or coverage.
我们提出了一个卷积网络模块,能够在不损失分辨率的情况下混合多尺度的上下文信息。然后这个模块能够以任意的分辨率被嵌入到现有的结构中(能够任意嵌入的原因就是他的输入和输出都是C个feature maps,即输入输出时相同的形式)。与从图像分类中延续的金字塔形结构不同,所呈现的上下文模块专门用于密集预测。它没有池化和下采样操作。我们的网络是它主要基于空洞卷积,其支持指数级扩展感受野而不损失分辨率或覆盖范围。 【也就是不需要下采样只用空洞卷积就可以获得较大感受野】
空洞卷积 DILATED CONVOLUTIONS
related work
In recent work on convolutional networks for semantic segmentation,
- Long et al. (2015) analyzed filter dilation but chose not to use it. 《Fully convolutional networks for semantic segmenta- tion.》
Long分析了dilation核但没有用 - Chen et al. (2015a) used dilation to simplify the architecture of Long et al. (2015).
《Semantic image segmentation with deep convolutional nets and fully connected CRFs.》
chen用了dilation简化了Long的网络结构。
In contrast, we develop a new convolutional network architecture that systematically uses dilated convolutions for multi-scale context aggregation.
我们研发了一个新的用空洞卷积的用于多尺寸信息融合的卷积网络。
预备知识(此部分不是论文的内容)
感受野
什么是感受野
感受野用来表示网络内部的不同位置的神经元对原图像的感受范围的大小,也就是能看到的输入图像的区域。神经元感受野的值越大表示其能接触到的原始图像范围就越大,也意味着他可能蕴含更为全局、语义层次更高的特征;而值越小则表示其所包含的特征越趋向于局部和细节。因此感受野的值可以大致用来判断每一层的抽象层次。
可以看到在Conv1中的每一个单元所能看到的原始图像范围是3*3,而由于Conv2的每个单元都是由2x2范围的Conv1构成,因此回溯到原始图像,其实是能够看到5x5的原始图像范围的。因此我们说Conv1的感受野是3,Conv2的感受野是5. 输入图像的每个单元的感受野被定义为1,这应该很好理解,因为每个像素只能看到自己。
感受野的计算方式
R F l + 1 = R F l + ( k e r n e l _ s i z e l + 1 − 1 ) ∗ f e a t u r e _ s t r i d e l RF_{l+1} = RF_l+(kernel\_size_{l+1}-1)*feature\_stride_l RFl+1=RFl+(kernel_sizel+1−1)∗feature_strid</