浅谈deeplabV1-V3中细节思想

江南綿雨

已于 2022-09-08 10:25:15 修改

阅读量1.9k

点赞数 1

分类专栏： CNN分割系列文章标签：深度学习

于 2022-03-26 19:43:42 首次发布

本文链接：https://blog.csdn.net/weixin_43702653/article/details/123760841

版权

CNN分割系列专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

一：FCN介绍
二：针对FCN的Deeplab-VGG优雅改进
三：Hole算法
四：多孔金字塔池化（ASPP）的提出
五： Fully-Connected CRFs
六：Deeplab v3+完全体

一：FCN介绍

FCN对图像实际进行了像素级别的分类，将每个像素都看作一个训练样本，不仅要预测其类别，还要计算其 softmax 分类的损失。这一进展解决了语义级别的图像分割问题

FCN相当于图像分割领域的milestone，提出全卷积层替代了CNN最后的FC层，可以接受任意尺寸的输入图像大小。
利用反卷积层对最后一个卷积层的特征图（heatmap）进行上采样，使它恢复到与原图相同的尺寸，这样就保留了原始图像的空间信息，然后可以在上采样得到的特征图上逐像素进行分类，从而能够对原图的每一个像素都进行预测，最后再逐个像素计算 softmax 分类的损失；

采用skip layer的方法，在浅层处减小upsampling的步长，得到的fine layer 和高层得到的coarse layer做融合，然后再upsampling得到输出。这种做法兼顾local和global信息，即文中说的combining what and where，取得了不错的效果提升。FCN-32s为59.4，FCN-16s提升到了62.4，FCN-8s提升到62.7。可以看出效果还是很明显的。

FCN 的优势在于：

可以接受任意大小的输入图像（没有全连接层）
更加高效，避免了使用邻域带来的重复计算和空间浪费的问题。

其不足也很突出：

结果不够精细，进行8倍上采样虽然比32倍的效果好了很多，但是上采样的结果还是比较模糊，对图像中的细节不敏感。
是对各个像素进行分类，没有充分考虑像素与像素之间的关系，缺乏空间一致性。

二：针对FCN的Deeplab-VGG优雅改进

Deeplab 是谷歌在FCN的基础上搞出来的。FCN为了得到一个更加dense的score map，将一张500x500的输入图像，直接在第一个卷积层上conv1_1加了一个100的padding，最终在fc7层勉强得到一个16x16的score map。
Deeplab这里使用了一个非常优雅的做法：将VGG网络的pool4和pool5层的stride由原来的2改为了1，再加上 1 padding。就是这样一个改动，使得vgg网络总的stride由原来的32变成8，进而使得在输入图像为514x514时，fc7能得到67x67的score map, 要比FCN确实要dense很多很多。这样的话尺寸就缩小为原本的8倍，但是这样的话之后节点的感受野就会发生变化。

三：Hole算法

于是乎，作者想出了一招，来解决两个看似有点矛盾的问题：既想利用已经训练好的模型进行fine-tuning，又想改变网络结构得到更加dense的score map. 这个解决办法就是采用Hole算法。如下图(a) (b)所示，在以往的卷积或者pooling中，一个filter中相邻的权重作用在feature map上的位置都是物理上连续的。如下图©所示，为了保证感受野不发生变化，某一层的stride由2变为1以后，后面的层需要采用hole算法，具体来讲就是将连续的连接关系是根据hole size大小变成skip连接的（图©为了显示方便直接画在本层上了）。不要被©中的padding为2吓着了，其实2个padding不会同时和一个filter相连。

pool4的stride由2变为1，则紧接着的conv5_1, conv5_2和conv5_3中hole size为2。接着pool5由2变为1, 则后面的fc6中hole size为4。

贴出实现代码：

主要是im2col(前传)和col2im(反传)中做了改动 (增加了hole_w, hole_h)，这里只贴cpu的用于理解：

//forward
template <typename Dtype>
void im2col_cpu(const Dtype* data_im, 
    const int num, const int channels, const int height, const int width,
    const int kernel_h, const int kernel_w, const int pad_h, const int pad_w,
    const int stride_h, const int stride_w, const int hole_h, const int hole_w,
    Dtype* data_col) {
  // effective kernel if we expand the holes (trous)
  const int kernel_h_eff = kernel_h + (kernel_h - 1) * (hole_h - 1);
  const int kernel_w_eff = kernel_w + (kernel_w - 1) * (hole_w - 1);
  int height_col = (height + 2 * pad_h - kernel_h_eff) / stride_h + 1;
  int width_col = (width + 2 * pad_w - kernel_w_eff) / stride_w + 1;
  int channels_col = channels * kernel_h * kernel_w;
  for (int n = 0; n < num; ++n) {
    for (int c = 0; c < channels_col; ++c) {
      int w_offset = (c % kernel_w)  * hole_w;
      int h_offset = ((c / kernel_w) % kernel_h) * hole_h;
      int c_im = c / kernel_w / kernel_h;
      for (int h = 0; h < height_col; ++h) {
        const int h_im = h * stride_h + h_offset - pad_h;
        for (int w = 0; w < width_col; ++w) {
          const int w_im = w * stride_w + w_offset - pad_w;
          data_col[((n * channels_col + c) * height_col + h) * width_col + w] =
            (h_im >= 0 && h_im < height && w_im >= 0 && w_im < width) ?
            data_im[((n * channels + c_im) * height + h_im) * width + w_im] : 
            0.; // zero-pad
        } //width_col
      } //height_col
    } //channels_col
  } //num
}
 
//backward
template <typename Dtype>
void col2im_cpu(const Dtype* data_col,
    const int num, const int channels, const int height, const int width,
    const int kernel_h, const int kernel_w, const int pad_h, const int pad_w,
    const int stride_h, const int stride_w, const int hole_h, const int hole_w,
    Dtype* data_im) {
  caffe_set(num * channels * height * width, Dtype(0), data_im);
  const int kernel_h_eff = kernel_h + (kernel_h - 1) * (hole_h - 1);
  const int kernel_w_eff = kernel_w + (kernel_w - 1) * (hole_w - 1);
  int height_col = (height + 2 * pad_h - kernel_h_eff) / stride_h + 1;
  int width_col = (width + 2 * pad_w - kernel_w_eff) / stride_w + 1;
  int channels_col = channels * kernel_h * kernel_w;
  for (int n = 0; n < num; ++n) {
    for (int c = 0; c < channels_col; ++c) {
      int w_offset = (c % kernel_w)  * hole_w;
      int h_offset = ((c / kernel_w) % kernel_h) * hole_h;
      int c_im = c / kernel_w / kernel_h;
      for (int h = 0; h < height_col; ++h) {
    const int h_im = h * stride_h + h_offset - pad_h;
        for (int w = 0; w < width_col; ++w) {
          const int w_im = w * stride_w + w_offset - pad_w;
          if (h_im >= 0 && h_im < height && w_im >= 0 && w_im < width) {
            data_im[((n * channels + c_im) * height + h_im) * width + w_im] += 
              data_col[((n * channels_col + c) * height_col + h) * width_col + w];
          }
        }
      }
    }
  }
}

四：多孔金字塔池化（ASPP）的提出

在实验中发现 DCNNs 做语义分割时精准度不够的问题，根本原因是 DCNNs 的高级特征的平移不变性，即高层次特征映射，根源于重复的池化和下采样。

针对信号下采样或池化降低分辨率，DeepLab 是采用的 atrous（带孔）算法扩展感受野，利用空洞卷积代替传统的池化方法，获取更多的上下文信息以解决物体在多尺度图像中状态不同的问题。同时，作者受SPP启发，最终采用多孔金字塔池化方法。
在这里插入图片描述
以vgg16网络为例，ASPP的处理方式如下图。对pool5输出的featuremap（处理前的featuremap大小为2828512）进行四种rate不同的卷积处理，再将四种处理后的结果concate（方式与spp相同）。

五： Fully-Connected CRFs

弱分类器产生的分割图往往很粗糙，通常使用CRF来平滑去噪。而深度学习网络产生的分割图很平滑并且连续，在这种情况下需要使用short-range CRFs来恢复局部结构的细节而不是进一步平滑它。但采用该方法仍无法恢复结构中细小的部分，作者又采用了全连接条件随机场来恢复细节，效果很好。

CRFs的具体讲解请看：https://blog.csdn.net/qq_31347869/article/details/91128646
在这里插入图片描述
然后这张图展示了CRF处理前后的效果对比，可以看出用了CRF以后，细节确实改善了很多：

六：Deeplab v3+完全体

在这里插入图片描述
针对于v3，新引入了decoder和Xception来提高网络表现能力并降低计算复杂度。

1）Decoder

在Deeplabv3中，特征图被直接双线性插值上采样16倍变为与输入图像相同大小的图像，这种方法无法获得分割目标的细节。因此本文提出了一种简单有效的decoder如下图。encoder features来自于Deeplabv3（output_stride=16）。encoder features首先双线性插值上采样4倍，然后与网络中产生的空间分辨率相同的低层特征concate。在concate之前，先让低层特征通过一个11的卷积核以将channel减少到256,。concate之后，通过几个33卷积来重新定义特征，紧接着双线性插值上采样4倍。

2）Xception

作者修改了Xception模型以适应图像分割任务。作者修改内容如下：层数更多；在ASPP和decoder模块将所有的最大池化操作换成depthwise separable convolutions with striding（深度可分割卷积），每3个depthwise separable convolutions 后加一个batch normalizaiton和Relu，主要用来提高网络时效性。

在这里插入图片描述