【Semantic Segmentation】语义分割综述 -- Atrous Convolution

最新推荐文章于 2024-01-14 09:43:30 发布

Arron_hou

最新推荐文章于 2024-01-14 09:43:30 发布

阅读量426

点赞数

分类专栏：深度学习

本文链接：https://blog.csdn.net/Arron_hou/article/details/101072997

版权

深度学习专栏收录该内容

26 篇文章 1 订阅

订阅专栏

【Semantic Segmentation】语义分割综述 -- Atrous Convolution

Atrous Convolution

Atrous Convolution

为了解决encoder+decoder 损失图片细节精度的问题，使用空洞卷积（Atrous Convolution）解决问题。
以前的CNN主要问题总结：
（1）Up-sampling / pooling layer
（2）内部数据结构丢失；空间层级化信息丢失。
（3）小物体信息无法重建 (假设有四个pooling layer 则任何小于 2^4 = 16 pixel 的物体信息将理论上无法重建。)

[DeepLabV2] Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution,and Fully Connected CRFs 2016

https://arxiv.org/abs/1606.00915

在这里插入图片描述

[ASPP] atrous spatial pyramid pooling

在这里插入图片描述

[DeepLabV3] Rethinking Atrous Convolution for Semantic Image Segmentation 2017-06

https://arxiv.org/abs/1706.05587

[DeepLabV3+] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation 2018-02

https://arxiv.org/abs/1802.02611v1

在这里插入图片描述

定义 output stride 为feature map较原图缩小的倍数.
Encoder

DCNN 可以选择resnet101,或者xception;
输出Low-Level Features 的output stride = 4,channel = 256/128(resnet101/xception);
输出 tensor 的 output stride = 8/16,channel=2048;
ASPP将tensor 并行通过4个卷积层和一个全局平均池化层;
每一层的输出size 保持不变,channel数为256;
做concat 之后通过1X1卷积特征融合,输出output stride = 8/16,channel 为256维的tensor;

Decoder

Low-Level Features 通过1X1卷积特征融合,size 不变, 减少channel 为 48 (实验得出);
ASPP 的输出tensor output stride = 8/16 ,需要上采样2/4倍;
此时size 相同,即output stride = 4 做concat得到新的tensor;
tensor 经过卷积,上采样得到最终结果.

code

ASPP
ASPP模块通过空洞卷积提取feature map不同尺度的特征.

$\frac{I+2p-d(k-1)-1}{s}+1$
$是输入feature\ map的size,s表示步长stride,p表示pad,d表示dilation,O表示输出feature \ map\ size$

$令 O = I, k = 3$
$s (I - 1) = I + 2 p - 2 d - 1$
$(s - 1) (I - 1) = 2 p - 2 d$
$为了保证 O = I, 显而易见的 s, p, d 的关系可以为 s = 1, p = d = 任意数$
$令 O = I, k = 1$
$s (I - 1) = I + 2 p - 1$
$(s - 1) (I - 1) = 2 p$
$为了保证 O = I, 显而易见的 s, p 的关系可以为 s = 1, p = 0, d 为任意数$

import torch
import torch.nn as nn
import torch.nn.functional as F
from .utils import initialize_weights
def assp_branch(in_channels, out_channles, kernel_size, dilation):
	#padding 和 dilation 的设置在上面解释了	 
    padding = 0 if kernel_size == 1 else dilation
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channles, kernel_size, padding=padding, dilation=dilation, bias=False),
        nn.BatchNorm2d(out_channles),
        nn.ReLU(inplace=True))


class ImagePooling(nn.Module):
    def __init__(self, in_channels):
        super(ImagePooling, self).__init__()

        self.avg_pool = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Conv2d(in_channels, 256, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True))

    def forward(self, x):
        return F.interpolate(self.avg_pool(x), size=(x.size(2), x.size(3)), mode='bilinear', align_corners=True)


class ASSP(nn.Module):
    def __init__(self, in_channels, output_stride):
        super(ASSP, self).__init__()

        assert output_stride in [8, 16], 'Only output strides of 8 or 16 are suported'
        dilations = [None] * 4
        if output_stride == 16:
            dilations = [1, 6, 12, 18]
        elif output_stride == 8:
            dilations = [1, 12, 24, 36]

        self.aspp1 = assp_branch(in_channels, 256, 1, dilation=dilations[0])
        self.aspp2 = assp_branch(in_channels, 256, 3, dilation=dilations[1])
        self.aspp3 = assp_branch(in_channels, 256, 3, dilation=dilations[2])
        self.aspp4 = assp_branch(in_channels, 256, 3, dilation=dilations[3])
        self.aspp5 = ImagePooling(in_channels)

        self.conv1 = nn.Conv2d(256 * 5, 256, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(256)
        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(0.5)

        initialize_weights(self)

    def forward(self, x):
        x1 = self.aspp1(x)
        x2 = self.aspp2(x)
        x3 = self.aspp3(x)
        x4 = self.aspp4(x)
        x5 = self.aspp5(x)

        x = self.conv1(torch.cat((x1, x2, x3, x4, x5), dim=1))
        x = self.bn1(x)
        x = self.dropout(self.relu(x))

        return x

Decoder code

class Decoder(nn.Module):
    def __init__(self, low_level_channels, num_classes):
        super(Decoder, self).__init__()

        self.low_level_conv = nn.Sequential(
            nn.Conv2d(low_level_channels, 48, 1, bias=False),
            nn.BatchNorm2d(48),
            nn.ReLU(inplace=True),
        )

        # Table 2, best performance with two 3x3 convs
        self.output = nn.Sequential(
            nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(inplace=True),
            nn.Conv2d(256, 256, 3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(inplace=True),
            nn.Dropout(0.1),
            nn.Conv2d(256, num_classes, 1, stride=1),
        )
        initialize_weights(self)

    def forward(self, x, low_level_features, H, W):
        # low_level_features 变为 channel = 48
        low_level_features = self.low_level_conv(low_level_features)
        # 上采用2/4倍
        x = F.interpolate(x, size=(low_level_features.size(2), low_level_features.size(3)),
                          mode='bilinear',
                          align_corners=True)
        # concat
        x = torch.cat((x, low_level_features), dim=1)
        # 得到结果卷积
        x = self.output(x)
        # 上采用4倍 恢复到原图大小,传入H,W 是保证输入图像size不是2的倍数也可以恢复到原图大小
        x = F.interpolate(x, size=(H, W), mode='bilinear', align_corners=True)

        return x

DeepLabV3+的主干网络

from itertools import chain

import torch
import torch.nn as nn
import torch.nn.functional as F

from .models import getBackBone
from .models.aspp import ASSP
from .models.decoder import Decoder


class DeepLabV3Plus(nn.Module):
    def __init__(self, num_classes, in_channels=3, backbone='xception', pretrained=True,
                 output_stride=16, freeze_bn=False, **_):
        super(DeepLabV3Plus, self).__init__()
        assert ('xception' or 'resnet' in backbone)
        self.backbone, low_level_channels = getBackBone(backbone, in_channels=in_channels, output_stride=output_stride,
                                                        pretrained=pretrained)

        self.assp = ASSP(in_channels=2048, output_stride=output_stride)

        self.decoder = Decoder(low_level_channels, num_classes)

        if freeze_bn:
            self.freeze_bn()

    def forward(self, x):
        H, W = x.size(2), x.size(3)
        x, low_level_features = self.backbone(x)
        x = self.assp(x)
        x = self.decoder(x, low_level_features, H, W)
        return x

    # Two functions to yield the parameters of the backbone
    # & Decoder / ASSP to use differentiable learning rates
    # FIXME: in xception, we use the parameters from xception and not aligned xception
    # better to have higher lr for this backbone

    def get_backbone_params(self):
        return self.backbone.parameters()

    def get_decoder_params(self):
        return chain(self.ASSP.parameters(), self.decoder.parameters())

    def freeze_bn(self):
        for module in self.modules():
            if isinstance(module, nn.BatchNorm2d): module.eval()

参考
[细致讲解] https://blog.csdn.net/u011974639/article/details/79518175
[code] https://github.com/ArronHZG/segmentation-pytorch/tree/master/torch_model/net/DeepLabV3Plus

[DenseASPP] DenseASPP for Semantic Segmentation in Street Scenes 2019-01

http://openaccess.thecvf.com/content_cvpr_2018/papers/Yang_DenseASPP_for_Semantic_CVPR_2018_paper.pdf

在这里插入图片描述
DenseASPP 结合了ASPP和denseNet 的优点。
ASPP 的结构是一种平行的结构，通过不同的空洞卷积获得不同的感受野，兼顾了宏观和微观的语义信息。
而另一种结构例如Encode 和 Decode 是一种级联结构（cascading) ,天然地可以获得更大的感受野，但是对于物体边缘信息的判别就不是很准。
DenseASPP 综合了上述两种结构的优势。在Cityscapes数据集上在车辆类别中取得了90.7的mIOU。

Arron_hou

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Semantic Segmentation】语义分割综述 -- Atrous Convolution

【Semantic Segmentation】语义分割综述 -- Atrous ConvolutionAtrous Convolution[DeepLabV2] Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution,and Fully Connected CRFs 2016[ASPP] atrous...
复制链接

扫一扫