深入浅出理解SPP、ASPP、DSPP、MDSPP空间金字塔池化系列结构（综合篇）

花花少年

已于 2024-07-22 19:54:19 修改

阅读量5.3k

点赞数 28

分类专栏：深度学习文章标签： SPP ASPP DSPP MDSPP 空间金字塔池化

于 2024-01-11 19:18:45 首次发布

本文链接：https://blog.csdn.net/m0_37605642/article/details/135536588

版权

深度学习专栏收录该内容

133 篇文章 130 订阅

订阅专栏

一、参考资料

目标检测：SPP-net
SPP原理及实现
 金字塔池化系列的理解SPP、ASPP
SPP，PPM、ASPP和FPN结构理解和总结

二、空间金字塔池化(SPP)

原始论文：[1]

1. 引言

传统的卷积神经网络中，池化层通常采用固定的池化层级和固定的池化大小，这种方法对于不同大小的输入图像会导致信息的丢失，从而影响模型的准确性。而SPP空间金字塔池化方法则可以自适应地对不同大小的输入图像进行池化操作，从而能够更好地保留图像的信息。

在这里插入图片描述

2. SPP简介

2.1 SPP的概念

空间金字塔池化(Spatial Pyramid Pooling，SPP)是一种用于处理不同尺寸输入的卷积神经网络中的池化方法。它通过将不同大小的池化层级进行组合，从而能够对任意大小的输入图像进行池化操作，从而提高了网络的灵活性和泛化能力。

空间金字塔的基本思想是：在网络内设计参数不同的并行支路，每条支路基于各自的感受野提取不同空间尺度下的特征图，最后将所有分支的特征图进行融合。

2.2 SPP的原理

SPP空间金字塔池化方法的主要思想是将输入图像分成不同的层级，每一层级采用不同大小的池化窗口进行池化操作，然后将所有层级的池化结果拼接(concatenate)在一起，作为网络的特征表示。这样做的好处是，通过组合不同大小的池化层级，SPP空间金字塔池化方法可以对不同大小的输入图像进行池化操作，从而能够更好地保留图像的信息。

在这里插入图片描述

2.3 SPP的作用

在含有FC层的网络中，利用SPP改进输入需要固定尺寸的问题。因为带有FC层的网络结构结构都需要固定输入图像的尺度。

2.4 SPP的应用

//TODO

Yolov4的Neck结构采用了SPP模块。

3. SPP结构

在这里插入图片描述

如上图所示，最左边的图表示卷积操作得到的256维特征图，即对于每个区域厚度都为256，通过三种方式进行池化：

直接对整个特征图池化，每一维得到一个池化后的值，构成一个1x256的向量；
将特征图分成2x2共4份，每份单独进行池化，得到一个1x256的向量，最终得到2x2=4个1x256的向量；
将特征图分成4x4共16份，每份单独进行池化，得到一个1x256的向量，最终得到4x4=16个1x256的向量；

将三种划分方式池化得到的结果进行拼接(concatenate)，得到(1+4+16)x256=21x256的特征。由图中可以看出，整个过程对于输入的尺寸大小完全无关，因此可以处理任意尺寸的候选框。

空间池化层实际上就是一种自适应的层，这样无论你输入的尺寸是什么，输出都是固定的。

4. (PyTorch)代码实现

github代码：sppnet-pytorch

4.1 `SPP`

函数功能：构建SPP结构。

import math


def spatial_pyramid_pool(self,previous_conv, num_sample, previous_conv_size, out_pool_size):
    '''
    previous_conv: a tensor vector of previous convolution layer
    num_sample: an int number of image in the batch
    previous_conv_size: an int vector [height, width] of the matrix features size of previous convolution layer
    out_pool_size: a int vector of expected output size of max pooling layer
    
    returns: a tensor vector with shape [1 x n] is the concentration of multi-level pooling
    '''    
    # print(previous_conv.size())
    for i in range(len(out_pool_size)): # out_pool_size=[4, 2, 1]
        # print(previous_conv_size)
        # 计算池化块的尺寸, pooling_size=(h_wid, w_wid)
        h_wid = int(math.ceil(previous_conv_size[0] / out_pool_size[i]))
        w_wid = int(math.ceil(previous_conv_size[1] / out_pool_size[i]))
        # 计算padding的尺寸, padding=(h_pad, w_pad)
        h_pad = (h_wid*out_pool_size[i] - previous_conv_size[0] + 1)/2
        w_pad = (w_wid*out_pool_size[i] - previous_conv_size[1] + 1)/2
        # 实例化MaxPool2d
        maxpool = nn.MaxPool2d((h_wid, w_wid), stride=(h_wid, w_wid), padding=(h_pad, w_pad))
        # 执行最大池化操作，输出特征图的通道数不变, 尺寸变为(4, 4, 256), (2, 2, 256), (1, 1, 256)
        x = maxpool(previous_conv)
        if(i == 0):
            # reshape变为2D，即(4, 4, 256) -> (1, 4*4*256)
            spp = x.view(num_sample,-1)
            # print("spp size:",spp.size())
        else:
            # 将所有的最大池化结果拼接(concatenate)到第一个池化结果的尾部，组成一个更高维度的池化结果
            # 输入特征图: (1, 4*4*256) concat (1, 2*2*256) concat (1, 1*1*256)
            # 输出特征图: (1, (4*4+2*2+1*1)*256), 即(1, 5376)
            # print("size:",spp.size())
            spp = torch.cat((spp,x.view(num_sample,-1)), 1)
    return spp

4.2 `CNN with SPP`

函数功能：搭建一个带有SPP结构的CNN网络模型。

import torch
import torch.nn as nn
from torch.nn import init
import functools
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F
from spp_layer import spatial_pyramid_pool
class SPP_NET(nn.Module):
    '''
    A CNN model which adds spp layer so that we can input multi-size tensor
    '''
    def __init__(self, opt, input_nc, ndf=64,  gpu_ids=[]):
        super(SPP_NET, self).__init__()
        self.gpu_ids = gpu_ids
        self.output_num = [4,2,1]
        
        self.conv1 = nn.Conv2d(input_nc, ndf, 4, 2, 1, bias=False)
        
        self.conv2 = nn.Conv2d(ndf, ndf * 2, 4, 1, 1, bias=False)
        self.BN1 = nn.BatchNorm2d(ndf * 2)

        self.conv3 = nn.Conv2d(ndf * 2, ndf * 4, 4, 1, 1, bias=False)
        self.BN2 = nn.BatchNorm2d(ndf * 4)

        self.conv4 = nn.Conv2d(ndf * 4, ndf * 8, 4, 1, 1, bias=False)
        self.BN3 = nn.BatchNorm2d(ndf * 8)

        self.conv5 = nn.Conv2d(ndf * 8, 64, 4, 1, 0, bias=False)
        self.fc1 = nn.Linear(10752,4096)
        self.fc2 = nn.Linear(4096,1000)

    def forward(self,x):
        x = self.conv1(x)
        x = self.LReLU1(x)

        x = self.conv2(x)
        x = F.leaky_relu(self.BN1(x))

        x = self.conv3(x)
        x = F.leaky_relu(self.BN2(x))
        
        x = self.conv4(x)
        # x = F.leaky_relu(self.BN3(x))
        # x = self.conv5(x)
        spp = spatial_pyramid_pool(x,1,[int(x.size(2)),int(x.size(3))],self.output_num)
        # print(spp.size())
        fc1 = self.fc1(spp)
        fc2 = self.fc2(fc1)
        s = nn.Sigmoid()
        output = s(fc2)
        return output

三、空洞空间金字塔池化(ASPP)

1. 引言

在语义分割任务中，利用ASPP在不丢失信息时，组合不同大小感受野的语义信息，提高分割精度。

在图像分割领域（以FCN为例），图像输入到CNN中，FCN先像传统的CNN那样对图像做卷积再pooling，降低图像尺寸的同时增大感受野。但是由于图像分割预测是 pixel-wise 的输出，所以要将pooling后较小的图像尺寸 upsampling 到原始的图像尺寸进行预测。简单理解，图像分割FCN在pooling阶段减小图像尺寸增大感受野，在upsampling阶段扩大图像尺寸减小了感受野。在先减小再增大尺寸的过程中，导致了图像信息的丢失。为了解决该问题，DeepLab v2提出了ASPP模块，通过四个并行的膨胀卷积层来捕捉多尺度信息，可以在不丢失分辨率（不进行下采样）的情况下，组合不同大小感受野的语义信息，提高分割精度。

2. ASPP的概念

简单理解， ASPP = SPP+Dilated Convolution。

空洞空间金字塔池化 (Atrous Spatial Pyramid Pooling，ASPP) 结合了SPP和膨胀卷积/空洞卷积(Dilated Convolution) 的思想。结合Dilated Convolution，可以在不丢失分辨率（不进行下采样）的情况下，扩大卷积核的感受野。ASPP可以认为是SPP在语义分割任务中的应用。

3. DeepLab v2中的ASPP结构

DeepLab v2：[2]

DeepLab v2论文中的ASPP结构，如下图所示：

在这里插入图片描述

在输入特征图(input Feature Map )上并联四个分支，每个分支的第一层使用不同膨胀率(dilation rate)的 Dilated Convolution，使得每个分支的感受野不同，从而具有解决目标多尺度的问题。这里设计不同采样率的膨胀卷积来捕捉多尺度信息，但采样率并不是越大越好。因为膨胀率越大，导致滤波器会跑到padding上，产生无意义的权重，因此需要选择合适的采样率。

4. DeepLab v3中的ASPP结构

DeepLab v3：[3]

DeepLab v3论文中的ASPP结构，如下图所示：

在这里插入图片描述

ASPP的详细结构，如下图所示：

在这里插入图片描述

对于input输入：

Conv1x1：用一个1×1的卷积对input进行降维。论文中的解释是当 rate = feature map size 时，dilation conv 变成了Conv1x1，也就是说 Conv1x1 相当于 dilation rate 很大的空洞卷积；
Conv3x3, rate=6：用一个padding为6，dilation rate为6，卷积核大小为3×3的卷积层进行卷积；注意：padding=dilation rate，参阅下文中的代码实现。
Conv3x3, rate=12：用一个 padding 为12，dilation rate为12，卷积核大小为3×3的卷积层进行卷积；
Conv3x3, rate=18：用一个 padding 为18，dilation rate为18，卷积核大小为3×3的卷积层进行卷积；
Pool(1x1)——》Conv1x1——》upsample：ASPPPooling操作。首先，使用 全局平均池化(GAP) 得到尺寸为1×1的 feature map，然后使用 Conv1x1 的卷积对 feature map 进行降维，最后基于双线性插值算法进行 upsample 操作以恢复 feature map 尺寸。关于全局平均池化的详细介绍，请查阅另一篇博客：通俗易懂理解全局平均池化(GAP)

最后，将这5个分支的输出进行concat，并用1×1卷积层降维至指定通道数，得到最终 feature map。

可以看到，ASPP本质由一个1×1的卷积 (最左侧绿色)+ 池化金字塔(中间三个蓝色) + ASPPPooling(最右侧三层)组成。而ASPPConv层的dilation rate是可以自定义的，从而实现自由的多尺度特征提取。为什么用 rate = [6, 12, 18] ？是论文实验得到的，因为这个搭配比例的 mIOU 最高。

5. (PyTorch)代码实现

SOURCE CODE FOR TORCHVISION.MODELS.SEGMENTATION.DEEPLABV3
Pytorch-torchvision源码解读：ASPP

以DeepLab v3中的源代码为例，介绍ASPP的代码实现。

5.1 `ASPPConv`

函数功能：计算 Dilated Convolution ，执行 Conv3x3, rate=6/12/18 过程。

输入：输入特征图，(N, in_channels, H, W)

输出：输出特征图，(N, out_channels, H, W)，输出尺寸与输入特征图一致

class ASPPConv(nn.Sequential):
    def __init__(self, in_channels: int, out_channels: int, dilation: int) -> None:
        """
        in_channels: 输入通道数
        out_channels: 输出通道数
        dilation: 膨胀率
        padding=dilation
        """
        modules = [
            nn.Conv2d(in_channels, out_channels, 3, padding=dilation, dilation=dilation, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        ]
        super().__init__(*modules)

通过 Conv3x3, rate=6/12/18 三个 Dilated Convolution 得到的输出特征图尺寸相等，且都等于输入特征图尺寸，关于 Dilated Convolution 的计算公式，可参阅另一篇博客：深入浅出理解Dilated Convolution(空洞卷积，膨胀卷积)

5.2 `ASPPPooling`

函数功能：计算池化，执行 Pool(1x1)——》Conv1x1——》upsample 过程。

输入：输入特征图，(N, in_channels, H, W)

输出：输出特征图，(N, out_channels, H, W)，输出尺寸与输入特征图一致

首先，通过自适应均值池化(AdaptiveAvgPool2d)将各通道的特征图分别压缩至1×1，从而提取各通道的特征，进而获取全局的特征。
```
nn.AdaptiveAvgPool2d(1)
```
所谓自适应均值池化，其自适应的地方在于不需要指定kernel size 和 stride，只需要指定最后的输出尺寸(这里为1×1)。
然后，用一个1×1的卷积，对上一步获取的特征进行降维：
```
nn.Conv2d(in_channels, out_channels, 1, bias=False)
```

最后，通过上采样恢复原始输入大小：

F.interpolate(x, size=size, mode="bilinear", align_corners=False)

完整源码如下：

class ASPPPooling(nn.Sequential):
    def __init__(self, in_channels: int, out_channels: int) -> None:
        super().__init__(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        size = x.shape[-2:]
        for mod in self:
            x = mod(x)
        return F.interpolate(x, size=size, mode="bilinear", align_corners=False)

5.3 `ASPP`

函数功能：搭建ASPP的整体结构，并执行ASPP操作。

用1×1的卷积层，进行降维：

super(ASPP, self).__init__()
modules = []
modules.append(nn.Sequential(
   nn.Conv2d(in_channels, out_channels, 1, bias=False),
   nn.BatchNorm2d(out_channels),
   nn.ReLU()
   )
)

用 ASPPConv 构建池化金字塔。对于给定的膨胀因子 atrous_rates，叠加相应的空洞卷积层，提取不同尺度下的特征：
```
rates = tuple(atrous_rates)
for rate in rates:
	modules.append(ASPPConv(in_channels, out_channels, rate))
```

添加 ASPPPooling 层：

modules.append(ASPPPooling(in_channels, out_channels))

输出层，用于对ASPP各层叠加后的输出，进行卷积操作，得到最终结果：

self.project = nn.Sequential(
  nn.Conv2d(len(self.convs) * out_channels, out_channels, 1, bias=False),
  nn.BatchNorm2d(out_channels),
  nn.ReLU(),
  nn.Dropout(0.5)
  )

完整代码：

class ASPP(nn.Module):
    def __init__(self, in_channels: int, atrous_rates: List[int], out_channels: int = 256) -> None:
        """
        in_channels: 输入通道数
        atrous_rates: dilation rate
        out_channels: 输出通道数，默认为 256
        """
        super().__init__()
        modules = []
        modules.append(
            nn.Sequential(nn.Conv2d(in_channels, out_channels, 1, bias=False), nn.BatchNorm2d(out_channels), nn.ReLU())
        )

        rates = tuple(atrous_rates)
        for rate in rates:
            modules.append(ASPPConv(in_channels, out_channels, rate))

        modules.append(ASPPPooling(in_channels, out_channels))

        self.convs = nn.ModuleList(modules)

        self.project = nn.Sequential(
            nn.Conv2d(len(self.convs) * out_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            nn.Dropout(0.5),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        _res = []
        for conv in self.convs:
            _res.append(conv(x))
        
        # (B, C, H, W), dim = 1, 按通道拼接
        res = torch.cat(_res, dim=1)
        return self.project(res)

注意：对于forward方法，其顺序执行ASPP的各层，将各层的输出按通道叠加，并通过输出层的 conv->bn->relu->dropout 降维至给定通道数，获取最终结果。

5.4 整体代码

import torch.nn as nn
import torch
import torch.nn.functional as F


# Dilated Convolution
class ASPPConv(nn.Sequential):
    def __init__(self, in_channels: int, out_channels: int, dilation: int) -> None:
        modules = [
            nn.Conv2d(in_channels, out_channels, 3, padding=dilation, dilation=dilation, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        ]
        super().__init__(*modules)


# Pool(1x1) -> 1*1 卷积 -> 上采样
class ASPPPooling(nn.Sequential):
    def __init__(self, in_channels: int, out_channels: int) -> None:
        super().__init__(
            nn.AdaptiveAvgPool2d(1),  # 自适应均值池化
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # (N, C, H, W)
        size = x.shape[-2:]  # (H, W)
        for mod in self:
            x = mod(x)
        
        # 上采样
        return F.interpolate(x, size=size, mode="bilinear", align_corners=False)


# 整个ASPP结构
class ASPP(nn.Module):
    def __init__(self, in_channels: int, atrous_rates: List[int], out_channels: int = 256) -> None:
        super().__init__()
        modules = []
        
        # 1*1 卷积
        modules.append(
            nn.Sequential(nn.Conv2d(in_channels, out_channels, 1, bias=False), nn.BatchNorm2d(out_channels), nn.ReLU())
        )
		
        # 多尺度空洞卷积
        rates = tuple(atrous_rates)
        for rate in rates:
            modules.append(ASPPConv(in_channels, out_channels, rate))
		
        # 添加ASPPPooling
        modules.append(ASPPPooling(in_channels, out_channels))

        self.convs = nn.ModuleList(modules)
		
        # 输出层
        self.project = nn.Sequential(
            nn.Conv2d(len(self.convs) * out_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            nn.Dropout(0.5),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        _res = []
        for conv in self.convs:
            _res.append(conv(x))
            
        # 对输出结果进行concat
        res = torch.cat(_res, dim=1)
        return self.project(res)

5.5 参数量和计算量

from torchstat import stat


if __name__ == '__main__':
    model = ASPP(in_channels=512, atrous_rates=[6, 12, 18], out_channels=512)
    stat(model, [512, 14, 14])

输出结果

=====================================================================================================================================================
Total params: 8,919,040
-----------------------------------------------------------------------------------------------------------------------------------------------------
Total memory: 6.13MB
Total MAdd: 3.39GMAdd
Total Flops: 1.7GFlops
Total MemR+W: 47.05MB

6. ASPP的缺陷

ASPP对输入的特征图，采用不同 dilation rate 的空洞卷积进行卷积操作，以不同大小的感受野获多尺度上下文信息。空洞卷积的叠加使用对密集预测任务十分不友好，会丢失很多像素细节。且ASPP模块只能利用空洞卷积从像素点周围的少数点去获取上下文信息，而不能形成密集的全局上下文信息。为了获得密集的全局上下文信息，从而建立像素之间的依赖关系，引入“自注意力机制”。

关于自注意力机制的详细介绍，请参考另一篇博客：【Transformer系列】深入浅出理解Attention注意力和Self-Attention自注意力机制

四、深度可分离金字塔池化(DSPP)

论文：[4]

深度可分离金字塔池化(depthwise separable pyramidal pooling，DSPP)，本文以 SPEEP 论文为例介绍DSPP。

1. SPEED结构

SPEEP(Separable Pyramidal pooling EncodEr-Decoder)是基于Encoder-Decoder架构的单目深度估计网络模型。

SPEED结构，如下图所示：
在这里插入图片描述

2. SPEED encoder结构

DSPP Encoder结构，如下图所示：

在这里插入图片描述

3. DSPP结构

DSPP由4个分支构成，每个分支由1个平均池化层(Average pooling) 和1个深度可分离卷积层(Depthwise separable convolution，Separable Conv2D)组成。4个分支最后输出的特征图与原始输入特征图进行拼接(concatenate)。DSPP的结构，如下图所示：

在这里插入图片描述

如上图所示，原始输入特征图尺寸为(12, 16, 512)，最后经过拼接的输出特征图尺寸为 (12, 16, 1024)。

4. (TensorFlow)代码实现

github代码：SPEED

4.1 `SPEED_Encoder`

函数功能：构建 SPEED_Encoder结构。

def SPEED_Encoder(input_shape, alpha=1.0, depth_multiplier=1):
    img_input = layers.Input(shape=input_shape)
	
    # (192, 256, 3) -> (96, 128, 32)
    x = _conv_block(img_input, 32, alpha, strides=(2, 2))
    # (96, 128, 32) -> (96, 128, 64)
    x = _depthwise_conv_block(x, 64, alpha, depth_multiplier, block_id=1)
    
    # (96, 128, 64) -> (48, 64, 128)
    x = _depthwise_conv_block(x, 128, alpha, depth_multiplier, strides=(2, 2), block_id=2)
    # (48, 64, 128) -> (48, 64, 128)
    x = _depthwise_conv_block(x, 128, alpha, depth_multiplier, block_id=3)
	
    # (48, 64, 128) -> (24, 32, 256)
    x = _depthwise_conv_block(x, 256, alpha, depth_multiplier, strides=(2, 2), block_id=4)
    # (24, 32, 256) -> (24, 32, 256)
    x = _depthwise_conv_block(x, 256, alpha, depth_multiplier, block_id=5)

    # (24, 32, 256) -> (12, 16, 256)
    x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, strides=(2, 2), block_id=6)
    # (12, 16, 256) -> (12, 16, 512)
    x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, block_id=7)

    # (12, 16, 512) -> (12, 16, 1024)
    x = depthwise_separable_pyramid_pooling(x, [2, 4, 6, 8], x.shape[1], x.shape[2], filters=128)
    # x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, block_id=8)
    # x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, block_id=9)
    # x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, block_id=10)
    # x = _depthwise_conv_block(x, 512, alpha, depth_multiplier, block_id=11)
	
    # (12, 16, 1024) -> (6, 8, 1024)
    x = _depthwise_conv_block(x, 1024, alpha, depth_multiplier, strides=(2, 2), block_id=12)
    
    # (6, 8, 1024) -> (6, 8, 256)
    x = _depthwise_conv_block(x, 256, alpha, depth_multiplier, block_id=13)  # 1024

    model = Model(img_input, x, name='SPP_encoder')

    return model

4.2 `DSPP`

函数功能：构建DSPP的总体结构。

def depthwise_separable_pyramid_pooling(input_tensor, bin_sizes, w, h, filters):
    concat_list = [input_tensor]

    for bin_size in bin_sizes:
        x = tf.keras.layers.AveragePooling2D(pool_size=(w // bin_size, h // bin_size), strides=(w // bin_size, h // bin_size))(input_tensor)
        x = tf.keras.layers.SeparableConv2D(filters, 3, strides=1, padding='same')(x)
        x = tf.keras.layers.Lambda(lambda x: tf.image.resize(x, (w, h)))(x)

        concat_list.append(x)

    return tf.keras.layers.concatenate(concat_list)

4.3 `AveragePooling2D`

tf.keras.layers.AveragePooling2D

函数功能：进行平均池化操作。

for bin_size in bin_sizes:
    x = tf.keras.layers.AveragePooling2D(pool_size=(w // bin_size, h // bin_size), strides=(w // bin_size, h // bin_size))(input_tensor)

从AveragePooling2D函数参数中可知，pool_size=(w // bin_size, h // bin_size), strides=pool_size，已知 bin_size=[[2, 4, 6, 8]]，则计算输出特征图尺寸为：

分支1：bin_size=2，pool_size= (12 // 2, 16 // 2) = (6, 8)，平均池化后输出特征图尺寸(2, 2, 512)

分支2：bin_size=4，pool_size= (12 // 4, 16 // 4) = (3, 4)，平均池化后输出特征图尺寸(4, 4, 512)

分支3：bin_size=6，pool_size= (12 // 6, 16 // 6) = (2, 2)，平均池化后输出特征图尺寸(6, 8, 512)

分支4：bin_size=8，pool_size= (12 // 8, 16 // 8) = (1, 2)，平均池化后输出特征图尺寸(12, 8, 512)

4.4 `Depthwise separable convolution`

函数功能：进行深度可分离卷积操作。

在TensorFlow中，深度可分离卷积的函数是：tf.keras.layers.SeparableConv2D

关于深度可分离卷积的详细介绍，可查阅另一篇博客：深入浅出理解深度可分离卷积（Depthwise Separable Convolution）

for bin_size in bin_sizes:
    x = tf.keras.layers.SeparableConv2D(filters, 3, strides=1, padding='same')(x)

从 SeparableConv2D函数参数中可知，strides=1, padding='same' ,filters=128，根据深度可分离卷积的计算公式，计算输出特征图尺寸为：

分支1：特征图尺寸(2, 2, 512)经过深度可分离卷积操作后，输出特征图尺寸(3, 3, 128)

分支2：特征图尺寸(4, 4, 512)经过深度可分离卷积操作后，输出特征图尺寸(6, 6, 128)

分支3：特征图尺寸(6, 8, 512)经过深度可分离卷积操作后，输出特征图尺寸(8, 10, 128)

分支4：特征图尺寸(12, 8, 512)经过深度可分离卷积操作后，输出特征图尺寸(14, 10, 128)

4.5 `resize`

tf.image.resize

函数功能：将输出特征图的尺寸resize恢复到原始输入特征图的尺寸。

resize默认的插值算法是 ResizeMethod.BILINEAR。经过resize操作后，计算输出特征图尺寸为：

分支1：特征图尺寸(3, 3, 128)经过深度可分离卷积操作后，输出特征图尺寸(12, 16, 128)

分支2：特征图尺寸(6, 6, 128)经过深度可分离卷积操作后，输出特征图尺寸(12, 16, 128)

分支3：特征图尺寸(8, 10, 128)经过深度可分离卷积操作后，输出特征图尺寸(12, 16, 128)

分支4：特征图尺寸(14, 10, 128)经过深度可分离卷积操作后，输出特征图尺寸(12, 16, 128)

for bin_size in bin_sizes:
	x = tf.keras.layers.Lambda(lambda x: tf.image.resize(x, (w, h)))(x)
    
    concat_list.append(x)

4.6 `concatenate`

函数功能：将4个分支最后输出的特征图与原始输入特征图进行拼接(concatenate)。

输入特征图尺寸：(12, 16, 128)*4 + (12, 16, 512)

输出特征图尺寸：(12, 16, 1024)

tf.keras.layers.concatenate(concat_list)

五、混合深度可分离金字塔池化(MDSPP)

论文：[4]

混合深度可分离金字塔池化(mixed depthwise separable pyramidal pooling，MDSPP)，本文以 SPEEP 论文为例介绍MDSPP。

1. MDSPP结构

MDSPP前半部分(Average Pooling 和 Separable Conv2D)与DSPP结构相同，后半部分为2个不同的分支，最后将这2个分支进行拼接(concatenate)。MDSPP的结构，如下图所示：

在这里插入图片描述

2. `upsample`结构

在这里插入图片描述

3. (TensorFlow)代码实现

github代码：SPEED

3.1 `SPEED`

根据SPEED的结构示意图，搭建SPEED模型。

def create_SPEED_model(input_shape, existing=''):
    if len(existing) == 0:
        # encoder阶段, (192, 256, 3) -> (6, 8, 256)
        encoder = SPEED_Encoder(input_shape=input_shape)
        # encoder.summary()
        print('Number of layers in the encoder: {}'.format(len(encoder.layers)))

        # Starting point for decoder
        base_model_output_shape = encoder.layers[-1].output.shape
        decode_filters = 256

        # Decoder Layers
        # 初始化decoder_0, 不进行upsample, (6, 8, 256) -> (6, 8, 256)
        decoder_0 = Conv2D(filters=decode_filters,
                           kernel_size=1,
                           padding='same',
                           input_shape=base_model_output_shape,
                           name='conv_Init_decoder')(encoder.output)
		
        # decoder_1, 进行upsample, (6, 8, 256) -> (12, 16, 128)
        decoder_1 = upsample_layer(decoder_0, int(decode_filters / 2), 'up1', concat_with='conv_dw_6', base_model=encoder)
        # decoder_2, 进行upsample, (12, 16, 128) -> (24, 32, 64)
        decoder_2 = upsample_layer(decoder_1, int(decode_filters / 4), 'up2', concat_with='conv_dw_4', base_model=encoder)
        # decoder_3, 进行upsample, (24, 32, 64) -> (48, 64, 32)
        decoder_3 = upsample_layer(decoder_2, int(decode_filters / 8), 'up3', concat_with='conv_dw_2', base_model=encoder)
		
        # 维度变换，输出特征图尺寸不变，通道数变为1, (48, 64, 32) -> (48, 64, 1)
        convDepthF = Conv2D(filters=1,
                            kernel_size=3,
                            padding='same',
                            name='convDepthF')(decoder_3)

        model = Model(inputs=encoder.input, outputs=convDepthF)
        print('Number of layers in the SPEED model: {}'.format(len(model.layers)))
        model.summary()

    else:
        if not existing.endswith('.h5'):
            sys.exit('Please provide a correct model file when using [existing] argument.')

        custom_objects = {'accurate_obj_boundaries_loss': accurate_obj_boundaries_loss}
        model = models.load_model(existing, custom_objects=custom_objects)

        for layer in model.layers:
            layer.trainable = True

        print('Number of layers in the SPEED model: {}'.format(len(model.layers)))
        print('Existing model loaded.\n')

    return model

代码分析

SPEED Decoder由4个decoder子层构成，其中decoder_0用于decoder初始化，不进行upsample，输出特征图尺寸不变，通道数不变；decoder_1用于upsample上采样，输出特征图尺寸翻倍，通道数减半；decoder_2用于upsample上采样，输出特征图尺寸翻倍，通道数减半；decoder_3用于upsample上采样，输出特征图尺寸翻倍，通道数减半。经过3次upsample，最终输出的特征图通道数变为32。详细过程，请查阅源码。

3.2 `MDSPP`

函数功能：执行MDSPP操作，输出特征图尺寸不变，通道数翻倍。

输入特征图：(h, w, filters)

输出特征图：(h, w, filters*2)

def depthwise_mixed_separable_pyramid_pooling(input_tensor, bin_sizes, w, h, filters, name):
    """
    (h, w, filters) -> (h, w, filters*2)
    """
    concat_list = []

    for bin_size in bin_sizes:  # bin_sizes=[2, 4, 6, 8]
        """
        执行AveragePooling2D+SeparableConv2D+resize操作之后，
        (h, w, filters) -> (h, w, filters//4) * 4个
        """
        x = AveragePooling2D(pool_size=(w // bin_size, h // bin_size), strides=(w // bin_size, h // bin_size), name=name + '_avgpool_' + str(bin_size))(input_tensor)
        x = SeparableConv2D(filters=filters // 4, kernel_size=3, padding='same', name=name + '_upconv_1_' + str(bin_size), use_bias=False)(x)
        x = tf.keras.layers.Lambda(lambda x: tf.image.resize(x, (w, h)))(x)

        concat_list.append(x)
	
    # 对奇数项的特征图进行拼接(concat)
    # 输入通道数:(h, w, filters//4) concat (h, w, filters//4)
    # 输出通道数:(h, w, filters//2)
    x_even = Concatenate()([concat_list[0], concat_list[2]])
    x_even = ReLU()(x_even)

    # (h, w, filters//2) -> (h, w, filters//2)
    x_even = SeparableConv2D(filters=filters // 2, kernel_size=3, padding='same', name=name + '_upconv_2_odd', use_bias=False)(x_even)
    x_even = ReLU()(x_even)
	
    # 对偶数项的特征图进行拼接(concat)
    # 输入通道数:(h, w, filters//4) concat (h, w, filters//4)
    # 输出通道数:(h, w, filters//2)
    x_odd = Concatenate()([concat_list[1], concat_list[3]])
    x_odd = ReLU()(x_odd)
    
    # 输入通道数:filters//2
    # 输出通道数:filters//2
    # (h, w, filters//2) -> (h, w, filters//2)
    x_odd = SeparableConv2D(filters=filters // 2, kernel_size=3, padding='same', name=name + '_upconv_2_even', use_bias=False)(x_odd)
    x_odd = ReLU()(x_odd)
	
    # 输入通道数:(h, w, filters//2) concat (h, w, filters//2) concat (h, w, filters)
    # 输出通道数:(h, w, filters*2)
    x = Concatenate()([x_even, x_odd, input_tensor])

    return x

3.3 `upsample`

函数功能：采用转置卷积(Conv2DTranspose)进行upsample上采样操作，输出特征图尺寸翻倍，通道数减半。

def upsample_layer(tensor, filters, name, concat_with, base_model):
    """
    进行三次上采样
    decoder_1, filters=256//2=128, tensor=(6, 8, 256) -> (12, 16, 128)
    decoder_2, filters=256//4=64, tensor=(12, 16, 128) -> (24, 32, 64)
    decoder_2, filters=256//8=32, tensor=(24, 32, 64) -> (48, 64, 32)
    """
    def HPO2(filters_value):
        for i in range(filters_value, 0, -1):
            if (i & (i - 1)) == 0:
                return i

    if name == 'up1':
        # 上采样采用转置卷积，输出特征图尺寸翻倍，通道数变为128
        # decoder_1, tensor=(6, 8, 256)， 则(6, 8, 256) -> (12, 16, 128)
        up_i = Conv2DTranspose(filters=filters, kernel_size=3, strides=2, padding='same', dilation_rate=1, name=name + '_upconv', use_bias=False)(tensor)
    else:
        # 执行MDSPP操作，输出特征图尺寸不变，通道数翻倍
        # decoder_2, (12, 16, 128) -> (12, 16, 256)
        # decoder_3, (24, 32, 64) -> (24, 32, 128)
        up_i = depthwise_mixed_separable_pyramid_pooling(input_tensor=tensor, bin_sizes=[2, 4, 6, 8], w=tensor.shape[1], h=tensor.shape[2], filters=filters, name=name)
        # 上采样采用转置卷积，输出特征图尺寸翻倍，通道数变为64、32
        # decoder_2, (12, 16, 256) -> (24, 32, 64)
        # decoder_3, (24, 32, 128) -> (48, 64, 32)
        up_i = Conv2DTranspose(filters=filters, kernel_size=3, strides=2, padding='same', dilation_rate=1, name=name + '_upconv_final', use_bias=False)(up_i)
	
    # decoder_1, conv_dw_6=(12, 16, 512), 则(12, 16, 128) concat (12, 16, 512) -> (12, 16, 640)
    # decoder_2, conv_dw_4=(24, 32, 256), 则(24, 32, 64) concat (24, 32, 256) -> (24, 32, 320)
    # decoder_3, conv_dw_2=(48, 64, 128), 则(48, 64, 32) concat (48, 64, 128) -> (48, 64, 160)
    up_i = Concatenate(name=name + '_concat')([up_i, base_model.get_layer(concat_with).output])  # Skip connection
    up_i = ReLU()(up_i)
	
    # decoder_1, (12, 16, 640) 取2的次幂，可得(12, 16, 512)，则(12, 16, 512) -> (12, 16, 128)
    # decoder_2, (24, 32, 320) 取2的次幂，可得(24, 32, 256)，则(24, 32, 256) -> (24, 32, 64)
    # decoder_3, (48, 64, 160) 取2的次幂，可得(48, 64, 128)，则(48, 64, 128) -> (48, 64, 32)
    up_i = SeparableConv2D(filters=HPO2((up_i.shape[-1]) // 4),
                           kernel_size=3,
                           padding='same',
                           use_bias=False,
                           name=name + '_sep_conv')(up_i)
    up_i = ReLU()(up_i)

    return up_i

解释说明

decoder_1：不进行MDSPP操作，首先采用转置卷积(Conv2DTranspose)进行upsample操作，然后进行拼接(concatenate)操作，最后进行深度可分离卷积操作(SeparableConv2D)。
decoder_2：首先进行MDSPP操作，再采用转置卷积(Conv2DTranspose)进行upsample上采样操作，然后进行拼接(concatenate)操作，最后进行深度可分离卷积操作(SeparableConv2D)。
decoder_3：首先进行MDSPP操作，再采用转置卷积(Conv2DTranspose)进行upsample上采样操作，然后进行拼接(concatenate)操作，最后进行深度可分离卷积操作(SeparableConv2D)。

六、相关经验

PSPNet

基于港中文和商汤组的 PSPNet 里的 Pooling module，ASPP则在 decoder 中对于不同尺度上用不同大小的 dilation rate 来抓取多尺度信息，每个尺度则为一个独立分支，在网络最后把它们合并起来，再接一个卷积层输出预测 label。这样的设计有效避免了在 encoder 上获取冗余的信息，直接关注与物体之间的相关性。