Squeeze-and-Excitation Blocks, CVPR 2018


Let V = [ v 1 , v 2 , … , v C ] \mathbf{V}=\left[\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_C\right] V=[v1,v2,,vC] denote the learned set of filter kernels, where v c \mathbf{v}_c vc refers to the parameters of the c c c-th filter. We can then write the outputs of F t r \mathbf{F}_{t r} Ftr as U = [ u 1 , u 2 , … , u C ] \mathbf{U}=\left[\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_C\right] U=[u1,u2,,uC], where
u c = v c ∗ X = ∑ s = 1 C ′ v c s ∗ x s . \mathbf{u}_c=\mathbf{v}_c * \mathbf{X}=\sum_{s=1}^{C^{\prime}} \mathbf{v}_c^s * \mathbf{x}^s . uc=vcX=s=1Cvcsxs.

Here ∗ * denotes convolution, v c = [ v c 1 , v c 2 , … , v c C ′ ] \mathbf{v}_c=\left[\mathbf{v}_c^1, \mathbf{v}_c^2, \ldots, \mathbf{v}_c^{C^{\prime}}\right] vc=[vc1,vc2,,vcC] and X = [ x 1 , x 2 , … , x C ′ ] \mathbf{X}=\left[\mathbf{x}^1, \mathbf{x}^2, \ldots, \mathbf{x}^{C^{\prime}}\right] X=[x1,x2,,xC] (to simplify the notation, bias terms are omitted), while v c s \mathbf{v}_c^s vcs is a 2D spatial kernel, and therefore represents a single channel of v c \mathbf{v}_c vc which acts on the corresponding channel of X \mathbf{X} X.

就是说正常的卷积网络,通道输出是和空间的卷积核挂钩的(一个通道的结果是不同通道卷积后相加), 即通道依赖关系与过滤器捕获的空间相关性纠缠在一起。作者的目标是

Our goal is to ensure that the network is able to increase its sensitivity to informative features so that they can be exploited by subsequent transformations, and to suppress less useful ones. We propose to achieve this by explicitly modelling channel interdependencies to recalibrate filter responses in two steps, squeeze and excitation, before they are fed into next transformation.

Squeeze: Global Information Embedding


Formally, a statistic z ∈ R C \mathbf{z} \in \mathbb{R}^C zRC is generated by shrinking U \mathbf{U} U through spatial dimensions H × W H \times W H×W, where the c c c-th element of z \mathbf{z} z is calculated by:
z c = F s q ( u c ) = 1 H × W ∑ i = 1 H ∑ j = 1 W u c ( i , j ) z_c=\mathbf{F}_{s q}\left(\mathbf{u}_c\right)=\frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W u_c(i, j) zc=Fsq(uc)=H×W1i=1Hj=1Wuc(i,j)

Excitation: Adaptive Recalibration

为了利用在挤压操作中聚合的信息,我们在它之后进行第二个操作,目的是to fully capture channel-wise dependencies.

The final output of the block is obtained by rescaling the transformation output U \mathrm{U} U with the activations:
x ~ c = F scale  ( u c , s c ) = s c ⋅ u c , \widetilde{\mathbf{x}}_c=\mathbf{F}_{\text {scale }}\left(\mathbf{u}_c, s_c\right)=s_c \cdot \mathbf{u}_c, x c=Fscale (uc,sc)=scuc,
where X ~ = [ x ~ 1 , x ~ 2 , … , x ~ C ] \widetilde{\mathbf{X}}=\left[\widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \ldots, \widetilde{\mathbf{x}}_C\right] X =[x 1,x 2,,x C] and F scale  ( u c , s c ) \mathbf{F}_{\text {scale }}\left(\mathbf{u}_c, s_c\right) Fscale (uc,sc) refers to channel-wise multiplication between the feature map u c ∈ \mathbf{u}_c \in uc R H × W \mathbb{R}^{H \times W} RH×W and the scalar s c s_c sc.




  1. 第一种1
class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.Linear(channel // reduction, channel, bias=False),
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c) #对应Squeeze操作
        y = self.fc(y).view(b, c, 1, 1) #对应Excitation操作
        return x * y.expand_as(x)
  1. 第二种2
import torch
import torch.nn as nn
import torch.nn.functional as F

# Squeeze and Excitation module

class SqEx(nn.Module):

    def __init__(self, n_features, reduction=16):
        super(SqEx, self).__init__()

        if n_features % reduction != 0:
            raise ValueError('n_features must be divisible by reduction (default = 16)')

        self.linear1 = nn.Linear(n_features, n_features // reduction, bias=True)
        self.nonlin1 = nn.ReLU(inplace=True)
        self.linear2 = nn.Linear(n_features // reduction, n_features, bias=True)
        self.nonlin2 = nn.Sigmoid()

    def forward(self, x):

        y = F.avg_pool2d(x, kernel_size=x.size()[2:4])
        y = y.permute(0, 2, 3, 1)
        y = self.nonlin1(self.linear1(y))
        y = self.nonlin2(self.linear2(y))
        y = y.permute(0, 3, 1, 2)
        y = x * y
        return y

# Residual block using Squeeze and Excitation

class ResBlockSqEx(nn.Module):

    def __init__(self, n_features):
        super(ResBlockSqEx, self).__init__()

        # convolutions

        self.norm1 = nn.BatchNorm2d(n_features)
        self.relu1 = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(n_features, n_features, kernel_size=3, stride=1, padding=1, bias=False)

        self.norm2 = nn.BatchNorm2d(n_features)
        self.relu2 = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(n_features, n_features, kernel_size=3, stride=1, padding=1, bias=False)

        # squeeze and excitation

        self.sqex  = SqEx(n_features)

    def forward(self, x):
        # convolutions

        y = self.conv1(self.relu1(self.norm1(x)))
        y = self.conv2(self.relu2(self.norm2(y)))

        # squeeze and excitation

        y = self.sqex(y)

        # add residuals
        y = torch.add(x, y)

        return y

