【Shuffle Attention】《SA-Net：Shuffle Attention for Deep Convolutional Neural Networks》

bryant_meng

已于 2024-01-09 16:04:59 修改

阅读量1.1k

点赞数 22

分类专栏： CNN / Transformer 文章标签：深度学习人工智能 SA-Net shuffle

于 2023-12-19 14:53:11 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/135058651

版权

CNN / Transformer 专栏收录该内容

214 篇文章 8 订阅

订阅专栏

在这里插入图片描述

ICASSP-2021

1 Background and Motivation

CNN 中注意力是一个很有效的提点方式，一般可分为通道注意力和空间注意力，也有二合一的，但往往 suffered from either converging difﬁculty or heavy computation burdens.

“Can one fuse different attention modules in a lighter but more efﬁcient way?”

作者基于此提出 Shuffle Attention，轻量高效

在这里插入图片描述

2 Related Work

Multi-branch architectures
Inception / ResNet / SKNet / ShuffleNets
Grouped Features
AlexNet / MobileNets / ShuffleNets / CapsuleNets / SGE
Attention mechanisms
SE / SGE / ECA-Net / GCNet / Non-Local / CBAM / DA

3 Advantages / Contributions

提出 shuffle attention，基本无参，ImageNet-1k for classiﬁcation, MS COCO for object detection, and instance segmentation 上提点明显

4 Method

在这里插入图片描述

（1）Feature Grouping

$\in \mathbb{R}^{C \times H \times W}$

分成 $G$ 组， $X = [X_1, X_2, ..., X_G]$ ， $X_k \in \mathbb{R} ^{C/G \times H \times W}$

split 两个分支

$X_{k1}, X_{k2} \in \mathbb{R} ^ {C/2G \times H \times W}$

分别做通道注意力和空间注意力

（2）Channel Attention

在这里插入图片描述
$\in \mathbb{R} ^ {C/2G \times 1 \times 1}$

$W_1 \in \mathbb{R} ^ {C/2G \times 1 \times 1}$

$b_1 \in \mathbb{R} ^ {C/2G \times 1 \times 1}$

所有组别 G 中 $W_1$ 和 $b_1$ 参数共享

（3）Spatial Attention

在这里插入图片描述

$W_2 \in \mathbb{R} ^ {C/2G \times 1 \times 1}$

$b_2 \in \mathbb{R} ^ {C/2G \times 1 \times 1}$

所有组别 G 中 $W_2$ 和 $b_2$ 参数共享

$X_k^{{}'} = [X_{k_1}^{{}'}, X_{k_2}^{{}'}] \in \mathbb{R}^{C/2G \times H \times W}$

来了个组卷积以实现空间注意力，组卷积细节如下

在这里插入图片描述

GN 的极端情况就是 IN 和 LN

分别对应 G 等于 C 和 G 等于 1

torch.nn.GroupNorm(num_groups,num_channels),将channel切分成许多组进行归一化

num_groups:组数
num_channels:通道数量

PyTorch学习之归一化层（BatchNorm、LayerNorm、InstanceNorm、GroupNorm）

为什么组卷积可以实现空间注意力呢，作者是这样解释的

在这里插入图片描述

代码中看作者的 group normal 是当 instance normal 来做的（groups 的数量同 channels），这个操作有 spatial attention 的感觉，但是乘以一个 w 再加个 b 就有点通道注意力的感觉了，最后 sigmoid 的话标配，混合了空间和通道，感觉 spatial attention 的 learning 的过程都集中在了 instance normal 层

（4）Aggregation

concat + shuffle channel

total parameters are $3 C / G$

（5）Implementation

伪代码
在这里插入图片描述
实际的代码：https://github.com/wofmanaf/SA-Net

以（1，256，4，4）输入 G=8 为例，写下各个流程中特征图 shape 变化情况

class sa_layer(nn.Module):
    """Constructs a Channel Spatial Group module.

    Args:
        k_size: Adaptive selection of kernel size
    """

    def __init__(self, channel, groups=64):
        super(sa_layer, self).__init__()
        self.groups = groups
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.cweight = Parameter(torch.zeros(1, channel // (2 * groups), 1, 1))  # （1，16，1，1）
        self.cbias = Parameter(torch.ones(1, channel // (2 * groups), 1, 1))  # （1，16，1，1）
        self.sweight = Parameter(torch.zeros(1, channel // (2 * groups), 1, 1))  # （1，16，1，1）
        self.sbias = Parameter(torch.ones(1, channel // (2 * groups), 1, 1))  # （1，16，1，1）

        self.sigmoid = nn.Sigmoid()
        self.gn = nn.GroupNorm(channel // (2 * groups), channel // (2 * groups)) # 16, 16

    @staticmethod
    def channel_shuffle(x, groups):
        b, c, h, w = x.shape  # （1，256，4，4）

        x = x.reshape(b, groups, -1, h, w)  # （1，2，128，4，4）
        x = x.permute(0, 2, 1, 3, 4)  # （1，128，2，4，4）

        # flatten
        x = x.reshape(b, -1, h, w)  # （1，256，4，4）

        return x

    def forward(self, x):
        b, c, h, w = x.shape  # （1，256，4，4）

        x = x.reshape(b * self.groups, -1, h, w)  # （8，32，4，4）
        x_0, x_1 = x.chunk(2, dim=1) # （8，16，4，4）（8，16，4，4）

        # channel attention
        xn = self.avg_pool(x_0)  #（8，16，1，1）
        xn = self.cweight * xn + self.cbias  #（8，16，1，1）
        xn = x_0 * self.sigmoid(xn)  #（8，16，4，4）

        # spatial attention
        xs = self.gn(x_1) # （8，16，4，4）
        xs = self.sweight * xs + self.sbias  # （8，16，4，4）
        xs = x_1 * self.sigmoid(xs)  # （8，16，4，4）

        # concatenate along channel axis
        out = torch.cat([xn, xs], dim=1)  # （8，32，4，4）
        out = out.reshape(b, -1, h, w)  #（1，256，4，4）

        out = self.channel_shuffle(out, 2)
        return out