【Attention】《CBAM: Convolutional Block Attention Module》

本文链接：https://blog.csdn.net/bryant_meng/article/details/88889004

在这里插入图片描述

ECCV-2018

文章目录

1 Background and motivation
2 Innovations / Advantages / Contributions
3 Related work
4 Methods
- 4.1 Channel attention module
- 4.2 Spatial attention module
5 Experiment
6 Conclusion（own）

1 Background and motivation

为了提升 CNN 的表现，大量同行在 depth, width, and cardinality（group）三个方向进行了探索，

They empirically show that cardinality not only saves the total number of parameters but also results in stronger representation power than the other two factors: depth and width.

作者从 attention 方向入手，目的是 increase representation power. focusing on important features and suppressing unnecessary ones.

Squeeze-and-Excitation Networks（SENet）虽然也同为 CVPR-2018，但 preprint 版本还是比较早，本片论文对 SENet 的 attention 模块（利用注意力机制重新分配 channels 的权重）进行改进，不仅在通道上，还在空间上引入了 attention 模块，做到 adaptive feature refinement，在公开数据上取得了不错的效果！

channel：what
spatial：where

2 Innovations / Advantages / Contributions

lightweight and general module（plug-and-play module）
better performance and better interpretability

contribution

提出 CBAM 结构
通过 ablation experiment 验证了 effectiveness
在 benchmarks 上（ImageNet-1k, MS COCO, and VOC 2007），greatly improved on various network

3 Related work

Network engineering（width, depth, cardinality）
- inception-v1
- ResNet
- WideResNet
- Inception-ResNet
- ResNeXt
- PyramidNet
- DenseNet
Attention mechanism（One important property of a human visual system is that one does not attempt to process a whole scene at once.）
- Residual Attention Network
- SENet
- Bottleneck attention module

4 Methods

focusing on important features and suppressing unnecessary ones（emphasize or suppress）

在这里插入图片描述

$\in \mathbb{R}^{C×H×W}$ ：Feature map
$M_c \in \mathbb{R}^{C×1×1}$ ：channel attention map
$M_s \in \mathbb{R}^{1×H×W}$ ：spatial attention map

4.1 Channel attention module

在这里插入图片描述

注意 $W_0, W_1$ 共享权重

$W_0 \in \mathbb{R}^{\frac{C}{r}×C}$
$W_1 \in \mathbb{R}^{C×\frac{C}{r}}$
$F_{avg}^c \in \mathbb{R}^{C×1×1}$
$F_{max}^c \in \mathbb{R}^{C×1×1}$

4.2 Spatial attention module

在这里插入图片描述

$F_{avg}^s \in \mathbb{R}^{1×H×W}$
$F_{max}^s \in \mathbb{R}^{1×H×W}$

5 Experiment

在这里插入图片描述
加在每个 block 的最后一个 feature map 后面

5.1 Ablation studies

5.1.1 Channel attention

在这里插入图片描述

baseline：resnet 50
r = 16
作者 argue that max-pooled features which encode the degree of the most salient part can compensate the average-pooled features which encode global statistics softly.

5.1.2 Spatial attention

在这里插入图片描述

第一行 SE 只做 channel 的 ave
第二行的 channel 表示 max & ave
avg & max 比 1x1 conv 好，k 大的比较好

It implies that a broad view (i.e. large receptive field) is needed for deciding spatially important regions.

5.1.3 Arrangement of the channel and spatial attention

在这里插入图片描述
channel 和 spatial 的组织方式，串？谁先谁后？并？

5.2 Image Classification on ImageNet-1K

1.2 million images for training and 50,000 for validation with 1,000 object classes.
在这里插入图片描述

打败了 ILSVRC 2017 classification task 的冠军，也即 SENet

the overall overhead of CBAM is quite small in terms of both parameters and computation. 作者在轻量级网络上试了试效果
在这里插入图片描述

5.3 Network Visualization with Grad-CAM

该方法 calculate the importance of the spatial locations in convolutional layers.
在这里插入图片描述

P denotes the softmax score of each network for the ground-truth class.

可以看出 CBAM cover the target object regions better than other methods，而且预测的分数更高

5.4 Quantitative evaluation of improved interpretability

有点像心理学的实验，被试，或者可视化方向的实验
在这里插入图片描述
50图，每个图 25 个 respondents，
问题，Given class label, which region seems more class-discriminative?（图5中的后面两种，截取方法为 For the visualizations, regions of the image with Grad-CAM values of 0.6 or greater are shown.）
三种回答，结果统计如 table 6

可以看出 CBAM 比 baseline 要好！

5.5 MS COCO Object Detection

80k training images (“2014 train”)
40k validation images (“2014 val”)

在这里插入图片描述

demonstrating generalization performance of CBAM on other recognition tasks.

5.6 VOC 2007 Object Detection

在这里插入图片描述

6 Conclusion（own）

感觉作者的行文思路特别好（值得借鉴），每次看 ECCV 的论文就特别的欢快，可能是因为很多是中国人写的缘故吧

提出 CBAM 结构
通过 ablation experiment 验证了 effectiveness
在 benchmarks 上（ImageNet-1k, MS COCO, and VOC 2007），greatly improved on various network

计算机视觉中attention机制的理解总结的不错，以下是部分内容节选

1）分类
注意力机制可以分为四类：基于输入项的柔性注意力（Item-wise Soft Attention）、基于输入项的硬性注意力（Item-wise Hard Attention）、基于位置的柔性注意力（Location-wise Soft Attention）、基于位置的硬性注意力（Location-wise Hard Attention）。

软注意力（更关注区域或者通道、是确定性的注意力、可微的）
强注意力与软注意力不同点在于，首先强注意力是更加关注点，也就是图像中的每个点都有可能延伸出注意力，同时强注意力是一个随机的预测过程，更强调动态变化。当然，最关键是强注意力是一个不可微的注意力，训练过程往往是通过增强学习(reinforcement learning)来完成的。

2）模型结构

空间域
通道域
通道域的注意力机制原理很简单，我们可以从基本的信号变换的角度去理解。信号系统分析里面，任何一个信号其实都可以写成正弦波的线性组合，经过时频变换之后，时域上连续的正弦波信号就可以用一个频率信号数值代替了。
在卷积神经网络中，每一张图片初始会由（R，G，B）三通道表示出来，之后经过不同的卷积核之后，每一个通道又会生成新的信号，比如图片特征的每个通道使用64核卷积，就会产生64个新通道的矩阵（H,W,64），H,W分别表示图片特征的高度和宽度。每个通道的特征其实就表示该图片在不同卷积核上的分量，类似于时频变换，而这里面用卷积核的卷积类似于信号做了傅里叶变换，从而能够将这个特征一个通道的信息给分解成64个卷积核上的信号分量。既然每个信号都可以被分解成核函数上的分量，产生的新的64个通道对于关键信息的贡献肯定有多有少，如果我们给每个通道上的信号都增加一个权重，来代表该通道与关键信息的相关度的话，这个权重越大，则表示相关度越高，也就是我们越需要去注意的通道了。
混合域

CNN 可视化方法（来自 CNN可视化技术总结(一)-特征图可视化）

特征图可视化。特征图可视化有两类方法，一类是直接将某一层的feature map映射到0-255的范围，变成图像。另一类是使用一个反卷积网络（反卷积、反池化）将feature map变成图像，从而达到可视化feature map的目的。
卷积核可视化。
类激活可视化。这个主要用于确定图像哪些区域对识别某个类起主要作用。如常见的热力图（Heat Map），在识别猫时，热力图可直观看出图像中每个区域对识别猫的作用大小。这个目前主要用的方法有CAM系列（CAM、Grad-CAM、Grad-CAM++）。
一些技术工具。通过一些研究人员开源出来的工具可视化CNN模型某一层。

推荐阅读：CNN可视化技术总结（三）–类可视化