每日论文230928--Paying more attention to attention

Undefined游侠

已于 2023-09-29 21:41:41 修改

阅读量49

点赞数

分类专栏：论文阅读文章标签：人工智能

于 2023-09-29 10:45:46 首次发布

本文链接：https://blog.csdn.net/qq_19859865/article/details/133395260

版权

论文阅读专栏收录该内容

9 篇文章 0 订阅

订阅专栏

时间： ICLR 2017

代码：GitHub - szagoruyko/attention-transfer: Improving Convolutional Networks via Attention Transfer (ICLR 2017)

简介：

通过学习teacher model attention map的输出，我们也可以提升基于CNN的student model的效果。

作者的想法如下：

基于attention机制，我们能否能提高CNN的效果？

Teacher model 能够通过传授自己attention 所关注的位置，来提升student model的效果。

而回答上述问题，需要我们回答如下问题：

我们该如何定义CNN的attention？

作者提供了自己的定义

we consider attention as a set of spatial maps that essentially try to encode on which spatial areas of the input the network focuses most for taking its output decision

并且提供了两种spatial attention maps: 分别是activation-based 和 gradient-based的。

Attention Transfer

1. ACTIVATION-BASED ATTENTION TRANSFER

作者提供了以下 spatial attention mapping function的选择

并且，测试了效果有区别的各种深度网络,结果发现，更强大的网络在特征层值得注意的区域具有峰值，而菜的网络则没有。

We found that the above statistics of hidden activations not only have spatial correlation with predicted objects on image level, but these correlations also tend to be higher in networks with higher accuracy, and stronger networks have peaks in attention where weak networks don’t.

于是基于这些观察，作者提到了他的目标

the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher. In general, one can place transfer losses w.r.t. attention maps computed across several la

接着，作者提出了对应的损失函数。并提出，这一公式也可以和Hinton的蒸馏温度共同使用（只不过不是作者自己的创新点了）。公式的含义也很简单，就是student模型的交叉熵以及学生和老师 attention map的模的差。

2. GRADIENT-BASED ATTENTION TRANSFER

我们可以用梯度定义注意力的关注点。某个区域的注意力越集中，那么微小的pixel变化也能带来很大的梯度变化。

所以，对应的思路就是让student model去模仿teacher model的梯度，对应的损失函数的目标就是要尽可能缩小两个网络梯度间的距离。

实验结果

1. Activation-based attention transfer

从实验结果来看，AT单独使用，并没有KD好，但是AT和KD的组合取得了更好的效果。

2. gradient-based attention transfer

在这组实验中，尽管grad-based AT效果也取得了一定优化，但是activation-based AT取得了比grad-based AT更好的效果。这一点，作者也给出了解释

We also trained a network with activation-based AT in the same training conditions, which resulted in the best performance among all methods. We should note that the architecture of student NIN without batch normalization is slightly different from teacher network, it doesn’t have ReLU activations before pooling layers, which leads to better performance without batch normalization, and worse with. So to achieve the best performance with activation-based AT we had to train a new teacher, with batch normalization and without ReLU activations before pooling layers, and have AT losses on outputs of convolutional layers.