每日论文230928--Paying more attention to attention

时间: ICLR 2017

代码:GitHub - szagoruyko/attention-transfer: Improving Convolutional Networks via Attention Transfer (ICLR 2017)

简介:

通过学习teacher model attention map的输出,我们也可以提升基于CNN的student model的效果。

作者的想法如下:

基于attention机制,我们能否能提高CNN的效果?

Teacher model 能够通过传授自己attention 所关注的位置,来提升student model的效果。

而回答上述问题,需要我们回答如下问题:

我们该如何定义CNN的attention?

作者提供了自己的定义

we consider attention as a set of spatial maps that essentially try to encode on which spatial areas of the input the network focuses most for taking its output decision

并且提供了两种spatial attention maps: 分别是activation-based 和 gradient-based的。

Attention Transfer

1. ACTIVATION-BASED ATTENTION TRANSFER

作者提供了以下 spatial attention mapping function的选择

并且,测试了效果有区别的各种深度网络,结果发现,更强大的网络在特征层值得注意的区域具有峰值,而菜的网络则没有。

We found that the above statistics of hidden activations not only have spatial correlation with predicted objects on image level, but these correlations also tend to be higher in networks with higher accuracy, and stronger networks have peaks in attention where weak networks don’t.

 

 于是基于这些观察,作者提到了他的目标

the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher. In general, one can place transfer losses w.r.t. attention maps computed across several la

 接着,作者提出了对应的损失函数。并提出,这一公式也可以和Hinton的蒸馏温度共同使用(只不过不是作者自己的创新点了)。公式的含义也很简单,就是student模型的交叉熵以及学生和老师 attention map的模的差。

2. GRADIENT-BASED ATTENTION TRANSFER

我们可以用梯度定义注意力的关注点。某个区域的注意力越集中,那么微小的pixel变化也能带来很大的梯度变化。

所以,对应的思路就是让student model去模仿teacher model的梯度,对应的损失函数的目标就是要尽可能缩小两个网络梯度间的距离。

实验结果

1. Activation-based attention transfer

从实验结果来看,AT单独使用,并没有KD好,但是AT和KD的组合取得了更好的效果。

2. gradient-based attention transfer

在这组实验中,尽管grad-based AT效果也取得了一定优化,但是activation-based AT取得了比grad-based AT更好的效果。这一点,作者也给出了解释

We also trained a network with activation-based AT in the same training conditions, which resulted in the best performance among all methods. We should note that the architecture of student NIN without batch normalization is slightly different from teacher network, it doesn’t have ReLU activations before pooling layers, which leads to better performance without batch normalization, and worse with. So to achieve the best performance with activation-based AT we had to train a new teacher, with batch normalization and without ReLU activations before pooling layers, and have AT losses on outputs of convolutional layers.

总结

作者提出了针对注意力的学习策略,并且提出了基于梯度,和基于激活函数的两种策略,并且证明这两种策略都可以有效。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值