Attention Transfer

最新推荐文章于 2024-03-25 09:58:33 发布

爆米花好美啊

最新推荐文章于 2024-03-25 09:58:33 发布

阅读量1.7k

点赞数 2

分类专栏： Knowledge Distillation 深度学习论文学习笔记文章标签： Knowledge Distillation

本文链接：https://blog.csdn.net/u013010889/article/details/102962574

版权

深度学习同时被 3 个专栏收录

72 篇文章 9 订阅

订阅专栏

论文学习笔记

52 篇文章 3 订阅

订阅专栏

Knowledge Distillation

12 篇文章 5 订阅

订阅专栏

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Motivation

大量的论文已经证明Attention在CV、NLP中都发挥着巨大的作用，因为本文利用Attention做KD，即让student学习teacher的attention maps
在这里插入图片描述

Activation-based attention transfer

如果定义是spatial attention map

各个channel相同位置绝对值求和
各个channel相同位置p次方求和：对比1，会更加注重于响应高的地方
各个channel相同位置p次方求最大值

3种方式得到的attention map各有侧重，后两种更加侧重一些响应更突出的位置

最终的Loss：

Qs Qt为第j对student和teacher的attention map

beta取1000，式子后半部会在所有位置取平均，整体来说后半部的权重在0.1左右

Gradient-based attention transfer

网络对某些位置输入的敏感性，比如调整某些位置的像素然后观察网络输出的变化，如果某些位置调整后网络输出变化大即说明网络更加paying attention to这个位置
在这里插入图片描述

Experiments

activation-based AT， F-AcT(类似FitNets，1x1做feature adaptation后做L2 loss)
在这里插入图片描述
平方和效果最好

activation-based好于gradient-based
在这里插入图片描述
其他在Scenes这个数据集上AT做的比传统的KD要好很多，猜测是因为we speculate is due to importance of intermediate attention for fine-grained recognition

好像作者写错了吧，这里明明CUB才是fine-grained的数据集
在这里插入图片描述
重要

KD struggles to work if teacher and student have different architecture/depth (we observe the same on CIFAR), so we tried using the same architecture and depth for attention transfer.

We also could not find applications of FitNets, KD or similar methods on ImageNet in the literature. Given that, we can assume that proposed activation-based AT is the first knowledge transfer method to be successfully applied on ImageNet.
在这里插入图片描述