[ACM MM 2021] RAMS-Trans: Recurrent Attention Multi-scale Transformer for FGVC

连理o

于 2023-02-28 09:35:11 发布

阅读量292

点赞数

文章标签： ACM MM 2021

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42437114/article/details/129249313

版权

papers 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

Contents

Introduction
Recurrent Attention Multi-Scale Transformer (RAMS-Trans)
Experiments
References

Introduction

RAMS-Trans uses the transformer’s self-attention weights to measure the importance of the patch tokens corresponding to the raw images and recursively learns discriminative region attention in a multi-scale manner.

Recurrent Attention Multi-Scale Transformer (RAMS-Trans)

Proposed Network

在这里插入图片描述

RAMS-Trans 采用 2 stage 策略，先用 ViT 对图像进行一次前向传播 (global scale)，得到 CE loss $Loss_{s2}$ 和 attention matrix，再用 DPPM 通过 attention matrix 得到值得关注的区域，将该区域通过双线性插值后得到值得关注的局部区域图像，再进行前向传播 (local scale) 得到 CE loss $Loss_{s1}$ ，总的损失函数为
其中 $\lambda=1.0$ . 在推理时只使用 local scale 部分的 ViT
需要注意的是，上图中左右两个 ViT 除了 [CLS] token 以外，其余参数均为共享参数

Dynamic Patch Proposal Module (DPPM)

DPPM 利用 ViT 中的 attention matrix 来定位图像中需要重点关注的区域
层 $l$ head $h$ 对应的 attention matrix $W_h^l\in\R^{N\times N}$ 为
由 $H$ 个 head 对应的 attention matrix 一起可以计算出 $l$ 层的 normalized attention weights matrix $G_l\in\R^{N\times N}$
其中 $E$ 为 diagonal matrix， $normc(\cdot)$ 负责将输入矩阵的每一列归一化为概率向量 (除以列向量中各个元素之和). 接着通过 recursive matrix multiplications 可以融合所有层的 normalized attention weights matrix，再进行 column-wise arithmetic average 然后归一化，最后 reshape 就能得到反映各个 patch 重要性的 patch attention $g\in\R^{N^{1/2}\times N^{1/2}}$
其中， $normm(\cdot)$ 为 dividing every element in the vector by the maximum element in
that vector， $\Gamma^N$ 将 attention vector ( $R^N$ ) reshape 为 patch attention ( $\R^{N^{1/2}\times N^{1/2}}$ ). 最后，作者计算 $g$ 的均值 $\bar g$ ，由此得到 binary patch mask $\tilde M\in\R^{N^{1/2}\times N^{1/2}}$
其中 $\alpha>1$ 为超参
最后由 algorithm1 即可得到 $\tilde M$ 中的 largest connected component，该部分即为值得关注的图像区域

Scale-wise Class Token

[CLS] token 主要是为了和其他 patch token 交换信息并最终用于分类，而在 RAMS-Trans 中，左右 ViT 中的 patch tokens 对应 scale 是不同的，因此有必要在不同 scale 中使用不同的 [CLS] token

Experiments

Results on CUB-200-2011
Results on iNaturalist2017
Results on Stanford Dogs
Attention localization at the local scale
Ablation Experiments
可以看到，TransFG 的 token 选择模块 PSM 在一些分辨率下甚至会掉点. 作者给出了如下解释：this kind of hard attention filtering is easy to fail in two cases, one is in the case of small image resolution, and the other is in the case of the high complexity of the dataset. In the former case, a lot of important local information is not easily available, and if most of the tokens information has to be filtered out at this time, it is likely to hurt classification performance. In the latter case, a model can easily make wrong judgments based on improper token information when the attention mechanism fails.

References

Hu, Yunqing, et al. “Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition.” Proceedings of the 29th ACM International Conference on Multimedia. 2021.

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。