Contents
Introduction
- RAMS-Trans uses the transformer’s self-attention weights to measure the importance of the patch tokens corresponding to the raw images and recursively learns discriminative region attention in a multi-scale manner.
Recurrent Attention Multi-Scale Transformer (RAMS-Trans)
Proposed Network
- RAMS-Trans 采用 2 stage 策略,先用 ViT 对图像进行一次前向传播 (global scale),得到 CE loss
L
o
s
s
s
2
Loss_{s2}
Losss2 和 attention matrix,再用 DPPM 通过 attention matrix 得到值得关注的区域,将该区域通过双线性插值后得到值得关注的局部区域图像,再进行前向传播 (local scale) 得到 CE loss
L
o
s
s
s
1
Loss_{s1}
Losss1,总的损失函数为
其中 λ = 1.0 \lambda=1.0 λ=1.0. 在推理时只使用 local scale 部分的 ViT - 需要注意的是,上图中左右两个 ViT 除了 [CLS] token 以外,其余参数均为共享参数
Dynamic Patch Proposal Module (DPPM)
- DPPM 利用 ViT 中的 attention matrix 来定位图像中需要重点关注的区域
- 层
l
l
l head
h
h
h 对应的 attention matrix
W
h
l
∈
R
N
×
N
W_h^l\in\R^{N\times N}
Whl∈RN×N 为
由 H H H 个 head 对应的 attention matrix 一起可以计算出 l l l 层的 normalized attention weights matrix G l ∈ R N × N G_l\in\R^{N\times N} Gl∈RN×N
其中 E E E 为 diagonal matrix, n o r m c ( ⋅ ) normc(\cdot) normc(⋅) 负责将输入矩阵的每一列归一化为概率向量 (除以列向量中各个元素之和). 接着通过 recursive matrix multiplications 可以融合所有层的 normalized attention weights matrix,再进行 column-wise arithmetic average 然后归一化,最后 reshape 就能得到反映各个 patch 重要性的 patch attention g ∈ R N 1 / 2 × N 1 / 2 g\in\R^{N^{1/2}\times N^{1/2}} g∈RN1/2×N1/2
其中, n o r m m ( ⋅ ) normm(\cdot) normm(⋅) 为 dividing every element in the vector by the maximum element in
that vector, Γ N \Gamma^N ΓN 将 attention vector ( R N \R^N RN) reshape 为 patch attention ( R N 1 / 2 × N 1 / 2 \R^{N^{1/2}\times N^{1/2}} RN1/2×N1/2). 最后,作者计算 g g g 的均值 g ˉ \bar g gˉ,由此得到 binary patch mask M ~ ∈ R N 1 / 2 × N 1 / 2 \tilde M\in\R^{N^{1/2}\times N^{1/2}} M~∈RN1/2×N1/2
其中 α > 1 \alpha>1 α>1 为超参 - 最后由 algorithm1 即可得到
M
~
\tilde M
M~ 中的 largest connected component,该部分即为值得关注的图像区域
Scale-wise Class Token
- [CLS] token 主要是为了和其他 patch token 交换信息并最终用于分类,而在 RAMS-Trans 中,左右 ViT 中的 patch tokens 对应 scale 是不同的, 因此有必要在不同 scale 中使用不同的 [CLS] token
Experiments
- Results on CUB-200-2011
- Results on iNaturalist2017
- Results on Stanford Dogs
- Attention localization at the local scale
- Ablation Experiments
可以看到,TransFG 的 token 选择模块 PSM 在一些分辨率下甚至会掉点. 作者给出了如下解释:this kind of hard attention filtering is easy to fail in two cases, one is in the case of small image resolution, and the other is in the case of the high complexity of the dataset. In the former case, a lot of important local information is not easily available, and if most of the tokens information has to be filtered out at this time, it is likely to hurt classification performance. In the latter case, a model can easily make wrong judgments based on improper token information when the attention mechanism fails.