[ACM MM 2022] SIM-Trans: Structure Information Modeling Transformer for FGVC

连理o

已于 2023-03-08 21:25:05 修改

阅读量751

点赞数

文章标签： ACM MM 2022

于 2023-02-11 14:52:09 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/128981004

版权

papers 专栏收录该内容

40 篇文章

订阅专栏

文章提出了SIM-Trans模型，通过引入结构信息增强Transformer的特征学习能力，特别是在细粒度视觉分类任务中。使用滑动窗口分割方法处理图像，通过图卷积网络（GCN）学习物体结构信息，并结合对比学习提升模型表示能力。实验在CUB-200-2011和iNaturalist2017数据集上验证了方法的有效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Introduction
Approach
Experiments
References

Introduction

In this paper, we propose the structure information modeling transformer (SIM-Trans) that introduces the object structure information into vision transformer for boosting the discriminative feature learning to contain both the appearance and structure information.
同时作者也指出了在 FGVC 任务上，ViT 相比 CNN 的优势：“the stacked convolution and pooling operations bring both the expansion of the receptive field and the degradation of spatial discrimination. Large continuous image areas are focused on and discriminative details are generally overlooked, which are essential for distinguishing subtle difference in fine-grained visual categorization.”

Approach

在这里插入图片描述

Multi-level Feature Boosting

在这里插入图片描述

由于输入图像分块时的 non-overlapping splitting 会导致 incomplete neighboring information，因此作者采用 sliding window splitting method (He, Shuting, et al.) 对输入图像进行分块，patch 数量 $N$ 为
其中 $H, W$ 为图像高宽， $S$ 为 window’s sliding step， $P$ 为 patch size
最后 3 个 transformer layers 的 [CLS] token 输出 concat 后得到 image representation，送入 FC 分类头得到最终的预测结果

Structure Information Learning

在这里插入图片描述

作者提出 structure information learning (SIL) module，它将 $N$ 个 patch 建模为 $N$ 个结点的图，使用 GCN 将物体空间上下文信息引入 ViT
(1) 图的结点特征. 首先，作者利用不同 patch token 和 [CLS] token 之间的 attention weight 定位物体位置。假设为 $H$ 头自注意力层，则 total attention weights 为
其中 $Att_h^{cls}$ 为 $h$ -th head 中各个 patch token 和 [CLS] token 的 attention weights. 因此 $A\in\R^{N}$ 就代表了 $N$ 个 patch token 与 [CLS] token 的相关程度，相关程度越高，则该 patch 内含有物体的可能性就越高。其中 attention weight 最大的 patch 为 reference patch，最终使用的 GCN 输出特征即为 reference patch 对应的结点特征. 受 Zhou, Mohan, et al. 启发，作者使用 patch 到 reference patch 的极坐标作为 patch 的结点特征 $X$
其中， $x_0,y_0)$ 为 reference patch 在 $N_H\times N_W$ 的分块图像上的坐标， $(x, y)$ 为 patch 在分块图像上的坐标， $\arctan2(\cdot)$ 返回 $(- π, π]$ 范围内的极坐标角度信息
(2) 图的边权. 可以计算出 total attention weights 的均值 $\bar A$ ，然后过滤掉 attention weight 小于均值的背景 patch，新的 total attention weights $A^{new}$ 为
$A_{i}^{n e w}=\left\{\begin{array}{cc} A_{i} & \text { if } A_{i}>\bar{A} \\ 0 & \text { otherwise } \end{array}\right.$ 图的边权为
$Adj=A^{new}(A^{new})^T\in\R^{N\times N}$ 由于最终使用的 GCN 输出特征即为 reference patch 对应的结点特征，因此在特征聚合时实际上只用到了 attention weight 超过均值的 patch 结点特征
(3) GCN. The structure features $S$ are obtained by two-layer graph convolution
其中 $\sigma$ 为激活函数。reference patch node 的输出特征即为 object structure feature，它被直接加在 [CLS] token 的输出特征上. Through the end-to-end training, the composition of the object can be modeled and the significant image patch can be highlighted, which improves the model’s classification performance.

Contrastive Loss

在这里插入图片描述

作者还进一步引入了对比学习提高模型的表征能力，在最后一层的 [CLS] token 上用了对比学习
其中， $z_i,z_j^+)$ 为属于同一类别的正样本对， $z_i,z_j^-)$ 为负样本对， $\Gamma_{y_i=y_j, i \neq j}$ 为正样本对个数， $sim(\cdot)$ 为余弦相似度， $N$ 为 batch size， $Indicator_{i,j}$ 用于难负样本挖掘，负责过滤掉相似度比正样本对平均相似度低 $\alpha$ 的负样本，作者在实验中设置 $\alpha=0.3$