[TMM 2023] TransIFC: Invariant Cues-aware Feature Concentration Learning for Efficient FGVC

连理o

已于 2023-02-27 18:00:32 修改

阅读量454

点赞数 1

文章标签： TIP 2022

于 2023-02-26 10:30:38 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/129221739

版权

papers 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

TransIFC是一种针对细粒度鸟类图像分类的深度学习模型，它利用SwinTransformer作为backbone，结合FFA模块捕获不变的核心特征和HSFA模块融合多尺度信息。FFA通过计算patch特征之间的相似度选择最具区分性的特征，而HSFA则通过最大池化融合不同阶段的特征。此外，模型还采用了MAP估计优化分类头，以提高分类性能。

摘要由CSDN通过智能技术生成

Introduction
Proposed TransIFC Model
Experiments
References

Introduction

Challenges of Fine-grained Bird Image Classification. (1) Bird molting: Some birds undergo an annual molt (change their feather) as season changes (Figs. 1(a) and 1(b)). (2) Complex background. (3) Arbitrary posture
Observation and motivation. Finding I: Invariant cues of specific birds. i.e., core features and the long-dependent semantic relationships of bird parts. Finding II: Subtle discrepancies of different birds.

Proposed TransIFC Model

Feature map generation

在这里插入图片描述

TransIFC 采用 Swin Transformer 作为 backbone (pre-trained on ImageNet-22k)，抽取 fine-grained and multiscale information，输出的特征为各个 stage 的输出特征 (i.e., 各个 stage 输出 token feartures 的 avg pooling)

FFA module

在这里插入图片描述

FFA 负责提取出图像中的特征显著区域 (invariant core features)
假设 $q_i$ ( $i\in[1,2,...,n]$ ) 为 patch merging layer 后输出的 $n$ 1D patch vectors，可以计算这 $n$ 个 vectors 间的相似度矩阵 $S_{n\times n}$ ，其中 $S_{ij}=Sim(q_i,q_j)$ ，相似度可以采用余弦相似度或 L2 距离的倒数。由相似度矩阵可以得到每个 patch vector 的 discrimination score
FFA 选择 Hits@𝑘 (𝑘 highest scored) patch vectors 作为下一层的输入 (这里具体是在哪几个 stage 加 FFA 感觉作者写的不是很清楚，论文介绍 FFA 的时候写得好像是每个 stage 都加 FFA，但根据论文的示意图以及后面的消融实验中说 $k$ 是常数，作者应该是只在最后一个 stage 用了 FFA (TransIFC)，将具有显著特征的 patch fearture 用于网络的后续分类，而根据后面实验部分，作者说将 FFA 用在了每个 stage 中用于替代 HSFA 中的 max pooling (TransIFC+)，不过 Swin 里每个 stage patch 数都不一样，这样 $k$ 值还能是常数？)
作者还做了可视化，中间 5 个浅绿色的 patch features 即为最后一个 stage 里的 Hits@𝑘 patch vectors，可以发现在 lower layers 中，Hits@k features 各不相同，而得分低的 patch feartures 基本相同。在 higher layers 中，Hits@k features 比较相似，且激活值都比较高，而得分低的 features 看起来比较 noisy

HSFA module

在这里插入图片描述

HSFA 负责融合来自不同 stage 的多尺度信息。它首先将 feature maps $M_i$ ( $i\in[1,2,3,...,N]$ ， $N$ 为 stage 数) 用 max pooling 降维，拉直后 concat 得到 aggregated feature map 𝑨

Classification head

在这里插入图片描述

将 FFA 和 HSFA 的输出连接后经过两个全连接层后就得到了 final prediction $\hat y$ (为了防止过拟合，还加了 dropout)

MAP-based model

MAP (Maximum A Posteriori) estimation
$\begin{aligned} \theta^*&=\operatorname{argmax}_\theta \prod_{i=1}^r p\left(\theta \mid x_i, y_i\right) \\&=\operatorname{argmax} \frac{\prod_{i=1}^r p\left(x_i, y_i \mid \theta\right) p(\theta)}{\prod_{i=1}^r p\left(x_i, y_i\right)} \\&=\operatorname{argmax} \prod_{i=1}^r p\left(x_i, y_i \mid \theta\right) p(\theta) \\&=\operatorname{argmax}\left(\log \prod_{i=1}^r p\left(x_i, y_i \mid \theta\right)+\log p(\theta)\right) \end{aligned}$
似然取
$p\left(x_i, y_i \mid \theta\right) \propto \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(\frac{\left\|y_i-\hat{y}_i\right\|^2}{2 \sigma^2}\right)$ 先验取
$p(\theta) \propto\|\theta-0\|^2$ 最后可得损失函数为
$L(\theta)=\frac{1}{2} \sum_{i=1}^r\left\|y_i-\hat{y}_i\right\|^2+\eta\|\theta\|^2$

Experiments

Results on the CUB-200-2011 dataset

实验部分最大的问题是没有直接和 Swin 比较 (在消融实验部分提到了 Swin 在 CUB 数据集上的性能)

在这里插入图片描述

Results on the NABirds dataset

在这里插入图片描述

Results on the Stanford Cars dataset

在这里插入图片描述

Visualization (ScoreCAM)

在这里插入图片描述
Ablation study

Effect of $k$ on the FFA module, and positional embeddings
Effect of head number in self-attention operation, and positional embeddings
Effects of HSFA and FFA modules
Effect of image resolution