Contents
Introduction
- Challenges of Fine-grained Bird Image Classification. (1) Bird molting: Some birds undergo an annual molt (change their feather) as season changes (Figs. 1(a) and 1(b)). (2) Complex background. (3) Arbitrary posture
- Observation and motivation. Finding I: Invariant cues of specific birds. i.e., core features and the long-dependent semantic relationships of bird parts. Finding II: Subtle discrepancies of different birds.
Proposed TransIFC Model
Feature map generation
- TransIFC 采用 Swin Transformer 作为 backbone (pre-trained on ImageNet-22k),抽取 fine-grained and multiscale information,输出的特征为各个 stage 的输出特征 (i.e., 各个 stage 输出 token feartures 的 avg pooling)
FFA module
- FFA 负责提取出图像中的特征显著区域 (invariant core features)
- 假设
q
i
q_i
qi (
i
∈
[
1
,
2
,
.
.
.
,
n
]
i\in[1,2,...,n]
i∈[1,2,...,n]) 为 patch merging layer 后输出的
𝑛
𝑛
n 1D patch vectors,可以计算这
n
n
n 个 vectors 间的相似度矩阵
S
n
×
n
S_{n\times n}
Sn×n,其中
S
i
j
=
S
i
m
(
q
i
,
q
j
)
S_{ij}=Sim(q_i,q_j)
Sij=Sim(qi,qj),相似度可以采用余弦相似度或 L2 距离的倒数。由相似度矩阵可以得到每个 patch vector 的 discrimination score
FFA 选择 Hits@𝑘 (𝑘 highest scored) patch vectors 作为下一层的输入 (这里具体是在哪几个 stage 加 FFA 感觉作者写的不是很清楚,论文介绍 FFA 的时候写得好像是每个 stage 都加 FFA,但根据论文的示意图以及后面的消融实验中说 k k k 是常数,作者应该是只在最后一个 stage 用了 FFA (TransIFC),将具有显著特征的 patch fearture 用于网络的后续分类,而根据后面实验部分,作者说将 FFA 用在了每个 stage 中用于替代 HSFA 中的 max pooling (TransIFC+),不过 Swin 里每个 stage patch 数都不一样,这样 k k k 值还能是常数?) - 作者还做了可视化,中间 5 个浅绿色的 patch features 即为最后一个 stage 里的 Hits@𝑘 patch vectors,可以发现在 lower layers 中,Hits@k features 各不相同,而得分低的 patch feartures 基本相同。在 higher layers 中,Hits@k features 比较相似,且激活值都比较高,而得分低的 features 看起来比较 noisy
HSFA module
- HSFA 负责融合来自不同 stage 的多尺度信息。它首先将 feature maps M i M_i Mi ( i ∈ [ 1 , 2 , 3 , . . . , N ] i\in[1,2,3,...,N] i∈[1,2,3,...,N], N N N 为 stage 数) 用 max pooling 降维,拉直后 concat 得到 aggregated feature map 𝑨
Classification head
- 将 FFA 和 HSFA 的输出连接后经过两个全连接层后就得到了 final prediction y ^ \hat y y^ (为了防止过拟合,还加了 dropout)
MAP-based model
- MAP (Maximum A Posteriori) estimation
θ ∗ = argmax θ ∏ i = 1 r p ( θ ∣ x i , y i ) = argmax ∏ i = 1 r p ( x i , y i ∣ θ ) p ( θ ) ∏ i = 1 r p ( x i , y i ) = argmax ∏ i = 1 r p ( x i , y i ∣ θ ) p ( θ ) = argmax ( log ∏ i = 1 r p ( x i , y i ∣ θ ) + log p ( θ ) ) \begin{aligned} \theta^*&=\operatorname{argmax}_\theta \prod_{i=1}^r p\left(\theta \mid x_i, y_i\right) \\&=\operatorname{argmax} \frac{\prod_{i=1}^r p\left(x_i, y_i \mid \theta\right) p(\theta)}{\prod_{i=1}^r p\left(x_i, y_i\right)} \\&=\operatorname{argmax} \prod_{i=1}^r p\left(x_i, y_i \mid \theta\right) p(\theta) \\&=\operatorname{argmax}\left(\log \prod_{i=1}^r p\left(x_i, y_i \mid \theta\right)+\log p(\theta)\right) \end{aligned} θ∗=argmaxθi=1∏rp(θ∣xi,yi)=argmax∏i=1rp(xi,yi)∏i=1rp(xi,yi∣θ)p(θ)=argmaxi=1∏rp(xi,yi∣θ)p(θ)=argmax(logi=1∏rp(xi,yi∣θ)+logp(θ)) - 似然取
p ( x i , y i ∣ θ ) ∝ 1 2 π σ exp ( ∥ y i − y ^ i ∥ 2 2 σ 2 ) p\left(x_i, y_i \mid \theta\right) \propto \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(\frac{\left\|y_i-\hat{y}_i\right\|^2}{2 \sigma^2}\right) p(xi,yi∣θ)∝2πσ1exp(2σ2∥yi−y^i∥2)先验取
p ( θ ) ∝ ∥ θ − 0 ∥ 2 p(\theta) \propto\|\theta-0\|^2 p(θ)∝∥θ−0∥2最后可得损失函数为
L ( θ ) = 1 2 ∑ i = 1 r ∥ y i − y ^ i ∥ 2 + η ∥ θ ∥ 2 L(\theta)=\frac{1}{2} \sum_{i=1}^r\left\|y_i-\hat{y}_i\right\|^2+\eta\|\theta\|^2 L(θ)=21i=1∑r∥yi−y^i∥2+η∥θ∥2
Experiments
Results on the CUB-200-2011 dataset
- 实验部分最大的问题是没有直接和 Swin 比较 (在消融实验部分提到了 Swin 在 CUB 数据集上的性能)
Results on the NABirds dataset
Results on the Stanford Cars dataset
Visualization (ScoreCAM)
Ablation study
- Effect of
𝑘
𝑘
k on the FFA module, and positional embeddings
- Effect of head number in self-attention operation, and positional embeddings
- Effects of HSFA and FFA modules
- Effect of image resolution