Contents
- Introduction
- Method
- Cross-Hierarchical Region Feature (CHRF) learning framework
- Region Feature Mining Module (RFM): learn different granularity-wise attention regions with multi-grained classification tasks
- Cross-Hierarchical Orthogonal Fusion Module (COF): explore how human attention shifts from one hierarchy to another
- Experiments
- References
Introduction
- 作者利用 Region Feature Mining Module (RFM) 挖掘出对应不同粒度的显著区域特征,然后通过 Cross-Hierarchical Orthogonal Fusion Module (COF) 增强细粒度显著区域特征中与粗粒度显著区域特征正交的特征来提高细粒度特征的辨识能力。值得注意的是,作者并非直接选取出图像中的显著特征区域 (e.g. anchor-based),而是通过设置 learnable region prototypes 来筛选出显著区域特征进行分类,这也是一种隐式地寻找 discriminative regions 的思路
Method
Cross-Hierarchical Region Feature (CHRF) learning framework
- Trunk:给定标签为 { y 1 , y 2 , . . . , y L } \{y^1,y^2,...,y^L\} {y1,y2,...,yL} 的图片 x x x,trunk 通过 CNN f ( ⋅ ) f(\cdot) f(⋅) 抽取出图像特征 f ( x ) ∈ R W 1 × H 1 × C 1 f(x)\in\R^{W_1\times H_1\times C_1} f(x)∈RW1×H1×C1
- Branches:Branches 使用 L L L 个 region feature mining (RFM) modules 从 f ( x ) f(x) f(x) 抽取出不同粒度的区域特征,其中 l l l 层的 granularity-wise attention region representation 为 B l ( x ) ∈ R M l × C 2 B_l(x) ∈ \R^{M_l×C_2} Bl(x)∈RMl×C2, M l M_l Ml 为 l l l 层的 region 数
- Leaves:Leaves 通过 cross-hierarchical orthogonal fusion (COF) module 整合两个相邻层的 region representations
B
l
−
1
(
x
)
,
B
l
(
x
)
B_{l-1}(x),B_l(x)
Bl−1(x),Bl(x) 来得到 discriminative region orthogonal feature
O
l
(
x
)
∈
R
M
l
×
C
2
O_l(x)\in\R^{M_l\times C_2}
Ol(x)∈RMl×C2. 每层的损失函数则是交叉熵损失
粗粒度分类可以通过反向传播得到来自细粒度分类的指导,细粒度分类也可以通过前向传播定位粗细粒度特征的不同来增强层级特征的辨识能力 (compare the difference between fine-grained observation and coarse-grained observation and improve the discriminability of the fine-grained representation),这样就同时提高了粗细粒度分类的性能
Region Feature Mining Module (RFM): learn different granularity-wise attention regions with multi-grained classification tasks
- RFM 的主要目的是抽取出不同粒度的图像特征。对于
l
l
l 层,RFM 用 CNN
ϕ
l
(
⋅
)
\phi_l(\cdot)
ϕl(⋅) (exclusive for the specific hierarchy) 抽取出 granularity-wise semantic feature
ϕ
l
(
x
)
∈
R
W
2
×
H
2
×
C
2
\phi_l(x)\in\R^{W_2\times H_2\times C_2}
ϕl(x)∈RW2×H2×C2. 此外,RFM 还设置了
M
l
M_l
Ml 个 learnable region prototypes
R
l
=
{
r
l
,
m
∈
R
C
2
}
m
=
1
M
l
R_l=\{r_{l,m}\in\R^{C_2}\}_{m=1}^{M_l}
Rl={rl,m∈RC2}m=1Ml 来从
ϕ
l
(
x
)
\phi_l(x)
ϕl(x) 中挖掘出
M
l
M_l
Ml 个 regions. 具体而言,计算
ϕ
l
(
x
)
\phi_l(x)
ϕl(x) 中
W
2
×
H
2
W_2\times H_2
W2×H2 个特征向量与
r
l
,
m
r_{l,m}
rl,m 的点积可以得到 similarity map
∈
R
W
2
×
H
2
\in\R^{W_2\times H_2}
∈RW2×H2,一共可得
M
l
M_l
Ml 个 similarity map,对它们进行 batch normalization + ReLU 后可得 region masks
A
l
(
x
)
=
{
a
l
,
m
(
x
)
∈
R
W
2
×
H
2
}
m
=
1
M
l
A_l(x) = \{a_{l,m}(x) ∈ R^{W_2×H_2}\}^{M_l}_{m=1}
Al(x)={al,m(x)∈RW2×H2}m=1Ml. 用
a
l
,
m
(
x
)
a_{l,m}(x)
al,m(x) 对
ϕ
l
(
x
)
\phi_l(x)
ϕl(x) 加权即可得到第
m
m
m 个 region representation:
把 M l M_l Ml 个 region representation 连接到一起即可得到 observation of level l l l B l ( x ) = [ b l , 1 ( x ) , b l , 2 ( x ) , . . . , b l , M l ( x ) ] B_l(x) = [b_{l,1}(x), b_{l,2}(x), ..., b_{l,M_l} (x)] Bl(x)=[bl,1(x),bl,2(x),...,bl,Ml(x)]
Cross-Hierarchical Orthogonal Fusion Module (COF): explore how human attention shifts from one hierarchy to another
- 当在层级
l
l
l 进行分类的时候,人类一般会忽略一些共性的粗粒度特征而主要关注于一些具有辨识力的细粒度区域,受 Wu, Aming, et al. “Vector-decomposed disentanglement for domain-invariant object detection.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 和 Yang, Min, et al. “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 的启发,作者认为应该通过特征向量分解把层级
l
l
l 的 discriminative features 从 finer-grained region representation
B
l
(
x
)
B_l(x)
Bl(x) 中解耦出来以提升不同层级 region representations 的辨识能力
- COF 首先通过 average pooling 从
B
l
−
1
(
x
)
B_{l-1}(x)
Bl−1(x) 得到 global observation
G
l
−
1
(
x
)
∈
R
1
×
C
2
G_{l-1}(x)\in\R^{1\times C_2}
Gl−1(x)∈R1×C2 (i.e. 对于粗粒度分类而言的特征显著区域,对于细粒度分类而言就可以将其看作是对图像全局的一个粗略观测特征)
然后分别计算 M l M_l Ml 个细粒度特征 b l , m ( x ) b_{l,m}(x) bl,m(x) ( 1 ≤ m ≤ M l 1\leq m\leq M_l 1≤m≤Ml) 到 global observation 的投影 b l , m p r o j ( x ) b_{l,m}^{proj}(x) bl,mproj(x)
这个投影就可以看作是细粒度特征中包含的共性的粗粒度特征,去除这一共性特征即可得到 discriminative region observation b l , m o r t h ( x ) b_{l,m}^{orth}(x) bl,morth(x)
最终将细粒度独有特征 b l , m o r t h ( x ) b_{l,m}^{orth}(x) bl,morth(x) 加在细粒度特征 b l , m ( x ) b_{l,m}(x) bl,m(x) 上来提高细粒度区域特征的辨识能力 (i.e. Fusion),得到 region orthogonal feature o l , m ( x ) o_{l,m}(x) ol,m(x)
将所有 region orthogonal feature 连接后即可得到 l l l 层的 region orthogonal feature O l ( x ) = [ o l , 1 ( x ) , o l , 2 ( x ) , . . . , o l , M l ( x ) ] ∈ R M l × C 2 O_l(x)=[o_{l,1}(x), o_{l,2}(x), ..., o_{l,M_l} (x)]\in\R^{M_l\times C_2} Ol(x)=[ol,1(x),ol,2(x),...,ol,Ml(x)]∈RMl×C2
- 为了降低同一层级不同 region features 之间的相关性,促使模型真正关注到
M
l
M_l
Ml 个不同的区域,受 Ranasinghe, Kanchana, et al. “Orthogonal projection loss.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 启发,作者引入了 orthogonal region regularization. 具体而言,作者引入了 orthogonal region bank 来存储
l
l
l 层每个类别的
M
l
M_l
Ml 个区域的 center representation
c
m
y
l
c_m^{y^l}
cmyl (共
∣
y
l
∣
×
M
l
|y^l|\times M_l
∣yl∣×Ml 个 centers),正则项如下:
其中第一项用来保证在 l l l 层上属于同一类别 y l y^l yl 的不同样本抽取出的第 m m m ( 1 ≤ m ≤ M l 1\leq m\leq M_l 1≤m≤Ml) 个区域特征彼此相似 (与它们的中心 c m y l c_m^{y^l} cmyl 相似),第二项用来保证在 l l l 层上属于同一类别 y l y^l yl 的不同样本抽取出的不同区域特征彼此不相似,这可以保证不同 region orthogonal features 之间的正交性,降低它们间的相关性,帮助 RFM 找到更多不同的特征区域。center representation 初始值为 0,更新方法如下:
total orthogonal region regularization 为
total loss 为
Experiments
The implementation details are provided in Appendix C
注意这篇文章用的 CUB 的类别层次关系来自于 Chen, Tianshui, et al. “Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding.” Proceedings of the 26th ACM international conference on Multimedia. 2018. 而非 Chang, Dongliang, et al. “Your” Flamingo" is My" Bird": Fine-Grained, or Not." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
- wAP (weighted AP):细粒度层级权重更大
其中, P l P_l Pl 为 l l l 层 precision
- Baseline (ReNet-50) 包括 backbone
f
(
⋅
)
f(\cdot)
f(⋅) (the first three convolution groups) 和层级特征抽取网络
ϕ
(
⋅
)
\phi(\cdot)
ϕ(⋅),其中 backbone 不训练;Baseline++ 结构与 Baseline 相同,但 backbone 训练;HSE 来自 Chen, Tianshui, et al. “Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding.” Proceedings of the 26th ACM international conference on Multimedia. 2018.;FGN 来自 Chang, Dongliang, et al. “Your” Flamingo" is My" Bird": Fine-Grained, or Not." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.;Ours-RF 为 baseline + RFM module;Ours-CHRF 为作者提出的最终模型;可以看到,Baseline++ 和 Baseline 性能差不多,因此 CHRF 在训练时也直接冻结
f
(
⋅
)
f(\cdot)
f(⋅) 的参数
- Evaluation on Traditional FGVC Setting
- Further Analysis.
作者还尝试了 COF 中不同的 Fusion 策略。发现当 λ \lambda λ 设为可学习参数时 region orthogonal feature 会对训练集过拟合,concat 的融合策略效果也不好
- Where to Focus? We visualize the attention maps of humans, Ours-RF, and Ours-CHRF in Fig. 5.