[ECCV 2022] Where to Focus: Investigating Hierarchical Attention Relationship for FGVC

最新推荐文章于 2023-02-10 20:16:55 发布

连理o

最新推荐文章于 2023-02-10 20:16:55 发布

阅读量358

点赞数

文章标签： ECCV 2022

本文链接：https://blog.csdn.net/weixin_42437114/article/details/128975875

版权

papers 专栏收录该内容

40 篇文章

订阅专栏

该文提出了一种新的图像识别框架，通过RFM模块学习不同粒度的显著区域特征，并用COF模块增强细粒度特征的辨识能力。通过对图像的多层次分析，模型能更好地捕捉到区分不同类别的关键区域，从而提高细粒度和粗粒度分类的性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Introduction
Method
Experiments
References

Introduction

作者利用 Region Feature Mining Module (RFM) 挖掘出对应不同粒度的显著区域特征，然后通过 Cross-Hierarchical Orthogonal Fusion Module (COF) 增强细粒度显著区域特征中与粗粒度显著区域特征正交的特征来提高细粒度特征的辨识能力。值得注意的是，作者并非直接选取出图像中的显著特征区域 (e.g. anchor-based)，而是通过设置 learnable region prototypes 来筛选出显著区域特征进行分类，这也是一种隐式地寻找 discriminative regions 的思路

Method

Cross-Hierarchical Region Feature (CHRF) learning framework

在这里插入图片描述

Trunk：给定标签为 ${y^1,y^2,...,y^L\}$ 的图片 $x$ ，trunk 通过 CNN $f(\cdot)$ 抽取出图像特征 $f(x)\in\R^{W_1\times H_1\times C_1}$
Branches：Branches 使用 $L$ 个 region feature mining (RFM) modules 从 $f (x)$ 抽取出不同粒度的区域特征，其中 $l$ 层的 granularity-wise attention region representation 为 $B_l(x) ∈ \R^{M_l×C_2}$ ， $M_l$ 为 $l$ 层的 region 数
Leaves：Leaves 通过 cross-hierarchical orthogonal fusion (COF) module 整合两个相邻层的 region representations $B_{l-1}(x),B_l(x)$ 来得到 discriminative region orthogonal feature $O_l(x)\in\R^{M_l\times C_2}$ . 每层的损失函数则是交叉熵损失
粗粒度分类可以通过反向传播得到来自细粒度分类的指导，细粒度分类也可以通过前向传播定位粗细粒度特征的不同来增强层级特征的辨识能力 (compare the difference between fine-grained observation and coarse-grained observation and improve the discriminability of the fine-grained representation)，这样就同时提高了粗细粒度分类的性能

Region Feature Mining Module (RFM): learn different granularity-wise attention regions with multi-grained classification tasks

在这里插入图片描述

RFM 的主要目的是抽取出不同粒度的图像特征。对于 $l$ 层，RFM 用 CNN $\phi_l(\cdot)$ (exclusive for the specific hierarchy) 抽取出 granularity-wise semantic feature $\phi_l(x)\in\R^{W_2\times H_2\times C_2}$ . 此外，RFM 还设置了 $M_l$ 个 learnable region prototypes $R_l=\{r_{l,m}\in\R^{C_2}\}_{m=1}^{M_l}$ 来从 $\phi_l(x)$ 中挖掘出 $M_l$ 个 regions. 具体而言，计算 $\phi_l(x)$ 中 $W_2\times H_2$ 个特征向量与 $r_{l,m}$ 的点积可以得到 similarity map $\in\R^{W_2\times H_2}$ ，一共可得 $M_l$ 个 similarity map，对它们进行 batch normalization + ReLU 后可得 region masks $A_l(x) = \{a_{l,m}(x) ∈ R^{W_2×H_2}\}^{M_l}_{m=1}$ . 用 $a_{l,m}(x)$ 对 $\phi_l(x)$ 加权即可得到第 $m$ 个 region representation：
把 $M_l$ 个 region representation 连接到一起即可得到 observation of level $l$ $B_l(x) = [b_{l,1}(x), b_{l,2}(x), ..., b_{l,M_l} (x)]$

Cross-Hierarchical Orthogonal Fusion Module (COF): explore how human attention shifts from one hierarchy to another

在这里插入图片描述

当在层级 $l$ 进行分类的时候，人类一般会忽略一些共性的粗粒度特征而主要关注于一些具有辨识力的细粒度区域，受 Wu, Aming, et al. “Vector-decomposed disentanglement for domain-invariant object detection.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 和 Yang, Min, et al. “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 的启发，作者认为应该通过特征向量分解把层级 $l$ 的 discriminative features 从 finer-grained region representation $B_l(x)$ 中解耦出来以提升不同层级 region representations 的辨识能力
COF 首先通过 average pooling 从 $B_{l-1}(x)$ 得到 global observation $G_{l-1}(x)\in\R^{1\times C_2}$ (i.e. 对于粗粒度分类而言的特征显著区域，对于细粒度分类而言就可以将其看作是对图像全局的一个粗略观测特征)
然后分别计算 $M_l$ 个细粒度特征 $b_{l,m}(x)$ ( $1\leq m\leq M_l$ ) 到 global observation 的投影 $b_{l,m}^{proj}(x)$
这个投影就可以看作是细粒度特征中包含的共性的粗粒度特征，去除这一共性特征即可得到 discriminative region observation $b_{l,m}^{orth}(x)$
最终将细粒度独有特征 $b_{l,m}^{orth}(x)$ 加在细粒度特征 $b_{l,m}(x)$ 上来提高细粒度区域特征的辨识能力 (i.e. Fusion)，得到 region orthogonal feature $o_{l,m}(x)$
将所有 region orthogonal feature 连接后即可得到 $l$ 层的 region orthogonal feature $O_l(x)=[o_{l,1}(x), o_{l,2}(x), ..., o_{l,M_l} (x)]\in\R^{M_l\times C_2}$
为了降低同一层级不同 region features 之间的相关性，促使模型真正关注到 $M_l$ 个不同的区域，受 Ranasinghe, Kanchana, et al. “Orthogonal projection loss.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 启发，作者引入了 orthogonal region regularization. 具体而言，作者引入了 orthogonal region bank 来存储 $l$ 层每个类别的 $M_l$ 个区域的 center representation $c_m^{y^l}$ (共 $|y^l|\times M_l$ 个 centers)，正则项如下：
其中第一项用来保证在 $l$ 层上属于同一类别 $y^l$ 的不同样本抽取出的第 $m$ ( $1\leq m\leq M_l$ ) 个区域特征彼此相似 (与它们的中心 $c_m^{y^l}$ 相似)，第二项用来保证在 $l$ 层上属于同一类别 $y^l$ 的不同样本抽取出的不同区域特征彼此不相似，这可以保证不同 region orthogonal features 之间的正交性，降低它们间的相关性，帮助 RFM 找到更多不同的特征区域。center representation 初始值为 0，更新方法如下：
total orthogonal region regularization 为total loss 为

Experiments

The implementation details are provided in Appendix C

注意这篇文章用的 CUB 的类别层次关系来自于 Chen, Tianshui, et al. “Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding.” Proceedings of the 26th ACM international conference on Multimedia. 2018. 而非 Chang, Dongliang, et al. “Your” Flamingo" is My" Bird": Fine-Grained, or Not." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

wAP (weighted AP)：细粒度层级权重更大
其中， $P_l$ 为 $l$ 层 precision
Baseline (ReNet-50) 包括 backbone $f(\cdot)$ (the first three convolution groups) 和层级特征抽取网络 $\phi(\cdot)$ ，其中 backbone 不训练；Baseline++ 结构与 Baseline 相同，但 backbone 训练；HSE 来自 Chen, Tianshui, et al. “Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding.” Proceedings of the 26th ACM international conference on Multimedia. 2018.；FGN 来自 Chang, Dongliang, et al. “Your” Flamingo" is My" Bird": Fine-Grained, or Not." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.；Ours-RF 为 baseline + RFM module；Ours-CHRF 为作者提出的最终模型；可以看到，Baseline++ 和 Baseline 性能差不多，因此 CHRF 在训练时也直接冻结 $f(\cdot)$ 的参数
Evaluation on Traditional FGVC Setting
Further Analysis.
作者还尝试了 COF 中不同的 Fusion 策略。发现当 $\lambda$ 设为可学习参数时 region orthogonal feature 会对训练集过拟合，concat 的融合策略效果也不好
Where to Focus? We visualize the attention maps of humans, Ours-RF, and Ours-CHRF in Fig. 5.