[CVPR 2022] Fine-Grained Object Classification via Self-Supervised Pose Alignment

连理o

于 2023-02-10 19:28:11 发布

阅读量537

点赞数

文章标签： CVPR 2022

本文链接：https://blog.csdn.net/weixin_42437114/article/details/128975994

版权

papers 专栏收录该内容

39 篇文章 1 订阅

订阅专栏

Introduction
Method
Experiments
References

Introduction

This paper proposes a novel feature regularization scheme about learning pose-insensitive discriminative representations for fine-grained classification.

Method

P2P-Net (Part-to-Pose network)

在这里插入图片描述

Curriculum Supervision on Backbone Network

P2P-Net 采用 ResNet 作为 backbone. ResNet 中有 $S$ 个 stage，它们输出的中间特征为 $F^{(s)}$ ( $s\in\{1,...,S\}$ ). P2P-Net 将 $F^{(s)}$ 分别输入不同的卷积层，经过 global maximum pooling (GMP) layer 后可以得到 $S$ 个图像特征 ${r_{im}^{(s)}\}$ ，最后用 $S + 1$ 个 MLP 可以得到 $S + 1$ 个预测结果，其中第 $S + 1$ 个预测结果是由 $S$ 个图像特征 concat 后经过 MLP 得到的
同时，作者认为，更深层次的网络拥有更强的识别能力，因此就可以对 $S + 1$ 个不同深度的网络生成的预测结果进行因材施教 (curriculum training scheme) . 具体而言，作者利用 label smoothing 根据预测结果对应的网络深度，给不同的预测结果生成相应的软标签。软标签由 smoothing factor $\alpha\in(0,1]$ 确定
$y_\alpha[t]=\left\{\begin{array}{ll} \alpha, & t=y \\ \frac{1-\alpha}{K-1}, & t \neq y \end{array},\right.$ 其中， $K$ 为类别数. 随着 $s$ 的增大，作者将 $\alpha^{(s)}$ 从一个稍微超过 $\frac{1}{K}$ 的值逐渐增大到 1 来构造出 easy-to-hard curriculum targets，损失函数 smooth cross-entropy loss 如下：
其中， $\hat y^{(s)}$ 为第 $s$ 个预测结果， $y_{\alpha^{(s)}}$ 为第 $s$ 个软标签
作者认为这一设计有如下好处：
- (1) by making predictions at different depths on the network, complementary information of multi-granularity can be fused to capture discriminative object features;
- (2) by connecting more layers to the output, parameters in shallower layers become easier to optimize.

Contrastive Feature Regularization

在定位 discriminative parts 后，一般的做法是将其特征 concat 或者 fuse 后生成一个新的特征进行分类。但作者则是通过最小化显著区域特征和全局图像特征间的 KL 散度来将显著区域特征中的细粒度信息引入全局图像特征 (a feature regularization on representations between local parts and global object to enforce incorporating fine-grained details from distinctive parts into image representation.)
Weakly-Supervised Part Localization：作者在 backbone 的最后一个 feature block 后加上 FPN 来生成 3 个不同感受野的特征图 ( $14 \times 14, 7 \times 7, 4 \times 4$ )，特征图上的每个特征点都对应一个特定大小的 image patch. 借用 RPN 的思想，特征图上的每个特征点都对应若干个 anchors，同一个特征点的不同通道就对应不同 anchors 的得分，这样特征图就可以看作 score maps
在得到 score maps 后，作者根据得分用 NMS 筛去多余 anchors，然后选出 top- $N$ parts
Classification and Ranking Loss on Detected Parts：上面讲了怎么通过前向传播得到 $N$ 个特征显著区域，那么用什么监督信号来保证模型真的能输出特征显著区域呢？这里作者的思想和 NTS-Net 类似，都是用一个排序损失来监督特征显著区域的生成，下面介绍的内容非常类似于 NTS-Net 中的 Navigator 和 Teacher。在得到 top- $N$ parts 后，可以对特征显著区域进行 crop and resize ( $224 \times 224$ (half the spatial size of the original image))，然后对每个 part 重复之前介绍的 curriculum supervision，也就是对每个 part 的特征 $r_{p_n} = [r^{(1)}_{p_n} ; r^{(2)}_{p_n}; . . . ; r^{(S)}_{p_n}]$ (包括 $S$ 个 stage 的特征) 得到其分类的 smooth cross-entropy loss
总的分类损失为
现在我们得到了 top- $N$ parts 的分类损失 ${L_{p_n}\}^N$ 及其对应的 score ${sc_{p_n}\}^N$ . 类似于 NTS-Net，作者认为如果 $L_{p_n}<L_{p_n'}$ ，那么 $p_n$ 的得分就应该更大，也就是 $sc_{p_n}>sc_{p_n'}$ ，由此可以得到如下排序损失：
其中， $\delta=1$
Contrastive Loss for Feature Regularization：前面说过，作者并不是简单地将特征显著区域的特征 concat 或者 fuse 后生成一个新的特征进行分类，而是通过最小化显著区域特征和全局图像特征间的 KL 散度来将显著区域特征中的细粒度信息引入全局图像特征，这样不仅可以节省推理时的模型参数，还不会损失全局信息，并且可以过滤掉 part fearture 中的冗余信息。对比损失如下：
其中， $r_{im}$ 为图像特征， $r_{p_i}$ 为第 $i$ 个 part 的特征， $s$ 为 stage 数， $l_{kl}$ 为 KL 散度， $\phi$ 为 2 层 MLP

Graph Matching for Part Alignment

现在我们已经能得到 $N$ 个特征显著区域，但作者认为这些区域并没有对齐。比如说鸟的特征显著区域为头、翅膀、脚、尾巴，但选出的 $N$ 个特征显著区域可能有的图像选出的第 1 个区域是翅膀，有的图像选出的第 1 个区域是尾巴，因此在将所有显著区域的特征 concat 后形成的特征向量的每一个部分对应的语义特征对于不同的样本而言可能都是不一样的，这可能会损害网络的分类性能 (feature inconsistency problem)。我们想要的效果是给这些显著区域规定一个顺序，每个样本抽取出的显著区域都依次为 “头、翅膀、脚、尾巴”。对齐显著区域特征可以 narrow the intra-class variance caused by pose changes
作者认为每个类别的特征显著区域基本上都是固定的，例如鸟的特征显著区域为头、翅膀、脚、尾巴。但遗憾的是，我们无法通过弱标签去直接识别出每个 part 是哪一部分，但我们可以通过网络学习出不同 parts 之间的关系，因此作者定义了一个 correlation matrix $M$ (reference matrix) 用于描述 parts 之间的关系：
对于一张图像抽取出的 $N$ 个 parts，我们可以计算出其所有 permutation 的 correlation matrix $M^{'}$ (由于 $N\leq5$ ，因此 $N!$ 的计算量也不大)，从中取与 $M$ 最相似的 $\hat M$ 对应的 permutation 作为 parts 的排列顺序即可：
其中，矩阵间的相似性可以通过将矩阵 reshape 为向量然后求内积得到。reference matrix 的更新采用 online-updating scheme，更新过程主要是维护用于构造 reference matrix 的 parts centers，具体而言，就是对新样本排序后的特征和原有的 parts centers 特征进行加权求和，使得 older samples 的权重越来越小 (指数加权平均？)
上述过程实际上是一个只考虑 edges (relations) similarity，不考虑 nodes (parts) similarity 的 graph matching problem，不考虑 nodes (parts) similarity 是因为计算是同一种 part，在不同图像中的特征也可能很不相同，但 parts 之间的关系往往不变

Training and Inference

Training
Inference
在测试时，只需要 Test sub-network

Experiments

Comparison with State-of-the-arts
Ablation Studies (baseline 为 Resnet + MLP) 注意到 (e) 在 AIR 数据集上性能比 ( $c$ ) 好很多，作者的解释是 global contour information of aircraft classes is well preserved when using feature concatenation, instead of regularizing image feature with relatively small parts
Visualizations：Specifically, for aircraft samples, salient regions tend to locate in the body and tail; for bird breeds, they usually concern on birds’ head and body; and for cars, the front and body of vehicles contain discriminative details. This phenomenon is an important prerequisite for the proposed graph matching method.
Class Activation Map (Grad-CAM)：Compared to the baseline, our P2P-Net
has less activation on the background and is more concentrated on the discriminative regions of objects
Feature Visualization (t-SNE scatter plot)：after applying parts’ feature
regularization, image representations exhibit higher intra-classes variations and favor for enlarging inter-class margins, especially on the CUB and AIR datasets