[NeurIPS 2022] Relational Proxies: Emergent Relationships as Fine-Grained Discriminators-CSDN博客

本文链接：https://blog.csdn.net/weixin_42437114/article/details/130575534

Introduction
Relational Proxies
Experiments
References

Introduction

作者提出对于 FGVC 问题，模型不仅需要 global/local view 特征，还需要 cross-view relational information，即 “the way the local parts of an object combine to form its global view” 并给出了相应的理论框架。基于理论分析，作者进一步提出了 Relational Proxies 用于 FGVC
Limitations. 作者获取 local views 的方法还可以进一步改进

Relational Proxies

Notation. 给定 image $\mathbf x ∈ \mathbb X$ with a label $\mathbf y ∈ \mathbb Y$ ，则 $\mathbf g = c_g(\mathbf x)$ and $\mathbb L = \{\mathbf l_1,\mathbf l_2, ... \mathbf l_k\} = c_l(\mathbf x)$ 分别为 $\mathbf x$ 的 global 和 set of local views，其中 $c_g,c_l$ 为 cropping functions. $f$ 为 encoder，用于将 global / local views 映射到 latent space ( $\mathbf z_g=f(g)$ ， $\mathbb Z_{\mathbb L}=\{f(\mathbf l):\mathbf l\in\mathbb L\}=\{\mathbf z_{l_1},\mathbf z_{l_2},...,\mathbf z_{\mathbf l_k}\}$ ). $\mathbf r$ 为 global ( $\mathbf g$ ) 和 set of local ( $\mathbb L$ ) views 之间的关联信息 (i.e., cross-view relational information)，它代表 “the way the local parts of an object combine to form its global view”

Problem Definition

作者首先给出了 k-distinguishability 的定义，并借此进一步定义了 FGVC 问题，即 FGVC 问题中必然存在两个类别，模型想要区分它们必须输入至少 $k$ 个 local features

Relation-Agnostic Representations and Information Gap

现有的 FGVC 方法都只单独考虑了 global view 和 local views 而没有考虑 cross-view relational information，由此学得的 encoder 的特征空间即为 relation-agnostic representations
接下来，作者证明了 Proposition 1，即 relation-agnostic encoders 无法完全捕捉图像中编码的标签信息 (i.e., $I(\mathbf x;\mathbf y)>I(\mathbf z;\mathbf y)$ )，无法捕捉的信息量即为 Information Gap (i.e., $I(\mathbf x;\mathbf y)-I(\mathbf z;\mathbf y)$ )，并且 Information Gap 为 $I(\mathbf x;\mathbf r|\mathbf z)$

Proof.

Chain rule for mutual information. (1) 由于 $I(\mathbf x; \mathbf y) + I(\mathbf x; \mathbf z|\mathbf y) =I(\mathbf x; \mathbf z) + I(\mathbf x; \mathbf y|\mathbf z)$ ，并且 $I(\mathbf x; \mathbf z|\mathbf y)=0$ (since $\mathbf z$ does not encode any more information about $\mathbf x$ than $\mathbf y$ )，有
$I(\mathbf x; \mathbf y) =I(\mathbf x; \mathbf z) + I(\mathbf x; \mathbf y|\mathbf z)$ (2) 由于 $I(\mathbf y; \mathbf z) + I(\mathbf x; \mathbf z|\mathbf y) =I(\mathbf x; \mathbf z) + I(\mathbf y; \mathbf z|\mathbf x)$ ，并且 $I(\mathbf x; \mathbf z|\mathbf y)=0$ ，有
$I(\mathbf y; \mathbf z)=I(\mathbf x; \mathbf z) + I(\mathbf y; \mathbf z|\mathbf x)$ (3) 由于 $I(\mathbf x; \mathbf z) + I(\mathbf y; \mathbf z|\mathbf x) =I(\mathbf y; \mathbf z) + I(\mathbf x; \mathbf z|\mathbf y)$ ，并且 $I(\mathbf y; \mathbf z|\mathbf x)=0$ (since $\mathbf z$ cannot encode any more information about $\mathbf y$ than $\mathbf x$ )，有
$I(\mathbf x; \mathbf z)=I(\mathbf y; \mathbf z) + I(\mathbf x; \mathbf z|\mathbf y)$ (4) 由于 $I(\mathbf x; \mathbf z) + I(\mathbf r; \mathbf z|\mathbf x) =I(\mathbf r; \mathbf z) + I(\mathbf x; \mathbf z|\mathbf r)$ ，并且 $I(\mathbf r; \mathbf z|\mathbf x)=0$ (since $\mathbf z$ is relation-agnostic)，有
$I(\mathbf x; \mathbf z) =I(\mathbf r; \mathbf z) + I(\mathbf x; \mathbf z|\mathbf r)$ (5) 由于 $I(\mathbf x; \mathbf r) + I(\mathbf r; \mathbf z|\mathbf x) =I(\mathbf r; \mathbf z) + I(\mathbf x; \mathbf r|\mathbf z)$ ，并且 $I(\mathbf r; \mathbf z|\mathbf x)=0$ ，有
$I(\mathbf x; \mathbf r) =I(\mathbf r; \mathbf z) + I(\mathbf x; \mathbf r|\mathbf z)$
Lemma 1.
Proof. $I(\mathbf x; \mathbf y|\mathbf z)$ 表示 $\mathbf x$ 用于预测 $\mathbf y$ 的信息中 $\mathbf z$ 无法捕捉的那一部分信息，作者认为这部分信息即为 $I(\mathbf x;\mathbf r)$ (i.e., the predictive information that $\mathbf r$ does capture from $\mathbf x$ ) (这个等式基于作者的猜想，即 FGVC 需要 global view + local view + cross-view relationship)，进一步将其用 chain rule 展开可得
Lemma 2.
Proof.
Proposition 1.
Proof. 首先将 $I(\mathbf x,\mathbf y)$ 利用 chain rule 展开，然后代入 Lemma 1&2 的结果可得
此外，由 chain rule $I(\mathbf y; \mathbf z)=I(\mathbf x; \mathbf z) + I(\mathbf y; \mathbf z|\mathbf x)$ 以及 $I(\mathbf y; \mathbf z|\mathbf x)=0$ 可知， $I(\mathbf y; \mathbf z)=I(\mathbf x; \mathbf z)$ ，代入上式可知

Sufficient Learner

作者首先定义了 Relation-Agnostic Representations 的几何性质，即 global view 特征远离 local view 特征，并且它们之间的距离正比于 Information Gap $I(\mathbf x;\mathbf r|\mathbf z)$ (个人理解：不存在 Information Gap 时，global view 和某个 local view 特征重合，说明该 local view 就是最重要的局部特征，relation 信息不再重要)
然后作者在 Lemma 3 中说明了，如果 encoder $f$ 的下游目标是 cross-entropic，那么 $f$ 会自然地生成 relation-agnostic representations，满足定义 4 中的几何性质
Proof. 对于 FGVC 问题， $f$ 的下游目标是 cross-entropic. 假如 $f$ 生成的特征不是 relation-agnostic，则根据定义 4，global view $\mathbf z_g$ 必在一个 local view $\mathbf z_l$ 的邻域上，则 classifier 得到的信息就只有 $\ z l \{\mathbf z_g\}\cup\mathbb Z_L\backslash\mathbf z_l$ 而不是 $\{\mathbf z_g\}\cup\mathbb Z_L$ . 而 FGVC 具有 $k$ -distinguishability 性质，缺少一个 local view 信息会导致 classifier 误分类，这与 Axiom 1 相违背，因此 $f$ 只会学得 relation-agnostic representations
作者接着说明了用于学习 relation-agnostic representations $\mathbf z$ 的 encoder $f$ 不能用于建模 relationship $\mathbf r$ (假设 local view 和 global view 是分别通过 encoder 得到相应特征的)，因此 learner 必须使用单独的子模块用于建模 relationship $\mathbf r$
Proof. 如果 $f$ 要建模 cross-view relationships，它就需要不管输入是 global view 还是 local view 都输出相同的 $\mathbf r$ ，但根据 Lemma 3，global view 和 local view 的特征各不相同，因此 $f$ 无法建模 relationship

Learning Relation-Agnostic and Relation-Aware Representations

根据上述描述，为了捕捉完整的标签信息 $I(\mathbf x;\mathbf y)$ ，模型需要同时考虑 (1) the relation-agnostic information $\mathbf z$ (2) the cross-view relational information $\mathbf r$ ，模型主要分为两部分：(1) relation-agnostic agnostic encoder $f$ ；(2) cross-view relational function $ξ$

在这里插入图片描述

Relation-Agnostic Representations. 作者首先通过 thresholding the final layer activations of a CNN encoder $f$ and detecting the largest connected component in the thresholded feature map 来获取感兴趣区域，得到 global view $\mathbf g$ 和 local views $\{\mathbf l_1,\mathbf l_2,...,\mathbf l_k\}$ (i.e., sub-crops of $\mathbf g$ ) (For initial training stability, we consider five disjoint locations (four corners and the centre) of $\mathbf g$ to be the set of local views. As training progresses, we also allow the model to learn from an increased number views obtained via random cropping. In the same way at inference time, the local views constitute a combination of the five disjoint crops along with some random crops.)，将 views 分别经过 CNN encoder $f$ 编码得到 relation-agnostic representations $\mathbf z_g,\mathbb Z_{\mathbb L}$
Relational Embeddings. 作者认为 relationship modelling function $\xi:(\mathbf g,\mathbb L)\rightarrow \mathbf r$ 应该满足 (1) View-Unification: Maps the set of all views $\{\mathbf g,\mathbb L\}$ of an image $\mathbf x$ to a single output $\mathbf r$ ; (2) Permutation Invariance (robust to changes in pose and relative orientation of local object parts): Produces the same output irrespective of the order of the local views. 对于 Permutation Invariance，作者采用 Attribute Summarization Transformer (AST)，输入为 $\mathbf Z'_{\mathbb L}=[\mathbf z_{\mathbb L},\mathbf z_{l_1},...,\mathbf z_{l_k}]$ (不加 position embed)，其中 $\mathbf z_{\mathbb L}$ 为新增的 summary embedding，模型结构采用多干个 Transformer 层， $\mathbf z_{\mathbb L}$ 的输出即为归纳后的 local feature；对于 View-Unification，作者采用 MLP $\rho:(\mathbf z_g,\mathbf z_{\mathbb L})\rightarrow\mathbf r$
- Motivation: one can view the local-to-global relationship modelling function as an enumerative search algorithm - given a set of local views, it first enumerates all possible ways in which they can combine to form a meaningful global view (i.e., AST). Given that enumeration, it then finds the target solution by learning to identify the correct combination that matches with the global-view representation (i.e., $\rho$ ). Thus, the enumerate operation needs to be permutation invariant, as it has to consider all possible combinations of the inputs, and the find operation needs to be a view-unifier by construction. Behind our specific design choice was the motivation to keep the enumerate and find steps separate. This allows the model to have dedicated representation spaces for the two distinct subtasks, which in turn facilitates better convergence. Our AST thus produces the candidate enumerations of local-view aggregations, and the view-unification MLP ( $\rho$ ) finds the correct aggregation that matches with the global view.
Learning Relational Proxies. 对于 FGVC，global/local view 和 cross-view relationships 都是区分类别的重要信息，因此作者将 global ( $\mathbf z_g$ ), summary of local ( $\mathbf z_{\mathbb L}$ ) 和 relational ( $\mathbf r$ ) 都当作 learnable metric space embedding，设置 $c$ 个 proxies，采用 proxy-anchor loss 进行 metric learning
其中， $\omega=\{\mathbf z_g,\mathbf z_{\mathbb L},\mathbf r\}$ ， $\Omega$ 为 mini-batch 中 $\mathbf p$ 的正样本集合， $\bar\Omega$ 为负样本集合， $s$ 为余弦相似度
Inference.

在这里插入图片描述

Experiments

Comparison with State of the Art. 作者采用 ResNet50 pretrained on ImageNet 作为 backbone of relation-agnostic encoder $f$
Ablation Studies
(1) Key components of the sufficient learner. Row 1 模型为 relation-agnostic encoder + a simple classification head；Row 2 模型将 RelationNet (i.e., view-unification MLP $\rho$ ) 替换为了固定的 relationship function，即 local 和 global representations 之间的距离，然后最小化同一类别不同样本的 relational distance value 之间的 Huber loss；Row 3 模型对 relational vectors 计算 pairwise contrastive loss；Rows 4 - 7 模型则是将 pair-based loss 改为 proxy based loss，其中没有采用 AST/RelationNet 的模型是将其替换为了线性层，输入为所有输入向量进行连接
(2) Conditioning of the relational proxies. relational proxies 由 summary of the local attributes $\mathbf z_{\mathbb L}$ , the representation of the
global view $\mathbf z_g$ , 以及 the relational vector $\mathbf r$ 三部分组成，作者分别对这三部分进行了消融实验
(3) Results on ImageNet subsets
(4) The optimal value of $k$ for the $k$ -distinguishability criterion. 对于 $\mathcal P_{\text{FGVC}}$ ，模型性能受提供给模型的 local views 数量 $|\mathbb L|$ 影响，当 $|\mathbb L|$ 少于 FGVC 所要求的最小值 $k$ 时，模型的分类性能就会迅速下降。作者固定 local-crop size，在 3 个数据集下用实验验证了 the idea of $k$ -distinguishability. 当 $|\mathbb L|$ 到达 $k = 7/8$ 时，模型性能达到最大值，而当 $|\mathbb L|$ 继续增大时，新增信息为多为冗余或无关信息，模型性能没有明显提升
(5) Correlation between $|\mathbb L|$ and local patch size. 作者探索了 local view 数量以及大小对模型性能的影响，下表中 $N / t$ 为 local view 的边长， $N$ 为图像边长，表内数值代表相对 reported accuracy for FGVC Aircraft (95.25%) 的性能变化。
(6) Permutation invariance of AST
Visual Representations of Cross-View Local Relationships. 作者选取图像中的部分 local views，对 Cross-View Local Relationships 进行了可视化，node 代表 local view，边的粗细对应 AST 最后一层中 local views 之间的 mutual attention score，只有 attention score 超过均值的边才会被画出。可以看到，视觉上相似的图像可以通过 local views 之间的关系进行区分 (When two categories share a large number of local attributes, this cross-view relational information becomes the only discriminator.)，同时作者也指出存在利用 local views 之间的关系也无法区分的图像
Cross-view relationships for intra-class variations. FGVC 通常会面临很大的类间差异性，而考虑 fine-grained geometric relationships 有助于缓解这一问题
Importance of Relational Information
Relation-Agnosticity of Relational Proxies. 作者用实验验证了 Lemma 3, i.e., $f$ will produce relation-agnostic representations if the downstream objective is cross-entropic in nature. 注意到有一些 glocal embed 位置和 local embed 重合，这是因为那些图片的 local views 已经提供了足够多的信息，global view 的信息就变得冗余了，因此 global view 就和 local view 合并到了一起 (并不违反 $k$ -distinguishability)