自动发现特征重要性的可识别对比学习

最新推荐文章于 2024-06-17 22:18:29 发布

c2a2o2

最新推荐文章于 2024-06-17 22:18:29 发布

阅读量644

点赞数 13

文章标签：学习

本文链接：https://blog.csdn.net/c2a2o2/article/details/139256770

版权

Identifiable Contrastive Learning with Automatic Feature Importance Discovery
自动发现特征重要性的可识别对比学习

Qi Zhang1∗ Yifei Wang2 Yisen Wang1,3
1 National Key Lab of General Artificial Intelligence,
School of Intelligence Science and Technology, Peking University 2 School of Mathematical Sciences, Peking University
3 Institute for Artificial Intelligence, Peking University
张琦 1∗ 王逸飞 2 王奕森 1,3 1 北京大学智能科学与技术学院通用人工智能国家重点实验室 2 北京大学数学科学学院 3 北京大学人工智能研究院
Equal Contribution.Corresponding Author: Yisen Wang (yisen.wang@pku.edu.cn).
同等贡献。通讯作者：王一森（yisen.wang@ pku.edu.cn）。

Abstract 摘要 https://arxiv.org/abs/2310.18904

Existing contrastive learning methods rely on pairwise sample contrast 𝑧𝑥⊤𝑧𝑥′ to learn data representations, but the learned features often lack clear interpretability from a human perspective. Theoretically, it lacks feature identifiability and different initialization may lead to totally different features. In this paper, we study a new method named tri-factor contrastive learning (triCL) that involves a 3-factor contrast in the form of 𝑧𝑥⊤𝑆𝑧𝑥′, where 𝑆=diag(𝑠1,…,𝑠𝑘) is a learnable diagonal matrix that automatically captures the importance of each feature. We show that by this simple extension, triCL can not only obtain identifiable features that eliminate randomness but also obtain more interpretable features that are ordered according to the importance matrix 𝑆. We show that features with high importance have nice interpretability by capturing common classwise features, and obtain superior performance when evaluated for image retrieval using a few features. The proposed triCL objective is general and can be applied to different contrastive learning methods like SimCLR and CLIP. We believe that it is a better alternative to existing 2-factor contrastive learning by improving its identifiability and interpretability with minimal overhead. Code is available at GitHub - PKU-ML/Tri-factor-Contrastive-Learning: Official Code for NeurIPS 2023 Paper: Identifiable Contrastive Learning with Automatic Feature Importance Discovery.
现有的对比学习方法依赖于成对的样本对比来学习数据表示，但从人类的角度来看，学习到的特征往往缺乏清晰的可解释性。从理论上讲，它缺乏特征可识别性，不同的初始化可能会导致完全不同的特征。在本文中，我们研究了一种名为三因素对比学习（triCL）的新方法，该方法涉及形式为的3因素对比，其中是一个可学习的对角矩阵，可以自动捕获每个特征的重要性。我们表明，通过这个简单的扩展，triCL不仅可以获得可识别的功能，消除随机性，但也获得更多的可解释的功能，根据重要性矩阵排序。我们发现，具有高重要性的功能有很好的可解释性，通过捕获常见的类特征，并获得上级性能时，使用几个功能进行图像检索评估。所提出的triCL目标是通用的，可以应用于不同的对比学习方法，如SIMPLE和CLIP。我们认为，它是一个更好的替代现有的2因素对比学习，提高其可识别性和可解释性以最小的开销。代码可在www.example.com上获得。

1Introduction 1介绍

Refer to caption

(a)10 most important dimensions
(a)10个最重要的维度

Refer to caption

(b)10 least important dimensions
(b)10个最不重要的维度

Figure 1:The visualization of samples on ImageNet-100 that have the largest values in 10 selected dimensions of representations learned by tri-factor contrastive learning (each row represents a dimension).
图1：ImageNet-100上样本的可视化，这些样本在通过三因素对比学习学习的表示的10个选定维度中具有最大值（每行代表一个维度）。

As a representative self-supervised paradigm, contrastive learning obtains meaningful representations and achieves state-of-the-art performance in various tasks by maximizing the feature similarity 𝑧𝑥⊤𝑧𝑥+ between samples augmented from the same images while minimizing the similarity 𝑧𝑥⊤𝑧𝑥− between independent samples [4, 20, 15, 5, 18]. Besides the empirical success, recent works also discuss the theoretical properties and the generalization performance of contrastive learning [32, 30, 18].
作为一种代表性的自监督范式，对比学习通过最大化从相同图像增强的样本之间的特征相似性，同时最小化独立样本之间的相似性，获得有意义的表示，并在各种任务中实现最先进的性能[ 4，20，15，5，18]。除了经验上的成功，最近的工作也讨论了对比学习的理论属性和泛化性能[32，30，18]。

However, there still exist many properties of contrastive learning that are not guaranteed. In this paper, we focus on a significant one: the feature identifiability. Feature identifiability in the representation learning refers to the property there exists a single, global optimal solution to the learning objective. Consequently, the learned representations can be reproducible regardless of the initialization and the optimizing procedure is useful. As a well-studied topic, identifiability is a desirable property for various tasks, including but not limited to, transfer learning [10], fair classification [26] and causal inference [24]. The previous works propose that contrastive learning obtains the linear feature identifiability while lacking exact feature identifiability, i.e., the optimal solutions obtain a freedom of linear transformations [29]. As a result, the different features are coupled, which hurts the interpretability and performance of learned representations.
然而，仍然存在许多对比学习的性质是不能保证的。在本文中，我们集中在一个重要的：特征可识别性。在表示学习中，特征的可识别性是指学习目标存在一个全局最优解的性质。因此，学习的表示可以是可再现的，而不管初始化和优化过程是有用的。作为一个深入研究的主题，可识别性是各种任务的理想属性，包括但不限于迁移学习[ 10]，公平分类[ 26]和因果推理[ 24]。以前的工作提出，对比学习获得线性特征可识别性，而缺乏精确的特征可识别性，即，最优解获得线性变换的自由度[ 29]。结果，不同的特征被耦合，这损害了学习表征的可解释性和性能。

In this paper, we propose a new contrastive learning model: tri-factor contrastive learning (triCL), which introduces a 3-factor contrastive loss, i.e., we replace 𝑧𝑥⊤𝑧𝑥′ with 𝑧𝑥⊤𝑆𝑧𝑥′ when calculating the similarity between two features, where 𝑆 is a learnable diagonal matrix called importance matrix. We theoretically prove that triCL absorbs the freedom of linear transformations and enables exact feature identifiability. Besides, we observe that triCL shows other satisfying properties. For example, the generalization performance of triCL is theoretically guaranteed. What is more, we find that the diagonal values of the importance matrix 𝑆 in triCL indicate the degrees of feature importance. In Figure 1, we visualize the samples that have the largest values in the most and least important dimensions ordered by the importance matrix. We find the samples activated in the more important dimensions are more semantically similar, which verifies the order of feature importance in triCL is quite close to the ground truth. Theoretically, we prove that the dimensions related to the larger values in the importance matrix make more contributions to decreasing the triCL loss. With the automatic discovery of feature importance in triCL, the downstream task conducted on the representations can be accelerated by selecting the important features and throwing the meaningless features.
在本文中，我们提出了一种新的对比学习模型：三因素对比学习（triCL），它引入了三因素对比损失，即，当计算两个特征之间的相似度时，我们用替换，其中是一个可学习的对角矩阵，称为重要性矩阵。我们从理论上证明了triCL吸收了线性变换的自由度，并使准确的特征识别。此外，我们观察到，triCL显示其他令人满意的性质。例如，理论上保证了triCL的泛化性能。此外，我们还发现triCL中重要性矩阵的对角值表示特征的重要程度。在图1中，我们可视化了按重要性矩阵排序的最重要和最不重要维度中具有最大值的样本。我们发现在更重要的维度上激活的样本在语义上更相似，这验证了triCL中特征重要性的顺序非常接近地面真值。理论上，我们证明了与重要性矩阵中的较大值相关的维度对降低triCL损失的贡献更大。通过triCL中特征重要性的自动发现，可以通过选择重要特征并丢弃无意义特征来加速对表示进行的下游任务。

As triCL is a quite simple and general extension, we apply it to different contrastive learning frameworks, such as SimCLR [4] and CLIP [28]. Empirically, we first verify the identifiability of triCL and further evaluate the performance of triCL on real-world datasets including CIFAR-10, CIFAR-100, and ImageNet-100. In particular, with the automatic discovery of important features, triCL demonstrates significantly better performance on the downstream tasks using a few feature dimensions. We summarize our contributions as follows:
由于 triCL 是一个相当简单且通用的扩展，我们将其应用于不同的对比学习框架，例如Simplified [4]和 CLIP [28]。从经验上讲，我们首先验证了triCL的可识别性，并进一步评估了triCL在真实世界数据集（包括CIFAR-10，CIFAR-100和ImageNet-100）上的性能。特别是，通过自动发现重要特征，triCL在使用几个特征维度的下游任务上表现出更好的性能。我们将我们的贡献总结如下：

•

We propose tri-factor contrastive learning (triCL), the first contrastive learning algorithm which enables exact feature identifiability. Additionally, we extend triCL to different contrastive learning methods, such as spectral contrastive learning, SimCLR and CLIP.

·我们提出了三因素对比学习（triCL），这是第一个能够实现精确特征识别的对比学习算法。此外，我们将triCL扩展到不同的对比学习方法，如谱对比学习，Simplified和CLIP。
•

Besides the feature identifiability, we analyze several theoretical properties of triCL. Specifically, we construct the generalization guarantee for triCL and provide theoretical evidence that triCL can automatically discover the feature importance.

·除了特征可识别性之外，我们还分析了triCL的几个理论性质。具体来说，我们构建了triCL的泛化保证，并提供了理论证据，triCL可以自动发现特征的重要性。
•

Empirically, we verify that triCL enables the exact feature identifiability and triCL can discover the feature importance on the synthetic and real-world datasets. Moreover, we investigate whether triCL obtains superior performance in downstream tasks with selected representations.

·在经验上，我们验证了triCL能够实现准确的特征可识别性，并且triCL可以发现合成和真实世界数据集上的特征重要性。此外，我们调查是否triCL获得上级的性能在下游任务与选定的表示。

2Related Work 2相关工作

Self-supervised Learning. Recently, to get rid of the expensive cost of the labeled data, self-supervised learning has risen to be a promising paradigm for learning meaningful representations by designing various pretraining tasks, including context-based tasks [14], contrastive learning [20] and masked image modeling [21]. Among them, contrastive learning is a popular algorithm that achieves impressive success and largely closes the gap between self-supervised learning and supervised learning [27, 4, 20, 33]. The idea of contrastive learning is quite simple, i.e., pulling the semantically similar samples (positive samples) together while pushing the dissimilar samples (negative samples) away in the feature space. To better achieve this objective, recent works propose different variants of contrastive learning, such as introducing different training objectives [36, 18, 35], different structures [4, 6, 16, 39], different sampling processes [11, 25, 38, 7] and different empirical tricks [20, 3].
自我监督学习。最近，为了摆脱标记数据的昂贵成本，自监督学习已经成为通过设计各种预训练任务来学习有意义的表示的有前途的范例，包括基于上下文的任务[ 14]，对比学习[ 20]和掩蔽图像建模[ 21]。其中，对比学习是一种流行的算法，取得了令人印象深刻的成功，并在很大程度上缩小了自监督学习和监督学习之间的差距[ 27，4，20，33]。对比学习的概念非常简单，即，在特征空间中将语义相似的样本（正样本）拉到一起，同时将不相似的样本（负样本）推开。为了更好地实现这一目标，最近的工作提出了不同的对比学习变体，例如引入不同的训练目标[36，18，35]，不同的结构[4，6，16，39]，不同的采样过程[11，25，38，7]和不同的经验技巧[20，3]。

Theoretical Understandings of Contrastive Learning. Despite the empirical success of contrastive learning, the theoretical understanding of it is still limited. [30] establish the first theoretical guarantee on the downstream classification performance of contrastive learning by connecting the InfoNCE loss and Cross Entropy loss. As there exists some unpractical assumptions in the theoretical analysis in [30], recent works improve the theoretical framework and propose new bounds [1, 2, 34]. Moreover, [18] analyzes the downstream performance of contrastive learning from a graph perspective and constructs the connection between contrastive learning and spectral decomposition. Recently, researchers have taken the inductive bias in contrastive learning into the theoretical framework and shown the influence of different network architectures [17, 31]. Except for the downstream classification performance of contrastive learning, some works focus on other properties of representations learned by contrastive learning. [23] discuss the feature diversity by analyzing the dimensional collapse in contrastive learning. [29] prove that the contrastive models are identifiable up to linear transformations under certain assumptions.
对比学习的理论认识。尽管对比学习在实践上取得了成功，但理论上对它的理解仍然有限。[ 30]将InfoNCE损失和交叉熵损失联系起来，建立了对比学习下游分类性能的第一个理论保证。由于[ 30]中的理论分析存在一些不切实际的假设，最近的工作改进了理论框架，并提出了新的界限[1，2，34]。此外，[ 18]从图的角度分析了对比学习的下游性能，并构建了对比学习和谱分解之间的联系。最近，研究人员将对比学习中的归纳偏差纳入理论框架，并显示了不同网络架构的影响[ 17，31]。除了对比学习的下游分类性能之外，一些工作集中在通过对比学习学习的表征的其他性质上。[ 23]通过分析对比学习中的维度塌陷来讨论特征多样性。[ 29]证明了在一定的假设下，对比模型在线性变换下是可识别的。

3Preliminary 3初步

Contrastive Pretraining Process. We begin by introducing the basic notations of contrastive learning. The set of all natural data is denoted as 𝒟𝑢={𝑥¯𝑖}𝑖=1𝑁𝑢 with distribution 𝒫𝑢, and each natural data 𝑥¯∈𝒟𝑢 has a ground-truth label 𝑦(𝑥¯). An augmented sample 𝑥 is generated by transforming a natural sample 𝑥¯ with the augmentations distributed with 𝒜(⋅|𝑥¯). The set of all the augmented samples is denoted as 𝒟={𝑥𝑖}𝑖=1𝑁. We assume both sets of natural samples and augmented samples to be finite but exponentially large to avoid non-essential nuances in the theoretical analysis and it can be easily extended to the case where they are infinite [18]. During the pretraining process, we first draw a natural sample 𝑥¯∼𝒫𝑢, and independently generate two augmented samples 𝑥∼𝒜(⋅|𝑥¯), 𝑥+∼𝒜(⋅|𝑥¯) to construct a positive pair (𝑥,𝑥+). For the negative samples, we independently draw another natural sample 𝑥¯−∼𝒫𝑢 and generate 𝑥−∼𝒜(⋅|𝑥¯−). With positive and negative pairs, we learn the encoder 𝑓:ℝ𝑑→ℝ𝑘 with the contrastive loss. For the ease of our analysis, we take the spectral contrastive loss [18] as an example:
对比预训练过程。我们开始介绍对比学习的基本符号。所有自然数据的集合被表示为具有分布 𝒫𝑢 的 𝒟𝑢={𝑥¯𝑖}𝑖=1𝑁𝑢 ，并且每个自然数据 𝑥¯∈𝒟𝑢 具有地面实况标签 𝑦(𝑥¯) 。通过用与 𝒜(⋅|𝑥¯) 一起分布的增强来变换自然样本 𝑥¯ 来生成增强样本 𝑥 。所有扩增样本的集合表示为 𝒟={𝑥𝑖}𝑖=1𝑁 。我们假设自然样本和增强样本的集合都是有限的，但指数级大，以避免理论分析中的非本质细微差别，并且可以很容易地扩展到它们是无限的情况[ 18]。在预训练过程中，我们首先绘制自然样本 𝑥¯∼𝒫𝑢 ，并独立生成两个增强样本 𝑥∼𝒜(⋅|𝑥¯) ， 𝑥+∼𝒜(⋅|𝑥¯) 以构建正对 (𝑥,𝑥+) 。对于阴性样本，我们独立地绘制另一个自然样本 𝑥¯−∼𝒫𝑢 并生成 𝑥−∼𝒜(⋅|𝑥¯−) 。使用正对和负对，我们学习具有对比损失的编码器 𝑓:ℝ𝑑→ℝ𝑘 。为了便于分析，我们以频谱对比损失[ 18]为例：

ℒSCL(𝑓)=−2𝔼𝑥,𝑥+𝑓(𝑥)⊤𝑓(𝑥+)+𝔼𝑥𝔼𝑥−(𝑓(𝑥)⊤𝑓(𝑥−))2.

(1)

We denote 𝑧𝑥=𝑓(𝑥) as the features encoded by the encoder. By optimizing the spectral loss, the features of positive pairs (𝑧𝑥⊤𝑧𝑥+) are pulled together while the negative pairs (𝑧𝑥⊤𝑧𝑥−) are pushed apart.
我们将 𝑧𝑥=𝑓(𝑥) 表示为由编码器编码的特征。通过优化频谱损耗，正对（ 𝑧𝑥⊤𝑧𝑥+ ）的特征被拉在一起，而负对（ 𝑧𝑥⊤𝑧𝑥− ）被推开。

Augmentation Graph. A useful theoretical framework to describe the properties of contrastive learning is to model the learning process from the augmentation graph perspective [18]. The augmentation graph is defined over the set of augmented samples 𝒟, with its adjacent matrix denoted by 𝐴. In the augmentation graph, each node corresponds to an augmented sample, and the weight of the edge connecting two nodes 𝑥 and 𝑥+ is equal to the probability that they are selected as a positive pair, i.e.,𝐴𝑥𝑥+=𝔼𝑥¯∼𝒫𝑢[𝒜(𝑥|𝑥¯)𝒜(𝑥+|𝑥¯)]. And we denote 𝐴¯ as the normalized adjacent matrix of the augmentation graph, i.e., 𝐴¯=𝐷−1/2𝐴𝐷−1/2, where 𝐷 is a diagonal matrix and 𝐷𝑥𝑥=∑𝑥′∈𝒟𝐴𝑥𝑥′. To analyze the properties of 𝐴¯, we denote 𝐴¯=𝑈Σ𝑉⊤ is the singular value decomposition (SVD) of the normalized adjacent matrix 𝐴¯, where 𝑈∈ℝ𝑁×𝑁,𝑉∈ℝ𝑁×𝑁 are unitary matrices, and Σ=diag(𝜎1,…,𝜎𝑁) contains descending singular values 𝜎1≥…𝜎𝑁≥0.
增强图。描述对比学习属性的一个有用的理论框架是从增强图的角度对学习过程进行建模[18]。在扩增样本集合 𝒟 上定义扩增图，其相邻矩阵由 𝐴 表示。在增强图中，每个节点对应于一个增强样本，并且连接两个节点 𝑥 和 𝑥+ 的边的权重等于它们被选择为正对的概率，即， 𝐴𝑥𝑥+=𝔼𝑥¯∼𝒫𝑢[𝒜(𝑥|𝑥¯)𝒜(𝑥+|𝑥¯)]. 并且我们将 𝐴¯ 表示为扩充图的归一化相邻矩阵，即， 𝐴¯=𝐷−1/2𝐴𝐷−1/2 ，其中 𝐷 是对角矩阵，而 𝐷𝑥𝑥=∑𝑥′∈𝒟𝐴𝑥𝑥′ 是对角矩阵。为了分析 𝐴¯ 的性质，我们表示 𝐴¯=𝑈Σ𝑉⊤ 是归一化相邻矩阵 𝐴¯ 的奇异值分解（SVD），其中 𝑈∈ℝ𝑁×𝑁,𝑉∈ℝ𝑁×𝑁 是酉矩阵，并且 Σ=diag(𝜎1,…,𝜎𝑁) 包含降序奇异值 𝜎1≥…𝜎𝑁≥0 。

4Exact Feature Identifiability with Tri-factor Contrastive Learning
4基于三因素对比学习的精确特征可识别性

In this section, we propose a new representation learning paradigm called tri-factor contrastive learning (triCL) that enables exact feature identifiability in contrastive learning. In Section 4.1, we prove that contrastive learning obtains linear identifiability, i.e., the freedom in the optimal solutions are linear transformations. In Section 4.2, we introduce the learning process of triCL and theoretically verify it enables the exact feature identifiability.
在本节中，我们提出了一种新的表示学习范式，称为三因素对比学习（triCL），它可以在对比学习中实现精确的特征识别。在第4.1节中，我们证明了对比学习获得线性可识别性，即，最优解中的自由度是线性变换。在第4.2节中，我们介绍了triCL的学习过程，并从理论上验证了它能够实现精确的特征可识别性。

4.1Feature Identifiability of Contrastive Learning
4.1对比学习的特征可识别性

When using a pretrained encoder for downstream tasks, it is useful if the learned features are reproducible, in the sense that when the neural network learns the representation function on the same data distribution multiple times, the resulting features should be approximately the same. For example, reproducibility can enhance the interpretability and the robustness of learned representations [22]. One rigorous way to ensure reproducibility is to select a model whose representation function is identifiable in function space [29]. To explore the feature identifiability of contrastive learning, we first characterize its general solution.
当使用预训练的编码器进行下游任务时，如果学习的特征是可再现的，这是有用的，因为当神经网络多次学习相同数据分布的表示函数时，所得到的特征应该大致相同。例如，再现性可以增强学习表示的可解释性和鲁棒性[ 22]。确保再现性的一种严格方法是选择一个模型，其表示函数在函数空间中是可识别的[ 29]。为了探索对比学习的特征可识别性，我们首先刻画了它的一般解。

Lemma 4.1 ([18]).

Let 𝐴¯=𝑈Σ𝑉⊤ be the SVD decomposition of the normalized adjacent matrix 𝐴¯. Assume the neural networks are expressive enough for any features. The spectral contrastive loss (Eq. 1) attains its optimum when ∀𝑥∈𝒟,

𝑓∗(𝑥)=1𝐷𝑥𝑥(𝑈𝑥𝑘diag(𝜎1,…,𝜎𝑘)𝑅)⊤,

(2)

where 𝑈𝑥 takes the 𝑥-th row of 𝑈, 𝑈𝑘 denotes the submatrices containing the first 𝑘 columns of 𝑈, and 𝑅∈ℝ𝑘×𝑘 is an arbitrary unitary matrix.
其中 𝑈𝑥 取 𝑈 的第 𝑥 行， 𝑈𝑘 表示包含 𝑈 的前 𝑘 列的子矩阵，并且 𝑅∈ℝ𝑘×𝑘 是任意酉矩阵。

引理4.1（[ 18]）.令 𝐴¯=𝑈Σ𝑉⊤ 为归一化相邻矩阵 𝐴¯ 的SVD分解。假设神经网络对任何特征都有足够的表达能力。频谱对比度损失（Eq. 1)当 ∀𝑥∈𝒟 ，

From Lemma 4.1, we know that the optimal representations are not unique, due to the freedom of affine transformations. This is also regarded as a relaxed notion of feature identifiability, named linear feature identifiability ∼𝐿 defined below [29].
从引理4.1，我们知道最优表示不是唯一的，因为仿射变换是自由的。这也被认为是特征可识别性的宽松概念，称为线性特征可识别性 ∼𝐿 ，定义如下[ 29]。

Definition 4.2 (Linear feature identifiability).

Let ∼𝐿 be a pairwise relation in the encoder function space ℱ={𝑓:𝒳→ℝ𝑘} defined as:

𝑓′∼𝐿𝑓∗⟺𝑓′(𝑥)=𝐴𝑓⋆(𝑥),∀𝑥∈𝒳,

(3)

where 𝐴 is an invertible 𝑘×𝑘 matrix.
其中 𝐴 是可逆的 𝑘×𝑘 矩阵。

定义4.2（线性特征可识别性）。令 ∼𝐿 为编码器函数空间 ℱ={𝑓:𝒳→ℝ𝑘} 中的成对关系，其定义为：

It is apparent that the optimal encoder 𝑓 (Eq. 2) obtained from the spectral contrastive loss (Eq. 1) is linearly identifiable. Nevertheless, there are still some ambiguities w.r.t. linear transformations in the model. Although the freedom of linear transformations can be absorbed on the linear probing task [18], these representations may show varied results in many downstream tasks e.g., in the k-NN evaluation process. So we wonder whether we could achieve the exact feature identifiability. We first further define two kinds of more accurate feature identifiabilities below.
很明显，最佳编码器 𝑓 （Eq. 2)从光谱对比损失获得（等式10）。1)是线性可识别的尽管如此，在w.r.t.模型中的线性变换。虽然线性变换的自由度可以在线性探测任务中吸收[ 18]，但这些表示可能在许多下游任务中显示不同的结果，例如，在k-NN评估过程中。因此，我们想知道我们是否可以实现准确的特征识别。我们首先在下面进一步定义两种更准确的特征可识别性。

Definition 4.3 (Sign feature identifiability).

Let ∼𝑆 be a pairwise relation in the encoder function space ℱ={𝑓:𝒳→ℝ𝑘} defined as:

𝑓′∼𝑆𝑓∗⟺𝑓𝑗′(𝑥)=±𝑓𝑗⋆(𝑥),∀𝑥∈𝒳,𝑗∈[𝑘].

(4)

where 𝑓𝑗(𝑥) is the 𝑗-th dimension of 𝑓(𝑥).
其中 𝑓𝑗(𝑥) 是 𝑓(𝑥) 的第 𝑗 个维度。

定义4.3（标志特征可识别性）。令 ∼𝑆 为编码器函数空间 ℱ={𝑓:𝒳→ℝ𝑘} 中的成对关系，其定义为：

Definition 4.4 (exact feature identifiability).

Let ∼ be a pairwise relation in the encoder function space ℱ={𝑓:𝒳→ℝ𝑘} defined as:

𝑓′∼𝑓∗⟺𝑓′(𝑥)=𝑓⋆(𝑥),∀𝑥∈𝒳.

(5)

定义4.4（精确特征可识别性）。令 ∼ 为编码器函数空间 ℱ={𝑓:𝒳→ℝ𝑘} 中的成对关系，其定义为：

4.2Tri-factor Contrastive Learning with Exact Feature Identifiability
4.2具有精确特征可识别性的三因素对比学习

Motivated by the trifactorization technique in matrix decomposition problems [9] and the equivalence between the spectral contrastive loss and the matrix decomposition objective [18], we consider adding a learnable diagonal matrix when calculating the feature similarity in the contrastive loss to absorb the freedom of linear transformations. To be specific, we introduce a contrastive learning model that enables exact feature identifiability, named tri-factor contrastive learning (triCL), which adopts a tri-term contrastive loss:
受矩阵分解问题中的三因子分解技术[ 9]以及谱对比损失与矩阵分解目标之间的等价性[ 18]的启发，我们考虑在计算对比损失中的特征相似度时添加可学习的对角矩阵，以吸收线性变换的自由度。具体来说，我们引入了一种对比学习模型，它可以实现精确的特征识别，称为三因素对比学习（triCL），它采用了三项对比损失：

ℒtri(𝑓,𝑆)

=−2𝔼𝑥,𝑥+𝑓(𝑥)⊤𝑆𝑓(𝑥+)+𝔼𝑥𝔼𝑥−(𝑓(𝑥)⊤𝑆𝑓(𝑥−))2,

(6)

where we name 𝑆=diag(𝑠1,…,𝑠𝑘) as the importance matrix and it is a diagonal matrix with 𝑘 non-negative learnable parameters satisfying 𝑠1≥⋯≥𝑠𝑘≥0 11In practice, we enforce the non-negative conditions by applying the softplus activation functions on the diagonal values of 𝑆. We only enforce the monotonicity at the end of training by simply sorting different rows of 𝑆 and different dimensions of 𝑓(𝑥) by the descending order of corresponding diagonal values in 𝑆.
在实践中，我们通过在 𝑆 的对角值上应用softplus激活函数来强制非负条件。我们只在训练结束时通过简单地按 𝑆 中相应对角线值的降序对 𝑆 的不同行和 𝑓(𝑥) 的不同维度进行排序来加强单调性。
其中，我们将 𝑆=diag(𝑠1,…,𝑠𝑘) 命名为重要性矩阵，并且它是具有满足 𝑠1≥⋯≥𝑠𝑘≥0 1 的非负可学习参数的对角矩阵。. Additionally, the encoder 𝑓 is constrained to be decorrelated, i.e.,
.另外，编码器 𝑓 被约束为解相关，即，

𝔼𝑥𝑓𝑖(𝑥)⊤𝑓𝑗(𝑥)={1, if 𝑖=𝑗0, if 𝑖≠𝑗,𝑖,𝑗∈[𝑘],

(7)

for an encoder 𝑓:ℝ𝑑→ℝ𝑘. One way to ensure feature decorrelation is the following penalty loss,
对于编码器 𝑓:ℝ𝑑→ℝ𝑘 。确保特征去相关的一种方法是以下惩罚损失，

ℒdec(𝑓)

=‖𝔼𝑥𝑓(𝑥)𝑓(𝑥)⊤−𝐼‖2,

(8)

leading to a combined triCL objective,
导致组合的三重CL物镜，

ℒtriCL(𝑓,𝑆)=ℒtri(𝑓,𝑆)+ℒdec(𝑓).

(9)

Similar feature decorrelation objectives have been proposed in non-contrastive visual learning methods with slightly different forms, e.g., Barlow Twins [36].
类似的特征去相关目标已经在具有略微不同形式的非对比视觉学习方法中提出，例如，巴洛双胞胎[ 36]。

Since triCL automatically learns feature importance 𝑆 during training, it admits a straightforward feature selection approach. Specifically, if we need to select 𝑚 out of 𝑘 feature dimensions for downstream tasks (e.g., in-time image retrieval), we can sort the feature dimensions according to their importance 𝑠𝑖’s (after training), and simply use the top 𝑚 features as the most important ones. Without loss of generality, we assume 𝑠1≥⋯≥𝑠𝑘, and the top 𝑚 features are denoted as 𝑓(𝑚).
由于triCL在训练过程中自动学习特征重要性 𝑆 ，因此它采用了一种简单的特征选择方法。具体来说，如果我们需要从 𝑘 个特征维度中选择 𝑚 个用于下游任务（例如，实时图像检索），我们可以根据特征维度的重要性 𝑠𝑖 （在训练之后）对特征维度进行排序，并且简单地使用最高 𝑚 特征作为最重要的特征。在不失一般性的情况下，我们假设 𝑠1≥⋯≥𝑠𝑘 ，并且顶部 𝑚 特征被表示为 𝑓(𝑚) 。

Identifiability of TriCL. In the following theorem, we show that by incorporating the diagonal importance matrix 𝑆 that regularizes features along each dimension, triCL can resolve the linear ambiguity of contrastive learning and become sign-identifiable.
TriCL的可识别性。在下面的定理中，我们证明了通过合并对角重要性矩阵 𝑆 ，沿着沿着每个维度规则化特征，triCL可以解决对比学习的线性模糊性，并变得可识别。

Theorem 4.5.

Assume the normalized adjacent matrix 𝐴¯ has distinct largest 𝑘 singular values (∀𝑖,𝑗∈[𝑘], 𝜎𝑖≠𝜎𝑗 when 𝑖≠𝑗) and the neural networks are expressive enough, the tri-factor contrastive learning (triCL, Eq. 9) attains its optimum when ∀𝑥∈𝒟, 𝑗∈[𝑘]

𝑓𝑗⋆(𝑥)=±1𝐷𝑥𝑥(𝑈𝑥𝑘)𝑗,𝑆∗=diag(𝜎1,…,𝜎𝑘),

(10)

which states that the tri-factor contrastive learning enables the sign feature identifiability.
其说明三因素对比学习使得符号特征可识别。

定理4.5.假设归一化相邻矩阵 𝐴¯ 具有不同的最大 𝑘 奇异值（当 𝑖≠𝑗 时为 ∀𝑖,𝑗∈[𝑘] 、 𝜎𝑖≠𝜎𝑗 ）并且神经网络具有足够的表达力，则三因素对比学习（triCL，等式10）可以被称为三因素对比学习（triCL，等式11）。9)当 ∀𝑥∈𝒟 、 𝑗∈[𝑘]

As shown in Theorem 4.5, the only difference remaining in the solutions (Eq. 10) is the sign. To remove this ambiguity, for dimension 𝑗, we randomly select a natural sample 𝑥¯∼𝒫𝑢, encode it with the optimal solution 𝑓⋆ of triCL, and observe the sign of its 𝑗-th dimension 𝑓𝑗⋆(𝑥¯). If 𝑓𝑗⋆(𝑥¯)=0, we draw another sample and repeat the process until we obtain a non-zero feature 𝑓𝑗⋆(𝑥¯). We then store the sample as an original point 𝑥0𝑗 and adjust the sign of different learned representations as follows:
如定理4.5所示，在解（方程4.5）中剩下的唯一差异是：10)就是信号为了消除这种模糊性，对于维度 𝑗 ，我们随机选择自然样本 𝑥¯∼𝒫𝑢 ，用triCL的最优解 𝑓⋆ 对其进行编码，并观察其第 𝑗 维度 𝑓𝑗⋆(𝑥¯) 的符号。如果 𝑓𝑗⋆(𝑥¯)=0 ，我们绘制另一个样本并重复该过程，直到我们获得非零特征 𝑓𝑗⋆(𝑥¯) 。然后，我们将样本存储为原始点 𝑥0𝑗 ，并调整不同学习表示的符号，如下所示：

𝑓¯𝑗⋆(𝑥)=(−1)𝟙(𝑓𝑗⋆(𝑥0𝑗)>0)⋅𝑓𝑗⋆(𝑥),∀𝑥∈𝒟,𝑗∈[𝑘]

(11)

By removing the freedom of sign, the solution becomes unique and triCL enables the exact feature identifiability:
通过去除符号的自由度，解决方案变得唯一，并且triCL实现了精确的特征可识别性：

Corollary 4.6.

Set 𝑓¯⋆ as the final learned encoder of tri-factor contrastive learning, and then triCL obtains the exact feature identifiability.

推论4.6。设置 𝑓¯⋆ 作为三因子对比学习的最终学习编码器，然后triCL获得准确的特征可识别性。

5Theoretical Properties of Tri-factor Contrastive Learning
5三因素对比学习的理论性质

Besides the feature identifiability, we analyze other theoretical properties of triCL in this section. Specifically, in Section 5.1, we provide the generalization guarantee of triCL and we present another advantage of triCL: triCL can automatically discover the importance of different features. In Section 5.2, we extend triCL to other contrastive learning frameworks.
除了特征可识别性之外，我们还在本节中分析了triCL的其他理论性质。具体来说，在5.1节中，我们提供了triCL的泛化保证，并展示了triCL的另一个优点：triCL可以自动发现不同特征的重要性。在第5.2节中，我们将triCL扩展到其他对比学习框架。

c2a2o2

关注

13
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
0
评论
自动发现特征重要性的可识别对比学习

Existing contrastive learning methods rely on pairwise sample contrast 𝑧𝑥⊤𝑧𝑥′ to learn data representations, but the learned features often lack clear interpretability from a human perspective. Theoretically, it lacks feature identifiability and diffe
复制链接

扫一扫