多模态机器学习的基础与趋势：原则、挑战与开放问题（挑战1：表示）

iracer

已于 2023-12-19 12:08:02 修改

阅读量593

点赞数

文章标签：机器学习人工智能多模态综述

于 2023-12-19 12:01:39 首次发布

原文链接：https://arxiv.org/pdf/2209.03430.pdf

版权

本文是对如下多模态机器学习综述的阅读笔记之二。

本文内容为：多模态及其学习的挑战1：表示。

Foundations & Trends in Multimodal Machine Learning:Principles, Challenges, and Open Questions

PAUL PU LIANG, AMIR ZADEH, and LOUIS-PHILIPPE MORENCY,
Machine Learning Department and Language Technologies Institute, Carnegie Mellon University, USA

论文链接：https://arxiv.org/pdf/2209.03430.pdf

3 挑战1：表示 REPRESENTATION

The first fundamental challenge is to learn representations that reflect cross-modal interactions between individual elements across different modalities. This challenge can be seen as learning a ‘local’ representation between elements, or a representation using holistic features. This section covers (1) representation fusion: integrating information from 2 or more modalities, effectively reducing the number of separate representations, (2) representation coordination: interchanging cross-modal information with the goal of keeping the same number of representations but improving multimodal contextualization, and (3) representation fission: creating a new ecoupled set of representations, usually larger number than the input set, that reflects knowledge about internal structure such as data clustering or factorization (Figure 5).

第一个基本挑战是学习反映不同模态的个体元素之间跨模态交互的表示。这种挑战可以被看作是学习元素之间的“局部”表示，或者是使用整体特征的表示。本节涵盖(1)表示融合：整合来自2或更多模式的信息，有效地减少单独表示的数量，(2)表示协调：交换跨模式信息的目标保持相同数量的表示，但改善多模式语境，和(3)表示裂变：创建一个新的解耦表示集，通常比输入集更大，这反映了内部结构的知识，如数据聚类或因式分解（图5）。

Fig. 5. Challenge 1 aims to learn representations that reflect cross-modal interactions between individual modality elements, through (1) fusion: integrating information to reduce the number of separate representations, (2) coordination: interchanging cross-modal information with the goal of keeping the same number of representations but improving multimodal contextualization, and (3) fission: creating a larger set of decoupled representations that reflects knowledge about internal structure.
图5 挑战1旨在学习反映跨模态元素之间的相互作用的表示，通过(1)融合：整合信息减少独立表示的数量，(2)协调：交换跨模态信息的目标保持相同数量的表示，但改善多模态语境，和(3)裂变：创建一套更大的解耦表示，反映了内部结构的知识。

3.1 Subchallenge 1a: Representation Fusion

Representation fusion aims to learn a joint representation that models cross-modal interactions between individual elements of different modalities, effectively reducing the number of separate representations. We categorize these approaches into fusion with abstract modalities and fusion with raw modalities (Figure 6). In fusion with abstract modalities, suitable unimodal encoders are first applied to capture a holistic representation of each element (or modality entirely), after which several building blocks for representation fusion are used to learn a joint representation. As a result, fusion happens at the abstract representation level. On the other hand, fusion with raw modalities entails representation fusion at very early stages with minimal preprocessing, perhaps even involving raw modalities themselves.

表示融合的目的是学习一个联合表示，以建模不同模态的个别元素之间的跨模态交互，有效地减少了分离表示的数量。我们将这些方法分为与抽象模态的融合和与原始模态的融合（图6）。在抽象模态的融合中，首先应用合适的单模态编码器来捕获每个元素（或完全模态）的整体表示，然后使用表示融合的多个构建块来学习联合表示。因此，融合发生在抽象表示级别上。另一方面，与原始模态的融合需要在非常早期的阶段进行表示融合，并且只有最小的预处理，甚至可能涉及到原始模态本身。

Fig. 6. We categorize representation fusion approaches into (1) fusion with abstract modalities, where unimodal encoders first capture a holistic representation of each element before fusion at relatively homogeneous representations, and (2) fusion with raw modalities which entails representation fusion at very early stages, perhaps directly involving heterogeneous raw modalities.
图6 我们将表示融合方法分为(1)抽象模态的融合，单模态编码器首先捕获融合之前的每个元素的整体表示（各模态原始数据先从通过不同的特征抽取模型抽取特征向量，再融合成一个向量），和(2)与原始模态数据融合，需要在早期阶段表示融合，也许直接涉及异构原始模式。

🚩Fusion with abstract modalities抽象模态融合

Fusion with abstract modalities: We begin our treatment of representation fusion of abstract representations with additive and multiplicative interactions. These operators can be seen as differentiable building blocks combining information from two streams of data that can be flexibly inserted into almost any unimodal machine learning pipeline. Given unimodal data or features x1 and x2, additive fusion can be seen as learning a new joint representation , where and are the weights learned for additive fusion of and , the bias term, and the error term. If the joint representation is directly taken as a prediction , then additive fusion resembles late or ensemble fusion with unimodal predictors and [74]. Otherwise, the additive representation can also undergo subsequent unimodal or multimodal processing [23]. Multiplicative interactions extend additive interactions to include a cross term . These models have been used extensively in statistics, where it can be interpreted as a moderation effect of affecting the linear relationship between and [25]. Overall, purely additive interactions can be seen as a first-order polynomial between input modalities and , combining additive and multiplicative captures a second-order polynomial.

抽象模态的融合：我们从具有“加性交互”（additive interactions）和“乘性交互（multiplicative interactions）”的抽象表示的表示融合开始介绍。这些操作符可以看作是可区分的构建块，结合了来自两个数据流的信息，这些信息可以灵活地插入到几乎任何单模态机器学习工作流程中。给定单模态数据或特性和，加性融合可以被视为学习一个新的联合表示，其中和是加性融合学习到的特征和的权重合，为偏置项，𝜖为误差项。如果直接将联合表示作为预测，则加性融合类似于晚期或集成融合与单模态预测器和 [74]。否则，加性表示也可以进行后续的单模态或多模态处理[23]。乘性交互将加性交互扩展到包括一个交叉项。这些模型在统计学中被广泛使用，它可以被解释为影响和𝑦[25]之间的线性关系的调节效应。总的来说，纯加性交互可以看作是输入模态和之间的一阶多项式，结合加性和乘性交互的捕获一个二阶多项式。

To further go beyond first and second-order interactions, tensors are specifically designed to
explicitly capture higher-order interactions across modalities [310]. Given unimodal data x1 , x2,
tensors are defined as where ⊗ denotes an outer product [28, 76]. Tensor products of higher order represent polynomial interactions of higher order between elements [98]. However, computing tensor products is expensive since their dimension scales exponentially with the number of modalities, so several efficient approximations based on low-rank decomposition have been proposed [98, 158]. Finally, Multiplicative Interactions (MI) generalize additive and multiplicative operators to include learnable parameters that capture second-order interactions [117]. In its most general form, MI defines a bilinear product where W , U , Z, and b are trainable parameters.

为了进一步超越一阶和二阶交互作用，张量被专门设计来明确地捕获跨模态[310]的高阶交互作用。给定单模态数据和，张量定义为，其中⊗表示外积[28,76]。高阶张量积表示元素[98]之间的高阶多项式相互作用。然而，计算张量积的代价是昂贵的，因为它们的维数随模态的数量呈指数增长，因此提出了几种基于低秩分解的有效近似[98,158]。最后，乘法交互（Multiplicative Interactions, MI）推广了加性和乘法交互算子，以包括捕获二阶交互[117]的可学习参数。在其最一般的形式中，MI定义了一个双线性乘积，其中W、U、Z和b是可训练的参数。

Multimodal gated units/attention units learn representations that dynamically change for every input [47, 284]. Its general form can be written as , where ℎ represents a function with sigmoid activation and ⊙ denotes element-wise product. ℎ( x2) is commonly referred to as ‘attention weights’ learned from x2 to attend on x1. Recent work has explored more expressive forms of learning attention weights such as using Query-Key-Value mechanisms [261], fully-connected neural network layers [18, 47], or even hard gated units for sharper attention [55].

多模态门控单元/注意力单元学习每个输入[47,284]动态变化的表示。其一般形式可以写成，其中ℎ表示一个具有s型激活的函数，⊙表示元素级乘积(element-wise product)。通常被称为从学习到的对的“注意力权重“。最近的工作探索了更具表现力的学习注意力权重形式，如使用查询-键-值机制（ Query-Key-Value mechanisms）[261]、全连接的神经网络层[18,47]，甚至是针对更清晰的注意力（sharper attention）[55]的硬门控单元（hard gated units）。

🚩Fusion with raw modalities原始模态融合

Fusion with raw modalities entails representation fusion at very early stages, perhaps even involving raw modalities themselves. These approaches typically bear resemblance to early fusion [23], which performs concatenation of input data before applying a prediction model (i.e., ). Fusing at the raw modality level is more challenging since raw modalities are likely to exhibit more dimensions of heterogeneity. Nevertheless, Barnum et al. [24] demonstrated robustness benefits of fusion at early stages, while Gadzicki et al. [77] also found that complex early fusion can outperform abstract fusion. To account for the greater heterogeneity during complex early fusion, many approaches rely on generic encoders that are applicable to both modalities, such as convolutional layers [24, 77] and Transformers [150, 153]. However, do these complex non-additive fusion models actually learn non-additive interactions between modality elements? Not necessarily, according to Hessel and Lee [94]. We cover these fundamental analysis questions and more in the quantification challenge (§8).

原始模态的融合，需要在非常早期的阶段进行表示融合，甚至可能涉及原始模态数据本身。这些方法通常与早期融合[23]相似，早期融合[23]在应用预测模型（即）之前执行输入数据的连接。在原始模态水平上的融合更具挑战性，因为原始模态很可能表现出更多的异质性维度。然而，Barnum等人[24]证明了早期融合的稳健性好处，而Gadzicki等人[77]也发现复杂的早期融合可以优于抽象融合。为了解释在复杂的早期融合过程中更大的异质性，许多方法依赖于适用于这两种模式的通用编码器，如卷积层[24,77]和变压器[150,153]。然而，这些复杂的非加性融合模型真的学习了模态元素之间的非加性相互作用吗？Hessel和Lee [94]认为，这并不一定。我们（的论文）涵盖了这些基本分析问题，以及更多的量化挑战。

3.2 Subchallenge 1b: Representation Coordination

Representation coordination aims to learn multimodal contextualized representations that are coordinated through their interconnections (Figure 7). In contrast to representation fusion, coordination keeps the same number of representations but improves multimodal contextualization. We start our discussion with strong coordination that enforces strong equivalence between modality elements, before moving on to partial coordination that captures more general connections such as correlation, order, hierarchies, or relationships beyond similarity.

表示协调的目的是学习多模态上下文表示，并通过它们的互连进行协调（图7）。与表示融合相比，协调保持了相同数量的表示，但改善了多模态上下文。我们以强协调开始讨论，以加强模态元素之间的强等价性，然后转向部分协调，以捕获更一般的联系，如相关性、顺序、层次结构或相似性之外的关系。

Fig. 7. There is a spectrum of representation coordination functions: strong coordination aims to enforce strong equivalence in all dimensions, whereas in partial coordination only certain dimensions may be coordinated to capture more general connections such as correlation, order, hierarchies, or relationships.
图7. 存在一种表示协调功能谱：强协调旨在在所有维度上强制实现强等价性，而在部分协调中，可能只协调某些维度以捕捉更一般的联系，如相关性、顺序、层次或关系。

🚩Strong coordination 强协调

Strong coordination aims to bring semantically corresponding modalities close together in a coordinated space, thereby enforcing strong equivalence between modality elements. For example, these models would encourage the representation of the word ‘dog’ and an image of a dog to be close (i.e., semantically positive pairs), while the distance between the word ‘dog’ and an image of a car to be far apart (i.e., semantically negative pairs) [75]. The coordination distance is typically cosine distance [174, 287] or max-margin losses [102]. Recent work has explored large-scale representation coordination by scaling up contrastive learning of image and text pairs [206], and also found that contrastive learning provably captures redundant information across the two views [256, 258] (but not non-redundant information). In addition to contrastive learning, several approaches instead learn a coordinated space by mapping corresponding data from one modality to another [69]. For example, Socher et al. [236] maps image embeddings into word embedding spaces for zero-shot image classification. Similar ideas were used to learn coordinated representations between text, video, and audio [202], as well as between pretrained language models and image features [249].

强协调的目的是使语义上对应的模态在一个协调的空间中紧密结合，从而加强模态元素之间的强等价性。例如，这些模型会鼓励单词“狗”和狗的图像的表示很接近（即语义上的正对），而单词“狗”和汽车图像之间的距离距离很远（即语义上的负对）[75]。协调距离通常为余弦距离[174,287]或最大裕度损失[102]。最近的研究通过扩大图像和文本对[206]的对比学习来探索大规模的表示协调，还发现对比学习可以证明通过两个视图[256,258]捕获冗余信息（但不是非冗余信息）。除了对比学习之外，一些方法还通过将相应的数据从一个模态映射到另一个[69]来学习协调空间。例如，Socher等人的[236]将图像嵌入映射到单词嵌入空间中，用于零镜头图像分类。类似的想法被用来学习tex之间的协调表征。

🚩Partial coordination 部分协调

Partial coordination: Instead of strictly capturing equivalence via strong coordination, partial coordination instead captures more general modality connections such as correlation, order, hierarchies, or relationships. To achieve these goals, partially coordinated models enforce different types of constraints on the representation space beyond semantic similarity, and perhaps only on certain dimensions of the representation.

部分协调：部分协调不是通过强协调严格捕获等价性，而是捕获更一般的模态连接，如相关性、顺序、层次结构或关系。为了实现这些目标，部分协调的模型对语义相似性之外的表示空间施加不同类型的约束，而且可能只对表示的某些维度施加约束。

Canonical correlation analysis (CCA) computes a linear projection that maximizes the correlation between two random variables while enforcing each dimension in a new representation to be orthogonal to each other [254]. CCA models have been used extensively for cross-modal retrieval [211] audio-visual signal analysis [221], and emotion recognition [186]. To increase the expressiveness of CCA, several nonlinear extensions have been proposed including Kernel CCA [134], Deep CCA [16], and CCA Autoencoders [283].
Ordered and hierarchical spaces: Another example of representation coordination comes from order-embeddings of images and language [276], which aims to capture a partial order on the language and image embeddings to enforce a hierarchy in the coordinated space. A similar model using denotation graphs was also proposed by Young et al. [306] where denotation graphs are used to induce such a partial ordering hierarchy.
Relationship coordination: In order to learn a coordinated space that captures semantic relationships between elements beyond correspondences, Zhang et al. [319] use structured representations of text and images to create multimodal concept taxonomies. Delaherche and Chetouani [61] learn coordinated representations capturing hierarchical relationships, while Alviar et al. [12] apply multiscale coordination of speech and music using partial correlation measures. Finally, Xu et al. [298] learn coordinated representations using a Cauchy loss to strengthen robustness to outliers.

典型相关分析（CCA）计算一个线性投影，最大化两个随机变量之间的相关性，同时强制每个新表示的维度相互正交[254]。CCA模型已被广泛用于跨模态检索[211]、音频-视觉信号分析[221]和情感识别[186]。为了提高CCA的表达能力，已经提出了几种非线性扩展，包括核CCA[134]、深度CCA[16]和CCA自编码器[283]。

有序和层次空间：另一个表示协调的例子来自图像和语言的有序嵌入[276]，旨在捕捉语言和图像嵌入的部分顺序，以在协调空间中强制执行层次结构。Young等人[306]还提出了一种类似的模型，使用指称图来诱导这种部分顺序层次结构。

关系协调：为了学习捕获元素之间语义关系的协调空间，而不仅仅是对应关系，Zhang等人[319]使用文本和图像的结构化表示来创建多模态概念分类。Delaherche和Chetouani[61]学习捕获层次关系的协调表示，而Alviar等人[12]使用部分相关度量在语音和音乐之间进行多尺度协调。最后，Xu等人[298]使用Cauchy损失学习协调表示，以增强对异常值的鲁棒性。

3.3 Subchallenge 1c: Representation Fission

Finally, representation fission aims to create a new decoupled set of representations (usually a larger number than the input representation set) that reflects knowledge about internal multimodal structure such as data clustering, independent factors of variation, or modality-specific information. In comparison with joint and coordinated representations, representation fission enables careful interpretation and fine-grained controllability. Depending on the granularity of decoupled factors, methods can be categorized into modality-level and fine-grained fission (Figure 8).

最后，表示裂变的目的是创建一个新的解耦表示集（通常比输入表示集的数量更大），反映内部多模态结构的知识，如数据聚类、变化的独立因素或特定于模态的信息。与联合表示和协调表示相比，表示裂变使细粒度解释和细粒度可控性成为可能。根据解耦因子的粒度，方法可以分为模态级和细粒度的裂变（图8）。

Fig. 8. Representation fission creates a larger set of decoupled representations that reflects knowledge about internal structure. (1) Modality-level fission factorizes into modality-specific information primarily in each modality, and multimodal information redundant in both modalities, while (2) fine-grained fission attempts to further break multimodal data down into individual subspaces.
图8. 表示分裂创建了一个更大的解耦表示集合，反映了内部结构的知识。(1) 模态级分裂主要在每种模态中分解为模态特定信息和两种模态中冗余的多模态信息，而(2) 细粒度分裂试图进一步将多模态数据分解为单个子空间。

🚩Modality-level fission

Modality-level fission aims to factorize into modality-specific information primarily in each modality and multimodal information redundant in both modalities [101, 262]. Disentangled representation learning aims to learn mutually independent latent variables that each explain a particular variation of the data [30, 95], and has been useful for modality-level fission by enforcing independence constraints on modality-specific and multimodal latent variables [101, 262]. Tsai et al. [262] and Hsu and Glass [101] study factorized multimodal representations and demonstrate the importance of modality-specific and multimodal factors towards generation and prediction. Shi et al. [231] study modality-level fission in multimodal variational autoencoders using a mixture-of-experts layer, while Wu and Goodman [292] instead use a product-of-experts layer.

模态级裂变旨在将信息主要分解为每个模态的模态特定信息以及两个模态中冗余的多模态信息[101, 262]。解缠表示学习旨在学习相互独立的潜在变量，每个潜在变量解释数据的特定变化[30, 95]，并且对于通过强制独立约束模态特定和多模态潜在变量来实现模态级裂变非常有用[101, 262]。Tsai等人[262]和Hsu和Glass[101]研究因子化多模态表示，并证明模态特定和多模态因子对生成和预测的重要性。Shi等人[231]研究使用混合专家层的多模态变分自编码器的模态级裂变，而Wu和Goodman[292]则使用乘积专家层。

Post-hoc representation disentanglement is suitable when it is difficult to retrain a disentangled model, especially for large pretrained multimodal models. Empirical multimodally-additive function projection (EMAP) [94] is an approach for post-hoc disentanglement of the effects of unimodal (additive) contributions from cross-modal interactions in multimodal tasks, which works for arbitrary multimodal models and tasks. EMAP is also closely related to the use of Shapley values for feature disentanglement and interpretation [176], which can also be used for post-hoc representation disentanglement in general models.

当解纠缠模型难以再训练时，可采用事后表示解纠缠法，特别是对于大型预训练的多模态模型。经验多模态加性函数投影（EMAP）[94]是一种事后解决多模态任务中跨模态交互的单模态（加性）贡献效应的方法，它适用于任意多模态模型和任务。EMAP还与Shapley值在特征解缠和解释中的使用密切相关[176]，也可用于一般模型中的事后表示分离。

🚩Fine-grained fission

Fine-grained fission: Beyond factorizing only into individual modality representations, fine-grained fission attempts to further break multimodal data down into the individual subspaces covered by the modalities [277]. Clustering approaches that group data based on semantic similarity [165] have been integrated with multimodal networks for end-to-end representation fission and prediction. For example, Hu et al. [102] combine 𝑘-means clustering in representations with unsupervised audiovisual learning. Chen et al. [48] combine 𝑘-means clustering with self-supervised contrastive learning on videos. Subspace clustering [1], approximate graph Laplacians [125], conjugate mixture models [124], and dictionary learning [126] have also been integrated with multimodal models. Motivated by similar goals of representation fission, matrix factorization techniques have also seen several applications in multimodal prediction [10] and image retrieval [41].

细粒度裂变：除了只分解为单个模态表示之外，细粒度裂变试图进一步将多模态数据分解为模态[277]所覆盖的单个子空间。基于语义相似度[165]对数据进行分组的聚类方法已与多模态网络集成，用于端到端表示、裂变和预测。例如，Hu等人[102]将𝑘-means聚类与无监督视听学习结合起来。Chen等人[48]将𝑘-means聚类与视频上的自监督对比学习相结合。子空间聚类[1]、近似图拉普拉斯算子[125]、共轭混合模型[124]和字典学习[126]也与多模态模型集成。受类似的表示裂变目标的推动，矩阵分解技术在多模态预测[10]和图像检索[41]中也有一些应用。