多模态机器学习的基础与趋势：原则、挑战与开放问题（挑战2：对齐）

iracer

已于 2023-12-19 15:05:14 修改

阅读量160

点赞数

文章标签：机器学习人工智能多模态综述

于 2023-12-19 12:06:22 首次发布

原文链接：https://arxiv.org/pdf/2209.03430.pdf

版权

本文是对如下多模态机器学习综述的阅读笔记之一。

本文：多模态机器学习的挑战2：对齐。

Foundations & Trends in Multimodal Machine Learning:Principles, Challenges, and Open Questions

PAUL PU LIANG, AMIR ZADEH, and LOUIS-PHILIPPE MORENCY,
Machine Learning Department and Language Technologies Institute, Carnegie Mellon University, USA

论文链接：https://arxiv.org/pdf/2209.03430.pdf

4 挑战2：对齐ALIGNMENT

A second challenge is to identify cross-modal connections and interactions between elements of multiple modalities. For example, when analyzing the speech and gestures of a human subject, how can we align specific gestures with spoken words or utterances? Alignment between modalities is challenging since it may depend on long-range dependencies, involves ambiguous segmentation (e.g., words or utterances), and could be either one-to-one, many-to-many, or not exist at all. This section covers recent work in multimodal alignment involving (1) discrete alignment: identifying connections between discrete elements across modalities, (2) continuous alignment: modeling alignment between continuous modality signals with ambiguous segmentation, and (3) contextualized representations: learning better multimodal representations by capturing cross-modal interactions
between elements (Figure 9)

第二个挑战是识别多种模态元素之间的跨模态连接和相互作用。例如，当我们分析一个人类主体的言语和手势时，我们如何将特定的手势与口语或话语对齐？模式之间的对齐是具有挑战性的，因为它可能依赖于长期依赖，涉及模糊的分割（例如，单词或话语），并且可能是一对一的，多对多的，或者根本不存在。本节涵盖了多模态对齐方面的工作，包括(1)离散对齐：识别跨模态的离散元素之间的连接，(2)连续对齐：连续模态信号之间的建模对齐，以及(3)情境化表示：通过捕获元素之间的跨模态交互来学习更好的多模态表示（图9）。

Fig. 9. Alignment aims to identify cross-modal connections and interactions between modality elements. Recent work has involved (1) discrete alignment to identify connections among discrete elements, (2) continuous alignment of continuous signals with ambiguous segmentation, and (3) contextualized representation learning
to capture these cross-modal interactions between connected elements.
图9 对齐旨在识别模态元素之间的跨模态连接和相互作用。最近的工作包括(1)离散对齐来识别离散元素之间的连接，(2)具有模糊分割的连续信号的连续对齐，(3)上下文化表示学习来捕获连接元素之间的这些跨模态交互作用。

4.1 Subchallenge 2a: Discrete Alignment

The first subchallenge aims to identify connections between discrete elements of multiple modalities. We describe recent work in (1) local alignment to discover connections between a given matching pair of modality elements, and (2) global alignment where alignment must be performed globally to learn both the connections and matchings (Figure 10).

第一个子挑战旨在识别多种模式的离散元素之间的联系。我们描述了最近在(1)局部对齐，以发现给定的匹配模态元素之间的连接，以及(2)全局对齐，必须全局的学习模态元素之间的连接关系和匹配以实现全局对齐(图10）

Fig. 10. Discrete alignment identifies connections between discrete elements, spanning (1) local alignment to discover connections given matching pairs, and (2) global alignment where alignment must be performed globally to learn both the connections and matchings between modality elements.
图10 离散对齐识别离散元素之间的连接，跨越(1)局部对齐以给定匹配对发现连接，以及(2)全局对齐，其中必须进行全局对齐以学习模态元素之间的连接和匹配。

🚩Local alignment局部对齐

Local alignment between connected elements is particularly suitable for multimodal tasks where there is clear segmentation into discrete elements such as words in text or object bounding boxes in images or videos (e.g., tasks such as visual coreference resolution [131], visual referring expression recognition [58, 59], and cross-modal retrieval [75, 203]). When we have supervised data in the form of connected modality pairs, contrastive learning is a popular approach where the goal is to match representations of the same concept expressed in different modalities [23]. Several objective functions for learning aligned spaces from varying quantities of paired [43, 107] and unpaired [85] data have been proposed. Many of the ideas that enforce strong [75, 152] or partial [16, 276, 319] representation coordination (§3.2) are also applicable for local alignment. Several examples include aligning books with their corresponding movies/scripts [323], matching referring expressions to visual objects [169], and finding similarities between image regions and their descriptions [105]. Methods for local alignment have also enabled the learning of shared semantic concepts not purely based on language but also on additional modalities such as vision [107], sound [60, 236], and multimedia [323] that are useful for downstream tasks.

局部对齐连接元素，特别适用于能将模态数据清晰的分割成离散元素的多模态任务，如文本或图像或视频中的对象边界框（例如，视觉共参考分辨率[131]、视觉引用表达识别[58,59]和跨模态检索[75,203]）。当我们以连接模态对的形式监督数据时（比如图像加文字描述，组成一个双模态的数据对，可以作为监督学习的数据使用），对比学习(contrastive learning)是一种流行的方法，其目标是匹配在不同模态[23]中表示相同概念的表示。不同的学者提出了很多用于学习对齐空间（ learning aligned spaces）的目标函数（objective functions），这些目标函数可以从不同数量的已配对[43,107]和未配对数据[43,107]中学习对齐空间。许多强制执行强[75,152]或部分[16,276,319]表示协调（3.2节）的思想也适用于局部对齐。有几个例子包括将书籍与它们对应的电影/脚本[323]对齐，匹配对视觉对象[169]的引用表达式，以及寻找图像区域和它们的描述[105]之间的相似性。局部对齐的方法还支持了共享语义概念的学习，它不纯粹基于语言，还基于其他模式，如视觉[107]、声音[60,236]和多媒体[323]，这些模式对下游任务很有用。

🚩Global alignment全局对齐

Global alignment: When the ground-truth modality pairings are not available, alignment must be performed globally between all elements across both modalities. Optimal transport (OT)-based approaches [278] (which belong to a broader set of matching algorithms) are a potential solution since they jointly optimize the coordination function and optimal coupling between modality elements by posing alignment as a divergence minimization problem. These approaches are useful for aligning multimodal representation spaces [142, 205]. To alleviate computational issues, several recent advances have integrated them with neural networks [54], approximated optimal transport with entropy regularization [288], and formulated convex relaxations for efficient learning [85].

全局对齐：当模态对的真值（ground-truth）标注不可用时，必须在两种模态的所有元素之间进行全局对齐。基于最优传输（OT）的方法[278]（属于更广泛的匹配算法集）是一种潜在的解决方案，因为它们通过将对齐作为散度最小化问题，共同优化了协调函数和模态元素之间的最优耦合。这些方法对于对齐多模态表示空间[142,205]是很有用的。为了缓解计算问题，一些最新的进展已经将它们与神经网络[54]集成，用熵正则化[288]近似最优传输，并制定了有效学习[85]的凸松弛。

4.2 Subchallenge 2b: Continuous Alignment

So far, one important assumption we have made is that modality elements are already segmented and discretized. While certain modalities display clear segmentation (e.g., words/phrases in a sentence or object regions in an image), there are many cases where the segmentation is not readily provided, such as in continuous signals (e.g, financial or medical time-series), spatio-temporal data (e.g., satellite or weather images), or data without clear semantic boundaries (e.g., MRI images). In these settings, methods based on warping and segmentation have been recently proposed:

到目前为止，我们所做的一个重要假设是，模态元素已经被分割和离散化了。虽然某些模态有清晰的分割（例如，句子中的单词/短语或图像中的物体区域），但在许多情况下，提供分割好的元素并不容易，例如连续信号（例如，金融或医疗时间序列）、时空数据（例如，卫星或天气图像），或没有明确语义边界的数据（例如，MRI图像）。在这些情况下，最近有人提出了基于规整（warping）和分割（segmentation）的方法：

Fig. 11. Continuous alignment tackles the difficulty of aligning continuous signals where element segmentation is not readily available. We cover related work in (1) continuous warping of representation spaces and (2) modality segmentation of continuous signals into discrete elements at an appropriate granularity.
图11 连续对齐解决了元素分割的连续信号对齐的困难。我们介绍了(1)表示空间的连续扭曲和(2)连续信号在适当粒度下的离散元素的相关工作。

🚩Continuous warping连续规整

👋

warping含义参见DTW（Dynamic Time Warping）动态时间规整算法

Continuous warping aims to align two sets of modality elements by representing them as continuous representation spaces and forming a bridge between these representation spaces.
Adversarial training is a popular approach to warp one representation space into another. Initially used in domain adaptation [27], adversarial training learns a domain-invariant representation across domains where a domain classifier is unable to identify which domain a feature came from [8]. These ideas have been extended to align multimodal spaces [100, 103, 181]. Hsu et al. [100] use adversarial training to align images and medical reports, Hu et al. [103] design an adversarial network for cross-modal retrieval, and Munro and Damen [181] design both self-supervised alignment and adversarial alignment objectives for multimodal action recognition. Dynamic time warping (DTW) [133] is a related approach to segment and align multi-view time series data. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames) such that they are aligned across segmented time boundaries. For multimodal tasks, it is necessary to design similarity metrics between modalities [17, 251]. DTW was extended using CCA to map the modalities to a coordinated space, allowing for both alignment (through DTW) and coordination (through CCA) between different modality streams jointly [260].

连续规整（wraping）的目的是通过将两组模态元素表示为连续的表示空间，并在这些表示空间之间形成一座桥梁。对抗性训练是一种将一个表示空间规整成另一个表示空间的流行方法。最初对抗性训练用于域自适应[27]，对抗性训练学习跨域的域不变表示，其中域分类器无法识别一个特征来自哪个域[8]。这些想法已经被扩展到对齐多模态空间[100,103,181]。Hsu等人[100]使用对抗训练对齐图像和医疗报告，Hu等人[103]设计了一个跨模态检索的对抗网络，Munro和Damen [181]设计了多模态动作识别的自监督对齐和对抗对齐目标。动态时间规整（DTW）[133]是对多视图时间序列数据进行分割和对齐的一种相关方法。DTW度量两个序列之间的相似性，并通过时间规整（插入帧）找到它们之间的最佳匹配，这样它们就可以在分割的时间边界上对齐。对于多模态任务，有必要设计模态[17,251]之间的相似性度量。DTW使用CCA进行扩展，将模态映射到一个协调的空间，允许不同模态流之间的对齐（通过DTW）和协调（通过CCA）联合[260]。

🚩Modality segmentation模态分割

Modality segmentation involves dividing high-dimensional data into elements with semantically meaningful boundaries. A common problem involves temporal segmentation, where the goal is to discover the temporal boundaries across sequential data. Several approaches for temporal segmentation include forced alignment, a popular approach to align discrete speech units with individual words in a transcript [309]. Malmaud et al. [167] explore multimodal alignment using a factored hidden Markov model to align ASR transcripts to the ground truth. Clustering approaches have also been used to group continuous data based on semantic similarity [165]. Clustering-based discretization has recently emerged as an important preprocessing step for generalizing language-based pretraining (with clear word/bytepair segmentation boundaries and discrete elements) to video or audio-based pretraining (without clear segmentation boundaries and continuous elements). By clustering raw video or audio features into a discrete set, approaches such as VideoBERT [243] perform masked pretraining on raw video and audio data. Similarly, approaches such as DALL.E [210], VQ-VAE [271], and CMCM [156] also utilize discretized intermediate layers obtained via vector quantization and showed benefits in modality alignment.

模态分割包括将高维数据划分为具有语义——有意义边界的元素。一个常见的问题涉及到时间分割，其目标是发现跨序列数据的时间边界。几种时间分割的方法包括强制对齐，这是一种流行的方法，以在一个转录本（transcripts）中，对齐离散的语音单元与单个单词[309]。Malmaud等人[167]探索了使用因子隐马尔可夫模型的多模态对齐，以将ASR转录本与地面真相对齐。聚类方法也被用于基于语义相似性[165]对连续数据进行分组。

基于聚类的离散化最近成为一个重要的预处理步骤，用于推广基于语言的预训练（具有明确的字/字节对分割边界和离散元素）到基于视频或音频的预训练（没有明确的分割边界和连续元素）。通过将原始视频或音频特征聚类到一个离散的集合中，诸如VideoBERT [243]等方法对原始视频和音频数据进行掩码预训练。类似地，诸如DALL.E [210]、VQ-VAE [271]和CMCM [156]等方法也利用了通过向量量化获得的离散化中间层，并在模态对齐中显示出了优势。

4.3 Subchallenge 2c: Contextualized Representations

Finally, contextualized representation learning aims to model all modality connections and interactions to learn better representations. Contextualized representations have been used as an intermediate (often latent) step enabling better performance on a number of downstream tasks including speech recognition, machine translation, media description, and visual question-answering. We categorize work in contextualized representations into (1) joint undirected alignment, (2) cross-modal directed alignment, and (3) alignment with graph networks (Figure 12).

最后，情境化表示学习旨在建模所有的模态连接和交互，以学习更好的表示。情境化表示被用作中间（通常是潜在的）步骤，能够在许多下游任务中表现更好，包括语音识别、机器翻译、媒体描述和视觉问题回答。我们将上下文表示中的工作分为(1)联合无向对齐，(2)交叉模态有向对齐，和(3)与图网络的对齐（图12)。

Fig. 12. Contextualized representation learning aims to model modality connections to learn better representations. Recent directions include (1) joint undirected alignment that captures undirected symmetric connections, (2) cross-modal directed alignment that models asymmetric connections in a directed manner, and (3) graphical alignment that generalizes the sequential pattern into arbitrary graph structures.
图12 情境化表示学习旨在建模模态连接，以学习更好的表示。最近的研究方向包括(1)捕获无向对称连接的关节无向对齐，(2)以有向方式建模非对称连接的跨模态定向对齐，以及(3)将顺序模式推广为任意图形结构的图形对齐。

🚩Joint undirected alignment联合无向对齐

Joint undirected alignment aims to capture undirected connections across pairs of modalities, where the connections are symmetric in either direction. This is commonly referred to in the literature as unimodal, bimodal, trimodal interactions, and so on [164]. Joint undirected alignment is typically captured by parameterizing models with alignment layers and training end-to-end for a multimodal task. These alignment layers can include attention weights [47], tensor products [158, 310], and multiplicative interactions [117]. More recently, transformer models [273] have emerged as powerful encoders for sequential data by automatically aligning and capturing complementary features at different time steps. Building upon the initial text-based transformer model, multimodal transformers have been proposed that perform joint alignment using a full self-attention over modality elements concatenated across the sequence dimension (i.e., early fusion) [140, 243]. As a result, all modality elements become jointly connected to all other modality elements similarly (i.e., modeling all connections using dot-product similarity kernels).

联合无向对齐的目的是捕获跨模态对的无向连接，其中连接在任何一个方向上都是对称的。这在文献中通常被称为单模态、双模态、三模态相互作用等[164]。联合无向对齐通常是通过使用对齐层进行参数化模型和对多模态任务进行端到端训练来捕获的。这些对齐层可以包括注意权重[47]、张量积[158,310]和乘法相互作用[117]。最近，transformer模型[273]通过在不同的时间步长上自动对齐和捕获互补特征，已经成为序列数据的强大编码器。基于最初的基于文本的transformer模型，多模态transformer已经被提出，它使用跨序列维度（即早期融合）[140,243]连接的模态元素的完全自注意来执行联合对齐。因此，所有的模态元素都会类似地联合连接到所有其他的模态元素（即，使用点积相似性核来建模所有的连接dot-product similarity kernels）。

🚩Cross-modal directed alignment跨模态定向对齐

Cross-modal directed alignment relates elements from a source modality in a directed manner to a target modality, which can model asymmetric connections. For example, temporal attention models use alignment as a latent step to improve many sequence-based tasks [297, 318]. These attention mechanisms are typically directed from the output to the input so that the resulting weights reflect a soft alignment distribution over the input. Multimodal transformers perform directed alignment using query-key-value attention mechanisms to attend from one modality's sequence to another, before repeating in a bidirectional manner. This results in two sets of asymmetric contextualized representations to account for the possibly asymmetric connections between
modalities [159, 248, 261]. These methods are useful for sequential data by automatically aligning and capturing complementary features at different time-steps [261]. Self-supervised multimodal pretraining has also emerged as an effective way to train these architectures, with the aim of learning general-purpose representations from larger-scale unlabeled multimodal data before transferring to specific downstream tasks via supervised fine-tuning [140]. These pretraining objectives typically consist of unimodal masked prediction, crossmodal masked prediction, and multimodal alignment prediction [93].

跨模态定向对齐将源模态的元素与目标模态联系起来，目标模态可以建模非对称连接。例如，时间注意模型使用对齐作为一个潜在的步骤来改进许多基于序列的任务[297,318]。这些注意机制通常从输出定向到输入，因此产生的权值反映了输入上的软对齐分布。多模态transformers (Multimodal transformers)是在以双向的方式重复之前,使用查询键值注意机制从一个模态中执行定向对齐把序列转换成另一种序列。

这导致了两组不对称上下文表示来解释模态[159,248,261]之间可能的不对称连接。这些方法通过在不同的时间步长[261]上自动对齐和捕获互补特征，对序列数据很有用。自监督多模态预训练也已成为训练这些架构的一种有效方法，其目的是在通过监督微调[140]转移到特定的下游任务之前，从更大规模的未标记多模态数据中学习通用表示。这些训练目标通常包括单峰掩蔽预测、跨模态掩蔽预测和多模态对齐预测[93]。

🚩Graphical alignment图形对齐

Graphical alignment generalizes the sequential pattern seen in undirected or directed alignment into arbitrary graph structures between elements. This has several benefits since it does not require all elements to be connected, and allows the user to choose different edge functions for different connections. Solutions in this subcategory typically make use of graph neural networks [275] to recursively learn element representations contextualized with the elements in locally connected neighborhoods [223, 275]. These approaches have been applied for multimodal sequential data through MTAG [301] that captures connections in human videos, and F2F-CL [289] that additionally adds factorizes nodes along speaker turns.

图形对齐将在无向或有向对齐中看到的顺序模式推广为元素之间的任意图结构。这有几个好处，因为它不需要连接所有的元素，并且允许用户为不同的连接选择不同的边缘函数。这个子类别中的解决方案通常利用图神经网络[275]递归地学习与局部连接的邻域[223,275]中的元素上下文化的元素表示。这些方法已经通过捕获人类视频中连接的MTAG [301]和F2F-CL [289]，以及F2F-CL [289]来分解节点。

未完待续......