多模态机器学习的基础与趋势：原则、挑战与开放问题（论文详解1，原则）

iracer

已于 2023-12-19 12:06:30 修改

阅读量1.4k

点赞数 1

文章标签：机器学习人工智能论文阅读多模态

于 2023-12-18 13:31:19 首次发布

原文链接：https://arxiv.org/pdf/2209.03430.pdf

版权

本文是对如下多模态机器学习综述的阅读笔记之一。主要包括摘要、引言、基本原则等内容。

Foundations & Trends in Multimodal Machine Learning:Principles, Challenges, and Open Questions

PAUL PU LIANG, AMIR ZADEH, and LOUIS-PHILIPPE MORENCY,
Machine Learning Department and Language Technologies Institute, Carnegie Mellon University, USA

论文链接：https://arxiv.org/pdf/2209.03430.pdf

Abstract

多模态机器学习是一个充满活力的多学科研究领域，旨在通过整合多种交互方式（包括语言、听觉、视觉、触觉和生理信息）来设计具有理解、推理和学习等智能能力的计算机智能体。随着最近对视频理解、具身自治智能体（ embodied autonomous agents）、文本到图像生成以及医疗保健和机器人等应用领域的多传感器融合的兴趣，多模态机器学习给机器学习社区带来了独特的计算和理论挑战，因为数据源的异构性和模态之间经常发现的相互联系。然而，多模态研究进展的广度使得很难确定该领域的共同主题和开放性问题。本文从历史和近期的角度综合了广泛的应用领域和理论框架，旨在概述多模态机器学习的计算和理论基础。我们首先定义了推动后续创新的模态异质性、关联和相互作用（heterogeneity, connections, and interactions）的三个关键原则，并提出了六个核心技术挑战的分类法：表示、对齐、推理、生成、转移和量化（representation, alignment, reasoning, generation, transference, quantification），涵盖历史和近期趋势。最近的技术成就将通过该分类法的视角进行展示，使研究人员能够了解新方法之间的异同。最后，我们提出了几个未解决的问题，供未来的研究使用，如我们的分类法所确定的那样。

1 INTRODUCTION

通过多模态体验和数据，开发具有理解、推理和学习等智能能力的计算机智能体一直是人工智能的宏伟目标，类似于人类使用多种感官模式感知世界并与之互动的方式。随着具身自主智能体（ embodied autonomous agents） [37， 222]、自动驾驶汽车 [295]、图像和视频理解 [11， 243]、图像和视频生成 [210， 234] 以及机器人 [136， 170] 和医疗保健 [119， 151] 等应用领域的多传感器融合的最新进展，我们现在比以往任何时候都更接近能够整合和学习多种感官模式的智能体。鉴于数据的异质性和模态之间经常发现的相互联系，多模态机器学习这一充满活力的多学科研究领域带来了独特的挑战，并在多媒体[184]、情感计算[204]、机器人技术[127,136]、人机交互[190,228]和医疗保健[40,180]中得到了广泛的应用。

具身（Embodiment)：指具有支持感觉和运动（sensorimotor）的物理身体。
具身智能(Embodied AI)：有身体并支持物理交互的智能体，如家用服务机器人、无人车等。

论文写作目的：

然而，多模态研究的进展速度使得很难确定共同的历史和近期工作的主题，以及该领域的关键开放性问题。通过综合广泛的多模态研究，本文旨在概述多模态机器学习的方法、计算和理论基础，补充最近在视觉和语言[269]、语言和强化学习[161]、多媒体分析[19]和人机交互[114]等方面面向应用的综述。

To better understand the foundations of multimodal machine learning, we begin by defining (in §2) three key principles that have driven subsequent technical challenges and innovations: (1) modalities are heterogeneous because the information present often shows diverse qualities, structures, and representations, (2) modalities are connected since they are often related and share commonalities, and (3) modalities interact to give rise to new information when used for task inference. Building upon these definitions, we propose a new taxonomy of six core challenges in multimodal learning: representation, alignment, reasoning, generation, transference, and quantification (see Figure 1). These constitute core multimodal technical challenges that are understudied in conventional unimodal machine learning, and need to be tackled in order to progress the field forward:

🚩多模态学习中的3个关键原则

为了更好地理解多模态机器学习的基础，我们首先定义了（在第 2 节中）三个关键原则，这些原则推动了随后的技术挑战和创新：

模态是异构的（heterogeneous），因为存在的信息通常显示出不同的质量、结构和表示
模态是相互联系的（connected），因为它们通常是相关的并具有共性
模态交互（interact），通过交互可以在用于推理任务时产生新信息。

Fig. 1. Core research challenges in multimodal learning: (1) Representation studies how to represent and summarize multimodal data to reflect the heterogeneity and interconnections between individual modality elements. (2) Alignment aims to identify the connections and interactions across all elements. (3) Reasoning aims to compose knowledge from multimodal evidence usually through multiple inferential steps for a task. (4) Generation involves learning a generative process to produce raw modalities that reflect cross-modal interactions, structure, and coherence. (5) Transference aims to transfer knowledge between modalities and their representations. (6) Quantification involves empirical and theoretical studies to better understand the multimodal learning process.
图1 多模态学习的核心研究挑战： (1)表示研究如何表示和总结多模态数据，以反映个体模态元素之间的异质性和相互联系。(2)对齐旨在识别所有元素之间的连接和交互。(3)推理的目的是通过多模态证据的多个推理步骤来组合知识。(4)生成包括学习一个生成过程，以产生反映跨模态交互作用、结构和连贯性的原始模式。(5)转移的目的是在模态及其表征之间转移知识。(6)量化包括实证和理论研究，以更好地理解多模态学习过程。

基于这些定义，我们提出了一种新的分类法，其中包含多模态学习中的六个核心挑战：表示、对齐、推理、生成、转移和量化（见图 1）。这些构成了核心的多模态技术挑战，这些挑战在传统的单模态机器学习中没有得到充分研究，需要解决这些挑战才能推动该领域向前发展：

Representation (§3): Can we learn representations that reflect heterogeneity and interconnections between modality elements? We will cover approaches for (1) representation fusion: integrating information from two or more modalities to capture cross-modal interactions, (2) representation coordination: interchanging cross-modal information with the goal of keeping the same number of representations but improving multimodal contextualization, and (3) representation fission: creating a larger set of disjoint representations that reflects knowledge about internal structure such as data clustering or factorization.
Alignment (§4): How can we identify the connections and interactions between modality elements? Alignment is challenging since it may depend on long-range dependencies, involves ambiguous segmentation (e.g., words or utterances), and could be either one-to-one, many-to-many, or not exist at all. We cover (1) discrete alignment: identifying connections between discrete elements across modalities, (2) continuous alignment: modeling alignment between continuous modality signals with ambiguous segmentation, and (3) contextualized representations: learning better representations by capturing cross-modal interactions between elements.
Reasoning (§5) is defined as composing knowledge, usually through multiple inferential steps, that exploits the problem structure for a specific task. Reasoning involves (1) modeling the structure over which composition occurs, (2) the intermediate concepts in the composition process, (3) understanding the inference paradigm of more abstract concepts, and (4) leveraging large-scale external knowledge in the study of structure, concepts, and inference.
Generation (§6) involves learning a generative process to produce raw modalities. We categorize its subchallenges into (1) summarization: summarizing multimodal data to reduce information content while highlighting the most salient parts of the input, (2) translation: translating from one modality to another and keeping information content while being consistent with cross-modal connections, and (3) creation: simultaneously generating multiple modalities to increase information content while maintaining coherence within and across modalities.
Transference (§7) aims to transfer knowledge between modalities, usually to help the target modality, which may be noisy or with limited resources. Transference is exemplified by (1) cross-modal transfer: adapting models to tasks involving the primary modality, (2) co-learning: transferring information from secondary to primary modalities by sharing representation spaces between both modalities, and (3) model induction: keeping individual unimodal models separate but transferring information across these models.
Quantification (§8): The sixth and final challenge involves empirical and theoretical studies to better understand (1) the dimensions of heterogeneity in multimodal datasets and how they subsequently influence modeling and learning, (2) the presence and type of modality connections and interactions in multimodal datasets and captured by trained models, and (3) the learning and optimization challenges involved with heterogeneous data.

🚩多模态学习中的六个核心挑战

表示Representation（§3）：我们能否学习反映模态元素之间异构和互联（heterogeneity and interconnections, interconnections=Connected + Interacting连接和交互）的表征？我们将介绍以下方法：（1）表示融合（representation fusion）：整合来自两个或多个模态的信息以捕获跨模态交互，（2）表示协调（representation coordination）：交换跨模态信息，目的是保持相同数量的表征，但改善多模态上下文语境化（contextualization），以及（3）表示裂变：创建更大的不相交表征集，反映有关内部结构的知识，例如数据聚类或分解（data clustering or factorization）。
对齐Alignment（§4）：我们如何识别模态元素之间的联系和相互作用（ connections and interactions ）？对齐具有挑战性，因为它可能依赖于远程依赖关系（long-range dependencies），涉及模棱两可的分割（例如，单词或话语），并且可能是一对一、多对多或根本不存在。我们涵盖了（1）离散对齐（discrete alignment）：识别跨模态离散元素之间的联系，（2）连续对齐（continuous alignment）：使用模糊分割对连续模态信号之间的对齐进行建模，以及（3）情境化表示学习（contextualized representations）：通过捕获元素之间的跨模态交互来学习更好的表示。
推理Reasoning（§5）推理被定义为组合知识，通常通过多个推理步骤，利用特定任务的问题结构。推理涉及（1）对组合的结构进行建模（modeling the structure），（2）构成过程中的中间概念（intermediate concepts），（3）理解更抽象概念的推理范式（inference paradigm ），以及（4）在结构、概念和推理的研究中利用大规模的外部知识。
生成Generation（§6）生成涉及学习生成过程以产生原始模态（raw modalities）。我们将其子挑战分为（1）总结（summarization）：总结多模态数据以减少信息内容，同时突出输入中最显著的部分，（2）翻译（translation）：从一种模态转换为另一种模态，并在与跨模态连接保持一致的同时保持信息内容，以及（3）创建（creation）：同时生成多种模态以增加信息内容，同时保持模态内部和跨模态的一致性（maintaining coherence within and across modalities）。
转移Transference（§7）转移旨在在模态之间转移知识，通常是为了帮助目标模态，而目标模态可能包含噪声，或资源（该目标模态的数据）有限。转移的例子是（1）跨模态迁移（cross-modal transfer）：使模型适应涉及主要模态的任务，（2）共同学习（co-learning）：通过在两种模态之间共享表示空间将信息从次要模态转移到主要模态，以及（3）模型归纳（model induction）：将单个单模态模型分开，但在这些模型之间传递信息。
量化 Quantification（§8）：第六个也是最后一个挑战涉及实证和理论研究，以更好地了解（1）多模态数据集中异质性的维度以及它们随后如何影响建模和学习，（2）多模态数据集中模态连接和交互的存在和类型，并由经过训练的模型捕获，以及（3）异构数据涉及的学习和优化挑战。

最后，我们通过对多模态学习的长期展望来总结该分类法确定的开放研究问题。作者还通过CVPR 2022和NAACL 2022的教程，以及CMU的多模式机器学习课程和11-877高级主题，介绍了这项综述。我们鼓励读者查看这些公开的视频记录，额外的阅读材料，和讨论探索，激励着在多模态学习中的开放研究问题。

2 FOUNDATIONAL PRINCIPLES IN MULTIMODAL RESEARCH

🚩模态modality定义

A modality refers to a way in which a natural phenomenon is perceived or expressed. For example, modalities include speech and audio recorded through microphones, images and videos captured via cameras, and force and vibrations captured via haptic sensors. Modalities can be placed along a spectrum from raw to abstract: raw modalities are those more closely detected from a sensor, such as speech recordings from a microphone or images captured by a camera. Abstract modalities are those farther away from sensors, such as language extracted from speech recordings, objects detected from images, or even abstract concepts like sentiment intensity and object categories.

模态是指一种自然现象被感知或表达的一种方式。例如，模态包括通过麦克风记录的语音和音频，通过照相机捕获的图像和视频，以及通过触觉传感器捕获的力和振动。模态可以沿着从原始到抽象的形式（光谱spectrum）放置：原始模态是那些更接近从传感器上检测到的模态，比如来自麦克风的语音记录或由相机捕获的图像。抽象模态是指那些远离传感器的模式，比如从语音记录中提取的语言，从图像中检测到的物体，甚至是情感强度和物体类别等抽象概念。

🚩多模态Multimodal定义

Multimodal refers to situations where multiple modalities are involved. From a research perspective, multimodal entails the computational study of heterogeneous and interconnected (connections + interactions) modalities. Firstly, modalities are heterogeneous because the information present in different modalities will often show diverse qualities, structures, and representations. Secondly, these modalities are not independent entities but rather share connections due to complementary information. Thirdly, modalities interact in different ways when they are integrated for a task. We expand on these three foundational principles of multimodal research in the following subsections.

多模态是指涉及多种模态的情况。从研究的角度来看，多模态需要对异构和互联（连接+交互）模态的计算开展研究。首先，模态是异构的，因为不同模态下的信息往往会显示出不同的质量、结构和表示（qualities, structures, and representations）。其次，这些模态不是独立的实体，而是由于互补的信息而共享连接。第三，当模态被整合到一个任务中时，它们以不同的方式相互作用。我们将在以下几个小节中扩展多模态研究的这三个基本原则。

2.1 原则1：模态有异质性 Modalities are Heterogeneous

Fig. 2. The information present in different modalities will often show diverse qualities, structures, and representations. Dimensions of heterogeneity can be measured via differences in individual elements and their distribution, the structure of elements, as well as modality information, noise, and task relevance.
图2. 不同模态的信息通常会表现出不同的质量、结构和表示。异质性的维度可以通过元素及其分布的差异、元素的结构、模态信息、噪声和任务相关性来衡量。（图中每一列表示一个模态异构性的维度）

The principle of heterogeneity reflects the observation that the information present in different modalities will often show diverse qualities, structures, and representations. Heterogeneity should be seen as a spectrum: two images from the same camera which capture the same view modulo camera wear and tear are closer to homogeneous, two different languages which capture the same meaning but are different depending on language families are slightly heterogeneous, language and vision are even more heterogeneous, and so on. In this section, we present a non-exhaustive list of dimensions of heterogeneity (see Figure 2 for an illustration). These dimensions are complementary and may overlap; each multimodal problem likely involves heterogeneity in multiple dimensions.

异质性原则反映了不同模态的信息通常会表现出不同的质量、结构和表示的观察结果。异质性应该被视为一个光谱：两个来自同一相机的图像，捕捉相同的视角，除了相机磨损和撕裂，它们更接近于同质性，两个不同的语言捕捉相同的意义，但取决于语言家族而有所不同，语言和视觉更加异质，等等。在本节中，我们提供了一个非详尽的异质性维度列表（见图2）。这些维度是互补的，可能重叠；每个多模态问题可能涉及多个维度的异质性。

(1) Element representation: Each modality is typically comprised of a set of elements - the most basic unit of data which cannot (or rather, the user chooses to not) be broken down into further units [26, 147]. For example, typed text is recorded via a set of characters, videos are recorded via a set of frames, and graphs are recorded via a set of nodes and edges. What are the basic elements present in each modality, and how can we represent them? Formally, this dimensions measures heterogeneity in the sample space or representation space of modality elements.

(2) Distribution refers to the frequency and likelihood of elements in modalities. Elements typically follow a unique distribution, with words in a linguistic corpus following Zipf’s Law as a classic example. Distribution heterogeneity then refers to the differences in frequencies and likelihoods of elements, such as different frequencies in recorded signals and the density of elements.

(3) Structure: Natural data exhibits structure in the way individual elements are composed to form entire modalities [38]. For example, images exhibit spatial structure across individual object elements, language is hierarchically composed of individual words, and signals exhibit temporal structure across time. Structure heterogeneity refers to differences in this underlying structure. (4) Information measures the total information content present in each modality. Subsequently, information heterogeneity measures the differences in information content across modalities, which could be formally measured by information theoretic metrics [227].

(5) Noise: Noise can be introduced at several levels across naturally occurring data and also during the data recording process. Natural data noise includes occlusions, imperfections in human-generated data (e.g., imperfect keyboard typing or unclear speech), or data ambiguity due to sensor failures [151]. Noise heterogeneity measures differences in noise distributions across modalities, as well as differences in signal-to-noise ratio.

(6) Relevance: Finally, each modality shows different relevance toward specific tasks and contexts-certain modalities may be more useful for certain tasks than others [78]. Task relevance describes how modalities can be used for inference, while context relevance describes how modalities are contextualized with other modalities.

注：对应图2的6个异质性维度

(1) 元素表示：每种模态通常由一组元素组成 —— 数据的最基本单位，无法（或用户选择不分解为更小的单位）[26, 147]。例如，文本通过字符集进行记录，视频通过帧集进行记录，图形通过节点和边集进行记录。每种模态中存在哪些基本元素，我们如何表示它们？形式上，这个维度度量了模态元素样本空间或表示空间的异质性。

(2) 分布：分布指的是模态中元素的频率和可能性。元素通常遵循独特的分布，例如语言语料库中的单词遵循齐普夫定律。分布异质性指的是元素频率和可能性的差异，例如记录信号中的不同频率和元素密度。

(3) 结构：自然数据在元素组成整个模态的方式上表现出结构[38]。例如，图像在单个对象元素上表现出空间结构，语言是按层次结构组成的单词，信号在时间上表现出时间结构。结构异质性指的是这种底层结构的差异。

(4) 信息：信息度量了每种模态中存在的总信息量。随后，信息异质性度量了模态间信息量的差异，这可以通过信息论度量来正式测量[227]。

(5) 噪声：噪声可以在自然数据记录过程中和数据记录过程中在多个层次上引入。自然数据噪声包括遮挡、人类生成的数据中的不完美（例如，不完美的键盘输入或不清晰的语音）或由于传感器故障导致的数据歧义[151]。噪声异质性度量了模态间噪声分布的差异，以及信噪比的差异。

(6) 相关性：最后，每种模态对特定任务和上下文的关联性不同 - 某些模态可能对某些任务比其他模态更有用[78]。任务相关性描述了模态如何用于推理，而上下文相关性描述了模态如何与其他模态进行上下文化。

在研究单模态和多模态数据时，考虑这些异质性的维度是有用的。在单模态情况下，专门的编码器通常被设计来捕获每个模态[38]中的这些独特特征。在多模态的情况下，建模异质性在学习表示和捕获对齐[314]时是有用的，并且是量化多模态模型[150]的一个关键子挑战。

2.2 原则2：模态是连接的 Modalities are Connected

Fig. 3. Modality connections describe how modalities are related and share commonalities, such as correspondences between the same concept in language and images or dependencies across spatial and temporal dimensions. Connections can be studied through both statistical and semantic perspectives.
图3. 模态连接描述了不同模态之间的关系和共同点，例如语言和图像中相同概念的对应关系或空间和时间维度上的依赖关系。连接可以通过统计和语义两种视角进行研究。

Although modalities are heterogeneous, they are often connected due to shared complementary information. The presence of shared information is often in contrast to unique information that exists solely in a single modality [290]. Modality connections describe the extent and dimensions in which information can be shared across modalities. When reasoning about the connections in multimodal data, it is helpful to think about both bottom-up (statistical) and top-down (semantic) approaches (see Figure 3). From a statistical data-driven perspective, connections are identified from distributional patterns in multimodal data, while semantic approaches define connections based on our domain knowledge about how modalities share and contain unique information.

虽然模态是异构的，但它们往往是由于共享的互补信息而连接起来的。共享信息的存在通常与单独存在于单一模态[290]中的唯一信息形成对比。模态连接描述了可以跨模态共享信息的范围和维度。在对多模态数据中的连接进行推理时，同时考虑自下而上（统计）和自上而下（语义）方法是有帮助的（参见图3）。从统计数据驱动的角度来看，连接是从多模态数据中的分布模式中识别出来的，而语义方法则基于我们关于模态如何共享和包含独特信息的领域知识来定义连接。

(1) Statistical association exists when the values of one variable relate to the values of another. For example, two elements may co-occur with each other, resulting in a higher frequency of both occurring at the same time. Statistically, this could lead to correlation - the degree in which elements are linearly related, or other non-linear associations. From a data-driven perspective, discovering which elements are associated with each other is important for modeling the joint distributions across modalities during multimodal representation and alignment [257].

(2) Statistical dependence goes deeper than association and requires an understanding of the exact type of statistical dependency between two elements. For example, is there a causal dependency from one element to another, or an underlying confounder causing both elements to be present at the same time? Other forms of dependencies could be spatial or temporal: one element occurring above the other, or after the other. Typically, while statistical association can be estimated purely from data, understanding the nature of statistical dependence requires some knowledge of the elements and their underlying relationships [188, 267].

(3) Semantic correspondence can be seen as the problem of ascertaining which elements in one modality share the same semantic meaning as elements in another modality [192]. Identifying correspondences is fundamental in many problems related to language grounding [46], translation and retrieval [203], and cross-modal alignment [248].

(4) Semantic relations: Finally, semantic relations generalize semantic correspondences: instead of modality elements sharing the same exact meaning, semantic relations includes an attribute describing the exact nature of the relationship between two modality elements, such as semantic, logical, causal, or functional relations. Identifying these semantically related connections is important for higher-order reasoning [26, 172].

(1)统计关联。当一个变量的值与另一个变量的值相关时，就存在统计关联。例如，两个元素可能同时发生，导致两者同时发生的频率更高。统计上，这可能导致相关性——元素线性相关的程度，或其他非线性关联。从数据驱动的角度来看，发现哪些元素相互关联对于在多模态表示和对齐[257]期间跨模式的联合分布建模是很重要的。

(2)统计依赖性。统计依赖性比关联更深入，需要了解两个元素之间的统计依赖性的确切类型。例如，是否存在从一个因素到另一个因素的因果依赖关系，或者是一个潜在的混杂因素导致这两个因素同时存在？其他形式的依赖关系可以是空间上的或时间上的：一个元素出现在另一个元素之上，或者出现在另一个元素之后。通常，虽然统计关联可以纯粹从数据中估计，但理解统计依赖的本质需要对元素及其潜在关系[188,267]有一些知识。

(3)语义对应。语义对应可以看作是确定一个模态中的哪些元素与另一个模态[192]中的元素具有相同的语义意义的问题。识别通信是与语言基础[46]、翻译和检索[203]以及跨模态对齐[248]等许多问题相关的基础。

(4)语义关系：最后，语义关系概括了语义对应：语义关系包含了一个不同的属性，描述了两个模态元素之间关系的确切本质，如语义关系，逻辑关系、因果关系或功能关系。识别这些语义相关的连接对于高阶推理[26,172]很重要。

2.3 原则3：模态间有交互 Modalities Interact

Modality interactions study how modality elements interact to give rise to new information when integrated together for task inference. We note an important difference between modality connections and interactions: connections exist within multimodal data itself, whereas interactions only arise when modalities are integrated and processed together to bring a new response. In Figure 4, we provide a high-level illustration of some dimensions of interactions that can exist.

模态交互研究模态元素在一起进行任务推理时如何相互作用产生新信息。我们注意到模态连接和交互之间的一个重要区别：连接存在于多模态数据本身中，而交互只有在模态被集成和处理在一起以产生新的响应时才会出现。在图4中，我们提供了一个关于可能存在的一些交互维度的高级说明。

Fig. 4. Several dimensions of modality interactions: (1) Interaction information studies whether common redundant information or unique non-redundant information is involved in interactions; (2) interaction mechanics study the manner in which interaction occurs, and (3) interaction response studies how the inferred task changes in the presence of multiple modalities.
模态交互作用的几个维度： (1)交互信息研究交互作用中是否涉及共同的冗余信息或独特的非冗余信息；(2)交互机制研究交互作用发生的方式，(3)交互反应研究在存在多种模态时所推断的任务如何变化。

(1) Interaction information investigates the type of connected information that is involved in an interaction. When an interaction involves shared information common to both modalities, the interaction is redundant, while a non-redundant interaction is one that does not solely rely on shared information, and instead relies on different ratios of shared, unique, or possibly even synergistic information [290].

(2) Interaction mechanics are the functional operators involved when integrating modality elements for task inference. For example, interactions can be expressed as statistically additive, non-additive, and non-linear forms [117], as well as from a semantic perspective where two elements interact through a logical, causal, or temporal operation [268].

(3) Interaction response studies how the inferred response changes in the presence of multiple modalities. For example, through sub-dividing redundant interactions, we can say that two modalities create an equivalence response if the multimodal response is the same as responses from either modality, or enhancement if the multimodal response displays higher confidence. On the other hand, non-redundant interactions such as modulation or emergence happen when there exist different multimodal versus unimodal responses [197].

(1)信息交互研究交互中涉及的连接信息的类型。当交互涉及两种模式共有的共享信息时，交互是冗余的，而非冗余的交互并不仅仅依赖于共享信息，而是依赖于共享的、唯一的、甚至协同信息的不同比例[290]。

(2)交互机制是推理任务集成模态元素时，所涉及的功能算子。例如，交互可以表示为统计加性、非加性和非线性形式[117]，以及从语义的角度来看，其中两个元素通过逻辑、因果或时间操作[268]进行交互。

(3)交互反应研究多种模态存在时推理的反应如何变化。例如，通过细分冗余交互，我们可以说，如果多模态响应与任何一种模态的响应相同，则两种模态产生等效响应，或者如果多模态响应显示出更高的置信度，则产生增强。另一方面，当存在不同的多模态和单模态响应时[197]，非冗余的交互作用如调制或出现发生。

2.4 核心技术挑战 Core Technical Challenges

Building on these three core principles and on our detailed review of recent work, we propose a new taxonomy to characterize the core technical challenges in multimodal research: representation, alignment, reasoning, generation, transference, and quantification. In Table 1 we summarize our full taxonomy of these six core challenges, their subchallenges, categories of corresponding approaches, and recent examples in each category. In the following sections, we describe our new taxonomy in detail and also revisit the principles of heterogeneity, connections, and interactions to see how they pose research questions and inspire research in each of these six challenges.

基于这三个核心原则，以及我们对最近工作的详细回顾，我们提出了一种新的分类法来描述多模态研究中的核心技术挑战：表示、对齐、推理、生成、转移和量化。在表1中，我们总结了这六个核心挑战的完整分类，它们的子挑战、相应方法的类别，以及每个类别中的最近示例。在下面的章节中，我们将详细描述我们的新分类法，并重新讨论异质性、连接和相互作用的原则，看看它们如何在这六种挑战中提出研究问题并激发研究。

未完待续......