多模态机器学习的基础与趋势:原则、挑战与开放问题(挑战2:对齐)

本文是对如下多模态机器学习综述的阅读笔记之一。

本文:多模态机器学习的挑战2对齐

Foundations & Trends in Multimodal Machine Learning:Principles, Challenges, and Open Questions

PAUL PU LIANG, AMIR ZADEH, and LOUIS-PHILIPPE MORENCY,
Machine Learning Department and Language Technologies Institute, Carnegie Mellon University, USA

论文链接:https://arxiv.org/pdf/2209.03430.pdf

4 挑战2:对齐ALIGNMENT

A second challenge is to identify cross-modal connections and interactions between elements of multiple modalities. For example, when analyzing the speech and gestures of a human subject, how can we align specific gestures with spoken words or utterances? Alignment between modalities is challenging since it may depend on long-range dependencies, involves ambiguous segmentation (e.g., words or utterances), and could be either one-to-one, many-to-many, or not exist at all. This section covers recent work in multimodal alignment involving (1) discrete alignment: identifying connections between discrete elements across modalities, (2) continuous alignment: modeling alignment between continuous modality signals with ambiguous segmentation, and (3) contextualized representations: learning better multimodal representations by capturing cross-modal interactions
between elements (Figure 9)

第二个挑战是识别多种模态元素之间的跨模态连接和相互作用。例如,当我们分析一个人类主体的言语和手势时,我们如何将特定的手势与口语或话语对齐?模式之间的对齐是具有挑战性的,因为它可能依赖于长期依赖,涉及模糊的分割(例如,单词或话语),并且可能是一对一的,多对多的,或者根本不存在。本节涵盖了多模态对齐方面的工作,包括(1)离散对齐:识别跨模态的离散元素之间的连接,(2)连续对齐:连续模态信号之间的建模对齐,以及(3)情境化表示:通过捕获元素之间的跨模态交互来学习更好的多模态表示(图9)。

Fig. 9. Alignment aims to identify cross-modal connections and interactions between modality elements. Recent work has involved (1) discrete alignment to identify connections among discrete elements, (2) continuous alignment of continuous signals with ambiguous segmentation, and (3) contextualized representation learning
to capture these cross-modal interactions between connected elements.
图9 对齐旨在识别模态元素之间的跨模态连接和相互作用。最近的工作包括(1)离散对齐来识别离散元素之间的连接,(2)具有模糊分割的连续信号的连续对齐,(3)上下文化表示学习来捕获连接元素之间的这些跨模态交互作用。

4.1 Subchallenge 2a: Discrete Alignment

The first subchallenge aims to identify connections between discrete elements of multiple modalities. We describe recent work in (1) local alignment to discover connections between a given matching pair of modality elements, and (2) global alignment where alignment must be performed globally to learn both the connections and matchings (Figure 10).

第一个子挑战旨在识别多种模式的离散元素之间的联系。我们描述了最近在(1)局部对齐,以发现给定的匹配模态元素之间的连接,以及(2)全局对齐,必须全局的学习模态元素之间的连接关系和匹配以实现全局对齐(图10)

Fig. 10. Discrete alignment identifies connections between discrete elements, spanning (1) local alignment to discover connections given matching pairs, and (2) global alignment where alignment must be performed globally to learn both the connections and matchings between modality elements.
图10 离散对齐识别离散元素之间的连接,跨越(1)局部对齐以给定匹配对发现连接,以及(2)全局对齐,其中必须进行全局对齐以学习模态元素之间的连接和匹配。
🚩Local alignment局部对齐
Local alignment between connected elements is particularly suitable for multimodal tasks where there is clear segmentation into discrete elements such as words in text or object bounding boxes in images or videos (e.g., tasks such as visual coreference resolution [131], visual referring expression recognition [58, 59], and cross-modal retrieval [75, 203]). When we have supervised data in the form of connected modality pairs, contrastive learning is a popular approach where the goal is to match representations of the same concept expressed in different modalities [23]. Several objective functions for learning aligned spaces from varying quantities of paired [43, 107] and unpaired [85] data have been proposed. Many of the ideas that enforce strong [75, 152] or partial [16, 276, 319] representation coordination (§3.2) are also applicable for local alignment. Several examples include aligning books with their corresponding movies/scripts [323], matching referring expressions to visual objects [169], and finding similarities between image regions and their descriptions [105]. Methods for local alignment have also enabled the learning of shared semantic concepts not purely based on language but also on additional modalities such as vision [107], sound [60, 236], and multimedia [323] that are useful for downstream tasks.

局部对齐连接元素,特别适用于能将模态数据清晰的分割成离散元素的多模态任务,如文本或图像或视频中的对象边界框(例如,视觉共参考分辨率[131]、视觉引用表达识别[58,59]和跨模态检索[75,203])。当我们以连接模态对的形式监督数据时(比如图像加文字描述,组成一个双模态的数据对,可以作为监督学习的数据使用),对比学习(contrastive learning)是一种流行的方法,其目标是匹配在不同模态[23]中表示相同概念的表示。不同的学者提出了很多用于学习对齐空间( learning aligned spaces)的目标函数(objective functions),这些目标函数可以从不同数量的已配对[43,107]和未配对数据[43,107]中学习对齐空间。许多强制执行强[75,152]或部分[16,276,319]表示协调(3.2节)的思想也适用于局部对齐。有几个例子包括将书籍与它们对应的电影/脚本[323]对齐,匹配对视觉对象[169]的引用表达式,以及寻找图像区域和它们的描述[105]之间的相似性。局部对齐的方法还支持了共享语义概念的学习,它不纯粹基于语言,还基于其他模式,如视觉[107]、声音[60,236]和多媒体[323],这些模式对下游任务很有用。

🚩Global alignment全局对齐
Global alignment: When the ground-truth modality pairings are not available, alignment must be performed globally between all elements across both modalities. Optimal transport (OT)-based approaches [278] (which belong to a broader set of matching algorithms) are a potential solution since they jointly optimize the coordination function and optimal coupling between modality elements by posing alignment as a divergence minimization problem. These approaches are useful for aligning multimodal representation spaces [142, 205]. To alleviate computational issues, several recent advances have integrated them with neural networks [54], approximated optimal transport with entropy regularization [288], and formulated convex relaxations for efficient learning [85].

全局对齐:当模态对的真值(ground-truth)标注不可用时,必须在两种模态的所有元素之间进行全局对齐。基于最优传输(OT)的方法[278](属于更广泛的匹配算法集)是一种潜在的解决方案,因为它们通过将对齐作为散度最小化问题,共同优化了协调函数和模态元素之间的最优耦合。这些方法对于对齐多模态表示空间[142,205]是很有用的。为了缓解计算问题,一些最新的进展已经将它们与神经网络[54]集成,用熵正则化[288]近似最优传输,并制定了有效学习[85]的凸松弛。

4.2 Subchallenge 2b: Continuous Alignment

So far, one important assumption we have made is that modality elements are already segmented and discretized. While certain modalities display clear segmentation (e.g., words/phrases in a sentence or object regions in an image), there are many cases where the segmentation is not readily provided, such as in continuous signals (e.g, financial or medical time-series), spatio-temporal data (e.g., satellite or weather images), or data without clear semantic boundaries (e.g., MRI images). In these settings, methods based on warping and segmentation have been recently proposed:

到目前为止,我们所做的一个重要假设是,模态元素已经被分割和离散化了。虽然某些模态有清晰的分割(例如,句子中的单词/短语或图像中的物体区域),但在许多情况下,提供分割好的元素并不容易,例如连续信号(例如,金融或医疗时间序列)、时空数据(例如,卫星或天气图像),或没有明确语义边界的数据(例如,MRI图像)。在这些情况下,最近有人提出了基于规整(warping)和分割(segmentation)的方法:

Fig. 11. Continuous alignment tackles the difficulty of aligning continuous signals where element segmentation is not readily available. We cover related work in (1) continuous warping of representation spaces and (2) modality segmentation of continuous signals into discrete elements at an appropriate granularity.
图11 连续对齐解决了元素分割的连续信号对齐的困难。我们介绍了(1)表示空间的连续扭曲和(2)连续信号在适当粒度下的离散元素的相关工作。
🚩Continuous warping连续规整
👋

warping含义参见DTW(Dynamic Time Warping)动态时间规整算法

Continuous warping aims to align two sets of modality elements by representing them as continuous representation spaces and forming a bridge between these representation spaces.
Adversarial training is a popular approach to warp one representation space into another. Initially used in domain adaptation [27], adversarial training learns a domain-invariant representation across domains where a domain classifier is unable to identify which domain a feature came from [8]. These ideas have been extended to align multimodal spaces [100, 103, 181]. Hsu et al. [100] use adversarial training to align images and medical reports, Hu et al. [103] design an adversarial network for cross-modal retrieval, and Munro and Damen [181] design both self-supervised alignment and adversarial alignment objectives for multimodal action recognition. Dynamic time warping (DTW) [133] is a related approach to segment and align multi-view time series data. DTW measures the similarity between two sequences and finds an optimal match between them by time warping (inserting frames) such that they are aligned across segmented time boundaries. For multimodal tasks, it is necessary to design similarity metrics between modalities [17, 251]. DTW was extended using CCA to map the modalities to a coordinated space, allowing for both alignment (through DTW) and coordination (through CCA) between different modality streams jointly [260].

连续规整(wraping)的目的是通过将两组模态元素表示为连续的表示空间,并在这些表示空间之间形成一座桥梁。对抗性训练是一种将一个表示空间规整成另一个表示空间的流行方法。最初对抗性训练用于域自适应[27],对抗性训练学习跨域的域不变表示,其中域分类器无法识别一个特征来自哪个域[8]。这些想法已经被扩展到对齐多模态空间[100,103,181]。Hsu等人[100]使用对抗训练对齐图像和医疗报告,Hu等人[103]设计了一个跨模态检索的对抗网络,Munro和Damen [181]设计了多模态动作识别的自监督对齐和对抗对齐目标。动态时间规整(DTW)[133]是对多视图时间序列数据进行分割和对齐的一种相关方法。DTW度量两个序列之间的相似性,并通过时间规整(插入帧)找到它们之间的最佳匹配,这样它们就可以在分割的时间边界上对齐。对于多模态任务,有必要设计模态[17,251]之间的相似性度量。DTW使用CCA进行扩展,将模态映射到一个协调的空间,允许不同模态流之间的对齐(通过DTW)和协调(通过CCA)联合[260]。

🚩Modality segmentation模态分割
Modality segmentation involves dividing high-dimensional data into elements with semantically meaningful boundaries. A common problem involves temporal segmentation, where the goal is to discover the temporal boundaries across sequential data. Several approaches for temporal segmentation include forced alignment, a popular approach to align discrete speech units with individual words in a transcript [309]. Malmaud et al. [167] explore multimodal alignment using a factored hidden Markov model to align ASR transcripts to the ground truth. Clustering approaches have also been used to group continuous data based on semantic similarity [165]. Clustering-based discretization has recently emerged as an important preprocessing step for generalizing language-based pretraining (with clear word/bytepair segmentation boundaries and discrete elements) to video or audio-based pretraining (without clear segmentation boundaries and continuous elements). By clustering raw video or audio features into a discrete set, approaches such as VideoBERT [243] perform masked pretraining on raw video and audio data. Similarly, approaches such as DALL.E [210], VQ-VAE [271], and CMCM [156] also utilize discretized intermediate layers obtained via vector quantization and showed benefits in modality alignment.

模态分割包括将高维数据划分为具有语义——有意义边界的元素。一个常见的问题涉及到时间分割,其目标是发现跨序列数据的时间边界。几种时间分割的方法包括强制对齐,这是一种流行的方法,以在一个转录本(transcripts)中,对齐离散的语音单元与单个单词[309]。Malmaud等人[167]探索了使用因子隐马尔可夫模型的多模态对齐,以将ASR转录本与地面真相对齐。聚类方法也被用于基于语义相似性[165]对连续数据进行分组。

基于聚类的离散化最近成为一个重要的预处理步骤,用于推广基于语言的预训练(具有明确的字/字节对分割边界和离散元素)到基于视频或音频的预训练(没有明确的分割边界和连续元素)。通过将原始视频或音频特征聚类到一个离散的集合中,诸如VideoBERT [243]等方法对原始视频和音频数据进行掩码预训练。类似地,诸如DALL.E [210]、VQ-VAE [271]和CMCM [156]等方法也利用了通过向量量化获得的离散化中间层,并在模态对齐中显示出了优势。

4.3 Subchallenge 2c: Contextualized Representations

Finally, contextualized representation learning aims to model all modality connections and interactions to learn better representations. Contextualized representations have been used as an intermediate (often latent) step enabling better performance on a number of downstream tasks including speech recognition, machine translation, media description, and visual question-answering. We categorize work in contextualized representations into (1) joint undirected alignment, (2) cross-modal directed alignment, and (3) alignment with graph networks (Figure 12).

最后,情境化表示学习旨在建模所有的模态连接和交互,以学习更好的表示。情境化表示被用作中间(通常是潜在的)步骤,能够在许多下游任务中表现更好,包括语音识别、机器翻译、媒体描述和视觉问题回答。我们将上下文表示中的工作分为(1)联合无向对齐,(2)交叉模态有向对齐,和(3)与图网络的对齐(图12)。

Fig. 12. Contextualized representation learning aims to model modality connections to learn better representations. Recent directions include (1) joint undirected alignment that captures undirected symmetric connections, (2) cross-modal directed alignment that models asymmetric connections in a directed manner, and (3) graphical alignment that generalizes the sequential pattern into arbitrary graph structures.
图12 情境化表示学习旨在建模模态连接,以学习更好的表示。最近的研究方向包括(1)捕获无向对称连接的关节无向对齐,(2)以有向方式建模非对称连接的跨模态定向对齐,以及(3)将顺序模式推广为任意图形结构的图形对齐。
🚩Joint undirected alignment联合无向对齐
Joint undirected alignment aims to capture undirected connections across pairs of modalities, where the connections are symmetric in either direction. This is commonly referred to in the literature as unimodal, bimodal, trimodal interactions, and so on [164]. Joint undirected alignment is typically captured by parameterizing models with alignment layers and training end-to-end for a multimodal task. These alignment layers can include attention weights [47], tensor products [158, 310], and multiplicative interactions [117]. More recently, transformer models [273] have emerged as powerful encoders for sequential data by automatically aligning and capturing complementary features at different time steps. Building upon the initial text-based transformer model, multimodal transformers have been proposed that perform joint alignment using a full self-attention over modality elements concatenated across the sequence dimension (i.e., early fusion) [140, 243]. As a result, all modality elements become jointly connected to all other modality elements similarly (i.e., modeling all connections using dot-product similarity kernels).

联合无向对齐的目的是捕获跨模态对的无向连接,其中连接在任何一个方向上都是对称的。这在文献中通常被称为单模态、双模态、三模态相互作用等[164]。联合无向对齐通常是通过使用对齐层进行参数化模型和对多模态任务进行端到端训练来捕获的。这些对齐层可以包括注意权重[47]、张量积[158,310]和乘法相互作用[117]。最近,transformer模型[273]通过在不同的时间步长上自动对齐和捕获互补特征,已经成为序列数据的强大编码器。基于最初的基于文本的transformer模型,多模态transformer已经被提出,它使用跨序列维度(即早期融合)[140,243]连接的模态元素的完全自注意来执行联合对齐。因此,所有的模态元素都会类似地联合连接到所有其他的模态元素(即,使用点积相似性核来建模所有的连接dot-product similarity kernels)。

🚩Cross-modal directed alignment跨模态定向对齐
Cross-modal directed alignment relates elements from a source modality in a directed manner to a target modality, which can model asymmetric connections. For example, temporal attention models use alignment as a latent step to improve many sequence-based tasks [297, 318]. These attention mechanisms are typically directed from the output to the input so that the resulting weights reflect a soft alignment distribution over the input. Multimodal transformers perform directed alignment using query-key-value attention mechanisms to attend from one modality's sequence to another, before repeating in a bidirectional manner. This results in two sets of asymmetric contextualized representations to account for the possibly asymmetric connections between
modalities [159, 248, 261]. These methods are useful for sequential data by automatically aligning and capturing complementary features at different time-steps [261]. Self-supervised multimodal pretraining has also emerged as an effective way to train these architectures, with the aim of learning general-purpose representations from larger-scale unlabeled multimodal data before transferring to specific downstream tasks via supervised fine-tuning [140]. These pretraining objectives typically consist of unimodal masked prediction, crossmodal masked prediction, and multimodal alignment prediction [93].

跨模态定向对齐将源模态的元素与目标模态联系起来,目标模态可以建模非对称连接。例如,时间注意模型使用对齐作为一个潜在的步骤来改进许多基于序列的任务[297,318]。这些注意机制通常从输出定向到输入,因此产生的权值反映了输入上的软对齐分布。多模态transformers (Multimodal transformers)是在以双向的方式重复之前,使用查询键值注意机制从一个模态中执行定向对齐把序列转换成另一种序列。

这导致了两组不对称上下文表示来解释模态[159,248,261]之间可能的不对称连接。这些方法通过在不同的时间步长[261]上自动对齐和捕获互补特征,对序列数据很有用。自监督多模态预训练也已成为训练这些架构的一种有效方法,其目的是在通过监督微调[140]转移到特定的下游任务之前,从更大规模的未标记多模态数据中学习通用表示。这些训练目标通常包括单峰掩蔽预测、跨模态掩蔽预测和多模态对齐预测[93]。

🚩Graphical alignment图形对齐
Graphical alignment generalizes the sequential pattern seen in undirected or directed alignment into arbitrary graph structures between elements. This has several benefits since it does not require all elements to be connected, and allows the user to choose different edge functions for different connections. Solutions in this subcategory typically make use of graph neural networks [275] to recursively learn element representations contextualized with the elements in locally connected neighborhoods [223, 275]. These approaches have been applied for multimodal sequential data through MTAG [301] that captures connections in human videos, and F2F-CL [289] that additionally adds factorizes nodes along speaker turns.

图形对齐将在无向或有向对齐中看到的顺序模式推广为元素之间的任意图结构。这有几个好处,因为它不需要连接所有的元素,并且允许用户为不同的连接选择不同的边缘函数。这个子类别中的解决方案通常利用图神经网络[275]递归地学习与局部连接的邻域[223,275]中的元素上下文化的元素表示。这些方法已经通过捕获人类视频中连接的MTAG [301]和F2F-CL [289],以及F2F-CL [289]来分解节点。


未完待续......

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
2020人工智能机器学习创新峰会PPT汇总,24个专题共73份资料。供大家学习参考。 一、测试新趋势 业务数据监控从0-1脱敏 AI在游戏数值与平衡性分析中的应用 微众研发效能改进之数据度量体系 敏捷测试团队转型实践 二、大规模机器学习算法 快速深度学习训练优化算法 三、高效运维 构建全链路数据度量体系、实现DevOps数据驱动闭环 既快又好 DevOps为小红书全员质量保障赋能 浙江移动AIOpsDev运维转型实践-脱敏版 四、工业4.0 AI赋能医药工业发展案例 五、机器学习框架 Volcano加速AI云原生迁移之路 基于分布式机器学习的通信网络资源协同优化和分配 如何做智能边缘计算 六、计算机视觉 深度学习Depth预测--在2d-to-3d项目中的应用 PaddleOCR产业实践之路:如何打造8.6M超轻量模型,一条龙解决训练部署问题 视觉问答与对话系统的新技术进展 视觉技术赋能高效淘宝素材质量巡检 七、架构演进 边缘计算的缘起、价值和实践 AI人脸识别应用技术方案选型与架构落地 爱奇艺 K8S GPU 共享虚拟化实践和优化 菜鸟运力平台架构演进 八、流式计算 美团点评实时计算平台 小米实时计算平台构建 超大规模 Flink 调度优化实践 九、落地"大中台"战略 有赞数据中台降本治理 演进式的大规模业务中台体系落地实践 京东B2B中台化实践A2M 十、企业级大数据架构演进 基于阿里云数据湖分析服务DLA快速构建数据湖解决方案 Delta Lake在实时数仓中的应用实践 滴滴数据平台建设实践 美图PB级大数据基础架构升级之路 十一、区块链 新基建下可信区块链网络建设 基于区块链的药品溯源体系建设 区块链+电子合同 确保效力 放飞效率 千里之堤溃于蚁穴 十二、数据库的未来 PB级结构化日志数据的高效处理 TBase多中心多活应用实践 使用 TiDB 列存引擎进行实时数据分析 十三、图神经网络、知识图谱 知识图谱在内容安全中的实践应用 百度事件图谱技术与应用 华为云知识图谱平台技术及案例分享 知识图谱在腾讯AI医疗的应用实践-脱敏版 十四、推荐系统 多模态内容理解在推荐系统的应用 小红书推荐系统的架构演进 知乎搜索排序模型实践 十五、微服务的2.0时代 如何优雅的步入微服务2.0时代(脱敏版) 微服务之后,分层架构该如何演进 微服务网关(2) 十六、云原生构建之路 从0到1构建云原生智能金融电商-脱敏版 Tars与k8s如何结合,助力阅文海外业务 云原生应用性能优化之道 Dubbo 基于 MOSN 在 Service Mesh 场景下的落地实践-曹春晖 十七、智慧金融 智慧金融的新基础设施-数据中台-A2M 自然语言处理在金融实时事件监测和财务快讯生成中的应用 联邦学习在金融安全领域的研究与应用 人工智能Fairness在金融行业的研究:基于Pipeline的方法 十八、智能数据分析 闲鱼纳米镜--人人都是数据分析师 大数据分析系统在游戏领域的实践 十九、智能语音 智能语音交互 面向自然交互的多模态人机交互解决方案 二十、AI基础设施建设 飞桨开源模型库与行业应用 基于飞桨的深度学习全流程开发实战 NLP定制化训练实践1.3 二十一、AIOps AItest 百度AIOps解决方案及行业落地案例--脱敏 邱化峰-使用AI从业务测试走向业务验证 面向人工智能的测试体系建设 - 脱敏 二十二、FinTech 恒生银行DevOps实践和探索 数字化转型:从内部一体走向内外一体 基于区块链的隐私支付分析与比较 二十三、IOT 数字化转型升级 AIoT在工业水处理中的应用和实践及对永续发展的意义 二十四、NLP 阿里小蜜DeepQA算法平台化大规模提效实践 多模态内容生成在京东商品营销中的探索与实践

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值