MLLMs之UniME:《Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs》翻译与解读
导读:论文提出UniME框架,解决CLIP模型在多模态表示学习中存在的文本截断、图像文本编码隔离和组合性不足等问题。UniME利用多模态大型语言模型(MLLM),通过文本判别知识蒸馏和硬负样本增强指令微调两阶段方法,学习更具判别性和组合性的多模态表示,在多个检索任务上取得了显著性能提升。
>> 背景痛点:现有的对比语言-图像预训练 (CLIP) 框架在多模态表示学习中存在三个主要局限性:
● 文本标记截断: CLIP 无法处理长文本,因为其文本编码器会截断过长的文本序列,导致信息丢失。
● 图像-文本编码隔离: CLIP 分别编码图像和文本,缺乏对图像和文本之间复杂交互的建模能力。
● 缺乏组合性: 由于词袋模型的特性,CLIP 难以理解文本的组合语义,限制了其在复杂场景下的应用。
● 虽然最近的多模态大型语言模型 (MLLM) 在通用视觉-语言理解方面取得了显著进展,但它们在学习可迁移的多模态表示方面的潜力尚未得到充分挖掘。
>> 具体的解决方案:论文提出了一种名为 UniME (Universal Multimodal Embedding) 的新型两阶段框架,利用 MLLM 来学习用于各种下游任务的判别式表示。
>> 核心思路步骤:UniME 框架包含两个阶段:
● 第一阶段:文本判别知识蒸馏: 从强大的基于 LLM 的教师模型中进行文本判别知识蒸馏,以增强 MLLM 语言组件的嵌入能力。这步骤旨在提升模型对文本的理解和表示能力。
● 第二阶段:硬负样本增强指令微调: 引入硬负样本增强指令微调,进一步提升判别式表示学习。具体来说,该方法首先减轻假负样本污染,然后在每个批次中为每个实例采样多个硬负样本,迫使模型关注具有挑战性的样本。这不仅提高了判别能力,还增强了下游任务中的指令遵循能力。
>> 优势:UniME 框架通过这两个阶段,有效地克服了 CLIP 的局限性,实现了以下优势:
● 处理长文本: 通过知识蒸馏和强大的LLM教师模型,能够处理更长的文本序列。
● 建模图像-文本交互: MLLM 的使用允许模型更好地建模图像和文本之间的复杂交互。
● 增强组合性: 硬负样本增强指令微调提高了模型对组合语义的理解能力。
● 提升判别能力和指令遵循能力: 在多个下游任务中表现出优越的判别和组合能力。
>> 结论和观点:
● 论文通过在 MMEB 基准测试和多个检索任务(包括短文本和长文本检索以及组合检索)上的大量实验,证明了 UniME 在所有任务上都取得了持续的性能提升,展现了其优越的判别和组合能力。
● 这表明,利用 MLLM 进行知识蒸馏和硬负样本增强指令微调,是学习可迁移多模态表示的一种有效方法,为打破多模态障碍提供了新的途径。
● UniME 框架在处理长文本、建模图像-文本交互以及理解组合语义方面具有显著优势,为多模态表示学习提供了新的方向。
目录
《Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs》翻译与解读
《Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs》翻译与解读
地址 | 论文地址:[2504.17432] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs |
时间 | 2025年4月24日 |
作者 | 悉尼大学、DeepGlint 、阿里巴巴集团通义实验室、伦敦帝国学院 |
Abstract
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains this http URL this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities. | 对比语言图像预训练(CLIP)框架已成为多模态表示学习中广泛使用的方法,尤其是在图像文本检索和聚类方面。然而,其有效性受到三个关键限制的制约:(1)文本标记截断,(2)孤立的图像文本编码,以及(3)由于词袋行为导致的组合性不足。尽管最近的多模态大型语言模型(MLLMs)在通用视觉语言理解方面取得了显著进展,但它们在学习可迁移多模态表示方面的潜力仍有待挖掘。在这项工作中,我们提出了 UniME(通用多模态嵌入),这是一种新颖的两阶段框架,利用 MLLMs 学习适用于各种下游任务的判别性表示。在第一阶段,我们从强大的基于 LLM 的教师模型中进行文本判别性知识蒸馏,以增强 MLLM 语言组件的嵌入能力。在第二阶段,我们引入了硬负样本增强指令微调,以进一步推进判别性表示学习。具体而言,我们首先减轻假阴性污染,然后在每个批次内为每个实例采样多个难负样本,迫使模型专注于具有挑战性的样本。这种方法不仅提高了判别能力,还增强了在下游任务中的指令遵循能力。我们在 MMEB 基准和多个检索任务(包括短和长描述检索以及组合检索)上进行了广泛的实验。结果表明,UniME 在所有任务中都实现了持续的性能提升,展现出卓越的判别和组合能力。 |
Figure 1.The UniME framework incorporates textual discriminative knowledge distillation and hard negative enhanced instruction tuning stages to learn discriminative representations for diverse downstream tasks. Our framework achieves state-of-the-art performance on both the MMEB benchmark and multiple retrieval tasks.图 1. UniME 框架融合了文本判别知识蒸馏和难负样本增强指令调优阶段,以学习适用于各种下游任务的判别性表示。我们的框架在 MMEB 基准测试和多个检索任务中均取得了最先进的性能。
1、Introduction
Modern AI applications increasingly rely on multimodal embeddings to process diverse data types, powering essential tasks like image-text retrieval (Baldrati et al., 2023; Tang et al., 2025), Retrieval Augmented Generation (RAG) (Jiang et al., 2023; Cong et al., 2023), and Visual Question Answering (VQA) (Gardères et al., 2020; Faysse et al., 2024; Chun et al., 2021). As a seminal model, CLIP (Radford et al., 2021) demonstrates notable text-image retrieval performance via cross-modal contrastive supervision using large-scale web-collected image-text pairs. However, despite its widespread use, CLIP presents notable limitations. Firstly, it restricts text token length to 77, hindering its ability to process detailed descriptions and limiting its utility in cross-modal retrieval tasks that require extensive contextual information (Zhang et al., 2024c; Cao et al., 2024; Huang et al., 2024). Moreover, CLIP employs a dual-encoder architecture that processes images and text separately, which compromises its effectiveness in complex tasks such as instruction-following multimodal retrieval (Jiang et al., 2025; Liu et al., 2024a; Wei et al., 2024). Additionally, CLIP exhibits limited advanced language understanding, struggles with compositionality, and tends to display bag-of-words behavior (Yuksekgonul et al., 2022; Tschannen et al., 2023). The success of Large Language Models (LLMs) (Touvron et al., 2023a, b; Grattafiori et al., 2024; Bai et al., 2023; Yang et al., 2024) has motivated researchers to adapt LLMs to understand multimodal inputs. Multimodal Large Language Models (MLLMs) as a key component in the construction of general-purpose AI assistants have demonstrated remarkable progress (Liu et al., 2023, 2024b). For example, Qwen2-VL (Wang et al., 2024a) innovates beyond fixed-resolution visual processing, achieving robust performance across diverse image resolutions and aspect ratios. Similarly, LLaVA-OneVision (Li et al., 2024) introduces a unified modeling approach that enables effective task transfer across scenarios while maintaining architectural simplicity. While these MLLMs show impressive vision-language reasoning capabilities, these MLLMs are inherently constrained by their autoregressive next-token prediction objective, which limits their effectiveness in learning multimodal representations compared to contrastive methods such as CLIP (Jiang et al., 2024b, 2025). | 现代人工智能应用越来越多地依赖多模态嵌入来处理各种类型的数据,为图像 - 文本检索(Baldrati 等人,2023;Tang 等人,2025)、检索增强生成(RAG)(Jiang 等人,2023;Cong 等人,2023)和视觉问答(VQA)(Garderes 等人,2020;Faysse 等人,2024;Chun 等人,2021)等关键任务提供支持。作为一项开创性模型,CLIP(Radford 等人,2021)通过使用大规模网络收集的图像 - 文本对进行跨模态对比监督,在图像 - 文本检索方面表现出色。然而,尽管其应用广泛,但 CLIP 存在显著的局限性。首先,它将文本标记长度限制在 77 个,这限制了其处理详细描述的能力,并且在需要大量上下文信息的跨模态检索任务中实用性受限(Zhang 等人,2024c;曹等人(2024 年);黄等人(2024 年)。此外,CLIP 采用双编码器架构分别处理图像和文本,这在诸如指令遵循的多模态检索等复杂任务中削弱了其有效性(江等人,2025 年;刘等人,2024 年 a;魏等人,2024 年)。此外,CLIP 在高级语言理解方面表现有限,难以处理组合性问题,并且倾向于表现出词袋行为(尤克塞贡努尔等人,2022 年;特尚嫩等人,2023 年)。 大型语言模型(LLMs)的成功(图罗夫等人,2023 年 a、b;格拉塔菲奥里等人,2024 年;白等人,2023 年;杨等人,2024 年)促使研究人员将 LLMs 改造为能够理解多模态输入。作为通用人工智能助手构建的关键组成部分,多模态大型语言模型(MLLMs)已取得显著进展(刘等人,2023 年;2024 年 b)。例如,Qwen2-VL(王等人,2024 年 a)超越了固定分辨率的视觉处理,实现了在不同图像分辨率和纵横比下的稳健性能。同样地,LLaVA-OneVision(Li 等人,2024 年)引入了一种统一建模方法,能够在不同场景间实现有效的任务迁移,同时保持架构的简洁性。尽管这些多模态大型语言模型(MLLMs)展现出了令人印象深刻的视觉语言推理能力,但它们本质上受到自回归下一个标记预测目标的限制,这使得它们在学习多模态表示方面不如对比方法(如 CLIP,Jiang 等人,2024b,2025 年)有效。 |
Recent advances in LLM-based models have demonstrated substantial progress on the MTEB benchmark (Muennighoff et al., 2022). Inspired by these developments (Lee et al., 2024; BehnamGhader et al., 2024), researchers are now actively investigating MLLMs for unified multimodal representation learning. E5-V (Jiang et al., 2024b) proposes an unimodal contrastive learning approach that trains the language component of MLLM on sentence pairs to bridge cross-modal representation disparities. However, this method encounters two primary constraints: (1) constraints arising from the limited scale and diversity of training data (Ouali et al., 2024); (2) inherent challenges caused by the MLLM’s causal attention mechanism, which fundamentally restricts its ability to learn complex contextual representations (Man et al., 2024; Xie et al., 2024b, a). These factors collectively constrain the model’s full embedding potential. VLM2Vec (Jiang et al., 2025) introduces the Massive Multimodal Embedding Benchmark (MMEB), comprising 36 datasets across 4 meta-tasks, and develops a contrastive framework that converts state-of-the-art vision-language models into embedding models through MMEB training. Nevertheless, the existence of false negative samples in the batch significantly complicates the discrimination of hard negative pairs when using the standard InfoNCE loss. | 基于大型语言模型(LLM)的最新进展在MTEB基准测试(Muennighoff等人,2022年)上展示了显著的进步。受到这些发展(Lee等人,2024年;BehnamGhader等人,2024年)的启发,研究人员现在正积极研究用于统一多模态表示学习的大型多模态语言模型(MLLM)。E5-V(Jiang等人,2024年b)提出了一种单模态对比学习方法,该方法在句子对上训练MLLM的语言组件,以弥合跨模态表示差异。然而,这种方法遇到了两个主要的限制:(1)由于训练数据的规模和多样性有限而产生的限制(Ouali等人,2024年);(2)由于MLLM的因果注意力机制固有的挑战,这从根本上限制了其学习复杂上下文表示的能力(Man等人,2024年;Xie等人,2024年b,a)。这些因素共同限制了模型的完全嵌入潜力。VLM2Vec(Jiang等人,2025年)引入了大规模多模态嵌入基准(MMEB),包含4个元任务的36个数据集,并开发了一个对比框架,通过MMEB训练将最先进的视觉-语言模型转换为嵌入模型。然而,批次中存在假阴性样本在使用标准InfoNCE损失时显著增加了难以区分的负样本对的辨别难度。 |
To overcome these challenges, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that empowers multimodal large language models (as shown in Figure 1) to learn universal representations for diverse downstream vision-language tasks. In the first textual discriminative knowledge distillation stage, we leverage a powerful LLM-based teacher model to enhance the embedding capabilities of MLLM’s language component. In the second stage of hard negative enhanced instruction tuning, we first eliminate false negative contamination, then implement a hard negative sampling strategy that selects multiple challenging negatives per instance within each batch. This approach forces the model to focus on challenging negative samples, thereby learning more discriminative multimodal representations while also improving instruction-following ability in downstream tasks. We evaluate our approach comprehensively on the MMEB benchmark and multiple retrieval tasks, including both short&long caption retrieval and compositional retrieval. Experimental results demonstrate that UniME achieves significant performance improvement across all tasks, exhibiting both robust discriminative power and superior compositional understanding. The main contributions of this paper are summarized as follows: • We present UniME (Universal Multimodal Embedding), a novel two-stage framework that empowers Multimodal Large Language Models (MLLMs) to learn universal representations for diverse downstream tasks. • We propose textual discriminative knowledge distillation, leveraging a powerful LLM-based teacher model to enhance the embedding capability of the MLLM’s language component. • We introduce hard negative enhanced instruction tuning to further advance discriminative representation learning through false negative filtering and hard negative sampling. • We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including both short&long caption retrieval and compositional retrieval. Results show that UniME demonstrates robust discriminative and compositional capabilities, achieving notable performance improvements across all evaluated tasks. | 为了克服这些挑战,我们提出了UniME(通用多模态嵌入),这是一个新的两阶段框架,使多模态大型语言模型(如图1所示)能够学习适用于各种下游视觉-语言任务的通用表示。在第一阶段文本判别性知识蒸馏中,我们利用强大的基于LLM的教师模型来增强MLLM语言组件的嵌入能力。在第二阶段强化负样本指令调整中,我们首先消除假阴性污染,然后实施一个强化负样本采样策略,在每个批次中为每个实例选择多个具有挑战性的负样本。这种方法迫使模型专注于具有挑战性的负样本,从而学习更具判别性的多模态表示,同时也提高了下游任务中的指令跟随能力。我们在MMEB基准测试和多个检索任务上全面评估了我们的方法,包括短文本和长文本标题检索以及组合检索。实验结果表明,UniME在所有任务上都实现了显著的性能提升,显示出稳健的判别力和优越的组合理解力。本文的主要贡献总结如下: • 我们提出了UniME(通用多模态嵌入),这是一个新的两阶段框架,使多模态大型语言模型(MLLM)能够学习适用于各种下游任务的通用表示。 • 我们提出了文本判别性知识蒸馏,利用强大的基于LLM的教师模型来增强MLLM语言组件的嵌入能力。 • 我们引入了强化负样本指令调整,通过假阴性过滤和强化负样本采样进一步推进判别性表示学习。 • 我们在MMEB基准测试和多个检索任务上进行了广泛的实验,包括短文本和长文本标题检索以及组合检索。结果表明,UniME显示出稳健的判别力和组合能力,在所有评估任务上都实现了显著的性能提升。 |
Conclusion
In this paper, we introduce UniME (Universal Multimodal Embedding), a novel two-stage framework that enables large multimodal language models with the capacity to learn discriminative representations applicable to a variety of downstream tasks. In the first textual discriminative knowledge distillation stage, we leverage a powerful LLM-based teacher model to enhance the embedding capability of the MLLM’s language component. In the second hard negative enhanced instruction tuning stage, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short&long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities. We hope that our work provides insights into multimodal representation learning. | 在本文中,我们介绍了UniME(通用多模态嵌入),这是一个新的两阶段框架,使大型多模态语言模型能够学习适用于各种下游任务的判别性表示。在第一阶段文本判别性知识蒸馏中,我们利用强大的基于LLM的教师模型来增强MLLM语言组件的嵌入能力。在第二阶段强化负样本指令调整中,我们首先减轻假阴性污染,然后在每个批次中为每个实例采样多个强化负样本,迫使模型专注于具有挑战性的样本。这种方法不仅提高了判别力,还增强了下游任务中的指令跟随能力。我们在MMEB基准测试和多个检索任务上进行了广泛的实验,包括短文本和长文本标题检索以及组合检索。结果表明,UniME在所有任务上都实现了持续的性能提升,显示出优越的判别力和组合能力。我们希望我们的工作为多模态表示学习提供了有价值的见解。 |