BGE-M3论文《BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings 》翻译

利用GPT-Academic Report生成的,效果还不错。

## # Title:
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

## # Abstract:



In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

## # Meta Translation

题目:BGE M3-Embedding: 通过自我知识蒸馏实现多语言、多功能、多层次的文本嵌入模型

摘要:本文提出了一种新的嵌入模型,称为M3-Embedding,其在多语言性、多功能性和多层次性方面具有多样性。它可以支持100多种工作语言,在多语言和跨语言检索任务上取得了新的最先进性能。它可以同时执行嵌入模型的三种常见检索功能:密集检索、多向量检索和稀疏检索,为真实世界的信息检索应用提供了统一的模型基础。它能够处理不同粒度的输入,从短句到最多8192个标记的长文档。M3-Embedding的有效训练包括以下技术贡献。我们提出了一种新颖的自我知识蒸馏方法,其中来自不同检索功能的相关性分数可以被集成为教师信号,以提高训练质量。我们还优化了批处理策略,实现了大批量和高训练吞吐量,以确保嵌入的区分度。据我们所知,M3-Embedding是第一个实现如此强大多样性的嵌入模型。该模型和代码将在https://github.com/FlagOpen/FlagEmbedding上公开。

## # Introduction

Embedding models are a critical form of DNN application in natural language processing. They encode the textual data in the latent space, where the underlying semantic of the data can be expressed by the output embeddings (Reimers and Gurevych, 2019;Ni et al., 2021a). With the advent of pretrained language models, the quality of text embeddings have been substantially improved, making them imperative components for the information †. Co-first author * . Correspondence author retrieval (IR). One common form of embeddingbased IR is dense retrieval, where relevant answers to the query can be retrieved based on the embedding similarity (Karpukhin et al., 2020a;Xiong et al., 2020;Neelakantan et al., 2022;Wang et al., 2022a;Xiao et al., 2023). Besides, the embedding model can also be applied to other IR tasks, such as multi-vector retrieval where the fine-grained relevance between query and document is computed based on the interaction score of multiple embeddings (Khattab and Zaharia, 2020a), and sparse or lexical retrieval where the importance of each term is estimated by its output embedding (Gao et al., 2021a;Lin and Ma, 2021a;Dai and Callan, 2020). Despite the widespread popularity of text embeddings, the existing methods are still limited in versatility. First of all, most of the embedding models are tailored only for English, leaving few viable options for the other languages. Secondly, the existing embedding models are usually trained for one single retrieval functionality. However, typical IR systems call for the compound workflow of multiple retrieval methods. Thirdly, it is challenging to train a competitive long-document retriever due to the overwhelming training cost, where most of the embedding model can only support short inputs.

嵌入模型是自然语言处理中关键的DNN应用形式。它们将文本数据编码到潜在空间中,其中数据的底层语义可以通过输出的嵌入进行表达(Reimers和Gurevych,2019;Ni等,2021a)。随着预训练语言模型的出现,文本嵌入的质量大大提高,使其成为信息检索(IR)中必不可少的组成部分(†.合著者,* .通讯作者)。基于嵌入的IR的一种常见形式是密集检索,其中可以根据嵌入相似性检索与查询相关的答案(Karpukhin等,2020a;Xiong等,2020;Neelakantan等,2022;Wang等,2022a;Xiao等,2023)。此外,嵌入模型还可以应用于其他IR任务,例如多向量检索,其中根据多个嵌入的交互得分计算查询和文档之间的细粒度相关性(Khattab和Zaharia,2020a),以及稀疏或词汇检索,其中根据其输出嵌入估计每个词项的重要性(Gao等,2021a;Lin和Ma,2021a;Dai和Callan,2020)。尽管文本嵌入广受欢迎,但现有方法在多样性方面仍存在限制。首先,大多数嵌入模型仅适用于英语,对其他语言来说可选项较少。其次,现有的嵌入模型通常仅针对一种检索功能进行训练。然而,典型的IR系统需要多个检索方法的复合工作流程。第三,由于巨大的训练成本,训练具有竞争力的长文档检索器是具有挑战性的,其中大部分嵌入模型仅支持短输入。

## # Introduction

To address the above challenges, we introduce M3-Embedding, which is pronounced for its breakthrough of versatility in working languages, retrieval functionalities, and input granularities. Particularly, M3-Embedding is proficient in multilinguality, which is able to support more than 100 world languages. By learning a common semantic space for different languages, it enables both multi-lingual retrieval within each language and the cross-lingual retrieval between different languages. Besides, it is able to generate versatile embeddings to support different retrieval functionalities, not just dense retrieval, but also sparse retrieval and multivector retrieval. Finally, M3-Embedding is learned to process different input granularities, spanning from short inputs like sentences and passages, to long documents of up to 8,192 input tokens.
The effective training of such a versatile embedding model poses a significant challenge. In our work, the following technical contributions are made to optimize the training quality. Firstly, we propose a novel self-knowledge distillation framework, where the multiple retrieval functionalities can be jointly learned and mutually reinforced. In M3-Embedding, the [CLS] embedding is used for dense retrieval, while embeddings from other tokens are used for sparse retrieval and multi-vector retrieval. Based on the principle of ensemble learning (Bühlmann, 2012), such heterogenous predictors can be combined as a stronger predictor. Thus, we integrate the relevance scores from different retrieval functions as the teacher signal, which is used to enhance the learning process via knowledge distillation. Secondly, we optimize the batching strategy to achieve a large batch size and high training throughput, which substantially contributes to the discriminativeness of embeddings. Last but not least, we perform comprehensive and high-quality data curation. Our dataset consists of three sources: 1) the extraction of weakly supervised data from massive multi-lingual corpora, 2) the integration of closely related supervised data, 3) the synthesization of scarce training data. The three sources of data are complement to each other and applied to different stages of the training process, which lays foundation for the versatile text embeddings.

为了应对上述挑战,我们引入了M3-Embedding,这个模型以其在工作语言、检索功能和输入粒度的全面性突破而闻名。特别是,M3-Embedding在多语言能力方面表现出色,能够支持100多种世界语言。通过为不同语言学习一个共同的语义空间,它实现了在每种语言内的多语言检索以及不同语言之间的跨语言检索。此外,M3-Embedding能够生成多功能的嵌入向量,不仅用于密集检索,还可用于稀疏检索和多向量检索。最后,M3-Embedding能够处理不同的输入粒度,从短句子和段落到最多8,192个输入标记的长文档。

这样一个多功能嵌入模型的有效训练提出了重大挑战。在我们的工作中,为了优化训练质量,我们做出了以下技术贡献。首先,我们提出了一种全新的自我知识蒸馏框架,可以联合学习和相互增强多个检索功能。在M3-Embedding中,[CLS]嵌入用于密集检索,而其他标记的嵌入则用于稀疏检索和多向量检索。基于集成学习的原理(Bühlmann,2012),这样的异构预测器可以组合成更强的预测器。因此,我们将来自不同检索功能的相关性得分作为教师信号进行整合,用于通过知识蒸馏来增强学习过程。其次,我们优化了批处理策略,实现了大批次大小和高训练吞吐量,这对嵌入的区分度产生了重大影响。最后但并非最不重要的是,我们进行了全面且高质量的数据整理。我们的数据集包括三个来源:1)从大规模多语言语料库中提取的弱监督数据,2)相关监督数据的整合,3)稀缺训练数据的合成。这三个数据源相互补充,并应用于训练过程的不同阶段,为多功能文本嵌入奠定了基础。

## # Introduction

M3-Embedding exhibits a remarkable versatility in our experiments. It achieves superior retrieval quality for a variety of languages, leading to the state-of-the-art performances on popular multilingual and cross-lingual benchmarks like MIR-ACL (Zhang et al., 2023b) and MKQA (Longpre et al., 2021). It effectively learns the three retrieval functionalities, which can not only work individ-ually but also work together for an even stronger retrieval quality. It also well preserves its superior capability across different input granularities within 8192 tokens, which outperforms the existing methods by a notable advantage.
Our work makes the following contributions. 1) We present M3-Embedding, which notably advances the versatility of embedding model in multilinguality, multi-functionality, multi-granularity. 2) We propose the novel training framework of selfknowledge distillation, improve the training quality with the optimized batching strategy, and perform high-quality curation of training data. 3) Our model, code, and data will all be publicly available, which provides critical resources for both direct usage and future development of text embeddings.

在我们的实验中,M3-Embedding展现出了令人瞩目的多功能性。它在各种语言上实现了卓越的检索质量,从而在流行的多语言和跨语言基准测试(如MIR-ACL(Zhang et al.,2023b)和MKQA(Longpre et al.,2021))上取得了最先进的性能。它有效地学习了三种检索功能,不仅可以单独工作,还可以共同作用以获得更强大的检索质量。它还在8192个标记内良好保持了其超凡的能力,相比现有方法具有显著优势。

我们的工作做出了以下贡献。1)我们提出了M3-Embedding,显著提升了嵌入模型在多语言性、多功能性和多粒度上的适应性。2)我们提出了自我知识蒸馏的新型训练框架,通过优化的批处理策略改善了训练质量,并对训练数据进行高质量的策划。3)我们的模型、代码和数据将都公开提供,为文本嵌入的直接使用和未来的发展提供了重要资源。

## # Related Work

The related works are reviewed from three aspects: general text embeddings, embedding models for neural retrieval, embeddings of multi-linguality.
In the past few years, substantial progresses have been achieved in the field of text embedding. One major driving force is the popularity of pre-trained language models, where the underlying semantic of the data can be effectively encoded by such powerful text encoders (Reimers and Gurevych, 2019;Karpukhin et al., 2020a;Ni et al., 2021a). In addition, the progress of contrastive learning is another critical factor, especially the improvement of negative sampling (Xiong et al., 2020;Qu et al., 2020) and the exploitation of knowledge distillation (Hofstätter et al., 2021;Ren et al., 2021;Zhang et al., 2021a). On top of these well-established techniques, it becomes increasingly popular to learn versatile embedding models, which are able to uniformly support a variety of application scenarios. So far, there have been many impactful methods in the direction, like Contriever (Izacard et al., 2022), GTR (Ni et al., 2021b), E5 (Wang et al., 2022a), BGE (Xiao et al., 2023), SGPT (Muennighoff, 2022), andOpen Text Embedding (Neelakantan et al., 2022), which significantly advance the usage of text embeddings for general tasks.
One major application of embedding models is neural retrieval (Lin et al., 2022). By measuring the semantic relationship with the text embeddings, the relevant answers to the input query can be retrieved based on the embedding similarity. The most common form of embedding-based retrieval method is dense retrieval (Karpukhin et al., 2020a), where the text encoder's outputs are aggregated (e.g., via [CLS] or mean-pooling) to compute the embedding similarity. Another common alternative is known as multi-vec retrieval (Khattab and Zaharia, 2020b;Humeau et al., 2020), which applies fine-grained interactions for the text encoder's outputs to compute the embedding similarity. Finally, the text embeddings can also be transformed into term weights, which facilitates sparse or lexical retrieval (Luan et al., 2021;Dai and Callan, 2020;Lin and Ma, 2021b). Typically, the above retrieval methods are realized by different embedding models. To the best of knowledge, no existing method is able to unified all these functionalities.
Despite the substantial technical advancement, most of the existing text embeddings are developed only for English, where other languages are lagging behind. To mitigate this problem, continual efforts are presented from multiple directions. One is the development of pre-trained multi-lingual text encoders, such as mBERT (Pires et al., 2019), mT5 (Xue et al., 2020), XLM-R (Conneau et al., 2019). Another one is the curation of training and evaluation data for multi-lingual text embeddings, e.g., MIRACL (Zhang et al., 2023b), mMARCO (Bonifacio et al., 2021), Mr. TyDi (Zhang et al., 2021b), MKQA (Longpre et al., 2021). At the same time, the multi-lingual text embeddings are continually developed from the community, e.g., mDPR (Zhang et al., 2023a), mContriever (Izacard et al., 2022), mE5 (Wang et al., 2022b), etc. However, the current progress is still far from enough given the notable gap with English models and the huge imbalance between different languages.

相关工作从三个方面进行了综述: 通用文本嵌入、用于神经检索的嵌入模型、多语言嵌入。在过去几年中,文本嵌入领域取得了显著进展。其中一个主要推动力是预训练语言模型的普及,这些强大的文本编码器能够有效地编码数据的语义信息(Reimers and Gurevych, 2019; Karpukhin et al., 2020a; Ni et al., 2021a)。此外,对比学习的进展也是另一个关键因素,特别是负采样的改进(Xiong et al., 2020; Qu et al., 2020)和知识蒸馏的应用(Hofstätter et al., 2021; Ren et al., 2021; Zhang et al., 2021a)。在这些已有的技术基础上,学习多功能嵌入模型变得越来越受欢迎,这些模型能够统一支持各种应用场景。到目前为止,在这个方向上已经有了许多有影响力的方法,如Contriever (Izacard et al., 2022), GTR (Ni et al., 2021b), E5 (Wang et al., 2022a), BGE (Xiao et al., 2023), SGPT (Muennighoff, 2022), and Open Text Embedding (Neelakantan et al., 2022),这些方法显著推动了文本嵌入在通用任务中的应用。

嵌入模型的主要应用之一是神经检索(Lin et al., 2022)。通过衡量文本嵌入的语义关系,可以基于嵌入的相似性检索与输入查询相关的答案。基于嵌入的检索方法最常见的形式是密集检索(Karpukhin et al., 2020a),在该方法中,文本编码器的输出被聚合(例如通过[CLS]或平均池化)来计算嵌入的相似性。另一种常见的替代方法被称为多向量检索(Khattab and Zaharia, 2020b; Humeau et al., 2020),它将文本编码器的输出进行细粒度的交互来计算嵌入的相似性。最后,文本嵌入也可以转换为词项权重,以便进行稀疏或词汇检索(Luan et al., 2021; Dai and Callan, 2020; Lin and Ma, 2021b)。通常,以上的检索方法由不同的嵌入模型实现。据我所知,目前还没有存在的方法能够统一实现所有这些功能。

尽管在技术上取得了实质性的进展,但现有的大多数文本嵌入仅用于英文,其他语言的发展滞后。为了解决这个问题,从多个方向不断努力。其中一个是开发预训练的多语言文本编码器,如mBERT (Pires et al., 2019), mT5 (Xue et al., 2020), XLM-R (Conneau et al., 2019)。另一个是针对多语言文本嵌入的培训和评估数据的整理,例如MIRACL (Zhang et al., 2023b), mMARCO (Bonifacio et al., 2021), Mr. TyDi (Zhang et al., 2021b), MKQA (Longpre et al., 2021)。与此同时,社区不断发展多语言文本嵌入,如mDPR (Zhang et al., 2023a), mContriever (Izacard et al., 2022), mE5 (Wang et al., 2022b)等。然而,当前的进展仍远远不足,英文模型与其他语言模型之间存在显著差距,并且不同语言之间的差距巨大。

## # M3-Embedding

M3-Embedding realizes three-fold versatility. It supports a wide variety of languages and handles input data of different granularities. Besides, it unifies the common retrieval functionalities of text embeddings. Formally, given a query q in an arbitrary language x, it is able to retrieve document d in language y from the corpus D y : d y ← fn * (q x , D y ). In this place, fn * (•) belongs to any of the functions: dense, sparse/lexical, or multi-vec retrieval; y can be another language or the same language as x.

M3-Embedding实现了三重的多功能性。它支持各种语言,并处理不同粒度的输入数据。此外,它统一了文本嵌入的常见检索功能。形式上,给定一个在任意语言x中的查询q,它能够从语料库Dy中检索语言y中的文档d:dy ← fn*(qx, Dy)。在这里,fn*(•)可以是密集检索、稀疏/词汇检索或多向量检索等其中之一;y可以是另一种语言或与x相同的语言。

## # Data Curation

The training of M3-Embedding calls for a largescale and diverse multi-lingual dataset. In this work, we perform comprehensive data collection from three sources: the weak supervision data from unlabeled corpora, the fine-tuning data from labeled corpora, and the fine-tuning data via synthesization (shown as Table 1). The three data sources complement to each other, which are applied to different stages of the training process. Particularly, the weak supervision data is curated by extracting the rich-semantic structures, e.g., titlebody, title-abstract, instruction-output, etc., within a wide variety of multi-lingual corpora, including Wikipedia, S2ORC (Lo et al., 2020), xP3 (Muennighoff et al., 2022), mC4 (Raffel et al., 2019), and CC-News (Hamborg et al., 2017). Besides, the well-curated weak supervision data from MTP (Xiao et al., 2023) is directly incorporated. To learn the unified embedding space for cross-lingual semantic matching, the parallel sentences are introduced from two translation datasets, NLLB (NLLB Team et al., 2022) and CCMatrix (Schwenk et al., 2021). The raw data is filtered to remove potential bad contents and low-relevance samples. In total, it brings in 1.2 billion text pairs of 194 languages and 2655 cross-lingual correspondences.
In addition, we collect relatively small but diverse and high-quality fine-tuning data from labeled corpora. For English, we incorporate eight datasets, including HotpotQA (Yang et al., 2018), TriviaQA (Joshi et al., 2017), NQ (Kwiatkowski et al., 2019), MS MARCO (Nguyen et al., 2016), COLIEE (Kim et al., 2022), PubMedQA (Jin et al., 2019), and NLI data collected by SimCSE ( Gao et al., 2021b). For Chinese, we incorporate seven datasets, including DuReader (He et al., 2017), mMARCO-ZH (Bonifacio et al., 2021), T 2 -Ranking (Xie et al., 2023), LawGPT 1 , CMedQAv2 (Zhang et al., 2018), and LeCaRDv2 (Li et al., 2023). For other languages, we leverage the training data from Mr. Tydi (Zhang et al., 2021b) and MIRACL (Zhang et al., 2023b).
Finally, we generate synthetic data to mitigate the shortage of long document retrieval tasks and introduce extra multi-lingual fine-tuning data (denoted as MultiLongDoc). Specifically, we sample lengthy articles from Wiki and MC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs. The generated question and the sampled article constitute a new text pair to the finetuning data. Detailed specifications are presented in Appendix A.1

M3-Embedding的训练需要一个大规模且多样化的多语言数据集。在这项工作中,我们从三个来源进行了全面的数据收集:未标记语料库的弱监督数据、标记语料库的微调数据,以及通过合成生成的微调数据(如表1所示)。这三个数据源相互补充,在训练过程的不同阶段应用。特别是,弱监督数据是通过从各种多语言语料库(包括维基百科、S2ORC、xP3、mC4和CC-News等)中提取丰富的语义结构(如标题-正文、标题-摘要、指令-输出等)来策划的。此外,还直接将来自MTP的经过精心策划的弱监督数据纳入其中。为了学习用于跨语言语义匹配的统一嵌入空间,我们从两个翻译数据集(NLLB和CCMatrix)中引入平行句子。原始数据经过过滤,以消除潜在的低相关性样本和不良内容。总共,我们带入了194种语言和2655种跨语言对应关系的12亿个文本对。

此外,我们从标记语料库中收集了规模相对较小但多样化且高质量的微调数据。对于英语,我们纳入了八个数据集, 包括HotpotQA、TriviaQA、NQ、MS MARCO、COLIEE、PubMedQA和SimCSE收集的NLI数据。对于中文,我们纳入了七个数据集,包括DuReader、mMARCO-ZH、T 2 -Ranking、LawGPT 1、CMedQAv2和LeCaRDv2。对于其他语言,我们利用了Mr. Tydi和MIRACL的训练数据。

最后,我们生成了合成数据来缓解长文档检索任务的不足,并引入额外的多语言微调数据(称为MultiLongDoc)。具体来说,我们从维基百科和MC4数据集中抽样出长篇文章,并随机选择其中的段落。然后,我们使用GPT-3.5根据这些段落生成问题。生成的问题和抽样的文章构成了新的文本对,用于微调数据。详细的规范见附录A.1。

## # Hybrid Retrieval

M3-Embedding unifies all three common retrieval functionalities of the embedding model, i.e. dense retrieval, lexical (sparse) retrieval, and multi-vec retrieval. The formulations are presented as follows.
• Dense retrieval. The input query q is transformed into the hidden states H q based on a text encoder. We use the normalized hidden state of the special token "[CLS]" for the representation of the query: e q = norm(H q [0]). Similarly, we can get the embedding of passage p as e p = norm(H p [0]). Thus, the relevance score between query and passage is measured by the inner product between the two embeddings e q and e p : s dense ← 〈e p , e q 〉.
• Lexical Retrieval. The output embeddings are also used to estimate the importance of each 1. https://github.com/LiuHC0428/LAW-GPT term to facilitate lexical retrieval. For each term t within the query (a term is corresponding to a token in our work), the term weight is computed as
w qt ← Relu(W T lex H q [i]))
, where W lex ∈ R d×1 is the matrix mapping the hidden state to a float number. If a term t appears multiple times in the query, we only retain its max weight. We use the same way to compute the weight of each term in the passage. Based on the estimation term weights, the relevance score between query and passage is computed by the joint importance of the co-existed terms (denoted as q ∩ p) within the query and passage: s lex ← t∈q∩p (w qt * w pt ).
• Multi-Vec Retrieval. As an extension of dense retrieval, the multi-vec method makes use of the entire output embeddings for the representation of query and passage: E q = norm(W T mul H q ), E p = norm(W T mul H p ), where W mul ∈ R d×d is the learnable projection matrix. Following Col-Bert (Khattab and Zaharia, 2020b), we use lateinteraction to compute the fine-grained relevance score:
s mul ← 1 N N i=1 max M j=1 E q [i] • E T p [j]
; N and M are the lengths of query and passage.
Thanks to the multi-functionality of the embedding model, the retrieval process can be conducted in a hybrid process. First of all, the candidate results can be individually retrieved by each of the methods (the multi-vec method can be exempted from this step due to its heavy cost). Then, the final retrieval result is re-ranked based on the integrated relevance score: s rank ← s dense + s lex + s mul .

M3-Embedding统一了嵌入模型的三种常见检索功能,即密集检索(dense retrieval),词汇检索(lexical retrieval)和多向量检索(multi-vec retrieval)。具体的公式如下所示:
- 密集检索:将输入的查询q通过文本编码器转化为隐藏状态Hq。我们使用特殊标记"[CLS]"的标准化隐藏状态来表示查询的嵌入向量:eq = norm(Hq[0])。类似地,我们可以得到文章p的嵌入向量为ep = norm(Hp[0])。因此,查询和文章之间的相关性得分通过两个嵌入向量eq和ep的内积来衡量:sdense ← 〈ep, eq〉。
- 词汇检索:也使用输出的嵌入向量来估计每个词汇的重要性,以方便词汇检索。对于查询中的每个词汇t(一个词汇对应我们工作中的一个标记),词汇权重计算如下:wqt ← Relu(Wlex Hq[i])),其中Wlex ∈ Rd×1是将隐藏状态映射到浮点数的矩阵。如果一个词汇t在查询中出现多次,我们只保留它的最大权重。我们使用相同的方法计算文章中每个词汇的权重。基于估计的词汇权重,查询和文章之间的相关性得分通过共同存在的词汇(表示为q∩p)的联合权重来计算:slex ← t∈q∩p (wqt * wpt)。
- 多向量检索:作为密集检索的扩展,多向量方法利用全局输出的嵌入向量来表示查询和文章:Eq = norm(Wmul Hq),Ep = norm(Wmul Hp),其中Wmul ∈ Rd×d是可学习的投影矩阵。我们采用Col-Bert(Khattab and Zaharia, 2020b)的lateinteraction方法计算细粒度的相关性得分:smul ← 1/N ∑Ni=1 max(∑Mj=1 Eq[i] • ETp[j]),其中N和M分别是查询和文章的长度。
由于嵌入模型的多功能性,检索过程可以采用混合流程进行。首先,可以分别使用每种方法(由于其开销较大,多向量方法可以免去此步骤)检索候选结果。然后,根据整合的相关性得分重新对检索结果进行排名:srank ← sdense + slex + smul。

## # Self-Knowledge Distillation

The embedding model is trained to discriminate the positive samples from the negative ones. For each of the retrieval methods, it is expected to assign a higher score for the query's positive samples compared with the negative ones. Therefore, the training process is conducted to minimize the In-foNCE loss, whose general form is presented by the following loss function:
L = -log exp(s(q, p * )/τ )
p∈{p * ,P ′ } exp(s(q, p)/τ )
.
(1)
Here, p * and P ′ stand for the positive and negative samples to the query q; s(•) is any of the functions within {s dense (•), s lex (•), s mul (•)}.
The training objectives of different retrieval methods can be mutually conflicting with each their. Therefore, the native multi-objective training can be unfavorable to the embedding's quality. To facilitate the optimization of multiple retrieval functions, we propose to unify the training process on top of self-knowledge distillation. Particularly, based on the principle of ensemble learning (Bühlmann, 2012), the predictions from different retrieval methods can be integrated as a more accurate relevance score given their heterogeneous nature. In the simplest form, the integration can just be the sum-up of different prediction scores:
s inter ← s dense + s lex + s mul .(2)
In previous studies, the training quality of embedding model can benefit from knowledge distillation, which takes advantage of fine-grained soft labels from another ranking model (Hofstätter et al., 2021). In this place, we simply employ the integration score s inter as the teacher, where the loss function of each retrieval method is modified as:
L ′ * ← p(s inter ) * log p(s * ).(3)
Here, p(•) is the softmax activation; s * is any of the members within s dense , s lex , and s mul . We further integrate and normalize the modified loss function:
L ′ ← L ′ dense + L ′ lex + L ′ mul /3.(4)
Finally, we derive the final loss function for selfknowledge distillation with the linear combination of L and L ′ :
L f inal ← L + L ′ .
The overall training process is a multi-stage workflow (Figure 2). Firstly, the text encoder is pre-trained with the massive weak supervision data, where only the dense retrieval is trained in the basic form of contrastive learning. The self-knowledge distillation is applied to the second stage, where the embedding model is fine-tuned to establish the three retrieval functionalities. Both labeled and 

自我知识蒸馏

该嵌入模型的训练目标是通过区分正样本和负样本来提高特征表达的区分能力。对于每种检索方法,期望将查询的正样本得分相对于负样本得分更高。因此,训练过程通过最小化In-foNCE损失来实现,其一般形式由以下损失函数表示:
L = -log exp(s(q, p * )/τ )
p∈{p * ,P ′ } exp(s(q, p)/τ )
.
(1)
这里,p * 和 P ′ 分别表示查询 q 的正样本和负样本;s(•) 是 {s dense (•), s lex (•), s mul (•)} 中的任意一个函数。

不同检索方法的训练目标可能会相互冲突,因此原生的多目标训练可能不利于嵌入质量的提高。为了便于优化多个检索功能,我们提出在自我知识蒸馏基础上统一训练过程。特别是,基于集成学习原理(Bühlmann, 2012),来自不同检索方法的预测可以作为更准确的相关性分数进行整合,考虑到它们的异构性。在最简单的形式中,整合可以简单地是不同预测分数的总和:
s inter ← s dense + s lex + s mul .(2)

先前的研究表明,嵌入模型的训练质量可以从知识蒸馏中受益,即利用来自另一个排序模型的精细软标签(Hofstätter et al., 2021)。在这里,我们简单地将整合分数 s inter 作为教师,修改每种检索方法的损失函数如下:
L ′ * ← p(s inter ) * log p(s * ).(3)
这里,p(•) 是softmax激活函数;s * 是 s dense ,s lex 和 s mul 中的任何一个成员。我们进一步整合和归一化修改后的损失函数:
L ′ ← L ′ dense + L ′ lex + L ′ mul /3.(4)
最后,通过线性组合 L 和 L ′ 来得到最终的损失函数,用于自我知识蒸馏:
L final ← L + L ′ .

整个训练过程是一个多阶段的工作流程(图2)。首先,文本编码器使用大量的弱监督数据进行预训练,其中只有基本形式的对比学习中进行了稠密检索的训练。自我知识蒸馏应用于第二阶段,在该阶段通过微调嵌入模型来建立三种检索功能。同时使用了标记和未标记的数据来进行训练。

## # Efficient Batching

The embedding model needs to learn from diverse and massive multi-lingual data to fully capture the general semantic of different languages. It also needs to keep the batch size as large as possible (where a huge amount of in-batch negatives can be leveraged) so as to ensure the discriminativeness of text embeddings. Given the limitations on GPU's memory and computation power, people usually truncate the input data into short sequences for high throughput of training and a large batch size. However, the common practice is not a feasible option for M3-Embedding because it needs to learn from both short and long-sequence data to effectively handle the input of different granularities.
In our work, we improve the training efficiency by optimizing the batching strategy, which enables high training throughput and large batch sizes. Particularly, the training data is pre-processed by being grouped by sequence length. When producing a mini-batch, the training instances are sampled from the same group. Due to the similar sequence lengths, it significantly reduces sequence padding (marked in red) and facilitates a more effective utilization of GPUs. Besides, when sampling the training data for different GPUs, the random seed is always fixed, which ensures the load balance and minimizes the waiting time in each training step. Besides, when handling long-sequence training data, the mini-batch is further divided into subbatches, which takes less memory footprint. We iteratively encode each sub-batch using gradient checkpointing (Chen et al., 2016)  generated embeddings in the last step 3 . Finally, the embeddings from different GPUs are broadcasted, allowing each device to obtain all embeddings in the distribute environment, which notably expands the scale of in-bath negative samples.

在Efficient Batching章节中,我们通过优化批处理策略来改善训练效率,从而实现高训练吞吐量和大批量大小。具体而言,训练数据根据序列长度进行预处理后分组。在生成小批量时,训练实例从同一组中进行采样。由于序列长度相似,这显著减少了序列填充(以红色标记),并更有效地利用了GPU。此外,在为不同GPU采样训练数据时,始终固定随机种子,以确保负载平衡,并尽量减少每个训练步骤中的等待时间。此外,当处理长序列训练数据时,小批量进一步分为子批次,从而占用更少的内存。我们通过梯度检查点编码每个子批次,生成最后一步的嵌入(参考Chen等人,2016)。最后,不同GPU的嵌入被广播,使得每个设备都可以在分布式环境中获取所有嵌入,从而显著扩大了批内负样本的规模。

## # Experiment

We investigate the following critical issues regarding the effectiveness of M3-Embedding in the experiment. 1) The retrieval performance on different languages.
2) The retrieval performance on top of different functionalities.
3) The retrieval performance with different input granularities. 4) The impact from each technical factor. In the following part, we analyze the above issues with our discussions on multi-lingual retrieval, cross-lingual retrieval, long-doc retrieval, and ablation studies.

我们在实验中研究了M3-Embedding的有效性方面的以下关键问题:1)在不同语言上的检索性能;2)在不同功能性上的检索性能;3)在不同输入粒度上的检索性能;4)每个技术因素的影响。在接下来的部分,我们通过对多语言检索、跨语言检索、长文档检索以及消融研究的探讨,分析了以上问题。

## # Multi-Lingual Retrieval

We evaluate the multi-lingual retrieval performance with MIRACL (Zhang et al., 2023b), which consists of ad-hoc retrieval tasks in 18 languages. Each task is made up of query and passage presented in the same language. Following the official benchmark, we evaluate our method using Pyserini (Lin et al., 2021), and use nDCG@10 as the primary evaluation metric (Recall@100 is also measured and reported in Appendix B). We incorporate the following baselines in our experiment: the lexical retrieval method: BM25 (Robertson and Zaragoza, 2009); the dense retrieval methods: mDPR 4 , mContriever 5 , mE5 large 6 and E5 mistral-7b 7 . We also make We can make the following observations according to the experiment result in Table 2. Firstly, M3-Embedding already achieves a superior retrieval performance with only its dense retrieval functionality (denoted as Dense). It not only outperforms other baseline methods in the average performance, but also maintains a consistent empirical advantage in most of the individual languages. Even compared with E5 mistral-7b , which leverages a much larger Mistral-7B model as the text encoder and specifically trained with English data, our method is able to produce a similar result in English and notably higher results in the other languages. Besides, the sparse retrieval functionality (denoted as Sparse) is also effectively trained by M3-Embedding, as it outperforms the typical BM25 methods in all languages. We can also observe the additional improvement from multivec retrieval 9 (denoted as Mult-vec), which relies on fine-grained interactions between query and passage's embeddings to compute the relevance score. Finally, the collaboration of dense and sparse method, e.g., Dense+Sparse 10 , leads to a further improvement over each individual method; and the collaboration of all three methods 11 (denoted as All) brings forth the best performance.

我们使用MIRACL(Zhang等,2023b)评估了M3-Embedding在多语言检索性能方面的表现,其中包括18种语言的特定检索任务。每个任务由查询和同一语言的文本段落组成。根据官方基准,我们使用Pyserini(Lin等,2021)评估我们的方法,并以nDCG@10作为主要评估指标(在附录B中还测量和报告了Recall@100)。我们在实验中引入了以下基准方法:基于词汇的检索方法BM25(Robertson和Zaragoza,2009);基于密集表示的检索方法mDPR<sup>4</sup>,mContriever<sup>5</sup>,mE5 large<sup>6</sup>和E5 mistral-7b<sup>7</sup>。根据表2中的实验结果,我们可以得出以下观察结果。首先,只使用其密集检索功能(表示为Dense),M3-Embedding已经取得了卓越的检索性能。它不仅在平均性能上优于其他基准方法,而且在大多数单个语言上都保持了一致的实证优势。即使与E5 mistral-7b相比,后者利用了更大的Mistral-7B模型作为文本编码器并专门针对英文数据进行了训练,我们的方法在英语中也能产生类似的结果,并在其他语言中得到显著更高的结果。此外,M3-Embedding还有效地训练了稀疏检索功能(表示为Sparse),因为它在所有语言中均优于典型的BM25方法。我们还观察到多向量检索(表示为Mult-vec)的额外改进,它依赖于查询和段落嵌入之间的细粒度交互来计算相关性得分。最后,密集方法和稀疏方法的协作(例如Dense+Sparse<sup>10</sup>)相较于每个单独的方法进一步提高;而所有三种方法的协作(表示为All<sup>11</sup>)带来了最佳性能。

## # Cross-Lingual Retrieval

We make evaluation for the cross-lingual retrieval performance with the MKQA benchmark (Longpre et al., 2021), which includes queries in 25 non-English languages. For each query, it needs to retrieve the ground-truth passage from the English Wikipedia corpus. In our experiment, we make use of the well-processed corpus offered by the BEIR 12 (Thakur et al., 2021). Following the previous study (Karpukhin et al., 2020b), we report Recall@100 as the primary metric (Recall@20 is reported as an auxiliary metric in the Appendix).
The experiment result is shown in Table 3. Similar as our observation in multi-lingual retrieval, M3-Embedding continues to produce a superior performance, where it notably outperforms other baseline methods purely with its dense retrieval functionality (Dense). The collaboration of different retrieval methods brings in further improvements, leading to the best empirical performance of cross-lingual retrieval. Besides, we can also observe the following interesting results which are unique to this benchmark. Firstly, the performance gaps are not as significant as MIRACL, where competitive baselines like E5 mistral-7b is able to produce similar or even better results on some of the testing languages. However, the baselines are prone to 12. https://huggingface.co/datasets/BeIR/nq bad performances on many other languages, especially the low-resource languages, such as ar, km, he, etc. In contrast, M3-Embedding maintains relative stable performances in all languages, which can largely attribute to its pre-training over comprehensive weak supervision data. Secondly, although M3-Embedding (Sparse) is still better than BM25, it performs badly compared with other methods. This is because there are only very limited coexisted terms for cross-lingual retrieval as the query and passage are presented in different languages.

我们使用MKQA基准测试(Longpre等,2021)对跨语言检索性能进行评估,该基准测试包括25种非英语语言的查询。对于每个查询,需要从英文维基百科语料库中检索出与之相匹配的段落。在我们的实验中,我们使用由BEIR 12(Thakur等,2021)提供的经过良好处理的语料库。按照先前的研究(Karpukhin等,2020b)的做法,我们报告Recall@100作为主要指标(Recall@20作为附加指标在附录中报告)。

实验结果见表3。与我们在多语言检索中的观察类似,M3-Embedding继续表现出卓越的性能,它以其稠密检索功能(Dense)明显优于其他基线方法。不同检索方法的协同作用带来进一步的改进,导致了最佳的跨语言检索实证性能。此外,我们还可以观察到以下基准测试独有的有趣结果。首先,性能差距不像MIRACL那样显著,竞争性基线方法如E5 Mistral-7b在某些测试语言上能够产生类似甚至更好的结果。然而,对于许多其他语言,特别是低资源语言(如ar,km,he等),基线方法容易表现糟糕。相比之下,M3-Embedding在所有语言中保持相对稳定的性能,这在很大程度上归因于其在全面的弱监督数据上的预训练。其次,尽管M3-Embedding(Sparse)仍然优于BM25,但与其他方法相比表现不佳。这是因为对于跨语言检索,查询和段落以不同语言呈现,存在非常有限的共现术语。

## # Multilingual Long-Doc Retrieval

We evaluate the retrieval performance with longer sequences with two benchmarks: MLDR (Multilingual Long-Doc Retrieval), which are curated by the multilingual articles from Wikipedia and mC4 (see Table 7), and NarrativeQA 13 (s Koˇciský et al., 2018;Günther et al., 2024), which is only for English. In addition to the previous baselines, we further introduce JinaEmbeddingv2 14 , textembedding-ada-002 and text-embedding-3-large from OpenAI given their outstanding long-doc retrieval capability.
13. Using the evaluation pipeline from (Günther et al., 2024) 14. https://huggingface.co/jinaai/jina-embeddings-v2-base-en Table 5: Evaluation on NarrativeQA (nDCG@10).
The evaluation result on MLDR is presented in Table 4. It can be observed that M3 (Dense) is able to outperform all baselines with notable advantages. Interestingly, M3 (Sparse) turns out to be a more effective method for long document retrieval, which achieves another about 12 points improvement over the dense method. Besides, the multi-vec retrieval is also impressive, which brings 5.1+ points improvement over M3 (Dense). Finally, the combination of different retrieval methods leads to a remarkable average performance of 65.0.
To explore the reason of M3-Embedding's competitiveness on long-document retrieval, we perform the ablation study by removing the longsequence data from the fine-tuning stage (denoted as w.o. long). After this modification, the dense method, i.e. Dense-w.o.long, can still outperform the majority of baselines, which indicates that its empirical advantage has been well established dur- ing the pre-training stage. We also propose a simple strategy, MCLS, to address this situation (no long-text retrieval data for fine-tuning). In MCLS, we insert a cls token for every fixed number of tokens, and the final text embedding is obtained by averaging the last hidden states of all cls tokens. Experimental results indicate that MCLS can significantly improve the effectiveness of document retrieval.
We make further analysis with NarrativeQA (Table 5), where we have similar observations as MLDR. Besides, with the growing of sequence length, our method gradually expands its advantage over baseline (Figure 4), which reflects its proficiency in handling long inputs.

我们使用两个基准进行了较长序列的检索性能评估:来自维基百科的多语言长文档检索(MLDR)和仅适用于英语的NarrativeQA。除了之前的基准,我们还引入了JinaEmbeddingv2、textembedding-ada-002和text-embedding-3-large这几个模型,由于它们在长文档检索能力方面的突出表现。

在NarrativeQA上的评估结果见表5。可以观察到,M3(Dense)在各项性能指标上均优于所有的基准模型。有趣的是,M3(Sparse)方法在长文档检索方面效果更好,相比密集方法又提升了约12个点。此外,多向量检索方法也表现出色,相比M3(Dense)提升了5.1个点。最终,不同检索方法的组合使得平均性能达到了65.0。

为了探索M3-Embedding在长文档检索方面具有竞争力的原因,我们进行了去除细调阶段长序列数据的剔除研究(标记为"去掉长文本")。经过此修改后,密集方法(即Dense-w.o.long)仍然能够优于大多数基准模型,这表明其在预训练阶段的经验优势已经确立。我们还提出了一种简单的策略,即MCLS,用于应对没有用于细调的长文本检索数据的情况。在MCLS中,我们在每个固定令牌数的位置插入一个CLS令牌,并通过对所有CLS令牌的最后隐藏状态进行平均,得到最终的文本嵌入。实验结果表明,MCLS能够显著提高文档检索的效果。

我们通过对NarrativeQA的进一步分析(见表5),得出与MLDR相似的观察结论。此外,随着序列长度的增加,我们的方法逐渐扩大了其对基准模型的优势(见图4),反映了它在处理长输入方面的熟练程度。

## # Ablation study

The ablation study is perform to analyze the impact from self-knowledge distillation (skd). Particularly, we disable the distillation processing and have each retrieval methods trained independently (denoted as M3-w.o.skd). According to our evaluation on MIRACL (Table 6), the original method, i.e. M3 w.skd, brings in better performances than the ablation method in all settings, i.e., Dense, Sparse, Multi-vec. Notably, the impact is more pronounced for sparse retrieval, which indicates the incompatibility between dense and sparse retrieval methods.

进行消融研究旨在分析自我知识蒸馏(skd)对模型的影响。特别地,我们禁用了蒸馏过程,将每个检索方法独立训练(标记为M3-w.o.skd)。根据我们在MIRACL数据集上的评估结果(表6),原始方法,即M3 w.skd,在所有设置下都比消融方法表现更好,即Dense、Sparse、Multi-vec。值得注意的是,对于稀疏检索方法,影响更为明显,这表明稠密和稀疏检索方法之间存在不兼容性。

## # Conclusion

In this paper, we present M3-Embedding, which achieves notably versatility in supporting multilingual retrieval, handling input of diverse granularities, and unifying different retrieval functionalities. We perform comprehensive and high-quality curation of training data, optimize the learning process with self-knowledge distillation, and improves the training through and batch size with efficient batching. The effectiveness of M3-Embedding is verified by our experimental studies, where it leads to superior performances on multi-lingual retrieval, cross-lingual retrieval, and multi-lingual long-doc retrieval tasks.

在本文中,我们提出了M3-Embedding模型,该模型在支持多语言检索、处理不同粒度输入和统一不同检索功能方面具有显著的多功能性。我们对训练数据进行了全面且高质量的筛选,通过自我知识蒸馏优化了学习过程,并通过有效的批处理提升了训练速度和批处理大小以确保嵌入的区分性。我们的实验研究验证了M3-Embedding的有效性,证明其在多语言检索、跨语言检索和多语言长文档检索任务上表现出优越的性能。

## # A.1 Synthetic Data

The prompt for GPT3.5 is "You are a curious AI assistant, please generate one specific and valuable question based on the following text. The generated question should revolve around the core content of this text, and avoid using pronouns (e.g., "this"). Note that you should generate only one question, without including additional content:". The details of generated datasets are shown in Table 7.

GPT3.5的提示是“你是一个好奇的AI助手,请根据以下文本生成一个具体而有价值的问题。所生成的问题应围绕着文本的核心内容,并避免使用代词(例如,“此”)。需要注意的是,你应该只生成一个问题,不包括其他内容。”生成的数据集的详细信息如表7所示。

## # A.2 Experimental Hyperparameters

We adopt a further pre-trained XLM-RoBERTa 15 as the foundational model. We extend the max position to 8192 and update the model via the Retro-MAE (Xiao et al., 2022) method. The data comprises Pile (Gao et al., 2020), Wudao (Yuan et al., 2021), and mC4 (Raffel et al., 2019) datasets. We sampled a total of 184 million text samples from these sources, covering 105 languages. The maximum sequence length is 8192 and the learning rate is 7 × 10 -5 . The batch size is set to 32 and we accumulate the gradient over 16 steps. Pre-training is conducted on 32 A100(40GB) GPUs for 20,000 steps.
For the pre-training with the massive weak supervision data, the max length of query and passage is set to 512 and 8192, respectively. The learning rate is 5 × 10 -5 , the warmup ratio is 0.1 and the weight decay is 0.01. For training data with different sequence length ranges (e.g., 0-500, 500-1000, etc.), we use different batch sizes. The details are represented in Table 8. The second stage is conducted on 96 A800(80GB) GPUs. The language and length distribution of weak-supervised data are illustrated in Figure 5.
In the fine-tuning stage, we sample 7 negatives for each query. Refer to Table 8 for the batch size. In the initial phase, we employed approximately 6000 steps to perform warm-up on dense embedding, sparse embedding and multi-vectors. Subsequently, we conducted unified training with selfknowledge distillation. These experiments were carried out on 24 A800(80GB) GPUs.

我们采用了进一步预训练的XLM-RoBERTa 15作为基础模型。我们将最大位置扩展到8192,并通过Retro-MAE (Xiao et al., 2022)方法更新模型。数据包括Pile (Gao et al., 2020)、Wudao (Yuan et al., 2021)和mC4 (Raffel et al., 2019)数据集。我们从这些来源中采样了1.84亿个文本样本,覆盖了105种语言。最大序列长度为8192,学习率为7 × 10 -5。批量大小设置为32,并在16个步骤中累积梯度。预训练在32个A100(40GB) GPU上进行了20,000个步骤。

对于使用大规模弱监督数据进行预训练的情况,查询和段落的最大长度分别设置为512和8192。学习率为5 × 10 -5,预热比例为0.1,权重衰减为0.01。对于不同序列长度范围的训练数据(例如,0-500,500-1000等),我们使用不同的批量大小。具体细节请参见表8。第二阶段在96个A800(80GB) GPU上进行。弱监督数据的语言和长度分布如图5所示。

在微调阶段,我们对每个查询采样了7个负样本。批量大小请参考表8。在初始阶段,我们使用了大约6000个步骤对密集嵌入、稀疏嵌入和多向量进行预热。随后,我们进行了包括自知识蒸馏在内的统一训练。这些实验在24个A800(80GB) GPU上进行。

## # A.3 Split-batch Method

Algorthm 1 provides the pseudo-code of the splitbatch strategy. For the current batch, we partition it into multiple smaller sub-batches. For each subbatch we utilize the model to generate embeddings, 15. https://huggingface.co/FacebookAI/xlm-roberta-large discarding all intermediate activations via gradient checkpointing during the forward pass. Finally, we gather the encoded results from all sub-batch, and obtain the embeddings for current batch. It is crucial to enable the gradient-checkpointing strategy; otherwise, the intermediate activations for each sub-batch will continuously accumulate, ultimately occupying the same amount of GPU memory as traditional methods. 

以下是Split-batch方法的伪代码,详见算法1。对于当前的批次,我们将其分割成多个较小的子批次。对于每个子批次,我们利用模型生成嵌入,通过前向传递时使用梯度检查点来丢弃所有的中间激活结果。最后,我们从所有子批次中收集编码结果,并得到当前批次的嵌入。启用梯度检查点策略非常关键,否则每个子批次的中间激活结果将持续累积,在最终占用与传统方法相同的GPU内存量。

## # B Additional Results

In this section, we present additional evaluation results on the MIRACL and MKQA benchmarks. As shown in Table 9 and10 

在本节中,我们展示了在MIRACL和MKQA基准测试上的额外评估结果。如表9和10所示,
  • 6
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值