带指令的embedding是否更配RAG?_instructorembedding-CSDN博客

本文链接：https://blog.csdn.net/python1222_/article/details/140009182

在这里插入图片描述

带指令的嵌入模型通过结合大规模预训练和指令微调，能够根据简单的文本指令适应多种任务和领域，如文本分类、相似度计算、信息检索等。这种模型的优势在于无需针对每个特定任务进行微调，显著提高了使用效率和灵活性。用户只需根据需要构造指令，就能快速获得适应特定任务的文本嵌入，适用于多种不同的应用场景，特别适合于资源有限或对机器学习了解不深的用户。本文就主要介绍当先两类支持指令的嵌入模型：Instructor、BGE 和 LLM Embedder。

1 Instructor Embedding

Instructor是由香港大学自然语言处理实验室团队推出的。Instructor模型通过简单的指令，能够生成适用于任何任务（如分类、检索、聚类、文本评估等）和领域（如科学、金融等）的文本嵌入，而无需进行任何微调。它在70个不同的嵌入任务上取得了最好的性能。其主要特点就是：

无需微调: 仅通过提供任务指令，即可生成针对任何任务和领域的文本嵌入。
多种使用场景: 适用于自定义文本的嵌入计算、文本间相似度计算、信息检索和聚类等多种用途。
易于安装和使用: 提供了详细的安装和使用指南，包括一个可以直接在Colab上尝试的笔记本。
多种模型大小: 提供了多个不同大小的Instructor模型版本。

https://github.com/xlang-ai/instructor-embedding

1.1 计算自定义文本的嵌入

假设你需要为特定领域的文本计算嵌入，可以按照以下步骤操作：

from InstructorEmbedding import INSTRUCTOR

# 加载模型
model = INSTRUCTOR('hkunlp/instructor-large')#

 准备带有指令的文本
 text_instruction_pairs = [
     {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"},    
     {"instruction": "Represent the Medicine sentence for retrieving a duplicate sentence:", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear."}
]
     
# 计算嵌入
customized_embeddings = model.encode([[pair["instruction"], pair["text"]] for pair in text_instruction_pairs])

# 输出嵌入结果
for pair, embedding in zip(text_instruction_pairs, customized_embeddings):    
    print("Instruction: ", 
    pair["instruction"])    
    print("Text: ", 
    pair["text"])    
    print("Embedding: ", embedding)    
    print("")

1.2 计算文本间相似度

使用INSTRUCTOR模型计算两组文本间的相似度：

from sklearn.metrics.pairwise import cosine_similarity

# 定义两组文本
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'],
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]

# 计算嵌入
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)

# 计算相似度
similarities = cosine_similarity(embeddings_a, embeddings_b)

# 输出相似度结果
print(similarities)

1.3 信息检索

使用INSTRUCTOR模型进行信息检索的一个例子：

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 定义查询和文档
query = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism...'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial..."],          
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices...']]

# 计算嵌入
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)

# 计算相似度并找到最相关文档
similarities = cosine_similarity(query_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)

# 输出最相关文档的
IDprint(retrieved_doc_id)

2 BGE

BGE（BAAI General Embedding），是由智源研究院（BAAI）团队开发的一款文本Embedding模型。BGE嵌入是一种通用的嵌入模型。一种基于深度学习的自然语言处理工具，专门用于处理和理解文本数据。这个模型可以用于多种语言，包括中文，并且在多个任务上表现出色，如文本检索、语义相似度计算等。

https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding

2.1 使用 FlagEmbedding

在这里插入图片描述

2.2 使用 Sentence-Transformers

你也可以将 bge 模型与 sentence-transformers 一起使用：

在这里插入图片描述

对于短查询到长文档(s2p)检索任务，每个短查询应以指令开头（指令见模型列表）。但是，文档不需要添加指令。

在这里插入图片描述

3 LLM Embedder

LLM Embedder也是智源研究院（BAAI）团队开发的一款针对LLM场景的嵌入模型。根据 LLM 的反馈进行了微调。它可以支持大型语言模型的检索增强需求，包括知识检索、记忆检索、示例检索和工具检索。它针对 6 个任务进行了微调：问答、会话搜索、长会话、长范围语言建模、上下文学习和工具学习。

https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/llm_embedder/README.md

3.1 使用 `FlagEmbedding`

在这里插入图片描述

3.2 使用 `transformers`

在这里插入图片描述

3.3 使用 `sentence-transformers`

在这里插入图片描述

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述