（17-6-04）检索增强生成（RAG）：长文本检索器+多向量检索器

最新推荐文章于 2024-06-20 18:00:11 发布

码农三叔

最新推荐文章于 2024-06-20 18:00:11 发布

阅读量841

点赞数 29

分类专栏：大模型从入门到实战文章标签： python 人工智能 bert langchain 自然语言开发语言

本文链接：https://blog.csdn.net/asd343442/article/details/138338389

版权

大模型从入门到实战专栏收录该内容

169 篇文章 36 订阅

订阅专栏

5.6.7 长文本检索器

在LangChain中，类LongContextReorder用于解决在处理长文本上下文时检索器性能下降的问题。这种性能下降通常是因为模型在长上下文中难以有效地处理和利用所有信息，尤其是在上下文中间部分的信息。LongContextReorder通过重新组织检索到的文档顺序来优化模型对上下文信息的利用。

LongContextReorder的主要功能是对检索到的文档列表进行重新排序，它会将最相关的文档放在列表的开始和结束位置，而将不太相关的文档移动到中间位置，这样做的目的是让模型更容易关注到最重要的信息。通过优化文档的顺序，LongContextReorder有助于提高模型在长上下文环境中的性能，特别是在需要从大量文档中检索特定信息的场景中。

在实际应用中，LongContextReorder可以与LangChain中的检索器（如Chroma）集成使用，以增强检索结果的相关性和准确性。例如下面是一个使用LangChain库的例子，展示了结合多个检索器来处理和回答关于特定主题的问题的方法，同时优化了长上下文中的检索信息。

实例5-7：优化了长上下文中的检索信息（源码路径：codes\5\jian07.py）

实例文件jian07.py的具体实现代码如下所示。

from langchain.chains import LLMChain, StuffDocumentsChain
from langchain.prompts import PromptTemplate
from langchain_chroma import Chroma
from langchain_community.document_transformers import (
    LongContextReorder,
)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import OpenAI

# 获取嵌入模型
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 文本列表
texts = [
    "篮球是一项伟大的运动。",
    "《飞往月球》是我最喜欢的歌曲之一。",
    "凯尔特人是我最喜欢的球队。",
    "这是关于波士顿凯尔特人的文档。",
    "我简直喜欢去电影院。",
    "波士顿凯尔特人队以20分的优势赢得了比赛。",
    "这只是一段随机文本。",
    "《埃尔登之环》是过去15年中最棒的游戏之一。",
    "L·科尔奈特是最好的凯尔特人球员之一。",
    "拉里·伯德是一位标志性的NBA球员。",
]

# 创建检索器
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "关于凯尔特人队你能告诉我什么？"

# 获取相关文档，按相关性得分排序
docs = retriever.get_relevant_documents(query)
print(docs)

# 对文档进行重新排序：
# 相关度较低的文档将位于列表中间，相关度较高的文档位于开头和结尾。
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)
# 确认四个相关文档位于列表的开头和结尾。
print(reordered_docs)

# 我们准备并运行一个自定义Stuff链，并使用重新排序的文档作为上下文。

# 覆盖提示
document_prompt = PromptTemplate(
    input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
llm = OpenAI()
stuff_prompt_override = """给定这些文本摘录：
-----
{context}
-----
请回答以下问题：
{query}"""
prompt = PromptTemplate(
    template=stuff_prompt_override, input_variables=["context", "query"]
)

# 实例化链
llm_chain = LLMChain(llm=llm, prompt=prompt)
chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_prompt=document_prompt,
    document_variable_name=document_variable_name,
)
chain.run(input_documents=reordered_docs, query=query)

上述代码的实现流程如下所示：

（1）获取嵌入模型：类使用HuggingFaceEmbeddings获取一个预训练的文本嵌入模型，本实例使用的模型是"all-MiniLM-L6-v2"。

（2）定义文本列表：定义了一个包含多个关于不同主题的文本片段的列表，例如篮球、音乐、电影和电子游戏等。

（3）创建检索器：使用Chroma类和文本嵌入创建一个检索器，该检索器能够根据用户查询从文本列表中检索相关文档。search_kwargs参数设置为{"k": 10}，表示每次检索返回最多10个相关文档。

（4）执行检索：使用检索器方法get_relevant_documents根据用户查询（"关于凯尔特人队你能告诉我什么？"）获取相关文档。

（5）重新排序文档：使用类LongContextReorder对检索到的文档进行重新排序，将相关性高的文档放在列表的开头和结尾，而将相关性低的文档放在中间。这样做是为了提高模型在处理长上下文时的性能，因为模型在长上下文中处理信息时，通常在上下文的开始和结束部分表现更好。

（6）准备提示和处理链：创建PromptTemplate对象，用于构建处理链中的提示。这里定义了一个覆盖提示，它将文档内容和查询作为输入变量。

（7）实例化LLMChain和StuffDocumentsChain，这些链将用于处理重新排序的文档并生成回答。

（8）运行处理链：使用StuffDocumentsChain中的run方法将重新排序的文档和查询作为输入，生成最终的回答。

执行后会输出：

# 获取相关文档的原始顺序
[
  Document(page_content='这是关于波士顿凯尔特人的文档'),
  Document(page_content='L. Kornet 是最好的凯尔特人球员之一。'),
  Document(page_content='拉里·伯德是一位标志性的 NBA 球员。'),
  Document(page_content='波士顿凯尔特人队以 20 分的优势赢得了比赛。'),
  Document(page_content='我简直喜欢去电影院。'),
  Document(page_content='《飞往月球》是我最喜欢的歌曲之一。'),
  Document(page_content='这只是一段随机文本。'),
  Document(page_content='《埃尔登之环》是过去 15 年中最棒的游戏之一。'),
  Document(page_content='篮球是一项伟大的运动。')
]

# 重新排序文档后的顺序
[
  Document(page_content='凯尔特人是我最喜欢的球队。'),
  Document(page_content='波士顿凯尔特人队以 20 分的优势赢得了比赛。'),
  Document(page_content='这只是一段随机文本。'),
  Document(page_content='我简直喜欢去电影院。'),
  Document(page_content='《飞往月球》是我最喜欢的歌曲之一。'),
  Document(page_content='《埃尔登之环》是过去 15 年中最棒的游戏之一。'),
  Document(page_content='篮球是一项伟大的运动。'),
  Document(page_content='L. Kornet 是最好的凯尔特人球员之一。'),
  Document(page_content='这是关于波士顿凯尔特人的文档。')
]

# 最终生成的回答
'''
给定这些文本摘录：
-----
The Celtics are my favorite team.
The Boston Celtics won the game by 20 points.
This is just a random text.
I simply love going to the movies.
Fly me to the moon is one of my favorite songs.
Elden Ring is one of the best games in the last 15 years.
Basketball is a great sport.
L. Kornet is one of the best Celtics players.
This is a document about the Boston Celtics.
-----
请回答以下问题：
关于凯尔特人队你能告诉我什么？
'''

凯尔特人队是NBA中的一支非常知名且历史悠久的篮球队。根据提供的文本摘录，凯尔特人队赢得了一场比赛，优势明显。此外，L. Kornet被提及为队中最优秀的球员之一，而拉里·伯德则被尊称为标志性的NBA球员。这些信息表明，凯尔特人队在篮球领域有着丰富的成功经历和杰出的球员阵容。

本实例的目的是确保在长上下文中，模型能够更有效地关注和处理相关信息，从而提高检索和回答的准确性。通过重新排序文档，可以减少模型在处理长文本时可能出现的性能下降问题。

5.6.8 多向量检索器

在LangChain中，类MultiVectorRetriever是一个多向量检索器，它允许用户为每个文档存储和检索多个向量。这种方法特别适用于需要从多个角度或不同表示中检索信息的场景。MultiVectorRetriever提供了一种灵活的机制，通过它可以增强检索系统的性能和准确性。

多向量检索器的主要特点如下所示。

多个向量存储：与传统的检索器不同，MultiVectorRetriever可以为每个文档存储多个向量。这些向量可以代表文档的不同部分、摘要、相关问题或其他与文档相关的信息。
灵活的检索策略：MultiVectorRetriever支持多种检索策略，包括相似性搜索（基于向量之间的相似度）和最大边际相关性（MMR）搜索（选择与查询最相关且彼此之间相关性较低的文档集）。
自定义向量生成：用户可以自定义文档向量的生成方式，例如通过分割文档、生成摘要或创建假设性问题等。
精确控制：MultiVectorRetriever允许用户精确控制哪些向量用于检索，以及如何组合这些向量以获得最佳结果。

在实际应用中，MultiVectorRetriever通常与向量存储系统（如Chroma）一起使用，它依赖于向量存储来保存和检索文档向量。MultiVectorRetriever的实现需要以下几个关键组件：

向量存储：用于存储文档向量的系统。
字节存储：用于存储原始文档的元数据和内容的存储层。
ID键：用于在向量存储和字节存储之间关联文档的唯一标识符。

在类MultiVectorRetriever中提供了如下所示的成员方法：

add_documents：将文档及其向量添加到向量存储中。
get_relevant_documents：根据用户查询返回相关的文档列表。
similarity_search：执行基于向量相似度的搜索。

在现实应用中，多向量检索器MultiVectorRetriever的主要应用领域如下所示。

1. 分割文档

将长文档分割成较小的部分，并为每个部分生成一个向量。例如下面是一个使用MultiVectorRetriever分割文档并为每个部分生成向量的例子。

实例5-1：分割文档并为每个部分生成向量（源码路径：codes\5\jian08.py）

实例文件jian08.py的具体实现代码如下所示。

loaders = [
    TextLoader("example1.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# 使用文本分割器分割文档
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# 创建向量存储和字节存储
vectorstore = Chroma(collection_name="full_documents", embedding_function=OpenAIEmbeddings())
store = InMemoryByteStore()
id_key = "doc_id"

# 初始化MultiVectorRetriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# 为每个文档生成唯一的ID
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 分割文档为更小的块
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

# 将子文档及其向量添加到向量存储
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# 现在可以使用MultiVectorRetriever来检索与查询相关的子文档
# 例如，搜索包含"justice breyer"的子文档
retriever.vectorstore.similarity_search("justice breyer")

eCharacterTextSplitter将其分割成较小的块。我们为每个子文档生成了一个唯一的ID，并将这些子文档及其向量添加到了Chroma向量存储中。最后，我们使用MultiVectorRetriever来执行基于相似性的搜索，寻找包含特定查询词（如"justice breyer"）的子文档。执行后会输出：

loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# 使用LangChain的链来生成摘要
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5}) # 并行生成摘要以提高效率

# 创建向量存储以索引子块
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# 创建存储层以存储父文档
store = InMemoryByteStore()
id_key = "doc_id"  # 用于在向量存储和字节存储之间关联文档的唯一标识符

# 初始化MultiVectorRetriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# 为每个摘要生成唯一ID并创建Document对象
doc_ids = [str(uuid.uuid4()) for _ in range(len(summaries))]
summary_docs = [
    Document(page_content=summary, metadata={id_key: doc_id})
    for summary, doc_id in zip(summaries, doc_ids)
]

# 将摘要文档及其向量添加到向量存储
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# 现在可以使用MultiVectorRetriever来检索与查询相关的摘要
# 例如，搜索包含"justice breyer"的摘要
retriever.get_relevant_documents("justice breyer")

2. 生成摘要

为每个文档创建一个摘要，然后为摘要文本生成一个向量。例如下面是一个使用MultiVectorRetriever生成文件摘要并为摘要文本生成向量的例子。

实例5-1：使用MultiVectorRetriever生成文件摘要和向量（源码路径：codes\5\jian09.py）

实例文件jian09.py的具体实现代码如下所示。

loaders = [
    TextLoader("paul_graham_essay.txt"),
    TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

# 使用LangChain的链来生成摘要
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5}) # 并行生成摘要以提高效率

# 创建向量存储以索引子块
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# 创建存储层以存储父文档
store = InMemoryByteStore()
id_key = "doc_id"  # 用于在向量存储和字节存储之间关联文档的唯一标识符

# 初始化MultiVectorRetriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# 为每个摘要生成唯一ID并创建Document对象
doc_ids = [str(uuid.uuid4()) for _ in range(len(summaries))]
summary_docs = [
    Document(page_content=summary, metadata={id_key: doc_id})
    for summary, doc_id in zip(summaries, doc_ids)
]

# 将摘要文档及其向量添加到向量存储
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# 现在可以使用MultiVectorRetriever来检索与查询相关的摘要
# 例如，搜索包含"justice breyer"的摘要
retriever.get_relevant_documents("justice breyer")

在上述代码中，首先使用LangChain来为每个文档生成摘要。然后，创建了一个Chroma向量存储和一个InMemoryByteStore存储层，用于存储摘要和原始文档的元数据。接下来，初始化了MultiVectorRetriever，并为每个摘要创建了一个Document对象，将其添加到向量存储中。最后，使用retriever.get_relevant_documents方法来执行检索，寻找包含特定查询词（如"justice breyer"）的摘要。这个方法将返回一个文档列表，其中每个文档都是一个包含摘要文本和元数据的Document对象。执行后会输出：

# 执行摘要生成链后的输出
[
    "Paul Graham's essay highlights the critical role of startups in driving technological progress and innovation.",
    "The State of the Union address emphasizes the nation's economic growth, job creation, and commitment to tackling future challenges together."
    # ... 其他文档的摘要 ...
]

# 检索与"justice breyer"相关的摘要
[
    Document(page_content="The State of the Union address emphasizes the importance of nominating a Supreme Court justice and introduces Judge Ketanji Brown Jackson as the nominee.", metadata={'doc_id': '56345bff-3ead-418c-a4ff-dff203f77474'})
    # ... 可能还有其他匹配的摘要 ...
]

# 输出检索到的第一个文档的内容长度
90  # 这是第一个匹配摘要的文本长度

在上面的中，调用chain.batch函数为每个文档生成了一个摘要，并且这些摘要被添加到了MultiVectorRetriever中。然后，使用vectorstore.similarity_search方法检索包含特定查询词（如"justice breyer"）的摘要。最后，打印输出了检索到的第一个匹配摘要的文本长度。

3. 假设性问题

为每个文档生成一些假设性问题，并为这些问题生成向量。例如在下面的实例中，使用MultiVectorRetriever生成了假设性问题，并为这些问题生成向量。

实例5-1：使用MultiVectorRetriever生成假设性问题和向量（源码路径：codes\5\jian10.py）

实例文件jian10.py的具体实现代码如下所示。

# 定义生成假设性问题的功能
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

# 创建一个链来生成假设性问题
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

# 为每个文档生成假设性问题
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

# 接下来，我们可以将这些假设性问题存储到向量存储中，并使用MultiVectorRetriever进行检索
# 创建向量存储以索引子块
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# 创建存储层以存储父文档
store = InMemoryByteStore()
id_key = "doc_id"
# 初始化MultiVectorRetriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in range(len(docs))]

# 为每个问题创建Document对象
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    for question in question_list:
        question_doc = Document(page_content=question, metadata={id_key: doc_ids[i]})
        question_docs.append(question_doc)

# 将问题文档及其向量添加到向量存储
vectorstore.add_documents(question_docs)
store.mset(list(zip(doc_ids, docs)))

# 使用MultiVectorRetriever来检索与查询相关的问题
# 例如，搜索包含"justice"的问题
search_query = "justice"
retriever.search_type = SearchType.similarity_search  # 可以选择使用相似性搜索或MMR搜索
relevant_questions = retriever.get_relevant_documents(search_query)

# 输出检索到的问题
for question_doc in relevant_questions:
    print(question_doc.page_content)

上述代码的实现流程如下所示：

首先，使用ChatOpenAI和JsonKeyOutputFunctionsParser为每个文档生成假设性问题。
然后，创建了一个Chroma向量存储和一个InMemoryByteStore存储层，用于存储问题和原始文档的元数据。
接下来，初始化MultiVectorRetriever，并为每个问题创建了一个Document对象，将其添加到向量存储中。
最后，使用retriever.get_relevant_documents方法执行检索，寻找包含特定查询词（如"justice"）的问题。这个方法将返回一个文档列表，其中每个文档都是一个包含问题文本和元数据的Document对象。

未完待续

码农三叔

关注

29
点赞
踩
27

收藏

觉得还不错? 一键收藏
打赏
0
评论
（17-6-04）检索增强生成（RAG）：长文本检索器+多向量检索器

为每个文档创建一个摘要，然后为摘要文本生成一个向量。例如下面是一个使用MultiVectorRetriever生成文件摘要并为摘要文本生成向量的例子。
复制链接

扫一扫