使用 LangChain 掌握检索增强生成 (RAG) 的终极指南：6、索引-CSDN博客

本文链接：https://blog.csdn.net/wangjiansui/article/details/141136843

6.索引

在 RAG 中，我们做的第一件事是创建一个向量存储来存储所提供文档的“块”。它们以一种可以根据查询轻松有效地检索的方式存储在我们的矢量数据库中。这称为索引。在本节中，我们将研究 Langchain 所采用的用于优化 RAG 的不同索引技术。

%load_ext dotenv
%dotenv secrets/secrets.env

6.1.多表示索引

在多表示索引中，我们首先生成每个文档的摘要，而不是对整个文档进行分块和嵌入。然后，摘要的嵌入将存储在向量存储中，而通过 id 与这些摘要相关的完整文档则存储在单独的内存数据库（即文档存储）中。一旦用户提出问题，我们的多向量检索器将首先从向量存储中获取最相似的摘要，然后从文档存储中获取相应的文档。因此，不仅由于嵌入空间较小，相似性搜索将得到优化，而且 LLM 可以使用整个原始文档作为上下文（而不是块）来准确回答问题。

from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain import hub

首先我们创建两个文档来回答用户的问题。

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

然后我们创建一个链，根据每个文档的页面内容生成摘要。

chain = (
    {
   "doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {
   "max_concurrency": 5})

summaries[0]

‘The document discusses the concept of building autonomous agents
powered by Large Language Models (LLMs). It explains the key
components of such agents, including Planning, Memory, and Tool Use.
Several proof-of-concept examples are provided, such as AutoGPT and
GPT-Engineer, showcasing the potential of LLMs in various tasks.
Challenges related to finite context length, planning, and reliability
of natural language interfaces are also addressed. Finally, the
document includes citations and references for further reading.’

生成摘要后，我们将docstore创建为InMemoryByteStore来存储使用 UUID 索引的文档，并创建 Chroma vectorestore来存储转换为文档的摘要的嵌入。在这里，我们使用 UUID 将摘要与文档链接起来，该 UUID 作为元数据添加到每个摘要中。最后，我们使用创建的vectorstore 、 docstore和doc_id作为它们之间的链接来创建MultiVectorRetriever 。

from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
import uuid

docstore = InMemoryByteStore() # To store the documents
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings()) # To store the embeddings from the summeries of the documents

# ids that map summeries to the documents
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Create documents from summeries. 
summary_docs = [Document(page_content=s, metadata={
   "doc_id": doc_id}) for s, doc_id in zip(summaries, doc_ids)]

# Create the retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=docstore,
    id_key="doc_id"
)

# Add summaries to the vectorstore
retriever.vectorstore.add_documents(summary_docs)

# Add docuemnts to the docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))

然后我们可以查询向量库以获取与用户查询相关的摘要。

query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(page_content=‘The document discusses the concept of building
autonomous agents powered by Large Language Models (LLMs). It explains
the key components of such agents, including Planning, Memory, and
Tool Use. Several proof-of-concept examples are provided, such as
AutoGPT and GPT-Engineer, showcasing the potential of LLMs in various
tasks. Challenges related to finite context length, planning, and
reliability of natural language interfaces are also addressed.
Finally, the document includes citations and references for further
reading.’, metadata={‘doc_id’:
‘997d7f6e-3911-49e8-b23f-3dca97361902’})

此外，我们还可以直接获取与用户查询相关的文档，该文档可以用作LLM的上下文来回答用户问题。

警告

您必须确保 LLM 有足够的上下文长度来容纳整个文档和问题。

retrieved_docs = retriever.invoke(query)
len(retrieved_docs[0].page_content)

Number of requested results 4 is greater than number of elements in
index 2, updating n_results = 2

43902

6.2. RAPTOR（树组织检索的递归抽象处理）

尽管多表示索引允许我们索引大型文档并将它们作为上下文检索，但将整个原始文档提供给LLM将是昂贵且缓慢的。此外，如果需要多个文档来回答用户问题，则很难使用多表示索引来做到这一点。因此，引入了RAPTOR作为解决方案，它使用分层索引来递归地嵌入、聚类和汇总文本块，从下到上构建具有不同汇总级别的树。

在该树中，叶节点将是文本块（根据论文）或本例中的完整文档。然后 RAPTOR 嵌入叶节点并对它们进行聚类。每个集群都被概括为跨相似文档的更高级别（更抽象）的信息整合。这一过程递归地完成，直到只剩下一个簇。

让我们看看如何使用 Langchain 来实现！

6.2.1.加载文档

首先，我们从 2 篇论文创建 2 个文档并将它们合并。我们还初始化 LLM 和嵌入模型。

lora_doc = PyPDFLoader("data/LORA.pdf"