LangChain官方文档:
MultiVector Retriever | 🦜️🔗 LangChain
通常将多个向量存储在每个文档中是有益的。有多种用例表明这是有益的。LangChain 有一个基础 MultiVectorRetriever
,使查询这种设置变得容易。很多复杂性在于如何为每个文档创建多个向量。本笔记本涵盖了一些创建这些向量和使用 MultiVectorRetriever
的常见方法。
创建每个文档多个向量的方法包括:
- 较小的块:将文档拆分成较小的块,并嵌入这些块(这是 ParentDocumentRetriever)。
- 总结:为每个文档创建一个摘要,将其与文档一起嵌入(或代替文档)。
- 假设性问题:创建每个文档适合回答的假设性问题,将这些问题嵌入(或代替)文档。
请注意,这也启用了另一种添加嵌入的方法——手动添加。这非常棒,因为您可以明确地添加应该导致文档被恢复的问题或查询,从而让您拥有更多的控制权。
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
loaders = [
TextLoader("../../paul_graham_essay.txt"),
TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)
较小的块
很多时候,获取较大块的信息是有益的,但要嵌入较小块。这使得嵌入尽可能准确地捕捉语义意义,同时尽可能多地传递上下文。请注意,这就是 ParentDocumentRetriever
的作用。这里我们展示了内部的工作原理。
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
uuid 是一个用于生成全局唯一标识符(UUID,Universally Unique Identifier)的模块。UUID 是一个128位的标识符,通常用于在分布式系统中唯一标识对象。
[str(uuid.uuid4()) for _ in docs]
:对 docs
列表中的每个文档生成一个唯一的字符串形式的 UUID。
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
这行代码将原始文档 docs
和它们的唯一标识符 doc_ids
添加到检索器 retriever
的文档存储中。
doc_ids
是之前生成的文档唯一标识符列表。
docs
是原始文档列表。
zip(doc_ids, docs)
创建一个元组列表,其中每个元组包含一个文档的唯一标识符和该文档。
list(zip(doc_ids, docs))
将这个元组列表转换为一个列表。
retriever.docstore.mset
方法用于将文档和它们的标识符批量设置到文档存储中。
# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]
Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '2fd77862-9ed5-4fad-bf76-e487b747b333', 'source': '../../state_of_the_union.txt'})
# Retriever returns larger chunks
len(retriever.invoke("justice breyer")[0].page_content)
默认情况下,检索器在向量数据库上执行的搜索类型是相似性搜索。LangChain 向量存储还支持通过最大边际相关性进行搜索,因此如果您想要这种搜索,只需将 search_type
属性设置如下:
from langchain.retrievers.multi_vector import SearchType
retriever.search_type = SearchType.mmr
len(retriever.invoke("justice breyer")[0].page_content)
常见的 SearchType
选项
- mmr (Maximal Marginal Relevance):
- MMR :相似性搜索的最大边际相关性重新排序。是一种信息检索方法,用于在搜索结果中最大化相关性和多样性。它通过平衡结果的相关性和与已选结果的相似性来选择文档,避免结果过于相似,增加结果的多样性。
- similarity:
- 基于相似度的检索方法,通过计算查询和文档之间的相似度得分来返回最相关的文档。这通常使用余弦相似度、欧几里得距离等度量。
摘要
有时摘要可以更准确地提炼出一个片段的内容,从而提高检索效果。这里我们展示如何创建摘要,然后嵌入这些摘要。
import uuid
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
chain = (
{"doc": lambda x: x.page_content}
| ChatPromptTemplate.from_template("Summarize the following document:\\n\\n{doc}")
| ChatOpenAI(max_retries=0)
| StrOutputParser()
)
|
符号表示管道操作符,这是一种流式处理数据的方法。每个步骤的输出会成为下一个步骤的输入。这种语法通常出现在某些特定的库或框架中,表示数据处理链中的各个步骤。
这意味着:
- 提取文档内容:从输入对象中提取文档的内容。
- 生成聊天提示模板:将提取的内容传递给聊天提示模板。
- 执行聊天模型:将提示模板传递给 OpenAI 聊天模型进行总结处理。
- 解析输出:将聊天模型的输出传递给字符串输出解析器进行解析。
summaries = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# # 如果我们也想把原切片放入向量数据库也可以:
# for i, doc in enumerate(docs):
# doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs[0]
Document(page_content="The document is a speech given by President Biden addressing various issues and outlining his agenda for the nation. He highlights the importance of nominating a Supreme Court justice and introduces his nominee, Judge Ketanji Brown Jackson. He emphasizes the need to secure the border and reform the immigration system, including providing a pathway to citizenship for Dreamers and essential workers. The President also discusses the protection of women's rights, including access to healthcare and the right to choose. He calls for the passage of the Equality Act to protect LGBTQ+ rights. Additionally, President Biden discusses the need to address the opioid epidemic, improve mental health services, support veterans, and fight against cancer. He expresses optimism for the future of America and the strength of the American people.", metadata={'doc_id': '56345bff-3ead-418c-a4ff-dff203f77474'})
假设查询
一个LLM也可以用来生成一份关于特定文件的假设性问题列表。这些问题可以随后被嵌入
functions = [
{
"name": "hypothetical_questions",
"description": "Generate hypothetical questions",
"parameters": {
"type": "object",
"properties": {
"questions": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["questions"],
},
}
]
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
chain = (
{"doc": lambda x: x.page_content}
# Only asking for 3 hypothetical questions, but this could be adjusted
| ChatPromptTemplate.from_template(
"Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\\n\\n{doc}"
)
| ChatOpenAI(max_retries=0, model="gpt-4").bind(
functions=functions, function_call={"name": "hypothetical_questions"}
)
| JsonKeyOutputFunctionsParser(key_name="questions")
)
chain.invoke(docs[0])
["What was the author's first experience with programming like?",
'Why did the author switch their focus from AI to Lisp during their graduate studies?',
'What led the author to contemplate a career in art instead of computer science?']
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
question_docs.extend(
[Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
)
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs
[Document(page_content='Who has been nominated to serve on the United States Supreme Court?', metadata={'doc_id': '0b3a349e-c936-4e77-9c40-0a39fc3e07f0'}),
Document(page_content="What was the context and content of Robert Morris' advice to the document's author in 2010?", metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),
Document(page_content='How did personal circumstances influence the decision to pass on the leadership of Y Combinator?', metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),
Document(page_content='What were the reasons for the author leaving Yahoo in the summer of 1999?', metadata={'doc_id': 'ce4f4981-ca60-4f56-86f0-89466de62325'})]