快速入门多向量检索器

LangChain官方文档:

MultiVector Retriever | 🦜️🔗 LangChain

通常将多个向量存储在每个文档中是有益的。有多种用例表明这是有益的。LangChain 有一个基础 MultiVectorRetriever ,使查询这种设置变得容易。很多复杂性在于如何为每个文档创建多个向量。本笔记本涵盖了一些创建这些向量和使用 MultiVectorRetriever 的常见方法。

创建每个文档多个向量的方法包括:

  • 较小的块:将文档拆分成较小的块,并嵌入这些块(这是 ParentDocumentRetriever)。
  • 总结:为每个文档创建一个摘要,将其与文档一起嵌入(或代替文档)。
  • 假设性问题:创建每个文档适合回答的假设性问题,将这些问题嵌入(或代替)文档。

请注意,这也启用了另一种添加嵌入的方法——手动添加。这非常棒,因为您可以明确地添加应该导致文档被恢复的问题或查询,从而让您拥有更多的控制权。

from langchain.retrievers.multi_vector import MultiVectorRetriever

from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("../../paul_graham_essay.txt"),
    TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

较小的块

很多时候,获取较大块的信息是有益的,但要嵌入较小块。这使得嵌入尽可能准确地捕捉语义意义,同时尽可能多地传递上下文。请注意,这就是 ParentDocumentRetriever 的作用。这里我们展示了内部的工作原理。

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

uuid 是一个用于生成全局唯一标识符(UUID,Universally Unique Identifier)的模块。UUID 是一个128位的标识符,通常用于在分布式系统中唯一标识对象。

[str(uuid.uuid4()) for _ in docs]:对 docs 列表中的每个文档生成一个唯一的字符串形式的 UUID。

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

这行代码将原始文档 docs 和它们的唯一标识符 doc_ids 添加到检索器 retriever 的文档存储中。

doc_ids 是之前生成的文档唯一标识符列表。

docs 是原始文档列表。

zip(doc_ids, docs) 创建一个元组列表,其中每个元组包含一个文档的唯一标识符和该文档。

list(zip(doc_ids, docs)) 将这个元组列表转换为一个列表。

retriever.docstore.mset 方法用于将文档和它们的标识符批量设置到文档存储中。

# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]

Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '2fd77862-9ed5-4fad-bf76-e487b747b333', 'source': '../../state_of_the_union.txt'})

# Retriever returns larger chunks
len(retriever.invoke("justice breyer")[0].page_content)

默认情况下,检索器在向量数据库上执行的搜索类型是相似性搜索。LangChain 向量存储还支持通过最大边际相关性进行搜索,因此如果您想要这种搜索,只需将 search_type 属性设置如下:

from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

常见的 SearchType 选项

  1. mmr (Maximal Marginal Relevance):
    1. MMR :相似性搜索的最大边际相关性重新排序。是一种信息检索方法,用于在搜索结果中最大化相关性和多样性。它通过平衡结果的相关性和与已选结果的相似性来选择文档,避免结果过于相似,增加结果的多样性。
  2. similarity:
    1. 基于相似度的检索方法,通过计算查询和文档之间的相似度得分来返回最相关的文档。这通常使用余弦相似度、欧几里得距离等度量。

摘要

有时摘要可以更准确地提炼出一个片段的内容,从而提高检索效果。这里我们展示如何创建摘要,然后嵌入这些摘要。

import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\\n\\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

| 符号表示管道操作符,这是一种流式处理数据的方法。每个步骤的输出会成为下一个步骤的输入。这种语法通常出现在某些特定的库或框架中,表示数据处理链中的各个步骤。

这意味着:

  1. 提取文档内容:从输入对象中提取文档的内容。
  2. 生成聊天提示模板:将提取的内容传递给聊天提示模板。
  3. 执行聊天模型:将提示模板传递给 OpenAI 聊天模型进行总结处理。
  4. 解析输出:将聊天模型的输出传递给字符串输出解析器进行解析。
summaries = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# # 如果我们也想把原切片放入向量数据库也可以:
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs[0]

Document(page_content="The document is a speech given by President Biden addressing various issues and outlining his agenda for the nation. He highlights the importance of nominating a Supreme Court justice and introduces his nominee, Judge Ketanji Brown Jackson. He emphasizes the need to secure the border and reform the immigration system, including providing a pathway to citizenship for Dreamers and essential workers. The President also discusses the protection of women's rights, including access to healthcare and the right to choose. He calls for the passage of the Equality Act to protect LGBTQ+ rights. Additionally, President Biden discusses the need to address the opioid epidemic, improve mental health services, support veterans, and fight against cancer. He expresses optimism for the future of America and the strength of the American people.", metadata={'doc_id': '56345bff-3ead-418c-a4ff-dff203f77474'})

假设查询

一个LLM也可以用来生成一份关于特定文件的假设性问题列表。这些问题可以随后被嵌入

functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\\n\\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

chain.invoke(docs[0])

["What was the author's first experience with programming like?",

'Why did the author switch their focus from AI to Lisp during their graduate studies?',

'What led the author to contemplate a career in art instead of computer science?']

hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs

[Document(page_content='Who has been nominated to serve on the United States Supreme Court?', metadata={'doc_id': '0b3a349e-c936-4e77-9c40-0a39fc3e07f0'}),

Document(page_content="What was the context and content of Robert Morris' advice to the document's author in 2010?", metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),

Document(page_content='How did personal circumstances influence the decision to pass on the leadership of Y Combinator?', metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),

Document(page_content='What were the reasons for the author leaving Yahoo in the summer of 1999?', metadata={'doc_id': 'ce4f4981-ca60-4f56-86f0-89466de62325'})]

  • 12
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我可以为您提供一些关于PHP图片检索的基础知识和代码示例。 首先,关于图片检索,有两种常见的方法: 1. 基于图片内容的检索(Content-Based Image Retrieval, CBIR) 这种方法是通过比较图片的像素值、颜色、纹理等特征来进行检索。常见的算法包括颜色直方图、Gabor滤波、SIFT特征点等。具体实现可以使用OpenCV等图像处理库。 2. 基于图片标签的检索(Tag-Based Image Retrieval, TBIR) 这种方法是通过为图片打上标签(如“风景”、“人物”、“动物”等)来进行检索。一般需要使用机学习算法进行标签的自动化生成。常见的算法包括卷积神经网络(Convolutional Neural Network, CNN)、支持向量机(Support Vector Machine, SVM)等。 下面是一个简单的PHP示例,演示如何使用基于文件名的检索方法来实现在线图片搜索: ```php <?php // 定义图片文件夹路径 $dir = "./images/"; // 获取搜索关键字 $keyword = $_GET['keyword']; // 遍历文件夹中的所有图片文件 $files = scandir($dir); foreach($files as $file) { // 只处理jpg和png格式的图片 if(preg_match("/\.(jpg|png)$/i", $file)) { // 如果文件名包含搜索关键字,则将图片链接输出到页面 if(strpos($file, $keyword) !== false) { echo "<img src='$dir$file'>"; } } } ?> ``` 这个示例使用了PHP的`scandir()`函数来遍历指定文件夹中的所有文件,然后使用正则表达式判断文件是否为jpg或png格式。如果文件名包含了搜索关键字,则将图片链接输出到页面。 当然,这个示例只是一个很简单的演示,如果要实现更复杂的图片检索功能,需要使用更加高级的算法和工具。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值