88 递归检索器与查询引擎:探索分层数据的强大工具

递归检索器与查询引擎:探索分层数据的强大工具

在编程的世界里,处理分层数据是一项常见但复杂的任务。今天,我们将深入探讨一种名为“递归检索器(RecursiveRetriever)”的模块,它能够帮助我们更高效地处理和查询分层数据。

动机

递归检索的概念不仅仅局限于直接检索最相关的节点,还包括探索节点之间的关系,以进一步检索其他检索器或查询引擎并执行它们。例如,一个节点可能代表一个结构化表格的简洁摘要,并链接到该结构化表格上的SQL/Pandas查询引擎。如果检索到该节点,我们还希望查询底层查询引擎以获取答案。

这种技术对于具有分层关系的文档尤其有用。在本例中,我们将通过一个关于亿万富翁的Wikipedia文章(PDF格式)来演示,该文章包含文本和各种嵌入的结构化表格。我们首先为每个表格创建一个Pandas查询引擎,但也通过一个IndexNode表示每个表格(存储链接到查询引擎);这个节点与其他节点一起存储在向量存储中。

在查询时,如果检索到IndexNode,则将查询底层查询引擎/检索器。

设置说明

我们使用camelot从PDF中提取基于文本的表格。

%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental
import camelot

默认设置

import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

加载文档(和表格)

我们使用PyMuPDFReader读取文档的主要文本。

我们还使用camelot从文档中提取一些结构化表格。

file_path = "billionaires_page.pdf"
# 初始化PDF阅读器
reader = PyMuPDFReader()
docs = reader.load(file_path)
# 使用camelot解析表格
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs
table_dfs = get_tables(file_path, pages=[3, 25])
# 显示2023年顶级亿万富翁列表
table_dfs[0]
# 显示顶级亿万富翁列表
table_dfs[1]

创建Pandas查询引擎

我们为每个结构化表格创建一个Pandas查询引擎。

这些可以单独执行,以回答有关每个表格的查询。

# 定义这些表格的查询引擎
llm = OpenAI(model="gpt-4")

df_query_engines = [
    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs
]
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
# 输出: $180 billion
response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))
# 输出: 793

构建向量索引

在分块文档以及链接到表格的额外IndexNode对象上构建向量索引。

from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
# 定义索引节点
summaries = [
    (
        "This node provides information about the world's richest billionaires"
        " in 2023"
    ),
    (
        "This node provides information on the number of billionaires and"
        " their combined net worth from 2000 to 2023."
    ),
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}
# 构建顶级向量索引 + 查询引擎
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

在RetrieverQueryEngine中使用RecursiveRetriever

我们定义一个RecursiveRetriever对象来递归检索/查询节点。然后将其放入RetrieverQueryEngine中,并使用ResponseSynthesizer来合成响应。

我们传入从id到检索器和id到查询引擎的映射。然后传入一个代表我们首先查询的检索器的根id。

# 基准向量索引(不包括额外的df节点)。
# 用于基准测试
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)
response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)
# 输出: Retrieving with query id None: What's the net worth of the second richest billionaire in 2023?
# 输出: Retrieved node with id, entering: pandas0
# 输出: Retrieving with query id pandas0: What's the net worth of the second richest billionaire in 2023?
# 输出: Got response: $180 billion
response.source_nodes[0].node.get_content()
# 输出: "Query: What's the net worth of the second richest billionaire in 2023?\nResponse: $180\xa0billion"
str(response)
# 输出: '$180 billion.'
response = query_engine.query("How many billionaires were there in 2009?")
# 输出: Retrieving with query id None: How many billionaires were there in 2009?
# 输出: Retrieved node with id, entering: pandas1
# 输出: Retrieving with query id pandas1: How many billionaires were there in 2009?
# 输出: Got response: 793
str(response)
# 输出: '793'
response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)
print(response.source_nodes[0].node.get_content())
print(str(response))
# 输出: Based on the context information, it is not possible to determine the exact number of billionaires in 2009. The provided information only mentions the number of billionaires in 2013 and 2014.
response.source_nodes[0].node.get_content()
response = query_engine.query(
    "Which billionaires are excluded from this list?"
)
print(str(response))
# 输出: Royal families and dictators whose wealth is contingent on a position are excluded from this list.

让我们详细解释一下这个过程,从设置到实际应用的每一步。

1. 安装必要的库

首先,我们需要安装一些必要的库,这些库将帮助我们从PDF中提取数据,并使用递归检索器进行查询。

%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental
import camelot

2. 设置OpenAI API密钥

为了使用OpenAI的模型,我们需要设置API密钥。

import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

3. 配置默认设置

接下来,我们配置一些默认设置,包括使用的语言模型和嵌入模型。

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

4. 加载文档和表格

我们使用PyMuPDFReader读取PDF文档的主要文本,并使用camelot从文档中提取结构化表格。

file_path = "billionaires_page.pdf"
# 初始化PDF阅读器
reader = PyMuPDFReader()
docs = reader.load(file_path)

# 使用camelot解析表格
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs

table_dfs = get_tables(file_path, pages=[3, 25])

5. 创建Pandas查询引擎

我们为每个结构化表格创建一个Pandas查询引擎,这些引擎可以单独执行,以回答有关每个表格的查询。

llm = OpenAI(model="gpt-4")

df_query_engines = [
    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs
]

# 示例查询
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
# 输出: $180 billion

response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))
# 输出: 793

6. 构建向量索引

在分块文档以及链接到表格的额外IndexNode对象上构建向量索引。

from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)

# 定义索引节点
summaries = [
    "This node provides information about the world's richest billionaires in 2023",
    "This node provides information on the number of billionaires and their combined net worth from 2000 to 2023.",
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}

# 构建顶级向量索引 + 查询引擎
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

7. 使用RecursiveRetriever

我们定义一个RecursiveRetriever对象来递归检索/查询节点。然后将其放入RetrieverQueryEngine中,并使用ResponseSynthesizer来合成响应。

from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

# 示例查询
response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
# 输出: Retrieving with query id None: What's the net worth of the second richest billionaire in 2023?
# 输出: Retrieved node with id, entering: pandas0
# 输出: Retrieving with query id pandas0: What's the net worth of the second richest billionaire in 2023?
# 输出: Got response: $180 billion

response = query_engine.query("How many billionaires were there in 2009?")
print(str(response))
# 输出: Retrieving with query id None: How many billionaires were there in 2009?
# 输出: Retrieved node with id, entering: pandas1
# 输出: Retrieving with query id pandas1: How many billionaires were there in 2009?
# 输出: Got response: 793

8. 基准测试

为了进行基准测试,我们可以创建一个不包括额外df节点的基准向量索引,并进行查询。

vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()

response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)
print(response.source_nodes[0].node.get_content())
print(str(response))
# 输出: Based on the context information, it is not possible to determine the exact number of billionaires in 2009. The provided information only mentions the number of billionaires in 2013 and 2014.

9. 查询被排除的亿万富翁

最后,我们可以查询哪些亿万富翁被排除在这个列表之外。

response = query_engine.query(
    "Which billionaires are excluded from this list?"
)
print(str(response))
# 输出: Royal families and dictators whose wealth is contingent on a position are excluded from this list.

总结

递归检索器是一种强大的工具,能够帮助我们更高效地处理和查询分层数据。通过递归检索节点并查询底层查询引擎,我们可以在处理复杂文档时获得更准确和详细的信息。希望这篇博客能为你带来启发和帮助,让我们在编程的世界里,更加高效地驾驭数据和信息!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

需要重新演唱

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值