自定义路由检索器:在LlamaIndex中选择合适的检索工具
在本文中,我们将介绍如何定义一个自定义的路由检索器(Router Retriever),该检索器能够根据给定的查询选择一个或多个候选检索器来执行查询。路由模块(BaseSelector)使用LLM动态决定使用哪些底层检索工具,这对于从多样化的数据源中选择一个或多个数据源非常有帮助。
设置环境
首先,确保你已经安装了必要的库并设置了OpenAI API密钥:
%pip install llama-index-llms-openai
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
下载数据
下载示例数据:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载数据
加载文档并将其转换为节点,然后插入到文档存储中:
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
SimpleKeywordTableIndex,
)
from llama_index.core import SummaryIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# 初始化LLM + 分词器
llm = OpenAI(model="gpt-4")
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
# 初始化存储上下文(默认在内存中)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)
# 定义索引
summary_index = SummaryIndex(nodes, storage_context=storage_context)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
keyword_index = SimpleKeywordTableIndex(nodes, storage_context=storage_context)
list_retriever = summary_index.as_retriever()
vector_retriever = vector_index.as_retriever()
keyword_retriever = keyword_index.as_retriever()
定义检索工具
定义不同的检索工具:
from llama_index.core.tools import RetrieverTool
list_tool = RetrieverTool.from_defaults(
retriever=list_retriever,
description=(
"Will retrieve all context from Paul Graham's essay on What I Worked"
" On. Don't use if the question only requires more specific context."
),
)
vector_tool = RetrieverTool.from_defaults(
retriever=vector_retriever,
description=(
"Useful for retrieving specific context from Paul Graham essay on What"
" I Worked On."
),
)
keyword_tool = RetrieverTool.from_defaults(
retriever=keyword_retriever,
description=(
"Useful for retrieving specific context from Paul Graham essay on What"
" I Worked On (using entities mentioned in query)"
),
)
定义选择器模块
定义选择器模块,用于路由选择合适的检索工具:
from llama_index.core.selectors import (
PydanticMultiSelector,
PydanticSingleSelector,
)
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.response.notebook_utils import display_source_node
# PydanticSingleSelector
retriever = RouterRetriever(
selector=PydanticSingleSelector.from_defaults(llm=llm),
retriever_tools=[list_tool, vector_tool],
)
nodes = retriever.retrieve(
"Can you give me all the context regarding the author's life?"
)
for node in nodes:
display_source_node(node)
nodes = retriever.retrieve("What did Paul Graham do after RISD?")
for node in nodes:
display_source_node(node)
# PydanticMultiSelector
retriever = RouterRetriever(
selector=PydanticMultiSelector.from_defaults(llm=llm),
retriever_tools=[list_tool, vector_tool, keyword_tool],
)
nodes = retriever.retrieve(
"What were noteable events from the authors time at Interleaf and YC?"
)
for node in nodes:
display_source_node(node)
nodes = await retriever.aretrieve(
"What were noteable events from the authors time at Interleaf and YC?"
)
for node in nodes:
display_source_node(node)
通过使用路由检索器,你可以根据查询动态选择合适的检索工具,从而提高检索结果的准确性和相关性。希望这些信息对你有所帮助!