LlamaIndex中的可组合对象:构建顶级索引
在本文中,我们将展示如何将多个对象组合成一个顶级的索引。这种方法通过设置IndexNode对象来实现,每个对象的obj
字段可以指向以下内容:
- 查询引擎
- 检索器
- 查询管道
- 另一个节点!
object = IndexNode(index_id="my_object", obj=query_engine, text="some text about this object")
数据准备
首先,我们需要安装一些必要的库并下载示例数据:
%pip install llama-index-storage-docstore-mongodb
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-storage-docstore-firestore
%pip install llama-index-retrievers-bm25
%pip install llama-index-storage-docstore-redis
%pip install llama-index-storage-docstore-dynamodb
%pip install llama-index-readers-file pymupdf
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "./llama2.pdf"
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/1706.03762.pdf" -O "./attention.pdf"
检索器设置
设置OpenAI API密钥并加载文档:
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.readers.file import PyMuPDFReader
llama2_docs = PyMuPDFReader().load_data(file_path="./llama2.pdf", metadata=True)
attention_docs = PyMuPDFReader().load_data(file_path="./attention.pdf", metadata=True)
nodes = TokenTextSplitter(chunk_size=1024, chunk_overlap=128).get_nodes_from_documents(llama2_docs + attention_docs)
文档存储设置
使用SimpleDocumentStore存储节点:
from llama_index.core.storage.docstore import SimpleDocumentStore
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
向量存储和检索器设置
使用QdrantVectorStore和BM25Retriever:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
client = QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore("composable", client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=nodes)
vector_retriever = index.as_retriever(similarity_top_k=2)
bm25_retriever = BM25Retriever.from_defaults(docstore=docstore, similarity_top_k=2)
组合对象
创建IndexNode对象并设置SummaryIndex:
from llama_index.core.schema import IndexNode
vector_obj = IndexNode(index_id="vector", obj=vector_retriever, text="Vector Retriever")
bm25_obj = IndexNode(index_id="bm25", obj=bm25_retriever, text="BM25 Retriever")
from llama_index.core import SummaryIndex
summary_index = SummaryIndex(objects=[vector_obj, bm25_obj])
查询
使用tree_summarize模式进行查询,确保并发执行和快速响应:
query_engine = summary_index.as_query_engine(response_mode="tree_summarize", verbose=True)
response = await query_engine.aquery("How does attention work in transformers?")
print(str(response))
response = await query_engine.aquery("What is the architecture of Llama2 based on?")
print(str(response))
response = await query_engine.aquery("What was used before attention in transformers?")
print(str(response))
保存和加载
由于对象不是可序列化的,保存和加载时需要在加载时提供它们:
保存
docstore.persist("./docstore.json")
加载
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
docstore = SimpleDocumentStore.from_persist_path("./docstore.json")
client = QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore("composable", client=client)
index = VectorStoreIndex.from_vector_store(vector_store)
vector_retriever = index.as_retriever(similarity_top_k=2)
bm25_retriever = BM25Retriever.from_defaults(docstore=docstore, similarity_top_k=2)
from llama_index.core.schema import IndexNode
vector_obj = IndexNode(index_id="vector", obj=vector_retriever, text="Vector Retriever")
bm25_obj = IndexNode(index_id="bm25", obj=bm25_retriever, text="BM25 Retriever")
from llama_index.core import SummaryIndex
summary_index = SummaryIndex(objects=[vector_obj, bm25_obj])
通过这种方式,LlamaIndex中的可组合对象提供了一种强大的工具,用于构建和查询知识库,从而提高检索结果的准确性和相关性。