使用LlamaIndex进行基于时间的文档索引和查询

最新推荐文章于 2024-08-07 08:06:57 发布

qq_37836323

最新推荐文章于 2024-08-07 08:06:57 发布

阅读量302

点赞数 5

文章标签： java 前端服务器 python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140971075

版权

使用LlamaIndex进行基于时间的文档索引和查询

在处理包含不同版本的文档时，时间是一个重要的因素。我们希望能够从最新的信息中提取答案，而不仅仅是基于文本相似性。本文将介绍如何使用LlamaIndex进行基于时间的文档索引和查询，并展示如何通过API接口实现这一过程。

准备工作

首先，我们需要准备好环境并导入必要的库。这里，我们将使用LlamaIndex库来处理文档。

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import (
    FixedRecencyPostprocessor,
    EmbeddingRecencyPostprocessor,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core import Settings
from llama_index.core import StorageContext

# 设置API密钥
os.environ["OPENAI_API_KEY"] = "sk-..."

# 获取文件的元数据
def get_file_metadata(file_name: str):
    if "v1" in file_name:
        return {"date": "2020-01-01"}
    elif "v2" in file_name:
        return {"date": "2020-02-03"}
    elif "v3" in file_name:
        return {"date": "2022-04-12"}
    else:
        raise ValueError("invalid file")

# 加载文档
documents = SimpleDirectoryReader(
    input_files=[
        "test_versioned_data/paul_graham_essay_v1.txt",
        "test_versioned_data/paul_graham_essay_v2.txt",
        "test_versioned_data/paul_graham_essay_v3.txt",
    ],
    file_metadata=get_file_metadata,
).load_data()

# 设置分句器
Settings.text_splitter = SentenceSplitter(chunk_size=512)

# 解析文档节点
nodes = Settings.text_splitter.get_nodes_from_documents(documents)

# 添加到文档存储
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)

print(documents[2].get_text())

在上述代码中，我们首先设置了API密钥，并定义了获取文件元数据的函数。然后，我们加载了包含不同版本的文档，并使用分句器解析文档节点，最后将这些节点添加到文档存储中。

构建索引

接下来，我们将构建一个基于向量的索引。

# 构建索引
index = VectorStoreIndex(nodes, storage_context=storage_context)

定义时间后处理器

我们将定义两种时间后处理器：固定时间后处理器和基于嵌入的时间后处理器。

# 定义时间后处理器
node_postprocessor = FixedRecencyPostprocessor()
node_postprocessor_emb = EmbeddingRecencyPostprocessor()

查询索引

我们将展示如何进行简单查询以及使用时间后处理器的查询。

# 简单查询
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")
print(response)

# 使用固定时间后处理器的查询
query_engine = index.as_query_engine(similarity_top_k=3, node_postprocessors=[node_postprocessor])
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")
print(response)

# 使用基于嵌入的时间后处理器的查询
query_engine = index.as_query_engine(similarity_top_k=3, node_postprocessors=[node_postprocessor_emb])
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")
print(response)