颠覆认知！LLM评估原来可以这么简单

最新推荐文章于 2024-09-29 23:21:25 发布

技术狂潮AI

最新推荐文章于 2024-09-29 23:21:25 发布

阅读量752

点赞数 8

分类专栏： LLM应用实战大模型企业实战 LLM教程文章标签：人工智能 LLM评估

本文链接：https://blog.csdn.net/FrenzyTechAI/article/details/140705329

版权

LLM应用实战同时被 3 个专栏收录

90 篇文章 63 订阅

订阅专栏

大模型企业实战

37 篇文章 3 订阅

订阅专栏

LLM教程

5 篇文章 1 订阅

订阅专栏

一、前言

生成式 AI (Generative AI) 和大语言模型 (LLM)，例如 GPT-4、Llama 和 Claude，已经开启了 AI 驱动应用和用例的新时代。然而，评估 LLM 通常需要用到许多复杂的库和方法，这容易让人望而却步。

实际上，LLM 评估并不一定复杂。你不需要复杂的管道、数据库或基础架构组件就可以构建有效的评估管道。

Discord 就提供了一个很好的例子：他们为 2000 万用户构建了一个聊天机器人，并专注于实施易于运行和快速实施的评估方法。例如，他们会检查消息是否全部为小写，以确定聊天机器人是被随意使用还是以其他方式使用。

在这篇博文中，我们将学习如何为你的 LLM 应用设置一个简化的评估工作流程。受 G-EVAL 和 Self-Rewarding Language Models 的启发，我们将使用加法分数、思维链 (Chain-of-Thought, CoT) 和少样本示例的表单填充提示模板来指导评估。这种方法与人类判断非常吻合，并使评估过程易于理解、有效且易于管理。

我们将使用 meta-llama/Meta-Llama-3-70B-Instruct 作为 LLM 评估模型，它通过 Hugging Face Inference API 和 OpenAI 客户端托管。你也可以使用其他 LLM 模型。

二、如何为 LLM 评估模型设计优质的评估提示

当你使用 LLM 作为评估模型时，评估提示 (prompt) 的质量至关重要，它直接决定了模型评估结果的准确性。以下建议基于实践经验和最近研究的见解，特别是 G-EVAL 论文和 Self-Rewarding Language Models 论文。

2.1、定义清晰的评估指标（可选：加法分数）

首先，你需要为评估建立一个清晰的指标，并使用例如加法分数将其分解成具体的标准。这种方法增强了一致性，并且可以使用少样本示例更好地与人类判断保持一致。例如：

* 如果答案直接解决了问题的主题，没有偏离到无关的领域，则加 1 分。

* 如果答案适合教育用途并介绍了学习编码的关键概念，则奖励一分。

* …

使用 0-5 的小整数范围可以简化评分过程，并减少 LLM 评估结果的差异性。

2.2、定义思维链 (CoT) 评估步骤

为 LLM 定义预定义的推理步骤，以应用分步评估过程。这将使评估过程更加周到和准确。例如：

* 仔细阅读问题，了解所问的内容。

* 通读答案。

* 评估答案的长度。它是过长还是适当的简短？

* …

2.3、包含少样本示例（可选）

添加问题、答案、推理步骤及其评估的示例可以帮助引导 LLM 更好地学习人类的偏好，并提高其鲁棒性。

2.4、定义输出模式

以结构化格式（例如 JSON）请求评估结果，其中包含每个标准和总分的字段。这允许你解析结果并自动计算指标。你可以通过提供一些少样本示例来进一步改进输出结果。

以下是如何将所有内容整合到一起的示例：

EVALUATION_PROMPT_TEMPLATE = """
You are an expert judge evaluating the Retrieval Augmented Generation applications. Your task is to evaluate a given answer based on a context and question using the criteria provided below.
 
Evaluation Criteria (Additive Score, 0-5):
{additive_criteria}
 
Evaluation Steps:
{evaluation_steps}
 
Output format:
{json_schema}
 
Examples:
{examples}
 
Now, please evaluate the following:
 
Question:
{question}
Context:
{context}
Answer:
{answer}
"""

三、使用 LLM 作为评估模型来评估 RAG 应用

检索增强生成 (Retrieval Augmented Generation, RAG) 是 LLM 最流行的用例之一，但它也是最难评估的用例之一。RAG 有一些常用的指标，但它们可能并不总是适用于特定用例，或者过于“通用”。因此，我们定义了一个新的 RAG 加法指标（3 分制）。

这个 3 分制的加法指标从以下三个方面来评估 RAG 系统的响应：与给定上下文的贴合程度、是否完整地解决了所有关键要素、以及在确保相关性的基础上是否简洁。

注意：这完全是为了演示目的而虚构的指标。在实际应用中，你需要根据具体的用例和重要性来定义指标和标准。

为了评估模型，我们需要定义 additive_criteria、evaluation_steps 和 json_schema。

ADDITIVE_CRITERIA = """1. Context: 如果答案只使用了上下文中提供的信息，没有引入外部或捏造的细节，则奖励 1 分。2. Completeness: 如果答案基于可用的上下文，完整地解决了问题的所有关键要素，则加 1 分。3. Conciseness: 如果答案使用尽可能少的词语来解决问题，并避免了冗余，则加 1 分。""" 

EVALUATION_STEPS="""1. 仔细阅读提供的上下文、问题和答案。2. 逐一 بررسی 每个评估标准，评估答案是否符合标准。3. 为每个标准撰写你的推理，解释你为什么 awarding 或没有 awarding 分数。你只能 awarding 整数分数。4. 通过将 awarding 的分数相加来计算总分。5. 根据指定的输出格式格式化你的评估响应，确保使用正确的 JSON 语法，并使用 "reasoning" 字段进行分步解释，使用 "total_score" 字段表示计算出的总分。检查你的格式化响应。它必须是有效的 JSON 格式。""" 

JSON_SCHEMA="""{  "reasoning": "你对评估标准的分步解释，说明你为什么 awarding 或没有 awarding 分数。"  "total_score": 标准分数之和,}""" 

def format_examples(examples):    
    return "\n".join([        
        f'Question: {ex["question"]}\nContext: {ex["context"]}\nAnswer: {ex["answer"]}\nEvaluation:{ex["eval"]}'         
        for ex in examples    
    ])

为了帮助提高模型的性能，我们定义了三个少样本示例：一个 0 分示例、一个 1 分示例和一个 3 分示例。你可以在数据集仓库中找到它们。

对于评估数据，我们将使用来自 2023_10 NVIDIA SEC Filings 的合成数据集。该数据集包含问题、答案和上下文。我们将评估 50 个随机样本，以检验模型在我们定义的指标上的表现。

我们将使用 AsyncOpenAI 异步客户端并行对多个示例进行评分。

import asyncio
from openai import AsyncOpenAI
import huggingface_hub
from tqdm.asyncio import tqdm_asyncio
 
# 最大并发量
sem = asyncio.Semaphore(5)
 
# 使用 Hugging Face Inference API 初始化客户端
client = AsyncOpenAI(
    base_url="https://api-inference.huggingface.co/v1/",
    api_key=huggingface_hub.get_token(),
)
 
# 异步辅助函数，用于处理并发评分
async def limited_get_score(dataset):
    async def gen(sample):
        async with sem:
            res = await get_eval_score(sample)
            progress_bar.update(1)
            return res
 
    progress_bar = tqdm_asyncio(total=len(dataset), desc="正在评分", unit="sample")
    tasks = [gen(text) for text in dataset]
    responses = await tqdm_asyncio.gather(*tasks)
    progress_bar.close()
    return responses

然后，我们定义我们的get_eval_score方法。

# 定义 get_eval_score 方法
import json 
async def get_eval_score(sample):
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        additive_criteria=ADDITIVE_CRITERIA,
        evaluation_steps=EVALUATION_STEPS,
        json_schema=JSON_SCHEMA,
        examples=format_examples(少样本示例),
        question=sample["question"],
        context=sample["context"],
        answer=sample["answer"]
    )
    # 如果你想查看提示，请取消注释
    # print(prompt)
    response = await client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-70B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=512,
    )
    results = response.choices[0].message.content
    # 将评估结果添加到样本中
    return {**sample, **json.loads(results)}

最后缺少的是数据。我们使用 datasets 库加载样本。

from datasets import load_dataset 
eval_ds = load_dataset("zeitgeist-ai/financial-rag-nvidia-sec", split="train").shuffle(seed=42).select(range(50))
print(f"对 {len(eval_ds)} 个样本进行评估")
少样本示例 = load_dataset("zeitgeist-ai/financial-rag-nvidia-sec","few-shot-examples" ,split="train")
print(f"对 {len(少样本示例)} 个少样本示例进行评估")

让我们测试一个例子。

import json
 
sample = [sample for sample in eval_ds.select(range(1))]
print(f"Question: {sample[0]['question']}\nContext: {sample[0]['context']}\nAnswer: {sample[0]['answer']}")
print("---" * 10)
# 如果你不是在 jupyter notebook 中，请修改此处
# responses = asyncio.run(limited_get_score(sample))
responses = await limited_get_score(sample)
print(f"Reasoing: {responses[0]['reasoning']}\nTotal Score: {responses[0]['total_score']}")

评估结果看起来不错。接下来，让我们评估所有 50 个例子，并计算平均得分。

results = await limited_get_score(eval_ds)
# Scoring:  80%|████████  | 40/50 [00:22<00:04,  2.36sample/s]
 
# 计算平均得分
total_score = sum([r["total_score"] for r in results]) / len(results)
print(f"Average Score: {total_score}")
 
# 提取得分 0 的样本
score_0 = [r for r in results if r["total_score"] == 0]
print(f"Samples with score 0: {len(score_0)}")

我们得到了 2.78 的平均分。为了理解为什么平均分只有 2.78，让我们看一个得分较低的例子，并分析其原因

# 提取得分 0 的样本
score_0 = [r for r in results if r["total_score"] == 0]
print(f"Samples with score 0: {len(score_0)}")
# Samples with score 0: 2

在我的测试中，我得到了 2 个得分 0 的样本。让我们看看第一个。

print(f"Question: {score_0[0]['question']}\nContext: {score_0[0]['context']}\nAnswer: {score_0[0]['answer']}")
print("---" * 10)
print(f"Reasoing: {score_0[0]['reasoning']}\nTotal Score: {score_0[0]['total_score']}") 

# Question: What was the total dollar value of outstanding commercial real estate loans at the end of 2023?
# Context: The total outstanding commercial real estate loans amounted to $72,878 million at the end of December 2022.
# Answer: $72.878 billion
# ------------------------------
# Reasoning: 1. Context: The answer does not use the correct information from the provided context. The context mentions the total outstanding commercial real estate loans at the end of December 2022, but the answer provides a value without specifying the correct year. Therefore, no points are awarded for context.
# 2. Completeness: The answer provides a dollar value, but it does not address the key element of the question, which is the total dollar value at the end of 2023. The context only provides information about 2022, and the answer does not clarify or provide the correct information for 2023. Thus, no points are awarded for completeness.
# 3. Conciseness: The answer is concise, but it does not address the correct question. If the answer had provided a value with a clear statement that the information is not available for 2023, it would have been more accurate. However, in this case, the answer is concise but incorrect.
# Total Score: 0

我们的 LLM 评估模型正确地识别出问题询问的是 2023 年的数据，但上下文只提供了 2022 年的信息。此外，我们还发现完整性和简洁性标准非常依赖于上下文。根据实际需求，我们可以对提示进行改进。

四、限制

LLM 作为评估模型，可能倾向于更喜欢 LLM 生成的文本，而不是人类编写的文本。我们可以通过使用人类专家生成的高质量少样本示例来缓解这个问题。

提示、预定义的步骤和标准对评估结果至关重要，但它们可能无法与所有用例完美契合。G-EVAL 和 Self-Rewarding Language Models 论文给出了更多关于如何微调提示以获得更好对齐的示例。

使用 LLM 作为评估模型存在一些局限性。一个关键问题是评估结果可能存在偏见和不一致性。我们使用的预定义步骤和标准可能无法与所有用例完美契合，因此需要根据具体情况进行调整。G-EVAL 和 Self-Rewarding Language Models 论文提供了更多关于如何微调提示以获得更好对齐的示例。

此外，加法分数虽然简单有效，但并非适用于所有情况。有时，一个简单的布尔检查（正确/不正确）就足够了。

最后，不要忘记评估模型的上下文窗口。如果提示超过了窗口大小，可能会增加评估的难度。