大模型评测/评价框架调研

cynicme

于 2024-07-30 18:03:13 发布

阅读量489

点赞数 4

文章标签： python nlp 自然语言处理语言模型

本文链接：https://blog.csdn.net/cynicme/article/details/140803161

版权

大模型评测框架调研

目前，模型构建（modeling）、扩展（scaling）和泛化（generalization） 等方面的技术发展得比对其进行评估测试的方法更快，这就导致了对模型评估不足和对模型能力存在过高估计或夸大。我们还没有找到解决小型生成式模型和长文本生成（long form generations）相关评估问题的方法；

一.评测框架

1.llmuses

优点：

预置了多个常用的测试基准数据集，包括：MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH、HumanEval等
常用评估指标（metrics）的实现
统一model接入，兼容多个系列模型的generate、chat接口
自动评估（evaluator）：
- 客观题自动评估
- 使用专家模型实现复杂任务的自动评估
评估报告生成
竞技场模式(Arena）
可视化工具
模型性能评估

缺点：

训练时的测试数据集不包括计算机领域，在本项目的代码评测中实际能使用的评测方法是基于专家模型进行评测的

2.FlagEval

代码能力

具有模型编写、理解和优化计算机程序代码的能力。

3.CMMLU

中文多任务语言理解评估

数据集都是客观选择题，对开放性主观代码评测方面不具有优势

4.evals

openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. (github.com)

主流的OpenAI评估框架

5.deepeval

简单易用开源的LLM 评估框架。类似于 Pytest，但专门用于单元测试 LLM 输出，高度模块化的评估框架

示例

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

6.ragas

explodinggradients/ragas：检索增强生成（RAG）管道的评估框架 (github.com)

该工具可以评估检索增强生成（RAG）的pipelines，可以以自定义的数据集为标准评估对模型回答进行评估

from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()