大模型评测框架调研
目前,模型构建(modeling)、扩展(scaling)和泛化(generalization) 等方面的技术发展得比对其进行评估测试的方法更快,这就导致了对模型评估不足和对模型能力存在过高估计或夸大。我们还没有找到解决小型生成式模型和长文本生成(long form generations)相关评估问题的方法;
一.评测框架
1.llmuses
优点:
- 预置了多个常用的测试基准数据集,包括:MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH、HumanEval等
- 常用评估指标(metrics)的实现
- 统一model接入,兼容多个系列模型的generate、chat接口
- 自动评估(evaluator):
- 客观题自动评估
- 使用专家模型实现复杂任务的自动评估
- 评估报告生成
- 竞技场模式(Arena)
- 可视化工具
- 模型性能评估
缺点:
训练时的测试数据集不包括计算机领域,在本项目的代码评测中实际能使用的评测方法是基于专家模型进行评测的
2.FlagEval
代码能力
具有模型编写、理解和优化计算机程序代码的能力。
3.CMMLU
中文多任务语言理解评估
数据集都是客观选择题,对开放性主观代码评测方面不具有优势
4.evals
主流的OpenAI评估框架
5.deepeval
简单易用开源 的LLM 评估框架。类似于 Pytest,但专门用于单元测试 LLM 输出,高度模块化的评估框架
示例
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])
6.ragas
explodinggradients/ragas:检索增强生成 (RAG) 管道的评估框架 (github.com)
该工具可以评估检索增强生成 (RAG) 的pipelines,可以以自定义的数据集为标准评估对模型回答进行评估
from datasets import Dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
os.environ["OPENAI_API_KEY"] = "your-openai-key"
data_samples = {
'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'],
['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()
二.代码问答数据集
Find Open Datasets and Machine Learning Projects | Kaggle(只有力扣问题和链接,需要爬代码答案)
https://huggingface.co/datasets/greengerong/leetcode?row=0
https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k?row=0
https://huggingface.co/datasets/nuprl/leetcode-js?row=4
https://huggingface.co/datasets/RayBernard/leetcode1000?row=0
https://huggingface.co/datasets/google/code_x_glue_cc_code_to_code_trans?row=4(java转c++数据集)
https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench(代码正确性评估)
https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all?row=1(代码完形填空)