大模型评测/评价框架调研

大模型评测框架调研

目前,模型构建(modeling)、扩展(scaling)和泛化(generalization) 等方面的技术发展得比对其进行评估测试的方法更快,这就导致了对模型评估不足和对模型能力存在过高估计或夸大。我们还没有找到解决小型生成式模型和长文本生成(long form generations)相关评估问题的方法

一.评测框架

1.llmuses

优点:
  • 预置了多个常用的测试基准数据集,包括:MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH、HumanEval等
  • 常用评估指标(metrics)的实现
  • 统一model接入,兼容多个系列模型的generate、chat接口
  • 自动评估(evaluator):
    • 客观题自动评估
    • 使用专家模型实现复杂任务的自动评估
  • 评估报告生成
  • 竞技场模式(Arena)
  • 可视化工具
  • 模型性能评估
缺点:

训练时的测试数据集不包括计算机领域,在本项目的代码评测中实际能使用的评测方法是基于专家模型进行评测的

2.FlagEval

代码能力

具有模型编写、理解和优化计算机程序代码的能力。

3.CMMLU

中文多任务语言理解评估

数据集都是客观选择题,对开放性主观代码评测方面不具有优势

4.evals

openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. (github.com)

主流的OpenAI评估框架

5.deepeval

简单易用开源 的LLM 评估框架。类似于 Pytest,但专门用于单元测试 LLM 输出,高度模块化的评估框架

示例

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

6.ragas

explodinggradients/ragas:检索增强生成 (RAG) 管道的评估框架 (github.com)

该工具可以评估检索增强生成 (RAG) 的pipelines,可以以自定义的数据集为标准评估对模型回答进行评估

from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

os.environ["OPENAI_API_KEY"] = "your-openai-key"

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness])
score.to_pandas()

二.代码问答数据集

Find Open Datasets and Machine Learning Projects | Kaggle(只有力扣问题和链接,需要爬代码答案)

https://huggingface.co/datasets/greengerong/leetcode?row=0

https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k?row=0

https://huggingface.co/datasets/nuprl/leetcode-js?row=4

https://huggingface.co/datasets/RayBernard/leetcode1000?row=0

https://huggingface.co/datasets/google/code_x_glue_cc_code_to_code_trans?row=4(java转c++数据集)

https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench(代码正确性评估)

https://huggingface.co/datasets/google/code_x_glue_cc_cloze_testing_all?row=1(代码完形填空)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值