DeepSeek-R1-Evaluation
评估方法:我们所有的参评模型,最大生成长度设置为 32,768 个tokens。对于模型的采样参数,我们设置Tempreture值为 0.6 , top-p 值为0.95,并为每个查询生成 64 个响应来估计pass@1。
Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 |
|
---|---|---|---|---|---|---|---|
Architecture | - | - | MoE | - | - |
| |
# Activated Params | - | - | 37B | - | - |
| |
# Total Params | - | - | 671B | - | - |
| |
English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 |
|
MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - |
| |
MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - |
| |
DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 |
| |
IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - |
| |
GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 |
| |
SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 |
| |
FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - |
| |
AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - |
| |
ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - |
| |
Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 |
|
Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 |
| |
Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 |
| |
SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 |
| |
Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 |
| |
Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 |
|
MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 |
| |
CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - |
| |
Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - |
|
C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - |
| |
C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - |
|
English
多领域知识理解与推理(MMLU)412
-
测试内容:MMLU(Massive Multitask Language Understanding)覆盖57个学科(如科学、人文、社科等)的英文选择题,要求模型具备广泛的跨领域知识。
1. MMLU (Pass@1)
2020年9月7日,Dan Hendrycks, Collin Burns, Steven Basart等人提出了一种新的测试方法,用于衡量文本模型的多任务准确性。该测试涵盖了57个任务,包括基本数学、美国历史、计算机科学、法律等多个领域。为了在这个测试中获得高准确性,模型必须具备广泛的世界知识和问题解决能力。
测试内容: Massive Multitask Language Understanding (MMLU) 是一个多任务语言理解基准,涵盖57个学科领域,包括科学、历史、文学等。Pass@1 衡量模型在多项选择题中生成的第一个答案是否正确。
官方网站: hendrycks/test: Measuring Massive Multitask Language Understanding | ICLR 2021
数据集地址:cais/mmlu · Datasets at Hugging Face
题目示例:
question stringlengths 188209 8% | subject stringclasses abstract_algebra 100% | choices sequencelengths 4 100% | answer class label A 22% |
---|---|---|---|
Statement 1 | Every homomorphic image of a group G is isomorphic to a factor group of G. Statement 2 | The homomorphic images of a group G are the same (up to isomorphism) as the factor groups of G. | abstract_algebra | [ "True, True", "False, False", "True, False", "False, True" ] | A |
Deepseek APP(深度思考 R1)
正确答案:"True, True",回答正确
2. MMLU-Redux (EM)
主办方: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
测试内容: MMLU-Redux (EM) 是 MMLU(Massive Multitask Language Understanding)基准的改进版本,旨在更精准地评估语言模型的多领域知识理解和推理能力,尤其针对原数据集中存在的潜在偏差或模糊性问题进行优化。改进点:
-
去歧义化:对原MMLU中易引发争议的题目进行重写或剔除。
-
难度分级:增加对复杂推理链和多步问题的权重。
-
答案标准化:采用 Exact Match(EM,严格匹配) 作为指标,要求生成答案与标准答案完全一致(区分大小写、标点等细节),避免部分正确带来的模糊性。
MMLU-Redux (EM) 通过 严格答案匹配 和 数据去噪,为模型的多领域知识能力提供了更可靠的评测基准,尤其适合需要高精度输出的实际应用场景。
官方网站: [2406.04127] Are We Done with MMLU?
数据集:edinburgh-dawg/mmlu-redux-2.0 · Datasets at Hugging Face
question stringlengths 104125 10% | choices sequencelengths 4 100% | answer int64 3 21% | error_type stringclasses wrong_groundtruth 6% |
---|---|---|---|
Statement 1 | If T: V -> W is a linear transformation and dim(V ) < dim(W) < 1, then T must be injective. Statement 2 | Let dim(V) = n and suppose that T: V -> V is linear. If T is injective, then it is a bijection. | [ "True, True", "False, False", "True, False", "False, True" ] | 0 | bad_question_clarity |
Deepseek APP(深度思考 R1)
正确答案:"True, True",回答错误
3. MMLU-Pro (EM)
主办方:
测试内容: 滑铁卢大学、多伦多大学和卡耐基梅隆大学的研究人员一起提出了MMLU Pro版本评测基准。该评测基准采用了更高难度、更广泛的知识内容、更好区分性问题来对当前大模型进行通用能力评测。
MMLU Pro(Massive Multitask Language Understanding Pro)是一个基准测试,旨在评估大语言模型(LLMs)在各种多样化任务上的表现,类似于原始的 MMLU(Massive Multitask Language Understanding)基准测试,但在 Pro 版本中引入了一些高级特性或增强功能。
MMLU-Pro 涵盖了数学、物理、化学、法律、工程学、心理学和健康等 14 个领域,共包括超过 12,000 道问题,充分满足了广度要求。与 MMLU 基准相比,MMLU-Pro 主要在以下几个方面有所不同:
-
更多选项和干扰项:MMLU-Pro 每道题目提供 10 个选项,包含比 MMLU 多出 3 倍的干扰项。通过增加干扰项数量,降低了通过偶然猜测得到正确答案的概率,从而显著提高了基准的难度和可靠性。
-
增加大学级别难题的比例:MMLU-Pro 增加了更多挑战性的大学水平考试题目,这些问题要求大语言模型在不同领域进行深思熟虑的推理,才能得出正确答案。
-
两轮专家审查:为了减少数据集中的噪音,研究人员进行了两轮专家审查。第一轮通过专家验证,第二轮则利用最先进的大语言模型(SoTA LLMs)识别潜在的错误,并由人工注释员进行更有针对性的验证。
官方网站: [2406.01574] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
数据集:sam-paech/mmlu-pro-nomath · Datasets at Hugging Face
question stringlengths 25226 81.4% | options sequencelengths 10 100% | answer stringclasses A 15.7% | cot_content stringlengths 236342 40% | category stringclasses math 7.1% |
---|---|---|---|---|
The symmetric group $S_n$ has $ \factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements. Find the characteristic of the ring 2Z. | [ "0", "30", "3", "10", "12", "50", "2", "100", "20", "5" ] | A | A: Let's think step by step. A characteristic of a ring is R is $n$ if the statement $ka = 0$ for all $a\in 2Z$ implies that $k$ is a multiple of $n$. Assume that $ka = 0$ for all $a\in 2Z$ for some $k$. In particular $2k = 0$. Hence $k=0$ and $n=0$. The answer is (A). | math |
Deepseek APP(深度思考 R1)
正确答案A"0",回答正确
4. DROP (3-shot F1)
主办方: Allen Institute for AI
测试内容: 该数据集由加州大学、北大等研究者提出的英文阅读理解基准测试集。该数据集旨在推动阅读理解技术向更全面的文本段落分析发展,要求系统对段落内容执行离散推理操作,如加法、计数或排序。这些操作比以往数据集所需的理解更为深入。
DROP数据集通过众包方式创建,首先从Wikipedia中自动提取包含大量数字的叙事性段落,然后通过Amazon Mechanical Turk平台收集问案对。在问题创建过程中,采用了对抗性基线(BiDAF)作为背景,鼓励众包工作者提出基线系统无法正确回答的问题。最终,该数据集包含了96,567个问题,这些问题覆盖了Wikipedia中的多个类别,尤其强调体育比赛摘要和历史段落。
DROP数据集分享-CSDN博客
官方网站: DROP Dataset | Papers With Code
数据集:ucinlp/drop · Datasets at Hugging Face
section_id string | query_id string | passage string | question string | answers_spans sequence |
---|---|---|---|---|
nfl_2201 | f16c0ee7-f131-4a8b-a6ac-4d275ea68066 | To start the season, the Lions traveled south to Tampa, Florida to take on the Tampa Bay Buccaneers. The Lions scored first in the first quarter with a 23-yard field goal by Jason Hanson. The Buccaneers tied it up with a 38-yard field goal by Connor Barth, then took the lead when Aqib Talib intercepted a pass from Matthew Stafford and ran it in 28 yards. The Lions responded with a 28-yard field goal. In the second quarter, Detroit took the lead with a 36-yard touchdown catch by Calvin Johnson, and later added more points when Tony Scheffler caught an 11-yard TD pass. Tampa Bay responded with a 31-yard field goal just before halftime. The second half was relatively quiet, with each team only scoring one touchdown. First, Detroit's Calvin Johnson caught a 1-yard pass in the third quarter. The game's final points came when Mike Williams of Tampa Bay caught a 5-yard pass. The Lions won their regular season opener for the first time since 2007 | How many points did the buccaneers need to tie in the first? | { "spans": [ "3" ], "types": [ "number" ] } |
Deepseek APP(深度思考 R1)
正确答案 “ 3 ”,回答正确
5. IF-Eval (Prompt Strict)
主办方: Google &Yale University
测试内容: IFEval数据集为评估大语言模型的指令遵循能力提供了系统化、精细化的方法。其严格与宽松指标结合多种变换,有效解决了传统方法中的误判问题。数据集提供了丰富的指令类型,涵盖格式、语言、长度、内容等多个维度,具有高度可扩展性。相比其他评估数据集,IFEval更加侧重指令的可验证性,在实际应用中具有重要意义。
模型的指令遵循能力(Instruction Following)成为一个重要评估指标。
IFEval数据集旨在解决现有评估方法的局限性:
人工评估耗时高、成本大且存在主观偏差,影响可复现性;
基于模型的评估依赖评估器模型的准确性,但评估器自身可能存在缺陷,导致误导性结果;
量化基准虽然标准化,但缺乏对生成任务(如指令遵循)的精细评估。
IFEval通过聚焦可验证指令(如字数限制、JSON格式等),实现自动化、客观的评估,帮助研究者明确模型在哪些类型指令上表现不足,并支持不同模型的对比分析。
IFEval数据集通过设计严格(Strict) 和 宽松(Loose)两种评估指标,更精准地衡量模型是否遵循给定指令。指令遵循数据集IFEval介绍:中英双语-CSDN博客
官方网站: 2311.07911
数据集:google/IFEval · Datasets at Hugging Face
key int64 13 3.76k | prompt stringlengths 53 1.86k | instruction_id_list sequencelengths 1 3 | kwargs listlengths 1 3 |
---|---|---|---|
1,000 | Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*. | [ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words" ] | [ { "num_highlights": |
Deepseek APP(深度思考 R1)
答案要求 “ 无逗号,300字以上,标题加粗”,回答“无逗号,363字,标题加粗”完整符合要求