5分钟看看DeepSeek-R1做过的那些基准测试题(上)

DeepSeek-R1-Evaluation

评估方法:我们所有的参评模型,最大生成长度设置为 32,768 个tokens。对于模型的采样参数,我们设置Tempreture值为 0.6 , top-p 值为0.95,并为每个查询生成 64 个响应来估计pass@1。

CategoryBenchmark (Metric)Claude-3.5-Sonnet-1022GPT-4o 0513DeepSeek V3OpenAI o1-miniOpenAI o1-1217
  • DeepSeek R1
Architecture--MoE--
  • MoE
# Activated Params--37B--
  • 37B
# Total Params--671B--
  • 671B
EnglishMMLU (Pass@1)88.387.288.585.291.8
  • 90.8
MMLU-Redux (EM)88.988.089.186.7-
  • 92.9
MMLU-Pro (EM)78.072.675.980.3-
  • 84.0
DROP (3-shot F1)88.383.791.683.990.2
  • 92.2
IF-Eval (Prompt Strict)86.584.386.184.8-
  • 83.3
GPQA-Diamond (Pass@1)65.049.959.160.075.7
  • 71.5
SimpleQA (Correct)28.438.224.97.047.0
  • 30.1
FRAMES (Acc.)72.580.573.376.9-
  • 82.5
AlpacaEval2.0 (LC-winrate)52.051.170.057.8-
  • 87.6
ArenaHard (GPT-4-1106)85.280.485.592.0-
  • 92.3
CodeLiveCodeBench (Pass@1-COT)33.834.2-53.863.4
  • 65.9
Codeforces (Percentile)20.323.658.793.496.6
  • 96.3
Codeforces (Rating)717759113418202061
  • 2029
SWE Verified (Resolved)50.838.842.041.648.9
  • 49.2
Aider-Polyglot (Acc.)45.316.049.632.961.7
  • 53.3
MathAIME 2024 (Pass@1)16.09.339.263.679.2
  • 79.8
MATH-500 (Pass@1)78.374.690.290.096.4
  • 97.3
CNMO 2024 (Pass@1)13.110.843.267.6-
  • 78.8
ChineseCLUEWSC (EM)85.487.990.989.9-
  • 92.8
C-Eval (EM)76.776.086.568.9-
  • 91.8
C-SimpleQA (Correct)55.458.768.040.3-
  • 63.7

English

多领域知识理解与推理(MMLU)412

  • 测试内容:MMLU(Massive Multitask Language Understanding)覆盖57个学科(如科学、人文、社科等)的英文选择题,要求模型具备广泛的跨领域知识。

1. MMLU (Pass@1)

         2020年9月7日,Dan Hendrycks, Collin Burns, Steven Basart等人提出了一种新的测试方法,用于衡量文本模型的多任务准确性。该测试涵盖了57个任务,包括基本数学、美国历史、计算机科学、法律等多个领域。为了在这个测试中获得高准确性,模型必须具备广泛的世界知识和问题解决能力。
        测试内容: Massive Multitask Language Understanding (MMLU) 是一个多任务语言理解基准,涵盖57个学科领域,包括科学、历史、文学等。Pass@1 衡量模型在多项选择题中生成的第一个答案是否正确。
官方网站: hendrycks/test: Measuring Massive Multitask Language Understanding | ICLR 2021

数据集地址:cais/mmlu · Datasets at Hugging Face

题目示例:

question

stringlengths

188209

8%

subject

stringclasses

abstract_algebra

100%

choices

sequencelengths

4

100%

answer

class label

A

22%

Statement 1 | Every homomorphic image of a group G is isomorphic to a factor group of G. Statement 2 | The homomorphic images of a group G are the same (up to isomorphism) as the factor groups of G.

abstract_algebra

[ "True, True", "False, False", "True, False", "False, True" ]

A

Deepseek APP(深度思考 R1) 

正确答案:"True, True",回答正确



2. MMLU-Redux (EM)

主办方: Aryo Pradipta GemaJoshua Ong Jun LeangGiwon HongAlessio DevotoAlberto Carlo Maria MancinoRohit SaxenaXuanli HeYu ZhaoXiaotang DuMohammad Reza Ghasemi MadaniClaire BaraleRobert McHardyJoshua HarrisJean KaddourEmile van KriekenPasquale Minervini
测试内容: MMLU-Redux (EM) 是 MMLU(Massive Multitask Language Understanding)基准的改进版本,旨在更精准地评估语言模型的多领域知识理解和推理能力,尤其针对原数据集中存在的潜在偏差或模糊性问题进行优化。改进点

  • 去歧义化:对原MMLU中易引发争议的题目进行重写或剔除。

  • 难度分级:增加对复杂推理链和多步问题的权重。

  • 答案标准化:采用 Exact Match(EM,严格匹配) 作为指标,要求生成答案与标准答案完全一致(区分大小写、标点等细节),避免部分正确带来的模糊性。

MMLU-Redux (EM) 通过 严格答案匹配 和 数据去噪,为模型的多领域知识能力提供了更可靠的评测基准,尤其适合需要高精度输出的实际应用场景。

官方网站: [2406.04127] Are We Done with MMLU?

数据集:edinburgh-dawg/mmlu-redux-2.0 · Datasets at Hugging Face

question

stringlengths

104125

10%

choices

sequencelengths

4

100%

answer

int64

3

21%

error_type

stringclasses

wrong_groundtruth

6%

Statement 1 | If T: V -> W is a linear transformation and dim(V ) < dim(W) < 1, then T must be injective.

Statement 2 | Let dim(V) = n and suppose that T: V -> V is linear. If T is injective, then it is a bijection.

[ "True, True", "False, False", "True, False", "False, True" ]

0

bad_question_clarity

 Deepseek APP(深度思考 R1) 

正确答案:"True, True",回答错误



3. MMLU-Pro (EM)

主办方:
测试内容: 滑铁卢大学、多伦多大学和卡耐基梅隆大学的研究人员一起提出了MMLU Pro版本评测基准。该评测基准采用了更高难度、更广泛的知识内容、更好区分性问题来对当前大模型进行通用能力评测。

MMLU Pro(Massive Multitask Language Understanding Pro)是一个基准测试,旨在评估大语言模型(LLMs)在各种多样化任务上的表现,类似于原始的 MMLU(Massive Multitask Language Understanding)基准测试,但在 Pro 版本中引入了一些高级特性或增强功能。

MMLU-Pro 涵盖了数学、物理、化学、法律、工程学、心理学和健康等 14 个领域,共包括超过 12,000 道问题,充分满足了广度要求。与 MMLU 基准相比,MMLU-Pro 主要在以下几个方面有所不同:

  1. 更多选项和干扰项:MMLU-Pro 每道题目提供 10 个选项,包含比 MMLU 多出 3 倍的干扰项。通过增加干扰项数量,降低了通过偶然猜测得到正确答案的概率,从而显著提高了基准的难度和可靠性。

  2. 增加大学级别难题的比例:MMLU-Pro 增加了更多挑战性的大学水平考试题目,这些问题要求大语言模型在不同领域进行深思熟虑的推理,才能得出正确答案。

  3. 两轮专家审查:为了减少数据集中的噪音,研究人员进行了两轮专家审查。第一轮通过专家验证,第二轮则利用最先进的大语言模型(SoTA LLMs)识别潜在的错误,并由人工注释员进行更有针对性的验证。

官方网站: [2406.01574] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

数据集:sam-paech/mmlu-pro-nomath · Datasets at Hugging Face

question

stringlengths

25226

81.4%

options

sequencelengths

10

100%

answer

stringclasses

A

15.7%

cot_content

stringlengths

236342

40%

category

stringclasses

math

7.1%

The symmetric group $S_n$ has $ \factorial{n}$ elements, hence it is not true that $S_{10}$ has 10 elements. Find the characteristic of the ring 2Z.

[ "0", "30", "3", "10", "12", "50", "2", "100", "20", "5" ]

A

A: Let's think step by step. A characteristic of a ring is R is $n$ if the statement $ka = 0$ for all $a\in 2Z$ implies that $k$ is a multiple of $n$. Assume that $ka = 0$ for all $a\in 2Z$ for some $k$. In particular $2k = 0$. Hence $k=0$ and $n=0$. The answer is (A).

math

 Deepseek APP(深度思考 R1) 

正确答案A"0",回答正确


4. DROP (3-shot F1)

主办方: Allen Institute for AI
测试内容: 该数据集由加州大学、北大等研究者提出的英文阅读理解基准测试集。该数据集旨在推动阅读理解技术向更全面的文本段落分析发展,要求系统对段落内容执行离散推理操作,如加法、计数或排序。这些操作比以往数据集所需的理解更为深入。

DROP数据集通过众包方式创建,首先从Wikipedia中自动提取包含大量数字的叙事性段落,然后通过Amazon Mechanical Turk平台收集问案对。在问题创建过程中,采用了对抗性基线(BiDAF)作为背景,鼓励众包工作者提出基线系统无法正确回答的问题。最终,该数据集包含了96,567个问题,这些问题覆盖了Wikipedia中的多个类别,尤其强调体育比赛摘要和历史段落。

DROP数据集分享-CSDN博客
官方网站: DROP Dataset | Papers With Code

数据集:ucinlp/drop · Datasets at Hugging Face

section_id

string

query_id

string

passage

string

question

string

answers_spans

sequence

nfl_2201

f16c0ee7-f131-4a8b-a6ac-4d275ea68066

To start the season, the Lions traveled south to Tampa, Florida to take on the Tampa Bay Buccaneers. The Lions scored first in the first quarter with a 23-yard field goal by Jason Hanson. The Buccaneers tied it up with a 38-yard field goal by Connor Barth, then took the lead when Aqib Talib intercepted a pass from Matthew Stafford and ran it in 28 yards. The Lions responded with a 28-yard field goal. In the second quarter, Detroit took the lead with a 36-yard touchdown catch by Calvin Johnson, and later added more points when Tony Scheffler caught an 11-yard TD pass. Tampa Bay responded with a 31-yard field goal just before halftime. The second half was relatively quiet, with each team only scoring one touchdown. First, Detroit's Calvin Johnson caught a 1-yard pass in the third quarter. The game's final points came when Mike Williams of Tampa Bay caught a 5-yard pass. The Lions won their regular season opener for the first time since 2007

How many points did the buccaneers need to tie in the first?

{ "spans": [ "3" ], "types": [ "number" ] }

 Deepseek APP(深度思考 R1) 

正确答案 “ 3 ”,回答正确


5. IF-Eval (Prompt Strict)


主办方: Google &Yale University
测试内容: IFEval数据集为评估大语言模型的指令遵循能力提供了系统化、精细化的方法。其严格与宽松指标结合多种变换,有效解决了传统方法中的误判问题。数据集提供了丰富的指令类型,涵盖格式、语言、长度、内容等多个维度,具有高度可扩展性。相比其他评估数据集,IFEval更加侧重指令的可验证性,在实际应用中具有重要意义。

模型的指令遵循能力(Instruction Following)成为一个重要评估指标。
IFEval数据集旨在解决现有评估方法的局限性:

人工评估耗时高、成本大且存在主观偏差,影响可复现性;
基于模型的评估依赖评估器模型的准确性,但评估器自身可能存在缺陷,导致误导性结果;
量化基准虽然标准化,但缺乏对生成任务(如指令遵循)的精细评估。
IFEval通过聚焦可验证指令(如字数限制、JSON格式等),实现自动化、客观的评估,帮助研究者明确模型在哪些类型指令上表现不足,并支持不同模型的对比分析。

IFEval数据集通过设计严格(Strict) 和 宽松(Loose)两种评估指标,更精准地衡量模型是否遵循给定指令。指令遵循数据集IFEval介绍:中英双语-CSDN博客

官方网站: 2311.07911

数据集:google/IFEval · Datasets at Hugging Face

key

int64

13

3.76k

prompt

stringlengths

53

1.86k

instruction_id_list

sequencelengths

1

3

kwargs

listlengths

1

3

1,000

Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.

[ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words" ]

[ { "num_highlights":

Deepseek APP(深度思考 R1) 

答案要求 “ 无逗号,300字以上,标题加粗”,回答“无逗号,363字,标题加粗”完整符合要求


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值