DeepSeek-R1 强化学习_comparison of deepseek-r1 distilled models and oth-CSDN博客

本文链接：https://blog.csdn.net/weixin_60225179/article/details/145509825

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via

Reinforcement Learning

DeepSeek-AI

research@deepseek.com

Abstract

We introduce our ffrst-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super

vised ffne-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.

Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing

reasoning behaviors. However, it encounters challenges such as poor readability, and language

mixing. To address these issues and further enhance reasoning performance, we introduce

DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek

R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the

research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models

(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

AIME 2024

(Pass@1)

Codeforces

(Percentile)

GPQA Diamond

(Pass@1)

MATH-500

(Pass@1)

MMLU

(Pass@1)

SWE-bench Verified

(Resolved)

0 20 40 60 80 100

Accuracy / Percentile (%)

79.8 96.3 71.5 97.3 90.8 49.2

79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0

DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-V3

Figure 1 | Benchmark performance of DeepSeek-R1.

Contents

1 Introduction 3

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Approach 5

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . 5 2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6 2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . 9 2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . 10 2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10 2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . 11 2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . 11

3 Experiment 11

3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Discussion 14

4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Conclusion, Limitations, and Future Work 16 A Contributions and Acknowledgments 20

1. Introduction

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and

evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap

towards Artiffcial General Intelligence (AGI).

Recently, post-training has emerged as an important component of the full training pipeline.

It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt

to user preferences, all while requiring relatively minimal computational resources against

pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models

were the ffrst to introduce inference-time scaling by increasing the length of the Chain-of

Thought reasoning process. This approach has achieved signiffcant improvements in various

reasoning tasks, such as mathematics, coding, and scientiffc reasoning. However, the challenge

of effective test-time scaling remains an open question for the research community. Several prior

works have explored various approaches, including process-based reward models (Lightman

et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),

and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh

et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning

performance comparable to OpenAI’s o1 series models.

In this paper, we take the ffrst step toward improving language model reasoning capabilities

using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop

reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Speciffcally, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.

During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance

on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to

71.0%, and with majority voting, the score further improves to 86.7%, matching the performance

of OpenAI-o1-0912.

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language

mixing. To address these issues and further enhance reasoning performance, we introduce

DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training

pipeline. Speciffcally, we begin by collecting thousands of cold-start data to ffne-tune the

DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-

Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection

sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains

such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model.

After ffne-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to

as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-

32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey

et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source

QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a

new record on the reasoning benchmarks among dense models. 3

1.1. Contributions Post-Training: Large-Scale Reinforcement Learning on the Base Model

• We directly apply RL to the base model without relying on supervised ffne-tuning (SFT) as

a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for

solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek

R1-Zero demonstrates capabilities such as self-veriffcation, reffection, and generating

long CoTs, marking a signiffcant milestone for the research community. Notably, it is the

ffrst open research to validate that reasoning capabilities of LLMs can be incentivized

purely through RL, without the need for SFT. This breakthrough paves the way for future

advancements in this area. • We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL

stages aimed at discovering improved reasoning patterns and aligning with human pref

erences, as well as two SFT stages that serve as the seed for the model’s reasoning and

non-reasoning capabilities. We believe the pipeline will benefft the industry by creating

better models.

Distillation: Smaller Models Can Be Powerful Too

• We demonstrate that the reasoning patterns of larger models can be distilled into smaller

models, resulting in better performance compared to the reasoning patterns discovered

through RL on small models. The open source DeepSeek-R1, as well as its API, will benefft

the research community to distill better smaller models in the future. • Using the reasoning data generated by DeepSeek-R1, we ffne-tuned several dense models

that are widely used in the research community. The evaluation results demonstrate that

the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek

R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Addi

tionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500,

and 57.2% on LiveCodeBench. These results signiffcantly outperform previous open

source models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B,

32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

1.2. Summary of Evaluation Results

• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly

surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%,

performing on par with OpenAI-o1-1217 and signiffcantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks,

as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in

the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than

DeepSeek-V3, which could help developers in real world tasks. • Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek

R1 achieves outstanding results, signiffcantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its

performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1

surpasses other closed-source models, demonstrating its competitive edge in educational

tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,

demonstrating its capability in handling fact-based queries. A similar trend is observed

where OpenAI-o1 surpasses 4o on this benchmark. 4• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing,

general question answering, editing, summarization, and more. It achieves an impressive

length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on Are

naHard, showcasing its strong ability to intelligently handle non-exam-oriented queries.

Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring

long-context understanding, substantially outperforming DeepSeek-V3 on long-context

benchmarks.

2. Approach

2.1. Overview

Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be signiffcantly improved through large-scale reinforcement learning (RL), even without using supervised ffne-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint ffne-tuned with thousands of

long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to

small dense models.

2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

Reinforcement learning has demonstrated signiffcant effectiveness in reasoning tasks, as evidenced by our previous works (Shao et al., 2024; Wang et al., 2023). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we

explore the potential of LLMs to develop reasoning capabilities without any supervised data,

focusing on their self-evolution through a pure reinforcement learning process. We start with a

brief overview of our RL algorithm, followed by the presentation of some exciting results, and

hope this provides the community with valuable insights.

2.2.1. Reinforcement Learning Algorithm

Group Relative Policy Optimization In order to save the training costs of RL, we adopt Group

Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is

typically the same size as the policy model, and estimates the baseline from group scores instead.

Speciffcally, for each question ??, GRPO samples a group of outputs {??1, ??2, · · · , ????} from the old

policy ?????????? and then optimizes the policy model ???? by maximizing the following objective:

J????????(??) = E[?? ∼ ??(??), {????} ?? ??=1 ∼ ?????????? (??|??)]

∑∑ ?? ??=1

min

????(???? |??)

?????????? (???? |??) ???? , clip

????(???? |??)

?????????? (???? |??) , 1 − ??, 1 + ??

????

− ??D????

????||?????? ??

, (1)

D????

????||?????? ??

= ?????? ?? (???? |??)

????(???? |??) − log ?????? ?? (???? |??)

????(???? |??) − 1, (2)

where ?? and ?? are hyper-parameters, and ???? is the advantage, computed using a group of

rewards {??1,??2, . . . ,????} corresponding to the outputs within each group:

???? = ???? − m??????({??1,??2, · · · ,????})

s????({??1,??2, · · · ,????}) . (3) 5

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., reasoning process here answer here . User: prompt. Assistant:

Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question during training.

2.2.2. Reward Modeling

The reward is the source of the training signal, which decides the optimization direction of RL.

To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two

types of rewards: • Accuracy rewards: The accuracy reward model evaluates whether the response is correct.

For example, in the case of math problems with deterministic results, the model is required

to provide the final answer in a specified format (e.g., within a box), enabling reliable

rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be

used to generate feedback based on predefined test cases. • Format rewards: In addition to the accuracy reward model, we employ a format reward

model that enforces the model to put its thinking process between ‘’ and ‘’

tags.

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero,

because we find that the neural reward model may suffer from reward hacking in the large-scale

reinforcement learning process, and retraining the reward model needs additional training

resources and it complicates the whole training pipeline.

2.2.3. Training Template

To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template

requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer.

We intentionally limit our constraints to this structural format, avoiding any content-specific

biases—such as mandating reflective reasoning or promoting particular problem-solving strate

gies—to ensure that we can accurately observe the model’s natural progression during the RL

process.

2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeek

R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated,

DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the

RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels

comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL

algorithm in optimizing the model’s performance over time.

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912

models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

Model AIME 2024 MATH-500 GPQA LiveCode CodeForces Diamond Bench

pass@1 cons@64 pass@1 pass@1 pass@1 rating

OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820

OpenAI-o1-0912 74.4 83.3 94.8 77.3 63.4 1843

DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444

Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks. Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised

fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to

learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek

R1-Zero can be further augmented through the application of majority voting. For example,

when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance

escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The

ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without

majority voting, highlights its strong foundational capabilities and its potential for further

advancements in reasoning tasks.

Self-evolution Process of DeepSeek-R1-Zero The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model’s

progression without the influence of the supervised fine-tuning stage. This approach provides

a clear view of how the model evolves over time, particularly in terms of its ability to handle

complex reasoning tasks.

As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improve

7Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

ment throughout the training process. This improvement is not the result of external adjustments

but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the

ability to solve increasingly complex reasoning tasks by leveraging extended test-time compu

tation. This computation ranges from generating hundreds to thousands of reasoning tokens,

allowing the model to explore and refine its thought processes in greater depth.

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated

behaviors as the test-time computation increases. Behaviors such as reflection—where the model

revisits and reevaluates its previous steps—and the exploration of alternative approaches to

problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead

emerge as a result of the model’s interaction with the reinforcement learning environment. This

spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities,

enabling it to tackle more challenging tasks with greater efficiency and accuracy.

Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during

the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase,

DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial

approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and

sophisticated outcomes.

This moment is not only an “aha moment” for the model but also for the researchers

observing its behavior. It underscores the power and beauty of reinforcement learning: rather

than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of

intelligence in artificial systems, paving the way for more autonomous and adaptive models in

the future. 8Question: If 𝑎 > 1, then the sum of the real solutions of √√

?? − √ ?? + ?? = ?? is equal to Response: To solve the equation √√

?? − √ ?? + ?? = ??, let’s start by squaring both · · · √√

?? − √ ?? + ??

= ?? 2 =⇒ ?? − √ ?? + ?? = ?? 2 . Rearrange to isolate the inner square root term:

(?? − ?? 2 ) 2 = ?? + ?? =⇒ ?? 2 − 2𝑎𝑥2 + (?? 2 ) 2 = ?? + ?? =⇒ ?? 4 − 2𝑎𝑥2 − ?? + (?? 2 − ??) = 0 . . .

Wait, wait. Wait. That’s an aha moment I can flag here. Let’s reevaluate this step-by-step to identify if the correct sum can be · · ·

We started with the equation: √√

?? − √ ?? + ?? = ??

First, let’s square both sides:

?? − √ ?? + ?? = ?? 2 =⇒ √ ?? + ?? = ?? − ?? 2

Next, I could square both sides again, treating the equation: · · ·

. . . Table 3 | An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.

Drawback of DeepSeek-R1-Zero Although DeepSeek-R1-Zero exhibits strong reasoning

capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces

several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability,

and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly

cold-start data.

2.3. DeepSeek-R1: Reinforcement Learning with Cold Start

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can

reasoning performance be further improved or convergence accelerated by incorporating a small

amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong

general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The

pipeline consists of four stages, outlined as follows.

2.3.1. Cold Start

Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting

models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-

Zero outputs in a readable format, and refining the results through post-processing by human

annotators.

In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as

the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data

9include: • Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable

for reading. Responses may mix multiple languages or lack markdown formatting to

highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,

we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as

|special_token||special_token|, where the reasoning

process is the CoT for the query, and the summary is used to summarize the reasoning

results. • Potential: By carefully designing the pattern for cold-start data with human priors, we

observe better performance against DeepSeek-R1-Zero. We believe the iterative training is

a better way for reasoning models.

2.3.2. Reasoning-oriented Reinforcement Learning

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale

reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such

as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT often exhibits language mixing,

particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward

aligns with human preferences, making it more readable. Finally, we combine the accuracy of

reasoning tasks and the reward for language consistency by directly summing them to form the

final reward. We then apply RL training on the fine-tuned model until it achieves convergence

on reasoning tasks.

2.3.3. Rejection Sampling and Supervised Fine-Tuning

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT

(Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which

primarily focuses on reasoning, this stage incorporates data from other domains to enhance the

model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we

generate the data and fine-tune the model as described below.

Reasoning data We curate reasoning prompts and generate reasoning trajectories by perform

ing rejection sampling from the checkpoint from the above RL training. In the previous stage,

we only included data that could be evaluated using rule-based rewards. However, in this stage,

we expand the dataset by incorporating additional data, some of which use a generative reward

model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For

each prompt, we sample multiple responses and retain only the correct ones. In total, we collect

about 600k reasoning related training samples. 10

Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential

chain-of-thought before answering the question by prompting. However, for simpler queries,

such as “hello” we do not provide a CoT in response. In the end, we collected a total of

approximately 200k training samples that are unrelated to reasoning.

We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about

800k samples.

2.3.4. Reinforcement Learning for all Scenarios

To further align the model with human preferences, we implement a secondary reinforcement

learning stage aimed at improving the model’s helpfulness and harmlessness while simultane

ously refining its reasoning capabilities. Specifically, we train the model using a combination

of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the

learning process in math, code, and logical reasoning domains. For general data, we resort to

reward models to capture human preferences in complex and nuanced scenarios. We build

upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and train

ing prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the

assessment emphasizes the utility and relevance of the response to the user while minimizing

interference with the underlying reasoning process. For harmlessness, we evaluate the entire

response of the model, including both the reasoning process and the summary, to identify and

mitigate any potential risks, biases, or harmful content that may arise during the generation

process. Ultimately, the integration of reward signals and diverse data distributions enables us

to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

2.4. Distillation: Empower Small Models with Reasoning Capability

To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly

fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that

this straightforward distillation method significantly enhances the reasoning abilities of smaller

models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-

14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its

reasoning capability is slightly better than that of Llama-3.1.

For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL

stage to the broader research community.

3. Experiment

Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema

et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al.,

2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al.,

2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,

2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese

National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Math

ematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we

also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we

adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li

et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and

LiveCodeBench.

Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as

MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simple

evals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a

zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts

are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation

protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++,

C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated

using CoT format, with data collected between August 2024 and January 2025. The Codeforces

dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases,

after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related

benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum

of 32,768 tokens for each benchmark.

Baselines We conduct comprehensive evaluations against several strong baselines, including

DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217.

Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its perfor

mance based on official reports. For distilled models, we also compare the open-source model

QwQ-32B-Preview (Qwen, 2024a).

Evaluation Setup We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we

default to pass@?? evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature.

Specifically, we use a sampling temperature of 0.6 and a top-?? value of 0.95 to generate ??

responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1

is then calculated as pass@1 = 1

∑∑ ?? ??=1

???? ,

where ???? denotes the correctness of the ??-th response. This method provides more reliable

performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang

et al., 2022) using 64 samples, denoted as cons@64.

1https://aider.chat

2https://codeforces.com

3https://www.cms.org.cn/Home/comp/comp/cid/12.html

3.1. DeepSeek-R1 Evaluation

Benchmark (Metric) Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek Sonnet-1022 0513 V3 o1-mini o1-1217 R1

Architecture - - MoE - - MoE # Activated Params - - 37B - - 37B # Total Params - - 671B - - 671B English MMLU (Pass@1) 88.3 87.2 88.5 85.2 91.8 90.8 MMLU-Redux (EM) 88.9 88.0 89.1 86.7 - 92.9

MMLU-Pro (EM) 78.0 72.6 75.9 80.3 - 84.0

DROP (3-shot F1) 88.3 83.7 91.6 83.9 90.2 92.2

IF-Eval (Prompt Strict) 86.5 84.3 86.1 84.8 - 83.3 GPQA Diamond (Pass@1) 65.0 49.9 59.1 60.0 75.7 71.5 SimpleQA (Correct) 28.4 38.2 24.9 7.0 47.0 30.1 FRAMES (Acc.) 72.5 80.5 73.3 76.9 - 82.5

AlpacaEval2.0 (LC-winrate) 52.0 51.1 70.0 57.8 - 87.6

ArenaHard (GPT-4-1106) 85.2 80.4 85.5 92.0 - 92.3

Code LiveCodeBench (Pass@1-COT) 38.9 32.9 36.2 53.8 63.4 65.9

Codeforces (Percentile) 20.3 23.6 58.7 93.4 96.6 96.3 Codeforces (Rating) 717 759 1134 1820 2061 2029 SWE Verified (Resolved) 50.8 38.8 42.0 41.6 48.9 49.2 Aider-Polyglot (Acc.) 45.3 16.0 49.6 32.9 61.7 53.3 Math AIME 2024 (Pass@1) 16.0 9.3 39.2 63.6 79.2 79.8

MATH-500 (Pass@1) 78.3 74.6 90.2 90.0 96.4 97.3

CNMO 2024 (Pass@1) 13.1 10.8 43.2 67.6 - 78.8

Chinese CLUEWSC (EM) 85.4 87.9 90.9 89.9 - 92.8

C-Eval (EM) 76.7 76.0 86.5 68.9 - 91.8

C-SimpleQA (Correct) 55.4 58.7 68.0 40.3 - 63.7

Table 4 | Comparison between DeepSeek-R1 and other representative models.

For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA

Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where signif

icant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1

excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis

capabilities. This highlights the potential of reasoning models in AI-driven search and data

analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,

demonstrating its capability in handling fact-based queries. A similar trend is observed where

OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than

DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an

accuracy of over 70%. DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a

model’s ability to follow format instructions. These improvements can be linked to the inclusion

of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL

training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard,

indicating DeepSeek-R1’s strengths in writing tasks and open-domain question answering. Its

significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale

RL, which not only boosts reasoning capabilities but also improves performance across diverse

domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an

average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that

DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying

its robustness across multiple tasks.

On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm

tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these

benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1

on Aider but achieves comparable performance on SWE Verified. We believe the engineering

performance of DeepSeek-R1 will improve in the next version, as the amount of related RL

training data currently remains very limited.

3.2. Distilled Model Evaluation

Model AIME 2024 MATH-500 GPQA LiveCode CodeForces Diamond Bench

pass@1 cons@64 pass@1 pass@1 pass@1 rating

GPT-4o-0513 9.3 13.4 74.6 49.9 32.9 759

Claude-3.5-Sonnet-1022 16.0 26.7 78.3 65.0 38.9 717

OpenAI-o1-mini 63.6 80.0 90.0 60.0 53.8 1820 QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9 1316

DeepSeek-R1-Distill-Qwen-1.5B 28.9 52.7 83.9 33.8 16.9 954

DeepSeek-R1-Distill-Qwen-7B 55.5 83.3 92.8 49.1 37.6 1189

DeepSeek-R1-Distill-Qwen-14B 69.7 80.0 93.9 59.1 53.1 1481

DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2 1691

DeepSeek-R1-Distill-Llama-8B 50.4 80.0 89.1 49.0 39.6 1205

DeepSeek-R1-Distill-Llama-70B 70.0 86.7 94.5 65.2 57.5 1633

Table 5 | Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks.

As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek

R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B

Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly

exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distilla

tion. Additionally, we found that applying RL to these distilled models yields significant further

gains. We believe this warrants further exploration and therefore present only the results of the

simple SFT-distilled models here.

4. Discussion

4.1. Distillation v.s. Reinforcement Learning

In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance

through the large-scale RL training discussed in the paper without distillation?

To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math,

code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The

experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale

Model AIME 2024 MATH-500 GPQA Diamond LiveCodeBench

pass@1 cons@64 pass@1 pass@1 pass@1

QwQ-32B-Preview 50.0 60.0 90.6 54.5 41.9

DeepSeek-R1-Zero-Qwen-32B 47.0 60.0 91.6 55.0 40.2

DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 94.3 62.1 57.2

Table 6 | Comparison of distilled and RL Models on Reasoning-Related Benchmarks.

RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1- Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than

DeepSeek-R1-Zero-Qwen-32B across all benchmarks.

Therefore, we can draw two conclusions: First, distilling more powerful models into smaller

ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in

this paper require enormous computational power and may not even achieve the performance

of distillation. Second, while distillation strategies are both economical and effective, advancing

beyond the boundaries of intelligence may still require more powerful base models and larger

scale reinforcement learning.

4.2. Unsuccessful Attempts

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along

the way. We share our failure experiences here to provide insights, but this does not imply that

these approaches are incapable of developing effective reasoning models.

Process Reward Model (PRM) PRM is a reasonable method to guide the model toward better

approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al.,

2023). However, in practice, PRM has three main limitations that may hinder its ultimate suc

cess. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second,

determining whether the current intermediate step is correct is a challenging task. Automated

annotation using models may not yield satisfactory results, while manual annotation is not con

ducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward

hacking (Gao et al., 2022), and retraining the reward model needs additional training resources

and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good

ability to rerank the top-N responses generated by the model or assist in guided search (Snell

et al., 2024), its advantages are limited compared to the additional computational overhead it

introduces during the large-scale reinforcement learning process in our experiments.

Monte Carlo Tree Search (MCTS) Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Sil

ver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time

compute scalability. This approach involves breaking answers into smaller parts to allow the

model to explore the solution space systematically. To facilitate this, we prompt the model to

generate multiple tags that correspond to specific reasoning steps necessary for the search. For

training, we first use collected prompts to find answers via MCTS guided by a pre-trained value

model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First,

unlike chess, where the search space is relatively well-defined, token generation presents an

exponentially larger search space. To address this, we set a maximum extension limit for each

node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process.

Training a fine-grained value model is inherently difficult, which makes it challenging for the

model to iteratively improve. While AlphaGo’s core success relied on training a value model to

progressively enhance its performance, this principle proves difficult to replicate in our setup

due to the complexities of token generation.

In conclusion, while MCTS can improve performance during inference when paired with a

pre-trained value model, iteratively boosting model performance through self-search remains a

significant challenge.

5. Conclusion, Limitations, and Future Work

In this work, we share our journey in enhancing model reasoning abilities through reinforcement

learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful,

leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves

performance comparable to OpenAI-o1-1217 on a range of tasks.

We further explore distillation the reasoning capability to small dense models. We use

DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small

dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o

and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other

dense models also achieve impressive results, significantly outperforming other instruction

tuned models based on the same underlying checkpoints.

In the future, we plan to invest in research across the following directions for DeepSeek-R1.

• General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3

in tasks such as function calling, multi-turn, complex role-playing, and JSON output.

Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in

these fields. • Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which

may result in language mixing issues when handling queries in other languages. For

instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is

in a language other than English or Chinese. We aim to address this limitation in future

updates. • Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive

to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a

zero-shot setting for optimal results. • Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software

engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement

over DeepSeek-V3 on software engineering benchmarks. Future versions will address

this by implementing rejection sampling on software engineering data or incorporating

asynchronous evaluations during the RL process to improve efficiency. 16

References

AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-m odels/blob/main/models/llama3_1/MODEL_CARD.md.

Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3 -5-sonnet.

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,

N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin,

B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss,

A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,

A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,

M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and

W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.

URL https://arxiv.org/abs/2107.03374.

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten,

A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple

way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https: //arxiv.org/abs/2309.17179.

L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL

https://arxiv.org/abs/2210.10760.

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and

P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi.or g/10.48550/arXiv.2406.04127.

Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/techno logy/ai/google-gemini-next-generation-model-february-2024.

Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chinese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint arXiv:2411.07140, 2024.

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring

massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A

multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica.

Livecodebench: Holistic and contamination free evaluation of large language models for code.

CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974. 17

S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui.

Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. CoRR,

abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.485 50/arXiv.2409.12941.

A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop,

R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024.

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur

ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212,

2023.

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From

crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman,

I. Sutskever, and K. Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.

B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL

https://github.com/WildEval/ZeroEval.

MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math -competitions/american-invitational-mathematics-examination-aime. OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.

OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learnin g-to-reason-with-llms/.

OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing -simpleqa/.

OpenAI. Introducing SWE-bench verified we’re releasing a human-validated subset of swebench that more, 2024d. URL https://openai.com/index/introducing-swe-bench -verified/.

Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm .github.io/blog/qwq-32b-preview/.

Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/b log/qwen2.5.

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman.

GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,

D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and

shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,

2017a. URL http://arxiv.org/abs/1712.01815. 18D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,

M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and

D. Hassabis. Mastering the game of go without human knowledge. Nat., 550(7676):354–359,

2017b. doi: 10.1038/NATURE24270. URL https://doi.org/10.1038/nature24270.

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more

effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.033 14.

T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human

demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and

I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label

free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,

2023.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li,

M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024.

URL https://doi.org/10.48550/arXiv.2406.01574.

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software

engineering agents. arXiv preprint, 2024.

H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao,

Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing

proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL

https://arxiv.org/abs/2408.08152.

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following

evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 19

Appendix A. Contributions and Acknowledgments

Core Contributors

Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi Xiaokang Zhang Xingkai Yu Yu Wu Z.F. Wu Zhibin Gou Zhihong Shao Zhuoshu Li Ziyi Gao

Contributors

Aixin Liu Bing Xue Bingxuan Wang Bochao Wu Bei Feng Chengda Lu Chenggang Zhao Chengqi Deng Chong Ruan Damai Dai Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo* Guangbo Hao Guanting Chen Guowei Li H. Zhang Hanwei Xu Honghui Ding Huazuo Gao Hui Qu Hui Li Jianzhong Guo Jiashi Li Jingchang Chen Jingyang Yuan Jinhao Tu Junjie Qiu Junlong Li J.L. Cai Jiaqi Ni Jian Liang Jin Chen Kai Dong Kai Hu* Kaichao You Kaige Gao Kang Guan Kexin Huang Kuai Yu Lean Wang Lecong Zhang Liang Zhao Litong Wang Liyue Zhang Lei Xu Leyi Xia Mingchuan Zhang Minghua Zhang Minghui Tang Mingxu Zhou Meng Li Miaojun Wang Mingming Li Ning Tian Panpan Huang Peng Zhang Qiancheng Wang Qinyu Chen Qiushi Du Ruiqi Ge* Ruisong Zhang Ruizhe Pan Runji Wang R.J. Chen R.L. Jin

Ruyi Chen Shanghao Lu Shangyan Zhou Shanhuang Chen Shengfeng Ye Shiyu Wang Shuiping Yu Shunfeng Zhou Shuting Pan S.S. Li Shuang Zhou Shaoqing Wu Shengfeng Ye Tao Yun Tian Pei Tianyu Sun T. Wang Wangding Zeng Wen Liu Wenfeng Liang Wenjun Gao Wenqin Yu* Wentao Zhang W.L. Xiao Wei An Xiaodong Liu Xiaohan Wang Xiaokang Chen Xiaotao Nie Xin Cheng Xin Liu Xin Xie Xingchao Liu Xinyu Yang Xinyuan Li Xuecheng Su Xuheng Lin X.Q. Li Xiangyue Jin Xiaojin Shen Xiaosha Chen Xiaowen Sun Xiaoxiang Wang Xinnan Song Xinyi Zhou Xianzu Wang Xinxia Shan Y.K. Li Y.Q. Wang Y.X. Wei Yang Zhang Yanhong Xu Yao Li Yao Zhao Yaofeng Sun Yaohui Wang Yi Yu Yichao Zhang Yifan Shi Yiliang Xiong Ying He Yishi Piao Yisong Wang Yixuan Tan Yiyang Ma* Yiyuan Liu Yongqiang Guo Yuan Ou Yuduan Wang Yue Gong Yuheng Zou Yujia He Yunfan Xiong Yuxiang Luo Yuxiang You Yuxuan Liu Yuyang Zhou Y.X. Zhu Yanping Huang Yaohui Li Yi Zheng Yuchen Zhu Yunxian Ma Ying Tang Yukun Zha Yuting Yan Z.Z. Ren Zehui Ren Zhangli Sha Zhe Fu Zhean Xu Zhenda Xie Zhengyan Zhang Zhewen Hao Zhicheng Ma Zhigang Yan Zhiyu Wu Zihui Gu

Zijia Zhu Zijun Liu* Zilin Li Ziwei Xie Ziyang Song Zizheng Pan Zhen Huang Zhipeng Xu Zhongyu Zhang Zhen Zhang

Within each role, authors are listed alphabetically by the first name. Names marked with *

denote individuals who have departed from our team. 22