LLMs之DeepSeek-R1:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

LLMs之DeepSeek-R1:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

导读:2025年1月22日发布,这篇论文核心是利用强化学习提升大型语言模型(LLM)的推理能力,并通过实验验证了其有效性。 通过强化学习激励大型语言模型的推理能力,它不仅提供了一种新的训练范式,也为LLM推理能力的研究提供了宝贵的经验和开源资源。

>> 背景痛点:

当前LLM推理能力的局限性:现有LLM在推理任务上的表现仍有很大提升空间,难以达到人类水平。虽然一些方法(如OpenAI的o1系列模型)通过增加CoT推理过程的长度来提高推理能力,但有效地进行测试时间缩放仍然是一个挑战。

依赖监督数据:之前的研究很大程度上依赖于大量的监督数据来提高模型性能,这费时费力。

>> 具体的解决方案:论文提出了两个LLM推理模型:DeepSeek-R1-Zero 和 DeepSeek-R1,两者都基于强化学习,但训练策略有所不同,并通过知识蒸馏技术将其能力扩展到更小的模型。

● DeepSeek-R1-Zero:这是一个完全基于大规模强化学习训练的模型,无需任何监督微调 (SFT) 作为预备步骤。它直接在基础模型上应用强化学习观察模型在纯强化学习过程中的自我进化

● DeepSeek-R1:为了解决DeepSeek-R1-Zero的可读性和语言混合等问题,并进一步提高推理性能,DeepSeek-R1引入了多阶段训练冷启动数据。它包含四个阶段:

1) 使用少量高质量的CoT数据微调基础模型;

2) 进行与DeepSeek-R1-Zero类似的定向强化学习

3) 通过拒绝采样生成新的SFT数据,并结合其他领域的监督数据进行再训练

4) 对所有场景的提示进行强化学习,进一步提升模型的帮助性和无害性。

● 知识蒸馏:论文还探索了将DeepSeek-R1的推理能力蒸馏到更小的稠密模型中,结果表明,这种方法比在小型模型上直接应用强化学习效果更好

>> 核心思路步骤:

● DeepSeek-R1-Zero:选择基础模型 (DeepSeek-V3-Base),使用GRPO (Group Relative Policy Optimization)强化学习算法,设计简单的训练模板,只关注答案的准确性和格式,观察模型的自我进化过程。

● DeepSeek-R1:多阶段训练,包括冷启动数据微调推理定向强化学习拒绝采样监督微调针对所有场景的强化学习。 每个阶段都针对特定目标进行优化,例如可读性、准确性、帮助性和无害性。

● 知识蒸馏:使用DeepSeek-R1生成的800k样本,对基于Qwen和Llama的多个小型稠密模型进行微调

>> 优势:

纯强化学习训练:DeepSeek-R1-Zero证明了LLM的推理能力可以通过纯强化学习进行提升,无需SFT。

● 高效的训练策略:DeepSeek-R1的多阶段训练策略,结合冷启动数据和拒绝采样,有效地提升了模型的推理能力和可读性。

● 知识蒸馏的有效性:将大型模型的推理能力蒸馏到小型模型中,可以获得比在小型模型上直接应用强化学习更好的效果,降低了计算成本。

● 开源模型:论文开源了DeepSeek-R1-Zero、DeepSeek-R1以及多个蒸馏后的模型,方便研究社区进行进一步的研究。

>> 结论和观点:

强化学习提升LLM推理能力的一种有效方法,即使在没有监督数据的情况下也能取得显著成果。

多阶段训练冷启动数据可以进一步提升模型性能和可读性。

● 大型模型的推理能力可以有效地蒸馏到小型模型中。

● 虽然强化学习在提升推理能力方面具有潜力,但仍存在一些挑战,例如语言混合、提示工程以及在软件工程任务上的应用。 未来的研究方向包括提升模型的通用能力,解决语言混合问题,改进提示策略以及提高在软件工程任务上的效率。

#################20250128更新解读######################

>> Open AI的o1模型优势:DeepSeek指出,Open AI的o1模型之所以领先,是因为它采用了不同的策略来提升性能。与传统的预训练大模型不同,o1模型专注于增加推理阶段的计算量,即思考时间,而不是仅仅依赖于预训练阶段的算力投入。这种方法类似于教育中的思路,即重视思考过程而非死记硬背。通俗地解释,就是没有让小孩把精力放在背题上,而是把精力放在思考上。但问题是,o1作为业界最强闭源模型,是怎么做到的呢?

>> DeepSeek的r1-zero模型实现方法:DeepSeek的r1-zero模型是通过纯强化学习实现的,这种方法允许模型在没有监督数据的情况下自行发展推理能力。通俗地解释,强化学习的过程就像是在教导孩子独立完成作业。每当孩子答对了问题,我们就给予鼓励;如果答错了,就让他们重新尝试,直到他们能够正确解答。这种方式鼓励孩子不断尝试,直到掌握正确的解题方法。同时,r1-zero模型结合了GPRO算法规则化奖励实现。

● GPRO算法通过小组竞争学习法减少了计算资源的消耗并提高了训练速度。通俗地解释,当多个孩子同时尝试解决同一份作业时,如果某个孩子找到了正确答案,其他孩子就会参考他的答案来修正自己的作业。这就像是一种集体学习的过程,通过模仿和学习他人的成功经验。

● 规则化奖励则确保了解题过程的规范性,有助于模型的后期优化。这种训练方式在成本上具有优势,但可能导致解题步骤的语言混杂,影响可读性。通俗地解释,在教育孩子时不仅关注他们是否得到了正确答案,还强调解题过程的规范性。这样做是为了避免孩子只是猜测答案或者随意书写解题步骤。保持解题过程的规范性对于后续模型的改进和优化至关重要。

>> DeepSeek的r1模型优化:为了解决r1-zero模型的可读性问题,DeepSeek在r1模型中引入了SFT监督微调技术。这一技术通过提供标准的解题思路模板,规范了答题格式,然后结合强化学习,弥补了r1-zero模型的不足。经过二次强化训练,r1模型实现了与o1模型相匹敌的性能。r1模型可以类比为经过初步指导后自学成才的孩子,其能力自然优于完全自学的r1-zero模型。

>> r1模型的蒸馏技术:r1模型还采用了先进的蒸馏技术,通过让r1模型生成80万条高质量推理相关样本,用于训练其他小模型,如Qwen和LLaMA。这种方法使得小模型也能达到大模型的性能水平。例如,在AIME2024中,Qwen-7B模型的表现超过了参数更多的QwenQ-32B版本。

#################20250201更新解读######################

r1-zero证明无监督强化学习可独立发展推理能力;r1通过混合训练实现性能飞跃

o1模型:预训练 → 强化推理计算(闭源优化)

r1-zero模型:纯强化学习 → GPRO竞争 + 规则化奖励 → 输出“顿悟”但可读性差

r1模型:SFT监督微调 → 强化学习补充 → 二次强化训练 → 高可读性+高性能

● GPRO算法—小组竞争学习:多个模型同时解题,优胜者方案被其他模型模仿。降低50%计算资源消耗,加速训练过程(提升2倍);

● 规则化奖励:不仅要求答案正确,还需解题步骤符合规范(防止模型“瞎蒙”)。提升模型逻辑严谨性,为后续优化提供可解释性基础;

● SFT监督微调:预先给模型展示标准解题模板(如数学题分步推导)。解决r1-zero输出混乱问题,规范结果格式;

● 模型蒸馏技术:用大模型(r1)生成高质量样本,训练小模型(如Qwen-7B)。小模型性能匹敌大模型(7B参数超越32B版本);

顿悟时刻:模型在无预设代码情况下突然展现推理能力。验证无监督强化学习可自主发展复杂能力;

目录

相关文章

2024年1月5日,LLMs之DeepSeek-V1:《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读

2024年1月11日,LLMs之DeepSeek-V1之MoE:《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》翻译与解

2024年1月25日,LLMs之DeepSeek-V1:《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence》翻译与解读

2024年2月5日,LLMs之DeepSeek-V1:《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models》翻译与解读

2024年5月7日,LLMs之DeepSeek-V2:《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》翻译与解读

2024年12月26日,LLMs之MoE之DeepSeek-V3:DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略

2024年12月27日,LLMs之MoE之DeepSeek-V3:《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

2025年1月20日,LLMs之DeepSeek-V3:DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略

2025年1月22日,LLMs之DeepSeek-R1:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

Abstract

Figure 1: Benchmark performance of DeepSeek-R1.图 1:DeepSeek-R1 的基准性能。

1、Introduction

DeepSeek-R1-Zero:基础模型(采用DeepSeek-V3-Base)+RL框架(采用GRPO+K级RL步数)→读性差和语言混杂

DeepSeek-R1:收集冷启动数据(K级条数)→多阶段训练(以推理为导向的强化学习)→重训练微调DeepSeek-V3-Base模型(通过拒绝采样生成新的 SFT 数据+结合DeepSeek-V3在多领域的监督数据)→额外的强化学习

1.1 Contributions贡献

后训练:两个强化学习阶段+两个 SFT 阶段

蒸馏:证明了大型模型的推理模式可以被提炼到小型模型中

1.2 Summary of Evaluation Results评估结果总结

2 Approach方法

2.1 Overview概述

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model基于基础模型的强化学习

2.2.1 Reinforcement Learning Algorithm强化学习算法

Group Relative Policy Optimization组相对策略优化

Table 1:Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question during training.表 1:DeepSeek-R1-Zero的模板。在训练期间,提示将被具体的推理问题所替换。

2.2.2 Reward Modeling奖励建模

2.2.3 Training Template训练模板

2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero,DeepSeek-R1-Zero 的性能、自我进化过程和顿悟时刻

Table 2:Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks.表 2:DeepSeek-R1-Zero 与 OpenAI o1 模型在推理相关基准测试上的比较。

Figure 2:AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.图 2:DeepSeek-R1-Zero 在训练期间的 AIME 准确率。对于每个问题,我们抽取 16 个回答并计算总体平均准确率,以确保评估的稳定性。

Figure 3:The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.图 3:在强化学习过程中,DeepSeek-R1-Zero 在训练集上的平均响应长度。DeepSeek-R1-Zero 自然地学会了用更多的思考时间来解决推理任务。

Self-evolution Process of DeepSeek-R1-Zero自我进化过程

Aha Moment of DeepSeek-R1-Zero顿悟时刻

Drawback of DeepSeek-R1-Zero缺陷

Table 3:An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.表 3:DeepSeek-R1-Zero 中间版本的一个有趣“顿悟时刻”。该模型学会了以拟人化的口吻进行反思。这对我们来说也是一个顿悟时刻,让我们见证了强化学习的力量与美妙。

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start冷启动强化学习

2.3.1 Cold Start冷启动

2.3.2 Reasoning-oriented Reinforcement Learning以推理为导向的强化学习

2.3.3 Rejection Sampling and Supervised Fine-Tuning拒绝采样与监督微调

Reasoning data推理数据

Non-Reasoning data非推理数据

2.3.4 Reinforcement Learning for all Scenarios适用于所有场景的强化学习

2.4 Distillation: Empower Small Models with Reasoning Capability知识蒸馏:赋予小型模型推理能力

3 Experiment实验

Benchmarks基准测试

Evaluation Prompts评估提示

Baselines基准线

Evaluation Setup评估设置

3.1 DeepSeek-R1 Evaluation评估

Table 4: Comparison between DeepSeek-R1 and other representative models.表 4:DeepSeek-R1 与其他代表性模型的比较。

3.2 Distilled Model Evaluation蒸馏模型评估

Table 5:Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks.表 5:DeepSeek-R1 蒸馏模型与其他可比模型在推理相关基准上的比较。

4 Discussion讨论

4.1 Distillation v.s. Reinforcement Learning蒸馏与强化学习

Table 6:Comparison of distilled and RL Models on Reasoning-Related Benchmarks.表 6:蒸馏模型与强化学习模型在推理相关基准测试上的比较。

4.2 Unsuccessful Attempts失败的尝试

Process Reward Model (PRM)过程奖励模型

Monte Carlo Tree Search (MCTS)蒙特卡罗树搜索

5 Conclusion, Limitations, and Future Work结论、局限性与未来工作


相关文章

2024年1月5日,LLMs之DeepSeek-V1:《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读

LLMs之DeepSeek-V1:《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读-CSDN博客

2024年1月11日,LLMs之DeepSeek-V1之MoE:《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》翻译与解

LLMs之DeepSeek-V1之MoE:《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Lang-CSDN博客

2024年1月25日,LLMs之DeepSeek-V1:《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence》翻译与解读

LLMs之DeepSeek-V1:《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Cod-CSDN博客

2024年2月5日,LLMs之DeepSeek-V1:《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models》翻译与解读

LLMs之DeepSeek-V1:《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models-CSDN博客

2024年5月7日,LLMs之DeepSeek-V2:《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》翻译与解读

LLMs之DeepSeek-V2:《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model-CSDN博客

2024年12月26日,LLMs之MoE之DeepSeek-V3:DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略

LLMs之MoE之DeepSeek-V3:DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略-CSDN博客

2024年12月27日,LLMs之MoE之DeepSeek-V3:《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

LLMs之MoE之DeepSeek-V3:《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)_in order to achieve efficient training, we support-CSDN博客

2025年1月20日,LLMs之DeepSeek-V3:DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略

LLMs之DeepSeek-V3:DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略_怎样使用deepseek r1-CSDN博客

2025年1月22日,LLMs之DeepSeek-R1:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

LLMs之DeepSeek-R1:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning-CSDN博客

《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

地址

论文地址:[2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

时间

2025122

作者

DeepSeek-AI团队

Abstract

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

我们推出了第一代推理模型 DeepSeek-R1-Zero  DeepSeek-R1DeepSeek-R1-Zero是通过大规模强化学习(RL)训练而成,且在训练过程中未经过监督微调(SFT)这一初步步骤,展现出了卓越的推理能力。通过强化学习,DeepSeek-R1-Zero 自然地产生了众多强大而有趣的推理行为。然而,它也面临着诸如可读性差和语言混杂等问题。为了解决这些问题并进一步提升推理性能,我们推出了 DeepSeek-R1,该模型在强化学习之前加入了多阶段训练冷启动数据。DeepSeek-R1 在推理任务上的表现可与 OpenAI-o1-1217 相媲美。为了支持研究社区,我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以及从 DeepSeek-R1 基于 Qwen 和 Llama 精炼出的六个密集模型(1.5B、7B、8B、14B、32B、70B)。

Figure 1: Benchmark performance of DeepSeek-R1.图 1:DeepSeek-R1 的基准性能。

1、Introduction

引言部分指出近年来LLM快速迭代,朝着AGI发展,后训练阶段成为提升模型性能的重要环节。OpenAI的o1系列模型通过延长CoT推理过程取得了显著进展,但测试时间缩放仍然是挑战。本文旨在探索纯强化学习(RL)提升LLM推理能力的潜力,并介绍DeepSeek-R1-Zero和DeepSeek-R1模型。

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a; Anthropic, 2024; Google, 2024), progressively diminishing the gap towards Artificial General Intelligence (AGI).

Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored various approaches, including process-based reward models (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Xin et al., 2024; Trinh et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models.

近年来,大型语言模型(LLMs)经历了快速的迭代和演进(OpenAI,2024a;Anthropic,2024;Google,2024),逐渐缩小了与通用人工智能(AGI)之间的差距。

最近,后训练已成为完整训练流程中的一个重要组成部分。研究表明,它能够提高推理任务的准确性,与社会价值观保持一致,并适应用户偏好,同时相较于预训练所需的计算资源相对较少。在推理能力方面,OpenAI 的 o1(OpenAI,2024b)系列模型率先引入了推理时间缩放,通过增加链式思维推理过程的长度来实现。这种方法在数学、编程科学推理等各种推理任务中取得了显著的改进。然而,如何有效地进行测试时间缩放仍是研究界的一个开放性问题。此前的一些研究探索了多种方法,包括基于过程的奖励模型(Uesato 等人,2022;Lightman 等人,2023;王等人(2023 年)、强化学习(库马尔等人,2024 年)以及诸如蒙特卡罗树搜索和束搜索之类的搜索算法(冯等人,2024 年;辛等人,2024 年;Trinh 等人,2024 年)。然而,这些方法中没有一种在通用推理性能方面能与 OpenAI 的 o1 系列模型相媲美

DeepSeek-R1-Zero:基础模型(采用DeepSeek-V3-Base)+RL框架(采用GRPO+K级RL步数)→读性差和语言混杂

DeepSeek-R1:收集冷启动数据(K级条数)多阶段训练(以推理为导向的强化学习)重训练微调DeepSeek-V3-Base模型(通过拒绝采样生成新的 SFT 数据+结合DeepSeek-V3在多领域的监督数据)额外的强化学习

In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912.

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

在本文中,我们首次尝试使用纯强化学习(RL)来提升语言模型的推理能力。我们的目标是探索大型语言模型在没有任何监督数据的情况下发展推理能力的潜力,重点关注它们通过纯 RL 过程实现的自我进化。具体而言,我们以 DeepSeek-V3-Base 作为基础模型,并采用 GRPO(邵等人,2024 年)作为 RL 框架来提升模型的推理性能。在训练过程中,DeepSeek-R1-Zero 自然地展现出众多强大且有趣的推理行为。经过数千的 RL 步骤,DeepSeek-R1-Zero 在推理基准测试中表现出色。例如,在 AIME 2024 上的 pass@1 分数从 15.6% 提升至 71.0%,而采用多数投票后,分数进一步提高到 86.7%,与 OpenAI-o1-0912 的表现相当。然而,DeepSeek-R1-Zero面临着诸如可读性差和语言混杂等问题。为了解决这些问题并进一步提升推理性能,我们推出了 DeepSeek-R1,它引入了一小部分冷启动数据多阶段训练流程。具体来说,我们首先收集数千条冷启动数据来微调 DeepSeek-V3-Base 模型。接下来,我们像 DeepSeek-R1-Zero 那样进行以推理为导向的强化学习。在强化学习过程接近收敛时,我们通过拒绝采样在强化学习检查点上生成新的 SFT 数据,并结合来自 DeepSeek-V3 在写作、事实问答和自我认知等领域的监督数据,然后重新训练 DeepSeek-V3-Base 模型。在使用新数据进行微调后,该检查点会经历一个额外的强化学习过程,同时考虑所有场景的提示。经过这些步骤,我们得到了一个被称为 DeepSeek-R1 的检查点,其性能与 OpenAI-o1-1217 相当。

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models.

我们进一步探索了从 DeepSeek-R1 向更小的密集模型进行知识蒸馏。以 Qwen2.5-32B(Qwen,2024b)作为基础模型,直接从 DeepSeek-R1 进行蒸馏的表现优于在该模型上应用强化学习。这表明大型基础模型所发现的推理模式对于提升推理能力至关重要。我们开源了精简版的 Qwen 和 Llama(Dubey 等人,2024 年)系列。值得注意的是,我们的 14B参数精简模型大幅超越了最先进的开源 QwQ-32B-Preview(Qwen,2024a),而 32B参数和 70B参数的精简模型在密集模型的推理基准测试中创造了新的纪录

1.1 Contributions贡献

后训练:两个强化学习阶段+两个 SFT 阶段

Post-Training: Large-Scale Reinforcement Learning on the Base Model

• We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.

• We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

后训练:在基础模型上进行大规模强化学习

• 我们直接将RL应用于基础模型依赖监督微调(SFT)作为初步步骤。这种方法使模型能够探索解决复杂问题的链式思维(CoT),从而开发出 DeepSeek-R1-Zero。DeepSeek-R1-Zero 展示了诸如自我验证反思生成长链式思维等能力,为研究界树立了一个重要的里程碑。值得注意的是,这是首次公开研究证明,通过纯强化学习即可激励大型语言模型的推理能力,无需进行 SFT。这一突破为该领域的未来发展铺平了道路。

• 我们介绍了开发 DeepSeek-R1 的流程。该流程包含两个强化学习阶段,旨在发现改进的推理模式并使其与人类偏好保持一致,以及两个 SFT 阶段,作为模型推理和非推理能力的种子。我们相信该流程将通过创建更出色的模型为行业带来益处。

蒸馏:证明了大型模型的推理模式可以被提炼到小型模型

Distillation: Smaller Models Can Be Powerful Too

• We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.

• Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous open-source models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

蒸馏小模型也能大有作为

• 我们证明了大型模型的推理模式可以被提炼到小型模型中,其性能优于通过强化学习在小型模型中发现的推理模式。开源的 DeepSeek-R1 及其 API 将有助于研究社区在未来提炼出更出色的更小型模型。

• 利用 DeepSeek-R1 生成的推理数据,我们对研究社区广泛使用的几个密集模型进行了微调。评估结果表明,提炼出的更小型密集模型在基准测试中表现卓越。DeepSeek-R1-Distill-Qwen-7B 在 AIME 2024 中达到了 55.5%,超过了 QwQ-32B-Preview。此外,DeepSeek-R1-Distill-Qwen-32B 在 AIME 2024 中得分 72.6%,在 MATH-500 中得分 94.3%,在 LiveCodeBench 中得分 57.2%。这些结果显著优于之前的开源模型,并且与 o1-mini 相当。我们向社区开源了基于 Qwen2.5 和 Llama3 系列的 1.5B、7B、8B、14B、32B 和 70B 的检查点。

1.2 Summary of Evaluation Results评估结果总结

• Reasoning tasks: (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.

• Knowledge: On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 4o on this benchmark.

• Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on ArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

• 推理任务:(1)在 AIME 2024 上,DeepSeek-R1 获得了 79.8% 的 Pass@1 分数,略高于 OpenAI-o1-1217。在 MATH-500 上,它取得了 97.3% 的出色成绩,与 OpenAI-o1-1217 相当,并显著优于其他模型。(2)在与编码相关的任务中,DeepSeek-R1 在代码竞赛任务中表现出专家水平,在 Codeforces 上获得了 2029 的 ELO 分数,超过了 96.3% 的参赛选手。对于工程相关任务,DeepSeek-R1 的表现略优于 DeepSeek-V3,这有助于开发者完成实际任务。

• 知识:在 MMLU、MMLU-Pro 和 GPQA Diamond 等基准测试中,DeepSeek-R1 取得了出色的成绩,显著优于 DeepSeek-V3,在 MMLU 上得分 90.8%,在 MMLU-Pro 上得分 84.0%,在 GPQA Diamond 上得分 71.5%。虽然在这些基准测试中的表现略低于 OpenAI-o1-1217,但 DeepSeek-R1 超过了其他闭源模型,在教育任务中展现出竞争优势。在事实基准 SimpleQA 上,DeepSeek-R1 表现优于 DeepSeek-V3,证明了其处理基于事实的查询的能力。在这一基准测试中,OpenAI-o1 也呈现出类似的趋势,超越了 4o。

• 其他:DeepSeek-R1 在包括创意写作、通用问答、编辑、摘要等众多任务中表现出色。它在 AlpacaEval 2.0 上实现了令人瞩目的长度控制胜率 87.6%,在 ArenaHard 上的胜率为 92.3%,展示了其处理非考试导向型查询的智能能力。此外,DeepSeek-R1 在需要长上下文理解的任务中表现出色,在长上下文基准测试中大幅超越了 DeepSeek-V3。

2 Approach方法

本节详细介绍了DeepSeek-R1-Zero和DeepSeek-R1两种模型的训练方法。DeepSeek-R1-Zero采用纯强化学习,无需SFT,展现出强大的推理能力自我进化特性,但存在可读性和语言混合问题。DeepSeek-R1则在此基础上,通过多阶段训练(冷启动数据微调、推理定向强化学习、拒绝采样和监督微调、针对所有场景的强化学习)和冷启动数据,解决了这些问题,并达到了与OpenAI-o1-1217相当的性能。 最后,还介绍了知识蒸馏技术,将DeepSeek-R1的能力迁移到更小的模型上。

本节是论文的核心部分,详细阐述了两种模型的训练方法,并解释了设计选择的理由。 DeepSeek-R1的设计体现了对RL和SFT的结合应用,以及对模型可读性和鲁棒性的考虑。

2.1 Overview概述

Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.

以往的研究工作严重依赖大量有监督数据来提升模型性能。在本研究中,我们证明了通过大规模强化学习(RL)可以显著提升推理能力,即使不使用有监督微调(SFT)作为冷启动。此外,加入少量冷启动数据还能进一步提升性能。在接下来的部分中,我们将介绍:(1)DeepSeek-R1-Zero,它直接将 RL 应用于基础模型,不使用任何 SFT 数据;(2)DeepSeek-R1,它从使用数千个长链式思维(CoT)示例微调的检查点开始应用 RL;(3)将 DeepSeek-R1 的推理能力迁移到小型密集模型中。

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model基于基础模型的强化学习

Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Wang et al., 2023; Shao et al., 2024). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights.

强化学习在推理任务中已展现出显著的有效性,这一点在我们的先前研究(Wang 等人,2023 年;Shao 等人,2024 年)中得到了证实。然而,这些研究严重依赖于监督数据,而监督数据的收集耗时费力。在本节中,我们探索了大型语言模型在无需任何监督数据的情况下发展推理能力的潜力,重点关注它们通过纯粹的强化学习过程实现自我进化。我们首先简要概述我们的强化学习算法,然后展示一些令人兴奋的结果,希望这能为社区提供有价值的见解。

2.2.1 Reinforcement Learning Algorithm强化学习算法

Group Relative Policy Optimization组相对策略优化

In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q, GRPO samples a group of outputs {o1,o2,⋯,oG} from the old policy πθo⁢l⁢d and then optimizes the policy model πθ by maximizing the following objective:

��G⁢R⁢P⁢O⁢(θ)=��⁢[q∼P⁢(Q),{oi}i=1G∼πθo⁢l⁢d⁢(O|q)]1G∑i=1G(min(πθ⁢(oi|q)πθo⁢l⁢d⁢(oi|q)Ai,clip(πθ⁢(oi|q)πθo⁢l⁢d⁢(oi|q),1−ε,1+ε)Ai)−β��K⁢L(πθ||πr⁢e⁢f)), (1)

��K⁢L(πθ||πr⁢e⁢f)=πr⁢e⁢f⁢(oi|q)πθ⁢(oi|q)−logπr⁢e⁢f⁢(oi|q)πθ⁢(oi|q)−1, (2)

where ε and β are hyper-parameters, and Ai is the advantage, computed using a group of rewards {r1,r2,…,rG} corresponding to the outputs within each group:

Ai=ri−m⁢e⁢a⁢n⁢({r1,r2,⋯,rG})s⁢t⁢d⁢({r1,r2,⋯,rG}). (3)

为了节省强化学习(RL)的训练成本,我们采用了组相对策略优化(GRPO)(Shao 等人,2024),该方法放弃了通常与策略模型大小相同的评论模型,而是从组得分中估计基线。具体来说,对于每个问题 q,GRPO 从旧策略 πθo⁢l⁢d 中采样一组输出 {o1,o2,⋯,oG},然后通过最大化以下目标来优化策略模型 πθ:

其中 ε 和 β 是超参数,Ai 是优势,通过每个组内输出对应的奖励组 {r1,r2,...,rG} 计算得出:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.

The assistant first thinks about the reasoning process in the mind and then provides the user

with the answer. The reasoning process and answer are enclosed within <think> </think> and

<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>

<answer> answer here </answer>. User: prompt. Assistant: 

用户与助手之间的对话。用户提出一个问题,助手解决它。

助手首先在脑海中思考推理过程,然后为用户提供答案。推理过程和答案分别被包含在 <think> </think> 中。<思考> 这里的推理过程 </思考> 和 <答案> 这里的答案 </答案> 标签,即 <思考> 推理过程在此 </思考>

<答案> 答案在此 </答案> 。用户:提示。助手:

Table 1:Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning question during training.表 1:DeepSeek-R1-Zero的模板。在训练期间,提示将被具体的推理问题所替换。

2.2.2 Reward Modeling奖励建模

The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:

·• Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.

·• Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.

奖励是训练信号的来源,它决定了强化学习的优化方向。为了训练 DeepSeek-R1-Zero,我们采用了一个基于规则的奖励系统,该系统主要包含两种类型的奖励:

• 准确性奖励:准确性奖励模型评估响应是否正确。例如,在具有确定性结果的数学问题中,模型需要以指定格式(例如,在方框内)提供最终答案,从而能够进行可靠的基于规则的正确性验证。同样,对于 LeetCode 问题,可以使用编译器根据预定义的测试用例生成反馈。

• 格式奖励:除了准确性奖励模型外,我们还采用了一个格式奖励模型,该模型要求模型将其思考过程置于 '<think>' 和 '</think>' 标签之间。

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

在开发 DeepSeek-R1-Zero 时,我们没有应用结果或过程神经奖励模型,因为我们发现神经奖励模型在大规模强化学习过程中可能会遭受奖励劫持,并且重新训练奖励模型需要额外的训练资源,还会使整个训练流程变得复杂。

2.2.3 Training Template训练模板

To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model’s natural progression during the RL process.

为了训练 DeepSeek-R1-Zero,我们首先设计了一个简单的模板,引导基础模型遵循我们指定的指令。如表 1 所示,此模板要求 DeepSeek-R1-Zero 首先生成推理过程,然后给出最终答案。我们有意将约束限制在这一结构格式上,避免任何特定内容的偏见——例如强制进行反思性推理或推广特定的问题解决策略——以确保我们能够准确观察模型在强化学习过程中的自然发展。

2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero,DeepSeek-R1-Zero 的性能、自我进化过程和顿悟时刻

Table 2:Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks.表 2:DeepSeek-R1-Zero 与 OpenAI o1 模型在推理相关基准测试上的比较。

Figure 2:AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.图 2:DeepSeek-R1-Zero 在训练期间的 AIME 准确率。对于每个问题,我们抽取 16 个回答并计算总体平均准确率,以确保评估的稳定性。

Figure 2 depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model’s performance over time.

图 2 展示了 DeepSeek-R1-Zero 在 2024 年 AIME 基准测试中,整个强化学习训练过程中的性能轨迹。如图所示,随着强化学习训练的推进,DeepSeek-R1-Zero 的性能呈现出稳定且持续的提升。值得注意的是,在 AIME 2024 上的平均 pass@1 分数显著提高,从最初的 15.6% 跃升至令人瞩目的 71.0%,达到了与 OpenAI-o1-0912 相当的水平。这一显著的进步突显了我们的强化学习算法在优化模型性能方面随时间推移的有效性。

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.

表 2 对 DeepSeek-R1-Zero 和 OpenAI 的 o1-0912 模型在多种推理相关基准测试中的表现进行了比较分析。研究结果表明,强化学习使 DeepSeek-R1-Zero 能够在无需任何监督微调数据的情况下获得强大的推理能力。这是一个值得注意的成就,因为它突显了该模型仅通过强化学习就能有效学习和泛化的能力。此外,DeepSeek-R1-Zero 的性能还可以通过应用多数投票进一步提升。例如,在 AIME 基准测试中使用多数投票时,DeepSeek-R1-Zero 的性能从 71.0% 提升至 86.7%,从而超过了 OpenAI-o1-0912 的表现。DeepSeek-R1-Zero 无论是否使用多数投票都能取得如此具有竞争力的性能,这凸显了其强大的基础能力以及在推理任务中进一步发展的潜力。

Figure 3:The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.图 3:在强化学习过程中,DeepSeek-R1-Zero 在训练集上的平均响应长度。DeepSeek-R1-Zero 自然地学会了用更多的思考时间来解决推理任务。

Self-evolution Process of DeepSeek-R1-Zero自我进化过程

The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model’s progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks.

As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.

DeepSeek-R1-Zero 的自我进化过程是强化学习(RL)如何驱动模型自主提升推理能力的一个引人入胜的展示。通过直接从基础模型开始强化学习,我们能够密切观察模型的发展进程,不受监督微调阶段的影响。这种方法清晰地展现了模型随时间推移的演变情况,尤其是在处理复杂推理任务方面的能力。

如图 3 所示,DeepSeek-R1-Zero 的思考时间在整个训练过程中持续改善。这种改善并非源于外部调整,而是模型内部的自然发展。DeepSeek-R1-Zero 通过利用延长的测试时间计算,自然获得了解决越来越复杂推理任务的能力。这种计算从生成数百个推理标记到数千个,使模型能够更深入地探索和优化其思维过程。

随着测试时间计算的增加,这种自我进化过程中最引人注目的方面之一是复杂行为的出现。诸如反思(即模型重新审视和重新评估其先前步骤)以及探索解决问题的替代方法等行为会自发产生。这些行为并非明确编程设定,而是模型与强化学习环境互动的结果。这种自发的发展显著增强了 DeepSeek-R1-Zero 的推理能力,使其能够更高效、更准确地应对更具挑战性的任务。

Aha Moment of DeepSeek-R1-Zero顿悟时刻

A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

在训练 DeepSeek-R1-Zero 的过程中观察到的一个特别引人入胜的现象是“顿悟时刻”的出现。如表 3 所示,这一时刻发生在模型的一个中间版本中。在此阶段,DeepSeek-R1-Zero 学会通过重新评估其初始方法来为问题分配更多思考时间。这种行为不仅证明了模型推理能力的不断增强,也是强化学习能够带来意想不到的复杂结果的一个迷人例证。

这一时刻不仅是模型的“顿悟时刻”,也是观察其行为的研究人员的“顿悟时刻”。它突显了强化学习的力量和魅力:我们并非明确地教导模型如何解决问题,而是仅仅为其提供正确的激励,它便能自主开发出先进的问题解决策略。“顿悟时刻”有力地提醒我们,强化学习具有在人工系统中解锁新层次智能的潜力,为未来更自主、更适应性的模型铺平了道路。

Drawback of DeepSeek-R1-Zero缺陷

Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.

尽管 DeepSeek-R1-Zero 展现出强大的推理能力,并能自主开发出意想不到且强大的推理行为,但它也面临一些问题。例如,DeepSeek-R1-Zero 在诸如可读性差和语言混杂等方面存在挑战。为了使推理过程更易于理解,并与开放社区分享,我们探索了 DeepSeek-R1,这是一种利用强化学习和对人类友好的冷启动数据的方法。

Table 3:An interesting “aha moment” of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.表 3:DeepSeek-R1-Zero 中间版本的一个有趣“顿悟时刻”。该模型学会了以拟人化的口吻进行反思。这对我们来说也是一个顿悟时刻,让我们见证了强化学习的力量与美妙。

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start冷启动强化学习

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as follows.

受 DeepSeek-R1-Zero 令人鼓舞的结果启发,两个自然的问题随之产生:1)通过引入少量高质量数据作为冷启动,能否进一步提高推理性能或加快收敛速度?2)如何训练一个用户友好型模型,使其不仅能生成清晰连贯的思维链(CoT),还能展现出强大的通用能力?为了解决这些问题,我们设计了一个用于训练 DeepSeek-R1 的流程。该流程由四个阶段组成,概述如下。

2.3.1 Cold Start冷启动

Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

与 DeepSeek-R1-Zero 不同,为防止基于基础模型的强化学习训练早期出现不稳定冷启动阶段,对于 DeepSeek-R1,我们构建并收集了一小部分长 CoT 数据来微调模型作为初始的强化学习执行者。为了收集此类数据,我们探索了几种方法:使用长 CoT 作为示例进行少样本提示,直接提示模型生成包含反思和验证的详细答案,收集 DeepSeek-R1-Zero 输出并以可读格式呈现,以及通过人工标注员的后期处理来优化结果。

In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data include:

·• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token|<reasoning_process>|special_token|<summary>, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.

·• Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models.

在本研究中,我们收集了数千条冷启动数据来微调 DeepSeek-V3-Base 作为强化学习的起点。与 DeepSeek-R1-Zero 相比,冷启动数据的优势包括:

• 可读性:DeepSeek-R1-Zero 的一个关键局限在于其内容往往不适合阅读。回答可能混杂多种语言或缺乏用于突出答案的 Markdown 格式,以方便用户。相比之下,在为 DeepSeek-R1 创建冷启动数据时,我们设计了一种可读模式,每个回答的末尾都有一个总结,并过滤掉不便于阅读的回答。在此,我们将输出格式定义为 |特殊标记|<推理过程>|特殊标记|<总结>,其中推理过程是针对查询的链式思维(CoT),而总结用于概括推理结果。

• 潜力:通过利用人类先验知识精心设计冷启动数据的模式,我们观察到其性能优于 DeepSeek-R1-Zero。我们认为迭代训练是推理模型的更好方式。

·

2.3.2 Reasoning-oriented Reinforcement Learning以推理为导向的强化学习

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks.

在冷启动数据上对 DeepSeek-V3-Base 进行微调后,我们采用与 DeepSeek-R1-Zero 相同的大规模强化学习训练流程。此阶段重点在于提升模型的推理能力,尤其是在编码、数学、科学和逻辑推理等推理密集型任务中,这些任务涉及具有明确解决方案的定义清晰的问题。在训练过程中,我们注意到 CoT 经常出现语言混杂现象,特别是在强化学习提示涉及多种语言时。为缓解语言混杂问题,我们在强化学习训练中引入语言一致性奖励,其计算方式为 CoT 中目标语言单词的比例。尽管消融实验表明这种对齐会导致模型性能略有下降,但该奖励符合人类偏好,使其更具可读性。最后,我们将推理任务的准确性与语言一致性奖励直接相加,形成最终奖励。然后,我们对微调后的模型进行强化学习训练,直至其在推理任务上达到收敛。

2.3.3 Rejection Sampling and Supervised Fine-Tuning拒绝采样与监督微调

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below.

当以推理为导向的强化学习收敛时,我们利用所得的检查点来收集后续轮次的监督微调(SFT)数据。与主要侧重于推理的初始冷启动数据不同,此阶段纳入了来自其他领域的数据,以增强模型在写作、角色扮演和其他通用任务方面的能力。具体而言,我们按照以下方式生成数据并微调模型。

Reasoning data推理数据

We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples.

我们整理推理提示,并通过从上述强化学习训练的检查点中进行拒绝采样来生成推理轨迹。在前一阶段,我们仅包含能够使用基于规则的奖励进行评估的数据。然而,在此阶段,我们通过纳入使用生成奖励模型的数据来扩展数据集,该模型将真实值和模型预测输入 DeepSeek-V3 进行判断。此外,由于模型输出有时混乱且难以阅读,我们过滤掉了包含混合语言、长段落和代码块的链式思维。对于每个提示,我们采样多个响应,并仅保留正确的响应。总的来说,我们收集了约 60 万条与推理相关的训练样本。

Non-Reasoning data非推理数据

For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as “hello” we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning.

对于非推理数据,例如写作、事实问答、自我认知和翻译,我们采用 DeepSeek-V3 管道,并复用 DeepSeek-V3 的 SFT 数据集的部分内容。对于某些非推理任务,我们通过提示调用 DeepSeek-V3 在回答问题前生成潜在的思维链。然而,对于更简单的查询,例如“你好”,我们不会提供思维链作为回应。最终,我们收集了约 20 万条与推理无关的训练样本。

We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.

我们使用上述约 80 万条样本的精选数据集对 DeepSeek-V3-Base 进行了两轮微调。

2.3.4 Reinforcement Learning for all Scenarios适用于所有场景的强化学习

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

为了进一步使模型与人类偏好保持一致,我们实施了一个次级强化学习阶段,旨在提升模型的有用性和无害性,同时优化其推理能力。具体而言,我们使用奖励信号和多样化的提示分布来训练模型。对于推理数据,我们遵循 DeepSeek-R1-Zero 中概述的方法,该方法利用基于规则的奖励来引导数学、代码和逻辑推理领域的学习过程。对于通用数据,我们采用奖励模型来捕捉复杂和微妙场景中的人类偏好。我们基于 DeepSeek-V3 管道,并采用类似的偏好对和训练提示分布。对于有用性,我们仅关注最终总结,确保评估强调响应对用户的实用性和相关性,同时尽量减少对底层推理过程的干扰。对于无害性,我们评估模型的整个响应,包括推理过程和总结,以识别并减轻生成过程中可能出现的任何潜在风险、偏见或有害内容。最终,奖励信号与多样化的数据分布的整合使我们能够训练出一个在推理方面表现出色,同时优先考虑有益性和无害性的模型。

2.4 Distillation: Empower Small Models with Reasoning Capability知识蒸馏:赋予小型模型推理能力

To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than that of Llama-3.1.

For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

为了使像 DeepSeek-R1 这样的高效小型模型具备推理能力,我们直接使用 DeepSeek-R1 精选的 80 万份样本对开源模型(如 Qwen(Qwen,2024b)和 Llama(AI@Meta,2024))进行微调,详情见第 2.3.3 节。我们的研究结果表明,这种直接的知识蒸馏方法显著提升了小型模型的推理能力。这里我们使用的基模型包括 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B、Qwen2.5-14B、Qwen2.5-32B、Llama-3.1-8B 和 Llama-3.3-70B-Instruct。我们选择 Llama-3.3 是因为其推理能力略优于 Llama-3.1。

对于蒸馏模型,我们仅应用 SFT,而不包含 RL 阶段,尽管加入 RL 可能会大幅提高模型性能。我们在此的主要目标是展示知识蒸馏技术的有效性,将 RL 阶段的探索留给更广泛的研究社区。

3 Experiment实验

本节介绍了实验设置,包括使用的基准测试集(MMLU, MATH-500, Codeforces等多种类型的任务)、评估指标 (Pass@1, cons@64, Percentile等)、基线模型 (DeepSeek-V3, OpenAI-o1系列等)以及评估方法。 实验结果表明,DeepSeek-R1在多个基准测试中取得了优异的成绩,与OpenAI-o1-1217的性能相当,甚至在某些任务上超越了它。 蒸馏后的模型也取得了令人印象深刻的结果,在某些基准测试中甚至超过了部分大型开源模型。

实验部分提供了充分的证据来支持论文的结论,实验设计比较全面,涵盖了多种类型的任务和评估指标。实验结果令人信服,证明了DeepSeek-R1模型的有效性。 多种基准测试的比较,使得结果更具说服力。 对蒸馏模型的评估也显示了该方法的实用价值

Benchmarks基准测试

We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond  (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1, LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2, Chinese National High School Mathematics Olympiad (CNMO 2024)3, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.

我们在 MMLU(Hendrycks 等人,2020 年)、MMLU-Redux(Gema 等人,2024 年)、MMLU-Pro(Wang 等人,2024 年)、C-Eval(Huang 等人,2023 年)、CMMLU(Li 等人,2023 年)、IFEval(Zhou 等人,2023 年)、FRAMES(Krishna 等人,2024 年)、GPQA Diamond(Rein 等人,2023 年)、SimpleQA(OpenAI,2024 年 c)、C-SimpleQA(He 等人,2024 年)、SWE-Bench Verified(OpenAI,2024 年 d)、Aider 1、LiveCodeBench(Jain 等人,2024 年)(2024 年 8 月 - 2025 年 1 月)、Codeforces 2、中国高中数学奥林匹克竞赛(CNMO 2024 年)3 以及 2024 年美国数学邀请赛(AIME 2024 年)(MAA,2024 年)上对模型进行评估。除了标准基准测试外,我们还使用 LLM 作为评判者对模型在开放式生成任务上的表现进行评估。具体而言,我们遵循 AlpacaEval 2.0(Dubois 等人,2024 年)和 Arena-Hard(Li 等人,2024 年)的原始配置,利用 GPT-4-Turbo-1106 进行两两比较。在此,我们仅将最终总结提供给评判以避免长度偏差。对于精简模型,我们报告在 AIME 2024 年、MATH-500、GPQA Diamond、Codeforces 和 LiveCodeBench 上的代表性结果。

Evaluation Prompts评估提示

Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simple-evals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++, C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark.

在 DeepSeek-V3 的设置下,使用 simple-evals 框架中的提示对 MMLU、DROP、GPQA Diamond 和 SimpleQA 等标准基准进行评估。对于 MMLU-Redux,我们在零样本设置中采用 Zero-Eval 提示格式(Lin,2024)。对于 MMLU-Pro、C-Eval 和 CLUE-WSC,由于原始提示是少样本的,我们将其稍作修改以适应零样本设置。少样本中的 CoT 可能会损害 DeepSeek-R1 的性能。其他数据集遵循其创建者提供的原始评估协议和默认提示。对于代码和数学基准,HumanEval-Mul 数据集涵盖了八种主流编程语言(Python、Java、C++、C#、JavaScript、TypeScript、PHP 和 Bash)。LiveCodeBench 上的模型性能评估使用 CoT 格式,数据收集时间为 2024 年 8 月至 2025 年 1 月。Codeforces 数据集使用 10 场 Div.2 比赛中的问题以及专家设计的测试用例进行评估,之后计算预期评级和参赛者的百分比。SWE-Bench 的验证结果通过无代理框架(Xia 等人,2024)获得。与 AIDER 相关的基准测试采用“差异”格式进行衡量。DeepSeek-R1 的输出结果在每个基准测试中均被限制在最多 32,768 个标记。

Baselines基准线

We conduct comprehensive evaluations against several strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217. Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its performance based on official reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen, 2024a).

我们针对多个强大的基准模型进行了全面评估,包括 DeepSeek-V3、Claude-Sonnet-3.5-1022、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由于在中国大陆访问 OpenAI-o1-1217 的 API 存在困难,我们基于官方报告来呈现其性能。对于精简模型,我们还与开源模型 QwQ-32B-Preview(Qwen,2024a)进行了比较。

Evaluation Setup评估设置

We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we default to pass@k evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top-p value of 0.95 to generate k responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as

pass@1=1k⁢∑i=1kpi,

where pi denotes the correctness of the i-th response. This method provides more reliable performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang et al., 2022) using 64 samples, denoted as cons⁢@⁢64.

我们将模型的最大生成长度设置为 32,768 个标记。我们发现,使用贪婪解码来评估长输出推理模型会导致更高的重复率,并且在不同检查点之间存在显著的差异性。因此,我们默认采用 pass@k 评估(Chen 等人,2021 年),并使用非零温度报告 pass@1。具体来说,我们使用 0.6 的采样温度和 0.95 的 top-p 值为每个问题生成 k 个响应(通常在 4 到 64 之间,取决于测试集的大小)。然后,pass@1 计算公式为

pass@1=1k⁢∑i=1kpi,

其中 pi 表示第 i 个响应的正确性。这种方法能提供更可靠的性能估计。对于 AIME 2024,我们还报告使用 64 个样本的共识(多数投票)结果(Wang 等人,2022 年),记为 cons⁢@⁢64。

3.1 DeepSeek-R1 Evaluation评估

Table 4: Comparison between DeepSeek-R1 and other representative models.表 4:DeepSeek-R1 与其他代表性模型的比较。

For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%.

对于以教育为导向的知识基准,如 MMLU、MMLU-Pro 和 GPQA Diamond,DeepSeek-R1 的表现优于 DeepSeek-V3。这种改进主要归因于在 STEM 相关问题上的准确性提高,通过大规模强化学习取得了显著进步。此外,DeepSeek-R1 在 FRAMES(一个依赖于长上下文的问答任务)上表现出色,展示了其强大的文档分析能力。这突显了推理模型在 AI 驱动的搜索和数据分析任务中的潜力。在事实基准 SimpleQA 上,DeepSeek-R1 的表现优于 DeepSeek-V3,表明其在处理基于事实的查询方面的能力。类似的趋势也出现在 OpenAI-o1 在此基准上超越 GPT-4o 的情况。然而,在中文 SimpleQA 基准上,DeepSeek-R1 的表现不如 DeepSeek-V3,主要是由于其在安全强化学习后倾向于拒绝回答某些查询。如果没有安全强化学习,DeepSeek-R1 的准确率可以超过 70%。

DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model’s ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1’s strengths in writing tasks and open-domain question answering. Its significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning capabilities but also improves performance across diverse domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying its robustness across multiple tasks.

On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-o1-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On engineering-oriented coding tasks, OpenAI-o1-1217 outperforms DeepSeek-R1 on Aider but achieves comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1 will improve in the next version, as the amount of related RL training data currently remains very limited.

DeepSeek-R1 在 IF-Eval 上也取得了令人瞩目的成绩,IF-Eval 是一个用于评估模型遵循格式指令能力的基准测试。这些改进与在监督微调(SFT)和强化学习训练的最后阶段纳入遵循指令的数据有关。此外,在 AlpacaEval2.0 和 ArenaHard 上也表现出色,这表明 DeepSeek-R1 在写作任务和开放领域问答方面具有优势。它对 DeepSeek-V3 的显著超越突显了大规模强化学习的泛化优势,这不仅增强了推理能力,还提升了在不同领域的表现。此外,DeepSeek-R1 生成的摘要长度简洁,在 ArenaHard 上平均为 689 个标记,在 AlpacaEval 2.0 上平均为 2218 个字符。这表明 DeepSeek-R1 在基于 GPT 的评估中避免了引入长度偏差,进一步巩固了其在多项任务中的稳健性。

在数学任务方面,DeepSeek-R1 的表现与 OpenAI-o1-1217 相当,大幅超越了其他模型。在诸如 LiveCodeBench 和 Codeforces 这类编码算法任务中也观察到了类似的趋势,以推理为重点的模型在这些基准测试中占据主导地位。在面向工程的编码任务中,OpenAI-o1-1217 在 Aider 上的表现优于 DeepSeek-R1,但在 SWE Verified 上两者表现相当。我们认为 DeepSeek-R1 的工程性能在下一版本中会有所提升,因为目前相关的强化学习训练数据量仍然非常有限。

3.2 Distilled Model Evaluation蒸馏模型评估

Table 5:Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks.表 5:DeepSeek-R1 蒸馏模型与其他可比模型在推理相关基准上的比较。

As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek-R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-reasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

如表 5 所示,仅对 DeepSeek-R1 的输出进行精炼,就能使高效的 DeepSeek-R1-7B(即 DeepSeek-R1-Distill-Qwen-7B,以下简写类似)在所有基准上都优于 GPT-4o-0513 等非推理模型。DeepSeek-R1-14B 在所有评估指标上都超过了 QwQ-32B-Preview,而 DeepSeek-R1-32B 和 DeepSeek-R1-70B 在大多数基准上都显著优于 o1-mini。这些结果表明了精炼的强大潜力。此外,我们发现对这些蒸馏模型应用强化学习能带来显著的进一步提升。我们认为这值得进一步探索,因此这里仅展示简单 SFT 蒸馏模型的结果。

4 Discussion讨论

本节讨论了知识蒸馏强化学习的比较,以及一些失败尝试(过程奖励模型PRM和蒙特卡洛树搜索MCTS)。 实验结果表明,知识蒸馏比在小型模型上直接应用强化学习更有效。 PRM和MCTS由于存在一些局限性,未能取得理想的效果。

本节对实验结果进行了深入分析,通过对知识蒸馏和强化学习的比较分析非常重要,能够帮助读者更好地理解两种方法的优缺点。 对失败尝试的分析也具有参考价值,能够为未来的研究提供借鉴。

4.1 Distillation v.s. Reinforcement Learning蒸馏与强化学习

Table 6:Comparison of distilled and RL Models on Reasoning-Related Benchmarks.表 6:蒸馏模型与强化学习模型在推理相关基准测试上的比较。

In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed in the paper without distillation?

To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math, code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale RL training, achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks.

Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.

在 3.2 节中,我们可以看到通过蒸馏 DeepSeek-R1,小型模型能够取得令人瞩目的成果。然而,还有一个问题尚未解决:在不进行蒸馏的情况下,通过本文所讨论的大规模强化学习训练,模型能否达到可比的性能?

为了解答这个问题,我们在 Qwen-32B-Base 上使用数学、代码和 STEM 数据进行了大规模强化学习训练,训练步数超过 10K 步,从而得到了 DeepSeek-R1-Zero-Qwen-32B。实验结果如表 6 所示,表明经过大规模强化学习训练的 32B 基础模型,其性能与 QwQ-32B-Preview 相当。然而,从 DeepSeek-R1 蒸馏而来的 DeepSeek-R1-Distill-Qwen-32B 在所有基准测试中都显著优于 DeepSeek-R1-Zero-Qwen-32B。

因此,我们可以得出两个结论:首先,将更强大的模型蒸馏到较小的模型中能够取得出色的结果,而较小的模型依靠本文所提及的大规模强化学习可能需要巨大的计算能力,甚至可能无法达到蒸馏的效果。其次,尽管蒸馏策略既经济又有效,但要突破智能的界限,可能仍需要更强大的基础模型和更大规模的强化学习。

4.2 Unsuccessful Attempts失败的尝试

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models.

在开发 DeepSeek-R1 的早期阶段,我们也遭遇了失败和挫折。在此分享我们的失败经历以供参考,但这并不意味着这些方法无法开发出有效的推理模型。

Process Reward Model (PRM)过程奖励模型

PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Uesato et al., 2022; Lightman et al., 2023; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

PRM 是一种合理的方法,能够引导模型采用更好的方法来解决推理任务(Uesato 等人,2022 年;Lightman 等人,2023 年;Wang 等人,2023 年)。然而,在实际应用中,PRM 存在三个主要局限性,可能会阻碍其最终成功。首先,明确界定一般推理中的细粒度步骤颇具挑战性。其次,确定当前的中间步骤是否正确是一项艰巨的任务。使用模型进行自动标注可能无法取得令人满意的结果,而人工标注又不利于大规模推广。第三,一旦引入基于模型的 PRM,就不可避免地会导致奖励作弊(Gao 等人,2022 年),重新训练奖励模型需要额外的训练资源,并且会使整个训练流程变得复杂。总之,尽管 PRM 在对模型生成的前 N 个响应进行重新排序或辅助引导搜索方面表现出良好的能力(Snell 等人,2024 年),但在我们的实验中,与在大规模强化学习过程中引入的额外计算开销相比,其优势有限。

Monte Carlo Tree Search (MCTS)蒙特卡罗树搜索

Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo’s core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

受 AlphaGo(Silver 等人,2017b)和 AlphaZero(Silver 等人,2017a)的启发,我们探索了使用蒙特卡罗树搜索(MCTS)来增强测试时计算的可扩展性。这种方法涉及将答案分解成更小的部分,以使模型能够系统地探索解决方案空间。为了实现这一点,我们提示模型生成多个标签,这些标签对应于搜索所需的特定推理步骤。在训练过程中,我们首先使用收集到的提示通过由预训练的价值模型引导的 MCTS 来寻找答案。随后,我们使用生成的问题 - 答案对来训练行为模型和价值模型,不断优化这一过程。

然而,当扩大训练规模时,这种方法会遇到几个挑战。首先,与国际象棋不同,国际象棋的搜索空间相对明确,而标记生成的搜索空间呈指数级增长。为了解决这个问题,我们为每个节点设置了最大扩展限制,但这可能导致模型陷入局部最优。其次,价值模型直接影响生成的质量,因为它引导着搜索过程的每一步。训练一个细粒度的价值模型本身就很困难,这使得模型难以逐步改进。虽然 AlphaGo 的核心成功在于训练价值模型以逐步提升其表现,但由于标记生成的复杂性,这一原则在我们的设置中难以复制。

In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

总之,虽然蒙特卡罗树搜索(MCTS)在与预训练的价值模型结合使用时可以提高推理期间的表现,但通过自我搜索来逐步提升模型性能仍然是一个重大挑战。

5 Conclusion, Limitations, and Future Work结论、局限性与未来工作

总结了DeepSeek-R1和DeepSeek-R1-Zero模型的优势和局限性。 DeepSeek-R1在推理任务上达到了与OpenAI-o1-1217相当的性能,知识蒸馏技术也取得了成功。 但模型在通用能力、语言混合、提示工程和软件工程任务方面仍存在不足。 未来工作将集中在这些方面进行改进。

结论部分对全文进行了总结,并指出了未来的研究方向,对模型局限性的分析比较客观,为未来的改进指明了方向。 提出的未来研究方向具有针对性,具有较高的研究价值。

In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instruction-tuned models based on the same underlying checkpoints.

在这项工作中,我们分享了通过强化学习提升模型推理能力的历程。DeepSeek-R1-Zero 代表了一种纯强化学习方法,无需依赖冷启动数据,在各种任务中均表现出色。DeepSeek-R1 更加强大,它结合了冷启动数据和迭代强化学习微调。最终,DeepSeek-R1 在一系列任务中的表现可与 OpenAI-o1-1217 相媲美。

我们进一步探索将推理能力蒸馏到小型密集模型中。我们使用 DeepSeek-R1 作为教师模型生成 80 万个训练样本,并对几个小型密集模型进行微调。结果令人鼓舞:DeepSeek-R1-Distill-Qwen-1.5B 在数学基准测试中超越了 GPT-4o 和 Claude-3.5-Sonnet,在 AIME 上得分 28.9%,在 MATH 上得分 83.9%。其他密集模型也取得了令人瞩目的成绩,显著优于基于相同底层检查点的其他指令调优模型。

In the future, we plan to invest in research across the following directions for DeepSeek-R1.

·• General Capability: Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields.

·• Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates.

·• Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.

·• Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

未来,我们计划在以下方向对 DeepSeek-R1 进行研究投入。

• 通用能力:目前,DeepSeek-R1 在函数调用、多轮对话、复杂角色扮演和 JSON 输出等任务上的能力不如 DeepSeek-V3。未来,我们计划探索如何利用长链思维(CoT)来提升这些领域的任务表现。

• 语言混合:DeepSeek-R1 目前针对中文和英文进行了优化,这可能导致在处理其他语言的查询时出现语言混合问题。例如,即使查询语言不是英语或中文,DeepSeek-R1 也可能使用英语进行推理和回复。我们将在未来的更新中解决这一限制。

• 提示工程:在评估 DeepSeek-R1 时,我们发现它对提示较为敏感。少样本提示会持续降低其性能。因此,我们建议用户直接描述问题,并在零样本设置中指定输出格式,以获得最佳效果。

• 软件工程任务:由于评估时间较长,影响了强化学习过程的效率,大规模强化学习在软件工程任务中的应用并不广泛。因此,DeepSeek-R1 在软件工程基准测试方面相较于 DeepSeek-V3 并未展现出巨大的提升。未来版本将通过在软件工程数据上实施拒绝采样或在强化学习过程中引入异步评估来提高效率,从而解决这一问题。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值