LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

一个处女座的程序猿

已于 2025-03-04 23:57:48 修改

阅读量8.4k

点赞数 32

分类专栏： NLP/LLMs 文章标签： LLMs DeepSeek-V3

于 2025-01-23 23:46:20 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/145331021

版权

NLP/LLMs 专栏收录该内容

760 篇文章

订阅专栏

LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

导读：这篇论文介绍了DeepSeek-V3大型语言模型，其核心目标是构建一个性能强大、训练成本低廉的模型。这是一个拥有671B参数的大型混合专家模型（MoE），其中每个token激活37B参数，训练语料高达14.8T的token。论文的核心在于高效且经济地训练和推理出性能强大的大型语言模型。该论文提出了一种训练大型语言模型的新方法，在性能、成本和稳定性方面取得了显著进展，为开源大型语言模型的发展做出了重要贡献。论文中提出的许多技术细节，如FP8训练的优化策略和高效的训练框架设计，都具有很高的参考价值。

>> 背景痛点：

● 大型语言模型训练成本高昂：训练大型语言模型需要巨大的计算资源和时间，导致成本极高。

● 现有MoE模型负载不平衡：传统的MoE模型在训练过程中容易出现专家负载不平衡的问题，导致效率低下甚至训练崩溃（routing collapse）。

● 低精度训练的挑战：低精度训练（如FP8）虽然能提高效率，但容易出现数值不稳定性，尤其是在大型模型中。

● 长文本处理能力不足：许多模型的长文本处理能力有限，难以处理超长上下文。

● 推理效率低：大型语言模型的推理速度通常较慢，难以满足在线服务的实时性要求。

>> 具体的解决方案：

● 高效的模型架构：采用DeepSeek-V2中验证有效的Multi-head Latent Attention (MLA)用于高效推理，以及DeepSeekMoE用于经济高效的训练。

● 辅助损失函数的改进：提出了一种无辅助损失的负载平衡策略（auxiliary-loss-free strategy），避免了传统方法中辅助损失对模型性能的负面影响。

● 多token预测训练目标：采用多token预测（Multi-Token Prediction, MTP）训练目标，提高模型性能，并可用于加速推理。

● FP8混合精度训练框架：设计了一个细粒度的FP8混合精度训练框架，通过tile-wise和block-wise分组量化以及提高累加精度等方法，在保证训练稳定性的前提下，显著提高训练速度并降低内存占用。

● 高效的训练框架：设计了DualPipe算法用于高效的流水线并行，通过计算与通信重叠（computation-communication overlap）来隐藏通信开销。并开发了高效的跨节点全对全通信内核，充分利用InfiniBand和NVLink带宽。

● 长上下文扩展：采用YaRN方法进行上下文长度扩展，将最大上下文长度扩展到128K。

● 知识蒸馏：从DeepSeek-R1系列模型中蒸馏推理能力，提高DeepSeek-V3的推理性能，尤其是在数学和代码方面。

● 强化学习：采用Group Relative Policy Optimization (GRPO)进行强化学习，并结合规则和模型两种奖励模型进行微调。

● 高效的部署策略：设计了分离预填充和解码阶段的部署策略，并采用冗余专家部署来实现负载平衡，提高推理效率。

>> 核心思路步骤：

● 高效架构设计：选择MLA和DeepSeekMoE架构。

● 改进负载平衡：采用无辅助损失的负载平衡策略。

● 多token预测：引入MTP训练目标。

● FP8混合精度训练：设计高效的FP8训练框架，包括量化策略和高精度累加。

● 高效训练框架：采用DualPipe算法和高效的通信内核。

● 长上下文扩展：利用YaRN方法扩展上下文长度。

● 知识蒸馏和强化学习：从DeepSeek-R1蒸馏推理能力，并用GRPO进行强化学习。

● 高效部署：设计高效的推理部署策略，包括预填充和解码阶段的分离以及冗余专家部署。

>> 优势：

● 性能强大：在多个基准测试中，DeepSeek-V3的性能优于其他开源模型，并且与领先的闭源模型相当。超越Qwen2.5-72B,Llama-3.1-405B，性能与GPT-4o和Claude-3.5-Sonnet相当。尤其在代码和数学方面表现突出。

● 训练成本低：总训练成本仅为2.788百万H800 GPU小时，约合557.6万美元。

● 训练稳定：整个训练过程没有出现不可恢复的损失峰值或回滚。

● 高效推理：采用MLA和高效的部署策略，提高了推理效率。通过算法和工程上的创新，DeepSeek-V3 的生成吐字速度从 20 TPS 大幅提高至 60 TPS，相比 V2.5 模型实现了 3 倍的提升。

>> 结论和观点：

● DeepSeek-V3是目前最强大的开源大型语言模型之一，在多个基准测试中取得了优异的成绩。

● DeepSeek-V3的训练成本低，这归功于高效的算法、框架和硬件协同设计。

● 无辅助损失的负载平衡策略和多token预测训练目标有效提高了模型性能。

● FP8混合精度训练框架在大型模型训练中是可行且有效的。

● 从DeepSeek-R1系列模型中蒸馏推理能力是一种有效的提升模型性能的方法。

● DeepSeek-V3在长文本处理和推理方面表现出色。

● 论文也指出了DeepSeek-V3的一些局限性，例如部署单元较大，以及推理速度仍有提升空间，并提出了未来研究方向，包括模型架构改进、数据质量提升、推理能力增强和评估方法改进等。

总结来说，DeepSeek-V3论文系统地介绍了一个大型混合专家语言模型，该模型在性能和训练成本方面取得了显著的平衡。论文的核心贡献在于：

1) 提出了无辅助损失的负载平衡策略和多token预测训练目标；

2) 设计了高效的FP8混合精度训练框架和DualPipe流水线并行算法；

3) 从DeepSeek-R1模型中成功蒸馏了推理能力；

4) 设计了高效的推理部署策略。

DeepSeek-V3在多个基准测试中取得了领先的成绩，尤其是在数学和代码任务方面。虽然论文也指出了模型的一些局限性，但其提出的各种创新技术和工程优化方法，为大型语言模型的训练和部署提供了宝贵的经验和参考。

2024年1月5日，LLMs之DeepSeek-V1：《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读

2024年1月11日，LLMs之DeepSeek-V1之MoE：《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》翻译与解

2024年1月25日，LLMs之DeepSeek-V1：《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence》翻译与解读

2024年2月5日，LLMs之DeepSeek-V1：《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models》翻译与解读

2024年5月7日，LLMs之DeepSeek-V2：《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》翻译与解读

2024年12月26日，LLMs之MoE之DeepSeek-V3：DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略

2024年12月27日，LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

2025年1月20日，LLMs之DeepSeek-V3：DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略

2025年1月22日，LLMs之DeepSeek-R1：《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

《DeepSeek-V3 Technical Report》翻译与解读

Abstract

Figure 1: Benchmark performance of DeepSeek-V3 and its counterparts.图 1：DeepSeek-V3 及其同类模型的基准性能。

1、Introduction

Table 1: Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.表 1：假设 H800 的租赁价格为每 GPU 小时 2 美元，DeepSeek-V3 的训练成本。

Architecture: Innovative Load Balancing Strategy and Training Objective架构：创新的负载均衡策略与训练目标

Pre-Training: Towards Ultimate Training Efficiency预训练：迈向极致训练效率

Post-Training: Knowledge Distillation from DeepSeek-R1后训练：从 DeepSeek-R1 中的知识蒸馏

Summary of Core Evaluation Results核心评估结果摘要

2、Architecture

Figure 2: Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.图 2：DeepSeek-V3 基本架构示意图。继 DeepSeek-V2 之后，我们采用 MLA 和 DeepSeekMoE 来实现高效推理和经济训练。

2.1、Basic Architecture基本架构：基于Transformer框架+MLA高效推理+DeepSeekMoE高效训练+ALFLB实现负载均衡+CSWAL补充的序列级辅助损失+NLR降低训练过程中的通信成本+NTD策略

2.1.1 Multi-Head Latent Attention多头潜在注意力：采用MLA提高推理效率,采用低秩联合压缩注意力键和值来减少推理过程中的KV缓存，保持了MHA相当的性能

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing具有无辅助损失负载均衡的 DeepSeekMoE：采用DeepSeekMoE以降低训练成本

Basic Architecture of DeepSeekMoE—DeepSeek-V3 中的 DeepSeekMoE 基本架构：对FFN采用DeepSeekMoE架构+DeepSeek-V3使用sigmoid函数计算亲和力分数+并归一化生成门控值

Auxiliary-Loss-Free Load Balancing.无辅助损失的负载均衡——解决MoE模型中专家负载不平衡：解决MoE模型中专家负载不平衡→确定top-K路由→动态调整偏差项→

Complementary Sequence-Wise Auxiliary Loss.互补序列级辅助损失—防止单个序列内出现极度不平衡：添加了一个小的序列级平衡损失

Node-Limited Routing节点受限路由—限制通信成本：每个token最多发送到M个节点

No Token-Dropping.无标记舍弃—保证训练和推理过程中不丢弃任何token：

Figure 3: Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.图 3：我们多标记预测（MTP）实现的示意图。在每个深度，我们为每个标记的预测保留完整的因果链。

2.2、Multi-Token Prediction多标记预测：引入MTP训练目标提升模型性能——扩展预测范围+MTP模块保持每个预测深度的完整因果链+对每个预测深度计算交叉熵损失+推理中丢弃MTP

目标：扩展预测范围到多个未来token，提高数据效率，并使模型更好地预先规划表示。

MTP Modules模块实现：使用多个顺序模块来预测多个额外token，保持每个预测深度的完整因果链

MTP Training Objective训练目标：对每个预测深度计算交叉熵损失，并将其平均值作为额外的训练目标

编辑

MTP in Inference推理中的 MTP：推理过程中可以丢弃MTP模块，或将其用于推测性解码以提高生成速度

3、Infrastructures基础设施

3.1 Compute Clusters计算集群：硬件配置(采用2048个H800 GPU)、节点内部互联(每个节点包含8个通过NVLink和NVSwitch互连的GPU)、节点间互联(节点间使用InfiniBand (IB) 互连)

3.2 Training Framework训练框架：

框架和并行策略：使用高效轻量级的HAI-LLM训练框架，采用16路流水线并行 (PP)、跨8个节点的64路专家并行 (EP) 和ZeRO-1数据并行 (DP)。

工程优化：DualPipe算法实现高效PP算法+高效的跨节点全对全通信内核(充分利用InfiniBand和NVLink带宽)+极度节省内存(重新计算RMSNorm和MLA上投影+在CPU中保存EMA参数+共享MTP模块和主模型的嵌入层和输出头等)

3.2.1 DualPipe and Computation-Communication Overlap双管道与计算通信重叠

Figure 4: Overlapping strategy for a pair of individual forward and backward chunks (the boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes "backward for input", blue denotes "backward for weights", purple denotes PP communication, and red denotes barriers. Both all-to-all and PP communication can be fully hidden.图 4：一对单独的前向和后向块的重叠策略（变压器块的边界未对齐）。橙色表示前向，绿色表示“输入的后向”，蓝色表示“权重的后向”，紫色表示 PP 通信，红色表示屏障。全对全和 PP 通信都可以完全隐藏。

Figure 5: Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.图 5：8 个 PP 级别和 20 个微批次在两个方向上的 DualPipe 调度示例。反向的微批次与正向的微批次对称，为简化说明，我们省略了它们的批次 ID。由共享黑色边框包围的两个单元格具有相互重叠的计算和通信。

Table 2: Comparison of pipeline bubbles and memory usage across different pipeline parallel methods. F denotes the execution time of a forward chunk, B denotes the execution time of a full backward chunk, W denotes the execution time of a "backward for weights" chunk, and F&B denotes the execution time of two mutually overlapped forward and backward chunks.表 2：不同流水线并行方法的流水线气泡和内存使用情况比较。F 表示前向块的执行时间，B 表示完整后向块的执行时间，W 表示“权重后向”块的执行时间，F&B 表示两个相互重叠的前向和后向块的执行时间。

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication跨节点全对全通信的高效实现

3.2.3 Extremely Memory Saving with Minimal Overhead极大节省内存且开销极小

Recomputation of RMSNorm and MLA Up-Projection重新计算 RMSNorm 和 MLA 上投影

Exponential Moving Average in CPU在 CPU 中使用指数移动平均

Shared Embedding and Output Head for Multi-Token Prediction.多标记预测的共享嵌入和输出头

3.3 FP8 Training训练：基于FP8的混合精度框架+细粒度量化+提高累加精度+低精度存储和通信

Figure 6: The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.图 6：采用 FP8 数据格式的整体混合精度框架。为便于说明，仅展示了线性运算符

3.3.1 Mixed Precision Framework混合精度框架

Figure 7: (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of NC=128 elements MMA for the high-precision accumulation.图 7：（a）我们提出了一种细粒度量化方法来减轻由特征异常值引起的量化误差；为便于说明，仅展示了前向传播（Fprop）。（b）结合我们的量化策略，我们通过以 NC=128 个元素为间隔将 FP8 GEMM 精度提升至 CUDA 核心的 MMA 来提高高精度累加。

3.3.2 Improved Precision from Quantization and Multiplication量化与乘法运算提升精度

Fine-Grained Quantization精细量化

Increasing Accumulation Precision提高累加精度

Mantissa over Exponents尾数与指数

Online Quantization在线量化

3.3.3 Low-Precision Storage and Communication低精度存储与通信

Low-Precision Optimizer States低精度优化器状态

Low-Precision Activation低精度激活

Low-Precision Communication低精度通信

3.4 Inference and Deployment推理与部署：将预填充和解码阶段分开部署

3.4.1 Prefilling预填充：并行策略+采用冗余专家实现负载均衡策略

3.4.2 Decoding解码：并行策略+IB点对点传输+IBGDA技术

3.5 Suggestions on Hardware Design关于硬件设计的建议：硬件厂商(开发卸载通信任务的协处理器+提高FP8 GEMM累加精度+支持tile和block级量化+支持转置GEMM操作)

3.5.1 Communication Hardware通信硬件：建议开发卸载通信任务的GPU协处理器或网络协处理器，并统一IB和NVLink网络接口

3.5.2 Compute Hardware计算硬件：建议提高Tensor Core中FP8 GEMM累加精度，支持tile和block级量化以及在线量化，并支持转置GEMM操作

Higher FP8 GEMM Accumulation Precision in Tensor Cores.张量核心中更高的 FP8 GEMM 累加精度

Support for Tile- and Block-Wise Quantization支持分块和分组量化

Support for Online Quantization对在线量化提供支持

Support for Transposed GEMM Operations对转置 GEMM 操作的支持

4 Pre-Training

4.1 Data Construction数据构建：优化预训练语料库=提高数学和编程样本比例+扩展多语言+文档打包+FIM策略

>> 语料库优化：提高数学和编程样本的比例+扩展多语言

>> 文档打包(数据完整性)→14.8T(高质量且多样化)

>> Fill-in-Middle (FIM) 策略：沿用DeepSeekCoder-V2中的FIM策略的PSM框架+文档级别

>> 分词器：采用BPE +词汇表(128K)+随机拆分

4.2 Hyper-Parameters超参数：模型超参数（Transformer层数、隐藏维度、注意力头数等）和训练超参数（优化器、学习率调度、批量大小等）

Model Hyper-Parameters模型超参数：Transformer(61层)，隐藏维度(7K)，MLA参数(nh和dh都为128/dc=512/dc′=1536/dhR=64)，MoE参数(除前三层外余FFN都替换为MoE层+每个MoE层包含1个共享专家和256个路由专家【2048】，每个token激活8个路由专家，每个token最多发送到4个节点)，多token预测深度(1)，压缩RMSNorm层,每个token激活37B共计671B

Training Hyper-Parameters训练超参数：优化器(AdamW)，max_length=4K，预训练14.8T，学习率调度，并行策略(PP=8)，M=4，辅助损失免费负载均衡，MTP损失权重

Figure 8: Evaluation results on the ”Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.图 8：“大海捞针”（NIAH）测试的评估结果。DeepSeek-V3 在所有上下文窗口中均表现出色

4.3 Long Context Extension长上下文扩展：沿用YaRN方法(仅应用于解耦共享键)+2个额外的训练阶段(4K→32K→128K，每个阶段包含1000步)，NIAH测试良好

4.4 Evaluations评估：多个英语、中文和多语言基准上评估

4.4.1 Evaluation Benchmarks评估基准：多项选择数据集、语言理解和推理数据集、闭卷问答数据集、阅读理解数据集、指代消歧数据集、语言建模数据集、中文理解和文化数据集、数学数据集、代码数据集、标准化考试

评估方法和指标：困惑度+生成，BPB度量指标

Table 3: Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3-Base achieves the best performance on most benchmarks, especially on math and code tasks.表 3：DeepSeek-V3-Base 与其他代表性开源基础模型的比较。所有模型均在我们的内部框架中进行评估，并采用相同的评估设置。分差不超过 0.3 的分数被视为处于同一水平。DeepSeek-V3-Base 在大多数基准测试中表现最佳，尤其是在数学和代码任务方面。

4.4.2 Evaluation Results评估结果：最强大的开源模型(尤其是在数学和代码任务上)，超便宜(每万亿token的训练仅需180K H800 GPU小时)

Table 4: Ablation results for the MTP strategy. The MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.表 4：MTP 策略的消融实验结果。MTP 策略在大多数评估基准上始终能提升模型性能。

4.5 Discussion讨论

4.5.1 Ablation Studies for Multi-Token Prediction多标记预测的消融研究：在不同规模的基线模型上验证了MTP策略的有效性

4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy辅助损失无损平衡策略的消融研究：在不同规模的基线模型上验证了辅助损失免费负载均衡策略的有效性

Table 5: Ablation results for the auxiliary-loss-free balancing strategy. Compared with the purely auxiliary-loss-based method, the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.表 5：无辅助损失平衡策略的消融实验结果。与纯辅助损失方法相比，无辅助损失策略在大多数评估基准上始终能取得更好的模型性能。

4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance批量负载均衡与序列负载均衡

Figure 9: Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C.图 9：在 Pile 测试集的三个领域中，无辅助损失模型和基于辅助损失模型的专家负载情况。无辅助损失模型显示出比基于辅助损失模型更明显的专家专业化模式。相对专家负载表示实际专家负载与理论平衡专家负载之间的比率。由于篇幅限制，我们仅展示两层的结果作为示例，所有层的结果见附录 C。

5 Post-Training

5.1 Supervised Fine-Tuning后训练处理：采用150万个指令微调数据集

(1)、数据集构建：构建包含150万个样本的指令微调数据集，涵盖多个领域，每个领域采用不同的数据创建方法。

Reasoning Data推理数据：采用DeepSeek-R1生成+两阶段方法(基于SFT和RL训练领域专家模型→采用专家模型生成两种类型的SFT样本→采用拒绝采样筛选高质量SFT数据)

Non-Reasoning Data非推理数据：采用DeepSeek-V2.5生成答案→人工标注者验证

(2)、SFT设置：2轮迭代微调+余弦衰减策略+每个序列由多个样本打包而成+采用样本掩码策略(确保样本之间相互隔离)

SFT Settings设置

5.2 Reinforcement Learning强化学习：基于规则的奖励模型和基于模型的奖励模型+采用GRPO算法

5.2.1 Reward Model奖励模型

Rule-Based RM基于规则的奖励机制—确定性/可靠性：适用于特定规则验证的问题（例如某些数学题、LeetCode题）

Model-Based RM基于模型的奖励机制：适用于自由格式答案的问题（例如创意写作)，奖励模型采用DeepSeek-V3 SFT+构建包含思维链的偏好数据+提高可靠性

5.2.2 Group Relative Policy Optimization组相对策略优化：采用GRPO算法从组分数估计基线+最大化奖励+控制KL散度+整合多域提示

5.3 Evaluations评估：标准评估、开放式评估

5.3.1 Evaluation Settings评估设置

Evaluation Benchmarks评估基准：基础模型基准+指令模型基准

Compared Baselines对比基线模型：对比DeepSeek-V2系列、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022和GPT-4o-0513

Detailed Evaluation Configurations详细评估配置

Table 6: Comparison between DeepSeek-V3 and other representative chat models. All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.表 6：DeepSeek-V3 与其他代表性聊天模型的比较。所有模型均在限制输出长度为 8K 的配置下进行评估。样本数量少于 1000 个的基准测试会使用不同的温度设置多次进行测试，以得出可靠的最终结果。DeepSeek-V3 是表现最佳的开源模型，并且在与前沿的闭源模型的对比中也展现出具有竞争力的性能。

5.3.2 Standard Evaluation标准评估：在大多数基准测试中表现最佳

English Benchmarks英语基准测试

Code and Math Benchmarks代码和数学基准测试

Chinese Benchmarks中文基准测试

Table 7: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.表 7：英语开放式对话评估。对于 AlpacaEval 2.0，我们使用长度控制下的胜率作为衡量指标。

5.3.3 Open-Ended Evaluation开放式评估：使用LLM作为评判者

5.3.4 DeepSeek-V3 as a Generative Reward Model作为生成奖励模型的 DeepSeek-V3：性能相当甚至优于

Table 8:Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.表 8：GPT-4o、Claude-3.5-sonnet 和 DeepSeek-V3 在 RewardBench 上的表现

5.4 Discussion讨论

5.4.1 Distillation from DeepSeek-R1的蒸馏效果

Table 9: The contribution of distillation from DeepSeek-R1. The evaluation settings of LiveCodeBench and MATH-500 are the same as in Table 6.表 9：DeepSeek-R1 蒸馏的贡献。LiveCodeBench 和 MATH-500 的评估设置与表 6 相同。

5.4.2 Self-Rewarding自我奖励：采用宪法AI方法+利用DeepSeek-V3自身的投票评估结果作为反馈来源

5.4.3 Multi-Token Prediction Evaluation多标记预测评估：第二个token的接受率在85%到90%之间

6、Conclusion, Limitations, and Future Directions结论、局限性与未来方向

2024年1月5日，LLMs之DeepSeek-V1：《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读

LLMs之DeepSeek-V1：《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读-CSDN博客

2024年1月11日，LLMs之DeepSeek-V1之MoE：《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》翻译与解

LLMs之DeepSeek-V1之MoE：《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Lang-CSDN博客

2024年1月25日，LLMs之DeepSeek-V1：《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence》翻译与解读

LLMs之DeepSeek-V1：《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Cod-CSDN博客

2024年2月5日，LLMs之DeepSeek-V1：《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models》翻译与解读

LLMs之DeepSeek-V1：《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models-CSDN博客

2024年5月7日，LLMs之DeepSeek-V2：《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》翻译与解读

LLMs之DeepSeek-V2：《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model-CSDN博客

2024年12月26日，LLMs之MoE之DeepSeek-V3：DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略

LLMs之MoE之DeepSeek-V3：DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略-CSDN博客

2024年12月27日，LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)_in order to achieve efficient training, we support-CSDN博客

2025年1月20日，LLMs之DeepSeek-V3：DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略

LLMs之DeepSeek-V3：DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略_怎样使用deepseek r1-CSDN博客

2025年1月22日，LLMs之DeepSeek-R1：《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

LLMs之DeepSeek-R1：《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning-CSDN博客

《DeepSeek-V3 Technical Report》翻译与解读

地址	论文地址：[2412.19437] DeepSeek-V3 Technical Report
时间	2024年12月27日
作者	DeepSeek团队

Abstract

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at this https URL.

我们推出 DeepSeek-V3，这是一款强大的专家混合（MoE）语言模型，总参数量达 6710 亿，每个标记激活 370 亿参数。为了实现高效的推理和经济高效的训练，DeepSeek-V3 采用了多头潜在注意力（MLA）和 DeepSeekMoE 架构，这些架构在 DeepSeek-V2 中得到了充分验证。此外，DeepSeek-V3 还率先采用无辅助损失的策略来实现负载均衡，并设定了多标记预测训练目标以增强性能。我们使用 14.8 万亿个多样且高质量的标记对 DeepSeek-V3 进行预训练，随后进行监督微调和强化学习阶段，以充分发挥其能力。全面评估表明，DeepSeek-V3 超过了其他开源模型，并达到了与领先闭源模型相当的性能。尽管性能出色，但 DeepSeek-V3 的完整训练仅需 278.8 万 H800 GPU 小时。此外，其训练过程非常稳定。在整个训练过程中，我们没有遇到任何不可恢复的损失峰值，也未进行任何回滚操作。模型检查点可在以下 https 网址获取。

Figure 1: Benchmark performance of DeepSeek-V3 and its counterparts.图 1：DeepSeek-V3 及其同类模型的基准性能。

1、Introduction

介绍了LLM的快速发展以及向通用人工智能(AGI)迈进的趋势。指出开源模型（如DeepSeek系列、LLaMA系列、Qwen系列和Mistral系列）正在努力缩小与闭源模型的差距。DeepSeek-V3作为DeepSeek系列的最新模型，旨在通过规模化和创新技术来进一步提升开源模型的能力。强调了DeepSeek-V3在追求强大性能的同时，也注重经济成本。

>> DeepSeek-V3的目标：在保持强大模型性能的同时，降低训练成本。

>> DeepSeek-V3的创新点：
● 采用DeepSeek-V2中验证有效的Multi-head Latent Attention (MLA) 和 DeepSeekMoE 架构，以提高推理效率和降低训练成本。
● 首次提出辅助损失免费的负载均衡策略 (Auxiliary-loss-free strategy)，最大程度减少负载均衡对模型性能的负面影响。
● 采用多token预测训练目标 (Multi-token prediction training objective)，提升模型在评估基准上的整体性能。

>> DeepSeek-V3的训练过程：包含预训练、监督微调和强化学习三个阶段，在14.8万亿高质量和多样化token上进行预训练。整个训练过程稳定，没有出现不可恢复的损失峰值或回滚。

>> DeepSeek-V3的性能：优于其他开源模型，与领先的闭源模型性能相当，且训练成本低廉（278.8万H800 GPU小时，约合557.6万美元）。

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a; Anthropic, 2024; Google, 2024), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024b, c; Guo et al., 2024; DeepSeek-AI, 2024a), LLaMA series (Touvron et al., 2023a, b; AI@Meta, 2024a, b), Qwen series (Qwen, 2023, 2024a, 2024b), and Mistral series (Jiang et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.

With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks.

近年来，大型语言模型（LLMs）经历了快速的迭代和演进（OpenAI，2024a；Anthropic，2024；Google，2024），逐渐缩小了与通用人工智能（AGI）之间的差距。除了闭源模型之外，开源模型，包括 DeepSeek 系列（DeepSeek-AI，2024b，c；Guo 等人，2024；DeepSeek-AI，2024a）、LLaMA 系列（Touvron 等人，2023a，b；AI@Meta，2024a，b）、Qwen 系列（Qwen，2023，2024a，2024b）和 Mistral 系列（Jiang 等人，2023；Mistral，2024），也在不断取得重大进展，努力缩小与闭源模型之间的差距。为了进一步拓展开源模型的能力边界，我们扩大了模型规模，并推出了 DeepSeek-V3，这是一个拥有 6710 亿参数的大型专家混合（MoE）模型，其中每个标记激活 370 亿参数。

从长远来看，我们始终致力于实现强大的模型性能和经济的成本。因此，在架构方面，DeepSeek-V3 仍然采用多头潜在注意力（MLA）（DeepSeek-AI，2024c）以实现高效的推理，并采用 DeepSeekMoE（Dai 等人，2024）以实现经济高效的训练。这两种架构已在 DeepSeek-V2（DeepSeek-AI，2024c）中得到验证，证明了它们在保持模型性能稳健的同时能够实现高效训练和推理的能力。除了基本架构之外，我们还实施了两种额外策略以进一步增强模型能力。首先，DeepSeek-V3 开创了一种无辅助损失的负载均衡策略（Wang 等人，2024a），旨在将为促进负载均衡所付出的努力对模型性能产生的不利影响降至最低。其次，DeepSeek-V3 采用多标记预测训练目标，我们观察到这能提升在评估基准上的整体性能。

In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework. Low-precision training has emerged as a promising solution for efficient training (Kalamkar et al., 2019; Narang et al., 2017; Peng et al., 2023b; Dettmers et al., 2022), its evolution being closely tied to advancements in hardware capabilities (Micikevicius et al., 2022; Luo et al., 2024; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.

During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.

为了实现高效训练，我们支持 FP8 混合精度训练，并对训练框架进行了全面优化。低精度训练已成为高效训练的一种有前景的解决方案（Kalamkar 等人，2019；Narang 等人，2017；Peng 等人，2023b；Dettmers 等人，2022），其发展与硬件能力的进步紧密相关（Micikevicius 等人，2022；Luo 等人，2024；Rouhani 等人，2023a）。在这项工作中，我们引入了一个 FP8 混合精度训练框架，并首次验证了其在超大规模模型上的有效性。通过支持 FP8 计算和存储，我们实现了训练加速和 GPU 内存使用的减少。至于训练框架，我们设计了 DualPipe 算法以实现高效的流水线并行，该算法减少了流水线中的空泡，并通过计算与通信重叠隐藏了训练期间的大部分通信。这种重叠确保了随着模型规模的进一步扩大，只要保持恒定的计算与通信比例，我们仍能在节点间使用细粒度专家，同时实现近乎零的全对全通信开销。此外，我们还开发了高效的跨节点全对全通信内核，以充分利用 InfiniBand（IB）和 NVLink 带宽。而且，我们还精心优化了内存占用，使得无需使用昂贵的张量并行就能训练 DeepSeek-V3。通过这些努力，我们实现了高训练效率。

在预训练阶段，我们在 14.8T 高质量且多样化的标记上训练 DeepSeek-V3。预训练过程非常稳定。在整个训练过程中，我们没有遇到任何不可恢复的损失峰值，也不需要回滚。接下来，我们对 DeepSeek-V3 进行两阶段的上下文长度扩展。在第一阶段，最大上下文长度扩展到 32K，在第二阶段，进一步扩展到 128K。随后，我们对 DeepSeek-V3 的基础模型进行后训练，包括监督微调（SFT）和强化学习（RL），以使其与人类偏好保持一致，并进一步释放其潜力。在后训练阶段，我们从 DeepSeek-R1 系列模型中提炼推理能力，同时精心保持模型准确性和生成长度之间的平衡。

We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks.

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

我们在一系列全面的基准测试中对 DeepSeek-V3 进行了评估。尽管其训练成本经济实惠，但全面评估表明，DeepSeek-V3-Base 已成为目前最强的开源基础模型，尤其是在代码和数学方面。其聊天版本也优于其他开源模型，并在一系列标准和开放式基准测试中达到了与 GPT-4o 和 Claude-3.5-Sonnet 等领先闭源模型相当的性能。

最后，我们再次强调 DeepSeek-V3 的经济训练成本，如表 1 所示，这是通过我们对算法、框架和硬件的优化协同设计实现的。在预训练阶段，每训练一万亿个标记仅需 18 万 H800 GPU 小时，即在我们拥有 2048 个 H800 GPU 的集群上仅需 3.7 天。因此，我们的预训练阶段在不到两个月的时间内完成，耗时 266.4 万 GPU 小时。加上 11.9 万 GPU 小时用于上下文长度扩展和 5000 个 GPU 小时用于后期训练，DeepSeek-V3 的完整训练仅耗时 278.8 万 GPU 小时。假设 H800 GPU 的租赁价格为每 GPU 小时 2 美元，我们的总训练成本仅为 557.6 万美元。请注意，上述成本仅包括 DeepSeek-V3 的官方训练费用，不包含在架构、算法或数据方面的前期研究和消融实验所产生的费用。

Table 1: Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.表 1：假设 H800 的租赁价格为每 GPU 小时 2 美元，DeepSeek-V3 的训练成本。

Our main contribution includes:

我们的主要贡献包括：

Architecture: Innovative Load Balancing Strategy and Training Objective架构：创新的负载均衡策略与训练目标

• On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.

• We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.

• 在 DeepSeek-V2 高效架构的基础上，我们开创了一种无需辅助损失的负载均衡策略，该策略将因鼓励负载均衡而产生的性能下降降至最低。

• 我们研究了一种多标记预测（MTP）目标，并证明其对模型性能有益。它还可用于推测性解码以加速推理。

Pre-Training: Towards Ultimate Training Efficiency预训练：迈向极致训练效率

Pre-Training: Towards Ultimate Training Efficiency

• We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.

• Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation-communication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.

• At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.

• 我们设计了一个 FP8 混合精度训练框架，并首次验证了在超大规模模型上进行 FP8 训练的可行性和有效性。

• 通过算法、框架和硬件的协同设计，我们克服了跨节点 MoE 训练中的通信瓶颈，实现了近乎完全的计算-通信重叠。这极大地提高了我们的训练效率，降低了训练成本，使我们能够在不增加额外开销的情况下进一步扩大模型规模。

仅花费 266.4 万 H800 GPU 小时的经济成本，我们就在 14.8 万亿个标记上完成了 DeepSeek-V3 的预训练，生成了目前最强的开源基础模型。预训练之后的后续训练阶段仅需 10 万 GPU 小时。

Post-Training: Knowledge Distillation from DeepSeek-R1后训练：从 DeepSeek-R1 中的知识蒸馏

• We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.

我们引入了一种创新的方法，将长链思维（CoT）模型（特别是 DeepSeek R1 系列模型之一）的推理能力提炼到标准的大型语言模型（LLM）中，尤其是 DeepSeek-V3 中。我们的流程巧妙地将 R1 的验证和反思模式融入到 DeepSeek-V3 中，并显著提升了其推理性能。同时，我们还能够控制 DeepSeek-V3 的输出风格和长度。

Summary of Core Evaluation Results核心评估结果摘要

• Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge.

• Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as LiveCodeBench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks.

• 知识：（1）在 MMLU、MMLU-Pro 和 GPQA 等教育基准测试中，DeepSeek-V3 超过了所有其他开源模型，在 MMLU 上达到 88.5，在 MMLU-Pro 上达到 75.9，在 GPQA 上达到 59.1。其表现可与 GPT-4o 和 Claude-Sonnet-3.5 等领先闭源模型相媲美，缩小了开源与闭源模型在该领域的差距。（2）在事实性基准测试方面，DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 中均展现出开源模型中的卓越表现。尽管在英语事实知识（SimpleQA）方面落后于 GPT-4o 和 Claude-Sonnet-3.5，但在中文事实知识（中文 SimpleQA）方面却超越了这些模型，突显了其在中文事实知识方面的优势。

• 代码、数学和推理：（1）在所有非长链推理（CoT）的开源和闭源模型中，DeepSeek-V3 在数学相关基准测试中表现卓越。值得注意的是，在 MATH-500 等特定基准测试中，它甚至超过了 o1-preview，展示了其强大的数学推理能力。（2）在与编码相关的任务中，DeepSeek-V3 在诸如 LiveCodeBench 等编码竞赛基准测试中脱颖而出，成为表现最佳的模型，巩固了其在该领域的领先地位。对于工程相关的任务，尽管 DeepSeek-V3 的表现略逊于 Claude-Sonnet-3.5，但仍大幅领先于其他所有模型，展示了其在各种技术基准测试中的竞争力。

In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, long-context extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6).

在本文的其余部分，我们首先详细介绍 DeepSeek-V3 模型架构（第 2 节）。随后，我们将介绍我们的基础设施，包括计算集群、训练框架、对 FP8 训练的支持、推理部署策略以及对未来硬件设计的建议。接下来，我们将描述预训练过程，包括训练数据的构建、超参数设置、长上下文扩展技术、相关评估以及一些讨论（第 4 节）。之后，我们将讨论我们在训练后的努力，其中包括监督微调（SFT）、强化学习（RL）、相应的评估，最后，我们总结了这项工作，讨论了 DeepSeek-V3 现有的局限性，并提出了未来研究的潜在方向（第 6 节）。

2、Architecture

本节详细阐述了DeepSeek-V3模型的架构，包括基本架构、多头潜在注意力机制(MLA)、DeepSeekMoE以及多token预测(MTP)训练目标。DeepSeek-V3的架构设计在高效推理和经济高效的训练之间取得了良好的平衡，并通过引入辅助损失免费负载均衡策略和多token预测训练目标，进一步提升了模型的性能。DeepSeek-V3模型架构的核心特点是：采用Multi-head Latent Attention (MLA) 以提高推理效率，采用DeepSeekMoE以降低训练成本。此外，还引入了多token预测(MTP)训练目标，以提升模型性能。其他未明确提及的细节与DeepSeek-V2保持一致。

DeepSeek-V3的基本架构基于Transformer框架，并采用了DeepSeek-V2中验证有效的MLA (Multi-head Latent Attention) 和DeepSeekMoE架构，分别用于高效推理和经济高效的训练。创新之处在于引入了无辅助损失的负载平衡策略和多token预测训练目标。

● MLA通过低秩压缩减少了推理过程中的KV缓存。DeepSeekMoE通过细粒度的专家和共享专家来降低训练成本，并通过动态调整每个专家的偏差项来实现无辅助损失的负载平衡。

● 多token预测（MTP）扩展了模型的预测范围，提高了数据效率和预测准确性。

We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeek-V2 (DeepSeek-AI, 2024c).

我们首先介绍 DeepSeek-V3 的基本架构，其特色在于采用多头潜在注意力（MLA）（DeepSeek-AI，2024c）以实现高效推理，以及采用 DeepSeekMoE（Dai 等人，2024）以实现经济训练。然后，我们提出了一种多标记预测（MTP）训练目标，我们观察到该目标能提升在评估基准上的整体性能。对于未明确提及的其他细节，DeepSeek-V3 遵循 DeepSeek-V2（DeepSeek-AI，2024c）的设置。

Figure 2: Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.图 2：DeepSeek-V3 基本架构示意图。继 DeepSeek-V2 之后，我们采用 MLA 和 DeepSeekMoE 来实现高效推理和经济训练。

2.1、Basic Architecture基本架构：基于Transformer框架+MLA高效推理+DeepSeekMoE高效训练+ALFLB实现负载均衡+CSWAL补充的序列级辅助损失+NLR降低训练过程中的通信成本+NTD策略

>> 基于Transformer框架。

>> 采用MLA (Multi-head Latent Attention) 进行高效推理：通过低秩联合压缩注意力键和值来减少推理过程中的键值缓存。

>> 采用DeepSeekMoE (Mixture-of-Experts) 进行经济高效的训练：使用更细粒度的专家，并分离一些专家作为共享专家。

>> 辅助损失免费的负载均衡 (Auxiliary-Loss-Free Load Balancing)：通过动态调整每个专家的偏差项来实现负载均衡，避免了传统辅助损失方法对模型性能的负面影响。

>> 补充的序列级辅助损失 (Complementary Sequence-Wise Auxiliary Loss)：为了防止单个序列内出现极度不平衡，添加一个小的序列级平衡损失。

>> 节点限制路由 (Node-Limited Routing)：限制每个token最多发送到M个节点，以降低训练过程中的通信成本。

>> 不丢弃token (No Token-Dropping)：有效的负载均衡策略保证了训练和推理过程中不丢弃任何token。

The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and DeepSeekMoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) for DeepSeekMoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and DeepSeekMoE in this section.

DeepSeek-V3 的基本架构仍在 Transformer（Vaswani 等人，2017 年）框架内。为了实现高效的推理和经济的训练，DeepSeek-V3 还采用了 MLA 和 DeepSeekMoE，这两者已在 DeepSeek-V2 中得到了充分验证。与 DeepSeek-V2 相比，唯一的例外是，我们为 DeepSeekMoE 额外引入了一种无辅助损失的负载均衡策略（Wang 等人，2024a），以缓解为确保负载均衡而付出的努力所导致的性能下降。图 2 展示了 DeepSeek-V3 的基本架构，在本节中我们将简要回顾 MLA 和 DeepSeekMoE 的细节。

2.1.1 Multi-Head Latent Attention多头潜在注意力：采用MLA提高推理效率,采用低秩联合压缩注意力键和值来减少推理过程中的KV缓存，保持了MHA相当的性能

>> 核心思想：通过低秩联合压缩注意力键和值来减少推理过程中的键值缓存 (KV cache)。

>> 具体方法：对注意力键和值进行低秩压缩，生成压缩的潜在向量 (cKVt)。对注意力查询也进行低秩压缩，生成压缩的潜在向量 (cQt)。利用旋转位置编码 (RoPE) 生成解耦的键和查询 (kRt, qRt)。最终将查询、键和值结合起来生成最终的注意力输出 (ut)。

>> 优势：仅需缓存压缩的潜在向量和解耦的键/查询，显著减少了KV缓存，同时保持了与标准多头注意力 (MHA) 相当的性能。

For attention, DeepSeek-V3 adopts the MLA architecture. Let d denote the embedding dimension, nh denote the number of attention heads, dh denote the dimension per head, and ��t∈ℝd denote the attention input for the t-th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference:

对于注意力机制，DeepSeek-V3 采用了 MLA 架构。设 d 表示嵌入维度，nh 表示注意力头的数量，dh 表示每个头的维度，��t∈ℝd 表示给定注意力层中第 t 个标记的注意力输入。MLA 的核心在于对注意力键和值进行低秩联合压缩，以减少推理过程中的键值（KV）缓存：

where c �� ∈ R�� is the compressed latent vector for keys and values; ��(≪ ��ℎ��ℎ) indicates the KV compression dimension; �� ∈ R��×�� denotes the down-projection matrix; �� , �� ∈ R��ℎ��ℎ×�� are the up-projection matrices for keys and values, respectively; �� ∈ R�� ℎ ×�� is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE(·) denotes the operation that applies RoPE matrices; and [·; ·] denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., c �� and k �� ) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017).

其中，c �� ∈ R�� 是用于键和值的压缩潜在向量；��（≪ ��ℎ��ℎ）表示 KV 压缩维度；�� ∈ R��×�� 是降维投影矩阵；�� 、�� ∈ R��ℎ��ℎ×�� 分别是用于键和值的升维投影矩阵；�� ∈ R�� ℎ ×�� 是用于生成携带旋转位置嵌入（RoPE）（Su 等人，2024）的解耦键的矩阵；RoPE(·) 表示应用 RoPE 矩阵的操作；[·； ·] 表示拼接。请注意，对于 MLA，仅需在生成过程中缓存蓝色框中的向量（即 c �� 和 k �� ），这可显著减少 KV 缓存，同时保持与标准多头注意力（MHA）（Vaswani 等人，2017）相当的性能。

For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training:

对于注意力查询，我们也执行低秩压缩，这可以在训练期间减少激活内存：

where ��tQ∈ℝdc′ is the compressed latent vector for queries; dc′(≪dh⁢nh) denotes the query compression dimension; WD⁢Q∈ℝdc′×d,WU⁢Q∈ℝdh⁢nh×dc′ are the down-projection and up-projection matrices for queries, respectively; and WQ⁢R∈ℝdhR⁢nh×dc′ is the matrix to produce the decoupled queries that carry RoPE.

Ultimately, the attention queries (��t,i), keys (��j,i), and values (��j,iC) are combined to yield the final attention output ��t:

where WO∈ℝd×dh⁢nh denotes the output projection matrix.

其中，压缩后的查询潜在向量为 ��tQ∈ℝdc'；dc'（远小于 dh⁢nh）表示查询压缩维度；WD⁢Q∈ℝdc'×d 和 WU⁢Q∈ℝdh⁢nh×dc' 分别为查询的降维投影矩阵和升维投影矩阵；WQ⁢R∈ℝdhR⁢nh×dc' 是用于生成携带 RoPE 的解耦查询的矩阵。

最终，注意力查询（��t,i）、键（��j,i）和值（��j,iC）被组合起来，以生成最终的注意力输出 ��t：

其中，WO∈ℝd×dh⁢nh 表示输出投影矩阵。

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing具有无辅助损失负载均衡的 DeepSeekMoE：采用DeepSeekMoE以降低训练成本

Basic Architecture of DeepSeekMoE—DeepSeek-V3 中的 DeepSeekMoE 基本架构：对FFN采用DeepSeekMoE架构+DeepSeek-V3使用sigmoid函数计算亲和力分数+并归一化生成门控值

>> DeepSeekMoE基本架构：对于前馈网络 (FFN)，采用DeepSeekMoE架构。该架构使用更细粒度的专家，并分离一些专家作为共享专家。每个token的FFN输出 (h't) 是共享专家和路由专家的输出之和。路由专家选择基于token-to-expert亲和力 (si,t) 和每个专家的中心向量 (ei)。与DeepSeek-V2略有不同，DeepSeek-V3使用sigmoid函数计算亲和力分数，并对所有选定的亲和力分数进行归一化以生成门控值。

For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the DeepSeekMoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones. Let ��t denote the FFN input of the t-th token, we compute the FFN output ��t′ as follows:

对于前馈网络（FFN），DeepSeek-V3 采用了 DeepSeekMoE 架构（Dai 等人，2024）。与 GShard（Lepikhin 等人，2021）等传统 MoE 架构相比，DeepSeekMoE 使用更细粒度的专家，并将部分专家隔离为共享专家。设 ��t 表示第 t 个标记的 FFN 输入，我们按如下方式计算 FFN 输出 ��t'：

where Ns and Nr denote the numbers of shared experts and routed experts, respectively; FFNi(s)⁡(⋅) and FFNi(r)⁡(⋅) denote the i-th shared expert and the i-th routed expert, respectively; Kr denotes the number of activated routed experts; gi,t is the gating value for the i-th expert; si,t is the token-to-expert affinity; ��i is the centroid vector of the i-th routed expert; and Topk⁡(⋅,K) denotes the set comprising K highest scores among the affinity scores calculated for the t-th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.

其中 Ns 和 Nr 分别表示共享专家和路由专家的数量；FFNi(s)⁡(⋅) 和 FFNi(r)⁡(⋅) 分别表示第 i 个共享专家和第 i 个路由专家；Kr 表示激活的路由专家数量；gi,t 是第 i 个专家的门控值；si,t 是标记到专家的亲和度；��i 是第 i 个路由专家的质心向量；Topk⁡(⋅，K) 表示由为第 t 个标记和所有路由专家计算的亲和度得分中最高的 K 个得分组成的集合。与 DeepSeek-V2 略有不同，DeepSeek-V3 使用 Sigmoid 函数计算亲和度得分，并对所有选定的亲和度得分进行归一化以生成门控值。

Auxiliary-Loss-Free Load Balancing.无辅助损失的负载均衡——解决MoE模型中专家负载不平衡：解决MoE模型中专家负载不平衡→确定top-K路由→动态调整偏差项→

>> 辅助损失免费负载均衡：为了解决MoE模型中专家负载不平衡的问题，DeepSeek-V3提出了一种辅助损失免费的负载均衡策略。通过为每个专家引入一个偏差项 (bi)，并将其添加到相应的亲和力分数中来确定top-K路由。偏差项仅用于路由，门控值仍由原始亲和力分数 (si,t) 导出。在训练过程中，动态调整偏差项 (bi)，以保持均衡的专家负载。

For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term bi for each expert and add it to the corresponding affinity scores si,t to determine the top-K routing:

对于 MoE 模型，专家负载的不平衡会导致路由崩溃（Shazeer 等人，2017 年），并且在具有专家并行性的场景中降低计算效率。传统的解决方案通常依赖于辅助损失（Fedus 等人，2021 年；Lepikhin 等人，2021 年）来避免负载不平衡。然而，过大的辅助损失会损害模型性能（Wang 等人，2024a）。为了在负载平衡和模型性能之间实现更好的权衡，我们开创了一种无辅助损失的负载平衡策略（Wang 等人，2024a），以确保负载平衡。具体而言，我们为每个专家引入一个偏差项 bi，并将其添加到相应的亲和度分数 si,t 中，以确定前 K 个路由。

Note that the bias term is only used for routing. The gating value, which will be multiplied with the FFN output, is still derived from the original affinity score si,t. During training, we keep monitoring the expert load on the whole batch of each training step. At the end of each step, we will decrease the bias term by γ if its corresponding expert is overloaded, and increase it by γ if its corresponding expert is underloaded, where γ is a hyper-parameter called bias update speed. Through the dynamic adjustment, DeepSeek-V3 keeps balanced expert load during training, and achieves better performance than models that encourage load balance through pure auxiliary losses.

请注意，偏差项仅用于路由。将与前馈网络（FFN）输出相乘的门控值仍源自原始亲和度得分 si,t。在训练期间，我们持续监控每个训练步骤中整个批次的专家负载。在每一步结束时，如果其对应的专家负载过重，则将偏差项减少γ；如果其对应的专家负载过轻，则将偏差项增加γ，其中γ是一个称为偏差更新速度的超参数。通过这种动态调整，DeepSeek-V3 在训练期间保持了专家负载的平衡，并且比通过纯辅助损失鼓励负载平衡的模型表现更优。

Complementary Sequence-Wise Auxiliary Loss.互补序列级辅助损失—防止单个序列内出现极度不平衡：添加了一个小的序列级平衡损失

>> 补充的序列级辅助损失：为了防止单个序列内出现极度不平衡，添加了一个小的序列级平衡损失 (LBal)。

Although DeepSeek-V3 mainly relies on the auxiliary-loss-free strategy for load balance, to prevent extreme imbalance within any single sequence, we also employ a complementary sequence-wise balance loss:

尽管 DeepSeek-V3 主要依靠无辅助损失策略来实现负载均衡，但为了防止任何单个序列内部出现极端不平衡的情况，我们还采用了互补的序列级平衡损失：

where the balance factor α is a hyper-parameter, which will be assigned an extremely small value for DeepSeek-V3; ��⁢(⋅) denotes the indicator function; and T denotes the number of tokens in a sequence. The sequence-wise balance loss encourages the expert load on each sequence to be balanced.

其中平衡因子α是一个超参数，在 DeepSeek-V3 中会被赋予一个极小的值；��⁢(⋅) 表示指示函数；T 表示序列中的标记数量。序列级平衡损失鼓励每个序列上的专家负载保持平衡。

Node-Limited Routing节点受限路由—限制通信成本：每个token最多发送到M个节点

>> 节点限制路由：为了限制通信成本，每个token最多发送到M个节点。

Like the device-limited routing used by DeepSeek-V2, DeepSeek-V3 also uses a restricted routing mechanism to limit communication costs during training. In short, we ensure that each token will be sent to at most M nodes, which are selected according to the sum of the highest KrM affinity scores of the experts distributed on each node. Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap.

与 DeepSeek-V2 所采用的设备受限路由类似，DeepSeek-V3 也使用了一种受限路由机制来限制训练期间的通信成本。简而言之，我们确保每个标记最多发送到 M 个节点，这些节点是根据每个节点上分布的专家的 KrM 个最高亲和度得分之和来选择的。在这一约束条件下，我们的 MoE 训练框架几乎可以实现计算与通信的完全重叠。

No Token-Dropping.无标记舍弃—保证训练和推理过程中不丢弃任何token：

>> 不丢弃token：有效的负载均衡策略保证了训练和推理过程中不丢弃任何token。

Due to the effective load balancing strategy, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training. In addition, we also implement specific deployment strategies to ensure inference load balance, so DeepSeek-V3 also does not drop tokens during inference.

由于有效的负载均衡策略，DeepSeek-V3 在整个训练过程中保持良好的负载均衡。因此，DeepSeek-V3 在训练期间不会舍弃任何标记。此外，我们还实施了特定的部署策略以确保推理负载均衡，所以 DeepSeek-V3 在推理期间也不会舍弃标记。

Figure 3: Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.图 3：我们多标记预测（MTP）实现的示意图。在每个深度，我们为每个标记的预测保留完整的因果链。

2.2、Multi-Token Prediction多标记预测：引入MTP训练目标提升模型性能——扩展预测范围+MTP模块保持每个预测深度的完整因果链+对每个预测深度计算交叉熵损失+推理中丢弃MTP

目标：扩展预测范围到多个未来token，提高数据效率，并使模型更好地预先规划表示。

扩展预测范围到多个未来token：在每个位置预测多个未来的token，增加训练信号密度，提高数据效率，并可能使模型更好地预先规划表示以更好地预测未来的token。

Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts D additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.

受 Gloeckle 等人（2024 年）的启发，我们为 DeepSeek-V3 设定了一个多标记预测（MTP）目标，该目标将预测范围扩展到每个位置的多个未来标记。一方面，MTP 目标增加了训练信号的密度，可能提高数据效率。另一方面，MTP 可能使模型能够预先规划其表示，从而更好地预测未来标记。图 3 展示了我们对 MTP 的实现。与 Gloeckle 等人（2024 年）不同，他们使用独立的输出头并行预测 D 个额外标记，我们则依次预测额外标记，并在每个预测深度保持完整的因果链。本节将介绍我们 MTP 实现的细节。

MTP Modules模块实现：使用多个顺序模块来预测多个额外token，保持每个预测深度的完整因果链

MTP模块 (MTP Modules)：使用D个顺序模块来预测D个额外token，保持每个预测深度的完整因果链。每个MTP模块包含共享的嵌入层、输出头和Transformer块以及一个投影矩阵。

To be specific, our MTP implementation uses D sequential modules to predict D additional tokens. The k-th MTP module consists of a shared embedding layer Emb⁡(⋅), a shared output head OutHead⁡(⋅), a Transformer block TRMk⁡(⋅), and a projection matrix Mk∈ℝd×2⁢d. For the i-th input token ti, at the k-th prediction depth, we first combine the representation of the i-th token at the (k−1)-th depth ��ik−1∈ℝd and the embedding of the (i+k)-th token E⁢m⁢b⁢(ti+k)∈ℝd with the linear projection:

where [⋅;⋅] denotes concatenation. Especially, when k=1, ��ik−1 refers to the representation given by the main model. Note that for each MTP module, its embedding layer is shared with the main model. The combined ��i′⁣k serves as the input of the Transformer block at the k-th depth to produce the output representation at the current depth ��ik:

where T represents the input sequence length and i:j denotes the slicing operation (inclusive of both the left and right boundaries). Finally, taking ��ik as the input, the shared output head will compute the probability distribution for the k-th additional prediction token Pi+1+kk∈ℝV, where V is the vocabulary size:

The output head OutHead⁡(⋅) linearly maps the representation to logits and subsequently applies the Softmax⁡(⋅) function to compute the prediction probabilities of the k-th additional token. Also, for each MTP module, its output head is shared with the main model. Our principle of maintaining the causal chain of predictions is similar to that of EAGLE (Li et al., 2024b), but its primary objective is speculative decoding (Xia et al., 2023; Leviathan et al., 2023), whereas we utilize MTP to improve training.

具体来说，我们的 MTP 实现使用 D 个顺序模块来预测 D 个额外的标记。第 k 个 MTP 模块由共享嵌入层 Emb(⋅)、共享输出头 OutHead(⋅)、Transformer 块 TRMk(⋅) 和投影矩阵 Mk∈ℝd×2d 组成。对于第 i 个输入标记 ti，在第 k 次预测深度时，我们首先将第 i 个标记在第 (k−1) 次深度的表示 ��ik−1∈ℝd 与第 (i+k) 个标记的嵌入 Emb(ti+k)∈ℝd 通过线性投影进行组合：

其中 [⋅;⋅] 表示拼接。特别地，当 k=1 时，��ik−1 指的是主模型给出的表示。请注意，对于每个 MTP 模块，其嵌入层与主模型共享。组合后的 ��i'⁣k 作为第 k 次深度的 Transformer 块的输入，以生成当前深度的输出表示 ��ik：

其中 T 表示输入序列的长度，i:j 表示切片操作（包含左右边界）。最后，以 ��ik 作为输入，共享输出头将计算第 k 个额外预测标记的概率分布 Pi+1+kk∈ℝV，其中 V 是词汇表大小：输出头 OutHead⁡(⋅) 将表示形式线性映射到对数几率，然后应用 Softmax⁡(⋅) 函数来计算第 k 个附加标记的预测概率。此外，对于每个 MTP 模块，其输出头与主模型共享。我们保持预测因果链的原则与 EAGLE（Li 等人，2024b）类似，但其主要目标是推测性解码（Xia 等人，2023；Leviathan 等人，2023），而我们利用 MTP 来改进训练。

MTP Training Objective训练目标：对每个预测深度计算交叉熵损失，并将其平均值作为额外的训练目标

MTP训练目标 (MTP Training Objective)：对每个预测深度计算交叉熵损失，并将其平均值作为额外的训练目标。

For each prediction depth, we compute a cross-entropy loss ℒMTPk:

where T denotes the input sequence length, ti denotes the ground-truth token at the i-th position, and Pik⁢[ti] denotes the corresponding prediction probability of ti, given by the k-th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a weighting factor λ to obtain the overall MTP loss ℒMTP, which serves as an additional training objective for DeepSeek-V3:

对于每个预测深度，我们计算一个交叉熵损失 ℒMTPk：

其中 T 表示输入序列的长度，ti 表示第 i 个位置的真实标记，而 Pik⁢[ti] 表示由第 k 个 MTP 模块给出的 ti 的相应预测概率。最后，我们计算所有深度的 MTP 损失的平均值，并乘以一个加权因子 λ，以获得总体 MTP 损失 ℒMTP，它作为 DeepSeek-V3 的附加训练目标：

MTP in Inference推理中的 MTP：推理过程中可以丢弃MTP模块，或将其用于推测性解码以提高生成速度

MTP在推理中的应用：推理过程中可以丢弃MTP模块，或将其用于推测性解码以进一步提高生成速度。

Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.

我们的 MTP 策略主要旨在提升主模型的性能，因此在推理过程中，我们可以直接舍弃 MTP 模块，主模型能够独立且正常地运行。此外，我们还可以将这些 MTP 模块重新用于推测性解码，以进一步降低生成延迟。

3、Infrastructures基础设施

本节内容主要介绍DeepSeek-V3模型的训练基础设施和训练框架，包括计算集群、训练框架的并行策略和优化、FP8训练以及推理部署和对未来硬件设计的建议。DeepSeek-V3的训练基础设施和框架通过采用多种并行策略、细致的工程优化以及FP8训练等技术，实现了高效且经济的模型训练和推理部署。同时，文章也对未来AI硬件设计提出了有益的建议。

DeepSeek-V3使用了包含2048个NVIDIA H800 GPU的集群进行训练。训练框架使用了16路流水线并行(PP)、64路专家并行(EP)和ZeRO-1数据并行(DP)。为了提高训练效率，论文介绍了DualPipe算法，该算法通过计算与通信重叠来减少流水线气泡并隐藏通信开销。此外，还开发了高效的跨节点全对全通信内核，并对内存占用进行了优化，避免了代价高昂的张量并行(TP)。论文还详细介绍了FP8混合精度训练框架，包括混合精度策略、细粒度量化方法和高精度累加策略，以及低精度存储和通信策略。最后，论文还介绍了DeepSeek-V3的推理部署策略，包括预填充和解码阶段的分离，以及冗余专家部署以确保负载平衡。

基础设施部分详细描述了DeepSeek-V3的训练和部署环境，以及为了提高效率所做的各种工程优化。 DualPipe算法和FP8混合精度训练框架是该部分的亮点，它们有效地解决了大型MoE模型训练中的通信瓶颈和内存问题。对硬件设计的建议也体现了论文的实用性和前瞻性。

3.1 Compute Clusters计算集群：硬件配置(采用2048个H800 GPU)、节点内部互联(每个节点包含8个通过NVLink和NVSwitch互连的GPU)、节点间互联(节点间使用InfiniBand (IB) 互连)

计算集群 (Compute Clusters): 使用2048个NVIDIA H800 GPU，每个节点包含8个通过NVLink和NVSwitch互连的GPU，节点间使用InfiniBand (IB) 进行互连。

>> 硬件配置：DeepSeek-V3使用一个配备了2048个NVIDIA H800 GPU的集群进行训练。

>> 节点内部互联：每个节点包含8个GPU，通过NVLink和NVSwitch进行内部互联。

>> 节点间互联：不同节点之间使用InfiniBand (IB) 进行通信。

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications.

DeepSeek-V3 在配备 2048 块 NVIDIA H800 GPU 的集群上进行训练。H800 集群中的每个节点包含 8 块通过 NVLink 和 NVSwitch 相互连接的 GPU。不同节点之间通过 InfiniBand（IB）互连来实现通信。

3.2 Training Framework训练框架：

框架和并行策略：使用高效轻量级的HAI-LLM训练框架，采用16路流水线并行 (PP)、跨8个节点的64路专家并行 (EP) 和ZeRO-1数据并行 (DP)。

The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a), 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020).

In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimizations. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP methods, DualPipe has fewer pipeline bubbles. More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism. Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication. Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP).

DeepSeek-V3 的训练由 HAI-LLM 框架提供支持，这是由我们的工程师从头开始打造的一款高效且轻量级的训练框架。总体而言，DeepSeek-V3 应用了 16 路流水线并行（PP）（Qi 等人，2023a）、64 路专家并行（EP）（Lepikhin 等人，2021）跨越 8 个节点，以及 ZeRO-1 数据并行（DP）（Rajbhandari 等人，2020）。

为了促进 DeepSeek-V3 的高效训练，我们实施了细致的工程优化。首先，我们设计了 DualPipe 算法以实现高效的流水线并行。与现有的 PP 方法相比，DualPipe 的流水线气泡更少。更重要的是，它在前向和后向过程中重叠了计算和通信阶段，从而解决了跨节点专家并行引入的大量通信开销问题。其次，我们开发了高效的跨节点全对全通信内核，以充分利用 IB 和 NVLink 带宽，并节省专门用于通信的流式多处理器（SM）。最后，我们精心优化了训练期间的内存占用，从而能够在不使用昂贵的张量并行（TP）的情况下训练 DeepSeek-V3。

工程优化：DualPipe算法实现高效PP算法+高效的跨节点全对全通信内核(充分利用InfiniBand和NVLink带宽)+极度节省内存(重新计算RMSNorm和MLA上投影+在CPU中保存EMA参数+共享MTP模块和主模型的嵌入层和输出头等)

为了提高训练效率，进行了细致的工程优化。

DualPipe算法

>> DualPipe算法：高效的流水线并行算法，减少流水线气泡，并通过计算-通信重叠隐藏大部分通信。

一种创新的流水线并行算法，减少流水线气泡，并通过计算与通信的重叠来解决跨节点专家并行带来的高通信开销问题。它将每个chunk分成四个部分：attention、all-to-all dispatch、MLP和all-to-all combine。反向传播中，attention和MLP进一步细分为backward for input和backward for weights。通过重新排列这些组件并手动调整GPU SMs用于通信与计算的比例，实现了all-to-all和PP通信的完全隐藏。DualPipe采用双向流水线调度，同时从流水线的两端馈送微批次，显著提高了效率，并且在模型进一步扩展时，只要保持恒定的计算与通信比，就能在节点间使用细粒度的专家，同时实现接近于零的全对全通信开销。与其他流水线并行方法相比，DualPipe显著减少了流水线气泡，同时仅增加了峰值激活内存。

高效的跨节点全对全通信内核

>> 高效的跨节点全对全通信内核 (Efficient Implementation of Cross-Node All-to-All Communication)：充分利用InfiniBand和NVLink带宽，节省用于通信的流多处理器 (SMs)。

高效的跨节点全对全通信内核：充分利用IB和NVLink带宽，节省用于通信的SMs。通过限制每个token最多发送到4个节点，减少了IB流量；利用NVLink进行节点内通信，实现了IB和NVLink通信的完全重叠。每个token平均可以高效地选择每个节点3.2个专家，而不会产生额外的NVLink开销。采用warp specialization技术，将20个SMs划分为10个通信通道，动态调整分配给每个通信任务的warp数量，并通过自定义PTX指令和自动调整通信块大小来减少L2缓存的使用和对其他SMs的干扰。

极度节省内存

>> 极度节省内存 (Extremely Memory Saving with Minimal Overhead)：通过重新计算RMSNorm和MLA上投影、在CPU中保存EMA参数、以及共享MTP模块和主模型的嵌入层和输出头等技术来减少内存占用，避免使用代价高昂的张量并行 (Tensor Parallelism)。

通过重新计算RMSNorm和MLA上投影、在CPU中保存EMA参数以及共享MTP模块和主模型的嵌入层和输出头等技术来减少内存占用，避免使用代价高昂的张量并行 (TP)。

3.2.1 DualPipe and Computation-Communication Overlap双管道与计算通信重叠

For DeepSeek-V3, the communication overhead introduced by cross-node expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1. To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles.

The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component. As illustrated in Figure 4, for a pair of forward and backward chunks, we rearrange these components and manually adjust the ratio of GPU SMs dedicated to communication versus computation. In this overlapping strategy, we can ensure that both all-to-all and PP communication can be fully hidden during execution. Given the efficient overlapping strategy, the full DualPipe scheduling is illustrated in Figure 5. It employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped. This overlap also ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.

对于 DeepSeek-V3，跨节点专家并行引入的通信开销导致计算与通信的比例约为 1:1，效率低下。为解决这一挑战，我们设计了一种创新的流水线并行算法，称为双管道（DualPipe），它不仅通过有效重叠前向和后向计算通信阶段来加速模型训练，还减少了流水线气泡。

双管道的关键思想是在一对单独的前向和后向块内重叠计算和通信。具体而言，我们将每个块分为四个部分：注意力、全对全分发、多层感知机（MLP）和全对全合并。特别地，对于后向块，注意力和 MLP 进一步分为两部分，即输入的后向和权重的后向，就像在 ZeroBubble（Qi 等人，2023b）中那样。此外，我们还有一个 PP 通信组件。如图 4 所示，对于一对前向和后向块，我们重新排列这些组件，并手动调整用于通信与计算的 GPU SM 比例。在这种重叠策略中，我们可以确保在执行期间全对全通信和 PP 通信都能被完全隐藏。鉴于这种高效的重叠策略，完整的双管道调度如图 5 所示。它采用双向流水线调度，同时从流水线的两端输入微批次，并且大部分通信都能完全重叠。这种重叠还确保了，随着模型进一步扩大规模，只要我们保持恒定的计算与通信比例，我们仍能在节点间使用细粒度专家，同时实现近乎零的全对全通信开销。

Figure 4: Overlapping strategy for a pair of individual forward and backward chunks (the boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes "backward for input", blue denotes "backward for weights", purple denotes PP communication, and red denotes barriers. Both all-to-all and PP communication can be fully hidden.图 4：一对单独的前向和后向块的重叠策略（变压器块的边界未对齐）。橙色表示前向，绿色表示“输入的后向”，蓝色表示“权重的后向”，紫色表示 PP 通信，红色表示屏障。全对全和 PP 通信都可以完全隐藏。

Figure 5: Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.图 5：8 个 PP 级别和 20 个微批次在两个方向上的 DualPipe 调度示例。反向的微批次与正向的微批次对称，为简化说明，我们省略了它们的批次 ID。由共享黑色边框包围的两个单元格具有相互重叠的计算和通信。

In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages. In Table 2, we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qi et al., 2023b) and 1F1B (Harlap et al., 2018), DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1P⁢P times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows.

此外，即使在没有沉重通信负担的更一般场景中，DualPipe 仍表现出效率优势。在表 2 中，我们总结了不同 PP 方法的流水线气泡和内存使用情况。如表所示，与 ZB1P（Qi 等人，2023b）和 1F1B（Harlap 等人，2018）相比，DualPipe 显著减少了流水线气泡，同时仅将峰值激活内存增加了 1P⁢P 倍。尽管 DualPipe 需要保存模型参数的两份副本，但由于我们在训练期间使用了较大的 EP 大小，因此这并不会显著增加内存消耗。与 Chimera（Li 和 Hoefler，2021）相比，DualPipe 只要求管道阶段和微批次能被 2 整除，而不要求微批次能被管道阶段整除。此外，对于 DualPipe 而言，无论微批次数量如何增加，气泡和激活内存都不会增加。

Table 2: Comparison of pipeline bubbles and memory usage across different pipeline parallel methods. F denotes the execution time of a forward chunk, B denotes the execution time of a full backward chunk, W denotes the execution time of a "backward for weights" chunk, and F&B denotes the execution time of two mutually overlapped forward and backward chunks.表 2：不同流水线并行方法的流水线气泡和内存使用情况比较。F 表示前向块的执行时间，B 表示完整后向块的执行时间，W 表示“权重后向”块的执行时间，F&B 表示两个相互重叠的前向和后向块的执行时间。

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication跨节点全对全通信的高效实现

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB (50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.

为了确保 DualPipe 具备足够的计算性能，我们定制了高效的跨节点全对全通信内核（包括分发和合并），以减少专门用于通信的流式多处理器（SM）数量。内核的实现与 MoE 门控算法以及我们集群的网络拓扑结构协同设计。具体而言，在我们的集群中，跨节点 GPU 通过 IB 全互连，而节点内的通信则通过 NVLink 处理。NVLink 提供 160GB/s 的带宽，约为 IB（50GB/s）的 3.2 倍。为了有效利用 IB 和 NVLink 不同的带宽，我们将每个标记最多分发到 4 个节点，从而减少 IB 流量。对于每个标记，在其路由决策确定后，首先通过 IB 传输到目标节点上具有相同节点内索引的 GPU。一旦到达目标节点，我们将努力确保其通过 NVLink 瞬时转发到承载其目标专家的特定 GPU，而不会被随后到达的标记阻塞。通过这种方式，IB 和 NVLink 之间的通信实现了完全重叠，每个标记能够平均在每个节点上高效选择 3.2 个专家，且不会因 NVLink 而产生额外开销。这意味着，尽管 DeepSeek-V3 实际上仅选择 8 个路由专家，但它能够将此数量扩展到最多 13 个专家（4 个节点×每个节点 3.2 个专家），同时保持相同的通信成本。总体而言，在这种通信策略下，仅 20 个 SM 就足以充分利用 IB 和 NVLink 的带宽。

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

具体来说，我们采用 warp 专业化技术（Bauer 等人，2014 年），将 20 个 SM 分为 10 个通信通道。在调度过程中，（1）IB 发送、（2）IB 到 NVLink 转发以及（3）NVLink 接收分别由各自的 warp 处理。分配给每个通信任务的 warp 数量会根据所有 SM 上的实际工作负载动态调整。同样，在组合过程中，（1）NVLink 发送、（2）NVLink 到 IB 转发和累加以及（3）IB 接收和累加也由动态调整的 warp 处理。此外，调度和合并内核与计算流存在重叠，因此我们还考虑了它们对其他流式多处理器（SM）计算内核的影响。具体而言，我们采用了定制的 PTX（并行线程执行）指令，并自动调整了通信块的大小，这显著减少了 L2 缓存的使用，并降低了对其他 SM 的干扰。

3.2.3 Extremely Memory Saving with Minimal Overhead极大节省内存且开销极小

In order to reduce the memory footprint during training, we employ the following techniques.

为了在训练期间减少内存占用，我们采用了以下技术。

Recomputation of RMSNorm and MLA Up-Projection重新计算 RMSNorm 和 MLA 上投影

We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activations. With a minor overhead, this strategy significantly reduces memory requirements for storing activations.

在反向传播期间，我们重新计算所有 RMSNorm 操作和 MLA 上投影，从而无需持久存储其输出激活。虽然开销较小，但此策略显著减少了存储激活所需的内存。

Exponential Moving Average in CPU在 CPU 中使用指数移动平均

During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead.

在训练期间，我们保存模型参数的指数移动平均值（EMA），以便在学习率衰减后早期估计模型性能。EMA 参数存储在 CPU 内存中，并在每次训练步骤后异步更新。这种方法使我们能够维护 EMA 参数，而不会产生额外的内存或时间开销。

Shared Embedding and Output Head for Multi-Token Prediction.多标记预测的共享嵌入和输出头

With the DualPipe strategy, we deploy the shallowest layers (including the embedding layer) and deepest layers (including the output head) of the model on the same PP rank. This arrangement enables the physical sharing of parameters and gradients, of the shared embedding and output head, between the MTP module and the main model. This physical sharing mechanism further enhances our memory efficiency.

借助 DualPipe 策略，我们将模型的最浅层（包括嵌入层）和最深层（包括输出头）部署在相同的 PP 等级上。这种安排使得 MTP 模块和主模型之间能够共享嵌入层和输出层的参数及梯度。这种物理共享机制进一步提高了我们的内存效率。

3.3 FP8 Training训练：基于FP8的混合精度框架+细粒度量化+提高累加精度+低精度存储和通信

混合精度框架	>> 混合精度框架 (Mixed Precision Framework)：大多数计算密集型操作使用FP8，一些关键操作保持原始精度，平衡训练效率和数值稳定性。提出了一种细粒度的混合精度框架，利用FP8数据格式进行训练，平衡训练效率和数值稳定性。大多数计算密集型操作使用FP8，一些关键操作保持原始精度。
细粒度量化	>> 细粒度量化 (Fine-Grained Quantization)：针对激活和权重采用基于tile和block的细粒度量化策略，以减轻异常值的影响。为了扩展FP8格式的动态范围并减轻异常值的影响，引入了细粒度的量化策略：tile-wise分组 (1×Nc元素) 或block-wise分组 (Nc×Nc元素)。
提高累加精度	>> 提高累加精度 (Increasing Accumulation Precision)：通过将部分结果提升到CUDA核心进行FP32累加来提高FP8 GEMM的精度。为了解决FP8 GEMM的累加精度受限问题，采用将部分结果提升到CUDA核心进行FP32累加的策略。设置NC=128元素，可以在不引入额外开销的情况下显著提高精度。
低精度存储和通信	>> 低精度存储和通信 (Low-Precision Storage and Communication)：将缓存的激活和优化器状态压缩为低精度格式，以减少内存和通信开销。使用BF16格式存储优化器状态，使用FP8格式缓存激活，并对部分操作采用更高精度，以减少内存和通信开销。

Inspired by recent advances in low-precision training (Peng et al., 2023b; Dettmers et al., 2022; Noune et al., 2022), we propose a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. While low-precision training holds great promise, it is often limited by the presence of outliers in activations, weights, and gradients (Sun et al., 2024; He et al.,; Fishman et al., 2024). Although significant progress has been made in inference quantization (Xiao et al., 2023; Frantar et al., 2022), there are relatively few studies demonstrating successful application of low-precision techniques in large-scale language model pre-training (Fishman et al., 2024). To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1×Nc elements or block-wise grouping with Nc×Nc elements. The associated dequantization overhead is largely mitigated under our increased-precision accumulation process, a critical aspect for achieving accurate FP8 General Matrix Multiplication (GEMM). Moreover, to further reduce memory and communication overhead in MoE training, we cache and dispatch activations in FP8, while storing low-precision optimizer states in BF16. We validate the proposed FP8 mixed precision framework on two model scales similar to DeepSeek-V2-Lite and DeepSeek-V2, training for approximately 1 trillion tokens (see more details in Appendix B.1). Notably, compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.

受近期低精度训练进展（Peng 等人，2023b；Dettmers 等人，2022；Noune 等人，2022）的启发，我们提出了一种利用 FP8 数据格式对 DeepSeek-V3 进行训练的细粒度混合精度框架。尽管低精度训练前景广阔，但其往往受限于激活值、权重和梯度中异常值的存在（Sun 等人，2024；He 等人；Fishman 等人，2024）。尽管在推理量化方面已取得显著进展（Xiao 等人，2023；Frantar 等人，2022），但在大规模语言模型预训练中成功应用低精度技术的研究相对较少（Fishman 等人，2024）。为应对这一挑战并有效扩展 FP8 格式的动态范围，我们引入了一种细粒度量化策略：采用 1×Nc 元素的分块分组或 Nc×Nc 元素的块分组。在我们提高精度的累加机制下，相关的去量化开销得到了很大程度的缓解，这是实现准确的 FP8 通用矩阵乘法（GEMM）的关键方面。此外，为了进一步降低 MoE 训练中的内存和通信开销，我们在 FP8 中缓存和分发激活值，同时以 BF16 格式存储低精度优化器状态。我们在两个类似于 DeepSeek-V2-Lite 和 DeepSeek-V2 的模型规模上验证了所提出的 FP8 混合精度框架，训练了大约 1 万亿个标记（更多细节见附录 B.1）。值得注意的是，与 BF16 基线相比，我们的 FP8 训练模型的相对损失误差始终低于 0.25%，这一水平完全在训练随机性的可接受范围内。

Figure 6: The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.图 6：采用 FP8 数据格式的整体混合精度框架。为便于说明，仅展示了线性运算符

3.3.1 Mixed Precision Framework混合精度框架

Building upon widely adopted techniques in low-precision training (Kalamkar et al., 2019; Narang et al., 2017), we propose a mixed precision framework for FP8 training. In this framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability. The overall framework is illustrated in Figure 6.

Firstly, in order to accelerate model training, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure 6, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. This design theoretically doubles the computational speed compared with the original BF16 method. Additionally, the FP8 Wgrad GEMM allows activations to be stored in FP8 for use in the backward pass. This significantly reduces memory consumption.

基于低精度训练中广泛采用的技术（Kalamkar 等人，2019 年；Narang 等人，2017 年），我们提出了一种用于 FP8 训练的混合精度框架。在该框架中，大多数计算密集型操作以 FP8 进行，而少数关键操作则策略性地保持其原始数据格式，以平衡训练效率和数值稳定性。整体框架如图 6 所示。

首先，为了加速模型训练，大多数核心计算内核，即 GEMM 操作，均以 FP8 精度实现。这些 GEMM 操作接受 FP8 张量作为输入，并产生 BF16 或 FP32 格式的输出。如图 6 所示，与线性算子相关的三个 GEMM 操作，即 Fprop（前向传播）、Dgrad（激活反向传播）和 Wgrad（权重反向传播），均在 FP8 中执行。与原始的 BF16 方法相比，这种设计理论上将计算速度提高了一倍。此外，FP8 的 Wgrad GEMM 允许激活值以 FP8 格式存储，以便在反向传播中使用。这显著减少了内存消耗。

Despite the efficiency advantage of the FP8 format, certain operators still require a higher precision due to their sensitivity to low-precision computations. Besides, some low-cost operators can also utilize a higher precision with a negligible overhead to the overall training cost. For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. These targeted retentions of high precision ensure stable training dynamics for DeepSeek-V3. To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. While these high-precision components incur some memory overheads, their impact can be minimized through efficient sharding across multiple DP ranks in our distributed training system.

尽管 FP8 格式具有效率优势，但某些运算符由于对低精度计算敏感，仍需要更高的精度。此外，一些低成本运算符在整体训练成本中增加的开销可以忽略不计的情况下，也可以使用更高的精度。因此，经过仔细研究，我们对以下组件保持了原有的精度（例如 BF16 或 FP32）：嵌入模块、输出头、MoE 门控模块、归一化运算符和注意力运算符。这些有针对性地保留高精度的操作确保了 DeepSeek-V3 训练过程的稳定性。为了进一步保证数值稳定性，我们将主权重、权重梯度和优化器状态以更高的精度存储。虽然这些高精度组件会带来一些内存开销，但通过在我们的分布式训练系统中跨多个数据并行（DP）等级进行高效分片，可以将其影响降至最低。

Figure 7: (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of NC=128 elements MMA for the high-precision accumulation.图 7：（a）我们提出了一种细粒度量化方法来减轻由特征异常值引起的量化误差；为便于说明，仅展示了前向传播（Fprop）。（b）结合我们的量化策略，我们通过以 NC=128 个元素为间隔将 FP8 GEMM 精度提升至 CUDA 核心的 MMA 来提高高精度累加。

3.3.2 Improved Precision from Quantization and Multiplication量化与乘法运算提升精度

Based on our mixed precision FP8 framework, we introduce several strategies to enhance low-precision training accuracy, focusing on both the quantization method and the multiplication process.

基于我们的混合精度 FP8 框架，我们引入了若干策略来提高低精度训练的准确性，重点在于量化方法和乘法运算过程。

Fine-Grained Quantization精细量化

In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. As a standard practice, the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum representable value of FP8 (Narang et al., 2017). This method makes low-precision training highly sensitive to activation outliers, which can heavily degrade quantization accuracy. To solve this, we propose a fine-grained quantization method that applies scaling at a more granular level. As illustrated in Figure 7 (a), (1) for activations, we group and scale elements on a 1x128 tile basis (i.e., per token per 128 channels); and (2) for weights, we group and scale elements on a 128x128 block basis (i.e., per 128 input channels per 128 output channels). This approach ensures that the quantization process can better accommodate outliers by adapting the scale according to smaller groups of elements. In Appendix B.2, we further discuss the training instability when we group and scale activations on a block basis in the same way as weights quantization.

在低精度训练框架中，由于 FP8 格式的动态范围有限（受其指数位数减少的限制），溢出和下溢是常见的挑战。作为标准做法，通过将输入张量的最大绝对值缩放到 FP8 格式可表示的最大值来对齐输入分布（Narang 等人，2017 年）。这种方法使得低精度训练对激活异常值高度敏感，这会严重降低量化精度。为了解决这个问题，我们提出了一种精细量化方法，在更细粒度的层面上应用缩放。如图 7（a）所示，（1）对于激活值，我们在 1x128 块的基础上分组和缩放元素（即每个标记每 128 个通道）；（2）对于权重，我们在 128x128 块的基础上分组和缩放元素（即每 128 个输入通道每 128 个输出通道）。这种方法通过根据更小的元素组调整缩放比例，确保量化过程能够更好地适应异常值。在附录 B.2 中，我们进一步探讨了在与权重量化相同的方式下，以块为单位对激活进行分组和缩放时出现的训练不稳定问题。

One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM. However, combined with our precise FP32 accumulation strategy, it can be efficiently implemented.

Notably, our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b), while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for microscaling formats with smaller quantization granularity (NVIDIA, 2024a). We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures.

我们方法的一个关键改进是引入了沿 GEMM 操作内部维度的每组缩放因子。此功能在标准 FP8 GEMM 中未直接得到支持。然而，结合我们的精确 FP32 累积策略，它可以高效实现。

值得注意的是，我们的细粒度量化策略与微缩放格式的理念高度一致（Rouhani 等人，2023b），而 NVIDIA 下一代 GPU（Blackwell 系列）的 Tensor Cores 已宣布支持具有更小量化粒度的微缩放格式（NVIDIA，2024a）。我们希望我们的设计能够为未来的工作提供参考，以跟上最新的 GPU 架构。

Increasing Accumulation Precision提高累加精度

Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision (Kalamkar et al., 2019; Narang et al., 2017). However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits, which is significantly lower than FP32 accumulation precision. This problem will become more pronounced when the inner dimension K is large (Wortsman et al., 2023), a typical scenario in large-scale model training where the batch size and model width are increased. Taking GEMM operations of two random matrices with K = 4096 for example, in our preliminary test, the limited accumulation precision in Tensor Cores results in a maximum relative error of nearly 2%. Despite these problems, the limited accumulation precision is still the default option in a few FP8 frameworks (NVIDIA, 2024b), severely constraining the training accuracy.

低精度的 GEMM 运算常常会遇到下溢问题，其精度在很大程度上取决于高精度累加，通常以 FP32 精度进行（Kalamkar 等人，2019 年；Narang 等人，2017 年）。然而，我们观察到在 NVIDIA H800 GPU 上，FP8 GEMM 的累加精度仅能保留约 14 位，这明显低于 FP32 的累加精度。当内维 K 较大时（Wortsman 等人，2023 年），这一问题会更加突出，这是大规模模型训练中常见的场景，此时批量大小和模型宽度都会增加。以两个随机矩阵的 GEMM 运算为例，K = 4096，在我们的初步测试中，Tensor Cores 中有限的累加精度导致最大相对误差接近 2%。尽管存在这些问题，但在少数 FP8 框架中（NVIDIA，2024 年 b），有限的累加精度仍是默认选项，这严重限制了训练精度。

In order to address this issue, we adopt the strategy of promotion to CUDA Cores for higher precision (Thakkar et al., 2023). The process is illustrated in Figure 7 (b). To be specific, during MMA (Matrix Multiply-Accumulate) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of NC is reached, these partial results will be copied to FP32 registers on CUDA Cores, where full-precision FP32 accumulation is performed. As mentioned before, our fine-grained quantization applies per-group scaling factors along the inner dimension K. These scaling factors can be efficiently multiplied on the CUDA Cores as the dequantization process with minimal additional computational cost.

It is worth noting that this modification reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs the promotion operation, the other is able to execute the MMA operation. This design enables overlapping of the two operations, maintaining high utilization of Tensor Cores. Based on our experiments, setting NC=128 elements, equivalent to 4 WGMMAs, represents the minimal accumulation interval that can significantly improve precision without introducing substantial overhead.

为了解决这个问题，我们采用了提升到 CUDA Cores 以获得更高精度的策略（Thakkar 等人，2023 年）。该过程在图 7（b）中有所展示。具体来说，在 Tensor 核心上执行矩阵乘累加（MMA）操作期间，中间结果会使用有限的位宽进行累加。一旦达到 NC 的间隔，这些部分结果将被复制到 CUDA 核心上的 FP32 寄存器中，在那里执行全精度的 FP32 累加。如前所述，我们的细粒度量化在内部维度 K 上应用每组缩放因子。这些缩放因子可以在 CUDA 核心上高效地进行乘法运算，作为去量化的过程，且几乎不会增加额外的计算成本。

值得注意的是，这种修改降低了单个线程束的 WGMMA（线程束级矩阵乘累加）指令发出率。然而，在 H800 架构中，通常会有两个 WGMMA 同时存在：当一个线程束执行提升操作时，另一个线程束能够执行 MMA 操作。这种设计使得这两种操作能够重叠，从而保持 Tensor 核心的高利用率。根据我们的实验，在 NC 设置为 128 个元素（相当于 4 个 WGMMAs）时，这代表了能够显著提高精度且不会引入过多开销的最小累加间隔。

Mantissa over Exponents尾数与指数

In contrast to the hybrid FP8 format adopted by prior work (NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b), which uses E4M3 (4-bit exponent and 3-bit mantissa) in Fprop and E5M2 (5-bit exponent and 2-bit mantissa) in Dgrad and Wgrad, we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling. By operating on smaller element groups, our methodology effectively shares exponent bits among these grouped elements, mitigating the impact of the limited dynamic range.

与先前工作（NVIDIA，2024b；Peng 等人，2023b；Sun 等人，2019b）所采用的混合 FP8 格式（在前向传播中使用 E4M3（4 位指数和 3 位尾数），在反向传播梯度和权重梯度中使用 E5M2（5 位指数和 2 位尾数））不同，我们在所有张量上采用 E4M3 格式以获得更高的精度。我们认为这种方法可行的原因在于我们的精细量化策略，即按块和按片进行缩放。通过在较小的元素组上操作，我们的方法能够有效地在这些分组元素之间共享指数位，从而减轻有限动态范围的影响。

Online Quantization在线量化

Delayed quantization is employed in tensor-wise quantization frameworks (NVIDIA, 2024b; Peng et al., 2023b), which maintains a history of the maximum absolute values across prior iterations to infer the current value. In order to ensure accurate scales and simplify the framework, we calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block. Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format.

在张量级量化框架中（NVIDIA，2024b；Peng 等人，2023b）采用了延迟量化，它会保存先前迭代中最大绝对值的历史记录以推断当前值。为了确保准确的比例因子并简化框架，我们针对每个 1×128 激活块或 128×128 权重块在线计算最大绝对值。基于此，我们推导出缩放因子，然后在线将激活或权重量化为 FP8 格式。

3.3.3 Low-Precision Storage and Communication低精度存储与通信

In conjunction with our FP8 training framework, we further reduce the memory consumption and communication overhead by compressing cached activations and optimizer states into lower-precision formats.

结合我们的 FP8 训练框架，我们通过将缓存的激活值和优化器状态压缩为更低精度的格式，进一步降低了内存消耗和通信开销。

Low-Precision Optimizer States低精度优化器状态

We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017) optimizer, without incurring observable performance degradation. However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training.

我们采用 BF16 数据格式而非 FP32 来追踪 AdamW（Loshchilov 和 Hutter，2017）优化器中的第一和第二矩，而不会造成可观察到的性能下降。不过，主权重（由优化器存储）和梯度（用于批量大小累积）仍以 FP32 格式保留，以确保整个训练过程中的数值稳定性。

Low-Precision Activation低精度激活

As illustrated in Figure 6, the Wgrad operation is performed in FP8. To reduce the memory consumption, it is a natural choice to cache activations in FP8 format for the backward pass of the Linear operator. However, special considerations are taken on several operators for low-cost high-precision training:

(1) Inputs of the Linear after the attention operator. These activations are also used in the backward pass of the attention operator, which makes it sensitive to precision. We adopt a customized E5M6 data format exclusively for these activations. Additionally, these activations will be converted from an 1x128 quantization tile to an 128x1 tile in the backward pass. To avoid introducing extra quantization error, all the scaling factors are round scaled, i.e., integral power of 2.

(2) Inputs of the SwiGLU operator in MoE. To further reduce the memory cost, we cache the inputs of the SwiGLU operator and recompute its output in the backward pass. These activations are also stored in FP8 with our fine-grained quantization method, striking a balance between memory efficiency and computational accuracy.

如图 6 所示，Wgrad 操作在 FP8 中执行。为了减少内存消耗，在线性算子的反向传播中缓存激活值采用 FP8 格式是一个自然的选择。然而，对于低成本高精度训练，对几个算子采取了特殊考虑：

（1）注意力算子之后的线性算子的输入。这些激活值也在注意力算子的反向传播中使用，因此对精度很敏感。我们专门为此类激活值采用定制的 E5M6 数据格式。此外，在反向传播中，这些激活值将从 1x128 量化块转换为 128x1 量化块。为避免引入额外的量化误差，所有缩放因子均采用整数幂次的 2 进行圆整缩放。（2）MoE 中 SwiGLU 操作符的输入。为了进一步降低内存成本，我们缓存 SwiGLU 操作符的输入，并在反向传播时重新计算其输出。这些激活值也通过我们的细粒度量化方法以 FP8 格式存储，在内存效率和计算精度之间取得了平衡。

Low-Precision Communication低精度通信

Communication bandwidth is a critical bottleneck in the training of MoE models. To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections. Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2. A similar strategy is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline.

在训练 MoE 模型时，通信带宽是一个关键瓶颈。为了解决这一挑战，我们在 MoE 上投影前将激活量化为 FP8，然后应用分发组件，这与 MoE 上投影中的 FP8 前向传播兼容。与注意力运算符后的线性层输入类似，此激活的缩放因子为 2 的整数次幂。对于 MoE 下投影前的激活梯度，也采用了类似的策略。对于前向和反向组合组件，我们将其保留为 BF16，以在训练管道的关键部分保持训练精度。

3.4 Inference and Deployment推理与部署：将预填充和解码阶段分开部署

推理和部署 (Inference and Deployment): 将预填充和解码阶段分开部署，以同时保证在线服务的SLA和高吞吐量。详细描述了预填充和解码阶段的并行策略和负载均衡策略，包括冗余专家部署。

>> 策略：将预填充和解码阶段分开部署，以同时保证在线服务的SLA和高吞吐量。

We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages.

我们在 H800 集群上部署 DeepSeek-V3，集群中每个节点内的 GPU 通过 NVLink 相互连接，而集群中的所有 GPU 则通过 IB 实现全互联。为了同时满足在线服务的服务级别目标（SLO）和高吞吐量的要求，我们采用了以下部署策略，将预填充和解码阶段分开。

3.4.1 Prefilling预填充：并行策略+采用冗余专家实现负载均衡策略

>> 预填充 (Prefilling)：最小部署单元为4个节点32个GPU，采用TP4+SP+DP8的并行策略，并使用冗余专家策略来实现负载均衡。同时处理两个微批次，重叠attention和MoE与dispatch和combine。正在探索动态冗余策略。

The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8). Its small TP size of 4 limits the overhead of TP communication. For the MoE part, we use 32-way Expert Parallelism (EP32), which ensures that each expert processes a sufficiently large batch size, thereby enhancing computational efficiency. For the MoE all-to-all communication, we use the same method as in training: first transferring tokens across nodes via IB, and then forwarding among the intra-node GPUs via NVLink. In particular, we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TP communication.

预填充阶段的最小部署单元由 4 个节点和 32 个 GPU 组成。注意力部分采用 4 路张量并行（TP4）结合序列并行（SP），并辅以 8 路数据并行（DP8）。其较小的 TP 尺寸 4 限制了 TP 通信的开销。对于 MoE 部分，我们使用 32 路专家并行（EP32），这确保了每个专家处理足够大的批处理大小，从而提高计算效率。对于 MoE 的全对全通信，我们采用与训练相同的方法：首先通过 IB 在节点间传输标记，然后通过 NVLink 在节点内的 GPU 之间转发。特别是，对于浅层密集 MLP，我们使用 1 路张量并行以节省 TP 通信。

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

为了在 MoE 部分实现不同专家之间的负载均衡，我们需要确保每个 GPU 处理的标记数量大致相同。为此，我们引入了冗余专家的部署策略，即复制高负载专家并进行冗余部署。高负载专家是根据在线部署期间收集的统计数据检测出来的，并定期（例如每 10 分钟）进行调整。确定冗余专家的集合后，我们会根据观察到的负载在节点内的 GPU 之间仔细重新安排专家，力求在不增加跨节点全对全通信开销的情况下尽可能平衡 GPU 之间的负载。对于 DeepSeek-V3 的部署，我们在预填充阶段设置了 32 个冗余专家。对于每个 GPU，除了其原本承载的 8 个专家外，还将额外承载一个冗余专家。

此外，在预填充阶段，为了提高吞吐量并隐藏全对全和 TP 通信的开销，我们同时处理两个计算工作量相似的微批次，将一个微批次的注意力和 MoE 与另一个微批次的分发和合并进行重叠处理。最后，我们正在探索一种专家的动态冗余策略，其中每个 GPU 会承载更多的专家（例如 16 位专家），但在每次推理步骤中仅激活 9 位。在每一层的全对全操作开始之前，我们会实时计算全局最优的路由方案。鉴于预填充阶段涉及大量的计算，计算此路由方案的开销几乎可以忽略不计。

3.4.2 Decoding解码：并行策略+IB点对点传输+IBGDA技术

>> 解码 (Decoding)：最小部署单元为40个节点320个GPU，采用TP4+SP+DP80的并行策略，每个GPU只负责一个专家，并使用64个GPU负责冗余专家和共享专家。使用IB进行点对点传输，并利用IBGDA技术。同样正在探索动态冗余策略，以及同时处理两个微批次，重叠attention与dispatch+MoE+combine。

During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

在解码过程中，我们将共享专家视为路由专家。从这个角度来看，每个标记在路由时将选择 9 个专家，其中共享专家被视为高负载专家，始终会被选中。解码阶段的最小部署单元由 40 个节点和 320 个 GPU 组成。注意力部分采用 TP4 结合 SP，再加上 DP80，而 MoE 部分使用 EP320。对于 MoE 部分，每个 GPU 只承载一个专家，64 个 GPU 负责承载冗余专家和共享专家。调度和组合部分的全对全通信通过 IB 上的直接点对点传输来实现，以达到低延迟。此外，我们还利用 IBGDA（NVIDIA，2022）技术进一步降低延迟并提高通信效率。

与预填充类似，我们基于在线服务中的专家负载统计，在一定间隔内定期确定冗余专家的集合。不过，由于每个 GPU 只承载一个专家，因此无需重新安排专家。我们还在探索解码的动态冗余策略。然而，这需要对计算全局最优路由方案的算法进行更精细的优化，并与调度内核融合以降低开销。

Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage. Therefore, we overlap the attention of one micro-batch with the dispatch+MoE+combine of another. In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation. Since the MoE part only needs to load the parameters of one expert, the memory access overhead is minimal, so using fewer SMs will not significantly affect the overall performance. Therefore, to avoid impacting the computation speed of the attention part, we can allocate only a small portion of SMs to dispatch+MoE+combine.

此外，为了提高吞吐量并隐藏全对全通信的开销，我们还在探索在解码阶段同时处理两个计算工作量相似的微批次。与预填充不同，在解码阶段，注意力机制消耗的时间更多。因此，我们将一个微批次的注意力机制与另一个微批次的调度+MoE+合并重叠。在解码阶段，每个专家的批次大小相对较小（通常在 256 个标记以内），瓶颈在于内存访问而非计算。由于 MoE 部分只需加载一个专家的参数，内存访问开销极小，因此使用较少的流式多处理器（SM）不会显著影响整体性能。因此，为了不影响注意力部分的计算速度，我们可以仅分配少量的流式多处理器（SM）用于调度+MoE+合并。

3.5 Suggestions on Hardware Design关于硬件设计的建议：硬件厂商(开发卸载通信任务的协处理器+提高FP8 GEMM累加精度+支持tile和block级量化+支持转置GEMM操作)

对硬件设计的建议 (Suggestions on Hardware Design): 建议未来AI硬件厂商在通信硬件方面开发卸载通信任务的协处理器，并在计算硬件方面提高FP8 GEMM累加精度、支持tile和block级量化以及在线量化，并支持转置GEMM操作。

Based on our implementation of the all-to-all communication and FP8 training scheme, we propose the following suggestions on chip design to AI hardware vendors.

基于我们对全对全通信和 FP8 训练方案的实现，我们向 AI 硬件供应商提出以下芯片设计方面的建议。

3.5.1 Communication Hardware通信硬件：建议开发卸载通信任务的GPU协处理器或网络协处理器，并统一IB和NVLink网络接口

In DeepSeek-V3, we implement the overlap between computation and communication to hide the communication latency during computation. This significantly reduces the dependency on communication bandwidth compared to serial computation and communication. However, the current communication implementation relies on expensive SMs (e.g., we allocate 20 out of the 132 SMs available in the H800 GPU for this purpose), which will limit the computational throughput. Moreover, using SMs for communication results in significant inefficiencies, as tensor cores remain entirely -utilized.

在 DeepSeek-V3 中，我们实现了计算与通信的重叠，以在计算过程中隐藏通信延迟。这与串行计算和通信相比，显著降低了对通信带宽的依赖。然而，当前的通信实现依赖于昂贵的 SM（例如，在 H800 GPU 中，我们为该目的分配了 132 个可用 SM 中的 20 个），这将限制计算吞吐量。此外，使用 SM 进行通信会导致显著的效率低下，因为张量核心完全没有被利用。

Currently, the SMs primarily perform the following tasks for all-to-all communication:

• Forwarding data between the IB (InfiniBand) and NVLink domain while aggregating IB traffic destined for multiple GPUs within the same node from a single GPU.

• Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers.

• Executing reduce operations for all-to-all combine.

• Managing fine-grained memory layout during chunked data transferring to multiple experts across the IB and NVLink domain.

目前，交换机模块（SM）主要执行以下全对全通信任务：

• 在 InfiniBand（IB）和 NVLink 域之间转发数据，同时将同一节点内单个 GPU 发往多个 GPU 的 IB 流量进行聚合。

• 在 RDMA 缓冲区（已注册的 GPU 内存区域）和输入/输出缓冲区之间传输数据。

• 执行全对全组合的归约操作。

• 在通过 IB 和 NVLink 域向多个专家分块传输数据期间管理细粒度的内存布局。

We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink (scale-up) networks from the perspective of the computation units. With this unified interface, computation units can easily accomplish operations such as read, write, multicast, and reduce across the entire IB-NVLink-unified domain via submitting communication requests based on simple primitives.

我们期望未来供应商开发出能将这些通信任务从宝贵的计算单元 SM 中卸载出来的硬件，充当类似 NVIDIA SHARP Graham 等人（2016 年）所提出的 GPU 协处理器或网络协处理器的角色。此外，为了降低应用程序编程的复杂性，我们希望这种硬件能从计算单元的角度统一 IB（横向扩展）和 NVLink（纵向扩展）网络。通过这种统一的接口，计算单元能够基于简单的原语提交通信请求，从而轻松地在整个 IB-NVLink 统一域内完成读取、写入、多播和归约等操作。

3.5.2 Compute Hardware计算硬件：建议提高Tensor Core中FP8 GEMM累加精度，支持tile和block级量化以及在线量化，并支持转置GEMM操作

Higher FP8 GEMM Accumulation Precision in Tensor Cores.张量核心中更高的 FP8 GEMM 累加精度

In the current Tensor Core implementation of the NVIDIA Hopper architecture, FP8 GEMM (General Matrix Multiply) employs fixed-point accumulation, aligning the mantissa products by right-shifting based on the maximum exponent before addition. Our experiments reveal that it only uses the highest 14 bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of 32 FP8×FP8 multiplications, at least 34-bit precision is required. Thus, we recommend that future chip designs increase accumulation precision in Tensor Cores to support full-precision accumulation, or select an appropriate accumulation bit-width according to the accuracy requirements of training and inference algorithms. This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency.

在 NVIDIA Hopper 架构当前的张量核心实现中，FP8 GEMM（通用矩阵乘法）采用定点累加，通过基于最大指数的右移来对齐尾数乘积，然后进行加法运算。我们的实验表明，在符号填充右移后，它仅使用每个尾数乘积的最高 14 位，并截断超出此范围的位。然而，例如，要从 32 次 FP8×FP8 乘法的累加中获得精确的 FP32 结果，至少需要 34 位精度。因此，我们建议未来的芯片设计提高张量核心中的累加精度以支持全精度累加，或者根据训练和推理算法的精度要求选择适当的累加位宽。这种方法确保误差保持在可接受的范围内，同时保持计算效率。

Support for Tile- and Block-Wise Quantization支持分块和分组量化

Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like our tile- and block-wise quantization. In the current implementation, when the NC interval is reached, the partial results will be copied from Tensor Cores to CUDA cores, multiplied by the scaling factors, and added to FP32 registers on CUDA cores. Although the dequantization overhead is significantly mitigated combined with our precise FP32 accumulation strategy, the frequent data movements between Tensor Cores and CUDA cores still limit the computational efficiency. Therefore, we recommend future chips to support fine-grained quantization by enabling Tensor Cores to receive scaling factors and implement MMA with group scaling. In this way, the whole partial sum accumulation and dequantization can be completed directly inside Tensor Cores until the final result is produced, avoiding frequent data movements.

当前的 GPU 仅支持张量级量化，缺乏对诸如我们的分块和分组量化这种细粒度量化的原生支持。在当前的实现中，当达到 NC 区间时，部分结果将从 Tensor Core 复制到 CUDA 核心，乘以缩放因子，并添加到 CUDA 核心上的 FP32 寄存器中。尽管结合我们的精确 FP32 累加策略，去量化的开销已显著降低，但 Tensor Core 和 CUDA 核心之间的频繁数据移动仍然限制了计算效率。因此，我们建议未来的芯片支持细粒度量化，使 Tensor Core 能够接收缩放因子，并实现具有组缩放的矩阵乘法累加（MMA）。这样，整个部分和累加和去量化都可以直接在 Tensor Core 内部完成，直到产生最终结果，从而避免频繁的数据移动。

Support for Online Quantization对在线量化提供支持

The current implementations struggle to effectively support online quantization, despite its effectiveness demonstrated in our research. In the existing process, we need to read 128 BF16 activation values (the output of the previous computation) from HBM (High Bandwidth Memory) for quantization, and the quantized FP8 values are then written back to HBM, only to be read again for MMA. To address this inefficiency, we recommend that future chips integrate FP8 cast and TMA (Tensor Memory Accelerator) access into a single fused operation, so quantization can be completed during the transfer of activations from global memory to shared memory, avoiding frequent memory reads and writes. We also recommend supporting a warp-level cast instruction for speedup, which further facilitates the better fusion of layer normalization and FP8 cast. Alternatively, a near-memory computing approach can be adopted, where compute logic is placed near the HBM. In this case, BF16 elements can be cast to FP8 directly as they are read from HBM into the GPU, reducing off-chip memory access by roughly 50%.

尽管我们的研究已证明在线量化十分有效，但当前的实现方式却难以有效支持这一功能。在现有流程中，我们需要从 HBM（高带宽内存）读取 128 个 BF16 激活值（即前一次计算的输出）来进行量化，然后将量化后的 FP8 值写回 HBM，之后再读取用于 MMA（矩阵乘法累加）。为解决这种低效问题，我们建议未来的芯片将 FP8 转换和 TMA（张量内存加速器）访问集成到一个融合操作中，这样量化就可以在激活值从全局内存传输到共享内存的过程中完成，从而避免频繁的内存读写。我们还建议支持用于加速的 warp 级别转换指令，这进一步促进了层归一化和 FP8 转换的更好融合。或者，可以采用近内存计算方法，即将计算逻辑置于 HBM 附近。在这种情况下，BF16 元素在从 HBM 读入 GPU 时可直接转换为 FP8，从而将片外内存访问减少约 50%。

Support for Transposed GEMM Operations对转置 GEMM 操作的支持

The current architecture makes it cumbersome to fuse matrix transposition with GEMM operations. In our workflow, activations during the forward pass are quantized into 1x128 FP8 tiles and stored. During the backward pass, the matrix needs to be read out, dequantized, transposed, re-quantized into 128x1 tiles, and stored in HBM. To reduce memory operations, we recommend future chips to enable direct transposed reads of matrices from shared memory before MMA operation, for those precisions required in both training and inference. Combined with the fusion of FP8 format conversion and TMA access, this enhancement will significantly streamline the quantization workflow.

当前架构使得将矩阵转置与 GEMM 操作融合变得繁琐。在我们的工作流程中，前向传播期间的激活值被量化为 1×128 的 FP8 块并存储。在反向传播期间，矩阵需要被读出、去量化、转置、重新量化为 128×1 的块，并存储在 HBM 中。为了减少内存操作，我们建议未来的芯片在 MMA 操作之前能够直接从共享内存中读取转置矩阵，对于训练和推理都需要的精度。结合 FP8 格式转换和 TMA 访问的融合，这一改进将显著简化量化工作流程。

4 Pre-Training

本节内容详细介绍了DeepSeek-V3模型的预训练阶段，包括数据构建、超参数设置、长上下文扩展以及评估和讨论。

DeepSeek-V3在14.8万亿高质量和多样化的token上进行预训练。数据构建方面，优化了数学和编程样本的比例，并扩展了多语言覆盖范围。使用了Prefix-Suffix-Middle (PSM)框架和Fill-in-Middle (FIM)策略。使用了Byte-level BPE分词器，并对预分词器进行了改进以优化多语言压缩效率，并解决token边界偏差问题。

详细介绍了模型和训练超参数设置，包括学习率调度、批量大小调度等。进行了长上下文扩展，将最大上下文长度扩展到128K。最后，对多token预测策略和无辅助损失的负载平衡策略进行了消融实验。

本预训练部分详细描述了DeepSeek-V3的训练数据、超参数设置和训练过程。数据处理和预训练策略的优化，以及长上下文扩展的成功，都为模型的最终性能奠定了坚实的基础。消融实验结果有力地证明了MTP和无辅助损失负载平衡策略的有效性。

4.1 Data Construction数据构建：优化预训练语料库=提高数学和编程样本比例+扩展多语言+文档打包+FIM策略

数据构建 (Data Construction)：优化预训练语料库，提高数学和编程样本比例，扩展多语言覆盖范围，并采用文档打包方法。使用了Fill-in-Middle (FIM) 策略。

>> 语料库优化：提高数学和编程样本的比例+扩展多语言

相较于DeepSeek-V2，DeepSeek-V3的预训练语料库进行了优化：

提高了数学和编程样本的比例。

扩展了多语言覆盖范围，超越了英语和中文。

优化了数据处理流程，最大限度地减少冗余，同时保持语料库的多样性。

Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. (2024), we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diverse tokens in our tokenizer.

与 DeepSeek-V2 相比，我们通过提高数学和编程样本的比例来优化预训练语料库，同时将多语言覆盖范围扩展到英语和中文之外。此外，我们还改进了数据处理流程，以减少冗余并保持语料库的多样性。

>> 文档打包(数据完整性)→14.8T(高质量且多样化)

>> 文档打包：受Ding et al. (2024) 的启发，采用了文档打包方法来保证数据完整性，但在训练过程中没有使用跨样本注意力掩码。DeepSeek-V3的训练语料库包含14.8万亿高质量和多样化的token。

受 Ding 等人（2024 年）的启发，我们采用了文档打包方法以保证数据完整性，但在训练过程中并未引入跨样本注意力掩码。最终，DeepSeek-V3 的训练语料库包含 14.8T 个高质量且多样化的标记，这些标记由我们的分词器处理。

>> Fill-in-Middle (FIM) 策略：沿用DeepSeekCoder-V2中的FIM策略的PSM框架+文档级别

In the training process of DeepSeekCoder-V2 (DeepSeek-AI, 2024a), we observe that the Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability while enabling the model to accurately predict middle text based on contextual cues. In alignment with DeepSeekCoder-V2, we also incorporate the FIM strategy in the pre-training of DeepSeek-V3. To be specific, we employ the Prefix-Suffix-Middle (PSM) framework to structure data as follows:

This structure is applied at the document level as a part of the pre-packing process. The FIM strategy is applied at a rate of 0.1, consistent with the PSM framework.

在 DeepSeekCoder-V2（DeepSeek-AI，2024a）的训练过程中，我们发现 Fill-in-Middle（FIM）策略在不影响下一个标记预测能力的同时，使模型能够根据上下文线索准确预测中间文本。与 DeepSeekCoder-V2 保持一致，我们在 DeepSeek-V3 的预训练中也采用了 FIM 策略。具体而言，我们采用前缀-后缀-中间（PSM）框架来构建数据，结构如下：

这种结构在文档级别应用，作为预打包过程的一部分。FIM 策略以 0.1 的比率应用，与 PSM 框架保持一致。

>> 分词器：采用BPE +词汇表(128K)+随机拆分

DeepSeek-V3的分词器采用字节级BPE (Byte-level BPE)，扩展词汇量为128K个token。预分词器和训练数据进行了修改，以优化多语言压缩效率。新的预分词器引入了组合标点符号和换行符的token，但这可能会导致token边界偏差。为了解决这个问题，在训练过程中随机拆分一部分这样的组合token，使模型接触更广泛的特殊情况，从而减轻偏差。

The tokenizer for DeepSeek-V3 employs Byte-level BPE (Shibata et al., 1999) with an extended vocabulary of 128K tokens. The pretokenizer and training data for our tokenizer are modified to optimize multilingual compression efficiency. In addition, compared with DeepSeek-V2, the new pretokenizer introduces tokens that combine punctuations and line breaks. However, this trick may introduce the token boundary bias (Lundberg, 2023) when the model processes multi-line prompts without terminal line breaks, particularly for few-shot evaluation prompts. To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias.

DeepSeek-V3 的分词器采用字节级 BPE（Shibata 等人，1999 年），扩展词汇量为 128K 个标记。我们的分词器的预分词器和训练数据经过修改，以优化多语言压缩效率。此外，与 DeepSeek-V2 相比，新的预分词器引入了结合标点符号和换行符的标记。然而，当模型处理没有终端换行符的多行提示时，特别是对于少样本评估提示，这种技巧可能会引入标记边界偏差（Lundberg，2023 年）。为了解决这个问题，我们在训练期间随机拆分一定比例的此类组合标记，这使模型接触到更广泛的特殊情况，并减轻了这种偏差。

4.2 Hyper-Parameters超参数：模型超参数（Transformer层数、隐藏维度、注意力头数等）和训练超参数（优化器、学习率调度、批量大小等）

超参数 (Hyper-Parameters): 详细列出了模型超参数（Transformer层数、隐藏维度、注意力头数等）和训练超参数（优化器、学习率调度、批量大小等）。

Model Hyper-Parameters模型超参数：Transformer(61层)，隐藏维度(7K)，MLA参数(nh和dh都为128/dc=512/dc′=1536/dhR=64)，MoE参数(除前三层外余FFN都替换为MoE层+每个MoE层包含1个共享专家和256个路由专家【2048】，每个token激活8个路由专家，每个token最多发送到4个节点)，多token预测深度(1)，压缩RMSNorm层,每个token激活37B共计671B

Transformer层数：61层

隐藏维度：7168

所有可学习参数的随机初始化标准差：0.006

MLA参数：注意力头数 (nh) 128，每个头维度 (dh) 128，KV压缩维度 (dc) 512，查询压缩维度 (dc′) 1536，解耦查询和键的每个头维度 (dhR) 64。

MoE参数：除了前三层外，所有FFN都替换为MoE层。每个MoE层包含1个共享专家和256个路由专家，每个专家的中间隐藏维度为2048。每个token激活8个路由专家，每个token最多发送到4个节点。

多token预测深度 (D)：1 (每个token预测一个额外token)。

其他：DeepSeek-V3还使用了压缩潜在向量后的额外RMSNorm层，并在宽度瓶颈处乘以额外的缩放因子。

总参数量：6710亿，每个token激活370亿参数。

We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads nh to 128 and the per-head dimension dh to 128. The KV compression dimension dc is set to 512, and the query compression dimension dc′ is set to 1536. For the decoupled queries and key, we set the per-head dimension dhR to 64. We substitute all FFNs except for the first three layers with MoE layers. Each MoE layer consists of 1 shared expert and 256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the routed experts, 8 experts will be activated for each token, and each token will be ensured to be sent to at most 4 nodes. The multi-token prediction depth D is set to 1, i.e., besides the exact next token, each token will predict one additional token. As DeepSeek-V2, DeepSeek-V3 also employs additional RMSNorm layers after the compressed latent vectors, and multiplies additional scaling factors at the width bottlenecks. Under this configuration, DeepSeek-V3 comprises 671B total parameters, of which 37B are activated for each token.

我们将 Transformer 层的数量设置为 61 层，隐藏维度设置为 7168。所有可学习参数均以标准差为 0.006 的随机方式初始化。在 MLA 中，我们将注意力头数 nh 设置为 128，每个头的维度 dh 设置为 128。KV 压缩维度 dc 设置为 512，查询压缩维度 dc' 设置为 1536。对于解耦的查询和键，我们将每个头的维度 dhR 设置为 64。除了前三个层之外，我们将所有全连接层（FFN）替换为专家混合（MoE）层。每个 MoE 层由 1 个共享专家和 256 个路由专家组成，每个专家的中间隐藏维度为 2048。在路由专家中，每个标记将激活 8 个专家，并且每个标记将确保发送到最多 4 个节点。多标记预测深度 D 设置为 1，即除了确切的下一个标记外，每个标记还将预测一个额外的标记。与 DeepSeek-V2 一样，DeepSeek-V3 在压缩的潜在向量之后也采用了额外的 RMSNorm 层，并在宽度瓶颈处乘以额外的缩放因子。在此配置下，DeepSeek-V3 总共有 6710 亿个参数，其中每个标记激活 370 亿个参数。

Training Hyper-Parameters训练超参数：优化器(AdamW)，max_length=4K，预训练14.8T，学习率调度，并行策略(PP=8)，M=4，辅助损失免费负载均衡，MTP损失权重

优化器：AdamW (β1=0.9, β2=0.95, weight_decay=0.1)

最大序列长度：4K

训练token总数：14.8万亿

学习率调度：前2K步线性增加到2.2×10⁻⁴，然后保持不变直到消耗10万亿训练token，之后按照余弦衰减曲线逐渐衰减到2.2×10⁻⁵，最后阶段学习率为7.3×10⁻⁶。

梯度裁剪范数：1.0

批大小调度：前4690亿token逐渐增加到15360，之后保持不变。

并行策略：利用流水线并行将模型的不同层部署在不同的GPU上，每个层的路由专家均匀部署在8个节点的64个GPU上。

节点限制路由：每个token最多发送到4个节点 (M=4)。

辅助损失免费负载均衡：偏差更新速度 (γ) 前14.3万亿token为0.001，之后为0.0。

平衡损失：α = 0.0001

MTP损失权重 (λ)：前10万亿token为0.3，之后为0.1。

We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to β1=0.9, β2=0.95, and weight⁢_⁢decay=0.1. We set the maximum sequence length to 4K during pre-training, and pre-train DeepSeek-V3 on 14.8T tokens. As for the learning rate scheduling, we first linearly increase it from 0 to 2.2×10−4 during the first 2K steps. Then, we keep a constant learning rate of 2.2×10−4 until the model consumes 10T training tokens. Subsequently, we gradually decay the learning rate to 2.2×10−5 in 4.3T tokens, following a cosine decay curve. During the training of the final 500B tokens, we keep a constant learning rate of 2.2×10−5 in the first 333B tokens, and switch to another constant learning rate of 7.3×10−6 in the remaining 167B tokens. The gradient clipping norm is set to 1.0. We employ a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training. We leverage pipeline parallelism to deploy different layers of a model on different GPUs, and for each layer, the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes. As for the node-limited routing, each token will be sent to at most 4 nodes (i.e., M=4). For auxiliary-loss-free load balancing, we set the bias update speed γ to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens. For the balance loss, we set α to 0.0001, just to avoid extreme imbalance within any single sequence. The MTP loss weight λ is set to 0.3 for the first 10T tokens, and to 0.1 for the remaining 4.8T tokens.

我们采用 AdamW 优化器（Loshchilov 和 Hutter，2017），其超参数设置为 β1=0.9、β2=0.95 以及 weight_decay=0.1。在预训练期间，我们将最大序列长度设为 4K，并对 DeepSeek-V3 进行 14.8T 个标记的预训练。至于学习率调度，我们首先在前 2K 步中将其从 0 线性增加到 2.2×10−4。然后，在模型消耗 10T 训练标记之前，保持 2.2×10−4 的恒定学习率。随后，在 4.3T 标记内，我们按照余弦衰减曲线将学习率逐渐衰减至 2.2×10−5。在最后 500B 标记的训练中，我们在前 333B 标记内保持 2.2×10−5 的恒定学习率，然后在剩余的 167B 标记内切换到另一个恒定学习率 7.3×10−6。梯度裁剪范数设为 1.0。我们采用批量大小调度策略，在前 469B 标记的训练中，批量大小从 3072 逐渐增加到 15360，然后在剩余的训练中保持 15360。我们利用流水线并行性将模型的不同层部署在不同的 GPU 上，对于每一层，路由专家将均匀部署在 8 个节点的 64 个 GPU 上。至于节点受限路由，每个标记最多会被发送到 4 个节点（即 M = 4）。对于无辅助损失的负载均衡，我们为前 14.3T 个标记将偏差更新速度 γ 设为 0.001，对于剩余的 500B 个标记设为 0.0。对于平衡损失，我们将 α 设为 0.0001，以避免任何单个序列内的极端不平衡。MTP 损失权重 λ 对前 10T 个标记设为 0.3，对剩余的 4.8T 个标记设为 0.1。

Figure 8: Evaluation results on the ”Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.图 8：“大海捞针”（NIAH）测试的评估结果。DeepSeek-V3 在所有上下文窗口中均表现出色

4.3 Long Context Extension长上下文扩展：沿用YaRN方法(仅应用于解耦共享键)+2个额外的训练阶段(4K→32K→128K，每个阶段包含1000步)，NIAH测试良好

长上下文扩展 (Long Context Extension): 采用YaRN方法将上下文长度扩展到128K。

>> 方法：采用与DeepSeek-V2类似的方法，使用YaRN (Peng et al., 2023a) 进行上下文扩展。

>> 训练阶段：进行两个额外的训练阶段，每个阶段包含1000步，将上下文窗口从4K逐步扩展到32K，然后到128K。YaRN仅应用于解耦共享键 (��tR)。超参数保持不变 (s=40, α=1, β=32, ��=0.1ln��+1)。第一阶段序列长度为32K，批大小为1920；第二阶段序列长度为128K，批大小为480。学习率为7.3×10⁻⁶。

>> 效果：DeepSeek-V3能够处理长度高达128K的输入，同时保持强大的性能。在“Needle In A Haystack” (NIAH) 测试中表现良好。

We adopt a similar approach to DeepSeek-V2 (DeepSeek-AI, 2024c) to enable long context capabilities in DeepSeek-V3. After the pre-training stage, we apply YaRN (Peng et al., 2023a) for context extension and perform two additional training phases, each comprising 1000 steps, to progressively expand the context window from 4K to 32K and then to 128K. The YaRN configuration is consistent with that used in DeepSeek-V2, being applied exclusively to the decoupled shared key ��tR. The hyper-parameters remain identical across both phases, with the scale s=40, α=1, β=32, and the scaling factor t=0.1⁢ln⁡s+1. In the first phase, the sequence length is set to 32K, and the batch size is 1920. During the second phase, the sequence length is increased to 128K, and the batch size is reduced to 480. The learning rate for both phases is set to 7.3×10−6, matching the final learning rate from the pre-training stage.

Through this two-phase extension training, DeepSeek-V3 is capable of handling inputs up to 128K in length while maintaining strong performance. Figure 8 illustrates that DeepSeek-V3, following supervised fine-tuning, achieves notable performance on the "Needle In A Haystack" (NIAH) test, demonstrating consistent robustness across context window lengths up to 128K.

我们采用与 DeepSeek-V2（DeepSeek-AI，2024c）类似的方法，为 DeepSeek-V3 实现长上下文能力。在预训练阶段之后，我们应用 YaRN（Peng 等人，2023a）进行上下文扩展，并执行两个额外的训练阶段，每个阶段包含 1000 步，逐步将上下文窗口从 4K 扩展到 32K，然后再扩展到 128K。YaRN 的配置与 DeepSeek-V2 中使用的配置一致，仅应用于解耦的共享密钥 ��tR。两个阶段的超参数保持不变，其中比例 s=40，α=1，β=32，缩放因子 t=0.1⁢ln⁡s+1。在第一阶段，序列长度设置为 32K，批处理大小为 1920。在第二阶段，序列长度增加到 128K，批处理大小减少到 480。两个阶段的学习率均设置为 7.3×10−6，与预训练阶段的最终学习率相同。

通过这种两阶段的扩展训练，DeepSeek-V3 能够处理长达 128K 的输入，同时保持强大的性能。图 8 表明，经过监督微调后，DeepSeek-V3 在“大海捞针”（NIAH）测试中表现出色，在长达 128K 的上下文窗口长度范围内始终保持着强大的稳定性。

4.4 Evaluations评估：多个英语、中文和多语言基准上评估

评估 (Evaluations): 在多个英语、中文和多语言基准上对DeepSeek-V3进行评估，包括知识、代码、数学和推理等方面。

4.4.1 Evaluation Benchmarks评估基准：多项选择数据集、语言理解和推理数据集、闭卷问答数据集、阅读理解数据集、指代消歧数据集、语言建模数据集、中文理解和文化数据集、数学数据集、代码数据集、标准化考试

对DeepSeek-V3基础模型进行了全面的评估，基准涵盖多个领域，包括多学科多项选择数据集 (MMLU, MMLU-Redux, MMLU-Pro, MMMLU, C-Eval, CMMLU)、语言理解和推理数据集 (HellaSwag, PIQA, ARC, BBH)、闭卷问答数据集 (TriviaQA, NaturalQuestions)、阅读理解数据集 (RACE, DROP, C3, CMRC)、指代消歧数据集 (CLUEWSC, WinoGrande)、语言建模数据集 (Pile)、中文理解和文化数据集 (CCPM)、数学数据集 (GSM8K, MATH, MGSM, CMath) 和代码数据集 (HumanEval, LiveCodeBench-Base, MBPP, CRUXEval)以及标准化考试 (AGIEval)。

The base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chinese constituting the majority, so we evaluate its performance on a series of benchmarks primarily in English and Chinese, as well as on a multilingual benchmark. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Considered benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese and double-underlined benchmarks are multilingual ones:

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024b), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and NaturalQuestions (Kwiatkowski et al., 2019).

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019a), and CMRC (Cui et al., 2019).

Reference disambiguation datasets include CLUEWSC (Xu et al., 2020) and WinoGrande Sakaguchi et al. (2019).

Language modeling datasets include Pile (Gao et al., 2020).

Chinese understanding and culture datasets include CCPM (Li et al., 2021).

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM (Shi et al., 2023), and CMath (Wei et al., 2023).

Code datasets include HumanEval (Chen et al., 2021), LiveCodeBench-Base (0801-1101) (Jain et al., 2024), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.

DeepSeek-V3 的基础模型是在以英语和中文为主的多语言语料库上进行预训练的，因此我们主要在英语和中文的系列基准测试以及一个多语言基准测试上对其性能进行评估。我们的评估基于集成在 HAI-LLM 框架中的内部评估框架。所考虑的基准测试按类别列出如下，其中带下划线的基准测试为中文基准测试，双下划线的基准测试为多语言基准测试：

多学科多项选择数据集包括 MMLU（Hendrycks 等人，2020 年）、MMLU-Redux（Gema 等人，2024 年）、MMLU-Pro（Wang 等人，2024 年 b）、MMMLU（OpenAI，2024 年 b）、C-Eval（Huang 等人，2023 年）和 CMMLU（Li 等人，2023 年）。

语言理解和推理数据集包括 HellaSwag（Zellers 等人，2019 年）、PIQA（Bisk 等人，2020 年）、ARC（Clark 等人，2018 年）和 BigBench Hard（BBH）（Suzgun 等人，2022 年）。

封闭式问答数据集包括 TriviaQA（Joshi 等人，2017 年）和 NaturalQuestions（Kwiatkowski 等人，2019 年）。阅读理解数据集包括 RACE（Lai 等人，2017 年）、DROP（Dua 等人，2019 年）、C3（Sun 等人，2019 年 a）、CMRC（Cui 等人，2019 年）。

参考消歧数据集包括 CLUEWSC（Xu 等人，2020 年）和 WinoGrande（Sakaguchi 等人，2019 年）。

语言建模数据集包括 Pile（Gao 等人，2020 年）。

中文理解和文化数据集包括 CCPM（Li 等人，2021 年）。

数学数据集包括 GSM8K（Cobbe 等人，2021 年）、MATH（Hendrycks 等人，2021 年）、MGSM（Shi 等人，2023 年）和 CMath（Wei 等人，2023 年）。

代码数据集包括 HumanEval（Chen 等人，2021 年）、LiveCodeBench-Base（0801-1101）（Jain 等人，2024 年）、MBPP（Austin 等人，2021 年）和 CRUXEval（Gu 等人，2024 年）。

标准化考试包括 AGIEval（Zhong 等人，2023 年）。请注意，AGIEval 包含英语和中文子集。

评估方法和指标：困惑度+生成，BPB度量指标

使用了困惑度 (perplexity) 和生成 (generation) 两种评估方法，并使用Bits-Per-Byte (BPB) 作为度量指标。

Following our previous work (DeepSeek-AI, 2024b, c), we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, MMLU-Redux, MMLU-Pro, MMMLU, ARC-Easy, ARC-Challenge, C-Eval, CMMLU, C3, and CCPM, and adopt generation-based evaluation for TriviaQA, NaturalQuestions, DROP, MATH, GSM8K, MGSM, HumanEval, MBPP, LiveCodeBench-Base, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models using different tokenizers.

继我们之前的工作（DeepSeek-AI，2024 年 b、c）之后，我们对 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、MMLU-Redux、MMLU-Pro、MMMLU、ARC-Easy、ARC-Challenge、C-Eval、CMMLU、C3 和 CCPM 数据集采用基于困惑度的评估，对 TriviaQA、NaturalQuestions、DROP、MATH、GSM8K、MGSM、HumanEval、MBPP、LiveCodeBench-Base 和 CRUXEval 采用基于生成的评估。BBH、AGIEval、CLUEWSC、CMRC 和 CMath。此外，我们对 Pile-test 进行基于语言模型的评估，并使用每字节比特数（BPB）作为指标，以确保使用不同分词器的模型之间进行公平比较。

Table 3: Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3-Base achieves the best performance on most benchmarks, especially on math and code tasks.表 3：DeepSeek-V3-Base 与其他代表性开源基础模型的比较。所有模型均在我们的内部框架中进行评估，并采用相同的评估设置。分差不超过 0.3 的分数被视为处于同一水平。DeepSeek-V3-Base 在大多数基准测试中表现最佳，尤其是在数学和代码任务方面。

4.4.2 Evaluation Results评估结果：最强大的开源模型(尤其是在数学和代码任务上)，超便宜(每万亿token的训练仅需180K H800 GPU小时)

评估结果 (Evaluation Results)：DeepSeek-V3-Base在大多数基准测试中都取得了最佳性能，尤其是在数学和代码任务上。它全面优于DeepSeek-V2-Base和Qwen2.5 72B Base，并在大多数基准测试中超越LLaMA-3.1 405B Base，成为最强大的开源模型。其高效的架构和全面的工程优化使其训练效率极高，每万亿token的训练仅需180K H800 GPU小时。

In Table 3, we compare the base model of DeepSeek-V3 with the state-of-the-art open-source base models, including DeepSeek-V2-Base (DeepSeek-AI, 2024c) (our previous release), Qwen2.5 72B Base (Qwen, 2024b), and LLaMA-3.1 405B Base (AI@Meta, 2024b). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Note that due to the changes in our evaluation framework over the past months, the performance of DeepSeek-V2-Base exhibits a slight difference from our previously reported results. Overall, DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base, and surpasses LLaMA-3.1 405B Base in the majority of benchmarks, essentially becoming the strongest open-source model.

在表 3 中，我们将 DeepSeek-V3 的基础模型与最先进的开源基础模型进行了比较，包括 DeepSeek-V2-Base（DeepSeek-AI，2024c）（我们之前的版本）、Qwen2.5 72B 基础模型（Qwen，2024b）以及 LLaMA-3.1 405B 基础模型（AI@Meta，2024b）。我们使用内部评估框架对所有这些模型进行了评估，并确保它们处于相同的评估设置下。请注意，由于过去几个月我们评估框架的变化，DeepSeek-V2-Base 的性能与我们之前报告的结果略有不同。总体而言，DeepSeek-V3-Base 在各个方面都优于 DeepSeek-V2-Base 和 Qwen2.5 72B 基础模型，并且在大多数基准测试中超越了 LLaMA-3.1 405B 基础模型，基本上成为了最强的开源模型。

From a more detailed perspective, we compare DeepSeek-V3-Base with the other open-source base models individually. (1) Compared with DeepSeek-V2-Base, due to the improvements in our model architecture, the scale-up of the model size and training tokens, and the enhancement of data quality, DeepSeek-V3-Base achieves significantly better performance as expected. (2) Compared with Qwen2.5 72B Base, the state-of-the-art Chinese open-source model, with only half of the activated parameters, DeepSeek-V3-Base also demonstrates remarkable advantages, especially on English, multilingual, code, and math benchmarks. As for Chinese benchmarks, except for CMMLU, a Chinese multi-subject multiple-choice task, DeepSeek-V3-Base also shows better performance than Qwen2.5 72B. (3) Compared with LLaMA-3.1 405B Base, the largest open-source model with 11 times the activated parameters, DeepSeek-V3-Base also exhibits much better performance on multilingual, code, and math benchmarks. As for English and Chinese language benchmarks, DeepSeek-V3-Base shows competitive or better performance, and is especially good on BBH, MMLU-series, DROP, C-Eval, CMMLU, and CCPM.

Due to our efficient architectures and comprehensive engineering optimizations, DeepSeek-V3 achieves extremely high training efficiency. Under our training framework and infrastructures, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, which is much cheaper than training 72B or 405B dense models.

从更细致的角度来看，我们将 DeepSeek-V3-Base 与其他开源基础模型分别进行了比较。（1）与 DeepSeek-V2-Base 相比，由于模型架构的改进、模型规模和训练标记量的扩大以及数据质量的提升，DeepSeek-V3-Base 如预期般取得了显著更好的性能。（2）与最先进的中文开源模型 Qwen2.5 72B Base 相比，尽管激活参数只有其一半，DeepSeek-V3-Base 仍展现出显著优势，尤其是在英语、多语言、代码和数学基准测试方面。至于中文基准测试，除了 CMMLU（一个中文多学科选择题任务）外，DeepSeek-V3-Base 的表现也优于 Qwen2.5 72B。（3）与拥有 11 倍激活参数的最大的开源模型 LLaMA-3.1 405B Base 相比，DeepSeek-V3-Base 在多语言、代码和数学基准测试方面也表现出色得多。在英语和中文语言基准测试方面，DeepSeek-V3-Base 表现出了具有竞争力或更优的性能，尤其在 BBH、MMLU 系列、DROP、C-Eval、CMMLU 和 CCPM 上表现出色。

由于我们高效的架构和全面的工程优化，DeepSeek-V3 实现了极高的训练效率。在我们的训练框架和基础设施下，训练 DeepSeek-V3 每万亿个标记仅需 18 万 H800 GPU 小时，这比训练 720 亿或 4050 亿的密集模型要便宜得多。

Table 4: Ablation results for the MTP strategy. The MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.表 4：MTP 策略的消融实验结果。MTP 策略在大多数评估基准上始终能提升模型性能。

4.5 Discussion讨论

讨论 (Discussion): 进行了多token预测策略和辅助损失免费负载均衡策略的消融实验，并分析了批量级负载均衡与序列级负载均衡的区别。

4.5.1 Ablation Studies for Multi-Token Prediction多标记预测的消融研究：在不同规模的基线模型上验证了MTP策略的有效性

>> 多token预测策略消融实验 (Ablation Studies for Multi-Token Prediction)：在不同规模的基线模型上验证了MTP策略的有效性，结果表明MTP策略一致地提高了模型性能。

In Table 4, we show the ablation results for the MTP strategy. To be specific, we validate the MTP strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 540B tokens. On top of them, keeping the training data and the other architectures the same, we append a 1-depth MTP module onto them and train two models with the MTP strategy for comparison. Note that during inference, we directly discard the MTP module, so the inference costs of the compared models are exactly the same. From the table, we can observe that the MTP strategy consistently enhances the model performance on most of the evaluation benchmarks.

在表 4 中，我们展示了多标记预测（MTP）策略的消融结果。具体而言，我们在两个不同规模的基线模型之上验证了 MTP 策略。在小规模上，我们在 1.33T 个标记上训练了一个包含 15.7B 总参数的基线 MoE 模型。在大规模上，我们在 540B 个标记上训练了一个包含 228.7B 总参数的基线 MoE 模型。在它们之上，保持训练数据和其他架构不变，我们为它们附加了一个 1 层深度的 MTP 模块，并使用 MTP 策略训练了两个模型以作比较。请注意，在推理过程中，我们直接丢弃 MTP 模块，因此所比较模型的推理成本完全相同。从表中可以看出，MTP 策略在大多数评估基准上始终能提升模型性能。

4.5.2 Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy辅助损失无损平衡策略的消融研究：在不同规模的基线模型上验证了辅助损失免费负载均衡策略的有效性

>> 辅助损失免费负载均衡策略消融实验 (Ablation Studies for the Auxiliary-Loss-Free Balancing Strategy)：在不同规模的基线模型上验证了辅助损失免费负载均衡策略的有效性，结果表明该策略一致地取得了更好的模型性能。

In Table 5, we show the ablation results for the auxiliary-loss-free balancing strategy. We validate this strategy on top of two baseline models across different scales. At the small scale, we train a baseline MoE model comprising 15.7B total parameters on 1.33T tokens. At the large scale, we train a baseline MoE model comprising 228.7B total parameters on 578B tokens. Both of the baseline models purely use auxiliary losses to encourage load balance, and use the sigmoid gating function with top-K affinity normalization. Their hyper-parameters to control the strength of auxiliary losses are the same as DeepSeek-V2-Lite and DeepSeek-V2, respectively. On top of these two baseline models, keeping the training data and the other architectures the same, we remove all auxiliary losses and introduce the auxiliary-loss-free balancing strategy for comparison. From the table, we can observe that the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.

表 5 展示了辅助损失无损平衡策略的消融结果。我们在两个不同规模的基线模型之上验证了该策略。在小规模上，我们在 1.33T 个标记上训练了一个包含 15.7B 总参数的基线 MoE 模型。在大规模上，我们在 578B 个标记上训练了一个包含 228.7B 总参数的基线 MoE 模型。这两个基线模型均仅使用辅助损失来鼓励负载平衡，并使用带有 top-K 亲和度归一化的 S 型门控函数。它们控制辅助损失强度的超参数分别与 DeepSeek-V2-Lite 和 DeepSeek-V2 相同。在这些两个基线模型之上，保持训练数据和其他架构不变，我们移除所有辅助损失并引入辅助损失无损平衡策略进行对比。从表中可以看出，辅助损失无损平衡策略在大多数评估基准上始终能取得更好的模型性能。

Table 5: Ablation results for the auxiliary-loss-free balancing strategy. Compared with the purely auxiliary-loss-based method, the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks.表 5：无辅助损失平衡策略的消融实验结果。与纯辅助损失方法相比，无辅助损失策略在大多数评估基准上始终能取得更好的模型性能。

4.5.3 Batch-Wise Load Balance VS. Sequence-Wise Load Balance批量负载均衡与序列负载均衡

>> 批量级负载均衡与序列级负载均衡的比较 (Batch-Wise Load Balance VS. Sequence-Wise Load Balance)：比较了辅助损失免费平衡和序列级辅助损失的区别，即批量级与序列级平衡。批量级平衡更灵活，允许专家更好地专门化于不同领域。实验结果表明，在达到类似的批量级负载均衡水平时，批量级辅助损失也能达到与辅助损失免费方法相似的模型性能。此外，还讨论了批量级负载均衡方法在效率方面面临的挑战，并介绍了相应的解决方案。

The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.

To further investigate the correlation between this flexibility and the advantage in model performance, we additionally design and validate a batch-wise auxiliary loss that encourages load balance on each training batch instead of on each sequence. The experimental results show that, when achieving a similar level of batch-wise load balance, the batch-wise auxiliary loss can also achieve similar model performance to the auxiliary-loss-free method. To be specific, in our experiments with 1B MoE models, the validation losses are: 2.258 (using a sequence-wise auxiliary loss), 2.253 (using the auxiliary-loss-free method), and 2.253 (using a batch-wise auxiliary loss). We also observe similar results on 3B MoE models: the model using a sequence-wise auxiliary loss achieves a validation loss of 2.085, and the models using the auxiliary-loss-free method or a batch-wise auxiliary loss achieve the same validation loss of 2.080.

无辅助损失的均衡与序列辅助损失之间的关键区别在于其均衡范围：批量与序列。与序列辅助损失相比，批量均衡施加了更灵活的约束，因为它不会强制每个序列在域内保持平衡。这种灵活性使专家能够更好地专注于不同的领域。为了验证这一点，我们在 Pile 测试集的不同领域中记录并分析了 16B 辅助损失基线模型和 16B 无辅助损失模型的专家负载情况。如图 9 所示，我们观察到无辅助损失模型如预期般表现出更显著的专家专业化模式。

为了进一步探究这种灵活性与模型性能优势之间的关联，我们还设计并验证了一种批量辅助损失，该损失鼓励在每个训练批次而非每个序列上实现负载均衡。实验结果表明，在实现相似水平的批量负载均衡时，批量辅助损失也能达到与无辅助损失方法相似的模型性能。具体而言，在我们对 10 亿参数的 MoE 模型进行的实验中，验证损失分别为：使用序列辅助损失为 2.258，使用无辅助损失方法为 2.253，使用批量辅助损失为 2.253。在 30 亿参数的 MoE 模型上我们也观察到了类似的结果：使用序列辅助损失的模型验证损失为 2.085，而使用无辅助损失方法或批量辅助损失的模型验证损失均为 2.080。

In addition, although the batch-wise load balancing methods show consistent performance advantages, they also face two potential challenges in efficiency: (1) load imbalance within certain sequences or small batches, and (2) domain-shift-induced load imbalance during inference. The first challenge is naturally addressed by our training framework that uses large-scale expert parallelism and data parallelism, which guarantees a large size of each micro-batch. For the second challenge, we also design and implement an efficient inference framework with redundant expert deployment, as described in Section 3.4, to overcome it.

此外，尽管批量负载均衡方法表现出一致的性能优势，但它们在效率方面也面临两个潜在挑战：（1）某些序列或小批量内的负载不均衡；（2）推理过程中由领域偏移导致的负载不均衡。第一个挑战通过我们使用大规模专家并行和数据并行的训练框架自然得到解决，该框架保证了每个微批量的规模较大。对于第二个挑战，我们在第 3.4 节中描述并设计实现了一个具有冗余专家部署的高效推理框架来克服它。

Figure 9: Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C.图 9：在 Pile 测试集的三个领域中，无辅助损失模型和基于辅助损失模型的专家负载情况。无辅助损失模型显示出比基于辅助损失模型更明显的专家专业化模式。相对专家负载表示实际专家负载与理论平衡专家负载之间的比率。由于篇幅限制，我们仅展示两层的结果作为示例，所有层的结果见附录 C。

5 Post-Training

本节内容主要描述DeepSeek-V3模型的后训练阶段，包括监督微调、强化学习以及相应的评估和讨论。

DeepSeek-V3进行了监督微调(SFT)和强化学习(RL)两个阶段的后训练，以使其与人类偏好对齐并进一步释放其潜力。 SFT数据包含150万个跨多个领域的实例，并根据不同领域的需求采用不同的数据创建方法。 RL过程使用了规则和模型两种奖励模型，并采用了Group Relative Policy Optimization (GRPO)算法。从DeepSeek-R1系列模型中蒸馏推理能力，显著提高了模型在数学和代码方面的推理性能。

本后训练部分重点介绍了如何通过SFT和RL来提升模型的性能和对齐程度。从DeepSeek-R1模型中蒸馏知识，以及采用GRPO算法，都是有效的策略。对奖励模型的详细描述，以及对蒸馏策略和自奖励机制的讨论，都体现了论文的深度和严谨性。

5.1 Supervised Fine-Tuning后训练处理：采用150万个指令微调数据集

监督微调 (Supervised Fine-Tuning): 使用150万个跨多个领域的指令微调数据集进行微调，其中推理数据利用内部DeepSeek-R1模型生成。

We curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements.

监督微调我们精心策划了指令微调数据集，其中包含 150 万个实例，涵盖多个领域，每个领域都采用了针对其特定需求定制的不同数据创建方法。

(1)、数据集构建：构建包含150万个样本的指令微调数据集，涵盖多个领域，每个领域采用不同的数据创建方法。

>> 推理数据 (Reasoning Data)：利用内部DeepSeek-R1模型生成与数学、代码竞赛和逻辑难题相关的数据。R1生成的数据精度高，但存在过度思考、格式不佳和长度过长的问题。目标是平衡R1生成数据的准确性和常规格式数据的清晰简洁性。为此，采用了一种两阶段方法：首先训练领域专家模型（结合SFT和RL），然后用该专家模型生成两种类型的SFT样本：<问题，原始答案> 和 <系统提示，问题，R1答案>。

系统提示旨在引导模型生成包含反思和验证机制的答案。RL阶段使用高温采样，即使没有明确的系统提示，也能整合R1生成的数据和原始数据的模式。最后，使用拒绝采样筛选高质量的SFT数据。

>> 非推理数据 (Non-Reasoning Data)：对于创意写作、角色扮演和简单的问答等非推理数据，使用DeepSeek-V2.5生成答案，并由人工标注者验证数据的准确性和正确性。

Reasoning Data推理数据：采用DeepSeek-R1生成+两阶段方法(基于SFT和RL训练领域专家模型→采用专家模型生成两种类型的SFT样本→采用拒绝采样筛选高质量SFT数据)

For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.

To establish our methodology, we begin by developing an expert model tailored to a specific domain, such as code, mathematics, or general reasoning, using a combined Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training pipeline. This expert model serves as a data generator for the final model. The training process involves generating two distinct types of SFT samples for each instance: the first couples the problem with its original response in the format of <problem, original response>, while the second incorporates a system prompt alongside the problem and the R1 response in the format of <system prompt, problem, R1 response>.

对于与推理相关的数据集，包括专注于数学、代码竞赛问题和逻辑谜题的数据集，我们通过利用内部的 DeepSeek-R1 模型来生成数据。具体而言，虽然 R1 生成的数据具有很强的准确性，但它存在过度思考、格式不佳和篇幅过长等问题。我们的目标是在 R1 生成的推理数据的高准确性与常规格式化推理数据的清晰简洁之间取得平衡。

为了建立我们的方法，我们首先使用监督微调（SFT）和强化学习（RL）相结合的训练流程开发针对特定领域的专家模型，例如代码、数学或通用推理。该专家模型作为最终模型的数据生成器。训练过程包括为每个实例生成两种不同类型的 SFT 样本：第一种将问题与其原始响应以 <问题，原始响应> 的格式配对，第二种则将系统提示与问题和 R1 响应一起以 <系统提示，问题，R1 响应> 的格式组合。

The system prompt is meticulously designed to include instructions that guide the model toward producing responses enriched with mechanisms for reflection and verification. During the RL phase, the model leverages high-temperature sampling to generate responses that integrate patterns from both the R1-generated and original data, even in the absence of explicit system prompts. After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically.

Upon completing the RL training phase, we implement rejection sampling to curate high-quality SFT data for the final model, where the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective.

系统提示经过精心设计，包含引导模型生成富含反思和验证机制的回复的指令。在强化学习阶段，模型利用高温采样生成融合了 R1 生成数据和原始数据模式的回复，即使没有明确的系统提示也是如此。经过数百次强化学习步骤，中间强化学习模型学会了融入 R1 模式，从而从战略上提升了整体性能。

在完成强化学习训练阶段后，我们采用拒绝采样来为最终模型精选高质量的 SFT 数据，其中专家模型作为数据生成源。这种方法确保最终训练数据保留了 DeepSeek-R1 的优势，同时生成的回复简洁有效。

Non-Reasoning Data非推理数据：采用DeepSeek-V2.5生成答案→人工标注者验证

For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.

对于非推理数据，例如创意写作、角色扮演和简单问答，我们使用 DeepSeek-V2.5 生成响应，并安排人工标注员来验证数据的准确性和正确性。

(2)、SFT设置：2轮迭代微调+余弦衰减策略+每个序列由多个样本打包而成+采用样本掩码策略(确保样本之间相互隔离)

使用SFT数据集对DeepSeek-V3-Base进行两轮迭代的微调，采用余弦衰减学习率调度，起始学习率为5×10⁻⁶，逐渐减小到1×10⁻⁶。训练过程中，每个序列由多个样本打包而成，但采用样本掩码策略，确保样本之间相互隔离。

SFT Settings设置

We fine-tune DeepSeek-V3-Base for two epochs using the SFT dataset, using the cosine decay learning rate scheduling that starts at 5×10−6 and gradually decreases to 1×10−6. During training, each single sequence is packed from multiple samples. However, we adopt a sample masking strategy to ensure that these examples remain isolated and mutually invisible.

我们使用 SFT 数据集对 DeepSeek-V3-Base 进行两轮微调，采用余弦衰减学习率调度，从 5×10−6 开始逐渐降低至 1×10−6。在训练过程中，每个单一序列由多个样本打包而成。然而，我们采用样本掩码策略以确保这些示例保持独立且彼此不可见。

5.2 Reinforcement Learning强化学习：基于规则的奖励模型和基于模型的奖励模型+采用GRPO算法

强化学习 (Reinforcement Learning): 采用基于规则的奖励模型和基于模型的奖励模型，并使用Group Relative Policy Optimization (GRPO) 算法。

5.2.1 Reward Model奖励模型

We employ a rule-based Reward Model (RM) and a model-based RM in our RL process.

我们采用基于规则的奖励模型

Rule-Based RM基于规则的奖励机制—确定性/可靠性：适用于特定规则验证的问题（例如某些数学题、LeetCode题）

基于规则的奖励模型 (Rule-Based RM)：对于可以使用特定规则验证的问题（例如某些数学题、LeetCode题），使用基于规则的奖励系统确定反馈。这种方法具有较高的可靠性，不易被操纵或利用。

For questions that can be validated using specific rules, we adopt a rule-based reward system to determine the feedback. For instance, certain math problems have deterministic results, and we require the model to provide the final answer within a designated format (e.g., in a box), allowing us to apply rules to verify the correctness. Similarly, for LeetCode problems, we can utilize a compiler to generate feedback based on test cases. By leveraging rule-based validation wherever possible, we ensure a higher level of reliability, as this approach is resistant to manipulation or exploitation.

对于能够通过特定规则进行验证的问题，我们采用基于规则的奖励系统来确定反馈。例如，某些数学问题有确定的结果，我们要求模型以指定格式（例如在方框内）提供最终答案，这样我们就可以应用规则来验证其正确性。同样，对于 LeetCode 问题，我们可以利用编译器根据测试用例生成反馈。通过在可能的情况下利用基于规则的验证，我们确保了更高的可靠性，因为这种方法不易被操纵或利用。

Model-Based RM基于模型的奖励机制：适用于自由格式答案的问题（例如创意写作)，奖励模型采用DeepSeek-V3 SFT+构建包含思维链的偏好数据+提高可靠性

基于模型的奖励模型 (Model-Based RM)：对于具有自由格式答案的问题（例如创意写作），使用奖励模型来判断答案是否与预期答案匹配。奖励模型使用DeepSeek-V3 SFT检查点进行训练，并构建包含思维链的偏好数据以提高可靠性，降低奖励黑客攻击的风险。

For questions with free-form ground-truth answers, we rely on the reward model to determine whether the response matches the expected ground-truth. Conversely, for questions without a definitive ground-truth, such as those involving creative writing, the reward model is tasked with providing feedback based on the question and the corresponding answer as inputs. The reward model is trained from the DeepSeek-V3 SFT checkpoints. To enhance its reliability, we construct preference data that not only provides the final reward but also includes the chain-of-thought leading to the reward. This approach helps mitigate the risk of reward hacking in specific tasks.

对于具有自由形式真实答案的问题，我们依靠奖励模型来判断回答是否符合预期的真实答案。相反，对于没有明确真实答案的问题，例如涉及创意写作的问题，奖励模型的任务是根据问题和相应的回答作为输入提供反馈。奖励模型是从 DeepSeek-V3 SFT 检查点训练出来的。为了提高其可靠性，我们构建了偏好数据，该数据不仅提供最终奖励，还包括得出奖励的推理过程。这种方法有助于降低特定任务中奖励操纵的风险。

5.2.2 Group Relative Policy Optimization组相对策略优化：采用GRPO算法从组分数估计基线+最大化奖励+控制KL散度+整合多域提示

组相对策略优化 (Group Relative Policy Optimization)：采用GRPO算法，避免使用与策略模型大小相同的评论家模型，而是从组分数估计基线。目标函数旨在最大化奖励，同时控制策略模型与参考模型之间的KL散度。在RL过程中，整合了来自不同领域（编码、数学、写作、角色扮演和问答）的提示，使模型更符合人类偏好，并在基准测试中提升性能，尤其是在SFT数据有限的情况下。

Similar to DeepSeek-V2 (DeepSeek-AI, 2024c), we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q, GRPO samples a group of outputs {o1,o2,⋯,oG} from the old policy model πθo⁢l⁢d and then optimizes the policy model πθ by maximizing the following objective:

与 DeepSeek-V2（DeepSeek-AI，2024c）类似，我们采用组相对策略优化（GRPO）（Shao 等人，2024），该方法摒弃了通常与策略模型大小相同的评估模型，而是从组得分中估计基线。具体而言，对于每个问题 q，GRPO 从旧策略模型 πθo⁢l⁢d 中采样一组输出 {o1,o2,⋯,oG}，然后通过最大化以下目标来优化策略模型 πθ：

where ε and β are hyper-parameters; πr⁢e⁢f is the reference model; and Ai is the advantage, derived from the rewards {r1,r2,…,rG} corresponding to the outputs within each group:

We incorporate prompts from diverse domains, such as coding, math, writing, role-playing, and question answering, during the RL process. This approach not only aligns the model more closely with human preferences but also enhances performance on benchmarks, especially in scenarios where available SFT data are limited.

其中 ε 和 β 是超参数；πr⁢e⁢f 是参考模型；Ai 是优势，由每个组内输出对应的奖励 {r1,r2,...,rG} 推导得出：

在强化学习过程中，我们纳入了来自不同领域的提示，例如编程、数学、写作、角色扮演和问答。这种方法不仅使模型更贴近人类偏好，而且在可用的微调数据有限的情况下，还能提升在基准测试中的表现。

5.3 Evaluations评估：标准评估、开放式评估

评估 (Evaluations): 在更多基准上对DeepSeek-V3进行评估，包括标准评估和开放式评估，并比较了DeepSeek-V3作为生成式奖励模型的性能。

5.3.1 Evaluation Settings评估设置

>> 评估基准 (Evaluation Benchmarks)：除了基础模型测试中使用的基准外，还在多个基准上对指令模型进行评估，包括IFEval、FRAMES、LongBench v2、GPQA、SimpleQA、C-SimpleQA、SWE-Bench Verified、Aider、LiveCodeBench、Codeforces、CNMO 2024和AIME 2024等。

>> 对比基线 (Compared Baselines)：将DeepSeek-V3与多个强大的基线模型进行比较，包括DeepSeek-V2系列、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022和GPT-4o-0513。

Evaluation Benchmarks评估基准：基础模型基准+指令模型基准

Apart from the benchmark we used for base model testing, we further evaluate instructed models on IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), LongBench v2 (Bai et al., 2024), GPQA (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1, LiveCodeBench (Jain et al., 2024) (questions from August 2024 to November 2024), Codeforces 2, Chinese National High School Mathematics Olympiad (CNMO 2024)3, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024).

除了用于基础模型测试的基准之外，我们还对指令模型在 IFEval（Zhou 等人，2023 年）、FRAMES（Krishna 等人，2024 年）、LongBench v2（Bai 等人，2024 年）、GPQA（Rein 等人，2023 年）、SimpleQA（OpenAI，2024 年 c）、C-SimpleQA（He 等人，2024 年）、SWE-Bench Verified（OpenAI，2024 年 d）、Aider 1、LiveCodeBench（Jain 等人，2024 年）（2024 年 8 月至 2024 年 11 月的问题）、Codeforces 2、2024 年中国高中数学奥林匹克竞赛（CNMO 2024）3 以及 2024 年美国数学邀请赛（AIME 2024）（MAA，2024 年）上进行了评估。

Compared Baselines对比基线模型：对比DeepSeek-V2系列、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022和GPT-4o-0513

We conduct comprehensive evaluations of our chat model against several strong baselines, including DeepSeek-V2-0506, DeepSeek-V2.5-0905, Qwen2.5 72B Instruct, LLaMA-3.1 405B Instruct, Claude-Sonnet-3.5-1022, and GPT-4o-0513. For the DeepSeek-V2 model series, we select the most representative variants for comparison. For closed-source models, evaluations are performed through their respective APIs.

我们对聊天模型进行了全面评估，将其与多个强大的基线模型进行了比较，包括 DeepSeek-V2-0506、DeepSeek-V2.5-0905、Qwen2.5 72B Instruct、LLaMA-3.1 405B Instruct、Claude-Sonnet-3.5-1022 以及 GPT-4o-0513。对于 DeepSeek-V2 系列模型，我们选取了最具代表性的变体进行对比。对于闭源模型，我们通过各自的 API 进行评估。

Detailed Evaluation Configurations详细评估配置

>> 对于MMLU、DROP、GPQA和SimpleQA等标准基准，采用simple-evals框架的评估提示。

>> 对其他数据集，遵循其原始评估协议。

>> 对代码和数学基准，使用CoT和非CoT方法评估LiveCodeBench的性能。Codeforces数据集使用竞争对手百分比进行衡量。SWE-Bench Verified使用agentless框架进行评估。Aider相关基准使用“diff”格式进行评估。

>> 数学评估使用温度为0.7，结果取16次运行的平均值。所有模型的每个基准最大输出长度为8192个token。

For standard benchmarks including MMLU, DROP, GPQA, and SimpleQA, we adopt the evaluation prompts from the simple-evals framework4. We utilize the Zero-Eval prompt format (Lin, 2024) for MMLU-Redux in a zero-shot setting. For other datasets, we follow their original evaluation protocols with default prompts as provided by the dataset creators. For code and math benchmarks, the HumanEval-Mul dataset includes 8 mainstream programming languages (Python, Java, Cpp, C#, JavaScript, TypeScript, PHP, and Bash) in total. We use CoT and non-CoT methods to evaluate model performance on LiveCodeBench, where the data are collected from August 2024 to November 2024. The Codeforces dataset is measured using the percentage of competitors. SWE-Bench verified is evaluated using the agentless framework (Xia et al., 2024). We use the “diff” format to evaluate the Aider-related benchmarks. For mathematical assessments, AIME and CNMO 2024 are evaluated with a temperature of 0.7, and the results are averaged over 16 runs, while MATH-500 employs greedy decoding. We allow all models to output a maximum of 8192 tokens for each benchmark.

对于包括 MMLU、DROP、GPQA 和 SimpleQA 在内的标准基准测试，我们采用 simple-evals 框架 4 中的评估提示。对于 MMLU-Redux，我们在零样本设置中使用 Zero-Eval 提示格式（Lin，2024）。对于其他数据集，我们遵循数据集创建者提供的原始评估协议和默认提示。对于代码和数学基准测试，HumanEval-Mul 数据集总共包含 8 种主流编程语言（Python、Java、Cpp、C#、JavaScript、TypeScript、PHP 和 Bash）。我们在 LiveCodeBench 上使用 CoT 和非 CoT 方法评估模型性能，其中数据收集时间为 2024 年 8 月至 2024 年 11 月。Codeforces 数据集使用参赛者的百分比进行衡量。SWE-Bench verified 使用无代理框架（Xia 等人，2024）进行评估。我们使用“diff”格式评估与 Aider 相关的基准测试。对于数学评估，AIME 和 CNMO 2024 以 0.7 的温度进行评估，结果取 16 次运行的平均值，而 MATH-500 则采用贪婪解码。我们允许所有模型在每个基准测试中输出最多 8192 个标记。

Table 6: Comparison between DeepSeek-V3 and other representative chat models. All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.表 6：DeepSeek-V3 与其他代表性聊天模型的比较。所有模型均在限制输出长度为 8K 的配置下进行评估。样本数量少于 1000 个的基准测试会使用不同的温度设置多次进行测试，以得出可靠的最终结果。DeepSeek-V3 是表现最佳的开源模型，并且在与前沿的闭源模型的对比中也展现出具有竞争力的性能。

5.3.2 Standard Evaluation标准评估：在大多数基准测试中表现最佳

>> 标准评估 (Standard Evaluation)：DeepSeek-V3在大多数基准测试中表现最佳，尤其是在英语和中文基准测试中，在代码和数学基准测试中也表现出色，与GPT-4o和Claude-3.5-Sonnet等顶级模型相比具有竞争力。详细分析了其在MMLU系列、DROP、FRAMES、LongBench v2、SimpleQA、C-SimpleQA、HumanEval-Mul、LiveCodeBench、AIME、MATH-500、CNMO 2024、C-Eval和CLUEWSC等基准上的表现。

Table 6 presents the evaluation results, showcasing that DeepSeek-V3 stands as the best-performing open-source model. Additionally, it is competitive against frontier closed-source models like GPT-4o and Claude-3.5-Sonnet.

表 6 展示了评估结果，表明 DeepSeek-V3 是表现最佳的开源模型。此外，它在与前沿的闭源模型（如 GPT-4o 和 Claude-3.5-Sonnet）的竞争中也颇具竞争力。

English Benchmarks英语基准测试

MMLU is a widely recognized benchmark designed to assess the performance of large language models, across diverse knowledge domains and tasks. DeepSeek-V3 demonstrates competitive performance, standing on par with top-tier models such as LLaMA-3.1-405B, GPT-4o, and Claude-Sonnet 3.5, while significantly outperforming Qwen2.5 72B. Moreover, DeepSeek-V3 excels in MMLU-Pro, a more challenging educational knowledge benchmark, where it closely trails Claude-Sonnet 3.5. On MMLU-Redux, a refined version of MMLU with corrected labels, DeepSeek-V3 surpasses its peers. In addition, on GPQA-Diamond, a PhD-level evaluation testbed, DeepSeek-V3 achieves remarkable results, ranking just behind Claude 3.5 Sonnet and outperforming all other competitors by a substantial margin.

MMLU 是一项广受认可的基准测试，旨在评估大型语言模型在各种知识领域和任务中的表现。DeepSeek-V3 展现出具有竞争力的性能，与 LLaMA-3.1-405B、GPT-4o 和 Claude-Sonnet 3.5 等顶级模型不相上下，同时大幅超越了 Qwen2.5 72B。此外，在更具挑战性的教育知识基准 MMLU-Pro 中，DeepSeek-V3 表现优异，紧随 Claude-Sonnet 3.5 之后。在经过标签修正的 MMLU 改进版 MMLU-Redux 中，DeepSeek-V3 超越了所有竞争对手。另外，在博士水平的评估测试平台 GPQA-Diamond 上，DeepSeek-V3 取得了显著成果，仅次于 Claude 3.5 Sonnet，大幅领先于其他所有参赛者。

In long-context understanding benchmarks such as DROP, LongBench v2, and FRAMES, DeepSeek-V3 continues to demonstrate its position as a top-tier model. It achieves an impressive 91.6 F1 score in the 3-shot setting on DROP, outperforming all other models in this category. On FRAMES, a benchmark requiring question-answering over 100k token contexts, DeepSeek-V3 closely trails GPT-4o while outperforming all other models by a significant margin. This demonstrates the strong capability of DeepSeek-V3 in handling extremely long-context tasks. The long-context capability of DeepSeek-V3 is further validated by its best-in-class performance on LongBench v2, a dataset that was released just a few weeks before the launch of DeepSeek V3. On the factual knowledge benchmark, SimpleQA, DeepSeek-V3 falls behind GPT-4o and Claude-Sonnet, primarily due to its design focus and resource allocation. DeepSeek-V3 assigns more training tokens to learn Chinese knowledge, leading to exceptional performance on the C-SimpleQA. On the instruction-following benchmark, DeepSeek-V3 significantly outperforms its predecessor, DeepSeek-V2-series, highlighting its improved ability to understand and adhere to user-defined format constraints.

在 DROP、LongBench v2 和 FRAMES 等长上下文理解基准测试中，DeepSeek-V3 继续展现其顶级模型的地位。在 DROP 的 3-shot 设置中，它取得了令人瞩目的 91.6 F1 分数，超越了该类别中的所有其他模型。在需要在超过 10 万标记的上下文中进行问答的 FRAMES 基准测试中，DeepSeek-V3 紧随 GPT-4o 之后，大幅领先于所有其他模型。这表明 DeepSeek-V3 在处理极长上下文的任务方面具有强大的能力。DeepSeek-V3 的长上下文能力还通过其在 LongBench v2 上的卓越表现得到了进一步验证，该数据集在 DeepSeek V3 发布前几周才发布。在事实知识基准测试 SimpleQA 上，DeepSeek-V3 落后于 GPT-4o 和 Claude-Sonnet，这主要是由于其设计重点和资源分配。DeepSeek-V3 分配了更多的训练标记来学习中文知识，从而在 C-SimpleQA 上表现出色。在指令遵循基准测试中，DeepSeek-V3 明显优于其前身 DeepSeek-V2 系列，突显了其在理解并遵循用户定义的格式约束方面的能力提升。

Code and Math Benchmarks代码和数学基准测试

Coding is a challenging and practical task for LLMs, encompassing engineering-focused tasks like SWE-Bench-Verified and Aider, as well as algorithmic tasks such as HumanEval and LiveCodeBench. In engineering tasks, DeepSeek-V3 trails behind Claude-Sonnet-3.5-1022 but significantly outperforms open-source models. The open-source DeepSeek-V3 is expected to foster advancements in coding-related engineering tasks. By providing access to its robust capabilities, DeepSeek-V3 can drive innovation and improvement in areas such as software engineering and algorithm development, empowering developers and researchers to push the boundaries of what open-source models can achieve in coding tasks. In algorithmic tasks, DeepSeek-V3 demonstrates superior performance, outperforming all baselines on benchmarks like HumanEval-Mul and LiveCodeBench. This success can be attributed to its advanced knowledge distillation technique, which effectively enhances its code generation and problem-solving capabilities in algorithm-focused tasks.

对于大型语言模型（LLM）来说，编程是一项具有挑战性且实用的任务，涵盖了以工程为重点的任务，如 SWE-Bench-Verified 和 Aider，以及算法任务，如 HumanEval 和 LiveCodeBench。在工程任务方面，DeepSeek-V3 落后于 Claude-Sonnet-3.5-1022，但显著优于开源模型。开源的 DeepSeek-V3 有望推动与编程相关的工程任务的发展。通过提供其强大的功能，DeepSeek-V3 可以推动软件工程和算法开发等领域的创新和改进，使开发人员和研究人员能够突破开源模型在编程任务中的能力界限。在算法任务方面，DeepSeek-V3 表现出色，在 HumanEval-Mul 和 LiveCodeBench 等基准测试中超越了所有基线。这一成功可归因于其先进的知识蒸馏技术，该技术有效地增强了其在算法任务中的代码生成和问题解决能力。

On math benchmarks, DeepSeek-V3 demonstrates exceptional performance, significantly surpassing baselines and setting a new state-of-the-art for non-o1-like models. Specifically, on AIME, MATH-500, and CNMO 2024, DeepSeek-V3 outperforms the second-best model, Qwen2.5 72B, by approximately 10% in absolute scores, which is a substantial margin for such challenging benchmarks. This remarkable capability highlights the effectiveness of the distillation technique from DeepSeek-R1, which has been proven highly beneficial for non-o1-like models.

在数学基准测试中，DeepSeek-V3 展现出了卓越的性能，大幅超越了基准模型，并为非 o1 类型的模型树立了新的标杆。具体而言，在 AIME、MATH-500 和 CNMO 2024 这些测试中，DeepSeek-V3 的绝对得分比排名第二的 Qwen2.5 72B 模型高出约 10%，对于如此具有挑战性的基准测试来说，这是一个相当大的优势。这一出色的能力凸显了从 DeepSeek-R1 中提炼出的技术对于非 o1 类型模型的显著益处。

Chinese Benchmarks中文基准测试

Qwen and DeepSeek are two representative model series with robust support for both Chinese and English. On the factual benchmark Chinese SimpleQA, DeepSeek-V3 surpasses Qwen2.5-72B by 16.4 points, despite Qwen2.5 being trained on a larger corpus compromising 18T tokens, which are 20% more than the 14.8T tokens that DeepSeek-V3 is pre-trained on.

On C-Eval, a representative benchmark for Chinese educational knowledge evaluation, and CLUEWSC (Chinese Winograd Schema Challenge), DeepSeek-V3 and Qwen2.5-72B exhibit similar performance levels, indicating that both models are well-optimized for challenging Chinese-language reasoning and educational tasks.

Qwen 和 DeepSeek 是两个具有强大中英文支持能力的代表性模型系列。在事实基准测试 Chinese SimpleQA 中，尽管 Qwen2.5 是基于 18 万亿个标记的更大语料库训练的，比 DeepSeek-V3 预训练所用的 14.8 万亿个标记多出 20%，但 DeepSeek-V3 仍以 16.4 分的优势领先于 Qwen2.5-72B。

在代表性的中文教育知识评估基准 C-Eval 以及中文 Winograd Schema Challenge（CLUEWSC）测试中，DeepSeek-V3 和 Qwen2.5-72B 表现相当，这表明这两个模型在应对具有挑战性的中文推理和教育任务方面都进行了良好的优化。

Table 7: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.表 7：英语开放式对话评估。对于 AlpacaEval 2.0，我们使用长度控制下的胜率作为衡量指标。

5.3.3 Open-Ended Evaluation开放式评估：使用LLM作为评判者

>> 开放式评估 (Open-Ended Evaluation)：使用LLM作为评判者，在AlpacaEval 2.0和Arena-Hard等开放式生成任务上进行评估。DeepSeek-V3在Arena-Hard上取得了超过85%的胜率，成为首个在该基准上超过85%的开源模型。在AlpacaEval 2.0上也表现出色。

In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges, with the results shown in Table 7. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024a), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. On Arena-Hard, DeepSeek-V3 achieves an impressive win rate of over 86% against the baseline GPT-4-0314, performing on par with top-tier models like Claude-Sonnet-3.5-1022. This underscores the robust capabilities of DeepSeek-V3, especially in dealing with complex prompts, including coding and debugging tasks. Furthermore, DeepSeek-V3 achieves a groundbreaking milestone as the first open-source model to surpass 85% on the Arena-Hard benchmark. This achievement significantly bridges the performance gap between open-source and closed-source models, setting a new standard for what open-source models can accomplish in challenging domains.

除了标准基准测试外，我们还使用 LLM 作为评判者对我们的模型在开放式生成任务上的表现进行了评估，结果如表 7 所示。具体而言，我们遵循了 AlpacaEval 2.0（Dubois 等人，2024 年）和 Arena-Hard（Li 等人，2024a）的原始配置，这两项任务均采用 GPT-4-Turbo-1106 作为评判者进行两两比较。在 Arena-Hard 任务中，DeepSeek-V3 对比基准模型 GPT-4-0314 获得了超过 86% 的胜率，表现与顶级模型如 Claude-Sonnet-3.5-1022 相当。这突显了 DeepSeek-V3 在处理复杂提示（包括编码和调试任务）方面的强大能力。此外，DeepSeek-V3 达到了一个开创性的里程碑，成为首个在 Arena-Hard 基准测试中胜率超过 85% 的开源模型。这一成就显著缩小了开源模型与闭源模型之间的性能差距，为开源模型在具有挑战性的领域所能达成的成就树立了新的标准。

Similarly, DeepSeek-V3 showcases exceptional performance on AlpacaEval 2.0, outperforming both closed-source and open-source models. This demonstrates its outstanding proficiency in writing tasks and handling straightforward question-answering scenarios. Notably, it surpasses DeepSeek-V2.5-0905 by a significant margin of 20%, highlighting substantial improvements in tackling simple tasks and showcasing the effectiveness of its advancements.

同样，在 AlpacaEval 2.0 任务中，DeepSeek-V3 表现卓越，超越了闭源和开源模型。这表明其在写作任务和处理简单问答场景方面具有卓越的能力。值得注意的是，它比 DeepSeek-V2.5-0905 高出 20% 的显著优势，突显了其在处理简单任务方面的重大改进，并展示了其进步的有效性。

5.3.4 DeepSeek-V3 as a Generative Reward Model作为生成奖励模型的 DeepSeek-V3：性能相当甚至优于

>> DeepSeek-V3作为生成式奖励模型 (DeepSeek-V3 as a Generative Reward Model)：将DeepSeek-V3与GPT-4o和Claude-3.5在RewardBench上进行比较，其性能与GPT-4o-0806和Claude-3.5-Sonnet-1022的最佳版本相当，甚至优于其他版本。

We compare the judgment ability of DeepSeek-V3 with state-of-the-art models, namely GPT-4o and Claude-3.5. Table 8 presents the performance of these models in RewardBench (Lambert et al., 2024). DeepSeek-V3 achieves performance on par with the best versions of GPT-4o-0806 and Claude-3.5-Sonnet-1022, while surpassing other versions. Additionally, the judgment ability of DeepSeek-V3 can also be enhanced by the voting technique. Therefore, we employ DeepSeek-V3 along with voting to offer self-feedback on open-ended questions, thereby improving the effectiveness and robustness of the alignment process.

我们将 DeepSeek-V3 的判断能力与最先进的模型（即 GPT-4o 和 Claude-3.5）进行了比较。表 8 展示了这些模型在 RewardBench（Lambert 等人，2024 年）中的表现。DeepSeek-V3 的性能与 GPT-4o-0806 和 Claude-3.5-Sonnet-1022 的最佳版本相当，同时超过了其他版本。此外，DeepSeek-V3 的判断能力还可以通过投票技术得到增强。因此，我们采用 DeepSeek-V3 并结合投票技术，为开放式问题提供自我反馈，从而提高对齐过程的有效性和稳健性。

Table 8:Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.表 8：GPT-4o、Claude-3.5-sonnet 和 DeepSeek-V3 在 RewardBench 上的表现

5.4 Discussion讨论

讨论 (Discussion): 讨论了从DeepSeek-R1模型蒸馏知识、自奖励和多token预测的评估结果。

5.4.1 Distillation from DeepSeek-R1的蒸馏效果

>> 从DeepSeek-R1蒸馏知识 (Distillation from DeepSeek-R1): 基于DeepSeek-V2.5对从DeepSeek-R1蒸馏知识的贡献进行了消融实验，结果表明蒸馏数据提高了LiveCodeBench和MATH-500基准的性能，但也增加了平均响应长度。

We ablate the contribution of distillation from DeepSeek-R1 based on DeepSeek-V2.5. The baseline is trained on short CoT data, whereas its competitor uses data generated by the expert checkpoints described above.

Table 9 demonstrates the effectiveness of the distillation data, showing significant improvements in both LiveCodeBench and MATH-500 benchmarks. Our experiments reveal an interesting trade-off: the distillation leads to better performance but also substantially increases the average response length. To maintain a balance between model accuracy and computational efficiency, we carefully selected optimal settings for DeepSeek-V3 in distillation.

我们基于 DeepSeek-V2.5 来探究 DeepSeek-R1 蒸馏数据的贡献。基准模型是基于短 CoT 数据训练的，而其竞争对手则使用了上述专家检查点生成的数据。

表 9 展示了蒸馏数据的有效性，在 LiveCodeBench 和 MATH-500 基准测试中均有显著提升。我们的实验揭示了一个有趣的权衡：蒸馏能带来更好的性能，但也会大幅增加平均响应长度。为了在模型准确性和计算效率之间保持平衡，我们在 DeepSeek-V3 的蒸馏过程中精心选择了最优设置。

Our research suggests that knowledge distillation from reasoning models presents a promising direction for post-training optimization. While our current work focuses on distilling data from mathematics and coding domains, this approach shows potential for broader applications across various task domains. The effectiveness demonstrated in these specific areas indicates that long-CoT distillation could be valuable for enhancing model performance in other cognitive tasks requiring complex reasoning. Further exploration of this approach across different domains remains an important direction for future research.

我们的研究表明，从推理模型中进行知识蒸馏是训练后优化的一个有前景的方向。虽然我们目前的工作主要集中在从数学和编程领域蒸馏数据，但这种方法在各种任务领域都有潜在的应用价值。在这些特定领域所展现的有效性表明，长上下文蒸馏对于提升其他需要复杂推理的认知任务中的模型性能可能具有重要价值。在不同领域进一步探索这种方法仍是未来研究的一个重要方向。

Table 9: The contribution of distillation from DeepSeek-R1. The evaluation settings of LiveCodeBench and MATH-500 are the same as in Table 6.表 9：DeepSeek-R1 蒸馏的贡献。LiveCodeBench 和 MATH-500 的评估设置与表 6 相同。

5.4.2 Self-Rewarding自我奖励：采用宪法AI方法+利用DeepSeek-V3自身的投票评估结果作为反馈来源

>> 自奖励 (Self-Rewarding): 在难以使用硬编码构建反馈机制的领域，采用宪法AI方法，利用DeepSeek-V3自身的投票评估结果作为反馈来源，显著增强了模型的性能。

Rewards play a pivotal role in RL, steering the optimization process. In domains where verification through external tools is straightforward, such as some coding or mathematics scenarios, RL demonstrates exceptional efficacy. However, in more general scenarios, constructing a feedback mechanism through hard coding is impractical. During the development of DeepSeek-V3, for these broader contexts, we employ the constitutional AI approach (Bai et al., 2022), leveraging the voting evaluation results of DeepSeek-V3 itself as a feedback source. This method has produced notable alignment effects, significantly enhancing the performance of DeepSeek-V3 in subjective evaluations. By integrating additional constitutional inputs, DeepSeek-V3 can optimize towards the constitutional direction. We believe that this paradigm, which combines supplementary information with LLMs as a feedback source, is of paramount importance. The LLM serves as a versatile processor capable of transforming unstructured information from diverse scenarios into rewards, ultimately facilitating the self-improvement of LLMs. Beyond self-rewarding, we are also dedicated to uncovering other general and scalable rewarding methods to consistently advance the model capabilities in general scenarios.

奖励在强化学习中起着关键作用，引导优化过程。在通过外部工具进行验证较为直接的领域，比如某些编程或数学场景中，强化学习表现出色。然而，在更普遍的场景中，通过硬编码构建反馈机制是不切实际的。在开发 DeepSeek-V3 的过程中，针对这些更广泛的场景，我们采用了宪法人工智能方法（Bai 等人，2022 年），利用 DeepSeek-V3 自身的投票评估结果作为反馈来源。这种方法产生了显著的对齐效果，极大地提升了 DeepSeek-V3 在主观评估中的表现。通过整合额外的宪法输入，DeepSeek-V3 可以朝着宪法方向进行优化。我们认为，这种将补充信息与大语言模型结合作为反馈来源的范式至关重要。大型语言模型（LLM）充当着一种多功能处理器，能够将来自各种场景的非结构化信息转化为奖励，最终促进 LLM 的自我提升。除了自我奖励之外，我们还致力于探索其他通用且可扩展的奖励方法，以持续提升模型在一般场景中的能力。

5.4.3 Multi-Token Prediction Evaluation多标记预测评估：第二个token的接受率在85%到90%之间

>> 多token预测评估 (Multi-Token Prediction Evaluation): 多token预测技术提高了模型的解码速度，第二个token的接受率在85%到90%之间。

Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through the MTP technique. Combined with the framework of speculative decoding (Leviathan et al., 2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model. A natural question arises concerning the acceptance rate of the additionally predicted token. Based on our evaluation, the acceptance rate of the second token prediction ranges between 85% and 90% across various generation topics, demonstrating consistent reliability. This high acceptance rate enables DeepSeek-V3 to achieve a significantly improved decoding speed, delivering 1.8 times TPS (Tokens Per Second).

DeepSeek-V3 不再只是预测下一个单个标记，而是通过多标记预测（MTP）技术预测接下来的两个标记。结合推测性解码框架（Leviathan 等人，2023 年；Xia 等人，2023 年），它能够显著加快模型的解码速度。一个自然的问题随之而来，即额外预测的标记的接受率如何。根据我们的评估，在各种生成主题中，第二个标记预测的接受率在 85% 至 90% 之间，表现出了一致的可靠性。这种高接受率使 DeepSeek-V3 能够实现显著提升的解码速度，达到每秒 1.8 倍的标记处理量（TPS）。

6、Conclusion, Limitations, and Future Directions结论、局限性与未来方向

总结了DeepSeek-V3模型的各项优势，包括强大的性能、低廉的训练成本和训练的稳定性。指出了模型的局限性，例如部署单元较大以及推理速度仍有提升空间。提出了未来研究方向，包括模型架构改进、数据质量提升、推理能力增强和评估方法改进等。

本结论部分对全文进行了总结，并客观地指出了DeepSeek-V3的局限性，这体现了研究的严谨性。未来研究方向的提出，为后续研究提供了明确的方向，也体现了研究团队的长远目标。
>> DeepSeek-V3的结论：性能强大，成本效益高，是目前最强大的开源模型之一。
>> DeepSeek-V3的局限性：部署单元较大，推理速度仍有提升空间。
>> 未来研究方向：改进模型架构、优化训练数据、增强模型深度思考能力、改进模型评估方法。

In this paper, we introduce DeepSeek-V3, a large MoE language model with 671B total parameters and 37B activated parameters, trained on 14.8T tokens. In addition to the MLA and DeepSeekMoE architectures, it also pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. The training of DeepSeek-V3 is cost-effective due to the support of FP8 training and meticulous engineering optimizations. The post-training also makes a success in distilling the reasoning capability from the DeepSeek-R1 series of models. Comprehensive evaluations demonstrate that DeepSeek-V3 has emerged as the strongest open-source model currently available, and achieves performance comparable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet. Despite its strong performance, it also maintains economical training costs. It requires only 2.788M H800 GPU hours for its full training, including pre-training, context length extension, and post-training.

While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment. Firstly, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might pose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware.

在本文中，我们介绍了 DeepSeek-V3，这是一个拥有 6710 亿总参数和 370 亿激活参数的大型 MoE 语言模型，基于 14.8 万亿个标记进行训练。除了 MLA 和 DeepSeekMoE 架构外，它还开创了一种无需辅助损失的负载均衡策略，并设定了多标记预测训练目标以增强性能。由于 FP8 训练的支持和细致的工程优化，DeepSeek-V3 的训练成本效益显著。在训练后，它还成功地从 DeepSeek-R1 系列模型中提炼出了推理能力。全面的评估表明，DeepSeek-V3 已成为目前最强的开源模型，其性能可与 GPT-4o 和 Claude-3.5-Sonnet 等领先的闭源模型相媲美。尽管性能强劲，但它仍保持了经济的训练成本。其完整训练（包括预训练、上下文长度扩展和训练后处理）仅需 278.8 万 H800 GPU 小时。尽管我们认可 DeepSeek-V3 出色的性能和成本效益，但也意识到它存在一些局限性，尤其是在部署方面。首先，为了确保高效的推理，DeepSeek-V3 推荐的部署单元相对较大，这可能会给小型团队带来负担。其次，尽管我们的部署策略使 DeepSeek-V3 的端到端生成速度比 DeepSeek-V2 快了两倍多，但仍存在进一步提升的空间。幸运的是，随着更先进硬件的发展，这些局限性有望自然得到解决。

DeepSeek consistently adheres to the route of open-source models with longtermism, aiming to steadily approach the ultimate goal of AGI (Artificial General Intelligence). In the future, we plan to strategically invest in research across the following directions.

• We will consistently study and refine our model architectures, aiming to further improve both the training and inference efficiency, striving to approach efficient support for infinite context length. Additionally, we will try to break through the architectural limitations of Transformer, thereby pushing the boundaries of its modeling capabilities.

• We will continuously iterate on the quantity and quality of our training data, and explore the incorporation of additional training signal sources, aiming to drive data scaling across a more comprehensive range of dimensions.

• We will consistently explore and iterate on the deep thinking capabilities of our models, aiming to enhance their intelligence and problem-solving abilities by expanding their reasoning length and depth.

• We will explore more comprehensive and multi-dimensional model evaluation methods to prevent the tendency towards optimizing a fixed set of benchmarks during research, which may create a misleading impression of the model capabilities and affect our foundational assessment.

DeepSeek 始终坚持长期主义的开源模型路线，致力于稳步迈向通用人工智能（AGI）的终极目标。未来，我们计划在以下方向进行战略性的研究投入。

• 我们将持续研究并优化模型架构，旨在进一步提高训练和推理效率，努力实现对无限上下文长度的有效支持。此外，我们将努力突破 Transformer 的架构限制，从而拓展其建模能力的边界。

• 我们将持续优化训练数据的数量和质量，并探索引入更多的训练信号源，旨在推动数据在更广泛的维度上实现规模扩展。

• 我们将不断迭代训练数据的数量和质量，并探索引入更多的训练信号源，旨在推动数据规模在更广泛的维度上进行扩展。

• 我们将持续探索并迭代模型的深度思考能力，通过延长推理长度和深度来提升其智能水平和问题解决能力。我们将探索更全面、多维度的模型评估方法，以防止在研究过程中出现一味优化固定基准指标的倾向，这种倾向可能会造成对模型能力的错误认知，并影响我们对其基础能力的评估。

LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

相关文章​​​​​​​

2024年1月5日，LLMs之DeepSeek-V1：《DeepSeek LLM: Scaling Open-Source Language Models with Longtermism》翻译与解读

2024年1月11日，LLMs之DeepSeek-V1之MoE：《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》翻译与解

2024年1月25日，LLMs之DeepSeek-V1：《DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence》翻译与解读

2024年2月5日，LLMs之DeepSeek-V1：《DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models》翻译与解读

2024年5月7日，LLMs之DeepSeek-V2：《DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model》翻译与解读

2024年12月26日，LLMs之MoE之DeepSeek-V3：DeepSeek-V3的简介、安装和使用方法、案例应用之详细攻略

2024年12月27日，LLMs之MoE之DeepSeek-V3：《DeepSeek-V3 Technical Report》翻译与解读(DeepSeek-V3的最详细解读)

2025年1月20日，LLMs之DeepSeek-V3：DeepSeek-R1的简介、安装和使用方法、案例应用之详细攻略

2025年1月22日，LLMs之DeepSeek-R1：《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》翻译与解读

《DeepSeek-V3 Technical Report》翻译与解读

Abstract

Figure 1: Benchmark performance of DeepSeek-V3 and its counterparts.图 1：DeepSeek-V3 及其同类模型的基准性能。

1、Introduction

Table 1: Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.表 1：假设 H800 的租赁价格为每 GPU 小时 2 美元，DeepSeek-V3 的训练成本。

Architecture: Innovative Load Balancing Strategy and Training Objective架构：创新的负载均衡策略与训练目标

Pre-Training: Towards Ultimate Training Efficiency预训练：迈向极致训练效率

Post-Training: Knowledge Distillation from DeepSeek-R1后训练：从 DeepSeek-R1 中的知识蒸馏

Summary of Core Evaluation Results核心评估结果摘要

2、Architecture

2.1、Basic Architecture基本架构：基于Transformer框架+MLA高效推理+DeepSeekMoE高效训练+ALFLB实现负载均衡+CSWAL补充的序列级辅助损失+NLR降低训练过程中的通信成本+NTD策略

2.1.1 Multi-Head Latent Attention多头潜在注意力：采用MLA提高推理效率,采用低秩联合压缩注意力键和值来减少推理过程中的KV缓存，保持了MHA相当的性能

2.1.2 DeepSeekMoE with Auxiliary-Loss-Free Load Balancing具有无辅助损失负载均衡的 DeepSeekMoE：采用DeepSeekMoE以降低训练成本

Basic Architecture of DeepSeekMoE—DeepSeek-V3 中的 DeepSeekMoE 基本架构：对FFN采用DeepSeekMoE架构+DeepSeek-V3使用sigmoid函数计算亲和力分数+并归一化生成门控值

Auxiliary-Loss-Free Load Balancing.无辅助损失的负载均衡——解决MoE模型中专家负载不平衡：解决MoE模型中专家负载不平衡→确定top-K路由→动态调整偏差项→

Complementary Sequence-Wise Auxiliary Loss.互补序列级辅助损失—防止单个序列内出现极度不平衡：添加了一个小的序列级平衡损失

Node-Limited Routing节点受限路由—限制通信成本：每个token最多发送到M个节点

No Token-Dropping.无标记舍弃—保证训练和推理过程中不丢弃任何token：

Figure 3: Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.图 3：我们多标记预测（MTP）实现的示意图。在每个深度，我们为每个标记的预测保留完整的因果链。

2.2、Multi-Token Prediction多标记预测：引入MTP训练目标提升模型性能——扩展预测范围+MTP模块保持每个预测深度的完整因果链+对每个预测深度计算交叉熵损失+推理中丢弃MTP

目标：扩展预测范围到多个未来token，提高数据效率，并使模型更好地预先规划表示。

MTP Modules模块实现：使用多个顺序模块来预测多个额外token，保持每个预测深度的完整因果链

MTP Training Objective训练目标：对每个预测深度计算交叉熵损失，并将其平均值作为额外的训练目标

MTP in Inference推理中的 MTP：推理过程中可以丢弃MTP模块，或将其用于推测性解码以提高生成速度

3、Infrastructures基础设施

3.1 Compute Clusters计算集群：硬件配置(采用2048个H800 GPU)、节点内部互联(每个节点包含8个通过NVLink和NVSwitch互连的GPU)、节点间互联(节点间使用InfiniBand (IB) 互连)

3.2 Training Framework训练框架：

框架和并行策略：使用高效轻量级的HAI-LLM训练框架，采用16路流水线并行 (PP)、跨8个节点的64路专家并行 (EP) 和ZeRO-1数据并行 (DP)。

工程优化：DualPipe算法实现高效PP算法+高效的跨节点全对全通信内核(充分利用InfiniBand和NVLink带宽)+极度节省内存(重新计算RMSNorm和MLA上投影+在CPU中保存EMA参数+共享MTP模块和主模型的嵌入层和输出头等)

3.2.1 DualPipe and Computation-Communication Overlap双管道与计算通信重叠

3.2.2 Efficient Implementation of Cross-Node All-to-All Communication跨节点全对全通信的高效实现

3.2.3 Extremely Memory Saving with Minimal Overhead极大节省内存且开销极小

Recomputation of RMSNorm and MLA Up-Projection重新计算 RMSNorm 和 MLA 上投影

Exponential Moving Average in CPU在 CPU 中使用指数移动平均

Shared Embedding and Output Head for Multi-Token Prediction.多标记预测的共享嵌入和输出头

3.3 FP8 Training训练：基于FP8的混合精度框架+细粒度量化+提高累加精度+低精度存储和通信

Figure 6: The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.图 6：采用 FP8 数据格式的整体混合精度框架。为便于说明，仅展示了线性运算符

3.3.1 Mixed Precision Framework混合精度框架

3.3.2 Improved Precision from Quantization and Multiplication量化与乘法运算提升精度

Fine-Grained Quantization精细量化

Increasing Accumulation Precision提高累加精度

Mantissa over Exponents尾数与指数

Online Quantization在线量化

3.3.3 Low-Precision Storage and Communication低精度存储与通信

Low-Precision Optimizer States低精度优化器状态

Low-Precision Activation低精度激活

Low-Precision Communication低精度通信

3.4 Inference and Deployment推理与部署：将预填充和解码阶段分开部署

3.4.1 Prefilling预填充：并行策略+采用冗余专家实现负载均衡策略

3.4.2 Decoding解码：并行策略+IB点对点传输+IBGDA技术

3.5 Suggestions on Hardware Design关于硬件设计的建议：硬件厂商(开发卸载通信任务的协处理器+提高FP8 GEMM累加精度+支持tile和block级量化+支持转置GEMM操作)

3.5.1 Communication Hardware通信硬件：建议开发卸载通信任务的GPU协处理器或网络协处理器，并统一IB和NVLink网络接口

3.5.2 Compute Hardware计算硬件：建议提高Tensor Core中FP8 GEMM累加精度，支持tile和block级量化以及在线量化，并支持转置GEMM操作

Higher FP8 GEMM Accumulation Precision in Tensor Cores.张量核心中更高的 FP8 GEMM 累加精度

Support for Tile- and Block-Wise Quantization支持分块和分组量化

Support for Online Quantization对在线量化提供支持

Support for Transposed GEMM Operations对转置 GEMM 操作的支持

4 Pre-Training

4.1 Data Construction数据构建：优化预训练语料库=提高数学和编程样本比例+扩展多语言+文档打包+FIM策略

>> 语料库优化：提高数学和编程样本的比例+扩展多语言

>> 文档打包(数据完整性)→14.8T(高质量且多样化)

>> Fill-in-Middle (FIM) 策略：沿用DeepSeekCoder-V2中的FIM策略的PSM框架+文档级别

>> 分词器：采用BPE +词汇表(128K)+随机拆分

4.2 Hyper-Parameters超参数：模型超参数（Transformer层数、隐藏维度、注意力头数等）和训练超参数（优化器、学习率调度、批量大小等）

Training Hyper-Parameters训练超参数：优化器(AdamW)，max_length=4K，预训练14.8T，学习率调度，并行策略(PP=8)，M=4，辅助损失免费负载均衡，MTP损失权重

Figure 8: Evaluation results on the ”Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.图 8：“大海捞针”（NIAH）测试的评估结果。DeepSeek-V3 在所有上下文窗口中均表现出色

4.3 Long Context Extension长上下文扩展：沿用YaRN方法(仅应用于解耦共享键)+2个额外的训练阶段(4K→32K→128K，每个阶段包含1000步)，NIAH测试良好

4.4 Evaluations评估：多个英语、中文和多语言基准上评估

4.4.1 Evaluation Benchmarks评估基准：多项选择数据集、语言理解和推理数据集、闭卷问答数据集、阅读理解数据集、指代消歧数据集、语言建模数据集、中文理解和文化数据集、数学数据集、代码数据集、标准化考试

相关文章