[GRPO] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper link: https://arxiv.org/abs/2402.03300

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Pre-training base model

Collect large-scale high-quality domain-related dataset

在这里插入图片描述

  • First Iteration
    Seed corpus: a small but high-quality collection of domain-related dataset
    Train a simple embedding model (fastText) based on the seed corpus (OpenWebMath, 500K positive samples). In the first iteration, use the trained embedding model to recall samples similar to seed corpus from a much larger dataset (Common Crawl, 40B HTML web pages). All collected samples are ranked according fastText score, then only keep the top 40B tokens data.
  • Manual annotation, Embedding model optimization and Collection iteration
    Because Common Crawl is a web page dataset, where each sample is corresponded to a URL, the authors organize the data into several groups based on base URL (e.g., mathoverflow.net), then manually annotate the URLs associated with mathematical content (e.g., mathoverflow.net/questions) within groups that more than 10% of content is collected in the first iteration.
    This way, more sample is recalled and added into the high-quality dataset. Then, using the enhanced dataset, an improved fastText embedding model is trained to do better recall in the next iteration.
    After four iterations of data collection, the authors end up with 35.5M mathematical web pages, totaling 120B tokens.
  • Decontamination
    Any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our training corpus.

To evaluate the quality of collected data, the authors trained a 1.3B LLM on different corpus and compared the LLM’s performance. It turns out that DeepSeekMath Corpus gains a significant improvement compared with its seed corpus OpenWebMath, involving a lot of high-quality and multilingual data from Common Crawl.
在这里插入图片描述

Train base model

DeepSeekMath-Base 7B is initialized with DeepSeek-Coder-Base-v1.5 7B, as it is noticed that starting from a code training model is a better choice compared to a general LLM.
The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github
code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese.

The DeepSeekMath-Base 7B shows outstanding mathematic performance and outperforms Llemma 34B on mathematic problem solving with step-by-step reasoning/tool use and formal mathematics.
在这里插入图片描述
在这里插入图片描述
For other general natural language tasks, DeepSeekMath-Base 7B maintains comparable performance to initial DeepSeek-Coder-Base-v1.5. And it outperforms Mistral 7B on natural language reasoning and coding.
在这里插入图片描述

Supervised fine-tuning instruct model

Data collection

Manually annotated mathematical instruction-tuning dataset covering English and Chinese problems, where problems are paired with
solutions in chain-of-thought (CoT), program-of-thought (PoT), and tool-integrated reasoning format. The total number of training examples is 776K.

Fine-tune instruct model

Based on DeepSeekMath-Base 7B, DeepSeekMath-Instruct 7B is fine-tuned with the data collected above. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens.
在这里插入图片描述
Under the evaluation setting where tool use is disallowed, DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning. Under the evaluation setting where models are allowed to integrate natural language reasoning and program-based tool use for problem solving, DeepSeekMath-Instruct 7B approaches an accuracy of 60% on MATH, surpassing all existing open-source models.

Reinforcement Learning

GRPO

在这里插入图片描述
In PPO, there are 2 trainable models, Policy Model and Value Model, where Value Model is usually of the similar size to Policy Model. So, it would be very expensive to train these 2 models together.

Another challenge is that the Reward Model, which is trained in RLHF, is only able to give reward based entire response. It differs from the idea of PPO that give reward and value estimation based each token, resulting in difficulty to train the Value Model.

To know more about PPO, see https://blog.csdn.net/ShadyPi/article/details/145379220.

To overcome the challenge above, GRPO is proposed. GRPO does not need a Value Model to estimate future rewards. In contrast, it computes advantage based on reward of multiple sampled outputs
J G R P O ( θ ) = E q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O ∣ q ) [ 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ { min ⁡ [ π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) A ^ i , t , clip ( π θ ( o i , t ∣ q , o t , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) , 1 − ϵ , 1 + ϵ ) A ^ i , t ] − β D K L [ π θ ∥ π r e f ] } ] \mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q\sim P(Q),\{o_i\}_{i=1}^G\sim\pi_{\theta_{old}}(O|q)}\\ \left[ \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}\hat{A}_{i,t}, \text{clip}\left( \frac{\pi_\theta(o_{i,t}|q,o_{t,<t})}{\pi_{\theta_{old}}(o_{i,t}|q,o_{i,<t})}, 1-\epsilon, 1+\epsilon \right)\hat{A}_{i,t} \right] - \beta\mathbb{D}_{KL}[\pi_\theta\Vert\pi_{ref}] \right\} \right] JGRPO(θ)=EqP(Q),{oi}i=1Gπθold(Oq) G1i=1Goi1t=1oi{min[πθold(oi,tq,oi,<t)πθ(oi,tq,oi,<t)A^i,t,clip(πθold(oi,tq,oi,<t)πθ(oi,tq,ot,<t),1ϵ,1+ϵ)A^i,t]βDKL[πθπref]}
It looks very complicated! However, there are few things new compared to PPO. Actually, the main part min ⁡ ( ⋯   ) \min(\cdots) min() keeps the same as PPO, only changing the method to compute advantage A ^ i , t \hat{A}_{i,t} A^i,t. For each sample o i o_i oi in sampled group, a KL divergence is added. (In PPO, a KL penalty is implicitly added into the rewards). Then the overall objective is the average objective of each sample in group.

KL divergency D K L [ π θ ∥ π r e f ] \mathbb{D}_{KL}[\pi_\theta\Vert\pi_{ref}] DKL[πθπref] is defined as
D K L [ π θ ∥ π r e f ] = π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − log ⁡ π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − 1 \mathbb{D}_{KL}[\pi_\theta\Vert\pi_{ref}] = \frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-\log\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_\theta(o_{i,t}|q,o_{i,<t})}-1 DKL[πθπref]=πθ(oi,tq,oi,<t)πref(oi,tq,oi,<t)logπθ(oi,tq,oi,<t)πref(oi,tq,oi,<t)1
which is guaranteed to be positive.

The advantage A ^ i , t \hat{A}_{i,t} A^i,t is the normalized reward, which is defined as
A ^ i , t = r i , t − mean ( { r 1 , t , r 2 , t , ⋯   , r G , t } ) std ( { r 1 , t , r 2 , t , ⋯   , r G , t } ) \hat{A}_{i,t}=\frac{r_{i,t}-\text{mean}(\{r_{1,t},r_{2,t},\cdots,r_{G,t}\})}{\text{std}(\{r_{1,t},r_{2,t},\cdots,r_{G,t}\})} A^i,t=std({r1,t,r2,t,,rG,t})ri,tmean({r1,t,r2,t,,rG,t})

In the paper, the author defines another 2 variants of GRPO. However, in DeepSeek-R1, it turns out that this “Outcome Supervision RL with GRPO” works best.

Train & Test

The training data of RL are chain-of-thought-format questions related to GSM8K and MATH from the SFT data, which consists
of around 144K questions. The authors exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase.

The performance surpasses that of all open-source models in the 7B to 70B range, as well as the majority of closed-source models.

Despite the constrained scope of its training data, it outperforms DeepSeekMath-Instruct 7B across all evaluation metrics, showcasing the effectiveness of reinforcement learning

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值