基于trl复现DeepSeek-R1的GRPO训练过程

1. 引入

huggingface开发了强化学习训练Transformer的库trl(参考3),借助这个trl,可以用来做GRPO的强化学习训练。魔搭ModelScope社区的文章(参考2)给出了基于Qwen基座模型Qwen2.5-0.5B-Instruct,复现DeepSeek-R1训练过程的描述,参考1还给出了源码。
本文就对这个代码进行复现,并记录重点步骤与过程,以期更好的理解GRPO训练过程、数据格式、训练前后的不同效果。

2. 关键步骤

本文仅讲解关键步骤,完整步骤/代码可看参考2与参考1。

2.1 数据处理

(1)原始数据
数据集选用的是OpenAI的gsm8k(参考4),数据是多组question和answer组成的,原始数据格式如下所示:
在这里插入图片描述
可以将answer看过有两部分组成:推理过程,标准答案(数字)。两部分用####隔开。

(2)处理方法:加上prompt

# 系统提示词:规定回复格式
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
# 最终每条数据格式
d = { # type: ignore
	'prompt': [
		{'role': 'system', 'content': SYSTEM_PROMPT},
		{'role': 'user', 'content': x['question']}
	],
	'answer': extract_answer(x['answer'])# 标准答案(数字)
}

最终处理后数据格式入上面代码中的d所示,处理完后,一共7473组数据。如下是1条处理完成后的数据示例:

{
    "prompt": [
        {
            "role": "system",
            "content": "\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n"
        },
        {
            "role": "user",
            "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
        }
    ],
    "answer": "72"
}

可从prompt中看到,最终模型需要回复的内容中有两个字段,一个是推理过程(<reasoning>字段),另一个是最终答案(<answer>字段)。

2.2 自定义Reward函数

强化学习中比较重要的是reward函数,这里定义了5个reward函数:

(1)correctness_reward_func
如果模型输出答案中的<answer>字段(数字)等于标准答案,则得2分,否则0分。

(2)int_reward_func
如果模型输出答案中的<answer>字段是纯数字类型,则得0.5分,否则得0分。

(3)strict_format_reward_func
如果模型输出答案中的,既有<reasoning>字段也有<answer>字段,则得0.5分,否则0分。
strict表示这里做正则匹配要求是更严格的,必须以<reasoning>开头,以<answer>结束。回复内容收尾必须是这两个字段。具体正则为:pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"。

(4)soft_format_reward_func
soft表示这里的格式要求匹配正则不是那么严格。具体正则为pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"

(5)xmlcount_reward_func
根据如下逻辑对<reasoning>字段<answer>字段做计数后打分,这也是一种对输出格式的打分。

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1: # 只出现1次<reasoning>
        count += 0.125
    if text.count("\n</reasoning>\n") == 1: # 只出现1次</reasoning>
        count += 0.125
    if text.count("\n<answer>\n") == 1: # 只出现1次<answer>
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1: # 只出现1次</answer>
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

2.3 GRPO训练

训练时,定义好超参数,按如下配置即可训练

trainer = GRPOTrainer(
    model=model, # 模型基座
    processing_class=tokenizer, # 分词器
    reward_funcs=[ # reward函数
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func],
    args=training_args, # 超参数
    train_dataset=dataset, # 数据
)
trainer.train()

如下为训练中间过程的部分输出(能看出用1张A800训练一次耗时1:41:31):

-------------------- Question:
Billy wants to watch something fun on YouTube but doesn't know what to watch.  He has the website generate 15 suggestions but, after watching each in one, he doesn't like any of them.  Billy's very picky so he does this a total of 5 times before he finally finds a video he thinks is worth watching.  He then picks the 5th show suggested on the final suggestion list.  What number of videos does Billy watch?
Answer:
65
Response:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Extracted:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-------------------- Question:
Xander read 20% of his 500-page book in one hour.  The next night he read another 20% of the book.  On the third night, he read 30% of his book.  How many pages does he have left to read?
Answer:
150
Response:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Extracted:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.918502993675464e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': nan, 'epoch': 0.17}
 17%|███████████████████                                                                                           | 324/1868 [17:38<1:22:29,  3.21s/it]-------------------- Question:
Jordan gave Danielle two dozen roses and a box of chocolates as a birthday day gift.  Later that day, after Jordan left, Danielle traded the box of chocolates for another dozen roses.  Overnight, half of the roses wilted, and  Danielle decided to throw the wilted flowers away.  On the second day, another half of the remaining flowers wilted, and she threw the wilted ones away.  How many unwilted flowers remained?
Answer:
9
Response:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Extracted:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
...
...
-------------------- Question:
It's payday but Jebb has to pay 10% for the tax. If his pay is $650, how much is his take-home pay?
Answer:
585
Response:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Extracted:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-------------------- Question:
In a car racing competition, Skye drove a 6-kilometer track. For the first 3 kilometers, his speed was 150 kilometers per hour. For the next 2 kilometers, his speed was 50 kilometers per hour more. For the remaining 1 kilometer, his speed was twice as fast as his speed on the first 3 kilometers. What is Skye's average speed for the entire race?
Answer:
180
Response:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Extracted:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': nan, 'epoch': 1.0}
{'train_runtime': 6091.2875, 'train_samples_per_second': 1.227, 'train_steps_per_second': 0.307, 'train_loss': 0.0, 'epoch': 1.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1868/1868 [1:41:31<00:00,  3.26s/it]

2.4 基本验证

对GRPO训练前后的大模型输入一个问题:

prompt = "Xiao Ming bought 4 apples, ate 1, and gave 1 to his sister. How many apples were left?"

(1)GRPO训练前模型的输出:

To determine how many kilometers Natalia rode in total, we need to calculate her distances for each day and then sum them up.

1. **Monday**: Natalia rode 40 kilometers.
2. **Tuesday**: Natalia rode 50 kilometers.
3. **Wednesday**: She rode 50% fewer kilometers than on Tuesday. This means she rode \(50 - (0.5 \times 50) = 50 - 25 = 25\) kilometers.
4. **Thursday**: She rode as many kilometers as the sum of the kilometers from Monday and Wednesday. So, she rode \(40 + 25 = 65\) kilometers.

Now, let's add up the distances from all four days:

- Monday: 40 kilometers
- Tuesday: 50 kilometers
- Wednesday: 25 kilometers
- Thursday: 65 kilometers

Total distance = 40 + 50 + 25 + 65 = 180 kilometers

Therefore, the total number of kilometers Natalia rode is \(\boxed{180}\).

(2)GRPO训练后模型的输出:

<reasoning> 
Initially, Xiao Ming had 4 apples. After eating 1 apple, he was left with 4 - 1 = 3 apples. Then, he gave 1 apple to his sister, leaving him with 3 - 1 = 2 apples. 
</reasoning>
<answer> 
2 
</answer>

从输出内容,输出格式,最终结果可以看到GRPO训练前后模型的不同表现。

3. 总结

本文讲解了借助trl对Qwen2.5-0.5B-Instruct做GRPO的强化学习训练的过程,包括数据、数据处理、reward函数定义、训练前后的模型输出差异。

4. 参考:

  1. https://modelscope.cn/notebook/share/ipynb/c4d8363a/Qwen-GRPO.ipynb
  2. https://mp.weixin.qq.com/s/EkFRLMwHMdLvyra-ql-1QQ
  3. https://github.com/huggingface/trl
  4. https://modelscope.cn/datasets/modelscope/gsm8k/dataPeview
<think>好的,用户想要使用unsloth来微调deepseek-r1-8B模型,并希望得到完整的代码。首先,我需要确认unsloth是否支持这个模型。unsloth主要用于优化微调过程,比如减少内存占用和加快训练速度,通常与Hugging Face的transformers库结合使用。deepseek-r1-8B可能是一个基于类似LLaMA架构的模型,所以可能需要检查是否支持。 接下来,用户可能需要代码示例,包括数据准备、模型加载、训练配置、训练过程和保存模型。需要确保步骤完整,例如安装必要的库,加载数据集,使用正确的模型名称,设置训练参数如batch size、learning rate等,以及处理可能的内存问题,比如使用4位量化。 另外,用户可能对数据格式有疑问,所以需要提供一个示例数据集的结构,比如alpaca格式。还要注意可能出现的错误,比如OOM(内存不足),这时候可能需要调整batch size或者使用梯度累积。 需要提醒用户替换数据集路径,并根据硬件调整参数,比如device_map和dtype设置。最后,保存模型并演示如何加载使用,确保代码的可运行性。 可能还需要验证代码的准确性,比如检查unsloth的最新文档,确认支持的模型类型,以及是否有额外的参数需要注意。同时,确保训练器的设置正确,比如使用Seq2SeqTrainer,因为可能是生成任务。 总结:步骤包括安装库、准备数据、加载模型和分词器、配置训练参数、训练、保存模型。需要注意模型兼容性、硬件限制和参数调整,确保用户能顺利运行代码。</think>以下是一个使用Unsloth微调DeepSeek-R1-8B模型的完整代码示例。请注意根据您的硬件调整参数(如batch_size、gradient_accumulation_steps等): ```python # 首先安装必要库(建议使用Python 3.10+) # pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git" # pip install --no-deps "xformers<0.26.0" trl peft accelerate datasets bitsandbytes import torch from unsloth import FastLanguageModel from datasets import load_dataset from trl import SFTTrainer from transformers import TrainingArguments # 参数配置 max_seq_length = 2048 # 根据GPU内存调整 dtype = None # None自动检测,或指定为torch.float16/torch.bfloat16 load_in_4bit = True # 使用4位量化节省内存 # 加载DeepSeek-R1-8B模型 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "deepseek-ai/deepseek-llm-7b-r1", # 确认HuggingFace模型名称 max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) # 添加LoRA适配器(可选) model = FastLanguageModel.get_peft_model( model, r=16, # LoRA秩 target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"], # 常见目标模块 lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing=True, ) # 准备数据集(示例使用Alpaca格式) dataset = load_dataset("yahma/alpaca-cleaned", split="train") dataset = dataset.map(lambda sample: { "text": f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {sample['instruction']} ### Input: {sample['input']} ### Response: {sample['output']}""" }) # 训练参数配置 trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = True, # 序列打包提高效率 args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, num_train_epochs = 3, learning_rate = 2e-5, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", save_strategy="steps", save_steps=500, ), ) # 执行训练 trainer.train() # 保存微调后的模型 model.save_pretrained("lora_adapter") # 保存LoRA适配器 # model.push_to_hub("your_username/model_name") # 可选:上传到HuggingFace Hub # 使用微调后的模型生成文本 inputs = tokenizer( ["### Instruction:\nExplain deep learning in 3 sentences\n\n### Response:"], return_tensors="pt" ).to("cuda") outputs = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` 注意事项: 1. 硬件要求:至少需要24GB显存(建议A100/A6000等) 2. 数据集:示例使用Alpaca格式,请替换为您的实际数据集 3. 模型名称:请确认HuggingFace上的准确模型名称 4. 内存优化:如遇OOM错误,可以尝试: - 降低`per_device_train_batch_size` - 增加`gradient_accumulation_steps` - 使用`load_in_4bit=True` - 减小`max_seq_length` 建议在微调前: 1. 检查CUDA可用性:`print(torch.cuda.is_available())` 2. 验证模型加载:`print(model)` 3. 测试基本推理:先运行生成代码确保基础模型正常工作 如需完整生产级实现,建议参考: - Unsloth官方文档:https://github.com/unslothai/unsloth - DeepSeek官方微调指南:https://huggingface.co/deepseek-ai
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值