【LLM-RL】强化对齐之GRPO算法和微调实践

山顶夕景

已于 2025-03-09 13:23:49 修改

阅读量2.8k

点赞数 17

分类专栏： # 大模型推理优化 # 强化学习文章标签：强化学习 GRPO PPO RM

于 2025-01-19 13:54:18 首次发布

本文链接：https://blog.csdn.net/qq_35812205/article/details/144945234

版权

大模型推理优化同时被 2 个专栏收录

24 篇文章

订阅专栏

强化学习

20 篇文章

订阅专栏

note

GRPO核心思想是通过构建多个模型输出的群组，并计算群组内的相对奖励来估计基线，从而避免了传统策略优化算法中需要使用与策略模型大小相同的评论模型。
GRPO 算法还引入了一些额外的优化策略（奖励缩放和策略裁剪），提升训练的稳定性。

一、GRPO算法

论文：DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models （https://arxiv.org/pdf/2402.03300）GRPO 在 DeepSeek V2 中采用了，GRPO 在训练过程中，不需要 Value Model，因此也能够减少 RL 训练过程中的资源消耗。

1. 目标函数

GRPO核心思想是通过构建多个模型输出的群组，并计算群组内的相对奖励来估计基线，从而避免了传统策略优化算法中需要使用与策略模型大小相同的评论模型。

大幅度降低 RL 训练的计算成本，同时还能保证模型能够有效地学习到策略。
具体来说，在传统的 RL 训练中，评论模型需要与策略模型具有相同的大小，增加计算资源的消耗。而 GRPO 算法利用群组内的相对信息来估计基线，避免了使用Critic Model的需要。
此外，GRPO 算法还引入了一些额外的优化策略（奖励缩放和策略裁剪），提升训练的稳定性。

在这里插入图片描述
From PPO to GRPO：

PPO 作为 Actor－Critic 算法被广泛运用于 Post－Training，核心目标是最大化下面的目标函数
其中， $\pi_\theta$ 和 $\pi_{\theta o l d}$ 分别表示当前策略模型和旧策略模型， $q, o$ 是从问题数据集和旧策略 $\pi_{\theta o l d}$ 中采样的输入和输出， $A_t$ 是基于广义优势估计（GAE）计算的优势值，依赖于奖励序列 $\left\{r_{2 t}\right\}$ 和学习的价值函数 $V_\psi$ 。因此，PPO需要同时训练策略模型和价值函数。为避免奖励模型的过度优化，标准做法是在每个词元的奖励中添加与参考模型的KL惩罚项

$\mathcal{J}_{P P O}(\theta)=\mathbf{E}\left[q \sim P(Q), o \sim \pi_{\theta_{\Delta u}}(o \mid q)\right] \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[\frac{\pi_\theta\left(o_t \mid q, o_{\varepsilon}\right)}{\pi_{\theta_{\alpha t}\left(o_t \mid q, o_{<t}\right)}} A_t, \operatorname{clip}\left(\frac{\pi_\theta\left(o_t \mid q, o_{<t}\right)}{\left.\pi_{\theta_{d i t}\left(o_t \mid q, o_{<t}\right)}, 1-\varepsilon, 1+\varepsilon\right)} A_t\right] \quad r_t=r_{\varphi}\left(q, o_{\leq t}\right)-\beta \log \frac{\pi_\theta\left(o_t \mid q, o_{<t}\right)}{\pi_{r e f}\left(o_t \mid q, o_{<t}\right)},\right.$

GRPO放弃了通常与policy模型大小相同的critic模型，从群体分数来估计基线。具体来说，对每个q，GRPO从旧的policy采样一组输出，然后通过下面的目标函数优化policy。GRPO 的目标函数为：
$\begin{gathered} \mathcal{J}_{G R P O}(\theta)=\mathbb{E}\left[q \sim P(Q),\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text {odd }}}(O \mid q)\right] \\ \frac{1}{G} \sum_{i=1}^G\left(\min \left(\frac{\pi_\theta\left(o_i \mid q\right)}{\pi_{\theta_{d d}}\left(o_i \mid q\right)} A_i, \operatorname{clip}\left(\frac{\pi_\theta\left(o_i \mid q\right)}{\pi_{\theta_{\text {dd }}}\left(o_i \mid q\right)}, 1-\varepsilon, 1+\varepsilon\right) A_i\right)-\beta \mathbb{D}_{K L}\left(\pi_\theta \| \pi_{r e f}\right)\right), \\ \mathbb{D}_{K L}\left(\pi_\theta \| \pi_{r e f}\right)=\frac{\pi_{r e f}\left(o_i \mid q\right)}{\pi_\theta\left(o_i \mid q\right)}-\log \frac{\pi_{r e f}\left(o_i \mid q\right)}{\pi_\theta\left(o_i \mid q\right)}-1 \end{gathered}$

其中， $\varepsilon$ 和 $\beta$ 是超参， Ai 是advantage，如下。

$A_i=\frac{r_i-\operatorname{mean}\left(\left\{r_1, r_2, \ldots, r_G\right\}\right)}{\operatorname{std}\left(\left\{r_1, r_2, \ldots, r_G\right\}\right)}$

步骤流程：
在这里插入图片描述

2. RM

基于规则，没有ORM或PRM！包括精度奖励和格式奖励（把思考过程放在<think>和</think>之间）两种规则。

二、一个例子理解GRPO

举个例子来方便理解DeepSeek的GRPO: 组相对策略优化算法：

问题：“2 + 3等于多少？”
🌟步骤1：大语言模型生成三个答案。

“5”
“6”
“2 + 3 = 5”

🌟步骤2：为每个答案打分。
“5” → 1分（正确，但没有推理过程）
“6” → 0分（错误）
“2 + 3 = 5” → 2分（正确，并且有推理过程）

🌟步骤3：计算整个组的平均分数。
平均分数 = (1 + 0 + 2) / 3 = 1

🌟步骤4：将每个答案的分数与平均分数进行比较。
“5” → 0 （等于平均分）
“6” → -1 （低于平均分）
“2 + 3 = 5” → 1 （高于平均分）

🌟步骤5：强化大语言模型以偏好更高的分数。
偏好像 3这样的回答（积极）
保持像 1这样的回答（中性）
避免像 2这样的回答（消极）

这个过程会重复进行，使模型能够随着时间推移不断学习和改进。

参考：https://superb-makemake-3a4.notion.site/group-relative-policy-optimization-GRPO-18c41736f0fd806eb39dc35031758885

三、GRPO算法实验

Andriy Burkov教程《从头开始写 GRPO 代码》：
https://github.com/aburkov/theLMbook/blob/main/GRPO_From_Scratch_Multi_GPU_DataParallel_Qwen_2_5_1_5B_Instruct.ipynb

内容：

使用 Qwen2.5-1.5B-Instruct 的分布式实现，从而可以针对数学、逻辑和编程任务对语言模型进行微调。
- PyTorch：用于张量运算和分布式训练。
- Hugging Face Transformers：用于加载预训练的语言模型和 tokenizer
- FlashAttention2：优化的注意力机制，有助于减少内存使用量并提高训练速度。
- Weights & Biases (wandb)：用于实验跟踪、可视化和模型版本控制。

1. 基础设置和导入

!pip install tf-keras # for some reason, Hugging Face cannot work without it
!pip install flash-attn # FlashAttention2
!pip install wandb # Weights and Biases
!pip install 'accelerate>=0.26.0'
!pip install transformers # Hugging Face Transformers API
!pip install datasets # Hugging Face Datasets API

# Import necessary libraries
# Basic Python libraries for various operations
import random
import copy
import re
import os
import numpy as np
import wandb

# PyTorch and related libraries for deep learning
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence

# Hugging Face libraries for transformer models
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

def set_random_seed(seed: int = 42):
    """
    Set the random seed for reproducibility across Python, NumPy, and PyTorch.

    Args:
        seed (int): The seed value to use for random number generation.

    Returns:
        None

    Explanation:
        1. Sets seed for Python's built-in random module for basic random operations.
        2. Sets seed for NumPy, ensuring consistent random number generation in array operations.
        3. Sets seed for PyTorch CPU operations.
        4. If CUDA is available, sets seed for all GPU devices.
        5. Configures cuDNN to ensure deterministic behavior:
           - Sets deterministic flag to True, ensuring reproducible results.
           - Disables benchmarking to prevent algorithm selection based on hardware.

    Note:
        Setting deterministic behavior may impact performance but ensures consistent results
        across multiple runs, which is crucial for debugging and research.
    """
    # Set the seed for Python's built-in random module
    random.seed(seed)
    # Set the seed for NumPy
    np.random.seed(seed)
    # Set the seed for PyTorch
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    # Ensure deterministic behavior in cuDNN (may impact performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Call the function to set random seed for reproducibility
set_random_seed(42)

# Set environment variables for Weights & Biases (wandb) logging
os.environ["WANDB_API_KEY"] = "USE YOUR KEY"
os.environ["WANDB_PROJECT"] = "GRPO-Qwen-1.5-Instruct-Multi-GPU"

设置随机种子：set_random_seed 函数通过为 Python 的随机模块、NumPy 和 PyTorch 设置种子，确保可复现性；
环境变量配置：设置 WANDB_API_KEY 和 WANDB_PROJECT 环境变量，以启用与 Weights & Biases 的实验跟踪；
导入必要的库，包括 random、copy、re、torch 等等。

2. 数据格式以及答案提取

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def extract_answer_from_model_output(text):
   """
   Extracts the value from the last <answer> tag in the text.

   Args:
       text (str): The model-generated text containing XML-style <answer> tags.

   Returns:
       str or None: The content inside the <answer> tags, or None if no valid answer is found.

   Explanation:
       1. Splits the text on the <answer> tag to isolate content after the tag.
       2. Checks if at least one <answer> tag exists in the text.
       3. For the last <answer> segment:
          - Verifies it contains a closing </answer> tag.
          - Extracts only the content between the tags.
       4. Returns None if the answer is empty (just "...") or if tags are missing.
   """
   # Split on <answer> and take everything after the last occurrence
   parts = text.split("<answer>")
   if len(parts) < 2:  # No <answer> tag found
       return None
   last_part = parts[-1]

   # Extract content up to </answer>
   if "</answer>" not in last_part:
       return None
   answer = last_part.split("</answer>")[0].strip()
   return None if answer == "..." else answer

def extract_answer_from_dataset(text):
   """
   Extracts the answer from the GSM8K dataset examples.

   Args:
       text (str): The dataset example text containing a question and answer.

   Returns:
       str or None: The extracted answer part after the '####' delimiter, or None if not found.

   Explanation:
       1. Checks if the text contains the '####' delimiter that separates question from answer.
       2. If found, splits the text at this delimiter and returns the second part (the answer).
       3. The answer is stripped of leading/trailing whitespace.
       4. Returns None if no delimiter is present.
   """
   if "####" not in text:
       return None
   return text.split("####")[1].strip()

为了确保模型输出格式一致，项目还定义了一个系统提示。该提示指示模型生成包含 < reasoning > 和 < answer > 标签的输出。这一步通过两个函数完成：

extract_answer_from_model_output：此函数获取模型的输出文本，并提取 < answer > 标签内的内容；
extract_answer_from_dataset：此函数从 GSM8K 数据集中提取预期答案，该数据集使用 “####” 分隔符来分隔答案：

3. 数据准备

该项目使用 GSM8K 数据集进行训练。项目使用了该数据集中的示例来训练模型，基于强化学习（RL）训练范式，让模型生成多个问题解答样本，之后作者将这些解答与 GSM8K 示例中的标准答案进行对比，如果匹配，就为 RL 算法（GRPO）提供高奖励，然后更新模型权重，以增加模型下次获得高奖励的可能性。

实验过程是这样的。首先从 Hugging Face 加载数据集，然后格式化每个示例，包括系统提示和用户提示。这段实现代码中还定义了两个辅助函数：prepare_dataset 以及 build_prompt。

def prepare_dataset(split="train"):
   """
   Load and prepare the GSM8K dataset for training with string prompts.

   Args:
       split (str): The dataset split to load ("train" or "test"). Defaults to "train".

   Returns:
       list: A list of formatted examples, each containing a prompt string and answer.

   Explanation:
       1. Loads the GSM8K dataset from the Hugging Face datasets hub.
       2. For each example in the dataset:
          - Creates a list of messages with system prompt and the question.
          - Converts this list into a single string prompt using build_prompt().
          - Extracts the answer from the dataset example.
          - Creates a formatted example dictionary with prompt and answer.
       3. Returns the list of formatted examples ready for model training or evaluation.
   """
   data = load_dataset('openai/gsm8k', 'main')[split]
   formatted_data = []
   for example in data:
       # Convert list of messages to a single string prompt.
       prompt_str = build_prompt([
           {"role": "system", "content": SYSTEM_PROMPT},
           {"role": "user", "content": example["question"]}
       ])
       formatted_example = {
           "prompt": prompt_str,  # Now a string rather than a list.
           "answer": extract_answer_from_dataset(example["answer"])
       }
       formatted_data.append(formatted_example)
   return formatted_data

def build_prompt(messages):
   """
   Build a single prompt string from a list of messages.

   Args:
       messages (list): A list of message dictionaries, each with 'role' and 'content' keys.

   Returns:
       str: A concatenated string of all message contents.

   Explanation:
       1. Takes a list of message dictionaries in the typical chat format.
       2. Extracts the 'content' field from each message and strips whitespace.
       3. Joins all content strings with newlines to create a single prompt.
       4. This preserves the training format while converting from structured messages to a string.
   """
   return "\n".join([msg["content"].strip() for msg in messages])

4. 评估函数

评估对于跟踪模型的进展至关重要。因此作者定义了一些函数，从而可以在一组示例上对模型进行评估。该项目的评估函数执行以下任务：

token 化提示并生成响应：模型的输出是在 token 化提示的基础上生成的。
提取预测答案：从生成的响应中提取答案。
将预测答案与预期答案进行比较：这种比较是通过精确匹配以及数值等价检查来完成的。

在这段代码中，两个辅助函数 _extract_last_number 和 _extract_single_number 被用来从文本中提取数字。评估函数 evaluate_model 使用这些辅助函数来确定预测答案是否正确：

def evaluate_model(model, tokenizer, eval_examples, device):
   """
   Evaluates the model on a set of examples and prints detailed results.

   Args:
       model: The language model to evaluate.
       tokenizer: The tokenizer for encoding inputs and decoding outputs.
       eval_examples (list): List of evaluation examples, each containing "prompt" and "answer".
       device: The device (CPU or GPU) to run evaluation on.

   Returns:
       float: The accuracy percentage (correct predictions / total examples * 100).

   Explanation:
       1. Sets the model to evaluation mode.
       2. For each example in the evaluation set:
          - Encodes the prompt and generates a response using the model.
          - Extracts the predicted answer from the generated response.
          - Compares the predicted answer with the expected answer using multiple methods:
            a. Exact string matching
            b. Single number extraction and comparison
            c. Last number extraction and comparison
          - Prints detailed information about each example.
       3. Calculates and returns the overall accuracy.
       4. Returns the model to training mode.
   """
   model.eval()
   correct = 0
   total = len(eval_examples)
   print("\n" + "="*50)
   print("EVALUATION ON", total, "EXAMPLES")
   print("="*50)

   for example in eval_examples:
       # Get the prompt and expected answer
       full_prompt = example["prompt"]
       expected = example["answer"]

       # Tokenize and generate response
       inputs = tokenizer.encode(full_prompt, return_tensors="pt").to(device)
       with torch.no_grad():
           outputs = model.generate(
               inputs,
               max_new_tokens=512,
               temperature=0.7,
               num_return_sequences=1,
               pad_token_id=tokenizer.pad_token_id,
               eos_token_id=tokenizer.eos_token_id,
               forced_eos_token_id=tokenizer.eos_token_id,
               early_stopping=False,
           )
       response = tokenizer.decode(outputs[0], skip_special_tokens=True)

       try:
           # Extract answer and check correctness
           predicted = extract_answer_from_model_output(response)

           # Try different matching methods
           if predicted == expected:  # Exact match
               is_correct = True
           else:
               # Try single number matching
               pred_num = extract_single_number(str(predicted))
               exp_num = extract_single_number(str(expected))
               if pred_num is not None and exp_num is not None and pred_num == exp_num:
                   is_correct = True
               else:
                   # Try last number matching
                   pred_num = extract_last_number(str(predicted))
                   exp_num = extract_last_number(str(expected))
                   is_correct = (pred_num is not None and exp_num is not None and
                               pred_num == exp_num)

           # Update counter for correct answers
           if is_correct:
               correct += 1

           # Print evaluation details
           print("\nPrompt:")
           print(full_prompt)
           print("\nExpected Answer:")
           print(expected)
           print("\nExtracted Answer:")
           print(predicted)
           print("\nFull Generated Response:")
           print(response)
           print("\nCorrect:", "✓" if is_correct else "✗")
           print("-"*50)

       except Exception as e:
           print("\nFailed to parse model output for prompt:")
           print(full_prompt)
           print("Error:", e)
           print("-"*50)

   # Calculate and print final accuracy
   accuracy = (correct / total) * 100
   print(f"\nAccuracy: {accuracy:.2f}% ({correct}/{total})")
   print("="*50)

   # Return model to training mode
   model.train()
   return accuracy

5. 奖励函数

作者定义了两个奖励函数：

correctness_reward：这个函数根据生成的答案是否正确来分配奖励。采用两种方式：精确的字符串匹配和数值等价检查，将模型输出的答案与预期答案进行比较。完全匹配会获得更高的奖励（2.0），而基于数值等价的匹配会获得较小的奖励（1.5）。
format_reward：这个函数鼓励模型遵循所需的类似 XML 的输出格式。它为生成文本中存在 < reasoning>、</reasoning>、<answer > 和 </answer > 标签提供小额奖励。

def correctness_reward(prompts, completions, answer, **kwargs):
   """
   Assigns a reward based on the correctness of the model's answer.

   Args:
       prompts (list): List of input prompts.
       completions (list): List of model completions, each containing content.
       answer (list): List of expected answers.
       **kwargs: Additional keyword arguments.

   Returns:
       list: List of numerical rewards for each completion.

   Explanation:
       1. Extracts the content from each completion.
       2. Extracts the answer portion from each response using extract_answer_from_model_output.
       3. Assigns rewards based on matching criteria:
          - 2.0 points for an exact match
          - 1.5 points for numeric equivalence (when values match but format differs)
          - 0.0 points for incorrect answers
       4. Tracks completion lengths for analysis.
   """
   responses = [completion[0]['content'] for completion in completions]
   extracted = [extract_answer_from_model_output(r) for r in responses]
   rewards = []
   for r, a in zip(extracted, answer):
       if r == a:  # Exact match case
           rewards.append(2.0)
       else:
           # Try numeric equivalence
           r_num = extract_single_number(str(r))
           a_num = extract_single_number(str(a))
           if r_num is not None and a_num is not None and r_num == a_num:
               rewards.append(1.5)
           else:
               rewards.append(0.0)
   # Log completion lengths
   completion_lengths = [len(response.split()) for response in responses]
   return rewards

def format_reward(completions, **kwargs):
   """
   Assigns a reward for adhering to the desired XML format.

   Args:
       completions (list): List of model completions, each containing content.
       **kwargs: Additional keyword arguments.

   Returns:
       list: List of format compliance scores for each completion.

   Explanation:
       1. Extracts the content from each completion.
       2. Evaluates format compliance by checking for required XML tags:
          - 0.2 points for each tag present (<reasoning>, </reasoning>, <answer>, </answer>)
          - Maximum score of 0.8 for perfect format compliance
       3. Stores and returns the format compliance scores.
   """
   responses = [completion[0]['content'] for completion in completions]
   rewards = []
   format_scores = []
   for response in responses:
       score = 0.0
       if "<reasoning>" in response: score += 0.2
       if "</reasoning>" in response: score += 0.2
       if "<answer>" in response: score += 0.2
       if "</answer>" in response: score += 0.2
       rewards.append(score)
       format_scores.append(score)
   return rewards

def combined_reward(prompts, completions, answer):
   """
   Combines correctness and format rewards.

   Args:
       prompts (list[str]): List of prompt texts
       completions (list[list[dict]]): List of completion dictionaries
       answer (list[str]): List of expected answers

   Returns:
       list[float]: Combined rewards for each prompt-completion pair

   Explanation:
       1. Calculates separate rewards for correctness and format compliance.
       2. Combines the rewards with the following weights:
          - Correctness score range: 0.0 to 2.0
          - Format score range: 0.0 to 0.8
          - Total possible range: 0.0 to 2.8
       3. Returns the combined reward for each example.
   """
   # Get individual rewards
   correctness_scores = correctness_reward(prompts=prompts, completions=completions, answer=answer)
   format_scores = format_reward(completions=completions)

   # Combine rewards - correctness is weighted more heavily
   combined_rewards = []
   for c_score, f_score in zip(correctness_scores, format_scores):
       # Correctness score range: 0.0 to 2.0
       # Format score range: 0.0 to 0.8
       # Total range: 0.0 to 2.8
       combined_rewards.append(c_score + f_score)

   return combined_rewards

6. 从头开始实现 DataParallel GRPO

首先，这里假设运行代码的机器至少有 2 台 GPU。为此，这里要使用 PyTorch 的 DataParallel API 来将策略模型放在多个 GPU 核心上，每个 GPU 核心都有该模型的一个副本。然后将批量数据分散在这些 GPU 核心上完成处理。

关键步骤：

模型和 tokenizer 初始化：使用优化设置（使用 torch.bfloat16 和 FlashAttention2）加载模型 Qwen/Qwen2.5-1.5B-Instruct。tokenizer 也要加载，其填充 token 设置为序列末尾 token。使用 torch.bfloat16 加载模型会将其参数转换为每个数值使用 16 位而不是 32 位的形式，这可将模型的内存使用量减少一半，并且可加快在现代 GPU 上的训练速度。
初步评估：在微调之前，根据几个示例对模型进行评估，以确定基准性能。
强化学习微调：为从头开始实现 GRPO 的训练函数 train_with_grpo 配置适当的训练参数和奖励函数。然后，在剩余的训练数据上执行强化学习训练。
最终评估和模型保存：强化学习微调后，再次评估模型，并保存最终模型。

7. 训练设置和执行

def grpo_loss(model, ref_model, rollout_data, tokenizer, reward_function, beta=0.01, epsilon=0.2):
    """
    Computes the GRPO loss for updating the policy model.

    Args:
        model: The policy model being trained.
        ref_model: The reference model for KL divergence calculation.
        rollout_data (dict): Data generated by generate_rollout_data.
        tokenizer: The tokenizer for encoding and decoding text.
        reward_function: Function that calculates rewards for completions.
        beta (float): KL penalty coefficient.
        epsilon (float): Clipping parameter for PPO.

    Returns:
        torch.Tensor: The GRPO loss to be minimized.

    Explanation:
        1. Computes current token log probabilities using the policy model.
        2. Calculates the probability ratio between current and old policies.
        3. Computes rewards using the provided reward_function.
        4. Calculates advantages by standardizing rewards within each prompt.
        5. Computes the PPO surrogate objective with clipping.
        6. Calculates the KL divergence between reference and policy models.
        7. Combines surrogate loss and KL penalty.
        8. Averages the loss across all tokens and batches.
    """
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    input_ids = rollout_data["input_ids"]
    attention_mask = rollout_data["attention_mask"]
    completion_mask = rollout_data["completion_mask"]
    logits_to_keep = rollout_data["logits_to_keep"]
    old_log_probs = rollout_data["old_log_probs"]
    ref_log_probs = rollout_data["ref_log_probs"]
    token_log_probs = compute_log_probs(model, input_ids, attention_mask, logits_to_keep)
    ratio = torch.exp(token_log_probs - old_log_probs)
    rewards = torch.tensor(
        reward_function(prompts=rollout_data["repeated_prompts"], completions=rollout_data["formatted_completions"], answer=rollout_data["repeated_answers"]),
        dtype=torch.float32,
        device=device
    )
    #print(f"Rewards: {rewards}")  # Debug rewards
    batch_size = rollout_data["batch_size"]
    num_generations = rollout_data["num_generations"]
    rewards = rewards.view(batch_size, num_generations)
    avg_reward = rewards.mean().item()
    print("Average Reward:", avg_reward)
    mean_rewards = rewards.mean(dim=1).repeat_interleave(num_generations)
    std_rewards = rewards.std(dim=1).repeat_interleave(num_generations)
    advantages = ((rewards.view(-1) - mean_rewards) / (std_rewards + 1e-4)).unsqueeze(1)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    surrogate_loss = torch.min(surr1, surr2)
    kl = torch.exp(ref_log_probs - token_log_probs) - (ref_log_probs - token_log_probs) - 1
    per_token_loss = surrogate_loss - beta * kl
    loss = -((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
    return loss, avg_reward

num_iterations=1：从当前策略模型创建新参考模型的外部迭代次数。一次迭代是指在整个数据集上执行一次通过。
num_steps=500：训练循环将执行最多 500 个步骤，每个步骤处理一批样本。
batch_size=7：在 8 台 GPU 的情况下，每个步骤每批处理 7 个样本，每台 GPU 上放置 1 个样本。使用一个 GPU (0) 被 DataParallel 用作主节点来聚合梯度并收集输出。
num_generations=14：对于训练数据中的每个提示词，训练器将生成 14 个不同的完成结果。这些生成结果将被用于计算指导强化学习更新的相对优势（或奖励信号）。如果你的 GPU 的 VRAM 较少，请减少此数字。
max_completion_length=400：在生成完成结果（序列的 response 部分）时，生成上限为 400 个 token。这限制了模型在 RL 阶段生成的输出的长度。如果你的 GPU 的 VRAM 较少，请减少此数字。
beta=0.04：GRPO 损失函数中 KL 散度惩罚的系数。这控制的是模型与参考模型的偏差程度。
learning_rate=5e-6：RL 微调的学习率。为了实现稳定的策略更新，这里使用了相对较低的学习率。
mu=1：对每批 rollout 数据执行的策略更新次数。在这里，我们每批只执行一次更新。
epsilon=0.1：GRPO 的 PPO 组件的 clipping 参数。这可以防止策略在单次更新中发生太大的变化。

结果：

经过一轮 GRPO 之后，Qwen-2.5-1.5B-Instruct 模型答对了 30 问题中的 27 题，实现了 90% 的准确度。相较于 GRPO 之前的 23.33%，可说是实现了性能飞跃。
多次测试后还可以发现，该模型没有学会生成序列结束（EOS）token，因此即使在 </answer> token 之后，输出序列仍会继续。这是预期的行为，因为我们使用的奖励函数中没有包含一个用于停止生成的奖励。我们也没有执行监督微调步骤 —— 该步骤可以让模型学会在 </answer> 之后立即生成 EOS。

参考：DeepSeek关键RL算法GRPO