RLHF (PPO) 流程详解: Proximal Policy Optimization

最新推荐文章于 2025-04-25 17:06:38 发布

阿正的梦工坊

最新推荐文章于 2025-04-25 17:06:38 发布

阅读量1.3k

点赞数 10

分类专栏： LLM Deep Learning 文章标签：人工智能 ppo

本文链接：https://blog.csdn.net/shizheng_Li/article/details/144445993

版权

Deep Learning 同时被 2 个专栏收录

290 篇文章

订阅专栏

LLM

210 篇文章

订阅专栏

RLHF (PPO) 流程详解

RLHF 是通过强化学习将人类偏好引入模型训练的一种方法，其中 PPO（Proximal Policy Optimization）是常用的算法。RLHF 的流程通常可以分为以下步骤：

1. 流程概述

Step 1: 准备一批 Prompts

从用户输入或数据集中抽取一组 prompts，作为生成模型（Actor 模型）的输入。
引入 Reference 模型 (Ref 模型)，为后续计算策略比率提供基础。

Step 2: 生成 Responses

使用 Actor 模型根据 prompts 生成 responses。
同时利用 Reference 模型计算 responses 的旧策略概率分布 (( $log_probs ref \text{log\_probs}_{\text{ref}}$ ))，为后续 PPO 策略比率的计算奠定基础。

Step 3: 计算 Rewards

使用 Reward 模型对 ([ $\text{prompts} + \text{responses}$ ]) 进行评分，生成奖励信号 (( $\text{rewards}$ ))。
利用 Critic 模型对 prompts 计算值函数 (( $V (s)$ ))，作为基准值函数。
使用 ( $\text{rewards}$ ) 和 ( $V (s)$ ) 计算优势函数 (( $A (s)$ ))。

Step 4: 计算损失并更新模型

Actor Loss:
- 通过 Actor 模型和 Reference 模型计算策略比率 ( $r_t(\theta)$ )。
- 引入 PPO 的裁剪机制，计算 Actor 的策略损失。
Critic Loss:
- 使用 Critic 模型的预测值与实际奖励 (( $\text{returns}$ )) 计算均方误差 (MSE)。
模型更新:
- 分别对 Actor 和 Critic 模型执行反向传播，更新参数。

2. 四个模型的作用及初始化

(1) Actor 模型

作用: 用于根据输入的 prompts 生成响应 (responses)，是生成式强化学习流程的核心。
初始化: 通常从预训练的语言模型（如 GPT、LLaMA 等）加载参数，作为生成的起点。

(2) Critic 模型

作用: 用于对 Actor 生成的 responses 进行评分，估算值函数 ( $V (s)$ )，为优势函数 (Advantage) 提供基准。
初始化: 可以随机初始化，也可以从 Reward 模型的权重初始化，以减少训练时间。

(3) Reward 模型

作用: 模拟人类偏好，对 ( $\text{prompts} + \text{responses} ]$ ) 的质量进行打分，用于引导 Actor 的优化方向。
初始化: 从预训练的语言模型开始，通过使用人类反馈数据（如 RLHF 数据集）进行微调。

(4) Reference 模型

作用: 作为固定的对比模型，用于计算策略比率 ( $r_t(\theta)$ )，避免 Actor 的输出偏离预训练模型的初始分布，防止策略崩塌。
初始化: 使用与 Actor 相同的预训练模型权重，并在整个流程中保持参数固定，不参与训练。

3. 数学公式详解

奖励计算

给定 prompt ( $p$ ) 和 response ( $r$ )，奖励函数的形式为：
$f_{\text{Reward}}(p, r) - \beta \cdot D_{\text{KL}}(\pi_{\text{Actor}}(r|p) || \pi_{\text{Reference}}(r|p))$

( $f_{\text{Reward}}$ )：由 Reward 模型计算的分数。
( $D_{\text{KL}}$ )：Actor 和 Reference 模型分布之间的 KL 散度，约束 Actor 不偏离初始分布。

Actor Loss

基于 PPO 的目标函数，Actor 的策略梯度损失为：
$L_{\text{Actor}} = -\mathbb{E}_t \left[ \min(r_t(\theta) \cdot A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot A_t) \right]$

( $r_t(\theta) = \frac{\pi_{\text{Actor}}(r_t|p_t)}{\pi_{\text{Old}}(r_t|p_t)}$ )：策略更新比率。
( $A_t$ )：优势函数，衡量当前动作相对于基准的优越性。

Critic Loss

Critic 网络的值函数损失为：
$L_{\text{Critic}} = \mathbb{E}_t \left[ (V_{\text{Critic}}(p_t) - R_t)^2 \right]$

( $R_t$ )：目标回报值，基于时间差分计算。

4. 代码实现

以下是引入 Reference 模型后的 PPO 实现：

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. 加载模型
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
actor_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
critic_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
reward_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
reference_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")  # 固定参考模型

# 2. 准备数据
prompts = [
    "What are the key benefits of renewable energy?",
    "Explain the theory of relativity in simple terms."
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)

# Actor 模型生成 responses
with torch.no_grad():
    actor_outputs = actor_model.generate(inputs["input_ids"], max_length=50, return_dict_in_generate=True, output_scores=True)
responses = [tokenizer.decode(output, skip_special_tokens=True) for output in actor_outputs.sequences]

# Tokenize responses
response_inputs = tokenizer(responses, return_tensors="pt", padding=True, truncation=True)

# Reference 模型计算旧策略 log_probs
with torch.no_grad():
    ref_log_probs = reference_model(**response_inputs).logits.log_softmax(dim=-1)

# 3. Reward 计算
rewards = []
for prompt, response in zip(prompts, responses):
    input_text = tokenizer(prompt + response, return_tensors="pt")
    with torch.no_grad():
        reward_score = reward_model(**input_text).logits.mean()  # 示例奖励计算
    rewards.append(reward_score.item())

# 转换为 Tensor
rewards = torch.tensor(rewards)

# Critic 模型计算值函数
critic_values = critic_model(**inputs).logits.mean(dim=-1)
returns = rewards  # 假设 returns 即为 rewards
advantages = returns - critic_values.detach()

# 4. Actor 和 Critic 损失计算
# Actor 模型计算新策略 log_probs
actor_log_probs = actor_model(**response_inputs).logits.log_softmax(dim=-1)

# 策略比率 r_t(θ)
ratios = torch.exp(actor_log_probs - ref_log_probs)

# Actor Loss: PPO 裁剪机制
epsilon = 0.2
clipped_ratios = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
actor_loss = -torch.min(ratios * advantages, clipped_ratios * advantages).mean()

# Critic Loss: 值函数损失
critic_loss = nn.MSELoss()(critic_values, returns)

# 总损失
total_loss = actor_loss + critic_loss

# 5. 反向传播和更新
optimizer = torch.optim.Adam(list(actor_model.parameters()) + list(critic_model.parameters()))
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

print("PPO 更新完成！")

分析

引入 Reference 模型:
- Reference 模型固定，作为策略基线，为计算策略比率 ( $r_t(\theta)$ ) 提供参考。
Reward 模型与 Critic 模型的协作:
- Reward 模型直接对 ( $\text{prompts} + \text{responses} ]$ ) 进行打分，作为奖励信号。
- Critic 模型输出值函数 ( $V (s)$ )，为计算优势函数 ( $A (s) = R - V (s)$ ) 提供基准。
PPO 裁剪机制:
- 防止策略更新偏离参考策略过多，增加稳定性。
- 使用裁剪公式：
  $\text{Actor Loss} = -\mathbb{E}\left[\min(r_t(\theta) \cdot A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot A_t)\right]$
分离 Actor 和 Critic 的优化目标:
- Actor 优化策略梯度，Critic 优化值函数预测，避免相互干扰。

5. 流程验证

使用示例 prompts：
- “What are the key benefits of renewable energy?”
- “Explain the theory of relativity in simple terms.”
检查:
1. Actor 模型生成 responses 的合理性。
2. Reference 模型计算的旧策略概率是否稳定。
3. Reward 模型的奖励值是否反映 responses 的质量。
4. PPO 更新后，Actor 模型生成的 responses 是否逐步优化。