LLM Preference Alignment (PPO, DPO, SimPO, GRPO)

连理o

已于 2025-05-09 00:50:22 修改

阅读量1.5k

点赞数 15

文章标签： LLM

于 2024-08-01 11:18:05 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/140449020

版权

模型部署专栏收录该内容

42 篇文章

订阅专栏

Preference Alignment
More Alignment Algorithms
- Step-wise Reward
References

Preference Alignment

Reinforcement Learning from Human Feedback (RLHF) – Proximal Policy Optimization (PPO)

Overview

PPO 算法介绍可以参考 Introduction to Deep Reinforcement Learning (Policy Gradient, Actor-Critic, PPO)，下面主要介绍 LLM 训练场景下如何应用 PPO 做偏好对齐
4 models in PPO. RLHF 基于 SFT 模型继续进行训练，首先需要使用 SFT 模型针对同一问题输出多个不同回答交给标注员进行排序或打分，得到的人工标注的偏好数据可以用于训练 reward model，其中 reward model 直接用 SFT 模型初始化，训练的 reward model 用于在 RL 训练中提供 reward 判断 LLM 生成回答的好坏；在 PPO 训练中，我们还需要 actor model 和 critic model，其中 actor model 就是我们想要对齐偏好的 LLM，可以使用 SFT 模型初始化，critic model 则是 PPO 中估计 value function 的模型，可以使用 reward model 初始化 (也可以和 actor model 共享部分权重)；此外，为了防止对齐后的模型输出显著偏离 SFT 模型，通常还会引入由 SFT 模型初始化的 reference model，将 actor model 和 reference model 的 KL 散度损失作为正则项，使得对齐后的模型输出不要过度偏离 SFT 模型；总的来说，PPO 里需要用到 4 个不同的 LLM！其中两个需要训练，两个参数冻结，总体需要巨量的内存开销！

在这里插入图片描述

PPO in RLHF

(1) 采样数据. 在 NLP 中，LLM 生成 1 个 token 可以看作是 action，生成完一个句子可以看作是 episode，因此 actor 产生交互数据也就是从数据集里抽取 prompts 后采样生成 response，并且要保存 log prob 用于后续的 PPO
(2) 计算 reward. reward model 可以针对 actor 的生成结果 $s_T$ 给出 reward $R(s_T)$ ，相当于我们只能在 episode 结束时得到 reward，只能反映出生成结果的好坏，那我们怎么衡量中间生成过程的合理性呢？一种简单粗暴的方法是：循规蹈矩，只要 actor 生成的 token 和 reference model 比较一致，那么就可以额外得到少量的 reward. 相当于是在 reward 里加上 $\log P_{r e f}(a_t \mid s_t)-\log P(a_t \mid s_t)$ ， $\log P_{r e f}(a_t \mid s_t)$ 越高，说明 actor 和 reference model 越一致，所获得的 reward 也越高，而 $-\log P(a_t \mid s_t)$ 则作为正则项，保证了概率分布的多样性，使得 actor 和 reference model 的输出内容又不那么一样；其实这项 reward 可以看作是 KL 散度 $\text{KL}(P\|P_{ref})$ 的简化版本
$\left\{\begin{array}{l} r_t=-k \cdot\log \frac{P(a_t \mid s_t)}{P_{r e f}(a_t \mid s_t)}, \quad \quad \quad \quad\ \ \ t \neq T \\ r_t=-k \cdot\log \frac{P(a_t \mid s_t)}{P_{r e f}(a_t \mid s_t)}+R(s_t), \quad t=T \end{array}\right.$ 其中， $k = 0.1$ 为超参
(3) 计算 advantage. critic 预测得到 $V^{\theta}(s_t)$ ，结合 $r_t$ 可以计算 advantage
$A^{\theta}(s_t,a_t)=r_t+\gamma V^{\theta}(s_{t+1})-V^{\theta}(s_t)$ (Generalized Advantage Estimation, GAE). 上述 advantage 代表着 $t$ 时刻的即时优势，我们也可以引入未来优势，重新定义 advantage 为
$A^{\theta}(s_t,a_t)=(r_t+\gamma V^\theta(s_{t+1})-V^\theta(s_t))+\lambda\cdot \gamma A^{\theta}(s_{t+1},a_{t+1})$ 其中， $A(s_{t},a_{t})$ 的计算可以采用动态规划的方法从 $T$ 时刻往前倒推
(4) PPO 训练. 使用采样数据 $(s_t,a_t)\sim\pi_{\theta_{\text{old}}}$ 训练多个 epochs 的 actor 和 critic，actor loss 为
$\begin{aligned} L_{\text{actor}}=-\min(&\frac{p_{\theta}(a_t|s_t)}{p_{\theta_{\text{old}}}(a_t|s_t)}A^{\theta_{\text{old}}}(s_t,a_t), \\&\text{clip}(\frac{p_{\theta}(a_t|s_t)}{p_{\theta_{\text{old}}}(a_t|s_t)},1-\varepsilon,1+\varepsilon)A^{\theta_{\text{old}}}(s_t,a_t)) \end{aligned}$ critic loss 为 “实际回报” 和 “预估回报” 的 MSE loss，其中 “实际回报” 为 $r_t+\gamma V^\theta(s_{t+1})$ ，由于我们还引入了未来优势，因此可以将 “实际回报” 进一步优化为 $A^{\theta}(s_t,a_t)+V^\theta(s_{t})$ ；“预估回报” 为 $V^{\theta_{\text{old}}}(s_t)$ ，也就是 old critic model 的预估 value function；因此 critic loss 为
$\begin{aligned} L_{\text{critic}}=(A^{\theta}(s_t,a_t)+V^\theta(s_{t})-V^{\theta_{\text{old}}}(s_t))^2 \end{aligned}$ 总的 loss 为
$\begin{aligned} L=L_{\text{actor}}+0.1\cdot L_{\text{critic}} \end{aligned}$

policy_model = load_policy_model()
ref_policy_model = policy_model.copy()
reward_model = load_reward_model()
critic_model = reward_model.copy()

for k in range(20000):
    # sample data
    prompts = sample_prompt()
    responses, old_log_probs = policy_model(prompts)

	# compute rewards
    scores = reward_model(prompts, responses)
    _, ref_log_probs = ref_policy_model(prompts, responses)
    rewards = reward_func(scores, old_log_probs, ref_log_probs)
    
    # compute advantages
    old_values = critic_model(prompts, responses)
    old_advantages = advantage_func(rewards, old_values)
    
    for epoch in range(4):
        _, log_probs = policy_model(prompts, responses)
        values = critic_model(prompts, responses)
        advantages = advantage_func(rewards, values)
        actor_loss = actor_loss_func(old_advantages, old_log_probs, log_probs)
        critic_loss = critic_loss_func(advantages, values, old_values)
        loss = actor_loss + 0.1 * critic_loss
        train(loss, policy_model.parameters(), critic_model.parameters())

Direct Preference Optimization (DPO)

经过前面的介绍可以发现，PPO 优化需要同时加载 4 个 LLMs，并且还要同时训练其中的两个，优化难度和训练开销可想而知，DPO 则是对 PPO 做出改进，直接利用人类偏好数据去训练 LLM，免去了 RL 的过程，不需要额外训练 reward model 和 critic model，但又和 RLHF 使用相同的目标函数，理论上优化得到的模型也是一样的，同时还可以使得模型更容易训练；此外，DPO 只需要使用偏序关系表示的人类偏好数据，在标注数据时我们只需要比较两个回答哪个好而不用给出具体的打分，也节省了标注成本

DPO 的损失函数. 在 RLHF 中，我们的目标函数为
$\max_{\pi_\theta}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_\theta(y|x)}\bigl[r_\phi(x,y)\bigr]-\beta\mathbb{D}_{\mathbf{KL}}\bigl[\pi_\theta(y\mid x)\mid\mid\pi_{\mathbf{ref}}(y\mid x)\bigr]$ 将 KL 散度展开可得
$\begin{aligned} &\max_{\pi}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi(y|x)}\bigl[r(x,y)\bigr]-\beta\mathbb{D}_{\mathbf{KL}}\bigl[\pi(y\mid x)\mid\mid\pi_{\mathbf{ref}}(y\mid x)\bigr] \\ =&\max_\pi\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi(y|x)}\left[r(x,y)-\beta\log\frac{\pi(y|x)}{\pi_{\mathrm{ref}}(y|x)}\right] \\ =&\min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_{\mathrm{ref}}(y|x)}-\frac1\beta r(x,y)\right] \\ =&\min_\pi\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi(y|x)}\left[\log\frac{\pi(y|x)}{\frac1{Z(x)}\pi_{\mathrm{ref}}(y|x)\exp\left(\frac1\beta r(x,y)\right)}-\log Z(x)\right] \end{aligned}$ 由此可以定义一个新的概率分布
$\pi^*(y|x)=\frac1{Z(x)}\pi_\text{ref}{(y|x)}\exp\left(\frac1\beta r(x,y)\right)$ 其中， $Z(x)=\sum_y\pi_{\mathrm{ref}}(y|x)\exp\left(\frac1\beta r(x,y)\right)$ 为归一化因子，继续代入目标函数可得
$\begin{aligned}&\min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi^*(y|x)}\right]-\log Z(x)\right]\\=&\min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{D}_{\mathrm{KL}}(\pi(y|x)\parallel\pi^*(y|x))-\log Z(x)\right]\end{aligned}$ 由于 $Z (x)$ 和 LLM 无关，因此优化这个式子只需要优化前面的 KL 项，可见最优的 $\pi$ 即为 $\pi^*$ ，但 $\pi^*$ 的计算仍然依赖于 reference model 和 reward model，我们想要进一步摆脱 reward model. 为此，我们可以反过来根据 $\pi^*$ 的表达式去表示 $r (x, y)$ ，得到
$r(x,y)=\beta\log\frac{\pi_r(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}+\beta\log Z(x)$ 在 RLHF 中我们希望 reward 能通过 Bradley–Terry model 来预测人类偏好，即
$p^*(y_w\succ y_l\mid x)=\frac{1}{1+\exp\bigg(-\big(r_\phi(x,y_w)-r_\phi(x,y_l)\big)\bigg)}$ 基于最大似然估计可以给出 reward model 的损失函数
$\mathcal{L}_R(r_\phi,\mathcal{D})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma(r_\phi(x,y_w)-r_\phi(x,y_l))\right]$ 我们将得到的 $r (x, y)$ 带入上述损失函数即可得到 DPO 的损失函数
$\mathcal{L}_{\mathrm{DPO}}(\pi_{\theta};\pi_{\mathrm{ref}})=-\mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}\mid x)}{\pi_{\mathrm{ref}}(y_{w}\mid x)}-\beta\log\frac{\pi_{\theta}(y_{l}\mid x)}{\pi_{\mathrm{ref}}(y_{l}\mid x)}\right)\right]$ 其中， $\hat r(x,y)=\beta\log\frac{\pi_{\theta}(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}$ 即为 LLM 中隐含的 reward model，也就是论文标题中所说的 “Your language model is secretly a reward model”

Simple Preference Optimization (SimPO)

DPO 在优化时虽然相比 PPO 简化了许多，但仍需要同时加载 actor 和 reference model，训练开销仍然比较大；此外，DPO 的隐式 reward model $\hat r(x,y)=\beta\log\frac{\pi_{\theta}(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}$ 在计算 reward 时没有对生成文本长度做归一化，因此 DPO 在优化时模型会倾向于生成更长的文本来获取更大的 reward，尤其是当数据量比较少的时候；DPO 优化时还有训练和推理目标不完全对齐的问题，我们推理时想要的是 $\pi_{\theta}(y_w\mid x)>\pi_{\theta}(y_l\mid x)$ ，但 DPO 的训练目标是 $\hat r(x,y_w)>\hat r(x,y_l)$ ，也就是说即使 $\hat r(x,y_w)>\hat r(x,y_l)$ ，我们也并不能保证 LLM 在推理时仍然能生成 $y_w$
SimPO 针对上述问题进行了改进，无需 reference model 即可训练，并通过文本长度归一化和训推目标的一致性使得 LLM 在生成较短长度的前提下依然能够保证生成质量；至此，SimPO 的损失函数似乎和对比学习差不多了，RLHF 的过程逐渐由繁化简 (当然实际效果如何还需要实践出真知)

具体来说，SimPO 为了让训推目标对齐，直接把 DPO reward model 里的 reference model 拿掉了，此外还加入了长度归一化项 $∣ y ∣$ ，实验表明这个长度归一化项非常重要，由于 SimPO 中不存在 reference model 做隐式的长度归一化，因此去掉它之后模型就会输出冗长的低质量回答
$r_{\mathrm{SimPO}}(x,y)=\frac\beta{|y|}\log\pi_\theta(y\mid x)=\frac\beta{|y|}\sum_{i=1}^{|y|}\log\pi_\theta(y_i\mid x,y_{<i})$ 此外，SimPO 还引入了 reward margin $\gamma$ ，要求正负样本对的 reward 差值大于 $\gamma$
$p(y_w\succ y_l\mid x)=\sigma\left(r(x,y_w)-r(x,y_l)-\gamma\right)$
最终，SimPO 的损失函数为
$\mathcal{L}_{\mathrm{SimPO}}(\pi_\theta)=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\frac\beta{|y_w|}\mathrm{log}\pi_\theta(y_w|x)-\frac\beta{|y_l|}\mathrm{log}\pi_\theta(y_l|x)-\gamma\right)\right]$