DPO(Direct Preference Optimization)算法解释:中英双语

中文版

DPO paper: https://arxiv.org/pdf/2305.18290

DPO 算法详解:从理论到实现

1. 什么是 DPO?

DPO(Direct Preference Optimization)是一种直接基于人类偏好进行优化的算法,旨在解决从人类偏好数据中训练出表现更优的语言模型的问题。它与传统的基于奖励建模的强化学习方法(如 PPO)不同,通过引入一种基于 Bradley-Terry 模型的参数化方法,将人类偏好概率直接与语言模型的输出概率相关联,从而避免了明确训练奖励模型的过程。


2. DPO 解决什么问题?

在 RLHF(Reinforcement Learning with Human Feedback)框架中,通常需要训练一个奖励模型来对语言模型的生成进行打分。然而,训练奖励模型和使用强化学习优化策略模型(如 PPO)通常会引入一些复杂性和不稳定性:

  • 奖励模型可能过拟合或偏离人类真实偏好。
  • 使用强化学习优化策略模型需要平衡探索和收敛,容易引发 KL 散度爆炸等问题。

DPO 提供了一种更直接的优化方式,通过重新参数化,将偏好建模直接嵌入语言模型优化中,从而绕过奖励建模,简化了训练流程。


3. DPO 的核心公式

DPO 的核心思想是通过 Bradley-Terry 偏好模型,将偏好概率建模为语言模型输出概率的对数比值,并引入温度参数 ( β \beta β ) 来控制 KL 惩罚强度。

核心公式

人类偏好概率建模公式如下:

p ∗ ( y 1 ≻ y 2 ∣ x ) = 1 1 + exp ⁡ ( β log ⁡ π ∗ ( y 2 ∣ x ) π ref ( y 2 ∣ x ) − β log ⁡ π ∗ ( y 1 ∣ x ) π ref ( y 1 ∣ x ) ) p^*(y_1 \succ y_2 | x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)} - \beta \log \frac{\pi^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)}\right)} p(y1y2x)=1+exp(βlogπref(y2x)π(y2x)βlogπref(y1x)π(y1x))1

在实际中,我们通过最大化以下目标函数来优化参数化的策略模型 ( π θ \pi_\theta πθ ):

L DPO ( π θ ; π ref ) = − E ( x , y w , y l ) ∼ D [ log ⁡ σ ( β log ⁡ π θ ( y w ∣ x ) π ref ( y w ∣ x ) − β log ⁡ π θ ( y l ∣ x ) π ref ( y l ∣ x ) ) ] L_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = - \mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)}\right) \right] LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlogπref(ywx)πθ(ywx)βlogπref(ylx)πθ(ylx))]

其中:

  • ( σ \sigma σ ) 是 Sigmoid 函数。
  • ( y w y_w yw ) 和 ( y l y_l yl ) 分别是人类标注的偏好和非偏好样本。

通过最大化该目标函数,策略模型会更倾向于生成被人类偏好的输出,同时抑制被人类不喜欢的输出。


4. 如何理解 DPO?

DPO 的优化过程可以从以下几个方面理解:

  1. 奖励重新参数化
    通过将奖励模型嵌入策略模型输出的对数比值中,避免了显式训练奖励模型的过程。
    隐式奖励定义为:
    r ^ θ ( x , y ) = β log ⁡ π θ ( y ∣ x ) π ref ( y ∣ x ) \hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y | x)}{\pi_{\text{ref}}(y | x)} r^θ(x,y)=βlogπref(yx)πθ(yx)

  2. 梯度优化
    DPO 的梯度公式为:
    ∇ θ L DPO = − β E ( x , y w , y l ) ∼ D [ σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) ⋅ ( ∇ θ log ⁡ π θ ( y w ∣ x ) − ∇ θ log ⁡ π θ ( y l ∣ x ) ) ] \nabla_\theta L_{\text{DPO}} = -\beta \mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \sigma(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)) \cdot (\nabla_\theta \log \pi_\theta(y_w | x) - \nabla_\theta \log \pi_\theta(y_l | x)) \right] θLDPO=βE(x,yw,yl)D[σ(r^θ(x,yl)r^θ(x,yw))(θlogπθ(ywx)θlogπθ(ylx))]

    直观上,这意味着模型会:

    • 提高 ( y w y_w yw ) 的生成概率。
    • 降低 ( y l y_l yl ) 的生成概率。
    • 偏差较大的样本(即 ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) \hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w) r^θ(x,yl)r^θ(x,yw) ) 较大时)权重更高。
  3. 温度参数 ( β \beta β )
    ( β \beta β ) 控制 KL 惩罚的强度,平衡策略模型与参考模型之间的分布差异。


5. 示例解析

假设我们有一个 Prompt,生成了两个候选回复 ( y 1 y_1 y1 ) 和 ( y 2 y_2 y2 ),并根据人类偏好得到以下信息:

  • ( y 1 y_1 y1 ) 被偏好 (( y w = y 1 y_w = y_1 yw=y1 )),( y 2 y_2 y2 ) 被不偏好 (( y l = y 2 y_l = y_2 yl=y2 ))。
  • 模型的输出概率为:
    π θ ( y 1 ∣ x ) = 0.6 , π θ ( y 2 ∣ x ) = 0.4 , π ref ( y 1 ∣ x ) = 0.5 , π ref ( y 2 ∣ x ) = 0.5 \pi_\theta(y_1|x) = 0.6, \quad \pi_\theta(y_2|x) = 0.4, \quad \pi_{\text{ref}}(y_1|x) = 0.5, \quad \pi_{\text{ref}}(y_2|x) = 0.5 πθ(y1x)=0.6,πθ(y2x)=0.4,πref(y1x)=0.5,πref(y2x)=0.5

计算隐式奖励:
r ^ θ ( x , y 1 ) = β log ⁡ π θ ( y 1 ∣ x ) π ref ( y 1 ∣ x ) = β log ⁡ 0.6 0.5 \hat{r}_\theta(x, y_1) = \beta \log \frac{\pi_\theta(y_1|x)}{\pi_{\text{ref}}(y_1|x)} = \beta \log \frac{0.6}{0.5} r^θ(x,y1)=βlogπref(y1x)πθ(y1x)=βlog0.50.6
r ^ θ ( x , y 2 ) = β log ⁡ π θ ( y 2 ∣ x ) π ref ( y 2 ∣ x ) = β log ⁡ 0.4 0.5 \hat{r}_\theta(x, y_2) = \beta \log \frac{\pi_\theta(y_2|x)}{\pi_{\text{ref}}(y_2|x)} = \beta \log \frac{0.4}{0.5} r^θ(x,y2)=βlogπref(y2x)πθ(y2x)=βlog0.50.4

偏好模型的概率:
p ∗ ( y 1 ≻ y 2 ∣ x ) = 1 1 + exp ⁡ ( r ^ θ ( x , y 2 ) − r ^ θ ( x , y 1 ) ) p^*(y_1 \succ y_2 | x) = \frac{1}{1 + \exp\left(\hat{r}_\theta(x, y_2) - \hat{r}_\theta(x, y_1)\right)} p(y1y2x)=1+exp(r^θ(x,y2)r^θ(x,y1))1

优化目标是让模型进一步增加 ( y 1 y_1 y1 ) 的概率,同时减少 ( y 2 y_2 y2 ) 的概率。


6. DPO 和 PPO 的区别
特性DPOPPO
核心思想直接基于人类偏好优化语言模型基于奖励信号,通过强化学习优化策略
是否需要奖励模型不需要需要
优化目标最大化偏好概率最大化累计奖励
实现复杂度较低较高
稳定性较高可能出现 KL 爆炸等问题

关于KL爆炸问题,可以参考笔者的另一篇博客:PPO 可能出现 KL 爆炸等问题的详细分析(KL Explosions in PPO): 中英双语


7. 总结

DPO 提供了一种高效、稳定的语言模型优化方法,适合在大规模人类偏好数据上训练更优的模型。相比于传统的 RLHF 方法,DPO 不仅简化了实现过程,还具备更强的理论一致性和实践可靠性。

Direct Preference Optimization (DPO): A Comprehensive Overview

What Problem Does DPO Solve?

Direct Preference Optimization (DPO) addresses the limitations of Reinforcement Learning with Human Feedback (RLHF) by offering a simpler and more direct optimization method. RLHF traditionally uses reward models and Proximal Policy Optimization (PPO) to align language models with human preferences. However, PPO introduces complexity due to the need for dynamic reward modeling and reinforcement learning updates, which involve policy rollouts and value function estimation.

DPO simplifies this process by directly optimizing the likelihood of human-preferred responses relative to dispreferred ones without requiring an explicit reward model or reinforcement learning steps. Instead, it reformulates the optimization as a maximum likelihood estimation (MLE) problem.

Core Formula of DPO

The central idea of DPO is to use a Bradley-Terry preference model to define probabilities for human preferences based on the log-probabilities output by the model.

Given:

  • ( π θ \pi_\theta πθ ): The policy (current model being optimized)
  • ( π r e f \pi_{ref} πref ): The reference policy (pre-trained model used as a baseline)
  • ( y w y_w yw ): Preferred response
  • ( y l y_l yl ): Dispreferred response
  • ( β \beta β ): Temperature hyperparameter controlling regularization strength

DPO models human preferences using the log-ratio of probabilities between the preferred and dispreferred outputs.

The loss function is:
L D P O ( π θ ; π r e f ) = − E ( x , y w , y l ) ∼ D [ log ⁡ σ ( β ( log ⁡ π θ ( y w ∣ x ) π r e f ( y w ∣ x ) − log ⁡ π θ ( y l ∣ x ) π r e f ( y l ∣ x ) ) ) ] L_{DPO}(\pi_\theta; \pi_{ref}) = -E_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \beta \left( \log \frac{\pi_\theta(y_w | x)}{\pi_{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_{ref}(y_l | x)} \right) \right) \right] LDPO(πθ;πref)=E(x,yw,yl)D[logσ(β(logπref(ywx)πθ(ywx)logπref(ylx)πθ(ylx)))]

Key Points in the Formula:
  1. The loss directly optimizes the relative log-probabilities of preferred (( y w y_w yw)) versus dispreferred (( y l y_l yl)) responses.
  2. ( β \beta β ) controls the strength of KL-regularization between the policy and the reference model.
  3. ( σ ( ⋅ ) \sigma(\cdot) σ() ) represents the sigmoid function, ensuring the preference probabilities are modeled effectively.
  4. It eliminates the need for explicit reward modeling, treating model preferences as implicit rewards.

Understanding the Formula

1. Implicit Reward Calculation

DPO implicitly defines a reward function based on the policy and reference model:

r ^ θ ( x , y ) = β log ⁡ π θ ( y ∣ x ) π r e f ( y ∣ x ) \hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y | x)}{\pi_{ref}(y | x)} r^θ(x,y)=βlogπref(yx)πθ(yx)

This means the reward is proportional to the log-likelihood ratio between the current and reference models.

2. Optimization Objective

DPO optimizes the probability of preferred completions being ranked higher than dispreferred completions.

Specifically, it increases the likelihood of preferred completions (( y w y_w yw)) while decreasing the likelihood of dispreferred ones (( y l y_l yl)).

The gradient of the loss is:
∇ θ L D P O = − β E ( x , y w , y l ) ∼ D [ σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) ( ∇ θ log ⁡ π θ ( y w ∣ x ) − ∇ θ log ⁡ π θ ( y l ∣ x ) ) ] \nabla_\theta L_{DPO} = -\beta E_{(x, y_w, y_l) \sim D}\left[ \sigma(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)) \left( \nabla_\theta \log \pi_\theta(y_w | x) - \nabla_\theta \log \pi_\theta(y_l | x) \right) \right] θLDPO=βE(x,yw,yl)D[σ(r^θ(x,yl)r^θ(x,yw))(θlogπθ(ywx)θlogπθ(ylx))]

3. Weighting by Confidence

The weighting term ( σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) \sigma(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)) σ(r^θ(x,yl)r^θ(x,yw)) ) penalizes errors when the model incorrectly assigns higher rewards to dispreferred completions. This ensures that updates focus on examples where the model is most uncertain or wrong, leading to more effective training.


Example Analysis

Suppose we have the following preferences for prompts:

Input Prompt:
“What is the capital of France?”

Completions:

  • ( y w y_w yw ): “The capital of France is Paris.” (Preferred)
  • ( y l y_l yl ): “The capital of France is London.” (Dispreferred)

The log-probabilities from the current model (( π θ \pi_\theta πθ )) and reference model (( π r e f \pi_{ref} πref )) are:

  • ( π θ ( y w ∣ x ) = − 0.2 \pi_\theta(y_w | x) = -0.2 πθ(ywx)=0.2 ), ( π θ ( y l ∣ x ) = − 0.8 \pi_\theta(y_l | x) = -0.8 πθ(ylx)=0.8 )
  • ( π r e f ( y w ∣ x ) = − 0.3 \pi_{ref}(y_w | x) = -0.3 πref(ywx)=0.3 ), ( π r e f ( y l ∣ x ) = − 0.7 \pi_{ref}(y_l | x) = -0.7 πref(ylx)=0.7 )

Using the DPO loss formula:

  1. Calculate the log-probability ratios:
    r w = log ⁡ π θ ( y w ∣ x ) π r e f ( y w ∣ x ) = log ⁡ ( − 0.2 ) − log ⁡ ( − 0.3 ) = − 0.17 r_w = \log \frac{\pi_\theta(y_w | x)}{\pi_{ref}(y_w | x)} = \log(-0.2) - \log(-0.3) = -0.17 rw=logπref(ywx)πθ(ywx)=log(0.2)log(0.3)=0.17
    r l = log ⁡ π θ ( y l ∣ x ) π r e f ( y l ∣ x ) = log ⁡ ( − 0.8 ) − log ⁡ ( − 0.7 ) = 0.06 r_l = \log \frac{\pi_\theta(y_l | x)}{\pi_{ref}(y_l | x)} = \log(-0.8) - \log(-0.7) = 0.06 rl=logπref(ylx)πθ(ylx)=log(0.8)log(0.7)=0.06

  2. Compute the preference difference:
    Δ r = β ( r w − r l ) = β ( − 0.17 − 0.06 ) = β ( − 0.23 ) \Delta r = \beta (r_w - r_l) = \beta(-0.17-0.06)=\beta(-0.23) Δr=β(rwrl)=β(0.170.06)=β(0.23)

  3. Final loss:
    L = − log ⁡ σ ( Δ r ) = − log ⁡ σ ( − 0.23 β ) L = -\log \sigma(\Delta r) = -\log \sigma(-0.23\beta) L=logσ(Δr)=logσ(0.23β)

The optimization encourages increasing the likelihood of ( y w y_w yw ) while reducing ( y l y_l yl ).


DPO vs PPO: Key Differences

AspectDPOPPO
Reward ModelImplicitly modeled via log-probabilities.Requires an explicit, learned reward model.
Algorithm TypeMaximum Likelihood Estimation (MLE).Reinforcement Learning with Policy Gradients.
Training ComplexitySimpler and requires fewer hyperparameters.More complex with value function updates and clipping mechanisms.
StabilityMore stable due to direct optimization.Requires careful tuning to avoid divergence.
Data RequirementRelies on preference data directly.Requires preference data and rollout data for updates.
KL RegularizationControlled by parameter ( β \beta β ).Controlled by PPO clipping.

Why is DPO Effective?

  1. Simplified Training Process: No need for reward model training or complex PPO pipelines.
  2. Implicit Reward Modeling: Avoids separate reward models and leverages pre-trained probabilities.
  3. Theoretical Guarantees: Based on Bradley-Terry models, ensuring consistency under reasonable assumptions.
  4. Practical Applicability: Compatible with public preference datasets without requiring new data collection.

Implementation Example

import torch
import torch.nn.functional as F

def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
    pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]
    ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]
    pi_logratios = pi_yw_logps - pi_yl_logps
    ref_logratios = ref_yw_logps - ref_yl_logps
    losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
    rewards = beta * (pi_logps - ref_logps).detach()
    return losses, rewards

Conclusion

DPO offers a lightweight alternative to PPO for preference optimization by directly leveraging preference data without relying on complex reinforcement learning frameworks. It is particularly effective for aligning language models with human preferences and offers theoretical guarantees grounded in Bradley-Terry models. Given its simplicity and effectiveness, DPO is increasingly used for tasks requiring preference-based fine-tuning of large language models.

后记

2024年12月26日20点52分于上海,在GPT4o大模型辅助下完成。

### 关于DPO源代码的信息 目前提到的Open R1项目已经开源了部分实现,其中包括GRPO的相关实现以及训练与评估代码[^1]。然而,在该描述中并未明确提及直接提供DPO (Direct Preference Optimization) 的具体实现。如果目标是获取DPO相关的源代码,可以尝试从以下几个方向入手: #### 1. **GitHub上的相关资源** 虽然当前引用未明确指出DPO的具体位置,但可以通过访问Open R1项目的GitHub仓库来进一步查找相关内容。该项目提供了完整的训练流程及相关代码结构,可能包含或间接支持DPO的实现逻辑。 对于具体的下载方式,可以直接通过以下链接进入项目主页进行探索: - GitHub地址:[THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B)[^2] 需要注意的是,尽管此链接指向了一个特定的语言模型项目,但它也可能与其他类似的强化学习优化方法共享某些基础组件。 #### 2. **社区贡献与第三方扩展** 除了官方提供的代码外,许多开发者会在个人或组织的GitHub页面上分享基于主流框架(如Hugging Face Transformers、PyTorch Lightning等)构建的DPO实现版本。这些实现通常会更加模块化,并附带详细的文档说明如何运行和调整参数。 以下是检索此类资源的一些技巧: - 使用关键词组合搜索,例如 `"DPO implementation site:github.com"` 或者更具体的 `direct preference optimization pytorch github`。 - 浏览热门机器学习库下的issues和pull requests区域,有时作者或其他用户会讨论甚至提交补丁形式的新功能集成请求。 #### 示例代码片段展示 下面给出一段假设性的Python脚本作为参考,它展示了简化版DPO算法的核心思路之一——利用奖励信号指导策略更新的过程: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer def compute_rewards(reward_model, input_ids, attention_mask): """计算给定输入序列对应的偏好得分""" with torch.no_grad(): outputs = reward_model(input_ids=input_ids, attention_mask=attention_mask) return outputs.logits.squeeze(-1) def dpo_step(policy_model, reference_model, batch, optimizer, device='cuda'): policy_model.train() inputs = {k:v.to(device) for k,v in batch.items()} ref_logits = reference_model(**inputs).logits.detach() # 参考模型输出固定不变 new_outputs = policy_model(**inputs) pi_a_given_s = torch.gather(new_outputs.logits.softmax(dim=-1), dim=-1, index=batch['labels'].unsqueeze(-1)).squeeze(-1) rho_a_given_s = ... # 计算rho值... loss = -(pi_a_given_s * rewards - beta*rho_a_given_s).mean() optimizer.zero_grad() loss.backward() optimizer.step() return loss.item() ``` 上述伪代码仅为示意用途,请根据实际需求适配至对应环境当中去测试验证效果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值