【手撕RLHF-DPO(2)】过拟合分析与IPO改进

最新推荐文章于 2025-05-22 21:17:46 发布

大模型产品经理

最新推荐文章于 2025-05-22 21:17:46 发布

阅读量2.4k

点赞数 26

文章标签： prompt 数据库深度学习人工智能

本文链接：https://blog.csdn.net/bagell/article/details/140494833

版权

6. DPO 分析

在上述DPO推导结果中，我们得到了BT-Model形式下的策略最优解，看似非常完美，但实际有以下问题，导致与PPO优化算法仍有差距

6.1 IPO 观点

A General Theoretical Paradigm to Understand Learning from Human Preferences

我们从IPO论文来做简单的讨论，为什么标准的RLHF(PPO)更加鲁棒？

数据：真实的偏好概率并不是绝对的{0,1}, 而是存在一定的噪声的，这样的数据所训练的reward model是underfit的，因此对下游的训练也更加好

Why may standard RLHF be more robust to this problem in practice? While a purported advantage of DPO is that it avoids the need to fit a reward function, we observe that in practice when empirical preference probabilities are in the set {0, 1}, the reward function ends up being underfit. The optimal rewards in the presence of {0, 1} preference probabilities are infinite, but these values are avoided, and indeed regularisation of the re ward function has been observed to be an important aspect of RLHF training in practice
合DPO的训练目标会导致过拟合, 在下式中的优化策略为零，那么就可以使得偏好概率为1

令 , 那么，损失就能降下来，直觉的理解是，只要远离Rejected样本分布就能优化模型

Consider the simple example where we have two actions and such that , i.e., is always preferred to . Then the Bradley-Terry model would require that to satisfy(1). If we plug this into the optimal policy (7) then we would get that (i.e., ) irrespective of what constant is used for the KL-regularisation. Thus the strength of the KL-regularisation becomes weaker and weaker the more deterministic the preferences.

The weakness of the KL-regularisation becomes even more pronounced in the finite data regime, where we only have access to a sample estimate of the preference . Even if the true preference is, e.g., , empirically it can be very possible when we only have a few data points to estimate , in which case the empirical optimal policy would make for any . This means that overfitting can be a substantial empirical issue, especially when the context and action spaces are extremely large as it is for large language models.
在DPO的推导中，最优策略是基于BT-Model形式下能得到最大的reward，在非DPO的优化中，存在其他的策略能够使得DPO Loss更低。

Now, suppose is optimal for the Bradley-Terry reward objective, meaning that is optimal for the RLHF objective. If is not optimal for the DPO objective, then there exists another policy that obtains a strictly lower value for the DPO loss. But then there exists a reward function such that , such as , and this therefore obtains a lower Bradley-Terry loss than , a contradiction.

6.2 DPO 过拟合实验分析

在以下实验中，我们观测：

rejected样本一个token对应的策略: neg_poilicy_prob
DPO的logprob：logistic_prob
DPO的loss ：loss_record

# 手撕DPO训练   import torch.optim as optim      model = LlamaForCausalLM(config)   optimizer = optim.SGD(model.parameters(), lr=0.1)   epochs = 1000   epochs_print = epochs//10      neg_poilicy_prob = [] # pi_l   logistic_prob = [] # DPO beta( log(pi_w/pi_ref_w)  - log(pi_l/pi_ref_l))   loss_record = [] # DPO loss   for i in range(epochs):              optimizer.zero_grad()          # forward get logits       with torch.no_grad():           logits_chosen_ref = ref_model(**x_chosen).logits           logits_rejected_ref = ref_model(**x_rejected).logits       logits_chosen = model(**x_chosen).logits       logits_rejected = model(**x_rejected).logits          # logits to logprob       probs_chosen_ref = get_probs(logits_chosen_ref, prompt_chosen)       probs_chosen = get_probs(logits_chosen, prompt_chosen)       probs_rejected_ref = get_probs(logits_rejected_ref, prompt_rejected)       probs_rejected = get_probs(logits_rejected, prompt_rejected)          # loss       beta = 0.1       pi_logratios = probs_chosen - probs_rejected       ref_logratios = probs_chosen_ref - probs_rejected_ref       logits = pi_logratios - ref_logratios       losses = -F.logsigmoid( beta * logits ) * attention_mask       loss = losses.sum(-1)/attention_mask.sum()          # print(loss)       loss_record.append(loss.item())          # loss back       loss.backward()       optimizer.step()          neg_poilicy_prob.append(torch.exp(probs_rejected[:,-1]).item())       logistic_prob.append(torch.sigmoid( beta * logits)[:,-1].item())              if i % epochs_print == 0:           print(f'step {i}, loss:{loss.item()}, pi_rej:{neg_poilicy_prob[-1]}, log_prob:{logistic_prob[-1]}')

训练结果为

step 0, loss:0.6998180150985718, pi_rej:0.038977719843387604, log_prob:0.4686339199542999   step 100, loss:0.07304652035236359, pi_rej:5.0733341139252985e-12, log_prob:0.9286277890205383   step 200, loss:0.0369233712553978, pi_rej:4.546971963715913e-15, log_prob:0.9633069634437561   step 300, loss:0.024222105741500854, pi_rej:6.391225561269749e-17, log_prob:0.9757383465766907   step 400, loss:0.01783704198896885, pi_rej:2.9681833589622987e-18, log_prob:0.9820361733436584   step 500, loss:0.01402188278734684, pi_rej:2.6912437551686396e-19, log_prob:0.9858156442642212   step 600, loss:0.01149554643779993, pi_rej:3.742525367120755e-20, log_prob:0.9883255958557129   step 700, loss:0.009704340249300003, pi_rej:7.004394982656413e-21, log_prob:0.9901090860366821   step 800, loss:0.008370832540094852, pi_rej:1.6307020622463523e-21, log_prob:0.9914391040802002   step 900, loss:0.007341053802520037, pi_rej:4.487044362993971e-22, log_prob:0.9924677014350891

通过训练后绘制曲线

可见rejected token策略会快速收敛到0， DPO sigmoid概率接近1

6.3 IPO 改进及实验分析

IPO中定义了MSE形式的Loss，即偏好概率拟合到一个定值

This simplified form of the loss provides some valuable insights on the way in which ipo optimizes the policy : ipo learns from preferences dataset simply by regressing the gap between log-likelihood ratios and to .

其中为：

那么改变几行代码就可以实现IPO

if loss_type == 'DPO' :       losses = -F.logsigmoid( beta * logits ) * attention_mask   elif loss_type == 'IPO' :     	constant = 1.0 / (beta * 2.0)       losses = torch.square( logits - constant ) * attention_mask

6.3.1 DPO与IPO收敛情况对比

首先观测 IPO的优化策略是否会收敛到{0,1}

设定控制不同的就能控制policy避免收敛到0，而对于DPO来说不同的的到最后都收敛到0

epochs = 100   IPO_lr = 0.0001 # IPO 太大容易震荡   DPO_lr = 0.01   model_tmp = copy.deepcopy(model)   pi_1, _, _ = train_XPO(model_tmp, 0.1, 'IPO', epochs, IPO_lr)    model_tmp = copy.deepcopy(model)   pi_2, _, _ = train_XPO(model_tmp, 0.5, 'IPO', epochs, IPO_lr)   model_tmp = copy.deepcopy(model)   pi_3, _, _ = train_XPO(model_tmp, 1.0, 'IPO', epochs, IPO_lr)   model_tmp = copy.deepcopy(model)   pi_4, _, _ = train_XPO(model_tmp, 0.1, 'DPO', epochs, DPO_lr)   model_tmp = copy.deepcopy(model)   pi_5, _, _ = train_XPO(model_tmp, 0.5, 'DPO', epochs, DPO_lr)

6.3.2 DPO和IPO优化策略实验

改变不同的学习率，可见IPO的随着学习率增加，策略容易震荡，适合用小的学习率

epochs = 100   IPO_lr = 0.0001   DPO_lr = 0.01      model_tmp = copy.deepcopy(model)   pi_1, _, _ = train_XPO(model_tmp, 0.5, 'IPO', epochs, 0.001)    model_tmp = copy.deepcopy(model)   pi_2, _, _ = train_XPO(model_tmp, 0.5, 'IPO', epochs, 0.0001)   model_tmp = copy.deepcopy(model)   pi_3, _, _ = train_XPO(model_tmp, 0.5, 'DPO', epochs, 0.0001)   model_tmp = copy.deepcopy(model)   pi_4, _, _ = train_XPO(model_tmp, 0.5, 'IPO', epochs, 0.001, 'ADAM') # 原论文用adam

6.3.3 IPO 论文实验结果

3个策略在DPO中都会收敛到{0,1}，而对收敛结果没有控制作用

IPO控制效果明显, 不同所收敛的策略数值差距较大，IPO并没有讨论如何自适应的控制这个超参数

7. 总结

DPO推导需要掌握BT Model、Reward Model和RL优化算法，其所谓的最优策略也仅是在BT Model形式下的最优。
DPO是一种高效的RLHF平替方法，无需额外训练reward model即可做policy优化，但DPO与PPO还是有较大差异，PPO由于reward model 欠拟合使得RL优化更有鲁棒性，一部分是由偏好数据决定的。
DPO的变种非常多，简易做了分类，适合扩展阅读

优化Loss：IPO, cDPO、CPO
无需成对数据：KTO、NPO、Smaug
无需Ref Model：sDPO、ORPO
优化训练流程：RSO

Reference

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

A General Theoretical Paradigm to Understand Learning from Human Preferences

KTO: Model Alignment as Prospect Theoretic Optimization

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translatio

A note on DPO with noisy preferences & relationship to IPO

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

ORPO: Monolithic Preference Optimization without Reference Model

Statistical Rejection Sampling Improves Preference Optimization

sDPO: Don’t Use Your Data All at Once

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

如何系统的去学习大模型LLM ？

作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。

但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的 AI大模型资料 包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

😝有需要的小伙伴，可以V扫描下方二维码免费领取🆓

在这里插入图片描述

一、全套AGI大模型学习路线

AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

在这里插入图片描述

四、AI大模型商业化落地方案

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
  - L1.4.1 知识大模型
  - L1.4.2 生产大模型
  - L1.4.3 模型工程方法论
  - L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
  - L2.1.1 OpenAI API接口
  - L2.1.2 Python接口接入
  - L2.1.3 BOT工具类框架
  - L2.1.4 代码示例
- L2.2 Prompt框架
  - L2.2.1 什么是Prompt
  - L2.2.2 Prompt框架应用现状
  - L2.2.3 基于GPTAS的Prompt框架
  - L2.2.4 Prompt框架与Thought
  - L2.2.5 Prompt框架与提示词
- L2.3 流水线工程
  - L2.3.1 流水线工程的概念
  - L2.3.2 流水线工程的优点
  - L2.3.3 流水线工程的应用
- L2.4 总结与展望

阶段3：AI大模型应用架构实践

目标：深入理解AI大模型的应用架构，并能够进行私有化部署。
内容：
- L3.1 Agent模型框架
  - L3.1.1 Agent模型框架的设计理念
  - L3.1.2 Agent模型框架的核心组件
  - L3.1.3 Agent模型框架的实现细节
- L3.2 MetaGPT
  - L3.2.1 MetaGPT的基本概念
  - L3.2.2 MetaGPT的工作原理
  - L3.2.3 MetaGPT的应用场景
- L3.3 ChatGLM
  - L3.3.1 ChatGLM的特点
  - L3.3.2 ChatGLM的开发环境
  - L3.3.3 ChatGLM的使用示例
- L3.4 LLAMA
  - L3.4.1 LLAMA的特点
  - L3.4.2 LLAMA的开发环境
  - L3.4.3 LLAMA的使用示例
- L3.5 其他大模型介绍