强化学习的优化策略PPO和DPO详解并分析异同

samoyan

已于 2024-08-19 17:42:17 修改

阅读量2.4w

点赞数 35

文章标签：人工智能

于 2023-12-29 11:43:17 首次发布

本文链接：https://blog.csdn.net/baoyan2015/article/details/135287298

版权

1、PPO（Proximal Policy Optimization）

工作原理

实现步骤

2、DPO (Direct Preference Optimization)

3、DPO（Distributed Proximal Policy Optimization）

Proximal Policy Optimization (PPO) Algorithm Diagram

PPO Algorithm Overview

PPO Algorithm Details

PPO Algorithm Implementation

Sources

Direct Preference Optimization (DPO) Algorithm Diagram

DPO Algorithm Overview

DPO Algorithm Details

DPO Algorithm Implementation

Sources

1、PPO（Proximal Policy Optimization）

工作原理

目标函数：PPO旨在通过最大化特定的目标函数来改进策略。这个目标函数通常包括一个期望回报的项，以及可能的正则化项（如熵）来鼓励探索。
概率比率剪切：PPO使用了一种称为概率比率剪切的技术，这涉及到计算新策略和旧策略对动作概率的比率。如果这个比率偏离1太远，PPO会通过剪切这个比率来限制更新的幅度，从而避免过大的策略变动。
目标函数的优化：PPO对目标函数进行优化，通常使用随机梯度上升方法。这个过程涉及到在策略网络参数上应用梯度更新，以增加高回报动作的概率，同时减少低回报动作的概率。
多次迭代更新：PPO算法通常在一次策略更新中使用多个迭代，这意味着它会重复利用同一批数据多次，以进行有效的学习。

实现步骤

收集数据：首先，使用当前策略在环境中执行多个动作，收集状态、动作和回报的数据。
计算优势函数：然后，计算每个时间步的优势函数，这通常涉及到对回报的估计和基线（比如状态价值函数）的使用。
优化策略：接着，通过优化目标函数来更新策略参数。这个过程包括计算目标函数的梯度，并使用梯度上升来更新参数。
重复迭代：重复上述过程多次，直到策略收敛或达到预定的迭代次数。

加载4个模型，2个推理，2个训练

Actor Model：演员模型，想要训练的目标语言模型
Critic Model：评论家模型，它的作用是预估总收益
Reward Model：奖励模型，它的作用是计算即时收益
Reference Model：参考模型，它的作用是在RLHF阶段给语言模型增加一些“约束”，防止语言模型训歪（朝不受控制的方向更新，效果可能越来越差）

其中:

Actor/Critic Model在RLHF阶段是需要训练的；而Reward/Reference Model是参数冻结的。
Critic/Reward/Reference Model共同组成了一个“奖励-loss”计算体系，我们综合它们的结果计算loss，用于更新Actor和Critic Model

2、DPO (Direct Preference Optimization)

DPO是一种相对较新的方法，它直接优化用户或专家的偏好，而非传统的累积奖励。在DPO中，通过对比不同的决策序列或策略，并根据用户或专家的偏好来优化模型，使得最终的策略能够更好地符合预期的行为。DPO通常用于那些难以明确定义奖励函数的场景，或者在用户偏好需要直接编码到决策过程中的应用中。

DPO的实现需要构建一个偏好模型，该模型能够从用户或专家的反馈中学习。在实际应用中，可能需要设计一种机制来收集用户的偏好数据，例如通过对比查询或者排名反馈。然后使用这些数据来训练一个或多个模型，这些模型能够预测给定决策序列的偏好得分，并据此来优化策略。

只需要加载2个模型，其中一个推理，另外一个训练，直接在偏好数据上进行训练。

3、DPPO（Distributed Proximal Policy Optimization）

工作原理

分布式架构：DPPO在多个计算节点上并行运行，每个节点都有自己的一份策略副本，并在各自的环境实例中收集数据。
数据同步：在一定时间步或者周期之后，各个节点会将它们收集到的数据（如梯度信息或者策略更新）发送给中心节点。
中心化更新：中心节点会聚合这些数据，并进行策略更新。更新后的策略参数随后会被分发回各个计算节点。
并行化探索：由于每个节点可以独立探索不同的状态空间区域，DPO能够更加高效地进行探索和利用，从而加快学习过程。

实现步骤

分配任务：将并行任务分配到多个计算节点。
局部数据收集：各个节点独立执行策略，收集状态、动作和回报等数据。
局部梯度计算：每个节点计算其收集数据的梯度或参数更新。
全局同步：所有节点将梯度或参数更新发送到中心节点进行聚合。
策略更新和分发：中心节点更新策略参数后，将新的策略分发给所有节点。
重复迭代：重复上述过程，直到策略收敛或达到预定的迭代次数。

总结来说，PPO专注于通过剪切概率比率来稳定策略更新，而DPPO在此基础上引入分布式计算，以提高数据收集和处理的效率，加快学习速度。

4、强化学习中的PPO和DPPO的异同

相同点

基于策略梯度：PPO（Proximal Policy Optimization）和DPPO（Distributed Proximal Policy Optimization）都是基于策略梯度的强化学习算法，它们通过优化策略函数来直接学习一个策略，该策略能够映射观察到的状态到动作的概率分布。
目标函数：两者都使用了类似的目标函数来优化策略，即通过最大化期望的回报来进行学习。
剪切或限制策略更新：PPO和DPPO都采用了一种机制来避免在策略更新时做出过大的改变，从而保持学习的稳定性。PPO通过剪切概率比率或使用KL散度限制，而DPPO也可能采用类似的策略来限制分布式训练中的策略更新。

不同点

分布式训练：DPPO的“Distributed”指的是它被设计为在分布式计算环境中运行，可以在多个处理器或机器上并行执行，而PPO通常指单机版本的算法。
扩展性和并行化：由于DPPO是为分布式环境设计的，它在处理大规模并行化训练任务时具有更好的扩展性，而PPO则在这方面可能受到限制。
通信和同步：在分布式设置中，DPPO需要有效的通信和同步机制来保证多个训练节点之间的协调，这是PPO在单机设置中不需要考虑的问题。
资源利用和效率：DPPO通常能更有效地利用多核处理器或多台机器的计算资源，从而在实际应用中可能获得更快的训练速度和更好的性能。

总结来说，PPO和DPPO在算法框架和目标函数上有共同之处，但在实现方式、并行化程度以及适用的计算环境上存在差异，DPPO特别适用于需要大规模并行处理的场景。

Proximal Policy Optimization (PPO) Algorithm Diagram

PPO Algorithm Overview

Proximal Policy Optimization (PPO) is a type of reinforcement learning algorithm developed by OpenAI. It is designed to perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become a default reinforcement learning algorithm at OpenAI due to its ease of use and good performance.

PPO Algorithm Details

PPO works by trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. It uses a novel objective function that enables multiple epochs of minibatch updates. The objective function is expressed as:

[ L^{CLIP}(\theta) = \hat{E}_{t}[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat{A}_t ) ] ]

where:

( \theta ) is the policy parameter
( \hat{E}_{t} ) denotes the empirical expectation over timesteps
( r_t ) is the ratio of the probability under the new and old policies, respectively
( \hat{A}_t ) is the estimated advantage at time ( t )
( \varepsilon ) is a hyperparameter, usually 0.1 or 0.2

The PPO algorithm implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent and simplifies the algorithm by removing the KL penalty and the need to make adaptive updates.

PPO Algorithm Implementation

The PPO algorithm can be implemented using Python3 and TensorFlow. It includes scalable, parallel implementations of PPO and TRPO (Trust Region Policy Optimization) which both use MPI for data passing. OpenAI has also released a GPU-enabled implementation called PPO2, which runs approximately 3x faster than the current PPO baseline on Atari games.

Sources

OpenAI Research: Proximal Policy Optimization - OpenAI

Direct Preference Optimization (DPO) Algorithm Diagram

DPO Algorithm Overview

Direct Preference Optimization (DPO) is introduced as a new parameterization of the reward model in Reinforcement Learning from Human Feedback (RLHF) that enables extraction of the corresponding optimal policy in closed form. It solves the standard RLHF problem with a simple classification loss, eliminating the need for sampling from the Language Model (LM) during fine-tuning or performing significant hyperparameter tuning.

DPO Algorithm Details

DPO is stable, performant, and computationally lightweight. It fine-tunes Language Models (LMs) to align with human preferences effectively. Notably, DPO exceeds PPO-based RLHF in controlling the sentiment of generations and matches or improves response quality in summarization and single-turn dialogue tasks while being substantially simpler to implement and train.

DPO Algorithm Implementation

DPO operates by directly optimizing a policy model ( \pi_\theta ) using preference data without the need for an explicit reward model. The DPO loss is computed as follows:

[ \mathcal{L}{DPO} = - \log \sigma(\beta (\log \pi\theta(y_w | x) - \log \pi_\theta(y_l | x))) ]

where:

( \sigma ) is the sigmoid function
( \beta ) is a hyperparameter, typically between 0.1 and 0.5
( y_w ) is the preferred response in the preference data
( y_l ) is the less preferred response
( \pi_\theta ) is the policy model being optimized

DPO updates aim to increase the relative log probability of preferred responses over less preferred ones, incorporating a dynamic, per-sample importance weight to prevent model degradation.

Sources

arXiv: Direct Preference Optimization: Your Language Model is Secretly a Reward Model - arXiv
Zhihu: DPO: Direct Preference Optimization 论文解读及代码实践 - Zhihu