【深度强化学习】(6) PPO 模型解析,附Pytorch完整代码

大家好,今天和各位分享一下深度强化学习中的近端策略优化算法(proximal policy optimization,PPO),并借助 OpenAI 的 gym 环境完成一个小案例,完整代码可以从我的 GitHub 中获得:

https://github.com/LiSir-HIT/Reinforcement-Learning/tree/main/Model


1. 算法原理

PPO 算法之所以被提出,根本原因在于 Policy Gradient 在处理连续动作空间时 Learning rate 取值抉择困难。Learning rate 取值过小,就会导致深度强化学习收敛性较差,陷入完不成训练的局面,取值过大则导致新旧策略迭代时数据不一致,造成学习波动较大或局部震荡。除此之外,Policy Gradient 因为在线学习的性质,进行迭代策略时原先的采样数据无法被重复利用,每次迭代都需要重新采样

同样地置信域策略梯度算法(Trust Region Policy Optimization,TRPO)虽然利用重要性采样(Important-sampling)、共轭梯度法求解提升了样本效率、训练速率等,但在处理函数的二阶近似时会面临计算量过大,以及实现过程复杂、兼容性差等缺陷。 

PPO 算法具备 Policy Gradient、TRPO 的部分优点采样数据和使用随机梯度上升方法优化代替目标函数之间交替进行,虽然标准的策略梯度方法对每个数据样本执行一次梯度更新,但 PPO 提出新目标函数,可以实现小批量更新。

鉴于上述问题,该算法在迭代更新时,观察当前策略在 t 时刻智能体处于状态 s 所采取的行为概率\pi (a_t |s_t),与之前策略所采取行为概率 \pi_{\theta old} (a_t | s_t)计算概率的比值来控制新策略更新幅度,比值 r_t 记作:

r_t(\theta) = \frac{\pi _{\theta}(a_t|s_t)}{\pi_{\theta old}(a_t|s_t)}

新旧策略差异明显且优势函数较大,则适当增加更新幅度;若 r_t 比值越接近 1,表明新旧策略差异越小。

优势函数代表,在状态 s 下,行为 a 相对于均值的偏差。在论文中,优势函数 \hat{A}_t 使用 GAE(generalized advantage estimation)来计算:

\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\bowtie } (\gamma \lambda )^l \delta _{t+l} ^ V

### Proximal Policy Optimization Algorithm in Reinforcement Learning In the context of reinforcement learning, proximal policy optimization (PPO) represents a significant advancement aimed at improving both performance and stability during training processes. PPO is designed to address some limitations found within earlier methods like TRPO by simplifying certain aspects while maintaining or enhancing effectiveness. The core idea behind PPO involves optimizing policies through iterative updates that are constrained so as not to deviate too far from previous versions. This constraint helps prevent large changes which could destabilize learning outcomes. Specifically, an objective function with clipping mechanisms ensures small adjustments per update cycle[^2]. To implement this approach effectively: - **Objective Function**: The key innovation lies in modifying how rewards influence future actions via gradient ascent steps on what's known as surrogate objectives. ```python ratio = torch.exp(new_log_probs - old_log_probs) surr1 = ratio * advantages surr2 = torch.clamp(ratio, 1-clip_param, 1+clip_param) * advantages loss = -torch.min(surr1, surr2).mean() ``` This code snippet demonstrates calculating losses using clipped probability ratios between new and old action probabilities under different states encountered during episodes. By minimizing these calculated values, one can ensure stable yet effective improvements over time without drastic shifts in behavior patterns learned by agents. Furthermore, leveraging benchmarks such as those provided for safe policy optimization offers valuable insights into practical applications where safety constraints play critical roles alongside efficiency considerations[^1]. Such resources facilitate deeper exploration beyond theoretical foundations towards real-world problem-solving scenarios involving complex environments requiring robust decision-making capabilities.
评论 28
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

立Sir

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值