二维张量乘以三维张量_张量流2的近端策略优化ppo

最新推荐文章于 2024-01-12 22:58:34 发布

吴雄辉

最新推荐文章于 2024-01-12 22:58:34 发布

阅读量413

点赞数

原文链接：https://towardsdatascience.com/proximal-policy-optimization-ppo-with-tensorflow-2-x-89c9430ecc26

版权

二维张量乘以三维张量

In this article, we will try to understand Open-AI’s Proximal Policy Optimization algorithm for reinforcement learning. After some basic theory, we will be implementing PPO with TensorFlow 2.x. Before you read further, I would recommend you take a look at the Actor-Critic method from here, as we will be modifying the code of that article for PPO.

在本文中，我们将尝试了解用于增强学习的Open-AI的近端策略优化算法。根据一些基本理论，我们将使用TensorFlow 2.x实施PPO。在继续阅读之前，我建议您从此处查看Actor-Critic方法，因为我们将为PPO修改该文章的代码。

为什么要使用PPO？ (Why PPO?)

Unstable Policy Update: In Many Policy Gradient Methods, policy updates are unstable because of larger step size, which leads to bad policy updates and when this new bad policy is used for learning then it leads to even worse policy. And if steps are small then it leads to slower learning.
不稳定的策略更新：在许多“策略梯度方法”中，由于步长较大，策略更新是不稳定的，这会导致错误的策略更新，而当使用这种新的错误策略进行学习时，则会导致更糟的策略。如果步幅很小，则会导致学习变慢。
Data Inefficiency: Many learning methods learn from current experience and discard the experiences after gradient updates. This makes the learning process slow as a neural net takes lots of data to learn.
数据效率低下：许多学习方法会从当前经验中学习，并在梯度更新后将其丢弃。由于神经网络需要学习大量数据，因此这会使学习过程变慢。

PPO comes handy to overcome the above issues.

PPO可以轻松克服上述问题。

PPO背后的核心思想 (Core Idea Behind PPO)

In earlier Policy gradient methods, the objective function was something like LPG(θ) =ˆEt[logπθ(at|st)ˆAt]. But now instead of the log of current policy, we will be taking the ratio of current policy and old policy.

在较早的Policy梯度方法中，目标函数类似于LPG(θ)= ˆEt [logπθ(at | st)ˆAt]。 但是现在，我们将采用当前政策与旧政策的比率，而不是当前政策的日志。

Image for post — https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347

We will be also clipping the ratio and will the minimum of the two i.e b/w clipped and unclipped.

我们还将削减比率，并将两者中的最小值(即黑白削减和未削减)。

This clipped objective will restrict large policy updates as shown below.

削减的目标将限制大型策略更新，如下所示。

最低0.47元/天解锁文章

吴雄辉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
二维张量乘以三维张量_张量流2的近端策略优化ppo

二维张量乘以三维张量In this article, we will try to understand Open-AI’s Proximal Policy Optimization algorithm for reinforcement learning. After some basic theory, we will be implementing PPO with TensorFlow...
复制链接

扫一扫