二维张量 乘以 三维张量_张量流2的近端策略优化ppo

二维张量 乘以 三维张量

In this article, we will try to understand Open-AI’s Proximal Policy Optimization algorithm for reinforcement learning. After some basic theory, we will be implementing PPO with TensorFlow 2.x. Before you read further, I would recommend you take a look at the Actor-Critic method from here, as we will be modifying the code of that article for PPO.

在本文中,我们将尝试了解用于增强学习的Open-AI的近端策略优化算法。 根据一些基本理论,我们将使用TensorFlow 2.x实施PPO。 在继续阅读之前,我建议您从此处查看Actor-Critic方法,因为我们将为PPO修改该文章的代码。

为什么要使用PPO? (Why PPO?)

  1. Unstable Policy Update: In Many Policy Gradient Methods, policy updates are unstable because of larger step size, which leads to bad policy updates and when this new bad policy is used for learning then it leads to even worse policy. And if steps are small then it leads to slower learning.

    不稳定的策略更新:在许多“策略梯度方法”中,由于步长较大,策略更新是不稳定的,这会导致错误的策略更新,而当使用这种新的错误策略进行学习时,则会导致更糟的策略。 如果步幅很小,则会导致学习变慢。

  2. Data Inefficiency: Many learning methods learn from current experience and discard the experiences after gradient updates. This makes the learning process slow as a neural net takes lots of data to learn.

    数据效率低下:许多学习方法会从当前经验中学习,并在梯度更新后将其丢弃。 由于神经网络需要学习大量数据,因此这会使学习过程变慢。

PPO comes handy to overcome the above issues.

PPO可以轻松克服上述问题。

PPO背后的核心思想 (Core Idea Behind PPO)

In earlier Policy gradient methods, the objective function was something like LPG(θ) =ˆEt[logπθ(at|st)ˆAt]. But now instead of the log of current policy, we will be taking the ratio of current policy and old policy.

在较早的Policy梯度方法中,目标函数类似于LPG(θ)= ˆEt [logπθ(at | st)ˆAt]。 但是现在,我们将采用当前政策与旧政策的比率,而不是当前政策的日志。

Image for post
https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347

We will be also clipping the ratio and will the minimum of the two i.e b/w clipped and unclipped.

我们还将削减比率,并将两者中的最小值(即黑白削减和未削减)。

Image for post
https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347

This clipped objective will restrict large policy updates as shown below.

削减的目标将限制大型策略更新,如下所示。

Image for post
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值