DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

3.3.1 What are Policy Gradient Methods?

Policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates.
Policy-based methods estimate the weights of an optimal policy through gradient ascent.

3.3.2 The Big Picture

在这里插入图片描述
策略梯度法和监督学习的区别。
在这里插入图片描述

3.3.4 Problem Setup

A trajectory is just a state action sequence. It can corresond to a full episode or just a small part of an episode. We denote a Trajectory with the Greek letter τ \tau τ. Then the sum reward from that trajectory is written as R of τ \tau τ.
在这里插入图片描述
在这里插入图片描述

3.3.5 REINFORCE

Our goal is to find the values of the weights θ \theta θ in the neural network that maximize the expected return U U U

U ( θ ) = ∑ τ P ( τ ; θ ) R ( τ ) U(\theta) = \sum_\tau P(\tau;\theta)R(\tau) U(θ)=τP(τ;θ)R(τ)

where τ \tau τ is an arbitrary trajectory. One way to determine the value of θ \theta θ that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the differences are that:

  • gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
  • gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.
    Our update step for gradient ascent appears as follows:

θ ← θ + α ∇ θ U ( θ ) \theta \leftarrow \theta + \alpha \nabla_\theta U(\theta) θθ+αθU(θ)
where α \alpha α is the step size that is generally allowed to decay over time. Once we know how to calculate or estimate this gradient, we can repeatedly apply this update step, in the hopes that θ \theta θ converges to the value that maximizes U ( θ ) U(\theta) U(θ).

In the above equation, we need take into account the probability of each possible trajectory and, the return permits trajectory.
在这里插入图片描述
In fact, to calculate the gradient ∇ θ U ( θ ) \nabla_{\theta}U\left( \theta \right) θU(θ), we have to consider every possible trajectory. However, It is computationally expensive because in order to calculate the gradient exeactly, we have to consider every possible trajectory. Instead of doing this, to calculate the gradient ∇ θ U ( θ ) \nabla_{\theta}U\left( \theta \right) θU(θ), we’ll just sample a few trajectories using the policy and then using those trajectories only to estimate the gradient.
在这里插入图片描述
对于单个轨迹序列,
在这里插入图片描述
在这里插入图片描述

REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值