DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

最新推荐文章于 2024-07-19 10:02:47 发布

小朱智能驾驶

最新推荐文章于 2024-07-19 10:02:47 发布

阅读量206

点赞数 1

分类专栏：深度强化学习专栏

本文链接：https://blog.csdn.net/weixin_37532614/article/details/106877481

版权

深度强化学习专栏专栏收录该内容

13 篇文章 0 订阅

订阅专栏

DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

3.3.1 What are Policy Gradient Methods?

Policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates.
Policy-based methods estimate the weights of an optimal policy through gradient ascent.

3.3.2 The Big Picture

在这里插入图片描述
策略梯度法和监督学习的区别。

3.3.4 Problem Setup

A trajectory is just a state action sequence. It can corresond to a full episode or just a small part of an episode. We denote a Trajectory with the Greek letter $\tau$ . Then the sum reward from that trajectory is written as R of $\tau$ .
在这里插入图片描述

3.3.5 REINFORCE

Our goal is to find the values of the weights $\theta$ in the neural network that maximize the expected return $U$

$U(\theta) = \sum_\tau P(\tau;\theta)R(\tau)$

where $\tau$ is an arbitrary trajectory. One way to determine the value of $\theta$ that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the differences are that:

gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.
Our update step for gradient ascent appears as follows:

$\theta \leftarrow \theta + \alpha \nabla_\theta U(\theta)$
where $\alpha$ is the step size that is generally allowed to decay over time. Once we know how to calculate or estimate this gradient, we can repeatedly apply this update step, in the hopes that $\theta$ converges to the value that maximizes $U(\theta)$ .

In the above equation, we need take into account the probability of each possible trajectory and, the return permits trajectory.
在这里插入图片描述
In fact, to calculate the gradient $\nabla_{\theta}U\left( \theta \right)$ , we have to consider every possible trajectory. However, It is computationally expensive because in order to calculate the gradient exeactly, we have to consider every possible trajectory. Instead of doing this, to calculate the gradient $\nabla_{\theta}U\left( \theta \right)$ , we’ll just sample a few trajectories using the policy and then using those trajectories only to estimate the gradient.
在这里插入图片描述
对于单个轨迹序列，

REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.

小朱智能驾驶

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods

DRL — Policy Based Methods — Chapter 3-3 Policy Gradient Methods3.3.1 What are Policy Gradient Methods?Policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates.
复制链接

扫一扫

专栏目录