【RL】4.Policy Gradient

最新推荐文章于 2023-05-16 16:47:51 发布

BevnWu

最新推荐文章于 2023-05-16 16:47:51 发布

阅读量172

点赞数

分类专栏：强化学习_BW 文章标签：强化学习

BevanWu所有

本文链接：https://blog.csdn.net/qq_41407979/article/details/109369428

版权

RL-Ch4-Policy Gradient

策略梯度(Policy Gradient)

强化学习的例子

Scene	Agent	Env	Reward Function
Video	游戏手柄	主机	杀1怪得20分
Go	AlphaGo	李世石	the Rule of Go

在上述例子中，策略(policy) $\pi$ 的具体表现形式可认为是神经网络从输入层到输出层之间的参数矩阵 $\theta$ 。

下图为一个加入了action的马尔可夫链，

在这里插入图片描述

记Trajectory $\tau=\{s_1,a_1,...,s_T,a_T\}$ ，则

$p_\theta(\tau)=p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)p(s_3|s_2,a_2)...\\ =p(s_1)\prod_{t=1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$

同时决策需要考虑到收益，我们在上图中加入reward。
$R(\tau)=\sum_{t=1}^Tr_t$

在这里插入图片描述

则期望收益为
$\bar{R_\theta}=\sum_\tau R(\tau)p_\theta(\tau)=\mathbb{E}_{\tau\sim p_\theta(\tau)}[R(\tau)]$
求期望收益的梯度有
$\nabla \bar{R_\theta}=\sum_\tau R(\tau)\nabla p_\theta(\tau)=\sum_\tau R(\tau)p_\theta(\tau)\frac{\nabla p_\theta(\tau)}{p_\theta(\tau)}\\ =\sum_\tau R(\tau)p_\theta(\tau)\nabla \log p_\theta(\tau)\\ =\mathbb{E}_{\tau\sim p_\theta(\tau)}[R(\tau)\nabla \log p_\theta(\tau)]$

策略梯度的计算

有两种计算方法：

蒙特卡洛采样，式(4)可以改写为如下式(5)

$\nabla \bar{R_\theta}\approx \frac{1}{N}\sum_{n=1}^N R(\tau^n)\nabla\log p_\theta(\tau^n)\overset{\text{式(1)}}{=}\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^{T_n}R(\tau^n)\nabla\log p_\theta(a_t^n|s_t^n)$

时序差分更新，式(4)可以改写为如下式(6)

最低0.47元/天解锁文章

BevnWu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【RL】4.Policy Gradient

RL-Ch4-Policy Gradient策略梯度(Policy Gradient)强化学习的例子SceneAgentEnvReward FunctionVideo游戏手柄主机杀1怪得20分GoAlphaGo李世石the Rule of Go在上述例子中，策略(policy)π\piπ的具体表现形式可认为是神经网络从输入层到输出层之间的参数矩阵θ\thetaθ。下图为一个加入了action的马尔可夫链，记Trajectory τ={s1,a1,.
复制链接

扫一扫