强化学习：策略梯度Policy-gradient

最新推荐文章于 2023-12-31 01:44:06 发布

IEEEagent RL

最新推荐文章于 2023-12-31 01:44:06 发布

阅读量224

点赞数

分类专栏：笔记文章标签：强化学习人工智能

本文链接：https://blog.csdn.net/weixin_45776027/article/details/117400378

版权

笔记专栏收录该内容

22 篇文章 27 订阅

订阅专栏

强化学习：策略梯度Policy-gradient

这是一篇笔记文

文章目录

- 强化学习：策略梯度Policy-gradient
- - 1. value-based and policy-based
  - 2. 策略梯度Policy-gradient

1. value-based and policy-based

value-based基于价值的RL，倾向于选择价值最大的状态或者动作；value-based通过迭代计算最优值函数Q，并根据最优值函数改进策略。
policy-based基于策略的RL，常分为随机策略与确定性策略；无需定义价值函数，policy-based可以通过动作分配概率分布，并按照该分布来根据当前状态执行动作；其原理在于将策略参数化，即 $\pi_{\theta}(s)$ , 通过寻找最优参数 $\theta$ ，使得累计回报的期望最大: $\displaystyle maxE[\Sigma_{t=0}^{k}R(s_t)|\pi_{\theta}]$ ，此时 $\pi_{\theta}(s)$ 即为最优策略。

区别：

1）相较于value-based直接将值函数参数化表示相比，policy-based直接将策略参数化表示，使得策略$ \pi_{\theta}(s) $更加简单高效、易收敛。

2）利用value-base适用于离散的动作空间，动作空间虽然可以离散化处理，但离散间距的选取不易确定，并且value function的微小变化对策略的影响很大。而policy-based适用于连续的动作空间，不用计算每个动作的概率，而是通过正态分布选择action。

3）policy-based常采用随机策略，随机策略将探索 $\varepsilon$ 集成到了所学的策略中。

4）policy-based基于梯度的求解，容易陷入局部最优。

5）policy-based评估单个策略时并不充分，方差很大。

2. 策略梯度Policy-gradient

trajectory: $\tau :s_1,a_1,r_1,...,s_t,a_t,r_t$

trajectory回报： $R(\tau)=\Sigma_{t=0}^{k}R(s_t,a_t|\theta)$

目标函数： $l(\theta)=E[\Sigma_{t=0}^{k}R(s_t,a_t)|\pi(\theta)]=\Sigma_{\tau}p(\tau|\theta)R(\theta)$
eg. $p(\tau|\theta)$ 为轨迹的概率分布

梯度下降求解： $\theta_{new}=\theta_{old}+\alpha\triangledown_\theta l(\theta)$

$\triangledown_\theta l(\theta)=\triangledown_{\theta}\Sigma_{\tau}p(\tau|\theta)R(\tau) =\Sigma_{\tau}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}\frac{p(\tau|\theta)}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}p(\tau|\theta)\frac{\triangledown_{\theta}p(\tau|\theta)R(\tau)}{p(\tau|\theta)}\\ =\Sigma_{\tau}p(\tau|\theta)\triangledown_{\theta}logp(\tau|\theta)R(\tau)$
其中， $\triangledown_{\theta}logp(\tau|\theta)=\frac{1}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta)$ , 这是由公式： $\frac{dlog(f(x))}{dx}=\frac{1}{f(x)}\frac{df(x)}{dx}$ 得到的。

最终变成求解 $\triangledown_{\theta}logp(\tau|\theta)R(\tau)$ 的期望，
可以利用经验平均估算：
$\overline R_{\theta}=\frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n})$

$\triangledown_{\theta} l_{\theta}=\frac{1}{m}\Sigma_{n=1}^{m}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n})$

1、那么如何求解 $\triangledown_{\theta}logp(\tau|\theta)$ ?

根据trajectory可以得到概率：
$p(\tau|\theta)=p(s_1)p(a_1|s_1;\theta)p(r_1,s_2|s_1,a_1)p(a_2|s_2,\theta)p(r_2,s_3|s_2,a_2)...p(a_{t}|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t)\\ =p(s_1)\prod_{t=1}^Tp(a_t|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t)$
从而推导求得：
$logp(\tau|\theta)=logp(s_1)+\Sigma_{t=1}^{T}logp(a_t|s_t.\theta)+logp(r_t,s_{t+1}|s_t,a_t)\\ ==>\triangledown_{\theta}logp(\tau|\theta)=\Sigma_{t=1}^{T}\triangledown _{\theta}logp(a_t|s_t,\theta)$

2、最终如何求解策略的梯度？

$\theta_{new}=\theta_{old}+\alpha\triangledown_\theta \overline R(\theta)_{old}$

求解过程如下：
$\triangledown_{\theta}\overline R_{\theta}\approx \frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}p(\tau^{n}|\theta)R(\tau ^n)\\ =\frac{1}{N}\Sigma_{n=1}^{N}R(\tau ^n)\Sigma_{t=1}^{T} \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\ =\frac{1}{N}\Sigma_{n=1}^{N}\Sigma_{t=1}^{T}R(\tau ^n) \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\$

参考：李宏毅(Reinforcement Learning)

IEEEagent RL

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
强化学习：策略梯度Policy-gradient

强化学习：策略梯度Policy-gradient这是一篇笔记文文章目录强化学习：策略梯度Policy-gradient1. value-based and policy-based2. 策略梯度Policy-gradient1. value-based and policy-basedvalue-based基于价值的RL，倾向于选择价值最大的状态或者动作；通过迭代计算最优值函数Q，并根据最优值函数改进策略。policy-base基于策略的RL，常分为随机策略与确定性策略；无需定义价值函数，
复制链接

扫一扫