强化学习2--策略梯度(1)

最新推荐文章于 2024-04-22 15:36:21 发布

yyyybupt

最新推荐文章于 2024-04-22 15:36:21 发布

阅读量613

点赞数 1

分类专栏：强化学习

本文链接：https://blog.csdn.net/qq_41747565/article/details/88196917

版权

强化学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

目标函数

对于一个带有参数 $\theta$ 的策略 ${\mathrm\pi}_{\mathrm\theta}(\mathrm s,\mathrm a)$ ，找到最优的 $\theta$

定义

start value: ${\mathrm J}_1(\mathrm\theta)=\mathrm V^{{\mathrm\pi}_{\mathrm\theta}}({\mathrm s}_1)={\mathrm E}_{{}^{{\mathrm\pi}_{\mathrm\theta}}}\lbrack{\mathrm v}_1\rbrack$
average value: $J_{avV}(\theta)={\textstyle\sum_s}{\textstyle{\scriptstyle d}^{\pi_\theta}}{\textstyle(}{\textstyle s}{\textstyle)}{\textstyle{\scriptstyle V}^{\pi_\theta}}{\textstyle(}{\textstyle s}{\textstyle)}$
average reward per time-step:

目标函数的优化

1、有限差分策略梯度

策略梯度

策略梯度算法可以使目标函数 $J(\theta)$ 沿着梯度上升至局部最大值： $\triangle\theta=\alpha\nabla_\theta J(\theta)$
策略梯度： $\nabla_\theta J(\theta)=\begin{pmatrix}\frac{\partial J(\theta)}{\partial\theta_1}\\\vdots\\\frac{\partial J(\theta)}{\partial\theta_n}\end{pmatrix}$
有限差分计算策略梯度： $\frac{\partial J(\theta)}{\partial\theta_k}\approx\frac{J(\theta+\varepsilon\mu_k)-J(\theta)}\varepsilon$ ， $\mu_k$ 为单位向量

2、蒙特卡罗策略梯度

$\nabla_\theta\pi_\theta(s,a)=\pi_\theta(s,a)\frac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)}=\pi_\theta(s,a)\nabla_\theta\log\pi_\theta(s,a)$

score function: $\nabla_\theta\log\pi_\theta(s,a)$
softmax策略：所有可能执行动作的概率
高斯策略：

均值：参数化表示，例如用线性组合 $\mu(s)=\varphi(s)^T\theta$
方差：固定值或者参数化
action对应一个具体数值： $a\sim N(\mu(s),\sigma^2)$
score function(高斯函数求导)： $\nabla_\theta\log\pi_\theta(s,a)=\frac{(a-\mu(s))\varphi(s)}{\sigma^2}$

3、策略梯度定理

任何可微策略 $\pi_\theta(s,a)$ ，

任何策略目标函数 $J=J_1,J_{avR},or\frac1{1-\gamma}J_{avV}$

策略梯度： $\nabla_\theta J(\theta)=E_{\pi_\theta}\lbrack\nabla_\theta\log\pi_\theta(s,a)Q^{\pi_\theta}(s,a)\rbrack$

4、蒙特卡罗策略梯度

算法过程

利用随机梯度下降法更新参数
利用策略梯度定理
$\triangle\theta_t=\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)v_t$

function REINFORCE

Initialise $\theta$ arbitrarily

for each episode $\{s_1,a_1,r_1,\dots,s_{T-1},a_{T-1},r_{T-1}\}\sim\pi_\theta$ do

for t=1 to T-1 do

$\theta\leftarrow\theta+\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)v_t$

end for

return $\theta$

end function

5、Actor-Critic策略梯度

$\begin{array}{l}E_{\pi_\theta}\lbrack\nabla_\theta\log\pi_\theta(s,a)B(s)\rbrack\\={\textstyle\sum_{s\in S}}d^{\pi_\theta}(s){\textstyle\sum_a}{\textstyle{\scriptstyle\nabla}_\theta}{\textstyle{\scriptstyle\pi}_\theta}{\textstyle(}{\textstyle s}{\textstyle,}{\textstyle a}{\textstyle)}{\textstyle B}{\textstyle(}{\textstyle s}{\textstyle)}{\textstyle\;}{\textstyle=}{\textstyle\sum_{s\in S}}{\textstyle{\scriptstyle d}^{\pi_\theta}}{\textstyle B}{\textstyle(}{\textstyle s}{\textstyle)\nabla_\theta}{\textstyle\sum_{a\in A}}{\textstyle\pi_\theta}{\textstyle(}{\textstyle s}{\textstyle,}{\textstyle a}{\textstyle)}{\textstyle\;}{\textstyle=}{\textstyle0}\end{array}$