强化学习（3）--- 基于策略函数的强化学习算法

汤姆_布利柏

已于 2024-04-25 22:57:13 修改

阅读量1.1k

点赞数 13

文章标签：算法

于 2024-01-03 21:12:01 首次发布

本文链接：https://blog.csdn.net/weixin_46072670/article/details/135073093

版权

回顾：基于值函数的方法主要是学习价值函数，然后根据价值函数导出一个策略，学习过程中并不存在一个显示的策略，有动态规划算法（DP）、蒙特卡洛方法（MC）、时序差分算法（TD）（SARSA和Q-learning）、DQN及DQN的改进算法等。基于策略函数的方法是直接显式地学习一个目标策略，有策略梯度算法（REINFORCE）。基于值函数和策略函数的方法是价值函数和策略函数均学习，学习到最优策略和最优价值函数，有AC（Actor-Critic）算法、信任区域策略优化算法（TRPO）、PPO算法、深度确定性策略梯度算法（DDPG）以及SAC（Soft Actor-Critic）算法等。也可以把基于值函数和策略函数的方法归到基于策略函数的方法，这篇只讲基于策略函数的强化学习算法。

Value-based：学习价值函数，暗含策略。确定性策略直接从价值函数贪婪的生成： $a_{t}=\underset{a\in A}{argmax}Q(s_{t},a)$ 。

Policy-based：没有价值函数，直接策略学习。将策略函数参数化为 $\pi _{\theta }(a|s)$ ，其中 $\theta$ 是可学习的策略参数， $\pi _{\theta }(a|s)$ 的输出是动作集合的概率分布。

AC：既学习价值函数也学习策略函数。（归到Policy-based）

策略学习的优点：

a、更好的收敛特性：保证收敛于局部最优（最坏情况）或全局最优（最好情况）

b、策略梯度在高维动作空间更有效

c、策略梯度可以学习随机策略，而价值函数则不能（只能导出确定性策略）

策略学习的缺点：

a、通常会收敛到局部最优

b、评估策略具有高方差

策略学习相对于值函数中的 $\epsilon -Greedy$ 选取动作有优势：

策略函数选择动作的概率作为被优化的参数的函数会平滑的变化；

值函数中的 $\epsilon -Greedy$ 选取动作，只要估计的动作价值函数的变化导致了最大动作价值对应的动作发生了变化，则选择某个动作的概率就可能会突然变化很大，即使估计的动作价值函数只发生了任意小的变化。

因此，基于策略的学习比基于动作价值函数的学习有更强的收敛性保证。

　3.2、A2C（Advantage AC）

1、策略

确定性策略（Deterministic policy）：在每个状态只输出一个确定性的动作，即只有该动作的概率为 1，其他动作的概率为 0。连续动作空间可以使用DDPG（策略网络直接输出连续动作的值）。

策略网络输出随机新策略如下：

随机性策略（Stochastic policy）：在每个状态输出的是关于动作的概率分布（伯努利分布：如0.4概率往左0.6概率往右、高斯分布：一般用于连续动作空间等），然后根据该分布进行采样就可以得到一个动作。

策略网络输出随机新策略如下：一个是离散动作的网络，一个是连续动作的网络

2、策略梯度

就像基于值函数的方法中的DQN算法用函数（神经网络）来表达值函数，基于策略的方法就是策略 $\pi (a|s)$ 参数化，即用函数 $\pi _{\theta }(a|s)$ 来近似。函数近似我们只关注那些可微的算法，即特征的线性组合、神经网络（一般用神经网络）。策略函数 $\pi _{\theta }(a|s)$ 要做的就是输入某个状态，然后输出一个关于动作的概率分布，训练模型以能够找到最优策略及最优价值函数。

基于策略的方法可以不需要值函数，直接优化策略函数，学出随机性策略。

策略梯度算法是基于策略函数方法的基础。

策略网络模型：

定义目标函数（代价函数）：

$J(\theta )=E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]=E_{s_{0}}[G_{t}]=E_{s_{0}}[\sum_{t=0}^{T-1}\gamma ^{t}R(s_{t},a_{t})]$

其中， $s_{0}$ 是初始状态， $J(\theta )$ 代表了从初始状态 $s_{0}$ 开始遵循策略 $\pi _{\theta }(a|s)$ 的期望回报。

求出目标函数 $J(\theta )$ 对参数 $\theta$ 的导数（梯度） $\bigtriangledown_{\theta } J(\theta )$ ，然后用梯度上升方法最大化目标函数 $\underset{}{\max}J(\theta )$ ，得出最优策略。

梯度下降：若要求出目标函数 $J(\theta )$ 的最小值，即 $\underset{}{\min}J(\theta )$ 。那么就算出目标函数对参数 $\omega$ 的梯度 $\bigtriangledown_{\theta } J(\theta )$ ，然后用 $\theta =\theta -\alpha\bigtriangledown_{\theta } J(\theta )$ 更新参数 $\theta$

梯度上升：若要求出目标函数 $J(\theta )$ 的最大值，即 $\underset{}{\max}J(\theta )$ 。那么就算出目标函数对参数 $\omega$ 的梯度 $\bigtriangledown_{\theta } J(\theta )$ ，然后用 $\theta =\theta +\alpha\bigtriangledown_{\theta } J(\theta )$ 更新参数 $\theta$

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ （方式一）

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ （方式二）

通过参数θ更新来移动分布p（一个策略对应一个分布p），使其未来的策略 $\pi _{\theta }(a|s)$ 采集的数据获得较高分数 $Q^{\pi _{\theta }}(s_{t},a_{t})$ 或 $R(\tau )$ 。

符合这个框架的方法都成为策略梯度法。

　2.1、策略梯度算法推导一

　2.1.1、状态访问分布：

不同策略得到的价值函数不一样，因为不同策略访问到的状态的概率分布不同。

定义MDP的初始状态分布 $\nu _{0}(s)$ ， $p{_{t}}^{\pi }(s)$ 表示采取策略 $\pi (a|s)$ 使得智能体在t时刻状态为的概率，有 $p{_{0}}^{\pi }(s)=\nu _{0}(s)$ ，定义一个策略 $\pi (a|s)$ 的状态访问分布（state visitation distribution）：

$\nu ^{\pi }(s)=(1-\gamma )\sum_{t=0}^{\infty }\gamma ^{t}p{_{t}}^{\pi }(s)$ （有限序列就不用加到 $\infty$ 了）

假如有三个状态，在每一时刻三个状态的概率和为1，即 $p{_{t}}^{\pi }(s_{1})+p{_{t}}^{\pi }(s_{2})+p{_{t}}^{\pi }(s_{3})=1$ 。

状态访问分布 $\nu ^{\pi }(s)$ 代表了在每一时间步状态为的概率之和，每一时间步的概率前都有一个权重 $\gamma ^{t}$ （权重随时间增加而减小）。

$(1-\gamma )$ 是为了使得所有状态的状态概率分布之和为1，即：

$\nu ^{\pi }(s_{1})+\nu ^{\pi }(s_{2})+\nu ^{\pi }(s_{3})$

$=(1-\gamma )[\sum_{t=0}^{\infty }\gamma ^{t}p{_{t}}^{\pi }(s_{1})+\sum_{t=0}^{\infty }\gamma ^{t}p{_{t}}^{\pi }(s_{2})+\sum_{t=0}^{\infty }\gamma ^{t}p{_{t}}^{\pi }(s_{3})]$

$=(1-\gamma )[1+\gamma +\gamma ^{2}+...]$

$=(1-\gamma )\frac{1-\gamma ^{\infty }}{(1-\gamma )}$ （ $\gamma < 1$ ）

状态访问分布的性质：

$\nu ^{\pi }(s^{'})=(1-\gamma )\nu _{0}(s^{'})+\gamma \int p(s^{'}|s,a)\pi (a|s)\nu ^{\pi }(s)dsda$

状态 $s^{'}$ 的状态访问分布就是 $(1-\gamma )$ 初始状态为 $s^{'}$ 的状态访问分布+ $\gamma$ 上一时刻任意状态分布跳转到下一时刻状态 $s^{'}$ 。因为上一时刻任意状态分布包含了权重 $(1-\gamma )$ ，所以后一项就不用乘 $(1-\gamma )$ 了。

占用度量：

策略的占用度量（occupancy measure）：

$\rho ^{\pi }(s,a)=(1-\gamma )\sum_{t=0}^{\infty }\gamma ^{t}p{_{t}}^{\pi }(s)\pi (a|s)$

$=\nu ^{\pi }(s)\pi (a|s)$

占用度量表示动作状态对被访问到的概率。

定理1：智能体分别以策略 $\pi _{1}(a|s)$ 和 $\pi _{2}(a|s)$ 和同一个MDP交互得到的占用度量 $\rho ^{\pi _{1}}$ 和 $\rho ^{\pi _{2}}$ 满足：

$\rho ^{\pi _{1}}=\rho ^{\pi _{2}}\leftrightarrow \pi _{1}=\pi _{2}$

即策略和占用度量一一对应

定理2：给定一合法占用度量（指的是存在“一个策略使智能体与MDP交互产生的状态动作对被访问到的概率”），可生成该占用度量的唯一策略是：

$\pi _{\rho }=\frac{\rho (s,a)}{\sum_{a^{'}\in A}^{}\rho (s,a^{'})}$

这个公式是下面推出的：

$\sum_{a\in A}^{}\rho (s,a)=\rho ^{\pi }(s,a_{1})+\rho ^{\pi }(s,a_{2})+\rho ^{\pi }(s,a_{3})$

$=\nu ^{\pi }(s)(\pi (a_{1}|s)+\pi (a_{2}|s)+\pi (a_{3}|s))$

$=\nu ^{\pi }(s)$

$\pi (a|s)=\frac{\rho ^{\pi }(s,a)}{\nu ^{\pi }(s)}=\frac{\rho ^{\pi }(s,a)}{\sum_{a\in A}^{}\rho (s,a)}$

　2.1.2、算法推导：

代价函数：

$J(\theta )=E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]$

$\bigtriangledown_{\theta } J(\theta )$ 的求解公式：

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)]$ 的推导过程（要用到状态访问分布和 $V^{\pi _{\theta }}(s)=\sum_{a\in A}^{}\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)$ ）：

1、首先得到 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)$ ，状态价值函数 $V^{\pi _{\theta }}(s)$ 的对参数 $\theta$ 求导：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=\bigtriangledown _{\theta }(\sum_{a\in A}^{}\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a))$

   $=\sum_{a\in A}^{}(\bigtriangledown _{\theta }\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)+\pi _{\theta }(a|s)\bigtriangledown _{\theta }Q^{\pi _{\theta }}(s,a))$

   $=\sum_{a\in A}^{}(\bigtriangledown _{\theta }\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)+\pi _{\theta }(a|s)\bigtriangledown _{\theta }(R(s,a)+$ $\gamma \sum_{s^{'}\in S}^{}p(s^{'}|s,a)V^{\pi _{\theta }}(s^{'})))$

   $=\sum_{a\in A}^{}(\bigtriangledown _{\theta }\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)+\gamma \pi _{\theta }(a|s)\sum_{s^{'}\in S}^{}p(s^{'}|s,a)\bigtriangledown _{\theta }$ $V^{\pi _{\theta }}(s^{'}))$

令 $W(s)=\sum_{a\in A}^{}\bigtriangledown _{\theta }\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)$ ，令 $d^{\pi _{\theta }}(s\rightarrow x,k)$ 为从状态出发遵循策略 $\pi _{\theta }(a|s)$ 后k步到达状态x的概率（如 $d^{\pi _{\theta }}(s\rightarrow s^{'},1)=\sum_{a\in A}^{}\pi _{\theta }(a|s)p(s^{'}|s,a)$ ），那么：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=W(s)+\gamma \sum_{a\in A}^{}\pi _{\theta }(a|s)\sum_{s^{'}\in S}^{}p(s^{'}|s,a)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

   $=W(s)+\gamma \sum_{s^{'}\in S}^{}\sum_{a\in A}^{}\pi _{\theta }(a|s)p(s^{'}|s,a)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

   $=W(s)+\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

找到了 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)$ 与下一状态 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$ 的关系，继续：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=W(s)+\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)[W(s^{'})+\gamma \sum_{s^{''}\in S}^{}$ $d^{\pi _{\theta }}(s^{'}\rightarrow s^{''},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{''})]$

   $=W(s)+\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)W(s^{'})+\gamma ^{2}\sum_{s^{''}\in S}^{}$ $d^{\pi _{\theta }}(s\rightarrow s^{''},2)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{''})$



   $=\sum_{x\in S}^{}\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s\rightarrow x,k)W(x)$

其实，从这就能看出来梯度 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)$ 就只和一条序列的每一步的有关（那些概率就不管了，因为最终是跟环境互动，这是unknown MDP问题的一般做法）， $\bigtriangledown_{\theta } J(\theta )=\bigtriangledown_{\theta }E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]$ 代表只要采样n条序列，然后计算这n条序列的每一条序列的求和再除以n就能算出梯度。

2、然后，推导 $\bigtriangledown_{\theta } J(\theta )$ ：

         $\bigtriangledown_{\theta } J(\theta )=\bigtriangledown_{\theta }E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]$

                         $=E_{s_{0}}[\bigtriangledown_{\theta }V^{\pi _{\theta }}(s_{0})]$

                         $=E_{s_{0}}[\sum_{x\in S}^{}\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow x,k)W(x)]$

令 $\eta (s)=E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow s,k)]=E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}p{_{k}}^{\pi _{\theta }}(s)]=\frac{\nu ^{\pi _{\theta }}(s)}{1-\gamma ^{'}}$ ，代表了从状态 $s_{0}$ 遵循策略 $\pi _{\theta }(a|s)$ 走任意步长到状态的期望概率。

         $\bigtriangledown_{\theta } J(\theta )=\sum_{s\in S}^{}(E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow s,k)]W(s))$

   $=\sum_{s\in S}^{}\eta (s)W(s)$

   $=\sum_{s\in S}^{}\sum_{s\in S}^{}\eta (s)\frac{\eta (s)}{\sum_{s\in S}^{}\eta (s)}W(s)$

   $=\sum_{s\in S}^{}\frac{1}{1-\gamma ^{'}}\nu ^{\pi _{\theta }}(s)W(s)$ （ $\sum_{s\in S}^{}\eta (s)=\frac{1}{1-\gamma ^{'}}> 1$ ）

   $\propto \sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)W(s)$

   $=\sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\sum_{a\in A}^{}Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }\pi _{\theta }(a|s)$

tip： $\bigtriangledown_{\theta } J(\theta )\propto\sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\sum_{a\in A}^{}(Q^{\pi _{\theta }}(s,a)-b(s))\bigtriangledown _{\theta }\pi _{\theta }(a|s)$ 是加了基准线的梯度公式（可以是任意函数，甚至是一个随机不随动作a变化）。

         $\sum_{a\in A}^{}b(s)\bigtriangledown _{\theta }\pi _{\theta }(a|s)=b(s)\sum_{a\in A}^{}\bigtriangledown _{\theta }\pi _{\theta }(a|s)=b(s)\bigtriangledown1=0$

最后推导出 $\bigtriangledown_{\theta } J(\theta )$ 常用的能够代入数据的公式：

         $\bigtriangledown_{\theta } J(\theta )\propto \sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\sum_{a\in A}^{}Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }\pi _{\theta }(a|s)$

                         $=\sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\sum_{a\in A}^{}\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)\frac{\bigtriangledown _{\theta }\pi _{\theta }(a|s)}{\pi _{\theta }(a|s)}$

                         $=\sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\sum_{a\in A}^{}\pi _{\theta }(a|s)Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)$

其中， $\nu ^{\pi _{\theta }}(s)$ 代表了状态的概率分布，即每一步是状态s的概率， $\nu ^{\pi _{\theta }}(s)\pi _{\theta }(a|s)$ 代表动作状态对被访问到的概率。 $\nu ^{\pi _{\theta }}(s)$ 隐含了从初始状态 $s_{0}$ 到序列终止条件。

                         $=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ，代表了从状态 $s_{0}$ 遵循策略 $\pi _{\theta }(a|s)$ 走任意步长，每一步的 $Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)$ 相加的期望。

                         $\approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

tips：这里的 $Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)$ 是从初始状态 $s_{0}$ 下遵循策略 $\pi _{\theta }(a|s)$ 后每一步的信息。

$J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})log\pi _{\theta }(a_{t}|s_{t})]$ ---很显然对于一个序列的每个状态采取策略函数输出的动作，当 $Q^{\pi _{\theta }}(s_{t},a_{t})$ 很大时，策略函数会将这个测流函数采取的动作的概率增加，以达到 $maxJ(\theta )$ 。因此，影响梯度值大小的主要是Q值，Q值大的话梯度就大，因此，更新梯度主要是让策略更多的去采样到能够得到较大Q值的动作。

观察梯度公式，是对策略 $\pi _{\theta }(a|s)$ 采集的数据求期望，即策略梯度算法为在线策略（on-policy）算法。因此，在计算梯度的时候要用到当前策略 $\pi _{\theta }(a|s)$ 采样得到的数据来计算梯度。

　2.3、策略梯度算法推导二

代价函数：

$J(\theta )=E_{\tau \sim p_{\theta }(\tau )}[R(\tau )]=\sum_{\tau }^{}p(\tau |\theta )R(\tau )\approx \frac{1}{N}\sum_{n=1}^{N}R(\tau ^{n})$

其中， $\tau =(s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T})$ $\sim (\pi _{\theta }(a|s),p(s_{t+1}|s_{t},a_{t}))$ ， $R(\tau )$ 是一个序列的累计奖励, $p(\tau |\theta )=p(s_{0})\prod_{t=0}^{T-1}\pi _{\theta }(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t})$ 。

求期望公式： $E_{\tau \sim p_{\theta }(\tau )}[R(\tau )]=\sum_{\tau }^{}p(\tau |\theta )R(\tau )=\int p_{\theta }(\tau )R(\tau )d\tau$

$\bigtriangledown_{\theta } J(\theta )$ 的求解公式：

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ 的推导过程：

         $\bigtriangledown_{\theta } J(\theta )=\bigtriangledown_{\theta }\sum_{\tau }^{}p(\tau |\theta )R(\tau )$

                         $=\sum_{\tau }^{}\bigtriangledown_{\theta }p(\tau |\theta )R(\tau )$

                         $=\sum_{\tau }^{}p(\tau |\theta )R(\tau )\frac{\bigtriangledown_{\theta }p(\tau |\theta )}{p(\tau |\theta )}$

                         $=\sum_{\tau }^{}p(\tau |\theta )R(\tau )\bigtriangledown_{\theta }logp(\tau |\theta )$    （用到了 $\frac{dlog(f(x))}{dx}=\frac{1}{f(x)}\frac{df(x)}{dx}$ ）

                         $=E_{\tau \sim p_{\theta }(\tau )}[R(\tau )\bigtriangledown_{\theta }logp(\tau |\theta )]$

其中， $\bigtriangledown_{\theta }logp(\tau |\theta )=\bigtriangledown_{\theta }log(p(s_{0})\prod_{t=0}^{T-1}\pi _{\theta }(a_{t}|s_{t})p(s_{t+1}|s_{t},a_{t}))$

$=\bigtriangledown_{\theta }(log(p(s_{0})+\sum_{t=0}^{T-1}log\pi _{\theta }(a_{t}|s_{t})+\sum_{t=0}^{T-1}logp(s_{t+1}|s_{t},a_{t}))$ （状态转移概率与策略无关，即与 $\theta$ 无关）

$=\sum_{t=0}^{T-1}\bigtriangledown_{\theta }log\pi _{\theta }(a_{t}|s_{t})$

则 $\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim p_{\theta }(\tau )}[R(\tau )\bigtriangledown_{\theta }logp(\tau |\theta )]$

$=E_{\tau \sim p_{\theta }(\tau )}[R(\tau )\sum_{t=0}^{T-1}\bigtriangledown_{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

$=E_{\tau \sim p_{\theta }(\tau )}[R(\tau )\sum_{t=0}^{T-1}\bigtriangledown_{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

$\approx \frac{1}{N}\sum_{i=1}^{N}R(\tau ^{i})\sum_{t=0}^{T-1}\bigtriangledown_{\theta }log\pi _{\theta }(a_{t}^{i}|s_{t}^{i})$

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

$J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}R(\tau )log\pi _{\theta }(a_{t}|s_{t})]$ ---很显然对于一个序列（每个状态采取策略函数输出的一个动作），当 $R(\tau ^{n})$ 很大时，策略函数会将这个序列采取的动作的概率增加，以达到 $maxJ(\theta )$ 。

梯度公式直觉的：

一、 $R(\tau ^{n})$

        如果在 $\tau ^{n}$ 序列中，看到状态 $s_{t}^{n}$ 后采取动作 $a_{t}^{n}$ ，最终得到的 $R(\tau ^{n})$ 是正的，那么更新参数 $\theta$ 就会增加 $\pi _{\theta }(a_{t}|s_{_{t}})$ 的概率；

        如果在 $\tau ^{n}$ 序列中，看到状态 $s_{t}^{n}$ 后采取动作 $a_{t}^{n}$ ，最终得到的 $R(\tau ^{n})$ 是负的，那么更新参数 $\theta$ 就会减小 $\pi _{\theta }(a_{t}|s_{_{t}})$ 的概率。

如果将 $R(\tau ^{n})$ 改为 $r_{t}^{n}$ （即在状态 $s_{t}^{n}$ 采取动作 $a_{t}^{n}$ 后的即时奖励），会一直增加会得到即时奖励的动作的概率，减小“虽然不会得到即时奖励后边能得到更高的总回报的”动作，最终导致一直采取会得到即时奖励的动作。

二、为什么要取log？

取log是让 $\bigtriangledown _{\theta }\pi _{\theta }(a_{t}|s_{t})$ 除以 $\pi _{\theta }(a_{t}|s_{t})$

假设让agent与环境交互多次，某一个状态在 $\tau ^{10}$ 、 $\tau ^{15}$ 、 $\tau ^{40}$ 、 $\tau ^{42}$ 、 $\tau ^{50}$ 都被看到了

         $\tau ^{10}$ 状态采取动作a、 $\tau ^{15}$ 状态采取动作b、 $\tau ^{40}$ 状态采取动作b、 $\tau ^{42}$ 状态采取动作b、 $\tau ^{50}$ 状态采取动作b

         $R(\tau ^{10})=3$ 、 $R(\tau ^{15})=1$ 、 $R(\tau ^{40})=1$ 、 $R(\tau ^{42})=1$ 、 $R(\tau ^{50})=1$

那么，这样更新的话会让策略偏向于在状态采取出现次数多的动作b（因为更新次数多），但是采取b比采取a得到的总回报小，这样会导致增加动作b的概率。

因此，使用Normalization， $\frac{\bigtriangledown _{\theta }\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta }(a_{t}|s_{t})}$ 使得出现频率高的的动作除以的 $\pi _{\theta }(a_{t}|s_{t})$ 大，出现频率小的的动作除以的 $\pi _{\theta }(a_{t}|s_{t})$ 小，这样更新就不会让策略偏向于出现几率大动作。

Tip1、添加基准线

一般回报 $R(\tau ^{n})$ 是正的。

①在理想情况下，不会有问题：

        在某一状态下有三个动作，采取三个动作得到的总回报都是正的，但是 $R(\tau ^{n})$ 大小不一样，假如动作a和动作c得到的总回报比动作b的大，那么更新的话动作a和动作c的概率会增大，动作b的概率会减小，保证概率之和为1（离散动作采用softmax，连续动作采样高斯分布）。虽然采取每个动作都会得到正数的梯度，但是最终是梯度比较小的概率减小，梯度比较大的概率增大。

②在实际操作时，有问题：

在某一状态实际采样时，只采集到动作b和动作c，没有出现动作a，这样更新会导致动作b和动作c的概率增加，动作a的概率减小，保证概率之和为1。

        因此，希望 $R(\tau ^{n})$ 有正有负，做法就是在 $R(\tau ^{n})$ 基础上减去一个基准线b，即 $(R(\tau )-b)$ 。b一般是 $b\approx E[R(\tau )]$ （b可以是任意函数，甚至是一个随机不随动作a变化），训练时不断地把 $R(\tau ^{n})$ 的值记录下来，不断计算平均值作为b。

        这代表只要 $R(\tau ^{n})$ 超过b，才将这个 $\tau$ 中状态出现的动作的概率增加；只要 $R(\tau ^{n})$ 没有超过b，才将这个 $\tau$ 中状态出现的动作的概率减小。就保证了不会导致没有被sample到就减小概率得情况。

梯度公式就是：

$\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim \pi _{\theta }}[\sum_{t=0}^{T-1}(R(\tau )-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

令 $M=\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$ ， $N=\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

$E_{\tau \sim \pi _{\theta }}[\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

         $=\sum_{\tau }^{}p(\tau |\theta )\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

         $=\int p_{\theta }(\tau )\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})d\tau$

         $=\int \int \nu _{\theta }(s_{t})\pi _{\theta }(a_{t}|s_{t})b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})dsda$

         $=b\int \nu _{\theta }(s_{t})ds\int \pi _{\theta }(a_{t}|s_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})da$

         $=b\int \nu _{\theta }(s_{t})ds\int \bigtriangledown _{\theta }\pi _{\theta }(a_{t}|s_{t})da$

         $=b\int \nu _{\theta }(s_{t})ds\bigtriangledown _{\theta }\int \pi _{\theta }(a_{t}|s_{t})da$ （ $\int \pi _{\theta }(a_{t}|s_{t})da=1$ ）



$var(\sum_{t=0}^{T-1}(R(\tau )-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

         $=var(\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})-\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

$=var(\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))+var(\sum_{t=0}^{T-1}b\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

         $=R(\tau )^{2}var(\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))+b^{2}var(\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$ $-2bR(\tau )var(\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

         $=(R(\tau )-b)^{2}var(\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

         $< R(\tau )^{2}var(\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

         $=var(\sum_{t=0}^{T-1}R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$

b一般是 $b\approx E[R(\tau )]$ ，可以尽可能的减小方差。

因此，添加基准线不影响期望，方差减小。b可以用函数或者神经网络拟合（用 $b\approx E[R(\tau )]$ ）。

Tip2、Assign Suitable Credit（分配合适的分）

         $\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(R(\tau )-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ 代表了对于同一个序列的状态动作对都要乘上 $(R(\tau )-b)$ ，这样是不公平的，如果一个序列的总回报很高不代表所有的动作就是好的，一个序列的总回报不高不代表所有的动作就是不好的。

①理想情况下，采样够多的话不是问题，因为对同样一个状态动作对在不同序列得到的 $(R(\tau )-b)$ 不同，最终会抵消不影响效果。

②实际上采样不够多时，就要给每一个状态动作对分配合理的分数贡献。

因此，对于一个序列，给每一个状态动作对都乘上不同的权重，即这个状态动作对之后的累计奖励（不算这个状态动作对之前的奖励） $(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b)$ ，而不是相同的 $(R(\tau )-b)$ ，这样能真正的反映每个动作是不是好的。

梯度公式就是：

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

这个梯度公式其实就是策略梯度算法推导一的梯度公式。

“这个公式还可以有这样来得到：

                任意时间步的奖励的期望：

                         $\bigtriangledown_{\theta } E_{\tau }[r_{t^{'}}]=E_{\tau }[r_{t^{'}}\sum_{t=0}^{t^{'}}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ （某一时刻的奖励要对之前每一时刻的策略求梯度）

奖励总期望：

                         $\bigtriangledown_{\theta } J(\theta )=E_{\tau }[R(\tau )]$

                                         $=E_{\tau }[\sum_{t^{'}=0}^{T-1}r_{t^{'}}\sum_{t=0}^{t^{'}}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

                                         $=E_{\tau }[\sum_{t=0}^{T-1}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})\sum_{t^{'}=t}^{T-1}r_{t^{'}}]$

                                         $=E_{\tau }[\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}r_{t^{'}})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ”

其实这就是REINFORCE算法。

         $Q(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}$ ，一个好的 $b(s_{t})$ 采用价值函数 $V(s_{t})=E[r_{t}+r_{t+1}+...+r_{T-1}]$ ，因此 $(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b)$ 实际上是优势函数 $A^{\theta }(s,a)$ （用神经网络表示是 $A^{\theta }(s,a)=Q(s,a)-V(s)$ ,V是Q的期望，因此A有正有负）， $A^{\theta }(s,a)$ 可以用critic来表示（就是Dueling DQN中的Q-V）。这是后面的AC算法。

tip：这里是 $Q(s_{t},a_{t})-b(s_{t})$ ，可以证明每一步的 $Q(s_{t},a_{t})-b(s_{t})$ 的方差都是和（即相关性）有关，相关性越高，方差越小，因此 $b(s_{t})$ 采用 $E[V(s_{t})]$ 能尽可能的减小方差，而期望不变。

        若只有策略函数，没有价值函数（就是单纯的策略学习），那么 $A^{\theta }(s,a)$ 只能通过采集数据根据公式 $(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b)$ 来计算，这里涉及一个重要性采样的问题（见强化学习2）：

如果是同策略 $\pi _{\theta }(a|s)$ 采集的数据来更新，就用公式 $\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim \pi _{\theta }}[\sum_{t=0}^{T-1}(R(\tau )-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ 或者 $\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim \pi _{\theta }}[\sum_{t=0}^{T-1}A^{\theta }(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ，这里 $A^{\theta }(s,a)=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b$ ，角标 $\theta$ 表示是策略 $\pi _{\theta }(a|s)$ 的数据， $s_{t},a_{t}\sim \tau$ 。

        如果是异策略 $\pi _{\theta ^{'}}(a|s)$ 采集的数据，但是更新的是参数 $\theta$ ，就要用到重要性采样， $\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim \pi _{\theta ^{'}}}[\sum_{t=0}^{T-1}\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'}}(a_{t}|s_{t})}(R(\tau )-b)\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ 或者 $\bigtriangledown_{\theta } J(\theta )=E_{\tau \sim \pi _{\theta ^{'}}}[\sum_{t=0}^{T-1}\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'}}(a_{t}|s_{t})}A^{\theta ^{'}}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ，这里 $A^{\theta ^{'}}(s,a)=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b$ ，角标 $\theta^{'}$ 表示是策略 $\pi _{\theta ^{'}}(a|s)$ 的数据， $s_{t},a_{t}\sim \tau$ 。

tip：这里近似认为 $\pi _{\theta ^{'}}(a|s)$ 采集的数据计算的 $A^{\theta ^{'}}(s,a)=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b$ 与 $\pi _{\theta }(a|s)$ 采集数据计算的 $A^{\theta }(s,a)=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b$ 一样。

异策略下的目标函数可以写为 $J^{\theta ^{'}}(\theta )=E_{(s_{t},a_{t})\sim \pi _{\theta ^{'}}}[\sum_{t=0}^{T-1}\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'}}(a_{t}|s_{t})}(R(\tau )-b)]$ 或者 $J^{\theta ^{'}}(\theta )=E_{(s_{t},a_{t})\sim \pi _{\theta ^{'}}}[\sum_{t=0}^{T-1}\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'}}(a_{t}|s_{t})}A^{\theta ^{'}}(s_{t},a_{t})]$ 。与环境交互的是策略 $\pi _{\theta ^{'}}(a|s)$ ，更新的是策略 $\pi _{\theta ^{'}}(a|s)$ 的参数 $\theta$ 。 $\pi _{\theta ^{'}}(a|s)$ 采集数据后计算 $A^{\theta ^{'}}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}r_{t^{'}}-b$ ，再乘以 $\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'}}(a_{t}|s_{t})}$ 。计算梯度的话再乘上 $\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$ 。

这两种推导方式其实是从不同的角度入手的，方式一是每步的 $Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)$ 入手，方拾二是从整体的 $R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$ 入手。本质上是一样的，只是更新方式不同，其实方式二的Tip2就是方式一。

$R(\tau )\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$ 更新是无偏但是方差大（因为采用了蒙特卡洛方法），减小方差的方法：

a、虽然 $Q^{\pi _{\theta }}(s,a)\bigtriangledown _{\theta }log\pi _{\theta }(a|s)$ 方式也采用蒙特卡洛更新Q值，但是相对于来说方差减小了。

b、添加基准线，即方式二的Tip1可以减小方差。

　2.4、REINFORCE

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

计算梯度要用到 $Q^{\pi _{\theta }}(s,a)$ 的值，估算 $Q^{\pi _{\theta }}(s,a)$ 有很多方式，REINFORCE算法采用了蒙特卡洛方法估算Q值，即：

$Q^{\pi _{\theta }}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})$

梯度：

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

REINFORCE算法流程：

初始化策略参数 $\theta$ （即初始化策略网络 $\pi _{\theta }(a|s)$ ）

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

计算当前序列每个时刻的回报 $Q^{\pi _{\theta }}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})$

对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

REINFORCE相关代码：

import gym
import torch
import torch.nn.functional as F
import numpy as np


class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=1)


class REINFORCE:
    def __init__(self, state_dim, hidden_dim, action_dim, learning_rate, gamma,
                 device):
        self.policy_net = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
        self.optimizer = torch.optim.Adam(self.policy_net.parameters(), lr=learning_rate)  # 使用Adam优化器
        self.gamma = gamma  # 折扣因子
        self.device = device

    def take_action(self, state):  # 根据动作概率分布随机采样
        state = torch.tensor([state], dtype=torch.float).to(self.device)
        probs = self.policy_net(state)
        action_dist = torch.distributions.Categorical(probs)
        action = action_dist.sample()
        return action.item()

    def update(self, transition_dict):
        reward_list = transition_dict['rewards']
        state_list = transition_dict['states']
        action_list = transition_dict['actions']

        G = 0
        self.optimizer.zero_grad()
        for i in reversed(range(len(reward_list))):  # 从最后一步算起
            reward = reward_list[i]
            state = torch.tensor([state_list[i]], dtype=torch.float).to(self.device)
            action = torch.tensor([action_list[i]]).view(-1, 1).to(self.device)
            log_prob = torch.log(self.policy_net(state).gather(1, action))
            G = self.gamma * G + reward
            loss = -log_prob * G  # 每一步的损失函数
            loss.backward()  # 反向传播计算梯度
        self.optimizer.step()  # 累计梯度实施梯度下降


learning_rate = 1e-3
episodes = 1000
hidden_dim = 128
gamma = 0.98
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

env_name = "CartPole-v0"
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = REINFORCE(state_dim, hidden_dim, action_dim, learning_rate, gamma, device)

for i in range(episodes):
    transition_dict = {'states': [],'actions': [],'next_states': [],'rewards': [],'dones': []}
    state = env.reset()
    done = False
    while not done:
        action = agent.take_action(state)
        next_state, reward, done, _ = env.step(action)
        transition_dict['states'].append(state)
        transition_dict['actions'].append(action)
        transition_dict['next_states'].append(next_state)
        transition_dict['dones'].append(done)
        state = next_state
    agent.update(transition_dict)

定义策略网络 $\pi _{\theta }(a|s)$ ：输入状态，输出该状态下的动作概率分布，用softmax（）来实现分布。

采样：通过策略网络 $\pi _{\theta }(a|s)$ 的输出（动作概率分布）对离散的动作进行采样。

因为梯度为 $\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ，则损失函数可以写为 $J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))log\pi _{\theta }(a_{t}|s_{t})]$ ，因此每个序列的损失就是 $\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))log\pi _{\theta }(a_{t}|s_{t})$ ，每个序列的每一步的损失就是 $(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))log\pi _{\theta }(a_{t}|s_{t})$ 。而且，因为是梯度上升，在用loss.backward()计算梯度时对loss加个负号，再用torch.optim.Adam（）梯度下降更新，这样就是梯度上升的运用了。

可以采用gym的小游戏‘CartPole-v0’环境进行实验操作。

总结：

        对比DQN算法，REINFORCE是一个在线策略算法，必须依赖在线采集的数据进行更新，所以要采集很多序列。

        REINFORCE算法就是智能体根据当前策略直接和环境交互，通过采样得到的轨迹数据直接计算出策略参数的梯度，进而更新当前策略，使其向最大化策略期望回报的目标靠近。

        REINFORCE算法优化的目标（即策略期望回报）正是最终所使用策略的性能，这比基于价值的强化学习算法的优化目标（一般是时序差分误差的最小化）要更加直接。

        REINFORCE算法理论上是能保证局部最优的，借助蒙特卡洛方法采样轨迹来估计动作价值，这样是可以得到无偏的梯度。但是，梯度估计的方差很大，可能会造成一定程度上的不稳定。

　2.5、带有基准线的REINFORCE

初始化策略参数 $\theta$ （即初始化策略网络 $\pi _{\theta }(a|s)$ ）与状态价值函数 $V_{\omega }(s)$

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

计算当前序列每个时刻的回报 $Q^{\pi _{\theta }}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}})$



                对状态价值网络 $V_{\omega }(s)$ 更新参数：

$\omega =\omega +\alpha_{\omega }\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})-V_{\omega }(s_{t}))\bigtriangledown _{\omega }V_{\omega }(s_{t})$ 或

$\omega =\omega -\alpha_{\omega }\sum_{t=0}^{T-1}(V_{\omega }(s_{t})-\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}}))\bigtriangledown _{\omega }V_{\omega }(s_{t})$

对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha_{\theta }\sum_{t=0}^{T-1}(\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})-V_{\omega }(s_{t}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

3、Actor-Critic

在基于策略学习的基础上加上价值函数的学习，即既学习策略函数 $\pi _{\theta }(a|s)$ （参数 $\theta$ ），又价值函数学习 $Q_{\omega }(s,a)$ （参数 $\omega$ ），这就是Actor-Critic算法，本质上还是基于策略的方法。

带基准线的强化学习算法既学习了策略函数又学习了状态价值函数，但是也不属于一个AC算法，因为带基准线的强化学习算法的状态价值函数仅被用作基准线，而不是一个评判器Critic。实际上是没有被用于自举操作（用后续各个状态的价值估计值来更新当前某个状态的价值估计值），而只是作为正被更新的状态价值的基线。

自举法引入了偏差是很有用的，降低了方差加快了学习。

带基准线的强化学习算法使用的是蒙特卡洛估计加一个基准线，是无偏差的，并且会渐近的收敛至局部最小值，但是学习缓慢产生高方差估计。

使用多步的方法（多步时序差分）可以灵活地选择自举操作的程度。

$\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$

如策略梯度公式， $\pi _{\theta }(a|s)$ 是由一个神经网络来表示，而 $Q^{\pi _{\theta }}(s_{t},a_{t})$ 可以有很多种形式：

a、REINFOCE使用蒙特卡洛方法 $Q^{\pi _{\theta }}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})$ 估计每一时刻的Q值，对策略梯度的估计是无偏的，但是有很大的方差。

b、用一个轨迹的总回报 $V(s)=\sum_{t^{'}=0}^{T-1}\gamma ^{t^{'}}R(s_{t^{'}},a_{t^{'}})$ 来代替轨迹的每一时刻的 $Q^{\pi _{\theta }}(s_{t},a_{t})$ 。

c、蒙特卡洛方法估计每一时刻的Q值基础上加一个基准线，即 $Q^{\pi _{\theta }}(s_{t},a_{t})=\sum_{t^{'}=t}^{T-1}\gamma ^{t^{'}-t}R(s_{t^{'}},a_{t^{'}})-b(s_{t})$ ，可以减小方差。

d、用神经网络 $Q_{\omega }(s,a)$ 计算每一时刻的 $Q_{\omega }(s_{t},a_{t})$ 。---AC

e、用优势函数 $A_{\omega }(s,a)=Q_{\omega }(s,a)-V(s)$ 代替 $Q^{\pi _{\theta }}(s_{t},a_{t})$ ，即在 $Q_{\omega }(s,a)$ 基础上减去基准。---两个网络的A2C

f、因为 $Q(s_{t},a_{t})=R(s_{t},a_{t})+\gamma V(s_{t+1})$ ，可以用时序差分（TD） $A_{\omega }(s_{t},a_{t})=R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t})$ 来代替 $Q^{\pi _{\theta }}(s_{t},a_{t})$ 这一项。V是Q的期望，因此A有正有负。---一个网络的A2C算法

        a、b、c都是基于蒙特卡洛累积奖励来计算的，c是带有基准线的REINFORCE，d、e、f用神经网络进行估计Q或V也是通过奖励R来更新的，但是神经网络相当于TD知道了下一时刻的估计，因此可以减小方差、提高鲁棒性，牺牲了偏差。

e的方法可以采用Dueling DQN中的网络架构，也可以采用两个网络（Q和V），但是有风险。因此一般采用f的方法 $Q(s_{t},a_{t})=R(s_{t},a_{t})+\gamma V(s_{t+1})$ ，只用一个网络V。

f方法的价值函数网络采用就可以了。e、f在神经网络的基础上加上了基准线，可以减小方差。

REINFORCE对比Actor-Critic（更新策略函数方式）：

        REINFORCE算法基于蒙特卡洛采样，只能在序列结束后进行计算更新（因为要计算累计奖励估计Q值），这同时也要求任务具有有限的步数；

        Actor-Critic算法更新策略函数则可以在每一步之后都进行更新，并且不对任务的步数做限制。（类似于TD）虽然公式是 $\bigtriangledown_{\theta } J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})]$ ，对每一时刻的梯度求和再更新参数，但是也可以对每一时刻计算梯度直接更新参数再累计参数的更新。

　3.1、AC

AC模型： $Q_{\omega }(s,a)$ 与 $\pi _{\theta }(a|s)$

AC算法的Actor采用求策略梯度然后梯度上升进行更新参数 $\theta$ ：

损失函数 $J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})log\pi _{\theta }(a_{t}|s_{t})]$

每个序列的损失 $J(\theta )=\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})log\pi _{\theta }(a_{t}|s_{t})$

每个序列的梯度 $\bigtriangledown _{\theta }J(\theta )=\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

更新 $\theta =\theta +\alpha_{\theta }\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

AC算法的Critic采用求时序差分算法进行更新参数 $\omega$ ：

损失函数 $J(\omega )=E_{\pi _{\theta }}[\frac{1}{2}\sum_{t=0}^{T-1}(r+Q_{\omega ^{-}}(s_{t+1},a_{t+1})-Q_{\omega }(s_{t},a_{t}))^{2}]$

每个序列的损失 $J(\omega )=\frac{1}{2}\sum_{t=0}^{T-1}(r+Q_{\omega ^{-}}(s_{t+1},a_{t+1})-Q_{\omega }(s_{t},a_{t}))^{2}$

每个序列的梯度 $\bigtriangledown _{\omega }J(\omega )=-\sum_{t=0}^{T-1}(r+Q_{\omega ^{-}}(s_{t+1},a_{t+1})-Q_{\omega }(s_{t},a_{t}))\bigtriangledown _{\omega }Q_{\omega }(s_{t},a_{t})$

更新 $\omega =\omega - \alpha_{\omega }(-\sum_{t=0}^{T-1}(r+Q_{\omega ^{-}}(s_{t+1},a_{t+1})-Q_{\omega }(s_{t},a_{t}))\bigtriangledown _{\omega }Q_{\omega }(s_{t},a_{t}))$

策略网络 $\pi _{\theta }(a|s)$ （Actor）与环境互动采集数据，然后根据价值网络 $Q_{\omega }(s,a)$ 算出的值计算策略网络梯度的权重然后更新策略网络 $\pi _{\theta }(a|s)$ 的参数 $\theta$ 。这样更新能够让策略网络的策略 $\pi _{\theta }(a|s)$ 学习到价值更高的动作。

价值网络 $Q_{\omega }(s,a)$ （Critic）通过策略网络 $\pi _{\theta }(a|s)$ （Actor）与环境互动采集的数据学习价值函数 $Q_{\omega }(s,a)$ 。

因此，价值网络 $Q_{\omega }(s,a)$ 会用于判断状态下哪个动作是好的，哪个是不好的，帮助策略网络进行更新最优策略。

AC算法流程：

初始化策略参数 $\theta$ 、价值函数参数 $\omega$ （即初始化策略网络 $\pi _{\theta }(a|s)$ 和价值函数训练网络 $Q_{\omega }(s,a)$ ），

        用相同的网络参数 $\omega ^{-}=\omega$ ，初始化目标网络 $Q_{\omega ^{-}}(s,a)$

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

计算当前序列每个时刻的 $Q_{\omega }(s_{t},a_{t})$ （用来更新 $\pi _{\theta }(a|s)$ ）与 $(r+V_{\omega ^{-}}(s_{t+1}))$ （用来更新 $Q_{\omega }(s,a)$ ）

对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha_{\theta }\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

$\theta =\theta -\alpha_{\theta }(-\sum_{t=0}^{T-1}Q_{\omega }(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$ （用torch.optim.Adam（）梯度下降更新时）

对价值网络 $Q_{\omega }(s,a)$ 更新参数：

$\omega =\omega - \alpha_{\omega }(-\sum_{t=0}^{T-1}(r+Q_{\omega ^{-}}(s_{t+1},a_{t+1})-Q_{\omega }(s_{t},a_{t}))\bigtriangledown _{\omega }Q_{\omega }(s_{t},a_{t}))$

                 ：

                        更新目标网络 $Q_{\omega ^{-}}(s,a)$ 的参数 $\omega ^{-}=\omega$

　3.2、A2C（Advantage AC）

A2C就是采用优势函数的AC模型，即 $A_{\omega }(s_{t},a_{t})=Q(s_{t},a_{t})-V_{\omega }(s_{t})$

一个网络的A2C模型： $A_{\omega }(s_{t},a_{t})=Q(s_{t},a_{t})-V_{\omega }(s_{t})=R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t})$

A2C算法的Actor采用求策略梯度然后梯度上升进行更新参数 $\theta$ ：

损失函数 $J(\theta )=E_{\pi _{\theta }}[\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))log\pi _{\theta }(a_{t}|s_{t})]$

每个序列的损失 $J(\theta )=\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))log\pi _{\theta }(a_{t}|s_{t})$

每个序列的梯度 $\bigtriangledown _{\theta }J(\theta )=\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

更新 $\theta =\theta +\alpha_{\theta }\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

A2C算法的Critic采用求时序差分算法进行更新参数 $\omega$ ：

损失函数 $J(\omega )=E_{\pi _{\theta }}[\frac{1}{2}\sum_{t=0}^{T-1}(r+V_{\omega ^{-}}(s_{t+1})-V_{\omega }(s_{t}))^{2}]$

每个序列的损失 $J(\omega )=\frac{1}{2}\sum_{t=0}^{T-1}(r+V_{\omega ^{-}}(s_{t+1})-V_{\omega ^{}}(s_{t}))^{2}$

每个序列的梯度 $\bigtriangledown _{\omega }J(\omega )=-\sum_{t=0}^{T-1}(r+V_{\omega ^{-}}(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\omega }V_{\omega }(s_{t})$

更新 $\omega =\omega - \alpha_{\omega }(-\sum_{t=0}^{T-1}(r+V_{\omega ^{-}}(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\omega }V_{\omega }(s_{t}))$

策略网络 $\pi _{\theta }(a|s)$ （Actor）与环境互动采集数据，然后根据价值网络 $V_{\omega }(s)$ 算出的值计算策略网络梯度的权重然后更新策略网络 $\pi _{\theta }(a|s)$ 的参数 $\theta$ 。这样更新能够让策略网络的策略 $\pi _{\theta }(a|s)$ 学习到价值更高的动作。

价值网络 $V_{\omega }(s)$ （Critic）通过策略网络 $\pi _{\theta }(a|s)$ （Actor）与环境互动采集的数据学习价值函数 $V_{\omega }(s)$ 。

因此，价值网络 $V_{\omega }(s)$ 会用于判断状态下哪个动作是好的，哪个是不好的，帮助策略网络进行更新最优策略。

A2C算法流程：

初始化策略参数 $\theta$ 、价值函数参数 $\omega$ （即初始化策略网络 $\pi _{\theta }(a|s)$ 和价值函数网络 $V_{\omega }(s)$ ）

        用相同的网络参数 $\omega ^{-}=\omega$ ，初始化目标网络 $V_{\omega ^{-}}(s)$

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

计算当前序列每个时刻的 $A_{\omega }(s_{t},a_{t})$ ，即 $R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t})$ （用来更新 $\pi _{\theta }(a|s)$ ）

                计算当前序列每个时刻的 $(r+V_{\omega ^{-}}(s_{t+1}))$ （用来更新 $V_{\omega }(s)$ ）

对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha_{\theta }\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$

$\theta =\theta -\alpha_{\theta }(-\sum_{t=0}^{T-1}(R(s_{t},a_{t})+\gamma V_{\omega }(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t}))$ （用torch.optim.Adam（）梯度下降更新时）

对训练价值网络 $V_{\omega }(s)$ 更新参数：

$\omega =\omega - \alpha_{\omega }(-\sum_{t=0}^{T-1}(r+V_{\omega ^{-}}(s_{t+1})-V_{\omega }(s_{t}))\bigtriangledown _{\omega }V_{\omega }(s_{t}))$

                 ：

                        更新目标网络 $V_{\omega ^{-}}(s)$ 的参数 $\omega ^{-}=\omega$

A2C相关代码：

import gym
import torch
import torch.nn.functional as F
import numpy as np


class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=1)


class ValueNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim):
        super(ValueNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)


class A2C:
    def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr, gamma, target_update, device):
        self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)  # 策略网络
        self.critic = ValueNet(state_dim, hidden_dim).to(device)  # 训练价值网络
        self.target_critic = ValueNet(state_dim, hidden_dim).to(device)  # 目标价值网络
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)  # 策略网络优化器
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)  # 价值网络优化器
        self.gamma = gamma
        
        self.target_update = target_update  # 目标网络更新频率
        self.count = 0  # 计数器,记录更新次数
        self.device = device

    def take_action(self, state):
        state = torch.tensor([state], dtype=torch.float).to(self.device)
        probs = self.actor(state)
        action_dist = torch.distributions.Categorical(probs)
        action = action_dist.sample()
        return action.item()

    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)

        td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones)  # 时序差分目标
        td_target2 = rewards + self.gamma * self.target_critic(next_states) * (1 - dones)
        td_delta = td_target - self.critic(states)  # 时序差分误差
        log_probs = torch.log(self.actor(states).gather(1, actions))
        actor_loss = torch.mean(-log_probs * td_delta.detach())
        critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target2.detach()))
        self.actor_optimizer.zero_grad()
        self.critic_optimizer.zero_grad()
        actor_loss.backward()  # 计算策略网络的梯度
        critic_loss.backward()  # 计算价值网络的梯度
        self.actor_optimizer.step()  # 更新策略网络的参数
        self.critic_optimizer.step()  # 更新价值网络的参数
        
        if self.count % self.target_update == 0:
            self.target_q_net.load_state_dict(self.q_net.state_dict())  # 更新目标网络
        self.count += 1


actor_lr = 1e-3
critic_lr = 1e-2
num_episodes = 1000
hidden_dim = 128
gamma = 0.98
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

env_name = "CartPole-v0"
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = A2C(state_dim, hidden_dim, action_dim, learning_rate, gamma, target_update, device)

for i in range(episodes):
    transition_dict = {'states': [],'actions': [],'next_states': [],'rewards': [],'dones': []}
    state = env.reset()
    done = False
    while not done:
        action = agent.take_action(state)
        next_state, reward, done, _ = env.step(action)
        transition_dict['states'].append(state)
        transition_dict['actions'].append(action)
        transition_dict['next_states'].append(next_state)
        transition_dict['dones'].append(done)
        state = next_state
    agent.update(transition_dict)

定义策略网络 $\pi _{\theta }(a|s)$ ：输入状态，输出该状态下的动作概率分布，用softmax（）来实现分布。

采样：通过策略网络 $\pi _{\theta }(a|s)$ 的输出（动作概率分布）对离散的动作进行采样。

定义价值网络 $V_{\omega }(s)$ ：输入状态，输出该状态下的状态价值。

可以采用gym的小游戏‘CartPole-v0’环境进行实验操作。

总结：

AC模型比REINFORCE更稳定，减小了方差。

价值模块Critic在策略模块Actor采样的数据中学习状态价值，能够帮助Actor分辨什么是好的动作，什么不是好的动作，进而指导Actor进行策略更新。

随着Actor的训练的进行，其与环境交互所产生的数据分布也发生改变，这需要Critic尽快适应新的数据分布并帮助Actor更好的判别。

A2C比AC多了优势函数 $A_{\omega }(s_{t},a_{t})$ （即添加了基准线），能够减小方差。

　3.4、A3C

异步优势演员-评论员算法（A3C）是在A2C算法基础上加上了异步的模式进行学习，由于使用了多个CPU，所以学习速度非常快。

异步优势演员-评论员算法有一个全局网络（包含策略函数网络 $\pi _{\theta }(a|s)$ 和价值函数网络 $V_{\omega }(s)$ ，两个网络可以共用前几层）。假设全局网络的参数是 $\alpha _{1}$ ，然后使用多个进程，每个进程都有1个CPU训练，每一个进程在工作之前都会把全局网络的参数复制过来（复制全局网络），然后与环境互动学习。每个进程与环境互动采集数据计算梯度，将计算的梯度传回全局网络更新参数。

所有的进程都是平行独立训练的，每个进程计算出梯度就要传回去，可能该进程计算的梯度传回去的时候，原来全局网络的参数已经更新了。

这样看异步优势演员-评论员算法属于异策略算法，但是异步优势演员-评论员算法是一种同策略算法，这是由于每个进程的演员和评论员都是基于当前策略与环境互动采集的数据计算梯度的，不存储历史数据计算梯度，主要通过平行探索来保持训练的稳定性。

AC算法总结：

格式基本上是一个价值网络和一个策略网络（根据价值网络的形式不同分为几种不同的AC算法），由价值网络评判动作的好坏。

属于同策略算法，若要使用异策略加上重要性采样。

AC算法为什么是同策略算法？

从梯度公式 $\bigtriangledown_{\theta } J(\theta )\approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}Q^{\pi _{\theta }}(s_{t},a_{t})\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})$ 可以看出，梯度是“策略网络采样的数据得到的总回报对参数的导数”， $\frac{1}{N}\sum_{i=1}^{N}$ 是对策略网络 $\pi _{\theta }(a|s)$ 的分布的采样估计。因此为了满足正确性，每次对参数更新完之后就要重新采样计算梯度。

4、确定性策略梯度算法

　4.1、路径衍生策略梯度

路径衍生策略梯度可以看成是DQN解连续动作的方法，也可以看成是一种特别的AC算法。

        对DQN的角度来说，解决连续动作向量没有通用的解决方法，因为没办法得到哪一个动作向量可以使得Q值取最大，即 $\underset{a\in A}{argmax}Q(s,a)$ 。路径衍生策略梯度就可以解决了，用一个策略网络 $\pi _{\theta }(a|s)$ （演员Actor）输入状态，然后解出输出哪一个动作或动作向量可以得到最大的Q值。这是确定性策略。



对AC算法来说，原来的算法Critic没有给Actor足够的信息，只告诉Actor这个状态下采取动作的价值高不高，这个动作好不好，而没有告诉Actor什么样的动作是好的（就是那个动作的价值最高）。路径衍生策略梯度就可以实现Critic不只是告诉Actor这个动作的价值好不好，还告诉Actor采取什么样的动作才可以得到比较大的价值。

        路径衍生策略梯度用于连续动作空间，但是策略输出的是确定性策略 $a=\pi _{\theta }(a|s)$ ，之前的那些算法都是学习的随机性策略 $a\sim \pi _{\theta }(a|s)$ 。

之前的基于策略的算法都属于在线策略学习算法，虽然TRPO和PPO算法的优化目标中包含重要性采样的过程，但其只是用到了上一轮策略的数据（这样的话两个策略之间的分布就不会太大，误差就会小），而不是过去所有策略的数据。

        路径衍生策略梯度方法使用确定性策略 $a=\pi _{\theta }(a|s)$ ，并且可以实现离线策略学习（因此DQN中的目标网络和经验回放，以及Double DQN技巧都是以使用）。

        REINFORCE和AC算法都是在线策略学习，因为更新策略函数的参数是要在此策略下的价值作为权重，而不能是其他策略采集数据估计的价值作为本策略更新参数权重价值（虽然可以使用重要性采样，由于我们有假设，如果两个策略分布相差太多会导致误差）。

路径衍生策略梯度算法为什么是异策略算法？

策略网络 $\pi _{\theta }(a|s)$ 建立的是状态和动作a一一对应关系（不是一个分布），是为了得到 $\pi _{\theta }(a|s)=\underset{a\in A}{argmax}Q(s,a)$ 。因此，不是当前策略采集的数据计算出的价值回报也没关系，也可以更新网络。

目的是尽可能地训练出价值网络，然后将价值网络作为权重更新出状态和动作a的一一对应关系。

路径衍生策略梯度模型：

路径衍生策略梯度算法就是学习一个函数（评价器输入是与，输出是 $Q_{\omega }(s,a)$ ），然后学习 $\pi _{\theta }(a|s)$ （演员输入，目的是输出能够使得 $Q_{\omega }(s,a)$ 尽可能大的动作， $\pi _{\theta }(a|s)=\underset{a\in A}{argmax}Q(s,a)$ ），这个演员的工作就是解决 $\underset{a\in A}{argmax}Q(s,a)$ 问题。

在训练的时候，策略 $\pi _{\theta }(a|s)$ 与环境互动采集数据，然后根据数据学习Q网络，估计完Q值以后就把Q网络固定住，然后只去学习 $\pi _{\theta }(a|s)$ ，更新策略函数的参数。即先学Critic再学Actor，能够让演员在给定状态后采取动作a使得Q函数输出的值越大越好。

路径衍生策略梯度算法流程：

初始化策略参数 $\theta$ 、价值函数参数 $\omega$ （即初始化策略网络 $\pi _{\theta }(a|s)$ 和价值函数网络 $Q_{\omega }(s,a)$ ）

        用相同的网络参数 $\theta ^{-}=\theta$ ，初始化目标网络 $\pi _{\theta ^{-}}(a|s)$ （Actor也需要目标网络因为目标网络也会被用来计算目标值）

        用相同的网络参数 $\omega ^{-}=\omega$ ，初始化目标网络 $Q_{\omega ^{-}}(s,a)$

初始化

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹（ $a=\pi (a|s)$ ）：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

                        将四元组 $(s,a,r,s^{'})$ 存储到中

                         ：

                                从中随机采样一个batch（n个 $(s,a,r,s^{'})$ ）

对每个数据 $(s,a,r,s^{'})$ 用目标网络计算当 $y_{t}=r_{t}+\gamma Q_{\omega ^{-}}(s_{t+1},a_{t+1}=\pi _{\theta ^{-}}(a|s_{t+1}))$ ，用训练网络计算 $Q_{\omega }(s_{t},a_{t})$

                                最小化损失函数 $L=\frac{1}{n}\sum_{i=1}^{n}[Q_{w}(s_{t},a_{t})-y_{t}]^{2}$ ，更新网络 $Q_{\omega }(s,a)$ 的参数 $\omega$

                                计算策略函数的梯度：

$\bigtriangledown _{\theta }J(\theta )=\frac{1}{N}\sum_{i=1}^{N}\bigtriangledown _{\theta }\pi _{\theta }(a_{t}|s_{t})\bigtriangledown _{a_{t}}Q(s_{t},a_{t})|_{a_{t}=\pi _{\theta }(a_{t}|s_{t})}$

                对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha_{\theta }\bigtriangledown _{\theta }J(\theta )$

$\theta =\theta -\alpha_{\theta }(-\bigtriangledown _{\theta }J(\theta ))$ （用torch.optim.Adam（）梯度下降更新时）

                                 ：

                                        更新目标网络 $Q_{\omega ^{-}}(s,a)$ 的参数 $\omega ^{-}=\omega$

                                        更新目标网络 $\pi _{\theta ^{-}}(a|s)$ 的参数 $\theta ^{-}=\theta$

对比DQN的改变：

a、在决定状态 $s_{t}$ 执行的动作 $a_{t}$ 不再是 $Q_{\omega }(s_{t},a_{t})$ 决定，而是训练策略网络 $\pi _{\theta }(a|s)$ 采集；

b、在决定状态 $s_{t+1}$ 执行的动作 $a_{t+1}$ 不再是 $\underset{a_{t+1}\in A}{argmax}Q_{\omega }(s_{t+1},a_{t+1})$ 决定，而是目标策略网络 $\pi _{\theta ^{-}}(a|s)$ 决定；

c、之前只学习Q网络，现在多了策略网络，学习策略网络是为了最大化价值网络Q，解决 $\underset{a\in A}{argmax}Q(s,a)$ 问题；

d、类似Q网络，既有训练网络又有目标网络，策略网络也有训练网络和目标网络。

　4.2、DDPG

深度确定性策略梯度（deep deterministic policy gradient），即DDPG。

“深度”：神经网络

“确定性”：确定性的动作

“策略梯度”：策略网络

DDPG和路径衍生策略梯度算法基本一样，只不过DDPG处理的就是确定性连续动作，而且增加了随机噪声来增加探索，使用软更新更新网络参数。

代价函数：

$J(\theta )=E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]$

$\bigtriangledown_{\theta } J(\theta )$ 的求解公式：

$\bigtriangledown_{\theta } J(\theta )=E_{s\sim \nu ^{\pi _{\theta }}}[\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}]$

$\bigtriangledown_{\theta } J(\theta )=E_{s\sim \nu ^{\pi _{\theta }}}[\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}]$ 的推导过程（要用到状态访问分布和 $V^{\pi _{\theta }}(s)=Q^{\pi _{\theta }}(s,\pi _{\theta }(a|s))$ ， $\pi _{\theta }(a|s)$ 是确定性策略）：

1、首先得到 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)$ ，状态价值函数 $V^{\pi _{\theta }}(s)$ 的对参数 $\theta$ 求导：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=\bigtriangledown _{\theta }Q^{\pi _{\theta }}(s,\pi (a|s))$

   $=\bigtriangledown _{\theta }(R(s,\pi (a|s))+\gamma \sum_{s^{'}\in S}^{}p(s^{'}|s,\pi (a|s))V^{\pi _{\theta }}(s^{'}))$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}R(s,a)|_{a=\pi _{\theta }(a|s)}+\gamma \sum_{s^{'}\in S}^{}(p(s^{'}|s,\pi (a|s))\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})+$ $\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}p(s^{'}|s,a)|_{a=\pi _{\theta }(a|s)}V^{\pi _{\theta }}(s^{'}))$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}(R(s,a)+\gamma \sum_{s^{'}\in S}^{}p(s^{'}|s,a)V^{\pi _{\theta }}(s^{'}))|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}p(s^{'}|s,\pi (a|s))\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+\gamma \sum_{s^{'}\in S}^{}p(s^{'}|s,\pi (a|s))\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

令 $d^{\pi _{\theta }}(s\rightarrow x,k)$ 为从状态出发遵循策略 $\pi _{\theta }(a|s)$ 后k步到达状态x的概率（如 $d^{\pi _{\theta }}(s\rightarrow s^{'},1)=p(s^{'}|s,\pi _{\theta }(a|s))$ ），那么：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

找到了 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)$ 与下一状态 $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$ 的关系，继续：

         $\bigtriangledown _{\theta }V^{\pi _{\theta }}(s)=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{'})$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)(\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}+$ $\gamma \sum_{s^{''}\in S}^{}d^{\pi _{\theta }}(s^{'}\rightarrow s^{''},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{''}))$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\gamma \sum_{s^{''}\in S}^{}d^{\pi _{\theta }}(s^{'}\rightarrow s^{''},1)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{''})$

   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\gamma \sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},1)\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}+$ $\gamma ^{2}\sum_{s^{''}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{''},2)\bigtriangledown _{\theta }V^{\pi _{\theta }}(s^{''})$



   $=\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}+$ $\sum_{t=1}^{\infty }\gamma ^{t}\sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},t)\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}$

   $=\sum_{t=0}^{\infty }\gamma ^{t}\sum_{s^{'}\in S}^{}d^{\pi _{\theta }}(s\rightarrow s^{'},t)\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}$

   $=\sum_{s^{'}\in S}^{}\sum_{t=0}^{\infty }\gamma ^{t}d^{\pi _{\theta }}(s\rightarrow s^{'},t)\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}$

2、然后，推导 $\bigtriangledown_{\theta } J(\theta )$ ：

         $\bigtriangledown_{\theta } J(\theta )=\bigtriangledown_{\theta }E_{s_{0}}[V^{\pi _{\theta }}(s_{0})]$

                         $=E_{s_{0}}[\bigtriangledown_{\theta }V^{\pi _{\theta }}(s_{0})]$

                         $=E_{s_{0}}[\sum_{s^{'}\in S}^{}\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow s^{'},k)$ $\bigtriangledown _{\theta }\pi _{\theta }(a|s^{'})\bigtriangledown _{a}Q^{\pi _{\theta }}(s^{'},a)|_{a=\pi _{\theta }(a|s^{'})}]$

令 $\eta (s)=E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow s,k)]=E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}p{_{k}}^{\pi _{\theta }}(s)]=\frac{\nu ^{\pi _{\theta }}(s)}{1-\gamma ^{'}}$ ，代表了从状态 $s_{0}$ 遵循策略 $\pi _{\theta }(a|s)$ 走任意步长到状态的期望概率。

         $\bigtriangledown_{\theta } J(\theta )=\sum_{s\in S}^{}(E_{s_{0}}[\sum_{k=0}^{\infty }\gamma ^{k}d^{\pi _{\theta }}(s_{0}\rightarrow s,k)]$ $\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)})$

   $=\sum_{s\in S}^{}\eta (s)\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$

   $=\sum_{s\in S}^{}\sum_{s\in S}^{}\eta (s)\frac{\eta (s)}{\sum_{s\in S}^{}\eta (s)}\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$

   $=\sum_{s\in S}^{}\frac{1}{1-\gamma ^{'}}\nu ^{\pi _{\theta }}(s)\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$ （ $\sum_{s\in S}^{}\eta (s)=\frac{1}{1-\gamma ^{'}}> 1$ ）

   $\propto \sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$

最后推导出 $\bigtriangledown_{\theta } J(\theta )$ 常用的能够代入数据的公式：

         $\bigtriangledown_{\theta } J(\theta )\propto \sum_{s\in S}^{}\nu ^{\pi _{\theta }}(s)\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$

                         $=E_{s\sim \nu ^{\pi _{\theta }}}[\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}]$

                         $\approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}\bigtriangledown _{\theta }\pi _{\theta }(a|s)\bigtriangledown _{a}Q^{\pi _{\theta }}(s,a)|_{a=\pi _{\theta }(a|s)}$

        以上是在线策略形式的DDPG算法推导（ $s\sim \nu ^{\pi _{\theta }}$ ），如果是离线策略形式的DDPG算法，那么状态的分布 $s\sim \nu ^{\pi _{\theta ^{'}}}$ ，其他策略采集的数据代入Q网络计算出梯度权重更新策略网络的参数。

DDPG算法流程：

初始化策略参数 $\theta$ 、价值函数参数 $\omega$ （即初始化策略网络 $\pi _{\theta }(a|s)$ 和价值函数网络 $Q_{\omega }(s,a)$ ）

        用相同的网络参数 $\theta ^{-}=\theta$ ，初始化目标网络 $\pi _{\theta ^{-}}(a|s)$ （Actor也需要目标网络因为目标网络也会被用来计算目标值）

        用相同的网络参数 $\omega ^{-}=\omega$ ，初始化目标网络 $Q_{\omega ^{-}}(s,a)$

初始化，随机噪声

           ：



    ：（一个序列）

用策略网络 $\pi _{\theta }(a|s)$ 与环境互动采集轨迹（ $a=\pi (a|s)+N$ ）：

$s_{0}\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{0},a_{0}),s_{1}\overset{\pi _{\theta }(a|s)}{\rightarrow}...\overset{\pi _{\theta }(a|s)}{\rightarrow}R(s_{T-1},a_{T-1}),s_{T}$

                        将四元组 $(s,a,r,s^{'})$ 存储到中

                         ：

                                从中随机采样一个batch（n个 $(s,a,r,s^{'})$ ）

对每个数据 $(s,a,r,s^{'})$ 用目标网络计算当 $y_{t}=r_{t}+\gamma Q_{\omega ^{-}}(s_{t+1},a_{t+1}=\pi _{\theta ^{-}}(a|s_{t+1}))$ ，用训练网络计算 $Q_{\omega }(s_{t},a_{t})$

                                最小化损失函数 $L=\frac{1}{n}\sum_{i=1}^{n}[Q_{w}(s_{t},a_{t})-y_{t}]^{2}$ ，更新网络 $Q_{\omega }(s,a)$ 的参数 $\omega$

                                计算策略函数的梯度：

$\bigtriangledown _{\theta }J(\theta )=\frac{1}{N}\sum_{i=1}^{N}\bigtriangledown _{\theta }\pi _{\theta }(a_{t}|s_{t})\bigtriangledown _{a_{t}}Q(s_{t},a_{t})|_{a_{t}=\pi _{\theta }(a_{t}|s_{t})}$

                对策略网络 $\pi _{\theta }(a|s)$ 更新参数：

$\theta =\theta +\alpha_{\theta }\bigtriangledown _{\theta }J(\theta )$

$\theta =\theta -\alpha_{\theta }(-\bigtriangledown _{\theta }J(\theta ))$ （用torch.optim.Adam（）梯度下降更新时）

                                更新目标网络 $Q_{\omega ^{-}}(s,a)$ 的参数：

$\omega ^{-}=\tau \omega+(1-\tau )\omega ^{-}$

                                更新目标网络 $\pi _{\theta ^{-}}(a|s)$ 的参数：

$\theta ^{-}=\tau \theta +(1-\tau )\theta ^{-}$

DQN中有 $\epsilon -Greedy$ 算法探索动作，DDPG在确定性策略（探索有限）基础上加一个噪声 $a=\pi (a|s)+N$ 来探索。

在给目标网络更新网络参数时，不用像DQN一样每隔C步更新一次，可以使用软更新 $\omega ^{-}=\tau \omega+(1-\tau )\omega ^{-}$ 与 $\theta ^{-}=\tau \theta +(1-\tau )\theta ^{-}$ 来更新。

DDPG相关代码：

import random
import gym
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F


# 经验回放
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def add(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = zip(*transitions)
        return np.array(state), action, reward, np.array(next_state), done

    def size(self):
        return len(self.buffer)


# 策略网络
class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound):
        super(PolicyNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)
        self.action_bound = action_bound  # action_bound是环境可以接受的动作最大值

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return torch.tanh(self.fc2(x)) * self.action_bound


# 价值网络
class QValueNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim + action_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.fc_out = torch.nn.Linear(hidden_dim, 1)

    def forward(self, x, a):
        cat = torch.cat([x, a], dim=1) # 拼接状态和动作
        x = F.relu(self.fc1(cat))
        x = F.relu(self.fc2(x))
        return self.fc_out(x)


# DDPG算法
class DDPG:
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound, sigma, actor_lr, critic_lr, tau, gamma, device):
        self.actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)  # 初始化训练策略网络
        self.critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)  # 初始化训练价值网络
        self.target_actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)  # 初始化目标策略网络
        self.target_critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)  # 初始化目标价值网络
        self.target_critic.load_state_dict(self.critic.state_dict())  # 目标价值网络参数=训练价值网络参数
        self.target_actor.load_state_dict(self.actor.state_dict())  # 目标策略网络参数=训练策略网络参数
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)  # 优化器
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)  # 优化器
        self.gamma = gamma
        self.sigma = sigma  # 高斯噪声的标准差,均值直接设为0
        self.tau = tau  # 目标网络软更新参数
        self.action_dim = action_dim
        self.device = device

    def take_action(self, state):
        state = torch.tensor([state], dtype=torch.float).to(self.device)
        action = self.actor(state).item()
        action = action + self.sigma * np.random.randn(self.action_dim)  # 给动作添加高斯噪声
        return action

    def soft_update(self, net, target_net):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)

    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1, 1).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)

        next_q_values = self.target_critic(next_states, self.target_actor(next_states))
        q_targets = rewards + self.gamma * next_q_values * (1 - dones)
        critic_loss = torch.mean(F.mse_loss(self.critic(states, actions), q_targets))
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        actor_loss = -torch.mean(self.critic(states, self.actor(states)))
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        self.soft_update(self.actor, self.target_actor)  # 软更新策略网络
        self.soft_update(self.critic, self.target_critic)  # 软更新价值网络


actor_lr = 3e-4
critic_lr = 3e-3
num_episodes = 200
hidden_dim = 64
gamma = 0.98
tau = 0.005  # 软更新参数
buffer_size = 10000
minimal_size = 1000
batch_size = 64
sigma = 0.01  # 高斯噪声标准差
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

env_name = 'Pendulum-v0'
env = gym.make(env_name)
random.seed(0)
np.random.seed(0)
env.seed(0)
torch.manual_seed(0)
replay_buffer = ReplayBuffer(buffer_size)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_bound = env.action_space.high[0]  # 动作最大值
agent = DDPG(state_dim, hidden_dim, action_dim, action_bound, sigma, actor_lr, critic_lr, tau, gamma, device)

for i_episode in range(num_episodes):
    state = env.reset()
    done = False
    while not done:
        action = agent.take_action(state)
        next_state, reward, done, _ = env.step(action)
        replay_buffer.add(state, action, reward, next_state, done)
        state = next_state
        if replay_buffer.size() > minimal_size:
            b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
            transition_dict = {
                'states': b_s,
                'actions': b_a,
                'next_states': b_ns,
                'dones': b_d}
            agent.update(transition_dict)

定义策略网络 $\pi _{\theta }(a|s)$ ：输入状态，输出确定性连续动作（利用 tanh(x) 计算连续动作，范围是 $\left [ -1 ,1\right ]$ ）。