Introduction to Deep Reinforcement Learning (Policy Gradient, Actor-Critic, PPO)

连理o

已于 2024-09-05 20:36:47 修改

阅读量989

点赞数 15

文章标签： LLM

于 2024-07-30 10:48:51 首次发布

本文链接：https://blog.csdn.net/weixin_42437114/article/details/140638274

版权

模型部署专栏收录该内容

27 篇文章 1 订阅

订阅专栏

What is RL?
Policy Gradient
Actor-Critic
Proximal Policy Optimization (PPO)
Deep Q Network (DQN)
References

What is RL?

RL 框架中主要包括 actor 和 environment，actor 根据当前状态决定下一步动作，environment 根据 actor 所做动作给出 reward，RL 的目的就是让 actor 学到最优策略使得获得的 reward 最大
这种学习方式适合于我们难以定义标签的任务，例如打游戏时我们难以定义出合理的标签去指示 actor 在每一步该做什么动作最好，但却可以轻松从计分板上获得 actor 做出每一步动作后获得的 reward，这种场景下使用 RL 会比传统的监督学习方便得多

Policy Gradient

Three Steps in ML

那么如何根据 reward 去优化 actor 呢？不妨套用一下 ML 里的三大步骤：

(1) Function with Unknown. 我们要优化的函数即为 Policy Network，输入为当前状态，输出为下一步要采取的动作分布，实际采取的动作需要从中进行随机采样 (encourage exploration)，因此从本质上看也就是一个分类网络
(2) Define “Loss”. 我们的优化目标为一局游戏中的 total reward (i.e., return) $R$ ，其中一局游戏也被称为 one episode
$R=\sum_{t=1}^Tr_t$
(3) Optimization. RL 的难点在于 environment 和 reward 属于黑盒模型无法传递梯度，我们没法像 GAN 一样直接去优化 actor 来最大化 return
对此，Policy Gradient 的思想其实也很简单，根据 actor 与环境交互时的状态动作得到的 reward 序列，我们可以估计出某个状态 $s$ 下采取某个动作 $\hat a$ 是好是坏，如果采取 $\hat a$ 得到的是正反馈，那么可以直接把 $\hat a$ 当作是 GT 标签，使用 CE loss $e$ 去优化模型参数使得 actor 之后继续选择采用动作 $\hat a$ ，损失函数即为 $L = e$ ，而如果是负反馈，则可以把 $L = - e$ 当作损失函数，使得 actor 之后不再采取动作 $\hat a$ ；另外，根据正/负反馈的强度，我们还可以对损失函数进行加权，表我们有多希望该动作在对应状态下被执行或不被执行. 如下图所示，在一个 epoch 内，actor 共采取了 $N$ 个动作，根据得到的 reward 序列，我们想继续采取 $\hat a_1,\hat a_3$ ，而不要采取 $\hat a_2,\hat a_N$ ，正/负反馈的强度即为 $A_n$ (Advantage)
下面就是主要介绍如何估计出合适的 $A_n$ (Advantage)

Estimate Advantage Function

Version 0 (Short-sighted Version). 最简单的方法就是直接把每个 step 的 reward $r_n$ 当作 Advantage，但这样学到的 actor 注定是短视的，难以通过牺牲短期利益去换取长期收益，例如在游戏中，这样优化的 actor 可能就只会开火而不会左右移动，因为只有开火这个动作才能产生正的 reward
Version 1 (Cumulated Reward). 一个很简单的解决方法是把 cumulated reward $G_t=\sum_{n=t}^Nr_n$ 当作 Advantage. 这种方式能综合考虑采取某个动作后所能得到的长期收益，但不足之处在于考虑的将来似乎有些过于长远了，很可能当前 step 得到的收益只与最近采取的动作关系比较大，与很早之前采取的动作可能就没有太大关系了
Version 2 (Discounted Cumulated Reward). 对此也有一个很简单的解决方案，就是加入 discount factor $\gamma<1$ ，把 discounted cumulated reward $G_t'=\sum_{n=t}^N\gamma^{n-t}r_n$ 当作 Advantage
Version 3 (Relative Reward). 上述版本还有一个小问题，就是 reward 的好坏是相对的，我们应该对 $G'_n$ 减去某个 baseline $b$ ， $b$ 可能代表着当前状态下采取所有动作后能带来累计收益的均值，这里的 $b$ 也就是后面要介绍的 Actor-Critic 里 critic 所要估计的值

Policy Gradient Algorithm

在这里插入图片描述

注意到在 policy gradient 中，收集到的 1 个 episode 的训练数据只能用于更新 1 次模型参数，这种使用当前模型的交互数据进行训练的方法被称为 on-policy (The actor to train and the actor for interacting is the same)，整个训练过程非常低效！那么我们能不能干脆用同一批交互数据去多次更新模型参数呢？答案是不能直接这样做，因为训练数据是由曾经的模型参数产生的，这些经验未必适用于当前的模型参数，有可能曾经见过的交互数据已经不太可能被观测到，也有可能当前模型的交互数据无法体现在曾经的交互数据里 (One man’s meat is another man’s poison)
那么怎么解决呢？其实后面要讲到的 PPO 就是一种 off-policy 的算法 (the actor to train and the actor for interacting is different)，可以使用模型曾经的交互数据进行训练，提高了采样效率和训练速度

Collection Training Data: Exploration. 需要注意的是，在 actor 交互生成数据的过程中，我们需要加入一些随机因素来鼓励 actor 进行探索来生成更多样的交互数据，避免陷入局部最优解，包括从 actor 输出的动作分布中采样，增大动作分布的熵，给 actor 参数加上随机噪声等

Actor-Critic

What is Critic?

Actor-Critic 架构中，critic 被用于估计 Value function $V^\theta(s)$ ，即给定 actor 参数 $\theta$ ，在状态 $s$ 下可获得的 discounted cumulated reward 期望值，也就是 policy gradient 的 Version 3 (Relative Reward) 中用来衡量 $G_n'$ 相对好坏的 baseline $b$ . 如果 $G_n'>V^\theta(s)$ ，就说明采取该动作是一个比较好的选择
在估计 $V^\theta(s)$ 时，需要输入状态 $s$ 和 actor 参数 $\theta$ ，因此一般会让 actor 和 critic 共享部分权重

How to estimate $V^\theta(s)$

(1) Monte-Carlo (MC) based approach. critic 拟合采样得到的 $G_n'$ ，但这样做必须要等到一个 episode 结束才能得到 $G_n'$
(2) Temporal-difference (TD) approach. 采用 bootstrapping 策略， $V^\theta(s_t)=\gamma V^\theta(s_{t+1})+r_t$ ，因此可以用 $V^\theta(s_t)-\gamma V^\theta(s_{t+1})$ 去拟合 $r_t$ ，这样不必等到 episode 结束就可以更新 critic

Estimate Advantage Function

Version 3.5. 顺着 Version 3 (Relative Reward) 的思路，我们将 baseline $b$ 替换成 $V^\theta(s)$ 就得到了 Version 3.5. 但其实这个版本还存在 bug，就是 $V^\theta(s)$ 是一个期望值，而 $G_n'$ 只是一个采样值，代表采取动作 $a_n$ 后带来的某一种可能收益，我们想要用一个代表采取动作 $a_n$ 后的期望回报去代替 $G_n'$
Version 4 (Advantage Actor-Critic). 上述目标可以通过 $V^\theta(s)$ 实现，采取动作 $a_n$ 后得到的期望回报可以用 $r_t+\gamma V^\theta(s_{t+1})$ 表示，即
$A(s_t,a_t)=r_t+\gamma V^\theta(s_{t+1})-V^\theta(s_t)$ 采用这个形式的 advantage function，我们也不必交互到 episode 结束就可以得到采样数据用于训练

Proximal Policy Optimization (PPO)

Why PPO?

前面提到，on-policy 算法面临的一大问题就是采样数据只能更新一次模型参数，采样效率过低；PPO 则属于 off-policy 算法，能够使用过去的采样数据多次更新模型参数 (use the experience more than once)，提高了采样效率，此外，PPO 还具有实现简单、易调超参的优点

From On-policy to Off-policy

Importance Sampling. 通过重要性采样，我们可以从另一个概率分布 $q$ 里采样去估计概率分布 $p$ 上的期望值
$E_{x\sim p}[f(x)]=\int f(x) p(x) d x=\int f(x) \frac{p(x)}{q(x)} q(x) d x=E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]$
Issue of Importance Sampling. 为了使得方差接近，我们应该保证 $p, q$ 尽可能相似
$\begin{aligned} & \operatorname{Var}_{x \sim p}[f(x)]=E_{x \sim p}\left[f(x)^2\right]-\left(E_{x \sim p}[f(x)]\right)^2 \\ & \begin{aligned} \operatorname{Var}_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] & =E_{x \sim q}\left[\left(f(x) \frac{p(x)}{q(x)}\right)^2\right]-\left(E_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\right)^2 \\ & =E_{x \sim p}\left[f(x)^2 \frac{p(x)}{q(x)}\right]-\left(E_{x \sim p}[f(x)]\right)^2 \end{aligned} \end{aligned}$

在 on-policy 中，我们的目标函数梯度为
$\begin{aligned} \nabla J(\theta)&=E_{(s_t,a_t)\sim\pi_\theta}[A^\theta(s_t,a_t)\nabla\log p_\theta(a_t|s_t)] \end{aligned}$ 我们必须从 $\pi_\theta$ 里采样数据用于参数更新，而使用重要性采样，我们就可以把 on-policy 算法转换为 off-policy，可以使用同一组交互数据 $\pi_{\theta'}$ 多次更新模型参数
$\begin{aligned} \nabla J^{\theta'}(\theta)&=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_{\theta}(s_t,a_t)}{p_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t|s_t)] \\&=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_{\theta}(a_t|s_t)p_\theta(s_t)}{p_{\theta'}(a_t|s_t)p_{\theta'}(s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t|s_t)] \\&\approx E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)\nabla\log p_\theta(a_t|s_t)] \end{aligned}$ 由于 $\nabla f(x)=f(x)\nabla\log f(x)$ ，因此可以得到 off-policy 算法的目标函数 $J^{\theta'}(\theta)$
$J^{\theta'}(\theta)=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]$
现在剩下的问题就是如何确保 $\pi_\theta$ 和 $\pi_{\theta'}$ 足够接近，这也是 PPO 解决的问题

PPO Algorithm

PPO (Adaptive KL Penalty). 一个简单的思路是给目标函数加上限制条件 $\text{KL}(\theta,\theta')<\delta$ ，这也是 Trust Region Policy Optimization (TRPO) 的做法，注意这里 KL 散度约束的是 behavior $p_\theta(a_t|s_t)$ 而非 parameter $\theta$ ；PPO 则更直接，直接把限制条件当作正则项加在目标函数上就行了
$J^{\theta'}_{\text{PPO}}(\theta)=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]-\beta\text{KL}(\theta,\theta')$ 优化时，如果 $\text{KL}(\theta,\theta')>\text{KL}_{\max}$ 就增加 $\beta$ ，如果 $\text{KL}(\theta,\theta')<\text{KL}_{\min}$ 就降低 $\beta$
PPO (Clip). 其实还有更简单的方法，就是用 clip 函数；当 $A > 0$ 时，优化目标函数 $J^{\theta'}(\theta)$ 会使得 $\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}$ 更大，加 clip 函数后相当于是优化 $A\cdot\min(x,\text{clip}(x,1-\varepsilon,1+\varepsilon))$ ，其中 $x=\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}$ ，可以使得优化时不要让 $x$ 过大从而保证 $\pi_{\theta}$ 和 $\pi_{\theta'}$ 足够接近；而当 $A < 0$ 时，则相当于是优化 $A\cdot\max(x,\text{clip}(x,1-\varepsilon,1+\varepsilon))$ ，可以使得优化时不要让 $x$ 过小从而保证 $\pi_{\theta}$ 和 $\pi_{\theta'}$ 足够接近
$\begin{aligned} J^{\theta'}_{\text{PPO}}(\theta)=E_{(s_t,a_t)\sim\pi_{\theta'}}\bigg[\min\bigg(&\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t), \\&\text{clip}(\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)},1-\varepsilon,1+\varepsilon)A^{\theta'}(s_t,a_t)\bigg)\bigg] \end{aligned}$