浅谈强化学习中的策略梯度算法

最新推荐文章于 2024-08-21 15:26:19 发布

止于至玄

最新推荐文章于 2024-08-21 15:26:19 发布

阅读量4.1k

点赞数 1

分类专栏： Reinforcement Learning 文章标签：强化学习

本文链接：https://blog.csdn.net/philthinker/article/details/81145928

版权

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

本文将主要介绍基于策略梯度的强化学习算法。这里我们假设读者对强化学习的基本原理有一定了解。

- 策略梯度法
  - REINFORCE
  - Baseline
- Actor-Critic
  - Asynchronous Advantage Actor-Critic (A3C)
  - Advantage Actor-Critic (A2C)

使用梯度进行估计或优化的方法可见于诸多领域，如凸优化和机器学习领域。在强化学习中，我们可以使用梯度来估计某个策略的价值函数或者直接估计策略。本文中我们仅讨论后一种情况。

策略梯度法

强化学习的目标是最大化长期回报的期望，于是目标可以写作：

π * = arg max π E τ \sim π (τ) [r (τ)]

$\pi^{*} = \arg\max_{\pi}E_{\tau \sim\pi(\tau)}[r(\tau)]$ 其中

τ τ $\tau$ 表示一条交互得到的轨迹，

r(τ) r ( τ ) $r(\tau)$ 表示这条轨迹的总体回报。假设我们用

J(θ) J ( θ ) $J(\theta)$ 表示总体的目标函数，将期望回报展开，可以得到：

J (θ) = E τ \sim π θ (τ) [r (τ)] = \int τ \sim π θ (τ) π θ (τ) r (τ) d τ

$J(\theta) = E_{\tau \sim\pi_{\theta}(\tau)}[r(\tau)]=\int_{\tau \sim\pi_{\theta}(\tau)}\pi_{\theta}(\tau)r(\tau)\mathrm{d}\tau$ 我们要通过调整策略参数

θ θ $\theta$ 来最大化上述目标函数，一个自然的想法是对

θ θ $\theta$ 求导，然后进行梯度上升：

\nabla θ J (θ) = \int τ \sim π θ (τ) \nabla θ π θ (τ) r (τ) d τ

$\nabla_{\theta} J(\theta) = \int_{\tau \sim\pi_{\theta}(\tau)}\nabla_{\theta}\pi_{\theta}(\tau)r(\tau)\mathrm{d}\tau$ 这个公式并不能直接估计，因为有个积分号。我们对其做一个变换，由于

\nabla x log y = 1 y \nabla x y i.e. \nabla θ π θ (τ) = π θ (τ) \nabla θ log π θ (τ)

$\nabla_{x}\log y = \frac{1}{y}\nabla_{x}y \quad\text{ i.e. }\quad \nabla_{\theta}\pi_{\theta}(\tau) = \pi_{\theta}(\tau)\nabla_{\theta}\log \pi_{\theta}(\tau)$ 那么

\nabla θ J (θ) = \int τ \sim π θ (τ) π θ (τ) \nabla θ log π θ (τ) r (τ) d τ = E τ \sim π θ (τ) [\nabla θ log π θ (τ) r (τ)]

$\nabla_{\theta} J(\theta) = \int_{\tau \sim\pi_{\theta}(\tau)}\pi_{\theta}(\tau)\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)\mathrm{d}\tau = E_{\tau \sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)]$
到这里，虽然看起来可以对梯度进行估计，但还是有不易计算的部分，我们对其进行进一步拆解：假设

τ={s0,a0,s1,a1,…,sT,aT} τ = { s 0 , a 0 , s 1 , a 1 , … , s T , a T } $\tau = \{ s_{0},a_{0},s_{1},a_{1},\dots,s_{T},a_{T} \}$ ：

\nabla θ log π θ (τ) = \nabla θ log [p (s 0) \prod t = 0 T π θ (a t | s t) p (s t + 1 | s t, a t)] = \nabla θ [log p (s 0) + \sum t = 0 T log π θ (a t | s t) + \sum t = 0 T log p (s t + 1 | s t, a t)] = \sum t = 0 T \nabla θ log π θ (a t | s t)

$\begin{split} \nabla_{\theta}\log\pi_{\theta}(\tau) &= \nabla_{\theta}\log[p(s_{0})\prod_{t=0}^{T}\pi_{\theta}(a_{t} | s_{t})p(s_{t+1}|s_{t},a_{t})] \\ &= \nabla_{\theta}[\log p(s_{0}) + \sum_{t=0}^{T}\log\pi_{\theta}(a_{t}|s_{t}) + \sum_{t=0}^{T}\log p(s_{t+1}|s_{t},a_{t})] \\ &= \sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) \end{split}$
将结果代入对目标函数梯度的估计：

\nabla θ J (θ) = E τ \sim π θ (τ) [\sum t = 0 T \nabla θ log π θ (a t | s t) \sum t = 0 T r (s t, a t)] = 1 N \sum i = 1 N [\sum t = 0 T \nabla θ log π θ (a i, t | s i, t) \sum t = 0 T r (s i, t, a i, t)]

$\nabla_{\theta} J(\theta) = E_{\tau \sim\pi_{\theta}(\tau)}[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\sum_{t=0}^{T}r(s_{t},a_{t})] = \frac{1}{N}\sum_{i=1}^{N}[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{i,t}|s_{i,t})\sum_{t=0}^{T}r(s_{i,t},a_{i,t})]$ 完成了对梯度的估计，我们便可以对参数进行更新：

θ' = θ + α \nabla θ J (θ) ˆ

$\theta’=\theta+\alpha\widehat{\nabla_{\theta} J(\theta)}$

我们可以假设

J (θ) = v π θ (s 0)

$J(\theta)=v_{\pi_{\theta}}(s_{0})$ 也就是以初始状态的价值作为目标。当然选法不唯一。

实际上，从函数估计的角度看，这种方法和最大似然法类似：我们希望策略模型得到行动概率分布能够尽可能地与采样数据的行为概率分布一致。 按照这个思路我们给出这个模型的求解方法，假设 $\tau$ 是我们的数据集合，采用最大似然法进行建模，目标函数可以设立为：

θ * = arg max π θ 1 N \sum i = 1 N \sum t = 0 T log π θ (a i, t | s i, t)

$\theta^{*} = \arg\max_{\pi_{\theta}}\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}\log\pi_{\theta}(a_{i,t} | s_{i,t})$ 令等号右边等于

J(θ) J ( θ ) $J(\theta)$ ，对其求导，得到：

\nabla θ J (θ) = 1 N \sum i = 1 N \sum t = 0 T \nabla θ log π θ (a i, t | s i, t)

$\nabla_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{i,t} | s_{i,t})$ 相比于之前求得的梯度公式，用最大似然法求得的梯度公式仅少了回报项，这个区别可以从两个角度去理解：

从策略梯度法的角度看，最大似然法没有考虑长期回报，或者说长期回报为固定值 1。即每个样本 $\tau_{i}$ 对训练的影响相同。
从最大似然法的角度看，最大似然对每个样本使用了相同的权重，对策略梯度法来说，它使用序列的回报作为样本的加权权重。也就是说，对于回报为正的样本，我们最大化它的似然；对于回报为负的样本，我们最小化它的似然。

策略梯度法的推导还有一种形式，在 Richard S. Sutton 的 Reinforcement Learning: An Introduction 中被称作策略梯度定理（policy gradient theorem）：

$\nabla θ J (θ) = \sum s d π (s) \sum a q π (s, a) \nabla θ π (a | s, θ)$ $\nabla_{\theta} J(\theta)=\sum_{s}d_{\pi}(s)\sum_{a}q_{\pi}(s,a)\nabla_{\theta}\pi(a|s,\theta)$ where the gradients in all cases are the column vectors of partial derivatives with respect to the components of $\theta$ , $\pi$ denotes the policy corresponding to weights vector $\theta$ and the distribution $d_\pi$ here is the expected number of time steps $t$ on which $S_{t}=s$ in a randomly generated episode starting in $s_{0}$ and following $\pi$ and the dynamics of the MDP.

此处以及下一小节我们采用 Richard S. Sutton 的表示法，注意其中的：

d π (s) = \sum k = 0 \infty γ k P r (s \to s', k, π)

$d_{\pi}(s) = \sum_{k=0}^{\infty}\gamma^{k}Pr(s\to s',k,\pi)$ 具体的推导过程比较复杂，此处从略。

REINFORCE

基于策略梯度定理，我们得到了一种参数增量的表达式，但是并不是其中每一项的解析表达式我们都已知。因此，我们现在需要的是一种采样估计方法来估计该增量表达式。注意到策略梯度定理表达式的右边对状态的加权求和，其权重是在策略 $\pi$ 下该状态出现的频率，以及 $\gamma$ 的 $k$ 次方， $k$ 是到达此状态所经历的步数。那么我们可以利用策略 $\pi$ 进行采样，再乘以衰减常数 $\gamma$ ，这样的期望值就可以用来估计该梯度：

\nabla θ J (θ) = E π [γ t \sum a q π (S t, a) \nabla θ π (a | S t, θ)]

$\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi}\left[ \gamma^{t}\sum_{a}q_{\pi}(S_{t},a)\nabla_{\theta}\pi(a|S_{t},\theta) \right]$
然后我们对动作的求和进行估计，我们要得到的是每一项乘以在策略

π π $\pi$ 下该动作出现的概率:

\nabla η (θ) = E π [γ t \sum a q π (S t, a) π (a | S t, θ) \nabla θ π ( a | S t , θ ) π ( a | S t , θ )] = E π [γ t q π (S t, A t) \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )] = E π [γ t G t \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )]

$\begin{split}\nabla\eta(\theta)&=\mathbb{E}_{\pi}\left[\gamma^{t}\sum_{a}q_{\pi}(S_{t},a)\pi(a|S_{t},\theta)\frac{\nabla_{\theta}\pi(a|S_{t},\theta)}{\pi(a|S_{t},\theta)}\right]\\ &=\mathbb{E}_{\pi}\left[\gamma^{t}q_{\pi}(S_{t},A_{t})\frac{\nabla_{\theta}\pi(A_{t}|S_{t},\theta)}{\pi(A_{t}|S_{t},\theta)}\right]\\ &=\mathbb{E}_{\pi}\left[\gamma^{t}G_{t}\frac{\nabla_{\theta}\pi(A_{t}|S_{t},\theta)}{\pi(A_{t}|S_{t},\theta)}\right]\end{split}$
现在我们得到了可以估计的表达式，那么

θ t + 1 = θ t + α γ t G t \nabla θ π ( A t | S t , θ ) π ( A t | S t , θ )

$\theta_{t+1}=\theta_{t}+\alpha\gamma^{t}G_{t}\frac{\nabla_{\theta}\pi(A_{t}|S_{t},\theta)}{\pi(A_{t}|S_{t},\theta)}$
可见 REINFORCE 是一种 Monte Carlo Policy Gradient 方法。而且这种方法是针对离散动作的。（当然也可以将策略表示为动作的概率分布，来生成连续的动作，此时参数

θ θ $\theta$ 便是这个概率分布的参数。如正态分布的均值与方差。）这个更新策略的更新量与总体回报成正比，与动作选择的概率成反比。前者的意义在于它在向总体回报最高的方向更新；后者的意义在于防止策略总是给出产生的回报不是最高但是出现频率最高的动作。

更新方法也可以写成：

θ t + 1 = θ t + α γ t G t \nabla θ log π (A t | S t, θ)

$\theta_{t+1}=\theta_{t}+\alpha\gamma^{t}G_{t}\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)$
该方法的更新步骤如下:

Initialize: $\pi(a|s,\theta)$ , $\forall a\in\mathcal{A}, s\in\mathcal{S}, \theta\in\mathbb{R}^{n}$
Initialize policy weights $\theta$
Repeat forever:
1. Generate an episdoe $S_{0}, A_{0}, R_{1}, \dots, S_{T-1}, A_{T-1}, R_{T}$ following $\pi(\cdot|\cdot,\theta)$ .
2. For each step of the episode t=0,1,…,T−1 :
  1. $G_{t}\leftarrow$ return from step $t$ .
  2. $\theta\leftarrow\theta+\alpha\gamma^{t}G_{t}\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)$

其中 $\frac{\nabla_{\theta}\pi(A_{t}|S_{t},\theta)}{\pi(A_{t}|S_{t},\theta)}$ 或者 $\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)$ 是策略参数存在的项。这一项被称作 eligibility vector。如果采用 softmax + linear action preferences 策略, 那么 eligibility vector 可以写成

\nabla θ log π (A t | S t, θ) = ϕ (s, a) - \sum b π (b |, s, θ) ϕ (s, b)

$\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)=\phi(s,a)-\sum_{b}\pi(b|,s,\theta)\phi(s,b)$

Baseline

我们的策略梯度公式有个问题：不管哪个时间段，我们都要用策略的梯度乘以所有时刻回报值得总和，这样的设计显然不合理。理论上 $t$ 时刻的决策最多只能影响 $t$ 时刻之后的所有回报，与之前的回报无关，所以我们的公式应该改写成：

\nabla θ J (θ) = 1 N \sum i = 1 N \sum t = 0 T [\nabla θ log π θ (a i, t | s i, t) \sum t' = t T r (s i, t', a i, t')]

$\nabla_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}\left[ \nabla_{\theta}\log\pi_{\theta}(a_{i,t} | s_{i,t})\sum_{t'=t}^{T}r(s_{i,t'},a_{i,t'}) \right]$

另一方面，我们知道总体回报可以理解为最大似然法中的权重，这个权重会带来两个问题：

如果序列得到的回报数值较大，那么对应的参数更新也较大，这样我们的优化可能会出现一定波动，这些波动很可能影响优化效果。
一些强化学习问题中，环境给予的回报始终为正，那么无论决策怎么样，累计的长期回报始终为正数。也就是说会增强所有策略，这与我们的初衷不符。

为了解决上述问题，我们可以调整权重的数值和范围，一个简单的方法是让所给出的长期累计汇报减去一个偏移量，这个偏移量被称为 Baseline，我们用变量 $b$ 表示：

\nabla_{θ} J (θ) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 0}^{T} [\nabla_{θ} \log π_{θ} (a_{i, t} | s_{i, t}) (\sum_{t^{'} = t}^{T} r (s_{i, t^{'}}, a_{i, t^{'}}) - b_{i, t^{'}})]

$\nabla_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}\left[ \nabla_{\theta}\log\pi_{\theta}(a_{i,t} | s_{i,t})\left(\sum_{t'=t}^{T}r(s_{i,t'},a_{i,t'})-b_{i,t'} \right)\right]$ 这个偏移量可以设计为同一个起始点不同序列在同一时刻的长期回报的均值：

b i, t' = 1 N \sum i = 1 N \sum t' = t T r (s i, t', a i, t')

$b_{i,t'} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t'=t}^{T}r(s_{i,t'},a_{i,t'})$ 经过这样的处理，所有时刻权重的均值变为0，就会同时存在权重为正或负的动作，权重的绝对值也得到了一定的缩小。而且很容易计算出这样一个结论： 加入这个偏移量并不会使原本的计算值变得有偏。

基于策略梯度定理的带 Baseline 的梯度可写成：

\nabla θ J (θ) = \sum s d π (s) \sum a (q π (s, a) - b (s)) \nabla θ π (a | s, θ)

$\nabla_{\theta}J(\theta)=\sum_{s}d_{\pi}(s)\sum_{a}(q_{\pi}(s,a)-b(s))\nabla_{\theta}\pi(a|s,\theta)$ 更新策略：

θ t + 1 = θ t + α (G t - b (S t)) \nabla θ log π (A t | S t, θ)

$\theta_{t+1}=\theta_{t}+\alpha(G_{t}-b(S_{t}))\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)$ 其中

b(s) b ( s ) $b(s)$ 可写为

v^(St,w) v ^ ( S t , w ) $\hat{v}(S_{t},w)$ 。

Actor-Critic

在上一小节中，我们使用偏移量来改进策略梯度算法，但是改进后的梯度计算公式仍然有问题。真实训练的过程中，我们往往需要控制交互的时间。有限次的交互有时并不能代表轨迹的真实期望。每一个交互序列都会有一定的差异，对应的回报也会有一定的差异，因此不充足的交互会给轨迹回报带来较大的方差。为了模型稳定，我们可以牺牲一定的偏差来换取方差变小。这其中一种方法就是 Actor-Critic算法。

关于Actor-Critic方法，Richard S. Sutton 在书中写道：

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

首先我们看单步 action-critic 算法。这里首先以 Richard S. Sutton 的方法为准。单步方法主要的有点是可以在线执行，如同 TD(0), SARSA(0) 和 Q-Learning。单步 actor-critic 算法参数更新策略如下：

θ t + 1 = θ t + α (R t + 1 + γ v^(S t + 1, w) - v^(S t, w)) \nabla θ log π (A t | S t, θ)

$\theta_{t+1}=\theta_{t}+\alpha(R_{t+1}+\gamma\hat{v}(S_{t+1},w)-\hat{v}(S_{t},w))\nabla_{\theta}\log\pi(A_{t}|S_{t},\theta)$
详细步骤如下：

Initialize a differentiable policy parameterization $\pi(a|s,\theta), \forall a\in\mathcal{A}, s\in\mathcal{S}, \theta\in\mathbb{R}^{n}$ .
Initialize a differentiable state-value parameterization $\hat{v}(s,w), \forall s\in\mathcal{S}, w\in\mathbb{R}^{m}$ .
Set step sizes $\alpha >0, \beta>0$ .
Repeat forever:
1. Initialize the first state of episode $S$
2. $I\leftarrow 1$
3. While S is not terminal:
  1. Take action $A\sim\pi(\cdot|S,\theta)$ , observe $S', R$
  2. $\delta\leftarrow R+\gamma\hat{v}(S',w)-\hat{v}(S,w)$ , (if $S'$ is terminal, $\hat{v}(S',w)=0$ )
  3. $w\leftarrow w+\beta\delta\nabla_{w}\hat{v}(S,w)$
  4. $\theta\leftarrow \theta+\alpha I\delta\nabla_{\theta}\log\pi(A|S,\theta)$
  5. $I\leftarrow \gamma I$
  6. $S\leftarrow S'$
带有Eligibility Traces的Actor-Critic算法步骤如下 :
1. Initialize $\pi(a|s,\theta), \forall a\in\mathcal{A}, s\in\mathcal{S}, \theta\in\mathbb{R}^{n}$
2. Initialize $\hat{v}(s,w), \forall s\in\mathcal{S}, w\in\mathbb{R}^{m}$
3. Repeat forever:
  1. Initialize the first state of episode $S$
  2. $e^{\theta}=0$ , $e^{w}=0$ .
  3. $I\leftarrow 1$
  4. While S is not terminal:
    1. Take action $A\sim\pi(\cdot|S,\theta)$ , observe $S', R$
    2. $\delta\leftarrow R+\gamma\hat{v}(S',w)-\hat{v}(S,w)$ , (if $S'$ is terminal, $\hat{v}(S',w)=0$ )
    3. $e^{w}\leftarrow \lambda^{w}e^{w}+I\nabla_{w}\hat{v}(S,w)$
    4. $e^{\theta}\leftarrow \lambda^{\theta}e^{\theta}+I\nabla_{\theta}\log\pi(A|S,\theta)$
    5. $w\leftarrow w+\beta\delta\nabla_{w}\hat{v}(S,w)$
    6. $\theta\leftarrow \theta+\alpha I\delta\nabla_{\theta}\log\pi(A|S,\theta)$
    7. $I\leftarrow \gamma I$
    8. $S\leftarrow S'$
  Asynchronous Advantage Actor-Critic (A3C)
  
  A3C是一种异步优化方法。其主要特点是异步，即并行的交互采样和训练。策略梯度法和Actor-Critic法都通过目标函数的梯度进行策略更新，而计算梯度需要基于当前的策略模型，所以每一次计算梯度时，我们需要使用当前最新的策略模型重新进行交互采样，得到序列样本，然后使用这些样本完成梯度计算；完成梯度计算后，我们丢弃使用过的样本，重新采样。这种训练方法被称为On-Policy Learning。在深度强化学习中，我们使用Replay Buffer存储了一段时间的交互样本，模型学习时使用的样本不一定是由当前模型交互得到，这样的学习方法被称为Off-Policy Learning。
  
  对于On-Policy来说，每一次模型更新都需要一定量的新样本，为了更快地手机样本，我们需要用并行的方法收集。我们同事启动N个线程进行交互，只要保证每一个线程中的环境设定不同，线程间交互得到的序列就不完全一样，这样得到的样本更有意义。收集完成样本后，每一个线程独立完成训练得到参数更新量，并异步更新到全局模型参数中。A3C方法使用多步回报估计法，对应的公式变为：
  
  ∑i=1nγi−1rt+1+v(st+n)−v(st)
  为了增加模型的探索性，模型的目标函数中加入了策略的熵：
  
  ∇θJ(θ)=1T∑t=0T∇θlogπθ(at|st)(∑i=1nγi−1rt+1+v(st+n)−v(st))+β∇θH(πθ(st))
  
  A3C单一线程的执行过程如下：
  1. $T \leftarrow 0$
  2. Initialize policy paramter $\theta$
  3. Initialize value parameter $\omega$
  4. Repeat until T>Tmax
    1. $d\theta \leftarrow 0, d\omega \leftarrow 0$ .
    2. Synchronize model parameter $\theta' \leftarrow \theta, \omega' \leftarrow \omega$ .
    3. Sample with policy $\pi_{\theta'}(a_{t}|s_{t})$ to collect $\{s_{0},a_{0},r_{0},\dots\}$
    4. $T \leftarrow T+n$
    5. $R = v_{\theta'}(s_{t})$ for non-terminal state, $R = 0$ for terminal state.
    6. for i∈{n−1,0} do
      1. $R \leftarrow r_{i} + \gamma R$
      2. $d\theta \leftarrow d\theta + \nabla_{\theta'}\log\pi_{\theta'}(a_{i} | s_{i})(R-V_{\omega'}(s_{i}))$
      3. $d\omega \leftarrow d\omega + \nabla_{\omega'}(R-V_{\omega'}(s_{i}))^{2}$
    7. end for
    8. $\theta \leftarrow d\theta, \omega \leftarrow d\omega$
  Advantage Actor-Critic (A2C)
  
  A3C算法表现十分优异，但是其中的异步更新是否是必要的？凭直觉，异步或者同步更新并不是决定算法优劣的主要因素。那么为什么不尝试使用同步更新方法呢？这就是A2C方法的来源。我们可以在Baseline项目中a2c文件夹下看到A2C的实现。可以直接运行其中的run_atari.py文件执行算法。OpenAI的官方博客中也提到A2C的效果优于A3C。