Trust Region Policy Optimization (TRPO) 公式推导

一. 优化问题的构建

1. 由期望累计奖励出发

记期望累计奖励 η ( π ) \eta(\pi) η(π)为:
η ( π ) = E s 0 ∼ ρ ( s 0 ) , a t ∼ π ( a ∣ s ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) [ ∑ t = 0 ∞ γ t r ( s t ) ] , (1) \eta(\pi) = \mathbb{E}_{s_0 \sim \rho(s_0), a_t \sim \pi(a|s), s_{t+1}\sim P(s_{t+1}|s_t,a_t)}[\sum_{t=0}^{\infty}\gamma^tr(s_t)] \tag{1}, η(π)=Es0ρ(s0),atπ(as),st+1P(st+1st,at)[t=0γtr(st)],(1)其中 ρ ( s 0 ) \rho(s_0) ρ(s0)为初始状态 s 0 s_0 s0的分布。可以证明:
η ( π ~ ) = η ( π ) + E s 0 ∼ ρ ( s 0 ) , a t ∼ π ~ ( a ∣ s ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] , (2) \eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0 \sim \rho(s_0), a_t \sim \tilde{\pi}(a|s), s_{t+1}\sim P(s_{t+1}|s_t,a_t)}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)], \tag{2} η(π~)=η(π)+Es0ρ(s0),atπ~(as),st+1P(st+1st,at)[t=0γtAπ(st,at)],(2)具体证明过程参考TRPO论文Appendix A Lemma1。注意,公式(2)期望中 a t a_t at由新策略进行采样。记 ρ π ( s ) = P ( s 0 = s ) + γ P ( s 1 = s ) + . . . , \rho_{\pi}(s) = P(s_0=s)+\gamma P(s_1=s) + ... , ρπ(s)=P(s0=s)+γP(s1=s)+...,则公式(2)可以改写为:
η ( π ~ ) = η ( π ) + ∑ t = 0 ∞ ∑ s P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) γ t A π ( s , a ) = η ( π ) + ∑ s ∑ t = 0 ∞ γ t P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) . (3) \eta(\tilde{\pi}) = \eta(\pi) + \sum_{t=0}^{\infty} \sum_{s}P(s_t=s|\tilde{\pi})\sum_{a}\tilde{\pi}(a|s)\gamma^tA_\pi(s,a) \\ = \eta(\pi) + \sum_{s}\sum_{t=0}^{\infty} \gamma^t P(s_t=s|\tilde{\pi}) \sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) \\ = \eta(\pi) + \sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) \tag{3}. η(π~)=η(π)+t=0sP(st=sπ~)aπ~(as)γtAπ(s,a)=η(π)+st=0γtP(st=sπ~)aπ~(as)Aπ(s,a)=η(π)+sρπ~(s)aπ~(as)Aπ(s,a).(3)公式(3)等号左边为新策略 π ~ \tilde{\pi} π~的期望累计奖励 η ( π ~ ) \eta(\tilde{\pi}) η(π~),右边为旧策略 π \pi π的期望累计奖励 η ( π ) \eta(\pi) η(π)加上某一项 ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) sρπ~(s)aπ~(as)Aπ(s,a)。TRPO的思想就是通过最大化 ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) sρπ~(s)aπ~(as)Aπ(s,a),以提升策略的性能。但不能直接对 ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \sum_{s}\rho_{\tilde{\pi}}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a) sρπ~(s)aπ~(as)Aπ(s,a)进行优化,因为 ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρπ~(s)导致了优化目标中含有未知项,即在未得到 π ~ \tilde\pi π~之前,不知道 ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρπ~(s),直接求导十分复杂, 难以计算。

2. 对目标函数进行近似与简化

首先,由于上述提到的问题,我们将公式(3)简化为:
L π ( π ~ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) , (4) L_{\pi}(\tilde\pi) = \eta(\pi) + \sum_{s}\rho_{\pi}(s)\sum_{a}\tilde{\pi}(a|s)A_\pi(s,a), \tag{4} Lπ(π~)=η(π)+sρπ(s)aπ~(as)Aπ(s,a),(4)注意在公式(4)中,我们将 ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρπ~(s)替换为了 ρ π ( s ) \rho_{\pi}(s) ρπ(s)。公式(4)具有以下的性质:
L π θ 0 ( π θ 0 ) = η ( π θ 0 ) , ∇ θ L π θ 0 ( π θ ) ∣ θ = θ 0 = ∇ θ η π θ 0 ( π θ ) ∣ θ = θ 0 , (5) L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0}),\\ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta}) |_{\theta = \theta_0} = \nabla_\theta \eta_{\pi_{\theta_0}}(\pi_{\theta}) |_{\theta = \theta_0}, \tag{5} Lπθ0(πθ0)=η(πθ0),θLπθ0(πθ)θ=θ0=θηπθ0(πθ)θ=θ0,(5)性质的证明见此公式: 公式(5)证明。下面的定理展示了 L π ( π ~ ) L_{\pi}(\tilde\pi) Lπ(π~) η ( π ~ ) \eta(\tilde{\pi}) η(π~)之间的关系:
η ( π ~ ) ≥ L π ( π ~ ) − C D T V m a x ( π , π ~ ) , \eta(\tilde{\pi}) \ge L_{\pi}(\tilde\pi) - CD_{TV}^{max}(\pi, \tilde\pi), η(π~)Lπ(π~)CDTVmax(π,π~),其中 C = 4 ϵ γ ( 1 − γ ) 2 , ϵ = max ⁡ s , a A π ( s , a ) , C=\frac{4\epsilon\gamma}{(1-\gamma)^2},\epsilon=\max_{s,a}{A_\pi(s,a)}, C=(1γ)24ϵγ,ϵ=maxs,aAπ(s,a), D T V m a x ( π , π ~ ) = max ⁡ s D T V ( π , π ~ ) = max ⁡ s 1 2 ∑ i ∣ π i ( ⋅ ∣ s ) − π ~ i ( ⋅ ∣ s ) ∣ D_{TV}^{max}(\pi, \tilde\pi)=\max_{s}D_{TV}(\pi,\tilde\pi)=\max_{s}\frac{1}{2}\sum_{i}|\pi_i(\cdot|s)-\tilde\pi_i(\cdot|s)| DTVmax(π,π~)=maxsDTV(π,π~)=maxs21iπi(s)π~i(s)。证明见TRPO论文Appendix A Lemma2、3。文章用KL散度替换了TV散度,由于性质 D T V ( π , π ~ ) 2 ≤ D K L ( π , π ~ ) D_{TV}(\pi, \tilde\pi)^2 \le D_{KL}(\pi, \tilde\pi) DTV(π,π~)2DKL(π,π~),则可以得到:
η ( π ~ ) ≥ L π ( π ~ ) − C D K L m a x ( π , π ~ ) , (6) \eta(\tilde{\pi}) \ge L_{\pi}(\tilde\pi) - CD_{KL}^{max}(\pi, \tilde\pi), \tag{6} η(π~)Lπ(π~)CDKLmax(π,π~),(6)其中 C = 4 ϵ γ ( 1 − γ ) 2 , ϵ = max ⁡ s , a A π ( s , a ) 。 C=\frac{4\epsilon\gamma}{(1-\gamma)^2},\epsilon=\max_{s,a}{A_\pi(s,a)}。 C=(1γ)24ϵγ,ϵ=maxs,aAπ(s,a)通过最大化等式右边 L π ( π ~ ) − C D K L m a x ( π , π ~ ) L_{\pi}(\tilde\pi) - CD_{KL}^{max}(\pi, \tilde\pi) Lπ(π~)CDKLmax(π,π~),可以保证策略性能 η ( π ~ ) \eta(\tilde{\pi}) η(π~)的单调上升。

3. 将优化问题进一步简化

π = π θ o l d , π ~ = π θ \pi = \pi_{\theta_{old}}, \tilde\pi=\pi_{\theta} π=πθold,π~=πθ,则优化问题可以构建为:
maximize θ [ L θ o l d ( θ ) − C D K L m a x ( θ o l d , θ ) ] . \text{maximize}_{\theta}[L_{\theta_{old}}(\theta)-CD_{KL}^{max}(\theta_{old}, \theta)]. maximizeθ[Lθold(θ)CDKLmax(θold,θ)].文章中指出,如果按照理论中设置C的大小,则每次更新的步长非常小,效率很低。因此,文章用信赖域的约束优化问题代替上面的问题:
maximize θ L θ o l d ( θ ) , s.t. D K L m a x ( θ o l d , θ ) ≤ δ . \text{maximize}_{\theta}L_{\theta_{old}}(\theta), \\ \text{s.t.} D_{KL}^{max}(\theta_{old}, \theta) \le \delta. maximizeθLθold(θ),s.t.DKLmax(θold,θ)δ.文章进一步将约束中的KL散度由最大值约束简化为了平均值约束:
maximize θ L θ o l d ( θ ) , s.t. D ˉ K L ρ θ o l d ( θ o l d , θ ) ≤ δ . \text{maximize}_{\theta}L_{\theta_{old}}(\theta), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta. maximizeθLθold(θ),s.t.DˉKLρθold(θold,θ)δ.其中 D ˉ K L ρ ( θ 1 , θ 2 ) = E s ∼ ρ [ D K L ( π θ 1 ( ⋅ ∣ s ) , π θ 2 ( ⋅ ∣ s ) ) ] \bar D_{KL}^{\rho}(\theta_1,\theta_2) = \mathbb{E}_{s\sim\rho}[D_{KL}(\pi_{\theta_1}(\cdot|s), \pi_{\theta_2}(\cdot|s))] DˉKLρ(θ1,θ2)=Esρ[DKL(πθ1(s),πθ2(s))]。由于在强化学习中,一般通过采样的方式收集样本,上述的优化目标还可以写成期望的形式:
maximize θ ∑ s ρ θ o l d ( s ) ∑ a ∼ π θ π θ ( a ∣ s ) A θ o l d ( s , a ) , s.t. D ˉ K L ρ θ o l d ( θ o l d , θ ) ≤ δ . \text{maximize}_{\theta} \sum_{s}\rho_{\theta_{old}}(s)\sum_{a\sim\pi_{\theta}}\pi_{\theta}(a|s)A_{\theta_{old}}(s,a), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta. maximizeθsρθold(s)aπθπθ(as)Aθold(s,a),s.t.DˉKLρθold(θold,θ)δ.在上式中, maximize θ ∑ s ρ θ o l d ( s ) [ . . . ] \text{maximize}_{\theta} \sum_{s}\rho_{\theta_{old}}(s)[...] maximizeθsρθold(s)[...]代表根据分布 s ∼ ρ θ o l d ( s ) s\sim\rho_{\theta_{old}}(s) sρθold(s)对状态进行采样,可以写成 E s ∼ ρ θ o l d ( s ) [ . . . ] \mathbb{E}_{s\sim \rho_{\theta_{old}}(s)}[...] Esρθold(s)[...],同时,由于在收集样本时,一般是采用某些随机策略 q ( ⋅ ∣ s ) q(\cdot|s) q(s),以增加对状态空间的探索,因此 ∑ a π θ ( a ∣ s ) \sum_a\pi_{\theta}(a|s) aπθ(as)与数据中 a t a_t at来源于 q ( ⋅ ∣ s ) q(\cdot|s) q(s)不一致,通过重要性采样将其改写为 ∑ a ∼ q π θ ( a ∣ s ) q ( ⋅ ∣ s ) \sum_{a\sim q}\frac{\pi_{\theta}(a|s)}{q(\cdot|s)} aqq(s)πθ(as)。综上,优化问题可以改写为:
maximize θ E s ∼ ρ θ o l d , a ∼ q π θ ( a ∣ s ) q ( ⋅ ∣ s ) A θ o l d ( s , a ) , s.t. D ˉ K L ρ θ o l d ( θ o l d , θ ) ≤ δ . (7) \text{maximize}_{\theta} \mathbb{E}_{s\sim \rho_{\theta_{old}},a\sim q}\frac{\pi_{\theta}(a|s)}{q(\cdot|s)}A_{\theta_{old}}(s,a), \\ \text{s.t.} \bar D_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta. \tag{7} maximizeθEsρθold,aqq(s)πθ(as)Aθold(s,a),s.t.DˉKLρθold(θold,θ)δ.(7)

二. 优化问题求解

文章采用顺序二次规划的思想,将目标函数线性化、将约束进行二阶Taylor展开,通过迭代的方式逐步求解问题(7)。

1. 目标函数的线性化

f ( θ ) = π θ ( a ∣ s ) q ( ⋅ ∣ s ) A θ o l d ( s , a ) f(\theta) = \frac{\pi_{\theta}(a|s)}{q(\cdot|s)}A_{\theta_{old}}(s,a) f(θ)=q(s)πθ(as)Aθold(s,a),则 f ( θ ) = f ( θ o l d ) + ∇ θ f ( θ ) ∣ θ = θ o l d ( θ − θ o l d ) f(\theta) = f(\theta_{old})+\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}(\theta-\theta_{old}) f(θ)=f(θold)+θf(θ)θ=θold(θθold)。其中 ∇ θ f ( θ ) ∣ θ = θ o l d \nabla_{\theta}f(\theta)|_{\theta=\theta_{old}} θf(θ)θ=θold
可以通过自动微分计算。

2. 约束的二阶Taylor展开

KL散度的二阶Taylor展开可以写为以下形式:
g ( θ ) = D K L ( p θ 0 , p θ ) = D K L ( p θ 0 , p θ 0 ) + ∇ θ D K L ( p θ 0 , p θ ) ∣ θ = θ 0 ( θ − θ 0 ) + 1 2 ( θ − θ 0 ) T F ( θ − θ 0 ) + O ( θ 3 ) . g(\theta) = D_{KL}(p_{\theta_0},p_{\theta}) = D_{KL}(p_{\theta_0},p_{\theta_0}) + \nabla_\theta D_{KL}(p_{\theta_0},p_{\theta})|_{\theta = \theta_0}(\theta - \theta_0) + \frac{1}{2}(\theta - \theta_0)^TF(\theta-\theta_0) + O(\theta^3). g(θ)=DKL(pθ0,pθ)=DKL(pθ0,pθ0)+θDKL(pθ0,pθ)θ=θ0(θθ0)+21(θθ0)TF(θθ0)+O(θ3).其中 D K L ( p θ 0 , p θ 0 ) = 0 , D_{KL}(p_{\theta_0},p_{\theta_0})=0, DKL(pθ0,pθ0)=0,
∇ θ D K L ( p θ 0 , p θ ) ∣ θ = θ 0 = ∇ θ ∫ p θ 0 ( x ) [ log ⁡ p θ 0 ( x ) − log ⁡ p θ ( x ) ] d x ∣ θ = θ 0 = − ∫ p θ 0 ( x ) ∇ θ log ⁡ p θ ( x ) d x ∣ θ = θ 0 = − ∫ p θ 0 ( x ) ∇ θ p θ ( x ) p θ ( x ) d x ∣ θ = θ 0 = − ∇ θ ∫ p θ ( x ) d x ∣ θ = θ 0 = − ∇ θ 1 = 0. \nabla_\theta D_{KL}(p_{\theta_0},p_{\theta})|_{\theta = \theta_0} = \nabla_\theta \int{p_{\theta_0}(x) [\log p_{\theta_0}(x)-\log p_{\theta}(x)]}dx|_{\theta = \theta_0} \\ = - \int{p_{\theta_0}(x) \nabla_\theta\log p_{\theta}(x)}dx|_{\theta = \theta_0} \\ = - \int{p_{\theta_0}(x) \frac{\nabla_\theta p_{\theta}(x)}{p_{\theta}(x)}}dx|_{\theta = \theta_0}\\ = - \nabla_\theta \int{p_\theta(x)dx}|_{\theta = \theta_0} \\ = - \nabla_\theta 1 \\ = 0. θDKL(pθ0,pθ)θ=θ0=θpθ0(x)[logpθ0(x)logpθ(x)]dxθ=θ0=pθ0(x)θlogpθ(x)dxθ=θ0=pθ0(x)pθ(x)θpθ(x)dxθ=θ0=θpθ(x)dxθ=θ0=θ1=0. F F F为Fisher信息矩阵,定义可见Fisher信息矩阵的定义。具体推导过程可见KL散度的二阶Taylor展开

综上,优化问题可以写为:
maximize θ E s ∼ ρ θ o l d , a ∼ q [ f ( θ o l d ) + ∇ θ f ( θ ) ∣ θ = θ o l d ( θ − θ o l d ) ] , s . t . 1 2 ( θ − θ 0 ) T F ( θ − θ 0 ) ≤ δ . (8) \text{maximize}_{\theta} \mathbb{E}_{s\sim \rho_{\theta_{old}},a\sim q}[ f(\theta_{old})+\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}}(\theta-\theta_{old})],\\ s.t. \frac{1}{2}(\theta - \theta_0)^TF(\theta-\theta_0) \le \delta. \tag{8} maximizeθEsρθold,aq[f(θold)+θf(θ)θ=θold(θθold)],s.t.21(θθ0)TF(θθ0)δ.(8)

3.高效求解优化问题

由问题(8),根据Language乘子法,可知 ∇ θ f ( θ ) ∣ θ = θ o l d + λ F ( θ − θ 0 ) = 0 , \nabla_{\theta}f(\theta)|_{\theta=\theta_{old}} + \lambda F(\theta-\theta_0)=0, θf(θ)θ=θold+λF(θθ0)=0,即更新方向 g = ( θ − θ 0 ) g=(\theta-\theta_0) g=(θθ0)满足
F g = b , (9) Fg = b,\tag{9} Fg=b,(9)其中 F F F为Fisher信息矩阵, g = ( θ − θ 0 ) , b = ∇ θ f ( θ ) ∣ θ = θ o l d g=(\theta-\theta_0), b=\nabla_{\theta}f(\theta)|_{\theta=\theta_{old}} g=(θθ0),b=θf(θ)θ=θold。文中采用共轭梯度法对(9)式进行快速计算。通过(9)式求得更新方向 g g g后,假设步长为 β \beta β,需满足KL散度的约束,即
δ = D K L ≈ 1 2 ( β g ) T A ( β g ) = 1 2 β 2 g T A g . \delta = D_{KL} \approx \frac{1}{2}(\beta g)^TA(\beta g)= \frac{1}{2}\beta^2 g^TAg. δ=DKL21(βg)TA(βg)=21β2gTAg.解得 β = 2 δ / g T A g \beta = \sqrt{2\delta/g^TAg} β=2δ/gTAg 。文中从 β = 2 δ / g T A g \beta = \sqrt{2\delta/g^TAg} β=2δ/gTAg 开始进行线搜索,求得使 L θ o l d ( θ ) − I ( D K L ( θ o l d , θ ) ≤ δ ) L_{\theta_{old}}(\theta)-\mathbb{I}(D_{KL}(\theta_{old},\theta)\le \delta) Lθold(θ)I(DKL(θold,θ)δ)取得最大值的 β ∗ \beta^* β,其中 I ( ⋅ ) \mathbb{I}(\cdot) I()为示性函数,当条件满足时取0,不满足时取 + ∞ +\infty +

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值