强化学习保守策略迭代Conservative policy iteration推导

前言

最近在学习TRPO,但是本人发现TRPO用了好多好多好多前人的论文结论,包括策略梯度定理 (Policy gradient theorem, PGT),自然策略梯度 (Natural policy gradient, NPG)和保守策略迭代 (Conservative policy iteration, CPI)。

因此,本人最近更新的博客数量剧增。。。PGT 和 NPG 可以参考我的另外两个博客
PGT.
NPG.

CPI 的主要贡献在于它提出了一种混合策略更新的方法,并且给出了使用这种方法更新策略可以使值函数单调增加的证明。可以设想一下,如果值函数单调不减,那么理论上是可以说明得到的策略是越来越好的。

实际上,CPI 算法并不会被简单粗暴地应用在程序中,因为结论中提到的变量都是未知的,都是很难确定的,所以CPI 算法更多地是提供一种解决一类RL问题的思路。然后,十几年之后,TRPO的提出就是这个思路被成功应用的很典型的一个案例。

Greedy policy

贪婪策略是十分常见的一种策略更新方式,每一回合计算出优势函数 A π o l d ( s , a ) A^{\pi_{old}}(s,a) Aπold(s,a),然后反向找到使这个 A π o l d ( s , a ) A^{\pi_{old}}(s,a) Aπold(s,a)最大的策略 π \pi π作为 π n e w \pi_{new} πnew,简单粗暴。(有的时候也会用 Q π o l d ( s , a ) Q^{\pi_{old}}(s,a) Qπold(s,a),本质上一样),即:
π n e w = arg ⁡ max ⁡ π ∈ Π E s ∼ d μ π [ A π ( s , a ) ] \begin{align} \begin{aligned} \pi_{new}=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi}}{\left[A^{\pi}(s,a)\right]}} \end{aligned} \end{align} πnew=argπΠmaxEsdμπ[Aπ(s,a)]

但是这就存在一个问题,如果两次的改进比较大,那么很有可能会导致由 π o l d \pi_{old} πold π n e w \pi_{new} πnew所诱导出的轨迹存在较大差别。考虑到此时值函数并没有收敛 (甚至刚开始学习),那么就会导致策略在一定范围内“反复横跳” (哎,就是玩,你说啥都没用,我就不收敛,你能咋的…>_<)。这也是混合策略提出的初衷。

Conservative Policy Iteration

CPI将策略的更新变成旧策略和临时策略的加权平均,即
{ π ′ = arg ⁡ max ⁡ π ∈ Π E s ∼ d μ π o l d [ A π ( s , a ) ] π n e w = ( 1 − α ) π + α π ′ \begin{align} \left\{ \begin{aligned} & \pi'=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi_{old}}}{\left[A^{\pi}(s,a)\right]}} \\ & \pi_{new}=(1-\alpha)\pi+\alpha\pi' \end{aligned} \right. \end{align} π=argπΠmaxEsdμπold[Aπ(s,a)]πnew=(1α)π+απ
这个式子给出来挺容易,其实到这里就已经结束了,问题是,如何说明它确实“有所提高”。这里给出三个引理

Lemma 1 (Performance difference lemma)

对于两个策略 π \pi π π ′ \pi' π,以及他们所诱导出的状态的分布 d μ π d_{\mu}^{\pi} dμπ d μ π ′ d_{\mu}^{\pi'} dμπ,有
{ V π ′ ( s 0 ) − V π ( s 0 ) = 1 1 − γ E s ∼ d μ π ′ [ A π ( s , π ′ ( s ) ) ] V π ( s 0 ) − V π ′ ( s 0 ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] \begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi'}}\left[A^{\pi}(s,\pi'(s))\right] \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \right. \end{align} Vπ(s0)Vπ(s0)=1γ1Esdμπ[Aπ(s,π(s))]Vπ(s0)Vπ(s0)=1γ1Esdμπ[Aπ(s,π(s))]
成立,其中 μ \mu μ为状态的初始分布, A A A是优势函数。

Proof V π ( s 0 ) − V π ′ ( s 0 ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] Vπ(s0)Vπ(s0)=1γ1Esdμπ[Aπ(s,π(s))]

V π ( s 0 ) − V π ′ ( s 0 ) = V π ( s 0 ) − E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 V π ′ ( s 1 ) ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 V π ′ ( s 1 ) ] − V π ′ ( s 0 ) = E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) ] ] − E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] − V π ′ ( s 0 ) = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] − V π ′ ( s 0 ) = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π [ Q π ′ ( s 0 , a 0 ) − V π ′ ( s 0 ) ] = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π A π ′ ( s 0 , a 0 ) \begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) & = V^{\pi}(s_0) -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] - V^{\pi'}(s_0)\\ &=\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)\right]\right] -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] \\ &+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ &= \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ & = \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[Q^{\pi'}(s_0,a_0) - V^{\pi'}(s_0)\right]\\ &= \textcolor{red}{\gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]} + \mathbb{E}_{a_0\sim\pi}A^{\pi'}(s_0,a_0) \end{aligned} \end{align} Vπ(s0)Vπ(s0)=Vπ(s0)Ea0π[rs0a0+γEs1Ps0a0Vπ(s1)]+Ea0π[rs0a0+γEs1Ps0a0Vπ(s1)]Vπ(s0)=Ea0π[rs0a0+γEs1Ps0a0[Vπ(s1)]]Ea0π[rs0a0+γEs1Ps0a0[Vπ(s1)]]+Ea0π[rs0a0+γEs1Ps0a0[Vπ(s1)]]Vπ(s0)=γEa0πEs1Ps0a0[Vπ(s1)Vπ(s1)]+Ea0π[rs0a0+γEs1Ps0a0[Vπ(s1)]]Vπ(s0)=γEa0πEs1Ps0a0[Vπ(s1)Vπ(s1)]+Ea0π[Qπ(s0,a0)Vπ(s0)]=γEa0πEs1Ps0a0[Vπ(s1)Vπ(s1)]+Ea0πAπ(s0,a0)
红色部分中括号里面的部分的形式与最初的等式形式相同。因此,继续推导红色部分
γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] = γ E s 1 ∼ P 1 π ( ⋅ ∣ s 0 ) [ V π ( s 1 ) − V π ′ ( s 1 ) ] = γ E s 1 ∼ P 1 π ( ⋅ ∣ s 0 ) [ γ E a 1 ∼ π E s 2 ∼ P s 2 a 2 [ V π ( s 2 ) − V π ′ ( s 2 ) ] ] = γ 2 E s 2 ∼ P 2 π ( ⋅ ∣ s 0 ) [ V π ( s 2 ) − V π ′ ( s 2 ) ] + γ E a 1 ∼ π A π ′ ( s 1 , a 1 ) \begin{align} \begin{aligned} \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] \\ & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[\gamma\mathbb{E}_{a_1\sim\pi}\mathbb{E}_{s_2\sim P_{s_2}^{a_2}}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]\right] \\ &= \gamma^2\mathbb{E}_{s_2\sim P_2^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]+\gamma\mathbb{E}_{a_1\sim \pi}A^{\pi'}(s_1,a_1) \end{aligned} \end{align} γEa0πEs1Ps0a0[Vπ(s1)Vπ(s1)]=γEs1P1π(s0)[Vπ(s1)Vπ(s1)]=γEs1P1π(s0)[γEa1πEs2Ps2a2[Vπ(s2)Vπ(s2)]]=γ2Es2P2π(s0)[Vπ(s2)Vπ(s2)]+γEa1πAπ(s1,a1)
由此类推,推导至无穷项,可以得出
V π ( s 0 ) − V π ′ ( s 0 ) = ∑ t = 0 ∞ γ t E s , a ∼ π A π ′ ( s , a ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] \begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) &= \sum_{t=0}^{\infty}\gamma^t \mathbb{E}_{s,a\sim\pi}A^{\pi'}(s,a)\\ &=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \end{align} Vπ(s0)Vπ(s0)=t=0γtEs,aπAπ(s,a)=1γ1Esdμπ[Aπ(s,π(s))]
第二个的证明与第一个完全一样。整体的表达式是对称的,只需要把 π ′ \pi' π π \pi π互换就好。

Lemma 2 还有另外一种表达形式 (这个就是恶心的地方,论文里都说显然,然后不同的资料用的符号和定义都不一样,或者实际上一样,但是看起来不一样>_<)

Lemma 1的另一种表达形式

{ V π ′ ( s 0 ) − V π ( s 0 ) = ∑ s d π ′ ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) V π ( s 0 ) − V π ′ ( s 0 ) = ∑ s d π ( s ) ∑ a π ( s , a ) A π ′ ( s , a ) \begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\sum_sd^{\pi'}(s)\sum_{a}{\pi'(s,a)A^{\pi}(s,a)} \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\sum_sd^{\pi}(s)\sum_{a}{\pi(s,a)A^{\pi'}(s,a)} \end{aligned} \right. \end{align} Vπ(s0)Vπ(s0)=sdπ(s)aπ(s,a)Aπ(s,a)Vπ(s0)Vπ(s0)=sdπ(s)aπ(s,a)Aπ(s,a)

Lemma2

对于混合策略更新,有
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ ( 1 − α ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align} VπnewVπαAπ(π)1γ(1α)2α2γϵ
成立。其中
A π ( π ′ ) ≡ ∑ s d π ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) ϵ ≡ 1 1 − γ [ max ⁡ s ∑ a π ′ ( s , a ) A π ( s , a ) ] A_{\pi}(\pi')\equiv\sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a)\\ \epsilon\equiv\frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right] Aπ(π)sdπ(s)aπ(s,a)Aπ(s,a)ϵ1γ1[smaxaπ(s,a)Aπ(s,a)]

Proof
在这里,可以将 π n e w \pi_{new} πnew看做 α \alpha α的函数。当 α = 0 \alpha=0 α=0时, π n e w = π \pi_{new}=\pi πnew=π;当 α = 1 \alpha=1 α=1时, π n e w = π ′ \pi_{new}=\pi' πnew=π。我们计算 π n e w \pi_{new} πnew α \alpha α的导数在 α = 0 \alpha=0 α=0时的数值
∇ α V π n e w ∣ α = 0 = ∑ s d π n e w ( s ) ∑ a ∇ α π n e w A π n e w ( s , a ) ∣ α = 0 = ∑ s d π n e w ( s ) ∑ a ( π ′ − π ) A π n e w ( s , a ) ∣ α = 0 = ∑ s d π ( s ) ∑ a ( π ′ − π ) A π ( s , a ) = ∑ s d π ( s ) ∑ a π ′ A π ( s , a ) − ∑ s d π ( s ) ∑ a π A π ( s , a ) = ∑ s d π ( s ) ∑ a π ′ A π ( s , a ) = A π ( π ′ ) \begin{align} \begin{aligned} \nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0} &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\nabla_{\alpha}\pi_{new}A^{\pi_{new}}(s,a)\right|_{\alpha=0} \\ &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\left(\pi'-\pi\right)A^{\pi_{new}}(s,a)\right|_{\alpha=0}\\ &= \sum_sd^{\pi}(s)\sum_a\left(\pi'-\pi\right)A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)-\sum_sd^{\pi}(s)\sum_a \pi A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)\\ &= A_{\pi}(\pi') \end{aligned} \end{align} αVπnewα=0=sdπnew(s)aαπnewAπnew(s,a) α=0=sdπnew(s)a(ππ)Aπnew(s,a) α=0=sdπ(s)a(ππ)Aπ(s,a)=sdπ(s)aπAπ(s,a)sdπ(s)aπAπ(s,a)=sdπ(s)aπAπ(s,a)=Aπ(π)
同理, A π ( π n e w ) = α A π ( π ′ ) A_{\pi}(\pi_{new})=\alpha A_{\pi}(\pi') Aπ(πnew)=αAπ(π)。将 V π n e w V^{\pi_{new}} Vπnew π n e w = π \pi_{new}=\pi πnew=π处Taylor展开,得到
V π n e w − V π = V π + α ∇ α V π n e w ∣ α = 0 − V π ≥ α A π ( π n e w ) − O ( α 2 ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & =V^{\pi} + \alpha\nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0}-V^{\pi} \\ &\geq\alpha A_{\pi}(\pi_{new}) -O(\alpha^2) \end{aligned} \end{align} VπnewVπ=Vπ+ααVπnewα=0VπαAπ(πnew)O(α2)
这里只说明了值函数的增益大于一个下确界,但是并没有得到目标的形式。继续推导,根据Lemma 1可知
V π n e w − V π = ∑ s d π n e w ( s ) ∑ a π n e w ( s , a ) A π ( s , a ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & = \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)} \end{aligned} \end{align} VπnewVπ=sdπnew(s)aπnew(s,a)Aπ(s,a)
这里定义一个变量 η t \eta_t ηt,含义为截止到第 t t t个时刻时,策略 π n e w \pi_{new} πnew中没有采用 π \pi π的步数,则有:
P r τ ∼ π n e w ( S t = s ∣ η t = 0 ) = P r τ ∼ π ( S t = s ) = ( 1 − α ) t \begin{align} \begin{aligned} Pr_{\tau\sim\pi_{new}}(S_t=s|\eta_t=0)=Pr_{\tau\sim\pi}(S_t=s)=(1-\alpha)^{t} \end{aligned} \end{align} Prτπnew(St=sηt=0)=Prτπ(St=s)=(1α)t
有的推导中写 t − 1 t-1 t1次方,这都无所谓的,最后是无穷项等比数列相加,不差这一个数。进而有
p t ≡ P r ( η t > 0 ) = 1 − ( 1 − α ) t \begin{align} \begin{aligned} p_t\equiv Pr(\eta_t>0)=1-(1-\alpha)^{t} \end{aligned} \end{align} ptPr(ηt>0)=1(1α)t
进而有 (开始了)
V π n e w − V π = ∑ s d π n e w ( s ) ∑ a π n e w ( s , a ) A π ( s , a ) = E τ ∼ π n e w [ ∑ t γ t ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] = α ∑ t ( 1 − p t ) γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η = 0 ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] = α ∑ t ( 1 − p t ) γ t E τ ∼ π [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] = α A π ( π ′ ) − α ∑ t p t γ t E τ ∼ π [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] ≥ α A π ( π ′ ) − 2 α ∑ t γ t [ 1 − ( 1 − α ) t ] [ max ⁡ s ∑ a π ′ ( s , a ) A π ( s , a ) ] = α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ ( 1 − α ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} &= \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)}\\ &= \mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_t\gamma^{t}\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta=0\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &=\alpha A_{\pi}(\pi')-\alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &\geq \alpha A_{\pi}(\pi')-2\alpha\sum_t\gamma^t\left[1-(1-\alpha)^{t}\right]\left[\max_s\sum_a\pi'(s,a)A^{\pi}(s,a)\right]\\ &= \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align} VπnewVπ=sdπnew(s)aπnew(s,a)Aπ(s,a)=Eτπnew[tγtaαπ(st,a)Aπ(st,a)]=αt(1pt)γtEτπnew[aαπ(st,a)Aπ(st,a)η=0]+αtptγtEτπnew[aαπ(st,a)Aπ(st,a)η>0]=αt(1pt)γtEτπ[aαπ(st,a)Aπ(st,a)]+αtptγtEτπnew[aαπ(st,a)Aπ(st,a)η>0]=αAπ(π)αtptγtEτπ[aαπ(st,a)Aπ(st,a)]+αtptγtEτπnew[aαπ(st,a)Aπ(st,a)η>0]αAπ(π)2αtγt[1(1α)t][smaxaπ(s,a)Aπ(s,a)]=αAπ(π)1γ(1α)2α2γϵ
Proof completes.

单调改进

在得到Lemma 2之后,很容易得出
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma} \end{aligned} \end{align} VπnewVπαAπ(π)1γ2α2γϵ
R R R是奖励的上界 (大于零),则有
{ A π ( π ′ ) = ∑ s d π ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) ≤ R 1 − γ ϵ = 1 1 − γ [ max ⁡ s ∑ a π ′ ( s , a ) A π ( s , a ) ] ≤ R 1 − γ \begin{align} \left\{ \begin{aligned} & A_{\pi}(\pi') = \sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a) \leq \frac{R}{1-\gamma} \\ & \epsilon = \frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right]\leq \frac{R}{1-\gamma} \end{aligned} \right. \end{align} Aπ(π)=sdπ(s)aπ(s,a)Aπ(s,a)1γRϵ=1γ1[smaxaπ(s,a)Aπ(s,a)]1γR
这时,若选取 α = A π ( π ′ ) ( 1 − γ ) 2 4 R \alpha=\frac{A_{\pi}(\pi')(1-\gamma)^2}{4R} α=4RAπ(π)(1γ)2,则有
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ = A π 2 ( π ′ ) ( 1 − γ ) 2 4 R − 2 α 2 γ ϵ 1 − γ ≥ A π 2 ( π ′ ) ( 1 − γ ) 2 4 R − 2 α 2 γ R ( 1 − γ ) 2 = A π ( π ′ ) ( 1 − γ ) 2 8 R > 0 \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & \geq \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ &= \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R}-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ & \geq \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R} - \frac{2\alpha^2\gamma R}{(1-\gamma)^2}\\ & = \frac{A_{\pi}(\pi')(1-\gamma)^2}{8R}>0 \end{aligned} \end{align} VπnewVπαAπ(π)1γ2α2γϵ=4RAπ2(π)(1γ)21γ2α2γϵ4RAπ2(π)(1γ)2(1γ)22α2γR=8RAπ(π)(1γ)2>0
如此,就保证了策略改进的单调提升,论文原文的后边还有很多推导,关于算法细节的,没太看懂,不往这放了。实际应用中,很多不会直接使用这个CPI技术,而是借用这个思路去设计。

OK,CPI理论推导到此结束了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值