【RL】MDP(2)

前文链接

MDP(1)

策略迭代

策略评估

矩阵算法

在动态特性 P ( s ′ , r ′ ∣ s , a ) \mathcal{P}(s', r'\mid s, a) P(s,rs,a),给定 π \pi π得到 q π ( s , a ) q_\pi(s, a) qπ(s,a).

V π ( s ) = ( V π ( s 1 ) V π ( s 2 ) ⋮ V π ( s ∣ s ∣ ) ) ∣ s ∣ × 1 V_\pi(s)=\left( \begin{matrix} V_\pi(s_1)\\ V_\pi(s_2)\\ \vdots\\ V_\pi(s_{|s|}) \end{matrix} \right)_{|s|\times 1} Vπ(s)=Vπ(s1)Vπ(s2)Vπ(ss)s×1
根据 V π ( s ) V_\pi(s) Vπ(s)表达式可以得到
V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ V π S t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s i ) ∑ s ′ , r P ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] \begin{aligned} V_\pi(s)&=\mathbb{E}_\pi[G_t\mid S_t=s] \\ &=\mathbb{E}_\pi[R_{t+1}+\gamma V_{\pi}S_{t+1}\mid S_t=s] \\ &=\sum_a\pi(a\mid s_i)\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_\pi(s')] \end{aligned} Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γVπSt+1St=s]=aπ(asi)s,rP(s,rs,a)[r+γVπ(s)]
展开得到
V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r P ( s ′ , r ∣ s , a ) r ⏟ ( A ) + γ ∑ a π ( a ∣ s ) ∑ s ′ , r P ( s ′ , r ∣ s , a ) V π ( s ′ ) ⏟ ( B ) \begin{aligned} V_\pi(s)=\underbrace{\sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)r}_{(A)}+\underbrace{\gamma\sum_{a}\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)V_\pi(s')}_{(B)} \end{aligned} Vπ(s)=(A) aπ(as)s,rP(s,rs,a)r+(B) γaπ(as)s,rP(s,rs,a)Vπ(s)
其中 ( A ) (A) (A)部分可以表示为
∑ a π ( a ∣ s ) ∑ r r P ( r ∣ s , a ) = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] \sum_{a}\pi(a\mid s)\sum_{r}r\mathcal{P}(r\mid s, a)=\sum_{a}\pi(a\mid s)\mathbb{E}[R_{t+1}\mid S_t=s, A_t=a] aπ(as)rrP(rs,a)=aπ(as)E[Rt+1St=s,At=a]
r ( s , a ) ≜ E [ R t + 1 ∣ S t = s , A t = a ] r(s, a)\triangleq\mathbb{E}[R_{t+1}\mid S_t=s, A_t=a] r(s,a)E[Rt+1St=s,At=a]
得到
( A ) = ∑ a π ( a ∣ s ) r ( s , a ) = r π ( s ) (A)=\sum_a\pi(a\mid s)r(s, a)=r_\pi(s) (A)=aπ(as)r(s,a)=rπ(s)
引入向量 r π r_\pi rπ
r π = ( r π ( s 1 ) r π ( s 2 ) ⋮ r π ( s ∣ s ∣ ) ) ∣ s ∣ × 1 r_\pi=\left( \begin{matrix} r_\pi(s_1)\\ r_\pi(s_2)\\ \vdots\\ r_\pi(s_{|s|}) \end{matrix} \right)_{|s|\times 1} rπ=rπ(s1)rπ(s2)rπ(ss)s×1
( B ) (B) (B)部分值为
( B ) = γ ∑ a π ( a ∣ s ) ∑ s ′ P ( s ′ ∣ s , a ) V π ( s ′ ) = γ ∑ s ′ ∑ a π ( a ∣ s ) P ( s ′ ∣ s , a ) ⏟ P π ( s , s ′ ) V π ( s ′ ) = γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) \begin{aligned} (B)&=\gamma\sum_a\pi(a\mid s)\sum_{s'}\mathcal{P}(s'\mid s, a)V_\pi(s') \\ &=\gamma\sum_{s'}\underbrace{\sum_a\pi(a\mid s)\mathcal{P}(s'\mid s, a)}_{P_\pi(s, s')}V_\pi(s')\\ &=\gamma\sum_{s'}P_\pi(s, s')V_\pi(s') \end{aligned} (B)=γaπ(as)sP(ss,a)Vπ(s)=γsPπ(s,s) aπ(as)P(ss,a)Vπ(s)=γsPπ(s,s)Vπ(s)
P π ≜ [ P π ( s , s ′ ) ] P_\pi\triangleq[P_\pi(s, s')] Pπ[Pπ(s,s)],得到
V π = r π + γ P π V π V_\pi=r_\pi+\gamma P_\pi V_\pi Vπ=rπ+γPπVπ
解出
V π = ( I − γ P π ) − 1 r π V_\pi=(\mathbf{I}-\gamma P_\pi)^{-1}r_\pi Vπ=(IγPπ)1rπ
算法时间复杂度为 O ( ∣ s ∣ 3 ) \mathcal{O}(|s|^3) O(s3).

迭代法

给定 π \pi π,求出 V π V_\pi Vπ.
V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r P ( s ′ , r ∣ s , a ) ( r + γ V π ( s ′ ) ) V_\pi(s)=\sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)(r+\gamma V_\pi(s')) Vπ(s)=aπ(as)s,rP(s,rs,a)(r+γVπ(s))
构造一个数列 { V k } k = 1 ∞ → V π \{V_k\}_{k=1}^\infty\to V_\pi {Vk}k=1Vπ
迭代方程为
V k + 1 ( s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ , r P ( s ′ , r ∣ s , a ) ( r + γ V k ( s ′ ) ) V_{k+1}(s)\triangleq \sum_a\pi(a\mid s)\sum_{s', r}\mathcal{P}(s', r\mid s, a)(r+\gamma V_k(s')) Vk+1(s)aπ(as)s,rP(s,rs,a)(r+γVk(s))

策略改进

求出 q π ( s , a ) q_\pi(s, a) qπ(s,a)构造出新的 π ′ \pi' π,使得 q π ′ ( s , a ) > q π ( s , a ) q_{\pi'}(s, a)>q_{\pi}(s, a) qπ(s,a)>qπ(s,a).

策略改进定理:给定 π , π ′ \pi, \pi' π,π如果 ∀ s ∈ S \forall s\in S sS, q π ( s , π ′ ( s ) ) q_\pi(s, \pi'(s)) qπ(s,π(s)),那么则有 ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S, V_{\pi'}(s)\geq V_{\pi}(s) sS,Vπ(s)Vπ(s)

证明
根据公式
q π ( s , a ) = ∑ s ′ , r P ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = E [ R t + 1 + γ V π ( S t + 1 ) ∣ S t = s , A t = a ] = E [ R t + 1 + γ V π ( S t + 1 ) ∣ S t = s , A t = π ′ ( s ) ] = E π ′ [ R t + 1 + γ V π ( S t + 1 ) ∣ S t = s ] ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 + γ V π ( S t + 2 ) ∣ S t + 1 ] ∣ S t = s ] = E π ′ [ R t + 1 + γ R t + 2 + γ 2 V π ( S t + 2 ) ∣ S t = s ] … ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … ⏟ G t ∣ S t = s ] = V π ′ ( s ) \begin{aligned} q_\pi(s, a)&=\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_\pi(s')] \\ &=\mathbb{E}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s, A_t=a]\\ &=\mathbb{E}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s, A_t=\pi'(s)] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma V_\pi(S_{t+1})\mid S_t=s] \\ &\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1}, \pi'(S_{t+1}))\mid S_t=s] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma\mathbb{E}_{\pi'}[R_{t+2}+\gamma V_\pi(S_{t+2})\mid S_{t+1}]\mid S_t=s] \\ &=\mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2V_\pi(S_{t+2})\mid S_t=s]\\ &\dots\\ &\leq \mathbb{E}_{\pi'}[\underbrace{R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots}_{G_t}\mid S_t=s]\\ &=V_{\pi'}(s) \end{aligned} qπ(s,a)=s,rP(s,rs,a)[r+γVπ(s)]=E[Rt+1+γVπ(St+1)St=s,At=a]=E[Rt+1+γVπ(St+1)St=s,At=π(s)]=Eπ[Rt+1+γVπ(St+1)St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]=Eπ[Rt+1+γEπ[Rt+2+γVπ(St+2)St+1]St=s]=Eπ[Rt+1+γRt+2+γ2Vπ(St+2)St=s]Eπ[Gt Rt+1+γRt+2+γ2Rt+3+St=s]=Vπ(s)

贪心策略(Greedy Policy): ∀ s ∈ S π ′ ( s ) = arg max ⁡ a q π ( s , a ) \forall s \in S \pi'(s)=\argmax_a q_\pi(s, a) sSπ(s)=aargmaxqπ(s,a)

V π ( s ) ≤ max ⁡ a q π ( s , a ) = q π ( s , π ′ ( s ) ) V_\pi(s)\leq \max_a q_\pi(s, a)=q_\pi(s, \pi'(s)) Vπ(s)amaxqπ(s,a)=qπ(s,π(s))
由策略改进定理可知
∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S, V_{\pi'}(s)\geq V_\pi(s) sS,Vπ(s)Vπ(s)

如果 V π ′ = V π V_{\pi'}=V_\pi Vπ=Vπ那么 V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ=Vπ=V

证明
V π ′ = V π ⇒ q π ′ = q π V_{\pi'}=V_{\pi}\Rightarrow q_{\pi'}=q_{\pi} Vπ=Vπqπ=qπ.
所以
∀ s ∈ S V π ′ ( s ) = ∑ a π ′ ( a ∣ s ) q π ′ ( s , a ) q π ( s , a ) = q π ( s , π ′ ( s ) ) = max ⁡ a q π ( s , a ) = max ⁡ a ∑ s ′ , r P ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] (1) \begin{aligned} &\forall s\in S\\ V_{\pi'}(s)&=\sum_{a}\pi'(a\mid s)q_{\pi'}(s, a)q_{\pi}(s, a)\\ &=q_\pi(s, \pi'(s))\\ &=\max_a q_\pi(s, a)\\ &=\max_a \sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_{\pi'}(s')] \end{aligned}\tag{1} Vπ(s)sS=aπ(as)qπ(s,a)qπ(s,a)=qπ(s,π(s))=amaxqπ(s,a)=amaxs,rP(s,rs,a)[r+γVπ(s)](1)
可以发现 ( 1 ) (1) (1)是Bellman Optimality Equation.

价值迭代

VI价值迭代时极端情况下的策略迭代
V k + 1 ( s ) = max ⁡ a ∑ s ′ , r P ( s ′ , r ∣ s , a ) [ r + γ V k ( s ′ ) ] V_{k+1}(s)=\max_a\sum_{s', r}\mathcal{P}(s', r\mid s, a)[r+\gamma V_k(s')] Vk+1(s)=amaxs,rP(s,rs,a)[r+γVk(s)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Quant0xff

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值