强化学习1:策略迭代与价值迭代(上)

现在我们来讨论 T = ∞ T=\infty T=情形下的MDP,在这里,我们再重复一次我们的模型假设:

  1. 行为空间和状态空间都有限集
  2. 状态转移具有马尔可夫性,即
    P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ) = P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ ) \begin{aligned} &P(s_{t+1}=s_{t+1}^\prime|a_{t}=a_{t}^\prime,s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&P(s_{t+1}=s_{t+1}^\prime|a_{t}=a_{t}^\prime,s_t=s_t^\prime) \end{aligned} =P(st+1=st+1at=at,st=st,at1=at1,,s0=s0)P(st+1=st+1at=at,st=st)马尔可夫性等价于在已知现在状态和行为的条件下,未来状态和过去的行为状态是向相互独立的,即
    P ( s t + 1 = s t + 1 ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ∣ a t = a t ′ , s t = s t ′ ) = P ( s t + 1 = s t + 1 ′ ∣ a t = a t ′ , s t = s t ′ ) P ( a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ∣ a t = a t ′ , s t = s t ′ ) \begin{aligned} &P(s_{t+1}=s_{t+1}^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime|a_t=a_t^\prime,s_t=s_t^\prime)\\ =&P(s_{t+1}=s_{t+1}^\prime|a_t=a_t^\prime,s_t=s_t^\prime)P(a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime|a_t=a_t^\prime,s_t=s_t^\prime) \end{aligned} =P(st+1=st+1,at1=at1,,s0=s0at=at,st=st)P(st+1=st+1at=at,st=st)P(at1=at1,,s0=s0at=at,st=st)
  3. 状态转移具有时齐性:
    P ( s t + 1 = s ′ ∣ a t = a , s t = s ) = P ( s 1 = s ′ ∣ a 0 = a , s 0 = s ) \begin{aligned} P(s_{t+1}=s^\prime|a_{t}=a,s_t=s)=P(s_1=s^\prime|a_0=a,s_0=s) \end{aligned} P(st+1=sat=a,st=s)=P(s1=sa0=a,s0=s)我们把这个概率称为转移概率,记为 p ( s ′ ∣ a , s ) p(s^\prime|a,s) p(sa,s),这个概率与策略无关,而是由环境决定,所以我们不加上标 π \pi π
  4. 我们当然可以制定一列策略 { π 1 , π 2 , ⋯   } \{\pi_1,\pi_2,\cdots\} {π1,π2,},然后求解策略列,但这样求解起来很不方便,在这里我们假定策略是一个平稳的策略,也就是不同阶段使用同一策略 π \pi π,换句话说, π = { π , π , ⋯   } \pi=\{\pi,\pi,\cdots\} π={π,π,},我们要找出这个最优的平稳策略。
    p π ( a t = a ∣ s t = s ) = π i ( a ∣ s ) = π 0 ( a ∣ s ) ∀ s ∈ S , a ∈ A , t = 0 , 1 , 2 , ⋯ p^\pi(a_t=a|s_t=s)=\pi_i(a|s)=\pi_0(a|s)\quad \forall s\in\mathcal{S},a\in\mathcal{A},t=0,1,2,\cdots pπ(at=ast=s)=πi(as)=π0(as)sS,aA,t=0,1,2,这里 A = ∪ s ∈ S A ( s ) \mathcal{A}=\cup_{s\in\mathcal{S}}\mathcal{A}(s) A=sSA(s),我们不妨假设所有状态共用一个行为空间 A \mathcal{A} A,不能选择的行为策略概率限制为0即可。
  5. 某一时刻的行为只由该时刻的状态和所选策略决定,与历史状态和行为都没有关系
    P π ( a t + 1 = a ∣ s t + 1 = s , a t = a t ′ , s t = s t ′ , ⋯   , s 0 = s 0 ′ ) = P π ( a t + 1 = a ∣ s t + 1 = s ) = π ( a ∣ s ) \begin{aligned} &P^\pi(a_{t+1}=a|s_{t+1}=s,a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime)\\ =&P^\pi(a_{t+1}=a|s_{t+1}=s)=\pi(a|s) \end{aligned} =Pπ(at+1=ast+1=s,at=at,st=st,,s0=s0)Pπ(at+1=ast+1=s)=π(as)同样地,这个假设等价于在给定当前状态的条件下,当前的行为和过去状态行为相互独立
    P π ( a t + 1 = a , a t = a t ′ , s t = s t ′ , ⋯   , s 0 = s 0 ′ ∣ s t + 1 = s ) = P π ( a t + 1 = a ∣ s t + 1 = s ) P π ( a t = a t ′ , s t = s t ′ , ⋯   , s 0 = s 0 ′ ∣ s t + 1 = s ) \begin{aligned} &P^\pi(a_{t+1}=a,a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime|s_{t+1}=s)\\ =&P^\pi(a_{t+1}=a|s_{t+1}=s)P^\pi(a_t=a_t^\prime,s_t=s_t^\prime,\cdots,s_0=s_0^\prime|s_{t+1}=s) \end{aligned} =Pπ(at+1=a,at=at,st=st,,s0=s0st+1=s)Pπ(at+1=ast+1=s)Pπ(at=at,st=st,,s0=s0st+1=s)

在上面的假定下,有了上面的假定就能确定任何一条有限长度的轨迹的概率:
P π ( s T , a T − 1 , s T − 1 , ⋯   , a 0 , s 0 ) = P π ( s T ∣ a T − 1 , s T − 1 , ⋯   , a 0 , s 0 ) P π ( a T − 1 , s T − 1 , ⋯   , a 0 , s 0 ) = p ( s T ∣ a T − 1 , s T − 1 ) P ( a T − 1 ∣ s T − 1 , ⋯   , a 0 , s 0 ) P π ( s T − 1 , ⋯   , a 0 , s 0 ) = p ( s T ∣ a T − 1 , s T − 1 ) π ( a T − 1 ∣ s T − 1 ) P π ( s T − 1 , ⋯   , a 0 , s 0 ) \begin{aligned} &P^\pi(s_T,a_{T-1},s_{T-1},\cdots,a_0,s_0)\\ =&P^\pi(s_T|a_{T-1},s_{T-1},\cdots,a_0,s_0)P^\pi(a_{T-1},s_{T-1},\cdots,a_0,s_0)\\ =&p(s_T|a_{T-1},s_{T-1})P(a_{T-1}|s_{T-1},\cdots,a_0,s_0)P^\pi(s_{T-1},\cdots,a_0,s_0)\\ =&p(s_T|a_{T-1},s_{T-1})\pi(a_{T-1}|s_{T-1})P^\pi(s_{T-1},\cdots,a_0,s_0) \end{aligned} ===Pπ(sT,aT1,sT1,,a0,s0)Pπ(sTaT1,sT1,,a0,s0)Pπ(aT1,sT1,,a0,s0)p(sTaT1,sT1)P(aT1sT1,,a0,s0)Pπ(sT1,,a0,s0)p(sTaT1,sT1)π(aT1sT1)Pπ(sT1,,a0,s0)递归地进行求解即可,同时,我们有下面的推论:

  1. 给定策略 π \pi π,下一期的状态只与当前状态有关,与过去状态和行为都无关,即
    P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ) = P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ ) \begin{aligned} &P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime) \end{aligned} =Pπ(st+1=st+1st=st,at1=at1,,s0=s0)Pπ(st+1=st+1st=st)

证:
P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ) = ∑ a ∈ A [ P π ( s t + 1 = s t + 1 ′ ∣ a t = a , s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ) ∗ P π ( a t = a ∣ s t = s t ′ , a t − 1 = a t − 1 ′ , ⋯   , s 0 = s 0 ′ ) ] = ∑ a ∈ A P π ( s t + 1 = s t + 1 ′ ∣ a t = a , s t = s t ′ ) P π ( a t = a ∣ s t = s t ′ ) = P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ ) \begin{aligned} &P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)\\ =&\sum_{a\in\mathcal{A}}[P^\pi(s_{t+1}=s_{t+1}^\prime|a_t=a,s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)*\\ &P^\pi(a_t=a|s_t=s_t^\prime,a_{t-1}=a_{t-1}^\prime,\cdots,s_0=s_0^\prime)]\\ =&\sum_{a\in\mathcal{A}}P^\pi(s_{t+1}=s_{t+1}^\prime|a_t=a,s_t=s_t^\prime)P^\pi(a_t=a|s_t=s_t^\prime)\\ =&P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime) \end{aligned} ===Pπ(st+1=st+1st=st,at1=at1,,s0=s0)aA[Pπ(st+1=st+1at=a,st=st,at1=at1,,s0=s0)Pπ(at=ast=st,at1=at1,,s0=s0)]aAPπ(st+1=st+1at=a,st=st)Pπ(at=ast=st)Pπ(st+1=st+1st=st)
第一项相等用的是假设2,第二项相等用的是假设6

上面的证明也给出了计算这个概率的方法:
P π ( s t + 1 = s t + 1 ′ ∣ s t = s t ′ ) = ∑ a ∈ A p ( s t + 1 ′ ∣ a , s t ′ ) π ( a ∣ s t ′ ) \begin{aligned} P^\pi(s_{t+1}=s_{t+1}^\prime|s_t=s_t^\prime)=\sum_{a\in\mathcal{A}}p(s_{t+1}^\prime|a,s_t^\prime)\pi(a|s_t^\prime) \end{aligned} Pπ(st+1=st+1st=st)=aAp(st+1a,st)π(ast)

  1. T ≥ t + 2 T\geq t+2 Tt+2,给定策略 π \pi π
    P π ( a T = a T ′ , s T = s T ′ , ⋯   , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ a t + 1 = a t + 1 ′ , s t + 1 = s t + 1 ′ , ⋯   , a 0 = a 0 ′ , s 0 = s 0 ′ ) = P π ( a T = a T ′ , s T = s T ′ , ⋯   , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ a t + 1 = a t + 1 ′ , s t + 1 = s t + 1 ′ ) = P π ( a T − t − 1 = a T ′ , s T − t − 1 = s T ′ , ⋯   , a 1 = a t + 2 ′ , s 1 = s t + 2 ′ ∣ a 0 = a t + 1 ′ , s 0 = s t + 1 ′ ) \begin{aligned} &P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|a_{t+1}=a_{t+1}^\prime,s_{t+1}=s_{t+1}^\prime,\cdots,a_0=a_0^\prime,s_0=s_0^\prime)\\ =&P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|a_{t+1}=a_{t+1}^\prime,s_{t+1}=s_{t+1}^\prime)\\ =&P^\pi(a_{T-t-1}=a_T^\prime,s_{T-t-1}=s_T^\prime,\cdots,a_{1}=a_{t+2}^\prime,s_{1}=s_{t+2}^\prime|a_{0}=a_{t+1}^\prime,s_{0}=s_{t+1}^\prime) \end{aligned} ==Pπ(aT=aT,sT=sT,,at+2=at+2,st+2=st+2at+1=at+1,st+1=st+1,,a0=a0,s0=s0)Pπ(aT=aT,sT=sT,,at+2=at+2,st+2=st+2at+1=at+1,st+1=st+1)Pπ(aTt1=aT,sTt1=sT,,a1=at+2,s1=st+2a0=at+1,s0=st+1)
    第一个等号是马尔可夫性的一种推广,第二个等号是时齐性的一种推广,这个证明可以通过数学归纳法完成,比较麻烦,这里省略。

  2. 同样地,有
    P π ( a T = a T ′ , s T = s T ′ , ⋯   , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ s t + 1 = s t + 1 ′ , ⋯   , a 0 = a 0 ′ , s 0 = s 0 ′ ) = P π ( a T = a T ′ , s T = s T ′ , ⋯   , a t + 2 = a t + 2 ′ , s t + 2 = s t + 2 ′ ∣ s t + 1 = s t + 1 ′ ) = P π ( a T − t − 1 = a T ′ , s T − t − 1 = s T ′ , ⋯   , a 1 = a t + 2 ′ , s 1 = s t + 2 ′ ∣ s 0 = s t + 1 ′ ) \begin{aligned} &P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|s_{t+1}=s_{t+1}^\prime,\cdots,a_0=a_0^\prime,s_0=s_0^\prime)\\ =&P^\pi(a_T=a_T^\prime,s_T=s_T^\prime,\cdots,a_{t+2}=a_{t+2}^\prime,s_{t+2}=s_{t+2}^\prime|s_{t+1}=s_{t+1}^\prime)\\ =&P^\pi(a_{T-t-1}=a_T^\prime,s_{T-t-1}=s_T^\prime,\cdots,a_{1}=a_{t+2}^\prime,s_{1}=s_{t+2}^\prime|s_{0}=s_{t+1}^\prime) \end{aligned} ==Pπ(aT=aT,sT=sT,,at+2=at+2,st+2=st+2st+1=st+1,,a0=a0,s0=s0)Pπ(aT=aT,sT=sT,,at+2=at+2,st+2=st+2st+1=st+1)Pπ(aTt1=aT,sTt1=sT,,a1=at+2,s1=st+2s0=st+1)
    第一个等号是马尔可夫性的一种推广,第二个等号是时齐性的一种推广

状态价值函数和状态-行为价值函数

给定策略 π \pi π,状态 s s s的状态价值函数定义为
V t π ( s ) = E π [ ∑ i = 0 ∞ γ i R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] V_t^\pi(s)=E^\pi\left[\sum_{i=0}^\infty\gamma^i R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right] Vtπ(s)=Eπ[i=0γiR(st+i,at+i,st+i+1)st=s]
不难证明 V t π ( s ) V_t^\pi(s) Vtπ(s)也是时齐的,也就是 V t π ( s ) = V 0 π ( s ) t = 0 , 1 , ⋯   , ∀ s ∈ S V_t^\pi(s)=V_0^\pi(s)\quad t=0,1,\cdots,\forall s\in\mathcal{S} Vtπ(s)=V0π(s)t=0,1,,sS,于是我们可以省去时间下标

证:
由于 γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1),并且回报函数有界,由有界收敛定理,有
V t π ( s ) = E π [ ∑ i = 0 ∞ γ i R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ i = 0 ∞ γ i E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] \begin{aligned} V_t^\pi(s)=&E^\pi\left[\sum_{i=0}^\infty\gamma^i R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right] \end{aligned} Vtπ(s)==Eπ[i=0γiR(st+i,at+i,st+i+1)st=s]i=0γiEπ[R(st+i,at+i,st+i+1)st=s]
E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ s ′ , s ′ ′ ∈ S , a ∈ A P ( s t + i = s ′ , a t + i = a , s t + i + 1 = s ′ ′ ∣ s t = s ) R ( s ′ , a , s ′ ′ ) = ∑ s ′ , s ′ ′ ∈ S , a ∈ A P ( s i = s ′ , a i = a , s i + 1 = s ′ ′ ∣ s 0 = s ) R ( s ′ , a , s ′ ′ ) = E π [ R ( s i , a i , s i + 1 ) ∣ s 0 = s ] \begin{aligned} &E^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{s^\prime,s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}}P(s_{t+i}=s^\prime,a_{t+i}=a,s_{t+i+1}=s^{\prime\prime}|s_t=s)R(s^\prime,a,s^{\prime\prime})\\ =&\sum_{s^\prime,s^{\prime\prime}\in\mathcal{S},a\in\mathcal{A}}P(s_{i}=s^\prime,a_{i}=a,s_{i+1}=s^{\prime\prime}|s_0=s)R(s^\prime,a,s^{\prime\prime})\\ =&E^\pi\left[ R(s_{i},a_{i},s_{i+1})\bigg |s_0=s\right] \end{aligned} ===Eπ[R(st+i,at+i,st+i+1)st=s]s,sS,aAP(st+i=s,at+i=a,st+i+1=sst=s)R(s,a,s)s,sS,aAP(si=s,ai=a,si+1=ss0=s)R(s,a,s)Eπ[R(si,ai,si+1)s0=s]
其中第二个由推论8得到,所以
V t π ( s ) = ∑ i = 0 ∞ γ i E π [ R ( s t + i , a t + i , s t + i + 1 ) ∣ s t = s ] = ∑ i = 0 ∞ γ i E π [ R ( s i , a i , s i + 1 ) ∣ s 0 = s ] = V 0 π ( s ) \begin{aligned} V_t^\pi(s) =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{t+i},a_{t+i},s_{t+i+1})\bigg |s_t=s\right]\\ =&\sum_{i=0}^\infty\gamma^iE^\pi\left[ R(s_{i},a_{i},s_{i+1})\bigg |s_0=s\right]\\ =&V_0^\pi(s) \end{aligned} Vtπ(s)===i=0γiEπ[R(st+i,at+i,st+i+1)st=s]i=0γiEπ[R(si,ai,si+1)s0=s]V0π(s)

所以此时Bellman等式简化为
V π ( s ) = E π [ R ( s , a 0 , s 1 ) + γ V π ( s 1 ) ∣ s 0 = s ] V^\pi(s)=E^\pi[R(s,a_0,s_1)+\gamma V^\pi(s_1)|s_0=s] Vπ(s)=Eπ[R(s,a0,s1)+γVπ(s1)s0=s]
展开来即是
V π ( s ) = E π [ R ( s , a 0 , s 1 ) ∣ s 0 = s ] + γ ∑ s ′ ∈ S p ( s 1 = s ′ ∣ s 0 = s ) V π ( s 1 ) \begin{aligned} V^\pi(s)=E^\pi[R(s,a_0,s_1)|s_0=s]+\gamma\sum_{s^\prime\in\mathcal{S}}p(s_1=s^\prime|s_0=s)V^\pi(s_1) \end{aligned} Vπ(s)=Eπ[R(s,a0,s1)s0=s]+γsSp(s1=ss0=s)Vπ(s1)可以看出这是一个关于 V π ( s ) V^\pi(s) Vπ(s)的线性方程组,其未知数的个数是状态空间中状态的个数

同样地可以得出状态-行为价值函数也是时齐的:
Q π ( a , s ) = E π [ R ( s , a , s 1 ) ∣ s 0 = s , a 0 = a ] + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) V π ( s ′ ) V π ( s ) = ∑ a ∈ S π ( a ∣ s ) Q ( a , s ) \begin{aligned} &Q^\pi(a,s)=E^\pi[R(s,a,s_1)|s_0=s,a_0=a]+\gamma\sum_{s^\prime\in\mathcal{S}}p(s^\prime|s,a)V^\pi(s^\prime)\\ &V^\pi(s)=\sum_{a\in\mathcal{S}}\pi(a|s)Q(a,s) \end{aligned} Qπ(a,s)=Eπ[R(s,a,s1)s0=s,a0=a]+γsSp(ss,a)Vπ(s)Vπ(s)=aSπ(as)Q(a,s)

现在有两个策略 π ′ , π ′ ′ \pi^\prime,\pi^{\prime\prime} π,π π ′ ≥ π ′ ′ \pi^\prime\geq \pi^{\prime\prime} ππ定义为 V π ′ ( s ) ≥ V π ′ ′ ( s ) ∀ s ∈ S V^{\pi^\prime}(s)\ge V^{\pi^{\prime\prime}}(s)\quad \forall s\in\mathcal{S} Vπ(s)Vπ(s)sS,我们的目标是选择一个最优策略 π ⋆ \pi^\star π,使得 π ⋆ ≥ π ∀ π \pi^\star\geq \pi \quad \forall \pi πππ,其价值函数记为 V ∗ ( s ) V^*(s) V(s)

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值