Reinforcement Learning exercise 3.13

Exercise 3.13 Give an equation for q π q_\pi qπ in terms of v π v_\pi vπ and the four-argument p p p.

First, we need to derive a formula from multiplication formula of probability theory:
p ( x ∣ y ) = p ( x , y ) p ( y ) = ∑ z p ( x , y , z ) p ( y ) = ∑ z [ p ( x ∣ y , z ) ⋅ p ( z ∣ y ) ⋅ p ( y ) ] p ( y ) = ∑ z [ p ( x ∣ y , z ) ⋅ p ( z ∣ y ) ] ( 1 ) \begin{aligned} p(x|y) &= \frac {p(x,y)}{p(y)} \\ &= \frac {\sum_z p(x,y,z)}{p(y)} \\ &= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\ &= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\ \end{aligned} p(xy)=p(y)p(x,y)=p(y)zp(x,y,z)=p(y)z[p(xy,z)p(zy)p(y)]=z[p(xy,z)p(zy)](1)
With formula (1), we can calculate q π ( s , a ) q_\pi(s,a) qπ(s,a) as below:
q π ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( R t + 1 + γ G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( G t + 1 ∣ S t = s , A t = a ) = ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = g t + 1 ∣ S t = s , A t = a ) \begin{aligned} q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\ &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\ \end{aligned} qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(Rt+1+γGt+1St=s,At=a)=Eπ(Rt+1St=s,At=a)+γEπ(Gt+1St=s,At=a)=rrPr(Rt+1=rSt=s,At=a)+γgt+1gt+1Pr(Gt+1=gt+1St=s,At=a)
Here, according to definition, g t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k g_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} gt+1=k=0γkrt+2+k. And use formula (1), we can derive:
q π ( s , a ) = ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t = s , A t = a ) = ∑ r r ⋅ ∑ s ′ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ ∑ s ′ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , a t = a ) \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\ &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ \end{aligned} qπ(s,a)=rrPr(Rt+1=rSt=s,At=a)+γgt+1gt+1Pr(Gt+1=k=0γkrt+2+kSt=s,At=a)=rrsPr(Rt+1=rSt=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)+γgt+1gt+1sPr(Gt+1=k=0γkrt+2+kSt=s,At=a,St+1=s)Pr(St+1=sSt=s,at=a)
Because in Markov Process, G t + 1 G_{t+1} Gt+1 is the reward of status S t + 1 = s ′ S_{t+1} = s' St+1=s, the information of S t = s S_t = s St=s and A t = a A_t = a At=a are no effect on G t + 1 G_{t+1} Gt+1. So:
q π ( s , a ) = ∑ r r ⋅ ∑ s ′ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ ∑ s ′ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , a t = a ) = ∑ s ′ { P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ [ ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t + 1 = s ′ ) ] } = ∑ s ′ { p ( s ′ ∣ s , a ) ⋅ [ E π ( r ∣ s , a , s ′ ) + γ ⋅ E π ( G t + 1 ∣ S t + 1 = s ′ ) ] } = ∑ s ′ { p ( s ′ ∣ s , a ) ⋅ [ E π ( r ∣ s , a , s ′ ) + γ ⋅ v π ( s ′ ) ] } \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ &= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned} qπ(s,a)=rrsPr(Rt+1=rSt=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)+γgt+1gt+1sPr(Gt+1=k=0γkrt+2+kSt+1=s)Pr(St+1=sSt=s,at=a)=s{Pr(St+1=sSt=s,At=a)[rrPr(Rt+1=rSt=s,At=a,St+1=s)+γgt+1gt+1Pr(Gt+1=k=0γkrt+2+kSt+1=s)]}=s{p(ss,a)[Eπ(rs,a,s)+γEπ(Gt+1St+1=s)]}=s{p(ss,a)[Eπ(rs,a,s)+γvπ(s)]}
Denote p ( s ′ ∣ a , s ) = P s , s ′ a p(s' | a , s ) = P_{s,s'}^a p(sa,s)=Ps,sa and E π ( r ∣ s , a , s ′ ) = R s , s ′ a \mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^a Eπ(rs,a,s)=Rs,sa, then
q π ( s , a ) = ∑ s ′ { P s , s ′ a ⋅ [ R s , s ′ a + γ ⋅ v π ( s ′ ) ] } ( 2 ) q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)} qπ(s,a)=s{Ps,sa[Rs,sa+γvπ(s)]}(2)
Here, the equation (2) is the result.

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值