Reinforcement Learning Exercise 4.2

Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is v π ( 15 ) v_\pi(15) vπ(15) for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is v π ( 15 ) v_\pi(15) vπ(15) for the equiprobable random policy in this case?

For the assumption that the transitions from the original states are unchanged, according to equation (4.4), we have:
v π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] = ∑ a π ( a ∣ s ) ∑ s ′ { ∑ r [ r ⋅ p ( s ′ , r ∣ s , a ) ] + ∑ r [ p ( s ′ , r ∣ s , a ) ⋅ γ v π ( s ′ ) ] } = ∑ a π ( a ∣ s ) ∑ s ′ { ∑ r [ r ⋅ p ( r ∣ s ′ , s , a ) ⋅ p ( s ′ ∣ s , a ) ] + p ( s ′ ∣ s , a ) ⋅ γ v π ( s ′ ) } = ∑ a π ( a ∣ s ) ∑ s ′ { p ( s ′ ∣ s , a ) [ ∑ r r ⋅ p ( r ∣ s ′ , s , a ) + γ v π ( s ′ ) ] } = ∑ a π ( a ∣ s ) ∑ s ′ { P s , s ′ a [ R s , s ′ a + γ v π ( s ′ ) ] } \begin{aligned} v_\pi(s) &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \bigl [ r + \gamma v_\pi(s')\bigr ] \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(s',r \mid s, a) \Bigr ] + \sum_r \Bigl [ p(s', r \mid s,a ) \cdot \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(r \mid s', s, a) \cdot p(s' \mid s,a) \Bigr ] + p(s' \mid s,a ) \cdot \gamma v_\pi(s') \biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ p(s' \mid s,a) \Bigl [ \sum_r r \cdot p(r \mid s', s, a) + \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ P_{s,s'}^a \Bigl [ R_{s,s'}^a + \gamma v_\pi(s') \Bigr ] \biggr \} \end{aligned} vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]=aπ(as)s{r[rp(s,rs,a)]+r[p(s,rs,a)γvπ(s)]}=aπ(as)s{r[rp(rs,s,a)p(ss,a)]+p(ss,a)γvπ(s)}=aπ(as)s{p(ss,a)[rrp(rs,s,a)+γvπ(s)]}=aπ(as)s{Ps,sa[Rs,sa+γvπ(s)]}
So,
v π ( 15 ) = ∑ a π ( a ∣ 15 ) ⋅ { P 15 , 12 l e f t [ R 15 , 12 l e f t + γ v π ( 12 ) ] + P 15 , 13 u p [ R 15 , 13 u p + γ v π ( 13 ) ] + P 15 , 14 r i g h t [ R 15 , 14 r i g h t + γ v π ( 14 ) ] + P 15 , 15 d o w n [ R 15 , 15 d o w n + γ v π ( 15 ) ] } \begin{aligned} v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned} vπ(15)=aπ(a15){P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
Because the agent follows the equiprobable random policy, for all actions π ( a ∣ s ) = 1 / 4 \pi(a \mid s) = 1 / 4 π(as)=1/4. And the action is deterministic, so:
P s , s ′ a = { 1  if  a  leads to  s ′ 0 if  a  doesn’t lead to  s ′ P_{s,s'}^a = \begin{cases} 1 & \text{ if $a$ leads to $s'$} \\ 0 & \text{if $a$ doesn't lead to $s'$} \end{cases} Ps,sa={10 if a leads to sif a doesn’t lead to s
According to Figure 4.2, we have:
v π ( 15 ) = 1 4 { 1 ⋅ [ − 1 + γ ( − 22 ) ] + 1 ⋅ [ − 1 + γ ( − 20 ) ] + 1 ⋅ [ − 1 + γ ( − 14 ) ] + 1 ⋅ [ − 1 + γ v π ( 15 ) ] } = − 1 − 14 γ + γ v π ( 15 ) \begin{aligned} v_\pi(15) &= \frac {1}{4} \biggl \{ 1 \cdot \Bigl [ -1 + \gamma (-22) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \cdot \Bigl [ -1 + \gamma (-14) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \gamma v_\pi(15) \\ \end{aligned} vπ(15)=41{1[1+γ(22)]+1[1+γ(20)]+1[1+γ(14)]+1[1+γvπ(15)]}=114γ+γvπ(15)
∴ v π ( 15 ) = 4 + 56 γ γ − 4 \therefore v_\pi(15) = \frac {4 + 56 \gamma} {\gamma - 4} vπ(15)=γ44+56γ
For the assumption that the dynamics of state 13 are also changed, similarly we have:
v π ( 13 ) = ∑ a π ( a ∣ 13 ) ⋅ { P 13 , 12 l e f t [ R 13 , 12 l e f t + γ v π ( 12 ) ] + P 13 , 9 u p [ R 13 , 9 u p + γ v π ( 9 ) ] + P 13 , 14 r i g h t [ R 13 , 14 r i g h t + γ v π ( 14 ) ] + P 13 , 15 d o w n [ R 13 , 15 d o w n + γ v π ( 15 ) ] } v π ( 15 ) = ∑ a π ( a ∣ 15 ) ⋅ { P 15 , 12 l e f t [ R 15 , 12 l e f t + γ v π ( 12 ) ] + P 15 , 13 u p [ R 15 , 13 u p + γ v π ( 13 ) ] + P 15 , 14 r i g h t [ R 15 , 14 r i g h t + γ v π ( 14 ) ] + P 15 , 15 d o w n [ R 15 , 15 d o w n + γ v π ( 15 ) ] } \begin{aligned} v_\pi(13) &= \sum_a \pi( a \mid 13) \cdot \biggl \{ P_{13,12}^{left} \Bigl[ R_{13,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{13,9}^{up} \Bigl[ R_{13,9}^{up} + \gamma v_\pi(9) \Bigr ] \\ & \quad + P_{13,14}^{right} \Bigl[ R_{13,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{13,15}^{down} \Bigl[ R_{13,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \\ v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned} vπ(13)vπ(15)=aπ(a13){P13,12left[R13,12left+γvπ(12)]+P13,9up[R13,9up+γvπ(9)]+P13,14right[R13,14right+γvπ(14)]+P13,15down[R13,15down+γvπ(15)]}=aπ(a15){P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
v π ( 13 ) = 1 4 ⋅ { 1 [ − 1 + γ ( − 22 ) ] + 1 [ ( − 1 + γ ( − 20 ) ] + 1 [ ( − 1 + γ ( − 14 ) ] + 1 [ ( − 1 + γ v π ( 15 ) ] } = − 1 − 14 γ + 1 4 γ v π ( 15 ) ( 1 ) v π ( 15 ) = 1 4 ⋅ { 1 [ − 1 + γ ( − 22 ) ] + 1 [ − 1 + γ v π ( 13 ) ] + 1 [ − 1 + γ ( − 14 ) ] + 1 [ − 1 + γ v π ( 15 ) ] } = − 1 − 9 γ + 1 4 γ v π ( 13 ) + 1 4 γ v π ( 15 ) ( 2 ) \begin{aligned} v_\pi(13) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma(-22) \Bigr ] + 1 \Bigl[ (-1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \Bigl[ (-1 + \gamma (-14) \Bigr ] + 1 \Bigl[ (-1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \frac {1}{4} \gamma v_\pi(15) \qquad \qquad \qquad \qquad \qquad \qquad \quad{(1)}\\ v_\pi(15) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma (-22) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(13) \Bigr ] \\ & \quad +1 \Bigl[ -1 + \gamma (-14) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ & = -1 - 9 \gamma + \frac{1}{4} \gamma v_\pi(13) +\frac{1}{4}\gamma v_\pi(15) \qquad \qquad \qquad \qquad{(2)} \end{aligned} vπ(13)vπ(15)=41{1[1+γ(22)]+1[(1+γ(20)]+1[(1+γ(14)]+1[(1+γvπ(15)]}=114γ+41γvπ(15)(1)=41{1[1+γ(22)]+1[1+γvπ(13)]+1[1+γ(14)]+1[1+γvπ(15)]}=19γ+41γvπ(13)+41γvπ(15)(2)
Then we have equation set:
v π ( 13 ) − 1 4 γ v π ( 15 ) = − 1 − 14 γ ( 3 ) − 1 4 γ v π ( 13 ) + ( 1 − 1 4 γ ) v π ( 15 ) = − 1 − 9 γ ( 4 ) \begin{aligned} v_\pi(13) - \frac {1}{4} \gamma v_\pi(15)&= -1 - 14 \gamma \qquad \qquad \qquad \qquad{(3)}\\ -\frac{1}{4} \gamma v_\pi(13) +(1-\frac{1}{4}\gamma )v_\pi(15) & = -1 - 9 \gamma \qquad \qquad \qquad \qquad{(4)} \end{aligned} vπ(13)41γvπ(15)41γvπ(13)+(141γ)vπ(15)=114γ(3)=19γ(4)
By solving equation set (3) and (4), we can obtain:
v π ( 15 ) = 14 γ 2 + 37 γ + 4 1 4 γ 2 + γ − 4 v π ( 13 ) = 19 γ 2 + 224 γ − 16 γ 2 + 4 γ − 16 \begin{aligned} v_\pi(15) &= \frac{14\gamma^2 + 37 \gamma + 4}{ \frac{1}{4}\gamma^2 + \gamma - 4} \\ v_\pi(13) &= \frac{19\gamma^2 + 224\gamma -16}{\gamma^2+4\gamma-16} \end{aligned} vπ(15)vπ(13)=41γ2+γ414γ2+37γ+4=γ2+4γ1619γ2+224γ16

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值