Reinforcement Learning Exercise 4.2

最新推荐文章于 2019-10-03 18:37:08 发布

YeXiang\^-^/

最新推荐文章于 2019-10-03 18:37:08 发布

阅读量666

点赞数

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/90741789

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is $v_\pi(15)$ for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is $v_\pi(15)$ for the equiprobable random policy in this case?

For the assumption that the transitions from the original states are unchanged, according to equation (4.4), we have:
$\begin{aligned} v_\pi(s) &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \bigl [ r + \gamma v_\pi(s')\bigr ] \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(s',r \mid s, a) \Bigr ] + \sum_r \Bigl [ p(s', r \mid s,a ) \cdot \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(r \mid s', s, a) \cdot p(s' \mid s,a) \Bigr ] + p(s' \mid s,a ) \cdot \gamma v_\pi(s') \biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ p(s' \mid s,a) \Bigl [ \sum_r r \cdot p(r \mid s', s, a) + \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ P_{s,s'}^a \Bigl [ R_{s,s'}^a + \gamma v_\pi(s') \Bigr ] \biggr \} \end{aligned}$
So,
$\begin{aligned} v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned}$
Because the agent follows the equiprobable random policy, for all actions $\pi(a \mid s) = 1 / 4$ . And the action is deterministic, so:
$P_{s,s'}^a = \begin{cases} 1 & \text{ if $a$ leads to $s'$} \\ 0 & \text{if $a$ doesn't lead to $s'$} \end{cases}$
According to Figure 4.2, we have:
$\begin{aligned} v_\pi(15) &= \frac {1}{4} \biggl \{ 1 \cdot \Bigl [ -1 + \gamma (-22) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \cdot \Bigl [ -1 + \gamma (-14) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \gamma v_\pi(15) \\ \end{aligned}$
$\therefore v_\pi(15) = \frac {4 + 56 \gamma} {\gamma - 4}$
For the assumption that the dynamics of state 13 are also changed, similarly we have:
$\begin{aligned} v_\pi(13) &= \sum_a \pi( a \mid 13) \cdot \biggl \{ P_{13,12}^{left} \Bigl[ R_{13,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{13,9}^{up} \Bigl[ R_{13,9}^{up} + \gamma v_\pi(9) \Bigr ] \\ & \quad + P_{13,14}^{right} \Bigl[ R_{13,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{13,15}^{down} \Bigl[ R_{13,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \\ v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned}$
$\begin{aligned} v_\pi(13) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma(-22) \Bigr ] + 1 \Bigl[ (-1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \Bigl[ (-1 + \gamma (-14) \Bigr ] + 1 \Bigl[ (-1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \frac {1}{4} \gamma v_\pi(15) \qquad \qquad \qquad \qquad \qquad \qquad \quad{(1)}\\ v_\pi(15) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma (-22) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(13) \Bigr ] \\ & \quad +1 \Bigl[ -1 + \gamma (-14) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ & = -1 - 9 \gamma + \frac{1}{4} \gamma v_\pi(13) +\frac{1}{4}\gamma v_\pi(15) \qquad \qquad \qquad \qquad{(2)} \end{aligned}$
Then we have equation set:
$\begin{aligned} v_\pi(13) - \frac {1}{4} \gamma v_\pi(15)&= -1 - 14 \gamma \qquad \qquad \qquad \qquad{(3)}\\ -\frac{1}{4} \gamma v_\pi(13) +(1-\frac{1}{4}\gamma )v_\pi(15) & = -1 - 9 \gamma \qquad \qquad \qquad \qquad{(4)} \end{aligned}$
By solving equation set (3) and (4), we can obtain:
$\begin{aligned} v_\pi(15) &= \frac{14\gamma^2 + 37 \gamma + 4}{ \frac{1}{4}\gamma^2 + \gamma - 4} \\ v_\pi(13) &= \frac{19\gamma^2 + 224\gamma -16}{\gamma^2+4\gamma-16} \end{aligned}$

YeXiang\^-^/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Reinforcement Learning Exercise 4.2

Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively...
复制链接

扫一扫