Reinforcement Learning Exercise 6.6

最新推荐文章于 2022-09-13 01:31:35 发布

YeXiang\^-^/

最新推荐文章于 2022-09-13 01:31:35 发布

阅读量592

点赞数 1

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/102808326

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are $\frac{1}{6}$ , $\frac{2}{6}$ , $\frac{3}{6}$ , $\frac{4}{6}$ and $\frac{5}{6}$ , for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?

Example 6.2 Random walk
In this example we empirically compare the prediction abilities of TD(0) and constant- $\alpha$ MC when applied to the following Markov reward process:
在这里插入图片描述
A Markov reward process, or MRP, is a Markov decision process without actions. We will often use MRPs when focusing on the prediction problem, in which there is no need to distinguish the dynamics due to the environment from those due to the agent. In this MRP, all episodes start in the center state, C, then proceed either left or right by one state on each step, with equal probability. Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, a reward of +1 occurs; all other rewards are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state. Thus, the true value of the center state is $v_{\pi}(C)$ = 0.5. The true values of all the states, A through E, are $\frac{1}{6}$ , $\frac{2}{6}$ , $\frac{3}{6}$ , $\frac{4}{6}$ and $\frac{5}{6}$ .

In this case, for all status, there is $\pi(a \mid s) =0.5$ because left action and right action are taken in equal probability. And also there is $r\mid s, a) = 1$ for deterministic action, $\gamma = 1$ for no discount.
Method 1:
Use Bellman equation for $v_\pi(s)$ (equation 3.14) directly, and Bellman equation can be simplified to:
$v_\pi(s) = 0.5 \sum_{s',r}\bigl [ r+v_\pi(s') \bigr]$
So, for state A, $v_\pi(A) = 0.5\bigl[ 0+v_\pi(terminal) + 0+ v_\pi(B)\bigr] = 0.5v_\pi(B)\qquad \text{(1)}$ .
For state B, $v_\pi(B) = 0.5\bigl[ 0 + v_\pi(A) + 0 +v_\pi(C)\bigr] = 0.5v_\pi(A)+0.5v_\pi(C)\qquad \text{(2)}$
And so on, we have:
$\begin{aligned} v_\pi(C) &= 0.5v_\pi(B) + 0.5v_\pi(D) \qquad \qquad \text{ (3)}\\ v_\pi(D) &= 0.5v_\pi(C) + 0.5v_\pi(E) \qquad \qquad \text{ (4)}\\ v_\pi(E) &= 0.5v_\pi(D) + 0.5 \qquad \qquad \qquad \quad\text{(5)}\\ \end{aligned}$
Solve equations from (1) to (5), we can obtain the state values A through E are $\frac{1}{6}$ , $\frac{2}{6}$ , $\frac{3}{6}$ , $\frac{4}{6}$ and $\frac{5}{6}$ .