Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are 1 6 \frac{1}{6} 61, 2 6 \frac{2}{6} 62, 3 6 \frac{3}{6} 63, 4 6 \frac{4}{6} 64 and 5 6 \frac{5}{6} 65, for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?
Example 6.2 Random walk
In this example we empirically compare the prediction abilities of TD(0) and constant-
α
\alpha
α MC when applied to the following Markov reward process:
A Markov reward process, or MRP, is a Markov decision process without actions. We will often use MRPs when focusing on the prediction problem, in which there is no need to distinguish the dynamics due to the environment from those due to the agent. In this MRP, all episodes start in the center state, C, then proceed either left or right by one state on each step, with equal probability. Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, a reward of +1 occurs; all other rewards are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state. Thus, the true value of the center state is
v
π
(
C
)
v_{\pi}(C)
vπ(C) = 0.5. The true values of all the states, A through E, are
1
6
\frac{1}{6}
61,
2
6
\frac{2}{6}
62,
3
6
\frac{3}{6}
63,
4
6
\frac{4}{6}
64 and
5
6
\frac{5}{6}
65.
In this case, for all status, there is
π
(
a
∣
s
)
=
0.5
\pi(a \mid s) =0.5
π(a∣s)=0.5 because left action and right action are taken in equal probability. And also there is
p
(
s
′
,
r
∣
s
,
a
)
=
1
p(s', r\mid s, a) = 1
p(s′,r∣s,a)=1 for deterministic action,
γ
=
1
\gamma = 1
γ=1 for no discount.
Method 1:
Use Bellman equation for
v
π
(
s
)
v_\pi(s)
vπ(s) (equation 3.14) directly, and Bellman equation can be simplified to:
v
π
(
s
)
=
0.5
∑
s
′
,
r
[
r
+
v
π
(
s
′
)
]
v_\pi(s) = 0.5 \sum_{s',r}\bigl [ r+v_\pi(s') \bigr]
vπ(s)=0.5s′,r∑[r+vπ(s′)]
So, for state A,
v
π
(
A
)
=
0.5
[
0
+
v
π
(
t
e
r
m
i
n
a
l
)
+
0
+
v
π
(
B
)
]
=
0.5
v
π
(
B
)
(1)
v_\pi(A) = 0.5\bigl[ 0+v_\pi(terminal) + 0+ v_\pi(B)\bigr] = 0.5v_\pi(B)\qquad \text{(1)}
vπ(A)=0.5[0+vπ(terminal)+0+vπ(B)]=0.5vπ(B)(1).
For state B,
v
π
(
B
)
=
0.5
[
0
+
v
π
(
A
)
+
0
+
v
π
(
C
)
]
=
0.5
v
π
(
A
)
+
0.5
v
π
(
C
)
(2)
v_\pi(B) = 0.5\bigl[ 0 + v_\pi(A) + 0 +v_\pi(C)\bigr] = 0.5v_\pi(A)+0.5v_\pi(C)\qquad \text{(2)}
vπ(B)=0.5[0+vπ(A)+0+vπ(C)]=0.5vπ(A)+0.5vπ(C)(2)
And so on, we have:
v
π
(
C
)
=
0.5
v
π
(
B
)
+
0.5
v
π
(
D
)
(3)
v
π
(
D
)
=
0.5
v
π
(
C
)
+
0.5
v
π
(
E
)
(4)
v
π
(
E
)
=
0.5
v
π
(
D
)
+
0.5
(5)
\begin{aligned} v_\pi(C) &= 0.5v_\pi(B) + 0.5v_\pi(D) \qquad \qquad \text{ (3)}\\ v_\pi(D) &= 0.5v_\pi(C) + 0.5v_\pi(E) \qquad \qquad \text{ (4)}\\ v_\pi(E) &= 0.5v_\pi(D) + 0.5 \qquad \qquad \qquad \quad\text{(5)}\\ \end{aligned}
vπ(C)vπ(D)vπ(E)=0.5vπ(B)+0.5vπ(D) (3)=0.5vπ(C)+0.5vπ(E) (4)=0.5vπ(D)+0.5(5)
Solve equations from (1) to (5), we can obtain the state values A through E are
1
6
\frac{1}{6}
61,
2
6
\frac{2}{6}
62,
3
6
\frac{3}{6}
63,
4
6
\frac{4}{6}
64 and
5
6
\frac{5}{6}
65.
Method 2:
I can not find another way that totally different with method 1 yet. If anybody know, please tell me.