Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.
The modified algorithm should be like this:
Input: an arbitrary target policy
π
Initialize, for all
s
∈
S
,
a
∈
A
(
s
)
:
Q
(
s
,
a
)
in
R
(
arbitrarily
)
C
(
s
,
a
)
←
0
Loop forever (for each episode):
b
←
any policy with coverage of
π
Generate an episode following b:
S
0
,
A
0
,
R
1
,
⋯
 
,
S
T
−
1
,
A
T
−
1
,
R
T
G
←
0
W
←
1
Loop for each step of episode,
t
=
T
−
1
,
T
−
2
,
⋯
 
,
0
,
while
W
=
̸
0
:
G
←
γ
G
+
R
t
+
1
C
(
S
t
,
A
t
)
←
C
(
S
t
,
A
t
)
+
W
Unless the pair
S
t
,
A
t
appears in
S
0
,
A
0
,
S
1
,
A
1
,
⋯
 
,
S
t
−
1
,
A
t
−
1
:
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
W
C
(
S
t
,
A
t
)
[
G
−
Q
(
S
t
,
A
t
)
]
W
←
W
π
(
A
t
∣
S
t
)
b
(
A
t
∣
S
t
)
\begin{aligned} &\text{Input: an arbitrary target policy }\pi \\ &\text{Initialize, for all }s \in \mathcal S, a \in \mathcal A(s): \\ & \qquad Q(s,a) \text{ in }\mathbb R (\text{arbitrarily}) \\ & \qquad C(s,a) \leftarrow 0 \\ &\text{Loop forever (for each episode):} \\ & \qquad b \leftarrow \text{any policy with coverage of } \pi \\ & \qquad \text{Generate an episode following b: }S_0, A_0, R_1, \cdots,S_{T-1},A_{T-1},R_T \\ & \qquad G \leftarrow 0 \\ & \qquad W \leftarrow 1 \\ & \qquad \text{Loop for each step of episode, } t=T-1,T-2,\cdots,0, \text{ while } W = \not 0: \\ & \qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ & \qquad \qquad C(S_t, A_t) \leftarrow C(S_t, A_t) + W \\ & \qquad \qquad \text{Unless the pair } S_t, A_t \text{ appears in } S_0, A_0, S_1, A_1, \cdots , S_{t-1}, A_{t-1}:\\ & \qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {W}{C(S_t, A_t)} \bigl [ G - Q(S_t, A_t)\bigr] \\ & \qquad \qquad \qquad W \leftarrow W \frac {\pi(A_t \mid S_t)}{b(A_t \mid S_t)} \\ \end{aligned}
Input: an arbitrary target policy πInitialize, for all s∈S,a∈A(s):Q(s,a) in R(arbitrarily)C(s,a)←0Loop forever (for each episode):b←any policy with coverage of πGenerate an episode following b: S0,A0,R1,⋯,ST−1,AT−1,RTG←0W←1Loop for each step of episode, t=T−1,T−2,⋯,0, while W≠0:G←γG+Rt+1C(St,At)←C(St,At)+WUnless the pair St,At appears in S0,A0,S1,A1,⋯,St−1,At−1:Q(St,At)←Q(St,At)+C(St,At)W[G−Q(St,At)]W←Wb(At∣St)π(At∣St)