Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this.
The altered pseudocode is shown as below:
Initialize:
π
(
s
)
∈
A
(
s
)
(arbitrarily), for all
s
∈
S
Q
(
s
,
a
)
∈
R
(arbitrarily), for all
s
∈
S
,
a
∈
A
(
s
)
c
o
u
n
t
s
(
s
,
a
)
←
0
, for all s
∈
S
,
a
∈
A
(
s
)
Loop forever (for each episode):
Choose
S
0
∈
S
,
A
0
∈
A
(
S
0
)
randomly such that all pairs have probability
>
0
Generate an episode from
S
0
,
A
0
,
following
π
:
S
0
,
A
0
,
R
1
,
.
.
.
,
S
T
−
1
,
A
T
−
1
,
R
T
G
←
0
Loop for each step of episode,
t
=
T
−
1
,
T
−
2
,
.
.
.
,
0
:
G
←
γ
G
+
R
t
+
1
Unless the pair
S
t
,
A
t
appears in
S
0
,
A
0
,
S
1
,
A
1
.
.
.
,
S
t
−
1
,
A
t
−
1
:
c
o
u
n
t
s
(
S
t
,
A
t
)
←
c
o
u
n
t
s
(
S
t
,
A
t
)
+
1
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
(
G
−
Q
(
S
t
,
A
t
)
)
c
o
u
n
t
(
S
t
,
A
t
)
π
(
S
t
)
←
argmax
a
Q
(
S
t
,
a
)
\begin{aligned} &\text{Initialize:} \\ &\qquad \pi(s) \in \mathcal A(s) \text{(arbitrarily), for all } s \in S \\ &\qquad Q(s, a) \in \mathbb R \text{(arbitrarily), for all } s \in S, a \in \mathcal A(s) \\ &\qquad counts(s, a) \leftarrow 0\text{, for all s } \in S, a \in \mathcal A(s) \\ &\text{Loop forever (for each episode):} \\ &\qquad \text{Choose }S_0 \in \mathcal S, A_0 \in \mathcal A(S_0) \text{ randomly such that all pairs have probability} > 0 \\ &\qquad \text{Generate an episode from }S_0, A_0, \text{following }\pi: S_0, A_0, R_1, . . . , S_{T -1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0 \\ &\qquad \text{Loop for each step of episode, } t = T -1, T -2, . . . , 0: \\ &\qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad \qquad \text{Unless the pair }S_t, A_t \text{ appears in }S_0, A_0, S_1, A_1 . . . , S_{t-1}, A_{t-1}: \\ &\qquad \qquad \qquad counts(S_t,A_t) \leftarrow counts(S_t,A_t) + 1\\ &\qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {(G - Q(S_t, A_t))}{count(S_t, A_t)} \\ &\qquad \qquad \qquad \pi(S_t) \leftarrow \text{argmax}_a Q(S_t, a) \\ \end{aligned}
Initialize:π(s)∈A(s)(arbitrarily), for all s∈SQ(s,a)∈R(arbitrarily), for all s∈S,a∈A(s)counts(s,a)←0, for all s ∈S,a∈A(s)Loop forever (for each episode):Choose S0∈S,A0∈A(S0) randomly such that all pairs have probability>0Generate an episode from S0,A0,following π:S0,A0,R1,...,ST−1,AT−1,RTG←0Loop for each step of episode, t=T−1,T−2,...,0:G←γG+Rt+1Unless the pair St,At appears in S0,A0,S1,A1...,St−1,At−1:counts(St,At)←counts(St,At)+1Q(St,At)←Q(St,At)+count(St,At)(G−Q(St,At))π(St)←argmaxaQ(St,a)