Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q ∗ q_* q∗, analogous to that on page 80 for computing v ∗ v_* v∗. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.
Here, we can use the result of exercise 3.17:
Q
π
(
s
,
a
)
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
∑
a
′
Q
π
(
s
′
,
a
′
)
π
(
s
′
,
a
′
)
]
P
s
,
s
′
a
Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a
Qπ(s,a)=s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′a
Then the algorithm which analogous to that on page 80 can be like this:
1 Initialization
Q
π
(
s
,
a
)
∈
R
a
n
d
π
(
s
,
a
)
∈
A
(
s
)
a
r
b
i
t
r
a
r
i
l
y
s
∈
S
a
n
d
a
∈
A
(
s
)
2 Policy Evaluation
Loop:
Δ
←
0
Loop for each
(
s
,
a
)
pair:
q
←
Q
π
(
s
,
a
)
Q
π
(
s
,
a
)
←
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
∑
a
′
Q
π
(
s
′
,
a
′
)
π
(
s
′
,
a
′
)
]
P
s
,
s
′
a
Δ
←
max
(
Δ
,
∣
q
−
Q
π
(
s
,
a
)
∣
)
until
Δ
<
θ
(a small positive number determining the accuracy of estimation)
3 Policy Improvement
p
o
l
i
c
y
-
s
t
a
b
l
e
←
t
r
u
e
For each
(
a
,
s
)
pair,
s
∈
S
and
a
∈
A
(
s
)
:
o
l
d
-
a
c
t
i
o
n
←
π
(
s
,
a
)
π
(
s
)
←
argmax
s
,
a
Q
π
(
s
,
a
)
If
o
l
d
-
a
c
t
i
o
n
=
̸
π
(
s
)
, then
p
o
l
i
c
y
-
s
t
a
b
l
e
←
f
a
l
s
e
If
p
o
l
i
c
y
-
s
t
a
b
l
e
, then stop and return
Q
π
≈
q
∗
and
π
≈
π
∗
; else go to 2.
\begin{aligned} &\text{1 Initialization} \\ &\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \text{Loop for each }(s,a)\text{ pair:} \\ &\qquad \qquad q \leftarrow Q_\pi(s,a) \\ &\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a \\ & \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\ &\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\ &\qquad \qquad old \text- action \leftarrow \pi(s, a) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\ &\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\ \end{aligned}
1 InitializationQπ(s,a)∈Randπ(s,a)∈A(s)arbitrarilys∈Sanda∈A(s)2 Policy EvaluationLoop:Δ←0Loop for each (s,a) pair:q←Qπ(s,a)Qπ(s,a)←s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′aΔ←max(Δ,∣q−Qπ(s,a)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each (a,s) pair, s∈S and a∈A(s):old-action←π(s,a)π(s)←argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return Qπ≈q∗ and π≈π∗; else go to 2.