Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ \epsilon ϵ-soft, meaning that the probability of selecting each action in each state, s s s, is at least ϵ / ∣ A ( s ) ∣ \epsilon/|\mathcal A(s)| ϵ/∣A(s)∣. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v ∗ v_* v∗ on page 80.
The algorithm on page 80 in section 4.2 is based on the assumption that the policy is deterministic. For a stochastic case, we can modify the algorithm like this:
1 Initialization V ( s ) ∈ R and π ( s ) ∈ A ( s ) arbitrarily s ∈ S 2 Policy Evaluation Loop: Δ ← 0 Loop for each s ∈ S : v ← V ( s ) V ( s ) ← ∑ s ′ , r π ( a ∣ s ) p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] Δ ← max ( Δ , ∣ v − V ( s ) ∣ ) until Δ < θ (a small positive number determining the accuracy of estimation) 3 Policy Improvement p o l i c y - s t a b l e ← t r u e For each s ∈ S : o l d - a c t i o n ← π ( s ) π ( s ) ← argmax a ∑ s ′ , r π ( a ∣ s ) p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] If o l d - a c t i o n = ̸ π ( s ) , then p o l i c y - s t a b l e ← f a l s e If p o l i c y - s t a b l e , then stop and return V ≈ v ∗ and π ≈ π ∗ ; else go to 2. \begin{aligned} &\text{1 Initialization} \\ &\qquad V(s) \in \mathbb R \text{ and }\pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \qquad \text{Loop for each }s \in \mathcal S: \\ &\qquad \qquad \qquad v \leftarrow V(s) \\ &\qquad \qquad \qquad V(s) \leftarrow \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &\qquad \qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }s \in \mathcal S: \\ &\qquad \qquad old \text- action \leftarrow \pi(s) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_a \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \text{If }old\text-action =\not \pi(s) \text{, then }policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } V \approx v_* \text{ and }\pi \approx \pi_*\text{; else go to 2.} \\ \end{aligned} 1 InitializationV(s)∈R and π(s)∈A(s) arbitrarily s∈S2 Policy EvaluationLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←s′,r∑π(a∣s)p(s′,r∣s,a)[r+γV(s′)]Δ←max(Δ,∣v−V(s)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each s∈S:old-action←π(s)π(s)←argmaxas′,r∑π(a∣s)p(s′,r∣s,a)[r+γV(s′)]If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return V≈v∗ and π≈π∗; else go to 2.
Because the only policies are
ϵ
\epsilon
ϵ-soft, the probability that the policy doesn’t select action
a
a
a is
ϵ
∣
A
(
s
)
∣
⋅
(
∣
A
(
s
)
∣
−
1
)
\frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1)
∣A(s)∣ϵ⋅(∣A(s)∣−1). So,
π
(
a
∣
s
)
=
1
−
ϵ
∣
A
(
s
)
∣
⋅
(
∣
A
(
s
)
∣
−
1
)
=
1
−
ϵ
+
ϵ
∣
A
(
s
)
∣
\begin{aligned} \pi(a \mid s) &= 1 - \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) \\ &= 1 - \epsilon + \frac {\epsilon}{|\mathcal A(s)|} \end{aligned}
π(a∣s)=1−∣A(s)∣ϵ⋅(∣A(s)∣−1)=1−ϵ+∣A(s)∣ϵ
Substitute this
π
(
a
∣
s
)
\pi(a \mid s)
π(a∣s) into the algorithm, we can get the final result.