Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies,
π
l
e
f
t
\pi_{left}
πleft and
π
r
i
g
h
t
\pi_{right}
πright. What policy is optimal if
γ
=
0
\gamma = 0
γ=0? If
γ
=
0.9
\gamma = 0.9
γ=0.9? If
γ
=
0.5
\gamma = 0.5
γ=0.5?
Before to solve this problem, we have to deduce the expression of
q
∗
(
s
,
a
)
q_*(s,a)
q∗(s,a) in terms of
R
s
,
s
′
a
R_{s,s'}^a
Rs,s′a and
P
s
,
s
′
a
P_{s,s'}^a
Ps,s′a.
First,
q
∗
(
s
,
a
)
=
E
[
R
t
+
1
+
γ
max
a
′
q
∗
(
S
t
+
1
,
a
′
)
∣
S
t
=
s
,
A
t
=
a
]
=
∑
s
′
,
r
{
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
max
a
′
q
∗
(
s
′
,
a
)
]
}
=
∑
s
′
,
r
[
r
p
(
s
′
,
r
∣
s
,
a
)
]
+
∑
s
′
,
r
[
p
(
s
′
,
r
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
=
∑
r
[
r
p
(
r
∣
s
,
a
)
]
+
∑
s
′
[
p
(
s
′
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
=
E
(
r
∣
s
,
a
)
+
∑
s
′
[
p
(
s
′
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
=
∑
s
′
[
E
(
r
∣
s
′
,
s
,
a
)
p
(
s
′
∣
s
,
a
)
]
+
∑
s
′
[
p
(
s
′
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
=
∑
s
′
{
[
E
(
r
∣
s
′
,
s
,
a
)
+
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
p
(
s
′
∣
s
,
a
)
}
\begin{aligned} q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\ &= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\ &= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\ &= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \} \end{aligned}
q∗(s,a)=E[Rt+1+γa′maxq∗(St+1,a′)∣St=s,At=a]=s′,r∑{p(s′,r∣s,a)[r+γa′maxq∗(s′,a)]}=s′,r∑[rp(s′,r∣s,a)]+s′,r∑[p(s′,r∣s,a)γa′maxq∗(s′,a′)]=r∑[rp(r∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=E(r∣s,a)+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑[E(r∣s′,s,a)p(s′∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑{[E(r∣s′,s,a)+γa′maxq∗(s′,a′)]p(s′∣s,a)}
denote
E
(
r
∣
s
′
,
s
,
a
)
=
R
s
,
s
′
a
\mathbb E(r|s',s,a) = R_{s,s'}^a
E(r∣s′,s,a)=Rs,s′a and
p
(
s
′
∣
s
,
a
)
=
P
s
,
s
′
a
p(s'|s,a)=P_{s,s'}^a
p(s′∣s,a)=Ps,s′a, we get the expression we wanted
(1)
q
∗
(
s
,
a
)
=
∑
s
′
{
[
R
s
,
s
′
a
+
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
P
s
,
s
′
a
}
q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1}
q∗(s,a)=s′∑{[Rs,s′a+γa′maxq∗(s′,a′)]Ps,s′a}(1)
Next, we name the three status in circles as
s
A
s_A
sA,
s
B
s_B
sB,
s
C
s_C
sC, and denote the action to left as
a
l
a_l
al, the action to right as
a
r
a_r
ar.
According to equation (1) we can get Bellman optimality equation for
q
∗
q_*
q∗ of the three status.
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
{
R
s
A
,
s
B
a
l
+
γ
max
a
′
[
q
∗
(
s
B
,
a
)
]
}
P
s
A
,
s
B
a
l
+
{
R
s
A
,
s
C
a
l
+
γ
max
a
′
[
q
∗
(
s
C
,
a
)
]
}
P
s
A
,
s
C
a
l
=
[
R
s
A
,
s
B
a
l
+
γ
q
∗
(
s
B
,
a
)
]
P
s
A
,
s
B
a
l
+
[
R
s
A
,
s
C
a
r
+
γ
q
∗
(
s
C
,
a
)
]
P
s
A
,
s
C
a
l
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
{
R
s
A
,
s
B
a
r
+
γ
max
a
′
[
q
∗
(
s
B
,
a
)
]
}
P
s
A
,
s
B
a
r
+
{
R
s
A
,
s
C
a
r
+
γ
max
a
′
[
q
∗
(
s
C
,
a
)
]
}
P
s
A
,
s
C
a
r
=
[
R
s
A
,
s
B
a
r
+
γ
q
∗
(
s
B
,
a
)
]
P
s
A
,
s
B
a
r
+
[
R
s
A
,
s
C
a
r
+
γ
q
∗
(
s
C
,
a
)
]
P
s
A
,
s
C
a
r
q
∗
(
s
B
,
a
)
=
{
R
s
B
,
s
A
a
+
γ
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
P
s
B
,
s
A
a
q
∗
(
s
C
,
a
)
=
{
R
s
C
,
s
A
a
+
γ
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
P
s
C
,
s
A
a
\begin{aligned} q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\ &= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\ &= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\ q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\ \end{aligned}
q∗,πleft(sA,al)q∗,πright(sA,ar)q∗(sB,a)q∗(sC,a)={RsA,sBal+γa′max[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γa′max[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCal={RsA,sBar+γa′max[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γa′max[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCar={RsB,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAa={RsC,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa
∵
P
s
A
,
s
B
a
r
=
0
,
P
s
A
,
s
C
a
l
=
0
∴
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
[
R
s
A
,
s
B
a
l
+
γ
q
∗
(
s
B
,
a
)
]
P
s
A
,
s
B
a
l
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
[
R
s
A
,
s
C
a
r
+
γ
q
∗
(
s
C
,
a
)
]
P
s
A
,
s
C
a
r
\because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\ \begin{aligned} \therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ \end{aligned}
∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)q∗,πright(sA,ar)=[RsA,sBal+γq∗(sB,a)]PsA,sBal=[RsA,sCar+γq∗(sC,a)]PsA,sCar
Now, let’s discuss the cases in different
γ
\gamma
γ.
For
γ
=
0
\gamma = 0
γ=0:
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
[
1
+
0
⋅
q
∗
(
s
B
,
a
)
]
⋅
1
=
1
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
[
0
+
0
⋅
q
∗
(
s
C
,
a
)
]
⋅
1
=
0
\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0 \end{aligned}
q∗,πleft(sA,al)q∗,πright(sA,ar)=[1+0⋅q∗(sB,a)]⋅1=1=[0+0⋅q∗(sC,a)]⋅1=0
So,
π
l
e
f
t
\pi_{left}
πleft is the optimal policy when
γ
=
0
\gamma = 0
γ=0.
For
γ
=
0.5
\gamma = 0.5
γ=0.5:
q
∗
(
s
B
,
a
)
=
{
0
+
0.5
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
⋅
1
=
0.5
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
(
s
C
,
a
)
=
{
2
+
0.5
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
⋅
1
=
2
+
0.5
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
[
1
+
0.5
⋅
q
∗
(
s
B
,
a
)
]
⋅
1
=
1
+
0.5
⋅
q
∗
(
s
B
,
a
)
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
[
0
+
0.5
⋅
q
∗
(
s
C
,
a
)
]
⋅
1
=
0.5
⋅
q
∗
(
s
C
,
a
)
\begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.5 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.5 \cdot q_*(s_C,a) \end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a)
Assume
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
≥
q
∗
,
π
−
l
e
f
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)
q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q
∗
(
s
B
,
a
)
=
0.5
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
q
∗
(
s
C
,
a
)
=
2
+
0.5
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
\begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πleft(sA,al)=2+0.5⋅q∗,πleft(sA,al)
therefore,
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
1
+
0.5
⋅
0.5
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
4
3
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
0.5
⋅
[
2
+
0.5
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
]
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
5
3
\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3} \end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.5⋅0.5⋅q∗,πleft(sA,al)=34=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]=35
Here,
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
<
q
∗
,
π
r
i
g
h
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)
q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
≤
q
∗
,
π
r
i
g
h
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)
q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q
∗
(
s
B
,
a
)
=
0.5
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q
∗
(
s
C
,
a
)
=
2
+
0.5
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
\begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πright(sA,ar)=2+0.5⋅q∗,πright(sA,ar)
therefore,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
0.5
⋅
[
2
+
0.5
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
4
3
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
1
+
0.5
⋅
0.5
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
4
3
\begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {4}{3}\\ q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ \end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]=34=1+0.5⋅0.5⋅q∗,πright(sA,ar)=34
Here
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r)
q∗,πleft(sA,al)=q∗,πright(sA,ar), assumption is correct. So, both
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l)
q∗,πleft(sA,al) and
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q_{*,\pi_{right}}(s_A, a_r)
q∗,πright(sA,ar) are optimal policies for
γ
=
0.5
\gamma = 0.5
γ=0.5.
For
γ
=
0.9
\gamma = 0.9
γ=0.9:
q
∗
(
s
B
,
a
)
=
{
0
+
0.9
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
⋅
1
=
0.9
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
(
s
C
,
a
)
=
{
2
+
0.9
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
}
⋅
1
=
2
+
0.9
max
a
′
[
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
[
1
+
0.9
⋅
q
∗
(
s
B
,
a
)
]
⋅
1
=
1
+
0.9
⋅
q
∗
(
s
B
,
a
)
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
[
0
+
0.9
⋅
q
∗
(
s
C
,
a
)
]
⋅
1
=
0.9
⋅
q
∗
(
s
C
,
a
)
\begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.9 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.9 \cdot q_*(s_C,a) \end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a)
Assume
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
≥
q
∗
,
π
−
l
e
f
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)
q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q
∗
(
s
B
,
a
)
=
0.9
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
q
∗
(
s
C
,
a
)
=
2
+
0.9
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
\begin{aligned} q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πleft(sA,al)=2+0.9⋅q∗,πleft(sA,al)
therefore,
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
1
+
0.9
⋅
0.9
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
100
19
=
500
95
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
0.9
⋅
[
2
+
0.9
⋅
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
]
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
729
95
\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {100}{19} = \frac {500}{95}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {729}{95} \end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.9⋅0.9⋅q∗,πleft(sA,al)=19100=95500=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]=95729
Here,
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
<
q
∗
,
π
r
i
g
h
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)
q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
≤
q
∗
,
π
r
i
g
h
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)
q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q
∗
(
s
B
,
a
)
=
0.9
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q
∗
(
s
C
,
a
)
=
2
+
0.9
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
\begin{aligned} q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πright(sA,ar)=2+0.9⋅q∗,πright(sA,ar)
therefore,
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
0.9
⋅
[
2
+
0.9
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
]
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
=
180
19
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
1
+
0.9
⋅
0.9
⋅
q
∗
,
π
r
i
g
h
t
(
s
A
,
a
r
)
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
=
1648
190
\begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {180}{19}\\ q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {1648}{190}\\ \end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]=19180=1+0.9⋅0.9⋅q∗,πright(sA,ar)=1901648
Here,
q
∗
,
π
l
e
f
t
(
s
A
,
a
l
)
<
q
∗
,
π
r
i
g
h
t
(
s
C
,
a
l
)
q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)
q∗,πleft(sA,al)<q∗,πright(sC,al), assumption is correct. So,
π
r
i
g
h
t
\pi_{right}
πright is the optimal policy for
γ
=
0.9
\gamma = 0.9
γ=0.9