Exercise 3.23 Give the Bellman equation for
q
∗
q_*
q∗ for the recycling robot.
This picture shows the mechanism of the recycling robot.
To give the Bellman equation for
q
∗
q_*
q∗ for the recycling robot, we have to enumerate equations for
q
∗
(
s
h
,
a
s
)
q_*(s_h, a_s)
q∗(sh,as),
q
∗
(
s
h
,
a
w
)
q_*(s_h, a_w)
q∗(sh,aw),
q
∗
(
s
h
,
a
r
)
q_*(s_h, a_r)
q∗(sh,ar),
q
∗
(
s
l
,
a
s
)
q_*(s_l, a_s)
q∗(sl,as),
q
∗
(
s
l
,
a
w
)
q_*(s_l, a_w)
q∗(sl,aw) and
q
∗
(
s
l
,
a
r
)
q_*(s_l, a_r)
q∗(sl,ar). Here, the subscripts h, l, s, w, r respectively denotes ‘high’, ‘low’, ‘search’, ‘wait’, ‘recharge’. For ‘high’ status, the available actions are ‘search’ and ‘wait’, so
q
∗
(
s
h
,
a
r
)
q_*(s_h, a_r)
q∗(sh,ar) is excluded.
First, we have to introduce the equation (1) from exercise 3.22:
q
∗
(
s
,
a
)
=
∑
s
′
{
[
R
s
,
s
′
a
+
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
P
s
,
s
′
a
}
(
1
)
q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \qquad{(1)}
q∗(s,a)=s′∑{[Rs,s′a+γa′maxq∗(s′,a′)]Ps,s′a}(1)
For status ‘high’, we have:
q
∗
(
s
h
,
a
s
)
=
[
R
s
h
,
s
h
a
s
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
P
s
h
,
s
h
a
s
+
[
R
s
h
,
s
l
a
s
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
P
s
h
,
s
l
a
s
(
2
)
q
∗
(
s
h
,
a
w
)
=
[
R
s
h
,
s
h
a
w
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
P
s
h
,
s
h
a
w
+
[
R
s
h
,
s
l
a
w
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
P
s
h
,
s
l
a
w
(
3
)
\begin{aligned} q_*(s_h, a_s) = \bigl [ R_{s_h, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_s} + \bigl [ R_{s_h, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_s} \qquad{(2)}\\ q_*(s_h, a_w) = \bigl [ R_{s_h, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_w} + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_w} \qquad{(3)} \end{aligned}
q∗(sh,as)=[Rsh,shas+γa′maxq∗(sh,a′)]Psh,shas+[Rsh,slas+γa′maxq∗(sl,a′)]Psh,slas(2)q∗(sh,aw)=[Rsh,shaw+γa′maxq∗(sh,a′)]Psh,shaw+[Rsh,slaw+γa′maxq∗(sl,a′)]Psh,slaw(3)
For status ‘low’, there are:
q
∗
(
s
l
,
a
s
)
=
[
R
s
l
,
s
h
a
s
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
P
s
l
,
s
h
a
s
+
[
R
s
l
,
s
l
a
s
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
P
s
l
,
s
l
a
s
(
4
)
q
∗
(
s
l
,
a
w
)
=
[
R
s
l
,
s
h
a
w
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
P
s
l
,
s
h
a
w
+
[
R
s
l
,
s
l
a
w
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
P
s
l
,
s
l
a
w
(
5
)
q
∗
(
s
l
,
a
r
)
=
[
R
s
l
,
s
h
a
r
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
P
s
l
,
s
h
a
r
+
[
R
s
l
,
s
l
a
r
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
P
s
l
,
s
l
a
r
(
6
)
\begin{aligned} q_*(s_l, a_s) &= \bigl [ R_{s_l, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_s} + \bigl [ R_{s_l, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_s} \qquad{(4)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_w} + \bigl [ R_{s_l, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_w} \qquad{(5)} \\ q_*(s_l, a_r) &= \bigl [ R_{s_l, s_h}^{a_r} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_r} + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_r} \qquad{(6)} \end{aligned}
q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=[Rsl,shas+γa′maxq∗(sh,a′)]Psl,shas+[Rsl,slas+γa′maxq∗(sl,a′)]Psl,slas(4)=[Rsl,shaw+γa′maxq∗(sh,a′)]Psl,shaw+[Rsl,slaw+γa′maxq∗(sl,a′)]Psl,slaw(5)=[Rsl,shar+γa′maxq∗(sh,a′)]Psl,shar+[Rsl,slar+γa′maxq∗(sl,a′)]Psl,slar(6)
Then according to the table in the above picture,
R
s
h
,
s
h
a
s
=
r
s
e
a
r
c
h
R_{s_h,s_h}^{a_s}=r_{search}
Rsh,shas=rsearch,
P
s
h
,
s
h
a
s
=
α
P_{s_h,s_h}^{a_s}=\alpha
Psh,shas=α,
R
s
h
,
s
l
a
s
=
r
s
e
a
r
c
h
R_{s_h,s_l}^{a_s}=r_{search}
Rsh,slas=rsearch,
P
s
h
,
s
l
a
s
=
1
−
α
P_{s_h,s_l}^{a_s}=1-\alpha
Psh,slas=1−α, … and so on. Plug these values into equations (2), (3), (4), (5), (6), we get:
q
∗
(
s
h
,
a
s
)
=
[
r
s
e
a
r
c
h
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
α
+
[
r
s
e
a
r
c
h
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
(
1
−
α
)
=
r
s
e
a
r
c
h
+
γ
[
α
max
a
′
q
∗
(
s
h
,
a
′
)
+
(
1
−
α
)
max
a
′
q
∗
(
s
l
,
a
′
)
]
(
7
)
q
∗
(
s
h
,
a
w
)
=
[
r
w
a
i
t
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
⋅
1
+
[
R
s
h
,
s
l
a
w
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
⋅
0
=
r
w
a
i
t
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
(
8
)
q
∗
(
s
l
,
a
s
)
=
[
−
3
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
(
1
−
β
)
+
[
r
s
e
a
r
c
h
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
β
=
(
r
s
e
a
r
c
h
−
3
)
+
γ
[
(
1
−
β
)
max
a
′
q
∗
(
s
h
,
a
′
)
+
β
max
a
′
q
∗
(
s
l
,
a
′
)
]
(
9
)
q
∗
(
s
l
,
a
w
)
=
[
R
s
l
,
s
h
a
w
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
⋅
0
+
[
r
w
a
i
t
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
⋅
1
=
r
w
a
i
t
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
(
10
)
q
∗
(
s
l
,
a
r
)
=
[
0
+
γ
max
a
′
q
∗
(
s
h
,
a
′
)
]
⋅
1
+
[
R
s
l
,
s
l
a
r
+
γ
max
a
′
q
∗
(
s
l
,
a
′
)
]
⋅
0
=
γ
max
a
′
q
∗
(
s
h
,
a
′
)
(
11
)
\begin{aligned} q_*(s_h, a_s) &= \bigl [ r_{search} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \alpha + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] (1-\alpha)\\ &= r_{search} + \gamma \bigl [\alpha \max_{a'} q_*(s_h,a') +(1-\alpha) \max_{a'} q_*(s_l,a')\bigr ]\qquad{(7)}\\ q_*(s_h, a_w) &= \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_h,a')\qquad{(8)}\\ q_*(s_l, a_s) &= \bigl [ -3 + \gamma \max_{a'} q_*(s_h,a') \bigr ] (1-\beta) + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \beta \\ &= (r_{search} - 3) + \gamma \bigl [ (1-\beta)\max_{a'} q_*(s_h,a') + \beta \max_{a'} q_*(s_l,a') \bigr ] \qquad{(9)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 0 + \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 1 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_l,a') \qquad{(10)} \\ q_*(s_l, a_r) &= \bigl [ 0 + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= \gamma \max_{a'} q_*(s_h,a')\qquad{(11)} \end{aligned}
q∗(sh,as)q∗(sh,aw)q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=[rsearch+γa′maxq∗(sh,a′)]α+[rsearch+γa′maxq∗(sl,a′)](1−α)=rsearch+γ[αa′maxq∗(sh,a′)+(1−α)a′maxq∗(sl,a′)](7)=[rwait+γa′maxq∗(sh,a′)]⋅1+[Rsh,slaw+γa′maxq∗(sl,a′)]⋅0=rwait+γa′maxq∗(sh,a′)(8)=[−3+γa′maxq∗(sh,a′)](1−β)+[rsearch+γa′maxq∗(sl,a′)]β=(rsearch−3)+γ[(1−β)a′maxq∗(sh,a′)+βa′maxq∗(sl,a′)](9)=[Rsl,shaw+γa′maxq∗(sh,a′)]⋅0+[rwait+γa′maxq∗(sl,a′)]⋅1=rwait+γa′maxq∗(sl,a′)(10)=[0+γa′maxq∗(sh,a′)]⋅1+[Rsl,slar+γa′maxq∗(sl,a′)]⋅0=γa′maxq∗(sh,a′)(11)
For ‘high’ status,
a
′
a'
a′ can only be ‘search’ and ‘wait’ while for ‘low’ status,
a
′
a'
a′ can be ‘search’, ‘wait’ and ‘recharge’. So, equations (7) to (11) can be rearranged as below:
q
∗
(
s
h
,
a
s
)
=
r
s
e
a
r
c
h
+
γ
{
α
max
a
′
[
q
∗
(
s
h
,
a
s
)
,
q
∗
(
s
h
,
a
w
)
]
+
(
1
−
α
)
max
a
′
[
q
∗
(
s
l
,
a
s
)
,
q
∗
(
s
l
,
a
w
)
,
q
∗
(
s
l
,
a
r
)
]
}
(
12
)
q
∗
(
s
h
,
a
w
)
=
r
w
a
i
t
+
γ
max
a
′
[
q
∗
(
s
h
,
a
s
)
,
q
∗
(
s
h
,
a
w
)
]
(
13
)
q
∗
(
s
l
,
a
s
)
=
(
r
s
e
a
r
c
h
−
3
)
+
γ
{
(
1
−
β
)
max
a
′
[
q
∗
(
s
h
,
a
s
)
,
q
∗
(
s
h
,
a
w
)
]
+
β
max
a
′
[
q
∗
(
s
l
,
a
s
)
,
q
∗
(
s
l
,
a
w
)
,
q
∗
(
s
l
,
a
r
)
]
}
(
14
)
q
∗
(
s
l
,
a
w
)
=
r
w
a
i
t
+
γ
max
a
′
[
q
∗
(
s
l
,
a
s
)
,
q
∗
(
s
l
,
a
w
)
,
q
∗
(
s
l
,
a
r
)
]
(
15
)
q
∗
(
s
l
,
a
r
)
=
γ
max
a
′
[
q
∗
(
s
h
,
a
s
)
,
q
∗
(
s
h
,
a
w
)
]
(
16
)
\begin{aligned} q_*(s_h, a_s) &= r_{search} + \gamma \Bigl \{\alpha \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ]+(1-\alpha) \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(12)} \\ q_*(s_h, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad {(13)}\\ q_*(s_l, a_s) &= (r_{search} - 3) + \gamma \Bigl \{ (1-\beta)\max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] + \beta \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(14)} \\ q_*(s_l, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \qquad{(15)} \\ q_*(s_l, a_r) &= \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad{(16)} \end{aligned}
q∗(sh,as)q∗(sh,aw)q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=rsearch+γ{αa′max[q∗(sh,as),q∗(sh,aw)]+(1−α)a′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(12)=rwait+γa′max[q∗(sh,as),q∗(sh,aw)](13)=(rsearch−3)+γ{(1−β)a′max[q∗(sh,as),q∗(sh,aw)]+βa′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(14)=rwait+γa′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)](15)=γa′max[q∗(sh,as),q∗(sh,aw)](16)
These equations from (12) to (16) are the Bellman equations for the recycling robot and can be solved in a similar way like exercise 3.22.