Reinforcement Learning Exercise 3.23

Exercise 3.23 Give the Bellman equation for q ∗ q_* q for the recycling robot.
在这里插入图片描述
This picture shows the mechanism of the recycling robot.

To give the Bellman equation for q ∗ q_* q for the recycling robot, we have to enumerate equations for q ∗ ( s h , a s ) q_*(s_h, a_s) q(sh,as), q ∗ ( s h , a w ) q_*(s_h, a_w) q(sh,aw), q ∗ ( s h , a r ) q_*(s_h, a_r) q(sh,ar), q ∗ ( s l , a s ) q_*(s_l, a_s) q(sl,as), q ∗ ( s l , a w ) q_*(s_l, a_w) q(sl,aw) and q ∗ ( s l , a r ) q_*(s_l, a_r) q(sl,ar). Here, the subscripts h, l, s, w, r respectively denotes ‘high’, ‘low’, ‘search’, ‘wait’, ‘recharge’. For ‘high’ status, the available actions are ‘search’ and ‘wait’, so q ∗ ( s h , a r ) q_*(s_h, a_r) q(sh,ar) is excluded.
First, we have to introduce the equation (1) from exercise 3.22:
q ∗ ( s , a ) = ∑ s ′ { [ R s , s ′ a + γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] P s , s ′ a } ( 1 ) q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \qquad{(1)} q(s,a)=s{[Rs,sa+γamaxq(s,a)]Ps,sa}(1)
For status ‘high’, we have:
q ∗ ( s h , a s ) = [ R s h , s h a s + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] P s h , s h a s + [ R s h , s l a s + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] P s h , s l a s ( 2 ) q ∗ ( s h , a w ) = [ R s h , s h a w + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] P s h , s h a w + [ R s h , s l a w + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] P s h , s l a w ( 3 ) \begin{aligned} q_*(s_h, a_s) = \bigl [ R_{s_h, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_s} + \bigl [ R_{s_h, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_s} \qquad{(2)}\\ q_*(s_h, a_w) = \bigl [ R_{s_h, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_w} + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_w} \qquad{(3)} \end{aligned} q(sh,as)=[Rsh,shas+γamaxq(sh,a)]Psh,shas+[Rsh,slas+γamaxq(sl,a)]Psh,slas(2)q(sh,aw)=[Rsh,shaw+γamaxq(sh,a)]Psh,shaw+[Rsh,slaw+γamaxq(sl,a)]Psh,slaw(3)
For status ‘low’, there are:
q ∗ ( s l , a s ) = [ R s l , s h a s + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] P s l , s h a s + [ R s l , s l a s + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] P s l , s l a s ( 4 ) q ∗ ( s l , a w ) = [ R s l , s h a w + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] P s l , s h a w + [ R s l , s l a w + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] P s l , s l a w ( 5 ) q ∗ ( s l , a r ) = [ R s l , s h a r + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] P s l , s h a r + [ R s l , s l a r + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] P s l , s l a r ( 6 ) \begin{aligned} q_*(s_l, a_s) &= \bigl [ R_{s_l, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_s} + \bigl [ R_{s_l, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_s} \qquad{(4)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_w} + \bigl [ R_{s_l, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_w} \qquad{(5)} \\ q_*(s_l, a_r) &= \bigl [ R_{s_l, s_h}^{a_r} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_r} + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_r} \qquad{(6)} \end{aligned} q(sl,as)q(sl,aw)q(sl,ar)=[Rsl,shas+γamaxq(sh,a)]Psl,shas+[Rsl,slas+γamaxq(sl,a)]Psl,slas(4)=[Rsl,shaw+γamaxq(sh,a)]Psl,shaw+[Rsl,slaw+γamaxq(sl,a)]Psl,slaw(5)=[Rsl,shar+γamaxq(sh,a)]Psl,shar+[Rsl,slar+γamaxq(sl,a)]Psl,slar(6)
Then according to the table in the above picture, R s h , s h a s = r s e a r c h R_{s_h,s_h}^{a_s}=r_{search} Rsh,shas=rsearch, P s h , s h a s = α P_{s_h,s_h}^{a_s}=\alpha Psh,shas=α, R s h , s l a s = r s e a r c h R_{s_h,s_l}^{a_s}=r_{search} Rsh,slas=rsearch, P s h , s l a s = 1 − α P_{s_h,s_l}^{a_s}=1-\alpha Psh,slas=1α, … and so on. Plug these values into equations (2), (3), (4), (5), (6), we get:
q ∗ ( s h , a s ) = [ r s e a r c h + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] α + [ r s e a r c h + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] ( 1 − α ) = r s e a r c h + γ [ α max ⁡ a ′ q ∗ ( s h , a ′ ) + ( 1 − α ) max ⁡ a ′ q ∗ ( s l , a ′ ) ] ( 7 ) q ∗ ( s h , a w ) = [ r w a i t + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] ⋅ 1 + [ R s h , s l a w + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] ⋅ 0 = r w a i t + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ( 8 ) q ∗ ( s l , a s ) = [ − 3 + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] ( 1 − β ) + [ r s e a r c h + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] β = ( r s e a r c h − 3 ) + γ [ ( 1 − β ) max ⁡ a ′ q ∗ ( s h , a ′ ) + β max ⁡ a ′ q ∗ ( s l , a ′ ) ] ( 9 ) q ∗ ( s l , a w ) = [ R s l , s h a w + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] ⋅ 0 + [ r w a i t + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] ⋅ 1 = r w a i t + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ( 10 ) q ∗ ( s l , a r ) = [ 0 + γ max ⁡ a ′ q ∗ ( s h , a ′ ) ] ⋅ 1 + [ R s l , s l a r + γ max ⁡ a ′ q ∗ ( s l , a ′ ) ] ⋅ 0 = γ max ⁡ a ′ q ∗ ( s h , a ′ ) ( 11 ) \begin{aligned} q_*(s_h, a_s) &= \bigl [ r_{search} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \alpha + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] (1-\alpha)\\ &= r_{search} + \gamma \bigl [\alpha \max_{a'} q_*(s_h,a') +(1-\alpha) \max_{a'} q_*(s_l,a')\bigr ]\qquad{(7)}\\ q_*(s_h, a_w) &= \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_h,a')\qquad{(8)}\\ q_*(s_l, a_s) &= \bigl [ -3 + \gamma \max_{a'} q_*(s_h,a') \bigr ] (1-\beta) + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \beta \\ &= (r_{search} - 3) + \gamma \bigl [ (1-\beta)\max_{a'} q_*(s_h,a') + \beta \max_{a'} q_*(s_l,a') \bigr ] \qquad{(9)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 0 + \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 1 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_l,a') \qquad{(10)} \\ q_*(s_l, a_r) &= \bigl [ 0 + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= \gamma \max_{a'} q_*(s_h,a')\qquad{(11)} \end{aligned} q(sh,as)q(sh,aw)q(sl,as)q(sl,aw)q(sl,ar)=[rsearch+γamaxq(sh,a)]α+[rsearch+γamaxq(sl,a)](1α)=rsearch+γ[αamaxq(sh,a)+(1α)amaxq(sl,a)](7)=[rwait+γamaxq(sh,a)]1+[Rsh,slaw+γamaxq(sl,a)]0=rwait+γamaxq(sh,a)(8)=[3+γamaxq(sh,a)](1β)+[rsearch+γamaxq(sl,a)]β=(rsearch3)+γ[(1β)amaxq(sh,a)+βamaxq(sl,a)](9)=[Rsl,shaw+γamaxq(sh,a)]0+[rwait+γamaxq(sl,a)]1=rwait+γamaxq(sl,a)(10)=[0+γamaxq(sh,a)]1+[Rsl,slar+γamaxq(sl,a)]0=γamaxq(sh,a)(11)
For ‘high’ status, a ′ a' a can only be ‘search’ and ‘wait’ while for ‘low’ status, a ′ a' a can be ‘search’, ‘wait’ and ‘recharge’. So, equations (7) to (11) can be rearranged as below:
q ∗ ( s h , a s ) = r s e a r c h + γ { α max ⁡ a ′ [ q ∗ ( s h , a s ) , q ∗ ( s h , a w ) ] + ( 1 − α ) max ⁡ a ′ [ q ∗ ( s l , a s ) , q ∗ ( s l , a w ) , q ∗ ( s l , a r ) ] } ( 12 ) q ∗ ( s h , a w ) = r w a i t + γ max ⁡ a ′ [ q ∗ ( s h , a s ) , q ∗ ( s h , a w ) ] ( 13 ) q ∗ ( s l , a s ) = ( r s e a r c h − 3 ) + γ { ( 1 − β ) max ⁡ a ′ [ q ∗ ( s h , a s ) , q ∗ ( s h , a w ) ] + β max ⁡ a ′ [ q ∗ ( s l , a s ) , q ∗ ( s l , a w ) , q ∗ ( s l , a r ) ] } ( 14 ) q ∗ ( s l , a w ) = r w a i t + γ max ⁡ a ′ [ q ∗ ( s l , a s ) , q ∗ ( s l , a w ) , q ∗ ( s l , a r ) ] ( 15 ) q ∗ ( s l , a r ) = γ max ⁡ a ′ [ q ∗ ( s h , a s ) , q ∗ ( s h , a w ) ] ( 16 ) \begin{aligned} q_*(s_h, a_s) &= r_{search} + \gamma \Bigl \{\alpha \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ]+(1-\alpha) \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(12)} \\ q_*(s_h, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad {(13)}\\ q_*(s_l, a_s) &= (r_{search} - 3) + \gamma \Bigl \{ (1-\beta)\max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] + \beta \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(14)} \\ q_*(s_l, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \qquad{(15)} \\ q_*(s_l, a_r) &= \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad{(16)} \end{aligned} q(sh,as)q(sh,aw)q(sl,as)q(sl,aw)q(sl,ar)=rsearch+γ{αamax[q(sh,as),q(sh,aw)]+(1α)amax[q(sl,as),q(sl,aw),q(sl,ar)]}(12)=rwait+γamax[q(sh,as),q(sh,aw)](13)=(rsearch3)+γ{(1β)amax[q(sh,as),q(sh,aw)]+βamax[q(sl,as),q(sl,aw),q(sl,ar)]}(14)=rwait+γamax[q(sl,as),q(sl,aw),q(sl,ar)](15)=γamax[q(sh,as),q(sh,aw)](16)
These equations from (12) to (16) are the Bellman equations for the recycling robot and can be solved in a similar way like exercise 3.22.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值