Reinforcement Learning Exercise 3.22

Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, π l e f t \pi_{left} πleft and π r i g h t \pi_{right} πright. What policy is optimal if γ = 0 \gamma = 0 γ=0? If γ = 0.9 \gamma = 0.9 γ=0.9? If γ = 0.5 \gamma = 0.5 γ=0.5?
在这里插入图片描述
Before to solve this problem, we have to deduce the expression of q ∗ ( s , a ) q_*(s,a) q(s,a) in terms of R s , s ′ a R_{s,s'}^a Rs,sa and P s , s ′ a P_{s,s'}^a Ps,sa.
First,
q ∗ ( s , a ) = E [ R t + 1 + γ max ⁡ a ′ q ∗ ( S t + 1 , a ′ ) ∣ S t = s , A t = a ] = ∑ s ′ , r { p ( s ′ , r ∣ s , a ) [ r + γ max ⁡ a ′ q ∗ ( s ′ , a ) ] } = ∑ s ′ , r [ r p ( s ′ , r ∣ s , a ) ] + ∑ s ′ , r [ p ( s ′ , r ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] = ∑ r [ r p ( r ∣ s , a ) ] + ∑ s ′ [ p ( s ′ ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] = E ( r ∣ s , a ) + ∑ s ′ [ p ( s ′ ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] = ∑ s ′ [ E ( r ∣ s ′ , s , a ) p ( s ′ ∣ s , a ) ] + ∑ s ′ [ p ( s ′ ∣ s , a ) γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] = ∑ s ′ { [ E ( r ∣ s ′ , s , a ) + γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] p ( s ′ ∣ s , a ) } \begin{aligned} q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\ &= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\ &= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\ &= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \} \end{aligned} q(s,a)=E[Rt+1+γamaxq(St+1,a)St=s,At=a]=s,r{p(s,rs,a)[r+γamaxq(s,a)]}=s,r[rp(s,rs,a)]+s,r[p(s,rs,a)γamaxq(s,a)]=r[rp(rs,a)]+s[p(ss,a)γamaxq(s,a)]=E(rs,a)+s[p(ss,a)γamaxq(s,a)]=s[E(rs,s,a)p(ss,a)]+s[p(ss,a)γamaxq(s,a)]=s{[E(rs,s,a)+γamaxq(s,a)]p(ss,a)}
denote E ( r ∣ s ′ , s , a ) = R s , s ′ a \mathbb E(r|s',s,a) = R_{s,s'}^a E(rs,s,a)=Rs,sa and p ( s ′ ∣ s , a ) = P s , s ′ a p(s'|s,a)=P_{s,s'}^a p(ss,a)=Ps,sa, we get the expression we wanted
(1) q ∗ ( s , a ) = ∑ s ′ { [ R s , s ′ a + γ max ⁡ a ′ q ∗ ( s ′ , a ′ ) ] P s , s ′ a } q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1} q(s,a)=s{[Rs,sa+γamaxq(s,a)]Ps,sa}(1)
Next, we name the three status in circles as s A s_A sA, s B s_B sB, s C s_C sC, and denote the action to left as a l a_l al, the action to right as a r a_r ar.
在这里插入图片描述
According to equation (1) we can get Bellman optimality equation for q ∗ q_* q of the three status.
q ∗ , π l e f t ( s A , a l ) = { R s A , s B a l + γ max ⁡ a ′ [ q ∗ ( s B , a ) ] } P s A , s B a l + { R s A , s C a l + γ max ⁡ a ′ [ q ∗ ( s C , a ) ] } P s A , s C a l = [ R s A , s B a l + γ q ∗ ( s B , a ) ] P s A , s B a l + [ R s A , s C a r + γ q ∗ ( s C , a ) ] P s A , s C a l q ∗ , π r i g h t ( s A , a r ) = { R s A , s B a r + γ max ⁡ a ′ [ q ∗ ( s B , a ) ] } P s A , s B a r + { R s A , s C a r + γ max ⁡ a ′ [ q ∗ ( s C , a ) ] } P s A , s C a r = [ R s A , s B a r + γ q ∗ ( s B , a ) ] P s A , s B a r + [ R s A , s C a r + γ q ∗ ( s C , a ) ] P s A , s C a r q ∗ ( s B , a ) = { R s B , s A a + γ max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } P s B , s A a q ∗ ( s C , a ) = { R s C , s A a + γ max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } P s C , s A a \begin{aligned} q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\ &= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\ &= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\ q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\ \end{aligned} q,πleft(sA,al)q,πright(sA,ar)q(sB,a)q(sC,a)={RsA,sBal+γamax[q(sB,a)]}PsA,sBal+{RsA,sCal+γamax[q(sC,a)]}PsA,sCal=[RsA,sBal+γq(sB,a)]PsA,sBal+[RsA,sCar+γq(sC,a)]PsA,sCal={RsA,sBar+γamax[q(sB,a)]}PsA,sBar+{RsA,sCar+γamax[q(sC,a)]}PsA,sCar=[RsA,sBar+γq(sB,a)]PsA,sBar+[RsA,sCar+γq(sC,a)]PsA,sCar={RsB,sAa+γamax[q,πleft(sA,al),q,πright(sA,ar)]}PsB,sAa={RsC,sAa+γamax[q,πleft(sA,al),q,πright(sA,ar)]}PsC,sAa ∵ P s A , s B a r = 0 , P s A , s C a l = 0 ∴ q ∗ , π l e f t ( s A , a l ) = [ R s A , s B a l + γ q ∗ ( s B , a ) ] P s A , s B a l q ∗ , π r i g h t ( s A , a r ) = [ R s A , s C a r + γ q ∗ ( s C , a ) ] P s A , s C a r \because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\ \begin{aligned} \therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ \end{aligned} PsA,sBar=0,PsA,sCal=0q,πleft(sA,al)q,πright(sA,ar)=[RsA,sBal+γq(sB,a)]PsA,sBal=[RsA,sCar+γq(sC,a)]PsA,sCar
Now, let’s discuss the cases in different γ \gamma γ.
For γ = 0 \gamma = 0 γ=0:
q ∗ , π l e f t ( s A , a l ) = [ 1 + 0 ⋅ q ∗ ( s B , a ) ] ⋅ 1 = 1 q ∗ , π r i g h t ( s A , a r ) = [ 0 + 0 ⋅ q ∗ ( s C , a ) ] ⋅ 1 = 0 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0 \end{aligned} q,πleft(sA,al)q,πright(sA,ar)=[1+0q(sB,a)]1=1=[0+0q(sC,a)]1=0
So, π l e f t \pi_{left} πleft is the optimal policy when γ = 0 \gamma = 0 γ=0.

For γ = 0.5 \gamma = 0.5 γ=0.5:
q ∗ ( s B , a ) = { 0 + 0.5 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } ⋅ 1 = 0.5 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] q ∗ ( s C , a ) = { 2 + 0.5 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } ⋅ 1 = 2 + 0.5 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] q ∗ , π l e f t ( s A , a l ) = [ 1 + 0.5 ⋅ q ∗ ( s B , a ) ] ⋅ 1 = 1 + 0.5 ⋅ q ∗ ( s B , a ) q ∗ , π r i g h t ( s A , a r ) = [ 0 + 0.5 ⋅ q ∗ ( s C , a ) ] ⋅ 1 = 0.5 ⋅ q ∗ ( s C , a ) \begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.5 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.5 \cdot q_*(s_C,a) \end{aligned} q(sB,a)q(sC,a)q,πleft(sA,al)q,πright(sA,ar)={0+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]}1=0.5amax[q,πleft(sA,al),q,πright(sA,ar)]={2+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]}1=2+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]=[1+0.5q(sB,a)]1=1+0.5q(sB,a)=[0+0.5q(sC,a)]1=0.5q(sC,a)
Assume q ∗ , π l e f t ( s A , a l ) ≥ q ∗ , π − l e f t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l) q,πleft(sA,al)q,πleft(sC,al) then we have:
q ∗ ( s B , a ) = 0.5 ⋅ q ∗ , π l e f t ( s A , a l ) q ∗ ( s C , a ) = 2 + 0.5 ⋅ q ∗ , π l e f t ( s A , a l ) \begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned} q(sB,a)q(sC,a)=0.5q,πleft(sA,al)=2+0.5q,πleft(sA,al)
therefore,
q ∗ , π l e f t ( s A , a l ) = 1 + 0.5 ⋅ 0.5 ⋅ q ∗ , π l e f t ( s A , a l ) q ∗ , π l e f t ( s A , a l ) = 4 3 q ∗ , π r i g h t ( s A , a r ) = 0.5 ⋅ [ 2 + 0.5 ⋅ q ∗ , π l e f t ( s A , a l ) ] q ∗ , π r i g h t ( s A , a r ) = 5 3 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3} \end{aligned} q,πleft(sA,al)q,πleft(sA,al)q,πright(sA,ar)q,πright(sA,ar)=1+0.50.5q,πleft(sA,al)=34=0.5[2+0.5q,πleft(sA,al)]=35
Here, q ∗ , π l e f t ( s A , a l ) &lt; q ∗ , π r i g h t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l) q,πleft(sA,al)<q,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q ∗ , π l e f t ( s A , a l ) ≤ q ∗ , π r i g h t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l) q,πleft(sA,al)q,πright(sC,al) then we have:
q ∗ ( s B , a ) = 0.5 ⋅ q ∗ , π r i g h t ( s A , a r ) q ∗ ( s C , a ) = 2 + 0.5 ⋅ q ∗ , π r i g h t ( s A , a r ) \begin{aligned} q_*(s_B, a) &amp;= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &amp;= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned} q(sB,a)q(sC,a)=0.5q,πright(sA,ar)=2+0.5q,πright(sA,ar)
therefore,
q ∗ , π r i g h t ( s A , a r ) = 0.5 ⋅ [ 2 + 0.5 ⋅ q ∗ , π r i g h t ( s A , a r ) ] q ∗ , π r i g h t ( s A , a r ) = 4 3 q ∗ , π l e f t ( s A , a l ) = 1 + 0.5 ⋅ 0.5 ⋅ q ∗ , π r i g h t ( s A , a r ) q ∗ , π l e f t ( s A , a l ) = 4 3 \begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {4}{3}\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {4}{3}\\ \end{aligned} q,πright(sA,ar)q,πright(sA,ar)q,πleft(sA,al)q,πleft(sA,al)=0.5[2+0.5q,πright(sA,ar)]=34=1+0.50.5q,πright(sA,ar)=34
Here q ∗ , π l e f t ( s A , a l ) = q ∗ , π r i g h t ( s A , a r ) q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r) q,πleft(sA,al)=q,πright(sA,ar), assumption is correct. So, both q ∗ , π l e f t ( s A , a l ) q_{*,\pi_{left}}(s_A, a_l) q,πleft(sA,al) and q ∗ , π r i g h t ( s A , a r ) q_{*,\pi_{right}}(s_A, a_r) q,πright(sA,ar) are optimal policies for γ = 0.5 \gamma = 0.5 γ=0.5.

For γ = 0.9 \gamma = 0.9 γ=0.9:
q ∗ ( s B , a ) = { 0 + 0.9 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } ⋅ 1 = 0.9 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] q ∗ ( s C , a ) = { 2 + 0.9 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] } ⋅ 1 = 2 + 0.9 max ⁡ a ′ [ q ∗ , π l e f t ( s A , a l ) , q ∗ , π r i g h t ( s A , a r ) ] q ∗ , π l e f t ( s A , a l ) = [ 1 + 0.9 ⋅ q ∗ ( s B , a ) ] ⋅ 1 = 1 + 0.9 ⋅ q ∗ ( s B , a ) q ∗ , π r i g h t ( s A , a r ) = [ 0 + 0.9 ⋅ q ∗ ( s C , a ) ] ⋅ 1 = 0.9 ⋅ q ∗ ( s C , a ) \begin{aligned} q_*(s_B, a)&amp;=\Bigl \{0+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &amp;=0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&amp;=\Bigl \{2+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &amp;= 2+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &amp;= 1 + 0.9 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &amp;= 0.9 \cdot q_*(s_C,a) \end{aligned} q(sB,a)q(sC,a)q,πleft(sA,al)q,πright(sA,ar)={0+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]}1=0.9amax[q,πleft(sA,al),q,πright(sA,ar)]={2+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]}1=2+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]=[1+0.9q(sB,a)]1=1+0.9q(sB,a)=[0+0.9q(sC,a)]1=0.9q(sC,a)
Assume q ∗ , π l e f t ( s A , a l ) ≥ q ∗ , π − l e f t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l) q,πleft(sA,al)q,πleft(sC,al) then we have:
q ∗ ( s B , a ) = 0.9 ⋅ q ∗ , π l e f t ( s A , a l ) q ∗ ( s C , a ) = 2 + 0.9 ⋅ q ∗ , π l e f t ( s A , a l ) \begin{aligned} q_*(s_B, a) &amp;= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &amp;= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned} q(sB,a)q(sC,a)=0.9q,πleft(sA,al)=2+0.9q,πleft(sA,al)
therefore,
q ∗ , π l e f t ( s A , a l ) = 1 + 0.9 ⋅ 0.9 ⋅ q ∗ , π l e f t ( s A , a l ) q ∗ , π l e f t ( s A , a l ) = 100 19 = 500 95 q ∗ , π r i g h t ( s A , a r ) = 0.9 ⋅ [ 2 + 0.9 ⋅ q ∗ , π l e f t ( s A , a l ) ] q ∗ , π r i g h t ( s A , a r ) = 729 95 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {100}{19} = \frac {500}{95}\\ q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {729}{95} \end{aligned} q,πleft(sA,al)q,πleft(sA,al)q,πright(sA,ar)q,πright(sA,ar)=1+0.90.9q,πleft(sA,al)=19100=95500=0.9[2+0.9q,πleft(sA,al)]=95729
Here, q ∗ , π l e f t ( s A , a l ) &lt; q ∗ , π r i g h t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l) q,πleft(sA,al)<q,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q ∗ , π l e f t ( s A , a l ) ≤ q ∗ , π r i g h t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l) q,πleft(sA,al)q,πright(sC,al) then we have:
q ∗ ( s B , a ) = 0.9 ⋅ q ∗ , π r i g h t ( s A , a r ) q ∗ ( s C , a ) = 2 + 0.9 ⋅ q ∗ , π r i g h t ( s A , a r ) \begin{aligned} q_*(s_B, a) &amp;= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &amp;= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned} q(sB,a)q(sC,a)=0.9q,πright(sA,ar)=2+0.9q,πright(sA,ar)
therefore,
q ∗ , π r i g h t ( s A , a r ) = 0.9 ⋅ [ 2 + 0.9 ⋅ q ∗ , π r i g h t ( s A , a r ) ] q ∗ , π r i g h t ( s A , a r ) = 180 19 q ∗ , π l e f t ( s A , a l ) = 1 + 0.9 ⋅ 0.9 ⋅ q ∗ , π r i g h t ( s A , a r ) q ∗ , π l e f t ( s A , a l ) = 1648 190 \begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {180}{19}\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {1648}{190}\\ \end{aligned} q,πright(sA,ar)q,πright(sA,ar)q,πleft(sA,al)q,πleft(sA,al)=0.9[2+0.9q,πright(sA,ar)]=19180=1+0.90.9q,πright(sA,ar)=1901648
Here, q ∗ , π l e f t ( s A , a l ) &lt; q ∗ , π r i g h t ( s C , a l ) q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l) q,πleft(sA,al)<q,πright(sC,al), assumption is correct. So, π r i g h t \pi_{right} πright is the optimal policy for γ = 0.9 \gamma = 0.9 γ=0.9

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值