Exercise 3.19 The value of an action, q π ( s , a ) q_\pi(s, a) qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action value, q π ( s , a ) q_\pi(s, a) qπ(s,a), in terms of the expected next reward, R t + 1 R_{t+1} Rt+1, and the expected next state value, v π ( S t + 1 ) v_\pi(S_t+1) vπ(St+1), given that S t = s S_t =s St=s and A t = a A_t =a At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p ( s ′ , r ∣ s , a ) p(s', r|s, a) p(s′,r∣s,a) defined by (3.2), such that no expected value notation appears in the equation.
q π ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( R t + 1 + γ G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( ∑ k = 0 ∞ γ k R t + k + 2 ∣ S t = s , A t = a ) = R t + 1 ( s , a ) + γ ∑ s ′ [ E π ( ∑ k = 0 ∞ γ k R t + k + 2 ∣ S t = s , A t = a , S t + 1 = s ′ ) P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) ] \begin{aligned} q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\ &= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\ \end{aligned} qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=Eπ(Rt+1∣St=