Reinforcement Learning- Exercise 3.17

Exercise 3.17 What is the Bellman equation for action values, that is, for q π q_\pi qπ? It must give the action value q π ( s , a ) q_\pi(s, a) qπ(s,a) in terms of the action values, q π ( s ′ , a ′ ) q_\pi(s', a') qπ(s,a), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
在这里插入图片描述

According to definition:
Q π ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ) = ∑ s ′ [ E π ( ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ) P ( S t + 1 = s ′ ∣ A t = a , S t = s ) ] = ∑ s ′ { [ E π ( R t + 1 ∣ S t = s , A t = a , S t + 1 = s ) + E π ( ∑ k = 1 ∞ γ k R t + 1 + k ) ] P ( S t + 1 = s ′ ∣ A t = a , S t = s ) } \begin{aligned} Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\ &= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\} \end{aligned} Qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(k=0γkRt+k+1St=s,At=a)=s[Eπ(k=0γkRt+k+1St=s,At=a,St+1=s)P(St+1=sAt=a,St=s)]=s{[Eπ(Rt+1St=s,At=a,St+1=s)+Eπ(k=1γkRt+1+k)]P(St+1=sAt=a,St=s)}
Denote
P ( S t + 1 = s ′ ∣ A t = a , S t = s ) = P s , s ′ a P(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^a P(St+1=sAt=a,St=s)=Ps,sa
E π ( R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ) = R s , s ′ a \mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^a Eπ(Rt+1St=s,At=a,St+1=s)=Rs,sa
then:
Q π ( s , a ) = ∑ s ′ R s , s ′ a P s s ′ a + ∑ s ′ [ E ( ∑ k = 1 ∞ γ k R t + 1 + k ∣ S t = s , A t = a , S t + 1 = s ′ ) P s , s ′ a ] = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ E ( ∑ k = 1 ∞ γ k − 1 R t + 1 + k ∣ S t = s , A t = a , S t + 1 = s ′ ) P s , s ′ a ] = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ E ( ∑ k = 0 ∞ γ k R t + 2 + k ∣ S t = s , A t = a , S t + 1 = s ′ ) P s , s ′ a ] = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ E ( ∑ k = 0 ∞ γ k R t + 2 + k ∣ S t + 1 = s ′ ) P s , s ′ a ] = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ { ∑ a ′ [ E ( ∑ k = 0 ∞ γ k R t + 2 + k ∣ S t + 1 = s ′ , A t + 1 = a ′ ) P ( A t + 1 = a ′ ∣ S t + 1 = s ′ ) ] P s , s ′ a } \begin{aligned} Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\ \end{aligned} Qπ(s,a)=sRs,saPssa+s[E(k=1γkRt+1+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=1γk1Rt+1+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=0γkRt+2+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=0γkRt+2+kSt+1=s)Ps,sa]=sRs,saPssa+γs{a[E(k=0γkRt+2+kSt+1=s,At+1=a)P(At+1=aSt+1=s)]Ps,sa}
According to definition
E ( ∑ k = 0 ∞ γ k R t + 2 + k ∣ S t + 1 = s ′ , A t + 1 = a ′ ) = Q π ( s ′ , a ′ ) P ( A t + 1 = a ′ ∣ S t + 1 = s ′ ) = π ( s ′ , a ′ ) \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\ P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a') E(k=0γkRt+2+kSt+1=s,At+1=a)=Qπ(s,a)P(At+1=aSt+1=s)=π(s,a)
so
Q π ( s , a ) = ∑ s ′ R s , s ′ a P s s ′ a + γ ∑ s ′ [ ∑ a ′ Q π ( s ′ , a ′ ) π ( s ′ , a ′ ) ] P s , s ′ a Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a Qπ(s,a)=sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,sa
This is the Bellman equation of action-value.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值