Reinforcement Learning Exercise 7.4

Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as
G t : t + n = Q t − 1 ( S t , A t ) + ∑ k = t m i n ( t + n , T ) − 1 γ k − t [ R k + 1 + γ Q k ( S k + 1 , A k + 1 ) − Q k − 1 ( S k , A k ) ] G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)] Gt:t+n=Qt1(St,At)+k=tmin(t+n,T)1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)]
Prove:
First G t : t + n G_{t:t+n} Gt:t+n can be written in terms of the sum of difference:
G t : t + n = G t : t + 1 − G t : t + 1 + G t : t + 2 − G t : t + 2 + ⋯ + G t : t + n − 2 − G t : t + n − 2 + G t : t + n − 1 − G t : t + n − 1 + G t : t + n = G t : t + 1 + ( G t : t + 2 − G t : t + 1 ) + ⋯ + ( G t : t + n − G t : t + n − 1 ) = G t : t + 1 + ∑ i = 2 n ( G t : t + i − G t : t + i − 1 ) (1) \begin{aligned} G_{t:t+n} &= G_{t:t+1} -G_{t:t+1} +G_{t:t+2} -G_{t:t+2} + \cdots + G_{t:t+n-2} -G_{t:t+n-2} +G_{t:t+n-1} -G_{t:t+n-1} + G_{t:t+n}\\ &=G_{t:t+1} + (G_{t:t+2} - G_{t:t+1}) + \cdots +(G_{t:t+n} - G_{t:t+n-1})\\ &=G_{t:t+1}+\sum_{i=2}^n(G_{t:t+i}-G_{t:t+i-1}) \tag{1} \end{aligned} Gt:t+n=Gt:t+1Gt:t+1+Gt:t+2Gt:t+2++Gt:t+n2Gt:t+n2+Gt:t+n1Gt:t+n1+Gt:t+n=Gt:t+1+(Gt:t+2Gt:t+1)++(Gt:t+nGt:t+n1)=Gt:t+1+i=2n(Gt:t+iGt:t+i1)(1)
According to Sarsa (7.4)
G t : t + n ≐ R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) , n ≥ 1 , 0 ≤ t < T − n (7.4) G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n \geq1, 0 \leq t < T-n \tag{7.4} Gt:t+nRt+1+γRt+2++γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<Tn(7.4)
there is:
G t : t + n − G t : t + n − 1 = γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) − γ n − 1 Q t + n − 2 ( S t + n − 1 , A t + n − 1 ) = γ n − 1 [ R t + n + γ Q t + n − 1 ( S t + n , A t + n ) − Q t + n − 2 ( S t + n − 1 , A t + n − 1 ) ] (2) \begin{aligned} G_{t:t+n} - G_{t:t+n-1} & = \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n} , A_{t+n}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1}, A_{t+n-1}) \\ &= \gamma^{n-1}\bigl[ R_{t+n} + \gamma Q_{t+n-1}(S_{t+n} , A_{t+n}) -Q_{t+n-2}(S_{t+n-1} , A_{t+n-1})\bigr] \tag{2} \end{aligned} Gt:t+nGt:t+n1=γn1Rt+n+γnQt+n1(St+n,At+n)γn1Qt+n2(St+n1,At+n1)=γn1[Rt+n+γQt+n1(St+n,At+n)Qt+n2(St+n1,At+n1)](2)
and for n = 1 n=1 n=1, there is:
G t : t + 1 = γ 0 R t + 1 + γ 1 Q t + 1 − 1 ( S t + 1 , A t + 1 ) = γ 0 R t + 1 + γ 1 Q t + 1 − 1 ( S t + 1 , A t + 1 ) − Q t − 1 ( S t , A t ) + Q t − 1 ( S t , A t ) = Q t − 1 ( S t , A t ) + γ 0 [ R t + 1 + γ Q t ( S t + 1 , A t + 1 ) − Q t − 1 ( S t , A t ) ] (3) \begin{aligned} G_{t:t+1} &=\gamma^0 R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) \\ &=\gamma^0R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) + Q_{t-1}(S_t, A_t) \\ &= Q_{t-1}(S_t, A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \tag{3} \end{aligned} Gt:t+1=γ0Rt+1+γ1Qt+11(St+1,At+1)=γ0Rt+1+γ1Qt+11(St+1,At+1)Qt1(St,At)+Qt1(St,At)=Qt1(St,At)+γ0[Rt+1+γQt(St+1,At+1)Qt1(St,At)](3)
Substitute equation (2) and (3) into (1), we get:
G t : t + n = Q t − 1 ( S t , A t ) + γ 0 [ R t + 1 + γ Q t ( S t + 1 , A t + 1 ) − Q t − 1 ( S t , A t ) ] + ∑ i = 2 n γ i − 1 [ R t + i + γ Q t + i − 1 ( S t + i , A t + i ) − Q t + i − 2 ( S t + i − 1 , A t + i − 1 ) ] = Q t − 1 ( S t , A t ) + ∑ i = 1 n γ i − 1 [ R t + i + γ Q t + i − 1 ( S t + i , A t + i ) − Q t + i − 2 ( S t + i − 1 , A t + i − 1 ) ] (4) \begin{aligned} G_{t:t+n} &= Q_{t-1}(S_t,A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \\ &\quad+ \sum_{i=2}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \\ &= Q_{t-1}(S_t,A_t) + \sum_{i=1}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \tag{4}\\ \end{aligned} Gt:t+n=Qt1(St,At)+γ0[Rt+1+γQt(St+1,At+1)Qt1(St,At)]+i=2nγi1[Rt+i+γQt+i1(St+i,At+i)Qt+i2(St+i1,At+i1)]=Qt1(St,At)+i=1nγi1[Rt+i+γQt+i1(St+i,At+i)Qt+i2(St+i1,At+i1)](4)
Let k = i + t − 1 k =i+t-1 k=i+t1, so i = k − t + 1 i =k-t+1 i=kt+1 equation (4) can be written as:
G t : t + n = Q t − 1 ( S t , A t ) + ∑ k = t t + n − 1 γ k − t [ R k + 1 + γ Q k ( S k + 1 , A k + 1 ) − Q k − 1 ( S k , A k ) ] (5) G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{t+n-1}\gamma^{k-t}\bigl[ R_{k+1} + \gamma Q_{k}(S_{k+1} , A_{k+1}) -Q_{k-1}(S_{k} , A_{k})\bigr] \tag{5} Gt:t+n=Qt1(St,At)+k=tt+n1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)](5)
t + n t +n t+n should not larger than T T T, so equation (5) can be written as:
G t : t + n = Q t − 1 ( S t , A t ) + ∑ k = t m i n ( t + n , T ) − 1 γ k − t [ R k + 1 + γ Q k ( S k + 1 , A k + 1 ) − Q k − 1 ( S k , A k ) ] G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)] Gt:t+n=Qt1(St,At)+k=tmin(t+n,T)1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)]
PROVED.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值