Reinforcement Learning Exercise 5.4

Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this.

The altered pseudocode is shown as below:
Initialize: π ( s ) ∈ A ( s ) (arbitrarily), for all  s ∈ S Q ( s , a ) ∈ R (arbitrarily), for all  s ∈ S , a ∈ A ( s ) c o u n t s ( s , a ) ← 0 , for all s  ∈ S , a ∈ A ( s ) Loop forever (for each episode): Choose  S 0 ∈ S , A 0 ∈ A ( S 0 )  randomly such that all pairs have probability > 0 Generate an episode from  S 0 , A 0 , following  π : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G ← 0 Loop for each step of episode,  t = T − 1 , T − 2 , . . . , 0 : G ← γ G + R t + 1 Unless the pair  S t , A t  appears in  S 0 , A 0 , S 1 , A 1 . . . , S t − 1 , A t − 1 : c o u n t s ( S t , A t ) ← c o u n t s ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + ( G − Q ( S t , A t ) ) c o u n t ( S t , A t ) π ( S t ) ← argmax a Q ( S t , a ) \begin{aligned} &\text{Initialize:} \\ &\qquad \pi(s) \in \mathcal A(s) \text{(arbitrarily), for all } s \in S \\ &\qquad Q(s, a) \in \mathbb R \text{(arbitrarily), for all } s \in S, a \in \mathcal A(s) \\ &\qquad counts(s, a) \leftarrow 0\text{, for all s } \in S, a \in \mathcal A(s) \\ &\text{Loop forever (for each episode):} \\ &\qquad \text{Choose }S_0 \in \mathcal S, A_0 \in \mathcal A(S_0) \text{ randomly such that all pairs have probability} > 0 \\ &\qquad \text{Generate an episode from }S_0, A_0, \text{following }\pi: S_0, A_0, R_1, . . . , S_{T -1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0 \\ &\qquad \text{Loop for each step of episode, } t = T -1, T -2, . . . , 0: \\ &\qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad \qquad \text{Unless the pair }S_t, A_t \text{ appears in }S_0, A_0, S_1, A_1 . . . , S_{t-1}, A_{t-1}: \\ &\qquad \qquad \qquad counts(S_t,A_t) \leftarrow counts(S_t,A_t) + 1\\ &\qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {(G - Q(S_t, A_t))}{count(S_t, A_t)} \\ &\qquad \qquad \qquad \pi(S_t) \leftarrow \text{argmax}_a Q(S_t, a) \\ \end{aligned} Initialize:π(s)A(s)(arbitrarily), for all sSQ(s,a)R(arbitrarily), for all sS,aA(s)counts(s,a)0, for all s S,aA(s)Loop forever (for each episode):Choose S0S,A0A(S0) randomly such that all pairs have probability>0Generate an episode from S0,A0,following π:S0,A0,R1,...,ST1,AT1,RTG0Loop for each step of episode, t=T1,T2,...,0:GγG+Rt+1Unless the pair St,At appears in S0,A0,S1,A1...,St1,At1:counts(St,At)counts(St,At)+1Q(St,At)Q(St,At)+count(St,At)(GQ(St,At))π(St)argmaxaQ(St,a)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值