Reinforcement Learning Exercise 5.9

Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.

The modified algorithm should be like this:
Input: an arbitrary target policy  π Initialize, for all  s ∈ S , a ∈ A ( s ) : Q ( s , a )  in  R ( arbitrarily ) C ( s , a ) ← 0 Loop forever (for each episode): b ← any policy with coverage of  π Generate an episode following b:  S 0 , A 0 , R 1 , ⋯   , S T − 1 , A T − 1 , R T G ← 0 W ← 1 Loop for each step of episode,  t = T − 1 , T − 2 , ⋯   , 0 ,  while  W = ̸ 0 : G ← γ G + R t + 1 C ( S t , A t ) ← C ( S t , A t ) + W Unless the pair  S t , A t  appears in  S 0 , A 0 , S 1 , A 1 , ⋯   , S t − 1 , A t − 1 : Q ( S t , A t ) ← Q ( S t , A t ) + W C ( S t , A t ) [ G − Q ( S t , A t ) ] W ← W π ( A t ∣ S t ) b ( A t ∣ S t ) \begin{aligned} &\text{Input: an arbitrary target policy }\pi \\ &\text{Initialize, for all }s \in \mathcal S, a \in \mathcal A(s): \\ & \qquad Q(s,a) \text{ in }\mathbb R (\text{arbitrarily}) \\ & \qquad C(s,a) \leftarrow 0 \\ &\text{Loop forever (for each episode):} \\ & \qquad b \leftarrow \text{any policy with coverage of } \pi \\ & \qquad \text{Generate an episode following b: }S_0, A_0, R_1, \cdots,S_{T-1},A_{T-1},R_T \\ & \qquad G \leftarrow 0 \\ & \qquad W \leftarrow 1 \\ & \qquad \text{Loop for each step of episode, } t=T-1,T-2,\cdots,0, \text{ while } W = \not 0: \\ & \qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ & \qquad \qquad C(S_t, A_t) \leftarrow C(S_t, A_t) + W \\ & \qquad \qquad \text{Unless the pair } S_t, A_t \text{ appears in } S_0, A_0, S_1, A_1, \cdots , S_{t-1}, A_{t-1}:\\ & \qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {W}{C(S_t, A_t)} \bigl [ G - Q(S_t, A_t)\bigr] \\ & \qquad \qquad \qquad W \leftarrow W \frac {\pi(A_t \mid S_t)}{b(A_t \mid S_t)} \\ \end{aligned} Input: an arbitrary target policy πInitialize, for all sS,aA(s):Q(s,a) in R(arbitrarily)C(s,a)0Loop forever (for each episode):bany policy with coverage of πGenerate an episode following b: S0,A0,R1,,ST1,AT1,RTG0W1Loop for each step of episode, t=T1,T2,,0, while W≠0:GγG+Rt+1C(St,At)C(St,At)+WUnless the pair St,At appears in S0,A0,S1,A1,,St1,At1:Q(St,At)Q(St,At)+C(St,At)W[GQ(St,At)]WWb(AtSt)π(AtSt)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值