Reinforcement Learning Exercise 5.9

最新推荐文章于 2020-01-06 23:32:53 发布

YeXiang\^-^/

最新推荐文章于 2020-01-06 23:32:53 发布

阅读量224

点赞数

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/98671094

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.

The modified algorithm should be like this:
$\begin{aligned} &\text{Input: an arbitrary target policy }\pi \\ &\text{Initialize, for all }s \in \mathcal S, a \in \mathcal A(s): \\ & \qquad Q(s,a) \text{ in }\mathbb R (\text{arbitrarily}) \\ & \qquad C(s,a) \leftarrow 0 \\ &\text{Loop forever (for each episode):} \\ & \qquad b \leftarrow \text{any policy with coverage of } \pi \\ & \qquad \text{Generate an episode following b: }S_0, A_0, R_1, \cdots,S_{T-1},A_{T-1},R_T \\ & \qquad G \leftarrow 0 \\ & \qquad W \leftarrow 1 \\ & \qquad \text{Loop for each step of episode, } t=T-1,T-2,\cdots,0, \text{ while } W = \not 0: \\ & \qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ & \qquad \qquad C(S_t, A_t) \leftarrow C(S_t, A_t) + W \\ & \qquad \qquad \text{Unless the pair } S_t, A_t \text{ appears in } S_0, A_0, S_1, A_1, \cdots , S_{t-1}, A_{t-1}:\\ & \qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {W}{C(S_t, A_t)} \bigl [ G - Q(S_t, A_t)\bigr] \\ & \qquad \qquad \qquad W \leftarrow W \frac {\pi(A_t \mid S_t)}{b(A_t \mid S_t)} \\ \end{aligned}$

YeXiang\^-^/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning Exercise 5.9

Exercise 5.9 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.The modified algorithm should b...
复制链接

扫一扫