Reinforcement Learning Exercise 5.4

最新推荐文章于 2022-11-16 12:07:03 发布

YeXiang\^-^/

最新推荐文章于 2022-11-16 12:07:03 发布

阅读量541

点赞数

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/98342850

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this.

The altered pseudocode is shown as below:
$\begin{aligned} &\text{Initialize:} \\ &\qquad \pi(s) \in \mathcal A(s) \text{(arbitrarily), for all } s \in S \\ &\qquad Q(s, a) \in \mathbb R \text{(arbitrarily), for all } s \in S, a \in \mathcal A(s) \\ &\qquad counts(s, a) \leftarrow 0\text{, for all s } \in S, a \in \mathcal A(s) \\ &\text{Loop forever (for each episode):} \\ &\qquad \text{Choose }S_0 \in \mathcal S, A_0 \in \mathcal A(S_0) \text{ randomly such that all pairs have probability} > 0 \\ &\qquad \text{Generate an episode from }S_0, A_0, \text{following }\pi: S_0, A_0, R_1, . . . , S_{T -1}, A_{T-1}, R_T \\ &\qquad G \leftarrow 0 \\ &\qquad \text{Loop for each step of episode, } t = T -1, T -2, . . . , 0: \\ &\qquad \qquad G \leftarrow \gamma G + R_{t+1} \\ &\qquad \qquad \text{Unless the pair }S_t, A_t \text{ appears in }S_0, A_0, S_1, A_1 . . . , S_{t-1}, A_{t-1}: \\ &\qquad \qquad \qquad counts(S_t,A_t) \leftarrow counts(S_t,A_t) + 1\\ &\qquad \qquad \qquad Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac {(G - Q(S_t, A_t))}{count(S_t, A_t)} \\ &\qquad \qquad \qquad \pi(S_t) \leftarrow \text{argmax}_a Q(S_t, a) \\ \end{aligned}$

YeXiang\^-^/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning Exercise 5.4

Exercise 5.4 The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient t...
复制链接

扫一扫