Reinforcement Learning Exercise 4.4

最新推荐文章于 2020-11-04 14:44:19 发布

YeXiang\^-^/

最新推荐文章于 2020-11-04 14:44:19 发布

阅读量1.3k

点赞数 1

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/93242932

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is OK for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.
$\begin{aligned} &\text{1 Initialization} \\ &\qquad V(s) \in \mathbb R \text{ and } \pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &\qquad \text{For each } s \in \mathcal S \text{ create an empty list : } old\_list\_of\_a(s)\\ &\qquad \text{For each } s \in \mathcal S \text{ create an iterator : } iterator\_old\_list\_of\_a(s)\\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \qquad \text{Loop for each } s \in \mathcal S: \\ &\qquad \qquad \qquad v \leftarrow V(s) \\ &\qquad \qquad \qquad V(s) \leftarrow \sum_{s',r}p(s',r \mid s,\pi(s)) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &\qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }s \in \mathcal S: \\ &\qquad \qquad V_{max}(s) \leftarrow \max_a \sum_{s',r}p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \text{Create an empty list: }new\_list\_of\_a(s)\\ &\qquad \qquad \text{For each }a \in |\mathcal A(s)|:\\ &\qquad \qquad \qquad \text{If }\sum_{s',r}p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \text{ is equal to } V_{max}(s):\\ &\qquad \qquad \qquad \qquad \text{Append }a \text{ to } new\_list\_of\_a(s)\\ &\qquad \qquad \text{If }new\_list\_of\_a(s) \text{ is not equal to } old\_list\_of\_a(s)\text{ :}\\ &\qquad \qquad \qquad policy\text{-}stable \leftarrow false \\ &\qquad \qquad \qquad old\_list\_of\_a(s) \leftarrow new\_list\_of\_a(s)\\ &\qquad \qquad \qquad iterator\_old\_list\_of\_a(s) \leftarrow \text{the beginning of }old\_list\_of\_a(s)\\ &\qquad \qquad \text{else :}\\ &\qquad \qquad \qquad \text{If }old\_list\_of\_a(s) \text{ is empty :}\\ &\qquad \qquad \qquad \qquad old\_list\_of\_a(s) \leftarrow new\_list\_of\_a(s)\\ &\qquad \qquad \qquad \qquad iterator\_old\_list\_of\_a(s) \leftarrow \text{the beginning of }old\_list\_of\_a(s)\\ &\qquad \qquad \qquad \qquad policy\text{-}stable \leftarrow false\\ &\qquad \qquad \qquad\text{else :}\\ &\qquad \qquad \qquad \qquad \text{If }iterator\_old\_list\_of\_a(s) \text{ is not equal to the end of }old\_list\_of\_a(s) \text{ :}\\ &\qquad \qquad \qquad \qquad \qquad \text{Move }iterator\_old\_list\_of\_a(s) \text{ to next.}\\ &\qquad \qquad \qquad \qquad \qquad policy\text{-}stable \leftarrow false\\ &\qquad \qquad \text{If }iterator\_old\_list\_of\_a(s) \text{ is not equal to the end of }old\_list\_of\_a(s) \text{ :}\\ &\qquad \qquad \qquad \pi(s) \leftarrow \text{Select }a \text{ in }old\_list\_of\_a(s) \text{ by } iterator\_old\_list\_of\_a(s)\\ &\qquad \qquad \text{else : }\\ &\qquad \qquad \qquad \pi(s) \leftarrow \text{Select }a \text{ in }old\_list\_of\_a(s) \text{ randomly}\\ &\qquad \text{If } policy\text-stable =true \text{ then stop and return } V \approx v_* \text{ and }\pi \approx \pi_* \text{ else go to 2.} \\ \end{aligned}$

YeXiang\^-^/

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Reinforcement Learning Exercise 4.4

Exercise 4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is OK...
复制链接

扫一扫