Reinforcement Learning Exercise 4.5

最新推荐文章于 2024-03-01 17:39:08 发布

YeXiang\^-^/

最新推荐文章于 2024-03-01 17:39:08 发布

阅读量443

点赞数

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/93140206

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing $q_*$ , analogous to that on page 80 for computing $v_*$ . Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

Here, we can use the result of exercise 3.17:
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
Then the algorithm which analogous to that on page 80 can be like this:
$\begin{aligned} &\text{1 Initialization} \\ &\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \text{Loop for each }(s,a)\text{ pair:} \\ &\qquad \qquad q \leftarrow Q_\pi(s,a) \\ &\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a \\ & \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\ &\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\ &\qquad \qquad old \text- action \leftarrow \pi(s, a) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\ &\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\ \end{aligned}$

YeXiang\^-^/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning Exercise 4.5

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q∗q_*q∗, analogous to that on page 80 for computing v∗v_*v∗. Please pay special attentio...
复制链接

扫一扫