Reinforcement Learning Exercise 4.6

最新推荐文章于 2020-06-01 13:38:30 发布

YeXiang\^-^/

最新推荐文章于 2020-06-01 13:38:30 发布

阅读量384

点赞数 1

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/92801775

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 4.6 Suppose you are restricted to considering only policies that are $\epsilon$ -soft, meaning that the probability of selecting each action in each state, $s$ , is at least $\epsilon/|\mathcal A(s)|$ . Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for $v_*$ on page 80.

The algorithm on page 80 in section 4.2 is based on the assumption that the policy is deterministic. For a stochastic case, we can modify the algorithm like this:

$\begin{aligned} &\text{1 Initialization} \\ &\qquad V(s) \in \mathbb R \text{ and }\pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \qquad \text{Loop for each }s \in \mathcal S: \\ &\qquad \qquad \qquad v \leftarrow V(s) \\ &\qquad \qquad \qquad V(s) \leftarrow \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &\qquad \qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }s \in \mathcal S: \\ &\qquad \qquad old \text- action \leftarrow \pi(s) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_a \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \text{If }old\text-action =\not \pi(s) \text{, then }policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } V \approx v_* \text{ and }\pi \approx \pi_*\text{; else go to 2.} \\ \end{aligned}$

Because the only policies are $\epsilon$ -soft, the probability that the policy doesn’t select action $a$ is $\frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1)$ . So,
$\begin{aligned} \pi(a \mid s) &= 1 - \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) \\ &= 1 - \epsilon + \frac {\epsilon}{|\mathcal A(s)|} \end{aligned}$
Substitute this $\pi(a \mid s)$ into the algorithm, we can get the final result.