Reinforcement Learning--Explanation to Formula (5.2)

最新推荐文章于 2021-10-22 21:17:02 发布

YeXiang\^-^/

最新推荐文章于 2021-10-22 21:17:02 发布

阅读量157

点赞数

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/95535730

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
$q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\ \because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) = \begin{cases} 1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\ \epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\ \end{cases} \\ \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\ &=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\ &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)} \end{aligned}$
Consider value $x$ , let
$=\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a)$
When $\not A^*$ , $\pi(a \mid s) = \epsilon/| \mathcal A(s) |$
$\begin{aligned} \therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\ &= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\ &= ( 1 - \epsilon) q_\pi(s, a=A^*) \\ &= (1-\epsilon)\max_aq_\pi(s,a) \\ &\leq \max_a q_\pi(s,a) \end{aligned}$
Also
$(1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a)$
$\begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\ & \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) \end{aligned}$
This is the third line of formula (5.2). It’s clear to be understood now.