Reinforcement Learning Exercise 3.19

最新推荐文章于 2019-10-04 14:21:04 发布

YeXiang\^-^/

最新推荐文章于 2019-10-04 14:21:04 发布

阅读量675

点赞数 1

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.csdn.net/ballade2012/article/details/89164995

版权

本文详细探讨了强化学习中行动值函数qπ(s,a)的数学表示，涉及预期下一个奖励Rt+1和剩余奖励的期望总和。通过小的备份图直观解释，给出两个方程，一个不包含策略条件的期望值，另一个将期望值明确写为状态转移概率p(s',r∣s,a)的形式。" 119970320,10591925,多项式展开正确性判断,"['c++', '算法', '数学', '模拟计算']

摘要由CSDN通过智能技术生成

Exercise 3.19 The value of an action, $q_\pi(s, a)$ , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, $q_\pi(s, a)$ , in terms of the expected next reward, $R_{t+1}$ , and the expected next state value, $v_\pi(S_t+1)$ , given that $S_t =s$ and $A_t =a$ . This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p (s^{'}, r ∣ s, a)$ defined by (3.2), such that no expected value notation appears in the equation.

$\begin{aligned} q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\ &= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\ \end{aligned}$