Reinforcement Learning Exercise 3.19

本文详细探讨了强化学习中行动值函数qπ(s,a)的数学表示,涉及预期下一个奖励Rt+1和剩余奖励的期望总和。通过小的备份图直观解释,给出两个方程,一个不包含策略条件的期望值,另一个将期望值明确写为状态转移概率p(s',r∣s,a)的形式。" 119970320,10591925,多项式展开正确性判断,"['c++', '算法', '数学', '模拟计算']
摘要由CSDN通过智能技术生成

Exercise 3.19 The value of an action, q π ( s , a ) q_\pi(s, a) qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, q π ( s , a ) q_\pi(s, a) qπ(s,a), in terms of the expected next reward, R t + 1 R_{t+1} Rt+1, and the expected next state value, v π ( S t + 1 ) v_\pi(S_t+1) vπ(St+1), given that S t = s S_t =s St=s and A t = a A_t =a At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p ( s ′ , r ∣ s , a ) p(s', r|s, a) p(s,rs,a) defined by (3.2), such that no expected value notation appears in the equation.

q π ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( R t + 1 + γ G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( ∑ k = 0 ∞ γ k R t + k + 2 ∣ S t = s , A t = a ) = R t + 1 ( s , a ) + γ ∑ s ′ [ E π ( ∑ k = 0 ∞ γ k R t + k + 2 ∣ S t = s , A t = a , S t + 1 = s ′ ) P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) ] \begin{aligned} q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\ &= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\ \end{aligned} qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(Rt+1+γGt+1St=s,At=a)=Eπ(Rt+1St=s,At=a)+γEπ(Gt+1St=s,At=a)=Eπ(Rt+1St=

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值