Reinforcement Learning exercise 3.13

最新推荐文章于 2019-10-04 14:21:04 发布

YeXiang\^-^/

最新推荐文章于 2019-10-04 14:21:04 发布

阅读量595

点赞数 2

分类专栏： reinforcement learning 文章标签： Reinforcement Learning

本文链接：https://blog.csdn.net/ballade2012/article/details/90580534

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 3.13 Give an equation for $q_\pi$ in terms of $v_\pi$ and the four-argument $p$ .

First, we need to derive a formula from multiplication formula of probability theory:
$\begin{aligned} p(x|y) &= \frac {p(x,y)}{p(y)} \\ &= \frac {\sum_z p(x,y,z)}{p(y)} \\ &= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\ &= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\ \end{aligned}$
With formula (1), we can calculate $q_\pi(s,a)$ as below:
$\begin{aligned} q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\ &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\ \end{aligned}$
Here, according to definition, $g_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}$ . And use formula (1), we can derive:
$\begin{aligned} q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\ &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ \end{aligned}$
Because in Markov Process, $G_{t+1}$ is the reward of status $S_{t+1} = s'$ , the information of $S_t = s$ and $A_t = a$ are no effect on $G_{t+1}$ . So:
$\begin{aligned} q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ &= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned}$
Denote $P_{s,s'}^a$ and $\mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^a$ , then
$q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)}$
Here, the equation (2) is the result.

YeXiang\^-^/

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
Reinforcement Learning exercise 3.13

Exercise 3.13 Give an equation for qπq_\piqπ in terms of vπv_\pivπ and the four-argument ppp.First, we need to derive a formula from multiplication formula of probability theory:p(x∣y)=p(x,y)p(y)=...
复制链接

扫一扫