Reinforcement Learning- Exercise 3.17

最新推荐文章于 2024-08-10 16:16:57 发布

YeXiang\^-^/

最新推荐文章于 2024-08-10 16:16:57 发布

阅读量607

点赞数

分类专栏： reinforcement learning 文章标签： Reinforcement Learning

本文链接：https://blog.csdn.net/ballade2012/article/details/89075428

版权

reinforcement learning 专栏收录该内容

37 篇文章 1 订阅

订阅专栏

Exercise 3.17 What is the Bellman equation for action values, that is, for $q_\pi$ ? It must give the action value $q_\pi(s, a)$ in terms of the action values, $q_\pi(s', a')$ , of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
在这里插入图片描述

According to definition：
$\begin{aligned} Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\ &= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\} \end{aligned}$
Denote
$P(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^a$
$\mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^a$
then：
$\begin{aligned} Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\ \end{aligned}$
According to definition
$\mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\ P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a')$
so
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
This is the Bellman equation of action-value.