Reinforcement Learning Exercise 3.12

Exercise 3.12 Give an equation for v π v_\pi vπ in terms of q π q_\pi qπ and π \pi π.

v π ( s ) = E π ( G t ∣ S t = s ) = ∑ g t [ g t ⋅ p ( g t ∣ s ) ] = ∑ g t [ g t ⋅ p ( g t , s ) p ( s ) ] = ∑ g t [ g t ⋅ ∑ a ∈ A p ( g t , s , a ) p ( s ) ] = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( s , a ) ] p ( s ) } = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( a ∣ s ) ⋅ p ( s ) ] p ( s ) ] } = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( a ∣ s ) ] } = ∑ a ∈ A { p ( a ∣ s ) ∑ g t [ g t ⋅ p ( g t ∣ s , a ) ] } \begin{aligned} v_\pi(s) &= \mathbb E_\pi(G_t|S_t=s) \\ &=\sum_{g_t}\bigl [ g_t \cdot p(g_t|s) \bigr ] \\ &=\sum_{g_t}\bigl [ g_t \cdot \frac {p(g_t, s)}{p(s)} \bigr ] \\ &=\sum_{g_t}\bigl [ g_t \cdot \frac{ \sum_{a \in \mathcal A} p(g_t, s, a)}{p(s)} \bigr ] \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \frac{ \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(s, a) \bigr ] }{p(s)} \Bigr \} \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \frac{ \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(a | s) \cdot p(s) \bigr ]}{p(s) \bigr ] } \Bigr \} \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(a | s) \bigr ] \Bigr \} \\ &=\sum_{a \in \mathcal A} \Bigl \{ p(a|s) \sum_{g_t} \bigl [ g_t \cdot p(g_t | s, a) \bigr ] \Bigr \} \end{aligned} vπ(s)=Eπ(GtSt=s)=gt[gtp(gts)]=gt[gtp(s)p(gt,s)]=gt[gtp(s)aAp(gt,s,a)]=gt{gtp(s)aA[p(gts,a)p(s,a)]}=gt{gtp(s)]aA[p(gts,a)p(as)p(s)]}=gt{gtaA[p(gts,a)p(as)]}=aA{p(as)gt[gtp(gts,a)]}
According to definition, p ( a ∣ s ) = π ( a ∣ s ) p(a|s) = \pi(a|s) p(as)=π(as), ∑ g t [ g t ⋅ p ( g t ∣ s , a ) ] = q π ( s , a ) \sum_{g_t} \bigl [ g_t \cdot p(g_t | s, a) \bigr ] = q_\pi(s,a) gt[gtp(gts,a)]=qπ(s,a), so there is:
v π ( s ) = ∑ a ∈ A [ π ( a ∣ s ) ⋅ q π ( s , a ) ] v_\pi(s) = \sum_{a \in \mathcal A} \bigl [ \pi(a|s) \cdot q_\pi(s,a) \bigr ] vπ(s)=aA[π(as)qπ(s,a)]

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值