Reinforcement Learning Exercise 3.18

Exercise 3.18 The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the value at the root node, v π ( s ) v_\pi(s) vπ(s), in terms of the value at the expected leaf node, q π ( s , a ) q_\pi(s, a) qπ(s,a), given S t = s S_t = s St=s. This equation should include an expectation conditioned on following the policy, π \pi π. Then give a second equation in which the expected value is written out explicitly in terms of π ( a ∣ s ) \pi(a|s) π(as) such that no expected value notation appears in the equation.
υ π ( s ) = E π ( G t ∣ S t = s ) = ∑ a ∈ A E π ( G t ∣ S t = s , A t = a ) P ( A t = a ∣ S t = s ) ∵ P ( A t = a ∣ S t = s ) = π ( a ∣ s ) ∴ υ π ( s ) = ∑ a ∈ A E π ( G t ∣ S t = s , A t = a ) π ( a ∣ s ) \begin{aligned} \\ \upsilon_\pi(s) &= \mathbb E_\pi ( G_t | S_t = s ) \\ &= \sum_{a \in \mathcal A} \mathbb E_\pi ( G_t | S_t = s, A_t = a ) P ( A_t = a | S_t = s) \\ \end{aligned} \\ \begin{aligned} &\because P ( A_t = a | S_t = s) = \pi(a | s) \\ &\therefore \upsilon_\pi(s) = \sum_{a \in \mathcal A} \mathbb E_\pi ( G_t | S_t = s, A_t = a ) \pi(a | s) \end{aligned} υπ(s)=Eπ(GtSt=s)=aAEπ(GtSt=s,At=a)P(At=aSt=s)P(At=aSt=s)=π(as)υπ(s)=aAEπ(GtSt=s,At=a)π(as)
According to definition
E π ( G t ∣ S t = s , A t = a ) = q π ( s , a ) \mathbb E_\pi ( G_t | S_t = s, A_t = a ) = q_\pi( s, a ) \\ Eπ(GtSt=s,At=a)=qπ(s,a)
so
υ π ( s ) = ∑ a ∈ A q π ( s , a ) π ( a ∣ s ) \upsilon_\pi(s) = \sum_{a \in \mathcal A} q_\pi( s, a ) \pi(a | s) υπ(s)=aAqπ(s,a)π(as)

this file contains:Advanced Deep Learning with Keras_ Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more (2018, Packt Publishing.pdf Deep Reinforcement Learning for Wireless Networks (2019, Springer International Publishing).pdf Deep Reinforcement Learning Hands-On_ Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more.pdf Hands-On Reinforcement Learning with Python_ Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow (2018, Packt Publishing).epub Hands-On Reinforcement Learning with Python_ Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow (2018, Packt Publishing).pdf Keras Reinforcement Learning Projects (2018, Packt Publishing).epub Keras Reinforcement Learning Projects (2018, Packt Publishing).pdf Practical Reinforcement Learning Develop self-evolving, intelligent agents with OpenAI Gym, Python and Java.pdf Python Reinforcement Learning Projects - 2018.pdf Reinforcement Learning for Optimal Feedback Control (2018, Springer International Publishing).pdf Reinforcement Learning with TensorFlow_ A beginner’s guide to designing self-learning systems with TensorFlow and OpenAI Gym (2018, Packt Publishing).pdf Reinforcement Learning _ With Open AI, TensorFlow and Keras Using Python-Apress (2018).pdf Reinforcement Learning_ An Introduction (2018, The MIT Press).pdf Simulation-Based Optimization_ Parametric Optimization Techniques and Reinforcement Learning (2015, Springer US).pdf Statistics for Machine Learning_ Techniques for exploring supervised, unsupervised, and reinforcement learning models with Python and R-Packt Publishing (2017).pdf Tensorflow for Deep Learning_ From Linear Regression to Reinforcement Learning (2018, O'Reilly Media).pdf
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值