Reinforcement Learning Exercise 5.5

在含有单一非终态状态和单一行动的MDP中,概率p返回非终态,1-p进入终态,奖励始终为1,γ=1。在一个持续10步并得到回报10的实验中,首次访问和每次访问状态下价值估计分别为p和1-p/(p^10-1)。
摘要由CSDN通过智能技术生成

Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p p p and transitions to the terminal state with probability 1 − p 1-p 1p. Let the reward be + 1 +1 +1 on all transitions, and let γ = 1 \gamma=1 γ=1. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?

For the first-visit estimator, only the first visit of a state is considered. So:
V ( S n o n t e r m i n a l ) = G ( S 0 ) = 1 ⋅ p + 0 ⋅ ( 1 − p ) = p \begin{aligned} V(S_{nonterminal}) &= G(S_0) \\ &= 1 \cdot p + 0 \cdot (1-p) \\ &= p \end{aligned} V(S

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值