Reinforcement Learning Exercise 3.24

该博客探讨了强化学习中的一个练习问题,涉及一个特定的gridworld环境。在这个环境中,状态A和B分别给予+10和+5的奖励,并会跳转到相应的位置A'和B'。博主通过分析最优策略,推导出状态A的最优值为无限循环序列,并使用等比数列公式计算出当γ=0.9时,状态A的最优值精确到三位小数为24.419。
摘要由CSDN通过智能技术生成

Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.
在这里插入图片描述
The rule of this gridworld is that: If the agent is in status A, no matter which direction it will move to in next step, the reward is +10, and will have to jump to A’ in next step. Similarly, if the agent is in status B, no matter which direction it will move to in next step, the reward is +5, and will have to jump to B’ in next step. If the agent move to the edge of the gridworld and is going to move to the outside of the gridworld, the reward is -1, and it will have to stay in its place in next step.
According to the definition,
v ∗ ( s ) = max ⁡ a ∈ A ( s ) q π ∗ ( s , a ) = max ⁡ a ∈ A ( s ) E π ∗ ( G t ∣ S t = s , A t = a ) = max ⁡ a ∈ A ( s ) E π ∗ ( ∑ k = 0 ∞ γ k r t + k + 1 ∣ S t = s , A t = a ) \begin{aligned} v_*(s) &=\max_{a \in \mathcal A(s)} q_{\pi_*}(s,a) \\ &= \max_{a \in \mathcal A(s)} \mathbb E_{\pi_*}(G_t|S_t=s, A_t=a)\\ &= \max_{a \in \mathcal A(s)} \mathbb E_{\pi_*}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t=s, A_t=a)\\ \end{aligned} v(s)=aA(s)max

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值