Exercise 3.24 Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.
The rule of this gridworld is that: If the agent is in status A, no matter which direction it will move to in next step, the reward is +10, and will have to jump to A’ in next step. Similarly, if the agent is in status B, no matter which direction it will move to in next step, the reward is +5, and will have to jump to B’ in next step. If the agent move to the edge of the gridworld and is going to move to the outside of the gridworld, the reward is -1, and it will have to stay in its place in next step.
According to the definition,
v ∗ ( s ) = max a ∈ A ( s ) q π ∗ ( s , a ) = max a ∈ A ( s ) E π ∗ ( G t ∣ S t = s , A t = a ) = max a ∈ A ( s ) E π ∗ ( ∑ k = 0 ∞ γ k r t + k + 1 ∣ S t = s , A t = a ) \begin{aligned} v_*(s) &=\max_{a \in \mathcal A(s)} q_{\pi_*}(s,a) \\ &= \max_{a \in \mathcal A(s)} \mathbb E_{\pi_*}(G_t|S_t=s, A_t=a)\\ &= \max_{a \in \mathcal A(s)} \mathbb E_{\pi_*}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t=s, A_t=a)\\ \end{aligned} v∗(s)=a∈A(s)max
Reinforcement Learning Exercise 3.24
最新推荐文章于 2022-11-24 16:50:23 发布
该博客探讨了强化学习中的一个练习问题,涉及一个特定的gridworld环境。在这个环境中,状态A和B分别给予+10和+5的奖励,并会跳转到相应的位置A'和B'。博主通过分析最优策略,推导出状态A的最优值为无限循环序列,并使用等比数列公式计算出当γ=0.9时,状态A的最优值精确到三位小数为24.419。
摘要由CSDN通过智能技术生成