强化学习笔记:Sutton-Book第三章习题详解(Ex17~Ex29)

目录

Exercise 3.17

Exercise 3.18

Exercise 3.19

Exercise 3.20

Exercise 3.21

Exercise 3.22

Exercise 3.23

Exercise 3.24

Exercise 3.25

Exercise 3.26

Exercise 3.27  

Exercise 3.28

Exercise 3.29


Exercise 3.17

What is the Bellman equation for action values, that is, for q_{\pi}? It must give the action value q_{\pi}(s,a) in terms of the action values, q_{\pi}(s',a'), of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

                        ​​​​​​​        ​​​​​​​        ​​​​​​​        

 解:

        如以上backup diagram所示,从(s,a)出发到各可能的s'的概率由p决定,每条支路的总回报包括两部分,其一是即时回报r,其二是状态s'的状态值函数(当然要打折扣),由此可得(在Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897]我们已经得到这个关系):

        q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma v_{\pi}(s')) \cdots (1)

        进一步,同样根据backup diagram可以得到(可以参考Exercise 3.12)用动作值函数表达状态值函数的表达式如下:

        v_{\pi}(s) = \sum\limits_{a}\pi(a|s)q_{\pi}(s,a) \cdots (2)

        将(2)代入(1)式即可得到:

        q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma \sum\limits_{a'}\pi(a'|s')q_{\pi}(s',a')) \cdots (3)

        这个就是关于动作值函数的贝尔曼方程!

        顺便说一下,由于状态值函数和动作值函数是可以相互表达的,所以从两者相互表达式出发,通过代入消元法消掉一个就得到另一个的贝尔曼方程。关于状态值函数的贝尔曼方程的推导参见强化学习笔记:策略、值函数及贝尔曼方程

Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action: 

Give the equation corresponding to this intuition and diagram for the value at the root node, v_{\pi}(s), in terms of the value at the expected leaf node, q_{\pi}(s,a), given S_t=s. This equation should include an expectation conditioned on following the policy \pi. Then give a second equation in which the expected value is written out explicitly in terms of \pi(a|s) such that no expected value notation appears in the equation. 

解:如上图所示,从状态s出发可以依由\pi(a|s)决定的概率到达各动作节点。而各动作节点的动作函数值为q_{\pi}(a,s). 状态s的状态值就是q_{\pi}(a,s)的期望。

        \begin{align} \mathbb{E}[X] &= \sum\limits_x x\cdot p(x) \\ Y &= g(X),\\ \mathbb{E}[Y] &= \sum\limits_x g(x)\cdot p(x) \end{align}

        x \rightarrow a, \ g(x)\rightarrow q_{\pi}(a,s),\ p(x)\rightarrow \pi(a|s),所以可以得到状态值函数即为动作值函数的期望:

        v_{\pi}(s) = \mathbb{E}_a[q_{\pi}(a,s)] = \sum\limits_{a}\pi(a|s)q_{\pi}(a,s)

Exercise 3.19

The value of an action q_{\pi}(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:        

Give the equation corresponding to this intuition and diagram for the action value, q_{\pi}(s,a), in terms of the expected next reward, R_{t+1}, and the expected next state value, v(S_{t+1}), given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s',r|s,a) defined by (3.2), such that no expected value notation appears in the equation. 

解:

        从(s,a)出发依概率p(s',r|s,a)到达如上图各支路。支路k的总回报包括R_{t+1}=r_k,和下一个状态的状态值函数v_{\pi}(s_k')v_{\pi}(s_k')是属于下一个时刻t+1的,折合到时刻t要乘以折扣因子。因此可以得到支路的总回报为各支路的回报的期望(概率加权均值):

        ​​​​​​​        \begin{align} G_{t+1}[k] &= R_{t+1} + \gamma v_{\pi}(S_{t+1}) \\ q_{\pi}(a,s) &= \mathbb{E}[G_{t+1}] \\ &= \sum\limits_{s',r}p(s',r|s,a)(r + \gamma v_{\pi}(s')) \end{align}

Exercise 3.20

Draw or describe the optimal state-value function for the golf example.

Exercise 3.21

Draw or describe the contours of the optimal action-value function for putting, q_*(s,putter), for the golf example.

Exercise 3.22

Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, \pi_{left} and \pi_{right}. What policy is optimal if \gamma = 0? If \gamma = 0.9? If \gamma = 0.5?

        ​​​​​​​        ​​​​​​​        ​​​​​​​         

Exercise 3.23

Give the Bellman equation for q_* for the recycling robot. 


Exercise 3.24

Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places. 


Exercise 3.25

Give an equation for v_* in terms of q_* .

解: v_*是最优值函数,根据定义它必然等于在状态s下采取某个行动a,然后遵循最优策略所得到的各个最优动作值函数之中最大的那个,由此可得:

        v_*(s) = \max\limits_{a}q_*(s,a)

 

Exercise 3.26

Give an equation for q_* in terms of v_* and the four-argument p. 

解:参考Exercise 3.19.

        最优动作值函数必定对应着每一个下一状态s'的最优状态值函数,因此有:

        q_*(a,s) = \sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_*(s')) 

Exercise 3.27  

Give an equation for \pi_* in terms of q_*

解:策略是用于从某个状态s出发选择动作的,最优策略意味着从任意状态s出发都选择对应的最优的动作,记为a_*(s),也即是说,选择a_*(s)的概率为1,而非最优动作的概率为0。当然,需要注意的是,在某个状态s下,最优动作可能不止一个。在这种情况下,任选其中一个均可。但是,多个最优动作的动作价值函数必定相等。

        首先,状态s下的最优动作满足以下方程:

        a_*(s) = \arg\max\limits_{a}q_*(s,a)

        其次,最优策略可以表达为(这里为了简便起见,假定每个状态下只有一个最优动作):

        \begin{align} \pi_*(a|s) &= 1, \ a = \arg\max\limits_{a'}q_*(s,a') \\ \pi_*(a|s) &= 0, \ others \end{align}

Exercise 3.28

Give an equation for \pi_* in terms of v_* and the four-argument p. 

解:结合3.26和3.27(将3.26的解代入到3.27的解)可以得到:

        \begin{align} \pi_*(a) &= 1, \quad a = \arg\max\limits_{a'} \sum\limits_{s',r}p(s',r|s,a')(r+\gamma v_*(s')) \\ \pi_*(a) &= 0, \quad others \end{align}

Exercise 3.29

Rewrite the four Bellman equations for the four value functions (v_{\pi}, v_*, q_{\pi},and q_*) in terms of the three argument function p (3.4) and the two-argument function r (3.5). 

 解:

        \begin{align} v_{\pi}(s) &=\sum\limits_{a}\pi(a|s)\sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_{\pi}(s')) \\ &= \sum\limits_{a}\pi(a|s)\bigg\{\sum\limits_{s',r} r p(s',r|s,a) + \sum\limits_{s',r}p(s',r|s,a)\gamma v_{\pi}(s')) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') \sum\limits_{r}p(s',r|s,a) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') p(s'|s,a) \bigg\} \end{align}

        余类推,略。

 回到总目录:强化学习笔记总目录icon-default.png?t=M0H8https://chenxiaoyuan.blog.csdn.net/article/details/121715424

Sutton-RLbook(第2版)第3章练习前半部分参见: 强化学习笔记:Sutton-Book第三章习题解答(Ex1~Ex16)icon-default.png?t=M0H8https://blog.csdn.net/chenxy_bwave/article/details/122522897

  • 4
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

笨牛慢耕

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值