强化学习笔记：Sutton-Book第三章习题详解(Ex17~Ex29)

最新推荐文章于 2023-05-18 12:43:37 发布

笨牛慢耕

最新推荐文章于 2023-05-18 12:43:37 发布

阅读量1.8k

点赞数 4

分类专栏：强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/chenxy_bwave/article/details/122898572

版权

强化学习专栏收录该内容

27 篇文章 86 订阅

订阅专栏

Exercise 3.17

What is the Bellman equation for action values, that is, for $q_{\pi}$ ? It must give the action value $q_{\pi}(s,a)$ in terms of the action values, $q_{\pi}(s',a')$ , of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

解：

如以上backup diagram所示，从(s,a)出发到各可能的s'的概率由p决定，每条支路的总回报包括两部分，其一是即时回报r，其二是状态s'的状态值函数（当然要打折扣），由此可得(在Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897]我们已经得到这个关系)：

$q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma v_{\pi}(s')) \cdots (1)$

进一步，同样根据backup diagram可以得到（可以参考Exercise 3.12）用动作值函数表达状态值函数的表达式如下：

$v_{\pi}(s) = \sum\limits_{a}\pi(a|s)q_{\pi}(s,a) \cdots (2)$

将(2)代入(1)式即可得到：

$q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma \sum\limits_{a'}\pi(a'|s')q_{\pi}(s',a')) \cdots (3)$

这个就是关于动作值函数的贝尔曼方程！

顺便说一下，由于状态值函数和动作值函数是可以相互表达的，所以从两者相互表达式出发，通过代入消元法消掉一个就得到另一个的贝尔曼方程。关于状态值函数的贝尔曼方程的推导参见强化学习笔记：策略、值函数及贝尔曼方程

Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node, $v_{\pi}(s)$ , in terms of the value at the expected leaf node, $q_{\pi}(s,a)$ , given S_t=s . This equation should include an expectation conditioned on following the policy $\pi$ . Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

解：如上图所示，从状态s出发可以依由 $\pi(a|s)$ 决定的概率到达各动作节点。而各动作节点的动作函数值为 $q_{\pi}(a,s)$ . 状态s的状态值就是 $q_{\pi}(a,s)$ 的期望。

$\begin{align} \mathbb{E}[X] &= \sum\limits_x x\cdot p(x) \\ Y &= g(X),\\ \mathbb{E}[Y] &= \sum\limits_x g(x)\cdot p(x) \end{align}$

$x \rightarrow a, \ g(x)\rightarrow q_{\pi}(a,s),\ p(x)\rightarrow \pi(a|s)$ ，所以可以得到状态值函数即为动作值函数的期望：

$v_{\pi}(s) = \mathbb{E}_a[q_{\pi}(a,s)] = \sum\limits_{a}\pi(a|s)q_{\pi}(a,s)$

Exercise 3.19

The value of an action $q_{\pi}(s,a)$ , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value, $q_{\pi}(s,a)$ , in terms of the expected next reward, $R_{t+1}$ , and the expected next state value, $v(S_{t+1})$ , given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s',r|s,a) defined by (3.2), such that no expected value notation appears in the equation.

解：

从(s,a)出发依概率 p(s',r|s,a) 到达如上图各支路。支路k的总回报包括 $R_{t+1}=r_k$ ，和下一个状态的状态值函数 $v_{\pi}(s_k')$ ， $v_{\pi}(s_k')$ 是属于下一个时刻t+1的，折合到时刻t要乘以折扣因子。因此可以得到支路的总回报为各支路的回报的期望（概率加权均值）:

$\begin{align} G_{t+1}[k] &= R_{t+1} + \gamma v_{\pi}(S_{t+1}) \\ q_{\pi}(a,s) &= \mathbb{E}[G_{t+1}] \\ &= \sum\limits_{s',r}p(s',r|s,a)(r + \gamma v_{\pi}(s')) \end{align}$

Exercise 3.20

Draw or describe the optimal state-value function for the golf example.

Exercise 3.21

Draw or describe the contours of the optimal action-value function for putting, q_*(s,putter) , for the golf example.

Exercise 3.22

Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $\pi_{left}$ and $\pi_{right}$ . What policy is optimal if $\gamma$ = 0? If $\gamma$ = 0.9? If $\gamma$ = 0.5?

Exercise 3.23

Give the Bellman equation for q_* for the recycling robot.

Exercise 3.24

Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.

Exercise 3.25

Give an equation for v_* in terms of q_* .

解： v_* 是最优值函数，根据定义它必然等于在状态s下采取某个行动a，然后遵循最优策略所得到的各个最优动作值函数之中最大的那个，由此可得：

$v_*(s) = \max\limits_{a}q_*(s,a)$

Exercise 3.26

Give an equation for q_* in terms of v_* and the four-argument p.

解：参考Exercise 3.19.

最优动作值函数必定对应着每一个下一状态s'的最优状态值函数，因此有：

$q_*(a,s) = \sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))$

Exercise 3.27

Give an equation for $\pi_*$ in terms of q_* .

解：策略是用于从某个状态s出发选择动作的，最优策略意味着从任意状态s出发都选择对应的最优的动作，记为 a_*(s) ，也即是说，选择 a_*(s) 的概率为1，而非最优动作的概率为0。当然，需要注意的是，在某个状态s下，最优动作可能不止一个。在这种情况下，任选其中一个均可。但是，多个最优动作的动作价值函数必定相等。

首先，状态s下的最优动作满足以下方程：

$a_*(s) = \arg\max\limits_{a}q_*(s,a)$

其次，最优策略可以表达为（这里为了简便起见，假定每个状态下只有一个最优动作）：

$\begin{align} \pi_*(a|s) &= 1, \ a = \arg\max\limits_{a'}q_*(s,a') \\ \pi_*(a|s) &= 0, \ others \end{align}$

Exercise 3.28

Give an equation for $\pi_*$ in terms of v_* and the four-argument p.

解：结合3.26和3.27（将3.26的解代入到3.27的解）可以得到：

$\begin{align} \pi_*(a) &= 1, \quad a = \arg\max\limits_{a'} \sum\limits_{s',r}p(s',r|s,a')(r+\gamma v_*(s')) \\ \pi_*(a) &= 0, \quad others \end{align}$

Exercise 3.29

Rewrite the four Bellman equations for the four value functions ( $v_{\pi}$ , v_* , $q_{\pi}$ ,and q_* ) in terms of the three argument function p (3.4) and the two-argument function r (3.5).

解：

$\begin{align} v_{\pi}(s) &=\sum\limits_{a}\pi(a|s)\sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_{\pi}(s')) \\ &= \sum\limits_{a}\pi(a|s)\bigg\{\sum\limits_{s',r} r p(s',r|s,a) + \sum\limits_{s',r}p(s',r|s,a)\gamma v_{\pi}(s')) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') \sum\limits_{r}p(s',r|s,a) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') p(s'|s,a) \bigg\} \end{align}$