目录
Exercise 3.17
What is the Bellman equation for action values, that is, for ? It must give the action value
in terms of the action values,
, of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
解:
如以上backup diagram所示,从(s,a)出发到各可能的s'的概率由p决定,每条支路的总回报包括两部分,其一是即时回报r,其二是状态s'的状态值函数(当然要打折扣),由此可得(在Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897]我们已经得到这个关系):
进一步,同样根据backup diagram可以得到(可以参考Exercise 3.12)用动作值函数表达状态值函数的表达式如下:
将(2)代入(1)式即可得到:
这个就是关于动作值函数的贝尔曼方程!
顺便说一下,由于状态值函数和动作值函数是可以相互表达的,所以从两者相互表达式出发,通过代入消元法消掉一个就得到另一个的贝尔曼方程。关于状态值函数的贝尔曼方程的推导参见强化学习笔记:策略、值函数及贝尔曼方程
Exercise 3.18
The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node,
, given
. This equation should include an expectation conditioned on following the policy
. Then give a second equation in which the expected value is written out explicitly in terms of
such that no expected value notation appears in the equation.
解:如上图所示,从状态s出发可以依由决定的概率到达各动作节点。而各动作节点的动作函数值为
. 状态s的状态值就是
的期望。
,所以可以得到状态值函数即为动作值函数的期望:
Exercise 3.19
The value of an action , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward,
, and the expected next state value,
, given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of
defined by (3.2), such that no expected value notation appears in the equation.
解:
从(s,a)出发依概率到达如上图各支路。支路k的总回报包括
,和下一个状态的状态值函数
,
是属于下一个时刻t+1的,折合到时刻t要乘以折扣因子。因此可以得到支路的总回报为各支路的回报的期望(概率加权均值):
Exercise 3.20
Draw or describe the optimal state-value function for the golf example.
Exercise 3.21
Draw or describe the contours of the optimal action-value function for putting, , for the golf example.
Exercise 3.22
Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, and
. What policy is optimal if
= 0? If
= 0.9? If
= 0.5?
Exercise 3.23
Give the Bellman equation for for the recycling robot.
Exercise 3.24
Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.
Exercise 3.25
Give an equation for in terms of
.
解: 是最优值函数,根据定义它必然等于在状态s下采取某个行动a,然后遵循最优策略所得到的各个最优动作值函数之中最大的那个,由此可得:
Exercise 3.26
Give an equation for in terms of
and the four-argument p.
解:参考Exercise 3.19.
最优动作值函数必定对应着每一个下一状态s'的最优状态值函数,因此有:
Exercise 3.27
Give an equation for in terms of
.
解:策略是用于从某个状态s出发选择动作的,最优策略意味着从任意状态s出发都选择对应的最优的动作,记为,也即是说,选择
的概率为1,而非最优动作的概率为0。当然,需要注意的是,在某个状态s下,最优动作可能不止一个。在这种情况下,任选其中一个均可。但是,多个最优动作的动作价值函数必定相等。
首先,状态s下的最优动作满足以下方程:
其次,最优策略可以表达为(这里为了简便起见,假定每个状态下只有一个最优动作):
Exercise 3.28
Give an equation for in terms of
and the four-argument p.
解:结合3.26和3.27(将3.26的解代入到3.27的解)可以得到:
Exercise 3.29
Rewrite the four Bellman equations for the four value functions (,
,
,and
) in terms of the three argument function p (3.4) and the two-argument function r (3.5).
解:
余类推,略。
回到总目录:强化学习笔记总目录https://chenxiaoyuan.blog.csdn.net/article/details/121715424
Sutton-RLbook(第2版)第3章练习前半部分参见: 强化学习笔记:Sutton-Book第三章习题解答(Ex1~Ex16)https://blog.csdn.net/chenxy_bwave/article/details/122522897