强化学习笔记2:Finite Markov Decision Processes

马尔科夫决策过程(Markov Decision Processes, MDPs) 是时序决策的一种经典形式,每一步动作不仅影响当前的回报,还影响后续的状态和回报。因此,MDPs包含了延迟回报,需要权衡(tradeoff)当前回报和延迟回报。在赌博机问题中,我们关心每个动作的价值 q ∗ ( a ) q_*(a) q(a);而在MDPs中,我们关心每个动作 a a a 在状态 s s s 下的价值 q ∗ ( s , a ) q_*(s,a) q(s,a),同时也关心在给定乐观动作选择的情况下每个状态 s s s 的价值 v ∗ ( s ) v_*(s) v(s)。这种依赖状态的特性,对于准确评估每个动作选择的长期效益是至关重要的。本笔记对应Sutton书的第3章。

The Agent-Environment Interface

MDPs 为解决从交互中学习从而达到目标的问题提供了一个直接的框架。

  • agent:the learner and decision maker
  • environment:the thing it interacts with, comprising everything outside the agent
    在这里插入图片描述
    这种交互是持续进行的,在一个离散的时间序列 t = 0 , 1 , 2 , 3 , . . . . . . t=0,1,2,3,...... t=0,1,2,3,...... 上,agent 处于状态 S t S_t St,选择动作 A t ∈ A ( s ) A_t\in\mathcal A(s) AtA(s),environment 根据动作的执行作出反馈,返回新的状态 S t + 1 S_{t+1} St+1 给agent,并且给予一个数值化的回报 R t + 1 ∈ R ⊂ R R_{t+1}\in\mathcal R\subset\mathbb R Rt+1RR,agent 就是要将累积的回报值最大化。这种交互过程就给出了下面这样的一个序列或者轨道(trajectory):
    S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , A 2 , R 3 , . . . . . . . S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,....... S0,A0,R1,S1,A1,R2,S2,A2,R3,.......所谓有限马尔科夫决策过程,就是指状态、动作、回报的集合 ( S , A , R ) (\mathcal S, \mathcal A, \mathcal R) (S,A,R)都是有限集。在这种情况下,随机变量 R t R_t Rt S t S_t St 服从离散概率分布,只依赖于前一次状态和动作。
    对所有 s ′ , s ∈ S , r ∈ R , a ∈ A ( s ) s',s\in\mathcal S, r\in\mathcal R, a\in\mathcal A(s) s,sS,rR,aA(s),定义(3.2):
    p ( s ′ , r ∣ s , a ) : = Pr ⁡ { S t = s ′ , R t = r ∣ S t − 1 = s , A t − 1 = a } p\left(s',r|s,a\right):=\Pr \{ S_t=s',R_t=r\big | S_{t-1}=s,A_{t-1}=a\} p(s,rs,a):=Pr{St=s,Rt=rSt1=s,At1=a} 对任意 s ∈ S , a ∈ A ( s ) s\in\mathcal S, a\in\mathcal A(s) sS,aA(s),有(3.3)
    ∑ s ′ ∈ S ∑ r ∈ R p ( s ′ , r ∣ s , a ) = 1 \sum_{s'\in\mathcal S}\sum_{r\in\mathcal R}p\left(s',r|s,a\right) = 1 sSrRp(s,rs,a)=1由这样的四参数函数 p : S × R × S × A → [ 0 , 1 ] p:\mathcal S\times\mathcal R\times\mathcal S\times\mathcal A\to[0,1] p:S×R×S×A[0,1]给出的概率完全刻画了一个有限MDP。利用它,我们可以计算出关于 enviroment 的任何信息,比如:
  • 状态迁移概率(state-transition probabilities) p : S × S × A → [ 0 , 1 ] p:\mathcal S\times\mathcal S\times\mathcal A\to[0,1] p:S×S×A[0,1]
    p ( s ′ ∣ s , a ) : = Pr ⁡ { S t = s ′ ∣ S t − 1 = s , A t − 1 = a } = ∑ r ∈ R p ( s ′ , r ∣ s , a ) ( 3.4 ) p\left(s'|s,a\right):=\Pr\{S_t=s'|S_{t-1}=s,A_{t-1}=a\}=\sum_{r\in\mathcal R}p(s',r|s,a)\qquad(3.4) p(ss,a):=Pr{St=sSt1=s,At1=a}=rRp(s,rs,a)(3.4)
  • expected rewards for state-action r : S × A → R r:\mathcal S\times\mathcal A\to\mathbb R r:S×AR
    r ( s , a ) : = E [ R t ∣ S t − 1 = s , A t − 1 = a ] = ∑ r ∈ R r ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) ( 3.5 ) r(s,a):=\mathbb E\left[R_t\big|S_{t-1}=s,A_{t-1}=a\right]=\sum_{r\in\mathcal R}r\sum_{s'\in\mathcal S}p(s',r|s,a)\qquad (3.5) r(s,a):=E[RtSt1=s,At1=a]=rRrsSp(s,rs,a)(3.5)
  • expected rewards for state-action-next-state r : S × A × S → R r:\mathcal S\times\mathcal A\times\mathcal S\to\mathbb R r:S×A×SR
    r ( s , a , s ′ ) : = E [ R t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ] = ∑ r ∈ R r p ( s ′ , r ∣ s , a ) p ( s ′ ∣ s , a ) ( 3.6 ) r(s,a,s'):=\mathbb E\left[R_t \big| S_{t-1}=s,A_{t-1}=a,S_t=s'\right]=\sum_{r\in\mathcal R}r\frac{p(s',r|s,a)}{p(s'|s,a)}\qquad(3.6) r(s,a,s):=E[RtSt1=s,At1=a,St=s]=rRrp(ss,a)p(s,rs,a)(3.6)

如何区分 agent 和 environment?The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent. The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

MDP框架是对基于目标导向的交互学习问题的一种很好的抽象。不管问题涉及到感觉、记忆和控制装置的何种细节,也不管要达到什么样的目标,任何基于目标导向的学习都被它抽象成 agent 和 environment 之间来回传递的三种信号:

  • 一个信号代表了 agent 作出的选择 (actions)
  • 一个信号代表了 agent 作出选择的基础 (state)
  • 一个信号代表了 agent 的目标 (rewards)

【习题3.4】If the current state is S t S_t St, and actions are selected according to stochastic policy π \pi π, then what is the expectation of R t + 1 R_{t+1} Rt+1 in terms of the four-argument function p p p (3.2)?
解答:
E [ R t + 1 ∣ S t = s ] = ∑ a ∈ A ( s ) r ( s , a ) π ( a ) = ∑ a ∈ A ( s ) π ( a ) ( ∑ r ∈ R ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) ) \mathbb E\left[R_{t+1}\big|S_t=s\right]=\sum_{a\in\mathcal A(s)}r(s,a)\pi(a)=\sum_{a\in\mathcal A(s)}\pi(a)\left(\sum_{r\in\mathcal R}\sum_{s'\in\mathcal S}p(s',r|s,a)\right) E[Rt+1St=s]=aA(s)r(s,a)π(a)=aA(s)π(a)(rRsSp(s,rs,a))

Example:回收机器人(Recycling Robot MDP)

场景是一个移动机器人在办公室里收集空的汽水瓶,它有用于检测汽水瓶的传感器,和一个可以拿起它们的手臂和夹钳。机器人控制系统具有感知感官信息、导航和控制手臂和夹钳的组件,而如何搜索汽水瓶的高级决策是由强化学习 agent 基于当前电池水平做出的。这个 agent 需要决定机器人是否做以下动作:(1) 主动搜索汽水瓶一段时间;(2) 保持静止等待别人把汽水瓶拿来;(3) 返回基地充电。这个决定可以周期性地做出,也可以由某些外部事件触发比如发现了空汽水瓶。因此,该 agent 具有三个动作,并且状态主要由电池的状态决定。在大多数时候,回报可能是零,但是当机器人回收一个空的汽水瓶时,回报就会变成正值;如果电池一直消耗,回报也可能变成很大的负值。现在对 environment 做一些假设:发现汽水瓶的最好途径显然是主动搜索,但是这会消耗电池,而静止等待不会。每当机器人主动搜索时,都存在电池耗尽的可能性,在这种情况下,机器人必须关闭并充电(产生低回报)。
\qquad 现在就可以建立MDP模型了:用电池的电量 S = { h i g h , l o w } \mathcal S=\{high,low\} S={high,low}表示状态集,机器人可选择的动作集为 A = { w a i t , s e a r c h , r e c h a r g e } \mathcal A=\{wait, search, recharge\} A={wait,search,recharge}。当电池电量充足时,充电显然是愚蠢的,所以我们可以定义
A ( h i g h ) : = { s e a r c h , w a i t } \mathcal A(high):=\{search, wait\}\qquad\qquad A(high):={search,wait} A ( l o w ) : = { w a i t , s e a r c h , r e c h a r g e } \mathcal A(low):=\{wait, search, recharge\} A(low):={wait,search,recharge}假设在电池充足的状态下,主动搜索一段时间,电量仍然充足的概率为 α \alpha α,电量不充足的概率为 1 − α 1-\alpha 1α。另一方面,在电池不充足的状态下,主动搜索一段时间,电量仍然不充足的概率为 β \beta β,电量耗尽的概率为 1 − β 1-\beta 1β。在电池耗尽的情况下,机器人必须充电,然后状态又返回为电量充足。每回收一个汽水瓶,机器人得到1个单位的回报;而充电给予 − 3 -3 3 的回报。用 r s e a r c h r_{search} rsearch r w a i t r_{wait} rwait 分别表示机器人在主动搜索和静止等待的情况下回收汽水瓶的期望个数,假设 r s e a r c h > r w a i t r_{search}>r_{wait} rsearch>rwait。最后,为了简单起见,假设机器人在跑回家充电时不能收集汽水瓶,在电池耗尽的情况下不能收集汽水瓶。
\qquad 这样就构建了一个有限MDP的模型,它的状态迁移概率和期望回报见下表:
在这里插入图片描述
我们通常使用迁移图(transition graph)来描述有限MDP的运作流程:
\qquad\qquad\qquad 在这里插入图片描述

Goals and Rewards

回报假设(reward hypothesis): 我们所说的目标和目的可以被很好地理解为接收标量信号(称为回报)的累积和的期望值的最大化。That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a receive scalar signal (called reward).

Returns and Episodes(片断)

如何定义累积回报?
\qquad 假设在时间 t t t 之后的回报序列为 R t + 1 , R t + 2 , R t + 3 . . . R_{t+1},R_{t+2},R_{t+3}... Rt+1,Rt+2,Rt+3...,我们可以定义一个关于此序列的函数 G t G_t Gt,称为期望返回值(expected return),我们的目标就是要将 G t G_t Gt 最大化。
\qquad 最简单的一个例子是 G t : = R t + 1 + R t + 2 + . . . . . . + R T ( 3.7 ) G_t:=R_{t+1}+R_{t+2}+......+R_T\qquad\qquad(3.7) Gt:=Rt+1+Rt+2+......+RT(3.7)其中 T T T 是一个终止时间点(final time step)。在实际应用中,(3.7)有意义需要自然存在这么一个终止时间点(final time step),也就是说 agent-environment interaction 可以自然地划分为一些子序列,我们称之为片断(episodes)。每一个片断在一个特殊状态终止,我们称之为终止态(terminal state),终止态之后紧接着一个开始态(starting state)或者是满足标准分布的开始态集中的一个样本。不管片断是以何种方式终止,比如获胜或者输掉一个比赛,下一个片断的开始与前一次片断没有关联。因此,所有片断可以被认为是结束于同样的终止态,只是回报和产出不同。我们称带有这种片断的任务为片断任务(episode tasks)。在片断任务中,我们用 S \mathcal S S 表示所有非终止态组成的集合, S + \mathcal S^+ S+ 表示所有状态的集合(包括终止态)。终止时间 T T T 是一个随机变量,从一个片断切换到另一个片断。
\qquad 而对于无法划分为片断的任务,我们称之为连续任务(continuing tasks),上面的公式(3.7)就没有意义,此时需要引入折扣(discounting)的想法。定义期望折扣返回值(expected discounted return)
G t : = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . = ∑ k = 0 ∞ γ k R t + k + 1 ( 3.8 ) G_t:=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}\qquad\qquad(3.8) Gt:=Rt+1+γRt+2+γ2Rt+3+...=k=0γkRt+k+1(3.8)其中参数 γ \gamma γ 满足 0 ≤ γ ≤ 1 0\le\gamma\le1 0γ1 称为折扣率(discount rate)。显然,只要 { R k } \left\{R_k\right\} {Rk} 有界并且 γ < 1 \gamma<1 γ<1,那么(3.8)就是一个有限值。特别地,如果 γ = 0 \gamma=0 γ=0,说明 agent 只关心当前回报,称之为近视(myopic);相反 γ \gamma γ 越接近于1,未来回报所占有的比重就越大,agent 就变得远视(farsighted)
\qquad 显然有如下的递归关系:
G t = R t + 1 + γ G t + 1 ( 3.9 ) G_t = R_{t+1}+\gamma G_{t+1}\qquad\qquad(3.9) Gt=Rt+1+γGt+1(3.9)其中 t < T t<T t<T
【习题3.6】The equations in Section 3.1 are for the continuing case and need to be modi fied (very slightly) to apply to episodic tasks. Show that you know the modifi cations needed by giving the modifi ed version of (3.3).
解答:?
【习题3.7】Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for ?1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?
解答:?
【习题3.8】Imagine that you are designing a robot to run a maze(迷宫). You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes|the successive runs through the maze|so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?
解答:?
【习题3.9】Suppose γ = 0.5 \gamma=0.5 γ=0.5 and the following sequence of rewards is received R 1 = − 1 , R 2 = 2 , R 3 = 6 , R 4 = 3 R_1=-1, R_2=2, R_3=6, R_4=3 R1=1,R2=2,R3=6,R4=3 and R 5 = 2 R_5=2 R5=2 with T = 5. What are G 0 , G 1 , . . . , G 5 G_0, G_1, ..., G_5 G0,G1,...,G5?
解答:
G 5 = 0 ; \qquad G_5=0; G5=0;
G 4 = R 5 = 2 ; \qquad G_4 = R_5=2; G4=R5=2
G 3 = R 4 + γ G 4 = 3 + 0.5 × 2 = 4 ; \qquad G_3 = R_4+\gamma G_4=3+0.5\times2=4; G3=R4+γG4=3+0.5×2=4
G 2 = R 3 + γ G 3 = 6 + 0.5 × 4 = 8 ; \qquad G_2 = R_3+\gamma G_3=6+0.5\times4=8; G2=R3+γG3=6+0.5×4=8
G 1 = R 2 + γ G 2 = 2 + 0.5 × 8 = 6 ; \qquad G_1 = R_2+\gamma G_2=2+0.5\times8=6; G1=R2+γG2=2+0.5×8=6
G 0 = R 1 + γ G 1 = − 1 + 0.5 × 6 = 2. □ \qquad G_0 = R_1+\gamma G_1=-1+0.5\times6=2.\qquad\qquad\square G0=R1+γG1=1+0.5×6=2.
【习题3.10】Suppose γ = 0.9 \gamma=0.9 γ=0.9 and the reward sequence is R 1 = 2 R_1=2 R1=2 followed by an infi nite sequence of 7 s 7s 7s. What are G 1 G_1 G1 and G 0 G_0 G0?
解答:
G 1 = R 2 + γ R 3 + γ 2 R 4 + . . . = 7 + 7 γ + 7 γ 2 + . . . = 7 ⋅ 1 1 − γ = 70 \qquad G_1=R_2+\gamma R_3 + \gamma^2 R_4 + ... =7+7\gamma+7\gamma^2+...=7\cdot\frac{1}{1-\gamma}=70 G1=R2+γR3+γ2R4+...=7+7γ+7γ2+...=71γ1=70
G 0 = R 1 + γ G 1 = 2 + 0.9 × 70 = 65. □ \qquad G_0=R_1+\gamma G_1=2+0.9\times70=65.\qquad\qquad\square G0=R1+γG1=2+0.9×70=65.

Unifi ed Notation for Episodic and Continuing Tasks

S t , i S_{t,i} St,i : the state representation at time t t t of episode i i i ,类似有 A t , i , R t , i , π t , i , T i A_{t,i}, R_{t,i}, \pi_{t,i}, T_i At,i,Rt,i,πt,i,Ti
在讨论片断任务时,通常只会研究一个特定的片断,所以为简单起见,用 S t S_t St 替代 S t , i S_{t,i} St,i,其它标记类似。
期望返回值可以统一成
G t = ∑ k = t + 1 T γ k − t − 1 R k G_t=\sum_{k=t+1}^T \gamma^{k-t-1}R_k Gt=k=t+1Tγkt1Rk其中包括 T = ∞ T=\infty T= 或者 γ = 1 \gamma=1 γ=1 这两种情况 (但不能同时满足)。

Policies and Value Functions

策略(policy):在某一状态下选取每种可能动作的概率,记为 π ( a ∣ s ) \pi(a|s) π(as).
价值(value):在某一状态下利用某种策略可获得的期望回报值。
状态价值函数(state-value function for policy π \pi π)
v π ( s ) : = E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] v_{\pi}(s):=\mathbb E_{\pi}\left[G_t|S_t=s\right]=\mathbb E_{\pi}\left[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}\bigg|S_t=s\right] vπ(s):=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]动作价值函数(action-value function for policy π \pi π)
q π ( s , a ) : = E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] q_{\pi}(s,a):=\mathbb E_{\pi}\left[G_t|S_t=s,A_t=a\right]=\mathbb E_{\pi}\left[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}\bigg|S_t=s,A_t=a\right] qπ(s,a):=Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]Bellman equation for v π v_{\pi} vπ
v π ( s ) = E π [ G t ∣ S t = s ] \qquad v_{\pi}(s)=\mathbb E_{\pi}\left[G_t|S_t=s\right] vπ(s)=Eπ[GtSt=s]
     = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] \qquad\qquad\;\; =\mathbb E_{\pi}\left[R_{t+1}+\gamma G_{t+1}|S_t=s\right] =Eπ[Rt+1+γGt+1St=s]
= ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] ] =\sum_a\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)\left[r+\gamma\mathbb E_{\pi}\left[G_{t+1}|S_{t+1}=s'\right]\right]\qquad\qquad =aπ(as)srp(s,rs,a)[r+γEπ[Gt+1St+1=s]] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] ( 3.14 ) =\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\left[r+\gamma v_{\pi}(s')\right]\qquad\qquad\qquad\qquad(3.14) =aπ(as)s,rp(s,rs,a)[r+γvπ(s)](3.14) \qquad\qquad\qquad\qquad\qquad\qquad 在这里插入图片描述
Example3.6:网格字(Gridworld)
在这里插入图片描述图3.3中用矩形网格构建了一个MDP,其中每一个小格代表了一个状态,在每个状态下有4个动作可选择:东、西、南、北。执行每个动作,都可将 agent 移动到所选择的方向中的某一个小格;如果执行某个动作后,agent 会离开网格,那么此时保持 agent 原位不动,并且给予 − 1 -1 1 分。从 A A A A ′ A' A 的 4 个动作每个得 + 10 +10 +10 分;从 B B B B ′ B' B 的 2 个动作,每个得 + 5 +5 +5 分;其它所有动作得 0 分。
\qquad 假设在任何小格选择 4 个动作的概率是一样的,并且设定折扣率 γ = 0.9 \gamma=0.9 γ=0.9,那么图3.3的最右边的方格就给出了价值函数。价值函数是通过计算由(3.14)给出的Bellman方程组得到的。注意到,在下面几行,每个状态的价值都是负值,这是因为在随机策略下它们有很高的概率将 agent 移出网格; A A A 是最好的状态,但是它的期望回报8.8却小于它的眼前回报10,这是因为从 A A A移到 A ′ A' A A ′ A' A的价值为负值,会抵消眼前回报。另一方面, B B B的眼前回报最高是5,可是它的期望回报却有5.3,大于5,这是因为从 B B B移到了 B ′ B' B,而 B ′ B' B的价值是大于0的。 □ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\square

【习题3.12】The Bellman equation (3.14) must hold for each state for the value function v π v_{\pi} vπ shown in Figure 3.3 (right) of Example 3.6. Show numerically that this equation holds for the center state, valued at + 0.7 +0.7 +0.7, with respect to its four neighboring states, valued at + 2.3 +2.3 +2.3, + 0.4 +0.4 +0.4, − 0.4 -0.4 0.4, and + 0.7 +0.7 +0.7. (These numbers are accurate only to one decimal place.)
解答:现在 s s s 表示中心小格对应的状态,在(3.14)式中对所有 a a a π ( a ∣ s ) = 1 4 \pi(a|s)=\frac{1}{4} π(as)=41。从中心位置往任何方向移动,都有2个小格可选择,所以对所有 a , s ′ , r a,s',r a,s,r p ( s ′ , r ∣ s , a ) = 1 2 p(s',r|s,a)=\frac{1}{2} p(s,rs,a)=21;此外,在中心位置选择任何方向移动,当前回报 r r r 都为 0。因此,我们有:
v π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] v_{\pi}(s)=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\left[r+\gamma v_{\pi}(s')\right]\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)] = ∑ a 1 4 ∑ s ′ , r 1 2 ⋅ 0.9 ⋅ v π ( s ′ )            =\sum_a\frac{1}{4}\sum_{s',r}\frac{1}{2}\cdot0.9\cdot v_{\pi}(s')\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\;\;\;\;\; =a41s,r210.9vπ(s) = 1 8 ( 4.4 + 2.3 + 0.1 + 0.7 − 0.4 − 1.2 + 0.4 − 0.4 )          =\frac{1}{8}(4.4+2.3+0.1+0.7-0.4-1.2+0.4-0.4)\qquad\qquad\qquad\qquad\;\;\;\; =81(4.4+2.3+0.1+0.70.41.2+0.40.4)                = 0.74 □ \qquad\;\;\;\;\;\;\;=0.74\qquad\qquad\qquad\qquad\square =0.74

【习题3.13】What is the Bellman equation for action values, that is, for q π q_\pi qπ? It must give the action value q π ( s , a ) q_\pi(s,a) qπ(s,a) in terms of the action values, q π ( s ′ , a ′ ) q_\pi(s',a') qπ(s,a), of possible successors to the state{action pair ( s , a ) (s,a) (s,a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
\qquad\qquad\qquad\qquad\qquad\qquad 在这里插入图片描述
解答:
   q π ( s , a ) : = E π [ G t ∣ S t = s , A t = a ] \;q_\pi(s,a) :=\mathbb E_\pi[G_t|S_t=s,A_t=a] qπ(s,a):=Eπ[GtSt=s,At=a]
= E π [ R t + γ G t + 1 ∣ S t = s , A t = a ] \qquad\qquad = \mathbb E_\pi[R_t+\gamma G_{t+1}|S_t=s,A_t=a] =Eπ[Rt+γGt+1St=s,At=a]
= ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) ∑ a ′ π ( a ′ ∣ s ) E [ G t + 1 ∣ S t + 1 = s ′ , A t + 1 = a ′ ] \qquad\qquad = \sum_{s'}\sum_r p(s',r|s,a)\sum_{a'}\pi(a'|s)\mathbb E[G_{t+1}|S_{t+1}=s',A_{t+1}=a']\qquad\qquad\qquad\qquad =srp(s,rs,a)aπ(as)E[Gt+1St+1=s,At+1=a] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) ∑ a ′ π ( a ′ ∣ s ) ( r + γ q π ( s ′ , a ′ ) )            □ =\sum_{s',r}p(s',r|s,a)\sum_{a'}\pi(a'|s)\left(r+\gamma q_\pi(s',a')\right)\qquad\qquad\qquad\qquad\;\;\;\;\;\square =s,rp(s,rs,a)aπ(as)(r+γqπ(s,a))

Example3.7: Golf 此例中将打高尔夫球作为一项强化学习任务,每一杆记 − 1 -1 1分作为惩罚(penalty)直到将球打到洞中。球的位置作为状态,一个状态的价值是从这个位置将球打入洞中所需杆数的负值;我们的动作是如何瞄准和摇摆球,以及选择哪个击棍(club)。假设前面的选项已经给定了,我们只考虑击棍的选择,并且假设是轻击球杆(putter)或者球棒(driver)。在这里插入图片描述图3.4的上半部分展示了一种状态价值函数 v p u t t ( s ) v_{putt}(s) vputt(s),这里采用的策略是一直选用 putter。终止态 in the hole 的价值为 0。从绿色部分的任何一处,我们假设都可以采用 putter 将球推入洞中,这些状态的价值为 − 1 -1 1。在绿色部分以外,我们无法采用 putter 将球推入洞中,所以它们的价值变得越来越小。图中黄色部分是个沙坑(sand traps),采用 putter 无法将球从沙坑中推出,所以它们的价值为 − ∞ -\infty

【习题3.14】In the grid-world example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant c c c to all the rewards adds a constant, v c v_c vc, to the values of all states, and thus does not affect the relative values of any states under any policies. What is v c v_c vc in terms of c c c and γ \gamma γ?
解答:回忆(3.8)式
G t = R t + 1 + γ R t + 2 + . . . . + γ T − t − 1 R T G_t=R_{t+1}+\gamma R_{t+2} + .... + \gamma^{T-t-1}R_T Gt=Rt+1+γRt+2+....+γTt1RT现在假设每一次的回报都增加一个常数 c c c,其期望折扣返回值为
G t , c = R t + 1 , c + γ R t + 2 , c + . . . . + γ T − t − 1 R T , c G_{t,c}=R_{t+1,c}+\gamma R_{t+2,c} + .... + \gamma^{T-t-1}R_{T,c}\qquad\qquad\qquad\qquad Gt,c=Rt+1,c+γRt+2,c+....+γTt1RT,c = ( R t + 1 + c ) + γ ( R t + 2 + c ) + . . . . + γ T − t ( R T + c )            =(R_{t+1}+c)+\gamma(R_{t+2}+c)+....+\gamma^{T-t}(R_T+c)\;\;\;\;\; =(Rt+1+c)+γ(Rt+2+c)+....+γTt(RT+c) = c + γ c + γ 2 c + . . . + γ T − t c + G t =c+\gamma c+\gamma^2 c+...+\gamma^{T-t}c+G_t\qquad\qquad\qquad\qquad =c+γc+γ2c+...+γTtc+Gt = c ⋅ 1 − γ T − t 1 − γ + R t + 1 + γ R t + 2 + . . . . + γ T − t − 1 R T          =c\cdot\frac{1-\gamma^{T-t}}{1-\gamma}+R_{t+1}+\gamma R_{t+2} + .... + \gamma^{T-t-1}R_T\;\;\;\; =c1γ1γTt+Rt+1+γRt+2+....+γTt1RT定义 v c = c ⋅ 1 − γ T − t 1 − γ v_c=c\cdot\frac{1-\gamma^{T-t}}{1-\gamma} vc=c1γ1γTt,可以计算出每个状态的价值都相差一个 v c v_c vc,不影响它们的相对价值:
v π , c ( s ) = E π [ G t , c ∣ S t = s ] = E π [ v c + G t ∣ S t = s ] = v c + v π ( s ) v_{\pi,c}(s)=\mathbb E_\pi[G_{t,c}|S_t=s]=\mathbb E_\pi[v_c+G_t|S_t=s]=v_c+v_\pi(s) vπ,c(s)=Eπ[Gt,cSt=s]=Eπ[vc+GtSt=s]=vc+vπ(s)所以,每一步的回报中的符号并不重要,只是方便计算和理解而已。 □ \qquad\qquad\qquad\square

【习题3.15】Now consider adding a constant c c c to all the rewards in an episodic task, such as maze(迷宫) running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

【习题3.16】The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
\qquad\qquad\qquad 在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the value at the root node, v π ( s ) v_\pi(s) vπ(s), in terms of the value at the expected leaf node, q π ( s , a ) q_\pi(s,a) qπ(s,a), given S t = s S_t=s St=s. This equation should include an expectation conditioned on following the policy, π \pi π. Then give a second equation in which the expected value is written out explicitly in terms of π ( a ∣ s ) \pi(a|s) π(as) such that no expected value notation appears in the equation.
解答:状态价值函数和此状态中可选择的动作相应的动作价值函数的关系为
v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_a \pi(a|s)q_\pi(s,a) vπ(s)=aπ(as)qπ(s,a)此公式中 q π ( s , a ) q_\pi(s,a) qπ(s,a) 是一个期望值,我们可以进一步地计算(去除期望值)HOW???

【习题3.17】The value of an action, q π ( s , a ) q_\pi(s,a) qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching to the possible next states:
\qquad\qquad\qquad\qquad 在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, q π ( s , a ) q_\pi(s,a) qπ(s,a), in terms of the expected next reward, R t + 1 R_{t+1} Rt+1, and the expected next state value, v π ( S t + 1 ) v_\pi(S_{t+1}) vπ(St+1), given that S t = s S_t=s St=s and A t = a A_t=a At=a. This equation should include an expectation but not one conditioned conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p ( s ′ , r ∣ s , a ) p(s',r|s,a) p(s,rs,a) de fined by (3.2), such that no expected value notation appears in the equation.

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

zte10096334

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值