每天一个RL基础理论(1)——Bellman Optimality


搬砖来源: https://wensun.github.io/CS6789_fall_2021.html

本系列尽量统一用下列术语表述。

一、Infinite horizon discounted MDPs

符号含义
S S S状态空间
A A A动作空间
P : S × A → △ ( S ) P:S\times A \rightarrow \triangle(S) P:S×A(S) △ ( S ) \triangle(S) (S)代表有效的状态空间,符合环境规律的转移映射 P P P
r : S × A → [ 0 , 1 ] r:S\times A\rightarrow [0,1] r:S×A[0,1]状态和动作 到 [0,1]映射的奖励函数 r r r
γ ∈ [ 0 , 1 ) \gamma\in [0,1) γ[0,1)discount fator
M = ( S , A , P , r , γ ) \mathcal{M}=(S,A,P,r,\gamma) M=(S,A,P,r,γ)MDP
π : S → △ ( A ) \pi:S\rightarrow \triangle(A) π:S(A)从状态空间到有效动作空间的策略映射 π \pi π
V ⋆ ( s ) V^\star(s) V(s)最优策略 π ⋆ \pi^\star π在状态结果为s下所对应的价值
Q ⋆ ( s , a ) Q^\star(s,a) Q(s,a)最优策略 π ⋆ \pi^\star π在状态结果为s,动作结果为a下所对应的价值
  • Infinite horizon:无限长的序列,用 H → ∞ H\rightarrow \infty H来表示
  • trajectory: τ = { s 0 , a 0 , . . . , s h , a h } \tau=\{s_0,a_0,..., s_h,a_h\} τ={s0,a0,...,sh,ah},其中 s h s_h sh为第h时刻的状态变量 a h = a a_h=a ah=a意味着第h时刻的动作变量的结果为a

所以基于该MDP的setting下有:

  1. value function : 表示初始时刻状态变量的取值为 s s s的价值
    V π ( s ) = E s 0 = s , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] V^\pi(s)=\mathbb E_{s_{0}=s,a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right] Vπ(s)=Es0=s,ahπ(sh),sh+1p(sh,ah)[h=0γhr(sh,ah)]
  2. Q function:表示初始时刻状态变量的取值为 s s s,动作变量取值为a的价值
    Q π ( s , a ) = E s 0 = s , a 0 = a , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) , a h + 1 ∼ π ( ⋅ ∣ s h + 1 ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] Q^\pi(s,a)=\mathbb E_{s_{0}=s,a_{0}=a,s_{h+1}\sim p(\cdot\mid s_h,a_h),a_{h+1}\sim \pi(\cdot\mid s_{h+1})}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right] Qπ(s,a)=Es0=s,a0=a,sh+1p(sh,ah),ah+1π(sh+1)[h=0γhr(sh,ah)]
  3. Bellman Consistency Equation:value function 和 Q function之间的迭代关系

V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] (V-Q) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] (Q-V) V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] ] (V-V) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] ] (Q-Q) \begin{aligned} V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[Q^\pi(s,a)\right] \quad \text{(V-Q)}\\ Q^\pi(s,a)&= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\quad \text{(Q-V)}\\ V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[ r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\right]\quad \text{(V-V)}\\ Q^\pi(s,a)&=r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[\mathbb E_{a'\sim \pi(\cdot\mid s')}\left[Q^\pi(s',a')\right]\right]\quad\text{(Q-Q)}\\ \end{aligned} Vπ(s)Qπ(s,a)Vπ(s)Qπ(s,a)=Eaπ(s)[Qπ(s,a)](V-Q)=r(s,a)+γEsp(s,a)[Vπ(s)](Q-V)=Eaπ(s)[r(s,a)+γEsp(s,a)[Vπ(s)]](V-V)=r(s,a)+γEsp(s,a)[Eaπ(s)[Qπ(s,a)]](Q-Q)

二、Bellman Optimality

(证明见第四部分)

  • 性质一:在infinite horizon discounted的MDP下,存在一个deterministic且stationary的最优策略 π ⋆ \pi^\star π,使得 V π ⋆ ≥ V π ( s ) , ∀ s , π V^{\pi^\star} \geq V^\pi(s), \forall s, \pi VπVπ(s),s,π ,且该optimal policy π ⋆ \pi^\star π的value function为 V π ⋆ V^{\pi^\star} Vπ,简记为 V ⋆ V^\star V,满足如下性质:
    V ⋆ ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ⋆ ( s ′ , a ′ ) ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot\mid s,a)}[V^\star(s')]\right]\\ Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right] V(s)=amax[r(s,a)+γEsP(s,a)[V(s)]]Q(s,a)=r(s,a)+γEsp(s,a)[amaxQ(s,a)]
  • 性质二:(V-version)对于任意的价值函数 V V V,如果其满足 V ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) V ( s ′ ) ] , ∀ s V(s)=\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(\cdot|s,a)}V(s')\right],\forall s V(s)=maxa[r(s,a)+γEsP(s,a)V(s)],s,则有:
    V ( s ) = V ⋆ ( s ) V(s)=V^\star(s) V(s)=V(s)同理有Q-version
    对于任意的Q值函数 Q Q Q,如果其满足 Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] , ∀ s Q(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q (s',a')\right],\forall s Q(s,a)=r(s,a)+γEsp(s,a)[maxaQ(s,a)],s,则有:
    Q ( s , a ) = Q ⋆ ( s , a ) Q(s,a)=Q^\star(s,a) Q(s,a)=Q(s,a)
  1. stationary的理解:在时间序列中,基于历史和现状去预测未来的前提是,未来 与 历史和现状存在一定的可延续性。更严谨的说,历史和当前时刻的状态变量 s h s_h sh有一些基本特性要在未来的一定时刻下 s h + k s_{h+k} sh+k保持不变。刻画这些变量的统计量为weak stationarity,而刻画这些变量的联合分布为strong stationarity。拓展延伸见如何理解时间序列的平稳性
  2. 从符号来严格定义non-stationary的policy可描述为: π ( a ∣ s , t ) \pi(a|s,t) π(as,t),与时间有关; π ( a ∣ s ) \pi(a|s) π(as)则意味这是稳态分布, π ( s ) \pi(s) π(s)则意味着deterministic policy
  3. deterministic stationary optimal policy在infinite horizon discounted的MDP下optimal policy的存在性证明

三、Trajectory distribution & State-action distribution

  • trajectory distribution) 在初始状态为 s 0 s_0 s0,策略 π \pi π下生成一条长度为h的trajectory τ h = { s 0 , a 0 , . . . , s h , a h } \tau_h=\{s_0,a_0,..., s_h,a_h\} τh={s0,a0,...,sh,ah}的概率为
    P s 0 π ( τ h ) = π ( a 0 ∣ s 0 ) ∏ t = 0 h p ( s t + 1 ∣ s t , a t ) π ( a t + 1 ∣ s t + 1 ) \mathbb P^\pi_{s_0}(\tau_h)= \pi(a_0|s_0)\prod_{t=0}^h p(s_{t+1}\mid s_t,a_t)\pi(a_{t+1}\mid s_{t+1}) Ps0π(τh)=π(a0s0)t=0hp(st+1st,at)π(at+1st+1)
  • 在初始状态为 s 0 s_0 s0, 策略 π \pi π下在第h时刻访问到结果为 s , a s,a s,a的概率
    P s 0 π ( τ h − 1 , s h = s , a h = a ) = ∑ a 0 , s 1 , a 1 , . . . , s h − 1 , a h − 1 ∈ τ h − 1 P ( τ h − 1 , s 0 ) p ( s t = s ∣ s t − 1 , a t − 1 ) π ( a t = a ∣ s ) \mathbb P_{s_0}^\pi(\tau_{h-1},s_h=s,a_h=a)=\sum_{a_0,s_1,a_1,...,s_{h-1},a_{h-1}\in \tau_{h-1}}\mathbb P(\tau_{h-1},s_0)p(s_{t}=s|s_{t-1},a_{t-1})\pi(a_t=a|s) Ps0π(τh1,sh=s,ah=a)=a0,s1,a1,...,sh1,ah1τh1P(τh1,s0)p(st=sst1,at1)π(at=as)(将h-1之前时刻的所有可能结果加起来)
  • (state-action distribution) 在初始状态为 s 0 s_0 s0,策略 π \pi π下,访问到结果为s,a的概率为 :
    d s 0 π ( s , a ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P s 0 π ( τ h − 1 , s h = s , a h = a ) d^\pi_{s_0}(s,a)=(1-\gamma)\sum_{h=0}^\infty \gamma^h \mathbb P^\pi_{s_0}(\tau_{h-1},s_h=s,a_h=a) ds0π(s,a)=(1γ)h=0γhPs0π(τh1,sh=s,ah=a)(在所有时刻访问到的概率加起来)

    其中 1 − γ 1-\gamma 1γ是用来normalized的,使得 d s 0 π ( s , a ) d^\pi_{s_0}(s,a) ds0π(s,a)为概率,即加起来为1

  • 以state-action distribution的形式来表达V函数,有:
    V π ( s 0 ) = 1 1 − γ ∑ s , a d s 0 π ( s , a ) r ( s , a ) V^\pi(s_0)=\frac{1}{1-\gamma}\sum_{s,a}d^\pi_{s_0}(s,a)r(s,a) Vπ(s0)=1γ1s,ads0π(s,a)r(s,a)(遍历所有的可能结果s,a)

四、QA和补充

Q1:为什么discount factor即 γ \gamma γ限定的范围是 [ 0 , 1 ) [0,1) [0,1),为什么不要1呢?

看Q2,然后思考一下,如果 γ ≥ 1 \gamma\geq1 γ1又会如何?

Q2:为什么reward要映射到 [ 0 , 1 ] [0,1] [0,1],这有什么用?

实际的reward signal都能通过reward shaping映射到该区间中,好处是我们能因此bound住value function 和 Q-function便于分析问题,即

0 ≤ V π ( s ) = E s 0 = s , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] ≤ ∑ h = 0 ∞ γ h = 1 1 − γ \begin{aligned} 0\leq V^\pi(s)&=\mathbb E_{s_{0}=s,a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right]\\ & \leq \sum_{h=0}^\infty \gamma^h = \frac{1}{1-\gamma} \end{aligned} 0Vπ(s)=Es0=s,ahπ(sh),sh+1p(sh,ah)[h=0γhr(sh,ah)]h=0γh=1γ1

同理有:
0 ≤ Q π ( s , a ) ≤ 1 1 − γ 0\leq Q^\pi(s,a)\leq \frac{1}{1-\gamma} 0Qπ(s,a)1γ1

4.1 Infinite horizon discounted的MDP下deterministic stationary最优策略的存在性证明

  • 存在性的证明思路:对于策略集合 Π \Pi Π中的 π \pi π ( s h , a h s_h,a_h sh,ah均为变量,其余为变量的取值,即结果
    1. 首先证明:在初始时刻的结果为s,a,r, 第一时刻的状态结果为 s ′ s' s时,其最大化future discounted value有如下性质:
      max ⁡ π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = γ V ⋆ ( s ′ ) \max_{\pi\in\Pi} \mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]=\gamma V^\star(s') πΠmaxEs0=s,a0=a,s1=s,ahπ(sh),sh+1p(sh,ah)[h=1γhr(sh,ah)]=γV(s)(这玩意说明了在一个策略集合 π \pi π下,给定初始结果为s,a,r,s’时,最优策略的表现仅与下一状态的结果 s ′ s' s有关,根据Markov Property能直觉地理解)
    2. 接着构造一个deterministic stationary的policy π ~ ( s ) \tilde \pi(s) π~(s)如下:
      π ~ ( s ) = arg max ⁡ a ∈ A E s 0 = s , a 0 = a [ r ( s , a ) + γ V ⋆ ( s 1 ) ] \tilde \pi(s)=\argmax_{a\in A}\mathbb E_{s_0=s,a_0=a}\left[r(s,a)+\gamma V^\star(s_1)\right] π~(s)=aAargmaxEs0=s,a0=a[r(s,a)+γV(s1)]
    3. 最后证明构造的 π ~ ( s ) \tilde \pi(s) π~(s)为optimal policy,即:
      V π ~ ( s ) = V ⋆ ( s ) V^{\tilde \pi(s)}=V^\star(s) Vπ~(s)=V(s)
  • 证1:
    h ′ = h − 1 h'=h-1 h=h1,则有:
    max ⁡ π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max ⁡ π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) + r ( s 0 , a 0 ) ] = max ⁡ π ∈ Π E s 0 = s ′ , a h ′ ∼ π ( ⋅ ∣ s h ′ ) , s h ′ + 1 ∼ p ( ⋅ ∣ s h ′ , a h ′ ) [ ∑ h ′ = 0 ∞ γ h ′ + 1 r ( a h ′ , s h ′ ) ] = max ⁡ π ∈ Π γ E s 0 = s ′ , a h ′ ∼ π ( ⋅ ∣ s h ′ ) , s h ′ + 1 ∼ p ( ⋅ ∣ s h ′ , a h ′ ) [ ∑ h ′ = 0 ∞ γ h ′ r ( a h ′ , s h ′ ) ] = max ⁡ π ∈ Π γ V π ( s ′ ) \begin{aligned} &\max_{\pi\in\Pi}\mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]\\ &=\max_{\pi\in\Pi}\mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)+r(s_0,a_0)\right]\\ &=\max_{\pi\in\Pi}\mathbb E_{s_0=s',a_{h'}\sim \pi(\cdot\mid s_{h'}),s_{h'+1}\sim p(\cdot\mid s_{h'},a_{h'})}\left[\sum_{h'=0}^\infty\gamma^{h'+1}r(a_{h'},s_{h'})\right]\\ &=\max_{\pi\in\Pi}\gamma \mathbb E_{s_0=s',a_{h'}\sim \pi(\cdot\mid s_{h'}),s_{h'+1}\sim p(\cdot\mid s_{h'},a_{h'})}\left[\sum_{h'=0}^\infty\gamma^{h'}r(a_{h'},s_{h'})\right]\\ &=\max_{\pi\in\Pi}\gamma V^\pi(s') \end{aligned} πΠmaxEs0=s,a0=a,s1=s,ahπ(sh),sh+1p(sh,ah)[h=1γhr(sh,ah)]=πΠmaxEs0=s,a0=a,s1=s,ahπ(sh),sh+1p(sh,ah)[h=1γhr(sh,ah)+r(s0,a0)]=πΠmaxEs0=s,ahπ(sh),sh+1p(sh,ah)[h=0γh+1r(ah,sh)]=πΠmaxγEs0=s,ahπ(sh),sh+1p(sh,ah)[h=0γhr(ah,sh)]=πΠmaxγVπ(s)所以 max ⁡ π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max ⁡ π ∈ Π γ V π ( s ′ ) = γ V ⋆ ( s ′ ) \begin{aligned} &\max_{\pi\in\Pi} \mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]\\ &= \max_{\pi\in\Pi} \gamma V^\pi(s')\\ &= \gamma V^\star(s') \end{aligned} πΠmaxEs0=s,a0=a,s1=s,ahπ(sh),sh+1p(sh,ah)[h=1γhr(sh,ah)]=πΠmaxγVπ(s)=γV(s)
  • 证2:
    首先根据定义一定有: V π ~ ( s ) ≤ V ⋆ ( s ) ∀ s ∈ S V^{\tilde \pi}(s)\leq V^\star(s) \quad \forall s\in S Vπ~(s)V(s)sS然后对于第0时刻的状态变量,根据定义有
    V ⋆ ( s 0 ) = max ⁡ π ∈ Π E a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ r ( s 0 , a 0 ) + ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max ⁡ π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) [ r ( s 0 , a 0 ) + E s h ∼ p ( ⋅ ∣ s 0 , a 0 ) , a h ∼ π ( ⋅ ∣ s h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] ] ≤ max ⁡ π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) [ r ( s 0 , a 0 ) + max ⁡ π ′ ∈ Π E s h ∼ p ( ⋅ ∣ s 0 , a 0 ) , a h ∼ π ′ ( ⋅ ∣ s h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] ] = max ⁡ π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) , s 1 ∼ p ( ⋅ ∣ s 0 , a 0 ) [ r ( s 0 , a 0 ) + γ V ⋆ ( s 1 ) ] (利用证明1) = max ⁡ a 0 ∈ A E a 0 ∼ π ( ⋅ ∣ s 0 ) , s 1 ∼ p ( ⋅ ∣ s 0 , a 0 ) [ r ( s 0 , a 0 ) + γ V ⋆ ( s 1 ) ] = V π ~ ( s 0 ) (等于构造的策略所选的动作) \begin{aligned} V^\star(s_0)&= \max_{\pi\in \Pi}\mathbb E_{a_h\sim \pi(\cdot\mid s_h), s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[r(s_0,a_0) + \sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\\ &=\max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0)}\left[r(s_0,a_0) + \mathbb E_{s_{h}\sim p(\cdot\mid s_0,a_0),a_{h}\sim\pi(\cdot|s_{h})}\left[\sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\right]\\ &\leq \max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0)}\left[r(s_0,a_0) + \max_{\pi'\in\Pi}\mathbb E_{s_{h}\sim p(\cdot\mid s_0,a_0),a_{h}\sim\pi'(\cdot|s_{h})}\left[\sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\right]\\ &=\max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0),s_1\sim p(\cdot\mid s_0,a_0)}\left[r(s_0,a_0) +\gamma V^\star(s_1)\right]\text{(利用证明1)}\\ &=\max_{a_0\in A}\mathbb E_{a_0\sim \pi(\cdot\mid s_0),s_1\sim p(\cdot\mid s_0,a_0)}\left[r(s_0,a_0) +\gamma V^\star(s_1)\right]\\ &=V^{\tilde\pi}(s_0)\text{(等于构造的策略所选的动作)} \end{aligned} V(s0)=πΠmaxEahπ(sh),sh+1p(sh,ah)[r(s0,a0)+h=1γhr(sh,ah)]=πΠmaxEa0π(s0)[r(s0,a0)+Eshp(s0,a0),ahπ(sh)[h=1γhr(sh,ah)]]πΠmaxEa0π(s0)[r(s0,a0)+πΠmaxEshp(s0,a0),ahπ(sh)[h=1γhr(sh,ah)]]=πΠmaxEa0π(s0),s1p(s0,a0)[r(s0,a0)+γV(s1)](利用证明1)=a0AmaxEa0π(s0),s1p(s0,a0)[r(s0,a0)+γV(s1)]=Vπ~(s0)(等于构造的策略所选的动作)所以
    V ⋆ ( s 0 ) ≤ V π ~ ( s 0 ) , ∀ s 0 ∈ S V^\star(s_0)\leq V^{\tilde\pi}(s_0),\forall s_0\in S V(s0)Vπ~(s0),s0S

因此构造的策略是optimal policy。

  • deterministic 体现在构造时选择的动作是argmax的,是一个动作点
  • stationary policy 体现在证明1中的change of variables即 h ′ = h − 1 h'=h-1 h=h1时,关于时间的变量h之间是存在可延续性的,即未来时刻的变量与当前时刻的变量之间存在一些基本特性的联系

4.2 性质一的证明

  • 证明 V ⋆ ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right] V(s)=maxa[r(s,a)+γEsp(s,a)[V(s)]]思路为通过不等式放缩进行夹逼,相关展开基于Bellman equation
  • 分析 max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] = max ⁡ a Q ⋆ ( s , a ) \max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right]=\max_aQ^\star(s,a) maxa[r(s,a)+γEsp(s,a)[V(s)]]=maxaQ(s,a),即证 V ⋆ ( s ) = max ⁡ a Q ⋆ ( s , a ) V^\star(s)=\max_aQ^\star(s,a) V(s)=maxaQ(s,a),等价于说策略 π ^ ( s ) = arg max ⁡ a Q ⋆ ( s , a ) \hat \pi(s)=\argmax_a Q^\star(s,a) π^(s)=aargmaxQ(s,a)是最优策略 π ⋆ \pi^\star π,所以只需要证明 V π ^ ( s ) = V ⋆ ( s ) V^{\hat\pi(s)}=V^\star(s) Vπ^(s)=V(s) π ^ ( s ) = π ⋆ \hat \pi(s)=\pi^\star π^(s)=π
  • 已知 V π ^ ( s ) ≤ V ⋆ ( s ) V^{\hat\pi(s)} \leq V^\star(s) Vπ^(s)V(s)只需证 V π ^ ( s ) ≥ V ⋆ ( s ) V^{\hat\pi(s)}\geq V^\star(s) Vπ^(s)V(s),即可证得 V π ^ ( s ) = V ⋆ ( s ) V^{\hat\pi(s)}=V^\star(s) Vπ^(s)=V(s)所以有 V ⋆ ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right] V(s)=maxa[r(s,a)+γEsp(s,a)[V(s)]]

以下证: V ⋆ ( s ) ≤ V π ^ ( s ) V^\star(s)\leq V^{\hat\pi(s)} V(s)Vπ^(s)
V ⋆ ( s ) = r ( s , π ⋆ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ⋆ ( s ) ) [ V ⋆ ( s ′ ) ] ≤ max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] = max ⁡ a Q ⋆ ( s , a ) = r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ V ⋆ ( s ′ ) ]  (1) \begin{aligned} V^\star(s)&=r(s,\pi^\star(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi^\star(s))}\left[V^\star(s')\right]\\ &\leq \max_a \left[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right]\\ &=\max_a Q^\star(s,a)\\ &=r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')] \text{ (1)} \end{aligned} V(s)=r(s,π(s))+γEsp(s,π(s))[V(s)]amax[r(s,a)+γEsp(s,a)[V(s)]]=amaxQ(s,a)=r(s,π^(s))+γEsp(s,π^(s))[V(s)] (1)是不是感觉证完了?(1)式不正是 V π ^ ( s ) V^{\hat\pi(s)} Vπ^(s)嘛?
可惜还真不是,根据value function的定义 V π ^ ( s ) = r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ V π ^ ( s ) ( s ′ ) ] V^{\hat\pi(s)}=r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^{\hat\pi(s)}(s')] Vπ^(s)=r(s,π^(s))+γEsp(s,π^(s))[Vπ^(s)(s)]

但可以对已有的不等式 V ⋆ ( s ) ≤ r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ V ⋆ ( s ′ ) ] V^\star(s)\leq r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')] V(s)r(s,π^(s))+γEsp(s,π^(s))[V(s)]进行套娃:
V ⋆ ( s ) ≤ r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ V ⋆ ( s ′ ) ] ≤ r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ r ( s ′ , π ^ ( s ′ ) ) + γ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π ^ ( s ′ ) ) [ V ⋆ ( s ′ ′ ) ] ] ≤ r ( s , π ^ ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ^ ( s ) ) [ r ( s ′ , π ^ ( s ′ ) ) + γ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π ^ ( s ′ ) ) [ r ( s ′ ′ , π ^ ( s ′ ′ ) ) + γ E s ′ ′ ′ ∼ p ( ⋅ ∣ s ′ ′ , π ^ ( s ′ ′ ) ) [ V ⋆ ( s ′ ′ ′ ) ] ] ] ⋯ = V π ^ ( s ) \begin{aligned} V^\star(s)&\leq r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')]\\ &\leq r(s,\hat\pi(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}\left[r(s',\hat\pi(s'))+\gamma \mathbb E_{s''\sim p(\cdot\mid s',\hat\pi(s'))}[V^\star(s'')]\right]\\ &\leq r(s,\hat\pi(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}\left[r(s',\hat\pi(s'))+\gamma \mathbb E_{s''\sim p(\cdot\mid s',\hat\pi(s'))}\left[r(s'',\hat\pi(s''))+\gamma \mathbb E_{s'''\sim p(\cdot\mid s'',\hat\pi(s''))}[V^\star(s''')]\right]\right] \\ &\cdots\\ &=V^{\hat\pi(s)} \end{aligned} V(s)r(s,π^(s))+γEsp(s,π^(s))[V(s)]r(s,π^(s))+γEsp(s,π^(s))[r(s,π^(s))+γEsp(s,π^(s))[V(s)]]r(s,π^(s))+γEsp(s,π^(s))[r(s,π^(s))+γEsp(s,π^(s))[r(s,π^(s))+γEsp(s,π^(s))[V(s)]]]=Vπ^(s)

性质一本质上说明了这个deterministic stationary的optimal policy 则为 π ⋆ ( s ) = arg max ⁡ a Q ⋆ ( s , a ) \pi^\star(s)=\argmax_aQ^\star(s,a) π(s)=aargmaxQ(s,a)

4.3 性质二的证明

  • 证明对于任意的价值函数 V V V,如果其满足 V ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) V ( s ′ ) ] , ∀ s V(s)=\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim p(\cdot|s,a)}V(s')\right],\forall s V(s)=maxa[r(s,a)+γEsp(s,a)V(s)],s,则有 V ( s ) = V ⋆ ( s ) V(s)=V^\star(s) V(s)=V(s)的思路为对 ∣ V ( s ) − V ⋆ ( s ) ∣ |V(s)-V^\star(s)| V(s)V(s)进行放缩,小于等于一个为0的项,展开基于Bellman equation

证:(第一步利用了性质一即 V ⋆ ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right] V(s)=maxa[r(s,a)+γEsp(s,a)[V(s)]])
∣ V ( s ) − V ⋆ ( s ) ∣ = ∣ max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) V ( s ′ ) ] − max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) V ⋆ ( s ′ ) ] ∣ ≤ max ⁡ a ∣ γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ( s ′ ) − V ⋆ ( s ′ ) ] ∣ ≤ max ⁡ a γ E s ′ ∼ p ( ⋅ ∣ s , a ) ∣ V ( s ′ ) − V ⋆ ( s ′ ) ∣ ≤ max ⁡ a γ E s ′ ∼ p ( ⋅ ∣ s , a ) ∣ max ⁡ a ′ γ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , a ′ ) ∣ V ( s ′ ′ ) − V ⋆ ( s ′ ′ ) ∣ ∣ ≤ max ⁡ a 1 , . . . , a k γ k E s k ∼ p ( ⋅ ∣ s k − 1 , a k − 1 ) [ ∣ V ( s k ) − V ⋆ ( s k ) ∣ ] ≤ max ⁡ a 1 , . . . , a k γ k × 2 1 − γ (  ∥ V ∥ ≤ 1 1 − γ ) ≤ lim ⁡ k → ∞ max ⁡ a 1 , . . . , a k 2 γ k 1 − γ = 0 \begin{aligned} |V(s)-V^\star(s)|&=\left |\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim p(\cdot|s,a)}V(s')\right]-\max_a\left[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}V^\star(s')\right] \right|\\ &\leq \max_a\left|\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V(s')-V^\star(s')]\right|\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left|V(s')-V^\star(s')\right|\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left| \max_{a'} \gamma \mathbb E_{s''\sim p(\cdot\mid s',a')}\left|V(s'')-V^\star(s'')\right|\right|\\ &\leq \max_{a_1,...,a_k} \gamma^k \mathbb E_{s^k\sim p(\cdot\mid s_{k-1},a_{k-1})}\left[|V(s_k)-V^\star(s_k)|\right]\\ &\leq \max_{a_1,...,a_k} \gamma^k \times \frac{2}{1-\gamma}\text{( $\|V\|\leq\frac{1}{1-\gamma}$)}\\ &\leq \lim_{k\rightarrow \infty} \max_{a_1,...,a_k} \frac{2\gamma^k}{1-\gamma}=0 \end{aligned} V(s)V(s)=amax[r(s,a)+γEsp(s,a)V(s)]amax[r(s,a)+γEsp(s,a)V(s)]amaxγEsp(s,a)[V(s)V(s)]amaxγEsp(s,a)V(s)V(s)amaxγEsp(s,a)amaxγEsp(s,a)V(s)V(s)a1,...,akmaxγkEskp(sk1,ak1)[V(sk)V(sk)]a1,...,akmaxγk×1γ2( ∥V1γ1)klima1,...,akmax1γ2γk=0

五、总结

  1. 理解策略的 平稳性 和 确定性
  2. 理解infinite horizon discounted这个setting
  3. 在该MDP setting下存在一个deterministic stationary的optimal policy会满足两个性质,称为Bellman Optimality
  4. 所以如果在该setting下,有满足性质的V和Q,那么它就是determinsitc stationary optimal policy的V和Q
好的,下面是C++实现的Bellman-Ford算法代码: ```c++ #include <iostream> #include <vector> #include <climits> using namespace std; struct Edge { int from, to, cost; Edge(int f, int t, int c) : from(f), to(t), cost(c) {} }; vector<int> bellmanFord(vector<Edge>& edges, int n, int start) { vector<int> dist(n, INT_MAX); dist[start] = 0; for (int i = 0; i < n - 1; i++) { for (int j = 0; j < edges.size(); j++) { Edge e = edges[j]; if (dist[e.from] != INT_MAX && dist[e.to] > dist[e.from] + e.cost) { dist[e.to] = dist[e.from] + e.cost; } } } return dist; } int main() { int n = 5; vector<Edge> edges; edges.push_back(Edge(0, 1, 1)); edges.push_back(Edge(0, 2, 5)); edges.push_back(Edge(1, 2, 2)); edges.push_back(Edge(1, 3, 2)); edges.push_back(Edge(1, 4, 1)); edges.push_back(Edge(2, 3, 1)); edges.push_back(Edge(2, 4, 3)); edges.push_back(Edge(3, 4, 1)); vector<int> dist = bellmanFord(edges, n, 0); for (int i = 0; i < n; i++) { cout << "Distance from vertex 0 to vertex " << i << " is " << dist[i] << endl; } return 0; } ``` 该算法的输入是边的集合`edges`,顶点总数`n`和起点`start`。输出为从起点到每个顶点的最短距离。实现中使用了一个数组`dist`来存储每个顶点到起点的距离,初始值为无穷大,起点的值为0。然后通过n-1次循环,遍历所有的边,更新`dist`数组中的值。如果某个顶点的距离被更新了,则表明它可以通过更短的路径到达起点。如果在n-1次循环后还存在可以更新的顶点,则说明图中存在负权环路,此时算法无法求出最短路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值