CS6789-1
搬砖来源: https://wensun.github.io/CS6789_fall_2021.html
本系列尽量统一用下列术语表述。
一、Infinite horizon discounted MDPs
符号 | 含义 |
---|---|
S S S | 状态空间 |
A A A | 动作空间 |
P : S × A → △ ( S ) P:S\times A \rightarrow \triangle(S) P:S×A→△(S) | △ ( S ) \triangle(S) △(S)代表有效的状态空间,符合环境规律的转移映射 P P P |
r : S × A → [ 0 , 1 ] r:S\times A\rightarrow [0,1] r:S×A→[0,1] | 状态和动作 到 [0,1]映射的奖励函数 r r r |
γ ∈ [ 0 , 1 ) \gamma\in [0,1) γ∈[0,1) | discount fator |
M = ( S , A , P , r , γ ) \mathcal{M}=(S,A,P,r,\gamma) M=(S,A,P,r,γ) | MDP |
π : S → △ ( A ) \pi:S\rightarrow \triangle(A) π:S→△(A) | 从状态空间到有效动作空间的策略映射 π \pi π |
V ⋆ ( s ) V^\star(s) V⋆(s) | 最优策略 π ⋆ \pi^\star π⋆在状态结果为s下所对应的价值 |
Q ⋆ ( s , a ) Q^\star(s,a) Q⋆(s,a) | 最优策略 π ⋆ \pi^\star π⋆在状态结果为s,动作结果为a下所对应的价值 |
- Infinite horizon:无限长的序列,用 H → ∞ H\rightarrow \infty H→∞来表示
- trajectory: τ = { s 0 , a 0 , . . . , s h , a h } \tau=\{s_0,a_0,..., s_h,a_h\} τ={s0,a0,...,sh,ah},其中 s h s_h sh为第h时刻的状态变量, a h = a a_h=a ah=a意味着第h时刻的动作变量的结果为a
所以基于该MDP的setting下有:
- value function : 表示初始时刻状态变量的取值为
s
s
s的价值
V π ( s ) = E s 0 = s , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] V^\pi(s)=\mathbb E_{s_{0}=s,a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right] Vπ(s)=Es0=s,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=0∑∞γhr(sh,ah)] - Q function:表示初始时刻状态变量的取值为
s
s
s,动作变量取值为a的价值
Q π ( s , a ) = E s 0 = s , a 0 = a , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) , a h + 1 ∼ π ( ⋅ ∣ s h + 1 ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] Q^\pi(s,a)=\mathbb E_{s_{0}=s,a_{0}=a,s_{h+1}\sim p(\cdot\mid s_h,a_h),a_{h+1}\sim \pi(\cdot\mid s_{h+1})}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right] Qπ(s,a)=Es0=s,a0=a,sh+1∼p(⋅∣sh,ah),ah+1∼π(⋅∣sh+1)[h=0∑∞γhr(sh,ah)] - Bellman Consistency Equation:value function 和 Q function之间的迭代关系
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] (V-Q) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] (Q-V) V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] ] (V-V) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] ] (Q-Q) \begin{aligned} V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[Q^\pi(s,a)\right] \quad \text{(V-Q)}\\ Q^\pi(s,a)&= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\quad \text{(Q-V)}\\ V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[ r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\right]\quad \text{(V-V)}\\ Q^\pi(s,a)&=r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[\mathbb E_{a'\sim \pi(\cdot\mid s')}\left[Q^\pi(s',a')\right]\right]\quad\text{(Q-Q)}\\ \end{aligned} Vπ(s)Qπ(s,a)Vπ(s)Qπ(s,a)=Ea∼π(⋅∣s)[Qπ(s,a)](V-Q)=r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)](Q-V)=Ea∼π(⋅∣s)[r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)]](V-V)=r(s,a)+γEs′∼p(⋅∣s,a)[Ea′∼π(⋅∣s′)[Qπ(s′,a′)]](Q-Q)
二、Bellman Optimality
(证明见第四部分)
- 性质一:在infinite horizon discounted的MDP下,存在一个deterministic且stationary的最优策略
π
⋆
\pi^\star
π⋆,使得
V
π
⋆
≥
V
π
(
s
)
,
∀
s
,
π
V^{\pi^\star} \geq V^\pi(s), \forall s, \pi
Vπ⋆≥Vπ(s),∀s,π ,且该optimal policy
π
⋆
\pi^\star
π⋆的value function为
V
π
⋆
V^{\pi^\star}
Vπ⋆,简记为
V
⋆
V^\star
V⋆,满足如下性质:
V ⋆ ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q ⋆ ( s ′ , a ′ ) ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot\mid s,a)}[V^\star(s')]\right]\\ Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right] V⋆(s)=amax[r(s,a)+γEs′∼P(⋅∣s,a)[V⋆(s′)]]Q⋆(s,a)=r(s,a)+γEs′∼p(⋅∣s,a)[a′maxQ⋆(s′,a′)] - 性质二:(V-version)对于任意的价值函数
V
V
V,如果其满足
V
(
s
)
=
max
a
[
r
(
s
,
a
)
+
γ
E
s
′
∼
P
(
⋅
∣
s
,
a
)
V
(
s
′
)
]
,
∀
s
V(s)=\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(\cdot|s,a)}V(s')\right],\forall s
V(s)=maxa[r(s,a)+γEs′∼P(⋅∣s,a)V(s′)],∀s,则有:
V ( s ) = V ⋆ ( s ) V(s)=V^\star(s) V(s)=V⋆(s)同理有Q-version:
对于任意的Q值函数 Q Q Q,如果其满足 Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q ( s ′ , a ′ ) ] , ∀ s Q(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q (s',a')\right],\forall s Q(s,a)=r(s,a)+γEs′∼p(⋅∣s,a)[maxa′Q(s′,a′)],∀s,则有:
Q ( s , a ) = Q ⋆ ( s , a ) Q(s,a)=Q^\star(s,a) Q(s,a)=Q⋆(s,a)
- stationary的理解:在时间序列中,基于历史和现状去预测未来的前提是,未来 与 历史和现状存在一定的可延续性。更严谨的说,历史和当前时刻的状态变量 s h s_h sh有一些基本特性要在未来的一定时刻下 s h + k s_{h+k} sh+k保持不变。刻画这些变量的统计量为weak stationarity,而刻画这些变量的联合分布为strong stationarity。拓展延伸见如何理解时间序列的平稳性
- 从符号来严格定义non-stationary的policy可描述为: π ( a ∣ s , t ) \pi(a|s,t) π(a∣s,t),与时间有关; π ( a ∣ s ) \pi(a|s) π(a∣s)则意味这是稳态分布, π ( s ) \pi(s) π(s)则意味着deterministic policy
- deterministic stationary optimal policy在infinite horizon discounted的MDP下optimal policy的存在性证明
三、Trajectory distribution & State-action distribution
- (trajectory distribution) 在初始状态为
s
0
s_0
s0,策略
π
\pi
π下生成一条长度为h的trajectory
τ
h
=
{
s
0
,
a
0
,
.
.
.
,
s
h
,
a
h
}
\tau_h=\{s_0,a_0,..., s_h,a_h\}
τh={s0,a0,...,sh,ah}的概率为
P s 0 π ( τ h ) = π ( a 0 ∣ s 0 ) ∏ t = 0 h p ( s t + 1 ∣ s t , a t ) π ( a t + 1 ∣ s t + 1 ) \mathbb P^\pi_{s_0}(\tau_h)= \pi(a_0|s_0)\prod_{t=0}^h p(s_{t+1}\mid s_t,a_t)\pi(a_{t+1}\mid s_{t+1}) Ps0π(τh)=π(a0∣s0)t=0∏hp(st+1∣st,at)π(at+1∣st+1) - 在初始状态为
s
0
s_0
s0, 策略
π
\pi
π下在第h时刻访问到结果为
s
,
a
s,a
s,a的概率为
P s 0 π ( τ h − 1 , s h = s , a h = a ) = ∑ a 0 , s 1 , a 1 , . . . , s h − 1 , a h − 1 ∈ τ h − 1 P ( τ h − 1 , s 0 ) p ( s t = s ∣ s t − 1 , a t − 1 ) π ( a t = a ∣ s ) \mathbb P_{s_0}^\pi(\tau_{h-1},s_h=s,a_h=a)=\sum_{a_0,s_1,a_1,...,s_{h-1},a_{h-1}\in \tau_{h-1}}\mathbb P(\tau_{h-1},s_0)p(s_{t}=s|s_{t-1},a_{t-1})\pi(a_t=a|s) Ps0π(τh−1,sh=s,ah=a)=a0,s1,a1,...,sh−1,ah−1∈τh−1∑P(τh−1,s0)p(st=s∣st−1,at−1)π(at=a∣s)(将h-1之前时刻的所有可能结果加起来) - (state-action distribution) 在初始状态为
s
0
s_0
s0,策略
π
\pi
π下,访问到结果为s,a的概率为 :
d s 0 π ( s , a ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P s 0 π ( τ h − 1 , s h = s , a h = a ) d^\pi_{s_0}(s,a)=(1-\gamma)\sum_{h=0}^\infty \gamma^h \mathbb P^\pi_{s_0}(\tau_{h-1},s_h=s,a_h=a) ds0π(s,a)=(1−γ)h=0∑∞γhPs0π(τh−1,sh=s,ah=a)(在所有时刻访问到的概率加起来)其中 1 − γ 1-\gamma 1−γ是用来normalized的,使得 d s 0 π ( s , a ) d^\pi_{s_0}(s,a) ds0π(s,a)为概率,即加起来为1
- 以state-action distribution的形式来表达V函数,有:
V π ( s 0 ) = 1 1 − γ ∑ s , a d s 0 π ( s , a ) r ( s , a ) V^\pi(s_0)=\frac{1}{1-\gamma}\sum_{s,a}d^\pi_{s_0}(s,a)r(s,a) Vπ(s0)=1−γ1s,a∑ds0π(s,a)r(s,a)(遍历所有的可能结果s,a)
四、QA和补充
Q1:为什么discount factor即 γ \gamma γ限定的范围是 [ 0 , 1 ) [0,1) [0,1),为什么不要1呢?
看Q2,然后思考一下,如果 γ ≥ 1 \gamma\geq1 γ≥1又会如何?
Q2:为什么reward要映射到 [ 0 , 1 ] [0,1] [0,1],这有什么用?
实际的reward signal都能通过reward shaping映射到该区间中,好处是我们能因此bound住value function 和 Q-function便于分析问题,即
0 ≤ V π ( s ) = E s 0 = s , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 0 ∞ γ h r ( s h , a h ) ] ≤ ∑ h = 0 ∞ γ h = 1 1 − γ \begin{aligned} 0\leq V^\pi(s)&=\mathbb E_{s_{0}=s,a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=0}^\infty \gamma^h r(s_h,a_h)\right]\\ & \leq \sum_{h=0}^\infty \gamma^h = \frac{1}{1-\gamma} \end{aligned} 0≤Vπ(s)=Es0=s,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=0∑∞γhr(sh,ah)]≤h=0∑∞γh=1−γ1
同理有:
0
≤
Q
π
(
s
,
a
)
≤
1
1
−
γ
0\leq Q^\pi(s,a)\leq \frac{1}{1-\gamma}
0≤Qπ(s,a)≤1−γ1
4.1 Infinite horizon discounted的MDP下deterministic stationary最优策略的存在性证明
- 存在性的证明思路:对于策略集合
Π
\Pi
Π中的
π
\pi
π (
s
h
,
a
h
s_h,a_h
sh,ah均为变量,其余为变量的取值,即结果)
- 首先证明:在初始时刻的结果为s,a,r, 第一时刻的状态结果为
s
′
s'
s′时,其最大化future discounted value有如下性质:
max π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = γ V ⋆ ( s ′ ) \max_{\pi\in\Pi} \mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]=\gamma V^\star(s') π∈ΠmaxEs0=s,a0=a,s1=s′,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=1∑∞γhr(sh,ah)]=γV⋆(s′)(这玩意说明了在一个策略集合 π \pi π下,给定初始结果为s,a,r,s’时,最优策略的表现仅与下一状态的结果 s ′ s' s′有关,根据Markov Property能直觉地理解) - 接着构造一个deterministic stationary的policy
π
~
(
s
)
\tilde \pi(s)
π~(s)如下:
π ~ ( s ) = arg max a ∈ A E s 0 = s , a 0 = a [ r ( s , a ) + γ V ⋆ ( s 1 ) ] \tilde \pi(s)=\argmax_{a\in A}\mathbb E_{s_0=s,a_0=a}\left[r(s,a)+\gamma V^\star(s_1)\right] π~(s)=a∈AargmaxEs0=s,a0=a[r(s,a)+γV⋆(s1)] - 最后证明构造的
π
~
(
s
)
\tilde \pi(s)
π~(s)为optimal policy,即:
V π ~ ( s ) = V ⋆ ( s ) V^{\tilde \pi(s)}=V^\star(s) Vπ~(s)=V⋆(s)
- 首先证明:在初始时刻的结果为s,a,r, 第一时刻的状态结果为
s
′
s'
s′时,其最大化future discounted value有如下性质:
- 证1:
令 h ′ = h − 1 h'=h-1 h′=h−1,则有:
max π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) + r ( s 0 , a 0 ) ] = max π ∈ Π E s 0 = s ′ , a h ′ ∼ π ( ⋅ ∣ s h ′ ) , s h ′ + 1 ∼ p ( ⋅ ∣ s h ′ , a h ′ ) [ ∑ h ′ = 0 ∞ γ h ′ + 1 r ( a h ′ , s h ′ ) ] = max π ∈ Π γ E s 0 = s ′ , a h ′ ∼ π ( ⋅ ∣ s h ′ ) , s h ′ + 1 ∼ p ( ⋅ ∣ s h ′ , a h ′ ) [ ∑ h ′ = 0 ∞ γ h ′ r ( a h ′ , s h ′ ) ] = max π ∈ Π γ V π ( s ′ ) \begin{aligned} &\max_{\pi\in\Pi}\mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]\\ &=\max_{\pi\in\Pi}\mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)+r(s_0,a_0)\right]\\ &=\max_{\pi\in\Pi}\mathbb E_{s_0=s',a_{h'}\sim \pi(\cdot\mid s_{h'}),s_{h'+1}\sim p(\cdot\mid s_{h'},a_{h'})}\left[\sum_{h'=0}^\infty\gamma^{h'+1}r(a_{h'},s_{h'})\right]\\ &=\max_{\pi\in\Pi}\gamma \mathbb E_{s_0=s',a_{h'}\sim \pi(\cdot\mid s_{h'}),s_{h'+1}\sim p(\cdot\mid s_{h'},a_{h'})}\left[\sum_{h'=0}^\infty\gamma^{h'}r(a_{h'},s_{h'})\right]\\ &=\max_{\pi\in\Pi}\gamma V^\pi(s') \end{aligned} π∈ΠmaxEs0=s,a0=a,s1=s′,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=1∑∞γhr(sh,ah)]=π∈ΠmaxEs0=s,a0=a,s1=s′,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=1∑∞γhr(sh,ah)+r(s0,a0)]=π∈ΠmaxEs0=s′,ah′∼π(⋅∣sh′),sh′+1∼p(⋅∣sh′,ah′)[h′=0∑∞γh′+1r(ah′,sh′)]=π∈ΠmaxγEs0=s′,ah′∼π(⋅∣sh′),sh′+1∼p(⋅∣sh′,ah′)[h′=0∑∞γh′r(ah′,sh′)]=π∈ΠmaxγVπ(s′)所以 max π ∈ Π E s 0 = s , a 0 = a , s 1 = s ′ , a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max π ∈ Π γ V π ( s ′ ) = γ V ⋆ ( s ′ ) \begin{aligned} &\max_{\pi\in\Pi} \mathbb E_{s_0=s,a_0=a,s_1=s',a_h\sim \pi(\cdot\mid s_h),s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[\sum_{h=1}^\infty \gamma^hr(s_h,a_h)\right]\\ &= \max_{\pi\in\Pi} \gamma V^\pi(s')\\ &= \gamma V^\star(s') \end{aligned} π∈ΠmaxEs0=s,a0=a,s1=s′,ah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[h=1∑∞γhr(sh,ah)]=π∈ΠmaxγVπ(s′)=γV⋆(s′) - 证2:
首先根据定义一定有: V π ~ ( s ) ≤ V ⋆ ( s ) ∀ s ∈ S V^{\tilde \pi}(s)\leq V^\star(s) \quad \forall s\in S Vπ~(s)≤V⋆(s)∀s∈S然后对于第0时刻的状态变量,根据定义有
V ⋆ ( s 0 ) = max π ∈ Π E a h ∼ π ( ⋅ ∣ s h ) , s h + 1 ∼ p ( ⋅ ∣ s h , a h ) [ r ( s 0 , a 0 ) + ∑ h = 1 ∞ γ h r ( s h , a h ) ] = max π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) [ r ( s 0 , a 0 ) + E s h ∼ p ( ⋅ ∣ s 0 , a 0 ) , a h ∼ π ( ⋅ ∣ s h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] ] ≤ max π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) [ r ( s 0 , a 0 ) + max π ′ ∈ Π E s h ∼ p ( ⋅ ∣ s 0 , a 0 ) , a h ∼ π ′ ( ⋅ ∣ s h ) [ ∑ h = 1 ∞ γ h r ( s h , a h ) ] ] = max π ∈ Π E a 0 ∼ π ( ⋅ ∣ s 0 ) , s 1 ∼ p ( ⋅ ∣ s 0 , a 0 ) [ r ( s 0 , a 0 ) + γ V ⋆ ( s 1 ) ] (利用证明1) = max a 0 ∈ A E a 0 ∼ π ( ⋅ ∣ s 0 ) , s 1 ∼ p ( ⋅ ∣ s 0 , a 0 ) [ r ( s 0 , a 0 ) + γ V ⋆ ( s 1 ) ] = V π ~ ( s 0 ) (等于构造的策略所选的动作) \begin{aligned} V^\star(s_0)&= \max_{\pi\in \Pi}\mathbb E_{a_h\sim \pi(\cdot\mid s_h), s_{h+1}\sim p(\cdot\mid s_h,a_h)}\left[r(s_0,a_0) + \sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\\ &=\max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0)}\left[r(s_0,a_0) + \mathbb E_{s_{h}\sim p(\cdot\mid s_0,a_0),a_{h}\sim\pi(\cdot|s_{h})}\left[\sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\right]\\ &\leq \max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0)}\left[r(s_0,a_0) + \max_{\pi'\in\Pi}\mathbb E_{s_{h}\sim p(\cdot\mid s_0,a_0),a_{h}\sim\pi'(\cdot|s_{h})}\left[\sum_{h=1}^\infty \gamma^h r(s_h,a_h)\right]\right]\\ &=\max_{\pi\in\Pi} \mathbb E_{a_0\sim \pi(\cdot\mid s_0),s_1\sim p(\cdot\mid s_0,a_0)}\left[r(s_0,a_0) +\gamma V^\star(s_1)\right]\text{(利用证明1)}\\ &=\max_{a_0\in A}\mathbb E_{a_0\sim \pi(\cdot\mid s_0),s_1\sim p(\cdot\mid s_0,a_0)}\left[r(s_0,a_0) +\gamma V^\star(s_1)\right]\\ &=V^{\tilde\pi}(s_0)\text{(等于构造的策略所选的动作)} \end{aligned} V⋆(s0)=π∈ΠmaxEah∼π(⋅∣sh),sh+1∼p(⋅∣sh,ah)[r(s0,a0)+h=1∑∞γhr(sh,ah)]=π∈ΠmaxEa0∼π(⋅∣s0)[r(s0,a0)+Esh∼p(⋅∣s0,a0),ah∼π(⋅∣sh)[h=1∑∞γhr(sh,ah)]]≤π∈ΠmaxEa0∼π(⋅∣s0)[r(s0,a0)+π′∈ΠmaxEsh∼p(⋅∣s0,a0),ah∼π′(⋅∣sh)[h=1∑∞γhr(sh,ah)]]=π∈ΠmaxEa0∼π(⋅∣s0),s1∼p(⋅∣s0,a0)[r(s0,a0)+γV⋆(s1)](利用证明1)=a0∈AmaxEa0∼π(⋅∣s0),s1∼p(⋅∣s0,a0)[r(s0,a0)+γV⋆(s1)]=Vπ~(s0)(等于构造的策略所选的动作)所以
V ⋆ ( s 0 ) ≤ V π ~ ( s 0 ) , ∀ s 0 ∈ S V^\star(s_0)\leq V^{\tilde\pi}(s_0),\forall s_0\in S V⋆(s0)≤Vπ~(s0),∀s0∈S
因此构造的策略是optimal policy。
- deterministic 体现在构造时选择的动作是argmax的,是一个动作点
- stationary policy 体现在证明1中的change of variables即 h ′ = h − 1 h'=h-1 h′=h−1时,关于时间的变量h之间是存在可延续性的,即未来时刻的变量与当前时刻的变量之间存在一些基本特性的联系
4.2 性质一的证明
- 证明 V ⋆ ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right] V⋆(s)=maxa[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]的思路为通过不等式放缩进行夹逼,相关展开基于Bellman equation
- 分析 max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] = max a Q ⋆ ( s , a ) \max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right]=\max_aQ^\star(s,a) maxa[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]=maxaQ⋆(s,a),即证 V ⋆ ( s ) = max a Q ⋆ ( s , a ) V^\star(s)=\max_aQ^\star(s,a) V⋆(s)=maxaQ⋆(s,a),等价于说策略 π ^ ( s ) = arg max a Q ⋆ ( s , a ) \hat \pi(s)=\argmax_a Q^\star(s,a) π^(s)=aargmaxQ⋆(s,a)是最优策略 π ⋆ \pi^\star π⋆,所以只需要证明 V π ^ ( s ) = V ⋆ ( s ) V^{\hat\pi(s)}=V^\star(s) Vπ^(s)=V⋆(s) 或 π ^ ( s ) = π ⋆ \hat \pi(s)=\pi^\star π^(s)=π⋆
- 已知 V π ^ ( s ) ≤ V ⋆ ( s ) V^{\hat\pi(s)} \leq V^\star(s) Vπ^(s)≤V⋆(s),只需证 V π ^ ( s ) ≥ V ⋆ ( s ) V^{\hat\pi(s)}\geq V^\star(s) Vπ^(s)≥V⋆(s),即可证得 V π ^ ( s ) = V ⋆ ( s ) V^{\hat\pi(s)}=V^\star(s) Vπ^(s)=V⋆(s),所以有 V ⋆ ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right] V⋆(s)=maxa[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]
以下证:
V
⋆
(
s
)
≤
V
π
^
(
s
)
V^\star(s)\leq V^{\hat\pi(s)}
V⋆(s)≤Vπ^(s)
V
⋆
(
s
)
=
r
(
s
,
π
⋆
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
⋆
(
s
)
)
[
V
⋆
(
s
′
)
]
≤
max
a
[
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
V
⋆
(
s
′
)
]
]
=
max
a
Q
⋆
(
s
,
a
)
=
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
V
⋆
(
s
′
)
]
(1)
\begin{aligned} V^\star(s)&=r(s,\pi^\star(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi^\star(s))}\left[V^\star(s')\right]\\ &\leq \max_a \left[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right]\\ &=\max_a Q^\star(s,a)\\ &=r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')] \text{ (1)} \end{aligned}
V⋆(s)=r(s,π⋆(s))+γEs′∼p(⋅∣s,π⋆(s))[V⋆(s′)]≤amax[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]=amaxQ⋆(s,a)=r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[V⋆(s′)] (1)是不是感觉证完了?(1)式不正是
V
π
^
(
s
)
V^{\hat\pi(s)}
Vπ^(s)嘛?
可惜还真不是,根据value function的定义
V
π
^
(
s
)
=
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
V
π
^
(
s
)
(
s
′
)
]
V^{\hat\pi(s)}=r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^{\hat\pi(s)}(s')]
Vπ^(s)=r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[Vπ^(s)(s′)]
但可以对已有的不等式
V
⋆
(
s
)
≤
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
V
⋆
(
s
′
)
]
V^\star(s)\leq r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')]
V⋆(s)≤r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[V⋆(s′)]进行套娃:
V
⋆
(
s
)
≤
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
V
⋆
(
s
′
)
]
≤
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
r
(
s
′
,
π
^
(
s
′
)
)
+
γ
E
s
′
′
∼
p
(
⋅
∣
s
′
,
π
^
(
s
′
)
)
[
V
⋆
(
s
′
′
)
]
]
≤
r
(
s
,
π
^
(
s
)
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
π
^
(
s
)
)
[
r
(
s
′
,
π
^
(
s
′
)
)
+
γ
E
s
′
′
∼
p
(
⋅
∣
s
′
,
π
^
(
s
′
)
)
[
r
(
s
′
′
,
π
^
(
s
′
′
)
)
+
γ
E
s
′
′
′
∼
p
(
⋅
∣
s
′
′
,
π
^
(
s
′
′
)
)
[
V
⋆
(
s
′
′
′
)
]
]
]
⋯
=
V
π
^
(
s
)
\begin{aligned} V^\star(s)&\leq r(s,\hat\pi(s)) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}[V^\star(s')]\\ &\leq r(s,\hat\pi(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}\left[r(s',\hat\pi(s'))+\gamma \mathbb E_{s''\sim p(\cdot\mid s',\hat\pi(s'))}[V^\star(s'')]\right]\\ &\leq r(s,\hat\pi(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s,\hat\pi(s))}\left[r(s',\hat\pi(s'))+\gamma \mathbb E_{s''\sim p(\cdot\mid s',\hat\pi(s'))}\left[r(s'',\hat\pi(s''))+\gamma \mathbb E_{s'''\sim p(\cdot\mid s'',\hat\pi(s''))}[V^\star(s''')]\right]\right] \\ &\cdots\\ &=V^{\hat\pi(s)} \end{aligned}
V⋆(s)≤r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[V⋆(s′)]≤r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[r(s′,π^(s′))+γEs′′∼p(⋅∣s′,π^(s′))[V⋆(s′′)]]≤r(s,π^(s))+γEs′∼p(⋅∣s,π^(s))[r(s′,π^(s′))+γEs′′∼p(⋅∣s′,π^(s′))[r(s′′,π^(s′′))+γEs′′′∼p(⋅∣s′′,π^(s′′))[V⋆(s′′′)]]]⋯=Vπ^(s)
性质一本质上说明了这个deterministic stationary的optimal policy 则为 π ⋆ ( s ) = arg max a Q ⋆ ( s , a ) \pi^\star(s)=\argmax_aQ^\star(s,a) π⋆(s)=aargmaxQ⋆(s,a)
4.3 性质二的证明
- 证明对于任意的价值函数 V V V,如果其满足 V ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) V ( s ′ ) ] , ∀ s V(s)=\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim p(\cdot|s,a)}V(s')\right],\forall s V(s)=maxa[r(s,a)+γEs′∼p(⋅∣s,a)V(s′)],∀s,则有 V ( s ) = V ⋆ ( s ) V(s)=V^\star(s) V(s)=V⋆(s)的思路为对 ∣ V ( s ) − V ⋆ ( s ) ∣ |V(s)-V^\star(s)| ∣V(s)−V⋆(s)∣进行放缩,小于等于一个为0的项,展开基于Bellman equation
证:(第一步利用了性质一即
V
⋆
(
s
)
=
max
a
[
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
V
⋆
(
s
′
)
]
]
V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\right]
V⋆(s)=maxa[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]])
∣
V
(
s
)
−
V
⋆
(
s
)
∣
=
∣
max
a
[
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
V
(
s
′
)
]
−
max
a
[
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
V
⋆
(
s
′
)
]
∣
≤
max
a
∣
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
V
(
s
′
)
−
V
⋆
(
s
′
)
]
∣
≤
max
a
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
∣
V
(
s
′
)
−
V
⋆
(
s
′
)
∣
≤
max
a
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
∣
max
a
′
γ
E
s
′
′
∼
p
(
⋅
∣
s
′
,
a
′
)
∣
V
(
s
′
′
)
−
V
⋆
(
s
′
′
)
∣
∣
≤
max
a
1
,
.
.
.
,
a
k
γ
k
E
s
k
∼
p
(
⋅
∣
s
k
−
1
,
a
k
−
1
)
[
∣
V
(
s
k
)
−
V
⋆
(
s
k
)
∣
]
≤
max
a
1
,
.
.
.
,
a
k
γ
k
×
2
1
−
γ
(
∥
V
∥
≤
1
1
−
γ
)
≤
lim
k
→
∞
max
a
1
,
.
.
.
,
a
k
2
γ
k
1
−
γ
=
0
\begin{aligned} |V(s)-V^\star(s)|&=\left |\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim p(\cdot|s,a)}V(s')\right]-\max_a\left[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}V^\star(s')\right] \right|\\ &\leq \max_a\left|\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V(s')-V^\star(s')]\right|\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left|V(s')-V^\star(s')\right|\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left| \max_{a'} \gamma \mathbb E_{s''\sim p(\cdot\mid s',a')}\left|V(s'')-V^\star(s'')\right|\right|\\ &\leq \max_{a_1,...,a_k} \gamma^k \mathbb E_{s^k\sim p(\cdot\mid s_{k-1},a_{k-1})}\left[|V(s_k)-V^\star(s_k)|\right]\\ &\leq \max_{a_1,...,a_k} \gamma^k \times \frac{2}{1-\gamma}\text{( $\|V\|\leq\frac{1}{1-\gamma}$)}\\ &\leq \lim_{k\rightarrow \infty} \max_{a_1,...,a_k} \frac{2\gamma^k}{1-\gamma}=0 \end{aligned}
∣V(s)−V⋆(s)∣=∣∣∣amax[r(s,a)+γEs′∼p(⋅∣s,a)V(s′)]−amax[r(s,a)+γEs′∼p(⋅∣s,a)V⋆(s′)]∣∣∣≤amax∣∣γEs′∼p(⋅∣s,a)[V(s′)−V⋆(s′)]∣∣≤amaxγEs′∼p(⋅∣s,a)∣V(s′)−V⋆(s′)∣≤amaxγEs′∼p(⋅∣s,a)∣∣∣a′maxγEs′′∼p(⋅∣s′,a′)∣V(s′′)−V⋆(s′′)∣∣∣∣≤a1,...,akmaxγkEsk∼p(⋅∣sk−1,ak−1)[∣V(sk)−V⋆(sk)∣]≤a1,...,akmaxγk×1−γ2( ∥V∥≤1−γ1)≤k→∞lima1,...,akmax1−γ2γk=0
五、总结
- 理解策略的 平稳性 和 确定性
- 理解infinite horizon discounted这个setting
- 在该MDP setting下存在一个deterministic stationary的optimal policy会满足两个性质,称为Bellman Optimality
- 所以如果在该setting下,有满足性质的V和Q,那么它就是determinsitc stationary optimal policy的V和Q