CS6789-4(中)
- 搬砖来源:https://wensun.github.io/CS6789_fall_2021.html
- 细节来源:Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal 2020 JMLR
- 主题:经典方法VI&PI在转移矩阵未知情况下,使用经典算法VI&PI达到near-optimal policy所需的样本复杂度
- setting:infinite horizon discounted MDP / unknown transition probability / deterministic reward / deterministic policy
- 解决的问题:给定 M = ( S , A , P , r , γ ) \mathcal M=(S,A,P,r,\gamma) M=(S,A,P,r,γ),其中转移矩阵 P P P未知,经典方法VI&PI需要多少的样本(transitions ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s′))才能学习到near optimal的策略(deterministic & stationary)?
- 使用的理论工具:Statistics Theory
一、样本复杂度的分析目标是什么?
- 首先对未知的转移矩阵P,用均匀交互假设(generative model)进行建模
P ^ ( s ′ ∣ s , a ) = # ( s ′ , s , a ) N \widehat P(s'\mid s,a)=\frac{\# (s',s,a)}{N} P (s′∣s,a)=N#(s′,s,a) - 在 M ^ = ( S , A , P ^ , r , γ ) \widehat M=(S,A,\widehat P,r,\gamma) M =(S,A,P ,r,γ)中使用VI算法学习到的元素均记为 Q ^ , V ^ \widehat Q ,\widehat V Q ,V ,其中 π ^ ⋆ , Q ^ ⋆ , V ^ ⋆ \widehat \pi^\star,\widehat Q^\star,\widehat V^\star π ⋆,Q ⋆,V ⋆分别为基于 M ^ \widehat M M 的最优策略、Q值、V值,而我们最想要的是真实 M = ( S , A , P , r , γ ) M=(S,A, P,r,\gamma) M=(S,A,P,r,γ)中对应的 π ⋆ \pi^\star π⋆
- 下面用
ϵ
\epsilon
ϵ来量化near-optimal的“near optimal”的程度,可简记为
∥
V
⋆
−
V
π
∥
∞
≤
ϵ
V
\|V^\star-V^\pi\|_\infty\leq \epsilon_V
∥V⋆−Vπ∥∞≤ϵV
V ⋆ ( s ) − V π ( s ) ≤ ϵ V ∀ s V^\star(s)-V^\pi(s)\leq \epsilon_V \quad \forall s V⋆(s)−Vπ(s)≤ϵV∀s - 第n次迭代得到的策略
π
n
=
arg max
a
Q
π
n
(
s
,
a
)
\pi_n=\argmax_a Q^{\pi_n}(s,a)
πn=aargmaxQπn(s,a),其与最优策略
π
⋆
\pi^\star
π⋆的性能满足:
V π n ≥ V ⋆ ( s ) − 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ∀ s ∈ S V^{\pi_n}\geq V^\star(s)-\frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty} \quad \forall s\in S Vπn≥V⋆(s)−1−γ2γn∣∣Q0−Q⋆∣∣∞∀s∈S因为V和Q之间的关系可以通过以上公式进行转换,所以最终目标可以选择分析Q函数,毕竟在VI或PI算法均以Q为直接迭代对象
Q ⋆ ( s , a ) − Q ^ π ^ n ( s , a ) ≤ ϵ Q ∀ s , a Q^\star(s,a)-\widehat Q^{\hat \pi_n}(s,a)\leq \epsilon_{Q} \quad \forall s,a Q⋆(s,a)−Q π^n(s,a)≤ϵQ∀s,a - 因此只需要寻找 ϵ Q \epsilon_Q ϵQ与 N N N的关系,就可以回答样本复杂度的问题:对转移矩阵建模后,想用VI算法得到一个 ϵ \epsilon ϵ-optimal的策略,至少需要哪个量级的样本数 N N N?
二、最终目标的分析与拆解
- 先对最终目标进行放缩,找它的上界
∥ Q ⋆ − Q ^ π ^ n ∥ ∞ ≤ ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ + ∥ Q ^ ⋆ − Q ^ π ^ n ∥ ∞ \|Q^\star-\widehat Q^{\hat \pi_n}\|_\infty\leq \|Q^\star-\widehat{Q}^\star\|_\infty + \|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty ∥Q⋆−Q π^n∥∞≤∥Q⋆−Q ⋆∥∞+∥Q ⋆−Q π^n∥∞ -
∥
Q
^
⋆
−
Q
^
π
^
n
∥
∞
\|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty
∥Q
⋆−Q
π^n∥∞这项并不难,其对应的是MDP已知的planning问题,在VI的Computational Complexity中可知
∥ Q ^ ⋆ − Q ^ π ^ n ∥ ∞ ≤ γ n ∥ Q ^ 0 − Q ^ ⋆ ∥ ∞ ≤ γ n 1 1 − γ = ( 1 − ( 1 − γ ) ) n 1 1 − γ ≤ exp ( − ( 1 − γ ) n ) 1 1 − γ ≤ ϵ o p t \begin{aligned} &\|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty\leq\gamma^n\|\widehat Q_0-\widehat Q^\star\|_\infty\leq \gamma^n\frac{1}{1-\gamma}\\ &=(1-(1-\gamma))^n\frac{1}{1-\gamma}\\ &\leq \exp(-(1-\gamma)n)\frac{1}{1-\gamma}\leq \epsilon_{opt} \end{aligned} ∥Q ⋆−Q π^n∥∞≤γn∥Q 0−Q ⋆∥∞≤γn1−γ1=(1−(1−γ))n1−γ1≤exp(−(1−γ)n)1−γ1≤ϵopt - 因此在 n ≥ O ( ( 1 − γ ) − 1 ln ϵ o p t − 1 ) n\geq O((1-\gamma)^{-1}\ln\epsilon_{opt}^{-1}) n≥O((1−γ)−1lnϵopt−1)这个量级的迭代步数后,有 ∥ Q ^ ⋆ − Q ^ π ^ n ∥ ∞ ≤ ϵ o p t \|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty\leq \epsilon_{opt} ∥Q ⋆−Q π^n∥∞≤ϵopt,这个称作optimization error,也即 ϵ o p t \epsilon_{opt} ϵopt下标的来源
- 难点在
∥
Q
⋆
−
Q
^
⋆
∥
∞
\|Q^\star-\widehat{Q}^\star\|_\infty
∥Q⋆−Q
⋆∥∞这一项中,先将它写完整进行细看(optimal policy可以是deterministic的!)对一个
(
s
,
a
)
(s,a)
(s,a)而言,有
Q ^ ⋆ ( s , a ) − Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ^ ( ⋅ ∣ s , a ) [ Q ^ ( s ′ , π ^ ⋆ ( s ′ ) ) ] − r ( s , a ) − γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ Q ( s ′ , π ⋆ ( s ′ ) ) ] \widehat Q^\star(s,a)-Q^\star(s,a)=r(s,a)+\gamma\mathbb E_{s'\sim \widehat p(\cdot|s,a)}[\widehat Q(s',\widehat \pi^\star(s'))]-r(s,a)-\gamma \mathbb E_{s'\sim p(\cdot|s,a)}[Q(s',\pi^\star(s'))] Q ⋆(s,a)−Q⋆(s,a)=r(s,a)+γEs′∼p (⋅∣s,a)[Q (s′,π ⋆(s′))]−r(s,a)−γEs′∼p(⋅∣s,a)[Q(s′,π⋆(s′))]首先,Q函数不同,其次,策略也不同,似乎一筹莫展,先回忆一下
三、相关公式的理解与梳理
2.1 Bellman Consistency Equation
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] (V-Q) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] (Q-V) V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] ] (V-V) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] ] (Q-Q) \begin{aligned} V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[Q^\pi(s,a)\right] \quad \text{(V-Q)}\\ Q^\pi(s,a)&= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\quad \text{(Q-V)}\\ V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[ r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\right]\quad \text{(V-V)}\\ Q^\pi(s,a)&=r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[\mathbb E_{a'\sim \pi(\cdot\mid s')}\left[Q^\pi(s',a')\right]\right]\quad\text{(Q-Q)}\\ \end{aligned} Vπ(s)Qπ(s,a)Vπ(s)Qπ(s,a)=Ea∼π(⋅∣s)[Qπ(s,a)](V-Q)=r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)](Q-V)=Ea∼π(⋅∣s)[r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)]](V-V)=r(s,a)+γEs′∼p(⋅∣s,a)[Ea′∼π(⋅∣s′)[Qπ(s′,a′)]](Q-Q)
- 引入操作符 P π P^\pi Pπ简写(Q-Q)式: Q π = r + γ P π Q π Q^\pi=r+\gamma P^\pi Q^\pi Qπ=r+γPπQπ,这可从矩阵形式进行理解,其中矩阵每一项的具体值为 P ( s , a , s ′ , a ′ ) π = p ( s ′ ∣ s , a ) π ( a ′ ∣ s ′ ) P^\pi_{(s,a,s',a')}=p(s'|s,a)\pi(a'|s') P(s,a,s′,a′)π=p(s′∣s,a)π(a′∣s′) Q π ( ∈ R ∣ S ∣ ∣ A ∣ × 1 ) = r ( ∈ R ∣ S ∣ ∣ A ∣ × 1 ) + γ P π ( ∈ R ∣ S ∣ ∣ A ∣ × ∣ S ∣ ∣ A ∣ ) Q π ( ∈ R ∣ S ∣ ∣ A ∣ × 1 ) Q^\pi(\in \mathbb R^{|S||A|\times 1})=r(\in \mathbb R^{|S||A|\times 1})+\gamma P^\pi(\in\mathbb R^{|S||A|\times |S||A|}) Q^\pi(\in \mathbb R^{|S||A|\times 1}) Qπ(∈R∣S∣∣A∣×1)=r(∈R∣S∣∣A∣×1)+γPπ(∈R∣S∣∣A∣×∣S∣∣A∣)Qπ(∈R∣S∣∣A∣×1)
- 同理,引入操作符 P P P简写(Q-V)式: Q π = r + γ P V π Q^\pi=r+\gamma PV^\pi Qπ=r+γPVπ
- 因此基于consistency的Q函数有closed-form的表述 Q π = ( I − γ P π ) − 1 r Q^\pi=(I-\gamma P^\pi)^{-1}r Qπ=(I−γPπ)−1r
- Q π Q^\pi Qπ对于其空间内的每一个点 ( s , a ) (s,a) (s,a)定义为 Q π ( s , a ) = E [ ∑ t = 0 ∞ γ t r ( s t , a t ) ∣ s 0 = s , a 0 = a ] Q^\pi(s,a)=\mathbb E\left[\sum_{t=0}^\infty\gamma^tr(s_t,a_t)\Big|s_0=s,a_0=a\right] Qπ(s,a)=E[t=0∑∞γtr(st,at)∣∣∣s0=s,a0=a]
- 因此便可直接推出关于
(
I
−
γ
P
π
)
−
1
(I-\gamma P^\pi)^{-1}
(I−γPπ)−1的引理,即该可逆矩阵其中每一项为(
Pr
\Pr
Pr代表从
s
0
=
s
,
a
0
=
a
s_0=s,a_0=a
s0=s,a0=a出发在转移矩阵和策略的作用下,在t时刻遇到
(
s
′
,
a
′
)
(s',a')
(s′,a′)的概率)
( I − γ P π ) ( s , a ) , ( s ′ , a ′ ) − 1 = ∑ t = 0 ∞ γ t Pr ( s t = s ′ , a t = a ′ ∣ s 0 = s , a 0 = a ) ≤ 1 1 − γ (I-\gamma P^\pi)^{-1}_{(s,a),(s',a')}=\sum_{t=0}^\infty \gamma^t\Pr\left(s_t=s',a_t=a'|s_0=s,a_0=a\right)\leq \frac{1}{1-\gamma} (I−γPπ)(s,a),(s′,a′)−1=t=0∑∞γtPr(st=s′,at=a′∣s0=s,a0=a)≤1−γ1 - 在前面知道这玩意是可逆的即 ∣ ∣ ( I − γ P π ) x ∣ ∣ ∞ = ∣ ∣ x − γ P π x ∣ ∣ ∞ ≥ ∣ ∣ x ∣ ∣ ∞ − ∣ ∣ γ P π x ∣ ∣ ∞ (两者差的最大值>= 两者最大值之差) ≥ ∣ ∣ x ∣ ∣ ∞ − γ ∣ ∣ x ∣ ∣ ∞ ( P π 转移矩阵的定义) = ( 1 − γ ) ∣ ∣ x ∣ ∣ ∞ > 0 \begin{aligned} ||(I-\gamma P^{\pi})x||_{\infty}&=||x-\gamma P^{\pi}x||_{\infty}\\ &\geq ||x||_{\infty}-||\gamma P^\pi x||_{\infty}\text{(两者差的最大值>= 两者最大值之差)}\\ &\geq ||x||_{\infty}-\gamma ||x||_\infty \text{($P^\pi$转移矩阵的定义)}\\ &=(1-\gamma)||x||_\infty >0 \end{aligned} ∣∣(I−γPπ)x∣∣∞=∣∣x−γPπx∣∣∞≥∣∣x∣∣∞−∣∣γPπx∣∣∞(两者差的最大值>= 两者最大值之差)≥∣∣x∣∣∞−γ∣∣x∣∣∞(Pπ转移矩阵的定义)=(1−γ)∣∣x∣∣∞>0
- 所以易得不等式 ∥ ( I − γ P π ) − 1 v ∥ ∞ ≤ ∥ v ∥ ∞ 1 − γ \|(I-\gamma P^\pi)^{-1}v\|_\infty \leq \frac{\|v\|_\infty}{1-\gamma} ∥(I−γPπ)−1v∥∞≤1−γ∥v∥∞
要知道这小节全都是根据定义推出的,即Bellman consistency在所有setting下均适用,没有deterministic policy的假设,得与Bellman optimality进行区分
2.2 Simulation Lemma
基于2.1 Bellman Consistency中介绍的操作符
P
π
P^\pi
Pπ,对于任意的策略
π
\pi
π有如下等式
Q
π
−
Q
^
π
=
(
I
−
γ
P
π
)
−
1
r
−
(
I
−
γ
P
^
π
)
−
1
r
=
(
I
−
γ
P
^
π
)
−
1
(
(
I
−
γ
P
^
π
)
(
I
−
γ
P
π
)
−
1
−
I
)
r
=
(
I
−
γ
P
^
π
)
−
1
(
(
I
−
γ
P
^
π
)
−
(
I
−
γ
P
π
)
)
(
I
−
γ
P
π
)
−
1
r
=
(
I
−
γ
P
^
π
)
−
1
(
(
I
−
γ
P
^
π
)
−
(
I
−
γ
P
π
)
)
Q
π
=
γ
(
I
−
γ
P
^
π
)
−
1
(
P
π
−
P
^
π
)
Q
π
=
γ
(
I
−
γ
P
^
π
)
−
1
(
P
−
P
^
)
V
π
\begin{aligned} Q^{\pi}-\widehat{Q}^{\pi} &=\left(I-\gamma P^{\pi}\right)^{-1} r-\left(I-\gamma \widehat{P}^{\pi}\right)^{-1} r \\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left((I-\gamma \widehat{P}^{\pi})(I-\gamma P^{\pi})^{-1} -I\right)r\\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left((I-\gamma \widehat{P}^{\pi})-(I-\gamma P^{\pi}) \right)(I-\gamma P^{\pi})^{-1} r\\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left(\left(I-\gamma \widehat{P}^{\pi}\right)-\left(I-\gamma P^{\pi}\right)\right) Q^{\pi} \\ &=\gamma\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left(P^{\pi}-\widehat{P}^{\pi}\right) Q^{\pi} \\ &=\gamma\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}(P-\widehat{P}) V^{\pi} \end{aligned}
Qπ−Q
π=(I−γPπ)−1r−(I−γP
π)−1r=(I−γP
π)−1((I−γP
π)(I−γPπ)−1−I)r=(I−γP
π)−1((I−γP
π)−(I−γPπ))(I−γPπ)−1r=(I−γP
π)−1((I−γP
π)−(I−γPπ))Qπ=γ(I−γP
π)−1(Pπ−P
π)Qπ=γ(I−γP
π)−1(P−P
)Vπ
Q π − Q ^ π = ( I − γ P π ) − 1 r − ( I − γ P ^ π ) − 1 r = ( I − γ P π ) − 1 ( I − ( I − γ P π ) ( I − γ P ^ π ) − 1 ) r = ( I − γ P π ) − 1 ( ( I − γ P ^ π ) − ( I − γ P π ) ) ( I − γ P ^ π ) − 1 r = γ ( I − γ P π ) − 1 ( P π − P ^ π ) Q ^ π = γ ( I − γ P π ) − 1 ( P − P ^ ) V ^ π \begin{aligned} Q^{\pi}-\widehat{Q}^{\pi} &=\left(I-\gamma P^{\pi}\right)^{-1} r-(I-\gamma \widehat{P}^{\pi})^{-1} r \\ &=(I-\gamma P^\pi)^{-1}\left(I-(I-\gamma P^\pi)(I-\gamma \widehat{P}^{\pi})^{-1}\right)r\\ &=(I-\gamma P^\pi)^{-1}\left((I-\gamma \widehat{P}^{\pi})-(I-\gamma P^\pi)\right)(I-\gamma \widehat{P}^{\pi})^{-1}r\\ &=\gamma(I-\gamma P^\pi)^{-1}(P^\pi-\widehat P^\pi)\widehat Q^\pi\\ &=\gamma(I-\gamma P^\pi)^{-1}(P-\widehat P)\widehat V^\pi \end{aligned} Qπ−Q π=(I−γPπ)−1r−(I−γP π)−1r=(I−γPπ)−1(I−(I−γPπ)(I−γP π)−1)r=(I−γPπ)−1((I−γP π)−(I−γPπ))(I−γP π)−1r=γ(I−γPπ)−1(Pπ−P π)Q π=γ(I−γPπ)−1(P−P )V π
特点是策略相同,Q不同;而Q的不同之处在于P还是 P ^ \widehat P P ,只与simulator有关,所以称为Simulation Lemma
2.3 Component-wise Bound
这是个著名的上下界,直接操作的对象是我们一筹莫展的 ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ \|Q^\star-\widehat{Q}^\star\|_\infty ∥Q⋆−Q ⋆∥∞
- 上界
Q ⋆ − Q ^ ⋆ = Q π ⋆ − Q ^ π ^ ⋆ ≤ Q π ⋆ − Q ^ π ⋆ (换成相同策略) = γ ( I − γ P ^ π ⋆ ) − 1 ( P − P ^ ) V π ⋆ (first simulation lemma) = γ ( I − γ P π ⋆ ) − 1 ( P − P ^ ) V ^ π ⋆ (second simulation leamma) \begin{aligned} Q^\star-\widehat{Q}^\star&=Q^{\pi^\star}-\widehat Q^{\hat \pi^\star}\leq Q^{\pi^\star}-\widehat Q^{\pi^\star}\text{(换成相同策略)}\\ &=\gamma\left(I-\gamma \widehat{P}^{\pi^\star}\right)^{-1}(P-\widehat{P}) V^{\pi^\star}\text{(first simulation lemma)}\\ &=\gamma\left(I-\gamma P^{\pi^\star}\right)^{-1}(P-\widehat P)\widehat V^{\pi^\star}\text{(second simulation leamma)} \end{aligned} Q⋆−Q ⋆=Qπ⋆−Q π^⋆≤Qπ⋆−Q π⋆(换成相同策略)=γ(I−γP π⋆)−1(P−P )Vπ⋆(first simulation lemma)=γ(I−γPπ⋆)−1(P−P )V π⋆(second simulation leamma) - 下界
Q ⋆ − Q ^ ⋆ = Q π ⋆ − Q ^ π ^ ⋆ = ( I − γ P π ⋆ ) − 1 r − ( I − γ P ^ π ^ ⋆ ) − 1 r = ( I − γ P ^ π ^ ⋆ ) − 1 ( ( I − γ P ^ π ^ ⋆ ) ( I − γ P π ⋆ ) − 1 − I ) r = ( I − γ P ^ π ^ ⋆ ) − 1 ( ( I − γ P ^ π ^ ⋆ ) − ( I − γ P π ⋆ ) ) ( I − γ P π ⋆ ) − 1 r = γ ( I − γ P ^ π ^ ⋆ ) − 1 ( P π ⋆ − P ^ π ^ ⋆ ) Q ⋆ ≥ γ ( I − γ P ^ π ^ ⋆ ) − 1 ( P π ⋆ − P ^ π ⋆ ) Q ⋆ ( ⋆ ) = γ ( I − γ P ^ π ^ ⋆ ) − 1 ( P − P ^ ) V ⋆ \begin{aligned} Q^\star-\widehat{Q}^\star&=Q^{\pi^\star}-\widehat Q^{\hat \pi^\star}\\ &=(I-\gamma P^{\pi^\star})^{-1}r-(I-\gamma \widehat P^{\hat \pi^\star})^{-1}r\\ &=(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left((I-\gamma \widehat P^{\hat \pi^\star})(I-\gamma P^{\pi^\star})^{-1}-I\right)r\\ &=(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left((I-\gamma \widehat P^{\hat \pi^\star})-(I-\gamma P^{\pi^\star})\right)(I-\gamma P^{\pi^\star})^{-1}r\\ &=\gamma(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left(P^{\pi^\star}-\widehat P^{\hat \pi^\star}\right)Q^\star\\ &\geq \gamma(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left(P^{\pi^\star}-\widehat P^{\pi^\star}\right)Q^\star\text{ ($\star$)}\\ &=\gamma(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left(P-\widehat P\right)V^\star \end{aligned} Q⋆−Q ⋆=Qπ⋆−Q π^⋆=(I−γPπ⋆)−1r−(I−γP π^⋆)−1r=(I−γP π^⋆)−1((I−γP π^⋆)(I−γPπ⋆)−1−I)r=(I−γP π^⋆)−1((I−γP π^⋆)−(I−γPπ⋆))(I−γPπ⋆)−1r=γ(I−γP π^⋆)−1(Pπ⋆−P π^⋆)Q⋆≥γ(I−γP π^⋆)−1(Pπ⋆−P π⋆)Q⋆ (⋆)=γ(I−γP π^⋆)−1(P−P )V⋆
因此Componet-wise bound为 γ ( I − γ P ^ π ^ ⋆ ) − 1 ( P − P ^ ) V ⋆ ≤ Q ⋆ − Q ^ ⋆ ≤ γ ( I − γ P ^ π ⋆ ) − 1 ( P − P ^ ) V ⋆ \gamma(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left(P-\widehat P\right)V^\star\leq Q^\star-\widehat{Q}^\star\leq \gamma\left(I-\gamma \widehat{P}^{\pi^\star}\right)^{-1}(P-\widehat{P}) V^{\star} γ(I−γP π^⋆)−1(P−P )V⋆≤Q⋆−Q ⋆≤γ(I−γP π⋆)−1(P−P )V⋆
Component-wise的bound无非是在simulation lemma的证明基础上,使用了两个小不等式:上界使用了 Q π ⋆ − Q ^ π ^ ⋆ ≤ Q π ⋆ − Q ^ π ⋆ Q^{\pi^\star}-\widehat Q^{\hat \pi^\star}\leq Q^{\pi^\star}-\widehat Q^{\pi^\star} Qπ⋆−Q π^⋆≤Qπ⋆−Q π⋆,下界在 ( ⋆ ) (\star) (⋆)处使用了 P ^ π ^ ⋆ Q ⋆ ≤ P ^ π ⋆ Q ⋆ \widehat P^{\hat \pi^\star}Q^\star \leq \widehat P^{\pi^\star}Q^\star P π^⋆Q⋆≤P π⋆Q⋆
四、在相关公式下的目标解析
- 使用component wide bound,有两种关于目标 Q ⋆ − Q ^ ⋆ Q^\star-\widehat{Q}^\star Q⋆−Q ⋆的上界即 Q ⋆ − Q ^ ⋆ ≤ γ ( I − γ P ^ π ⋆ ) − 1 ( P − P ^ ) V π ⋆ Q ⋆ − Q ^ ⋆ ≤ γ ( I − γ P π ⋆ ) − 1 ( P − P ^ ) V ^ π ⋆ \begin{aligned} &Q^\star-\widehat{Q}^\star\leq\gamma\left(I-\gamma \widehat{P}^{\pi^\star}\right)^{-1}(P-\widehat{P}) V^{\pi^\star}\\ &Q^\star-\widehat{Q}^\star\leq\gamma\left(I-\gamma P^{\pi^\star}\right)^{-1}(P-\widehat P)\widehat V^{\pi^\star} \end{aligned} Q⋆−Q ⋆≤γ(I−γP π⋆)−1(P−P )Vπ⋆Q⋆−Q ⋆≤γ(I−γPπ⋆)−1(P−P )V π⋆
- 接下来问题是,究竟哪个比较好bound住?
- 是否有相关公式来表述 ( P − P ^ ) V π ⋆ (P-\widehat{P}) V^{\pi^\star} (P−P )Vπ⋆或 ( P − P ^ ) V ^ π ⋆ (P-\widehat P)\widehat V^{\pi^\star} (P−P )V π⋆
- 是否有相关公式来表述 ( I − γ P ^ π ⋆ ) − 1 \left(I-\gamma \widehat{P}^{\pi^\star}\right)^{-1} (I−γP π⋆)−1或 ( I − γ P π ⋆ ) − 1 \left(I-\gamma P^{\pi^\star}\right)^{-1} (I−γPπ⋆)−1
- 怎么以一定的概率来bound住这些量?
- 且听下回分解