一、 Planning in MDP
关于该系列,每天一个RL基础理论(1-3),前三篇文章提及的核心概念如下:
- Bellman Optimality
V ⋆ ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q ⋆ ( s ′ , a ′ ) ] V ⋆ ( s ) = max a Q ⋆ ( s , a ) \begin{aligned} V^\star(s)&=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot\mid s,a)}[V^\star(s')]\right]\\ Q^\star(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right]\\ V^\star(s)&=\max_a Q^\star(s,a) \end{aligned} V⋆(s)Q⋆(s,a)V⋆(s)=amax[r(s,a)+γEs′∼P(⋅∣s,a)[V⋆(s′)]]=r(s,a)+γEs′∼p(⋅∣s,a)[a′maxQ⋆(s′,a′)]=amaxQ⋆(s,a) - Bellman Consistency Equation
V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ Q π ( s , a ) ] (V-Q) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] (Q-V) V π ( s ) = E a ∼ π ( ⋅ ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π ( s ′ ) ] ] (V-V) Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ E a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s ′ , a ′ ) ] ] (Q-Q) \begin{aligned} V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[Q^\pi(s,a)\right] \quad \text{(V-Q)}\\ Q^\pi(s,a)&= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\quad \text{(Q-V)}\\ V^\pi(s)&=\mathbb E_{a\sim \pi(\cdot\mid s)}\left[ r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[V^\pi(s')\right]\right]\quad \text{(V-V)}\\ Q^\pi(s,a)&=r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[\mathbb E_{a'\sim \pi(\cdot\mid s')}\left[Q^\pi(s',a')\right]\right]\quad\text{(Q-Q)}\\ \end{aligned} Vπ(s)Qπ(s,a)Vπ(s)Qπ(s,a)=Ea∼π(⋅∣s)[Qπ(s,a)](V-Q)=r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)](Q-V)=Ea∼π(⋅∣s)[r(s,a)+γEs′∼p(⋅∣s,a)[Vπ(s′)]](V-V)=r(s,a)+γEs′∼p(⋅∣s,a)[Ea′∼π(⋅∣s′)[Qπ(s′,a′)]](Q-Q) - 构造的deterministic的optimal policy
π ~ ( s ) = arg max a ∈ A E s 0 = s , a 0 = a [ r ( s , a ) + γ V ⋆ ( s 1 ) ] \tilde \pi(s)=\argmax_{a\in A}\mathbb E_{s_0=s,a_0=a}\left[r(s,a)+\gamma V^\star(s_1)\right] π~(s)=a∈AargmaxEs0=s,a0=a[r(s,a)+γV⋆(s1)] - Value Iteration,定义Bellman Optimality Operator为
B
Q
(
s
,
a
)
:
=
r
(
s
,
a
)
+
γ
E
s
′
∼
p
(
⋅
∣
s
,
a
)
[
max
a
′
Q
(
s
′
,
a
′
)
]
\mathcal BQ(s,a):= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q(s',a')]
BQ(s,a):=r(s,a)+γEs′∼p(⋅∣s,a)[maxa′Q(s′,a′)]
VI迭代式: Q n + 1 ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q n ( s ′ , a ′ ) ] ∀ s , a ∈ S × A B 的 性 质 : ∥ B Q ( s , a ) − B Q ′ ( s , a ) ∥ ∞ ≤ γ ∥ Q ( s , a ) − Q ′ ( s , a ) ∥ ∞ ∀ s , a ∈ S × A Q ⋆ 的 性 质 : B Q ⋆ = Q ⋆ VI收敛速度 : ∣ ∣ Q n − Q ⋆ ∣ ∣ ∞ = ∣ ∣ B Q n − 1 − B Q ⋆ ∣ ∣ ≤ γ ∣ ∣ Q n − 1 − Q ⋆ ∣ ∣ ≤ ⋯ ≤ γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ VI策略: π n ( s ) = arg max a Q n ( s , a ) VI策略性能 : V ⋆ ( s ) − V π n ( s ) ≤ 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ≤ 2 1 − γ × exp ( − ( 1 − γ ) n ) 1 − γ ≤ ϵ \begin{aligned} &\text{VI迭代式:}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A\\ &\text{$\mathcal B$}的性质:\|\mathcal BQ(s,a)-\mathcal BQ'(s,a)\|_{\infty}\leq \gamma \|Q(s,a)-Q'(s,a)\|_{\infty}\quad \forall s,a\in S\times A\\ &\text{$Q^\star$}的性质: \mathcal BQ^\star=Q^\star\\ &\text{VI收敛速度}:||Q_n-Q^\star||_{\infty}=||\mathcal BQ_{n-1}-\mathcal BQ^\star||\leq \gamma ||Q_{n-1}-Q^\star||\leq \cdots\leq \gamma^n ||Q_0-Q^\star||_{\infty}\\ &\text{VI策略:}\pi_n(s)=\argmax_a Q_n(s,a)\\ &\text{VI策略性能}:V^\star(s)-V^{\pi_n}(s)\leq \frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty}\leq \frac{2}{1-\gamma}\times \frac{\exp(-(1-\gamma)n)}{1-\gamma}\leq \epsilon \end{aligned} VI迭代式:Qn+1(s,a)=r(s,a)+γEs′∼p(⋅∣s,a)[a′maxQn(s′,a′)]∀s,a∈S×AB的性质:∥BQ(s,a)−BQ′(s,a)∥∞≤γ∥Q(s,a)−Q′(s,a)∥∞∀s,a∈S×AQ⋆的性质:BQ⋆=Q⋆VI收敛速度:∣∣Qn−Q⋆∣∣∞=∣∣BQn−1−BQ⋆∣∣≤γ∣∣Qn−1−Q⋆∣∣≤⋯≤γn∣∣Q0−Q⋆∣∣∞VI策略:πn(s)=aargmaxQn(s,a)VI策略性能:V⋆(s)−Vπn(s)≤1−γ2γn∣∣Q0−Q⋆∣∣∞≤1−γ2×1−γexp(−(1−γ)n)≤ϵ
- B \mathcal B B操作的对象是Q函数内的元素 s , a s,a s,a, ∥ ∥ ∞ \|\|_\infty ∥∥∞是函数的度量;
- ∥ Q ( s , a ) − Q ′ ( s , a ) ∥ ∞ \|Q(s,a)-Q'(s,a)\|_{\infty} ∥Q(s,a)−Q′(s,a)∥∞直观上这说的是, Q 与 Q ′ Q与Q' Q与Q′两个函数的距离是 ∞ -norm {\infty}\text{-norm} ∞-norm,是worst-case pointwise的;
- 经过 B \mathcal B B操作后,Q函数这个映射在 ( s , a ) (s,a) (s,a)空间上变得更为紧凑
- VI收敛速度是 O ( γ n 1 − γ ) O(\frac{\gamma^n}{1-\gamma}) O(1−γγn),迭代 n ≥ ln 2 ϵ ( 1 − γ ) 2 1 − γ n\geq \frac{\ln \frac{2}{\epsilon(1-\gamma)^2}}{1-\gamma} n≥1−γlnϵ(1−γ)22步后得到的策略 π n \pi_n πn是 ϵ − o p t i m a l \epsilon-optimal ϵ−optimal的,即 V π n ( s ) − V ⋆ ( s ) ≤ ϵ ∀ s V^{\pi_n}(s)-V^\star(s)\leq \epsilon\quad \forall s Vπn(s)−V⋆(s)≤ϵ∀s
- Policy Iteration
Policy Evaluation(Q-version) : Q π n ( s , a ) = ( I − γ P π n ) − 1 r ∀ s , a Policy Evaluation(V-version) : V π ( s ) = E s ′ ∼ p ( ⋅ ∣ s , π ( s ) ) [ r ( s , π ( s ) ) + γ V π ( s ′ ) ] = ( I − γ P π ) − 1 r ∀ s Policy Improvement : π n + 1 ( s ) = arg max a Q π n ( s , a ) = arg max a r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π n ( s ′ ) ] ∀ s PI收敛速度: ∥ V π n + 1 ( s ) − V ⋆ ( s ) ∥ ∞ ≤ γ ∥ V π n ( s ) − V ⋆ ( s ) ∥ ∞ \begin{aligned} &\text{Policy Evaluation(Q-version)}:Q^{\pi_n}(s,a)=(I-\gamma P^{\pi_n})^{-1}r\quad\forall s,a \\ &\text{Policy Evaluation(V-version)}:V^\pi(s)=\mathbb E_{s'\sim p(\cdot|s,\pi(s))}\left[r(s,\pi(s))+\gamma V^\pi(s')\right]=(I-\gamma P_\pi)^{-1}r\quad \forall s\\ &\text{Policy Improvement}:\pi_{n+1}(s)=\argmax_a Q^{\pi_n}(s,a)=\argmax_a r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}[V^{\pi_n}(s')]\quad\forall s\\ &\text{PI收敛速度:}\|V^{\pi_{n+1}}(s)-V^{\star}(s)\|_\infty \leq \gamma \|V^{\pi_n}(s)-V^\star(s)\|_\infty \end{aligned} Policy Evaluation(Q-version):Qπn(s,a)=(I−γPπn)−1r∀s,aPolicy Evaluation(V-version):Vπ(s)=Es′∼p(⋅∣s,π(s))[r(s,π(s))+γVπ(s′)]=(I−γPπ)−1r∀sPolicy Improvement:πn+1(s)=aargmaxQπn(s,a)=aargmaxr(s,a)+γEs′∼p(⋅∣s,a)[Vπn(s′)]∀sPI收敛速度:∥Vπn+1(s)−V⋆(s)∥∞≤γ∥Vπn(s)−V⋆(s)∥∞
- 在 π \pi π是deterministic的情况下,由Bellman Consistency有 V π ( s ) = Q π ( s , π ( s ) ) = r ( s , π ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ( s ) ) [ V π ( s ′ ) ] = r ( s , π ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π ( s ) ) [ Q π ( s ′ , π ( s ′ ) ] V^\pi(s)=Q^\pi(s,\pi(s))=r(s,\pi(s))+\gamma\mathbb E_{s'\sim p(\cdot|s,\pi(s))}[V^\pi(s')]=r(s,\pi(s))+\gamma \mathbb E_{s'\sim p(\cdot|s,\pi(s))}[Q^\pi(s',\pi(s')] Vπ(s)=Qπ(s,π(s))=r(s,π(s))+γEs′∼p(⋅∣s,π(s))[Vπ(s′)]=r(s,π(s))+γEs′∼p(⋅∣s,π(s))[Qπ(s′,π(s′)]
- PI使用了Bellman Consistency进行Policy Evaluation,VI使用了Bellman Optimality进行迭代,两者均使用了greedy policy即 π ( s ) = arg max a Q π n ( s , a ) \pi(s)=\argmax_a Q^{\pi_n}(s,a) π(s)=aargmaxQπn(s,a)
- 值得注意的是,这里没有与环境进行交互的概念,因为转移矩阵 P P P是已知的
- Connection
- 策略性能由 V ⋆ ( s ) − V π n ( s ) V^\star(s)-V^{\pi_n}(s) V⋆(s)−Vπn(s)这衡量,有个通用的performance lemma: V π ′ ( s 0 ) − V π ( s 0 ) = 1 1 − γ E s , a ∼ d s 0 π ′ [ Q π ( s , a ) − V π ( s ) ] = 1 1 − γ E ( s , a ) ∼ d s 0 π ′ ( s , a ) [ A π ( s , a ) ] V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma} \mathbb E_{s,a\sim d^{\pi'}_{s_0}}[Q^{\pi}(s,a)-V^{\pi}(s)]=\frac{1}{1-\gamma} \mathbb E_{(s,a)\sim d^{\pi'}_{s_0}(s,a)}[A^{\pi}(s,a)] Vπ′(s0)−Vπ(s0)=1−γ1Es,a∼ds0π′[Qπ(s,a)−Vπ(s)]=1−γ1E(s,a)∼ds0π′(s,a)[Aπ(s,a)]
- 由
V
⋆
(
s
)
−
V
π
n
(
s
)
≤
2
1
−
γ
∣
∣
Q
π
n
−
Q
⋆
∣
∣
∞
=
2
1
−
γ
∣
∣
Q
n
−
Q
⋆
∣
∣
∞
≤
2
γ
n
1
−
γ
∥
Q
0
−
Q
⋆
∥
V^\star(s)-V^{\pi_n}(s)\leq \frac{2}{1-\gamma}||Q^{\pi_n}-Q^\star||_{\infty}=\frac{2}{1-\gamma}||Q_n-Q^\star||_{\infty}\leq \frac{2\gamma^n}{1-\gamma}\|Q_0-Q^\star\|
V⋆(s)−Vπn(s)≤1−γ2∣∣Qπn−Q⋆∣∣∞=1−γ2∣∣Qn−Q⋆∣∣∞≤1−γ2γn∥Q0−Q⋆∥可知策略性能与Q值之间的关系,即只需要bound住
∥
Q
π
n
−
Q
⋆
∥
≤
ϵ
\|Q^{\pi_n}-Q^\star\|\leq \epsilon
∥Qπn−Q⋆∥≤ϵ,则有:
V ⋆ ( s ) − V π n ( s ) ≤ 2 ϵ 1 − γ V^\star(s)-V^{\pi_n}(s)\leq \frac{2\epsilon}{1-\gamma} V⋆(s)−Vπn(s)≤1−γ2ϵ于是只需要关心bound住Q值,便能bound住策略的性能
二、Sample Complexity
样本复杂度的探究,是非常取决于问题本身的,这什么意思呢?
- 如果问题的状态空间很大,如高维continuous,那么complexity中与状态相关量 ∣ S ∣ |S| ∣S∣的量级越低越好
- 如果问题的序列长度很长,如infinite,那么complexity中与序列长度horizon相关量 ∣ H ∣ |H| ∣H∣的量级越低越好
- 动作空间同理
其次,样本复杂度肯定也跟特定的算法相关。
每天一个RL基础理论(4-6)这三篇文章回答的问题是:
在Generative Model这种交互方式下,tabular MDP的setting下,对环境进行建模 P ^ \widehat P P ,使用VI或PI算法,至少需要多少样本才能得到一个 ϵ − o p t i m a l \epsilon-optimal ϵ−optimal的策略 π \pi π,即 V π ( s ) − V ⋆ ( s ) ≤ ϵ V^\pi(s)-V^\star(s)\leq \epsilon Vπ(s)−V⋆(s)≤ϵ
2.1 基本的setting
一般情况下,关于环境的dynamics即转移矩阵transition model P P P都是未知的。因此与环境交互的方式,决定了不同的样本复杂度。
- Generative model:均匀交互假设,我们可以在整个状态动作的联合空间均匀的访问,向环境输入具体的 ( s , a ) (s,a) (s,a)得到 s ′ s' s′,然后reset,这样巧妙地规避了exploration&exploitation的问题(知道了整个状态动作空间的reward所在)
- μ − r e s e t \mu-reset μ−reset:在特定的状态分布 μ ( s ) \mu(s) μ(s)上可以进行reset,根据当前策略 π \pi π选择动作,然后与环境交互,得到轨迹(只能知道受限的状态动作空间的reward所在)
- Exploration:需要在整个状态空间探索reward所在
2.2 基本公式
2.2.1 Simulation Lemma
Q π − Q ^ π = ( I − γ P π ) − 1 r − ( I − γ P ^ π ) − 1 r = ( I − γ P ^ π ) − 1 ( ( I − γ P ^ π ) ( I − γ P π ) − 1 − I ) r = ( I − γ P ^ π ) − 1 ( ( I − γ P ^ π ) − ( I − γ P π ) ) ( I − γ P π ) − 1 r = ( I − γ P ^ π ) − 1 ( ( I − γ P ^ π ) − ( I − γ P π ) ) Q π = γ ( I − γ P ^ π ) − 1 ( P π − P ^ π ) Q π = γ ( I − γ P ^ π ) − 1 ( P − P ^ ) V π \begin{aligned} Q^{\pi}-\widehat{Q}^{\pi} &=\left(I-\gamma P^{\pi}\right)^{-1} r-\left(I-\gamma \widehat{P}^{\pi}\right)^{-1} r \\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left((I-\gamma \widehat{P}^{\pi})(I-\gamma P^{\pi})^{-1} -I\right)r\\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left((I-\gamma \widehat{P}^{\pi})-(I-\gamma P^{\pi}) \right)(I-\gamma P^{\pi})^{-1} r\\ &=\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left(\left(I-\gamma \widehat{P}^{\pi}\right)-\left(I-\gamma P^{\pi}\right)\right) Q^{\pi} \\ &=\gamma\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}\left(P^{\pi}-\widehat{P}^{\pi}\right) Q^{\pi} \\ &=\gamma\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}(P-\widehat{P}) V^{\pi} \end{aligned} Qπ−Q π=(I−γPπ)−1r−(I−γP π)−1r=(I−γP π)−1((I−γP π)(I−γPπ)−1−I)r=(I−γP π)−1((I−γP π)−(I−γPπ))(I−γPπ)−1r=(I−γP π)−1((I−γP π)−(I−γPπ))Qπ=γ(I−γP π)−1(Pπ−P π)Qπ=γ(I−γP π)−1(P−P )Vπ
Q π − Q ^ π = ( I − γ P π ) − 1 r − ( I − γ P ^ π ) − 1 r = ( I − γ P π ) − 1 ( I − ( I − γ P π ) ( I − γ P ^ π ) − 1 ) r = ( I − γ P π ) − 1 ( ( I − γ P ^ π ) − ( I − γ P π ) ) ( I − γ P ^ π ) − 1 r = γ ( I − γ P π ) − 1 ( P π − P ^ π ) Q ^ π = γ ( I − γ P π ) − 1 ( P − P ^ ) V ^ π \begin{aligned} Q^{\pi}-\widehat{Q}^{\pi} &=\left(I-\gamma P^{\pi}\right)^{-1} r-(I-\gamma \widehat{P}^{\pi})^{-1} r \\ &=(I-\gamma P^\pi)^{-1}\left(I-(I-\gamma P^\pi)(I-\gamma \widehat{P}^{\pi})^{-1}\right)r\\ &=(I-\gamma P^\pi)^{-1}\left((I-\gamma \widehat{P}^{\pi})-(I-\gamma P^\pi)\right)(I-\gamma \widehat{P}^{\pi})^{-1}r\\ &=\gamma(I-\gamma P^\pi)^{-1}(P^\pi-\widehat P^\pi)\widehat Q^\pi\\ &=\gamma(I-\gamma P^\pi)^{-1}(P-\widehat P)\widehat V^\pi \end{aligned} Qπ−Q π=(I−γPπ)−1r−(I−γP π)−1r=(I−γPπ)−1(I−(I−γPπ)(I−γP π)−1)r=(I−γPπ)−1((I−γP π)−(I−γPπ))(I−γP π)−1r=γ(I−γPπ)−1(Pπ−P π)Q π=γ(I−γPπ)−1(P−P )V π
特点是策略相同,Q不同;而Q的不同之处在于P还是 P ^ \widehat P P ,只与simulator有关,所以称为Simulation Lemma
2.2.2 Component-wise bound
γ ( I − γ P ^ π ^ ⋆ ) − 1 ( P − P ^ ) V ⋆ ≤ Q ⋆ − Q ^ ⋆ ≤ γ ( I − γ P ^ π ⋆ ) − 1 ( P − P ^ ) V ⋆ \gamma(I-\gamma \widehat P^{\hat \pi^\star})^{-1}\left(P-\widehat P\right)V^\star\leq Q^\star-\widehat{Q}^\star\leq \gamma\left(I-\gamma \widehat{P}^{\pi^\star}\right)^{-1}(P-\widehat{P}) V^{\star} γ(I−γP π^⋆)−1(P−P )V⋆≤Q⋆−Q ⋆≤γ(I−γP π⋆)−1(P−P )V⋆
2.2.3 Bellman Variance Equation
定义:
Σ
M
π
(
s
,
a
)
=
E
π
,
P
[
(
∑
t
=
0
∞
γ
t
r
(
s
t
,
a
t
)
−
Q
M
π
(
s
,
a
)
)
2
∣
s
0
=
s
,
a
0
=
a
]
\Sigma_M^\pi(s,a)=\mathbb E_{\pi,P}\left[\left(\sum_{t=0}^\infty\gamma^tr(s_t,a_t)-Q^\pi_M(s,a)\right)^2\Big|s_0=s,a_0=a\right]
ΣMπ(s,a)=Eπ,P⎣⎡(t=0∑∞γtr(st,at)−QMπ(s,a))2∣∣∣s0=s,a0=a⎦⎤
Q
M
π
(
s
,
a
)
=
E
π
,
P
[
∑
h
=
0
∞
γ
h
r
(
s
h
,
a
h
)
∣
s
0
=
s
,
a
0
=
a
]
Q^\pi_M(s,a)=\mathbb E_{\pi,P}\left[\sum_{h=0}^\infty\gamma^h r(s_h,a_h)\big|s_0=s,a_0=a\right]
QMπ(s,a)=Eπ,P[h=0∑∞γhr(sh,ah)∣∣s0=s,a0=a]
Var
P
(
Q
M
π
)
(
s
,
a
)
=
E
s
′
∼
p
(
⋅
∣
s
,
a
)
,
a
′
∼
π
(
⋅
∣
s
′
)
[
(
Q
M
π
(
s
′
,
a
′
)
−
E
s
′
′
∼
p
(
⋅
∣
s
,
a
)
,
a
′
′
∼
π
(
⋅
∣
s
′
′
)
[
Q
M
π
(
s
′
′
,
a
′
′
)
]
)
2
]
\text{Var}_P(Q_M^\pi)(s,a)=\mathbb E_{s'\sim p(\cdot\mid s,a),a'\sim\pi(\cdot|s')}\left[\left(Q_M^\pi(s',a')-\mathbb E_{s''\sim p(\cdot\mid s,a),a''\sim\pi(\cdot|s'')}\left[Q_M^\pi(s'',a'')\right]\right)^2\right]
VarP(QMπ)(s,a)=Es′∼p(⋅∣s,a),a′∼π(⋅∣s′)[(QMπ(s′,a′)−Es′′∼p(⋅∣s,a),a′′∼π(⋅∣s′′)[QMπ(s′′,a′′)])2]
于是有Bellman-style相关的equation:
Σ
M
π
=
γ
2
(
Var
P
(
Q
M
π
)
+
P
π
Σ
M
π
)
\Sigma_M^\pi=\gamma^2\Big(\text{Var}_P(Q_M^\pi)+P^\pi\Sigma_M^\pi\Big)
ΣMπ=γ2(VarP(QMπ)+PπΣMπ)
2.3 目标分析
2.3.1 两个目标value estimation&sub-optimality
- 第一个分析的目标是Value Estimation ∥ Q ^ ⋆ − Q ⋆ ∥ ∞ \|\widehat Q^\star-Q^\star\|_\infty ∥Q ⋆−Q⋆∥∞,分别是两个MDP的最优Q值函数,即 M ^ = ( S , A , r , P ^ , γ ) \widehat \mathcal M=(S,A,r,\widehat P,\gamma) M =(S,A,r,P ,γ)与 M = ( S , A , r , P , γ ) \mathcal M=(S,A,r, P,\gamma) M=(S,A,r,P,γ)
- 对于一个点 ( s , a ) (s,a) (s,a)而言, ∥ Q ^ ⋆ − Q ⋆ ∥ ∞ \|\widehat Q^\star-Q^\star\|_\infty ∥Q ⋆−Q⋆∥∞可写为 Q ^ ⋆ ( s , a ) − Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ^ ( ⋅ ∣ s , a ) [ Q ^ ( s ′ , π ^ ⋆ ( s ′ ) ) ] − r ( s , a ) − γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ Q ( s ′ , π ⋆ ( s ′ ) ) ] \widehat Q^\star(s,a)-Q^\star(s,a)=r(s,a)+\gamma\mathbb E_{s'\sim \widehat p(\cdot|s,a)}[\widehat Q(s',\widehat \pi^\star(s'))]-r(s,a)-\gamma \mathbb E_{s'\sim p(\cdot|s,a)}[Q(s',\pi^\star(s'))] Q ⋆(s,a)−Q⋆(s,a)=r(s,a)+γEs′∼p (⋅∣s,a)[Q (s′,π ⋆(s′))]−r(s,a)−γEs′∼p(⋅∣s,a)[Q(s′,π⋆(s′))]
- 所以Value Estimation是说,建模的MDP的最优Q值,离真实MDP的最优Q值有多远?
- 第二个分析目标是Sub-Optimality ∥ Q π ⋆ ^ − Q ⋆ ∥ ∞ \|Q^{\widehat{\pi^\star}}-Q^\star\|_\infty ∥Qπ⋆ −Q⋆∥∞,从 M ^ = ( S , A , r , P ^ , γ ) \widehat \mathcal M=(S,A,r,\widehat P,\gamma) M =(S,A,r,P ,γ)学习到的最优策略 π ^ ⋆ \widehat \pi^\star π ⋆在真实MDP对应的Q值,离最优Q值有多远?
2.3.2 初衷
- 出发点:Sample Complexity的初衷是从建模的MDP M ^ = ( S , A , r , P ^ , γ ) \widehat \mathcal M=(S,A,r,\widehat P,\gamma) M =(S,A,r,P ,γ)中使用某个强化算法如VI迭代出一个策略 π n ^ \widehat {\pi_n} πn ,然后希望这个策略与真实的策略 π ⋆ \pi^\star π⋆相近
- generative model是一个很强的assumption,希望在该assumption下有一个比较好的sample complexity bound,然后逐步弱化它,变成 μ − r e s e t \mu-reset μ−reset以及exploration的assumption
- 所以现实的目标是 ∥ Q ^ π ^ n − Q ⋆ ∥ ∞ \|\widehat Q^{\widehat \pi_n}-Q^\star\|_\infty ∥Q π n−Q⋆∥∞,作如下小放缩: ∥ Q ⋆ − Q ^ π ^ n ∥ ∞ ≤ ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ + ∥ Q ^ ⋆ − Q ^ π ^ n ∥ ∞ \|Q^\star-\widehat Q^{\hat \pi_n}\|_\infty\leq \|Q^\star-\widehat{Q}^\star\|_\infty + \|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty ∥Q⋆−Q π^n∥∞≤∥Q⋆−Q ⋆∥∞+∥Q ⋆−Q π^n∥∞
- 放缩后的 ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ \|Q^\star-\widehat{Q}^\star\|_\infty ∥Q⋆−Q ⋆∥∞为value estimation, ∥ Q ^ ⋆ − Q ^ π ^ n ∥ ∞ \|\widehat Q^\star - \widehat Q^{\hat\pi_n}\|_\infty ∥Q ⋆−Q π^n∥∞为某个强化算法在 M ^ = ( S , A , r , P ^ , γ ) \widehat \mathcal M=(S,A,r,\widehat P,\gamma) M =(S,A,r,P ,γ)中的optimization error
- sub-optimality这个目标,通过恒等变形也同样会出现value estimation的目标,另一项可用simulation lemma进行放缩,因此我们把关注点放在了value estimation上 Q π ⋆ ^ − Q ⋆ = E s ′ ∼ p ( ⋅ ∣ s , a ) [ Q ( s ′ , π ⋆ ^ ( s ′ ) ) − Q ( s ′ , π ⋆ ( s ′ ) ) ] = E s ′ ∼ p ( ⋅ ∣ s , a ) [ Q ( s ′ , π ⋆ ^ ( s ′ ) ) − Q ^ ( s ′ , π ⋆ ^ ( s ′ ) ) + Q ^ ( s ′ , π ⋆ ^ ( s ′ ) ) − Q ( s ′ , π ⋆ ( s ′ ) ] = E s ′ ∼ p ( ⋅ ∣ s , a ) [ Q π ⋆ ^ − Q π ⋆ ^ ^ + Q ⋆ ^ − Q ⋆ ] (前者同一策略不同Q函数可用simulation lemma,后者为value estimation,期望可小于等于最大值点) \begin{aligned} Q^{\widehat{\pi^\star}}-Q^\star&=\mathbb E_{s'\sim p(\cdot\mid s,a)}\left[Q(s',\widehat{\pi^\star}(s'))-Q(s',\pi^\star(s'))\right]\\ &= \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[Q(s',\widehat{\pi^\star}(s'))-\widehat Q(s',\widehat{\pi^\star}(s'))+\widehat Q(s',\widehat{\pi^\star}(s'))-Q(s',\pi^\star(s')\right]\\ &= \mathbb E_{s'\sim p(\cdot\mid s,a)}\left[Q^{\widehat{\pi^\star}}-\widehat{Q^{\widehat{\pi^\star}}}+\widehat{Q^\star}-Q^\star\right]\\ & \text{ (前者同一策略不同Q函数可用simulation lemma,后者为value estimation,期望可小于等于最大值点)} \end{aligned} Qπ⋆ −Q⋆=Es′∼p(⋅∣s,a)[Q(s′,π⋆ (s′))−Q(s′,π⋆(s′))]=Es′∼p(⋅∣s,a)[Q(s′,π⋆ (s′))−Q (s′,π⋆ (s′))+Q (s′,π⋆ (s′))−Q(s′,π⋆(s′)]=Es′∼p(⋅∣s,a)[Qπ⋆ −Qπ⋆ +Q⋆ −Q⋆] (前者同一策略不同Q函数可用simulation lemma,后者为value estimation,期望可小于等于最大值点)
三、Sample Complexity的三篇文章
它们的Setting、Algorithms、Assumption是完全相同的,不同的是sample complexity的bound
- Setting是infinite horizon discounted MDP
- Algorithms使用的Value Iteration & Policy Iteration
- Assumption是generative model,即均匀交互假设,对转移矩阵进行建模 P ^ ( s ′ ∣ s , a ) = # ( s ′ , s , a ) N \widehat P(s'\mid s,a)=\frac{\# (s',s,a)}{N} P (s′∣s,a)=N#(s′,s,a)
3.1 Naive Model-based Approach
出发点是希望通过bound住两个不同MDP下的所有策略(Uniform Value Accuracy),来达到bound住value estimation的目标 ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ \|Q^\star-\widehat{Q}^\star\|_\infty ∥Q⋆−Q ⋆∥∞
证明可见每天一个RL基础理论(4)——Sample Complexity(上)
整体命题如下:
在均匀交互假设前提下, ϵ ∈ ( 0 , 1 1 − γ ) \epsilon\in(0,\frac{1}{1-\gamma}) ϵ∈(0,1−γ1), ∃ c > 0 \exist c>0 ∃c>0,如果 ∣ S ∣ ∣ A ∣ N ≥ γ ( 1 − γ ) 4 ∣ S ∣ 2 ∣ A ∣ log ( c ∣ S ∣ ∣ A ∣ δ ) ϵ 2 |S||A|N\geq \frac{\gamma}{(1-\gamma)^4}\frac{|S|^2|A|\log (\frac{c|S||A|}{\delta})}{\epsilon^2} ∣S∣∣A∣N≥(1−γ)4γϵ2∣S∣2∣A∣log(δc∣S∣∣A∣)则有 ≥ 1 − δ \geq1-\delta ≥1−δ的概率有如下命题成立:
- (Model Accuracy) :衡量真实transition matrix 与 transition model的差异
max s , a ∣ ∣ P ( ⋅ ∣ s , a ) − P ^ ( ⋅ ∣ s , a ) ∣ ∣ 1 ≤ ( 1 − γ ) 2 ϵ \max_{s,a}||P(\cdot|s,a)-\widehat P(\cdot|s,a)||_1\leq (1-\gamma)^2\epsilon s,amax∣∣P(⋅∣s,a)−P (⋅∣s,a)∣∣1≤(1−γ)2ϵ - (Uniform Value Accuracy):衡量策略
π
\pi
π的真实Q值与建模MDP得到的Q值的差异,对于
∀
π
∈
Π
\forall \pi\in \Pi
∀π∈Π有
∣ ∣ Q π − Q ^ π ∣ ∣ ∞ ≤ ϵ ||Q^\pi-\widehat Q^\pi||_{\infty}\leq \epsilon ∣∣Qπ−Q π∣∣∞≤ϵ - (Near Optimal Planning):前者衡量
M
&
M
^
\mathcal M\&\widehat \mathcal M
M&M
得到最优Q值的差异,后者衡量
M
&
M
^
\mathcal M\&\widehat \mathcal M
M&M
得到最优策略
π
⋆
^
&
π
⋆
\widehat{\pi^\star}\&\pi^\star
π⋆
&π⋆在真实Q值的差异,其中
π
⋆
^
\widehat{\pi^\star}
π⋆
为在
M
^
\widehat{\mathcal M}
M
上得到的最优策略,
Q
⋆
^
\widehat{Q^\star}
Q⋆
为在
M
^
\widehat{\mathcal M}
M
上得到的最优Q值函数
∣ ∣ Q ⋆ ^ − Q ⋆ ∣ ∣ ∞ ≤ ϵ , ∣ ∣ Q π ⋆ ^ − Q ⋆ ∣ ∣ ∞ ≤ 2 ϵ ||\widehat{Q^\star}-Q^\star||_{\infty}\leq \epsilon\quad, ||Q^{\widehat{\pi^\star}}-Q^\star||_{\infty}\leq 2\epsilon ∣∣Q⋆ −Q⋆∣∣∞≤ϵ,∣∣Qπ⋆ −Q⋆∣∣∞≤2ϵ
3.2 Generative model-based Approach
出发点是希望直接bound住value estimation的目标 ∥ Q ⋆ − Q ^ ⋆ ∥ ∞ \|Q^\star-\widehat{Q}^\star\|_\infty ∥Q⋆−Q ⋆∥∞,而不管算法迭代过程中策略的误差,只关心结果,过程不重要。
- 在每天一个RL基础理论(5)——Sample Complexity(中),介绍了simulation lemma的证明以及component-wise bound
- 在每天一个RL基础理论(6)——Sample Complexity(下),介绍了第一种得到bound的方法为直接用Hoeffding inequality对目标上界的简单放缩,第二种方法为用Bellman variance equation对目标上界进行变化后,再用Bernstein inequality进行放缩来得到更紧致的bound
整体命题如下:
- (Value estimation)在
ϵ
≤
1
\epsilon \leq 1
ϵ≤1下,如果
总 的 样 本 复 杂 度 = ∣ S ∣ ∣ A ∣ N ≥ c ∣ S ∣ ∣ A ∣ ( 1 − γ ) 3 ln c ∣ S ∣ ∣ A ∣ / δ ϵ 2 总的样本复杂度=|S||A|N\geq \frac{c|S||A|}{(1-\gamma)^3}\frac{\ln c|S||A|/\delta}{\epsilon^2} 总的样本复杂度=∣S∣∣A∣N≥(1−γ)3c∣S∣∣A∣ϵ2lnc∣S∣∣A∣/δ则在 1 − δ 1-\delta 1−δ概率下有:
∥ Q ⋆ − Q ^ ⋆ ∥ ∞ ≤ ϵ \|Q^\star-\widehat Q^\star\|_\infty \leq \epsilon ∥Q⋆−Q ⋆∥∞≤ϵ - Sub-Optimality在
ϵ
≤
1
1
−
γ
\epsilon \leq \sqrt{\frac{1}{1-\gamma}}
ϵ≤1−γ1下,如果
总
的
样
本
复
杂
度
=
∣
S
∣
∣
A
∣
N
≥
c
∣
S
∣
∣
A
∣
(
1
−
γ
)
3
ln
c
∣
S
∣
∣
A
∣
/
δ
ϵ
2
总的样本复杂度=|S||A|N\geq \frac{c|S||A|}{(1-\gamma)^3}\frac{\ln c|S||A|/\delta}{\epsilon^2}
总的样本复杂度=∣S∣∣A∣N≥(1−γ)3c∣S∣∣A∣ϵ2lnc∣S∣∣A∣/δ则在
1
−
δ
1-\delta
1−δ概率下有:
∥ Q ⋆ − Q π ^ ⋆ ∥ ∞ ≤ ϵ \|Q^\star- Q^{\widehat \pi^\star}\|_\infty \leq \epsilon ∥Q⋆−Qπ ⋆∥∞≤ϵ
- 简单看naive model-based的sample complexity bound,与effective horizon 1 1 − γ \frac{1}{1-\gamma} 1−γ1呈四次方,与任务复杂度即状态动作空间呈 O ( ∣ S ∣ 2 ∣ A ∣ ln ∣ S ∣ ∣ A ∣ ) O(|S|^2|A|\ln|S||A|) O(∣S∣2∣A∣ln∣S∣∣A∣)的关系
- generative model-based的sample complexity bound,与effective horizon 1 1 − γ \frac{1}{1-\gamma} 1−γ1呈三次方,与任务复杂度即状态动作空间呈 O ( ∣ S ∣ ∣ A ∣ ln ∣ S ∣ ∣ A ∣ ) O(|S||A|\ln|S||A|) O(∣S∣∣A∣ln∣S∣∣A∣)的关系
四、后续的理论怎么走?
这个sample complexity bound的前置条件:
- Setting是infinite horizon discounted MDP
- Algorithms使用的Value Iteration & Policy Iteration
- Assumption是generative model,即均匀交互假设,对转移矩阵进行建模 P ^ ( s ′ ∣ s , a ) = # ( s ′ , s , a ) N \widehat P(s'\mid s,a)=\frac{\# (s',s,a)}{N} P (s′∣s,a)=N#(s′,s,a)
理论问题:
- 如果环境不支持均匀交互假设即generative model,那么在 μ − r e s e t \mu-reset μ−reset或exploration下,什么样的算法比较紧的sample complexity bound的保证?
- 如果状态空间很大,是高维连续的,现在这个bound中有 O ( ∣ S ∣ ∣ A ∣ ln ∣ S ∣ ∣ A ∣ ) O(|S||A|\ln|S||A|) O(∣S∣∣A∣ln∣S∣∣A∣),能不能有别的算法降低一下对 ∣ S ∣ |S| ∣S∣的依赖?
- 现在使用的算法非常naive,是value iteration或policy iteration,可以说是tabular MDP,如 VI迭代式: Q n + 1 ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q n ( s ′ , a ′ ) ] ∀ s , a ∈ S × A \text{VI迭代式:}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A VI迭代式:Qn+1(s,a)=r(s,a)+γEs′∼p(⋅∣s,a)[maxa′Qn(s′,a′)]∀s,a∈S×A,如果状态动作空间是无穷维的,这VI迭代式怎么表示Q函数?怎么更新Q函数?
救命。太多变化了。所以下一步在逐步扩充这个naive的setting、assumption、algorithm及其得到的结果,并给出扩充后的理论保证
简单总结一下
我们walk through了最最最简单的问题setting:
- Infinite horizon discounted 且是Tabular的MDP
- VI或PI算法
- Generative Model的交互假设,回避了exploration的问题
在该setting下的理论结果:
- VI&PI算法的收敛性证明、收敛速度、计算复杂度computational complexity
- VI&PI在generative model下能得到 ϵ − o p t i m a l \epsilon-optimal ϵ−optimal策略的sample complexity bound