CS6789-6 Linear Bellman Completion
参考资料
背景
之前的问题setting:
- Infinite horizon discounted 且是Tabular的MDP
- VI或PI算法
- Generative Model的交互假设,回避了exploration的问题
现在添加一个额外的structural assumption——Linear Bellman Completeness:
- 稍微扩充一下Tabular MDP到Large Scale MDP,即离散的状态动作扩充到连续的状态动作
- 其它维持不变
- VI迭代式: Q n + 1 ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q n ( s ′ , a ′ ) ] ∀ s , a ∈ S × A \text{VI迭代式:}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A VI迭代式:Qn+1(s,a)=r(s,a)+γEs′∼p(⋅∣s,a)[maxa′Qn(s′,a′)]∀s,a∈S×A会出问题,因为不能用Q-table的方式来表示Q-function即 Q ( s , a ) Q(s,a) Q(s,a),要进行function approximation
- 这里的结构假设,是linear function approximation
最终问题是:在原有问题setting扩充了状态动作空间的容量(简单说,离散到连续),能否有sample efficient的算法能找到 ϵ \epsilon ϵ-optimal的策略?
逻辑
本篇证明比较多,逻辑链条比较长。整体来看
- Least Square的理论分析是基础,因为LSVI用到了Least Square(在4.2.2节)
- 然后D-optimal的假设,是为了保证用于Least Square的数据集有充足的多样性(在3.3节)
- 接着通过Least Square + D-optimal + Linear Bellman Completion 能得到low inherent Bellman error(在4.2.3节)
- 有了Low Inherent Bellman error,就能bound住policy performance (在4.2.2节)
文章的写作逻辑,以介绍概念——理解定理——证明定理为主,不想看证明可直接跳过第四章节,第四章节总逻辑在4.2.3节。
一、从infinite horizon到finite horizon
为了便于分析,已知infinite horizon中只要把effective horizon即 1 1 − γ \frac{1}{1-\gamma} 1−γ1换成Finite horizon即 H H H即可,另外再弱化关于transition 的stationarity即 P ( ⋅ ∣ s , a ) P(\cdot|s,a) P(⋅∣s,a)变成跟时间有关 P h ( ⋅ ∣ s , a ) , h ≤ H P_h(\cdot|s,a),h\leq H Ph(⋅∣s,a),h≤H
所以infinite horizon的定义 M = ( S , A , r , P , γ ) \mathcal M=(S,A,r,P,\gamma) M=(S,A,r,P,γ)改写成finite horizon的定义 M = ( S , A , r , P h , H ) \mathcal M=(S,A,r,P_h,H) M=(S,A,r,Ph,H)
它们之间的联系: H = 1 1 − γ H=\frac{1}{1-\gamma} H=1−γ1 & stationarity lim h → ∞ P h = P \lim_{h\rightarrow \infty}P_h=P limh→∞Ph=P
二、Linear Bellman Completeness
为了解决Tabular MDP到Large Scale MDP中,Q函数的表示问题,首先假设要进行linear function approximation,其次它要满足Bellman Completeness,称为Linear Bellman Completeness 定义如下:
给定一个已知的特征
ϕ
(
s
,
a
)
∈
R
d
\phi(s,a)\in \mathbb R^d
ϕ(s,a)∈Rd,假设其满足Bellman 完备性,即
∀
s
∈
S
,
a
∈
A
,
h
∈
[
H
]
,
θ
∈
R
d
,
∃
w
∈
R
d
\forall s\in S,a\in A,h\in [H],\theta\in\mathbb R^d,\exist w\in\mathbb R^d
∀s∈S,a∈A,h∈[H],θ∈Rd,∃w∈Rd满足
w
⊤
ϕ
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
∼
P
h
(
⋅
∣
s
,
a
)
[
max
a
′
θ
⊤
ϕ
(
s
′
,
a
′
)
]
w^\top\phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\theta^\top \phi(s',a')]
w⊤ϕ(s,a)=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxθ⊤ϕ(s′,a′)]
关于Q函数的Bellman Optimality为:
Q
⋆
(
s
,
a
)
=
r
(
s
,
a
)
+
γ
E
s
′
∼
P
(
⋅
∣
s
,
a
)
[
max
a
′
Q
⋆
(
s
′
,
a
′
)
]
Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim P(\cdot|s,a)}[\max_{a'}Q^\star(s',a')]
Q⋆(s,a)=r(s,a)+γEs′∼P(⋅∣s,a)[a′maxQ⋆(s′,a′)]
可以观察到Bellman Optimality 与 Linear Bellman Completeness非常相似。
- 简单理解一下:Q本来是在 ( s , a ) (s,a) (s,a)空间构成的映射,变成了在 ϕ ( s , a ) \phi(s,a) ϕ(s,a)张成的空间中的点。
- 当 ϕ ( s , a ) = one-hot ( s , a ) \phi(s,a)=\text{one-hot}(s,a) ϕ(s,a)=one-hot(s,a),此时便可看成Tabular MDP,即 ϕ ( s , a ) = ( 1 , 0 , 0 , . . . . , 0 ) \phi(s,a)=(1,0,0,...., 0) ϕ(s,a)=(1,0,0,....,0)
- 假设s有两个值 s 1 , s 2 s_1,s_2 s1,s2,动作有一个值a,它们的特征可设计成 ϕ ( s , a ) = ( s 1 s 2 , s 1 2 , s 2 2 , s 1 a , s 2 a , a , a 2 , 1 ) \phi(s,a)=(s_1s_2,s_1^2,s_2^2,s_1a,s_2a,a,a^2,1) ϕ(s,a)=(s1s2,s12,s22,s1a,s2a,a,a2,1),可看作LQR问题
一句话总结: ϕ ( s , a ) \phi(s,a) ϕ(s,a)已知,该假设的参数 w , θ w,\theta w,θ唯一标识了不同的Q函数,在第h时刻最优的Q函数 Q h ⋆ ( s , a ) = θ h ⋆ ϕ ( s , a ) Q^\star_h(s,a)=\theta_h^\star\phi(s,a) Qh⋆(s,a)=θh⋆ϕ(s,a)
三、LSVI的理论
3.1 LSVI算法流程
LSVI是Least Squar Value Iteration,是在Linear Bellman Completion的强假设下的一种sample efficient的算法,能在多项式时间找到
ϵ
\epsilon
ϵ-optimal的策略。先以与时间有关的角度,改写下Linear Bellman Completion:
θ
h
⊤
ϕ
(
s
,
a
)
=
r
(
s
,
a
)
+
E
s
′
∼
P
h
(
⋅
∣
s
,
a
)
[
max
a
′
θ
h
+
1
⊤
ϕ
(
s
′
,
a
′
)
]
\theta_h^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_{h}(\cdot|s,a)}[\max_{a'}\theta_{h+1}^\top \phi(s',a')]
θh⊤ϕ(s,a)=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxθh+1⊤ϕ(s′,a′)]
LSVI的算法流程:
- 在每个时间刻 h ∈ 0 , 1 , . . . , H − 1 h\in 0,1,...,H-1 h∈0,1,...,H−1,利用generative model的方式 s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) s′∼Ph(⋅∣s,a)进行数据收集,得到 D 0 , D 1 , . . . , D H − 1 D_0,D_1,...,D_{H-1} D0,D1,...,DH−1个数据集,其中一个数据样本以 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s′)的方式存储
- 处理边界情况,令 V H ( s ) = 0 , ∀ s ∈ S V_H(s)=0,\forall s\in S VH(s)=0,∀s∈S
- 对于 h = H − 1 → 0 h=H-1\rightarrow0 h=H−1→0,根据数据集 D h D_h Dh,学习参数 θ h \theta_h θh
1. θ h = arg min θ ∑ D h ( θ ⊤ ϕ ( s , a ) − r − V h + 1 ( s ′ ) ) 2 1.\quad\theta_h=\argmin_{\theta}\sum_{D_h}\left(\theta^\top \phi(s,a)-r-V_{h+1}(s')\right)^2 1.θh=θargminDh∑(θ⊤ϕ(s,a)−r−Vh+1(s′))2 2. V h ( s ) = max a θ h ⊤ ϕ ( s , a ) , ∀ s 2.\quad V_h(s)=\max_a \theta_h^\top\phi(s,a),\forall s 2.Vh(s)=amaxθh⊤ϕ(s,a),∀s
最后的策略从学习到的Q值中得到: π ^ h ( s ) = arg max a θ h ⊤ ϕ ( s , a ) ∀ h \widehat \pi_h(s)=\argmax_a\theta_h^\top\phi(s,a)\quad \forall h π h(s)=aargmaxθh⊤ϕ(s,a)∀h
3.2 LSVI样本复杂度
给出定理:
在finite horizon,Large Scale MDP,generative model的设定下,假设Linear Bellman Completion,使用LSVI算法得到策略
π
^
\widehat \pi
π
,如果采集到的数据集
{
D
h
}
h
=
0
H
−
1
\{D_h\}_{h=0}^{H-1}
{Dh}h=0H−1满足D-optimal design,则有至少
1
−
δ
1-\delta
1−δ的概率,找到有
ϵ
\epsilon
ϵ-optimal策略即
V
π
^
−
V
⋆
≤
ϵ
V^{\widehat \pi}-V^\star\leq \epsilon
Vπ
−V⋆≤ϵ,每个数据集
D
h
D_h
Dh的样本复杂度为
O
(
d
2
+
H
6
d
2
/
ϵ
2
)
O(d^2+H^6d^2/\epsilon^2)
O(d2+H6d2/ϵ2)
其中d为特征空间 ϕ ( s , a ) \phi(s,a) ϕ(s,a)的维度,这里是把样本复杂度中无限空间的 ∣ S ∣ ∣ A ∣ |S||A| ∣S∣∣A∣换成了有限的特征空间 ϕ ( s , a ) \phi(s,a) ϕ(s,a)
为什么假设了Linear Bellman Completion,还需要D-optimal Design呢?因为这里涉及到了通过数据集 D h D_h Dh进行最小二乘法回归的问题,这必定会涉及到数据分布,如果测试数据离训练数据很远,是没办法泛化到的,也就难以保证 ϵ \epsilon ϵ-optimal中的optimal,所以还得加一个使数据集有足够多样性的假设D-optimal design
3.3 D-optimal Design
样本所在空间 X ∈ R d \mathcal X\in \mathbb R^d X∈Rd,对 X \mathcal X X以分布 p p p的方式进行采样得到数据集 D = { x : x ∼ p } D=\{x:x\sim p\} D={x:x∼p},衡量数据集 D D D的多样性,相当于对样本空间 X \mathcal X X的采样分布 p p p进行多样性衡量。
D-optimality:假设采样分布 p p p至多包含样本空间 X \mathcal X X中 d ( d + 1 ) / 2 d(d+1)/2 d(d+1)/2个点,定义 Σ = E x ∼ p [ x x T ] \Sigma=\mathbb E_{x\sim p}[xx^T] Σ=Ex∼p[xxT]且有 x ⊤ Σ − 1 x ≤ d ∀ x ∈ X x^\top\Sigma^{-1} x\leq d\quad \forall x\in \mathcal X x⊤Σ−1x≤d∀x∈X,则称分布 p p p为D-optimal design
首先 d ( d + 1 ) / 2 d(d+1)/2 d(d+1)/2保证分布 p p p有一定数量的点, Σ \Sigma Σ是定义了分布内的点的信息量, x ⊤ Σ x ≤ d , ∀ x ∈ X x^\top\Sigma x\leq d,\forall x\in \mathcal X x⊤Σx≤d,∀x∈X表示了整个样本空间中的点,离 p p p分布的点距离不超过 d d d,完成了多样性的量化
以 D D D-optimality为原则,来量化多样性,最大化以下目标筛选采样分布 p = arg max v ln det E x ∼ v [ x x ⊤ ] p=\argmax_v \ln\det \mathbb E_{x\sim v}[xx^\top] p=vargmaxlndetEx∼v[xx⊤]
因为采样分布 p p p是D-optimal design的,因此根据它从样本空间 X \mathcal X X采样得到的数据集 D D D,具备一定的多样性(coverage)
3.4 LSVI样本复杂度的最终命题
四、LSVI样本复杂度的证明
4.1 熟悉time-dependent的finite MDP
主要是在做Q-value iteration方面的迭代,因此改写下与Q相关的公式:
- Bellman Optimality
Infinite Stationary: Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max a ′ Q ⋆ ( s ′ , a ′ ) ] V ⋆ ( s ) = max a Q ⋆ ( s , a ) Q = B Q (简写) Finite Non-stationary: Q h ⋆ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ Q h + 1 ⋆ ( s ′ , a ′ ) ] V h ⋆ ( s ) = max a Q h ⋆ ( s , a ) Q h = B h Q h + 1 (简写) \begin{aligned} \text{Infinite Stationary:}&\\ Q^\star(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right]\\ V^\star(s)&=\max_a Q^\star(s,a)\\ Q&=\mathcal BQ\text{ (简写)}\\ \text{Finite Non-stationary:}&\\ Q_h^\star(s,a)&=r(s,a)+ \mathbb E_{s'\sim P_h(\cdot|s,a)}\left[\max_{a'}Q_{h+1}^\star (s',a')\right]\\ V_h^\star(s)&=\max_a Q_h^\star(s,a)\\ Q_h&=\mathcal B_hQ_{h+1}\text{ (简写)}\\ \end{aligned} Infinite Stationary:Q⋆(s,a)V⋆(s)QFinite Non-stationary:Qh⋆(s,a)Vh⋆(s)Qh=r(s,a)+γEs′∼p(⋅∣s,a)[a′maxQ⋆(s′,a′)]=amaxQ⋆(s,a)=BQ (简写)=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxQh+1⋆(s′,a′)]=amaxQh⋆(s,a)=BhQh+1 (简写) - Non-stationary下
Q
h
⋆
Q_h^\star
Qh⋆满足:
Q H ⋆ = 0 , Q H − 1 ⋆ = r , Q h ⋆ = B h Q h + 1 ⋆ Q^\star_H=0,Q^\star_{H-1}=r,Q_h^\star=\mathcal B_h Q^\star_{h+1} QH⋆=0,QH−1⋆=r,Qh⋆=BhQh+1⋆
4.2 Inherent Bellman Error
如果通过某个算法估计的 Q ^ h \widehat Q_h Q h有 ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon ∥Q h−BhQ h+1∥∞≤ϵ成立(这个条件称做Inherent Bellman Error),则
- ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] ∥Q h−Qh⋆∥∞≤(H−h)ϵ∀h∈[0,1,2...,H−1]
- ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon ∣Vπ −V⋆∣≤2H2ϵ,其中 π ^ h ( s ) = arg max a Q ^ h ( s , a ) \widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a) π h(s)=aargmaxQ h(s,a)
4.2.1 证明 ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon ∥Q h−Qh⋆∥∞≤(H−h)ϵ
- 已知 Q H ( s , a ) = 0 Q_H(s,a)=0 QH(s,a)=0,根据Inherent Bellman Error ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon ∥Q h−BhQ h+1∥∞≤ϵ,令h=H-1,有 ∥ Q ^ H − 1 − r ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-r\|_\infty\leq \epsilon ∥Q H−1−r∥∞≤ϵ
- 由 Q H − 1 ⋆ = r Q^\star_{H-1}=r QH−1⋆=r,得到 ∥ Q ^ H − 1 − Q H − 1 ⋆ ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon ∥Q H−1−QH−1⋆∥∞≤ϵ
- 分析目标
∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ + ∥ B h Q ^ h + 1 − Q h ⋆ ∥ ∞ ≤ ϵ + ∥ B h Q ^ h + 1 − B h Q h + 1 ⋆ ∥ ∞ ≤ ϵ + ∥ E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ Q ^ h + 1 ( s ′ , a ′ ) ] − E s ′ ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ ′ Q h + 1 ⋆ ( s ′ ′ , a ′ ′ ) ] ∥ ∞ ≤ ϵ + ∥ Q ^ h + 1 − Q h + 1 ⋆ ∥ ∞ \begin{aligned} \|\widehat Q_h-Q^\star_h\|_\infty&\leq \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty+\|\mathcal B_h\widehat Q_{h+1}-Q^\star_h\|_\infty\\ &\leq \epsilon +\|\mathcal B_h\widehat Q_{h+1}-\mathcal B_hQ^\star_{h+1}\|_\infty\\ &\leq \epsilon +\|\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]-E_{s''\sim P_h(\cdot|s,a)}[\max_{a''}Q^\star_{h+1}(s'',a'')]\|_\infty\\ &\leq \epsilon + \|\widehat Q_{h+1}-Q^\star_{h+1}\|_\infty \end{aligned} ∥Q h−Qh⋆∥∞≤∥Q h−BhQ h+1∥∞+∥BhQ h+1−Qh⋆∥∞≤ϵ+∥BhQ h+1−BhQh+1⋆∥∞≤ϵ+∥Es′∼Ph(⋅∣s,a)[a′maxQ h+1(s′,a′)]−Es′′∼Ph(⋅∣s,a)[a′′maxQh+1⋆(s′′,a′′)]∥∞≤ϵ+∥Q h+1−Qh+1⋆∥∞ - 熟悉的套娃环节,到 ∥ Q ^ H − 1 − Q H − 1 ⋆ ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon ∥Q H−1−QH−1⋆∥∞≤ϵ,因此有 ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] ∥Q h−Qh⋆∥∞≤(H−h)ϵ∀h∈[0,1,2...,H−1]
4.2.2 证明 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon ∣Vπ −V⋆∣≤2H2ϵ
- 从后向前考虑,先看H-1时刻(下面每一步恒等变形都是为了对齐动作)
∣ V H − 1 ⋆ ( s ) − V H − 1 π ^ ( s ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) + Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) + Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) + Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) ∣ + ∣ Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ 2 ϵ \begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))+Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &= |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s))| + |\widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq 2\epsilon \end{aligned} ∣VH−1⋆(s)−VH−1π (s)∣=∣QH−1⋆(s,πH−1⋆(s))−QH−1π (s,π H−1(s))∣=∣QH−1⋆(s,πH−1⋆(s))−QH−1⋆(s,π H−1(s))+QH−1⋆(s,π H−1(s))−QH−1π (s,π H−1(s))∣=∣QH−1⋆(s,πH−1⋆(s))−QH−1⋆(s,π H−1(s))∣=∣QH−1⋆(s,πH−1⋆(s))−Q H−1(s,πH−1⋆(s))+Q H−1(s,πH−1⋆(s))−QH−1⋆(s,π H−1(s))∣≤∣QH−1⋆(s,πH−1⋆(s))−Q H−1(s,πH−1⋆(s))+Q H−1(s,π H−1(s))−QH−1⋆(s,π H−1(s))∣≤∣QH−1⋆(s,πH−1⋆(s))−Q H−1(s,πH−1⋆(s))∣+∣Q H−1(s,π H−1(s))−QH−1⋆(s,π H−1(s))∣≤2ϵ
- 注意到 ∀ s , a \forall s,a ∀s,a,有 Q H − 1 ⋆ ( s , a ) = r ( s , a ) = Q H − 1 π ^ ( s , a ) Q_{H-1}^\star(s,a)=r(s,a)=Q^{\widehat \pi}_{H-1}(s,a) QH−1⋆(s,a)=r(s,a)=QH−1π (s,a),因此 Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) = 0 Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))=0 QH−1⋆(s,π H−1(s))−QH−1π (s,π H−1(s))=0
- 另外 Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) ≠ 0 Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-\widehat Q_{H-1}(s,\widehat \pi_{H-1}(s))\neq0 QH−1⋆(s,π H−1(s))−Q H−1(s,π H−1(s))=0
- 考虑任意时刻
h
h
h之间的迭代式:
∣ V h ⋆ ( s ) − V h π ^ ( s ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h π ^ ( s , π ^ h ( s ) ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + Q h ⋆ ( s , π ^ h ( s ) ) − Q h π ^ ( s , π ^ h ( s ) ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q ^ h ( s , π ^ h ( s ) ) + Q ^ h ( s , π ^ h ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ ≤ ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q ^ h ( s , π h ⋆ ( s ) ) + Q ^ h ( s , π ^ h ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ ≤ 2 ( H − h ) ϵ + ∣ E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ \begin{aligned} |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+Q_{h}^\star(s,\widehat \pi_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\widehat \pi_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\pi^\star_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq 2(H-h)\epsilon +|\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ \end{aligned} ∣Vh⋆(s)−Vhπ (s)∣=∣Qh⋆(s,πh⋆(s))−Qhπ (s,π h(s))∣=∣Qh⋆(s,πh⋆(s))−Qh⋆(s,π h(s))+Qh⋆(s,π h(s))−Qhπ (s,π h(s))∣=∣Qh⋆(s,πh⋆(s))−Qh⋆(s,π h(s))+Es′∼Ph[Vh+1⋆(s′)−Vh+1π (s′)]∣=∣Qh⋆(s,πh⋆(s))−Q h(s,π h(s))+Q h(s,π h(s))−Qh⋆(s,π h(s))+Es′∼Ph[Vh+1⋆(s′)−Vh+1π (s′)]∣≤∣Qh⋆(s,πh⋆(s))−Q h(s,πh⋆(s))+Q h(s,π h(s))−Qh⋆(s,π h(s))+Es′∼Ph[Vh+1⋆(s′)−Vh+1π (s′)]∣≤2(H−h)ϵ+∣Es′∼Ph[Vh+1⋆(s′)−Vh+1π (s′)]∣ - 按时刻展开迭代式:
∣ V H − 1 ⋆ ( s ) − V H − 1 π ^ ( s ) ∣ ≤ 2 ϵ ∣ V H − 2 ⋆ ( s ) − V H − 2 π ^ ( s ) ∣ ≤ 4 ϵ + 2 ϵ ∣ V H − 3 ⋆ ( s ) − V H − 3 π ^ ( s ) ∣ ≤ 6 ϵ + 4 ϵ + 2 ϵ ⋯ ∣ V h ⋆ ( s ) − V h π ^ ( s ) ∣ ≤ 2 ϵ ( H − h ) H \begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&\leq 2\epsilon\\ |V_{H-2}^\star(s)-V_{H-2}^{\widehat \pi}(s)|&\leq 4\epsilon + 2\epsilon\\ |V_{H-3}^\star(s)-V_{H-3}^{\widehat \pi}(s)|&\leq 6\epsilon + 4\epsilon + 2\epsilon\\ &\cdots\\ |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&\leq 2\epsilon(H-h)H\\ \end{aligned} ∣VH−1⋆(s)−VH−1π (s)∣∣VH−2⋆(s)−VH−2π (s)∣∣VH−3⋆(s)−VH−3π (s)∣∣Vh⋆(s)−Vhπ (s)∣≤2ϵ≤4ϵ+2ϵ≤6ϵ+4ϵ+2ϵ⋯≤2ϵ(H−h)H - 因此h=0时,有 ∣ V 0 ⋆ ( s ) − V 0 π ^ ( s ) ∣ ≤ 2 H 2 ϵ |V_{0}^\star(s)-V_{0}^{\widehat \pi}(s)|\leq 2 H^2 \epsilon ∣V0⋆(s)−V0π (s)∣≤2H2ϵ
4.3 LSVI算法的假设与证明
关键:说明LSVI算法有 ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon ∥Q h−BhQ h+1∥∞≤ϵ成立,就可以利用上述结果
4.3.1 LSVI的隐含假设
- Linear Bellman Completeness
Q ^ h ( s , a ) = θ ^ h ϕ ( s , a ) B h Q ^ h + 1 ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ Q ^ h + 1 ( s ′ , a ′ ) ] B h Q ^ h + 1 ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] \begin{aligned} \widehat Q_h(s,a)&=\widehat \theta_h\phi(s,a)\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] \end{aligned} Q h(s,a)BhQ h+1(s,a)BhQ h+1(s,a)=θ hϕ(s,a)=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxQ h+1(s′,a′)]=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxθ h+1ϕ(s′,a′)] - D-optimal Design
ϕ ( s , a ) ∈ R d Σ = E ϕ ( s , a ) ϕ ( s , a ) ⊤ p = arg max v ln det E ( s , a ) ∼ v ϕ ( s , a ) ϕ ( s , a ) ⊤ D h = { ( s ( i ) , a ( i ) , r ( i ) , ( s ′ ) ( i ) ) ∣ ( s , a ) ∼ p , s ′ ∼ P h ( ⋅ ∣ s , a ) } i = 1 N Σ h = ∑ s , a ∈ D h ϕ ( s , a ) ϕ ( s , a ) ⊤ 是可逆的 ∀ ( s , a ) , N Σ ≤ ϕ ( s , a ) ⊤ Σ h − 1 ϕ ( s , a ) ≤ d \begin{aligned} &\phi(s,a)\in \mathbb R^d\\ &\Sigma=\mathbb E_{\phi(s,a)\phi(s,a)^\top}\\ &p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v} \phi(s,a)\phi(s,a)^\top\\ &D_h=\{(s^{(i)},a^{(i)},r^{(i)},(s')^{(i)})|(s,a)\sim p,s'\sim P_h(\cdot|s,a)\}_{i=1}^{N}\\ &\Sigma_h=\sum_{s,a\in D_h}\phi(s,a)\phi(s,a)^\top \text{是可逆的}\\ & \forall (s,a), N\Sigma\leq\phi(s,a)^\top \Sigma_h^{-1}\phi(s,a)\leq d \end{aligned} ϕ(s,a)∈RdΣ=Eϕ(s,a)ϕ(s,a)⊤p=vargmaxlndetE(s,a)∼vϕ(s,a)ϕ(s,a)⊤Dh={(s(i),a(i),r(i),(s′)(i))∣(s,a)∼p,s′∼Ph(⋅∣s,a)}i=1NΣh=s,a∈Dh∑ϕ(s,a)ϕ(s,a)⊤是可逆的∀(s,a),NΣ≤ϕ(s,a)⊤Σh−1ϕ(s,a)≤d - LSVI中的Least Square
正常的LS: y = θ x + ϵ Q h 的LS: y = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] , θ = θ ^ h , x = ϕ ( s , a ) \begin{aligned} \text{正常的LS:}&y=\theta x+\epsilon\\ \text{$Q_h$的LS:}&y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')], \theta=\widehat \theta_h,x=\phi(s,a) \end{aligned} 正常的LS:Qh的LS:y=θx+ϵy=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxθ h+1ϕ(s′,a′)],θ=θ h,x=ϕ(s,a)
4.3.2 Ordinary Least Square
简要复习下普通最小二乘法的理论,这里蕴含的假设是fixed design analysis,具体证明见2012年的论文Random Design Analysis of Ridge Regression。
已知一个 N N N个样本的数据集 D = { x ( i ) , y ( i ) } i = 1 N D=\{x^{(i)},y^{(i)}\}_{i=1}^N D={x(i),y(i)}i=1N,其中 x ( i ) ∈ R d , y ( i ) ∈ R 1 x^{(i)}\in \mathbb R^d,y^{(i)}\in \mathbb R^1 x(i)∈Rd,y(i)∈R1, x j ( i ) x^{(i)}_j xj(i)表示第 i i i个样本的第 j j j维。它们的结构假设是线性的,存在一个最优向量参数 θ ⋆ \theta^\star θ⋆,使得 y ( i ) = ( θ ⋆ ) ⊤ x ( i ) + ϵ ( i ) y^{(i)}=(\theta^\star)^\top x^{(i)}+\epsilon^{(i)} y(i)=(θ⋆)⊤x(i)+ϵ(i),其中噪声假设为 E [ ϵ ] = 0 , ∣ ϵ ∣ ≤ σ \mathbb E[\epsilon]=0,|\epsilon|\leq \sigma E[ϵ]=0,∣ϵ∣≤σ。
令样本矩阵 X ∈ R N × d X\in \mathbb R^{N\times d} X∈RN×d,它们的协方差矩阵 Σ = X ⊤ X = ∑ i = 1 N x ( i ) ( x ( i ) ) ⊤ \Sigma=X^\top X=\sum_{i=1}^N x^{(i)}(x^{(i)})^\top Σ=X⊤X=∑i=1Nx(i)(x(i))⊤,假设 Σ \Sigma Σ可逆,最小化二乘法学习到的参数 θ ^ \hat \theta θ^: θ ^ = arg min θ ( X θ − Y ) 2 = ( X ⊤ X ) − 1 X ⊤ Y = Σ − 1 X ⊤ Y \hat\theta=\argmin_\theta (X\theta-Y)^2=(X^\top X)^{-1}X^\top Y=\Sigma^{-1}X^\top Y θ^=θargmin(Xθ−Y)2=(X⊤X)−1X⊤Y=Σ−1X⊤Y
根据2012年的论文Random Design Analysis of Ridge Regression推导,选取
δ
∈
(
0
,
1
)
\delta\in (0,1)
δ∈(0,1),至少
1
−
δ
1-\delta
1−δ的概率有如下成立:
(
θ
^
−
θ
⋆
)
⊤
Σ
(
θ
^
−
θ
⋆
)
=
∥
θ
^
−
θ
⋆
∥
Σ
2
≤
σ
2
(
d
+
d
ln
(
1
/
δ
)
+
2
ln
(
1
/
δ
)
)
(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))
(θ^−θ⋆)⊤Σ(θ^−θ⋆)=∥θ^−θ⋆∥Σ2≤σ2(d+dln(1/δ)+2ln(1/δ))
具体展开看, ( θ ^ − θ ⋆ ) ⊤ Σ ( θ ^ − θ ⋆ ) = ( θ ^ − θ ⋆ ) ⊤ X ⊤ X ( θ ^ − θ ⋆ ) = ( X ( θ ^ − θ ⋆ ) ) ⊤ ( X ( θ ^ − θ ⋆ ) ) = ( Y ^ − Y ) 2 = ∑ i = 1 N ( θ ^ x ( i ) − θ ⋆ x ( i ) ) (\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star)=(\hat \theta -\theta^\star)^\top X^\top X(\hat \theta -\theta^\star)=(X(\hat \theta-\theta^\star))^\top (X(\hat \theta-\theta^\star))=(\hat Y-Y)^2=\sum_{i=1}^N (\hat \theta x^{(i)}-\theta^\star x^{(i)}) (θ^−θ⋆)⊤Σ(θ^−θ⋆)=(θ^−θ⋆)⊤X⊤X(θ^−θ⋆)=(X(θ^−θ⋆))⊤(X(θ^−θ⋆))=(Y^−Y)2=i=1∑N(θ^x(i)−θ⋆x(i))
4.3.3 LSVI的Inherent Bellman Error
先套Least Square的理论结果得到第h时刻估计参数 θ ^ h \hat \theta_h θ^h与最优参数 θ ⋆ \theta^\star θ⋆的界
- 套用LS的理论结果:
( θ ^ − θ ⋆ ) ⊤ Σ ( θ ^ − θ ⋆ ) = ∥ θ ^ − θ ⋆ ∥ Σ 2 ≤ σ 2 ( d + d ln ( 1 / δ ) + 2 ln ( 1 / δ ) ) (\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) (θ^−θ⋆)⊤Σ(θ^−θ⋆)=∥θ^−θ⋆∥Σ2≤σ2(d+dln(1/δ)+2ln(1/δ)) - LSVI中第h时刻从数据集 D h D_h Dh拟合参数 θ ^ h \widehat \theta_h θ h时的对应关系:
- 每一个 D h D_h Dh中的样本为 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s′)中的输入 ( s , a ) (s,a) (s,a),采样 s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) s′∼Ph(⋅∣s,a),其真实值为 y = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] y=r(s,a)+Es′∼Ph(⋅∣s,a)[maxa′θ h+1ϕ(s′,a′)]
- 从 s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) s′∼Ph(⋅∣s,a)采样一个样本 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s′),观察到的采样为 y ( i ) = r ( s , a ) + max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) y^{(i)}=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a') y(i)=r(s,a)+maxa′θ h+1ϕ(s′,a′)
- 因此关于一个
(
s
,
a
)
(s,a)
(s,a),采样带来的噪声为
ϵ
(
i
)
=
y
(
i
)
−
y
=
r
(
s
,
a
)
+
max
a
′
θ
^
h
+
1
ϕ
(
s
′
,
a
′
)
−
r
(
s
,
a
)
−
E
s
′
′
∼
P
h
(
⋅
∣
s
,
a
)
[
max
a
′
θ
^
h
+
1
ϕ
(
s
′
′
,
a
′
)
]
\epsilon^{(i)}=y^{(i)}-y=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')-r(s,a)-\mathbb E_{s''\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s'',a')]
ϵ(i)=y(i)−y=r(s,a)+a′maxθ
h+1ϕ(s′,a′)−r(s,a)−Es′′∼Ph(⋅∣s,a)[a′maxθ
h+1ϕ(s′′,a′)]
很自然我们有:
E [ ϵ ] = E s ′ ∼ P h ( ⋅ ∣ s , a ) [ r ( s , a ) + max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] − θ ^ h ϕ ( s , a ) = 0 \mathbb E[\epsilon]=\mathbb E_{s'\sim P_h(\cdot|s,a)}[r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')]-\widehat\theta_h\phi(s,a)=0 E[ϵ]=Es′∼Ph(⋅∣s,a)[r(s,a)+a′maxθ h+1ϕ(s′,a′)]−θ hϕ(s,a)=0
在Infinite Horizon中Q函数的范围为 ( 0 , 1 1 − γ ) (0,\frac{1}{1-\gamma}) (0,1−γ1),因此在finite horizon中Q函数的范围为 ( 0 , H ) (0,H) (0,H),因此噪声项有
∣ ϵ ∣ ≤ 2 H |\epsilon|\leq 2H ∣ϵ∣≤2H - Σ = Σ h = ∑ ( s , a ) ∈ D h ϕ ( s , a ) ϕ ( s , a ) ⊤ \Sigma=\Sigma_h=\sum_{(s,a)\in D_h}\phi(s,a)\phi(s,a)^\top Σ=Σh=∑(s,a)∈Dhϕ(s,a)ϕ(s,a)⊤
-
θ
⋆
\theta^\star
θ⋆能令
x
=
ϕ
(
s
,
a
)
x=\phi(s,a)
x=ϕ(s,a)为真实值y,即
θ
⋆
x
=
y
\theta^\star x=y
θ⋆x=y,套进来为:
( θ ⋆ ) ⊤ ϕ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] (\theta^\star)^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] (θ⋆)⊤ϕ(s,a)=r(s,a)+Es′∼Ph(⋅∣s,a)[a′maxθ h+1ϕ(s′,a′)]
- 所以LSVI在第h时刻的最小二乘法可以得到: ( θ ^ h − θ ⋆ ) ⊤ Σ h ( θ ^ h − θ ⋆ ) = ∥ θ ^ h − θ ⋆ ∥ Σ h 2 ≤ 4 H 2 ( d + d ln ( 1 / δ ) + 2 ln ( 1 / δ ) ) (\hat \theta_h -\theta^\star)^\top \Sigma_h (\hat \theta_h -\theta^\star)=\|\hat \theta_h-\theta^\star\|_{\Sigma_h}^2 \leq 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) (θ^h−θ⋆)⊤Σh(θ^h−θ⋆)=∥θ^h−θ⋆∥Σh2≤4H2(d+dln(1/δ)+2ln(1/δ))
- 分析目标
∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ = max s , a ∣ θ ^ h ⊤ ϕ ( s , a ) − r ( s , a ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max a ′ θ ^ h + 1 ⊤ ϕ ( s ′ , a ′ ) ] ∣ = max s , a ∣ θ ^ h ⊤ ϕ ( s , a ) − ( θ ⋆ ) ⊤ ϕ ( s , a ) ∣ = max s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ϕ ( s , a ) ∣ = max s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ϕ ( s , a ) ∣ 2 (利用 ∑ a i b i ≤ ∑ a i 2 ∑ b i 2 进行放缩) ≤ max s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ( θ ^ h − θ ⋆ ) ∣ ∣ ϕ ( s , a ) ⊤ ϕ ( s , a ) ∣ = max s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ Σ h ( θ ^ h − θ ⋆ ) ∣ ∣ ϕ ( s , a ) ⊤ Σ h − 1 ϕ ( s , a ) ∣ ≤ max s , a 4 H 2 ( d + d ln ( 1 / δ ) + 2 ln ( 1 / δ ) ) d N ≤ max s , a 4 H d ln 1 / δ N = 4 H d ln 1 / δ N \begin{aligned} \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty&=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-r(s,a)-\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}^\top\phi(s',a')]|\\ &=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-(\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|^2}\text{ (利用$\sum a_ib_i\leq \sum a_i^2\sum b_i^2$进行放缩)}\\ &\leq \max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\phi(s,a)|} \\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top\Sigma_h (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\Sigma_h^{-1}\phi(s,a)|} \\ &\leq\max_{s,a}\sqrt{ 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) \frac{d}{N}}\\ &\leq \max_{s,a} \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} \end{aligned} ∥Q h−BhQ h+1∥∞=s,amax∣θ^h⊤ϕ(s,a)−r(s,a)−Es′∼Ph(⋅∣s,a)[a′maxθ h+1⊤ϕ(s′,a′)]∣=s,amax∣θ^h⊤ϕ(s,a)−(θ⋆)⊤ϕ(s,a)∣=s,amax∣(θ^h−θ⋆)⊤ϕ(s,a)∣=s,amax∣(θ^h−θ⋆)⊤ϕ(s,a)∣2 (利用∑aibi≤∑ai2∑bi2进行放缩)≤s,amax∣(θ^h−θ⋆)⊤(θ^h−θ⋆)∣∣ϕ(s,a)⊤ϕ(s,a)∣=s,amax∣(θ^h−θ⋆)⊤Σh(θ^h−θ⋆)∣∣ϕ(s,a)⊤Σh−1ϕ(s,a)∣≤s,amax4H2(d+dln(1/δ)+2ln(1/δ))Nd≤s,amaxN4Hdln1/δ=N4Hdln1/δ - 求得数据集 D h D_h Dh的大小 N N N,令 ϵ = 4 H d ln 1 / δ N \epsilon= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} ϵ=N4Hdln1/δ,得 N = 16 H 2 d 2 ln ( H / δ ) ϵ 2 N=\frac{16H^2d^2\ln(H/\delta)}{\epsilon^2} N=ϵ216H2d2ln(H/δ)
4.3.4 LSVI样本复杂度的总逻辑
- 4.2节
如果算法估计的 Q ^ h \widehat Q_h Q h有 ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon ∥Q h−BhQ h+1∥∞≤ϵ成立(Inherent Bellman Error),则- ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] ∥Q h−Qh⋆∥∞≤(H−h)ϵ∀h∈[0,1,2...,H−1]
- ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon ∣Vπ −V⋆∣≤2H2ϵ,其中 π ^ h ( s ) = arg max a Q ^ h ( s , a ) \widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a) π h(s)=aargmaxQ h(s,a)
- 4.3.3节
利用了Least Squre + D-optimal Design后,当 N ≥ 16 H 2 d 2 ln ( H / δ ) ϵ 2 N\geq \frac{16H^2d^2\ln(H/\delta)}{\epsilon^2} N≥ϵ216H2d2ln(H/δ)有 ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon ∥Q h−BhQ h+1∥∞≤ϵ成立 - 总逻辑
- 以 p = arg max v ln det E ( s , a ) ∼ v [ ϕ ( s , a ) ϕ ( s , a ) ⊤ ] p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v}[\phi(s,a)\phi(s,a)^\top] p=vargmaxlndetE(s,a)∼v[ϕ(s,a)ϕ(s,a)⊤]方式得到在 ( s , a ) (s,a) (s,a)空间上的最优采样分布 p p p
- 然后在 h = 0 , 1 , . . . , H − 1 h=0,1,...,H-1 h=0,1,...,H−1的每个时刻,对于一个 ( s , a ) (s,a) (s,a),从其下一状态的真实分布 P h ( s ′ ∣ s , a ) P_h(s'|s,a) Ph(s′∣s,a)中随机采样 ⌈ p ( s , a ) N ⌉ \lceil p(s,a)N\rceil ⌈p(s,a)N⌉个下一状态值 s ′ s' s′作为样本,所以 D h D_h Dh的大小为 ∑ s , a ⌈ p ( s , a ) N ⌉ \sum_{s,a}\lceil p(s,a)N\rceil ∑s,a⌈p(s,a)N⌉。
- 因此H个数据集的总样本数为
总 的 样 本 数 = H ∑ s , a ∈ p ⌈ p ( s , a ) N ⌉ ≤ H ∑ s , a ∈ p ( 1 + p ( s , a ) N ) 总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N) 总的样本数=Hs,a∈p∑⌈p(s,a)N⌉≤Hs,a∈p∑(1+p(s,a)N) - 设定目标的误差为 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ = ϵ ′ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon' ∣Vπ −V⋆∣≤2H2ϵ=ϵ′,所以 ϵ = ϵ ′ 2 H 2 \epsilon=\frac{\epsilon'}{2H^2} ϵ=2H2ϵ′,这要求Inherent Bellman error ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ ′ 2 H 2 \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \frac{\epsilon'}{2H^2} ∥Q h−BhQ h+1∥∞≤2H2ϵ′,可知 N = 64 H 6 d 2 ln H / δ ( ϵ ′ ) 2 N= \frac{64H^6d^2\ln H/\delta}{(\epsilon')^2} N=(ϵ′)264H6d2lnH/δ
- 因此为了保持 ϵ ′ \epsilon' ϵ′-optimal的策略 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ = ϵ ′ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon' ∣Vπ −V⋆∣≤2H2ϵ=ϵ′ ,需要 总 的 样 本 数 = H ∑ s , a ∈ p ⌈ p ( s , a ) N ⌉ ≤ H ∑ s , a ∈ p ( 1 + p ( s , a ) N ) ≤ H ( d 2 + 64 H 6 d 2 ln H / δ ( ϵ ′ ) 2 ) 总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)\leq H(d^2+\frac{64H^6d^2\ln H/\delta}{(\epsilon')^2}) 总的样本数=Hs,a∈p∑⌈p(s,a)N⌉≤Hs,a∈p∑(1+p(s,a)N)≤H(d2+(ϵ′)264H6d2lnH/δ)
总结
假设很多,这里分一下类。
- 关于问题建模本身的假设:finite horizon、Large Scale MDP、time-dependent policy(non-stationary policy)
- 关于最优Q值的结构假设:Linear 、 结构变化是Linear Bellman Completion
- 关于数据集多样性的分布假设:D-optimality
- 关于算法中最小二乘的假设:fixed design analysis
这些假设,是相应算法在多项式时间内找到 ϵ \epsilon ϵ-optimal策略的保证。