每天一个RL基础理论(8)——Linear Bellman Completeness

参考资料

RL Theory Book 第三章

背景

之前的问题setting

  1. Infinite horizon discounted 且是Tabular的MDP
  2. VI或PI算法
  3. Generative Model的交互假设,回避了exploration的问题

现在添加一个额外的structural assumption——Linear Bellman Completeness:

  • 稍微扩充一下Tabular MDP到Large Scale MDP,即离散的状态动作扩充到连续的状态动作
  • 其它维持不变
  • VI迭代式: Q n + 1 ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q n ( s ′ , a ′ ) ] ∀ s , a ∈ S × A \text{VI迭代式:}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A VI迭代式:Qn+1(s,a)=r(s,a)+γEsp(s,a)[maxaQn(s,a)]s,aS×A会出问题,因为不能用Q-table的方式来表示Q-function即 Q ( s , a ) Q(s,a) Q(s,a),要进行function approximation
  • 这里的结构假设,是linear function approximation

最终问题是:在原有问题setting扩充了状态动作空间的容量(简单说,离散到连续),能否有sample efficient的算法能找到 ϵ \epsilon ϵ-optimal的策略?

逻辑

本篇证明比较多,逻辑链条比较长。整体来看

  1. Least Square的理论分析是基础,因为LSVI用到了Least Square(在4.2.2节)
  2. 然后D-optimal的假设,是为了保证用于Least Square的数据集有充足的多样性(在3.3节)
  3. 接着通过Least Square + D-optimal + Linear Bellman Completion 能得到low inherent Bellman error(在4.2.3节)
  4. 有了Low Inherent Bellman error,就能bound住policy performance (在4.2.2节)

文章的写作逻辑,以介绍概念——理解定理——证明定理为主,不想看证明可直接跳过第四章节,第四章节总逻辑在4.2.3节。

一、从infinite horizon到finite horizon

为了便于分析,已知infinite horizon中只要把effective horizon即 1 1 − γ \frac{1}{1-\gamma} 1γ1换成Finite horizon即 H H H即可,另外再弱化关于transition 的stationarity即 P ( ⋅ ∣ s , a ) P(\cdot|s,a) P(s,a)变成跟时间有关 P h ( ⋅ ∣ s , a ) , h ≤ H P_h(\cdot|s,a),h\leq H Ph(s,a)hH

所以infinite horizon的定义 M = ( S , A , r , P , γ ) \mathcal M=(S,A,r,P,\gamma) M=(S,A,r,P,γ)改写成finite horizon的定义 M = ( S , A , r , P h , H ) \mathcal M=(S,A,r,P_h,H) M=(S,A,r,Ph,H)

它们之间的联系: H = 1 1 − γ H=\frac{1}{1-\gamma} H=1γ1 & stationarity lim ⁡ h → ∞ P h = P \lim_{h\rightarrow \infty}P_h=P limhPh=P

二、Linear Bellman Completeness

为了解决Tabular MDP到Large Scale MDP中,Q函数的表示问题,首先假设要进行linear function approximation,其次它要满足Bellman Completeness,称为Linear Bellman Completeness 定义如下:

给定一个已知的特征 ϕ ( s , a ) ∈ R d \phi(s,a)\in \mathbb R^d ϕ(s,a)Rd,假设其满足Bellman 完备性,即 ∀ s ∈ S , a ∈ A , h ∈ [ H ] , θ ∈ R d , ∃ w ∈ R d \forall s\in S,a\in A,h\in [H],\theta\in\mathbb R^d,\exist w\in\mathbb R^d sS,aA,h[H],θRd,wRd满足
w ⊤ ϕ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ⊤ ϕ ( s ′ , a ′ ) ] w^\top\phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\theta^\top \phi(s',a')] wϕ(s,a)=r(s,a)+EsPh(s,a)[amaxθϕ(s,a)]

关于Q函数的Bellman Optimality为:
Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ⋆ ( s ′ , a ′ ) ] Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim P(\cdot|s,a)}[\max_{a'}Q^\star(s',a')] Q(s,a)=r(s,a)+γEsP(s,a)[amaxQ(s,a)]

可以观察到Bellman Optimality 与 Linear Bellman Completeness非常相似。

  • 简单理解一下:Q本来是在 ( s , a ) (s,a) (s,a)空间构成的映射,变成了在 ϕ ( s , a ) \phi(s,a) ϕ(s,a)张成的空间中的点。
  • ϕ ( s , a ) = one-hot ( s , a ) \phi(s,a)=\text{one-hot}(s,a) ϕ(s,a)=one-hot(s,a),此时便可看成Tabular MDP,即 ϕ ( s , a ) = ( 1 , 0 , 0 , . . . . , 0 ) \phi(s,a)=(1,0,0,...., 0) ϕ(s,a)=(1,0,0,....,0)
  • 假设s有两个值 s 1 , s 2 s_1,s_2 s1,s2,动作有一个值a,它们的特征可设计成 ϕ ( s , a ) = ( s 1 s 2 , s 1 2 , s 2 2 , s 1 a , s 2 a , a , a 2 , 1 ) \phi(s,a)=(s_1s_2,s_1^2,s_2^2,s_1a,s_2a,a,a^2,1) ϕ(s,a)=(s1s2,s12,s22,s1a,s2a,a,a2,1),可看作LQR问题

一句话总结: ϕ ( s , a ) \phi(s,a) ϕ(s,a)已知,该假设的参数 w , θ w,\theta w,θ唯一标识了不同的Q函数,在第h时刻最优的Q函数 Q h ⋆ ( s , a ) = θ h ⋆ ϕ ( s , a ) Q^\star_h(s,a)=\theta_h^\star\phi(s,a) Qh(s,a)=θhϕ(s,a)

三、LSVI的理论

3.1 LSVI算法流程

LSVI是Least Squar Value Iteration,是在Linear Bellman Completion的强假设下的一种sample efficient的算法,能在多项式时间找到 ϵ \epsilon ϵ-optimal的策略。先以与时间有关的角度,改写下Linear Bellman Completion:
θ h ⊤ ϕ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ h + 1 ⊤ ϕ ( s ′ , a ′ ) ] \theta_h^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_{h}(\cdot|s,a)}[\max_{a'}\theta_{h+1}^\top \phi(s',a')] θhϕ(s,a)=r(s,a)+EsPh(s,a)[amaxθh+1ϕ(s,a)]

LSVI的算法流程

  1. 在每个时间刻 h ∈ 0 , 1 , . . . , H − 1 h\in 0,1,...,H-1 h0,1,...,H1,利用generative model的方式 s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) sPh(s,a)进行数据收集,得到 D 0 , D 1 , . . . , D H − 1 D_0,D_1,...,D_{H-1} D0,D1,...,DH1个数据集,其中一个数据样本以 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s)的方式存储
  2. 处理边界情况,令 V H ( s ) = 0 , ∀ s ∈ S V_H(s)=0,\forall s\in S VH(s)=0,sS
  3. 对于 h = H − 1 → 0 h=H-1\rightarrow0 h=H10,根据数据集 D h D_h Dh,学习参数 θ h \theta_h θh

1. θ h = arg min ⁡ θ ∑ D h ( θ ⊤ ϕ ( s , a ) − r − V h + 1 ( s ′ ) ) 2 1.\quad\theta_h=\argmin_{\theta}\sum_{D_h}\left(\theta^\top \phi(s,a)-r-V_{h+1}(s')\right)^2 1.θh=θargminDh(θϕ(s,a)rVh+1(s))2 2. V h ( s ) = max ⁡ a θ h ⊤ ϕ ( s , a ) , ∀ s 2.\quad V_h(s)=\max_a \theta_h^\top\phi(s,a),\forall s 2.Vh(s)=amaxθhϕ(s,a),s

最后的策略从学习到的Q值中得到: π ^ h ( s ) = arg max ⁡ a θ h ⊤ ϕ ( s , a ) ∀ h \widehat \pi_h(s)=\argmax_a\theta_h^\top\phi(s,a)\quad \forall h π h(s)=aargmaxθhϕ(s,a)h

3.2 LSVI样本复杂度

给出定理
在finite horizon,Large Scale MDP,generative model的设定下,假设Linear Bellman Completion,使用LSVI算法得到策略 π ^ \widehat \pi π ,如果采集到的数据集 { D h } h = 0 H − 1 \{D_h\}_{h=0}^{H-1} {Dh}h=0H1满足D-optimal design,则有至少 1 − δ 1-\delta 1δ的概率,找到有 ϵ \epsilon ϵ-optimal策略即 V π ^ − V ⋆ ≤ ϵ V^{\widehat \pi}-V^\star\leq \epsilon Vπ Vϵ,每个数据集 D h D_h Dh的样本复杂度为 O ( d 2 + H 6 d 2 / ϵ 2 ) O(d^2+H^6d^2/\epsilon^2) O(d2+H6d2/ϵ2)

其中d为特征空间 ϕ ( s , a ) \phi(s,a) ϕ(s,a)的维度,这里是把样本复杂度中无限空间的 ∣ S ∣ ∣ A ∣ |S||A| SA换成了有限的特征空间 ϕ ( s , a ) \phi(s,a) ϕ(s,a)

为什么假设了Linear Bellman Completion,还需要D-optimal Design呢?因为这里涉及到了通过数据集 D h D_h Dh进行最小二乘法回归的问题,这必定会涉及到数据分布,如果测试数据离训练数据很远,是没办法泛化到的,也就难以保证 ϵ \epsilon ϵ-optimal中的optimal,所以还得加一个使数据集有足够多样性的假设D-optimal design

3.3 D-optimal Design

样本所在空间 X ∈ R d \mathcal X\in \mathbb R^d XRd,对 X \mathcal X X以分布 p p p的方式进行采样得到数据集 D = { x : x ∼ p } D=\{x:x\sim p\} D={x:xp},衡量数据集 D D D的多样性,相当于对样本空间 X \mathcal X X的采样分布 p p p进行多样性衡量。

D-optimality:假设采样分布 p p p至多包含样本空间 X \mathcal X X d ( d + 1 ) / 2 d(d+1)/2 d(d+1)/2个点,定义 Σ = E x ∼ p [ x x T ] \Sigma=\mathbb E_{x\sim p}[xx^T] Σ=Exp[xxT]且有 x ⊤ Σ − 1 x ≤ d ∀ x ∈ X x^\top\Sigma^{-1} x\leq d\quad \forall x\in \mathcal X xΣ1xdxX,则称分布 p p p为D-optimal design

首先 d ( d + 1 ) / 2 d(d+1)/2 d(d+1)/2保证分布 p p p有一定数量的点, Σ \Sigma Σ是定义了分布内的点的信息量, x ⊤ Σ x ≤ d , ∀ x ∈ X x^\top\Sigma x\leq d,\forall x\in \mathcal X xΣxd,xX表示了整个样本空间中的点,离 p p p分布的点距离不超过 d d d,完成了多样性的量化

D D D-optimality为原则,来量化多样性,最大化以下目标筛选采样分布 p = arg max ⁡ v ln ⁡ det ⁡ E x ∼ v [ x x ⊤ ] p=\argmax_v \ln\det \mathbb E_{x\sim v}[xx^\top] p=vargmaxlndetExv[xx]

因为采样分布 p p p是D-optimal design的,因此根据它从样本空间 X \mathcal X X采样得到的数据集 D D D,具备一定的多样性(coverage)

3.4 LSVI样本复杂度的最终命题

1

四、LSVI样本复杂度的证明

4.1 熟悉time-dependent的finite MDP

主要是在做Q-value iteration方面的迭代,因此改写下与Q相关的公式:

  1. Bellman Optimality
    Infinite Stationary: Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ⋆ ( s ′ , a ′ ) ] V ⋆ ( s ) = max ⁡ a Q ⋆ ( s , a ) Q = B Q  (简写) Finite Non-stationary: Q h ⋆ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q h + 1 ⋆ ( s ′ , a ′ ) ] V h ⋆ ( s ) = max ⁡ a Q h ⋆ ( s , a ) Q h = B h Q h + 1  (简写) \begin{aligned} \text{Infinite Stationary:}&\\ Q^\star(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right]\\ V^\star(s)&=\max_a Q^\star(s,a)\\ Q&=\mathcal BQ\text{ (简写)}\\ \text{Finite Non-stationary:}&\\ Q_h^\star(s,a)&=r(s,a)+ \mathbb E_{s'\sim P_h(\cdot|s,a)}\left[\max_{a'}Q_{h+1}^\star (s',a')\right]\\ V_h^\star(s)&=\max_a Q_h^\star(s,a)\\ Q_h&=\mathcal B_hQ_{h+1}\text{ (简写)}\\ \end{aligned} Infinite Stationary:Q(s,a)V(s)QFinite Non-stationary:Qh(s,a)Vh(s)Qh=r(s,a)+γEsp(s,a)[amaxQ(s,a)]=amaxQ(s,a)=BQ (简写)=r(s,a)+EsPh(s,a)[amaxQh+1(s,a)]=amaxQh(s,a)=BhQh+1 (简写)
  2. Non-stationary下 Q h ⋆ Q_h^\star Qh满足:
    Q H ⋆ = 0 , Q H − 1 ⋆ = r , Q h ⋆ = B h Q h + 1 ⋆ Q^\star_H=0,Q^\star_{H-1}=r,Q_h^\star=\mathcal B_h Q^\star_{h+1} QH=0,QH1=r,Qh=BhQh+1

4.2 Inherent Bellman Error

如果通过某个算法估计的 Q ^ h \widehat Q_h Q h ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon Q hBhQ h+1ϵ成立(这个条件称做Inherent Bellman Error),则

  1. ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] Q hQh(Hh)ϵh[0,1,2...,H1]
  2. ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon Vπ V2H2ϵ,其中 π ^ h ( s ) = arg max ⁡ a Q ^ h ( s , a ) \widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a) π h(s)=aargmaxQ h(s,a)

4.2.1 证明 ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon Q hQh(Hh)ϵ

  1. 已知 Q H ( s , a ) = 0 Q_H(s,a)=0 QH(s,a)=0,根据Inherent Bellman Error ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon Q hBhQ h+1ϵ,令h=H-1,有 ∥ Q ^ H − 1 − r ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-r\|_\infty\leq \epsilon Q H1rϵ
  2. Q H − 1 ⋆ = r Q^\star_{H-1}=r QH1=r,得到 ∥ Q ^ H − 1 − Q H − 1 ⋆ ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon Q H1QH1ϵ
  3. 分析目标
    ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ + ∥ B h Q ^ h + 1 − Q h ⋆ ∥ ∞ ≤ ϵ + ∥ B h Q ^ h + 1 − B h Q h + 1 ⋆ ∥ ∞ ≤ ϵ + ∥ E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ^ h + 1 ( s ′ , a ′ ) ] − E s ′ ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ ′ Q h + 1 ⋆ ( s ′ ′ , a ′ ′ ) ] ∥ ∞ ≤ ϵ + ∥ Q ^ h + 1 − Q h + 1 ⋆ ∥ ∞ \begin{aligned} \|\widehat Q_h-Q^\star_h\|_\infty&\leq \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty+\|\mathcal B_h\widehat Q_{h+1}-Q^\star_h\|_\infty\\ &\leq \epsilon +\|\mathcal B_h\widehat Q_{h+1}-\mathcal B_hQ^\star_{h+1}\|_\infty\\ &\leq \epsilon +\|\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]-E_{s''\sim P_h(\cdot|s,a)}[\max_{a''}Q^\star_{h+1}(s'',a'')]\|_\infty\\ &\leq \epsilon + \|\widehat Q_{h+1}-Q^\star_{h+1}\|_\infty \end{aligned} Q hQhQ hBhQ h+1+BhQ h+1Qhϵ+BhQ h+1BhQh+1ϵ+EsPh(s,a)[amaxQ h+1(s,a)]EsPh(s,a)[amaxQh+1(s,a)]ϵ+Q h+1Qh+1
  4. 熟悉的套娃环节,到 ∥ Q ^ H − 1 − Q H − 1 ⋆ ∥ ∞ ≤ ϵ \|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon Q H1QH1ϵ,因此有 ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] Q hQh(Hh)ϵh[0,1,2...,H1]

4.2.2 证明 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon Vπ V2H2ϵ

  1. 从后向前考虑,先看H-1时刻(下面每一步恒等变形都是为了对齐动作
    ∣ V H − 1 ⋆ ( s ) − V H − 1 π ^ ( s ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) + Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ = ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) + Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) + Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ ∣ Q H − 1 ⋆ ( s , π H − 1 ⋆ ( s ) ) − Q ^ H − 1 ( s , π H − 1 ⋆ ( s ) ) ∣ + ∣ Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) − Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) ∣ ≤ 2 ϵ \begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))+Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &= |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s))| + |\widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq 2\epsilon \end{aligned} VH1(s)VH1π (s)=QH1(s,πH1(s))QH1π (s,π H1(s))=QH1(s,πH1(s))QH1(s,π H1(s))+QH1(s,π H1(s))QH1π (s,π H1(s))=QH1(s,πH1(s))QH1(s,π H1(s))=QH1(s,πH1(s))Q H1(s,πH1(s))+Q H1(s,πH1(s))QH1(s,π H1(s))QH1(s,πH1(s))Q H1(s,πH1(s))+Q H1(s,π H1(s))QH1(s,π H1(s))QH1(s,πH1(s))Q H1(s,πH1(s))+Q H1(s,π H1(s))QH1(s,π H1(s))2ϵ
  • 注意到 ∀ s , a \forall s,a s,a,有 Q H − 1 ⋆ ( s , a ) = r ( s , a ) = Q H − 1 π ^ ( s , a ) Q_{H-1}^\star(s,a)=r(s,a)=Q^{\widehat \pi}_{H-1}(s,a) QH1(s,a)=r(s,a)=QH1π (s,a),因此 Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q H − 1 π ^ ( s , π ^ H − 1 ( s ) ) = 0 Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))=0 QH1(s,π H1(s))QH1π (s,π H1(s))=0
  • 另外 Q H − 1 ⋆ ( s , π ^ H − 1 ( s ) ) − Q ^ H − 1 ( s , π ^ H − 1 ( s ) ) ≠ 0 Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-\widehat Q_{H-1}(s,\widehat \pi_{H-1}(s))\neq0 QH1(s,π H1(s))Q H1(s,π H1(s))=0
  1. 考虑任意时刻 h h h之间的迭代式:
    ∣ V h ⋆ ( s ) − V h π ^ ( s ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h π ^ ( s , π ^ h ( s ) ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + Q h ⋆ ( s , π ^ h ( s ) ) − Q h π ^ ( s , π ^ h ( s ) ) ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ = ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q ^ h ( s , π ^ h ( s ) ) + Q ^ h ( s , π ^ h ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ ≤ ∣ Q h ⋆ ( s , π h ⋆ ( s ) ) − Q ^ h ( s , π h ⋆ ( s ) ) + Q ^ h ( s , π ^ h ( s ) ) − Q h ⋆ ( s , π ^ h ( s ) ) + E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ ≤ 2 ( H − h ) ϵ + ∣ E s ′ ∼ P h [ V h + 1 ⋆ ( s ′ ) − V h + 1 π ^ ( s ′ ) ] ∣ \begin{aligned} |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+Q_{h}^\star(s,\widehat \pi_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\widehat \pi_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\pi^\star_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq 2(H-h)\epsilon +|\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ \end{aligned} Vh(s)Vhπ (s)=Qh(s,πh(s))Qhπ (s,π h(s))=Qh(s,πh(s))Qh(s,π h(s))+Qh(s,π h(s))Qhπ (s,π h(s))=Qh(s,πh(s))Qh(s,π h(s))+EsPh[Vh+1(s)Vh+1π (s)]=Qh(s,πh(s))Q h(s,π h(s))+Q h(s,π h(s))Qh(s,π h(s))+EsPh[Vh+1(s)Vh+1π (s)]Qh(s,πh(s))Q h(s,πh(s))+Q h(s,π h(s))Qh(s,π h(s))+EsPh[Vh+1(s)Vh+1π (s)]2(Hh)ϵ+EsPh[Vh+1(s)Vh+1π (s)]
  2. 按时刻展开迭代式:
    ∣ V H − 1 ⋆ ( s ) − V H − 1 π ^ ( s ) ∣ ≤ 2 ϵ ∣ V H − 2 ⋆ ( s ) − V H − 2 π ^ ( s ) ∣ ≤ 4 ϵ + 2 ϵ ∣ V H − 3 ⋆ ( s ) − V H − 3 π ^ ( s ) ∣ ≤ 6 ϵ + 4 ϵ + 2 ϵ ⋯ ∣ V h ⋆ ( s ) − V h π ^ ( s ) ∣ ≤ 2 ϵ ( H − h ) H \begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&\leq 2\epsilon\\ |V_{H-2}^\star(s)-V_{H-2}^{\widehat \pi}(s)|&\leq 4\epsilon + 2\epsilon\\ |V_{H-3}^\star(s)-V_{H-3}^{\widehat \pi}(s)|&\leq 6\epsilon + 4\epsilon + 2\epsilon\\ &\cdots\\ |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&\leq 2\epsilon(H-h)H\\ \end{aligned} VH1(s)VH1π (s)VH2(s)VH2π (s)VH3(s)VH3π (s)Vh(s)Vhπ (s)2ϵ4ϵ+2ϵ6ϵ+4ϵ+2ϵ2ϵ(Hh)H
  3. 因此h=0时,有 ∣ V 0 ⋆ ( s ) − V 0 π ^ ( s ) ∣ ≤ 2 H 2 ϵ |V_{0}^\star(s)-V_{0}^{\widehat \pi}(s)|\leq 2 H^2 \epsilon V0(s)V0π (s)2H2ϵ

4.3 LSVI算法的假设与证明

关键:说明LSVI算法有 ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon Q hBhQ h+1ϵ成立,就可以利用上述结果

4.3.1 LSVI的隐含假设

  1. Linear Bellman Completeness
    Q ^ h ( s , a ) = θ ^ h ϕ ( s , a ) B h Q ^ h + 1 ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ^ h + 1 ( s ′ , a ′ ) ] B h Q ^ h + 1 ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] \begin{aligned} \widehat Q_h(s,a)&=\widehat \theta_h\phi(s,a)\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] \end{aligned} Q h(s,a)BhQ h+1(s,a)BhQ h+1(s,a)=θ hϕ(s,a)=r(s,a)+EsPh(s,a)[amaxQ h+1(s,a)]=r(s,a)+EsPh(s,a)[amaxθ h+1ϕ(s,a)]
  2. D-optimal Design
    ϕ ( s , a ) ∈ R d Σ = E ϕ ( s , a ) ϕ ( s , a ) ⊤ p = arg max ⁡ v ln ⁡ det ⁡ E ( s , a ) ∼ v ϕ ( s , a ) ϕ ( s , a ) ⊤ D h = { ( s ( i ) , a ( i ) , r ( i ) , ( s ′ ) ( i ) ) ∣ ( s , a ) ∼ p , s ′ ∼ P h ( ⋅ ∣ s , a ) } i = 1 N Σ h = ∑ s , a ∈ D h ϕ ( s , a ) ϕ ( s , a ) ⊤ 是可逆的 ∀ ( s , a ) , N Σ ≤ ϕ ( s , a ) ⊤ Σ h − 1 ϕ ( s , a ) ≤ d \begin{aligned} &\phi(s,a)\in \mathbb R^d\\ &\Sigma=\mathbb E_{\phi(s,a)\phi(s,a)^\top}\\ &p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v} \phi(s,a)\phi(s,a)^\top\\ &D_h=\{(s^{(i)},a^{(i)},r^{(i)},(s')^{(i)})|(s,a)\sim p,s'\sim P_h(\cdot|s,a)\}_{i=1}^{N}\\ &\Sigma_h=\sum_{s,a\in D_h}\phi(s,a)\phi(s,a)^\top \text{是可逆的}\\ & \forall (s,a), N\Sigma\leq\phi(s,a)^\top \Sigma_h^{-1}\phi(s,a)\leq d \end{aligned} ϕ(s,a)RdΣ=Eϕ(s,a)ϕ(s,a)p=vargmaxlndetE(s,a)vϕ(s,a)ϕ(s,a)Dh={(s(i),a(i),r(i),(s)(i))(s,a)p,sPh(s,a)}i=1NΣh=s,aDhϕ(s,a)ϕ(s,a)是可逆的(s,a),NΣϕ(s,a)Σh1ϕ(s,a)d
  3. LSVI中的Least Square
    正常的LS: y = θ x + ϵ Q h 的LS: y = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] , θ = θ ^ h , x = ϕ ( s , a ) \begin{aligned} \text{正常的LS:}&y=\theta x+\epsilon\\ \text{$Q_h$的LS:}&y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')], \theta=\widehat \theta_h,x=\phi(s,a) \end{aligned} 正常的LSQhLSy=θx+ϵy=r(s,a)+EsPh(s,a)[amaxθ h+1ϕ(s,a)],θ=θ h,x=ϕ(s,a)

4.3.2 Ordinary Least Square

简要复习下普通最小二乘法的理论,这里蕴含的假设是fixed design analysis,具体证明见2012年的论文Random Design Analysis of Ridge Regression。

已知一个 N N N个样本的数据集 D = { x ( i ) , y ( i ) } i = 1 N D=\{x^{(i)},y^{(i)}\}_{i=1}^N D={x(i),y(i)}i=1N,其中 x ( i ) ∈ R d , y ( i ) ∈ R 1 x^{(i)}\in \mathbb R^d,y^{(i)}\in \mathbb R^1 x(i)Rd,y(i)R1 x j ( i ) x^{(i)}_j xj(i)表示第 i i i个样本的第 j j j维。它们的结构假设是线性的,存在一个最优向量参数 θ ⋆ \theta^\star θ,使得 y ( i ) = ( θ ⋆ ) ⊤ x ( i ) + ϵ ( i ) y^{(i)}=(\theta^\star)^\top x^{(i)}+\epsilon^{(i)} y(i)=(θ)x(i)+ϵ(i),其中噪声假设 E [ ϵ ] = 0 , ∣ ϵ ∣ ≤ σ \mathbb E[\epsilon]=0,|\epsilon|\leq \sigma E[ϵ]=0,ϵσ

令样本矩阵 X ∈ R N × d X\in \mathbb R^{N\times d} XRN×d,它们的协方差矩阵 Σ = X ⊤ X = ∑ i = 1 N x ( i ) ( x ( i ) ) ⊤ \Sigma=X^\top X=\sum_{i=1}^N x^{(i)}(x^{(i)})^\top Σ=XX=i=1Nx(i)(x(i))假设 Σ \Sigma Σ可逆,最小化二乘法学习到的参数 θ ^ \hat \theta θ^ θ ^ = arg min ⁡ θ ( X θ − Y ) 2 = ( X ⊤ X ) − 1 X ⊤ Y = Σ − 1 X ⊤ Y \hat\theta=\argmin_\theta (X\theta-Y)^2=(X^\top X)^{-1}X^\top Y=\Sigma^{-1}X^\top Y θ^=θargmin(XθY)2=(XX)1XY=Σ1XY

根据2012年的论文Random Design Analysis of Ridge Regression推导,选取 δ ∈ ( 0 , 1 ) \delta\in (0,1) δ(0,1),至少 1 − δ 1-\delta 1δ的概率有如下成立:
( θ ^ − θ ⋆ ) ⊤ Σ ( θ ^ − θ ⋆ ) = ∥ θ ^ − θ ⋆ ∥ Σ 2 ≤ σ 2 ( d + d ln ⁡ ( 1 / δ ) + 2 ln ⁡ ( 1 / δ ) ) (\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) (θ^θ)Σ(θ^θ)=θ^θΣ2σ2(d+dln(1/δ) +2ln(1/δ))

具体展开看, ( θ ^ − θ ⋆ ) ⊤ Σ ( θ ^ − θ ⋆ ) = ( θ ^ − θ ⋆ ) ⊤ X ⊤ X ( θ ^ − θ ⋆ ) = ( X ( θ ^ − θ ⋆ ) ) ⊤ ( X ( θ ^ − θ ⋆ ) ) = ( Y ^ − Y ) 2 = ∑ i = 1 N ( θ ^ x ( i ) − θ ⋆ x ( i ) ) (\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star)=(\hat \theta -\theta^\star)^\top X^\top X(\hat \theta -\theta^\star)=(X(\hat \theta-\theta^\star))^\top (X(\hat \theta-\theta^\star))=(\hat Y-Y)^2=\sum_{i=1}^N (\hat \theta x^{(i)}-\theta^\star x^{(i)}) (θ^θ)Σ(θ^θ)=(θ^θ)XX(θ^θ)=(X(θ^θ))(X(θ^θ))=(Y^Y)2=i=1N(θ^x(i)θx(i))

4.3.3 LSVI的Inherent Bellman Error

先套Least Square的理论结果得到第h时刻估计参数 θ ^ h \hat \theta_h θ^h与最优参数 θ ⋆ \theta^\star θ的界

LSVI

  1. 套用LS的理论结果:
    ( θ ^ − θ ⋆ ) ⊤ Σ ( θ ^ − θ ⋆ ) = ∥ θ ^ − θ ⋆ ∥ Σ 2 ≤ σ 2 ( d + d ln ⁡ ( 1 / δ ) + 2 ln ⁡ ( 1 / δ ) ) (\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) (θ^θ)Σ(θ^θ)=θ^θΣ2σ2(d+dln(1/δ) +2ln(1/δ))
  2. LSVI中第h时刻从数据集 D h D_h Dh拟合参数 θ ^ h \widehat \theta_h θ h时的对应关系:
  • 每一个 D h D_h Dh中的样本为 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s)中的输入 ( s , a ) (s,a) (s,a),采样 s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) sPh(s,a),其真实值为 y = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] y=r(s,a)+EsPh(s,a)[maxaθ h+1ϕ(s,a)]
  • s ′ ∼ P h ( ⋅ ∣ s , a ) s'\sim P_h(\cdot|s,a) sPh(s,a)采样一个样本 ( s , a , r , s ′ ) (s,a,r,s') (s,a,r,s),观察到的采样为 y ( i ) = r ( s , a ) + max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) y^{(i)}=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a') y(i)=r(s,a)+maxaθ h+1ϕ(s,a)
  • 因此关于一个 ( s , a ) (s,a) (s,a),采样带来的噪声为 ϵ ( i ) = y ( i ) − y = r ( s , a ) + max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) − r ( s , a ) − E s ′ ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ ′ , a ′ ) ] \epsilon^{(i)}=y^{(i)}-y=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')-r(s,a)-\mathbb E_{s''\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s'',a')] ϵ(i)=y(i)y=r(s,a)+amaxθ h+1ϕ(s,a)r(s,a)EsPh(s,a)[amaxθ h+1ϕ(s,a)]
    很自然我们有:
    E [ ϵ ] = E s ′ ∼ P h ( ⋅ ∣ s , a ) [ r ( s , a ) + max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] − θ ^ h ϕ ( s , a ) = 0 \mathbb E[\epsilon]=\mathbb E_{s'\sim P_h(\cdot|s,a)}[r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')]-\widehat\theta_h\phi(s,a)=0 E[ϵ]=EsPh(s,a)[r(s,a)+amaxθ h+1ϕ(s,a)]θ hϕ(s,a)=0
    在Infinite Horizon中Q函数的范围为 ( 0 , 1 1 − γ ) (0,\frac{1}{1-\gamma}) (0,1γ1),因此在finite horizon中Q函数的范围为 ( 0 , H ) (0,H) (0,H),因此噪声项有
    ∣ ϵ ∣ ≤ 2 H |\epsilon|\leq 2H ϵ2H
  • Σ = Σ h = ∑ ( s , a ) ∈ D h ϕ ( s , a ) ϕ ( s , a ) ⊤ \Sigma=\Sigma_h=\sum_{(s,a)\in D_h}\phi(s,a)\phi(s,a)^\top Σ=Σh=(s,a)Dhϕ(s,a)ϕ(s,a)
  • θ ⋆ \theta^\star θ能令 x = ϕ ( s , a ) x=\phi(s,a) x=ϕ(s,a)为真实值y,即 θ ⋆ x = y \theta^\star x=y θx=y,套进来为:
    ( θ ⋆ ) ⊤ ϕ ( s , a ) = r ( s , a ) + E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ϕ ( s ′ , a ′ ) ] (\theta^\star)^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] (θ)ϕ(s,a)=r(s,a)+EsPh(s,a)[amaxθ h+1ϕ(s,a)]
  1. 所以LSVI在第h时刻的最小二乘法可以得到: ( θ ^ h − θ ⋆ ) ⊤ Σ h ( θ ^ h − θ ⋆ ) = ∥ θ ^ h − θ ⋆ ∥ Σ h 2 ≤ 4 H 2 ( d + d ln ⁡ ( 1 / δ ) + 2 ln ⁡ ( 1 / δ ) ) (\hat \theta_h -\theta^\star)^\top \Sigma_h (\hat \theta_h -\theta^\star)=\|\hat \theta_h-\theta^\star\|_{\Sigma_h}^2 \leq 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) (θ^hθ)Σh(θ^hθ)=θ^hθΣh24H2(d+dln(1/δ) +2ln(1/δ))
  2. 分析目标
    ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ = max ⁡ s , a ∣ θ ^ h ⊤ ϕ ( s , a ) − r ( s , a ) − E s ′ ∼ P h ( ⋅ ∣ s , a ) [ max ⁡ a ′ θ ^ h + 1 ⊤ ϕ ( s ′ , a ′ ) ] ∣ = max ⁡ s , a ∣ θ ^ h ⊤ ϕ ( s , a ) − ( θ ⋆ ) ⊤ ϕ ( s , a ) ∣ = max ⁡ s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ϕ ( s , a ) ∣ = max ⁡ s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ϕ ( s , a ) ∣ 2  (利用 ∑ a i b i ≤ ∑ a i 2 ∑ b i 2 进行放缩) ≤ max ⁡ s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ ( θ ^ h − θ ⋆ ) ∣ ∣ ϕ ( s , a ) ⊤ ϕ ( s , a ) ∣ = max ⁡ s , a ∣ ( θ ^ h − θ ⋆ ) ⊤ Σ h ( θ ^ h − θ ⋆ ) ∣ ∣ ϕ ( s , a ) ⊤ Σ h − 1 ϕ ( s , a ) ∣ ≤ max ⁡ s , a 4 H 2 ( d + d ln ⁡ ( 1 / δ ) + 2 ln ⁡ ( 1 / δ ) ) d N ≤ max ⁡ s , a 4 H d ln ⁡ 1 / δ N = 4 H d ln ⁡ 1 / δ N \begin{aligned} \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty&=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-r(s,a)-\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}^\top\phi(s',a')]|\\ &=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-(\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|^2}\text{ (利用$\sum a_ib_i\leq \sum a_i^2\sum b_i^2$进行放缩)}\\ &\leq \max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\phi(s,a)|} \\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top\Sigma_h (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\Sigma_h^{-1}\phi(s,a)|} \\ &\leq\max_{s,a}\sqrt{ 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) \frac{d}{N}}\\ &\leq \max_{s,a} \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} \end{aligned} Q hBhQ h+1=s,amaxθ^hϕ(s,a)r(s,a)EsPh(s,a)[amaxθ h+1ϕ(s,a)]=s,amaxθ^hϕ(s,a)(θ)ϕ(s,a)=s,amax(θ^hθ)ϕ(s,a)=s,amax(θ^hθ)ϕ(s,a)2  (利用aibiai2bi2进行放缩)s,amax(θ^hθ)(θ^hθ)ϕ(s,a)ϕ(s,a) =s,amax(θ^hθ)Σh(θ^hθ)ϕ(s,a)Σh1ϕ(s,a) s,amax4H2(d+dln(1/δ) +2ln(1/δ))Nd s,amaxN 4Hdln1/δ =N 4Hdln1/δ
  3. 求得数据集 D h D_h Dh的大小 N N N,令 ϵ = 4 H d ln ⁡ 1 / δ N \epsilon= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} ϵ=N 4Hdln1/δ ,得 N = 16 H 2 d 2 ln ⁡ ( H / δ ) ϵ 2 N=\frac{16H^2d^2\ln(H/\delta)}{\epsilon^2} N=ϵ216H2d2ln(H/δ)

4.3.4 LSVI样本复杂度的总逻辑

  • 4.2节
    如果算法估计的 Q ^ h \widehat Q_h Q h ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon Q hBhQ h+1ϵ成立(Inherent Bellman Error),则
    1. ∥ Q ^ h − Q h ⋆ ∥ ∞ ≤ ( H − h ) ϵ ∀ h ∈ [ 0 , 1 , 2... , H − 1 ] \|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1] Q hQh(Hh)ϵh[0,1,2...,H1]
    2. ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon Vπ V2H2ϵ,其中 π ^ h ( s ) = arg max ⁡ a Q ^ h ( s , a ) \widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a) π h(s)=aargmaxQ h(s,a)
  • 4.3.3节
    利用了Least Squre + D-optimal Design后,当 N ≥ 16 H 2 d 2 ln ⁡ ( H / δ ) ϵ 2 N\geq \frac{16H^2d^2\ln(H/\delta)}{\epsilon^2} Nϵ216H2d2ln(H/δ) ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon Q hBhQ h+1ϵ成立
  • 总逻辑
    1. p = arg max ⁡ v ln ⁡ det ⁡ E ( s , a ) ∼ v [ ϕ ( s , a ) ϕ ( s , a ) ⊤ ] p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v}[\phi(s,a)\phi(s,a)^\top] p=vargmaxlndetE(s,a)v[ϕ(s,a)ϕ(s,a)]方式得到在 ( s , a ) (s,a) (s,a)空间上的最优采样分布 p p p
    2. 然后在 h = 0 , 1 , . . . , H − 1 h=0,1,...,H-1 h=0,1,...,H1的每个时刻,对于一个 ( s , a ) (s,a) (s,a),从其下一状态的真实分布 P h ( s ′ ∣ s , a ) P_h(s'|s,a) Ph(ss,a)中随机采样 ⌈ p ( s , a ) N ⌉ \lceil p(s,a)N\rceil p(s,a)N个下一状态值 s ′ s' s作为样本,所以 D h D_h Dh的大小为 ∑ s , a ⌈ p ( s , a ) N ⌉ \sum_{s,a}\lceil p(s,a)N\rceil s,ap(s,a)N
    3. 因此H个数据集的总样本数为
      总 的 样 本 数 = H ∑ s , a ∈ p ⌈ p ( s , a ) N ⌉ ≤ H ∑ s , a ∈ p ( 1 + p ( s , a ) N ) 总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N) =Hs,app(s,a)NHs,ap(1+p(s,a)N)
    4. 设定目标的误差为 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ = ϵ ′ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon' Vπ V2H2ϵ=ϵ,所以 ϵ = ϵ ′ 2 H 2 \epsilon=\frac{\epsilon'}{2H^2} ϵ=2H2ϵ,这要求Inherent Bellman error ∥ Q ^ h − B h Q ^ h + 1 ∥ ∞ ≤ ϵ ′ 2 H 2 \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \frac{\epsilon'}{2H^2} Q hBhQ h+12H2ϵ,可知 N = 64 H 6 d 2 ln ⁡ H / δ ( ϵ ′ ) 2 N= \frac{64H^6d^2\ln H/\delta}{(\epsilon')^2} N=(ϵ)264H6d2lnH/δ
    5. 因此为了保持 ϵ ′ \epsilon' ϵ-optimal的策略 ∣ V π ^ − V ⋆ ∣ ≤ 2 H 2 ϵ = ϵ ′ |V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon' Vπ V2H2ϵ=ϵ ,需要 总 的 样 本 数 = H ∑ s , a ∈ p ⌈ p ( s , a ) N ⌉ ≤ H ∑ s , a ∈ p ( 1 + p ( s , a ) N ) ≤ H ( d 2 + 64 H 6 d 2 ln ⁡ H / δ ( ϵ ′ ) 2 ) 总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)\leq H(d^2+\frac{64H^6d^2\ln H/\delta}{(\epsilon')^2}) =Hs,app(s,a)NHs,ap(1+p(s,a)N)H(d2+(ϵ)264H6d2lnH/δ)

总结

假设很多,这里分一下类。

  • 关于问题建模本身的假设:finite horizon、Large Scale MDP、time-dependent policy(non-stationary policy)
  • 关于最优Q值的结构假设:Linear 、 结构变化是Linear Bellman Completion
  • 关于数据集多样性的分布假设:D-optimality
  • 关于算法中最小二乘的假设:fixed design analysis

这些假设,是相应算法在多项式时间内找到 ϵ \epsilon ϵ-optimal策略的保证。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面是C++实现的Bellman-Ford算法代码: ```c++ #include <iostream> #include <vector> #include <climits> using namespace std; struct Edge { int from, to, cost; Edge(int f, int t, int c) : from(f), to(t), cost(c) {} }; vector<int> bellmanFord(vector<Edge>& edges, int n, int start) { vector<int> dist(n, INT_MAX); dist[start] = 0; for (int i = 0; i < n - 1; i++) { for (int j = 0; j < edges.size(); j++) { Edge e = edges[j]; if (dist[e.from] != INT_MAX && dist[e.to] > dist[e.from] + e.cost) { dist[e.to] = dist[e.from] + e.cost; } } } return dist; } int main() { int n = 5; vector<Edge> edges; edges.push_back(Edge(0, 1, 1)); edges.push_back(Edge(0, 2, 5)); edges.push_back(Edge(1, 2, 2)); edges.push_back(Edge(1, 3, 2)); edges.push_back(Edge(1, 4, 1)); edges.push_back(Edge(2, 3, 1)); edges.push_back(Edge(2, 4, 3)); edges.push_back(Edge(3, 4, 1)); vector<int> dist = bellmanFord(edges, n, 0); for (int i = 0; i < n; i++) { cout << "Distance from vertex 0 to vertex " << i << " is " << dist[i] << endl; } return 0; } ``` 该算法的输入是边的集合`edges`,顶点总数`n`和起点`start`。输出为从起点到每个顶点的最短距离。实现中使用了一个数组`dist`来存储每个顶点到起点的距离,初始值为无穷大,起点的值为0。然后通过n-1次循环,遍历所有的边,更新`dist`数组中的值。如果某个顶点的距离被更新了,则表明它可以通过更短的路径到达起点。如果在n-1次循环后还存在可以更新的顶点,则说明图中存在负权环路,此时算法无法求出最短路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值