每天一个RL基础理论(8)——Linear Bellman Completeness

最新推荐文章于 2023-05-19 21:30:27 发布

Nemo555

最新推荐文章于 2023-05-19 21:30:27 发布

阅读量1k

点赞数 1

分类专栏： Deep RL 文章标签：机器学习人工智能深度强化学习理论

本文链接：https://blog.csdn.net/weixin_40056577/article/details/121684617

版权

Deep RL 专栏收录该内容

27 篇文章 49 订阅

订阅专栏

CS6789-6 Linear Bellman Completion

参考资料
背景
逻辑
一、从infinite horizon到finite horizon
二、Linear Bellman Completeness
三、LSVI的理论
四、LSVI样本复杂度的证明
总结

参考资料

RL Theory Book 第三章

背景

之前的问题setting：

Infinite horizon discounted 且是Tabular的MDP
VI或PI算法
Generative Model的交互假设，回避了exploration的问题

现在添加一个额外的structural assumption——Linear Bellman Completeness：

稍微扩充一下Tabular MDP到Large Scale MDP，即离散的状态动作扩充到连续的状态动作
其它维持不变
$\text{VI迭代式：}Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A$ 会出问题，因为不能用Q-table的方式来表示Q-function即 $Q (s, a)$ ，要进行function approximation
这里的结构假设，是linear function approximation

最终问题是：在原有问题setting扩充了状态动作空间的容量（简单说，离散到连续），能否有sample efficient的算法能找到 $\epsilon$ -optimal的策略？

逻辑

本篇证明比较多，逻辑链条比较长。整体来看

Least Square的理论分析是基础，因为LSVI用到了Least Square（在4.2.2节）
然后D-optimal的假设，是为了保证用于Least Square的数据集有充足的多样性（在3.3节）
接着通过Least Square + D-optimal + Linear Bellman Completion 能得到low inherent Bellman error(在4.2.3节）
有了Low Inherent Bellman error，就能bound住policy performance （在4.2.2节）

文章的写作逻辑，以介绍概念——理解定理——证明定理为主，不想看证明可直接跳过第四章节，第四章节总逻辑在4.2.3节。

一、从infinite horizon到finite horizon

为了便于分析，已知infinite horizon中只要把effective horizon即 $\frac{1}{1-\gamma}$ 换成Finite horizon即 $H$ 即可，另外再弱化关于transition 的stationarity即 $P(\cdot|s,a)$ 变成跟时间有关 $P_h(\cdot|s,a)，h\leq H$

所以infinite horizon的定义 $\mathcal M=(S,A,r,P,\gamma)$ 改写成finite horizon的定义 $\mathcal M=(S,A,r,P_h,H)$

它们之间的联系： $H=\frac{1}{1-\gamma}$ & stationarity $\lim_{h\rightarrow \infty}P_h=P$

二、Linear Bellman Completeness

为了解决Tabular MDP到Large Scale MDP中，Q函数的表示问题，首先假设要进行linear function approximation，其次它要满足Bellman Completeness，称为Linear Bellman Completeness 定义如下：

给定一个已知的特征 $\phi(s,a)\in \mathbb R^d$ ，假设其满足Bellman 完备性，即 $\forall s\in S,a\in A,h\in [H],\theta\in\mathbb R^d,\exist w\in\mathbb R^d$ 满足
$w^\top\phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\theta^\top \phi(s',a')]$

关于Q函数的Bellman Optimality为：
$Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim P(\cdot|s,a)}[\max_{a'}Q^\star(s',a')]$

可以观察到Bellman Optimality 与 Linear Bellman Completeness非常相似。

简单理解一下：Q本来是在 $(s, a)$ 空间构成的映射，变成了在 $\phi(s,a)$ 张成的空间中的点。
当 $\phi(s,a)=\text{one-hot}(s,a)$ ，此时便可看成Tabular MDP，即 $\phi(s,a)=(1,0,0,...., 0)$
假设s有两个值 $s_1,s_2$ ，动作有一个值a，它们的特征可设计成 $\phi(s,a)=(s_1s_2,s_1^2,s_2^2,s_1a,s_2a,a,a^2,1)$ ，可看作LQR问题

一句话总结： $\phi(s,a)$ 已知，该假设的参数 $w,\theta$ 唯一标识了不同的Q函数，在第h时刻最优的Q函数 $Q^\star_h(s,a)=\theta_h^\star\phi(s,a)$

三、LSVI的理论

3.1 LSVI算法流程

LSVI是Least Squar Value Iteration，是在Linear Bellman Completion的强假设下的一种sample efficient的算法，能在多项式时间找到 $\epsilon$ -optimal的策略。先以与时间有关的角度，改写下Linear Bellman Completion：
$\theta_h^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_{h}(\cdot|s,a)}[\max_{a'}\theta_{h+1}^\top \phi(s',a')]$

LSVI的算法流程：

在每个时间刻 $h\in 0,1,...,H-1$ ，利用generative model的方式 $s'\sim P_h(\cdot|s,a)$ 进行数据收集，得到 $D_0,D_1,...,D_{H-1}$ 个数据集，其中一个数据样本以 $(s, a, r, s^{'})$ 的方式存储
处理边界情况，令 $V_H(s)=0,\forall s\in S$
对于 $h=H-1\rightarrow0$ ，根据数据集 $D_h$ ，学习参数 $\theta_h$

$1.\quad\theta_h=\argmin_{\theta}\sum_{D_h}\left(\theta^\top \phi(s,a)-r-V_{h+1}(s')\right)^2$ $2.\quad V_h(s)=\max_a \theta_h^\top\phi(s,a),\forall s$

最后的策略从学习到的Q值中得到： $\widehat \pi_h(s)=\argmax_a\theta_h^\top\phi(s,a)\quad \forall h$

3.2 LSVI样本复杂度

给出定理：
在finite horizon，Large Scale MDP，generative model的设定下，假设Linear Bellman Completion，使用LSVI算法得到策略 $\widehat \pi$ ，如果采集到的数据集 ${D_h\}_{h=0}^{H-1}$ 满足D-optimal design，则有至少 $1-\delta$ 的概率，找到有 $\epsilon$ -optimal策略即 $V^{\widehat \pi}-V^\star\leq \epsilon$ ，每个数据集 $D_h$ 的样本复杂度为 $O(d^2+H^6d^2/\epsilon^2)$

其中d为特征空间 $\phi(s,a)$ 的维度，这里是把样本复杂度中无限空间的 $∣ S ∣ ∣ A ∣$ 换成了有限的特征空间 $\phi(s,a)$

为什么假设了Linear Bellman Completion，还需要D-optimal Design呢？因为这里涉及到了通过数据集 $D_h$ 进行最小二乘法回归的问题，这必定会涉及到数据分布，如果测试数据离训练数据很远，是没办法泛化到的，也就难以保证 $\epsilon$ -optimal中的optimal，所以还得加一个使数据集有足够多样性的假设D-optimal design

3.3 D-optimal Design

样本所在空间 $\mathcal X\in \mathbb R^d$ ，对 $\mathcal X$ 以分布 $p$ 的方式进行采样得到数据集 $D=\{x:x\sim p\}$ ，衡量数据集 $D$ 的多样性，相当于对样本空间 $\mathcal X$ 的采样分布 $p$ 进行多样性衡量。

D-optimality：假设采样分布 $p$ 至多包含样本空间 $\mathcal X$ 中 $d (d + 1) / 2$ 个点，定义 $\Sigma=\mathbb E_{x\sim p}[xx^T]$ 且有 $x^\top\Sigma^{-1} x\leq d\quad \forall x\in \mathcal X$ ，则称分布 $p$ 为D-optimal design

首先 $d (d + 1) / 2$ 保证分布 $p$ 有一定数量的点， $\Sigma$ 是定义了分布内的点的信息量， $x^\top\Sigma x\leq d,\forall x\in \mathcal X$ 表示了整个样本空间中的点，离 $p$ 分布的点距离不超过 $d$ ，完成了多样性的量化

以 $D$ -optimality为原则，来量化多样性，最大化以下目标筛选采样分布 $p=\argmax_v \ln\det \mathbb E_{x\sim v}[xx^\top]$

因为采样分布 $p$ 是D-optimal design的，因此根据它从样本空间 $\mathcal X$ 采样得到的数据集 $D$ ，具备一定的多样性(coverage)

3.4 LSVI样本复杂度的最终命题

四、LSVI样本复杂度的证明

4.1 熟悉time-dependent的finite MDP

主要是在做Q-value iteration方面的迭代，因此改写下与Q相关的公式：

Bellman Optimality
$\begin{aligned} \text{Infinite Stationary:}&\\ Q^\star(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right]\\ V^\star(s)&=\max_a Q^\star(s,a)\\ Q&=\mathcal BQ\text{ (简写)}\\ \text{Finite Non-stationary:}&\\ Q_h^\star(s,a)&=r(s,a)+ \mathbb E_{s'\sim P_h(\cdot|s,a)}\left[\max_{a'}Q_{h+1}^\star (s',a')\right]\\ V_h^\star(s)&=\max_a Q_h^\star(s,a)\\ Q_h&=\mathcal B_hQ_{h+1}\text{ (简写)}\\ \end{aligned}$
Non-stationary下 $Q_h^\star$ 满足：
$Q^\star_H=0,Q^\star_{H-1}=r,Q_h^\star=\mathcal B_h Q^\star_{h+1}$

4.2 Inherent Bellman Error

如果通过某个算法估计的 $\widehat Q_h$ 有 $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon$ 成立（这个条件称做Inherent Bellman Error），则

$\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]$
$|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon$ ，其中 $\widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a)$

4.2.1 证明 $\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon$

已知 $Q_H(s,a)=0$ ，根据Inherent Bellman Error $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon$ ，令h=H-1，有 $\|\widehat Q_{H-1}-r\|_\infty\leq \epsilon$
由 $Q^\star_{H-1}=r$ ，得到 $\|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon$
分析目标
$\begin{aligned} \|\widehat Q_h-Q^\star_h\|_\infty&\leq \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty+\|\mathcal B_h\widehat Q_{h+1}-Q^\star_h\|_\infty\\ &\leq \epsilon +\|\mathcal B_h\widehat Q_{h+1}-\mathcal B_hQ^\star_{h+1}\|_\infty\\ &\leq \epsilon +\|\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]-E_{s''\sim P_h(\cdot|s,a)}[\max_{a''}Q^\star_{h+1}(s'',a'')]\|_\infty\\ &\leq \epsilon + \|\widehat Q_{h+1}-Q^\star_{h+1}\|_\infty \end{aligned}$
熟悉的套娃环节，到 $\|\widehat Q_{H-1}-Q_{H-1}^\star\|_\infty\leq \epsilon$ ，因此有 $\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]$

4.2.2 证明 $|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon$

从后向前考虑，先看H-1时刻（下面每一步恒等变形都是为了对齐动作）
$\begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))+Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))|\\ &=|Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &= |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\pi^\star_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s)) + \widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq |Q_{H-1}^\star(s,\pi^\star_{H-1}(s))-\widehat Q_{H-1}(s,\pi^\star_{H-1}(s))| + |\widehat Q_{H-1}(s,\widehat\pi_{H-1}(s))-Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))|\\ &\leq 2\epsilon \end{aligned}$

注意到 $\forall s,a$ ，有 $Q_{H-1}^\star(s,a)=r(s,a)=Q^{\widehat \pi}_{H-1}(s,a)$ ，因此 $Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-Q^{\widehat \pi}_{H-1}(s,\widehat \pi_{H-1}(s))=0$
另外 $Q_{H-1}^\star(s,\widehat \pi_{H-1}(s))-\widehat Q_{H-1}(s,\widehat \pi_{H-1}(s))\neq0$

考虑任意时刻 $h$ 之间的迭代式：
$\begin{aligned} |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+Q_{h}^\star(s,\widehat \pi_{h}(s))-Q^{\widehat \pi}_{h}(s,\widehat \pi_{h}(s))|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &=|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\widehat \pi_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq|Q_{h}^\star(s,\pi^\star_{h}(s))-\widehat Q_h(s,\pi^\star_h(s))+\widehat Q_h(s,\widehat \pi_h(s))-Q_{h}^\star(s,\widehat \pi_{h}(s))+\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ &\leq 2(H-h)\epsilon +|\mathbb E_{s'\sim P_h}[V_{h+1}^\star(s')-V_{h+1}^{\widehat \pi}(s')]|\\ \end{aligned}$
按时刻展开迭代式：
$\begin{aligned} |V_{H-1}^\star(s)-V_{H-1}^{\widehat \pi}(s)|&\leq 2\epsilon\\ |V_{H-2}^\star(s)-V_{H-2}^{\widehat \pi}(s)|&\leq 4\epsilon + 2\epsilon\\ |V_{H-3}^\star(s)-V_{H-3}^{\widehat \pi}(s)|&\leq 6\epsilon + 4\epsilon + 2\epsilon\\ &\cdots\\ |V_{h}^\star(s)-V_{h}^{\widehat \pi}(s)|&\leq 2\epsilon(H-h)H\\ \end{aligned}$
因此h=0时，有 $|V_{0}^\star(s)-V_{0}^{\widehat \pi}(s)|\leq 2 H^2 \epsilon$

4.3 LSVI算法的假设与证明

关键：说明LSVI算法有 $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon$ 成立，就可以利用上述结果

4.3.1 LSVI的隐含假设

Linear Bellman Completeness
$\begin{aligned} \widehat Q_h(s,a)&=\widehat \theta_h\phi(s,a)\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat Q_{h+1}(s',a')]\\ \mathcal B_h\widehat Q_{h+1}(s,a)&=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')] \end{aligned}$
D-optimal Design
$\begin{aligned} &\phi(s,a)\in \mathbb R^d\\ &\Sigma=\mathbb E_{\phi(s,a)\phi(s,a)^\top}\\ &p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v} \phi(s,a)\phi(s,a)^\top\\ &D_h=\{(s^{(i)},a^{(i)},r^{(i)},(s')^{(i)})|(s,a)\sim p,s'\sim P_h(\cdot|s,a)\}_{i=1}^{N}\\ &\Sigma_h=\sum_{s,a\in D_h}\phi(s,a)\phi(s,a)^\top \text{是可逆的}\\ & \forall (s,a), N\Sigma\leq\phi(s,a)^\top \Sigma_h^{-1}\phi(s,a)\leq d \end{aligned}$
LSVI中的Least Square
$\begin{aligned} \text{正常的LS：}&y=\theta x+\epsilon\\ \text{$Q_h$的LS：}&y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')], \theta=\widehat \theta_h,x=\phi(s,a) \end{aligned}$

4.3.2 Ordinary Least Square

简要复习下普通最小二乘法的理论，这里蕴含的假设是fixed design analysis，具体证明见2012年的论文Random Design Analysis of Ridge Regression。

已知一个 $N$ 个样本的数据集 $D=\{x^{(i)},y^{(i)}\}_{i=1}^N$ ，其中 $x^{(i)}\in \mathbb R^d,y^{(i)}\in \mathbb R^1$ ， $x^{(i)}_j$ 表示第 $i$ 个样本的第 $j$ 维。它们的结构假设是线性的，存在一个最优向量参数 $\theta^\star$ ，使得 $y^{(i)}=(\theta^\star)^\top x^{(i)}+\epsilon^{(i)}$ ，其中噪声假设为 $\mathbb E[\epsilon]=0,|\epsilon|\leq \sigma$ 。

令样本矩阵 $X\in \mathbb R^{N\times d}$ ，它们的协方差矩阵 $\Sigma=X^\top X=\sum_{i=1}^N x^{(i)}(x^{(i)})^\top$ ，假设 $\Sigma$ 可逆，最小化二乘法学习到的参数 $\hat \theta$ ： $\hat\theta=\argmin_\theta (X\theta-Y)^2=(X^\top X)^{-1}X^\top Y=\Sigma^{-1}X^\top Y$

根据2012年的论文Random Design Analysis of Ridge Regression推导，选取 $\delta\in (0,1)$ ，至少 $1-\delta$ 的概率有如下成立：
$(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))$

具体展开看， $(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star)=(\hat \theta -\theta^\star)^\top X^\top X(\hat \theta -\theta^\star)=(X(\hat \theta-\theta^\star))^\top (X(\hat \theta-\theta^\star))=(\hat Y-Y)^2=\sum_{i=1}^N (\hat \theta x^{(i)}-\theta^\star x^{(i)})$

4.3.3 LSVI的Inherent Bellman Error

先套Least Square的理论结果得到第h时刻估计参数 $\hat \theta_h$ 与最优参数 $\theta^\star$ 的界

LSVI

套用LS的理论结果：
$(\hat \theta -\theta^\star)^\top \Sigma (\hat \theta -\theta^\star) =\|\hat \theta -\theta^\star\|_{\Sigma}^2 \leq \sigma^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))$
LSVI中第h时刻从数据集 $D_h$ 拟合参数 $\widehat \theta_h$ 时的对应关系：

每一个 $D_h$ 中的样本为 $(s, a, r, s^{'})$ 中的输入 $(s, a)$ ，采样 $s'\sim P_h(\cdot|s,a)$ ，其真实值为 $y=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')]$
从 $s'\sim P_h(\cdot|s,a)$ 采样一个样本 $(s, a, r, s^{'})$ ，观察到的采样为 $y^{(i)}=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')$
因此关于一个 $(s, a)$ ，采样带来的噪声为 $\epsilon^{(i)}=y^{(i)}-y=r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')-r(s,a)-\mathbb E_{s''\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s'',a')]$
很自然我们有：
$\mathbb E[\epsilon]=\mathbb E_{s'\sim P_h(\cdot|s,a)}[r(s,a)+\max_{a'}\widehat \theta_{h+1}\phi(s',a')]-\widehat\theta_h\phi(s,a)=0$
在Infinite Horizon中Q函数的范围为 $(0,\frac{1}{1-\gamma})$ ，因此在finite horizon中Q函数的范围为 $(0, H)$ ，因此噪声项有
$|\epsilon|\leq 2H$
$\Sigma=\Sigma_h=\sum_{(s,a)\in D_h}\phi(s,a)\phi(s,a)^\top$
$\theta^\star$ 能令 $x=\phi(s,a)$ 为真实值y，即 $\theta^\star x=y$ ，套进来为：
$(\theta^\star)^\top \phi(s,a)=r(s,a)+\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}\phi(s',a')]$

所以LSVI在第h时刻的最小二乘法可以得到： $(\hat \theta_h -\theta^\star)^\top \Sigma_h (\hat \theta_h -\theta^\star)=\|\hat \theta_h-\theta^\star\|_{\Sigma_h}^2 \leq 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta))$
分析目标
$\begin{aligned} \|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty&=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-r(s,a)-\mathbb E_{s'\sim P_h(\cdot|s,a)}[\max_{a'}\widehat \theta_{h+1}^\top\phi(s',a')]|\\ &=\max_{s,a}|\hat\theta_h^\top \phi(s,a)-(\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|\\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top \phi(s,a)|^2}\text{ (利用$\sum a_ib_i\leq \sum a_i^2\sum b_i^2$进行放缩)}\\ &\leq \max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\phi(s,a)|} \\ &=\max_{s,a}\sqrt{|(\hat\theta_h-\theta^\star)^\top\Sigma_h (\hat\theta_h-\theta^\star)|| \phi(s,a)^\top\Sigma_h^{-1}\phi(s,a)|} \\ &\leq\max_{s,a}\sqrt{ 4H^2(d+\sqrt{d\ln (1/\delta)}+2\ln(1/\delta)) \frac{d}{N}}\\ &\leq \max_{s,a} \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}} \end{aligned}$
求得数据集 $D_h$ 的大小 $N$ ，令 $\epsilon= \frac{4Hd\sqrt{\ln 1/\delta}}{\sqrt{N}}$ ，得 $N=\frac{16H^2d^2\ln(H/\delta)}{\epsilon^2}$

4.3.4 LSVI样本复杂度的总逻辑

4.2节
如果算法估计的 $\widehat Q_h$ 有 $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon$ 成立（Inherent Bellman Error），则
1. $\|\widehat Q_h-Q^\star_h\|_\infty\leq (H-h)\epsilon \quad \forall h\in[0,1,2...,H-1]$
2. $|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon$ ，其中 $\widehat \pi_h(s)=\argmax_a\widehat Q_h(s,a)$
4.3.3节
利用了Least Squre + D-optimal Design后，当 $N\geq \frac{16H^2d^2\ln(H/\delta)}{\epsilon^2}$ 有 $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \epsilon$ 成立
总逻辑
1. 以 $p=\argmax_v \ln\det \mathbb E_{(s,a)\sim v}[\phi(s,a)\phi(s,a)^\top]$ 方式得到在 $(s, a)$ 空间上的最优采样分布 $p$
2. 然后在 $h = 0, 1, . . ., H - 1$ 的每个时刻，对于一个 $(s, a)$ ，从其下一状态的真实分布 $P_h(s'|s,a)$ 中随机采样 $\lceil p(s,a)N\rceil$ 个下一状态值 $s^{'}$ 作为样本，所以 $D_h$ 的大小为 $\sum_{s,a}\lceil p(s,a)N\rceil$ 。
3. 因此H个数据集的总样本数为
  $总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)$
4. 设定目标的误差为 $|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon'$ ，所以 $\epsilon=\frac{\epsilon'}{2H^2}$ ，这要求Inherent Bellman error $\|\widehat Q_h-\mathcal B_h\widehat Q_{h+1}\|_\infty\leq \frac{\epsilon'}{2H^2}$ ，可知 $\frac{64H^6d^2\ln H/\delta}{(\epsilon')^2}$
5. 因此为了保持 $\epsilon'$ -optimal的策略 $|V^{\widehat \pi}-V^\star|\leq 2H^2\epsilon=\epsilon'$ ，需要 $总的样本数=H\sum_{s,a\in p}\lceil p(s,a)N\rceil \leq H\sum_{s,a\in p}(1+p(s,a)N)\leq H(d^2+\frac{64H^6d^2\ln H/\delta}{(\epsilon')^2})$

总结

假设很多，这里分一下类。

关于问题建模本身的假设：finite horizon、Large Scale MDP、time-dependent policy(non-stationary policy)
关于最优Q值的结构假设：Linear 、结构变化是Linear Bellman Completion
关于数据集多样性的分布假设：D-optimality
关于算法中最小二乘的假设：fixed design analysis

这些假设，是相应算法在多项式时间内找到 $\epsilon$ -optimal策略的保证。