Sim2Real:When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online RL_hybrid rl: using both offline and online data can -CSDN博客

本文链接：https://blog.csdn.net/wdnmdwsmsa/article/details/136500958

本文提出了一种结合离线和在线学习的策略，H2O，通过动态感知的策略评估方法处理离线数据中的OOD挑战。方法利用KL散度控制模拟与真实动力学的动态间隙，并通过正则化和重要性采样改进了Q学习过程。

摘要由CSDN通过智能技术生成

NIPS 2022
paper

Introduction

在这里插入图片描述
仿真环境地动力学模型相较于真实环境存在差异，不可避免导致在线训练地RL策略迁移到真实环境表现不佳。离线学习从数据出发，但是也面临OOD数据地挑战。本文将offline与online相结合提出H2O。H2O引入一种动态感知（dynamics-aware）策略评估方法，自适应惩罚具有较大动态间隙(dynamic gaps)的状态动作对的仿真Q值学习过程。该方法还允许利用真实环境的数据学习。

Method

本文方法借鉴于CQL，提出一种动态感知策略评估方法，实现对不同源数据的分布不匹配问题处理。具体而言，就是对具有高dynamic gaps的Q(s,a)进行惩罚。
$\min_Q\max_{\color{red}{d^\phi}}\beta\left[\mathbb{E}_{\mathbf{s},\mathbf{a}\sim \color{red}{d^\phi(\mathbf{s},\mathbf{a})}}[Q(\mathbf{s},\mathbf{a})]-\mathbb{E}_{\mathbf{s},\mathbf{a}\sim\mathcal{D}}[Q(\mathbf{s},\mathbf{a})]+\color{red}{\mathcal{R}(d^\phi)}\right]+\color{blue}{\widetilde{\mathcal{E}}\left(Q,\hat{B}^\pi\hat{Q}\right)}$
其中 $d^\phi$ 为高dynamic gaps的(s,a)数据分布。 $\mathcal{R}(d^\phi)$ 为正则化项。这里最小化Q惩罚了第一项max代表的便是高gaps的样本，而保护来自 $\mathcal{D}$ 的离线样本。 $\widetilde{\mathcal{E}}$ 表示混合数据(离线数据与仿真数据)的改进的bellman误差。下面分别对红色以及蓝色部分讨论

$\color{red}{\text{Red part}}$

如何设计 $d_\phi$ ?本文利用正则化项 $\mathcal{R}(d^\phi)$ 来控制 $d_\phi$ 。具体的，采用描述状态-动作空间中样本dynamics gap的分布w，用KL散度控制 $d_\phi$ 和w之间的距离： $\mathcal{R}(d^{\dot{\phi}})=-D_{KL}(d^{\phi}(\mathbf{s},\mathbf{a})\|\omega(\mathbf{s},\mathbf{a}))$

那么原始针对 $d_\phi$ 的优化目标变为：
$\max_{d^\phi}\mathbb{E}_{\mathbf{s},\mathbf{a}\sim d^\phi(\mathbf{s},\mathbf{a})}[Q(\mathbf{s},\mathbf{a})]-D_{KL}(d^\phi(\mathbf{s},\mathbf{a})\|\omega(\mathbf{s},\mathbf{a}))\quad\mathrm{s.t.}\sum_{\mathbf{s.a}}d^\phi(\mathbf{s},\mathbf{a})=1,d^\phi(\mathbf{s},\mathbf{a})\geq0$
上述问题存在一个closed-form解： $d^{\phi}(\mathbf{s},\mathbf{a})\propto\omega(\mathbf{s},\mathbf{a})\exp\left(Q(\mathbf{s},\mathbf{a})\right)$ 。将其带入上式得到：
$\min_Q\beta\left({\color{red}\log\sum_{\mathbf{s},\mathbf{a}}\omega(\mathbf{s},\mathbf{a})\exp\left(Q(\mathbf{s},\mathbf{a})\right)}-\mathbb{E}_{\mathbf{s},\mathbf{a}\sim D}\left[Q(\mathbf{s},\mathbf{a})\right]\right)+\color{blue}{\widetilde{\mathcal{E}}\left(Q,\hat{B}^\pi\hat{Q}\right)}$
这个结果直观上是合理的，因为在 $\omega(s, a)$ 较大的Q值上惩罚更多，对应于这些高动态间隙模拟样本。接下来就是如何推出 $\omega$

具体来说，本文测量状态-动作对上真实动力学和模拟动力学之间的动态差距：
$u(\mathbf{s},\mathbf{a}):=D_{KL}(P_{\widehat{\mathcal{M}}}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})\|P_{\mathcal{M}}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a}))=\mathbb{E}_{s^{\prime}\sim P_{\widehat{\mathcal{M}}}}\log(P_{\widehat{\mathcal{M}}}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a})/P_{\mathcal{M}}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a}))$
而 $\omega$ 可以表示为u的归一化分布： $\omega(\mathbf{s},\mathbf{a})=u(\mathbf{s},\mathbf{a})/\sum_{\tilde{\mathbf{s}}.\tilde{\mathbf{a}}}u(\tilde{\mathbf{s}},\tilde{\mathbf{a}})$

进一步通过贝叶斯法则动力学模型比值：
$\begin{aligned} \frac{P_{\widehat{\mathcal{M}}}\left(\mathbf{s^{\prime}|s,a}\right)}{P_{\mathcal{M}}\left(\mathbf{s^{\prime}|s,a}\right)}& ={\frac{p\left(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a},\mathbf{s}\mathbf{m}\right)}{p\left(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a},\mathbf{r}\mathbf{e}\mathbf{a}|\right)}}={\frac{p\left(\sin|\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}\right)}{p\left(\sin|\mathbf{s},\mathbf{a}\right)}}/{\frac{p\left(\mathbf{r}\mathbf{e}\mathbf{a}||\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}\right)}{p\left(\mathbf{r}\mathbf{e}\mathbf{a}||\mathbf{s},\mathbf{a}\right)}} \\ &=\frac{p\left(\mathrm{sim|s,a,s'}\right)}{p\left(\mathrm{real|s,a,s'}\right)}/\frac{p\left(\mathrm{sim|s,a}\right)}{p\left(\mathrm{real|s,a}\right)}=\frac{1-p\left(\mathrm{real|s,a,s'}\right)}{p\left(\mathrm{real|s,a,s'}\right)}/\frac{1-p\left(\mathrm{real|s,a}\right)}{p\left(\mathrm{real|s,a}\right)} \end{aligned}$
其中， $p\left(\mathrm{real|s,a,s'}\right)$ 与 $p\left(\mathrm{real|s,a}\right)$ 分别用判别器 $D_{\Phi_{sas}}(\cdot|\mathbf{s},\mathbf{a},\mathbf{s}^{\prime})$ 以及 $D_{\Phi_{sa}}(\cdot|\mathbf{s},\mathbf{a})$ 近似。而对判别器则是利用离线样本以及仿真样本，采用类似DRAC中标准交叉熵损失函数进行优化。

$\color{blue}{\text{blue part}}$

对于计算bellman误差的 $\color{blue}{\widetilde{\mathcal{E}}\left(Q,\hat{B}^\pi\hat{Q}\right)}$ 。由于现实与仿真存在dynamics-gaps,因此对在线数据采用重要性采样：
$\begin{aligned}\widetilde{\mathcal{E}}\left(Q,\hat{\mathcal{B}}^{\pi}\hat{Q}\right)&=\frac12\mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}\sim\mathcal{D}}\left[\left(Q-\hat{\mathcal{B}}^{\pi}\hat{Q}\right)(\mathbf{s},\mathbf{a})\right]^2+\frac{1}{2}\mathbb{E}_{\mathbf{s},\mathbf{a}\sim B}\mathbb{E}_{\mathbf{s}^{\prime}\sim p_{\mathcal{M}}}\left[\left(Q-\hat{\mathcal{B}}^{\pi}\hat{Q}\right)(\mathbf{s},\mathbf{a})\right]^{2}\\&=\frac12\mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}^{\prime}\sim\mathcal{D}}\left[\left(Q-\hat{\mathcal{B}}^{\pi}\hat{Q}\right)(\mathbf{s},\mathbf{a})\right]^2+\frac12\mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}'\sim B}\left[\frac{P_{\mathcal{M}}(\mathbf{s'}|\mathbf{s},\mathbf{a})}{P_{\widehat{\mathcal{M}}}(\mathbf{s'}|\mathbf{s},\mathbf{a})}\left(Q-\hat{\mathcal{B}}^{\pi}\hat{Q}\right)(\mathbf{s},\mathbf{a})\right]^2\end{aligned}$

其他

对策略的优化则是采样混合数据，利用SAC的优化方式进行策略改进（伪代码第7行）

出于计算考虑，将原始优化问题在整个的(s,a)计算exp(Q)的加权平均值，简化为在仿真数据集B中sample小批量数据。

另外，在计算 $u(s,a)=\mathbb{E}_{s^{\prime}\sim P_{\widehat{\mathcal{M}}}}\operatorname{log}(P_{\widehat{\mathcal{M}}}(\mathrm{s'|s},\mathrm{a})/P_{\mathcal{M}}(\mathrm{s'|s},\mathrm{a}))$ 时需要从动力学模型分布中采样下一个状态，这对于黑盒的仿真环境不可行。因此，改进为从高斯分布 $\mathcal{N}(\mathbf{s^{\prime}},\hat{\Sigma}_{\mathcal{D}})$ 中采样N个样本的均值近似期望值， $\hat{\Sigma}_{\mathcal{D}}$ 为离线数据集计算的状态的协方差矩阵。