O2O:Actor-Critic Alignment for Offline-to-Online Reinforcement Learning

本文链接：https://blog.csdn.net/wdnmdwsmsa/article/details/136471561

ICML 2023 Poster
paper

1 Introduction

O2O容易因为分布偏移导致策略崩溃，解决方法包括限制策略偏移计以及平衡样本采样等。然而这些方法需要求解分布散度或者密度比(density ratio)。为了避免这些复杂操作，本文并不采用以往AC方法对Q值进行变形，而是对离线策略进行对齐，即使面对离线策略外的动作的Q值依旧能被限制。因此，在线微调就能如同一般AC方法执行。

方法的核心来自于SAC的策略表示，它与Q值softmax操作密切相关，该形式让策略与Q值联系在一起。
$\pi_\theta(a|s)=\exp{(\frac{1}{\alpha}Q_\mu(s,a))}\bigg/\sum_{a\in\mathcal{A}}\exp{(\frac{1}{\alpha}Q_\mu(s,a))}$
设置 $Z(s)=\alpha\log\sum_{a\in \mathcal{A}} exp(Q_\mu(s,a)/\alpha)$ ，上式进行简写为：
$Q_{\mu}(s,a)=Z(s)+\alpha\log\pi_{\theta}(a|s)$
Q值由于OOD的存在可能存在错误估计，但是策略是值得信赖的（但是需要在线微调）。由上式可以看出，SAC的离线策略自然将critic与actor对齐。它允许我们对online阶段的Q值使用offline的策略进行初始化。在线微调时，只要采取SAC方法，所提出方法在各种任务上均表现优良。

2 Method

本文提出的方法包括三个阶段：1）offline 2）actor-critic alignment 3） online

2.1 Offline

2.1.1 actor update

对actor的更新采用SAC与最大似然(ML)相结合的方法，
$\begin{aligned}\mathcal{L}_\pi^{\mathrm{SAC+ML}}(\theta,\mathrm{d})&=\mathbb{E}_{(s,a)\sim\mathrm{d~}b\sim\pi_\theta(\cdot|s)}\Big[-\log\pi_\theta(a|s)\\&-\lambda\Big(Q_\mu(s,b)-\alpha\log\pi_\theta(b|s)\Big)\Big]\end{aligned}$
其中 $\lambda$ 超参数平衡二者，其取值计算如下：
$\lambda:=\omega/\underset{(s,a)\thicksim\mathbf{d}}{\operatorname*{\mathbb{E}}}|Q_\mu(s,a)|,\mathrm{~where~}Q_\mu:=\min\{Q_{\mu_1},Q_{\mu_2}\}$

2.1.2 critic update

critic的更新采用SAC的模式，并对temperature $\alpha$ 通过梯度进行更新。
$\begin{aligned}&\mathcal{L}_Q^{\mathrm{SAC+ML.}}(\mu_i,\mathbf{d}):=\mathbb{E}_{(s,a,r,s^{\prime})\sim\mathbf{d}}\Big[(Q_{\mu_i}(s,a)-y(r,s^{\prime}))^2\Big]\text{ }\\&\mathrm{~with~}y(r,s^{\prime}):=r+\gamma\underset{a^{\prime}\sim\pi_\theta(\cdot|s^{\prime})}{\operatorname*{E}}[Q_{\bar{\mu}}(s^{\prime},a^{\prime})-\alpha\log\pi_\theta(a^{\prime}|s^{\prime})],\end{aligned}$
其中， $\bar{\mu}$ 表示延迟更新的target Q，且 $Q_{\bar{\mu}}(s,a)=\operatorname*{min}_{i\in\{1,2\}}Q_{\bar{\mu}_{i}}(s,a)$
$\mathcal{L}_{\mathbf{temp}}^{\mathbf{SAC+ML}}(\alpha,\mathbf{d}):=-\alpha\underset{s\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}\underset{a\sim\pi_\theta(\cdot|s)}{\operatorname*{\mathbb{E}}}\left[\log\pi_\theta(a|s)-\bar{\mathcal{H}}\right]$

2.2 Align

离线阶段优化的策略通常表现良好，记作 $\pi_{\theta_0}$ 。而critic由于OOD可能导致崩溃。因此，本文提出一种对齐方法，将critic与离线策略相关联。这也得益于SAC策略表现形式，天然将二者相关联。这里对Q设置如下，此时 $\alpha=1$
$\begin{aligned}Q_i(s,a)=\log\pi_{\theta_0}(a|s)+Z_{\psi_i}(s)\end{aligned}$

对 $Z_{\psi_i}(s)$ 通过最小化bellman误差进行优化。
$\mathcal{L}_{Z}^{\mathrm{SAC+ML}}(\psi_{i},\mathbf{d}):=\underset{(s,a,r,s^{\prime})\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}[(\log\pi_{\theta_0}(a|s) +Z_{\psi_i}(s)-y(r,s'))^2]$
$\begin{aligned}\mathrm{where~}y(r,s^{\prime}):=&r+\gamma\underset{a^{\prime}\sim\pi_{\theta_0}(\cdot|s^{\prime})}{\operatorname*{\mathbb{E}}}[\log\pi_{\theta_0}(a^{\prime}|s^{\prime})+Z_{\psi}(s^{\prime})],\\&Z_{\psi}:=\min\{Z_{\psi_1},Z_{\psi_2}\}.\end{aligned}$

得益于对齐步骤，天然忽略了离线阶段优化的Q值，避免使用离线阶段错误的Q导致在线阶段的崩溃。在线阶段自然而然使用SAC在线微调
在这里插入图片描述

由上图可知，对齐后的策略能够有更好的性能表现，第二张图也展示策略与Q的对齐效果。

2.3 Online

在线微调阶段，Q值函数可以初始化表示如下
$\begin{aligned}Q_{\phi_i}(s,a)&:=\log\pi_{\theta_0}(a|s)+R_{\phi_i}(s,a),\\&\mathrm{where}\quad R_{\phi_i}(s,a)\text{ is initialized with }Z_{\psi_i}(s)\end{aligned}$
对critic 的优化采用SAC的方法：
$\begin{aligned} &\mathcal{L}_Q(\phi_i,\mathbf{d}):=\underset{\mathbf{d}}{\operatorname*{\mathbb{E}}}\left[\left(\log\pi_{\theta_0}(a|s)+R_{\phi_i}(s,a)-y(r,s^{\prime})\right)^2\right] \\ &\begin{aligned}\text{where }&y(r,s'):=r+\gamma&\mathbb{E}_{a'\sim\pi_\theta(\cdot|s')}[\log\pi_{\theta_0}(a'|s')\left.+R_{\bar{\phi}}(s^{\prime},a^{\prime})-\alpha\log\pi_{\theta}(a^{\prime}|s^{\prime})\right]\end{aligned} \end{aligned}$
actor的优化依旧类似于SAC，其中 $R_{\phi}:=\operatorname*{min}_{i\in\{1,2\}}R_{\phi_{i}},Q_{\phi}:=\operatorname{log}\pi_{\theta_{0}}+R_{\phi},$
$\begin{aligned} \mathcal{L}_\pi(\theta,\mathbf{d}) :&=-\underset{s\sim\mathbf{d}}{\operatorname*{\mathbb{E}}}\underset{a\sim\pi_\theta(\cdot|s)}{\operatorname*{\mathbb{E}}}\left[Q_\phi(s,a)-\alpha\log\pi_\theta(a|s)\right], \\ =&-\operatorname*{\mathbb{E}}_{s\sim\mathbf{d}}\operatorname*{\mathbb{E}}_{a\sim\pi_\theta(\cdot|s)}\left[R_\phi(s,a)-\alpha\log\pi_\theta(a|s)\right]\\ &-\underbrace{\mathbb{E}_{s\sim\mathbf{d}}\mathbb{E}_{a\sim\pi_\theta(\cdot|s)}[\log\pi_{\theta_0}(a|s)]}_{\text{penalizing deviation of $\pi_\theta$ from $\pi_{\theta_0}$}} \end{aligned}$
第二项的对数似然可看作时正则化项，使得新策略靠近离线优化的策略。