NIPS 2022 Offline RL Workshop
paper
Intro
O2O存在OOD数据导致价值函数的错误估计,进而导致策略出现performance drop。一些方法例如Off2OnRL采用集成悲观的Q估计方法,缓解分布偏移导致的这种Bootstrapping errors。同时该方法还提出一种Balanced ReplayBuffer将离线数据应用于在线过程。
本文提出的方法UPQ沿用集成悲观Q估计的同时,采用对Q函数加入基于不确信度量的惩罚。
Method
对集成Q函数以及策略沿用Off2OnRL的方法:
Q
θ
E
(
s
,
a
)
:
=
1
N
∑
i
=
1
N
Q
θ
i
(
s
,
a
)
,
π
ϕ
E
(
⋅
∣
s
)
=
N
(
1
N
∑
i
=
1
N
μ
ϕ
i
(
s
)
,
1
N
∑
i
=
1
N
(
σ
ϕ
i
2
(
s
)
+
μ
ϕ
i
2
(
s
)
)
−
μ
ϕ
2
(
s
)
)
,
\begin{aligned}Q_\theta^E(s,a)&:=\frac{1}{N}\sum_{i=1}^NQ_{\theta_i}(s,a),\\\pi_\phi^E(\cdot|s)&=\mathcal{N}\bigg(\frac{1}{N}\sum_{i=1}^N\mu_{\phi_i}(s),\quad\frac{1}{N}\sum_{i=1}^N\bigg(\sigma_{\phi_i}^2(s)+\mu_{\phi_i}^2(s)\bigg)-\mu_{\phi}^2(s)\bigg),\end{aligned}
QθE(s,a)πϕE(⋅∣s):=N1i=1∑NQθi(s,a),=N(N1i=1∑Nμϕi(s),N1i=1∑N(σϕi2(s)+μϕi2(s))−μϕ2(s)),
对Q函数的不确信度量方法如下:
U
θ
−
(
s
′
,
a
′
)
:
=
σ
(
Q
θ
−
(
s
′
,
a
′
)
)
=
1
N
∑
i
=
1
N
(
Q
θ
−
(
s
′
,
a
′
)
−
Q
θ
−
E
(
s
′
,
a
′
)
)
2
,
\mathcal{U}_{\theta_{-}}(s',a'):=\sigma(Q_{\theta_{-}}(s',a'))=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(Q_{\theta_{-}}(s',a')-Q_{\theta_{-}}^{E}(s',a')\right)^2,}
Uθ−(s′,a′):=σ(Qθ−(s′,a′))=N1i=1∑N(Qθ−(s′,a′)−Qθ−E(s′,a′))2,
将Q的不确定性度量加入bellman算子:
T
Q
θ
E
(
s
,
a
)
:
=
r
(
s
,
a
)
+
γ
E
a
′
∼
π
ϕ
E
[
Q
θ
−
E
(
s
′
,
a
′
)
−
α
log
π
ϕ
E
(
a
′
∣
s
′
)
−
β
U
θ
−
(
s
′
,
a
′
)
]
\mathcal{T}Q_{\theta}^{E}(s,a):=r(s,a)+\gamma\mathbb{E}_{a^{\prime}\sim\pi_{\phi}^{E}}\Big[Q_{\theta-}^{E}(s^{\prime},a^{\prime})-\alpha\log\pi_{\phi}^{E}(a^{\prime}|s^{\prime})-\beta\mathcal{U}_{\theta-}(s^{\prime},a^{\prime})\Big]
TQθE(s,a):=r(s,a)+γEa′∼πϕE[Qθ−E(s′,a′)−αlogπϕE(a′∣s′)−βUθ−(s′,a′)]
接下来对Critic以及Actor的更新如下:
L
C
r
i
t
i
c
(
θ
)
=
E
(
s
,
a
,
s
′
)
∼
B
[
(
Q
θ
E
(
s
,
a
)
−
T
Q
θ
E
(
s
,
a
)
)
2
]
,
L
A
c
t
o
r
(
ϕ
)
=
E
s
∼
B
,
a
∼
π
ϕ
E
[
α
log
π
ϕ
E
(
a
∣
s
)
−
Q
θ
E
(
s
,
a
)
]
,
\begin{aligned}\mathcal{L}_{Critic}(\theta)=\mathbb{E}_{(s,a,s')\sim\mathcal{B}}\bigg[\bigg(Q_\theta^E(s,a)-\mathcal{T}Q_\theta^E(s,a)\bigg)^2\bigg],\\\\\mathcal{L}_{Actor}(\phi)=\mathbb{E}_{s\sim\mathcal{B},a\sim\pi_\phi^E}\bigg[\alpha\log\pi_\phi^E(a|s)-Q_\theta^E(s,a)\bigg],\end{aligned}
LCritic(θ)=E(s,a,s′)∼B[(QθE(s,a)−TQθE(s,a))2],LActor(ϕ)=Es∼B,a∼πϕE[αlogπϕE(a∣s)−QθE(s,a)],