ICML 2021
paper
code
利用Q的方差作为权重估计,降低OOD数据的影响程度。
Intro
在离线强化学习中,目标是在不需要探索或交互的情况下,从静态数据集中学习。现有的基于Q学习和演员-评论家算法在处理分布外(OOD)行为或状态时存在困难,这可能导致价值估计中的重大错误,从而破坏训练的稳定性。
为了解决这个问题,提出了一种名为不确定性加权演员-评论家(UWAC)的新算法。UWAC背后的关键是检测到OOD行为-状态对,相应地减少它们在训练目标中的影响。这是通过一种实用的基于dropout的不确定性估计方法实现的,防止 Q 函数对OOD数据(高不确定性)过于乐观的学习。与现有的强化学习算法相比,这种方法几乎没有额外的开销。
Method
Uncertainty estimation through dropout
采用Monte-Carlo Dropout来计算Q值不确定性: 即在训练时对每个隐藏层网络输出加入Dropout,测试时也执行Dropout,然后对同一个数据连续T次预测,然后估计方差
V
a
r
[
Q
(
s
,
a
)
]
≈
σ
2
+
1
T
∑
t
=
1
T
Q
^
t
(
s
,
a
)
⊤
Q
^
t
(
s
,
a
)
−
E
[
Q
^
(
s
,
a
)
]
⊤
E
[
Q
^
(
s
,
a
)
]
\begin{aligned}Var[Q(s,a)]\approx\sigma^2+\frac1T\sum_{t=1}^T\hat{Q}_t(s,a)^\top\hat{Q}_t(s,a)-E[\hat{Q}(s,a)]^\top E[\hat{Q}(s,a)]\end{aligned}
Var[Q(s,a)]≈σ2+T1t=1∑TQ^t(s,a)⊤Q^t(s,a)−E[Q^(s,a)]⊤E[Q^(s,a)]
其中Dropout源代码为
def forward(self, input, return_preactivations=False):
h = input
for i, fc in enumerate(self.fcs):
h = fc(h)
if self.layer_norm and i < len(self.fcs) - 1:
h = self.layer_norms[i](h)
h = self.hidden_activation(h)
h = F.dropout(h,p=self.drop_rate)
preactivation = self.last_fc(h)
output = self.output_activation(preactivation)
if return_preactivations:
return output, preactivation
else:
return output
基于不确定性的策略表示为
π
′
(
a
∣
s
)
=
β
V
a
r
[
Q
0
π
′
(
s
,
a
)
]
π
(
a
∣
s
)
/
Z
(
s
)
;
Z
(
s
)
=
∫
a
β
V
a
r
[
Q
0
π
′
(
s
,
a
)
]
π
(
a
∣
s
)
d
a
\begin{gathered} \pi^{\prime}(a|s) =\frac\beta{Var\left[Q_0^{\pi^{\prime}}(s,a)\right]}\pi(a|s)/Z(s); \\ Z(s) =\int_{a}\frac{\beta}{Var\left[Q_{0}^{\pi^{\prime}}(s,a)\right]}\pi(a|s)da \end{gathered}
π′(a∣s)=Var[Q0π′(s,a)]βπ(a∣s)/Z(s);Z(s)=∫aVar[Q0π′(s,a)]βπ(a∣s)da
Uncertainty Weighted Actor-Critic
加权后对Q函数进行如下优化
L
(
Q
θ
)
=
E
(
s
′
∣
s
,
a
)
∼
D
E
a
′
∼
π
′
(
⋅
∣
s
′
)
[
E
r
r
(
s
,
a
,
s
′
,
a
′
)
2
]
=
E
(
s
′
∣
s
,
a
)
∼
D
E
a
′
∼
π
(
⋅
∣
s
′
)
[
β
V
a
r
[
Q
θ
′
(
s
′
,
a
′
)
]
E
r
r
(
s
,
a
,
s
′
,
a
′
)
2
]
E
r
r
(
s
,
a
,
s
′
,
a
′
)
=
Q
θ
(
s
,
a
)
−
(
R
(
s
,
a
)
+
γ
Q
θ
′
(
s
′
,
a
′
)
)
.
\begin{aligned} &\mathcal{L}(Q_{\theta}) = \mathbb{E}_{(s^{\prime}|s,a)\sim\mathcal{D}}\mathbb{E}_{a^{\prime}\sim\pi^{\prime}(\cdot|s^{\prime})}\left[Err(s,a,s^{\prime},a^{\prime})^{2}\right] \\ &= \mathbb{E}_{(s^{\prime}|s,a)\sim\mathcal{D}}\mathbb{E}_{a^{\prime}\sim\pi(\cdot|s^{\prime})}\left[\frac{\beta}{Var\left[Q_{\theta^{\prime}}(s^{\prime},a^{\prime})\right]}Err(s,a,s^{\prime},a^{\prime})^{2}\right] \\ &Err(s,a,s',a')=Q_{\theta}(s,a)-\left(R(s,a)+\gamma Q_{\theta'}(s',a')\right). \end{aligned}
L(Qθ)=E(s′∣s,a)∼DEa′∼π′(⋅∣s′)[Err(s,a,s′,a′)2]=E(s′∣s,a)∼DEa′∼π(⋅∣s′)[Var[Qθ′(s′,a′)]βErr(s,a,s′,a′)2]Err(s,a,s′,a′)=Qθ(s,a)−(R(s,a)+γQθ′(s′,a′)).
其中归一化因子被
β
\beta
β吸收。同样,对策略的优化为
L
(
π
)
=
−
E
a
∼
π
′
(
⋅
∣
s
)
[
Q
θ
(
s
,
a
)
]
=
−
E
a
∼
π
(
⋅
∣
s
)
[
β
V
a
r
[
Q
θ
(
s
,
a
)
]
Q
θ
(
s
,
a
)
]
\begin{aligned} \mathcal{L}(\pi)& =-\mathbb{E}_{a\sim\pi^{\prime}(\cdot|s)}\left[Q_\theta(s,a)\right] \\ &=-\mathbb{E}_{a\sim\pi(\cdot|s)}\left[\frac{\beta}{Var\left[Q_{\theta}(s,a)\right]}Q_{\theta}(s,a)\right] \end{aligned}
L(π)=−Ea∼π′(⋅∣s)[Qθ(s,a)]=−Ea∼π(⋅∣s)[Var[Qθ(s,a)]βQθ(s,a)]