NIPS 2020
paper
Intro
面对图像任务下RL存在两个挑战:表征学习以及任务学习。本文提出学习潜在变量模型并基于该模型执行RL。
Method
考虑部分可观测MDP(POMDP), 其概率图模型如图所示。该模型下智能体无法影响过去
τ
\tau
τ步的状态动作,而是预测未来最有动作,直至回合结束。
其中
O
τ
+
1
O_{\tau+1}
Oτ+1为新引入的随机变量分布,
p
(
O
t
=
1
∣
z
t
,
a
t
)
=
exp
(
r
(
z
t
,
a
t
)
)
p(\mathcal{O}_{t}=1|\mathbf{z}_{t},\mathbf{a}_{t})=\exp(r(\mathbf{z}_{t},\mathbf{a}_{t}))
p(Ot=1∣zt,at)=exp(r(zt,at))。算法对观测以及奖励联合构建时序模型,并通过最大化似然概率
p
(
x
1
:
τ
+
1
,
O
τ
+
1
:
T
∣
a
1
:
τ
)
p(\mathbf{x}_{1:\tau+1},\mathcal{O}_{\tau+1:T}|\mathbf{a}_{1:\tau})
p(x1:τ+1,Oτ+1:T∣a1:τ)分布优化策略。进一步通过变分推断得到该分布的ELBO,
其中
r
(
z
t
,
a
t
)
=
log
p
(
O
t
=
1
∣
z
t
,
a
t
)
r(\mathbf{z}_t,\mathbf{a}_t)=\log p(\mathcal{O}_t=1|\mathbf{z}_t,\mathbf{a}_t)
r(zt,at)=logp(Ot=1∣zt,at)
q
(
z
1
:
T
,
a
τ
+
1
:
T
∣
x
1
:
τ
+
1
,
a
1
:
τ
)
=
∏
t
=
0
τ
q
(
z
t
+
1
∣
x
t
+
1
,
z
t
,
a
t
)
∏
t
=
τ
+
1
T
−
1
p
(
z
t
+
1
∣
z
t
,
a
t
)
∏
t
=
τ
+
1
T
π
(
a
t
∣
x
1
:
t
,
a
1
:
t
−
1
)
q(\mathbf{z}_{1:T},\mathbf{a}_{\tau+1:T}|\mathbf{x}_{1:\tau+1},\mathbf{a}_{1:\tau})=\prod_{t=0}^\tau q(\mathbf{z}_{t+1}|\mathbf{x}_{t+1},\mathbf{z}_t,\mathbf{a}_t)\prod_{t=\tau+1}^{T-1}p(\mathbf{z}_{t+1}|\mathbf{z}_t,\mathbf{a}_t)\prod_{t=\tau+1}^T\pi(\mathbf{a}_t|\mathbf{x}_{1:t},\mathbf{a}_{1:t-1})
q(z1:T,aτ+1:T∣x1:τ+1,a1:τ)=t=0∏τq(zt+1∣xt+1,zt,at)t=τ+1∏T−1p(zt+1∣zt,at)t=τ+1∏Tπ(at∣x1:t,a1:t−1)
p
(
x
1
:
τ
+
1
,
O
τ
+
1
:
T
,
z
1
:
T
,
a
τ
+
1
:
T
∣
a
1
:
τ
)
=
∏
t
=
1
τ
+
1
p
(
x
t
∣
z
t
)
∏
t
=
0
T
−
1
p
(
z
t
+
1
∣
z
t
,
a
t
)
∏
t
=
τ
+
1
T
p
(
O
t
∣
z
t
,
a
t
)
∏
t
=
τ
+
1
T
p
(
a
t
)
p(\mathbf{x}_{1:\tau+1},\mathcal{O}_{\tau+1:T},\mathbf{z}_{1:T},\mathbf{a}_{\tau+1:T}|\mathbf{a}_{1:\tau})=\prod_{t=1}^{\tau+1}p(\mathbf{x}_t|\mathbf{z}_t)\prod_{t=0}^{T-1}p(\mathbf{z}_{t+1}|\mathbf{z}_t,\mathbf{a}_t)\prod_{t=\tau+1}^{T}p(\mathcal{O}_t|\mathbf{z}_t,\mathbf{a}_t)\prod_{t=\tau+1}^{T}p(\mathbf{a}_t)
p(x1:τ+1,Oτ+1:T,z1:T,aτ+1:T∣a1:τ)=t=1∏τ+1p(xt∣zt)t=0∏T−1p(zt+1∣zt,at)t=τ+1∏Tp(Ot∣zt,at)t=τ+1∏Tp(at)
因此,问题转化为最大化这个ELBO。ELBO分为两个部分第一个部分学习隐变量模型,通过最小化下列损失函数实现模型参数训练
J
M
(
ψ
)
=
E
z
1
:
τ
+
1
∼
q
ψ
[
∑
t
=
0
τ
−
log
p
ψ
(
x
t
+
1
∣
z
t
+
1
)
+
D
K
.
(
q
ψ
(
z
t
+
1
∣
x
t
+
1
,
z
t
,
a
t
)
∥
p
ψ
(
z
t
+
1
∣
z
t
,
a
t
)
)
]
J_{M}(\psi)=\mathop{\mathbb{E}}_{\mathbf{z}_{1:\tau+1}\sim q_{\psi}}\left[\sum_{t=0}^{\tau}-\operatorname{log}p_{\psi}(\mathbf{x}_{t+1}|\mathbf{z}_{t+1})+\mathrm{D}_{\mathbf{K}.}(q_{\psi}(\mathbf{z}_{t+1}|\mathbf{x}_{t+1},\mathbf{z}_{t},\mathbf{a}_{t})\|p_{\psi}(\mathbf{z}_{t+1}|\mathbf{z}_{t},\mathbf{a}_{t}))\right]
JM(ψ)=Ez1:τ+1∼qψ[t=0∑τ−logpψ(xt+1∣zt+1)+DK.(qψ(zt+1∣xt+1,zt,at)∥pψ(zt+1∣zt,at))]
第二部分,文章假设
a
t
a_t
at先验为服从均匀分布因此
log
p
(
a
t
)
\log p(a_t)
logp(at)该项为常数项,可以舍弃去。则该目标变为最大化熵的RL。本文采用了SAC。首先对q函数最小化soft bellman 均方误差优化
J
Q
(
θ
)
=
E
z
1
:
τ
+
1
∼
q
ψ
[
1
2
(
Q
θ
(
z
τ
,
a
τ
)
−
(
r
τ
+
γ
V
θ
ˉ
(
z
τ
+
1
)
)
)
2
]
,
V
θ
(
z
τ
+
1
)
=
E
a
τ
+
1
∼
π
ϕ
[
Q
θ
(
z
τ
+
1
,
a
τ
+
1
)
−
α
log
π
ϕ
(
a
τ
+
1
∣
x
1
:
τ
+
1
,
a
1
:
τ
)
]
\begin{gathered} J_{Q}(\theta) =\mathbb{E}_{\mathbf{z}_{1:\tau+1}\sim q_{\psi}}\left[\frac{1}{2}\left(Q_{\theta}(\mathbf{z}_{\tau},\mathbf{a}_{\tau})-(r_{\tau}+\gamma V_{\bar{\theta}}(\mathbf{z}_{\tau+1}))\right)^{2}\right], \\ V_{\theta}(\mathbf{z}_{\tau+1}) =\underset{\mathbf{a}_{\tau+1}\sim\pi_{\phi}}{\operatorname*{\mathbb{E}}}[Q_{\theta}(\mathbf{z}_{\tau+1},\mathbf{a}_{\tau+1})-\alpha\log\pi_{\phi}(\mathbf{a}_{\tau+1}|\mathbf{x}_{1:\tau+1},\mathbf{a}_{1:\tau})] \end{gathered}
JQ(θ)=Ez1:τ+1∼qψ[21(Qθ(zτ,aτ)−(rτ+γVθˉ(zτ+1)))2],Vθ(zτ+1)=aτ+1∼πϕE[Qθ(zτ+1,aτ+1)−αlogπϕ(aτ+1∣x1:τ+1,a1:τ)]
其中
z
τ
+
1
∼
q
ψ
(
z
τ
+
1
∣
x
τ
+
1
,
z
τ
,
a
τ
)
\mathbf{z}_{\tau+1}\sim q_\psi(\mathbf{z}_{\tau+1}|\mathbf{x}_{\tau+1},\mathbf{z}_{\tau},\mathbf{a}_{\tau})
zτ+1∼qψ(zτ+1∣xτ+1,zτ,aτ),然后策略优化采用SAC
J
π
(
ϕ
)
=
E
z
1
:
τ
+
1
∼
q
ψ
[
E
a
τ
+
1
∼
π
ϕ
[
α
log
π
ϕ
(
a
τ
+
1
∣
x
1
:
τ
+
1
,
a
1
:
τ
)
−
Q
θ
(
z
τ
+
1
,
a
τ
+
1
)
]
]
\begin{aligned}J_{\pi}(\phi)=\mathbb{E}_{\mathbf{z}_{1:\tau+1}\sim q_{\psi}}\left[\mathbb{E}_{\mathbf{a}_{\tau+1}\sim\pi_{\phi}}\left[\alpha\log\pi_{\phi}(\mathbf{a}_{\tau+1}|\mathbf{x}_{1:\tau+1},\mathbf{a}_{1:\tau})-Q_{\theta}(\mathbf{z}_{\tau+1},\mathbf{a}_{\tau+1})\right]\right]\end{aligned}
Jπ(ϕ)=Ez1:τ+1∼qψ[Eaτ+1∼πϕ[αlogπϕ(aτ+1∣x1:τ+1,a1:τ)−Qθ(zτ+1,aτ+1)]]