Offline RL 对偶性的应用
概述
回顾三个Duality的形式:
-
1. Fenchel Duality
给定一个原问题: min x ∈ D J P ( x ) = min x ∈ D f ( x ) + g ( A x ) \min_{x\in D}J_P(x)=\min_{x\in D}f(x)+g(Ax) x∈DminJP(x)=x∈Dminf(x)+g(Ax)其对偶问题为: max u ∈ D ∗ J D ( u ) = max u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}J_D(u)=\max_{u\in D^*}-f^*(-A^\top u)-g^*(u) u∈D∗maxJD(u)=u∈D∗max−f∗(−A⊤u)−g∗(u)在一定条件下,原问题与对偶问题等价。一些常用函数的共轭函数表:
-
2. Lagrange Duality
p ∗ = min x J P ( x ) = min x max α , β : α i ≥ 0 L ( x , α , β ) d ∗ = max α , β : α i ≥ 0 J D ( α , β ) = max α , β : α i ≥ 0 min x L ( x , α , β ) d ∗ ≤ p ∗ ( KKT condition ) \begin{aligned} p^*&=\min_xJ_P(x)=\min_x\max_{\alpha,\beta:\alpha_i\geq 0} L(x,\alpha,\beta)\\ d^*&=\max_{\alpha,\beta:\alpha_i\geq0}J_D(\alpha,\beta)=\max_{\alpha,\beta:\alpha_i\geq0}\min_x L(x,\alpha,\beta)\\ d^*&\leq p^*\quad (\text{KKT condition}) \end{aligned} p∗d∗d∗=xminJP(x)=xminα,β:αi≥0maxL(x,α,β)=α,β:αi≥0maxJD(α,β)=α,β:αi≥0maxxminL(x,α,β)≤p∗(KKT condition) -
3. Linear Program Duality(LP Duality)
min x c ⊤ x s.t x ≥ 0 , A x = b . max μ b ⊤ μ s.t A ⊤ μ ≤ c \begin{aligned} &\min_x c^\top x\quad \text{s.t }x\geq0,Ax=b.\\ &\max_\mu b^\top \mu\quad \text{s.t } A^\top \mu\leq c \end{aligned} xminc⊤xs.t x≥0,Ax=b.μmaxb⊤μs.t A⊤μ≤c
对Offline RL中的两类优化问题Policy Evaluation以及Policy Optimization进行原问题的建模,并根据问题结构添加一些约束。
对Offline RL建模的优化问题用Duality转化成对偶问题,从而优化对偶问题来求解原问题。
Duality的初衷:Duality就是用来解决Importance Sampling用在Offline Dataset上时出现的high variance问题而提出的。
一、Offline RL 建模基本术语
在上一篇文章主要是用Finite Horizon的形式表述的,这次换成Infinite Horizon来重新表述一下。
- Q值
Q
π
(
s
,
a
)
Q^\pi(s,a)
Qπ(s,a)
Q π ( s t , a t ) = E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ ∑ t ′ = t ∞ γ t ′ − t r ( s t ′ , a t ′ ) ] = r ( s t , a t ) + γ E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ ∑ t ′ = t + 1 ∞ γ t ′ − t r ( s t ′ , a t ′ ) ] = r ( s t , a t ) + γ E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ Q π ( s t + 1 , a t + 1 ) ] = r ( s t , a t ) + γ P π Q π ( s t , a t ) \begin{aligned} Q^\pi(s_t,a_t)&=\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[\sum_{t'=t}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})\right]\\ &=r(s_t,a_t)+\gamma\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[\sum_{t'=t+1}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})\right]\\ &=r(s_t,a_t)+\gamma\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[Q^\pi(s_{t+1},a_{t+1})\right]\\ &=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t)\\ \end{aligned} Qπ(st,at)=Est+1∼T(st+1∣st,at),at+1∼π(at+1∣st+1)[t′=t∑∞γt′−tr(st′,at′)]=r(st,at)+γEst+1∼T(st+1∣st,at),at+1∼π(at+1∣st+1)[t′=t+1∑∞γt′−tr(st′,at′)]=r(st,at)+γEst+1∼T(st+1∣st,at),at+1∼π(at+1∣st+1)[Qπ(st+1,at+1)]=r(st,at)+γPπQπ(st,at)其中 P π Q π ( s t , a t ) = E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ Q π ( s t + 1 , a t + 1 ) ] P^\pi Q^\pi(s_t,a_t)=\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[Q^\pi(s_{t+1},a_{t+1})\right] PπQπ(st,at)=Est+1∼T(st+1∣st,at),at+1∼π(at+1∣st+1)[Qπ(st+1,at+1)], P π P^\pi Pπ记为policy transition operator,是操作算符,意思是对当前时刻 t t t的 Q π ( s t , a t ) Q^\pi(s_t,a_t) Qπ(st,at)通过环境transition dynamics即 T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t,a_t) T(st+1∣st,at)获得下一个状态 s t + 1 s_{t+1} st+1,然后通过policy即 π ( a t + 1 ∣ s t + 1 ) \pi(a_{t+1}|s_{t+1}) π(at+1∣st+1)选择下一个动作 a t + 1 a_{t+1} at+1进行Q值的迭代。此处的重点是定义了一个operator P π P^\pi Pπ - state-action visitation frequency d π ( s , a ) d^\pi(s,a) dπ(s,a):
- 定义:一个策略在状态
×
\times
×动作空间
Ω
\Omega
Ω上的联合分布
d
π
(
s
,
a
)
d^\pi(s,a)
dπ(s,a):
d π ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t Pr ( s t = s , a t = a ∣ s 0 ∼ d 0 ( s ) , a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ T ( ⋅ ∣ s t , a t ) ) d^\pi(s,a)=(1-\gamma)\sum_{t=0}^\infty \gamma^t \Pr(s_t=s,a_t=a|s_0\sim d_0(s),a_t\sim\pi(\cdot|s_t),s_{t+1}\sim T(\cdot|s_t,a_t)) dπ(s,a)=(1−γ)t=0∑∞γtPr(st=s,at=a∣s0∼d0(s),at∼π(⋅∣st),st+1∼T(⋅∣st,at)) d 0 ( s ) d_0(s) d0(s):初始状态的分布, d π ( s , a ) d^\pi(s,a) dπ(s,a)就是说这个policy π ( a ∣ s ) \pi(a|s) π(a∣s)与环境dynamics model T ( ⋅ ∣ s , a ) T(\cdot|s,a) T(⋅∣s,a)的不断交互下遇到的状态和动作在 Ω \Omega Ω上的联合分布。 - 性质:如果这是个稳态的联合分布
d
π
(
s
,
a
)
d^\pi(s,a)
dπ(s,a),有一个自然成立的”稳态信息流“公式:
d π ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ π ( a ∣ s ) ∑ s ′ , a ′ d π ( s ′ , a ′ ) T ( s ∣ s ′ , a ′ ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d π ( s , a ) \begin{aligned} d^\pi(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma \pi(a|s)\sum_{s',a'}d^\pi(s',a')T(s|s',a')\\ &=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a) \end{aligned} dπ(s,a)=(1−γ)d0(s)π(a∣s)+γπ(a∣s)s′,a′∑dπ(s′,a′)T(s∣s′,a′)=(1−γ)d0(s)π(a∣s)+γP∗πdπ(s,a)直觉解释一下:到达当前 ( s , a ) (s,a) (s,a)的信息 = 初始时的信息 ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) (1-\gamma)d_0(s)\pi(a|s) (1−γ)d0(s)π(a∣s) + 流通时的信息 γ π ( a ∣ s ) ∑ s ′ , a ′ d π ( s ′ , a ′ ) T ( s ∣ s ′ , a ′ ) \gamma \pi(a|s)\sum_{s',a'}d^\pi(s',a')T(s|s',a') γπ(a∣s)∑s′,a′dπ(s′,a′)T(s∣s′,a′)此处额外定义了一个transpose policy transition operator P ∗ π P^\pi_* P∗π代表下一状态 s ′ , a ′ s',a' s′,a′经过Transition即 T ( s ∣ s ′ , a ′ ) T(s|s',a') T(s∣s′,a′)和策略 π ( a ∣ s ) \pi(a|s) π(a∣s)能来到当前 ( s , a ) (s,a) (s,a)的反向操作符。
- 在联合分布
d
π
(
s
,
a
)
d^\pi(s,a)
dπ(s,a)的定义中注意区分两种operator:
正向操作符即policy transition operator P π : P^\pi: Pπ:从当前状态到下一状态
反向操作符即transpose policy transition operator P ∗ π : P^\pi_*: P∗π:从下一状态到当前状态 - Policy Objective或Policy Value,求得它的方式也称为Policy Evaluation的定义:
J ( π ) = E τ ∼ π ( τ ) [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] = E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( a 0 ∣ s 0 ) , s 1 ∼ T ( ⋯ ∣ s 0 , a 0 ) ⋯ [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) [ V π ( s 0 ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) Q ( s 0 , a 0 ) = E ( s , a ) ∼ d π ( s , a ) [ r ( s , a ) ] \begin{aligned} J(\pi)&=\mathbb E_{\tau\sim \pi(\tau)}\left[\sum_{t=0}^\infty \gamma^tr(s_t,a_t)\right]\\ &=\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(a_0|s_0),s_1\sim T(\cdots|s_0,a_0)\cdots}\left[\sum_{t=0}^\infty \gamma^tr(s_t,a_t)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s)}\left[V^\pi(s_0)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}Q(s_0,a_0)\\ &=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[r(s,a)\right] \end{aligned} J(π)=Eτ∼π(τ)[t=0∑∞γtr(st,at)]=Es0∼d0(s),a0∼π(a0∣s0),s1∼T(⋯∣s0,a0)⋯[t=0∑∞γtr(st,at)]=(1−γ)Es0∼d0(s)[Vπ(s0)]=(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)Q(s0,a0)=E(s,a)∼dπ(s,a)[r(s,a)] - Policy Gradient或Policy Optimization:
∇ J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ Q π ( s , a ) ∇ l o g π ( a ∣ s ) ] \nabla J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[Q^\pi(s,a)\nabla log \pi(a|s)\right] ∇J(π)=E(s,a)∼dπ(s,a)[Qπ(s,a)∇logπ(a∣s)]
这里的公式只是把没参数化的对象关系罗列出来,也没开始优化,也没计算啥,只是单纯的问题进行定义,即问题建模的过程。主要理解下面四个术语:
- Q π ( s t , a t ) = r ( s t , a t ) + γ P π Q π ( s t , a t ) Q^\pi(s_t,a_t)=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t) Qπ(st,at)=r(st,at)+γPπQπ(st,at)
- d π ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d π ( s , a ) d^\pi(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a) dπ(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πdπ(s,a)
- J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ r ( s , a ) ] J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[r(s,a)\right] J(π)=E(s,a)∼dπ(s,a)[r(s,a)]
- ∇ J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ Q π ( s , a ) ∇ l o g π ( a ∣ s ) ] \nabla J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[Q^\pi(s,a)\nabla log \pi(a|s)\right] ∇J(π)=E(s,a)∼dπ(s,a)[Qπ(s,a)∇logπ(a∣s)]
二 、 Policy Evaluation的优化问题
2.1 Q-LP Policy Evaluation
(Q指Q函数,LP指Linear Program)
目的:现在有一个策略
π
\pi
π,请你衡量一下它有多好?有多坏?
方式一:将策略与环境交互得到experience,从experience中学一个
Q
π
(
s
,
a
)
Q^{\pi}(s,a)
Qπ(s,a)来衡量策略的好坏
那怎么学这个
Q
π
(
s
,
a
)
Q^{\pi}(s,a)
Qπ(s,a)呢?
对,就是用到
Q
π
(
s
t
,
a
t
)
=
r
(
s
t
,
a
t
)
+
γ
P
π
Q
π
(
s
t
,
a
t
)
Q^\pi(s_t,a_t)=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t)
Qπ(st,at)=r(st,at)+γPπQπ(st,at)这个已知条件。
该优化问题的表述为:(满足约束条件的Q中选择一个使policy value最小的,此时Q(s,a) 是一个未知的变量来求解 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a))
J ( π ) = min Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]≥r(s,a)+γPπQ(s,a)∀(s,a)∈S×A
(解释一下:本来目标是 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right] (1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]是不需要min的,此时为等式约束 Q ( s , a ) = r ( s , a ) + γ P π Q ( s , a ) = B π Q ( s , a ) Q(s,a)=r(s,a)+\gamma P^\pi Q(s,a)=\mathcal B^\pi Q(s,a) Q(s,a)=r(s,a)+γPπQ(s,a)=BπQ(s,a),但等式约束过强,且Bellman Operator B π \mathcal B^\pi Bπ一般是单调的,因此巧妙地让目标对Q取min,这样等式约束就可以变为不等式约束了)
它的理论解法(本质上为DQN那一类方法的理论来源):在这个目标以及约束下,用迭代求解出来的 Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a)即为 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)。这个理论证明十分基础,详细可以看看Sutton的强化圣经那本书里面有关于Bellman Operator Contraction的详细讨论。在这里简述一下即 B π Q ( s , a ) = r ( s , a ) + γ P π Q ( s , a ) \mathcal B^\pi Q(s,a)=r(s,a)+\gamma P^\pi Q(s,a) BπQ(s,a)=r(s,a)+γPπQ(s,a),然后初始化一个 Q 0 ( s , a ) Q_0(s,a) Q0(s,a),Bellman Operator不停地操作即 B π Q 0 ( s , a ) \mathcal B^\pi Q_0(s,a) BπQ0(s,a)后,有: Q 0 → B π Q 0 → ( B π ) 2 Q 0 → ⋯ → ( B π ) ∞ Q 0 = Q π Q_0\rightarrow\mathcal B^\pi Q_0\rightarrow (\mathcal B^\pi)^2 Q_0\rightarrow \cdots \rightarrow \mathcal (B^\pi)^\infty Q_0=Q^\pi Q0→BπQ0→(Bπ)2Q0→⋯→(Bπ)∞Q0=Qπ
方式二:将策略与环境交互得到experience,从experience中学一个联合分布 d π ( s , a ) d^{\pi}(s,a) dπ(s,a),来衡量策略的好坏。
那么怎么学这个
d
π
(
s
,
a
)
d^{\pi}(s,a)
dπ(s,a)呢?
对,就是用到“稳态信息流”这个条件
d
π
(
s
,
a
)
=
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
+
γ
P
∗
π
d
π
(
s
,
a
)
d^\pi(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a)
dπ(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πdπ(s,a),进行迭代求取。
该优化问题的表述为(此时 d ( s , a ) d(s,a) d(s,a)就是未知的变量来求解 d π ( s , a ) d^\pi(s,a) dπ(s,a)):
J ( π ) = max d : S × A → R + E ( s , a ) ∼ d ( s , a ) [ r ( s , a ) ] s.t d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) \begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t } d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a) \end{aligned} J(π)s.t d(s,a)=d:S×A→R+maxE(s,a)∼d(s,a)[r(s,a)]=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)
它的理论解法:约束相当于解一个线性方程组,有:
(
I
−
γ
P
∗
π
)
d
(
s
,
a
)
=
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
→
d
(
s
,
a
)
=
(
1
−
γ
)
(
I
−
γ
P
∗
π
)
−
1
d
0
(
s
)
π
(
a
∣
s
)
→
d
(
s
,
a
)
=
(
1
−
γ
)
∑
t
=
0
∞
γ
t
(
P
π
)
t
d
0
(
s
,
a
)
→
d
(
s
,
a
)
=
(
1
−
γ
)
∑
t
=
0
∞
Pr
(
s
t
=
s
,
a
t
=
a
∣
s
0
∼
d
0
(
s
)
,
a
t
∼
π
(
⋅
∣
s
t
)
,
s
t
+
1
∼
T
(
⋅
∣
s
t
,
a
t
)
)
→
d
(
s
,
a
)
=
d
π
(
s
,
a
)
\begin{aligned} &(I-\gamma P^\pi_*)d(s,a)=(1-\gamma)d_0(s)\pi(a|s)\\ &\rightarrow d(s,a)=(1-\gamma)(I-\gamma P^\pi_*)^{-1}d_0(s)\pi(a|s)\\ &\rightarrow d(s,a)=(1-\gamma)\sum_{t=0}^\infty \gamma^t(P^\pi)^td_0(s,a)\\ &\rightarrow d(s,a)=(1-\gamma)\sum_{t=0}^\infty \Pr(s_t=s,a_t=a|s_0\sim d_0(s),a_t\sim \pi(\cdot|s_t),s_{t+1}\sim T(\cdot|s_t,a_t))\\ &\rightarrow d(s,a)=d^\pi(s,a) \end{aligned}
(I−γP∗π)d(s,a)=(1−γ)d0(s)π(a∣s)→d(s,a)=(1−γ)(I−γP∗π)−1d0(s)π(a∣s)→d(s,a)=(1−γ)t=0∑∞γt(Pπ)td0(s,a)→d(s,a)=(1−γ)t=0∑∞Pr(st=s,at=a∣s0∼d0(s),at∼π(⋅∣st),st+1∼T(⋅∣st,at))→d(s,a)=dπ(s,a)
这里有一个理解的难点需要大家自己推推,加深下理解。即 ( I − γ P ∗ π ) − 1 (I-\gamma P^\pi_*)^{-1} (I−γP∗π)−1是怎么到 ∑ t = 0 ∞ γ t ( P π ) t \sum_{t=0}^\infty \gamma^t (P^\pi)^t ∑t=0∞γt(Pπ)t,从而等价于 d π ( s , a ) d^\pi(s,a) dπ(s,a)的概率定义的,即 Pr \Pr Pr那一坨。
(提示一下:要推出来,一要理解每个时间点是一个变量,二要区分变量的取值空间,三是离散化地把 Q ( s , a ) Q(s,a) Q(s,a)以及 d ( s , a d(s,a d(s,a)看作是取值空间 Ω : S × A \Omega:\mathcal{S\times A} Ω:S×A的点,即向量或矩阵)
(再提示一下:假如把值函数Q或联合分布d看成矩阵,即 s ∈ R m , a ∈ R n s\in R^m,a\in R^n s∈Rm,a∈Rn,则 d ( s , a ) ∈ R m × n d(s,a)\in R^{m\times n} d(s,a)∈Rm×n;或看成向量则 d ( s , a ) ∈ R m n d(s,a)\in R^{mn} d(s,a)∈Rmn)
所以上述的理论解法,说的是,如果解这个目标+约束的优化问题,就可以得到衡量policy π \pi π所需的量即 Q π ( s , a ) , d π ( s , a ) Q^\pi(s,a),d^\pi(s,a) Qπ(s,a),dπ(s,a),然后计算出 J ( π ) J(\pi) J(π)。但实际中,只要状态和动作中有一个不是离散的,就没办法理论解了)
本质上:从优化角度来看,方式一的L对偶(LP Duality)形式就是方式二,即 d ( s , a ) d(s,a) d(s,a)为对偶变量(可自行套用LP Duality的形式,推一推这两个优化问题的关系,真的很妙!)
2.2 Policy Evaluation的Lagrange对偶
在上一节理论本质是对以下问题,使用了LP Duality,即策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)实际上是对偶变量,因此直接套用Lagrange Duality。
建模值函数的原问题:
J ( π ) = min Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]≥r(s,a)+γPπQ(s,a)∀(s,a)∈S×A
在原问题上构建拉格朗日函数(记得Lagrange实际上是有Fenchel Conjugate推出的吧~):
J ( π ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + ∑ s , a d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J ( π ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\sum_{s,a}d(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big) J(π)=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+s,a∑d(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))J(π)=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼d(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))
所以它的对偶问题就是max-min即
max
d
≥
0
min
Q
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
(
s
0
)
[
Q
(
s
0
,
a
0
)
]
+
E
(
s
,
a
)
∼
d
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
P
π
Q
(
s
,
a
)
−
Q
(
s
,
a
)
)
\max_{d\geq0}\min_Q(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)
d≥0maxQmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼d(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))
好了正式引入Offline RL的问题设定:
- 理论上,这个 J ( π ) J(\pi) J(π)用的数据 ( s , a ) (s,a) (s,a)是策略 π \pi π与环境交互得来的样本
- 但在Offline RL中,我们只有一个不知名的离线数据集 D = { ( s t i , a t i , s t + 1 i , a t + 1 i } i = 1 N \mathcal D=\{(s^i_t,a^i_t,s^i_{t+1},a^i_{t+1}\}_{i=1}^N D={(sti,ati,st+1i,at+1i}i=1N,不能与环境交互
- 所以从数据集中提取它状态-动作的联合分布,即 d D ( s , a ) d^{\mathcal D}(s,a) dD(s,a)
- 利用Importance Sampling来利用 D \mathcal D D中的数据,来达到Policy Evaluation的目的
J ( π ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d π ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + ∫ d π ( s , a ) d D ( s , a ) d D ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D ( s , a ) [ d π ( s , a ) d D ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] = min Q max η ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D ( s , a ) [ η ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] ≈ min Q max η ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , r , s ′ ) ∼ d D ( s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ η ( s , a ) ( r ( s , a ) + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) ] = min Q max η ≥ 0 L ( Q , η ) ≥ max η ≥ 0 min Q L ( Q , η ) \begin{aligned} J(\pi)&=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^\pi(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &=\min_Q\max_{d\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\int \frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)}d^{\mathcal D}(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &=\min_Q\max_{d\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^{\mathcal D}(s,a)} \left[\frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &=\min_Q\max_{\eta\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^{\mathcal D}(s,a)} \left[\eta(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &\approx \min_Q\max_{\eta\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right]\\ &=\min_Q\max_{\eta\geq 0}L(Q,\eta)\\ &\geq \max_{\eta\geq 0}\min_QL(Q,\eta) \end{aligned} J(π)=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼dπ(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+∫dD(s,a)dπ(s,a)dD(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼dD(s,a)[dD(s,a)dπ(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))]=Qminη≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼dD(s,a)[η(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))]≈Qminη≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a,r,s′)∼dD(s,a),a′∼π(⋅∣s′)[η(s,a)(r(s,a)+γQ(s′,a′)−Q(s,a))]=Qminη≥0maxL(Q,η)≥η≥0maxQminL(Q,η)
从min-max问题的求解方法出发,去解决该问题
Faster saddle-point optimization for solving large-scale Markov decision processes
Minimax Weight and Q-Function Learning for Off-Policy Evaluation
忽略求解的细节,整体地看:
已知
d
D
d^{\mathcal D}
dD和策略
π
\pi
π,目的是评估它得到
J
(
π
)
J(\pi)
J(π),为此要解一个min-max问题,从
d
D
d^{\mathcal D}
dD中sample
(
s
,
a
,
s
′
)
(s,a,s')
(s,a,s′),从
π
(
⋅
∣
s
′
)
\pi(\cdot|s')
π(⋅∣s′)中sample
a
′
a'
a′,先解一个关于
η
\eta
η的max问题,再解一个关于
Q
Q
Q的min问题,迭代两者得
η
∗
\eta^*
η∗或
Q
∗
Q^*
Q∗,则完成了Policy Evaluation 即
J
(
π
)
J(\pi)
J(π)
为什么是或?因为只要得到了 η ∗ \eta^* η∗,有 J ( π ) = E ( s , a ) ∼ d D [ η ∗ ( s , a ) r ( s , a ) ] J(\pi)=\mathbb E_{(s,a)\sim d^{\mathcal D}}[\eta^*(s,a)r(s,a)] J(π)=E(s,a)∼dD[η∗(s,a)r(s,a)]
所以总结一下Q-LP Policy Evaluation的:
从两种不同形式建模, Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)值建模评估 J ( π ) J(\pi) J(π),策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)评估 J ( π ) J(\pi) J(π),实际上它们互为原问题和对偶问题,对偶关系是LP。
Q-LP原问题:
J ( π ) = min Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]≥r(s,a)+γPπQ(s,a)∀(s,a)∈S×A
Q-LP对偶问题:
J
(
π
)
=
max
d
:
S
×
A
→
R
+
E
(
s
,
a
)
∼
d
(
s
,
a
)
[
r
(
s
,
a
)
]
s.t
d
(
s
,
a
)
=
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
+
γ
P
∗
π
d
(
s
,
a
)
∀
(
s
,
a
)
∈
S
×
A
\begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t }\quad d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned}
J(π)s.t d(s,a)=d:S×A→R+maxE(s,a)∼d(s,a)[r(s,a)]=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)∀(s,a)∈S×A
(对偶问题有一个弊端,是等式约束,约束的数量与
(
s
,
a
)
(s,a)
(s,a)取值可能有关,过强了,满足等式约束后的
d
d
d可能就没有了)
它们之间互为LP对偶揭示了
d
(
s
,
a
)
d(s,a)
d(s,a)是Q-LP原问题的对偶变量,反过来
Q
(
s
,
a
)
Q(s,a)
Q(s,a)是Q-LP对偶问题的对偶变量也成立(均为凸函数)。
因此对Q-LP原问题用Lagrange Duality变成无约束问题,有:
J ( π ) = min Q max d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big) J(π)=Qmind≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼d(s,a)(r(s,a)+γPπQ(s,a)−Q(s,a))
引入了Offline Datasets,采用Importance Sampling的技巧后有:
J
(
π
)
=
min
Q
max
η
≥
0
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
(
s
0
)
[
Q
(
s
0
,
a
0
)
]
+
E
(
s
,
a
,
r
,
s
′
)
∼
d
D
(
s
,
a
)
,
a
′
∼
π
(
⋅
∣
s
′
)
[
η
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
Q
(
s
′
,
a
′
)
−
Q
(
s
,
a
)
)
]
J(\pi)=\min_Q\max_{\eta\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right]
J(π)=Qminη≥0max(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a,r,s′)∼dD(s,a),a′∼π(⋅∣s′)[η(s,a)(r(s,a)+γQ(s′,a′)−Q(s,a))]
优化的对偶问题为:
max
η
≥
0
min
Q
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
(
s
0
)
[
Q
(
s
0
,
a
0
)
]
+
E
(
s
,
a
,
r
,
s
′
)
∼
d
D
(
s
,
a
)
,
a
′
∼
π
(
⋅
∣
s
′
)
[
η
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
Q
(
s
′
,
a
′
)
−
Q
(
s
,
a
)
)
]
\max_{\eta\geq0}\min_Q(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right]
η≥0maxQmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a,r,s′)∼dD(s,a),a′∼π(⋅∣s′)[η(s,a)(r(s,a)+γQ(s′,a′)−Q(s,a))]
所以策略评估(Policy Evaluation)先表述成一个min-max问题,后去优化它的对偶问题max-min。当拟合的对象为 η ( s , a ) = d π ( s , a ) d D ( s , a ) \eta(s,a)=\frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)} η(s,a)=dD(s,a)dπ(s,a)时,即为Offline RL基于Duality的最原始目标,直接优化该max-min目标的细节在两篇文章中。12
2.3 改变Q-LP的Objective
现在的问题设定是在Offline-RL中了,因此只有一个离线的数据集。
Q-LP的对偶问题:
J
(
π
)
=
max
d
:
S
×
A
→
R
+
E
(
s
,
a
)
∼
d
(
s
,
a
)
[
r
(
s
,
a
)
]
s.t
d
(
s
,
a
)
=
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
+
γ
P
∗
π
d
(
s
,
a
)
∀
(
s
,
a
)
∈
S
×
A
\begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t }\quad d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned}
J(π)s.t d(s,a)=d:S×A→R+maxE(s,a)∼d(s,a)[r(s,a)]=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)∀(s,a)∈S×A
将对偶问题的目标改成:(最小化策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)与数据集中状态-动作联合分布 d D ( s , a ) d^{\mathcal D}(s,a) dD(s,a)的f-divergence D f D_f Df,这里实际上假设了 D \mathcal D D是专家数据集)
max d : S × A → R + − D f ( d ∣ ∣ d D ) s.t d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} -D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} d:S×A→R+max−Df(d∣∣dD)s.t d(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)∀(s,a)∈S×A
2.3. 1 对上述优化问题使用Lagrange Duality(别忘了 Q ( s , a ) Q(s,a) Q(s,a)是它的对偶变量)
max d min Q L ( Q , d ) = max d min Q − D f ( d ∣ ∣ d D ) + ∑ s , a Q ( s , a ) ( ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) − d ( s , a ) ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] − D f ( d ∣ ∣ d D ) + ∑ s , a Q ( s , a ) ( γ P ∗ π d ( s , a ) − d ( s , a ) ) ( 1 ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] − D f ( d ∣ ∣ d D ) + ∑ s , a d ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ( 2 ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D [ f ( d ( s , a ) d D ( s , a ) ) ] + ∑ s , a d ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ⟹ η ( s , a ) = d ( s , a ) d D ( s , a ) ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D [ f ( η ( s , a ) ) ] + E ( s , a ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , s ′ ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) − f ( η ( s , a ) ) ] \begin{aligned} \max_d \min_Q L(Q,d)&=\max_d \min_Q-D_f(d||d^{\mathcal D})+\sum_{s,a}Q(s,a)\Big((1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)-d(s,a)\Big)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]-D_f(d||d^{\mathcal D})+\sum_{s,a}Q(s,a)\Big(\gamma P^\pi_*d(s,a)-d(s,a)\Big)(1)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]-D_f(d||d^{\mathcal D})+\sum_{s,a}d(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)(2)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a)\sim d^{\mathcal D}}\left[f(\frac{d(s,a)}{d^{\mathcal D}(s,a)})\right]+\sum_{s,a}d(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &\stackrel{\eta(s,a)=\frac{d(s,a)}{d^{\mathcal D}(s,a)}}{\Longrightarrow}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a)\sim d^{\mathcal D}}\left[f(\eta(s,a))\right]+\mathbb E_{(s,a)\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a,s')\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)-f(\eta(s,a))\right] \end{aligned} dmaxQminL(Q,d)=dmaxQmin−Df(d∣∣dD)+s,a∑Q(s,a)((1−γ)d0(s)π(a∣s)+γP∗πd(s,a)−d(s,a))=(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]−Df(d∣∣dD)+s,a∑Q(s,a)(γP∗πd(s,a)−d(s,a))(1)=(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]−Df(d∣∣dD)+s,a∑d(s,a)(γPπQ(s,a)−Q(s,a))(2)=(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼dD[f(dD(s,a)d(s,a))]+s,a∑d(s,a)(γPπQ(s,a)−Q(s,a))⟹η(s,a)=dD(s,a)d(s,a)(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a)∼dD[f(η(s,a))]+E(s,a)∼dD(s,a)[η(s,a)(γPπQ(s,a)−Q(s,a))]=(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a,s′)∼dD(s,a)[η(s,a)(γPπQ(s,a)−Q(s,a))−f(η(s,a))]
( 1 ) → ( 2 ) (1)\rightarrow (2) (1)→(2)需要仔细推导一波,后面的 ⟹ η ( s , a ) = d ( s , a ) d D ( s , a ) \stackrel{\eta(s,a)=\frac{d(s,a)}{d^{\mathcal D}(s,a)}}{\Longrightarrow} ⟹η(s,a)=dD(s,a)d(s,a)是对Offline RL利用了重要性采样的trick。
所以改变了Objective并应用了Lagrange Duality后的优化目标变成:
max d min Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , s ′ ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) − f ( η ( s , a ) ) ] \max_d \min_Q(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a,s')\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)-f(\eta(s,a))\right] dmaxQmin(1−γ)Es0∼d0(s),a0∼π(s0)[Q(s0,a0)]+E(s,a,s′)∼dD(s,a)[η(s,a)(γPπQ(s,a)−Q(s,a))−f(η(s,a))]
对该目标进行具体优化的算法为DualDICE3
2.3.2 对上述问题用 Fenchel Duality(这个仔细给推导一遍)
- 优化问题 max d : S × A → R + − D f ( d ∣ ∣ d D ) s.t d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} -D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} d:S×A→R+max−Df(d∣∣dD)s.t d(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)∀(s,a)∈S×A
- Fenchel Duality min x ∈ D J P ( x ) = min x ∈ D f ( x ) + g ( A x ) \min_{x\in D}J_P(x)=\min_{x\in D}f(x)+g(Ax) x∈DminJP(x)=x∈Dminf(x)+g(Ax)其对偶问题为: max u ∈ D ∗ J D ( u ) = max u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}J_D(u)=\max_{u\in D^*}-f^*(-A^\top u)-g^*(u) u∈D∗maxJD(u)=u∈D∗max−f∗(−A⊤u)−g∗(u)
首要目标:确定 x : = d ( s , a ) x:=d(s,a) x:=d(s,a),要把等式约束 d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a) d(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)表示为 g ( A x ) g(Ax) g(Ax)
借助这个indicator function:
δ
C
(
x
)
=
{
0
if
x
∈
C
+
∞
otherwise
\delta_C(x)= \left\{ \begin{array}{rl} 0& \text{if }x\in C \\ +\infty & \text{otherwise} \end{array} \right.
δC(x)={0+∞if x∈Cotherwise
等式约束可看作:
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
⏟
C
=
(
I
−
γ
P
∗
π
)
⏟
A
d
(
s
,
a
)
⏟
x
\underbrace{(1-\gamma)d_0(s)\pi(a|s)}_C=\underbrace{(I-\gamma P^\pi_*)}_A\underbrace{d(s,a)}_x
C
(1−γ)d0(s)π(a∣s)=A
(I−γP∗π)x
d(s,a)
优化问题表述的原问题为:
min d D f ( d ∣ ∣ d D ) − δ ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) ( ( I − γ P ∗ π ) d ( s , a ) ) \min_d D_f(d||d^\mathcal D)-\delta_{(1-\gamma)d_0(s)\pi(a|s)}((I-\gamma P^\pi_*)d(s,a)) dminDf(d∣∣dD)−δ(1−γ)d0(s)π(a∣s)((I−γP∗π)d(s,a))
查共轭函数表有: δ a ∗ ( x ) = < a , y > \delta_{a}^*(x)=<a,y> δa∗(x)=<a,y>, D f ∗ ( x ∣ ∣ p ) = E z ∼ p [ f ∗ ( y ( z ) ) ] D_f^*(x||p)=\mathbb E_{z\sim p}[f^*(y(z))] Df∗(x∣∣p)=Ez∼p[f∗(y(z))],所以根据Fenchel Duality的对偶形式 max u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}-f^*(-A^\top u)-g^*(u) maxu∈D∗−f∗(−A⊤u)−g∗(u)可写出其对偶问题。(注意对偶变量 u u u实际上就是 Q ( s , a ) Q(s,a) Q(s,a))
所以对偶问题为:
max Q − E ( s , a ) ∼ d D [ f ∗ ( ( γ P ∗ π − I ) ⊤ Q ( s , a ) ) ] − < ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) , Q ( s , a ) > → min Q E ( s , a ) ∼ d D [ f ∗ ( γ P π Q ( s , a ) − Q ( s , a ) ] + ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] \begin{aligned} &\max_Q - \mathbb E_{(s,a)\sim d^\mathcal D}[f^*((\gamma P^\pi_*-I)^\top Q(s,a))]-<(1-\gamma )d_0(s)\pi(a|s),Q(s,a)>\\ &\rightarrow\min _Q \mathbb E_{(s,a)\sim d^\mathcal D}[f^*(\gamma P^\pi Q(s,a)-Q(s,a)]+\mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] \end{aligned} Qmax−E(s,a)∼dD[f∗((γP∗π−I)⊤Q(s,a))]−<(1−γ)d0(s)π(a∣s),Q(s,a)>→QminE(s,a)∼dD[f∗(γPπQ(s,a)−Q(s,a)]+(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)[Q(s0,a0)]
根据Fenchel Duality的结果,原问题的解 d ∗ ( s , a ) d^*(s,a) d∗(s,a)与对偶问题 Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a)的解关系为( f ∗ ′ f'_* f∗′为f的共轭导数函数): d ∗ ( s , a ) = d D ( s , a ) × f ∗ ′ ( γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) ) d^*(s,a)=d^\mathcal D(s,a)\times f_*'(\gamma P^\pi Q^*(s,a)-Q^*(s,a)) d∗(s,a)=dD(s,a)×f∗′(γPπQ∗(s,a)−Q∗(s,a))
如果令 f = 1 2 x 2 f=\frac{1}{2}x^2 f=21x2,其共轭 f ∗ = 1 2 y 2 f^*=\frac{1}{2}y^2 f∗=21y2,则
Q ∗ = arg min ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + 1 2 E ( s , a ) ∼ d D [ ( γ P π Q ( s , a ) − Q ( s , a ) 2 ] Q^*=\argmin \mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] + \frac{1}{2} E_{(s,a)\sim d^\mathcal D}[(\gamma P^\pi Q(s,a)-Q(s,a)^2] Q∗=argmin(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)[Q(s0,a0)]+21E(s,a)∼dD[(γPπQ(s,a)−Q(s,a)2]
在这特例中根据原问题解与对偶问题解的关系有:
f
∗
′
(
γ
P
π
Q
∗
(
s
,
a
)
−
Q
∗
(
s
,
a
)
)
=
γ
P
π
Q
∗
(
s
,
a
)
−
Q
∗
(
s
,
a
)
=
d
∗
(
s
,
a
)
d
D
(
s
,
a
)
=
d
π
(
s
,
a
)
d
D
(
s
,
a
)
f_*'(\gamma P^\pi Q^*(s,a)-Q^*(s,a))=\gamma P^\pi Q^*(s,a)-Q^*(s,a)=\frac{d^*(s,a)}{d^\mathcal D(s,a)}=\frac{d^\pi(s,a)}{d^\mathcal D(s,a)}
f∗′(γPπQ∗(s,a)−Q∗(s,a))=γPπQ∗(s,a)−Q∗(s,a)=dD(s,a)d∗(s,a)=dD(s,a)dπ(s,a)
2.4 总结一波(异常重要)
主线:
这么多推导可能掩盖了逻辑线。但整体上看,目的始终都是Policy Evaluation的任务,出发点都是on-policy objective。
在嵌入Offline RL的难点只有一个Offline Datasets,所以用off-policy的方式去估计on-policy objective需要进行distribution mismatch的补偿。如果是重要性采样进行补偿要计算多项乘积
π
θ
(
τ
)
π
β
(
τ
)
\frac{\pi_\theta(\tau)}{\pi_\beta(\tau)}
πβ(τ)πθ(τ),这个方法方差非常大。
因此这里通过一通推导,发现只要优化这个目标:
Q ∗ = arg min ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + 1 2 E ( s , a ) ∼ d D [ ( γ P π Q ( s , a ) − Q ( s , a ) 2 ] Q^*=\argmin \mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] + \frac{1}{2} E_{(s,a)\sim d^\mathcal D}[(\gamma P^\pi Q(s,a)-Q(s,a)^2] Q∗=argmin(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)[Q(s0,a0)]+21E(s,a)∼dD[(γPπQ(s,a)−Q(s,a)2]
得到 Q ∗ Q^* Q∗后,就能计算出 γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) = d π ( s , a ) d D ( s , a ) \gamma P^\pi Q^*(s,a)-Q^*(s,a)=\frac{d^\pi(s,a)}{d^\mathcal D(s,a)} γPπQ∗(s,a)−Q∗(s,a)=dD(s,a)dπ(s,a),于是就能达到利用offline datasets对on-policy objective进行evaluation了,即
J ( π ) = E ( s , a , r ) ∼ d D [ d π ( s , a ) d D ( s , a ) r ( s , a ) ] J(\pi)=\mathbb E_{(s,a,r)\sim d^\mathcal D}\left[\frac{d^\pi(s,a)}{d^\mathcal D(s,a)} r(s,a)\right] J(π)=E(s,a,r)∼dD[dD(s,a)dπ(s,a)r(s,a)]
辅助线:
- 从值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)建模评估 J ( π ) J(\pi) J(π),从策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)建模评估 J ( π ) J(\pi) J(π),两个角度对同一对象 J ( π ) J(\pi) J(π)本质上呈LP Duality的关系,LP对偶关系约束太多难以直接求解。
- 修改LP Duality,用Lagrange Duality的话,估计 J ( π ) J(\pi) J(π)最后变成一个min-max问题,优化起来容易陷入saddle-point
- 修改LP Duality,修改目标,再用Fenchel Duality的话,估计 J ( π ) J(\pi) J(π)最后变成一个min问题,优化起来就可以用Gradient-based方法了
三、Policy Optimization的优化问题
评估策略的值还不够。一般我们都想获得策略本身,而不仅仅是它的评估值 J ( π ) J(\pi) J(π)。
3.1 基础理论
拿Lagrange Duality来建模的评估量
J
(
π
)
J(\pi)
J(π)直接进行策略梯度:(假设在min-max问题中已经近似求解出
Q
∗
≈
Q
π
,
d
∗
≈
d
π
Q^*\approx Q^\pi,d^*\approx d^\pi
Q∗≈Qπ,d∗≈dπ😃,然后对参数化后的策略求梯度)
∇
θ
J
(
π
θ
)
=
∇
θ
(
min
Q
max
d
≥
0
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
θ
(
s
0
)
[
Q
(
s
0
,
a
0
)
]
+
E
(
s
,
a
)
∼
d
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
P
π
θ
Q
(
s
,
a
)
−
Q
(
s
,
a
)
)
)
=
∇
θ
(
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
θ
(
s
0
)
[
Q
∗
(
s
0
,
a
0
)
]
+
E
(
s
,
a
)
∼
d
∗
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
P
π
θ
Q
∗
(
s
,
a
)
−
Q
∗
(
s
,
a
)
)
)
=
(
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
θ
(
s
0
)
[
Q
∗
(
s
0
,
a
0
)
∇
θ
log
π
θ
(
a
∣
s
)
]
)
+
∇
θ
E
(
s
,
a
)
∼
d
∗
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
P
π
θ
Q
∗
(
s
,
a
)
−
Q
∗
(
s
,
a
)
)
=
(
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
θ
(
s
0
)
[
Q
∗
(
s
0
,
a
0
)
∇
θ
log
π
θ
(
a
∣
s
)
]
)
+
E
(
s
,
a
)
∼
d
∗
(
s
,
a
)
(
γ
∇
θ
P
π
θ
Q
∗
(
s
,
a
)
)
=
(
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
θ
(
s
0
)
[
Q
∗
(
s
0
,
a
0
)
∇
θ
log
π
θ
(
a
∣
s
)
]
)
+
γ
E
(
s
,
a
)
∼
d
∗
(
s
,
a
)
,
s
′
∼
T
(
⋅
∣
s
,
a
)
,
a
′
∼
π
θ
(
⋅
∣
s
′
)
(
Q
∗
(
s
,
a
)
∇
θ
l
o
g
π
θ
(
a
′
∣
s
′
)
)
=
E
(
s
,
a
)
∼
d
∗
[
Q
∗
(
s
,
a
)
∇
θ
log
π
(
a
∣
s
)
]
\begin{aligned} \nabla_\theta J(\pi_\theta)&=\nabla_\theta \left(\min_Q\max_{d\geq0} (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q(s,a)-Q(s,a)\Big)\right)\\ &=\nabla_\theta \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^*(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q^*(s,a)-Q^*(s,a)\Big)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+\nabla_\theta \mathbb E_{(s,a)\sim d^*(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q^*(s,a)-Q^*(s,a)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \mathbb E_{(s,a)\sim d^*(s,a)}\Big(\gamma \nabla_\theta P^{\pi_\theta} Q^*(s,a)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \gamma\mathbb E_{(s,a)\sim d^*(s,a),s'\sim T(\cdot|s,a),a'\sim \pi_\theta(\cdot|s')}\Big( Q^*(s,a) \nabla_\theta log\pi_\theta(a'|s') \right)\\ &=\mathbb E_{(s,a)\sim d^*}[Q^*(s,a)\nabla_\theta \log \pi(a|s)] \end{aligned}
∇θJ(πθ)=∇θ(Qmind≥0max(1−γ)Es0∼d0(s),a0∼πθ(s0)[Q(s0,a0)]+E(s,a)∼d(s,a)(r(s,a)+γPπθQ(s,a)−Q(s,a)))=∇θ((1−γ)Es0∼d0(s),a0∼πθ(s0)[Q∗(s0,a0)]+E(s,a)∼d∗(s,a)(r(s,a)+γPπθQ∗(s,a)−Q∗(s,a)))=((1−γ)Es0∼d0(s),a0∼πθ(s0)[Q∗(s0,a0)∇θlogπθ(a∣s)])+∇θE(s,a)∼d∗(s,a)(r(s,a)+γPπθQ∗(s,a)−Q∗(s,a))=((1−γ)Es0∼d0(s),a0∼πθ(s0)[Q∗(s0,a0)∇θlogπθ(a∣s)])+E(s,a)∼d∗(s,a)(γ∇θPπθQ∗(s,a))=((1−γ)Es0∼d0(s),a0∼πθ(s0)[Q∗(s0,a0)∇θlogπθ(a∣s)])+γE(s,a)∼d∗(s,a),s′∼T(⋅∣s,a),a′∼πθ(⋅∣s′)(Q∗(s,a)∇θlogπθ(a′∣s′))=E(s,a)∼d∗[Q∗(s,a)∇θlogπ(a∣s)]
所以只要min-max解出来的 Q ∗ ≈ Q π , d ∗ ≈ d π Q^*\approx Q^\pi,d^*\approx d^\pi Q∗≈Qπ,d∗≈dπ,Lagrange Duality的目标求梯度就等同于on-policy gradient
3.2 Offline Policy Gradient via Lagrangian
根据基础理论,用下重要性采样的trick,对以下目标求梯度即可。
max
π
min
Q
max
η
≥
0
L
(
Q
,
η
,
π
)
=
(
1
−
γ
)
E
s
0
∼
d
0
(
s
)
,
a
0
∼
π
(
⋅
∣
s
0
)
[
Q
(
s
0
,
a
0
)
]
+
E
(
s
,
a
,
s
′
)
∼
d
D
,
a
′
∼
π
(
⋅
∣
s
′
)
[
η
(
s
,
a
)
(
r
(
s
,
a
)
+
γ
Q
(
s
′
,
a
′
)
−
Q
(
s
,
a
)
)
]
\max_\pi\min_Q\max_{\eta\geq0} L(Q,\eta,\pi)=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)]+E_{(s,a,s')\sim d^\mathcal D,a'\sim \pi(\cdot|s')}\left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right]
πmaxQminη≥0maxL(Q,η,π)=(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)[Q(s0,a0)]+E(s,a,s′)∼dD,a′∼π(⋅∣s′)[η(s,a)(r(s,a)+γQ(s′,a′)−Q(s,a))]
3.3 Offline Policy Gradient via Fenchel Duality
考虑原问题:
max
d
:
S
×
A
→
R
+
E
(
s
,
a
)
∼
d
(
s
,
a
)
[
r
(
s
,
a
)
]
−
D
f
(
d
∣
∣
d
D
)
s.t
d
(
s
,
a
)
=
(
1
−
γ
)
d
0
(
s
)
π
(
a
∣
s
)
+
γ
P
∗
π
d
(
s
,
a
)
∀
(
s
,
a
)
∈
S
×
A
\begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} \mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]-D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned}
d:S×A→R+maxE(s,a)∼d(s,a)[r(s,a)]−Df(d∣∣dD)s.t d(s,a)=(1−γ)d0(s)π(a∣s)+γP∗πd(s,a)∀(s,a)∈S×A
通过Fenchel Duality后目标4变为:
max π min Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + log E ( s , a ) ∼ d D [ exp ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] \max_\pi\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)]+\log \mathbb E_{(s,a)\sim d^\mathcal D}\left[\exp{\big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\big)}\right] πmaxQmin(1−γ)Es0∼d0(s),a0∼π(⋅∣s0)[Q(s0,a0)]+logE(s,a)∼dD[exp(r(s,a)+γPπQ(s,a)−Q(s,a))]
对一个确定的Q,目标梯度为:
∇ θ J ( π θ ) = ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ( s 0 , a 0 ) ∇ θ log π θ ( a ∣ s ) ] ) + E ( s , a , s ′ ) ∼ d D , a ′ ∼ π ( ⋅ ∣ s ′ ) ( softmax d D ( r + γ P π Q − Q ) ( s , a ) ) ⋅ Q ( s ′ , a ′ ) ∇ θ log π θ ( a ′ ∣ s ′ ) ) \nabla_\theta J(\pi_\theta)=\left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \mathbb E_{(s,a,s')\sim d^\mathcal D,a'\sim \pi(\cdot|s')}\Big(\text{softmax}_{d^\mathcal D}(r+\gamma P^\pi Q-Q)(s,a))\cdot Q(s',a')\nabla_\theta \log \pi_\theta (a'|s')\right) ∇θJ(πθ)=((1−γ)Es0∼d0(s),a0∼πθ(s0)[Q(s0,a0)∇θlogπθ(a∣s)])+E(s,a,s′)∼dD,a′∼π(⋅∣s′)(softmaxdD(r+γPπQ−Q)(s,a))⋅Q(s′,a′)∇θlogπθ(a′∣s′))
其中 softmax p ( h ) ( z ) : = exp ( h ( z ) ) E z ~ ∼ p [ exp ( h ( z ~ ) ) ] \text{softmax}_p(h)(z):=\frac{\exp(h(z))}{\mathbb E_{\tilde z \sim p}[\exp(h(\tilde z))]} softmaxp(h)(z):=Ez~∼p[exp(h(z~))]exp(h(z))
总结
这就是Duality-based的Offline RL方法。看起来很复杂,推导一堆细节。
说白了,就是对Policy Evaluation建模成了优化问题(目标+约束),挖掘出了两个角度之间的LP对偶关系。
然后用Lagrange Duality、Fenchel Duality对优化问题进行变更,或者变更之前改一改目标,使其适应Offline Setting,仅此而已。
如果熟悉Fenchel Duality,实际上不难,后续用DualDICE这篇文章进行代码层面上的实例分析,看这些目标怎么用,怎么具体实现。