离线强化学习-3 对偶性的应用

概述

回顾三个Duality的形式

  1. 1. Fenchel Duality
    给定一个原问题: min ⁡ x ∈ D J P ( x ) = min ⁡ x ∈ D f ( x ) + g ( A x ) \min_{x\in D}J_P(x)=\min_{x\in D}f(x)+g(Ax) xDminJP(x)=xDminf(x)+g(Ax)其对偶问题为: max ⁡ u ∈ D ∗ J D ( u ) = max ⁡ u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}J_D(u)=\max_{u\in D^*}-f^*(-A^\top u)-g^*(u) uDmaxJD(u)=uDmaxf(Au)g(u)在一定条件下,原问题与对偶问题等价。一些常用函数的共轭函数表:
    5

  2. 2. Lagrange Duality
    p ∗ = min ⁡ x J P ( x ) = min ⁡ x max ⁡ α , β : α i ≥ 0 L ( x , α , β ) d ∗ = max ⁡ α , β : α i ≥ 0 J D ( α , β ) = max ⁡ α , β : α i ≥ 0 min ⁡ x L ( x , α , β ) d ∗ ≤ p ∗ ( KKT condition ) \begin{aligned} p^*&=\min_xJ_P(x)=\min_x\max_{\alpha,\beta:\alpha_i\geq 0} L(x,\alpha,\beta)\\ d^*&=\max_{\alpha,\beta:\alpha_i\geq0}J_D(\alpha,\beta)=\max_{\alpha,\beta:\alpha_i\geq0}\min_x L(x,\alpha,\beta)\\ d^*&\leq p^*\quad (\text{KKT condition}) \end{aligned} pdd=xminJP(x)=xminα,β:αi0maxL(x,α,β)=α,β:αi0maxJD(α,β)=α,β:αi0maxxminL(x,α,β)p(KKT condition)

  3. 3. Linear Program Duality(LP Duality)

min ⁡ x c ⊤ x s.t  x ≥ 0 , A x = b . max ⁡ μ b ⊤ μ s.t  A ⊤ μ ≤ c \begin{aligned} &\min_x c^\top x\quad \text{s.t }x\geq0,Ax=b.\\ &\max_\mu b^\top \mu\quad \text{s.t } A^\top \mu\leq c \end{aligned} xmincxs.t x0,Ax=b.μmaxbμs.t Aμc

对Offline RL中的两类优化问题Policy Evaluation以及Policy Optimization进行原问题的建模,并根据问题结构添加一些约束。
对Offline RL建模的优化问题用Duality转化成对偶问题,从而优化对偶问题来求解原问题。

Duality的初衷:Duality就是用来解决Importance Sampling用在Offline Dataset上时出现的high variance问题而提出的。

一、Offline RL 建模基本术语

在上一篇文章主要是用Finite Horizon的形式表述的,这次换成Infinite Horizon来重新表述一下。

  1. Q值 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)
    Q π ( s t , a t ) = E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ ∑ t ′ = t ∞ γ t ′ − t r ( s t ′ , a t ′ ) ] = r ( s t , a t ) + γ E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ ∑ t ′ = t + 1 ∞ γ t ′ − t r ( s t ′ , a t ′ ) ] = r ( s t , a t ) + γ E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ Q π ( s t + 1 , a t + 1 ) ] = r ( s t , a t ) + γ P π Q π ( s t , a t ) \begin{aligned} Q^\pi(s_t,a_t)&=\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[\sum_{t'=t}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})\right]\\ &=r(s_t,a_t)+\gamma\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[\sum_{t'=t+1}^\infty\gamma^{t'-t}r(s_{t'},a_{t'})\right]\\ &=r(s_t,a_t)+\gamma\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[Q^\pi(s_{t+1},a_{t+1})\right]\\ &=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t)\\ \end{aligned} Qπ(st,at)=Est+1T(st+1st,at),at+1π(at+1st+1)[t=tγttr(st,at)]=r(st,at)+γEst+1T(st+1st,at),at+1π(at+1st+1)[t=t+1γttr(st,at)]=r(st,at)+γEst+1T(st+1st,at),at+1π(at+1st+1)[Qπ(st+1,at+1)]=r(st,at)+γPπQπ(st,at)其中 P π Q π ( s t , a t ) = E s t + 1 ∼ T ( s t + 1 ∣ s t , a t ) , a t + 1 ∼ π ( a t + 1 ∣ s t + 1 ) [ Q π ( s t + 1 , a t + 1 ) ] P^\pi Q^\pi(s_t,a_t)=\mathbb E_{s_{t+1}\sim T(s_{t+1}|s_t,a_t),a_{t+1}\sim \pi(a_{t+1}|s_{t+1})}\left[Q^\pi(s_{t+1},a_{t+1})\right] PπQπ(st,at)=Est+1T(st+1st,at),at+1π(at+1st+1)[Qπ(st+1,at+1)] P π P^\pi Pπ记为policy transition operator,是操作算符,意思是对当前时刻 t t t Q π ( s t , a t ) Q^\pi(s_t,a_t) Qπ(st,at)通过环境transition dynamics即 T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t,a_t) T(st+1st,at)获得下一个状态 s t + 1 s_{t+1} st+1,然后通过policy即 π ( a t + 1 ∣ s t + 1 ) \pi(a_{t+1}|s_{t+1}) π(at+1st+1)选择下一个动作 a t + 1 a_{t+1} at+1进行Q值的迭代。此处的重点是定义了一个operator P π P^\pi Pπ
  2. state-action visitation frequency d π ( s , a ) d^\pi(s,a) dπ(s,a):
  • 定义:一个策略在状态 × \times ×动作空间 Ω \Omega Ω上的联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)
    d π ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t Pr ⁡ ( s t = s , a t = a ∣ s 0 ∼ d 0 ( s ) , a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ T ( ⋅ ∣ s t , a t ) ) d^\pi(s,a)=(1-\gamma)\sum_{t=0}^\infty \gamma^t \Pr(s_t=s,a_t=a|s_0\sim d_0(s),a_t\sim\pi(\cdot|s_t),s_{t+1}\sim T(\cdot|s_t,a_t)) dπ(s,a)=(1γ)t=0γtPr(st=s,at=as0d0(s),atπ(st),st+1T(st,at)) d 0 ( s ) d_0(s) d0(s):初始状态的分布, d π ( s , a ) d^\pi(s,a) dπ(s,a)就是说这个policy π ( a ∣ s ) \pi(a|s) π(as)与环境dynamics model T ( ⋅ ∣ s , a ) T(\cdot|s,a) T(s,a)的不断交互下遇到的状态和动作在 Ω \Omega Ω上的联合分布。
  • 性质:如果这是个稳态的联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a),有一个自然成立的”稳态信息流“公式:
    d π ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ π ( a ∣ s ) ∑ s ′ , a ′ d π ( s ′ , a ′ ) T ( s ∣ s ′ , a ′ ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d π ( s , a ) \begin{aligned} d^\pi(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma \pi(a|s)\sum_{s',a'}d^\pi(s',a')T(s|s',a')\\ &=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a) \end{aligned} dπ(s,a)=(1γ)d0(s)π(as)+γπ(as)s,adπ(s,a)T(ss,a)=(1γ)d0(s)π(as)+γPπdπ(s,a)直觉解释一下:到达当前 ( s , a ) (s,a) (s,a)的信息 = 初始时的信息 ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) (1-\gamma)d_0(s)\pi(a|s) (1γ)d0(s)π(as) + 流通时的信息 γ π ( a ∣ s ) ∑ s ′ , a ′ d π ( s ′ , a ′ ) T ( s ∣ s ′ , a ′ ) \gamma \pi(a|s)\sum_{s',a'}d^\pi(s',a')T(s|s',a') γπ(as)s,adπ(s,a)T(ss,a)此处额外定义了一个transpose policy transition operator P ∗ π P^\pi_* Pπ代表下一状态 s ′ , a ′ s',a' s,a经过Transition即 T ( s ∣ s ′ , a ′ ) T(s|s',a') T(ss,a)和策略 π ( a ∣ s ) \pi(a|s) π(as)能来到当前 ( s , a ) (s,a) (s,a)的反向操作符。
  1. 在联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)的定义中注意区分两种operator
    正向操作符即policy transition operator P π : P^\pi: Pπ:从当前状态到下一状态
    反向操作符即transpose policy transition operator P ∗ π : P^\pi_*: Pπ:从下一状态到当前状态
  2. Policy Objective或Policy Value,求得它的方式也称为Policy Evaluation的定义:
    J ( π ) = E τ ∼ π ( τ ) [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] = E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( a 0 ∣ s 0 ) , s 1 ∼ T ( ⋯ ∣ s 0 , a 0 ) ⋯ [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) [ V π ( s 0 ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) Q ( s 0 , a 0 ) = E ( s , a ) ∼ d π ( s , a ) [ r ( s , a ) ] \begin{aligned} J(\pi)&=\mathbb E_{\tau\sim \pi(\tau)}\left[\sum_{t=0}^\infty \gamma^tr(s_t,a_t)\right]\\ &=\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(a_0|s_0),s_1\sim T(\cdots|s_0,a_0)\cdots}\left[\sum_{t=0}^\infty \gamma^tr(s_t,a_t)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s)}\left[V^\pi(s_0)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}Q(s_0,a_0)\\ &=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[r(s,a)\right] \end{aligned} J(π)=Eτπ(τ)[t=0γtr(st,at)]=Es0d0(s),a0π(a0s0),s1T(s0,a0)[t=0γtr(st,at)]=(1γ)Es0d0(s)[Vπ(s0)]=(1γ)Es0d0(s),a0π(s0)Q(s0,a0)=E(s,a)dπ(s,a)[r(s,a)]
  3. Policy Gradient或Policy Optimization:
    ∇ J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ Q π ( s , a ) ∇ l o g π ( a ∣ s ) ] \nabla J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[Q^\pi(s,a)\nabla log \pi(a|s)\right] J(π)=E(s,a)dπ(s,a)[Qπ(s,a)logπ(as)]

这里的公式只是把没参数化的对象关系罗列出来,也没开始优化,也没计算啥,只是单纯的问题进行定义,即问题建模的过程。主要理解下面四个术语:

  • Q π ( s t , a t ) = r ( s t , a t ) + γ P π Q π ( s t , a t ) Q^\pi(s_t,a_t)=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t) Qπ(st,at)=r(st,at)+γPπQπ(st,at)
  • d π ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d π ( s , a ) d^\pi(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a) dπ(s,a)=(1γ)d0(s)π(as)+γPπdπ(s,a)
  • J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ r ( s , a ) ] J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[r(s,a)\right] J(π)=E(s,a)dπ(s,a)[r(s,a)]
  • ∇ J ( π ) = E ( s , a ) ∼ d π ( s , a ) [ Q π ( s , a ) ∇ l o g π ( a ∣ s ) ] \nabla J(\pi)=\mathbb E_{(s,a)\sim d^\pi(s,a)}\left[Q^\pi(s,a)\nabla log \pi(a|s)\right] J(π)=E(s,a)dπ(s,a)[Qπ(s,a)logπ(as)]

二 、 Policy Evaluation的优化问题

2.1 Q-LP Policy Evaluation

(Q指Q函数,LP指Linear Program)
目的:现在有一个策略 π \pi π,请你衡量一下它有多好?有多坏?


方式一:将策略与环境交互得到experience,从experience中学一个 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)来衡量策略的好坏
那怎么学这个 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)呢?
对,就是用到 Q π ( s t , a t ) = r ( s t , a t ) + γ P π Q π ( s t , a t ) Q^\pi(s_t,a_t)=r(s_t,a_t)+\gamma P^\pi Q^\pi(s_t,a_t) Qπ(st,at)=r(st,at)+γPπQπ(st,at)这个已知条件。

该优化问题的表述为:(满足约束条件的Q中选择一个使policy value最小的,此时Q(s,a) 是一个未知的变量来求解 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)

J ( π ) = min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t  Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]r(s,a)+γPπQ(s,a)(s,a)S×A

(解释一下:本来目标是 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right] (1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]是不需要min的,此时为等式约束 Q ( s , a ) = r ( s , a ) + γ P π Q ( s , a ) = B π Q ( s , a ) Q(s,a)=r(s,a)+\gamma P^\pi Q(s,a)=\mathcal B^\pi Q(s,a) Q(s,a)=r(s,a)+γPπQ(s,a)=BπQ(s,a),但等式约束过强,且Bellman Operator B π \mathcal B^\pi Bπ一般是单调的,因此巧妙地让目标对Q取min,这样等式约束就可以变为不等式约束了

它的理论解法(本质上为DQN那一类方法的理论来源):在这个目标以及约束下,用迭代求解出来的 Q ∗ ( s , a ) Q^*(s,a) Q(s,a)即为 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)。这个理论证明十分基础,详细可以看看Sutton的强化圣经那本书里面有关于Bellman Operator Contraction的详细讨论。在这里简述一下即 B π Q ( s , a ) = r ( s , a ) + γ P π Q ( s , a ) \mathcal B^\pi Q(s,a)=r(s,a)+\gamma P^\pi Q(s,a) BπQ(s,a)=r(s,a)+γPπQ(s,a),然后初始化一个 Q 0 ( s , a ) Q_0(s,a) Q0(s,a),Bellman Operator不停地操作即 B π Q 0 ( s , a ) \mathcal B^\pi Q_0(s,a) BπQ0(s,a)后,有: Q 0 → B π Q 0 → ( B π ) 2 Q 0 → ⋯ → ( B π ) ∞ Q 0 = Q π Q_0\rightarrow\mathcal B^\pi Q_0\rightarrow (\mathcal B^\pi)^2 Q_0\rightarrow \cdots \rightarrow \mathcal (B^\pi)^\infty Q_0=Q^\pi Q0BπQ0(Bπ)2Q0(Bπ)Q0=Qπ

方式二:将策略与环境交互得到experience,从experience中学一个联合分布 d π ( s , a ) d^{\pi}(s,a) dπ(s,a),来衡量策略的好坏。

那么怎么学这个 d π ( s , a ) d^{\pi}(s,a) dπ(s,a)呢?
对,就是用到“稳态信息流”这个条件 d π ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d π ( s , a ) d^\pi(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d^\pi(s,a) dπ(s,a)=(1γ)d0(s)π(as)+γPπdπ(s,a),进行迭代求取。

该优化问题的表述为(此时 d ( s , a ) d(s,a) d(s,a)就是未知的变量来求解 d π ( s , a ) d^\pi(s,a) dπ(s,a)):

J ( π ) = max ⁡ d : S × A → R + E ( s , a ) ∼ d ( s , a ) [ r ( s , a ) ] s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) \begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t } d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a) \end{aligned} J(π)s.t d(s,a)=d:S×AR+maxE(s,a)d(s,a)[r(s,a)]=(1γ)d0(s)π(as)+γPπd(s,a)

它的理论解法:约束相当于解一个线性方程组,有:
( I − γ P ∗ π ) d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) → d ( s , a ) = ( 1 − γ ) ( I − γ P ∗ π ) − 1 d 0 ( s ) π ( a ∣ s ) → d ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ γ t ( P π ) t d 0 ( s , a ) → d ( s , a ) = ( 1 − γ ) ∑ t = 0 ∞ Pr ⁡ ( s t = s , a t = a ∣ s 0 ∼ d 0 ( s ) , a t ∼ π ( ⋅ ∣ s t ) , s t + 1 ∼ T ( ⋅ ∣ s t , a t ) ) → d ( s , a ) = d π ( s , a ) \begin{aligned} &(I-\gamma P^\pi_*)d(s,a)=(1-\gamma)d_0(s)\pi(a|s)\\ &\rightarrow d(s,a)=(1-\gamma)(I-\gamma P^\pi_*)^{-1}d_0(s)\pi(a|s)\\ &\rightarrow d(s,a)=(1-\gamma)\sum_{t=0}^\infty \gamma^t(P^\pi)^td_0(s,a)\\ &\rightarrow d(s,a)=(1-\gamma)\sum_{t=0}^\infty \Pr(s_t=s,a_t=a|s_0\sim d_0(s),a_t\sim \pi(\cdot|s_t),s_{t+1}\sim T(\cdot|s_t,a_t))\\ &\rightarrow d(s,a)=d^\pi(s,a) \end{aligned} (IγPπ)d(s,a)=(1γ)d0(s)π(as)d(s,a)=(1γ)(IγPπ)1d0(s)π(as)d(s,a)=(1γ)t=0γt(Pπ)td0(s,a)d(s,a)=(1γ)t=0Pr(st=s,at=as0d0(s),atπ(st),st+1T(st,at))d(s,a)=dπ(s,a)

这里有一个理解的难点需要大家自己推推,加深下理解。即 ( I − γ P ∗ π ) − 1 (I-\gamma P^\pi_*)^{-1} (IγPπ)1是怎么到 ∑ t = 0 ∞ γ t ( P π ) t \sum_{t=0}^\infty \gamma^t (P^\pi)^t t=0γt(Pπ)t,从而等价于 d π ( s , a ) d^\pi(s,a) dπ(s,a)的概率定义的,即 Pr ⁡ \Pr Pr那一坨。

提示一下:要推出来,一要理解每个时间点是一个变量,二要区分变量的取值空间,三是离散化地把 Q ( s , a ) Q(s,a) Q(s,a)以及 d ( s , a d(s,a d(s,a)看作是取值空间 Ω : S × A \Omega:\mathcal{S\times A} Ω:S×A的点,即向量或矩阵)

再提示一下:假如把值函数Q或联合分布d看成矩阵,即 s ∈ R m , a ∈ R n s\in R^m,a\in R^n sRm,aRn,则 d ( s , a ) ∈ R m × n d(s,a)\in R^{m\times n} d(s,a)Rm×n;或看成向量则 d ( s , a ) ∈ R m n d(s,a)\in R^{mn} d(s,a)Rmn

所以上述的理论解法,说的是,如果解这个目标+约束的优化问题,就可以得到衡量policy π \pi π所需的量即 Q π ( s , a ) , d π ( s , a ) Q^\pi(s,a),d^\pi(s,a) Qπ(s,a),dπ(s,a),然后计算出 J ( π ) J(\pi) J(π)。但实际中,只要状态和动作中有一个不是离散的,就没办法理论解了)

本质上:从优化角度来看,方式一的L对偶(LP Duality)形式就是方式二,即 d ( s , a ) d(s,a) d(s,a)为对偶变量(可自行套用LP Duality的形式,推一推这两个优化问题的关系,真的很妙!)

2.2 Policy Evaluation的Lagrange对偶

在上一节理论本质是对以下问题,使用了LP Duality,即策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)实际上是对偶变量,因此直接套用Lagrange Duality。

建模值函数的原问题:

J ( π ) = min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t  Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]r(s,a)+γPπQ(s,a)(s,a)S×A

在原问题上构建拉格朗日函数(记得Lagrange实际上是有Fenchel Conjugate推出的吧~):

J ( π ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + ∑ s , a d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J ( π ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\sum_{s,a}d(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big) J(π)=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+s,ad(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))J(π)=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)d(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))

所以它的对偶问题就是max-min即
max ⁡ d ≥ 0 min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) \max_{d\geq0}\min_Q(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big) d0maxQmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)d(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))
好了正式引入Offline RL的问题设定:

  1. 理论上,这个 J ( π ) J(\pi) J(π)用的数据 ( s , a ) (s,a) (s,a)是策略 π \pi π与环境交互得来的样本
  2. 但在Offline RL中,我们只有一个不知名的离线数据集 D = { ( s t i , a t i , s t + 1 i , a t + 1 i } i = 1 N \mathcal D=\{(s^i_t,a^i_t,s^i_{t+1},a^i_{t+1}\}_{i=1}^N D={(sti,ati,st+1i,at+1i}i=1N,不能与环境交互
  3. 所以从数据集中提取它状态-动作的联合分布,即 d D ( s , a ) d^{\mathcal D}(s,a) dD(s,a)
  4. 利用Importance Sampling来利用 D \mathcal D D中的数据,来达到Policy Evaluation的目的

J ( π ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d π ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + ∫ d π ( s , a ) d D ( s , a ) d D ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D ( s , a ) [ d π ( s , a ) d D ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] = min ⁡ Q max ⁡ η ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D ( s , a ) [ η ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] ≈ min ⁡ Q max ⁡ η ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , r , s ′ ) ∼ d D ( s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ η ( s , a ) ( r ( s , a ) + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) ] = min ⁡ Q max ⁡ η ≥ 0 L ( Q , η ) ≥ max ⁡ η ≥ 0 min ⁡ Q L ( Q , η ) \begin{aligned} J(\pi)&=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^\pi(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &=\min_Q\max_{d\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\int \frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)}d^{\mathcal D}(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &=\min_Q\max_{d\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^{\mathcal D}(s,a)} \left[\frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &=\min_Q\max_{\eta\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^{\mathcal D}(s,a)} \left[\eta(s,a)\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &\approx \min_Q\max_{\eta\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right]\\ &=\min_Q\max_{\eta\geq 0}L(Q,\eta)\\ &\geq \max_{\eta\geq 0}\min_QL(Q,\eta) \end{aligned} J(π)=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)dπ(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+dD(s,a)dπ(s,a)dD(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)dD(s,a)[dD(s,a)dπ(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))]=Qminη0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)dD(s,a)[η(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))]Qminη0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,r,s)dD(s,a),aπ(s)[η(s,a)(r(s,a)+γQ(s,a)Q(s,a))]=Qminη0maxL(Q,η)η0maxQminL(Q,η)

从min-max问题的求解方法出发,去解决该问题
Faster saddle-point optimization for solving large-scale Markov decision processes
Minimax Weight and Q-Function Learning for Off-Policy Evaluation

忽略求解的细节,整体地看
已知 d D d^{\mathcal D} dD和策略 π \pi π,目的是评估它得到 J ( π ) J(\pi) J(π),为此要解一个min-max问题,从 d D d^{\mathcal D} dD中sample ( s , a , s ′ ) (s,a,s') (s,a,s),从 π ( ⋅ ∣ s ′ ) \pi(\cdot|s') π(s)中sample a ′ a' a,先解一个关于 η \eta η的max问题,再解一个关于 Q Q Q的min问题,迭代两者得 η ∗ \eta^* η Q ∗ Q^* Q,则完成了Policy Evaluation 即 J ( π ) J(\pi) J(π)

为什么是或?因为只要得到了 η ∗ \eta^* η,有 J ( π ) = E ( s , a ) ∼ d D [ η ∗ ( s , a ) r ( s , a ) ] J(\pi)=\mathbb E_{(s,a)\sim d^{\mathcal D}}[\eta^*(s,a)r(s,a)] J(π)=E(s,a)dD[η(s,a)r(s,a)]

所以总结一下Q-LP Policy Evaluation的

从两种不同形式建模, Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)值建模评估 J ( π ) J(\pi) J(π),策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)评估 J ( π ) J(\pi) J(π),实际上它们互为原问题和对偶问题,对偶关系是LP。

Q-LP原问题:

J ( π ) = min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] s.t  Q ( s , a ) ≥ r ( s , a ) + γ P π Q ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]\\ \text{s.t }\quad Q(s,a)&\geq r(s,a)+\gamma P^\pi Q(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t Q(s,a)=Qmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]r(s,a)+γPπQ(s,a)(s,a)S×A

Q-LP对偶问题:
J ( π ) = max ⁡ d : S × A → R + E ( s , a ) ∼ d ( s , a ) [ r ( s , a ) ] s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t }\quad d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t d(s,a)=d:S×AR+maxE(s,a)d(s,a)[r(s,a)]=(1γ)d0(s)π(as)+γPπd(s,a)(s,a)S×A

(对偶问题有一个弊端,是等式约束,约束的数量与 ( s , a ) (s,a) (s,a)取值可能有关,过强了,满足等式约束后的 d d d可能就没有了)
它们之间互为LP对偶揭示了 d ( s , a ) d(s,a) d(s,a)是Q-LP原问题的对偶变量,反过来 Q ( s , a ) Q(s,a) Q(s,a)是Q-LP对偶问题的对偶变量也成立(均为凸函数)。

因此对Q-LP原问题用Lagrange Duality变成无约束问题,有:

J ( π ) = min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) J(\pi)=\min_Q\max_{d\geq0} (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\Big) J(π)=Qmind0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)d(s,a)(r(s,a)+γPπQ(s,a)Q(s,a))

引入了Offline Datasets,采用Importance Sampling的技巧后有:
J ( π ) = min ⁡ Q max ⁡ η ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , r , s ′ ) ∼ d D ( s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ η ( s , a ) ( r ( s , a ) + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) ] J(\pi)=\min_Q\max_{\eta\geq0}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right] J(π)=Qminη0max(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,r,s)dD(s,a),aπ(s)[η(s,a)(r(s,a)+γQ(s,a)Q(s,a))]

优化的对偶问题为:
max ⁡ η ≥ 0 min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , r , s ′ ) ∼ d D ( s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ η ( s , a ) ( r ( s , a ) + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) ] \max_{\eta\geq0}\min_Q(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a,r,s')\sim d^{\mathcal D}(s,a),a'\sim \pi(\cdot|s')} \left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right] η0maxQmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,r,s)dD(s,a),aπ(s)[η(s,a)(r(s,a)+γQ(s,a)Q(s,a))]

所以策略评估(Policy Evaluation)先表述成一个min-max问题,后去优化它的对偶问题max-min。当拟合的对象为 η ( s , a ) = d π ( s , a ) d D ( s , a ) \eta(s,a)=\frac{d^\pi(s,a)}{d^{\mathcal D}(s,a)} η(s,a)=dD(s,a)dπ(s,a)时,即为Offline RL基于Duality的最原始目标,直接优化该max-min目标的细节在两篇文章中。12

2.3 改变Q-LP的Objective

现在的问题设定是在Offline-RL中了,因此只有一个离线的数据集。

Q-LP的对偶问题:
J ( π ) = max ⁡ d : S × A → R + E ( s , a ) ∼ d ( s , a ) [ r ( s , a ) ] s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} J(\pi)&=\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}}\mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]\\ \text{s.t }\quad d(s,a)&=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} J(π)s.t d(s,a)=d:S×AR+maxE(s,a)d(s,a)[r(s,a)]=(1γ)d0(s)π(as)+γPπd(s,a)(s,a)S×A

将对偶问题的目标改成:(最小化策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)与数据集中状态-动作联合分布 d D ( s , a ) d^{\mathcal D}(s,a) dD(s,a)的f-divergence D f D_f Df这里实际上假设了 D \mathcal D D是专家数据集

max ⁡ d : S × A → R + − D f ( d ∣ ∣ d D ) s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} -D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} d:S×AR+maxDf(ddD)s.t d(s,a)=(1γ)d0(s)π(as)+γPπd(s,a)(s,a)S×A

2.3. 1 对上述优化问题使用Lagrange Duality(别忘了 Q ( s , a ) Q(s,a) Q(s,a)是它的对偶变量)

max ⁡ d min ⁡ Q L ( Q , d ) = max ⁡ d min ⁡ Q − D f ( d ∣ ∣ d D ) + ∑ s , a Q ( s , a ) ( ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) − d ( s , a ) ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] − D f ( d ∣ ∣ d D ) + ∑ s , a Q ( s , a ) ( γ P ∗ π d ( s , a ) − d ( s , a ) ) ( 1 ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] − D f ( d ∣ ∣ d D ) + ∑ s , a d ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ( 2 ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D [ f ( d ( s , a ) d D ( s , a ) ) ] + ∑ s , a d ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ⟹ η ( s , a ) = d ( s , a ) d D ( s , a ) ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d D [ f ( η ( s , a ) ) ] + E ( s , a ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) ] = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , s ′ ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) − f ( η ( s , a ) ) ] \begin{aligned} \max_d \min_Q L(Q,d)&=\max_d \min_Q-D_f(d||d^{\mathcal D})+\sum_{s,a}Q(s,a)\Big((1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)-d(s,a)\Big)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]-D_f(d||d^{\mathcal D})+\sum_{s,a}Q(s,a)\Big(\gamma P^\pi_*d(s,a)-d(s,a)\Big)(1)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]-D_f(d||d^{\mathcal D})+\sum_{s,a}d(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)(2)\\ &=(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a)\sim d^{\mathcal D}}\left[f(\frac{d(s,a)}{d^{\mathcal D}(s,a)})\right]+\sum_{s,a}d(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)\\ &\stackrel{\eta(s,a)=\frac{d(s,a)}{d^{\mathcal D}(s,a)}}{\Longrightarrow}(1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a)\sim d^{\mathcal D}}\left[f(\eta(s,a))\right]+\mathbb E_{(s,a)\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)\right]\\ &=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a,s')\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)-f(\eta(s,a))\right] \end{aligned} dmaxQminL(Q,d)=dmaxQminDf(ddD)+s,aQ(s,a)((1γ)d0(s)π(as)+γPπd(s,a)d(s,a))=(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]Df(ddD)+s,aQ(s,a)(γPπd(s,a)d(s,a))(1)=(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]Df(ddD)+s,ad(s,a)(γPπQ(s,a)Q(s,a))(2)=(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)dD[f(dD(s,a)d(s,a))]+s,ad(s,a)(γPπQ(s,a)Q(s,a))η(s,a)=dD(s,a)d(s,a)(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a)dD[f(η(s,a))]+E(s,a)dD(s,a)[η(s,a)(γPπQ(s,a)Q(s,a))]=(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,s)dD(s,a)[η(s,a)(γPπQ(s,a)Q(s,a))f(η(s,a))]

( 1 ) → ( 2 ) (1)\rightarrow (2) (1)(2)需要仔细推导一波,后面的 ⟹ η ( s , a ) = d ( s , a ) d D ( s , a ) \stackrel{\eta(s,a)=\frac{d(s,a)}{d^{\mathcal D}(s,a)}}{\Longrightarrow} η(s,a)=dD(s,a)d(s,a)是对Offline RL利用了重要性采样的trick。

所以改变了Objective并应用了Lagrange Duality后的优化目标变成:

max ⁡ d min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , s ′ ) ∼ d D ( s , a ) [ η ( s , a ) ( γ P π Q ( s , a ) − Q ( s , a ) ) − f ( η ( s , a ) ) ] \max_d \min_Q(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(s_0)}[Q(s_0,a_0)]+\mathbb E_{(s,a,s')\sim d^\mathcal D(s,a)}\left[\eta(s,a)\Big(\gamma P^\pi Q(s,a)-Q(s,a)\Big)-f(\eta(s,a))\right] dmaxQmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,s)dD(s,a)[η(s,a)(γPπQ(s,a)Q(s,a))f(η(s,a))]

对该目标进行具体优化的算法为DualDICE3

2.3.2 对上述问题用 Fenchel Duality(这个仔细给推导一遍)

  • 优化问题 max ⁡ d : S × A → R + − D f ( d ∣ ∣ d D ) s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} -D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} d:S×AR+maxDf(ddD)s.t d(s,a)=(1γ)d0(s)π(as)+γPπd(s,a)(s,a)S×A
  • Fenchel Duality min ⁡ x ∈ D J P ( x ) = min ⁡ x ∈ D f ( x ) + g ( A x ) \min_{x\in D}J_P(x)=\min_{x\in D}f(x)+g(Ax) xDminJP(x)=xDminf(x)+g(Ax)其对偶问题为: max ⁡ u ∈ D ∗ J D ( u ) = max ⁡ u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}J_D(u)=\max_{u\in D^*}-f^*(-A^\top u)-g^*(u) uDmaxJD(u)=uDmaxf(Au)g(u)

首要目标:确定 x : = d ( s , a ) x:=d(s,a) x:=d(s,a),要把等式约束 d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a) d(s,a)=(1γ)d0(s)π(as)+γPπd(s,a)表示为 g ( A x ) g(Ax) g(Ax)

借助这个indicator function:
δ C ( x ) = { 0 if  x ∈ C + ∞ otherwise \delta_C(x)= \left\{ \begin{array}{rl} 0& \text{if }x\in C \\ +\infty & \text{otherwise} \end{array} \right. δC(x)={0+if xCotherwise

等式约束可看作:
( 1 − γ ) d 0 ( s ) π ( a ∣ s ) ⏟ C = ( I − γ P ∗ π ) ⏟ A d ( s , a ) ⏟ x \underbrace{(1-\gamma)d_0(s)\pi(a|s)}_C=\underbrace{(I-\gamma P^\pi_*)}_A\underbrace{d(s,a)}_x C (1γ)d0(s)π(as)=A (IγPπ)x d(s,a)

优化问题表述的原问题为:

min ⁡ d D f ( d ∣ ∣ d D ) − δ ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) ( ( I − γ P ∗ π ) d ( s , a ) ) \min_d D_f(d||d^\mathcal D)-\delta_{(1-\gamma)d_0(s)\pi(a|s)}((I-\gamma P^\pi_*)d(s,a)) dminDf(ddD)δ(1γ)d0(s)π(as)((IγPπ)d(s,a))

查共轭函数表有: δ a ∗ ( x ) = < a , y > \delta_{a}^*(x)=<a,y> δa(x)=<a,y> D f ∗ ( x ∣ ∣ p ) = E z ∼ p [ f ∗ ( y ( z ) ) ] D_f^*(x||p)=\mathbb E_{z\sim p}[f^*(y(z))] Df(xp)=Ezp[f(y(z))],所以根据Fenchel Duality的对偶形式 max ⁡ u ∈ D ∗ − f ∗ ( − A ⊤ u ) − g ∗ ( u ) \max_{u\in D^*}-f^*(-A^\top u)-g^*(u) maxuDf(Au)g(u)可写出其对偶问题。(注意对偶变量 u u u实际上就是 Q ( s , a ) Q(s,a) Q(s,a))

所以对偶问题为:

max ⁡ Q − E ( s , a ) ∼ d D [ f ∗ ( ( γ P ∗ π − I ) ⊤ Q ( s , a ) ) ] − < ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) , Q ( s , a ) > → min ⁡ Q E ( s , a ) ∼ d D [ f ∗ ( γ P π Q ( s , a ) − Q ( s , a ) ] + ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] \begin{aligned} &\max_Q - \mathbb E_{(s,a)\sim d^\mathcal D}[f^*((\gamma P^\pi_*-I)^\top Q(s,a))]-<(1-\gamma )d_0(s)\pi(a|s),Q(s,a)>\\ &\rightarrow\min _Q \mathbb E_{(s,a)\sim d^\mathcal D}[f^*(\gamma P^\pi Q(s,a)-Q(s,a)]+\mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] \end{aligned} QmaxE(s,a)dD[f((γPπI)Q(s,a))]<(1γ)d0(s)π(as),Q(s,a)>QminE(s,a)dD[f(γPπQ(s,a)Q(s,a)]+(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]

根据Fenchel Duality的结果,原问题的解 d ∗ ( s , a ) d^*(s,a) d(s,a)与对偶问题 Q ∗ ( s , a ) Q^*(s,a) Q(s,a)的解关系为( f ∗ ′ f'_* f为f的共轭导数函数): d ∗ ( s , a ) = d D ( s , a ) × f ∗ ′ ( γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) ) d^*(s,a)=d^\mathcal D(s,a)\times f_*'(\gamma P^\pi Q^*(s,a)-Q^*(s,a)) d(s,a)=dD(s,a)×f(γPπQ(s,a)Q(s,a))

如果令 f = 1 2 x 2 f=\frac{1}{2}x^2 f=21x2,其共轭 f ∗ = 1 2 y 2 f^*=\frac{1}{2}y^2 f=21y2,则

Q ∗ = arg min ⁡ ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + 1 2 E ( s , a ) ∼ d D [ ( γ P π Q ( s , a ) − Q ( s , a ) 2 ] Q^*=\argmin \mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] + \frac{1}{2} E_{(s,a)\sim d^\mathcal D}[(\gamma P^\pi Q(s,a)-Q(s,a)^2] Q=argmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+21E(s,a)dD[(γPπQ(s,a)Q(s,a)2]

在这特例中根据原问题解与对偶问题解的关系有:
f ∗ ′ ( γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) ) = γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) = d ∗ ( s , a ) d D ( s , a ) = d π ( s , a ) d D ( s , a ) f_*'(\gamma P^\pi Q^*(s,a)-Q^*(s,a))=\gamma P^\pi Q^*(s,a)-Q^*(s,a)=\frac{d^*(s,a)}{d^\mathcal D(s,a)}=\frac{d^\pi(s,a)}{d^\mathcal D(s,a)} f(γPπQ(s,a)Q(s,a))=γPπQ(s,a)Q(s,a)=dD(s,a)d(s,a)=dD(s,a)dπ(s,a)

2.4 总结一波(异常重要)

主线:
这么多推导可能掩盖了逻辑线。但整体上看,目的始终都是Policy Evaluation的任务,出发点都是on-policy objective。
在嵌入Offline RL的难点只有一个Offline Datasets,所以用off-policy的方式去估计on-policy objective需要进行distribution mismatch的补偿。如果是重要性采样进行补偿要计算多项乘积 π θ ( τ ) π β ( τ ) \frac{\pi_\theta(\tau)}{\pi_\beta(\tau)} πβ(τ)πθ(τ),这个方法方差非常大。
因此这里通过一通推导,发现只要优化这个目标:

Q ∗ = arg min ⁡ ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + 1 2 E ( s , a ) ∼ d D [ ( γ P π Q ( s , a ) − Q ( s , a ) 2 ] Q^*=\argmin \mathbb (1-\gamma)E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)] + \frac{1}{2} E_{(s,a)\sim d^\mathcal D}[(\gamma P^\pi Q(s,a)-Q(s,a)^2] Q=argmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+21E(s,a)dD[(γPπQ(s,a)Q(s,a)2]

得到 Q ∗ Q^* Q后,就能计算出 γ P π Q ∗ ( s , a ) − Q ∗ ( s , a ) = d π ( s , a ) d D ( s , a ) \gamma P^\pi Q^*(s,a)-Q^*(s,a)=\frac{d^\pi(s,a)}{d^\mathcal D(s,a)} γPπQ(s,a)Q(s,a)=dD(s,a)dπ(s,a),于是就能达到利用offline datasets对on-policy objective进行evaluation了,即

J ( π ) = E ( s , a , r ) ∼ d D [ d π ( s , a ) d D ( s , a ) r ( s , a ) ] J(\pi)=\mathbb E_{(s,a,r)\sim d^\mathcal D}\left[\frac{d^\pi(s,a)}{d^\mathcal D(s,a)} r(s,a)\right] J(π)=E(s,a,r)dD[dD(s,a)dπ(s,a)r(s,a)]

辅助线:

  1. 从值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)建模评估 J ( π ) J(\pi) J(π),从策略联合分布 d π ( s , a ) d^\pi(s,a) dπ(s,a)建模评估 J ( π ) J(\pi) J(π),两个角度对同一对象 J ( π ) J(\pi) J(π)本质上呈LP Duality的关系,LP对偶关系约束太多难以直接求解。
  2. 修改LP Duality,用Lagrange Duality的话,估计 J ( π ) J(\pi) J(π)最后变成一个min-max问题,优化起来容易陷入saddle-point
  3. 修改LP Duality,修改目标,再用Fenchel Duality的话,估计 J ( π ) J(\pi) J(π)最后变成一个min问题,优化起来就可以用Gradient-based方法了

三、Policy Optimization的优化问题

评估策略的值还不够。一般我们都想获得策略本身,而不仅仅是它的评估值 J ( π ) J(\pi) J(π)

3.1 基础理论

拿Lagrange Duality来建模的评估量 J ( π ) J(\pi) J(π)直接进行策略梯度:(假设在min-max问题中已经近似求解出 Q ∗ ≈ Q π , d ∗ ≈ d π Q^*\approx Q^\pi,d^*\approx d^\pi QQπ,ddπ😃,然后对参数化后的策略求梯度)
∇ θ J ( π θ ) = ∇ θ ( min ⁡ Q max ⁡ d ≥ 0 ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ( s , a ) ( r ( s , a ) + γ P π θ Q ( s , a ) − Q ( s , a ) ) ) = ∇ θ ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ∗ ( s 0 , a 0 ) ] + E ( s , a ) ∼ d ∗ ( s , a ) ( r ( s , a ) + γ P π θ Q ∗ ( s , a ) − Q ∗ ( s , a ) ) ) = ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ∗ ( s 0 , a 0 ) ∇ θ log ⁡ π θ ( a ∣ s ) ] ) + ∇ θ E ( s , a ) ∼ d ∗ ( s , a ) ( r ( s , a ) + γ P π θ Q ∗ ( s , a ) − Q ∗ ( s , a ) ) = ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ∗ ( s 0 , a 0 ) ∇ θ log ⁡ π θ ( a ∣ s ) ] ) + E ( s , a ) ∼ d ∗ ( s , a ) ( γ ∇ θ P π θ Q ∗ ( s , a ) ) = ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ∗ ( s 0 , a 0 ) ∇ θ log ⁡ π θ ( a ∣ s ) ] ) + γ E ( s , a ) ∼ d ∗ ( s , a ) , s ′ ∼ T ( ⋅ ∣ s , a ) , a ′ ∼ π θ ( ⋅ ∣ s ′ ) ( Q ∗ ( s , a ) ∇ θ l o g π θ ( a ′ ∣ s ′ ) ) = E ( s , a ) ∼ d ∗ [ Q ∗ ( s , a ) ∇ θ log ⁡ π ( a ∣ s ) ] \begin{aligned} \nabla_\theta J(\pi_\theta)&=\nabla_\theta \left(\min_Q\max_{d\geq0} (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q(s,a)-Q(s,a)\Big)\right)\\ &=\nabla_\theta \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\right]+\mathbb E_{(s,a)\sim d^*(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q^*(s,a)-Q^*(s,a)\Big)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+\nabla_\theta \mathbb E_{(s,a)\sim d^*(s,a)}\Big(r(s,a)+\gamma P^{\pi_\theta} Q^*(s,a)-Q^*(s,a)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \mathbb E_{(s,a)\sim d^*(s,a)}\Big(\gamma \nabla_\theta P^{\pi_\theta} Q^*(s,a)\right)\\ &= \left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q^*(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \gamma\mathbb E_{(s,a)\sim d^*(s,a),s'\sim T(\cdot|s,a),a'\sim \pi_\theta(\cdot|s')}\Big( Q^*(s,a) \nabla_\theta log\pi_\theta(a'|s') \right)\\ &=\mathbb E_{(s,a)\sim d^*}[Q^*(s,a)\nabla_\theta \log \pi(a|s)] \end{aligned} θJ(πθ)=θ(Qmind0max(1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)]+E(s,a)d(s,a)(r(s,a)+γPπθQ(s,a)Q(s,a)))=θ((1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)]+E(s,a)d(s,a)(r(s,a)+γPπθQ(s,a)Q(s,a)))=((1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)θlogπθ(as)])+θE(s,a)d(s,a)(r(s,a)+γPπθQ(s,a)Q(s,a))=((1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)θlogπθ(as)])+E(s,a)d(s,a)(γθPπθQ(s,a))=((1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)θlogπθ(as)])+γE(s,a)d(s,a),sT(s,a),aπθ(s)(Q(s,a)θlogπθ(as))=E(s,a)d[Q(s,a)θlogπ(as)]

所以只要min-max解出来的 Q ∗ ≈ Q π , d ∗ ≈ d π Q^*\approx Q^\pi,d^*\approx d^\pi QQπ,ddπ,Lagrange Duality的目标求梯度就等同于on-policy gradient

3.2 Offline Policy Gradient via Lagrangian

根据基础理论,用下重要性采样的trick,对以下目标求梯度即可。
max ⁡ π min ⁡ Q max ⁡ η ≥ 0 L ( Q , η , π ) = ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + E ( s , a , s ′ ) ∼ d D , a ′ ∼ π ( ⋅ ∣ s ′ ) [ η ( s , a ) ( r ( s , a ) + γ Q ( s ′ , a ′ ) − Q ( s , a ) ) ] \max_\pi\min_Q\max_{\eta\geq0} L(Q,\eta,\pi)=(1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)]+E_{(s,a,s')\sim d^\mathcal D,a'\sim \pi(\cdot|s')}\left[\eta(s,a)\Big(r(s,a)+\gamma Q(s',a')-Q(s,a)\Big)\right] πmaxQminη0maxL(Q,η,π)=(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+E(s,a,s)dD,aπ(s)[η(s,a)(r(s,a)+γQ(s,a)Q(s,a))]

3.3 Offline Policy Gradient via Fenchel Duality

考虑原问题:
max ⁡ d : S × A → R + E ( s , a ) ∼ d ( s , a ) [ r ( s , a ) ] − D f ( d ∣ ∣ d D ) s.t  d ( s , a ) = ( 1 − γ ) d 0 ( s ) π ( a ∣ s ) + γ P ∗ π d ( s , a ) ∀ ( s , a ) ∈ S × A \begin{aligned} &\max_{d:\mathcal{S\times A\rightarrow}\mathbb R_{+}} \mathbb E_{(s,a)\sim d(s,a)}[r(s,a)]-D_f(d||d^{\mathcal D})\\ &\text{s.t }\quad d(s,a)=(1-\gamma)d_0(s)\pi(a|s)+\gamma P^\pi_*d(s,a)\quad\forall (s,a)\in \mathcal{S\times A} \end{aligned} d:S×AR+maxE(s,a)d(s,a)[r(s,a)]Df(ddD)s.t d(s,a)=(1γ)d0(s)π(as)+γPπd(s,a)(s,a)S×A

通过Fenchel Duality后目标4变为:

max ⁡ π min ⁡ Q ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π ( ⋅ ∣ s 0 ) [ Q ( s 0 , a 0 ) ] + log ⁡ E ( s , a ) ∼ d D [ exp ⁡ ( r ( s , a ) + γ P π Q ( s , a ) − Q ( s , a ) ) ] \max_\pi\min_Q (1-\gamma)\mathbb E_{s_0\sim d_0(s),a_0\sim \pi(\cdot|s_0)}[Q(s_0,a_0)]+\log \mathbb E_{(s,a)\sim d^\mathcal D}\left[\exp{\big(r(s,a)+\gamma P^\pi Q(s,a)-Q(s,a)\big)}\right] πmaxQmin(1γ)Es0d0(s),a0π(s0)[Q(s0,a0)]+logE(s,a)dD[exp(r(s,a)+γPπQ(s,a)Q(s,a))]

对一个确定的Q,目标梯度为:

∇ θ J ( π θ ) = ( ( 1 − γ ) E s 0 ∼ d 0 ( s ) , a 0 ∼ π θ ( s 0 ) [ Q ( s 0 , a 0 ) ∇ θ log ⁡ π θ ( a ∣ s ) ] ) + E ( s , a , s ′ ) ∼ d D , a ′ ∼ π ( ⋅ ∣ s ′ ) ( softmax d D ( r + γ P π Q − Q ) ( s , a ) ) ⋅ Q ( s ′ , a ′ ) ∇ θ log ⁡ π θ ( a ′ ∣ s ′ ) ) \nabla_\theta J(\pi_\theta)=\left( (1-\gamma) \mathbb E_{s_0\sim d_0(s),a_0\sim \pi_\theta(s_0)}\left[Q(s_0,a_0)\nabla_\theta \log \pi_\theta(a|s)\right] \Big)+ \mathbb E_{(s,a,s')\sim d^\mathcal D,a'\sim \pi(\cdot|s')}\Big(\text{softmax}_{d^\mathcal D}(r+\gamma P^\pi Q-Q)(s,a))\cdot Q(s',a')\nabla_\theta \log \pi_\theta (a'|s')\right) θJ(πθ)=((1γ)Es0d0(s),a0πθ(s0)[Q(s0,a0)θlogπθ(as)])+E(s,a,s)dD,aπ(s)(softmaxdD(r+γPπQQ)(s,a))Q(s,a)θlogπθ(as))

其中 softmax p ( h ) ( z ) : = exp ⁡ ( h ( z ) ) E z ~ ∼ p [ exp ⁡ ( h ( z ~ ) ) ] \text{softmax}_p(h)(z):=\frac{\exp(h(z))}{\mathbb E_{\tilde z \sim p}[\exp(h(\tilde z))]} softmaxp(h)(z):=Ez~p[exp(h(z~))]exp(h(z))

总结

这就是Duality-based的Offline RL方法。看起来很复杂,推导一堆细节。
说白了,就是对Policy Evaluation建模成了优化问题(目标+约束),挖掘出了两个角度之间的LP对偶关系。
然后用Lagrange Duality、Fenchel Duality对优化问题进行变更,或者变更之前改一改目标,使其适应Offline Setting,仅此而已。
如果熟悉Fenchel Duality,实际上不难,后续用DualDICE这篇文章进行代码层面上的实例分析,看这些目标怎么用,怎么具体实现。


  1. Faster saddle-point optimization for solving large-scale Markov decision processes ↩︎

  2. Minimax Weight and Q-Function Learning for Off-Policy Evaluation ↩︎

  3. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections ↩︎

  4. AlgaeDICE: Policy Gradient from Arbitrary Experience ↩︎

评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值