Paper-4 精读 GCL(2016 ICML)

概述与定位

  1. 这是一个Inverse RL的问题,也称为Inverse Optimal Control,学习的目标是从expert demonstrations中得到一个cost function即reward function
  2. 传统IRL面临两个问题,一个是需要设计cost function的形式,另一个是在unknown dynamics即model-free的情况下,学习cost function面临一定的困难
  3. 因此这篇文paper,利用NN作为cost function避免设计的麻烦,再用sample-based approximation的方式去学这个cost function

基础理论
深度强化学习CS285 lec13-lec15 (下)

一、GCL的基础知识

设定:我们从reward的角度开始介绍这个问题,reward与cost的角度实际上是相同的,之所以有变化是因为有时候在控制论的语境用cost,有时候在standard RL的语境用reward。现在开始接近最近的文章就统一reward吧

1.1 轨迹的建模方式—PGM

之前在Control As Inference中提过,有一堆专家数据即expert trajectories,也可以理解成是optimal behaviors,引入了一个optimality variable来专门对专家行为进行建模: p ( O t ∣ s t , a t ) = e x p ( r ( s t , a t ) ) = e x p ( − c ( s t , a t ) ) p(O_t|s_t,a_t)=exp(r(s_t,a_t))=exp(-c(s_t,a_t)) p(Otst,at)=exp(r(st,at))=exp(c(st,at))

p ( τ ∣ O 1 : T ) = p ( τ , O 1 : T ) p ( O 1 : T ) ∝ p ( τ , O 1 : T ) = p ( τ ) p ( O 1 : T ∣ τ ) = p ( τ ) e x p ( ∑ t = 1 T r ( s t , a t ) ) = [ p ( s 1 ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t ) ] e x p ( ∑ t = 1 T r ( s t , a t ) ) \begin{aligned} p(\tau|O_{1:T})=\frac{p(\tau,O_{1:T})}{p(O_{1:T})}&\propto p(\tau,O_{1:T})\\ &=p(\tau)p(O_{1:T}|\tau)\\ &=p(\tau)exp\Big(\sum_{t=1}^Tr(s_t,a_t)\Big)\\ &=\Big[p(s_1)\prod_{t=1}^{T}p(s_{t+1}|s_t,a_t)\Big]exp\Big(\sum_{t=1}^Tr(s_t,a_t)\Big) \end{aligned} p(τO1:T)=p(O1:T)p(τ,O1:T)p(τ,O1:T)=p(τ)p(O1:Tτ)=p(τ)exp(t=1Tr(st,at))=[p(s1)t=1Tp(st+1st,at)]exp(t=1Tr(st,at))

这个感觉很奇怪,怎么 p ( τ ) p(\tau) p(τ)中的policy没了?因为这个是基于PGM的建模,专家数据中直接就有了state与action,不需要去管state与action之间的映射。如图:
pic-1
那现在我们想要参数化的是reward function,因此有:

p ( τ ∣ O 1 : T , ψ ) ∝ p ( τ ) e x p ( ∑ t = 1 T r ψ ( s t , a t ) ) = p ( τ ) e x p ( r ψ ( τ ) ) \begin{aligned} p(\tau|O_{1:T},\psi)&\propto p(\tau)exp\Big(\sum_{t=1}^Tr_\psi(s_t,a_t)\Big)\\ &=p(\tau)exp(r_\psi(\tau)) \end{aligned} p(τO1:T,ψ)p(τ)exp(t=1Trψ(st,at))=p(τ)exp(rψ(τ))

现在有专家数据 τ ( i ) \tau^{(i)} τ(i),Maximum Likelihood Learning去学习reward:
max ⁡ ψ L ( ψ ) = max ⁡ ψ 1 N ∑ i = 1 N l o g p ( τ ( i ) ∣ O 1 : T , ψ ) = max ⁡ ψ 1 N ∑ i = 1 N r ψ ( τ ( i ) ) − l o g Z \max_\psi L(\psi)=\max_\psi \frac{1}{N}\sum_{i=1}^Nlogp(\tau^{(i)}|O_{1:T},\psi)=\max_\psi \frac{1}{N}\sum_{i=1}^Nr_\psi(\tau^{(i)})-logZ ψmaxL(ψ)=ψmaxN1i=1Nlogp(τ(i)O1:T,ψ)=ψmaxN1i=1Nrψ(τ(i))logZ其中 Z Z Z为轨迹的归一化因子,即Partition function,因为要去掉正比 ∝ \propto 的符号.
Z = ∫ p ( τ ) e x p ( r ψ ( τ ) ) d τ Z=\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau Z=p(τ)exp(rψ(τ))dτ

1.2 轨迹的建模方式—Policy

下面这个是Standard RL的正常建模方式:
p ( τ ) = p ( s 1 ) ∏ t = 1 T π ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t) p(τ)=p(s1)t=1Tπ(atst)p(st+1st,at)

样本轨迹图
实际上还有一个reward的,不过它是一个scalar,不是一个要学习的函数,因此就忽略了。这里的trajectory是由policy与dynamics生成的,trajectory distribution是policy的一种表达。而上面PGM的建模是对专家数据的“轨迹”分布进行了表达。

众所周知,PGM最大的问题是partition function的积分求和难以计算的问题,因此就需要逼近。那怎么逼近呢?是的,采用轨迹另一种构建方式Policy去逼近PGM的trajectory distribution,从而使partition function即 Z = ∫ p ( τ ) e x p ( r ψ ( τ ) ) d τ Z=\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau Z=p(τ)exp(rψ(τ))dτ变得可估计!

二、文章的主要逻辑

Partition Function的估计,论文提到有Laplace Approximation、Value Function Approximation以及Samples的方式。Paper采用的就是基于Samples的方式对配分函数Z进行了估计,主要思路如下:

  1. 先对问题的梯度进行公式展开
    ∇ ψ L ( ψ ) = ∇ ψ [ E τ ∼ π ∗ ( τ ) [ r ψ ( τ ) ] − l o g ∫ p ( τ ) e x p ( r ψ ( τ ) ) d τ ] = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − 1 Z ∫ p ( τ ) e x p ( r ψ ( τ ) ) ⏟ p ( τ ∣ O 1 : T , ψ ) ∇ ψ r ψ ( τ ) d τ = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] \begin{aligned} \nabla_\psi L(\psi)&=\nabla_\psi\Big[E_{\tau\sim\pi^*(\tau)}\big[r_\psi(\tau)\big]-log\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau\Big]\\ &=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-\underbrace{\frac{1}{Z}\int p(\tau)exp\big(r_\psi(\tau)\big)}_{p(\tau|O_{1:T},\psi)}\nabla_\psi r_\psi(\tau)d\tau\\ &=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ \end{aligned} ψL(ψ)=ψ[Eτπ(τ)[rψ(τ)]logp(τ)exp(rψ(τ))dτ]=Eτπ(τ)[ψrψ(τ)]p(τO1:T,ψ) Z1p(τ)exp(rψ(τ))ψrψ(τ)dτ=Eτπ(τ)[ψrψ(τ)]Eτp(τO1:T,ψ)[ψrψ(τ)]

  2. 以Policy的构建方式来表达第二项的 E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] E_{\tau\sim p(\tau|O_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)] Eτp(τO1:T,ψ)[ψrψ(τ)]
    E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] = E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ ∑ t = 1 T r ψ ( s t , a t ) ] = ∑ t = 1 T E ( s t , a t ) ∼ p ( s t , a t ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( s t , a t ) ] = ∑ t = 1 T E s t ∼ p ( s t ∣ O 1 : T , ψ ) , a t ∼ p ( a t ∣ s t , O 1 : T , ψ ) [ ∇ ψ r ψ ( s t , a t ) ] \begin{aligned} &E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ &=E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi \sum_{t=1}^T r_\psi(s_t,a_t) \big]\\ &=\sum_{t=1}^TE_{(s_t,a_t)\sim p(s_t,a_t|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(s_t,a_t)\big]\\ &=\sum_{t=1}^TE_{s_t\sim p(s_t|O_{1:T},\psi),a_t\sim p(a_t|s_t,O_{1:T},\psi)}\big[\nabla_\psi r_\psi(s_t,a_t)\big]\\ \end{aligned} Eτp(τO1:T,ψ)[ψrψ(τ)]=Eτp(τO1:T,ψ)[ψt=1Trψ(st,at)]=t=1TE(st,at)p(st,atO1:T,ψ)[ψrψ(st,at)]=t=1TEstp(stO1:T,ψ),atp(atst,O1:T,ψ)[ψrψ(st,at)]

  3. Soft Optimal Policy即 p ( a t ∣ s t , O 1 : T , ψ ) p(a_t|s_t,O_{1:T},\psi) p(atst,O1:T,ψ)样本的获得用另一个policy即 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(atst)去近似它,因此采用Importance Sampling的方式使用 π ( a t ∣ s t ) \pi(a_t|s_t) π(atst)中的轨迹样本:
    ∇ ψ L ( ψ ) = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] ≈ 1 N ∑ i = 1 N ∇ ψ r ψ ( τ ( i ) ) − 1 ∑ j w j ∑ j = 1 M w j ∇ ψ r ψ ( τ ( j ) ) \begin{aligned} \nabla_\psi L(\psi)&=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ &\approx\frac{1}{N}\sum_{i=1}^N \nabla_\psi r_\psi(\tau^{(i)})-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\nabla_\psi r_\psi(\tau^{(j)}) \end{aligned} ψL(ψ)=Eτπ(τ)[ψrψ(τ)]Eτp(τO1:T,ψ)[ψrψ(τ)]N1i=1Nψrψ(τ(i))jwj1j=1Mwjψrψ(τ(j))

w j = p ( τ ) e x p ( r ψ ( τ ( j ) ) ) π θ ( τ ( j ) ) = p ( s 1 ) ∏ t p ( s t + 1 ∣ s t , a t ) e x p ( r ψ ( s t , a t ) ) p ( s 1 ) ∏ t p ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) = e x p ( ∑ t r ψ ( s t , a t ) ) ∏ t π θ ( a t ∣ s t ) \begin{aligned} w_j&=\frac{p(\tau)exp\big(r_\psi(\tau^{(j)})\big)}{\pi_\theta(\tau^{(j)})}\\ &=\frac{p(s_1)\prod_{t}p(s_{t+1}|s_t,a_t)exp(r_\psi(s_t,a_t))}{p(s_1)\prod_{t}p(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t) }\\ &=\frac{exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi_\theta(a_t|s_t)} \end{aligned} wj=πθ(τ(j))p(τ)exp(rψ(τ(j)))=p(s1)tp(st+1st,at)πθ(atst)p(s1)tp(st+1st,at)exp(rψ(st,at))=tπθ(atst)exp(trψ(st,at))

  1. 上述过程利用另一个Policy的方式去计算了Reward Function的梯度,从而对 r ψ ( s t , a t ) r_\psi(s_t,a_t) rψ(st,at)实现了更新,寻找那个reward function使专家行为产生高价值,当前行为产生低价值:
    ψ ← ψ + α ∇ ψ L ( ψ ) \psi\leftarrow\psi + \alpha \nabla_\psi L(\psi) ψψ+αψL(ψ)

  2. 接下来固定当前的reward function即 r ψ ( τ ) r_\psi(\tau) rψ(τ),对Policy即 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(atst)进行REINFORCE的更新,寻找那个Policy能在当前reward下产生高价值.
    ∇ θ L ( θ ) = 1 M ∑ j = 1 M ∇ θ l o g π θ ( τ j ) r ψ ( τ j ) θ ← θ + α ∇ θ L ( θ ) \nabla_\theta L(\theta)= \frac{1}{M}\sum_{j=1}^M\nabla_\theta log\pi_\theta(\tau_j)r_\psi(\tau_j)\\ \theta \leftarrow \theta + \alpha\nabla_\theta L(\theta) θL(θ)=M1j=1Mθlogπθ(τj)rψ(τj)θθ+αθL(θ)

  3. 然后迭代更新Reward Function与Policy,其中Rewad Function往能使专家行为产生高价值同时使当前Policy行为产生低价值的方向跑,然后Policy往能在当前Reward下产生高价值的行为方向跑,从而又体现出了对抗的思想!附流程图

4

三、实验小细节

  1. 专家数据从哪来?
    答: Guided Policy Search、RL from Scratch、Trajectory Optimization的Policy跑Dynamics生成出来的Trajectories
  2. 如何初始化这个Policy即 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(atst)
    答:从random initial state开始,用Trajectory Optimization弄一个Linear-Gaussian Controller拟合专家数据或者通过Produces a motion that tracks the average demonstration with variance propor- tional to the variation between demonstrated motions

实验结果:
5

四、总结

  • 贡献
    在Inverse RL的setting上,学得是一个cost function,本文利用policy update的信息来guide cost function的学习,因此叫做Guided Cost Learning。学到一个cost function的同时,policy也学好了,只是GAIL直接recover一个Policy而不用学习一个cost function。

  • 一句话总结:不用对cost function进行人为设计,直接从raw state representation中进行Policy Optimization来指导Cost Function的Learning解决high-dimensions与Unknown Dynamics的问题

  • 附录A有对Guided Policy Search与GPS + Trajectory Optimization Under Unknown Dynamics的总结,超棒的~

启发:此处的目标是学习一个Reward Function,利用Policy Optimization提供Guidance。优点是raw state input而不用手工对state做特征工程,但难以应对Visual input。

改进之处

  1. 可以从Policy Optimization的更新上加一些Constraints,从而使Reward的指向更加明确
  2. Reward更新的过程,对Partition Function采用Samples估计的方式,利用了Important Sampling,那个权重继续可以优化,其次是逼近Partieion Functon的方式可以采用另外的方式,不一定要Samples。

Guided Cost Learning的实现代码有点少,只看了一份:
https://github.com/bbrighttaer/guided-irl

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值