Guided Cost Learning
概述与定位
- 这是一个Inverse RL的问题,也称为Inverse Optimal Control,学习的目标是从expert demonstrations中得到一个cost function即reward function
- 传统IRL面临两个问题,一个是需要设计cost function的形式,另一个是在unknown dynamics即model-free的情况下,学习cost function面临一定的困难
- 因此这篇文paper,利用NN作为cost function避免设计的麻烦,再用sample-based approximation的方式去学这个cost function
基础理论
深度强化学习CS285 lec13-lec15 (下)
一、GCL的基础知识
设定:我们从reward的角度开始介绍这个问题,reward与cost的角度实际上是相同的,之所以有变化是因为有时候在控制论的语境用cost,有时候在standard RL的语境用reward。现在开始接近最近的文章就统一reward吧
1.1 轨迹的建模方式—PGM
之前在Control As Inference中提过,有一堆专家数据即expert trajectories,也可以理解成是optimal behaviors,引入了一个optimality variable来专门对专家行为进行建模: p ( O t ∣ s t , a t ) = e x p ( r ( s t , a t ) ) = e x p ( − c ( s t , a t ) ) p(O_t|s_t,a_t)=exp(r(s_t,a_t))=exp(-c(s_t,a_t)) p(Ot∣st,at)=exp(r(st,at))=exp(−c(st,at))
p ( τ ∣ O 1 : T ) = p ( τ , O 1 : T ) p ( O 1 : T ) ∝ p ( τ , O 1 : T ) = p ( τ ) p ( O 1 : T ∣ τ ) = p ( τ ) e x p ( ∑ t = 1 T r ( s t , a t ) ) = [ p ( s 1 ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t ) ] e x p ( ∑ t = 1 T r ( s t , a t ) ) \begin{aligned} p(\tau|O_{1:T})=\frac{p(\tau,O_{1:T})}{p(O_{1:T})}&\propto p(\tau,O_{1:T})\\ &=p(\tau)p(O_{1:T}|\tau)\\ &=p(\tau)exp\Big(\sum_{t=1}^Tr(s_t,a_t)\Big)\\ &=\Big[p(s_1)\prod_{t=1}^{T}p(s_{t+1}|s_t,a_t)\Big]exp\Big(\sum_{t=1}^Tr(s_t,a_t)\Big) \end{aligned} p(τ∣O1:T)=p(O1:T)p(τ,O1:T)∝p(τ,O1:T)=p(τ)p(O1:T∣τ)=p(τ)exp(t=1∑Tr(st,at))=[p(s1)t=1∏Tp(st+1∣st,at)]exp(t=1∑Tr(st,at))
这个感觉很奇怪,怎么
p
(
τ
)
p(\tau)
p(τ)中的policy没了?因为这个是基于PGM的建模,专家数据中直接就有了state与action,不需要去管state与action之间的映射。如图:
那现在我们想要参数化的是reward function,因此有:
p ( τ ∣ O 1 : T , ψ ) ∝ p ( τ ) e x p ( ∑ t = 1 T r ψ ( s t , a t ) ) = p ( τ ) e x p ( r ψ ( τ ) ) \begin{aligned} p(\tau|O_{1:T},\psi)&\propto p(\tau)exp\Big(\sum_{t=1}^Tr_\psi(s_t,a_t)\Big)\\ &=p(\tau)exp(r_\psi(\tau)) \end{aligned} p(τ∣O1:T,ψ)∝p(τ)exp(t=1∑Trψ(st,at))=p(τ)exp(rψ(τ))
现在有专家数据
τ
(
i
)
\tau^{(i)}
τ(i),Maximum Likelihood Learning去学习reward:
max
ψ
L
(
ψ
)
=
max
ψ
1
N
∑
i
=
1
N
l
o
g
p
(
τ
(
i
)
∣
O
1
:
T
,
ψ
)
=
max
ψ
1
N
∑
i
=
1
N
r
ψ
(
τ
(
i
)
)
−
l
o
g
Z
\max_\psi L(\psi)=\max_\psi \frac{1}{N}\sum_{i=1}^Nlogp(\tau^{(i)}|O_{1:T},\psi)=\max_\psi \frac{1}{N}\sum_{i=1}^Nr_\psi(\tau^{(i)})-logZ
ψmaxL(ψ)=ψmaxN1i=1∑Nlogp(τ(i)∣O1:T,ψ)=ψmaxN1i=1∑Nrψ(τ(i))−logZ其中
Z
Z
Z为轨迹的归一化因子,即Partition function,因为要去掉正比
∝
\propto
∝的符号.
Z
=
∫
p
(
τ
)
e
x
p
(
r
ψ
(
τ
)
)
d
τ
Z=\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau
Z=∫p(τ)exp(rψ(τ))dτ
1.2 轨迹的建模方式—Policy
下面这个是Standard RL的正常建模方式:
p
(
τ
)
=
p
(
s
1
)
∏
t
=
1
T
π
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
p(\tau)=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)
p(τ)=p(s1)t=1∏Tπ(at∣st)p(st+1∣st,at)
实际上还有一个reward的,不过它是一个scalar,不是一个要学习的函数,因此就忽略了。这里的trajectory是由policy与dynamics生成的,trajectory distribution是policy的一种表达。而上面PGM的建模是对专家数据的“轨迹”分布进行了表达。
众所周知,PGM最大的问题是partition function的积分求和难以计算的问题,因此就需要逼近。那怎么逼近呢?是的,采用轨迹另一种构建方式Policy去逼近PGM的trajectory distribution,从而使partition function即 Z = ∫ p ( τ ) e x p ( r ψ ( τ ) ) d τ Z=\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau Z=∫p(τ)exp(rψ(τ))dτ变得可估计!
二、文章的主要逻辑
Partition Function的估计,论文提到有Laplace Approximation、Value Function Approximation以及Samples的方式。Paper采用的就是基于Samples的方式对配分函数Z进行了估计,主要思路如下:
-
先对问题的梯度进行公式展开
∇ ψ L ( ψ ) = ∇ ψ [ E τ ∼ π ∗ ( τ ) [ r ψ ( τ ) ] − l o g ∫ p ( τ ) e x p ( r ψ ( τ ) ) d τ ] = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − 1 Z ∫ p ( τ ) e x p ( r ψ ( τ ) ) ⏟ p ( τ ∣ O 1 : T , ψ ) ∇ ψ r ψ ( τ ) d τ = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] \begin{aligned} \nabla_\psi L(\psi)&=\nabla_\psi\Big[E_{\tau\sim\pi^*(\tau)}\big[r_\psi(\tau)\big]-log\int p(\tau)exp\big(r_\psi(\tau)\big)d\tau\Big]\\ &=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-\underbrace{\frac{1}{Z}\int p(\tau)exp\big(r_\psi(\tau)\big)}_{p(\tau|O_{1:T},\psi)}\nabla_\psi r_\psi(\tau)d\tau\\ &=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ \end{aligned} ∇ψL(ψ)=∇ψ[Eτ∼π∗(τ)[rψ(τ)]−log∫p(τ)exp(rψ(τ))dτ]=Eτ∼π∗(τ)[∇ψrψ(τ)]−p(τ∣O1:T,ψ) Z1∫p(τ)exp(rψ(τ))∇ψrψ(τ)dτ=Eτ∼π∗(τ)[∇ψrψ(τ)]−Eτ∼p(τ∣O1:T,ψ)[∇ψrψ(τ)] -
以Policy的构建方式来表达第二项的 E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] E_{\tau\sim p(\tau|O_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)] Eτ∼p(τ∣O1:T,ψ)[∇ψrψ(τ)]
E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] = E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ ∑ t = 1 T r ψ ( s t , a t ) ] = ∑ t = 1 T E ( s t , a t ) ∼ p ( s t , a t ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( s t , a t ) ] = ∑ t = 1 T E s t ∼ p ( s t ∣ O 1 : T , ψ ) , a t ∼ p ( a t ∣ s t , O 1 : T , ψ ) [ ∇ ψ r ψ ( s t , a t ) ] \begin{aligned} &E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ &=E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi \sum_{t=1}^T r_\psi(s_t,a_t) \big]\\ &=\sum_{t=1}^TE_{(s_t,a_t)\sim p(s_t,a_t|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(s_t,a_t)\big]\\ &=\sum_{t=1}^TE_{s_t\sim p(s_t|O_{1:T},\psi),a_t\sim p(a_t|s_t,O_{1:T},\psi)}\big[\nabla_\psi r_\psi(s_t,a_t)\big]\\ \end{aligned} Eτ∼p(τ∣O1:T,ψ)[∇ψrψ(τ)]=Eτ∼p(τ∣O1:T,ψ)[∇ψt=1∑Trψ(st,at)]=t=1∑TE(st,at)∼p(st,at∣O1:T,ψ)[∇ψrψ(st,at)]=t=1∑TEst∼p(st∣O1:T,ψ),at∼p(at∣st,O1:T,ψ)[∇ψrψ(st,at)] -
Soft Optimal Policy即 p ( a t ∣ s t , O 1 : T , ψ ) p(a_t|s_t,O_{1:T},\psi) p(at∣st,O1:T,ψ)样本的获得用另一个policy即 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(at∣st)去近似它,因此采用Importance Sampling的方式使用 π ( a t ∣ s t ) \pi(a_t|s_t) π(at∣st)中的轨迹样本:
∇ ψ L ( ψ ) = E τ ∼ π ∗ ( τ ) [ ∇ ψ r ψ ( τ ) ] − E τ ∼ p ( τ ∣ O 1 : T , ψ ) [ ∇ ψ r ψ ( τ ) ] ≈ 1 N ∑ i = 1 N ∇ ψ r ψ ( τ ( i ) ) − 1 ∑ j w j ∑ j = 1 M w j ∇ ψ r ψ ( τ ( j ) ) \begin{aligned} \nabla_\psi L(\psi)&=E_{\tau\sim\pi^*(\tau)}\big[\nabla_\psi r_\psi(\tau)\big]-E_{\tau\sim p(\tau|O_{1:T},\psi)}\big[\nabla_\psi r_\psi(\tau) \big]\\ &\approx\frac{1}{N}\sum_{i=1}^N \nabla_\psi r_\psi(\tau^{(i)})-\frac{1}{\sum_jw_j}\sum_{j=1}^Mw_j\nabla_\psi r_\psi(\tau^{(j)}) \end{aligned} ∇ψL(ψ)=Eτ∼π∗(τ)[∇ψrψ(τ)]−Eτ∼p(τ∣O1:T,ψ)[∇ψrψ(τ)]≈N1i=1∑N∇ψrψ(τ(i))−∑jwj1j=1∑Mwj∇ψrψ(τ(j))
w j = p ( τ ) e x p ( r ψ ( τ ( j ) ) ) π θ ( τ ( j ) ) = p ( s 1 ) ∏ t p ( s t + 1 ∣ s t , a t ) e x p ( r ψ ( s t , a t ) ) p ( s 1 ) ∏ t p ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) = e x p ( ∑ t r ψ ( s t , a t ) ) ∏ t π θ ( a t ∣ s t ) \begin{aligned} w_j&=\frac{p(\tau)exp\big(r_\psi(\tau^{(j)})\big)}{\pi_\theta(\tau^{(j)})}\\ &=\frac{p(s_1)\prod_{t}p(s_{t+1}|s_t,a_t)exp(r_\psi(s_t,a_t))}{p(s_1)\prod_{t}p(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t) }\\ &=\frac{exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi_\theta(a_t|s_t)} \end{aligned} wj=πθ(τ(j))p(τ)exp(rψ(τ(j)))=p(s1)∏tp(st+1∣st,at)πθ(at∣st)p(s1)∏tp(st+1∣st,at)exp(rψ(st,at))=∏tπθ(at∣st)exp(∑trψ(st,at))
-
上述过程利用另一个Policy的方式去计算了Reward Function的梯度,从而对 r ψ ( s t , a t ) r_\psi(s_t,a_t) rψ(st,at)实现了更新,寻找那个reward function使专家行为产生高价值,当前行为产生低价值:
ψ ← ψ + α ∇ ψ L ( ψ ) \psi\leftarrow\psi + \alpha \nabla_\psi L(\psi) ψ←ψ+α∇ψL(ψ) -
接下来固定当前的reward function即 r ψ ( τ ) r_\psi(\tau) rψ(τ),对Policy即 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(at∣st)进行REINFORCE的更新,寻找那个Policy能在当前reward下产生高价值.
∇ θ L ( θ ) = 1 M ∑ j = 1 M ∇ θ l o g π θ ( τ j ) r ψ ( τ j ) θ ← θ + α ∇ θ L ( θ ) \nabla_\theta L(\theta)= \frac{1}{M}\sum_{j=1}^M\nabla_\theta log\pi_\theta(\tau_j)r_\psi(\tau_j)\\ \theta \leftarrow \theta + \alpha\nabla_\theta L(\theta) ∇θL(θ)=M1j=1∑M∇θlogπθ(τj)rψ(τj)θ←θ+α∇θL(θ) -
然后迭代更新Reward Function与Policy,其中Rewad Function往能使专家行为产生高价值同时使当前Policy行为产生低价值的方向跑,然后Policy往能在当前Reward下产生高价值的行为方向跑,从而又体现出了对抗的思想!附流程图
三、实验小细节
- 专家数据从哪来?
答: Guided Policy Search、RL from Scratch、Trajectory Optimization的Policy跑Dynamics生成出来的Trajectories - 如何初始化这个Policy即
π
θ
(
a
t
∣
s
t
)
\pi_\theta(a_t|s_t)
πθ(at∣st)
答:从random initial state开始,用Trajectory Optimization弄一个Linear-Gaussian Controller拟合专家数据或者通过Produces a motion that tracks the average demonstration with variance propor- tional to the variation between demonstrated motions
实验结果:
四、总结
-
贡献
在Inverse RL的setting上,学得是一个cost function,本文利用policy update的信息来guide cost function的学习,因此叫做Guided Cost Learning。学到一个cost function的同时,policy也学好了,只是GAIL直接recover一个Policy而不用学习一个cost function。 -
一句话总结:不用对cost function进行人为设计,直接从raw state representation中进行Policy Optimization来指导Cost Function的Learning解决high-dimensions与Unknown Dynamics的问题
-
附录A有对Guided Policy Search与GPS + Trajectory Optimization Under Unknown Dynamics的总结,超棒的~
启发:此处的目标是学习一个Reward Function,利用Policy Optimization提供Guidance。优点是raw state input而不用手工对state做特征工程,但难以应对Visual input。
改进之处
- 可以从Policy Optimization的更新上加一些Constraints,从而使Reward的指向更加明确
- Reward更新的过程,对Partition Function采用Samples估计的方式,利用了Important Sampling,那个权重继续可以优化,其次是逼近Partieion Functon的方式可以采用另外的方式,不一定要Samples。
Guided Cost Learning的实现代码有点少,只看了一份:
https://github.com/bbrighttaer/guided-irl