Paper-6 精读 Deep Visuomotor Policies (2016 JMLR)

概述

这篇是2016年Journal of Machine Learning Research的期刊Paper:End-to-End Training of Deep Visuomotor Policies。非常完整,特别适合对GPS、Trajectory Optimization、Guided Cost Learning进行整理,形成一个框架。

总的来说,传统方法处理的输入是raw state,这篇期刊的输入是image即observation,处理的是一个POMDP的问题。

一般需要将observation通过state estimation或者perception变成state,然后再用state进行control。End-to-End的意思就是把Perception与Control一起训练了。

文章公式细节有点冗长,看到一半捋不顺,可以直接跳到总结。
1

标记符号的意义,其中 l ( x t , u t ) l(x_t,u_t) l(xt,ut)还是改成用cost function的形式即 c ( x t , u t ) c(x_t,u_t) c(xt,ut)

一、逻辑梳理

1.1 目标

目标是学习一个Visuomotor Policies即 π ( u t ∣ o t ) \pi(u_t|o_t) π(utot)。就是给定一个Image observation,应该做什么动作。

因此现在参数化的对象是 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot):

π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(utxt)=πθ(utot)p(otxt)dot

但轨迹分布仍然是定义在state上的:

π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1Tπθ(utxt)p(xt+1xt,ut)

因此目标为寻找一个参数 θ \theta θ使损失最小:

E π θ ( τ ) [ ∑ t = 1 T c ( x t , u t ) ] E_{\pi_\theta(\tau)}[\sum_{t=1}^Tc(x_t,u_t)] Eπθ(τ)[t=1Tc(xt,ut)]

未知的对象有两个,一个是Dynamics即 p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_t,u_t) p(xt+1xt,ut),一个是observation distribution即 p ( o t ∣ x t ) p(o_t|x_t) p(otxt)

参数化的对象只有一个: π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot)

因此用下面的网络架构建模 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot),并确定形式 π θ ( u t ∣ o t ) = N ( μ ( o t ) , Σ ( o t ) ) \pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t)) πθ(utot)=N(μ(ot),Σ(ot))
2

1.2 监督信息

确定了目标 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot),现在要确定supervision从哪来?就是说给定 o t o_t ot,要做哪个动作 u t u_t ut

Supervision来自RL目标解出来的动作即 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt),系统的状态 x t x_t xt是已知的。(同一时刻,记录 ( x t , o t ) (x_t,o_t) (xt,ot),然后RL训练出一个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt) o t o_t ot提供supervision)

因此最主要的问题变为:

  1. 怎么训练一个linear-Gaussian controllers p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt),目标是什么?
  2. 训练 π ( u t ∣ o t ) \pi(u_t|o_t) π(utot)的目标是什么?
  3. 整体训练过程是怎样的?

1.3 训练框架

3

  1. Unknown Dynamics: p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_t,u_t) p(xt+1xt,ut)
  2. Linear-Gaussian Controllers : p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt)
    p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt)中的 i i i是指从不同initial states开始的Policy
  3. Trajectory distribution:
    p i ( τ ) = p i ( x 1 ) ∏ t = 1 T p i ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) p_i(\tau)=p_i(x_1)\prod_{t=1}^Tp_i(u_t|x_t)p(x_{t+1}|x_t,u_t) pi(τ)=pi(x1)t=1Tpi(utxt)p(xt+1xt,ut)

因此从Outer Loop最左侧的循环是用传统方法根据目标 L p L_p Lp将Controllers训练好的,而Inner Loop则是从Controllers那得到Guided Samples根据目标 L θ L_\theta Lθ训练好,得到 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot).

  1. Policy Distribution:
    π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1Tπθ(utxt)p(xt+1xt,ut)

因此Trajectory Distribution与Policy Distribution的初始状态是不一样的,要想Policy最后学到初始状态的泛化性,就得迭代Policy,使其与Trajectory distribution的state distribution相同,就有了Inner Loop中 L p L_p Lp L θ L_\theta Lθ交互优化的那一部分了。

流程清楚了,下面请留意两个未知对象的处理:
一个是Dynamics即 p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_t,u_t) p(xt+1xt,ut),一个是observation distribution即 p ( o t ∣ x t ) p(o_t|x_t) p(otxt)

二、Dive Into Details

2.1 整体的目标推导

先看看要学习对象 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot)的目标应该是什么?
π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(utxt)=πθ(utot)p(otxt)dot

因此我们看看 π θ ( u t ∣ x t ) \pi_\theta(u_t|x_t) πθ(utxt)
π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1Tπθ(utxt)p(xt+1xt,ut)

因此优化这个目标就能学到 π θ ( u t ∣ x t ) \pi_\theta(u_t|x_t) πθ(utxt)
E π θ ( τ ) [ c ( τ ) ] E_{\pi_\theta(\tau)}[c(\tau)] Eπθ(τ)[c(τ)]

这个经典的RL问题呀,当然可以采用VPG、TRPO、PPO一类的On-Policy算法,抑或是DDPG、TD3、SAC求解,但需要的Samples好像有点多,一般都是在虚拟环境上训练再在现实环境中transfer的,因此采用经典的GPS算法进行Guided,这个Guided的对象为 p ( u t ∣ x t ) p(u_t|x_t) p(utxt)目标变为:
min ⁡ p , π θ E p ( τ ) [ c ( τ ) ] s . t p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) \min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p,πθminEp(τ)[c(τ)]s.tp(xt)p(utxt)=p(xt)πθ(utxt)

现在样本的来源从 π θ ( τ ) \pi_\theta(\tau) πθ(τ)变成 p ( τ ) p(\tau) p(τ),然后加个约束控制两者轨迹分布相同。

π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1Tπθ(utxt)p(xt+1xt,ut) p ( τ ) = p ( x 1 ) ∏ t = 1 T p ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) p(\tau)=p(x_1)\prod_{t=1}^Tp(u_t|x_t)p(x_{t+1}|x_t,u_t) p(τ)=p(x1)t=1Tp(utxt)p(xt+1xt,ut)

(环境Unkown Dynamics在转换成轨迹分布匹配优化上两者相同,因此只需要约束对应的策略相同)

接下来着手优化求解这个目标。

2.2 目标优化求解

优化的算法采用ADMM的变体BADMM,简述一下:

2.2.1 增广拉格朗日ALM

优化问题:
min ⁡ x f ( x ) s . t A x = b \min_x f(x)\\ s.t\quad Ax=b xminf(x)s.tAx=b

原问题:
min ⁡ x max ⁡ λ L ( x , λ ) = f ( x ) + λ T ( A x − b ) + p 2 ∣ ∣ A x − b ∣ ∣ 2 \min_x\max_\lambda L(x,\lambda)=f(x)+\lambda^T(Ax-b)+\frac{p}{2}||Ax-b||^2 xminλmaxL(x,λ)=f(x)+λT(Axb)+2pAxb2

对偶问题:
max ⁡ λ min ⁡ x L ( x , λ ) = f ( x ) + λ T ( A x − b ) + p 2 ∣ ∣ A x − b ∣ ∣ 2 \max_\lambda\min_x L(x,\lambda)=f(x)+\lambda^T(Ax-b)+\frac{p}{2}||Ax-b||^2 λmaxxminL(x,λ)=f(x)+λT(Axb)+2pAxb2

迭代优化求解:
s t e p 1 : x k + 1 = arg min ⁡ x L ( x , λ k ) s t e p 2 : λ k + 1 = arg max ⁡ L ( x k + 1 , λ ) \begin{aligned} &step1:\quad x^{k+1}=\argmin_x L(x,\lambda^k)\\ &step2:\quad \lambda^{k+1}=\argmax L(x^{k+1},\lambda) \end{aligned} step1:xk+1=xargminL(x,λk)step2:λk+1=argmaxL(xk+1,λ)

多出来的 p 2 ∣ ∣ A x − b ∣ ∣ 2 \frac{p}{2}||Ax-b||^2 2pAxb2是为了让Lagrange函数更凸。

2.2.2 ADMM

优化问题:
min ⁡ x , z f ( x ) + g ( z ) s . t A x + B z = C \min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C x,zminf(x)+g(z)s.tAx+Bz=C

原问题:
min ⁡ x , z max ⁡ λ L ( x , z , λ ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) + p 2 ∣ ∣ A x + B z − C ∣ ∣ 2 \min_{x,z}\max_\lambda L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+\frac{p}{2}||Ax+Bz-C||^2 x,zminλmaxL(x,z,λ)=f(x)+g(z)+λT(Ax+BzC)+2pAx+BzC2

对偶问题:
max ⁡ λ min ⁡ x , z L ( x , z , λ ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) + p 2 ∣ ∣ A x + B z − C ∣ ∣ 2 \max_\lambda\min_{x,z} L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+\frac{p}{2}||Ax+Bz-C||^2 λmaxx,zminL(x,z,λ)=f(x)+g(z)+λT(Ax+BzC)+2pAx+BzC2

迭代优化求解:
s t e p 1 : x k + 1 = arg min ⁡ x L ( x , z k , λ k ) s t e p 2 : z k + 1 = arg min ⁡ L ( x k + 1 , z , λ k ) s t e p 3 : λ k + 1 = arg max ⁡ L ( x k + 1 , z k + 1 , λ ) \begin{aligned} &step1:\quad x^{k+1}=\argmin_x L(x,z^k,\lambda^k)\\ &step2:\quad z^{k+1}=\argmin L(x^{k+1},z,\lambda^k)\\ &step3:\quad \lambda^{k+1}=\argmax L(x^{k+1},z^{k+1},\lambda) \end{aligned} step1:xk+1=xargminL(x,zk,λk)step2:zk+1=argminL(xk+1,z,λk)step3:λk+1=argmaxL(xk+1,zk+1,λ)

2.2.3 BADMM

然后BADMM就是在ADMM上加一个Bregman divergence between the constrained variables x与z,就有类似于在ALM上加的那个二次惩罚项,只不过这里加的是一个KL Divergence。

优化问题:
min ⁡ x , z f ( x ) + g ( z ) s . t A x + B z = C \min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C x,zminf(x)+g(z)s.tAx+Bz=C

对偶问题:
min ⁡ x L ( x , z , λ ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) + p K L ( x ∣ ∣ z ) \min_x L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+pKL(x||z) xminL(x,z,λ)=f(x)+g(z)+λT(Ax+BzC)+pKL(xz)

min ⁡ z L ( z , x , λ ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) + p K L ( z ∣ ∣ x ) \min_z L(z,x,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+pKL(z||x) zminL(z,x,λ)=f(x)+g(z)+λT(Ax+BzC)+pKL(zx)

max ⁡ λ L ( λ , x , z ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) \max_\lambda L(\lambda,x,z)=f(x)+g(z)+\lambda^T(Ax+Bz-C) λmaxL(λ,x,z)=f(x)+g(z)+λT(Ax+BzC)

迭代优化过程:

优化过程直接贴BADMM论文的图,这不是重点= =。其中 y t = λ t y_t=\lambda_t yt=λt,B被选成了KL。
4

2.2.2 BADMM对目标进行优化

(x当作 θ \theta θ, z当作p)
优化问题:
min ⁡ p , π θ E p ( τ ) [ c ( τ ) ] s . t p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) \min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p,πθminEp(τ)[c(τ)]s.tp(xt)p(utxt)=p(xt)πθ(utxt)

将轨迹表达的优化问题变成时间表达:(等价表达,但需要思考一下,体会一下🤔)
min ⁡ p , θ ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] s . t p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) \min_{p,\theta}\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p,θmint=1TEp(xt,ut)[c(xt,ut)]s.tp(xt)p(utxt)=p(xt)πθ(utxt)

写成无约束问题需要写成两个,因为KL Divergence是不对称性的呀!
第一个是优化 θ \theta θ的:
L θ ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t θ ( θ , p ) L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^\theta(\theta,p) Lθ(θ,p)=t=1TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(utxt)p(xt,ut)]+vtϕtθ(θ,p)

第二个是优化 p p p的: L p ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t p ( θ , p ) L_p(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^p(\theta,p) Lp(θ,p)=t=1TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(utxt)p(xt,ut)]+vtϕtp(θ,p)

ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ) ] ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \begin{aligned} &\phi_t^\theta(\theta,p)=E_{p(x_t)}\big[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t))\big]\\ &\phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] \end{aligned} ϕtθ(θ,p)=Ep(xt)[DKL(p(utxt)πθ(utxt))]ϕtp(θ,p)=Ep(xt)[DKL(πθ(utxt)p(utxt))]

然后将与求解参数无关的term都扔了,如下更新:
θ ← arg min ⁡ θ ∑ t = 1 T p ( x t ) π θ ( u t ∣ x t ) λ t + v t ϕ t p ( θ , p ) p ← arg min ⁡ p ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] − λ t p ( x t , u t ) + v t ϕ t p ( θ , p ) λ t ← λ t + α v t ( π θ ( u t ∣ x t ) p ( x t ) − p ( u t ∣ x t ) p ( x t ) \begin{aligned} &\theta\leftarrow \argmin_\theta\sum_{t=1}^Tp(x_t)\pi_\theta(u_t|x_t)\lambda_t+v_t\phi_t^p(\theta,p)\\ &p\leftarrow\argmin_p\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]-\lambda_tp(x_t,u_t)+v_t\phi_t^p(\theta,p)\\ &\lambda_t\leftarrow \lambda_t+\alpha v_t(\pi_\theta(u_t|x_t)p(x_t)-p(u_t|x_t)p(x_t) \end{aligned} θθargmint=1Tp(xt)πθ(utxt)λt+vtϕtp(θ,p)ppargmint=1TEp(xt,ut)[c(xt,ut)]λtp(xt,ut)+vtϕtp(θ,p)λtλt+αvt(πθ(utxt)p(xt)p(utxt)p(xt)其中 v t v_t vt以启发式进行设置(具体法则看附录),或者如论文中写成期望的形式.
注意 l ( x t , u t ) = c ( x t , u t ) , λ t = λ x t , u t l(x_t,u_t)=c(x_t,u_t),\lambda_t=\lambda_{x_t,u_t} l(xt,ut)=c(xt,ut),λt=λxt,ut
5

2.2.3 实际考虑

目标有了,整体框架的优化方法有了,那就开始涉及到具体实现时的计算问题了。

第一个问题 p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p(xt)p(utxt)=p(xt)πθ(utxt)这么多个与时间有关的Constraints怎么表示?
答:只能近似表示,采用expected action的方式并简化为first moment进行表达,即 E ( x t ) p ( u t ∣ x t ) [ u t ] = E p ( x t ) π θ ( u t ∣ x t ) [ u t ] E_{(x_t)p(u_t|x_t)}[u_t]=E_{p(x_t)\pi_\theta(u_t|x_t)}[u_t] E(xt)p(utxt)[ut]=Ep(xt)πθ(utxt)[ut]然后优化的目标变为:6

两个概率相等的约束 用 两个期望动作值相等的约束来近似(建模中的精髓呀!非常值得学习!)
first moment的意思是期望动作是当前时刻,当然还可以是与 u t + 1 u_{t+1} ut+1时刻有关(还有坑呀!)

第二个问题:第二个关于 p p p的目标, p ( τ ) p(\tau) p(τ)是不知道的,应该怎么近似这个轨迹分布呢?

答:用GMM,多个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt)来近似这个 p ( u t ∣ x t ) p(u_t|x_t) p(utxt),每个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt)可以通过Trajectory Optimization进行学习。(什么?用GMM来近似?VAE行不行?坑来了,不知有无后人占坑,先mark~)

大家还记得整体流程吧。

7
L θ L_\theta Lθ记作:

arg min ⁡ θ L θ = arg min ⁡ θ ∑ t = 1 T E p ( x t ) π θ ( u t ∣ x t ) [ u t T λ u t ] + v t ϕ t θ ( θ , p ) \argmin_\theta L_\theta=\argmin_\theta\sum_{t=1}^TE_{p(x_t)\pi_\theta(u_t|x_t)}[u_t^T\lambda_{ut}]+v_t\phi_t^\theta(\theta,p) θargminLθ=θargmint=1TEp(xt)πθ(utxt)[utTλut]+vtϕtθ(θ,p)
L p L_p Lp记作: arg min ⁡ p L p = arg min ⁡ p ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) − u t T λ u t ] + v t ϕ t p ( p , θ ) \argmin_p L_p=\argmin_p \sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(p,\theta) pargminLp=pargmint=1TEp(xt,ut)[c(xt,ut)utTλut]+vtϕtp(p,θ)那么剩下的就是在Unknown dynamics下学习 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(utxt)了,并且细化 L θ L_\theta Lθ的具体计算。

2.3 Trajectory Optimization under Unknown Dynamics

p ( τ ) p(\tau) p(τ)称为guiding distributions, p i ( τ ) p_i(\tau) pi(τ)为不同initial states的轨迹分布,在Policy的Objective中指导 π θ ( u t ∣ x t ) \pi_\theta(u_t|x_t) πθ(utxt)的学习:
p ( τ ) = ∑ i p i ( τ ) = ∑ i ( p i ( x 1 ) ∏ t = 1 T p ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) ) p(\tau)=\sum_i p_i(\tau)=\sum_i\Big(p_i(x_1)\prod_{t=1}^Tp(u_t|x_t)p(x_{t+1}|x_t,u_t)\Big) p(τ)=ipi(τ)=i(pi(x1)t=1Tp(utxt)p(xt+1xt,ut))

然后赶紧开始建模对单个 p i ( τ ) p_i(\tau) pi(τ)
p i ( u t ∣ x t ) = N ( K t x t + k t , C t ) p i ( x t + 1 ∣ x t , u t ) = N ( f x t x t + f u t u t + f c t , F t ) p_i(u_t|x_t)=N(K_tx_t+k_t,C_t)\\ p_i(x_{t+1}|x_t,u_t)=N(f_{x_t}x_t+f_{u_t}u_t+f_{ct},F_t) pi(utxt)=N(Ktxt+kt,Ct)pi(xt+1xt,ut)=N(fxtxt+futut+fct,Ft)

线性高斯的形式选定了,就得看setting,然后选方法对参数进行Learning了。现在问题默认是Stochastic Unknown Dynamics,因此传统方法的Trajectory Optimization都是要fit一个dynamics model,即 p i ( x t + 1 ∣ x t , u t ) p_i(x_{t+1}|x_t,u_t) pi(xt+1xt,ut)

因为之前的Paper精读中介绍过,传统方法DDP一类的如iLQG、iLQR什么的,都可以从几个专家数据中拟合一个局部、泛化性能差的Linear-Gaussian-Controller,因此迭代起来学习dynamics model的时候就得限制在“局部”,不能走太远。

定义 p ^ ( τ ) \hat p(\tau) p^(τ)为上一次迭代的trajectory distribution,优化问题为:
min ⁡ p L p ( p , θ ) s . t D K L ( p ( τ ) ∣ ∣ p ^ ( τ ) ) ≤ ϵ \min_pL_p(p,\theta) \\s.t\quad D_{KL}(p(\tau)||\hat p(\tau))\leq \epsilon pminLp(p,θ)s.tDKL(p(τ)p^(τ))ϵ

这个称作:dynamics fitting procedure。

有了一个dynamics model后,我们才能optimize trajectory distribution呀~

L p ( p , θ ) L_p(p,\theta) Lp(p,θ)的形式:

L p ( p , θ ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) − u t T λ u t ] + v t ϕ t p ( θ , p ) L_p(p,\theta)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(\theta,p) Lp(p,θ)=t=1TEp(xt,ut)[c(xt,ut)utTλut]+vtϕtp(θ,p)

ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] ϕtp(θ,p)=Ep(xt)[DKL(πθ(utxt)p(utxt))]

π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(utxt)=πθ(utot)p(otxt)dot

刚才拟合了个dynamics model,现在需要处理目标中的 D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t)) DKL(πθ(utxt)p(utxt))的计算问题,这个涉及到 π θ ( u t ∣ x t ) \pi_\theta(u_t|x_t) πθ(utxt),从而涉及到 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot),然后KL的计算有一个Fisher Information Matrix的近似,这个就不介绍了,之前文章说过,这里估计这个Fisher信息矩阵使用的训练数据为:(问题的背景就是知道 x t , o t x_t,o_t xt,ot
{ x t i , E π θ ( u t ∣ o t i ) [ u t ] } i = 1 N \{x_t^i,E_{\pi_\theta(u_t|o_t^i)}[u_t]\}_{i=1}^N {xti,Eπθ(utoti)[ut]}i=1N

现在KL可以近似估计得到,dynamics model经过拟合得到,然后直接用iLQG处理Constraints,从而得出一个 p i ( τ ) p_i(\tau) pi(τ)

另一种解法就是cost function二阶近似,linear-Gaussian Dynamics,从而跑LQR-Framework的那个trajectory optimization的过程。

现在我们得到了 p ( τ ) p(\tau) p(τ),即guiding distributions.完成了第二步关于p的更新:

arg min ⁡ p ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) − u t T λ u t ] + v t ϕ t p ( p , θ ) \argmin_p \sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(p,\theta) pargmint=1TEp(xt,ut)[c(xt,ut)utTλut]+vtϕtp(p,θ)

2.4 Supervised Policy Optimization

记得建模对象 π θ ( u t ∣ o t ) = N ( μ ( o t ) , Σ ( o t ) ) \pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t)) πθ(utot)=N(μ(ot),Σ(ot)).

L θ ( θ , p ) = ∑ t = 1 T E p ( x t ) π θ ( u t ∣ x t ) [ u t T λ u t ] + v t ϕ t θ ( θ , p ) L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t)\pi_\theta(u_t|x_t)}[u_t^T\lambda_{ut}]+v_t\phi_t^\theta(\theta,p) Lθ(θ,p)=t=1TEp(xt)πθ(utxt)[utTλut]+vtϕtθ(θ,p)

ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ] \phi_t^\theta(\theta,p)=E_{p(x_t)}[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t)] ϕtθ(θ,p)=Ep(xt)[DKL(p(utxt)πθ(utxt)]

然后将Trajectory Optimization学习到的 p i ( u t ∣ x t ) = N ( u t i p , C t i ) p_i(u_t|x_t)=N(u_{ti}^p,C_{ti}) pi(utxt)=N(utip,Cti)与建模对象 π θ ( u t ∣ o t ) = N ( μ ( o t ) , Σ ( o t ) ) \pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t)) πθ(utot)=N(μ(ot),Σ(ot))代入到这里,再从 x t , o t , u t x_t,o_t,u_t xt,ot,ut的数据中采样N个样本进行近似,得到:
L θ ( θ , p ) = 1 2 N ∑ i = 1 N ∑ t = 1 T E p i ( x t , o t ) [ t r [ C t i − 1 Σ π ( o t ) ] − log ⁡ ∣ Σ π ( o t ) ∣ + ( μ π ( o t ) − μ t i p ( x t ) ) C t i − 1 ( μ π ( o t − μ t i p ) ( x t ) + 2 λ u t T μ π ( o t ) ] L_\theta(\theta,p)=\frac{1}{2N}\sum_{i=1}^N\sum_{t=1}^T E_{p_i(x_t,o_t)}\Big[tr[C_{ti}^{-1}\Sigma^\pi(o_t)]-\log|\Sigma^\pi(o_t)|\\+(\mu^\pi(o_t)-\mu_{ti}^p(x_t))C_{ti}^{-1}(\mu^\pi(o_t-\mu_{ti}^p)(x_t)+2\lambda_{ut}^T\mu^\pi(o_t)\Big] Lθ(θ,p)=2N1i=1Nt=1TEpi(xt,ot)[tr[Cti1Σπ(ot)]logΣπ(ot)+(μπ(ot)μtip(xt))Cti1(μπ(otμtip)(xt)+2λutTμπ(ot)]

公式很吓人:简单得说就是使建模对象 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(utot)的输出均值 μ ( o t ) \mu(o_t) μ(ot)与Guiding Distributions的输出均值 μ t i p ( x t ) \mu_{ti}^p(x_t) μtip(xt)的均值误差最小,让Policy跟着Guiding Distribution走,然后剩下的项是Penalty,如果跟trajectory distribution相差很远,就按比例进行惩罚。

三、总结

文章的整个问题背景是从一个经典的RL目标出发:

E π θ ( τ ) [ ∑ t = 1 T c ( x t , u t ) ] E_{\pi_\theta(\tau)}[\sum_{t=1}^Tc(x_t,u_t)] Eπθ(τ)[t=1Tc(xt,ut)]

π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1Tπθ(utxt)p(xt+1xt,ut)

robot的状态是知道的,因为用Actor-Critic类的解法需要大量样本,真实robot顶不住,从Guided Policy Search角度出发引入专家 p ( u t ∣ x t ) p(u_t|x_t) p(utxt)。因此有一系列关于时间点的约束:

min ⁡ p , π θ E p ( τ ) [ c ( τ ) ] s . t p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) \min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p,πθminEp(τ)[c(τ)]s.tp(xt)p(utxt)=p(xt)πθ(utxt)

而且Paper的一大卖点是可以通过Camera 的高维复杂的image observation进行Control,而不是几十维的状态state,因此有

π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(utxt)=πθ(utot)p(otxt)dot

这个过程可以通过 x t , o t , u t x_t,o_t,u_t xt,ot,ut的数据进行MC估计。

表示一堆时间约束的方式,就是用了动作的期望值来表示概率相等

利用BADMM的优化框架:
min ⁡ x , z f ( x ) + g ( z ) s . t A x + B z = C \min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C x,zminf(x)+g(z)s.tAx+Bz=C
第一个是优化 θ \theta θ的:
L θ ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t θ ( θ , p ) L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^\theta(\theta,p) Lθ(θ,p)=t=1TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(utxt)p(xt,ut)]+vtϕtθ(θ,p)

第二个是优化 p p p的: L p ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t p ( θ , p ) L_p(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^p(\theta,p) Lp(θ,p)=t=1TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(utxt)p(xt,ut)]+vtϕtp(θ,p)

ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ) ] ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \begin{aligned} &\phi_t^\theta(\theta,p)=E_{p(x_t)}\big[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t))\big]\\ &\phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] \end{aligned} ϕtθ(θ,p)=Ep(xt)[DKL(p(utxt)πθ(utxt))]ϕtp(θ,p)=Ep(xt)[DKL(πθ(utxt)p(utxt))]

然后想办法近似计算KL散度,以及Trajectory Optimization迭代得到Guiding Distribution。

缺点就是state需要完全知道,然后用Visual Observation进行Control。

值得学习的点:整个问题的建模过程中优化过多约束条件的思想与近似计算的方法。

一句话总结文章:在State完全知道、UnKnown Dynamics的情况下,利用Guided Policy Search将问题扩展到High-Dimensions Complex Observation进行Visual Control。具体Train起来还有各种Pre-Train以及复杂的Initialization。

  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值