End-to-End Training of Deep Visuomotor Policies
概述
这篇是2016年Journal of Machine Learning Research的期刊Paper:End-to-End Training of Deep Visuomotor Policies。非常完整,特别适合对GPS、Trajectory Optimization、Guided Cost Learning进行整理,形成一个框架。
总的来说,传统方法处理的输入是raw state,这篇期刊的输入是image即observation,处理的是一个POMDP的问题。
一般需要将observation通过state estimation或者perception变成state,然后再用state进行control。End-to-End的意思就是把Perception与Control一起训练了。
文章公式细节有点冗长,看到一半捋不顺,可以直接跳到总结。
标记符号的意义,其中 l ( x t , u t ) l(x_t,u_t) l(xt,ut)还是改成用cost function的形式即 c ( x t , u t ) c(x_t,u_t) c(xt,ut)
一、逻辑梳理
1.1 目标
目标是学习一个Visuomotor Policies即 π ( u t ∣ o t ) \pi(u_t|o_t) π(ut∣ot)。就是给定一个Image observation,应该做什么动作。
因此现在参数化的对象是 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(ut∣ot):
π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(ut∣xt)=∫πθ(ut∣ot)p(ot∣xt)dot
但轨迹分布仍然是定义在state上的:
π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1∏Tπθ(ut∣xt)p(xt+1∣xt,ut)
因此目标为寻找一个参数 θ \theta θ使损失最小:
E π θ ( τ ) [ ∑ t = 1 T c ( x t , u t ) ] E_{\pi_\theta(\tau)}[\sum_{t=1}^Tc(x_t,u_t)] Eπθ(τ)[t=1∑Tc(xt,ut)]
未知的对象有两个,一个是Dynamics即 p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_t,u_t) p(xt+1∣xt,ut),一个是observation distribution即 p ( o t ∣ x t ) p(o_t|x_t) p(ot∣xt)。
参数化的对象只有一个: π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(ut∣ot)
因此用下面的网络架构建模
π
θ
(
u
t
∣
o
t
)
\pi_\theta(u_t|o_t)
πθ(ut∣ot),并确定形式
π
θ
(
u
t
∣
o
t
)
=
N
(
μ
(
o
t
)
,
Σ
(
o
t
)
)
\pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t))
πθ(ut∣ot)=N(μ(ot),Σ(ot))
1.2 监督信息
确定了目标 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(ut∣ot),现在要确定supervision从哪来?就是说给定 o t o_t ot,要做哪个动作 u t u_t ut?
Supervision来自RL目标解出来的动作即 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt),系统的状态 x t x_t xt是已知的。(同一时刻,记录 ( x t , o t ) (x_t,o_t) (xt,ot),然后RL训练出一个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt)给 o t o_t ot提供supervision)
因此最主要的问题变为:
- 怎么训练一个linear-Gaussian controllers p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt),目标是什么?
- 训练 π ( u t ∣ o t ) \pi(u_t|o_t) π(ut∣ot)的目标是什么?
- 整体训练过程是怎样的?
1.3 训练框架
- Unknown Dynamics: p ( x t + 1 ∣ x t , u t ) p(x_{t+1}|x_t,u_t) p(xt+1∣xt,ut)
- Linear-Gaussian Controllers :
p
i
(
u
t
∣
x
t
)
p_i(u_t|x_t)
pi(ut∣xt)
p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt)中的 i i i是指从不同initial states开始的Policy - Trajectory distribution:
p i ( τ ) = p i ( x 1 ) ∏ t = 1 T p i ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) p_i(\tau)=p_i(x_1)\prod_{t=1}^Tp_i(u_t|x_t)p(x_{t+1}|x_t,u_t) pi(τ)=pi(x1)t=1∏Tpi(ut∣xt)p(xt+1∣xt,ut)
因此从Outer Loop最左侧的循环是用传统方法根据目标 L p L_p Lp将Controllers训练好的,而Inner Loop则是从Controllers那得到Guided Samples根据目标 L θ L_\theta Lθ训练好,得到 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(ut∣ot).
- Policy Distribution:
π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1∏Tπθ(ut∣xt)p(xt+1∣xt,ut)
因此Trajectory Distribution与Policy Distribution的初始状态是不一样的,要想Policy最后学到初始状态的泛化性,就得迭代Policy,使其与Trajectory distribution的state distribution相同,就有了Inner Loop中 L p L_p Lp与 L θ L_\theta Lθ交互优化的那一部分了。
流程清楚了,下面请留意两个未知对象的处理:
一个是Dynamics即
p
(
x
t
+
1
∣
x
t
,
u
t
)
p(x_{t+1}|x_t,u_t)
p(xt+1∣xt,ut),一个是observation distribution即
p
(
o
t
∣
x
t
)
p(o_t|x_t)
p(ot∣xt)
二、Dive Into Details
2.1 整体的目标推导
先看看要学习对象
π
θ
(
u
t
∣
o
t
)
\pi_\theta(u_t|o_t)
πθ(ut∣ot)的目标应该是什么?
π
θ
(
u
t
∣
x
t
)
=
∫
π
θ
(
u
t
∣
o
t
)
p
(
o
t
∣
x
t
)
d
o
t
\pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t
πθ(ut∣xt)=∫πθ(ut∣ot)p(ot∣xt)dot
因此我们看看
π
θ
(
u
t
∣
x
t
)
\pi_\theta(u_t|x_t)
πθ(ut∣xt):
π
θ
(
τ
)
=
p
(
x
1
)
∏
t
=
1
T
π
θ
(
u
t
∣
x
t
)
p
(
x
t
+
1
∣
x
t
,
u
t
)
\pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t)
πθ(τ)=p(x1)t=1∏Tπθ(ut∣xt)p(xt+1∣xt,ut)
因此优化这个目标就能学到
π
θ
(
u
t
∣
x
t
)
\pi_\theta(u_t|x_t)
πθ(ut∣xt):
E
π
θ
(
τ
)
[
c
(
τ
)
]
E_{\pi_\theta(\tau)}[c(\tau)]
Eπθ(τ)[c(τ)]
这个经典的RL问题呀,当然可以采用VPG、TRPO、PPO一类的On-Policy算法,抑或是DDPG、TD3、SAC求解,但需要的Samples好像有点多,一般都是在虚拟环境上训练再在现实环境中transfer的,因此采用经典的GPS算法进行Guided,这个Guided的对象为
p
(
u
t
∣
x
t
)
p(u_t|x_t)
p(ut∣xt)目标变为:
min
p
,
π
θ
E
p
(
τ
)
[
c
(
τ
)
]
s
.
t
p
(
x
t
)
p
(
u
t
∣
x
t
)
=
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
\min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t)
p,πθminEp(τ)[c(τ)]s.tp(xt)p(ut∣xt)=p(xt)πθ(ut∣xt)
现在样本的来源从 π θ ( τ ) \pi_\theta(\tau) πθ(τ)变成 p ( τ ) p(\tau) p(τ),然后加个约束控制两者轨迹分布相同。
π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1∏Tπθ(ut∣xt)p(xt+1∣xt,ut) p ( τ ) = p ( x 1 ) ∏ t = 1 T p ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) p(\tau)=p(x_1)\prod_{t=1}^Tp(u_t|x_t)p(x_{t+1}|x_t,u_t) p(τ)=p(x1)t=1∏Tp(ut∣xt)p(xt+1∣xt,ut)
(环境Unkown Dynamics在转换成轨迹分布匹配优化上两者相同,因此只需要约束对应的策略相同)
接下来着手优化求解这个目标。
2.2 目标优化求解
优化的算法采用ADMM的变体BADMM,简述一下:
2.2.1 增广拉格朗日ALM
优化问题:
min
x
f
(
x
)
s
.
t
A
x
=
b
\min_x f(x)\\ s.t\quad Ax=b
xminf(x)s.tAx=b
原问题:
min
x
max
λ
L
(
x
,
λ
)
=
f
(
x
)
+
λ
T
(
A
x
−
b
)
+
p
2
∣
∣
A
x
−
b
∣
∣
2
\min_x\max_\lambda L(x,\lambda)=f(x)+\lambda^T(Ax-b)+\frac{p}{2}||Ax-b||^2
xminλmaxL(x,λ)=f(x)+λT(Ax−b)+2p∣∣Ax−b∣∣2
对偶问题:
max
λ
min
x
L
(
x
,
λ
)
=
f
(
x
)
+
λ
T
(
A
x
−
b
)
+
p
2
∣
∣
A
x
−
b
∣
∣
2
\max_\lambda\min_x L(x,\lambda)=f(x)+\lambda^T(Ax-b)+\frac{p}{2}||Ax-b||^2
λmaxxminL(x,λ)=f(x)+λT(Ax−b)+2p∣∣Ax−b∣∣2
迭代优化求解:
s
t
e
p
1
:
x
k
+
1
=
arg min
x
L
(
x
,
λ
k
)
s
t
e
p
2
:
λ
k
+
1
=
arg max
L
(
x
k
+
1
,
λ
)
\begin{aligned} &step1:\quad x^{k+1}=\argmin_x L(x,\lambda^k)\\ &step2:\quad \lambda^{k+1}=\argmax L(x^{k+1},\lambda) \end{aligned}
step1:xk+1=xargminL(x,λk)step2:λk+1=argmaxL(xk+1,λ)
多出来的 p 2 ∣ ∣ A x − b ∣ ∣ 2 \frac{p}{2}||Ax-b||^2 2p∣∣Ax−b∣∣2是为了让Lagrange函数更凸。
2.2.2 ADMM
优化问题:
min
x
,
z
f
(
x
)
+
g
(
z
)
s
.
t
A
x
+
B
z
=
C
\min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C
x,zminf(x)+g(z)s.tAx+Bz=C
原问题:
min
x
,
z
max
λ
L
(
x
,
z
,
λ
)
=
f
(
x
)
+
g
(
z
)
+
λ
T
(
A
x
+
B
z
−
C
)
+
p
2
∣
∣
A
x
+
B
z
−
C
∣
∣
2
\min_{x,z}\max_\lambda L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+\frac{p}{2}||Ax+Bz-C||^2
x,zminλmaxL(x,z,λ)=f(x)+g(z)+λT(Ax+Bz−C)+2p∣∣Ax+Bz−C∣∣2
对偶问题:
max
λ
min
x
,
z
L
(
x
,
z
,
λ
)
=
f
(
x
)
+
g
(
z
)
+
λ
T
(
A
x
+
B
z
−
C
)
+
p
2
∣
∣
A
x
+
B
z
−
C
∣
∣
2
\max_\lambda\min_{x,z} L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+\frac{p}{2}||Ax+Bz-C||^2
λmaxx,zminL(x,z,λ)=f(x)+g(z)+λT(Ax+Bz−C)+2p∣∣Ax+Bz−C∣∣2
迭代优化求解:
s
t
e
p
1
:
x
k
+
1
=
arg min
x
L
(
x
,
z
k
,
λ
k
)
s
t
e
p
2
:
z
k
+
1
=
arg min
L
(
x
k
+
1
,
z
,
λ
k
)
s
t
e
p
3
:
λ
k
+
1
=
arg max
L
(
x
k
+
1
,
z
k
+
1
,
λ
)
\begin{aligned} &step1:\quad x^{k+1}=\argmin_x L(x,z^k,\lambda^k)\\ &step2:\quad z^{k+1}=\argmin L(x^{k+1},z,\lambda^k)\\ &step3:\quad \lambda^{k+1}=\argmax L(x^{k+1},z^{k+1},\lambda) \end{aligned}
step1:xk+1=xargminL(x,zk,λk)step2:zk+1=argminL(xk+1,z,λk)step3:λk+1=argmaxL(xk+1,zk+1,λ)
2.2.3 BADMM
然后BADMM就是在ADMM上加一个Bregman divergence between the constrained variables x与z,就有类似于在ALM上加的那个二次惩罚项,只不过这里加的是一个KL Divergence。
优化问题:
min
x
,
z
f
(
x
)
+
g
(
z
)
s
.
t
A
x
+
B
z
=
C
\min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C
x,zminf(x)+g(z)s.tAx+Bz=C
对偶问题:
min
x
L
(
x
,
z
,
λ
)
=
f
(
x
)
+
g
(
z
)
+
λ
T
(
A
x
+
B
z
−
C
)
+
p
K
L
(
x
∣
∣
z
)
\min_x L(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+pKL(x||z)
xminL(x,z,λ)=f(x)+g(z)+λT(Ax+Bz−C)+pKL(x∣∣z)
min z L ( z , x , λ ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) + p K L ( z ∣ ∣ x ) \min_z L(z,x,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-C)+pKL(z||x) zminL(z,x,λ)=f(x)+g(z)+λT(Ax+Bz−C)+pKL(z∣∣x)
max λ L ( λ , x , z ) = f ( x ) + g ( z ) + λ T ( A x + B z − C ) \max_\lambda L(\lambda,x,z)=f(x)+g(z)+\lambda^T(Ax+Bz-C) λmaxL(λ,x,z)=f(x)+g(z)+λT(Ax+Bz−C)
迭代优化过程:
优化过程直接贴BADMM论文的图,这不是重点= =。其中
y
t
=
λ
t
y_t=\lambda_t
yt=λt,B被选成了KL。
2.2.2 BADMM对目标进行优化
(x当作
θ
\theta
θ, z当作p)
优化问题:
min
p
,
π
θ
E
p
(
τ
)
[
c
(
τ
)
]
s
.
t
p
(
x
t
)
p
(
u
t
∣
x
t
)
=
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
\min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t)
p,πθminEp(τ)[c(τ)]s.tp(xt)p(ut∣xt)=p(xt)πθ(ut∣xt)
将轨迹表达的优化问题变成时间表达:(等价表达,但需要思考一下,体会一下🤔)
min
p
,
θ
∑
t
=
1
T
E
p
(
x
t
,
u
t
)
[
c
(
x
t
,
u
t
)
]
s
.
t
p
(
x
t
)
p
(
u
t
∣
x
t
)
=
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
\min_{p,\theta}\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t)
p,θmint=1∑TEp(xt,ut)[c(xt,ut)]s.tp(xt)p(ut∣xt)=p(xt)πθ(ut∣xt)
写成无约束问题需要写成两个,因为KL Divergence是不对称性的呀!
第一个是优化
θ
\theta
θ的:
L
θ
(
θ
,
p
)
=
∑
t
=
1
T
E
p
(
x
t
,
u
t
)
[
c
(
x
t
,
u
t
)
]
+
λ
t
[
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
−
p
(
x
t
,
u
t
)
]
+
v
t
ϕ
t
θ
(
θ
,
p
)
L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^\theta(\theta,p)
Lθ(θ,p)=t=1∑TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(ut∣xt)−p(xt,ut)]+vtϕtθ(θ,p)
第二个是优化 p p p的: L p ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t p ( θ , p ) L_p(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^p(\theta,p) Lp(θ,p)=t=1∑TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(ut∣xt)−p(xt,ut)]+vtϕtp(θ,p)
ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ) ] ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \begin{aligned} &\phi_t^\theta(\theta,p)=E_{p(x_t)}\big[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t))\big]\\ &\phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] \end{aligned} ϕtθ(θ,p)=Ep(xt)[DKL(p(ut∣xt)∣∣πθ(ut∣xt))]ϕtp(θ,p)=Ep(xt)[DKL(πθ(ut∣xt)∣∣p(ut∣xt))]
然后将与求解参数无关的term都扔了,如下更新:
θ
←
arg min
θ
∑
t
=
1
T
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
λ
t
+
v
t
ϕ
t
p
(
θ
,
p
)
p
←
arg min
p
∑
t
=
1
T
E
p
(
x
t
,
u
t
)
[
c
(
x
t
,
u
t
)
]
−
λ
t
p
(
x
t
,
u
t
)
+
v
t
ϕ
t
p
(
θ
,
p
)
λ
t
←
λ
t
+
α
v
t
(
π
θ
(
u
t
∣
x
t
)
p
(
x
t
)
−
p
(
u
t
∣
x
t
)
p
(
x
t
)
\begin{aligned} &\theta\leftarrow \argmin_\theta\sum_{t=1}^Tp(x_t)\pi_\theta(u_t|x_t)\lambda_t+v_t\phi_t^p(\theta,p)\\ &p\leftarrow\argmin_p\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]-\lambda_tp(x_t,u_t)+v_t\phi_t^p(\theta,p)\\ &\lambda_t\leftarrow \lambda_t+\alpha v_t(\pi_\theta(u_t|x_t)p(x_t)-p(u_t|x_t)p(x_t) \end{aligned}
θ←θargmint=1∑Tp(xt)πθ(ut∣xt)λt+vtϕtp(θ,p)p←pargmint=1∑TEp(xt,ut)[c(xt,ut)]−λtp(xt,ut)+vtϕtp(θ,p)λt←λt+αvt(πθ(ut∣xt)p(xt)−p(ut∣xt)p(xt)其中
v
t
v_t
vt以启发式进行设置(具体法则看附录),或者如论文中写成期望的形式.
注意
l
(
x
t
,
u
t
)
=
c
(
x
t
,
u
t
)
,
λ
t
=
λ
x
t
,
u
t
l(x_t,u_t)=c(x_t,u_t),\lambda_t=\lambda_{x_t,u_t}
l(xt,ut)=c(xt,ut),λt=λxt,ut
2.2.3 实际考虑
目标有了,整体框架的优化方法有了,那就开始涉及到具体实现时的计算问题了。
第一个问题:
p
(
x
t
)
p
(
u
t
∣
x
t
)
=
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t)
p(xt)p(ut∣xt)=p(xt)πθ(ut∣xt)这么多个与时间有关的Constraints怎么表示?
答:只能近似表示,采用expected action的方式并简化为first moment进行表达,即
E
(
x
t
)
p
(
u
t
∣
x
t
)
[
u
t
]
=
E
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
[
u
t
]
E_{(x_t)p(u_t|x_t)}[u_t]=E_{p(x_t)\pi_\theta(u_t|x_t)}[u_t]
E(xt)p(ut∣xt)[ut]=Ep(xt)πθ(ut∣xt)[ut]然后优化的目标变为:
两个概率相等的约束 用 两个期望动作值相等的约束来近似(建模中的精髓呀!非常值得学习!)
first moment的意思是期望动作是当前时刻,当然还可以是与
u
t
+
1
u_{t+1}
ut+1时刻有关(还有坑呀!)
第二个问题:第二个关于 p p p的目标, p ( τ ) p(\tau) p(τ)是不知道的,应该怎么近似这个轨迹分布呢?
答:用GMM,多个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt)来近似这个 p ( u t ∣ x t ) p(u_t|x_t) p(ut∣xt),每个 p i ( u t ∣ x t ) p_i(u_t|x_t) pi(ut∣xt)可以通过Trajectory Optimization进行学习。(什么?用GMM来近似?VAE行不行?坑来了,不知有无后人占坑,先mark~)
大家还记得整体流程吧。
将
L
θ
L_\theta
Lθ记作:
arg min
θ
L
θ
=
arg min
θ
∑
t
=
1
T
E
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
[
u
t
T
λ
u
t
]
+
v
t
ϕ
t
θ
(
θ
,
p
)
\argmin_\theta L_\theta=\argmin_\theta\sum_{t=1}^TE_{p(x_t)\pi_\theta(u_t|x_t)}[u_t^T\lambda_{ut}]+v_t\phi_t^\theta(\theta,p)
θargminLθ=θargmint=1∑TEp(xt)πθ(ut∣xt)[utTλut]+vtϕtθ(θ,p)
L
p
L_p
Lp记作:
arg min
p
L
p
=
arg min
p
∑
t
=
1
T
E
p
(
x
t
,
u
t
)
[
c
(
x
t
,
u
t
)
−
u
t
T
λ
u
t
]
+
v
t
ϕ
t
p
(
p
,
θ
)
\argmin_p L_p=\argmin_p \sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(p,\theta)
pargminLp=pargmint=1∑TEp(xt,ut)[c(xt,ut)−utTλut]+vtϕtp(p,θ)那么剩下的就是在Unknown dynamics下学习
p
i
(
u
t
∣
x
t
)
p_i(u_t|x_t)
pi(ut∣xt)了,并且细化
L
θ
L_\theta
Lθ的具体计算。
2.3 Trajectory Optimization under Unknown Dynamics
p
(
τ
)
p(\tau)
p(τ)称为guiding distributions,
p
i
(
τ
)
p_i(\tau)
pi(τ)为不同initial states的轨迹分布,在Policy的Objective中指导
π
θ
(
u
t
∣
x
t
)
\pi_\theta(u_t|x_t)
πθ(ut∣xt)的学习:
p
(
τ
)
=
∑
i
p
i
(
τ
)
=
∑
i
(
p
i
(
x
1
)
∏
t
=
1
T
p
(
u
t
∣
x
t
)
p
(
x
t
+
1
∣
x
t
,
u
t
)
)
p(\tau)=\sum_i p_i(\tau)=\sum_i\Big(p_i(x_1)\prod_{t=1}^Tp(u_t|x_t)p(x_{t+1}|x_t,u_t)\Big)
p(τ)=i∑pi(τ)=i∑(pi(x1)t=1∏Tp(ut∣xt)p(xt+1∣xt,ut))
然后赶紧开始建模对单个
p
i
(
τ
)
p_i(\tau)
pi(τ):
p
i
(
u
t
∣
x
t
)
=
N
(
K
t
x
t
+
k
t
,
C
t
)
p
i
(
x
t
+
1
∣
x
t
,
u
t
)
=
N
(
f
x
t
x
t
+
f
u
t
u
t
+
f
c
t
,
F
t
)
p_i(u_t|x_t)=N(K_tx_t+k_t,C_t)\\ p_i(x_{t+1}|x_t,u_t)=N(f_{x_t}x_t+f_{u_t}u_t+f_{ct},F_t)
pi(ut∣xt)=N(Ktxt+kt,Ct)pi(xt+1∣xt,ut)=N(fxtxt+futut+fct,Ft)
线性高斯的形式选定了,就得看setting,然后选方法对参数进行Learning了。现在问题默认是Stochastic Unknown Dynamics,因此传统方法的Trajectory Optimization都是要fit一个dynamics model,即 p i ( x t + 1 ∣ x t , u t ) p_i(x_{t+1}|x_t,u_t) pi(xt+1∣xt,ut)。
因为之前的Paper精读中介绍过,传统方法DDP一类的如iLQG、iLQR什么的,都可以从几个专家数据中拟合一个局部、泛化性能差的Linear-Gaussian-Controller,因此迭代起来学习dynamics model的时候就得限制在“局部”,不能走太远。
定义
p
^
(
τ
)
\hat p(\tau)
p^(τ)为上一次迭代的trajectory distribution,优化问题为:
min
p
L
p
(
p
,
θ
)
s
.
t
D
K
L
(
p
(
τ
)
∣
∣
p
^
(
τ
)
)
≤
ϵ
\min_pL_p(p,\theta) \\s.t\quad D_{KL}(p(\tau)||\hat p(\tau))\leq \epsilon
pminLp(p,θ)s.tDKL(p(τ)∣∣p^(τ))≤ϵ
这个称作:dynamics fitting procedure。
有了一个dynamics model后,我们才能optimize trajectory distribution呀~
L p ( p , θ ) L_p(p,\theta) Lp(p,θ)的形式:
L p ( p , θ ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) − u t T λ u t ] + v t ϕ t p ( θ , p ) L_p(p,\theta)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(\theta,p) Lp(p,θ)=t=1∑TEp(xt,ut)[c(xt,ut)−utTλut]+vtϕtp(θ,p)
ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] ϕtp(θ,p)=Ep(xt)[DKL(πθ(ut∣xt)∣∣p(ut∣xt))]
π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(ut∣xt)=∫πθ(ut∣ot)p(ot∣xt)dot
刚才拟合了个dynamics model,现在需要处理目标中的
D
K
L
(
π
θ
(
u
t
∣
x
t
)
∣
∣
p
(
u
t
∣
x
t
)
)
D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))
DKL(πθ(ut∣xt)∣∣p(ut∣xt))的计算问题,这个涉及到
π
θ
(
u
t
∣
x
t
)
\pi_\theta(u_t|x_t)
πθ(ut∣xt),从而涉及到
π
θ
(
u
t
∣
o
t
)
\pi_\theta(u_t|o_t)
πθ(ut∣ot),然后KL的计算有一个Fisher Information Matrix的近似,这个就不介绍了,之前文章说过,这里估计这个Fisher信息矩阵使用的训练数据为:(问题的背景就是知道
x
t
,
o
t
x_t,o_t
xt,ot)
{
x
t
i
,
E
π
θ
(
u
t
∣
o
t
i
)
[
u
t
]
}
i
=
1
N
\{x_t^i,E_{\pi_\theta(u_t|o_t^i)}[u_t]\}_{i=1}^N
{xti,Eπθ(ut∣oti)[ut]}i=1N
现在KL可以近似估计得到,dynamics model经过拟合得到,然后直接用iLQG处理Constraints,从而得出一个 p i ( τ ) p_i(\tau) pi(τ)。
另一种解法就是cost function二阶近似,linear-Gaussian Dynamics,从而跑LQR-Framework的那个trajectory optimization的过程。
现在我们得到了 p ( τ ) p(\tau) p(τ),即guiding distributions.完成了第二步关于p的更新:
arg min p ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) − u t T λ u t ] + v t ϕ t p ( p , θ ) \argmin_p \sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)-u_t^T\lambda_{ut}]+v_t\phi_t^p(p,\theta) pargmint=1∑TEp(xt,ut)[c(xt,ut)−utTλut]+vtϕtp(p,θ)
2.4 Supervised Policy Optimization
记得建模对象 π θ ( u t ∣ o t ) = N ( μ ( o t ) , Σ ( o t ) ) \pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t)) πθ(ut∣ot)=N(μ(ot),Σ(ot)).
L θ ( θ , p ) = ∑ t = 1 T E p ( x t ) π θ ( u t ∣ x t ) [ u t T λ u t ] + v t ϕ t θ ( θ , p ) L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t)\pi_\theta(u_t|x_t)}[u_t^T\lambda_{ut}]+v_t\phi_t^\theta(\theta,p) Lθ(θ,p)=t=1∑TEp(xt)πθ(ut∣xt)[utTλut]+vtϕtθ(θ,p)
ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ] \phi_t^\theta(\theta,p)=E_{p(x_t)}[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t)] ϕtθ(θ,p)=Ep(xt)[DKL(p(ut∣xt)∣∣πθ(ut∣xt)]
然后将Trajectory Optimization学习到的
p
i
(
u
t
∣
x
t
)
=
N
(
u
t
i
p
,
C
t
i
)
p_i(u_t|x_t)=N(u_{ti}^p,C_{ti})
pi(ut∣xt)=N(utip,Cti)与建模对象
π
θ
(
u
t
∣
o
t
)
=
N
(
μ
(
o
t
)
,
Σ
(
o
t
)
)
\pi_\theta(u_t|o_t)=N(\mu(o_t),\Sigma(o_t))
πθ(ut∣ot)=N(μ(ot),Σ(ot))代入到这里,再从
x
t
,
o
t
,
u
t
x_t,o_t,u_t
xt,ot,ut的数据中采样N个样本进行近似,得到:
L
θ
(
θ
,
p
)
=
1
2
N
∑
i
=
1
N
∑
t
=
1
T
E
p
i
(
x
t
,
o
t
)
[
t
r
[
C
t
i
−
1
Σ
π
(
o
t
)
]
−
log
∣
Σ
π
(
o
t
)
∣
+
(
μ
π
(
o
t
)
−
μ
t
i
p
(
x
t
)
)
C
t
i
−
1
(
μ
π
(
o
t
−
μ
t
i
p
)
(
x
t
)
+
2
λ
u
t
T
μ
π
(
o
t
)
]
L_\theta(\theta,p)=\frac{1}{2N}\sum_{i=1}^N\sum_{t=1}^T E_{p_i(x_t,o_t)}\Big[tr[C_{ti}^{-1}\Sigma^\pi(o_t)]-\log|\Sigma^\pi(o_t)|\\+(\mu^\pi(o_t)-\mu_{ti}^p(x_t))C_{ti}^{-1}(\mu^\pi(o_t-\mu_{ti}^p)(x_t)+2\lambda_{ut}^T\mu^\pi(o_t)\Big]
Lθ(θ,p)=2N1i=1∑Nt=1∑TEpi(xt,ot)[tr[Cti−1Σπ(ot)]−log∣Σπ(ot)∣+(μπ(ot)−μtip(xt))Cti−1(μπ(ot−μtip)(xt)+2λutTμπ(ot)]
公式很吓人:简单得说就是使建模对象 π θ ( u t ∣ o t ) \pi_\theta(u_t|o_t) πθ(ut∣ot)的输出均值 μ ( o t ) \mu(o_t) μ(ot)与Guiding Distributions的输出均值 μ t i p ( x t ) \mu_{ti}^p(x_t) μtip(xt)的均值误差最小,让Policy跟着Guiding Distribution走,然后剩下的项是Penalty,如果跟trajectory distribution相差很远,就按比例进行惩罚。
三、总结
文章的整个问题背景是从一个经典的RL目标出发:
E π θ ( τ ) [ ∑ t = 1 T c ( x t , u t ) ] E_{\pi_\theta(\tau)}[\sum_{t=1}^Tc(x_t,u_t)] Eπθ(τ)[t=1∑Tc(xt,ut)]
π θ ( τ ) = p ( x 1 ) ∏ t = 1 T π θ ( u t ∣ x t ) p ( x t + 1 ∣ x t , u t ) \pi_\theta(\tau)=p(x_1)\prod_{t=1}^T\pi_\theta(u_t|x_t)p(x_{t+1}|x_t,u_t) πθ(τ)=p(x1)t=1∏Tπθ(ut∣xt)p(xt+1∣xt,ut)
robot的状态是知道的,因为用Actor-Critic类的解法需要大量样本,真实robot顶不住,从Guided Policy Search角度出发引入专家 p ( u t ∣ x t ) p(u_t|x_t) p(ut∣xt)。因此有一系列关于时间点的约束:
min p , π θ E p ( τ ) [ c ( τ ) ] s . t p ( x t ) p ( u t ∣ x t ) = p ( x t ) π θ ( u t ∣ x t ) \min_{p,\pi_\theta}E_{p(\tau)}[c(\tau)]\\ s.t\quad p(x_t)p(u_t|x_t)=p(x_t)\pi_\theta(u_t|x_t) p,πθminEp(τ)[c(τ)]s.tp(xt)p(ut∣xt)=p(xt)πθ(ut∣xt)
而且Paper的一大卖点是可以通过Camera 的高维复杂的image observation进行Control,而不是几十维的状态state,因此有
π θ ( u t ∣ x t ) = ∫ π θ ( u t ∣ o t ) p ( o t ∣ x t ) d o t \pi_\theta(u_t|x_t)=\int \pi_\theta(u_t|o_t)p(o_t|x_t)do_t πθ(ut∣xt)=∫πθ(ut∣ot)p(ot∣xt)dot
这个过程可以通过 x t , o t , u t x_t,o_t,u_t xt,ot,ut的数据进行MC估计。
表示一堆时间约束的方式,就是用了动作的期望值来表示概率相等
利用BADMM的优化框架:
min
x
,
z
f
(
x
)
+
g
(
z
)
s
.
t
A
x
+
B
z
=
C
\min_{x,z}f(x)+g(z)\\s.t\quad Ax+Bz=C
x,zminf(x)+g(z)s.tAx+Bz=C
第一个是优化
θ
\theta
θ的:
L
θ
(
θ
,
p
)
=
∑
t
=
1
T
E
p
(
x
t
,
u
t
)
[
c
(
x
t
,
u
t
)
]
+
λ
t
[
p
(
x
t
)
π
θ
(
u
t
∣
x
t
)
−
p
(
x
t
,
u
t
)
]
+
v
t
ϕ
t
θ
(
θ
,
p
)
L_\theta(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^\theta(\theta,p)
Lθ(θ,p)=t=1∑TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(ut∣xt)−p(xt,ut)]+vtϕtθ(θ,p)
第二个是优化 p p p的: L p ( θ , p ) = ∑ t = 1 T E p ( x t , u t ) [ c ( x t , u t ) ] + λ t [ p ( x t ) π θ ( u t ∣ x t ) − p ( x t , u t ) ] + v t ϕ t p ( θ , p ) L_p(\theta,p)=\sum_{t=1}^TE_{p(x_t,u_t)}[c(x_t,u_t)]+\lambda_t\Big[p(x_t)\pi_\theta(u_t|x_t)-p(x_t,u_t)\Big]+v_t\phi_t^p(\theta,p) Lp(θ,p)=t=1∑TEp(xt,ut)[c(xt,ut)]+λt[p(xt)πθ(ut∣xt)−p(xt,ut)]+vtϕtp(θ,p)
ϕ t θ ( θ , p ) = E p ( x t ) [ D K L ( p ( u t ∣ x t ) ∣ ∣ π θ ( u t ∣ x t ) ) ] ϕ t p ( θ , p ) = E p ( x t ) [ D K L ( π θ ( u t ∣ x t ) ∣ ∣ p ( u t ∣ x t ) ) ] \begin{aligned} &\phi_t^\theta(\theta,p)=E_{p(x_t)}\big[D_{KL}(p(u_t|x_t)||\pi_\theta(u_t|x_t))\big]\\ &\phi_t^p(\theta,p)=E_{p(x_t)}\big[D_{KL}(\pi_\theta(u_t|x_t)||p(u_t|x_t))\big] \end{aligned} ϕtθ(θ,p)=Ep(xt)[DKL(p(ut∣xt)∣∣πθ(ut∣xt))]ϕtp(θ,p)=Ep(xt)[DKL(πθ(ut∣xt)∣∣p(ut∣xt))]
然后想办法近似计算KL散度,以及Trajectory Optimization迭代得到Guiding Distribution。
缺点就是state需要完全知道,然后用Visual Observation进行Control。
值得学习的点:整个问题的建模过程中优化过多约束条件的思想与近似计算的方法。
一句话总结文章:在State完全知道、UnKnown Dynamics的情况下,利用Guided Policy Search将问题扩展到High-Dimensions Complex Observation进行Visual Control。具体Train起来还有各种Pre-Train以及复杂的Initialization。