CS229 Lecture 19

15 篇文章 0 订阅

CS229 Lecture 19


Debugging RL algorithm
Differential Dynamic Programming (DDP)
Kalman Filter
Linear Quadratic Gaussian (LQG)


回顾

最大期望收益为:

max ⁡ E [ R ( t ) ( s t , a t ) + ⋯ + R ( T ) ( s T , a t ) ] \max E[R^{(t)}(s_t,a_t)+\cdots+R^{(T)}(s_T,a_t)] maxE[R(t)(st,at)++R(T)(sT,at)]

动态规划步骤为:

  1. V T ⋆ ( s T ) = max ⁡ a T R ( T ) ( s T , a T ) V_T^{\star}(s_T)=\max_{a_T}R^{(T)}(s_T,a_T) VT(sT)=maxaTR(T)(sT,aT)
  2. V t ⋆ ( s t ) = max ⁡ a t R ( t ) ( s t , a t ) + ∑ s t + 1 P s t a t ( s t + 1 ) V t + 1 ⋆ ( s t + 1 ) V_{t}^{\star}(s_t)=\max_{a_t}R^{(t)}(s_t,a_t)+\sum_{s_{t+1}}P_{s_ta_t}(s_{t+1})V_{t+1}^{\star}(s_{t+1}) Vt(st)=maxatR(t)(st,at)+st+1Pstat(st+1)Vt+1(st+1)
    π t ⋆ ( s t ) = arg ⁡ max ⁡ a t R ( t ) ( s t , a t ) + ∑ s t + 1 P s t a t ( s t + 1 ) V t + 1 ⋆ ( s t + 1 ) \pi_{t}^{\star}(s_t)=\arg \max_{a_t}R^{(t)}(s_t,a_t)+\sum_{s_{t+1}}P_{s_ta_t}(s_{t+1})V_{t+1}^{\star}(s_{t+1}) πt(st)=argmaxatR(t)(st,at)+st+1Pstat(st+1)Vt+1(st+1)
  3. V T ⋆ → V T − 1 ⋆ → V T − 2 ⋆ → ⋯ → V 0 ⋆ V_{T}^{\star}\rightarrow V_{T-1}^{\star}\rightarrow V_{T-2}^{\star}\rightarrow \cdots \rightarrow V_{0}^{\star} VTVT1VT2V0

LQR设定中 s t + 1 = A t s t + B t a t + w t s_{t+1}=A_ts_t+B_ta_t+w_t st+1=Atst+Btat+wt,这里的 s t ∈ R n s_t\in \mathbb{R}^n stRn, a t ∈ R d a_t\in \mathbb{R}^d atRd w t ∼ N ( 0 , Σ t ) w_t\sim\mathscr{N(0,\Sigma_t)} wtN(0,Σt)

在这里插入图片描述
如果模型是非线性的,我们可以将该模型线性化,当然要达到近似得效果那么需要当前 s s s a a a要在切点附近,否则近似效果很差。这里假设 s t + 1 = f ( s t , a t ) s_{t+1}=f(s_t,a_t) st+1=f(st,at),那么线性化的方式如下:

s t + 1 ≈ f ( s t ˉ , a t ˉ ) + ( ▽ s f ( s t ˉ , a t ˉ ) ) T ( s t − s t ˉ ) + ( ▽ a f ( s t ˉ , a t ˉ ) ) T ( a t − a t ˉ ) s_{t+1}\approx f(\bar{s_t},\bar{a_t})+(\bigtriangledown_{s}f(\bar{s_t},\bar{a_t}))^T(s_t-\bar{s_t})+(\bigtriangledown_{a}f(\bar{s_t},\bar{a_t}))^T(a_t-\bar{a_t}) st+1f(stˉ,atˉ)+(sf(stˉ,atˉ))T(ststˉ)+(af(stˉ,atˉ))T(atatˉ)

上式可以简化为: s t + 1 = A t s t + B t a t s_{t+1}=A_ts_t+B_ta_t st+1=Atst+Btat

对应的激励函数为: R ( t ) ( s t , a t ) = − s t T U t s t − a t T W t a t R^{(t)}(s_t,a_t)=-s_t^TU_ts_t-a_t^TW_ta_t R(t)(st,at)=stTUtstatTWtat,这里的 U t U_t Ut W t W_t Wt均为半正定矩阵,这意味激励收益总是负的。

DP(dynamic programing)

对于每一个专状态 s t s_t st有二次价值函数: V t ⋆ = s t T Φ t s t + Ψ t V_t^{\star}=s_t^T\Phi_ts_t+\Psi_t Vt=stTΦtst+Ψt

  1. 初始化 Φ T = − U T \Phi_T=-U_T ΦT=UT Ψ T = 0 \Psi_T=0 ΨT=0
  2. 递归迭代计算更新 Φ t \Phi_t Φt Ψ t \Psi_t Ψt根据 Φ t + 1 \Phi_{t+1} Φt+1 Ψ t + 1 \Psi_{t+1} Ψt+1 t = T − 1 , ⋯   , 0 t=T-1,\cdots,0 t=T1,,0
  3. 根据 Φ t + 1 \Phi_{t+1} Φt+1 Ψ t + 1 \Psi_{t+1} Ψt+1计算 L t L_t Lt进而有 π t ⋆ = L t . s t \pi_t^{\star}=L_t.s_t πt=Lt.st

上面的提到的 Φ t \Phi_t Φt Ψ t \Psi_t Ψt L t L_t Lt分别为:

Φ t : = A t T ( Φ t + 1 − Φ t + 1 B t ( B t T Φ t + 1 B t − W T ) − 1 B t Φ t + 1 ) A t − U t \Phi_t:=A_t^T(\Phi_{t+1}-\Phi_{t+1}B_t(B_t^T\Phi_{t+1}B_t-W^T)^{-1}B_t\Phi_{t+1})At-U_t Φt:=AtT(Φt+1Φt+1Bt(BtTΦt+1BtWT)1BtΦt+1)AtUt

Ψ t = t r ( Σ t Φ t + 1 ) + Ψ t + 1 \Psi_t=tr(\Sigma_t\Phi_{t+1})+\Psi_{t+1} Ψt=tr(ΣtΦt+1)+Ψt+1

L t = [ ( B t T Φ t + 1 B t − W T ) − 1 B t Φ t + 1 A t ] L_t=[(B_t^T\Phi_{t+1}B_t-W^T)^{-1}B_t\Phi_{t+1}A_t] Lt=[(BtTΦt+1BtWT)1BtΦt+1At]

上面的公式可以看出,如果单纯想得出最佳 p o l i c y policy policy,实际上并不需要计算 Ψ t \Psi_t Ψt,我们只要递归的计算 Φ t \Phi_t Φt就知道最佳的 p o l i c y policy policy。在计算 Φ \Phi Φ的公式中并不依赖 Σ t \Sigma_t Σt,因此模型的噪声并不会影响我们得出最佳的 p o l i c y policy policy,在计算最佳 p o l i c y policy policy我们完全可以忽略 Σ t \Sigma_t Σt。但是在价值函数的定义中由于依赖 Ψ \Psi Ψ,那么 V ⋆ V^{\star} V的协方差必定和 Σ \Sigma Σ有关,噪声越大得出的价值函数也就越糟糕。


Differential Dynamic Programming (DDP)

有非线性模拟器其模型时确定的,有 s t + 1 = f ( s t , a t ) s_{t+1}=f(s_t,a_t) st+1=f(st,at),微分动态规划的步骤如下:

  1. 通过简单的控制近似模拟理想的运动轨迹,近似模拟理想运动轨迹的方式如下:
    s 0 ⋆ , a 0 ⋆ ⟶ s 1 ⋆ , a 1 ⋆ ⟶ s 2 ⋆ , a 2 ⋆ ⟶ ⋯ ⟶ s_0^{\star},a_0^{\star}\longrightarrow s_1^{\star},a_1^{\star}\longrightarrow s_2^{\star},a_2^{\star}\longrightarrow\cdots \longrightarrow s0,a0s1,a1s2,a2
  2. 在上述模拟轨迹中的点 s ⋆ 处 s^{\star}处 s线性化函数 f f f,即;
    s t + 1 ≈ f ( s t ⋆ , a t ⋆ ) + ( ▽ s f ( s t ⋆ , a t ⋆ ) ) T ( s t − s t ⋆ ) + ( ▽ a f ( s t ⋆ , a t ⋆ ) ) T ( a t − a t ⋆ ) = A t s t + B t a t s_{t+1}\approx f(s_t^{\star},a_t^{\star})+(\bigtriangledown_{s}f(s_t^{\star},a_t^{\star}))^T(s_t-s_t^{\star})+(\bigtriangledown_{a}f(s_t^{\star},a_t^{\star}))^T(a_t-a_t^{\star})\\ =A_ts_t+B_ta_t st+1f(st,at)+(sf(st,at))T(stst)+(af(st,at))T(atat)=Atst+Btat
    这里希望 ( s t , a t ) ≈ ( s t ⋆ , a t ⋆ ) (s_t,a_t)\approx(s_t^{\star},a_t^{\star}) (st,at)(st,at)
  3. 然后通过LQR得到 π ⋆ \pi^{\star} π
  4. 使用模拟器根据得出的最佳 π ⋆ \pi^{\star} π得出新的运动轨迹
    s 0 ⋆ , π 0 ⋆ ( s 0 ) ⟶ s 1 ⋆ , π 1 ⋆ ( s 1 ) ⟶ s 2 ⋆ , π 2 ⋆ ( s 2 ) ⟶ ⋯ ⟶ s T ⋆ s_0^{\star},\pi_0^{\star}(s_0)\longrightarrow s_1^{\star},\pi_1^{\star}(s_1)\longrightarrow s_2^{\star},\pi_2^{\star}(s_2)\longrightarrow\cdots \longrightarrow s_T^{\star} s0,π0(s0)s1,π1(s1)s2,π2(s2)sT
    需要关注的是这里的转换使用的依旧是模拟器的函数 f f f而非线性近似,即 s t + 1 ⋆ = f ( s t ⋆ , a t ⋆ ) s_{t+1}^{\star}=f(s_t^{\star},a_t^{\star}) st+1=f(st,at)。生成新的近似运动轨迹后返回第二步一只迭代直到达到停止的标准。

在这里插入图片描述

上图黑线是理想的运动轨迹,要想通过模拟器模拟出该运动轨迹,最外面粉紫色的线是刚开始模拟出的轨迹,绿色和红色的线则是一轮轮迭代优化,最终越来越近似于理想的运动轨迹。


Partially Observable MDPs.

在现实中我们只能观测到部分状态而非全部,在这种情况下如何如何解决马尔可夫决策过程?因为前面我们一直优化的 p o l i c y policy policy是基于能观察到所有的状态而非部分。

假设 s t + 1 = A t s t + w t s_{t+1}=A_ts_t+w_t st+1=Atst+wt,这里先忽略 a t a_t at。又一个在二维平面飞行的无人直升机,它的状态为 s t = { x y x ˙ y ˙ } s_t=\left\{ \begin{matrix}x\\y\\ \dot{x} \\ \dot{y}\end{matrix} \right\} st=xyx˙y˙,存在 A = { 1 1 0 0 0 0.9 0 0 0 0 1 1 0 0 0.9 0 } A=\left\{ \begin{matrix} 1 &1 &0 &0 \\ 0 & 0.9 & 0 & 0 \\ 0 & 0 &1 & 1 \\ 0 & 0 & 0.9 &0 \end{matrix} \right\} A=100010.9000010.90010,那么:

x t + 1 = x t + x t ˙ + n o i s e x_{t+1}=x_t+\dot{x_t}+noise xt+1=xt+xt˙+noise

x t + 1 ˙ = 0.9 x t ˙ + n o i s e \dot{x_{t+1}}=0.9\dot{x_t}+noise xt+1˙=0.9xt˙+noise

因为在当前设定中我们仅仅能观测到部分状态,因此有: { y t = C . s t + v t s t + 1 = A t s t + B t s t + w t \begin{cases} y_t=C.s_t+v_t \\ s_t+1=A_ts_t+B_ts_t+w_t \end{cases} {yt=C.st+vtst+1=Atst+Btst+wt,其中 v t ∼ N ( 0 , Σ v ) v_t\sim \mathscr{N}(0,\Sigma_v) vtN(0,Σv)。如果有 C = { x y x ˙ y ˙ } C=\left\{ \begin{matrix}x\\y\\ \dot{x} \\ \dot{y}\end{matrix} \right\} C=xyx˙y˙,存在 A = [ 1 0 0 0 0 0 1 0 ] A=\left[ \begin{matrix} 1 & 0 &0 & 0\\0 & 0 & 1 & 0\end{matrix} \right] A=[10000100],那么 C . s t = [ x y ] C.s_t=\left[ \begin{matrix} x \\ y \end{matrix} \right] C.st=[xy]

假设有台雷达观察这个运动的直升机,在不同时刻观测到了不同的位置 y t y_t yt,由于观测误差的问题,雷达观察到直升机的位置和真实直升机位置还是有误差的。我们想要做的是 P ( s t ∣ y 0 , y 1 , y 2 , ⋯   , y T ) P(s_t|y_0,y_1,y_2,\cdots,y_T) P(sty0,y1,y2,,yT),且 s 0 , s 1 , s 2 , ⋯   , s T , y 0 , y 1 , y 2 , ⋯   , y T s_0,s_1,s_2,\cdots,s_T,y_0,y_1,y_2,\cdots,y_T s0,s1,s2,,sT,y0,y1,y2,,yT有联合概率高斯分布即 [ s 0 s 1 s 2 ⋯ s T y 0 y 1 y 2 ⋯ y T ] ∼ N ( u , Σ ) \left[ \begin{matrix} s_0\\s_1 \\ s_2 \\ \cdots \\ s_T \\ y_0 \\ y_1 \\ y_2 \\ \cdots \\ y_T \end{matrix} \right]\sim \mathscr{N}(u,\Sigma) s0s1s2sTy0y1y2yTN(u,Σ),我们可以通过条件概率和边际概率公式得出 P ( s t ∣ y 0 , y 1 , y 2 , ⋯   , y T ) P(s_t|y_0,y_1,y_2,\cdots,y_T) P(sty0,y1,y2,,yT)。但是根据式子可知 Σ \Sigma Σ随着观察点数的增多计算量呈指数增大,因此这种计算方式只是理论可行,在实际处理时并不可行。下面介绍的Kalman Filter可以解决计算量大的问题。


Kalman Filter
Kalman Filter 提供了一种常数时间内来计算均值和方差的方法,通过不断的更新和预测来迭代计算,假设我们知道 s t ∣ y 1 , y 2 , ⋯   , y t s_t|y_1,y_2,\cdots ,y_t sty1,y2,,yt的分布。过程如下:
( s t ∣ y 1 , y 2 , ⋯   , y t ) ⟶ p r e d i c t ( s t + 1 ∣ y 1 , y 2 , ⋯   , y t ) ⟶ u p d a t e ( s t + 1 ∣ y 1 , y 2 , ⋯   , y t , y t + 1 ) ⟶ p r e d i c t ⋯ (s_t|y_1,y_2,\cdots ,y_t)\stackrel{predict}\longrightarrow(s_{t+1}|y_1,y_2,\cdots ,y_t)\stackrel{update}\longrightarrow(s_{t+1}|y_1,y_2,\cdots ,y_t,y_{t+1})\stackrel{predict}\longrightarrow \cdots (sty1,y2,,yt)predict(st+1y1,y2,,yt)update(st+1y1,y2,,yt,yt+1)predict

在预测步骤中我们假设知道 s t ∣ y 1 , y 2 , ⋯   , y t ∼ N ( s t ∣ t , Σ t ∣ t ) s_t|y_1,y_2,\cdots ,y_t\sim N(s_{t|t},\Sigma_{t|t}) sty1,y2,,ytN(stt,Σtt),那么下个状态 s t + 1 ∣ y 1 , y 2 , ⋯   , y t ∼ N ( s t + 1 ∣ t , Σ t + 1 ∣ t ) s_{t+1}|y_1,y_2,\cdots ,y_t\sim N(s_{t+1|t},\Sigma_{t+1|t}) st+1y1,y2,,ytN(st+1t,Σt+1t),这里的 { s t + 1 ∣ t = A . s t ∣ t Σ t + 1 ∣ t = A . Σ t + 1 ∣ t . A T + Σ s \begin{cases} s_{t+1|t}=A.s_{t|t}\\ \Sigma_{t+1|t}=A.\Sigma_{t+1|t}.A^T+\Sigma_s \end{cases} {st+1t=A.sttΣt+1t=A.Σt+1t.AT+Σs

在更新步骤中已经给出了 s t + 1 ∣ t s_{t+1|t} st+1t Σ t + 1 ∣ t \Sigma_{t+1|t} Σt+1t因此 s t + 1 ∣ y 1 , y 2 , ⋯   , y t ∼ N ( s t + 1 ∣ t , Σ t + 1 ∣ t ) s_{t+1}|y_1,y_2,\cdots ,y_t\sim N(s_{t+1|t},\Sigma_{t+1|t}) st+1y1,y2,,ytN(st+1t,Σt+1t),在此基础上我们可以证明: s t + 1 ∣ y 1 , y 2 , ⋯   , y t , y t + 1 ∼ N ( s t + 1 ∣ t + 1 , Σ t + 1 ∣ t + 1 ) s_{t+1}|y_1,y_2,\cdots ,y_t,y_{t+1}\sim N(s_{t+1|t+1},\Sigma_{t+1|t+1}) st+1y1,y2,,yt,yt+1N(st+1t+1,Σt+1t+1),这里的 { s t + 1 ∣ t + 1 = s t + 1 ∣ t + K t ( y t + 1 − C s t + 1 ) Σ t + 1 ∣ t + 1 = Σ t + 1 ∣ t − K t . C . Σ t + 1 ∣ t \begin{cases} s_{t+1|t+1}=s_{t+1|t}+K_t(y_{t+1}-Cs_{t+1})\\ \Sigma_{t+1|t+1}=\Sigma_{t+1|t}-K_t.C.\Sigma_{t+1|t} \end{cases} {st+1t+1=st+1t+Kt(yt+1Cst+1)Σt+1t+1=Σt+1tKt.C.Σt+1t K t : = Σ t + 1 ∣ t C T ( C Σ t + 1 ∣ t C T + Σ y ) − 1 K_t := \Sigma_{t+1|t}C^T(C\Sigma_{t+1|t}C^T + \Sigma_y)^{−1} Kt:=Σt+1tCT(CΣt+1tCT+Σy)1

s t s_t st是未知的真实结果 y t y_t yt是观测结果。这些 s t ∣ t s_{t|t} stt s t + 1 ∣ t s_{t+1|t} st+1t Σ t + 1 ∣ t \Sigma_{t+1|t} Σt+1t均是计算结果。 s t + 1 ∣ t + 1 s_{t+1|t+1} st+1t+1是我们对 s t + 1 s_{t+1} st+1最佳的估计


Linear Quadratic Gaussian(LQG)

将Kalman Filter和LQR一起应用就是线性二次高斯,Kalman Filter提供了一种算法得以估算状态,LQR提供了线性系统控制算法,二者结合起来就为Partially Observable MDPs这种特殊的马尔可夫决策过程提供了解决方案。

s t + 1 = A t s a + B t a t + w t        ( w ∼ N ( 0 , Σ w ) ) s_{t+1}=A_ts_a+B_ta_t+w_t \,\,\,\,\,\,(w\sim N(0,\Sigma_w)) st+1=Atsa+Btat+wt(wN(0,Σw))

y t = C . s t + v t        ( v ∼ N ( 0 , Σ v ) ) y_t=C.s_t+v_t\,\,\,\,\,\,(v\sim N(0,\Sigma_v)) yt=C.st+vt(vN(0,Σv))

整个算法的步骤为:

在每个时间步骤使用Kalman Filter评估最佳的状态

初始化 s 0 ∣ 0 = s 0 , Σ 0 ∣ 0 = 0 s_{0|0}=s_0,\Sigma_{0|0}=0 s00=s0,Σ00=0其中 s 0 ∼ N ( s 0 ∣ 0 , Σ 0 ∣ 0 ) s_0\sim N(s_{0|0},\Sigma_{0|0}) s0N(s00,Σ00)

预测 { s t + 1 ∣ t = A . s t ∣ t + B a t Σ t + 1 ∣ t = A . Σ t + 1 ∣ t . A T + Σ v \begin{cases} s_{t+1|t}=A.s_{t|t}+Ba_t\\ \Sigma_{t+1|t}=A.\Sigma_{t+1|t}.A^T+\Sigma_v \end{cases} {st+1t=A.stt+BatΣt+1t=A.Σt+1t.AT+Σv

使用LQR算法计算 L t L_t Lt,然后 a t = L t . s t a_t=L_t .s_t at=Lt.st,在LQR中的假设观测到了 s t s_t st,实际上由于真实状态值无法被观测到,因此使用观测到的 y y y来计算评估最佳评估状态,来对 a t a_t at作出预判。


debugging RL algorithm

Suppose that:

  1. Thehelicoptersimulatorisaccurate.
  2. The RL algorithm correctly controls the helicopter (insimulation)so as to minimize J ( θ ) J(\theta) J(θ).
  3. Minimizing J ( θ ) J(\theta) J(θ) corresponds to correct autonomous flight.
    Then: The learned parameters θRL should fly well on the actual helicopter.

Diagnostics:

  1. If θ \theta θ RL flies well in simulation, but not in real life, then the problem is in the simulator. Otherwise:
  2. Let θ \theta θ human be the human control policy. If J ( θ h u m a n ) < J ( θ R L ) J(\theta_{human}) \lt J(\theta_{RL}) J(θhuman)<J(θRL), then the problem is in the reinforcement learning algorithm. (Failing to minimize the cost function J.)
  3. If J ( θ h u m a n ) > J ( θ R L ) J(\theta_{human}) \gt J(\theta_{RL}) J(θhuman)>J(θRL) ,then the problem is in the cost function.(Maximizing it doesn’t correspond to good autonomous flight.)

注:直接复制自ML-advice.pdf

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
补全以下代码private String cid;// Course id, e.g., CS110. private String name;// Course name, e.g., Introduce to Java Programming. private Integer credit;// Credit of this course private GradingSchema gradingSchema; //Grading schema of this course // enum GradingSchema{FIVE_LEVEL, PASS_FAIL} private Integer capacity;// Course capacity. private Integer leftCapacity;// Course capacity left. You should update the left capacity when enrolling students. private Set<Timeslot> timeslots;// One course may have one or more timeslots. e.g., a lecture in Monday's 10:20-12:10, and a lab in Tuesday's 14:00-15:50. public Course(String cid, String name, Integer credit, GradingSchema gradingSchema, Integer capacity) // constructor public void addTimeslot(Timeslot timeslot) //Record a timeslot for this course private Integer id;// A unique student id, should be an 8-digit integer: Undergraduates' ids should start with 1; Postgraduates' ids should start with 3. e.g., 12213199. private String name;// Student’s name private Map<Course, Grade> courses;// Enrolled courses, using Map structure to store course and its grade as a pair. Grade is an enum type enum Grade{PASS,FAIL,A,B,C,D,F}with an attribute: Double gradePoint protected Student(Integer id, String name) // constructor public abstract boolean canGraduate() // Checks if this student satisfies all the graduating conditions. Hint: you are allowed to change this abstract method into non-abstract to check if the student satisfies the common graduation conditions. public void enroll(Course course) // Tries to enroll the course, do some checks before enrolling. public void recordGrade(Course course, Grade grade)// Records the grade of a course that is current learning. public double getGpa() // Calculates the GPA for this student. public UndergraduateStudent(Integer id, String name)// constructor public boolean canGraduate() //Additional graduating conditions for undergraduate students public PostgraduateStudent(Integer id, String name)// constructor public boolean canGraduate() //Additional graduating conditions for postgraduate students
最新发布
06-02

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值