CS229 Lecture 19

Light_blue_love

于 2021-05-09 19:10:32 发布

阅读量263

点赞数

分类专栏： CS229 ML

本文链接：https://blog.csdn.net/Light_blue_love/article/details/116421367

版权

CS229 同时被 2 个专栏收录

20 篇文章 0 订阅

订阅专栏

15 篇文章 0 订阅

订阅专栏

CS229 Lecture 19

Debugging RL algorithm
Differential Dynamic Programming (DDP)
Kalman Filter
Linear Quadratic Gaussian (LQG)

回顾

最大期望收益为：

$\max E[R^{(t)}(s_t,a_t)+\cdots+R^{(T)}(s_T,a_t)]$

动态规划步骤为：

$V_T^{\star}(s_T)=\max_{a_T}R^{(T)}(s_T,a_T)$
$V_{t}^{\star}(s_t)=\max_{a_t}R^{(t)}(s_t,a_t)+\sum_{s_{t+1}}P_{s_ta_t}(s_{t+1})V_{t+1}^{\star}(s_{t+1})$
$\pi_{t}^{\star}(s_t)=\arg \max_{a_t}R^{(t)}(s_t,a_t)+\sum_{s_{t+1}}P_{s_ta_t}(s_{t+1})V_{t+1}^{\star}(s_{t+1})$
$V_{T}^{\star}\rightarrow V_{T-1}^{\star}\rightarrow V_{T-2}^{\star}\rightarrow \cdots \rightarrow V_{0}^{\star}$

在LQR设定中 $s_{t+1}=A_ts_t+B_ta_t+w_t$ ，这里的 $s_t\in \mathbb{R}^n$ , $a_t\in \mathbb{R}^d$ 且 $w_t\sim\mathscr{N(0,\Sigma_t)}$ 。

在这里插入图片描述
如果模型是非线性的，我们可以将该模型线性化，当然要达到近似得效果那么需要当前 $s$ 和 $a$ 要在切点附近，否则近似效果很差。这里假设 $s_{t+1}=f(s_t,a_t)$ ,那么线性化的方式如下：

$s_{t+1}\approx f(\bar{s_t},\bar{a_t})+(\bigtriangledown_{s}f(\bar{s_t},\bar{a_t}))^T(s_t-\bar{s_t})+(\bigtriangledown_{a}f(\bar{s_t},\bar{a_t}))^T(a_t-\bar{a_t})$

上式可以简化为: $s_{t+1}=A_ts_t+B_ta_t$

对应的激励函数为： $R^{(t)}(s_t,a_t)=-s_t^TU_ts_t-a_t^TW_ta_t$ ,这里的 $U_t$ 和 $W_t$ 均为半正定矩阵，这意味激励收益总是负的。

DP(dynamic programing)

对于每一个专状态 $s_t$ 有二次价值函数： $V_t^{\star}=s_t^T\Phi_ts_t+\Psi_t$

初始化 $\Phi_T=-U_T$ 和 $\Psi_T=0$
递归迭代计算更新 $\Phi_t$ 和 $\Psi_t$ 根据 $\Phi_{t+1}$ 及 $\Psi_{t+1}$ 从 $t=T-1,\cdots,0$
根据 $\Phi_{t+1}$ 和 $\Psi_{t+1}$ 计算 $L_t$ 进而有 $\pi_t^{\star}=L_t.s_t$

上面的提到的 $\Phi_t$ 、 $\Psi_t$ 和 $L_t$ 分别为：

$\Phi_t:=A_t^T(\Phi_{t+1}-\Phi_{t+1}B_t(B_t^T\Phi_{t+1}B_t-W^T)^{-1}B_t\Phi_{t+1})At-U_t$

$\Psi_t=tr(\Sigma_t\Phi_{t+1})+\Psi_{t+1}$

$L_t=[(B_t^T\Phi_{t+1}B_t-W^T)^{-1}B_t\Phi_{t+1}A_t]$

上面的公式可以看出，如果单纯想得出最佳 $p o l i c y$ ,实际上并不需要计算 $\Psi_t$ ,我们只要递归的计算 $\Phi_t$ 就知道最佳的 $p o l i c y$ 。在计算 $\Phi$ 的公式中并不依赖 $\Sigma_t$ ,因此模型的噪声并不会影响我们得出最佳的 $p o l i c y$ ,在计算最佳 $p o l i c y$ 我们完全可以忽略 $\Sigma_t$ 。但是在价值函数的定义中由于依赖 $\Psi$ ，那么 $V^{\star}$ 的协方差必定和 $\Sigma$ 有关，噪声越大得出的价值函数也就越糟糕。

Differential Dynamic Programming (DDP)

有非线性模拟器其模型时确定的,有 $s_{t+1}=f(s_t,a_t)$ ,微分动态规划的步骤如下：

通过简单的控制近似模拟理想的运动轨迹,近似模拟理想运动轨迹的方式如下：
$s_0^{\star},a_0^{\star}\longrightarrow s_1^{\star},a_1^{\star}\longrightarrow s_2^{\star},a_2^{\star}\longrightarrow\cdots \longrightarrow$
在上述模拟轨迹中的点 $s^{\star}处$ 线性化函数 $f$ ,即；
$s_{t+1}\approx f(s_t^{\star},a_t^{\star})+(\bigtriangledown_{s}f(s_t^{\star},a_t^{\star}))^T(s_t-s_t^{\star})+(\bigtriangledown_{a}f(s_t^{\star},a_t^{\star}))^T(a_t-a_t^{\star})\\ =A_ts_t+B_ta_t$
这里希望 $(s_t,a_t)\approx(s_t^{\star},a_t^{\star})$
然后通过LQR得到 $\pi^{\star}$
使用模拟器根据得出的最佳 $\pi^{\star}$ 得出新的运动轨迹
$s_0^{\star},\pi_0^{\star}(s_0)\longrightarrow s_1^{\star},\pi_1^{\star}(s_1)\longrightarrow s_2^{\star},\pi_2^{\star}(s_2)\longrightarrow\cdots \longrightarrow s_T^{\star}$
需要关注的是这里的转换使用的依旧是模拟器的函数 $f$ 而非线性近似，即 $s_{t+1}^{\star}=f(s_t^{\star},a_t^{\star})$ 。生成新的近似运动轨迹后返回第二步一只迭代直到达到停止的标准。

在这里插入图片描述

上图黑线是理想的运动轨迹，要想通过模拟器模拟出该运动轨迹，最外面粉紫色的线是刚开始模拟出的轨迹，绿色和红色的线则是一轮轮迭代优化，最终越来越近似于理想的运动轨迹。

Partially Observable MDPs.

在现实中我们只能观测到部分状态而非全部，在这种情况下如何如何解决马尔可夫决策过程？因为前面我们一直优化的 $p o l i c y$ 是基于能观察到所有的状态而非部分。

假设 $s_{t+1}=A_ts_t+w_t$ ，这里先忽略 $a_t$ 。又一个在二维平面飞行的无人直升机,它的状态为 $s_t=\left\{ \begin{matrix}x\\y\\ \dot{x} \\ \dot{y}\end{matrix} \right\}$ ,存在 $A=\left\{ \begin{matrix} 1 &1 &0 &0 \\ 0 & 0.9 & 0 & 0 \\ 0 & 0 &1 & 1 \\ 0 & 0 & 0.9 &0 \end{matrix} \right\}$ ，那么：

$x_{t+1}=x_t+\dot{x_t}+noise$

$\dot{x_{t+1}}=0.9\dot{x_t}+noise$

因为在当前设定中我们仅仅能观测到部分状态，因此有： $\begin{cases} y_t=C.s_t+v_t \\ s_t+1=A_ts_t+B_ts_t+w_t \end{cases}$ ，其中 $v_t\sim \mathscr{N}(0,\Sigma_v)$ 。如果有 $C=\left\{ \begin{matrix}x\\y\\ \dot{x} \\ \dot{y}\end{matrix} \right\}$ ,存在 $A=\left[ \begin{matrix} 1 & 0 &0 & 0\\0 & 0 & 1 & 0\end{matrix} \right]$ ,那么 $C.s_t=\left[ \begin{matrix} x \\ y \end{matrix} \right]$

假设有台雷达观察这个运动的直升机，在不同时刻观测到了不同的位置 $y_t$ ，由于观测误差的问题，雷达观察到直升机的位置和真实直升机位置还是有误差的。我们想要做的是 $P(s_t|y_0,y_1,y_2,\cdots,y_T)$ ，且 $s_0,s_1,s_2,\cdots,s_T,y_0,y_1,y_2,\cdots,y_T$ 有联合概率高斯分布即 $\left[ \begin{matrix} s_0\\s_1 \\ s_2 \\ \cdots \\ s_T \\ y_0 \\ y_1 \\ y_2 \\ \cdots \\ y_T \end{matrix} \right]\sim \mathscr{N}(u,\Sigma)$ ,我们可以通过条件概率和边际概率公式得出 $P(s_t|y_0,y_1,y_2,\cdots,y_T)$ 。但是根据式子可知 $\Sigma$ 随着观察点数的增多计算量呈指数增大，因此这种计算方式只是理论可行，在实际处理时并不可行。下面介绍的Kalman Filter可以解决计算量大的问题。

Kalman Filter
Kalman Filter 提供了一种常数时间内来计算均值和方差的方法，通过不断的更新和预测来迭代计算，假设我们知道 $s_t|y_1,y_2,\cdots ,y_t$ 的分布。过程如下：
$(s_t|y_1,y_2,\cdots ,y_t)\stackrel{predict}\longrightarrow(s_{t+1}|y_1,y_2,\cdots ,y_t)\stackrel{update}\longrightarrow(s_{t+1}|y_1,y_2,\cdots ,y_t,y_{t+1})\stackrel{predict}\longrightarrow \cdots$

在预测步骤中我们假设知道 $s_t|y_1,y_2,\cdots ,y_t\sim N(s_{t|t},\Sigma_{t|t})$ ,那么下个状态 $s_{t+1}|y_1,y_2,\cdots ,y_t\sim N(s_{t+1|t},\Sigma_{t+1|t})$ ,这里的 $\begin{cases} s_{t+1|t}=A.s_{t|t}\\ \Sigma_{t+1|t}=A.\Sigma_{t+1|t}.A^T+\Sigma_s \end{cases}$ 。

在更新步骤中已经给出了 $s_{t+1|t}$ 和 $\Sigma_{t+1|t}$ 因此 $s_{t+1}|y_1,y_2,\cdots ,y_t\sim N(s_{t+1|t},\Sigma_{t+1|t})$ ,在此基础上我们可以证明： $s_{t+1}|y_1,y_2,\cdots ,y_t,y_{t+1}\sim N(s_{t+1|t+1},\Sigma_{t+1|t+1})$ ,这里的 $\begin{cases} s_{t+1|t+1}=s_{t+1|t}+K_t(y_{t+1}-Cs_{t+1})\\ \Sigma_{t+1|t+1}=\Sigma_{t+1|t}-K_t.C.\Sigma_{t+1|t} \end{cases}$ 且 $K_t := \Sigma_{t+1|t}C^T(C\Sigma_{t+1|t}C^T + \Sigma_y)^{−1}$

注： $s_t$ 是未知的真实结果 $y_t$ 是观测结果。这些 $s_{t|t}$ 、 $s_{t+1|t}$ 和 $\Sigma_{t+1|t}$ 均是计算结果。 $s_{t+1|t+1}$ 是我们对 $s_{t+1}$ 最佳的估计

Linear Quadratic Gaussian(LQG)

将Kalman Filter和LQR一起应用就是线性二次高斯，Kalman Filter提供了一种算法得以估算状态，LQR提供了线性系统控制算法,二者结合起来就为Partially Observable MDPs这种特殊的马尔可夫决策过程提供了解决方案。

$s_{t+1}=A_ts_a+B_ta_t+w_t \,\,\,\,\,\,(w\sim N(0,\Sigma_w))$

$y_t=C.s_t+v_t\,\,\,\,\,\,(v\sim N(0,\Sigma_v))$

整个算法的步骤为：

在每个时间步骤使用Kalman Filter评估最佳的状态

初始化 $s_{0|0}=s_0,\Sigma_{0|0}=0$ 其中 $s_0\sim N(s_{0|0},\Sigma_{0|0})$

预测 $\begin{cases} s_{t+1|t}=A.s_{t|t}+Ba_t\\ \Sigma_{t+1|t}=A.\Sigma_{t+1|t}.A^T+\Sigma_v \end{cases}$

使用LQR算法计算 $L_t$ ，然后 $a_t=L_t .s_t$ ，在LQR中的假设观测到了 $s_t$ ,实际上由于真实状态值无法被观测到，因此使用观测到的 $y$ 来计算评估最佳评估状态,来对 $a_t$ 作出预判。

debugging RL algorithm

Suppose that:

Thehelicoptersimulatorisaccurate.
The RL algorithm correctly controls the helicopter (insimulation)so as to minimize $J(\theta)$ .
Minimizing $J(\theta)$ corresponds to correct autonomous flight.
Then: The learned parameters θRL should fly well on the actual helicopter.

Diagnostics:

If $\theta$ RL flies well in simulation, but not in real life, then the problem is in the simulator. Otherwise:
Let $\theta$ human be the human control policy. If $J(\theta_{human}) \lt J(\theta_{RL})$ , then the problem is in the reinforcement learning algorithm. (Failing to minimize the cost function J.)
If $J(\theta_{human}) \gt J(\theta_{RL})$ ,then the problem is in the cost function.(Maximizing it doesn’t correspond to good autonomous flight.)

注:直接复制自ML-advice.pdf