【Optimal Control (CMU 16-745)】Lecture 8 Controllability and Dynamic Programming

最新推荐文章于 2024-07-16 18:50:37 发布

啵啵啵啵哲

最新推荐文章于 2024-07-16 18:50:37 发布

阅读量106

点赞数 2

分类专栏：最优控制文章标签：动态规划学习机器人

本文链接：https://blog.csdn.net/xuzhengzhe/article/details/132640677

版权

最优控制专栏收录该内容

9 篇文章 2 订阅

订阅专栏

Review:

LQR as a QP
Riccati recursion

Lecture 8 Controllability and Dynamic Programming

Overview

Infinite-horizon LQR
Controllability
Dynamic programming

1. Infinite-horizon LQR

For time-invariant LQR, $\mathbf{K}$ matrices converge to constant values.
For stabilization problems we usually use constant $\mathbf{K}$ .
Backward recursion for $\mathbf{P}_k$ :
$\begin{aligned} &\mathbf{K}_k = \left(\mathbf{R} + \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{A}, \\ &\mathbf{P}_k = \mathbf{Q} + \mathbf{A}^\top \mathbf{P}_{k+1} \left(\mathbf{A} - \mathbf{B}\mathbf{K}_k\right) \end{aligned}$

There are two ways to get the infinite-horizon limit:
i. iterate until convergence (just like the fixed-point method);
ii. let $\mathbf{P}_\infty = \mathbf{P}_{k+1} = \mathbf{P}_k$ and use Newton’s method to solve the equation.

Use dare in Julia/MATLAB/Python to solve the Riccati equation.

2. Controllability

(1) Question: How do we know if LQR will work?

We already know $\mathbf{Q}\succeq 0$ and $\mathbf{R}\succ 0$ . For the time-invariant case, there is a simple answer.
For any initial state $\mathbf{x}_0$ , $\mathbf{x}_N$ is given by:
$\begin{aligned} \mathbf{x}_N & = \mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}_{N-1} \\ & = \mathbf{A}^2\mathbf{x}_{N-2} + \mathbf{A}\mathbf{B}\mathbf{u}_{N-2} + \mathbf{B}\mathbf{u}_{N-1} \\ & \vdots \\ & = \mathbf{A}^N\mathbf{x}_0 + \sum_{i=0}^{N-1} \mathbf{A}^{N-i-1}\mathbf{B}\mathbf{u}_i\\ & = \mathbf{A}^N\mathbf{x}_0 + \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix} \begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} \\ & = \mathbf{0} \end{aligned}$

The value is set to zero because we want to drive the state to origin.

Denote $\mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix}$ .

This is equivalent to a least-squares problem for $\mathbf{u}_{0:N-1}$ :
$\begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} = \left(\mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1} \right) \left(\mathbf{x}_N-\mathbf{A}^N\mathbf{x}_0\right)$

For “tall skinny” matrices, $\mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1}$ is the pseudo inverse of $\mathbf{C}$ . (right inverse)
For “short fat” matrices, $\left(\mathbf{C}^\top \mathbf{C}\right)^{-1} \mathbf{C}^\top$ is the pseudo inverse of $\mathbf{C}$ . (left inverse)

For $\mathbf{C}\mathbf{C}^\top$ to be invertible, we need $\dim(\mathbf{C}) = n$ ( $\dim\left(\mathbf{x}\right)$ ).
I get to stop at $N = n$ times steps in $\mathbf{C}$ because the Cayley-Hamilton theorem says that $\mathbf{A}^n$ can be written as a linear combination of $\mathbf{I}, \mathbf{A}, \cdots, \mathbf{A}^{n-1}$ . Namely,
$\mathbf{A}^n = \sum_{i=0}^{n-1} \alpha_i \mathbf{A}^i,$
Although $\mathbf{C}\in\mathbb{R}^{n\times nN}$ , adding more time steps/columns to $\mathbf{C}$ cannot increase the rank of $\mathbf{C}$ .
Thus we define $\mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{n-1}\mathbf{B} \end{bmatrix}$ is the controllability matrix. If $\mathrm{rank}(\mathbf{C}) = n$ , then the system is controllable.

We usually don’t solve LQR using shooting method is because $\mathbf{A}^N$ will enlarge the condition number and make the optimization problem ill-conditioned.

3. Bellman’s Principle

Optimal control problems have an inherently sequential structure.
Past control inputs only affect future states, future control inputs cannot affect past states. (causality)
Bellman’s Principle (“The principle of optimality”) states the consequence of this for optimal trajectories.

It the blue path had lower cost starting from $x_n$ , I would have taken it starting from $x_0$ .
$\Rightarrow$ Sub-trajectories of optimal trajectories have to be optimal for appropriate defined sub-problems.

4. Dynamic Programming

(1) Basic idea

Bellman’s principle suggests starting from the end of trajectory and working backwards.
We’ve already seen hints of this from the Riccati equation and Pontryagin’s principle.

(2) Cost-to-go

Define optimal cost-to-go (or value function) $V_k(\mathbf{x})$ as the cost incurred from state $\mathbf{x}$ at time $k$ if we act optimally.

(3) For LQR

$V_N\left(\mathbf{x}\right) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_N \mathbf{x} = \frac{1}{2} \mathbf{x}^\top \mathbf{P}_N \mathbf{x}$

Back up one step and calculate $V_{N-1}(\mathbf{x})$ :
$\begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \min_{\mathbf{u}} \left( \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2} \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + V_N\left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ & = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2}\min_{\mathbf{u}} \left( \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right)^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ \end{aligned}$

Take its gradient with respect to $\mathbf{u}$ and set it to zero:
$\begin{aligned} \Rightarrow &\mathbf{R}_{N-1}\mathbf{u} + \mathbf{B}^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) = \mathbf{0}\\ \Rightarrow &\mathbf{u}_{N-1} = -\left(\mathbf{R}_{N-1} + \mathbf{B}^\top \mathbf{P}_N \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_N \mathbf{A}\mathbf{x}_{N-1} \triangleq-\mathbf{K}_{N-1}\mathbf{x}_{N-1} \end{aligned}$

Plug $\mathbf{u}_{N-1}$ back into $V_{N-1}(\mathbf{x})$ :
$\begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \frac{1}{2} \mathbf{x}^\top \left[\mathbf{Q}_{N-1} + \mathbf{K}_{N-1}^\top \mathbf{R}_{N-1} \mathbf{K}_{N-1} + \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)^\top \mathbf{P}_N \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)\right] \mathbf{x}\\ & \triangleq \frac{1}{2} \mathbf{x}^\top \mathbf{P}_{N-1} \mathbf{x} \end{aligned}$

Now we have a backward recursion for $\mathbf{K}$ and $\mathbf{P}$ that we iterate until $k = 0$ .

(4) General form of dynamic programming

$\begin{aligned} & V_N(\mathbf{x}) \leftarrow \ell_N(\mathbf{x}_N) \\ & k \leftarrow N\\ & \text{while } k > 1\\ & \qquad V_{k-1}(\mathbf{x}) = \min_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right) \\ & \qquad k \leftarrow k-1\\ & \text{end} \end{aligned}$

If we know $V_k(\mathbf{x})$ , the optimal policy is:
$\mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right)$

DP equations written equivalently in terms of action-value function or Q-function:
$S_k(\mathbf{x}, \mathbf{u}) = \ell_k(\mathbf{x}, \mathbf{u}) + V_{k+1}\left(\mathbf{f}_k(\mathbf{x}, \mathbf{u})\right)$

$\Rightarrow \mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} S_k(\mathbf{x}, \mathbf{u})$

Usually denotes $Q(\mathbf{x}, \mathbf{u})$ but we will use $S$ to avoid confusion with LQR.
Avoids need for dynamics model $\mathbf{f}$ .

(5) The curse of dimensionality

DP is sufficient for global optimum
Only tractable for simple problems (LQR or low-dimensional)
$V (x)$ stays quadratic for LQR but becomes impossible to write analytically for even simple nonlinear problems.
Even if we could write $V (x)$ , $\min_{\mathbf{u}} S(\mathbf{x}, \mathbf{u})$ will be non-convex and possibly hard to solve.
Cost of DP blows up with state dimension due to difficulty of representing $V (x)$ .

(6) Why do we care about DP?

Approximate DP with a function approximator for $V (x)$ or $S (x, u)$ is very powerful.
Forms basis of modern reinforcement learning.
DP generalizes to stochastic problems (just wrap everything in expectation operators). Pontryagin’s principle does not.

啵啵啵啵哲

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Optimal Control (CMU 16-745)】Lecture 8 Controllability and Dynamic Programming

KKPkKkRB⊤Pk1B−1B⊤Pk1APkQA⊤Pk1A−BKkii. let P∞Pk1Pk⪰0and R≻00, xNNAxN−1BuN−1A2xN−2ABuN−2BuN−1⋮ANx0i0∑N−1AN−i−1Bui。
复制链接

扫一扫