【Optimal Control (CMU 16-745)】Lecture 8 Controllability and Dynamic Programming

Review:

  • LQR as a QP
  • Riccati recursion

Lecture 8 Controllability and Dynamic Programming

Overview

  • Infinite-horizon LQR
  • Controllability
  • Dynamic programming

1. Infinite-horizon LQR

  • For time-invariant LQR, K \mathbf{K} K matrices converge to constant values.
  • For stabilization problems we usually use constant K \mathbf{K} K.
  • Backward recursion for P k \mathbf{P}_k Pk:
    K k = ( R + B ⊤ P k + 1 B ) − 1 B ⊤ P k + 1 A , P k = Q + A ⊤ P k + 1 ( A − B K k ) \begin{aligned} &\mathbf{K}_k = \left(\mathbf{R} + \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{A}, \\ &\mathbf{P}_k = \mathbf{Q} + \mathbf{A}^\top \mathbf{P}_{k+1} \left(\mathbf{A} - \mathbf{B}\mathbf{K}_k\right) \end{aligned} Kk=(R+BPk+1B)1BPk+1A,Pk=Q+APk+1(ABKk)

There are two ways to get the infinite-horizon limit:
i. iterate until convergence (just like the fixed-point method);
ii. let P ∞ = P k + 1 = P k \mathbf{P}_\infty = \mathbf{P}_{k+1} = \mathbf{P}_k P=Pk+1=Pk and use Newton’s method to solve the equation.

Use dare in Julia/MATLAB/Python to solve the Riccati equation.

2. Controllability

(1) Question: How do we know if LQR will work?

We already know Q ⪰ 0 \mathbf{Q}\succeq 0 Q0 and R ≻ 0 \mathbf{R}\succ 0 R0. For the time-invariant case, there is a simple answer.
For any initial state x 0 \mathbf{x}_0 x0, x N \mathbf{x}_N xN is given by:
x N = A x N − 1 + B u N − 1 = A 2 x N − 2 + A B u N − 2 + B u N − 1 ⋮ = A N x 0 + ∑ i = 0 N − 1 A N − i − 1 B u i = A N x 0 + [ B A B ⋯ A N − 1 B ] [ u N − 1 u N − 2 ⋮ u 0 ] = 0 \begin{aligned} \mathbf{x}_N & = \mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}_{N-1} \\ & = \mathbf{A}^2\mathbf{x}_{N-2} + \mathbf{A}\mathbf{B}\mathbf{u}_{N-2} + \mathbf{B}\mathbf{u}_{N-1} \\ & \vdots \\ & = \mathbf{A}^N\mathbf{x}_0 + \sum_{i=0}^{N-1} \mathbf{A}^{N-i-1}\mathbf{B}\mathbf{u}_i\\ & = \mathbf{A}^N\mathbf{x}_0 + \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix} \begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} \\ & = \mathbf{0} \end{aligned} xN=AxN1+BuN1=A2xN2+ABuN2+BuN1=ANx0+i=0N1ANi1Bui=ANx0+[BABAN1B] uN1uN2u0 =0

The value is set to zero because we want to drive the state to origin.

Denote C = [ B A B ⋯ A N − 1 B ] \mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix} C=[BABAN1B].

This is equivalent to a least-squares problem for u 0 : N − 1 \mathbf{u}_{0:N-1} u0:N1:
[ u N − 1 u N − 2 ⋮ u 0 ] = ( C ⊤ ( C C ⊤ ) − 1 ) ( x N − A N x 0 ) \begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} = \left(\mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1} \right) \left(\mathbf{x}_N-\mathbf{A}^N\mathbf{x}_0\right) uN1uN2u0 =(C(CC)1)(xNANx0)

For “tall skinny” matrices, C ⊤ ( C C ⊤ ) − 1 \mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1} C(CC)1 is the pseudo inverse of C \mathbf{C} C. (right inverse)
For “short fat” matrices, ( C ⊤ C ) − 1 C ⊤ \left(\mathbf{C}^\top \mathbf{C}\right)^{-1} \mathbf{C}^\top (CC)1C is the pseudo inverse of C \mathbf{C} C. (left inverse)

For C C ⊤ \mathbf{C}\mathbf{C}^\top CC to be invertible, we need dim ⁡ ( C ) = n \dim(\mathbf{C}) = n dim(C)=n ( n = dim ⁡ ( x ) n = \dim\left(\mathbf{x}\right) n=dim(x)).
I get to stop at N = n N = n N=n times steps in C \mathbf{C} C because the Cayley-Hamilton theorem says that A n \mathbf{A}^n An can be written as a linear combination of I , A , ⋯   , A n − 1 \mathbf{I}, \mathbf{A}, \cdots, \mathbf{A}^{n-1} I,A,,An1. Namely,
A n = ∑ i = 0 n − 1 α i A i , \mathbf{A}^n = \sum_{i=0}^{n-1} \alpha_i \mathbf{A}^i, An=i=0n1αiAi,
Although C ∈ R n × n N \mathbf{C}\in\mathbb{R}^{n\times nN} CRn×nN, adding more time steps/columns to C \mathbf{C} C cannot increase the rank of C \mathbf{C} C.
Thus we define C = [ B A B ⋯ A n − 1 B ] \mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{n-1}\mathbf{B} \end{bmatrix} C=[BABAn1B] is the controllability matrix. If r a n k ( C ) = n \mathrm{rank}(\mathbf{C}) = n rank(C)=n, then the system is controllable.

We usually don’t solve LQR using shooting method is because A N \mathbf{A}^N AN will enlarge the condition number and make the optimization problem ill-conditioned.

3. Bellman’s Principle

  • Optimal control problems have an inherently sequential structure.
  • Past control inputs only affect future states, future control inputs cannot affect past states. (causality)
  • Bellman’s Principle (“The principle of optimality”) states the consequence of this for optimal trajectories.
    在这里插入图片描述

It the blue path had lower cost starting from x n x_n xn, I would have taken it starting from x 0 x_0 x0.
⇒ \Rightarrow Sub-trajectories of optimal trajectories have to be optimal for appropriate defined sub-problems.

4. Dynamic Programming

(1) Basic idea

Bellman’s principle suggests starting from the end of trajectory and working backwards.
We’ve already seen hints of this from the Riccati equation and Pontryagin’s principle.

(2) Cost-to-go

Define optimal cost-to-go (or value function) V k ( x ) V_k(\mathbf{x}) Vk(x) as the cost incurred from state x \mathbf{x} x at time k k k if we act optimally.

(3) For LQR

V N ( x ) = 1 2 x ⊤ Q N x = 1 2 x ⊤ P N x V_N\left(\mathbf{x}\right) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_N \mathbf{x} = \frac{1}{2} \mathbf{x}^\top \mathbf{P}_N \mathbf{x} VN(x)=21xQNx=21xPNx

Back up one step and calculate V N − 1 ( x ) V_{N-1}(\mathbf{x}) VN1(x):
V N − 1 ( x ) = min ⁡ u ( 1 2 x ⊤ Q N − 1 x + 1 2 u ⊤ R N − 1 u + V N ( A x N − 1 + B u ) ) = 1 2 x ⊤ Q N − 1 x + 1 2 min ⁡ u ( u ⊤ R N − 1 u + ( A x N − 1 + B u ) ⊤ P N ( A x N − 1 + B u ) ) \begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \min_{\mathbf{u}} \left( \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2} \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + V_N\left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ & = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2}\min_{\mathbf{u}} \left( \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right)^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ \end{aligned} VN1(x)=umin(21xQN1x+21uRN1u+VN(AxN1+Bu))=21xQN1x+21umin(uRN1u+(AxN1+Bu)PN(AxN1+Bu))

Take its gradient with respect to u \mathbf{u} u and set it to zero:
⇒ R N − 1 u + B ⊤ P N ( A x N − 1 + B u ) = 0 ⇒ u N − 1 = − ( R N − 1 + B ⊤ P N B ) − 1 B ⊤ P N A x N − 1 ≜ − K N − 1 x N − 1 \begin{aligned} \Rightarrow &\mathbf{R}_{N-1}\mathbf{u} + \mathbf{B}^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) = \mathbf{0}\\ \Rightarrow &\mathbf{u}_{N-1} = -\left(\mathbf{R}_{N-1} + \mathbf{B}^\top \mathbf{P}_N \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_N \mathbf{A}\mathbf{x}_{N-1} \triangleq-\mathbf{K}_{N-1}\mathbf{x}_{N-1} \end{aligned} RN1u+BPN(AxN1+Bu)=0uN1=(RN1+BPNB)1BPNAxN1KN1xN1

Plug u N − 1 \mathbf{u}_{N-1} uN1 back into V N − 1 ( x ) V_{N-1}(\mathbf{x}) VN1(x):
V N − 1 ( x ) = 1 2 x ⊤ [ Q N − 1 + K N − 1 ⊤ R N − 1 K N − 1 + ( A − B K N − 1 ) ⊤ P N ( A − B K N − 1 ) ] x ≜ 1 2 x ⊤ P N − 1 x \begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \frac{1}{2} \mathbf{x}^\top \left[\mathbf{Q}_{N-1} + \mathbf{K}_{N-1}^\top \mathbf{R}_{N-1} \mathbf{K}_{N-1} + \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)^\top \mathbf{P}_N \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)\right] \mathbf{x}\\ & \triangleq \frac{1}{2} \mathbf{x}^\top \mathbf{P}_{N-1} \mathbf{x} \end{aligned} VN1(x)=21x[QN1+KN1RN1KN1+(ABKN1)PN(ABKN1)]x21xPN1x

Now we have a backward recursion for K \mathbf{K} K and P \mathbf{P} P that we iterate until k = 0 k=0 k=0.

(4) General form of dynamic programming

V N ( x ) ← ℓ N ( x N ) k ← N while  k > 1 V k − 1 ( x ) = min ⁡ u ∈ U ( ℓ k − 1 ( x , u ) + V k ( f k − 1 ( x , u ) ) ) k ← k − 1 end \begin{aligned} & V_N(\mathbf{x}) \leftarrow \ell_N(\mathbf{x}_N) \\ & k \leftarrow N\\ & \text{while } k > 1\\ & \qquad V_{k-1}(\mathbf{x}) = \min_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right) \\ & \qquad k \leftarrow k-1\\ & \text{end} \end{aligned} VN(x)N(xN)kNwhile k>1Vk1(x)=uUmin(k1(x,u)+Vk(fk1(x,u)))kk1end

If we know V k ( x ) V_k(\mathbf{x}) Vk(x), the optimal policy is:
u k ( x ) = arg min ⁡ u ∈ U ( ℓ k − 1 ( x , u ) + V k ( f k − 1 ( x , u ) ) ) \mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right) uk(x)=uUargmin(k1(x,u)+Vk(fk1(x,u)))

DP equations written equivalently in terms of action-value function or Q-function:
S k ( x , u ) = ℓ k ( x , u ) + V k + 1 ( f k ( x , u ) ) S_k(\mathbf{x}, \mathbf{u}) = \ell_k(\mathbf{x}, \mathbf{u}) + V_{k+1}\left(\mathbf{f}_k(\mathbf{x}, \mathbf{u})\right) Sk(x,u)=k(x,u)+Vk+1(fk(x,u))

⇒ u k ( x ) = arg min ⁡ u ∈ U S k ( x , u ) \Rightarrow \mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} S_k(\mathbf{x}, \mathbf{u}) uk(x)=uUargminSk(x,u)

  • Usually denotes Q ( x , u ) Q(\mathbf{x}, \mathbf{u}) Q(x,u) but we will use S S S to avoid confusion with LQR.
  • Avoids need for dynamics model f \mathbf{f} f.
(5) The curse of dimensionality
  • DP is sufficient for global optimum
  • Only tractable for simple problems (LQR or low-dimensional)
  • V ( x ) V(x) V(x) stays quadratic for LQR but becomes impossible to write analytically for even simple nonlinear problems.
  • Even if we could write V ( x ) V(x) V(x), min ⁡ u S ( x , u ) \min_{\mathbf{u}} S(\mathbf{x}, \mathbf{u}) minuS(x,u) will be non-convex and possibly hard to solve.
  • Cost of DP blows up with state dimension due to difficulty of representing V ( x ) V(x) V(x).
(6) Why do we care about DP?
  • Approximate DP with a function approximator for V ( x ) V(x) V(x) or S ( x , u ) S(x, u) S(x,u) is very powerful.
  • Forms basis of modern reinforcement learning.
  • DP generalizes to stochastic problems (just wrap everything in expectation operators). Pontryagin’s principle does not.
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值