Review:
- LQR as a QP
- Riccati recursion
Lecture 8 Controllability and Dynamic Programming
Overview
- Infinite-horizon LQR
- Controllability
- Dynamic programming
1. Infinite-horizon LQR
- For time-invariant LQR, K \mathbf{K} K matrices converge to constant values.
- For stabilization problems we usually use constant K \mathbf{K} K.
- Backward recursion for
P
k
\mathbf{P}_k
Pk:
K k = ( R + B ⊤ P k + 1 B ) − 1 B ⊤ P k + 1 A , P k = Q + A ⊤ P k + 1 ( A − B K k ) \begin{aligned} &\mathbf{K}_k = \left(\mathbf{R} + \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_{k+1} \mathbf{A}, \\ &\mathbf{P}_k = \mathbf{Q} + \mathbf{A}^\top \mathbf{P}_{k+1} \left(\mathbf{A} - \mathbf{B}\mathbf{K}_k\right) \end{aligned} Kk=(R+B⊤Pk+1B)−1B⊤Pk+1A,Pk=Q+A⊤Pk+1(A−BKk)
There are two ways to get the infinite-horizon limit:
i. iterate until convergence (just like the fixed-point method);
ii. let
P
∞
=
P
k
+
1
=
P
k
\mathbf{P}_\infty = \mathbf{P}_{k+1} = \mathbf{P}_k
P∞=Pk+1=Pk and use Newton’s method to solve the equation.
Use dare
in Julia/MATLAB/Python to solve the Riccati equation.
2. Controllability
(1) Question: How do we know if LQR will work?
We already know
Q
⪰
0
\mathbf{Q}\succeq 0
Q⪰0 and
R
≻
0
\mathbf{R}\succ 0
R≻0. For the time-invariant case, there is a simple answer.
For any initial state
x
0
\mathbf{x}_0
x0,
x
N
\mathbf{x}_N
xN is given by:
x
N
=
A
x
N
−
1
+
B
u
N
−
1
=
A
2
x
N
−
2
+
A
B
u
N
−
2
+
B
u
N
−
1
⋮
=
A
N
x
0
+
∑
i
=
0
N
−
1
A
N
−
i
−
1
B
u
i
=
A
N
x
0
+
[
B
A
B
⋯
A
N
−
1
B
]
[
u
N
−
1
u
N
−
2
⋮
u
0
]
=
0
\begin{aligned} \mathbf{x}_N & = \mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}_{N-1} \\ & = \mathbf{A}^2\mathbf{x}_{N-2} + \mathbf{A}\mathbf{B}\mathbf{u}_{N-2} + \mathbf{B}\mathbf{u}_{N-1} \\ & \vdots \\ & = \mathbf{A}^N\mathbf{x}_0 + \sum_{i=0}^{N-1} \mathbf{A}^{N-i-1}\mathbf{B}\mathbf{u}_i\\ & = \mathbf{A}^N\mathbf{x}_0 + \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix} \begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} \\ & = \mathbf{0} \end{aligned}
xN=AxN−1+BuN−1=A2xN−2+ABuN−2+BuN−1⋮=ANx0+i=0∑N−1AN−i−1Bui=ANx0+[BAB⋯AN−1B]
uN−1uN−2⋮u0
=0
The value is set to zero because we want to drive the state to origin.
Denote C = [ B A B ⋯ A N − 1 B ] \mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{N-1}\mathbf{B} \end{bmatrix} C=[BAB⋯AN−1B].
This is equivalent to a least-squares problem for
u
0
:
N
−
1
\mathbf{u}_{0:N-1}
u0:N−1:
[
u
N
−
1
u
N
−
2
⋮
u
0
]
=
(
C
⊤
(
C
C
⊤
)
−
1
)
(
x
N
−
A
N
x
0
)
\begin{bmatrix} \mathbf{u}_{N-1} \\ \mathbf{u}_{N-2} \\ \vdots \\ \mathbf{u}_0 \end{bmatrix} = \left(\mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1} \right) \left(\mathbf{x}_N-\mathbf{A}^N\mathbf{x}_0\right)
uN−1uN−2⋮u0
=(C⊤(CC⊤)−1)(xN−ANx0)
For “tall skinny” matrices, C ⊤ ( C C ⊤ ) − 1 \mathbf{C}^\top \left(\mathbf{C}\mathbf{C}^\top\right)^{-1} C⊤(CC⊤)−1 is the pseudo inverse of C \mathbf{C} C. (right inverse)
For “short fat” matrices, ( C ⊤ C ) − 1 C ⊤ \left(\mathbf{C}^\top \mathbf{C}\right)^{-1} \mathbf{C}^\top (C⊤C)−1C⊤ is the pseudo inverse of C \mathbf{C} C. (left inverse)
For
C
C
⊤
\mathbf{C}\mathbf{C}^\top
CC⊤ to be invertible, we need
dim
(
C
)
=
n
\dim(\mathbf{C}) = n
dim(C)=n (
n
=
dim
(
x
)
n = \dim\left(\mathbf{x}\right)
n=dim(x)).
I get to stop at
N
=
n
N = n
N=n times steps in
C
\mathbf{C}
C because the Cayley-Hamilton theorem says that
A
n
\mathbf{A}^n
An can be written as a linear combination of
I
,
A
,
⋯
,
A
n
−
1
\mathbf{I}, \mathbf{A}, \cdots, \mathbf{A}^{n-1}
I,A,⋯,An−1. Namely,
A
n
=
∑
i
=
0
n
−
1
α
i
A
i
,
\mathbf{A}^n = \sum_{i=0}^{n-1} \alpha_i \mathbf{A}^i,
An=i=0∑n−1αiAi,
Although
C
∈
R
n
×
n
N
\mathbf{C}\in\mathbb{R}^{n\times nN}
C∈Rn×nN, adding more time steps/columns to
C
\mathbf{C}
C cannot increase the rank of
C
\mathbf{C}
C.
Thus we define
C
=
[
B
A
B
⋯
A
n
−
1
B
]
\mathbf{C} = \begin{bmatrix} \mathbf{B} & \mathbf{AB} & \cdots & \mathbf{A}^{n-1}\mathbf{B} \end{bmatrix}
C=[BAB⋯An−1B] is the controllability matrix. If
r
a
n
k
(
C
)
=
n
\mathrm{rank}(\mathbf{C}) = n
rank(C)=n, then the system is controllable.
We usually don’t solve LQR using shooting method is because A N \mathbf{A}^N AN will enlarge the condition number and make the optimization problem ill-conditioned.
3. Bellman’s Principle
- Optimal control problems have an inherently sequential structure.
- Past control inputs only affect future states, future control inputs cannot affect past states. (causality)
- Bellman’s Principle (“The principle of optimality”) states the consequence of this for optimal trajectories.
It the blue path had lower cost starting from
x
n
x_n
xn, I would have taken it starting from
x
0
x_0
x0.
⇒
\Rightarrow
⇒ Sub-trajectories of optimal trajectories have to be optimal for appropriate defined sub-problems.
4. Dynamic Programming
(1) Basic idea
Bellman’s principle suggests starting from the end of trajectory and working backwards.
We’ve already seen hints of this from the Riccati equation and Pontryagin’s principle.
(2) Cost-to-go
Define optimal cost-to-go (or value function) V k ( x ) V_k(\mathbf{x}) Vk(x) as the cost incurred from state x \mathbf{x} x at time k k k if we act optimally.
(3) For LQR
V N ( x ) = 1 2 x ⊤ Q N x = 1 2 x ⊤ P N x V_N\left(\mathbf{x}\right) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_N \mathbf{x} = \frac{1}{2} \mathbf{x}^\top \mathbf{P}_N \mathbf{x} VN(x)=21x⊤QNx=21x⊤PNx
Back up one step and calculate
V
N
−
1
(
x
)
V_{N-1}(\mathbf{x})
VN−1(x):
V
N
−
1
(
x
)
=
min
u
(
1
2
x
⊤
Q
N
−
1
x
+
1
2
u
⊤
R
N
−
1
u
+
V
N
(
A
x
N
−
1
+
B
u
)
)
=
1
2
x
⊤
Q
N
−
1
x
+
1
2
min
u
(
u
⊤
R
N
−
1
u
+
(
A
x
N
−
1
+
B
u
)
⊤
P
N
(
A
x
N
−
1
+
B
u
)
)
\begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \min_{\mathbf{u}} \left( \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2} \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + V_N\left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ & = \frac{1}{2} \mathbf{x}^\top \mathbf{Q}_{N-1} \mathbf{x} + \frac{1}{2}\min_{\mathbf{u}} \left( \mathbf{u}^\top \mathbf{R}_{N-1} \mathbf{u} + \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right)^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) \right) \\ \end{aligned}
VN−1(x)=umin(21x⊤QN−1x+21u⊤RN−1u+VN(AxN−1+Bu))=21x⊤QN−1x+21umin(u⊤RN−1u+(AxN−1+Bu)⊤PN(AxN−1+Bu))
Take its gradient with respect to
u
\mathbf{u}
u and set it to zero:
⇒
R
N
−
1
u
+
B
⊤
P
N
(
A
x
N
−
1
+
B
u
)
=
0
⇒
u
N
−
1
=
−
(
R
N
−
1
+
B
⊤
P
N
B
)
−
1
B
⊤
P
N
A
x
N
−
1
≜
−
K
N
−
1
x
N
−
1
\begin{aligned} \Rightarrow &\mathbf{R}_{N-1}\mathbf{u} + \mathbf{B}^\top \mathbf{P}_N \left(\mathbf{A}\mathbf{x}_{N-1} + \mathbf{B}\mathbf{u}\right) = \mathbf{0}\\ \Rightarrow &\mathbf{u}_{N-1} = -\left(\mathbf{R}_{N-1} + \mathbf{B}^\top \mathbf{P}_N \mathbf{B}\right)^{-1} \mathbf{B}^\top \mathbf{P}_N \mathbf{A}\mathbf{x}_{N-1} \triangleq-\mathbf{K}_{N-1}\mathbf{x}_{N-1} \end{aligned}
⇒⇒RN−1u+B⊤PN(AxN−1+Bu)=0uN−1=−(RN−1+B⊤PNB)−1B⊤PNAxN−1≜−KN−1xN−1
Plug
u
N
−
1
\mathbf{u}_{N-1}
uN−1 back into
V
N
−
1
(
x
)
V_{N-1}(\mathbf{x})
VN−1(x):
V
N
−
1
(
x
)
=
1
2
x
⊤
[
Q
N
−
1
+
K
N
−
1
⊤
R
N
−
1
K
N
−
1
+
(
A
−
B
K
N
−
1
)
⊤
P
N
(
A
−
B
K
N
−
1
)
]
x
≜
1
2
x
⊤
P
N
−
1
x
\begin{aligned} V_{N-1}\left(\mathbf{x}\right) & = \frac{1}{2} \mathbf{x}^\top \left[\mathbf{Q}_{N-1} + \mathbf{K}_{N-1}^\top \mathbf{R}_{N-1} \mathbf{K}_{N-1} + \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)^\top \mathbf{P}_N \left(\mathbf{A}-\mathbf{B}\mathbf{K}_{N-1}\right)\right] \mathbf{x}\\ & \triangleq \frac{1}{2} \mathbf{x}^\top \mathbf{P}_{N-1} \mathbf{x} \end{aligned}
VN−1(x)=21x⊤[QN−1+KN−1⊤RN−1KN−1+(A−BKN−1)⊤PN(A−BKN−1)]x≜21x⊤PN−1x
Now we have a backward recursion for K \mathbf{K} K and P \mathbf{P} P that we iterate until k = 0 k=0 k=0.
(4) General form of dynamic programming
V N ( x ) ← ℓ N ( x N ) k ← N while k > 1 V k − 1 ( x ) = min u ∈ U ( ℓ k − 1 ( x , u ) + V k ( f k − 1 ( x , u ) ) ) k ← k − 1 end \begin{aligned} & V_N(\mathbf{x}) \leftarrow \ell_N(\mathbf{x}_N) \\ & k \leftarrow N\\ & \text{while } k > 1\\ & \qquad V_{k-1}(\mathbf{x}) = \min_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right) \\ & \qquad k \leftarrow k-1\\ & \text{end} \end{aligned} VN(x)←ℓN(xN)k←Nwhile k>1Vk−1(x)=u∈Umin(ℓk−1(x,u)+Vk(fk−1(x,u)))k←k−1end
If we know
V
k
(
x
)
V_k(\mathbf{x})
Vk(x), the optimal policy is:
u
k
(
x
)
=
arg min
u
∈
U
(
ℓ
k
−
1
(
x
,
u
)
+
V
k
(
f
k
−
1
(
x
,
u
)
)
)
\mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} \left( \ell_{k-1}(\mathbf{x}, \mathbf{u}) + V_k\left(\mathbf{f}_{k-1}(\mathbf{x}, \mathbf{u})\right) \right)
uk(x)=u∈Uargmin(ℓk−1(x,u)+Vk(fk−1(x,u)))
DP equations written equivalently in terms of action-value function or Q-function:
S
k
(
x
,
u
)
=
ℓ
k
(
x
,
u
)
+
V
k
+
1
(
f
k
(
x
,
u
)
)
S_k(\mathbf{x}, \mathbf{u}) = \ell_k(\mathbf{x}, \mathbf{u}) + V_{k+1}\left(\mathbf{f}_k(\mathbf{x}, \mathbf{u})\right)
Sk(x,u)=ℓk(x,u)+Vk+1(fk(x,u))
⇒ u k ( x ) = arg min u ∈ U S k ( x , u ) \Rightarrow \mathbf{u}_k(\mathbf{x}) = \argmin_{\mathbf{u}\in\mathcal{U}} S_k(\mathbf{x}, \mathbf{u}) ⇒uk(x)=u∈UargminSk(x,u)
- Usually denotes Q ( x , u ) Q(\mathbf{x}, \mathbf{u}) Q(x,u) but we will use S S S to avoid confusion with LQR.
- Avoids need for dynamics model f \mathbf{f} f.
(5) The curse of dimensionality
- DP is sufficient for global optimum
- Only tractable for simple problems (LQR or low-dimensional)
- V ( x ) V(x) V(x) stays quadratic for LQR but becomes impossible to write analytically for even simple nonlinear problems.
- Even if we could write V ( x ) V(x) V(x), min u S ( x , u ) \min_{\mathbf{u}} S(\mathbf{x}, \mathbf{u}) minuS(x,u) will be non-convex and possibly hard to solve.
- Cost of DP blows up with state dimension due to difficulty of representing V ( x ) V(x) V(x).
(6) Why do we care about DP?
- Approximate DP with a function approximator for V ( x ) V(x) V(x) or S ( x , u ) S(x, u) S(x,u) is very powerful.
- Forms basis of modern reinforcement learning.
- DP generalizes to stochastic problems (just wrap everything in expectation operators). Pontryagin’s principle does not.