Contents
1. Introduction of Optimal Control (OC)
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.1 System dynamics
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.2 Optimal control problem, cost function and constraints formulation
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.3 Control policy
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.4 Bellman’s Principle of Optimality (BPO) and Dynamic Programming (DP)
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.5 Hamiltonian-Jacobian-Bellman
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
1.6 Pontryagin’s Maximum Principle (PMP)
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(上).
2. Linear Quadratic Regulator (LQR)
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(中).
2.1 LQR with indirect shooting based on PMP
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(中).
2.2 LQR as Quadratic Programming
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(中).
2.3 LQR as Dynamic Programming
Please refer to this 最优控制 LQR 与 Differential Dynamic Programming(DDP) 详细公式推导(中).
3. Differential Dynamic Programming (DDP)
In the last section we discussed the Linear Quadratic Regulator (LQR) problem, which is a special case of the optimal control problem (OCP) where the system dynamics are linear and the cost function is quadratic. How about non-linearities, either in dynamics or cost functions? How can we tackle the value function? In this section, we will discuss a more general framework for solving OCPs, namely Differential Dynamic Programming (DDP) for discrete process. DDP is a trajectory optimization method that is based on the principle of optimality. It is a two-step method that first computes a locally optimal trajectory and then iteratively refines it. The method is iterative and can be used to solve non-linear OCPs with non-linear dynamics and non-linear cost functions.
Recall that we have an OCP defined in the discrete-time domain as follows:
min
u
J
(
x
0
,
u
0
,
.
.
.
,
N
−
1
)
s
u
b
j
e
c
t
t
o
x
k
+
1
=
f
(
x
k
,
u
k
)
x
∈
X
u
∈
U
(51)
\begin{aligned}&\operatorname*{min}_{\boldsymbol{u}}&&J\left(\boldsymbol{x}_{0},\boldsymbol{u}_{0,...,N-1}\right)\\&\mathrm{subject~to}&&\bm{x}_{k+1}=\bm{f}(\bm{x}_{k},\bm{u}_{k})\\&&&\bm{x}\in\mathcal{X}\\&&&\bm{u}\in\mathcal{U}\end{aligned}\tag{51}
uminsubject toJ(x0,u0,...,N−1)xk+1=f(xk,uk)x∈Xu∈U(51)
where the cost function is defined as:
J
(
x
0
,
u
0
,
.
.
.
,
N
−
1
)
=
∑
k
=
0
N
−
1
ℓ
(
x
k
,
u
k
)
⏟
cost-to-go
+
ℓ
f
(
x
N
)
⏟
terminal cost
(52)
\begin{aligned}&J\left(\boldsymbol{x}_{0},\boldsymbol{u}_{0,...,N-1}\right)=\sum_{k=0}^{N-1}\underbrace{\ell(\boldsymbol{x}_{k},\boldsymbol{u}_{k})}_{\text{cost-to-go}}+\underbrace{\ell_{f}(\boldsymbol{x}_{N})}_{\text{terminal cost}}\end{aligned}\tag{52}
J(x0,u0,...,N−1)=k=0∑N−1cost-to-go
ℓ(xk,uk)+terminal cost
ℓf(xN)(52)
As in DP, we need to define the value function and find the recursion for it. To begin with, the value function is defined as:
V ( x k ) = min u k , … , N − 1 J ( x k , u k , … , N − 1 ) = min u k , … , N − 1 [ ∑ i = k N − 1 ℓ ( x i , u i ) + ℓ f ( x N ) ] = min u k ℓ ( x k , u k ) + min u k + 1 , … , N − 1 [ ∑ i = k + 1 N − 1 ℓ ( x i , u i ) + ℓ f ( x N ) ] = min u k ℓ ( x k , u k ) + V ( x k + 1 ) (53) \begin{aligned}V(\boldsymbol{x}_{k})&=\min_{\boldsymbol{u}_{k,\ldots,N-1}}J\left(\boldsymbol{x}_{k},\boldsymbol{u}_{k,\ldots,N-1}\right)\\&=\min_{\boldsymbol{u}_{k,\ldots,N-1}}\left[\sum_{i=k}^{N-1}\ell(\boldsymbol{x}_{i},\boldsymbol{u}_{i})+\ell_{f}(\boldsymbol{x}_{N})\right]\\&=\min_{\boldsymbol{u}_{k}}\ell(\boldsymbol{x}_{k},\boldsymbol{u}_{k})+\min_{\boldsymbol{u}_{k+1,\ldots,N-1}}\left[\sum_{i=k+1}^{N-1}\ell(\boldsymbol{x}_{i},\boldsymbol{u}_{i})+\ell_{f}(\boldsymbol{x}_{N})\right]\\&=\min_{\boldsymbol{u}_{k}}\ell(\boldsymbol{x}_{k},\boldsymbol{u}_{k})+V(\boldsymbol{x}_{k+1})\\\end{aligned}\tag{53} V(xk)=uk,…,N−1minJ(xk,uk,…,N−1)=uk,…,N−1min[i=k∑N−1ℓ(xi,ui)+ℓf(xN)]=ukminℓ(xk,uk)+uk+1,…,N−1min[i=k+1∑N−1ℓ(xi,ui)+ℓf(xN)]=ukminℓ(xk,uk)+V(xk+1)(53)
Here we don’t stick with the assumption on linear dynamics and quadratic cost function any more, so we can’t use the Riccati equation to solve the value function. Instead, we can use differential-based approximation upon both dynamics and cost function.
3.1 Local Approximation
Start with the dynamics, we can use Taylor expansion to approximate the dynamics around the current state and control input, say ( x ˉ , u ˉ ) (\bar{\bm{x}},\bar{\bm{u}}) (xˉ,uˉ), then we can approximate the dynamics around ( x ˉ , u ˉ ) (\bar{\bm{x}},\bar{\bm{u}}) (xˉ,uˉ) as:
d x d t = f ( x , u ) = f ( x ˉ , u ˉ ) + ∇ x f ( x ˉ , u ˉ ) ( x − x ˉ ) ⏟ l i n e a r t e r m + ∇ x f ( x ˉ , u ˉ ) ( u − u ˉ ) ⏟ l i n e a r t e r m + 1 2 ∇ x x 2 f ( x ˉ , u ˉ ) ( x − x ˉ ) 2 ⏟ quadratic term + 1 2 ∇ u u 2 f ( x ˉ , u ˉ ) ( u − u ˉ ) 2 ⏟ quadratic term + f x u ( x ˉ , u ˉ ) ( x − x ˉ ) ( u − u ˉ ) ⏟ cross quaduatic lerm + O ( ∥ x − x ˉ ∥ 3 ) + O ( ∥ u − u ˉ ∥ 3 ) (54) \begin{aligned}\frac{d\boldsymbol{x}}{dt}=\boldsymbol{f}(\boldsymbol{x},\boldsymbol{u})=\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})&+\underbrace{\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})(\boldsymbol{x}-\bar{\boldsymbol{x}})}_{\mathrm{linear~term}}+\underbrace{\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})(\boldsymbol{u}-\bar{\boldsymbol{u}})}_{\mathrm{linear~term}}\\&+\underbrace{\frac12\nabla_{\boldsymbol{x}\boldsymbol{x}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})(\bm{x}-\bar{\boldsymbol{x}})^2}_{\text{quadratic term}}+\underbrace{\frac12\nabla_{\boldsymbol{u}\boldsymbol{u}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})(\boldsymbol{u}-\bar{\boldsymbol{u}})^2}_{\text{quadratic term}}\\&+\underbrace{\boldsymbol{f}_{\boldsymbol{x}\boldsymbol{u}}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})(\boldsymbol{x}-\bar{\boldsymbol{x}})(\boldsymbol{u}-\bar{\boldsymbol{u}})}_{\text{cross quaduatic lerm}}\\&+\mathcal{O}\left(\|\bm{x}-\bar{\boldsymbol{x}}\|^3\right)+\mathcal{O}\left(\|\boldsymbol{u}-\bar{\boldsymbol{u}}\|^3\right)\end{aligned}\tag{54} dtdx=f(x,u)=f(xˉ,uˉ)+linear term ∇xf(xˉ,uˉ)(x−xˉ)+linear term ∇xf(xˉ,uˉ)(u−uˉ)+quadratic term 21∇xx2f(xˉ,uˉ)(x−xˉ)2+quadratic term 21∇uu2f(xˉ,uˉ)(u−uˉ)2+cross quaduatic lerm fxu(xˉ,uˉ)(x−xˉ)(u−uˉ)+O(∥x−xˉ∥3)+O(∥u−uˉ∥3)(54)
or we can put it into a more compact matrix form with notation δ x = x − x ˉ \delta \bm{x}=\bm{x}-\bar{\bm{x}} δx=x−xˉ and δ u = u − u ˉ \delta \bm{u}=\bm{u}-\bar{\bm{u}} δu=u−uˉ:
d δ x d t = d ( x − x ˉ ) d t = f ( x , u ) − f ( x ˉ , u ˉ ) ≈ [ ∇ x f ( x ˉ , u ˉ ) ∇ x f ( x ˉ , u ˉ ) ] T [ δ x δ u ] + 1 2 [ δ x δ u ] T [ ∇ x x 2 f ( x ˉ , u ˉ ) ∇ x u 2 f ( x ˉ , u ˉ ) ∇ u x 2 f ( x ˉ , u ˉ ) ∇ u u 2 f ( x ˉ , u ˉ ) ] [ δ x δ u ] (55) \begin{aligned}\frac{d\delta\boldsymbol{x}}{dt}&=\frac{d(\boldsymbol{x}-\bar{\boldsymbol{x}})}{dt}=\boldsymbol{f}(\boldsymbol{x},\boldsymbol{u})-\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})\\&\approx\left[\begin{array}{c}\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})\\\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})\end{array}\right]^T\left[\begin{array}{c}\delta x\\\delta u\end{array}\right]+\frac12\left[\begin{array}{c}\delta x\\\delta u\end{array}\right]^T\left[\begin{array}{c}\nabla_{\boldsymbol{x}\boldsymbol{x}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})&\nabla_{\boldsymbol{x}\boldsymbol{u}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})\\\nabla_{\boldsymbol{u}\boldsymbol{x}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})&\nabla_{\boldsymbol{u}\boldsymbol{u}}^2f(\bar{\boldsymbol{x}},\bar{\boldsymbol{u}})\end{array}\right]\left[\begin{array}{c}\delta x\\\delta u\end{array}\right]\end{aligned}\tag{55} dtdδx=dtd(x−xˉ)=f(x,u)−f(xˉ,uˉ)≈[∇xf(xˉ,uˉ)∇xf(xˉ,uˉ)]T[δxδu]+21[δxδu]T[∇xx2f(xˉ,uˉ)∇ux2f(xˉ,uˉ)∇xu2f(xˉ,uˉ)∇uu2f(xˉ,uˉ)][δxδu](55)
Then similarly for the value function, we have
V
(
x
k
,
u
k
)
=
min
u
k
ℓ
(
x
k
,
u
k
)
+
V
(
x
k
+
1
)
=
min
δ
u
k
ℓ
(
x
ˉ
k
+
δ
x
k
,
u
ˉ
k
+
δ
u
k
)
+
V
(
x
ˉ
k
+
1
+
δ
x
k
+
1
)
(56)
\begin{aligned}V(\boldsymbol{x}_k,\boldsymbol{u}_k)&=\min_{\boldsymbol{u}_k}\ell(\boldsymbol{x}_k,\boldsymbol{u}_k)+V(\boldsymbol{x}_{k+1})\\ &=\min_{\delta\boldsymbol{u}_{k}}\ell(\bar{\boldsymbol{x}}_{k}+\delta\boldsymbol{x}_{k},\bar{\boldsymbol{u}}_{k}+\delta\boldsymbol{u}_{k})+V(\bar{\boldsymbol{x}}_{k+1}+\delta\boldsymbol{x}_{k+1})\end{aligned}\tag{56}
V(xk,uk)=ukminℓ(xk,uk)+V(xk+1)=δukminℓ(xˉk+δxk,uˉk+δuk)+V(xˉk+1+δxk+1)(56)
as in DP backward iteration, we need to express the value function at next time step
k
+
1
k + 1
k+1 as a function of the current state
x
k
\bm{x}_k
xk and control
u
k
\bm{u}_k
uk in order to solve for
u
k
∗
.
\bm{u}_k^*.
uk∗. Hence we can use the Taylor expansior again:
V ( x ˉ k + 1 + δ x k + 1 ) = V ( x ˉ k + 1 ) + ∇ x V ( x ˉ k + 1 ) T δ x k + 1 + 1 2 δ x k + 1 T ∇ x x 2 V ( x ˉ k + 1 ) δ x k + 1 + O ( ∥ δ x k + 1 ∥ 3 ) (57) \begin{aligned}V(\bar{\boldsymbol{x}}_{k+1}+\delta\boldsymbol{x}_{k+1})&=V(\bar{\boldsymbol{x}}_{k+1})+\nabla_{\boldsymbol{x}}V(\bar{\boldsymbol{x}}_{k+1})^{T}\delta\boldsymbol{x}_{k+1}+\frac{1}{2}\delta\boldsymbol{x}_{k+1}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V(\bar{\boldsymbol{x}}_{k+1})\delta\boldsymbol{x}_{k+1}\\ &+\mathcal{O}\left(\|\delta\boldsymbol{x}_{k+1}\|^3\right)\end{aligned}\tag{57} V(xˉk+1+δxk+1)=V(xˉk+1)+∇xV(xˉk+1)Tδxk+1+21δxk+1T∇xx2V(xˉk+1)δxk+1+O(∥δxk+1∥3)(57)
to go further, we need to express
δ
x
k
+
1
\delta \bm{x}_{k+1}
δxk+1 as a function of
δ
x
k
\delta\boldsymbol{x}_k
δxk and
δ
u
k
\delta\boldsymbol{u}_k
δuk, which can be done by the dynamics in Eq.55 as
δ
x
k
+
1
≈
δ
x
k
+
[
∇
x
f
(
x
ˉ
k
,
u
ˉ
k
)
∇
x
f
(
x
ˉ
k
,
u
ˉ
k
)
]
T
[
δ
x
k
δ
u
k
]
+
1
2
[
δ
x
k
δ
u
k
]
T
[
∇
x
x
2
f
(
x
ˉ
k
,
u
ˉ
k
)
∇
x
u
2
f
(
x
ˉ
k
,
u
ˉ
k
)
∇
u
x
2
f
(
x
ˉ
k
,
u
ˉ
k
)
∇
u
u
2
f
(
x
ˉ
k
,
u
ˉ
k
)
]
[
δ
x
k
δ
u
k
]
(58)
\begin{aligned}\delta\boldsymbol{x}_{k+1}&\approx\delta\boldsymbol{x}_{k}+\left[\begin{array}{c}\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\\\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\end{array}\right]^{T}\left[\begin{array}{c}\delta\boldsymbol{x}_{k}\\\delta\boldsymbol{u}_{k}\end{array}\right]\\&+\frac{1}{2}\left[\begin{array}{c}\delta\boldsymbol{x}_{k}\\\delta\boldsymbol{u}_{k}\end{array}\right]^{T}\left[\begin{array}{c}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})&\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\\\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})&\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\end{array}\right]\left[\begin{array}{c}\delta\boldsymbol{x}_{k}\\\delta\boldsymbol{u}_{k}\end{array}\right]\\\end{aligned}\tag{58}
δxk+1≈δxk+[∇xf(xˉk,uˉk)∇xf(xˉk,uˉk)]T[δxkδuk]+21[δxkδuk]T[∇xx2f(xˉk,uˉk)∇ux2f(xˉk,uˉk)∇xu2f(xˉk,uˉk)∇uu2f(xˉk,uˉk)][δxkδuk](58)
Here we trim the
2
n
d
2^{nd}
2nd order term for simplicity, and use the non-matrix for for plugging it back into approximating the value function at
k
+
1
k + 1
k+1 as in Eq.57
δ
x
k
+
1
≈
δ
x
k
+
∇
x
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
+
∇
u
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
=
(
I
+
∇
x
f
(
x
ˉ
k
,
u
ˉ
k
)
)
⏟
∇
x
f
~
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
+
∇
u
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
(59)
\begin{aligned}\delta\boldsymbol{x}_{k+1}&\approx\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\\&=\underbrace{(\boldsymbol{I}+\nabla_{\boldsymbol{x}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k}))}_{\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\end{aligned}\tag{59}
δxk+1≈δxk+∇xf(xˉk,uˉk)δxk+∇uf(xˉk,uˉk)δuk=∇xf~(xˉk,uˉk)
(I+∇xf(xˉk,uˉk))δxk+∇uf(xˉk,uˉk)δuk(59)
which yields
V
(
x
ˉ
k
+
1
+
δ
x
k
+
1
)
≈
V
(
x
ˉ
k
+
1
)
+
∇
x
V
(
x
ˉ
k
+
1
)
T
[
∇
x
f
~
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
+
∇
u
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
]
+
1
2
[
∇
x
f
~
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
+
∇
u
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
]
T
∇
x
x
2
V
(
x
ˉ
k
+
1
)
+
[
∇
x
f
~
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
+
∇
u
f
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
]
(60)
\begin{aligned}V(\bar{\boldsymbol{x}}_{k+1}+\delta\boldsymbol{x}_{k+1})\approx&V(\bar{\boldsymbol{x}}_{k+1})\\&+\nabla_{\boldsymbol{x}}V(\bar{\boldsymbol{x}}_{k+1})^{T}\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\right]\\&+\frac{1}{2}\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\right]^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V(\bar{\boldsymbol{x}}_{k+1})\\&+\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\boldsymbol{f}(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\right]\end{aligned}\tag{60}
V(xˉk+1+δxk+1)≈V(xˉk+1)+∇xV(xˉk+1)T[∇xf~(xˉk,uˉk)δxk+∇uf(xˉk,uˉk)δuk]+21[∇xf~(xˉk,uˉk)δxk+∇uf(xˉk,uˉk)δuk]T∇xx2V(xˉk+1)+[∇xf~(xˉk,uˉk)δxk+∇uf(xˉk,uˉk)δuk](60)
as shown above, the Taylor expansion equation can be pretty long and will get even longer afterwards, hence we need to adapt some notations to make it more compact.
V
(
x
ˉ
k
+
1
)
=
V
′
f
(
x
ˉ
k
,
u
ˉ
k
)
=
f
(61)
V(\bar{\boldsymbol{x}}_{k+1})=V'\\ \bm{f}(\bar{\bm{x}}_k,\bar{\bm{u}}_k)=\bm{f}\tag{61}
V(xˉk+1)=V′f(xˉk,uˉk)=f(61)
then
V
(
x
ˉ
k
+
1
+
δ
x
k
+
1
)
≈
V
′
+
∇
x
V
′
T
[
∇
x
f
~
δ
x
k
+
∇
u
f
δ
u
k
]
+
1
2
[
∇
x
f
~
δ
x
k
+
∇
u
f
δ
u
k
]
T
∇
x
x
2
V
′
[
∇
x
f
~
δ
x
k
+
∇
u
f
δ
u
k
]
=
V
′
+
∇
x
V
′
T
∇
x
f
~
δ
x
k
+
∇
x
V
′
T
∇
u
f
δ
u
k
+
1
2
δ
x
k
T
∇
x
f
~
T
∇
x
x
2
V
′
∇
x
f
~
δ
x
k
+
1
2
δ
x
k
T
∇
x
f
~
T
∇
x
x
2
V
′
∇
u
f
δ
u
k
+
1
2
δ
u
k
T
∇
u
f
T
∇
x
x
2
V
′
∇
x
f
~
δ
x
k
+
1
2
δ
u
k
T
∇
u
f
T
∇
x
x
2
V
T
∇
x
x
2
V
′
∇
u
f
δ
u
k
(62)
\begin{aligned}V(\bar{\boldsymbol{x}}_{k+1}+\delta\boldsymbol{x}_{k+1})&\approx V^{\prime}+\nabla_{\boldsymbol{x}}V^{\prime T}\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}f\delta\boldsymbol{u}_{k}\right]\\ &\quad\quad+\frac{1}{2}\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}f\delta\boldsymbol{u}_{k}\right]^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\left[\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}f\delta\boldsymbol{u}_{k}\right]\\ &=V^{\prime}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{u}}f\delta\boldsymbol{u}_{k}\\ &\quad\quad+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}f\delta\boldsymbol{u}_{k} \\ &\quad\quad+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}}\bm{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^2V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}}\bm{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^2V^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}\bm{f}\delta\boldsymbol{u}_{k} \end{aligned}\tag{62}
V(xˉk+1+δxk+1)≈V′+∇xV′T[∇xf~δxk+∇ufδuk]+21[∇xf~δxk+∇ufδuk]T∇xx2V′[∇xf~δxk+∇ufδuk]=V′+∇xV′T∇xf~δxk+∇xV′T∇ufδuk+21δxkT∇xf~T∇xx2V′∇xf~δxk+21δxkT∇xf~T∇xx2V′∇ufδuk+21δukT∇ufT∇xx2V′∇xf~δxk+21δukT∇ufT∇xx2VT∇xx2V′∇ufδuk(62)
By now we have approximated the latter part of the RHS of Eq. 56, and we still need the second order expansion of the running cost function to fully approximate the value function, which is pretty straightforward:
ℓ
(
x
k
,
u
k
)
≈
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
+
∇
x
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
T
δ
x
k
+
∇
u
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
T
δ
u
k
+
1
2
δ
x
k
T
∇
x
x
2
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
T
δ
x
k
+
1
2
δ
u
k
T
∇
u
u
2
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
+
1
2
δ
x
k
T
∇
x
u
2
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
δ
u
k
+
1
2
δ
u
k
T
∇
u
x
2
ℓ
(
x
ˉ
k
,
u
ˉ
k
)
δ
x
k
(63)
\begin{aligned}\ell(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&\approx\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})+\nabla_{\boldsymbol{x}}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{u}_{k}\\&+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}\\&+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{u}_{k}+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}\ell(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})\delta\boldsymbol{x}_{k}\end{aligned}\tag{63}
ℓ(xk,uk)≈ℓ(xˉk,uˉk)+∇xℓ(xˉk,uˉk)Tδxk+∇uℓ(xˉk,uˉk)Tδuk+21δxkT∇xx2ℓ(xˉk,uˉk)Tδxk+21δukT∇uu2ℓ(xˉk,uˉk)δuk+21δxkT∇xu2ℓ(xˉk,uˉk)δuk+21δukT∇ux2ℓ(xˉk,uˉk)δxk(63)
Now we can plug Eq 62 and 63 into Eq. 56 and get the final approximation of the value function:
V
(
x
k
,
u
k
)
=
min
u
k
[
ℓ
+
V
′
+
(
∇
x
f
~
T
∇
x
V
′
+
∇
x
ℓ
)
T
δ
x
k
+
(
∇
u
f
T
∇
x
V
′
+
∇
u
ℓ
)
T
δ
u
k
+
1
2
δ
x
k
T
(
∇
x
f
~
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
x
x
2
ℓ
)
δ
x
k
+
1
2
δ
x
k
T
(
∇
x
f
~
T
∇
x
x
2
V
′
∇
u
f
+
∇
x
u
2
ℓ
)
δ
u
k
+
1
2
δ
u
k
T
(
∇
u
f
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
u
x
2
ℓ
)
δ
x
k
+
1
2
δ
u
k
T
(
∇
u
f
T
∇
x
x
2
V
′
∇
u
f
+
∇
u
u
2
ℓ
)
δ
u
k
]
(64)
\begin{aligned} V(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\min_{\boldsymbol{u}_{k}}[\ell+V^{\prime}+(\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{x}}\ell)^{T}\delta\boldsymbol{x}_{k}+(\nabla_{\boldsymbol{u}}\boldsymbol{f}^{T}\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{u}}\ell)^{T}\delta\boldsymbol{u}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}(\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\ell)\delta\boldsymbol{x}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}(\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\:V^{\prime}\nabla_{\boldsymbol{u}}\boldsymbol{f}+\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}\ell)\delta\boldsymbol{u}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}(\nabla_{\boldsymbol{u}}\boldsymbol{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}\ell)\delta\boldsymbol{x}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}(\nabla_{\boldsymbol{u}}\boldsymbol{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}\boldsymbol{f}+\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}\ell)\delta\boldsymbol{u}_{k}] \end{aligned}\tag{64}
V(xk,uk)=ukmin[ℓ+V′+(∇xf~T∇xV′+∇xℓ)Tδxk+(∇ufT∇xV′+∇uℓ)Tδuk+21δxkT(∇xf~T∇xx2V′∇xf~+∇xx2ℓ)δxk+21δxkT(∇xf~T∇xx2V′∇uf+∇xu2ℓ)δuk+21δukT(∇ufT∇xx2V′∇xf~+∇ux2ℓ)δxk+21δukT(∇ufT∇xx2V′∇uf+∇uu2ℓ)δuk](64)
theoretically, we can now set
∇
u
k
V
(
x
,
u
)
=
0
\nabla_{\bm{u}_k}V(\bm{x},\bm{u})=0
∇ukV(x,u)=0 to get the optimal control input
u
k
∗
u_k^*
uk∗ at step
k
k
k. However, please note these terms in brackets are all derivatives of the dynamics, cost function and value function, hence we can bring in yet another handy notation to make the formulas more compact:
S
(
x
,
u
)
=
ℓ
(
x
,
u
)
+
V
′
(
f
(
x
,
u
)
)
(65)
S(\boldsymbol{x},\boldsymbol{u})=\ell(\boldsymbol{x},\boldsymbol{u})+V'(\bm{f}(\boldsymbol{x},\boldsymbol{u}))\tag{65}
S(x,u)=ℓ(x,u)+V′(f(x,u))(65)
which is commonly referred as the state-action value function. Now we can rewrite the ”optimal” value function at stage
k
k
k as:
V
(
x
k
)
=
min
u
k
S
(
x
k
,
u
k
)
=
min
u
k
S
(
x
ˉ
k
+
δ
x
,
u
ˉ
k
+
δ
u
)
=
min
u
k
[
S
(
x
ˉ
k
,
u
ˉ
k
)
+
∇
x
S
(
x
ˉ
k
,
u
ˉ
k
)
T
δ
x
k
+
∇
u
S
(
x
ˉ
k
,
u
ˉ
k
)
T
δ
u
k
+
1
2
δ
x
k
T
∇
x
x
2
S
(
x
k
,
u
k
)
δ
x
k
+
1
2
δ
x
k
T
∇
x
u
2
S
(
x
k
,
u
k
)
δ
u
k
+
1
2
δ
u
k
T
∇
u
x
2
S
(
x
k
,
u
k
)
δ
x
k
+
1
2
δ
u
k
T
∇
u
u
2
S
(
x
k
,
u
k
)
δ
u
k
]
(66)
\begin{aligned}V(\boldsymbol{x}_{k})&=\min_{\boldsymbol{u}_{k}}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\\ &=\min_{\boldsymbol{u}_{k}}S(\bar{\boldsymbol{x}}_{k}+\delta\boldsymbol{x},\bar{\boldsymbol{u}}_{k}+\delta\boldsymbol{u})\\ &=\min_{\boldsymbol{u}_{k}}[S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})+\nabla_{\boldsymbol{x}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{u}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}{\boldsymbol{u}}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{u}_{k}\\ &+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{u}_{k}]\end{aligned}\tag{66}
V(xk)=ukminS(xk,uk)=ukminS(xˉk+δx,uˉk+δu)=ukmin[S(xˉk,uˉk)+∇xS(xˉk,uˉk)Tδxk+∇uS(xˉk,uˉk)Tδuk+21δxkT∇xx2S(xk,uk)δxk+21δxkT∇xu2S(xk,uk)δuk+21δukT∇ux2S(xk,uk)δxk+21δukT∇uu2S(xk,uk)δuk](66)
where the derivatives of the state-action value function are defined as in Eq. 64:
∇
x
S
(
x
ˉ
k
,
u
ˉ
k
)
=
∇
x
f
~
T
∇
x
V
′
+
∇
x
ℓ
∇
u
S
(
x
ˉ
k
,
u
ˉ
k
)
=
∇
u
f
T
∇
x
V
′
+
∇
u
ℓ
∇
x
x
2
S
(
x
k
,
u
k
)
=
∇
x
f
~
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
x
x
2
ℓ
∇
x
u
2
S
(
x
k
,
u
k
)
=
∇
x
f
~
T
∇
x
x
2
V
′
∇
u
f
+
∇
x
u
2
ℓ
∇
u
x
2
S
(
x
k
,
u
k
)
=
∇
u
f
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
u
x
2
ℓ
∇
u
u
2
S
(
x
k
,
u
k
)
=
∇
u
f
T
∇
x
x
2
V
′
∇
u
f
+
∇
u
u
2
ℓ
(67)
\begin{aligned} \nabla_{\boldsymbol{x}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})&=\nabla_{\boldsymbol{x}}\tilde{f}^{T}\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{x}}\ell \\ \nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_k,\bar{\boldsymbol{u}}_k)&=\nabla_{\boldsymbol{u}}f^T\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{u}}\ell \\ \nabla_{\boldsymbol{xx}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{xx}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{xx}}^{2}\ell \\ \nabla_{\boldsymbol{xu}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{xx}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}f+\nabla_{\boldsymbol{xu}}^{2}\ell \\ \nabla_{\boldsymbol{ux}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{u}}f^{T}\nabla_{\boldsymbol{xx}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{ux}}^{2}\ell \\ \nabla_{\boldsymbol{uu}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{u}}f^{T}\nabla_{\boldsymbol{xx}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}\boldsymbol{f}+\nabla_{\boldsymbol{uu}}^{2}\ell \end{aligned} \tag{67}
∇xS(xˉk,uˉk)∇uS(xˉk,uˉk)∇xx2S(xk,uk)∇xu2S(xk,uk)∇ux2S(xk,uk)∇uu2S(xk,uk)=∇xf~T∇xV′+∇xℓ=∇ufT∇xV′+∇uℓ=∇xf~T∇xx2V′∇xf~+∇xx2ℓ=∇xf~T∇xx2V′∇uf+∇xu2ℓ=∇ufT∇xx2V′∇xf~+∇ux2ℓ=∇ufT∇xx2V′∇uf+∇uu2ℓ(67)
still, we can re-introduce the second order Taylor expansion of the dynamics which we have omitted earlier:
∇
x
S
(
x
ˉ
k
,
u
ˉ
k
)
=
∇
x
f
~
T
∇
x
V
′
+
∇
x
ℓ
∇
u
S
(
x
ˉ
k
,
u
ˉ
k
)
=
∇
u
f
T
∇
x
V
′
+
∇
u
ℓ
∇
x
x
2
S
(
x
k
,
u
k
)
=
∇
x
f
~
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
x
V
′
T
∇
x
x
2
f
~
+
∇
x
x
2
ℓ
∇
x
u
2
S
(
x
k
,
u
k
)
=
∇
x
f
~
T
∇
x
x
2
V
′
∇
u
f
+
∇
x
V
′
T
∇
x
u
2
f
+
∇
x
u
2
ℓ
∇
u
x
2
S
(
x
k
,
u
k
)
=
∇
u
f
T
∇
x
x
2
V
′
∇
x
f
~
+
∇
x
V
′
T
∇
u
x
2
f
+
∇
u
x
2
ℓ
∇
u
u
2
S
(
x
k
,
u
k
)
=
∇
u
f
T
∇
x
x
2
V
′
∇
u
f
+
∇
x
V
′
T
∇
u
u
2
f
+
∇
u
u
2
ℓ
(68)
\begin{aligned} \nabla_{\boldsymbol{x}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})&=\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{x}}\ell\\ \nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})&=\nabla_{\boldsymbol{u}}f^{T}\nabla_{\boldsymbol{x}}V^{\prime}+\nabla_{\boldsymbol{u}}\ell\\ \nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}\ell\\ \nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}\boldsymbol{f}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}\boldsymbol{f}+\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}\ell\\ \nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{u}}\boldsymbol{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{x}}\tilde{\boldsymbol{f}}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}\boldsymbol{f}+\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}\ell\\ \nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})&=\nabla_{\boldsymbol{u}}\boldsymbol{f}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}V^{\prime}\nabla_{\boldsymbol{u}}{\boldsymbol{f}}+\nabla_{\boldsymbol{x}}V^{\prime T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}\boldsymbol{f}+\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}\ell \end{aligned}\tag{68}
∇xS(xˉk,uˉk)∇uS(xˉk,uˉk)∇xx2S(xk,uk)∇xu2S(xk,uk)∇ux2S(xk,uk)∇uu2S(xk,uk)=∇xf~T∇xV′+∇xℓ=∇ufT∇xV′+∇uℓ=∇xf~T∇xx2V′∇xf~+∇xV′T∇xx2f~+∇xx2ℓ=∇xf~T∇xx2V′∇uf+∇xV′T∇xu2f+∇xu2ℓ=∇ufT∇xx2V′∇xf~+∇xV′T∇ux2f+∇ux2ℓ=∇ufT∇xx2V′∇uf+∇xV′T∇uu2f+∇uu2ℓ(68)
these two sets of derivatives make iterative LQR and DDP respectively. The difference between the two methods is that iterative LQR uses the first order Taylor expansion of the dynamics, while DDP uses the second order Taylor expansion of the dynamics.
Now we can solve the optimal control input u k ∗ \bm{u}_k^* uk∗ by letting the gradient of the value function in Eq. 66 over u \bm{u} u be zero:
∇ u S ( x k , u k ) = ∇ u S ( x ˉ k , u ˉ k ) T + δ x k T ∇ x u 2 S ( x k , u k ) + δ u k T ∇ u u 2 S ( x k , u k ) = 0 ⟹ u k ∗ = − ∇ u u 2 S − 1 ∇ u S ( x ˉ k , u ˉ k ) ⏟ feedforward term k − ∇ u u 2 S − 1 ∇ u x 2 S δ x k ⏟ feedback term K (69) \begin{aligned}&\nabla_{\boldsymbol{u}}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})=\nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}+\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})+\delta\boldsymbol{u}_{k}^{T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})=0\\&\implies\boldsymbol{u}_{k}^{*}=\underbrace{-\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S^{-1}\nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})}_{\text{feedforward term k}}\underbrace{-\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S^{-1}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}S\delta\boldsymbol{x}_{k}}_{\text{feedback term K}}\end{aligned}\tag{69} ∇uS(xk,uk)=∇uS(xˉk,uˉk)T+δxkT∇xu2S(xk,uk)+δukT∇uu2S(xk,uk)=0⟹uk∗=feedforward term k −∇uu2S−1∇uS(xˉk,uˉk)feedback term K −∇uu2S−1∇ux2Sδxk(69)
3.2 Backward Pass
Given the one step optimal policy u k ∗ \bm{u}_k^* uk∗, we can start from the terminal step k = N k = N k=N and working backwards as we did in Dynamic Programming with Bellman’s equation.
Unlike normal DP, with the action-value function introduced, we still need the recursion for S S S to propagate, and also we are relying heavily on the derivatives of the value function at the next step (in other words, we should calculate and store that for the next iteration backwards at current step), in order to tackle that, we can plug the optimal control input u k ∗ \bm{u}_k^* uk∗ back into Eq. 66.
V ( x k ) = S ( x k , u k ) = S ( x ˉ k + δ x , u ˉ k + δ u ) = S ( x ˉ k , u ˉ k ) + ∇ x S ( x ˉ k , u ˉ k ) T δ x k + ∇ u S ( x ˉ k , u ˉ k ) T δ u k ∗ + 1 2 δ x k T ∇ x x 2 S ( x k , u k ) δ x k + 1 2 δ x k T ∇ x u 2 S ( x k , u k ) δ u k ∗ + 1 2 δ u k ∗ T ∇ u x 2 S ( x k , u k ) δ x k + 1 2 δ u k ∗ T ∇ u u 2 S ( x k , u k ) δ u k ∗ (70) \begin{aligned}V(\boldsymbol{x}_{k})=&S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\\=&S(\bar{\boldsymbol{x}}_{k}+\delta\boldsymbol{x},\bar{\boldsymbol{u}}_{k}+\delta\boldsymbol{u})\\=&S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})+\nabla_{\boldsymbol{x}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{x}_{k}+\nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})^{T}\delta\boldsymbol{u}_{k}^{*}\\&+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{x}_{k}^{T}\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{u}_{k}^{*}\\&+\frac{1}{2}\delta\boldsymbol{u}_{k}^{*T}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{x}_{k}+\frac{1}{2}\delta\boldsymbol{u}_{k}^{*T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S(\boldsymbol{x}_{k},\boldsymbol{u}_{k})\delta\boldsymbol{u}_{k}^{*}\end{aligned}\tag{70} V(xk)===S(xk,uk)S(xˉk+δx,uˉk+δu)S(xˉk,uˉk)+∇xS(xˉk,uˉk)Tδxk+∇uS(xˉk,uˉk)Tδuk∗+21δxkT∇xx2S(xk,uk)δxk+21δxkT∇xu2S(xk,uk)δuk∗+21δuk∗T∇ux2S(xk,uk)δxk+21δuk∗T∇uu2S(xk,uk)δuk∗(70)
where
u
k
∗
=
−
∇
u
u
2
S
−
1
∇
u
S
(
x
ˉ
k
,
u
ˉ
k
)
−
∇
u
u
2
S
−
1
∇
u
x
2
S
δ
x
k
(71)
\boldsymbol{u}_{k}^{*}=-\nabla_{\boldsymbol{uu}}^{2}S^{-1}\nabla_{\boldsymbol{u}}S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})-\nabla_{\boldsymbol{uu}}^{2}S^{-1}\nabla_{\boldsymbol{ux}}^{2}S\delta\boldsymbol{x}_{k}\tag{71}
uk∗=−∇uu2S−1∇uS(xˉk,uˉk)−∇uu2S−1∇ux2Sδxk(71)
reorgnizing the above equation leads to
V
(
x
k
)
=
S
(
x
ˉ
k
,
u
ˉ
k
)
−
1
2
∇
u
S
T
∇
u
u
2
S
−
1
∇
u
S
+
(
∇
x
S
T
−
∇
u
S
T
∇
u
u
2
S
−
1
∇
u
,
x
2
S
)
δ
x
+
1
2
δ
x
T
(
∇
x
x
2
S
T
−
∇
x
u
2
S
∇
u
u
2
S
−
1
∇
u
x
2
S
)
δ
x
(72)
\begin{aligned}V(\boldsymbol{x}_{k})&=S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})-\frac{1}{2}\nabla_{\boldsymbol{u}}S^{T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S^{-1}\nabla_{\boldsymbol{u}}S+(\nabla_{\boldsymbol{x}}S^{T}-\nabla_{\boldsymbol{u}}S^{T}\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S^{-1}\nabla_{\boldsymbol{u},\boldsymbol{x}}^{2}S)\delta\boldsymbol{x}\\&+\frac{1}{2}\delta\boldsymbol{x}^{T}(\nabla_{\boldsymbol{x}\boldsymbol{x}}^{2}S^{T}-\nabla_{\boldsymbol{x}\boldsymbol{u}}^{2}S\nabla_{\boldsymbol{u}\boldsymbol{u}}^{2}S^{-1}\nabla_{\boldsymbol{u}\boldsymbol{x}}^{2}S)\delta\boldsymbol{x}\end{aligned}\tag{72}
V(xk)=S(xˉk,uˉk)−21∇uST∇uu2S−1∇uS+(∇xST−∇uST∇uu2S−1∇u,x2S)δx+21δxT(∇xx2ST−∇xu2S∇uu2S−1∇ux2S)δx(72)
which has a clear structure of ordered items from 0 to 2, i.e., we can approximately use
Δ
V
=
S
(
x
ˉ
k
,
u
ˉ
k
)
−
1
2
∇
u
S
T
∇
u
u
2
S
−
1
∇
u
S
∇
x
V
=
∇
x
S
T
−
∇
u
S
T
∇
u
u
2
S
−
1
∇
u
,
x
2
S
∇
x
x
V
=
∇
x
x
2
S
T
−
∇
x
u
2
S
∇
u
u
2
S
−
1
∇
u
x
2
S
(73)
\begin{aligned} \Delta V&=S(\bar{\boldsymbol{x}}_{k},\bar{\boldsymbol{u}}_{k})-\frac{1}{2}\nabla_{u}S^{T}\nabla_{uu}^{2}S^{-1}\nabla_{u}S\\ \nabla_{\bm{x}}V&=\nabla_{x}S^{T}-\nabla_{u}S^{T}\nabla_{uu}^{2}S^{-1}\nabla_{u,x}^{2}S\\ \nabla_{\bm{xx}}V&=\nabla_{xx}^{2}S^{T}-\nabla_{xu}^{2}S\nabla_{uu}^{2}S^{-1}\nabla_{ux}^{2}S\tag{73} \end{aligned}
ΔV∇xV∇xxV=S(xˉk,uˉk)−21∇uST∇uu2S−1∇uS=∇xST−∇uST∇uu2S−1∇u,x2S=∇xx2ST−∇xu2S∇uu2S−1∇ux2S(73)
for step
k
−
1
k-1
k−1.
3.3 Line Search
Please be noted that the second order expansion of dynamics, cost function and value function can only hold locally, it can be possible that the u ∗ \bm{u}* u∗ overshoots and leads to inefficiency or even unstable optimisation process. Hence we can introduce a line-search step for evluation the ferformance of applying scaled feedforward term by α = 0 ∼ 1 \alpha = 0 ∼ 1 α=0∼1 during the iteration.
3.4 Forward Pass
Once we have finished the backward pass from the last timestep to the first one, we can roll-out the system response forward with dynamics.
Algorithm 1 DDP |
---|
Require: Initial State x 0 \bm{x}_0 x0, dynamics model x i + 1 = f ( x i , u i ) \bm{x}_{i+1}=f(\bm{x}_i,\bm{u}_i) xi+1=f(xi,ui), Initial reference control sequence U 0 \bm{U}_0 U0 |
Ensure: Optimal Control Sequence U ∗ \bm{U}^* U∗, Optimal State Trajectory X ∗ \bm{X}^* X∗ |
1: function D D P ( f , l , l f ) DDP(f,l,l_f) DDP(f,l,lf) |
2: Initial State x 0 \bm{x}_0 x0, Initial reference control sequence U 0 \bm{U}_0 U0 |
3: while not converged do |
4: ∇ x V ( x N ) = ∇ x ℓ f \nabla_{\bm{x}}V(\boldsymbol{x}_N)=\nabla_{\boldsymbol{x}}\ell_f ∇xV(xN)=∇xℓf |
5: ∇ x x V ( x N ) = ∇ x x ℓ f \nabla_{\bm{xx}}V(\boldsymbol{x}_N)=\nabla_{\boldsymbol{x}\bm{x}}\ell_f ∇xxV(xN)=∇xxℓf |
6: for i = N − 1 → 0 i= N- 1\to 0 i=N−1→0 do |
7: k , K ← B a c k w a r d P a s s \bm{k}, \bm{K}\leftarrow Backward\ Pass k,K←Backward Pass |
8: end for |
9: for i = 0 → N − 1 i= 0\to N-1 i=0→N−1 do |
10: Forward Pass with line search |
11: end for |
12: return Optimal Trajectory |
13: end while |
14: end function |
15: function Forward Pass |
16: Initialization, x ˉ 0 ← x 0 , U = U 0 , α = 1 \bar {\bm{x}}_0\leftarrow \bm{x}_0, \boldsymbol{U}= \boldsymbol{U}_0, \alpha = 1 xˉ0←x0,U=U0,α=1 |
17: for i = 0 → N − 1 i= 0\to N-1 i=0→N−1 do |
18: u ‾ i ← u i + α k [ i ] + K [ i ] ( x ‾ i − x i ) \overline{\bm{u}}_i\leftarrow \bm{u}_i+\alpha k[i]+K[i](\overline{\bm{x}}_i-\bm{x}_i) ui←ui+αk[i]+K[i](xi−xi) |
19: x ‾ i + 1 = f ( x ‾ i , u ‾ i ) \overline{\bm{x}}_{i+1}=f(\overline{\bm{x}}_i,\overline{\bm{u}}_i) xi+1=f(xi,ui) |
20: X [ i ] ← x ‾ i \boldsymbol{X}[i]\leftarrow\overline{\bm{x}}_i X[i]←xi |
21: U [ i ] ← u ‾ i \bm{U}[i]\leftarrow\overline{\bm{u}}_i U[i]←ui |
22: end for |
23: α = ρ α \alpha=\rho\alpha α=ρα |
24: X [ N ] ← x ‾ N \bm{X}[N]\leftarrow\overline{\bm{x}}_N X[N]←xN |
25: T ← { X , U } \bm{T}\leftarrow\{\bm{X},\bm{U}\} T←{X,U} |
26: J = J ( T ) J=J(T) J=J(T) |
27: return j , T j,\bm T j,T |
28: end function |
29: function Backward Pass |
30: Initialization, V x ← l f , x ( x n ) , V x , x ← l f , x ( x n ) , k ← [ ] , K ← [ ] V_{\boldsymbol{x}}\leftarrow l_{f,\boldsymbol{x}}(\bm{x}_{n}),V_{\boldsymbol{x},\boldsymbol{x}}\leftarrow l_{f,\boldsymbol{x}}(\bm{x}_{n}),k\leftarrow[],K\leftarrow[] Vx←lf,x(xn),Vx,x←lf,x(xn),k←[],K←[] |
31: for i = N − 1 → 0 i= N- 1\to 0 i=N−1→0 do |
32: S x ← l x ∣ x i + ( f x T V x ) ∣ x i S_{x}\leftarrow l_{x}|_{x_{i}}+(f_{x}^{T}V_{x})|_{x_{i}} Sx←lx∣xi+(fxTVx)∣xi |
33: S u ← l u ∣ x i + ( f u T V x ) ∣ u i , x i S_{u}\leftarrow l_{u}|_{{\boldsymbol{x}_i}}+(f_{u}^{T}V_{x})|_{{\boldsymbol{u}_i,\boldsymbol{x}_i}} Su←lu∣xi+(fuTVx)∣ui,xi |
34: S x x ← l x x ∣ x i + ( f x T V x x f x ) ∣ x i S_{\boldsymbol{x}\boldsymbol{x}}\leftarrow l_{\boldsymbol{x}\boldsymbol{x}}|_{\boldsymbol{x}_i}+(f_x^TV_{\boldsymbol{xx}}f_x)|_{\boldsymbol{x}_i} Sxx←lxx∣xi+(fxTVxxfx)∣xi |
35: S u u ← l u u ∣ u i + ( f u T V x x f u ) ∣ u i , x i , u i S_{\boldsymbol{u}\boldsymbol{u}}\leftarrow l_{\boldsymbol{u}\boldsymbol{u}}|_{\boldsymbol{u}_i}+(f_u^TV_{\boldsymbol{xx}}f_u)|_{\boldsymbol{u}_i,\boldsymbol{x}_i,\boldsymbol{u}_i} Suu←luu∣ui+(fuTVxxfu)∣ui,xi,ui |
36: S u x ← l u x ∣ u i , x i + ( f u T V x x f x ) ∣ u i , x i , x i S_{\boldsymbol{u}\boldsymbol{x}}\leftarrow l_{\boldsymbol{u}\boldsymbol{x}}|_{\boldsymbol{u}_i,\boldsymbol{x}_i}+(f_u^TV_{\boldsymbol{xx}}f_x)|_{\boldsymbol{u}_i,\boldsymbol{x}_i,\boldsymbol{x}_i} Sux←lux∣ui,xi+(fuTVxxfx)∣ui,xi,xi |
37: S ^ u ← l u ∣ x i + ( f u T ( V x + μ I n ) ) ∣ u i , x i \widehat{S}_{\bm{u}}\leftarrow l_{u}|_{{\boldsymbol{x}_i}}+(f_{u}^{T}(V_{x}+\mu \bm{I}_n))|_{{\boldsymbol{u}_i,\boldsymbol{x}_i}} S u←lu∣xi+(fuT(Vx+μIn))∣ui,xi |
38: S ~ u u ← l u u ∣ u i + ( f u T ( V x x + μ I n ) ) f u ) ∣ u i , x i , u i \widetilde{S}_{\boldsymbol{u}\boldsymbol{u}}\leftarrow l_{\boldsymbol{u}\boldsymbol{u}}|_{\boldsymbol{u}_i}+(f_u^T(V_{\boldsymbol{xx}}+\mu \bm{I}_n))f_u)|_{\boldsymbol{u}_i,\boldsymbol{x}_i,\boldsymbol{u}_i} S uu←luu∣ui+(fuT(Vxx+μIn))fu)∣ui,xi,ui |
39: S ~ u x ← l u x ∣ u i , x i + ( f u T ( V x x + μ I n ) f x ) ∣ u i , x i , x i \widetilde{S}_{\boldsymbol{u}\boldsymbol{x}}\leftarrow l_{\boldsymbol{u}\boldsymbol{x}}|_{\boldsymbol{u}_i,\boldsymbol{x}_i}+(f_u^T(V_{\boldsymbol{xx}}+\mu \bm{I}_n)f_x)|_{\boldsymbol{u}_i,\boldsymbol{x}_i,\boldsymbol{x}_i} S ux←lux∣ui,xi+(fuT(Vxx+μIn)fx)∣ui,xi,xi |
40: k [ i ] ← − S ~ u u − 1 S ^ u k[i]\leftarrow-\widetilde{S}_{\boldsymbol{uu}}^{-1}\widehat{S}_{\boldsymbol{u}} k[i]←−S uu−1S u |
41: K [ i ] ← − S ~ u u − 1 S ~ u x K[i]\leftarrow -\widetilde{S}_{\boldsymbol{uu}}^{-1}\widetilde{S}_{\boldsymbol{u}\bm x} K[i]←−S uu−1S ux |
42: V x ← S x − K T S u u k V_{x}\leftarrow S_{x}-K^{T}S_{\boldsymbol{uu}}k Vx←Sx−KTSuuk |
43: V x x ← S x x − K T S u u K V_{xx}\leftarrow S_{xx}-K^TS_{uu}K Vxx←Sxx−KTSuuK |
44: end for |
45: return k \bm{k} k, K \bm{K} K |
46: end function |