目录
自适应动态规划(Adaptive Dynamic Programming,ADP)方法通过逐步迭代逼近动态规划的真解,进而逐渐逼近非线性系统的最优控制解。
一、ADP的结构和基本原理
1、ADP的基本结构
设有离散时间非线性动态系统:
x
(
k
+
1
)
=
f
[
x
(
k
)
,
u
(
k
)
,
k
]
,
k
=
0
,
1
,
.
.
.
\begin{gather} \begin{aligned} x(k+1)=f[x(k),u(k),k], \ \ \ k=0,1,... \end{aligned}\end{gather}
x(k+1)=f[x(k),u(k),k], k=0,1,...式中,
x
∈
R
n
x \in R^n
x∈Rn 表示系统的状态向量;
u
∈
R
m
u \in R^m
u∈Rm 表示控制动作;
f
f
f 是系统函数。与该系统对应的
k
k
k 时刻性能指标(或代价)函数通常考虑为二次型成本函数为:
J
[
x
(
k
)
,
k
]
=
∑
i
=
k
∞
γ
i
−
k
(
x
(
i
)
T
Q
x
(
i
)
+
u
(
i
)
T
R
u
(
i
)
)
\begin{gather} \begin{aligned} J[x(k),k] = \sum\limits_{i = k}^\infty {{\gamma ^{i - k}}(x{{(i)}^T}Qx(i) + u{{(i)}^T}Ru(i))} \end{aligned}\end{gather}
J[x(k),k]=i=k∑∞γi−k(x(i)TQx(i)+u(i)TRu(i))其中,
Q
∈
n
×
n
Q \in {^{n \times n}}
Q∈n×n是正定状态权重矩阵;
R
∈
n
×
n
R \in {^{n \times n}}
R∈n×n是正定控制权重矩阵;
γ
\gamma
γ是折扣因子表示注重当下收益,且
0
<
γ
≤
1
0 < \gamma \le 1
0<γ≤1。动态规划的目的是选择一个控制序列
u
(
i
)
,
i
=
k
,
k
+
1
,
.
.
.
u(i),i = k,k + 1,...
u(i),i=k,k+1,...使得公式(2)中定义的函数
J
J
J (即代价)最小化。
自适应动态规划的基本结构如图1所示(虚线表示更新网络的意思):
2、ADP的基本原理
2.1 评价网络
评价网络的输出
J
^
\hat J
J^ 是对由式(2)给出的函数
J
J
J 的估计。这可以通过随着时间最小化下式的误差来实现。
∥
E
c
∥
=
∑
k
E
c
(
k
)
=
1
2
∑
k
[
J
^
(
k
)
−
U
(
k
)
−
γ
J
^
(
k
+
1
)
]
2
\begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {{{[\hat J(k) - U(k) - \gamma \hat J(k + 1)]}^2}} \end{aligned}\end{gather}
∥Ec∥=k∑Ec(k)=21k∑[J^(k)−U(k)−γJ^(k+1)]2式中,
J
^
(
k
)
=
J
^
[
x
(
k
)
,
u
(
k
)
,
k
,
W
c
]
\hat J(k) = \hat J[x\left( k \right),u\left( k \right),k,{W_c}]
J^(k)=J^[x(k),u(k),k,Wc] ,
W
c
W_c
Wc 代表评价网络的参数。函数
U
(
k
)
=
x
(
k
)
T
Q
x
(
k
)
+
u
(
k
)
T
R
u
(
k
)
U(k) = x{(k)^T}Qx(k) + u{(k)^T}Ru(k)
U(k)=x(k)TQx(k)+u(k)TRu(k) 是与式(2)中的效用函数完全一样的效用函数,注意
U
(
k
)
U(k)
U(k) 是
k
k
k 时刻一个时刻的效用函数,不是
k
k
k 时刻到无穷时刻的累加。当对于所有的
k
k
k 都有
E
c
(
k
)
=
0
{E_c}(k) = 0
Ec(k)=0 时,式(3)意味着
J
^
(
k
)
=
U
(
k
)
+
γ
J
^
(
k
+
1
)
=
U
(
k
)
+
γ
[
J
^
(
k
+
1
)
+
γ
J
^
(
k
+
2
)
]
=
.
.
.
.
.
=
∑
i
=
k
γ
i
−
k
U
(
i
)
\begin{gather} \begin{aligned} \begin{array}{l} \hat J(k) = U(k) + \gamma \hat J(k + 1)\\ \ \ \ \ \ \ \ \ \ = U(k) + \gamma [\hat J(k + 1) + \gamma \hat J(k + 2)]\\ \ \ \ \ \ \ \ \ \ = .....\\ \ \ \ \ \ \ \ \ \ = \sum\limits_{i = k}^{} {{\gamma ^{i - k}}U(i)} \end{array} \end{aligned}\end{gather}
J^(k)=U(k)+γJ^(k+1) =U(k)+γ[J^(k+1)+γJ^(k+2)] =..... =i=k∑γi−kU(i)式(4)与式(2)中定义的代价函数完全一样。因此,最小化由式(3)定义的误差函数,可以获得一个训练好的神经网络,该网络的输出
J
^
\hat J
J^ 是式(4.2)中定义的代价函数
J
J
J 的一个估计。
2.2 执行网络
执行网络的训练是通过使用控制信号 u ( k ) = u [ x ( k ) , k , W a ] u(k) = u[x(k),k,{W_a}] u(k)=u[x(k),k,Wa] ( W a W_a Wa 代表执行网络的参数),以最小化 J ^ ( k ) \hat J(k) J^(k) 为目标。即通过最小化评价网络的输出来训练,将得到个训练过的网络,它将产生一个最优或者次优的控制信号。
二、评价-执行(演员-评论家)网络设计及更新
1、评价网络设计
评价网络是输入当前系统的状态,输出代价值。因此,当系统存在n维状态时,采用具有n个输入神经元,p个隐藏层神经元和1个输出神经元的结构。n个输入是状态向量的n个分量。输出是与输入状态对应的最优性能指标的估计。评价网络的隐藏层采用双极性 sigmoidal函数(也可以采用其他的激活函数),输出层采用线性函数purelin。评价网络结构如图2所示。
评价网络的训练仍然由正向的计算和反向的误差传播过程组成。评价网络的正向计算过程为:
c
h
1
j
(
k
)
=
∑
i
=
1
n
x
^
i
(
k
)
⋅
W
c
1
i
j
(
k
)
,
j
=
1
,
2
,
.
.
.
,
p
\begin{gather} \begin{aligned} {c_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{c1ij}}(k),\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather}
ch1j(k)=i=1∑nx^i(k)⋅Wc1ij(k), j=1,2,...,p
c
h
2
j
(
k
)
=
1
−
e
−
c
h
1
j
(
k
)
1
+
e
−
c
h
1
j
(
k
)
,
j
=
1
,
2
,
.
.
.
,
p
\begin{gather} \begin{aligned} {c_{h2j}}(k) = \frac{{1 - {e^{ - {c_{h1j}}(k)}}}}{{1 + {e^{ - {c_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather}
ch2j(k)=1+e−ch1j(k)1−e−ch1j(k), j=1,2,...,p
J
^
(
k
)
=
∑
j
=
1
p
c
h
2
j
(
k
)
⋅
W
c
2
j
(
k
)
\begin{gather} \begin{aligned} \hat J(k) = \sum\limits_{j = 1}^p {{c_{h2j}}(k)} \cdot {W_{c2j}}(k) \end{aligned}\end{gather}
J^(k)=j=1∑pch2j(k)⋅Wc2j(k)式中,
c
h
1
j
(
k
)
{c_{h1j}}(k)
ch1j(k) 为评价网络隐藏层第j个节点的输入;
c
h
2
j
(
k
)
{c_{h2j}}(k)
ch2j(k) 为评价网络隐藏层第j个节点的输出。评价网络的训练同样采用梯度下降法,通过最小化下式定义的误差来实现:
∥
E
c
∥
=
∑
k
E
c
(
k
)
=
1
2
∑
k
e
c
2
(
k
)
\begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {e_c^2(k)} \end{aligned}\end{gather}
∥Ec∥=k∑Ec(k)=21k∑ec2(k)
e
c
(
k
)
=
J
^
(
k
)
−
U
(
k
)
−
γ
J
^
(
k
+
1
)
\begin{gather} \begin{aligned} e_c^{}(k) = \hat J(k) - U(k) - \gamma \hat J(k + 1) \end{aligned}\end{gather}
ec(k)=J^(k)−U(k)−γJ^(k+1)评价网络权值更新推导如下:
①
W
c
2
W_{c2}
Wc2(隐藏层到输出层的权值矩阵):
Δ
W
c
2
j
(
k
)
=
l
c
(
k
)
[
−
∂
E
c
(
k
)
∂
W
c
2
j
(
k
)
]
=
l
c
(
k
)
[
−
∂
E
c
(
k
)
∂
J
^
(
k
)
∂
J
^
(
k
)
∂
W
c
2
j
(
k
)
]
=
−
l
c
(
k
)
⋅
e
c
(
k
)
⋅
c
h
2
j
(
k
)
\begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c2j}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {c_{h2j}}(k)\end{array} \end{aligned}\end{gather}
ΔWc2j(k)=lc(k)[−∂Wc2j(k)∂Ec(k)] =lc(k)[−∂J^(k)∂Ec(k)∂Wc2j(k)∂J^(k)] =−lc(k)⋅ec(k)⋅ch2j(k)
Δ
W
c
2
(
k
)
=
−
l
c
(
k
)
⋅
e
c
(
k
)
⋅
c
h
2
T
(
k
)
\begin{gather} \begin{aligned} \Delta {W_{c2}}(k) = - {l_c}(k) \cdot {e_c}(k) \cdot c_{h2}^T(k) \end{aligned}\end{gather}
ΔWc2(k)=−lc(k)⋅ec(k)⋅ch2T(k)
W
c
2
(
k
+
1
)
=
W
c
2
(
k
)
+
Δ
W
c
2
(
k
)
\begin{gather} \begin{aligned} {W_{c2}}(k + 1) = {W_{c2}}(k) + \Delta {W_{c2}}(k) \end{aligned}\end{gather}
Wc2(k+1)=Wc2(k)+ΔWc2(k)
②
W
c
1
W_{c1}
Wc1(输入层到隐藏层的权值矩阵):
Δ
W
c
1
i
j
(
k
)
=
l
c
(
k
)
[
−
∂
E
c
(
k
)
∂
W
c
1
i
j
(
k
)
]
=
l
c
(
k
)
[
−
∂
E
c
(
k
)
∂
J
^
(
k
)
∂
J
^
(
k
)
∂
c
h
2
j
(
k
)
∂
c
h
2
j
(
k
)
∂
c
h
1
j
(
k
)
∂
c
h
1
j
(
k
)
∂
W
c
1
i
j
(
k
)
]
=
−
l
c
(
k
)
⋅
e
c
(
k
)
⋅
W
c
2
j
(
k
)
⋅
1
2
[
1
−
c
h
2
j
2
(
k
)
]
⋅
x
^
i
(
k
)
\begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c1ij}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {c_{h2j}}(k)}}\frac{{\partial {c_{h2j}}(k)}}{{\partial {c_{h1j}}(k)}}\frac{{\partial {c_{h1j}}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {W_{c2j}}(k) \cdot \frac{1}{2}\left[ {1 - {c_{h2j}}^2(k)} \right] \cdot {{\hat x}_i}(k) \end{array} \end{aligned}\end{gather}
ΔWc1ij(k)=lc(k)[−∂Wc1ij(k)∂Ec(k)] =lc(k)[−∂J^(k)∂Ec(k)∂ch2j(k)∂J^(k)∂ch1j(k)∂ch2j(k)∂Wc1ij(k)∂ch1j(k)] =−lc(k)⋅ec(k)⋅Wc2j(k)⋅21[1−ch2j2(k)]⋅x^i(k)
Δ
W
c
1
(
k
)
=
−
1
2
⋅
l
c
(
k
)
⋅
e
c
(
k
)
⋅
x
^
T
(
k
)
×
{
W
c
2
T
(
k
)
⊗
[
1
−
c
h
2
(
k
)
⊗
c
h
2
(
k
)
]
}
\begin{gather} \begin{aligned} \Delta {W_{c1}}(k) = - \frac{1}{2} \cdot {l_c}(k) \cdot {e_c}(k) \cdot {{\hat x}^T}(k) \times \left\{ {W_{c2}^T(k) \otimes \left[ {1 - {c_{h2}}(k) \otimes {c_{h2}}(k)} \right]} \right\} \end{aligned}\end{gather}
ΔWc1(k)=−21⋅lc(k)⋅ec(k)⋅x^T(k)×{Wc2T(k)⊗[1−ch2(k)⊗ch2(k)]}
W
c
1
(
k
+
1
)
=
W
c
1
(
k
)
+
Δ
W
c
1
(
k
)
\begin{gather} \begin{aligned} {W_{c1}}(k + 1) = {W_{c1}}(k) + \Delta {W_{c1}}(k) \end{aligned}\end{gather}
Wc1(k+1)=Wc1(k)+ΔWc1(k)
2、执行网络设计
执行网络采用具有n个输入神经元,q个隐藏层神经元和m个输出神经元的结构。n个输入分别是系统在k时刻的状态向量 x ( k ) x(k) x(k) 的n个分量。m个输出则是与输入状态 x ( k ) x(k) x(k) 对应的控制向量 u ( k ) u(k) u(k) 的m个分量。动作网络的隐藏层采用双极性sigmoidal函数,输出层采用线性函数purelin。动作网络结构如图3所示:
执行网络的训练仍然由正向的计算和反向的误差传播过程组成。执行网络的正向计算过程为:
a
h
1
j
(
k
)
=
∑
i
=
1
n
x
^
i
(
k
)
⋅
W
a
1
i
j
(
k
)
,
j
=
1
,
2
,
.
.
.
,
q
\begin{gather} \begin{aligned} {a_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{a1ij}}(k),\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather}
ah1j(k)=i=1∑nx^i(k)⋅Wa1ij(k), j=1,2,...,q
a
h
2
j
(
k
)
=
1
−
e
−
a
h
1
j
(
k
)
1
+
e
−
a
h
1
j
(
k
)
,
j
=
1
,
2
,
.
.
.
,
q
\begin{gather} \begin{aligned} {a_{h2j}}(k) = \frac{{1 - {e^{ - {a_{h1j}}(k)}}}}{{1 + {e^{ - {a_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather}
ah2j(k)=1+e−ah1j(k)1−e−ah1j(k), j=1,2,...,q
u
j
(
k
)
=
∑
j
=
1
q
a
h
2
i
(
k
)
⋅
W
a
2
i
j
(
k
)
,
j
=
1
,
2
,
.
.
.
,
m
\begin{gather} \begin{aligned} {u_j}(k) = \sum\limits_{j = 1}^q {{a_{h2i}}(k)}\cdot {W_{a2ij}}(k), \ \ \ \ \ \ j = 1,2,...,m \end{aligned}\end{gather}
uj(k)=j=1∑qah2i(k)⋅Wa2ij(k), j=1,2,...,m式中,
a
h
1
j
(
k
)
{a_{h1j}}(k)
ah1j(k) 为执行网络隐藏层第j个节点的输入;
a
h
2
j
(
k
)
{a_{h2j}}(k)
ah2j(k) 为执行网络隐藏层第j个节点的输出。执行网络的训练以最小化
J
^
(
k
)
\hat J(k)
J^(k) 为目标。执行网络的训练同样采用梯度下降法。
Δ
W
a
=
l
a
(
k
)
⋅
[
−
∂
J
^
(
k
)
∂
W
a
(
k
)
]
=
−
l
a
(
k
)
⋅
∂
J
^
(
k
)
∂
u
(
k
)
∂
u
(
k
)
∂
W
a
(
k
)
\begin{gather} \begin{aligned} \Delta {W_a} = {l_a}(k) \cdot \left[ { - \frac{{\partial \hat J(k)}}{{\partial {W_a}(k)}}} \right] = - {l_a}(k) \cdot \frac{{\partial \hat J(k)}}{{\partial u(k)}}\frac{{\partial u(k)}}{{\partial {W_a}(k)}} \end{aligned}\end{gather}
ΔWa=la(k)⋅[−∂Wa(k)∂J^(k)]=−la(k)⋅∂u(k)∂J^(k)∂Wa(k)∂u(k)
∂
J
^
(
k
)
∂
u
(
k
)
=
∂
U
(
k
)
∂
u
(
k
)
+
γ
∂
J
^
(
k
+
1
)
∂
u
(
k
)
\begin{gather} \begin{aligned} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{aligned}\end{gather}
∂u(k)∂J^(k)=∂u(k)∂U(k)+γ∂u(k)∂J^(k+1)式中,
∂
U
(
k
)
∂
u
(
k
)
\frac{{\partial U(k)}}{{\partial u(k)}}
∂u(k)∂U(k) 的值取决于效用函数的定义,而效用函数的定义与具体的被控系统有关,这里如果定义效用函数为二次型,即:
U
(
k
)
=
x
(
k
)
A
x
T
(
k
)
+
u
(
k
)
B
u
T
(
k
)
\begin{gather} \begin{aligned} U(k) = x(k)A{x^T}(k) + u(k)B{u^T}(k) \end{aligned}\end{gather}
U(k)=x(k)AxT(k)+u(k)BuT(k)式中,A,B分别为n维和m维的单位矩阵,则
∂
U
(
k
)
∂
u
(
k
)
=
2
u
(
k
)
\frac{{\partial U(k)}}{{\partial u(k)}} = 2u(k)
∂u(k)∂U(k)=2u(k) ,故 :
∂
J
^
(
k
)
∂
u
(
k
)
=
∂
U
(
k
)
∂
u
(
k
)
+
γ
∂
J
^
(
k
+
1
)
∂
u
(
k
)
=
2
u
(
k
)
+
γ
∂
J
^
(
k
+
1
)
∂
u
(
k
)
\begin{gather} \begin{aligned} \begin{array}{l} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}}\\ \ \ \ \ \ \ \ \ \ \ = 2u(k) + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{array} \end{aligned}\end{gather}
∂u(k)∂J^(k)=∂u(k)∂U(k)+γ∂u(k)∂J^(k+1) =2u(k)+γ∂u(k)∂J^(k+1)
执行网络权值更新推导如下(几页纸,略)。
三、基于matlab神经网络工具箱例子实现
例:我们考虑以下的离散线性系统:
x
k
+
1
=
A
x
k
+
B
u
k
\begin{gather} \begin{aligned} x_{k+1}=Ax_{k}+Bu_{k} \end{aligned}\end{gather}
xk+1=Axk+Buk其中,
x
k
=
[
x
1
k
,
x
2
k
]
T
x_{k}=[x_{1k}, x_{2k}]^T
xk=[x1k,x2k]T 并且
u
∈
R
1
u \in R^1
u∈R1,A矩阵为[ 0 0.1; 0.3 -1 ],B矩阵为[ 0; 0.5],初始状态
x
0
=
[
1
,
−
1
]
T
x_{0}=[1, -1]^T
x0=[1,−1]T。代价函数指标用公式(2),即
U
(
x
k
,
u
k
)
=
x
k
T
Q
x
k
+
u
k
T
R
u
k
U(x_k, u_k) = x_k^TQx_k + u_k^TRu_k
U(xk,uk)=xkTQxk+ukTRuk ,其中
Q
=
I
Q=I
Q=I,
R
=
0.5
I
R=0.5I
R=0.5I,
I
I
I为单位矩阵。
利用神经网络实现了策略迭代和值迭代两种算法,本例子的评论家网络和演员网络为三层BP神经网络,其结构分别为2-8-1和2-8-1的三层网络。对于每一个迭代步骤,使用α = 0.02的学习率对批评者网络和动作网络进行80步的训练,使神经网络的训练误差小于 1 0 − 5 10^{−5} 10−5。

完整代码见:链接: 我在闲鱼发布了【自适应动态规划代码!ADP,入门最佳代码,易懂。包括值迭代和策略迭代】
至此,自适应动态规划的数学推导以及例子就记录到这里了,
敲公式不易,麻烦各位看官一键三连!感谢!欢迎收藏以便后续用到的时候查公式!