自适应动态规划(Adaptive Dynamic Programming,ADP)算法,采用演员-评论家(评价-执行)网络,看这一文即可【非常详细推导,认真推理,包你看得懂】


自适应动态规划(Adaptive Dynamic Programming,ADP)方法通过逐步迭代逼近动态规划的真解,进而逐渐逼近非线性系统的最优控制解。

一、ADP的结构和基本原理

1、ADP的基本结构

设有离散时间非线性动态系统:
x ( k + 1 ) = f [ x ( k ) , u ( k ) , k ] ,     k = 0 , 1 , . . . \begin{gather} \begin{aligned} x(k+1)=f[x(k),u(k),k], \ \ \ k=0,1,... \end{aligned}\end{gather} x(k+1)=f[x(k),u(k),k],   k=0,1,...式中, x ∈ R n x \in R^n xRn 表示系统的状态向量; u ∈ R m u \in R^m uRm 表示控制动作; f f f 是系统函数。与该系统对应的 k k k 时刻性能指标(或代价)函数通常考虑为二次型成本函数为:
J [ x ( k ) , k ] = ∑ i = k ∞ γ i − k ( x ( i ) T Q x ( i ) + u ( i ) T R u ( i ) ) \begin{gather} \begin{aligned} J[x(k),k] = \sum\limits_{i = k}^\infty {{\gamma ^{i - k}}(x{{(i)}^T}Qx(i) + u{{(i)}^T}Ru(i))} \end{aligned}\end{gather} J[x(k),k]=i=kγik(x(i)TQx(i)+u(i)TRu(i))其中, Q ∈ n × n Q \in {^{n \times n}} Qn×n是正定状态权重矩阵; R ∈ n × n R \in {^{n \times n}} Rn×n是正定控制权重矩阵; γ \gamma γ是折扣因子表示注重当下收益,且 0 < γ ≤ 1 0 < \gamma \le 1 0<γ1。动态规划的目的是选择一个控制序列 u ( i ) , i = k , k + 1 , . . . u(i),i = k,k + 1,... u(i),i=k,k+1,...使得公式(2)中定义的函数 J J J (即代价)最小化。

自适应动态规划的基本结构如图1所示(虚线表示更新网络的意思):

图1  ADP结构示意图

图1 ADP结构示意图

2、ADP的基本原理

2.1 评价网络

评价网络的输出 J ^ \hat J J^ 是对由式(2)给出的函数 J J J 的估计。这可以通过随着时间最小化下式的误差来实现。
∥ E c ∥ = ∑ k E c ( k ) = 1 2 ∑ k [ J ^ ( k ) − U ( k ) − γ J ^ ( k + 1 ) ] 2 \begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {{{[\hat J(k) - U(k) - \gamma \hat J(k + 1)]}^2}} \end{aligned}\end{gather} Ec=kEc(k)=21k[J^(k)U(k)γJ^(k+1)]2式中, J ^ ( k ) = J ^ [ x ( k ) , u ( k ) , k , W c ] \hat J(k) = \hat J[x\left( k \right),u\left( k \right),k,{W_c}] J^(k)=J^[x(k),u(k),k,Wc] W c W_c Wc 代表评价网络的参数。函数 U ( k ) = x ( k ) T Q x ( k ) + u ( k ) T R u ( k ) U(k) = x{(k)^T}Qx(k) + u{(k)^T}Ru(k) U(k)=x(k)TQx(k)+u(k)TRu(k) 是与式(2)中的效用函数完全一样的效用函数,注意 U ( k ) U(k) U(k) k k k 时刻一个时刻的效用函数,不是 k k k 时刻到无穷时刻的累加。当对于所有的 k k k 都有 E c ( k ) = 0 {E_c}(k) = 0 Ec(k)=0 时,式(3)意味着
J ^ ( k ) = U ( k ) + γ J ^ ( k + 1 )           = U ( k ) + γ [ J ^ ( k + 1 ) + γ J ^ ( k + 2 ) ]           = . . . . .           = ∑ i = k γ i − k U ( i ) \begin{gather} \begin{aligned} \begin{array}{l} \hat J(k) = U(k) + \gamma \hat J(k + 1)\\ \ \ \ \ \ \ \ \ \ = U(k) + \gamma [\hat J(k + 1) + \gamma \hat J(k + 2)]\\ \ \ \ \ \ \ \ \ \ = .....\\ \ \ \ \ \ \ \ \ \ = \sum\limits_{i = k}^{} {{\gamma ^{i - k}}U(i)} \end{array} \end{aligned}\end{gather} J^(k)=U(k)+γJ^(k+1)         =U(k)+γ[J^(k+1)+γJ^(k+2)]         =.....         =i=kγikU(i)式(4)与式(2)中定义的代价函数完全一样。因此,最小化由式(3)定义的误差函数,可以获得一个训练好的神经网络,该网络的输出 J ^ \hat J J^ 是式(4.2)中定义的代价函数 J J J 的一个估计。

2.2 执行网络

执行网络的训练是通过使用控制信号 u ( k ) = u [ x ( k ) , k , W a ] u(k) = u[x(k),k,{W_a}] u(k)=u[x(k),k,Wa] W a W_a Wa 代表执行网络的参数),以最小化 J ^ ( k ) \hat J(k) J^(k) 为目标。即通过最小化评价网络的输出来训练,将得到个训练过的网络,它将产生一个最优或者次优的控制信号。

二、评价-执行(演员-评论家)网络设计及更新

1、评价网络设计

评价网络是输入当前系统的状态,输出代价值。因此,当系统存在n维状态时,采用具有n个输入神经元,p个隐藏层神经元和1个输出神经元的结构。n个输入是状态向量的n个分量。输出是与输入状态对应的最优性能指标的估计。评价网络的隐藏层采用双极性 sigmoidal函数(也可以采用其他的激活函数),输出层采用线性函数purelin。评价网络结构如图2所示。
在这里插入图片描述

图2 评价网络结构图

评价网络的训练仍然由正向的计算和反向的误差传播过程组成。评价网络的正向计算过程为:
c h 1 j ( k ) = ∑ i = 1 n x ^ i ( k ) ⋅ W c 1 i j ( k ) ,        j = 1 , 2 , . . . , p \begin{gather} \begin{aligned} {c_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{c1ij}}(k),\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather} ch1j(k)=i=1nx^i(k)Wc1ij(k),      j=1,2,...,p c h 2 j ( k ) = 1 − e − c h 1 j ( k ) 1 + e − c h 1 j ( k ) ,        j = 1 , 2 , . . . , p \begin{gather} \begin{aligned} {c_{h2j}}(k) = \frac{{1 - {e^{ - {c_{h1j}}(k)}}}}{{1 + {e^{ - {c_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather} ch2j(k)=1+ech1j(k)1ech1j(k),      j=1,2,...,p J ^ ( k ) = ∑ j = 1 p c h 2 j ( k ) ⋅ W c 2 j ( k ) \begin{gather} \begin{aligned} \hat J(k) = \sum\limits_{j = 1}^p {{c_{h2j}}(k)} \cdot {W_{c2j}}(k) \end{aligned}\end{gather} J^(k)=j=1pch2j(k)Wc2j(k)式中, c h 1 j ( k ) {c_{h1j}}(k) ch1j(k) 为评价网络隐藏层第j个节点的输入; c h 2 j ( k ) {c_{h2j}}(k) ch2j(k) 为评价网络隐藏层第j个节点的输出。评价网络的训练同样采用梯度下降法,通过最小化下式定义的误差来实现:
∥ E c ∥ = ∑ k E c ( k ) = 1 2 ∑ k e c 2 ( k ) \begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {e_c^2(k)} \end{aligned}\end{gather} Ec=kEc(k)=21kec2(k) e c ( k ) = J ^ ( k ) − U ( k ) − γ J ^ ( k + 1 ) \begin{gather} \begin{aligned} e_c^{}(k) = \hat J(k) - U(k) - \gamma \hat J(k + 1) \end{aligned}\end{gather} ec(k)=J^(k)U(k)γJ^(k+1)评价网络权值更新推导如下:
W c 2 W_{c2} Wc2(隐藏层到输出层的权值矩阵):
Δ W c 2 j ( k ) = l c ( k ) [ − ∂ E c ( k ) ∂ W c 2 j ( k ) ]                    = l c ( k ) [ − ∂ E c ( k ) ∂ J ^ ( k ) ∂ J ^ ( k ) ∂ W c 2 j ( k ) ]                    = − l c ( k ) ⋅ e c ( k ) ⋅ c h 2 j ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c2j}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {c_{h2j}}(k)\end{array} \end{aligned}\end{gather} ΔWc2j(k)=lc(k)[Wc2j(k)Ec(k)]                  =lc(k)[J^(k)Ec(k)Wc2j(k)J^(k)]                  =lc(k)ec(k)ch2j(k) Δ W c 2 ( k ) = − l c ( k ) ⋅ e c ( k ) ⋅ c h 2 T ( k ) \begin{gather} \begin{aligned} \Delta {W_{c2}}(k) = - {l_c}(k) \cdot {e_c}(k) \cdot c_{h2}^T(k) \end{aligned}\end{gather} ΔWc2(k)=lc(k)ec(k)ch2T(k) W c 2 ( k + 1 ) = W c 2 ( k ) + Δ W c 2 ( k ) \begin{gather} \begin{aligned} {W_{c2}}(k + 1) = {W_{c2}}(k) + \Delta {W_{c2}}(k) \end{aligned}\end{gather} Wc2(k+1)=Wc2(k)+ΔWc2(k)
W c 1 W_{c1} Wc1(输入层到隐藏层的权值矩阵):
Δ W c 1 i j ( k ) = l c ( k ) [ − ∂ E c ( k ) ∂ W c 1 i j ( k ) ]                     = l c ( k ) [ − ∂ E c ( k ) ∂ J ^ ( k ) ∂ J ^ ( k ) ∂ c h 2 j ( k ) ∂ c h 2 j ( k ) ∂ c h 1 j ( k ) ∂ c h 1 j ( k ) ∂ W c 1 i j ( k ) ]                     = − l c ( k ) ⋅ e c ( k ) ⋅ W c 2 j ( k ) ⋅ 1 2 [ 1 − c h 2 j 2 ( k ) ] ⋅ x ^ i ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c1ij}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {c_{h2j}}(k)}}\frac{{\partial {c_{h2j}}(k)}}{{\partial {c_{h1j}}(k)}}\frac{{\partial {c_{h1j}}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {W_{c2j}}(k) \cdot \frac{1}{2}\left[ {1 - {c_{h2j}}^2(k)} \right] \cdot {{\hat x}_i}(k) \end{array} \end{aligned}\end{gather} ΔWc1ij(k)=lc(k)[Wc1ij(k)Ec(k)]                   =lc(k)[J^(k)Ec(k)ch2j(k)J^(k)ch1j(k)ch2j(k)Wc1ij(k)ch1j(k)]                   =lc(k)ec(k)Wc2j(k)21[1ch2j2(k)]x^i(k) Δ W c 1 ( k ) = − 1 2 ⋅ l c ( k ) ⋅ e c ( k ) ⋅ x ^ T ( k ) × { W c 2 T ( k ) ⊗ [ 1 − c h 2 ( k ) ⊗ c h 2 ( k ) ] } \begin{gather} \begin{aligned} \Delta {W_{c1}}(k) = - \frac{1}{2} \cdot {l_c}(k) \cdot {e_c}(k) \cdot {{\hat x}^T}(k) \times \left\{ {W_{c2}^T(k) \otimes \left[ {1 - {c_{h2}}(k) \otimes {c_{h2}}(k)} \right]} \right\} \end{aligned}\end{gather} ΔWc1(k)=21lc(k)ec(k)x^T(k)×{Wc2T(k)[1ch2(k)ch2(k)]} W c 1 ( k + 1 ) = W c 1 ( k ) + Δ W c 1 ( k ) \begin{gather} \begin{aligned} {W_{c1}}(k + 1) = {W_{c1}}(k) + \Delta {W_{c1}}(k) \end{aligned}\end{gather} Wc1(k+1)=Wc1(k)+ΔWc1(k)

2、执行网络设计

执行网络采用具有n个输入神经元,q个隐藏层神经元和m个输出神经元的结构。n个输入分别是系统在k时刻的状态向量 x ( k ) x(k) x(k) 的n个分量。m个输出则是与输入状态 x ( k ) x(k) x(k) 对应的控制向量 u ( k ) u(k) u(k) 的m个分量。动作网络的隐藏层采用双极性sigmoidal函数,输出层采用线性函数purelin。动作网络结构如图3所示:

在这里插入图片描述

图3 执行网络结构图

执行网络的训练仍然由正向的计算和反向的误差传播过程组成。执行网络的正向计算过程为:
a h 1 j ( k ) = ∑ i = 1 n x ^ i ( k ) ⋅ W a 1 i j ( k ) ,        j = 1 , 2 , . . . , q \begin{gather} \begin{aligned} {a_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{a1ij}}(k),\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather} ah1j(k)=i=1nx^i(k)Wa1ij(k),      j=1,2,...,q a h 2 j ( k ) = 1 − e − a h 1 j ( k ) 1 + e − a h 1 j ( k ) ,        j = 1 , 2 , . . . , q \begin{gather} \begin{aligned} {a_{h2j}}(k) = \frac{{1 - {e^{ - {a_{h1j}}(k)}}}}{{1 + {e^{ - {a_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather} ah2j(k)=1+eah1j(k)1eah1j(k),      j=1,2,...,q u j ( k ) = ∑ j = 1 q a h 2 i ( k ) ⋅ W a 2 i j ( k ) ,        j = 1 , 2 , . . . , m \begin{gather} \begin{aligned} {u_j}(k) = \sum\limits_{j = 1}^q {{a_{h2i}}(k)}\cdot {W_{a2ij}}(k), \ \ \ \ \ \ j = 1,2,...,m \end{aligned}\end{gather} uj(k)=j=1qah2i(k)Wa2ij(k),      j=1,2,...,m式中, a h 1 j ( k ) {a_{h1j}}(k) ah1j(k) 为执行网络隐藏层第j个节点的输入; a h 2 j ( k ) {a_{h2j}}(k) ah2j(k) 为执行网络隐藏层第j个节点的输出。执行网络的训练以最小化 J ^ ( k ) \hat J(k) J^(k) 为目标。执行网络的训练同样采用梯度下降法。
Δ W a = l a ( k ) ⋅ [ − ∂ J ^ ( k ) ∂ W a ( k ) ] = − l a ( k ) ⋅ ∂ J ^ ( k ) ∂ u ( k ) ∂ u ( k ) ∂ W a ( k ) \begin{gather} \begin{aligned} \Delta {W_a} = {l_a}(k) \cdot \left[ { - \frac{{\partial \hat J(k)}}{{\partial {W_a}(k)}}} \right] = - {l_a}(k) \cdot \frac{{\partial \hat J(k)}}{{\partial u(k)}}\frac{{\partial u(k)}}{{\partial {W_a}(k)}} \end{aligned}\end{gather} ΔWa=la(k)[Wa(k)J^(k)]=la(k)u(k)J^(k)Wa(k)u(k) ∂ J ^ ( k ) ∂ u ( k ) = ∂ U ( k ) ∂ u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k ) \begin{gather} \begin{aligned} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{aligned}\end{gather} u(k)J^(k)=u(k)U(k)+γu(k)J^(k+1)式中, ∂ U ( k ) ∂ u ( k ) \frac{{\partial U(k)}}{{\partial u(k)}} u(k)U(k) 的值取决于效用函数的定义,而效用函数的定义与具体的被控系统有关,这里如果定义效用函数为二次型,即:
U ( k ) = x ( k ) A x T ( k ) + u ( k ) B u T ( k ) \begin{gather} \begin{aligned} U(k) = x(k)A{x^T}(k) + u(k)B{u^T}(k) \end{aligned}\end{gather} U(k)=x(k)AxT(k)+u(k)BuT(k)式中,A,B分别为n维和m维的单位矩阵,则 ∂ U ( k ) ∂ u ( k ) = 2 u ( k ) \frac{{\partial U(k)}}{{\partial u(k)}} = 2u(k) u(k)U(k)=2u(k) ,故 :
∂ J ^ ( k ) ∂ u ( k ) = ∂ U ( k ) ∂ u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k )            = 2 u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}}\\ \ \ \ \ \ \ \ \ \ \ = 2u(k) + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{array} \end{aligned}\end{gather} u(k)J^(k)=u(k)U(k)+γu(k)J^(k+1)          =2u(k)+γu(k)J^(k+1)
执行网络权值更新推导如下(几页纸,略)。

三、基于matlab神经网络工具箱例子实现

例:我们考虑以下的离散线性系统:
x k + 1 = A x k + B u k \begin{gather} \begin{aligned} x_{k+1}=Ax_{k}+Bu_{k} \end{aligned}\end{gather} xk+1=Axk+Buk其中, x k = [ x 1 k , x 2 k ] T x_{k}=[x_{1k}, x_{2k}]^T xk=[x1k,x2k]T 并且 u ∈ R 1 u \in R^1 uR1,A矩阵为[ 0 0.1; 0.3 -1 ],B矩阵为[ 0; 0.5],初始状态 x 0 = [ 1 , − 1 ] T x_{0}=[1, -1]^T x0=[1,1]T。代价函数指标用公式(2),即 U ( x k , u k ) = x k T Q x k + u k T R u k U(x_k, u_k) = x_k^TQx_k + u_k^TRu_k U(xk,uk)=xkTQxk+ukTRuk ,其中 Q = I Q=I Q=I, R = 0.5 I R=0.5I R=0.5I, I I I为单位矩阵。

利用神经网络实现了策略迭代和值迭代两种算法,本例子的评论家网络和演员网络为三层BP神经网络,其结构分别为2-8-1和2-8-1的三层网络。对于每一个迭代步骤,使用α = 0.02的学习率对批评者网络和动作网络进行80步的训练,使神经网络的训练误差小于 1 0 − 5 10^{−5} 105

在这里插入图片描述

图4 值函数收敛过程
在这里插入图片描述
图5 采用ADP和LQR控制结果对比

完整代码见:链接: 我在闲鱼发布了【自适应动态规划代码!ADP,入门最佳代码,易懂。包括值迭代和策略迭代】

至此,自适应动态规划的数学推导以及例子就记录到这里了,
敲公式不易,麻烦各位看官一键三连!感谢!欢迎收藏以便后续用到的时候查公式!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值