基于神经网络的自适应最优控制

《 N e u r a l   n e t w o r k   a p p r o a c h   t o   c o n t i n u o u s − t i m e   d i r e c t   a d a p t i v e   o p t i m a l   c o n t r o l   f o r   p a r t i a l l y   u n k n o w n   n o n l i n e a r   s y s t e m s 》 D r a g u n a   V r a b i e ∗ , F r a n k   L e w i s {\it《Neural\,network\,approach\,to\,continuous-time\,direct\,adaptive\,optimal\,control\,for\,partially\,unknown\,nonlinear\,systems 》}\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad Draguna\,Vrabie*,Frank\,Lewis NeuralnetworkapproachtocontinuoustimedirectadaptiveoptimalcontrolforpartiallyunknownnonlinearsystemsDragunaVrabie,FrankLewis

非线性最优控制问题:

1.   1.\, 1.最优控制问题描述:

系统动态方程:

x ˙ = f ( x ) + g ( x ( t ) ) u ( x ( t ) ) ; x ( 0 ) = x 0 ( 1 ) \dot{x}=f(x)+g(x(t))u(x(t)) ;\quad x(0)=x_0 \quad(1) x˙=f(x)+g(x(t))u(x(t));x(0)=x0(1)

假设条件: f ( x ) + g ( x ) u f(x)+g(x)u f(x)+g(x)u在包含原点的集合 Ω \Omega Ω L i p s c h i t z Lipschitz Lipschitz连续,且系统是可稳的。

定义无限时域积分型性能指标:

V u ( x ( t ) ) = ∫ t ∞ r ( x ( τ ) , u ( τ ) ) d τ , r ( x , u ) = Q ( x ) + u T R u ( 2 ) V^u(x(t))=\int_t^\infty r(x(\tau),u(\tau))d\tau,\quad r(x,u)=Q(x)+u^TRu \quad(2) Vu(x(t))=tr(x(τ),u(τ))dτ,r(x,u)=Q(x)+uTRu(2)

Q(x)为正定函数,R为正定矩阵。

定义允许控制 μ ∈ Ψ ( Ω ) \mu\in\Psi(\Omega) μΨ(Ω),满足 μ ( x ) \mu(x) μ(x) Ω \Omega Ω上连续, μ ( 0 ) = 0 \mu(0)=0 μ(0)=0 μ ( x ) \mu(x) μ(x) 能使系统稳定,同时 ∀ x 0 ∈ Ω \forall x_0\in \Omega x0Ω V ( x 0 ) V(x_0) V(x0)有限。

对于任意的 μ ∈ Ψ ( Ω ) \mu\in\Psi(\Omega) μΨ(Ω),其对应的性能指标函数 V μ ( x ( t ) ) = ∫ t ∞ r ( x ( τ ) , u ( τ ) ) d τ ( 3 ) V^\mu(x(t))=\int_t^\infty r(x(\tau),u(\tau))d\tau\quad(3) Vμ(x(t))=tr(x(τ),u(τ))dτ(3),易知

V μ ( x ) V^\mu(x) Vμ(x)一阶导函数连续,即 V μ ( x ) ∈ C 1 V^\mu(x)\in C^1 Vμ(x)C1,故 式 ( 3 ) (3) (3) 微分形式:

0 = r ( x , μ ( x ) ) + ( ∇ V x μ ) T ( f ( x ) + g ( x ) μ ( x ) ) , V μ ( 0 ) = 0 ( 4 ) 0=r(x,\mu(x))+(\nabla V_x^\mu)^T(f(x)+g(x)\mu(x)), \quad V^\mu(0)=0 \quad(4) 0=r(x,μ(x))+(Vxμ)T(f(x)+g(x)μ(x)),Vμ(0)=0(4)

最优控制问题 : \textsf{{最优控制问题}}: :

给 定 连 续 时 间 系 统 ( 1 ) 、 允 许 控 制 集 合 μ ( x ) 、 无 限 时 域 性 能 指 标 V μ ( 2 ) , 找 到 最 优 控 制 律 μ ∗ 使 得 式 ( 2 ) 最 小 \small{给定连续时间系统(1)、允许控制集合\mu(x)、无限时域性能指标V^\mu(2),找到最优控制律\mu^*使得式(2)最小} 1μ(x)Vμ2μ使2

2.   2.\, 2.策略迭代算法(policy iteration):

1. 策 略 评 估 ( p o l i c y   e v a l u a t i o n ) 1.策略评估(policy\,evaluation) 1.(policyevaluation)

V μ ( i ) ( x ( t ) ) = ∫ t t + T r ( x ( τ ) , μ ( i ) ( x ( τ ) ) ) d τ + V μ ( i ) ( x ( t + T ) ) ,   V μ ( i ) ( 0 ) = 0 ( 9 ) V^{\mu^{(i)}}(x(t))=\int_t^{t+T}r(x(\tau),\mu^{(i)}(x(\tau)))d\tau+V^{\mu^{(i)}}(x(t+T)), \, V^{\mu^{(i)}}(0)=0 \quad (9) Vμ(i)(x(t))=tt+Tr(x(τ),μ(i)(x(τ)))dτ+Vμ(i)(x(t+T)),Vμ(i)(0)=0(9)

2. 策 略 改 进 ( p o l i c y   i m p r o v e m e n t ) 2.策略改进(policy\,improvement) 2.(policyimprovement)

μ ( i + 1 ) ( x ) = − 1 2 R − 1 g T ( x ) ∇ V x μ ( i ) ( 11 ) \mu^{(i+1)}(x)=-\frac12R^{-1}g^T(x)\nabla V_x^{\mu^{(i)}} \quad(11) μ(i+1)(x)=21R1gT(x)Vxμ(i)(11)
  \,
L e m m a   1.    式 ( 9 ) 求 解 V μ ( i ) 等 价 于 求 解   0 = r ( x , μ ( i ) ( x ) ) + ( ∇ V x μ ( i ) ) T ( f ( x ) + g ( x ) μ ( i ) ( x ) ) , V μ ( i ) ( 0 ) = 0 ( 12 ) Lemma\,1.\;式(9)求解V^{\mu^{(i)}}等价于求解\,0=r(x,\mu^{(i)}(x))+(\nabla V_x^{\mu^{(i)}})^T(f(x)+g(x)\mu^{(i)}(x)), \quad V^{\mu^{(i)}}(0)=0 \quad(12) Lemma1.(9)Vμ(i)0=r(x,μ(i)(x))+(Vxμ(i))T(f(x)+g(x)μ(i)(x)),Vμ(i)(0)=0(12)

3.   3.\, 3.基于神经网络 近似 指标函数:

V μ ( i ) ( x ) = ∑ j = 1 L w j μ ( i ) ϕ j ( x ) = ( ω L μ ( i ) ) T φ ( x ) ( 14 ) V^{\mu^{(i)}}(x)=\sum_{j=1}^Lw_j^{\mu^{(i)}}\phi_j(x)=(\omega_L^{\mu^{(i)}})^T\varphi(x) \quad (14) Vμ(i)(x)=j=1Lwjμ(i)ϕj(x)=(ωLμ(i))Tφ(x)(14)

带入式(9)有:

( ω L μ ( i ) ) T φ ( x ( t ) ) = ∫ t t + T r ( x ( τ ) , μ ( i ) ( x ( τ ) ) ) d τ + ( ω L μ ( i ) ) T φ ( x ( t + T ) ) ( 16 ) (\omega_L^{\mu^{(i)}})^T\varphi(x(t))=\int_t^{t+T}r(x(\tau),\mu^{(i)}(x(\tau)))d\tau+(\omega_L^{\mu^{(i)}})^T\varphi(x(t+T)) \quad (16) (ωLμ(i))Tφ(x(t))=tt+Tr(x(τ),μ(i)(x(τ)))dτ+(ωLμ(i))Tφ(x(t+T))(16)

残差为:

δ L μ ( i ) ( x ( t ) , T ) = ∫ t t + T r ( x , μ ( i ) ( x ) ) d τ + ( ω L μ ( i ) ) T [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] ( 17 ) \delta_L^{\mu^{(i)}}(x(t),T)=\int_t^{t+T} r(x,\mu^{(i)}(x))d\tau+(\omega_L^{\mu^{(i)}})^T[\varphi_L(x(t+T))-\varphi_L(x(t))] \quad(17) δLμ(i)(x(t),T)=tt+Tr(x,μ(i)(x))dτ+(ωLμ(i))T[φL(x(t+T))φL(x(t))](17)

使用最小二乘法,最小化 S = ∫ Ω δ L μ ( i ) ( x , T ) δ L μ ( i ) ( x , T ) d x ( 18 ) S=\int_\Omega\delta_L^{\mu^{(i)}}(x,T)\delta_L^{\mu^{(i)}}(x,T)dx \quad(18) S=ΩδLμ(i)(x,T)δLμ(i)(x,T)dx(18)

∫ Ω δ L μ ( i ) ( x , T ) d ω L μ ( i ) δ L μ ( i ) ( x , T ) d x = 0 \int_\Omega\frac{\delta_L^{\mu^{(i)}}(x,T)}{d\omega_L^{\mu^{(i)}}}\delta_L^{\mu^{(i)}}(x,T)dx=0 ΩdωLμ(i)δLμ(i)(x,T)δLμ(i)(x,T)dx=0

写成 L e b e s g u e Lebesgue Lebesgue积分的內积形式为:

⟨ δ L μ ( i ) ( x , T ) d ω L μ ( i ) , δ L μ ( i ) ( x , T ) ⟩ Ω = 0 ( 19 ) \langle\frac{\delta_L^{\mu^{(i)}}(x,T)}{d\omega_L^{\mu^{(i)}}},\delta_L^{\mu^{(i)}}(x,T)\rangle_\Omega=0\quad(19) dωLμ(i)δLμ(i)(x,T),δLμ(i)(x,T)Ω=0(19)

结合 式(17) 和 式(19) 有:

⟨ [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] , [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] ⟩ Ω ω L μ ( i ) + ⟨ [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] , ∫ t t + T r ( x , μ ( i ) ( x ) ) d τ ⟩ Ω = 0 ( 20 ) \begin{aligned} &\langle[\varphi_L(x(t+T))-\varphi_L(x(t))],[\varphi_L(x(t+T))-\varphi_L(x(t))]\rangle_\Omega\omega_L^{\mu^{(i)}}\\ &+\langle[\varphi_L(x(t+T))-\varphi_L(x(t))],\int_t^{t+T} r(x,\mu^{(i)}(x))d\tau\rangle_\Omega=0\quad(20) \end{aligned} [φL(x(t+T))φL(x(t))],[φL(x(t+T))φL(x(t))]ΩωLμ(i)+[φL(x(t+T))φL(x(t))],tt+Tr(x,μ(i)(x))dτΩ=0(20)

假设 Φ = ⟨ [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] , [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] ⟩ Ω \Phi=\langle[\varphi_L(x(t+T))-\varphi_L(x(t))],[\varphi_L(x(t+T))-\varphi_L(x(t))]\rangle_\Omega Φ=[φL(x(t+T))φL(x(t))],[φL(x(t+T))φL(x(t))]Ω可逆,有:

ω L μ ( i ) = − Φ − 1 ⟨ [ φ L ( x ( t + T ) ) − φ L ( x ( t ) ) ] , ∫ t t + T r ( x , μ ( i ) ( x ) ) d τ ⟩ Ω ( 21 ) \omega_L^{\mu^{(i)}}=-\Phi^{-1}\langle[\varphi_L(x(t+T))-\varphi_L(x(t))],\int_t^{t+T} r(x,\mu^{(i)}(x))d\tau\rangle_\Omega\quad(21) ωLμ(i)=Φ1[φL(x(t+T))φL(x(t))],tt+Tr(x,μ(i)(x))dτΩ(21)

4.   4.\, 4.基于Actor/Critic的在线算法:

算法结构:
在这里插入图片描述
算法流程图:
算法流程图


示例1:

考虑如下动态系统:
{ x ˙ 1 = − x 1 + x 2 x ˙ 2 = f ( x ) + g ( x ) u \left\{\begin{aligned} &\dot{x} _1= -x_1+x_2\\ &\dot{x} _2= f(x)+g(x)u \end{aligned}\right. {x˙1=x1+x2x˙2=f(x)+g(x)u
其中, f ( x ) = − 1 2 ( x 1 + x 2 ) + 1 2 x 2 sin ⁡ ( x 1 ) g ( x ) = sin ⁡ ( x 1 ) f(x)= -\frac12(x_1+x_2)+\frac12x_2\sin(x_1)\quad g(x)=\sin(x_1) f(x)=21(x1+x2)+21x2sin(x1)g(x)=sin(x1)

定义无限时域性能指标函数 V u ( x ( t ) ) = ∫ t ∞ ( Q ( x ) + u 2 ) d τ V^u(x(t))=\int_t^\infty(Q(x)+u^2)d\tau Vu(x(t))=t(Q(x)+u2)dτ, Q ( x ) = x 1 2 + x 2 2 Q(x)=x_1^2+x_2^2 Q(x)=x12+x22

∀ x ∈ Ω \forall x\in\Omega xΩ V μ ( i ) ( x ) V^{\mu^{(i)}}(x) Vμ(i)(x) 由如下光滑函数近似:

V L μ ( i ) ( x ) = ( ω L μ ( i ) ) T φ ( x ) V_L^{\mu^{(i)}}(x)=(\omega_L^{\mu^{(i)}})^T\varphi(x) VLμ(i)(x)=(ωLμ(i))Tφ(x) L = 3 L=3 L=3

ω 3 μ ( i ) = [ w 1 μ ( i )   w 2 μ ( i )   w 3 μ ( i ) ] T \omega_3^{\mu^{(i)}}=[w_1^{\mu^{(i)}}\,w_2^{\mu^{(i)}}\,w_3^{\mu^{(i)}}]^T ω3μ(i)=[w1μ(i)w2μ(i)w3μ(i)]T φ 3 ( x ) = [ x 1 2    x 1 x 2    x 2 2 ] T \varphi_3(x)=[x_1^2\,\,x_1x_2\,\,x_2^2]^T φ3(x)=[x12x1x2x22]T

根据式(21)更新权重w。

matlab代码如下:

function odestart
clear all;close all;clc;

global P;          %w权重
global Target;  %cost积分
global v;

figure;hold on;

%initializations
%iteration step
j=0;
%initial state
x0=[1 1 0];
% P gives the controller parameters
P=[-1 3 1.5];

Target=0;
vv=[];
C=[];

T=0.1;      %采样间隔
Fsamples=150;   %采样总点数    
nop=30;         %一次更新采样点数

%next WW gives the initial stabilizing controller
WW=zeros(length(P),1+Fsamples/nop);     %记录w权重
WW(:,1)=P'; 

for k=1:Fsamples
    j=j+1;
    % simulation of the system to get the measurements
    tspan=[0 T];
    [t,x]= ode23(@odefile,tspan,x0);
    x1=x(length(x),(1:2));
    X(j,:)=[x0(1)^2 x0(1)*x0(2) x0(2)^2]'- [x1(1)^2 x1(1)*x1(2) x1(2)^2]';
    
    Target=x(length(x),3);
    Y(j,:)=Target;
    
    %每次更新w,随机五次初始状态,即采样五条状态轨迹
    if mod(k,nop/5)==0
        x0=[2*(rand(1,2)-1/2) 0];
    else
        x0=[x1 0];
    end

    plot(t+T*(k-1),x(:,1));
    vv=[vv v];      %记录控制量

    %每次更新w,采样nop个点
    if mod(k,nop)==0
        weights=X\Y;
        %calculating the matrix P
        P=[weights(1) weights(2) weights(3)];
        WW(:,k/nop+1)=[weights(1) weights(2) weights(3)]';
        X=zeros(nop,3);
        Y=zeros(nop,1);
        j=0;
        x0=[0.5*(rand(1,2)-1/2) 0];
    end

end
P=[weights(1) weights(2) weights(3)]

title('System states'); xlabel('Time (s)');

figure; plot([0:T:T*(length(vv)-1)],vv); title('Control signal'); xlabel('Time (s)')

figure; % in this figure we plot the neural network parameters at each iteration step in the policy iteration
WW % the matrix of parameteres is printed in the comand window
ss=size(WW);
plot((0:T*nop:T*Fsamples),WW(1,1:ss(2))','.-');hold on
plot((0:T*nop:T*Fsamples),WW(2,1:ss(2))','*:'); hold on
plot((0:T*nop:T*Fsamples),WW(3,1:ss(2))','o--'); 
legend('w_1','w_2 ','w_3');
title('W  parameters'); xlabel('Time (s)'); %hold on; plot(T*(Fsamples+1),WW(:,length(WW))','*'); title('W  parameters');

%-------------------------------------------------------------------------------------
function xdot=odefile(t,x);
global P;
global v;

Q=[1 0; 0 1];R=1;
x=[x(1) x(2)]';

%calculating the control signal
% P are the parameters of the critic
v=-1/2*inv(R)*sin(x(1))*P*[0; x(1); 2*x(2)];

%xdot=[A*[x1;x2;x3;x4]+B*v %+F*deltaPd
xdot=[[-1 1; -1/2 -1/2]*x+[0; 1/2*x(2)*(sin(x(1))^2)+sin(x(1))*v];   x'*Q*x+v'*R*v];
%-------------------------------------------------------------------------------------

运行结果:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 9
    点赞
  • 53
    收藏
    觉得还不错? 一键收藏
  • 12
    评论
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值