从零开始学MATLAB强化学习工具箱使用(三)：创建Simulink环境训练代理

pingping_TEL

已于 2023-12-02 19:48:27 修改

阅读量3k

点赞数 10

分类专栏： MATLAB强化学习工具包使用文章标签： matlab 机器学习

于 2023-11-11 20:04:27 首次发布

本文链接：https://blog.csdn.net/pingping_tel/article/details/134340845

版权

MATLAB强化学习工具包使用专栏收录该内容

5 篇文章 67 订阅

订阅专栏

本文详细描述了如何在Simulink中使用DDPG算法改造水箱模型的控制器，涉及观测空间、动作空间定义，以及神经网络架构的实现和训练过程，最终验证了模型的有效性和稳定性。

摘要由CSDN通过智能技术生成

译自强化学习工具包文档，加入自己的见解。本节所有Simulink模型均为Simulink内置模型，无需手动建立。

这个样例展示了如何将watertank Simulink模型的PI控制器转换为强化学习DDPG代理。

水箱模型

水箱Simulink模型如下，其中控制器的目标是控制箱中水平面与设定水平面保持一致。
在这里插入图片描述
水箱Simulink模型包括一个单环反馈系统中的非线性 Water-Tank System 被控对象和 PI 控制器。

水从顶部进入水箱，其速率与施加于泵上的电压 V 成正比。水通过水箱底部的开口流出，其速率与水箱中的水位 H 的平方根成正比。水流速率中存在平方根，这导致产生一个非线性被控对象。
在这里插入图片描述
系统微分方程为：

式中Vol是水箱中水的体积，A是水箱的横截面积，a,b为常量。根据微分方程建立水箱系统Water-Tank System模型。

对模型做以下改变：

删除PID控制器；
插入强化学习代理块；
连接观测向量 $[\int edt\quad e \quad h]^T$ ，其中 $h$ 是水箱水的高度， $r$ 是参考高度， $e = r - h$ ;
设置奖励函数： $reward=10(|e|<0.1)-1(|e|\geq0.1)-100(h\leq0||h\geq20)$ ;
当 $h\leq0||h\geq20$ 时，停止模拟。
用如下语句打开模型：

open_system("rlwatertank")

可能出现这种情况，4个链接随便点一个就行。
在这里插入图片描述

创建环境

定义观测空间 $[\int edt\quad e \quad h]^T$ ：

% 观测空间形状及每个变量取值上下限
obsInfo = rlNumericSpec([3 1],...
    LowerLimit=[-inf -inf 0  ]',...
    UpperLimit=[ inf  inf inf]');
% Name and description are optional and not used by the software
obsInfo.Name = "observations";
obsInfo.Description = "integrated error, error, and measured height";

定义动作空间：

% Action info
actInfo = rlNumericSpec([1 1]);
actInfo.Name = "flow";

创建环境对象：

env = rlSimulinkEnv("rlwatertank","rlwatertank/RL Agent",...
    obsInfo,actInfo);

设置自定义复位函数，随机为模型设置参考水位值：

env.ResetFcn = @(in)localResetFcn(in);

其中，新建函数文件localResetFcn.m来提供该复位函数：

function in = localResetFcn(in)

% Randomize reference signal
blk = sprintf("rlwatertank/Desired \nWater Level");
h = 3*randn + 10;
while h <= 0 || h >= 20
    h = 3*randn + 10;
end
in = setBlockParameter(in,blk,Value=num2str(h));

% Randomize initial height
h = 3*randn + 10;
while h <= 0 || h >= 20
    h = 3*randn + 10;
end
blk = "rlwatertank/Water-Tank System/H";
in = setBlockParameter(in,blk,InitialCondition=num2str(h));

end

指定仿真时间 $T f$ 和代理采样时间 $T s$ ，单位是s：

Ts = 1.0;
Tf = 200;

固定随机数生成器种子以方便复现：

rng(0)

创建评论家

DDPG代理使用参数化的Q值函数近似器(神经网络)来评估策略的价值。神经网络包含两个输入层(观测通道和动作通道)和一个输出层(标量价值)。

将每条网络路径定义为层对象数组，为每条路径的输入和输出层分配名字。通过这些名称可以连接路径，然后将网络输入和输出层与相应的环境通道明确关联起来。

观测值输入路径：

% Observation path
obsPath = [
    featureInputLayer(obsInfo.Dimension(1),Name="obsInLyr")
    fullyConnectedLayer(50)
    reluLayer
    fullyConnectedLayer(25,Name="obsPathOutLyr")
    ];

动作值输入路径：

% Action path
actPath = [
    featureInputLayer(actInfo.Dimension(1),Name="actInLyr")
    fullyConnectedLayer(25,Name="actPathOutLyr")
    ];

公共路径：

% Common path
commonPath = [
    additionLayer(2,Name="add")
    reluLayer
    fullyConnectedLayer(1,Name="QValue")
    ];

创建网络并连接各路径：

criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,obsPath);
criticNetwork = addLayers(criticNetwork,actPath);
criticNetwork = addLayers(criticNetwork,commonPath);

criticNetwork = connectLayers(criticNetwork, ...
    "obsPathOutLyr","add/in1");
criticNetwork = connectLayers(criticNetwork, ...
    "actPathOutLyr","add/in2");

查看网络结构：

figure
plot(criticNetwork)

在这里插入图片描述
将网络转换为dlnetwork(深度学习网络模型，具备推断和反向传播等方法)对象并查看其属性：

criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)

在这里插入图片描述
用Q值函数近似器对象创建评论家：

critic = rlQValueFunction(criticNetwork, ...
    obsInfo,actInfo, ...
    ObservationInputNames="obsInLyr", ...
    ActionInputNames="actInLyr");

用一个随机的输入来检查评论家的输出：

getValue(critic, ...
    {rand(obsInfo.Dimension)}, ...
    {rand(actInfo.Dimension)})

在这里插入图片描述

创建行动者

该行动者在连续动作空间给出参数化的确定性策略，以当前观测值为输入，以动作为输出。

定义网络结构：

actorNetwork = [
    featureInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(3)
    tanhLayer
    fullyConnectedLayer(actInfo.Dimension(1))
    ];

同样转换将网络转换为dlnetwork对象并查看其属性：

actorNetwork = dlnetwork(actorNetwork);
summary(actorNetwork)

在这里插入图片描述
创建行动家，rlContinuousDeterministicActor对象为一个具备连续动作空间的行动家：

actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo);

用随机观测输入值检查行动家输出：

getAction(actor,{rand(obsInfo.Dimension)})

在这里插入图片描述

创建DDPG代理

使用指定的行动者和评论家近似器对象创建DDPG代理：

agent = rlDDPGAgent(actor,critic);

设置超参数：

% 代理采样时间，即代理间隔多长时间执行一次
agent.SampleTime = Ts;

% 软更新学习率
agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 64;
agent.AgentOptions.ExperienceBufferLength = 1e6; 

agent.AgentOptions.NoiseOptions.Variance = 0.3;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

用随机观测输入来检查代理输出：

getAction(agent,{rand(obsInfo.Dimension)})

在这里插入图片描述

训练代理

设置训练参数：

训练轮数最多为5000；
每轮最大步数为仿真时间与代理采样时间之比；
关闭命令行显示模式，通过回合管理窗口查看训练进度；
当代理在20个连续回合内获得的平均累积奖励大于800时，停止训练。

trainOpts = rlTrainingOptions(... 
    MaxEpisodes=5000, ...    
    MaxStepsPerEpisode=ceil(Tf/Ts), ... 
    Verbose=false, ...
    Plots="training-progress",...  
    StopTrainingCriteria="AverageReward",...
    ScoreAveragingWindowLength=20, ...
    StopTrainingValue=800);

训练模型：

trainingStats = train(agent,env,trainOpts);

训练过程：
在这里插入图片描述

验证DDPG结果

仿真验证模型效果。由于复位函数会随机设置参考值，固定随机数生成器种子以便于复现：

rng(1)

运行仿真：

simOpts = rlSimulationOptions(MaxSteps=ceil(Tf/Ts),StopOnError="on");
experiences = sim(env,agent,simOpts);

仿真结果：
在这里插入图片描述
可见，训练好的模型效果还是不错的，10s内就可以调节至参考水位附近，80s左右几乎可以调节至完全一致。

pingping_TEL

关注

10
点赞
踩
46

收藏

觉得还不错? 一键收藏
13
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录