强化学习研究进展：在工业过程控制中的介绍与应用_强化学习在工业过程中的应用-CSDN博客

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135519999

A review On reinforcement learning: Introduction and applications in industrial process control 强化学习研究进展：在工业过程控制中的介绍与应用

Rui Nian, Jinfeng Liu, Biao Huang
Show more
Add to Mendeley
Share
Cite
https://doi.org/10.1016/j.compchemeng.2020.106886
Get rights and content 获取权限和内容

Highlights 突出

•
An overview of reinforcement learning with tutorials for industrial practitioners on implementing RL solutions into process control applications.
强化学习概述，为工业从业者提供关于将 RL 解决方案实施到过程控制应用中的教程。

•
An introduction to different reinforcement learning algorithms.
介绍不同的强化学习算法。

•
Recent successes of RL applications with emphasis on process control applications.
RL应用的最新成功，重点是过程控制应用。

•
A comparison with traditional optimal control methods.
与传统最优控制方法的比较。

Abstract 抽象

In recent years, reinforcement learning (RL) has attracted significant attention from both industry and academia due to its success in solving some complex problems. This paper provides an overview of RL along with tutorials for practitioners who are interested in implementing RL solutions into process control applications. The paper starts by providing an introduction to different reinforcement learning algorithms. Then, recent successes of RL applications across different industries will be explored, with more emphasis on process control applications. A detailed RL implementation example will also be shown. Afterwards, RL will be compared with traditional optimal control methods, in terms of stability and computational complexity among other factors, and the current shortcomings of RL will be introduced. This paper is concluded with a summary of RL’s potential advantages and disadvantages.
近年来，强化学习（RL）因其在解决一些复杂问题方面的成功而受到工业界和学术界的高度关注。本文概述了 RL，并为有兴趣将 RL 解决方案实施到过程控制应用中的从业者提供了教程。本文首先介绍了不同的强化学习算法。然后，将探讨RL在不同行业应用的最新成功案例，并重点介绍过程控制应用。还将展示一个详细的RL实现示例。然后，将RL与传统的最优控制方法在稳定性和计算复杂度等因素方面进行比较，并介绍RL目前的不足。本文总结了RL的潜在优点和缺点。

Previous article in issue 上一页已发行文章Next article in issue 下一篇文章有问题
Keywords 关键字
Reinforcement learning 强化学习Model predictive control 模型预测控制Optimal control 最佳控制Machine learning 机器学习Process industry 过程工业Process control 过程控制

Nomenclature 命名法

a
Anomalous state 异常状态

The set of anomalous states
一组异常状态

b
Behaviour policy 行为方针

Expectation 期望

H
Control horizon 控制范围

k
Update step 更新步骤

Kd K型 d
Derivative gain 导数增益

Ki
Integral gain 积分增益

Kp K型 p
Proportional gain 比例增益

N
Arbitrary number of time steps
任意时间步长

Ornstein-Uhlenbeck exploratory noise
Ornstein-Uhlenbeck探索性噪声

o
Observation 观察

The set of possible observations, observation space
可能的观测值集，观测空间

p
Probability transition function
概率转移函数

Action probability vector
动作概率向量

Pr
Probability 概率

q*
Action-value 操作值

Q
Estimated action-value 估计的操作值

Qmpc mpc 问
Tuning matrix for the states MPC
状态 MPC 的调整矩阵

r
Expected reward 预期回报

R
Sampled reward 抽样奖励

Rmpc R型 mpc
Tuning matrix for the inputs in MPC
MPC 中输入的调谐矩阵

t
Time step 时间步长

u
Control action 控制操作

u′
Noise corrupted input signal
噪声损坏的输入信号

u*
Optimal control action 最佳控制动作

The set of all possible control actions, action space
所有可能的控制动作的集合，动作空间

v
Expected value 期望值

V
Estimated value 预估值

w
Weight vector for function approximation
函数近似的权重向量

Wt
Wiener process 维纳工艺

x
State 州

y
Predicted variable 预测变量

The set of possible states, state space
可能状态的集合，状态空间

γ
Discount factor 折扣系数

π
Policy 政策

α
Learning rate or step size
学习率或步长

β
Fixed discount factor in Semi-MDPs
Semi-MDP中的固定贴现系数

η
Time spent in a particular state
在特定状态下花费的时间

μ
State distribution 状态分布

μa a μ
Threshold for system to be identified as anomalous in RL anomaly detection
在 RL 异常检测中被识别为异常的系统阈值

ϵ
Percent chance of performing a random action in TD methods
在 TD 方法中执行随机操作的几率百分比

θ
Ornstein-Uhlenbeck hyper parameter for the time parameter
时间参数的 Ornstein-Uhlenbeck 超参数

σ
Ornstein-Uhlenbeck hyper parameter for the Wiener process
维纳过程的 Ornstein-Uhlenbeck 超参数

Abbreviations 缩写
ADP
Approximate dynamic programming
近似动态规划

AI
Artificial intelligence 人工智能

CMDP CMDP的
Constrained Markov decision process
约束马尔可夫决策过程

CSTR CSTR的
Continuously stirred tank reactor
连续搅拌罐式反应釜

DCS
Distributed control system
分布式控制系统

DPG
Deterministic policy gradient
确定性策略梯度

DDPG DDPG系列
Deep deterministic policy gradient
深度确定性策略梯度

DMC
Dynamic matrix control 动态矩阵控制

DP
Dynamic programming 动态规划

DQN
Deep q-learning network 深度 q-learning 网络

EMPC 电磁管理体系
Economic model predictive control
经济模型预测控制

FOMDP FOMDP（FOMDP）
Fully observable Markov decision process
完全可观察的马尔可夫决策过程

FTC
Fault tolerant control 容错控制

HARL 哈尔
Heuristically accelerated reinforcement learning
启发式加速强化学习

LQG
Linear quadratic Gaussian
线性二次高斯

LQR
Linear quadratic regulator
线性二次稳压器

LSTM LSTM系列
Long short term memory
长短期记忆

MC
Monte Carlo 蒙特卡洛

MDP
Markov decision process 马尔可夫决策过程

MIMO MIMO模拟
Multiple-input multiple-output
多输入多输出

ML
Machine learning 机器学习

MP
Mathematical programming 数学规划

MPC
Model predictive control 模型预测控制

MRP
Markov reward process 马尔可夫奖励过程

MSE
Mean squared (tracking) error
均方（跟踪）误差

NP
Non-deterministic polynomial time
非确定性多项式时间

OPC
Open platform communication
开放平台沟通

P
Pressure 压力

PID
Proportional-Integral-Derivative
比例-积分-微分

POMDP POMDP的
Partially observable Markov decision process
部分可观察的马尔可夫决策过程

PPO
Proximal policy optimization
近端策略优化

PUE
Power usage effectiveness (used by Google to quantify energy efficiency)
电力使用效率（Google 用于量化能源效率）

RL
Reinforcement learning 强化学习

RNN
Recurrent neural network 循环神经网络

RTO
Real time optimization 实时优化

SGD
Stochastic gradient descent
随机梯度下降

SISO SISO公司
Single-input single-output
单输入单输出

SMDP SMDP系列
Semi Markov decision process
半马尔可夫决策过程

SP
Set-point 设定值

TD
Temporal difference 时间差异

TPU
Tensor processing unit 张量处理单元

TRPO TRPO系列
Trust region policy optimization
信任区域策略优化

1. Introduction 1. 引言

Artificial intelligence (AI) has recently triggered a paradigm shift in numerous industries around the world, ranging from technology to health care. The previously arcane topic is now an insuppressable wildfire igniting countless industrial and academic minds alike. Rapid advancements in computer hardware and ever-cheapening data storage combined with AI’s ability to ‘self-learn’ has pushed AI to become the forefront algorithm for many applications such as computer vision and natural language processing. According to PwC (2019), AI is projected to generate over 15 trillion USD to the world economy while providing a 26% boost in GDP by 2030. Overall, AI is a massive field encompassing many goals.
人工智能（AI）最近引发了全球众多行业的范式转变，从技术到医疗保健。这个以前晦涩难懂的话题现在变成了一团无法抑制的野火，点燃了无数工业界和学术界的头脑。计算机硬件的快速发展和不断降低的数据存储成本，加上人工智能的“自我学习”能力，推动人工智能成为计算机视觉和自然语言处理等许多应用的最前沿算法。根据普华永道（2019 年）的数据，到 2030 年，人工智能预计将为世界经济创造超过 15 万亿美元的收入，同时将 GDP 提高 26%。总体而言，人工智能是一个包含许多目标的庞大领域。

The major goals/topics of AI are shown in Fig. 1. Currently, the most influential topic in AI is machine learning (ML). ML can be described as the scientific field that studies and develops algorithms and statistical models to give machines the explicit ability to learn tasks without being programmed to do so (Russel and Norvig, 2009). The ML field can be further decomposed into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.
人工智能的主要目标/主题如图 1 所示。目前，人工智能领域最具影响力的话题是机器学习（ML）。ML可以被描述为研究和开发算法和统计模型的科学领域，使机器具有学习任务的明确能力，而无需编程（Russel和Norvig，2009）。ML领域可以进一步分解为监督学习、无监督学习、半监督学习和强化学习。

Fig. 1
Download : Download high-res image (147KB)
下载：下载高分辨率图片（147KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 1. The major goals of artificial intelligence.
图 1.人工智能的主要目标。

The breakdown of the ML field is shown in Fig. 2. In supervised learning, the agent (ML algorithm) learns the input-output mapping (model) from a training data set labeled by subject matter expert(s) (Ng, 2018). A supervised learning algorithm generally attempts to generalize across training examples and uses this knowledge to predict labels for unseen data. Note that not all labels are guaranteed to be correct. In the process industry, the subject matter expert is often a sensor measuring the current state (temperature, pressure, etc.) of a process and could be unreliable and noisy. Ultimately, the performance of the supervised learning agent cannot outperform the subject matter expert or supervisor because the agent essentially only mimics the labeling behavior of the expert. In literature, this performance limit of the agent is known as the Bayes error rate (Ng, 2018). Unsupervised learning is typically used for identifying hidden structures within unlabeled data sets. Examples include segregating data based on their similarity to identify different operating regimes, or identifying the principal components within a data set (Hinton, Sejnowski, 1999, Sutton, Barto, 2018). The three main goals of unsupervised learning are for dimensional reduction, feature extraction, and clustering. Semi-supervised learning is obtained by combining ideas from supervised and unsupervised learning. In the process industry, manually labeling data is a costly endeavor; however, many applications such as fault detection require labeled data sets to materialize useful applications. Here, semi-supervised learning can be applied to learn from the small amount of labeled data and extract additional useful insights from remaining unlabeled data (Ge et al., 2017). Nevertheless, semi-supervised learning still suffers from the inability to surpass the supervisor. Therefore, the previous methods can only create value through cost reductions while failing to expand the current capabilities of modern methods. An intuitive example is as follows:
ML场的细分如图2所示。在监督学习中，代理（ML算法）从主题专家标记的训练数据集中学习输入输出映射（模型）（Ng，2018）。监督学习算法通常尝试在训练样本中泛化，并利用这些知识来预测看不见的数据的标签。请注意，并非所有标签都保证正确无误。在过程工业中，主题专家通常是测量过程当前状态（温度、压力等）的传感器，可能不可靠且嘈杂。归根结底，受监督的学习代理的表现不能超过主题专家或主管，因为代理本质上只是模仿专家的标记行为。在文献中，代理的这种性能限制被称为贝叶斯错误率（Ng，2018）。无监督学习通常用于识别未标记数据集中的隐藏结构。例如，根据数据的相似性分离数据以识别不同的操作机制，或识别数据集中的主要成分（Hinton，Sejnowski，1999，Sutton，Barto，2018）。无监督学习的三个主要目标是降维、特征提取和聚类。半监督学习是通过结合监督学习和无监督学习的思想而获得的。在过程工业中，手动标记数据是一项成本高昂的工作;然而，许多应用（如故障检测）需要标记的数据集才能实现有用的应用。在这里，半监督学习可以应用于从少量标记数据中学习，并从剩余的未标记数据中提取更多有用的见解（Ge et al.， 2017）。尽管如此，半监督学习仍然无法超越主管。因此，以前的方法只能通过降低成本来创造价值，而无法扩展现代方法的现有能力。一个直观的示例如下：
World renowned chemists may have the capability to achieve 95% purity in chemical A using state-of-the-art methods. With the aid of supervised learning agents to replicate trivial tasks, the synthesization process may be faster and/or cheaper; however, the purity will not increase beyond 95% because the agent is simply replicating the supervisor. In other words, the current knowledge base of chemistry for synthesization of a higher purity chemical A has not changed.
世界知名的化学家可能有能力使用最先进的方法在化学品A中达到95%的纯度。借助监督学习代理来复制琐碎的任务，合成过程可能会更快和/或更便宜;但是，纯度不会增加到 95% 以上，因为代理只是复制了 Supervisor。换言之，目前用于合成更高纯度化学品A的化学知识基础没有改变。

Fig. 2
Download : Download high-res image (99KB)
下载：下载高分辨率图片（99KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 2. The sub-components of machine learning.
图 2.机器学习的子组件。

Reinforcement learning (RL) attempts to overcome the above mentioned performance limit by combining previous ML fields into a unifying algorithm with modifications on the learning process. Ultimately, the goal of RL is to give machines the ability to surpass all known methods. Specifically, the goal of the RL agent is to push the boundaries of what is currently possible through learning the optimal mapping of situations to actions (called policy) through a trial-and-error search guided by a scalar reward signal. In many challenging scenarios, actions affect not only the immediate reward, but also all subsequent rewards. These two features – guided trial-and-error search and delayed feedback – distinguish RL from all other topics of ML and ultimately enables the ability to push the current boundaries of knowledge (Sutton and Barto, 2018).
强化学习（RL）试图通过将先前的 ML 字段组合成一个统一的算法，并对学习过程进行修改来克服上述性能限制。RL的最终目标是使机器能够超越所有已知的方法。具体来说，RL 代理的目标是通过标量奖励信号引导的试错搜索，通过学习情况到行动（称为策略）的最佳映射，从而突破当前可能的界限。在许多具有挑战性的场景中，行动不仅会影响即时奖励，还会影响所有后续奖励。这两个功能——引导式试错搜索和延迟反馈——将 RL 与所有其他 ML 主题区分开来，并最终能够突破当前知识的界限（Sutton 和 Barto，2018）。

Nevertheless, this characteristic introduces unique challenges to the RL agent, one being the trade-off between exploration and exploitation. The goal of the agent is to maximize the reward signal; however, the agent is initialized tabula rasa (i.e., a clean state). Thus, the agent must first explore the state space to identify the optimal actions. Furthermore, for stochastic systems or systems with delayed reward signals, each state must be visited many times to obtain reliable information. Ultimately, the agent must exploit the current information to maximize rewards. However, exploiting too soon leads to locally optimal or sub-optimal actions. Likewise, exploiting too late results in forgone rewards. The problem is further complicated by non-stationary or time-varying systems, where exploration is a requirement for the continued optimality of the agent. In control, this dilemma is known as identification (or estimation) versus control.
然而，这一特性给RL代理带来了独特的挑战，其中之一是探索和开发之间的权衡。智能体的目标是最大化奖励信号;但是，代理初始化为 Tabula Rasa（即干净状态）。因此，智能体必须首先探索状态空间以确定最佳操作。此外，对于随机系统或具有延迟奖励信号的系统，必须多次访问每个状态才能获得可靠的信息。最终，代理必须利用当前信息来最大化奖励。然而，过早地利用会导致局部最优或次优操作。同样，利用太晚会导致放弃奖励。非平稳或时变系统使问题进一步复杂化，在这些系统中，探索是智能体持续最优性的必要条件。在控制中，这种困境被称为识别（或估计）与控制。

Mathematically, RL is formulated as an optimal sequential decision making algorithm with the ability to account for stochasticity within the system. Furthermore, RL was partly built upon stochastic optimal control which may provide advantages in certain systems compared to modern control techniques (Rawlik et al., 2013). Traditional optimal control methods (e.g., model predictive control (MPC)) typically employ mathematical programming (MP) based trajectory optimization methods. The successes of such methods in addressing multi-stage optimal control problems are widely demonstrated; however, industrial applications of such methods in large-scale stochastic multiple-input multiple-out (MIMO) problems are still limited due to their online computational requirements (Maravelias and Sung, 2009). Furthermore, the solutions to systems with uncertainty typically use stochastic programming with only a finite number of uncertainty scenarios and assume that the information regarding the uncertainty is known. Unfortunately in practice, uncertainty information is typically unknown, non-stationary and contain uncertainty themselves (Shin et al., 2019). Moreover, the control horizon of MP methods for large MIMO systems is generally short to ensure computational feasibility, although the identified optimal solution for short horizons might be highly sub-optimal in the long term (Mayne and Rawlings, 2017).
在数学上，RL被表述为一种最优的顺序决策算法，能够解释系统内的随机性。此外，RL部分建立在随机最优控制之上，与现代控制技术相比，这可能在某些系统中提供优势（Rawlik等人，2013）。传统的最优控制方法（例如，模型预测控制（MPC））通常采用基于数学规划（MP）的轨迹优化方法。这些方法在解决多阶段最优控制问题方面取得了成功;然而，由于其在线计算要求，此类方法在大规模随机多输入多输出（MIMO）问题中的工业应用仍然受到限制（Maravelias 和 Sung，2009 年）。此外，具有不确定性的系统的解决方案通常使用随机规划，只有有限数量的不确定性场景，并假设有关不确定性的信息是已知的。不幸的是，在实践中，不确定性信息通常是未知的、非平稳的，并且本身就包含不确定性（Shin et al.， 2019）。此外，大型MIMO系统的MP方法的控制范围通常很短，以确保计算可行性，尽管从长远来看，确定的短范围最佳解决方案可能非常次优（Mayne和Rawlings，2017）。

Compared to MP control methods, RL overcomes the issue of long online computation times by pre-computing the optimal solutions offline, a concept similar to parametric programming in explicit model predictive control (Bemporad et al., 2002). This characteristic of RL is advantageous for systems where online computation time is of importance. However, for complex systems, offline training of the RL agent may require hundreds of thousands of steps to achieve even a near-optimal policy; thus, making the training step infeasible in live operations. To overcome this issue, the agent may be first trained in a process simulator to obtain general knowledge of the process. The performance of the agent after this phase will be strongly correlated with the fidelity of the simulator. It is expected that the RL agent’s ultimate performance during such a training process will not outperform the performance of a corresponding MPC designed based on the same simulator model if global optimality is ensured in obtaining the MPC’s solution. When compared to advanced control, RL is similar to the widely used combination of real-time optimization (RTO) and MPC. In traditional optimal control, RTO provides the optimal steady state set-point(s) and MPC identifies the optimal input trajectory to achieve the desired set-points. For RL, the learned optimal policy implicitly carries information regarding the optimal set-point(s) and the optimal input(s) to reach the set-points. This is indeed very similar to the concept of economic MPC which aims to combine the RTO and MPC layers (Ellis et al., 2014). Due to these unique features, it is a natural curiosity to explore the areas where RL may have potential in the process control industry.
与 MP 控制方法相比，RL 通过离线预计算最优解来克服在线计算时间长的问题，这一概念类似于显式模型预测控制中的参数化编程（Bemporad 等人，2002 年）。RL的这一特性对于在线计算时间很重要的系统是有利的。然而，对于复杂的系统，RL 代理的离线训练可能需要数十万个步骤才能实现接近最优的策略;因此，使训练步骤在实时操作中不可行。为了克服这个问题，可以首先在流程模拟器中对代理进行培训，以获得流程的一般知识。在此阶段之后，代理的性能将与模拟器的保真度密切相关。如果在获得 MPC 的解决方案时确保全局最优性，预计 RL 代理在此类训练过程中的最终性能不会优于基于相同模拟器模型设计的相应 MPC 的性能。与高级控制相比，RL 类似于广泛使用的实时优化（RTO）和 MPC 组合。在传统的优化控制中，RTO提供最佳稳态设定点，MPC确定最佳输入轨迹，以实现所需的设定点。对于RL，学习到的最优策略隐含着有关最佳设定点和达到设定点的最佳输入的信息。这确实与经济MPC的概念非常相似，后者旨在将RTO和MPC层结合起来（Ellis等人，2014）。由于这些独特的功能，人们自然而然地对探索RL在过程控制行业中可能具有潜力的领域产生了好奇心。

In the literature, there exists some review papers focusing on ML and process control such as Shin et al. (2019) and Lee et al. (2018). In these existing review papers, the target audience is catered towards academic researchers and typically involve more rigorous mathematics. The objectives of this paper are to introduce the motivations and concepts of RL in tutorial way to potential practitioners. A general overview of all branches and state-of-the-art RL algorithms will be briefly explored. The focus of this study is on the potential implementation and value creation of RL in the process control industry, not on rigorous mathematical proofs and numerical studies of RL methods compared to traditional process control. This review starts with an introduction to RL, the Markov decision process (MDP) and different families of RL methods. Concepts explored will also be intuitively correlated to process control ideas to enhance understanding. Section 3 compares RL qualitatively to traditional control methods and also introduces some of successful RL applications. A quantitative example where RL was applied onto an industrial pumping system will also be shown here to provide additional intuition. The section is concluded with RL implementation techniques catered towards the process industry. Section 4 presents the current weaknesses that may be preventing RL from being adopted in process industry. Finally, the review is concluded in Section 5 with the advantages and disadvantages of RL summarized.
在文献中，存在一些专注于ML和过程控制的综述论文，例如Shin等人（2019）和Lee等人（2018）。在这些现有的综述论文中，目标受众迎合了学术研究人员的需求，通常涉及更严格的数学。本文的目的是以教程的方式向潜在的从业者介绍强化学习的动机和概念。本文将简要介绍所有分支和最先进的RL算法。本研究的重点是RL在过程控制行业中的潜在实施和价值创造，而不是与传统过程控制相比，RL方法的严格数学证明和数值研究。这篇综述首先介绍了RL、马尔可夫决策过程（MDP）和不同系列的RL方法。所探索的概念也将直观地与过程控制思想相关联，以增强理解。第3节对RL与传统控制方法进行了定性比较，并介绍了一些成功的RL应用。这里还将展示一个将RL应用于工业泵送系统的定量示例，以提供额外的直觉。本节以面向过程工业的RL实现技术结束。第4节介绍了当前可能阻碍RL在过程工业中采用的弱点。最后，在第 5 节中总结了 RL 的优缺点。

Reinforcement learning
强化学习
2.1. A brief history 2.1. 简史
RL originated from two main fields of research: optimal control using value functions and dynamic programming and animal psychology inspiring trial-and-error search. The optimal control problem was originally proposed to design a controller to minimize the loss function of a dynamical system over time (Mayne and Rawlings, 2017). In the mid 1950s, Richard Bellman extended the works of Hamilton and Jacobi and developed an approach to solve the optimal control problem. This approach, now called dynamic programming, optimizes the input trajectory by using the functional equation (a function where the unknowns are also functions) generated from the system’s state information in conjunction with a value function (Bellman, 1957a). The functional equation, known as the Bellman equation, is given as:
(1)
where V(x) is the value function in state x, γ is a discount factor to incorporate uncertainty into future rewards, r(x) is the reward obtained as a function of the system’s behaviour with respect to a desired performance, P(x′|x, u) is the transitional probability of arriving at the next state x′ given current state and control input x and u, respectively, V(x′) is the value function at the next state x′. Intuitively, the value function can be understood as the goodness of being in a particular state, assuming optimal behaviour thereafter; high values correspond to good states and low values for bad states. Unfortunately, dynamic programming suffers from the curse of dimensionality (i.e., the computational cost grows exponentially with the number of states). A significant step in RL literature was made when approximate dynamic programming (ADP) methods were developed to overcome this obstacle (Mes and Rivera, 2017). Reinforcement learning leverages one such ADP method to solve for the optimal policy offline. The design of the ADP may take many forms dependent on the objective of the agent. For example, Lee and Wong (2010) found that post-decision-state formulation of ADPs offers the most benefit to process control problems where the main objectives are safety and economics. More details can be found in Lee and Wong (2010). The concept of a reward/punishment trial-and-error learning system in RL originated from animal psychology. More specifically, the original concept was proposed in Thorndike and was named the Law of Effect, stating that actions resulting in good outcomes are likely to be repeated, while actions with bad outcomes are muted. Initially, the agent undergoes a trial-and-error search to identify the outcomes corresponding to each action, then only repeating the good outcome actions thereafter. Through the combination of dynamic programming from optimal control and trial-and-error search from animal psychology, the modern field of RL was developed. For additional details regarding the history of reinforcement learning, please refer to Sutton and Barto (2018).
RL起源于两个主要研究领域：使用值函数和动态规划的最优控制，以及激发试错搜索的动物心理学。最优控制问题最初被提出来设计一个控制器，以最小化动态系统随时间变化的损失函数（Mayne and Rawlings， 2017）。在 1950 年代中期，理查德·贝尔曼（Richard Bellman）扩展了汉密尔顿和雅各比的工作，并开发了一种解决最佳控制问题的方法。这种方法现在称为动态规划，通过使用从系统状态信息生成的函数方程（未知数也是函数的函数）与值函数相结合来优化输入轨迹（Bellman，1957a）。函数方程，称为贝尔曼方程，给出如下：
(1)
其中 V（x）是状态 x 中的值函数，γ 是将不确定性纳入未来奖励的折扣因子，r（x）是作为系统相对于期望性能的行为的函数而获得的奖励，P（x′|x， u）是给定当前状态并控制输入 x 和 u 到达下一个状态 x′ 的过渡概率，分别，V（x′）是下一个状态 x′ 的值函数。直观地说，价值函数可以理解为处于特定状态的善，假设此后的最佳行为;高值对应于良好状态，低值对应于坏状态。不幸的是，动态规划遭受了维度的诅咒（即，计算成本随着状态的数量呈指数增长）。当开发近似动态规划（ADP）方法来克服这一障碍时，RL 文献迈出了重要一步（Mes 和 Rivera，2017 年）。强化学习利用一种这样的 ADP 方法来求解离线最佳策略。ADP 的设计可以采取多种形式，具体取决于代理的目标。例如，Lee和Wong（2010）发现，ADP的决策后状态制定为过程控制问题提供了最大的好处，其主要目标是安全性和经济性。更多细节可以在Lee and Wong （2010）中找到。RL中奖惩试错学习系统的概念起源于动物心理学。更具体地说，最初的概念是在桑代克提出的，并被命名为效果法则，指出导致良好结果的行动可能会被重复，而具有不良结果的行动则被静音。最初，智能体会进行试错搜索，以确定与每个操作相对应的结果，然后只重复良好的结果操作。通过最优控制的动态规划和动物心理学的试错搜索相结合，发展了现代RL领域。有关强化学习历史的更多详细信息，请参阅Sutton和Barto（2018）。

2.2. The bandit problem 2.2. 强盗问题
The evolution of RL is shown in Table 1. Reinforcement learning takes its roots from the k-armed bandit problem where the agent is only concerned with making the optimal decisions in one situation. Additionally, only the immediate consequences were considered. Eventually, the concept was extended in the early 1980s by Barto et al. (1981) to solve multi-situation systems. This new problem was named contextual bandits (also named associative search). Here, the agent was still only concerned with the immediate consequences. However, real world problems can rarely be solved by considering only the immediate consequences. In most scenarios, near term sacrifices are absolutely necessary to reach long term success. To overcome this dilemma, RL was developed to find the optimal decisions in multi-situation systems that optimized not only the immediate rewards, but also the trajectory thereforth.
RL的演变如表1所示。强化学习起源于k-arm强盗问题，在该问题中，智能体只关心在一种情况下做出最佳决策。此外，只考虑了直接后果。最终，Barto等人（1981）在1980年代初扩展了这一概念，以解决多情况系统。这个新问题被命名为上下文强盗（也称为关联搜索）。在这里，特工仍然只关心眼前的后果。然而，现实世界的问题很少能通过只考虑直接后果来解决。在大多数情况下，短期牺牲对于取得长期成功是绝对必要的。为了克服这一困境，RL被开发出来，在多情况系统中找到最优决策，不仅优化了眼前的奖励，还优化了未来的轨迹。

Table 1. From left to right, the evolution of reinforcement learning.
表 1.从左到右，强化学习的演变。

k-armed bandits K-武装土匪 Contextual bandits 上下文强盗 Reinforcement learning 强化学习
Optimal action 最佳操作 Optimal action 最佳操作 Optimal action 最佳操作
One situation 一种情况 Many situations 许多情况 Many situations 许多情况
Immediate conseq. 立即后果。 Immediate conseq. 立即后果。 Long-term conseq. 长期连续性。
The k-armed bandit problem establishes the fundamental knowledge for understanding modern RL. In this problem, the agent assumes the system has only one constant state with many possible control actions. A classic example of the bandit problem would be choosing which slot machines to play in a casino. Here, the agent has one state (being inside the casino), and must identify which slot machine yields the highest reward. Objectively, the agent attempts to maximize reward over N steps by identifying the optimal control action,
where
is a set of k possible actions. For each action in
there is an expected reward called value, given by:
(2)
where u is the control action taken at time t, Rt is a scalar reward signal obtained by the agent after performing action u. For stationary (non-stationary) stochastic processes, Rt is drawn from a stationary (non-stationary) probability distribution, Rt ~ N(q*(u), σ2). Finally, q*(u) is the expected reward of taking action u. In a real process, q*(u) is unknown, but can be estimated through exploration of each u. The estimated value is denoted as Q(u). As all
are picked infinitely many times, by the law of large numbers, Q(u) → q*(u) (Borel, 1909).
k-armed bandit 问题为理解现代 RL 奠定了基础。在此问题中，代理假定系统只有一个恒定状态，具有许多可能的控制操作。强盗问题的一个典型例子是选择在赌场玩哪些老虎机。在这里，代理有一个状态（在赌场内），并且必须确定哪台老虎机产生最高的奖励。客观地说，智能体试图通过识别最佳控制行动来最大化 N 个步骤的奖励，
其中
有一组 k 个可能的动作。对于其中
的每个动作，都有一个称为值的预期奖励，由下式给出：
(2)
其中 u 是在时间 t 时采取的控制动作，R t 是代理在执行动作 u 后获得的标量奖励信号。对于稳态（非稳态）随机过程，R t 是从稳态（非平稳）概率分布中得出的，R t ~ N（q*（u）， σ 2 ）。最后，q*（u）是采取行动 u 的预期奖励。在实际过程中，q*（u）是未知的，但可以通过探索每个 u 来估计。估计值表示为 Q（u）。由于所有东西
都被无限多次选择，根据大数定律，Q（u）→q*（u）（Borel，1909）。

Given any time t, one Qt(u) will be greater than all others, signifying its corresponding action is optimal at time t and should be picked. Methods where action selection is based on the estimated values are called action-value methods (Sutton and Barto, 2018). Algorithms to solve the k-armed bandit problem are easily applied to situations where the concept of state is inert and only the actions are of concern; a near impossibility in the real world.
给定任何时间 t，一个 Q t （u）将大于所有其他 Q （u），表示其相应的作用在时间 t 时是最佳的，应该被选中。基于估计值的动作选择方法称为动作值方法（Sutton和Barto，2018）。解决 k-armed 匪徒问题的算法很容易应用于状态概念惰性且仅关注动作的情况;这在现实世界中几乎是不可能的。

Naturally, Barto et al. (1981) extended the original problem to incorporate many states and named it contextual bandits. In contextual bandits, different optimal policies are associated with different states (contexts). Mathematically, Eq. (2) was extended to:
(3)
where x is the current state of the system. Here, the value function is a function of both the state and action; therefore, allowing the agent to behave differently for different situations. Nevertheless, the agent is still concerned with only the immediate reward, rather than the long term optimal trajectory. To alleviate this, RL was developed, introducing the concept of sequential decision making.
当然，Barto等人（1981）将原始问题扩展到包含许多状态，并将其命名为上下文强盗。在情境强盗中，不同的最优策略与不同的状态（情境）相关联。在数学上，方程（2）被扩展为：
(3)
其中 x 是系统的当前状态。在这里，值函数是状态和动作的函数;因此，允许代理在不同情况下采取不同的行为。然而，智能体仍然只关心眼前的回报，而不是长期的最佳轨迹。为了缓解这种情况，RL应运而生，引入了顺序决策的概念。

2.3. Markov decision process
2.3. 马尔可夫决策过程
The reinforcement learning paradigm is shown in Fig. 3 and consists of two components: the agent and the system. The agent is the continuously learning decision maker (i.e., RL algorithm). The agent attempts to learn and conquer the system through meaningful interactionswith the system. The system is comprised of everything the agent cannot arbitrarily change. Relating to process control, the agent would be the controller logic and everything else would make up the system. Reinforcement learning’s decision making process is formalized in the Markov decision process (MDP).
强化学习范式如图 3 所示，由两个组件组成：智能体和系统。智能体是不断学习的决策者（即RL算法）。智能体试图通过与系统的有意义的交互来学习和征服系统。该系统由代理不能随意更改的所有内容组成。与过程控制有关，代理将是控制器逻辑，其他一切都将构成系统。强化学习的决策过程在马尔可夫决策过程（MDP）中正式化。

Fig. 3
Download : Download high-res image (53KB)
下载：下载高分辨率图片（53KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 3. The general Markov decision process framework. Original image from Sutton and Barto (2018).
图 3.一般马尔可夫决策过程框架。原始图像来自 Sutton 和 Barto （2018）。

The MDP is a discrete representation of the stochastic optimal control problem and a classical formulation of sequential decision making where both immediate and future rewards are considered (Bellman, 1957b, Sutton, Barto, 2018). MDPs provide formalism to agents when rationalizing about planning and acting in the face of uncertainty. Many different definitions of MDPs exist and are equivalent up to small alterations of the problem. One such definition is that a MDP,
is a tuple
p(x′, r|x, u), γ, R) comprised of (Ng, 2003):
MDP 是随机最优控制问题的离散表示，也是考虑当前和未来奖励的顺序决策的经典公式（Bellman，1957b，Sutton，Barto，2018）。MDP 在面对不确定性时合理化计划和行动时为代理人提供形式主义。MDP存在许多不同的定义，并且等同于问题的微小改变。一个这样的定义是，
MDP是一个元组
p（x′， r|x， u）， γ， R）由（Ng， 2003）组成：
•
: state space that describes the industrial system. Typical states in the process industry include temperatures, pressures, flow rates, etc.
：描述工业系统的状态空间。过程工业中的典型状态包括温度、压力、流速等。

•

action space of the reinforcement learning agent. In control, this is the bounded control inputs.
：强化学习代理的动作空间。在控制中，这是有界控制输入。

•

expected reward from the system after the agent performs u at x. Rewards are generated based on a desired performance metric and is called the objective function in control literature. Typically,
for stability and convergence purposes, where
is some upper bound of the reward.
：代理在 x 处执行 u 后系统的预期奖励。奖励是根据期望的绩效指标生成的，在控制文献中称为目标函数。通常，
出于稳定性和收敛性的目的，奖励的上限在哪里
。

•
p(x′, r|x, u): dynamics function of the environment. It denotes the probability of transitioning to x′ and receiving r given

as described below:
(4)
where p describes the dynamics of the system and Pr denotes probability (Sutton and Barto, 2018). Additionally, p is a probability distribution satisfying:
(5)
Notice that p depends only on the immediate past, thus assumes that
and
contain information about the history. This property is known as the Markov property and is critical for successful RL applications in process control. Note also that RL formulations with past state information augmented as:
where
and
denote the past states and actions, still exhibit the Markov property because decisions can be made exclusively using
and
.
p（x′， r|x， u）：环境的动力学函数。它表示过渡到 x′ 并接收 r
的概率，
如下所述：
(4)
其中 p 描述了系统的动力学，Pr 表示概率（Sutton 和 Barto，2018 年）。此外，p 是一个满足以下条件的概率分布：
(5)
请注意，p 仅依赖于不久的过去，因此假设并

包含有关历史的信息。该属性称为马尔可夫属性，对于过程控制中成功的RL应用至关重要。还要注意的是，具有过去状态信息的 RL 公式被增强为：
where
和表示过去的状态和动作，仍然表现出马尔可夫性质，因为决策只能使用
和

来做出。

•
γ: discount factor associated with future uncertainty, (0 ≤ γ ≤ 1).
γ：与未来不确定性相关的贴现因子（0≤ γ ≤1）。

Three different versions of MDPs exist and are used to describe problems containing different characteristics as shown in Table 2 and are discussed below.
MDP有三种不同版本，用于描述包含不同特征的问题，如表2所示，下面将对此进行讨论。

Table 2. A comparison of different Markov decision processes.
表 2.不同马尔可夫决策过程的比较。

FOMDPs FOMDPs（FOMDPs） SMDPs SMDP的 POMDPs 绒球
All states observable 所有状态均可观察 All states observable 所有状态均可观察 Some states observable 一些状态可观察
Discrete time 离散时间 Continuous time 连续时间 Discrete time 离散时间
2.3.1. Fully observable MDPs
2.3.1. 完全可观察的MDP
Fully observable MDPs (FOMDP) are used by agents to reason about decision making in discrete systems where all states are observable (measurable in control terminology). It will serve as the foundation of the MDP framework. The framework is initialized by the agent starting in some initial states x0. At each time t, the agent picks some action ut given xt corresponding to its policy π. Given xt and ut, the system will then transition to the new state
following Eq. (4) and output reward
based on the desired performance metric. In control, the performance metric is typically the squared tracking error of the controller. By repeating this procedure many times, the agent is able to traverse through some sequence,
and accumulate:
(6)
(7)
where Gt is the cumulative discounted return at time t. Furthermore, γ is the discount factor and captures the uncertainty of future rewards. MDPs can be finite or infinite; the former describes episodic systems with explicit terminal states while the latter may continue forever. For example, a game of chess can be described as a finite MDP where the game is terminated after one player is defeated. Contrarily, an infinite MDP system, such as the control system in a refinery, could continue on indefinitely. For infinite MDP systems such as those in process control, γ < 1 is required to keep Gt bounded. The RL agent tries to find the optimal policy, π*, that maximize Gt (instead of Rt in the bandit case) over N steps. The value function for each state in the system is given as Sutton and Barto (2018):
(8)
where vπ(x) is the value function of state x under policy π. Additionally, vπ is guaranteed to exist and be unique for continuous systems where γ < 1 or in systems with guaranteed termination. Compared to Eq. (2), Eq. (8) takes the expectation of Gt (defined in Eq. (7)) rather than Rt; therefore, optimizing the long term trajectory compared to only immediate rewards. The action-value version of Eq. (8) is given as:
(9)
FOMDPs can accurately represent discrete systems where all states are observable. Unfortunately, states in industrial processes are often times not fully measurable due to hardware limitations or other factors. During such situations, the system no longer exhibits the Markov property ultimately resulting in sub-optimal decision making.
完全可观察的 MDP （FOMDP）被智能体用于推理离散系统中的决策，其中所有状态都是可观察的（在控制术语中可测量）。它将作为MDP框架的基础。框架由代理初始化，从某些初始状态 x 0 开始。每次 t 时，代理都会选择与其策略π相对应的给定 x t 的某个操作 u t 。给定 x t 和 u t ，系统将按照方程（4）过渡到新状态
，并
根据所需的性能指标输出奖励。在控制中，性能指标通常是控制器的平方跟踪误差。通过多次重复此过程，智能体能够遍历某个序列并
累积：
(6)
(7)
其中 G t 是时间 t 的累积贴现回报。此外，γ是贴现系数，反映了未来奖励的不确定性。MDP 可以是有限的，也可以是无限的;前者描述了具有明确最终状态的情节系统，而后者可能永远持续下去。例如，国际象棋游戏可以描述为有限的 MDP，其中游戏在一名玩家被击败后终止。相反，无限的MDP系统，例如炼油厂的控制系统，可以无限期地继续运行。对于无限 MDP 系统，例如过程控制中的系统，需要γ < 1 来保持 G t 边界。RL 代理尝试找到最佳策略 π*，该策略在 N 个步骤中最大化 G t （而不是 Bandit 情况下的 R t ）。系统中每个状态的值函数由 Sutton 和 Barto （2018）给出：
(8)
其中 v π （x）是策略 π 下状态 x 的值函数。此外，对于γ < 1 的连续系统或具有保证终止的系统，v π 保证存在并且是唯一的。与式（2）相比，式（8）采用G t （在式（7）中定义）而不是R t 的期望;因此，与仅即时奖励相比，优化长期轨迹。方程（8）的动作值版本为：
(9)
FOMDP可以准确地表示所有状态都是可观测的离散系统。不幸的是，由于硬件限制或其他因素，工业过程中的状态通常无法完全测量。在这种情况下，系统不再表现出马尔可夫特性，最终导致次优决策。

2.3.2. Partially observable MDPs
2.3.2. 部分可观察的MDP
Partially observable Markov decision processes (POMDPs) extend the concepts of FOMDPs and are used to solve systems with states that are no longer fully observable. Observability in RL terminology is equivalent to measurability in control as mentioned earlier; thus, the two terminologies will be used interchangeably here-forth. Previously in FOMDPs, the agent at each time t can observe its current state xt. In the more general setting of POMDPs, the agent has a set of possible observations
rather than some states
. At each time t, the agent instead sees observation ot which corresponds to probability distributions over states giving the agent information regarding the state it might currently be in Ng (2003). In a process control setting, existing sensors usually only measure a subset of the current states directly. Using available measurements and the input information, one is able to infer the remaining unmeasured states using probabilistic inference approaches such as Kalman filter. Such systems are partially observable in RL terminology and are represented by POMDPs.
部分可观测马尔可夫决策过程（POMDP）扩展了 FOMDP 的概念，用于求解状态不再完全可观测的系统。如前所述，RL 术语中的可观察性等同于控制中的可测量性;因此，这两个术语在此可以互换使用。以前在 FOMDP 中，每次 t 的代理都可以观察到其当前状态 x t 。在 POMDP 的更一般设置中，智能体具有一组可能的观测值
，而不是某些状态
。在每次 t 时，智能体都会看到观察值 o t ，它对应于状态的概率分布，为智能体提供有关其当前可能在 Ng （2003）中可能处于的状态的信息。在过程控制设置中，现有传感器通常只能直接测量当前状态的子集。使用可用的测量值和输入信息，可以使用卡尔曼滤波等概率推理方法推断剩余的未测量状态。这些系统在RL术语中是部分可观察到的，并由POMDP表示。

In general, finding the true optimal policy, π*, in a POMDP system is significantly harder compared to FOMDPs. Even finding a near-optimal policy is NP-hard (non-deterministic polynomial time) (Lusena et al., 2001). Additionally, agents knowing all the true value functions of the system are still unable to behave optimally in POMDP systems because the current states are unknown (Ng, 2003).
一般来说，与 FOMDP 相比，在 POMDP 系统中找到真正的最优策略 π* 要困难得多。即使找到一个接近最优的策略也是NP-hard（非确定性多项式时间）（Lusena et al.， 2001）。此外，知道系统所有真实价值函数的智能体仍然无法在POMDP系统中表现最佳，因为当前状态是未知的（Ng，2003）。

The use of belief states is one possible approach for agents to behave optimally in POMDPs. Belief states, b, are probability distributions over states representing what the agent thinks its current state is, given previous observations and actions. Using these probabilities, one can compute the value functions of each state-action pair and use it to act optimally. Note that the behaviour is not optimal with respect to the system, rather, it is optimal with respect to the available information. Ultimately, belief states transform the POMDP problem into a FOMDP since all belief states are fully available. An quantitative example is as follows:
使用信念状态是智能体在 POMDP 中表现最佳的一种可能方法。信念状态 b 是状态的概率分布，表示智能体认为其当前状态是什么，给定先前的观察和行动。利用这些概率，可以计算每个状态-动作对的值函数，并使用它来发挥最佳作用。请注意，该行为相对于系统而言不是最优的，相反，它对可用信息而言是最优的。最终，信念状态将 POMDP 问题转化为 FOMDP，因为所有信念状态都是完全可用的。一个定量的例子如下：
Assume an agent exists in a simple POMDP system with two unmeasurable states (x1 and x2) and two actions (u1 and u2) and suppose the problem has a horizon of one (for longer horizons, the agent must identify the trade-off between immediate and long term rewards, making the example less intuitive). Such a system has four value functions corresponding to the immediate reward obtained for each state-action pair. Suppose u1 yields a reward of 2 in x1 and 0 in x2. Similarily, u2 yields a reward of 0 in x1 and 1 in x2. If the current belief state b is [0.2, 0.8] (probabilities of being in x1 and x2, respectively), then
and
making u2 the optimal action. For more relevant examples, please refer to Kaelbling et al. (1999).
假设一个智能体存在于一个简单的 POMDP 系统中，该系统具有两个不可测量的状态（x 和 x 1 2 ）和两个动作（u 和 u 1 2 ），并假设该问题的跨度为 1（对于更长的跨度，智能体必须确定即时和长期奖励之间的权衡，使示例不那么直观）。这样的系统有四个价值函数，对应于每个状态-行动对获得的即时奖励。假设 u 1 在 x 中产生 2 的奖励，在 x 1 2 中产生 0 的奖励。同样，u 2 在 x 中产生 0 的奖励，在 x 1 2 中产生 1 的奖励。如果当前信念状态 b 为 [0.2， 0.8]（分别位于 x 和 x 1 2 中的概率），则
和
使 u 2 为最优行动。有关更多相关例子，请参阅Kaelbling et al. （1999）。

In control theory, observers are leveraged to estimate unknown states. Observers are typically based on first principles models, but could also be based on data driven or probabilistic models. The concept of belief states is very similar to observer design in control theory. Kalman filter is a popular observer design method. Likewise in RL, recurrent neural networks (RNNs) are typically used to estimate the belief states. In Chenna et al. (2004), the performance of RNN is compared with Kalman filter, showing both methods’ similarities in performance, objective, and theory.
在控制理论中，利用观察者来估计未知状态。观察者通常基于第一性原理模型，但也可以基于数据驱动或概率模型。信念状态的概念与控制理论中的观察者设计非常相似。卡尔曼滤波是一种流行的观测器设计方法。同样，在RL中，递归神经网络（RNN）通常用于估计置信状态。在Chenna等人（2004）中，将RNN的性能与卡尔曼滤波进行了比较，显示了两种方法在性能、目标和理论上的相似性。

Optimal solutions from FOMDPs and POMDPs work well in discrete tasks where transition time is consistent and transition dynamics are unimportant; however, such topics are critical to successful optimal control in the process industry.
FOMDP和POMDP的最优解在离散任务中效果很好，在这些任务中，转换时间是一致的，而转换动力学并不重要;然而，这些主题对于过程工业中成功的优化控制至关重要。

2.3.3. Semi-MDPs 2.3.3. Semi-MDP
Typical MDPs are discrete representations of the optimal control problem and are sub-optimal in continuous tasks. Semi-Markov decision processes (SMDP) are an extension of MDPs to continuing tasks with unknown transition time and system dynamics. In SMDPs, the transition dynamics of the system are explicitly captured using reward function (Bradtke and Duff, 1994):
(10)
where
is the expected reward to be received when transitioning from xt to
after action ut. The rewards, R, are calculated at each time step in the transition period to explicitly capture transition information. Then, the average reward of the transition is used to update the agent. Here, ρ(xt, π(xt)) represents the average reward during the transition following policy, π.
denotes the probability distribution of the time required to transit from xt to
. Finally, β > 0 denotes the constant discount factor in SMDPs, where higher β results in short-sighted agents. In SMDPs, the discount factor is corrected for transition time during each update step. The corrected discount factor is given by:
(11)
where
is the expected discount factor that will be applied to the value of state
during the update step shown in Eq. (1). The value function for SMDPs is obtained from combining Eqs. (10) and (8):
(12)
where τ is the unknown transition time. Similarily, the action-value form is given by:
(13)
By representing control problems as SMDPs, control strategies resulting in large overshoot, inverse response, or any other undesirable dynamics behaviour can be minimized. Additionally, this representation can handle systems with unknown transition time. An intuitive example illustrating the advantages of SMDPs in process control is as follows:
典型的 MDP 是最优控制问题的离散表示，在连续任务中是次优的。半马尔可夫决策过程（SMDP）是 MDP 对具有未知过渡时间和系统动力学的持续任务的扩展。在SMDP中，使用奖励函数（Bradtke和Duff，1994）显式捕获系统的转换动态：
(10)
其中
是从x t 过渡到
动作u t 后的预期奖励。奖励 R 在过渡期的每个时间步长进行计算，以明确捕获过渡信息。然后，使用转换的平均奖励来更新代理。这里，ρ（x ， π（x ））表示策略转换期间的平均奖励，π 表示从 x t t t 过渡到
所需的时间的概率分布。
最后，β > 0 表示 SMDP 中的恒定贴现因子，其中较高的β会导致短视的代理。在 SMDP 中，折扣系数会针对每个更新步骤的转换时间进行校正。校正后的贴现系数由下式给出：
(11)
其中
，在式（1）所示的更新步骤中，将应用于状态
值的预期贴现系数。SMDP 的值函数是通过组合方程获得的。（10）和（8）：
(12)
其中τ是未知的过渡时间。类似地，动作-价值形式由下式给出：
(13)
通过将控制问题表示为 SMDP，可以最大限度地减少导致大过冲、逆响应或任何其他不良动态行为的控制策略。此外，这种表示可以处理转换时间未知的系统。下面举例说明SMDP在过程控制中的优势：
Suppose a refinery company is operating a continuously stirred tank reactor (CSTR). Objectively, the CSTR must maintain 200∘ C for optimal performance. The temperature is controlled through a heat exchanger using cold water. A RL agent was built to optimally control the flow of cold water to maintain the temperature at the set point. Suppose the CSTR starts at 220∘ C. Agents using MDP representations may be overly aggressive and send large input signals because the reward is only calculated right before the next evaluation step. Therefore, input signals resulting in large overshoot or inverse response may not be reflected in the reward. Contrarily, SMDP representation uses the average reward accumulated along the trajectory to provide feedback to the agent, allowing the transition dynamics to be explicitly captured. This way, input signals resulting in undesirable behaviour can be captured and mitigated. Furthermore, SMDP representations can have flexible evaluation time (traditional representations evaluate after a set time period), enabling re-evaluation during the transitional period and adjusts the discount factor in accordance to the elapsed time from the last evaluation.
假设一家炼油厂公司正在运行一个连续搅拌罐式反应器（CSTR）。客观地说，CSTR 必须保持 200 ∘ C 才能获得最佳性能。温度通过使用冷水的热交换器进行控制。构建了RL代理，以最佳方式控制冷水的流量，以将温度保持在设定点。假设 CSTR 从 220 ∘ C 开始，使用 MDP 表示的智能体可能过于激进并发送大量输入信号，因为奖励仅在下一个评估步骤之前计算。因此，导致较大过冲或反向响应的输入信号可能不会反映在奖励中。相反，SMDP 表示使用沿轨迹累积的平均奖励向智能体提供反馈，从而可以明确捕获过渡动态。这样，就可以捕获并缓解导致不良行为的输入信号。此外，SMDP表示可以具有灵活的评估时间（传统表示在设定的时间段后进行评估），从而可以在过渡期间重新评估，并根据上次评估经过的时间调整贴现系数。

2.4. Solving the Markov decision process
2.4. 求解马尔可夫决策过程
The optimal solution to a reinforcement learning problem refers to the policy that generates the highest reward over a trajectory. Formally, an optimal policy must satisfy the principle of optimality: the optimal policy π* is optimal if and only if
for all
(Poznyak, 2008). Note that there may exist many optimal policies, where
. The optimal value function is denoted mathematically as:
(14)
Similarily, the optimal action-value function is denoted as:
(15)
In a more explicit form, the optimal value function and action-value function written in terms of Eqs. (8) and (9) are given, respectively, by Sutton and Barto (2018):
(16)
(17)
Here, the max denotes that the optimal action will be taken in the next step and thereafter for the remaining of the trajectory. In theory, the optimal value functions can be explicitly solved for all systems using the above equation; however, such a task would require tremendous amounts of computation power even in trivial tasks. In the following section, three popular methods will be introduced to estimate the value and action-value functions in reinforcement learning.
强化学习问题的最优解是指在轨迹上产生最高奖励的策略。从形式上讲，最优政策必须满足最优原则：当且仅当
所有人
时，最优政策π*才是最优的（Poznyak，2008）。请注意，可能存在许多最佳策略，其中
.最优值函数在数学上表示为：类似地，最优动作值函数表示为：
(14)
(15)
在更明确的形式中，最优值函数和动作值函数用方程编写。Sutton和Barto（2018）分别给出了（8）和（9）：
(16)
(17)
在这里，最大值表示将在下一步中采取最佳行动，此后将针对轨迹的剩余部分采取最佳行动。从理论上讲，使用上述方程可以显式求解所有系统的最优值函数;然而，即使在琐碎的任务中，这样的任务也需要大量的计算能力。在下一节中，将介绍三种流行的方法来估计强化学习中的值函数和动作值函数。

2.5. Methods of reinforcement learning
2.5. 强化学习的方法
The three families of algorithms used to solve optimal policies in RL are shown in Fig. 4. Dynamic programming (DP) methods can provide exact solutions to the optimal policy, but requires a perfect system model and has infeasible computational requirements for non-trivial tasks. Comparatively, both Monte Carlo (MC) and temporal difference (TD) methods approximate DP solutions using less computational power and does not assume the presence of a perfect system model. MC methods find the optimal policy through averaging the value function over many sampled trajectories of states, actions, and rewards; however, the variance in the sampled trajectories results in high variance in final results. TD methods combine the ideas of DP and MC methods into one unifying algorithm. TD methods learn from sampled data like in MC methods, while also being able to perform mid-trajectory learning, like in DP methods; however, TD methods experience high bias due to estimating values through previously estimated values (known as bootstrapping). The general details of each method will be shown throughout this section. For a comprehensive introduction to each algorithm, see Sutton and Barto (2018).
用于求解 RL 中最优策略的三个算法系列如图 4 所示。动态规划（DP）方法可以为最优策略提供精确的解决方案，但需要完善的系统模型，并且对非平凡任务的计算要求不切实际。相比之下，蒙特卡罗（MC）和时差（TD）方法都使用较少的计算能力逼近 DP 解，并且不假设存在完美的系统模型。MC 方法通过对状态、动作和奖励的许多采样轨迹的值函数求平均值来找到最优策略;但是，采样轨迹的方差会导致最终结果的方差很大。TD 方法将 DP 和 MC 方法的思想结合到一个统一的算法中。TD 方法像 MC 方法一样从采样数据中学习，同时还能够执行中轨迹学习，就像在 DP 方法中一样;然而，由于通过先前的估计值（称为自举）来估计值，TD 方法会经历高偏差。本节将显示每种方法的一般详细信息。有关每种算法的全面介绍，请参阅Sutton and Barto （2018）。

Fig. 4
Download : Download high-res image (85KB)
下载：下载高分辨率图片（85KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 4. The sub-components of machine learning.
图 4.机器学习的子组件。

2.5.1. Dynamic programming
2.5.1. 动态规划
Dynamic programming refers to a set of algorithms with the ability to find optimal policies assuming a perfect model is available. DP algorithms are in general not widely used due to their very high computational cost for non-trivial problems. The two most popular methods in DP are policy iteration and value iteration.
动态规划是指一组算法，能够在假设完美模型可用的情况下找到最佳策略。DP算法通常没有被广泛使用，因为它们对非平凡问题的计算成本非常高。DP 中最流行的两种方法是策略迭代和值迭代。

On a high level, policy iteration searches for the optimal policy by iterating through many policies, π ∈ Π, keeping only the policy with the highest cumulative returns. The optimal policy is found when the cumulative returns of π can no longer be improved (convergence). Policy iteration includes two iterative steps: policy evaluation and policy improvement. Policy evaluation predicts the value functions for policy π through an iterative approach. Value functions for all states are initialized as 0, and are updated using:
(18)
where k is the kth update step, and
is the predicted value function following policy π after
update steps. As k → ∞, vk(x) → vπ(x) for all
(i.e., the true value functions for π). However, there may exist a π′ where
deeming π sub-optimal. The goal of policy improvement is to identify situations where
for any state. Once identified, π will violate the principle of optimality, hence disqualifying it from being optimal and π′ will be deemed the new optimal policy. This procedure will continue iteratively and infinitely until a policy where
for all
is found.
在高层次上，策略迭代通过遍历许多策略（π ∈ Π）来搜索最佳策略，仅保留累积收益最高的策略。当π的累积收益无法再提高时，找到最优策略（收敛）。策略迭代包括策略评估和策略改进两个迭代步骤。策略评估通过迭代方法预测策略π的价值函数。所有状态的值函数都初始化为 0，并使用以下命令进行更新：
(18)
其中 k 是 k th 个更新步骤，
是更新步骤后
策略π后的预测值函数。正如 k → ∞ 一样，v （x） → v k π （x）对于所有
（即 π 的真值函数）。但是，可能存在一个π′，认为
π次优。政策改进的目标是确定任何州的情况
。一旦确定，π将违反最优原则，因此取消其最优的资格，π’将被视为新的最优策略。此过程将迭代且无限地继续进行，直到找到
for all
的策略。

A visualization of the policy iteration algorithm is shown in Fig. 5 and is as follows (Sutton and Barto, 2018):
(19)
where
and
represent the policy evaluation and policy improvement steps, respectively. From Fig. 5, the agent starts with some arbitrary policy. Initially, there exists a large gap between Vπ and π, illustrating much room for improvement. As policy iteration continues, the gap is reduced until V(x), π → V*(x), π*. Presently, policy iteration is rarely used due to its high computational expense for the required iterative steps for each policy evaluation.
策略迭代算法的可视化如图 5 所示，如下所示（Sutton 和 Barto，2018 年）：
(19)
其中
和分别表示策略评估和
策略改进步骤。从图 5 中可以看出，代理从一些任意策略开始。最初，V π 和 π 之间存在很大的差距，这表明有很大的改进空间。随着策略迭代的继续，差距会缩小，直到 V（x） π → V*（x） π*。目前，策略迭代很少使用，因为每个策略评估所需的迭代步骤的计算费用很高。

Fig. 5
Download : Download high-res image (140KB)
下载：下载高分辨率图片（140KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 5. A visualization of the policy iteration algorithm. Original image from Silver (2018).
图 5.策略迭代算法的可视化效果。原始图像来自Silver（2018）。

Value iteration finds the optimal policy through identifying the optimal value functions rather than evaluating many policies. Intuitively, it is a special case of policy iteration where the policy evaluation is terminated after one step. After identifying the value functions, the optimal policy can be trivially extracted by simply traversing through the states and identifying the actions corresponding to the highest values. Note here that extraction of the optimal policy using V(x) is only possible if a perfect model of the system is provided. Using the model, one can find the state transition probabilities, and subsequently identify the actions with the highest probabilities to traverse to the high value states. Without a model, Q(x, u) must be identified instead of extracting the optimal policy. The one-step policy evaluation for the value function and action-value function, respectively, is:
(20)
(21)
where the
operation ensures that each vk(x) is updated using only the maximizing action; thus, ultimately finding v*(x). After convergence of all v*(x), an agent can be initialized in any arbitrary state and still behave optimally as long as the agent takes the maximizing action in each successive state. Note that both policy and value iteration bootstraps its current estimate using previously calculated values,
. This concept is used by RL to increase data efficiency and allow updates to explicitly capture long-term trajectory information; however, the method also introduces unintended biased updates.
值迭代通过识别最优值函数来寻找最优策略，而不是评估许多策略。直观地说，这是策略迭代的一种特例，其中策略评估在一步后终止。确定值函数后，只需遍历状态并确定与最高值对应的操作，即可轻松提取最优策略。这里需要注意的是，只有在提供完美的系统模型的情况下，才能使用 V（x）提取最优策略。使用该模型，可以找到状态转移概率，然后确定具有最高概率遍历到高价值状态的动作。如果没有模型，则必须识别 Q（x， u），而不是提取最优策略。值函数和动作-值函数的一步策略评估分别为：
(20)
(21)
其中操作
确保仅使用最大化操作更新每个 v k （x）;因此，最终找到 v*（x）。在所有 v*（x）收敛之后，代理可以在任何任意状态下初始化，并且只要代理在每个连续状态下执行最大化操作，它仍然会以最佳方式运行。请注意，策略和值迭代都使用先前计算的值 .
RL 使用此概念来提高数据效率，并允许更新以明确捕获长期轨迹信息;但是，该方法还引入了意外的偏向更新。

In industrial problems, both policy and value iterations have limited utility because their updates are applied to all
simultaneously (i.e., all value functions for all states are found simultaneously). In large multi-dimensional problems, performing even one iterative step may be infeasible. Asynchronous dynamic programming methods try to avoid this problem by only updating frequently visited states, while avoiding the update of states that are never visited. Doing so can reduce computation time considerably, but render the agents less useful in states that are rarely encountered.
在工业问题中，策略迭代和值迭代的效用有限，因为它们的更新同时应用于所有变量（即，同时找到所有
状态的所有值函数）。在大型多维问题中，即使执行一个迭代步骤也可能不可行。异步动态规划方法试图通过仅更新经常访问的状态来避免此问题，同时避免更新从未访问过的状态。这样做可以大大减少计算时间，但会使代理在很少遇到的状态下不太有用。

2.5.2. Monte Carlo methods
2.5.2. 蒙特卡罗方法
Unlike dynamic programming, Monte Carlo methods technically do not require a model of the system (a characteristic known as model-free). MC methods find the optimal policy by first estimating the average returns for different policies by sampling many sequences of states, actions, and rewards under said policy. As enough samples are generated,
. The average returns are updated after each trajectory. Due to the nature of MC updates, the most suitable systems are finite tasks with explicit terminal states, called episodic tasks. Discrete manufacturing is an example of an episodic task in process control. After the assembly of each object (cars, toys, etc.), the system terminates and starts anew. In episodic tasks, the value functions can be updated naturally after each terminal state. Typically, episodic tasks are rare in process control because most control systems are required to operate indefinitely. Processes with no terminal states are known as continuous tasks. To train a continuous task agent using the MC method, a maximum episode length should be initially specified so the agent can stop and update its value functions at certain intervals. In doing so, the agent can exploit its new learnings before continuing onward. Note here that all estimated value functions are independent and unbiased since no bootstrapping was used (opposite of dynamic programming); however, MC methods may suffer from large variances for systems highly corrupted by noise (Sutton and Barto, 2018). Moreover, exploration is mandatory in MC methods since the system model is not available to the agent. Only through exploration can the agent discover the value functions (state transition probabilities and rewards) for each action in each state, and subsequently, the optimal policy. Typically, exploration is conducted by starting in a random state at the beginning of each episode. After a sufficiently large amount of episodes are explored, all states will be visited sufficiently many times.
与动态规划不同，蒙特卡罗方法在技术上不需要系统的模型（这种特性称为无模型）。MC 方法首先通过对所述策略下的许多状态、动作和奖励序列进行抽样来估计不同策略的平均回报，从而找到最优策略。当生成足够的样本时，
.平均回报率在每条轨迹后更新。由于 MC 更新的性质，最合适的系统是具有明确终端状态的有限任务，称为情节任务。离散制造是过程控制中偶发性任务的一个例子。在每个物体（汽车、玩具等）组装完成后，系统终止并重新开始。在情景任务中，值函数可以在每个终端状态后自然更新。通常，偶发性任务在过程控制中很少见，因为大多数控制系统需要无限期运行。没有终端状态的进程称为连续任务。要使用 MC 方法训练连续任务代理，应首先指定最大剧集长度，以便代理可以按特定时间间隔停止和更新其值函数。这样一来，智能体就可以在继续之前利用其新的学习成果。请注意，所有估计值函数都是独立且无偏的，因为没有使用自举（与动态规划相反）;然而，对于被噪声高度破坏的系统，MC方法可能会有很大的差异（Sutton和Barto，2018）。此外，在 MC 方法中，探索是强制性的，因为系统模型对智能体不可用。只有通过探索，智能体才能发现每个状态下每个动作的价值函数（状态转换概率和奖励），并随后发现最优策略。通常，探索是通过在每集开始时以随机状态开始进行的。在探索了足够多的剧集后，所有州都会被足够多次地访问。

Policy search in MC methods is similar to policy iteration. There are three main differences: 1) not all states are updated simultaneously; 2) value functions are updated using sampled data from the agent interacting with the environment; 3) qπ(x, u) is identified instead of vπ(x). In MC methods, the action-value functions are identified instead because a model is not provided to the agent. The value functions alone are not useful to the agent because the actions required to transition to the high value states are not known. Instead, the action-values provide the agent with explicit information on the expected returns for each action in each state. The iterative procedure to find the cumulative returns is given by:
(22)
Intuitively, training starts by the agent being initiated in an unknown system. Here, the agent traverses through the state space
by performing actions
under a policy, π, and collect rewards
. Eventually, the agent will reach a terminal state, concluding the first episode. Upon termination, a sequence of returns
can be calculated via the collected rewards throughout the trajectory using:
or:
(23)
where Gm is the discounted cumulative return received on the mth step. Next, the action-values, Q(x, u), are estimated for each step by using the states, actions, and returns:
(24)
where Qk(x, u) represents the kth action-value update and G corresponds to the sampled returns of performing u in x. Note that Eq. (24) is unsuitable for non-stationary processes because as k → ∞,
. For non-stationary processes, Eq. (24) becomes:
(25)
where α ∈ (0, 1] is the learning rate (also called step size). Here, α is lower bounded to prevent the update from approaching 0; therefore, continually adapting in non-stationary problems. Additionally, α should not be too high or the updates will be significantly affected by short-term noise. After each update, a new episode starts and the above procedure is repeated. As the number of episodes approaches infinity (k → ∞), Q(x, u) → q(x, u) (i.e., the estimated action-values, Q(x, u) approaches the true action-values, q(x, u)). Once the action-value functions reach convergence, the optimal policy can be extracted by:
(26)
That is, the optimal policy is simply performing the action corresponding to the highest expected returns in each state. MC methods allow the agent to learn directly from trial-and-error experiences without a system model; however, whole episodes must be completed before the agent can update its knowledge base. Additionally, such a procedure is unnatural in continuous systems (most common in process control), severely disadvantageous in systems with long episodes, and is not intuitive to human behaviour. Humans typically learn immediately after feedback, not in pre-set increments. Temporal difference methods combine the ideas of both dynamic programming and Monte Carlo methods to become more human-like.
MC 方法中的策略搜索类似于策略迭代。有三个主要区别：1）并非所有状态都同时更新;2）使用来自与环境交互的代理的采样数据更新值函数;3） Q π （x， u）被识别出来，而不是 v π （x）。在 MC 方法中，由于未向代理提供模型，因此会标识操作值函数。值函数本身对代理没有用处，因为转换到高值状态所需的操作是未知的。相反，操作值会为代理提供有关每个状态下每个操作的预期回报的显式信息。查找累积回报的迭代过程由下式给出：
(22)
直观地说，训练是由在未知系统中启动的代理开始的。在这里，代理通过执行策略下的操作
来遍历状态空间
，π并收集奖励
。最终，智能体将达到最终状态，结束第一集。终止后，可以通过在整个轨迹中收集的奖励来计算一系列回报
，使用：或：
(23)
其中 G m 是在 m th 步中获得的折扣累积回报。接下来，通过使用状态、操作和返回来估计每个步骤的操作值 Q（x， u）：
(24)
其中 Q k （x， u）表示 k th 个操作值更新，G 对应于在 x 中执行 u 的采样返回。请注意，方程（24）不适用于非平稳过程，因为如k → ∞，
.对于非平稳过程，方程。（24）变为：
(25)
其中 α ∈ （0， 1] 是学习率（也称为步长）。在这里，α 是下限，以防止更新接近 0;因此，在非平稳问题中不断适应。此外，α不应太高，否则更新将受到短期噪音的显着影响。每次更新后，都会开始新的剧集，并重复上述过程。当集数接近无穷大（k → ∞）时，Q（x， u） → q（x， u）（即，估计的动作值 Q（x， u）接近真实的动作值 q（x， u））。一旦动作-价值函数达到收敛，就可以通过以下方式提取最优策略：
(26)
也就是说，最优策略只是执行与每个状态下的最高预期回报相对应的动作。MC 方法允许智能体直接从试错经验中学习，而无需系统模型;但是，必须先完成整个剧集，然后代理才能更新其知识库。此外，这种程序在连续系统中是不自然的（在过程控制中最常见），在长时间发作的系统中非常不利，并且对人类行为不直观。人类通常会在反馈后立即学习，而不是以预先设定的增量学习。时间差分方法结合了动态规划和蒙特卡罗方法的思想，变得更加像人类。

2.5.3. Temporal difference learning
Temporal difference (TD) methods are the most widely used RL algorithm today because of its simplicity and relatively cheap computational cost. TD methods combine the ability to learn from experiences (like MC methods) with bootstrapping (like DP methods). TD methods do not require a model of the system, and will instead learn the dynamics from interactions. Moreover, TD methods do not need to wait until the termination of an episode before updating its value functions. Instead, TD methods can update immediately after
and
are received. The TD update for value and action-value functions are given by Eqs. (27) and (28), respectively (Sutton, 1988):
(27)
(28)
where ← denotes the update operator. Intuitively, the old value functions are corrected using the TD errors at each update by a fixed amount (dictated by α). The TD errors are given as:
where δt denotes the TD error at time t, the two terms,
denote what the agent thinks the real value function is according to the last interaction, and V(xt) is the old value function. After the agent traverse through each state-action pair many times, V(xt) → v(xt) (i.e., the estimated values converge to the true values). The action-values follow a similar paradigm. After convergence, the optimal policy can be extracted via Eq. (26).

The algorithms presented in Eqs. (27) and (28) are called TD(0) because the agent updates its knowledge after just one action. TD(0) is a special case of the more general TD(λ) algorithm. Like DP, TD(0) also experiences high bias due to bootstrapping. However, as TD(0) → TD(1), bootstrapping is reduced and at TD(1), the algorithm becomes the MC method. Details regarding this algorithm can be abstract and will be omitted. A detailed description of TD(λ) and supplementary code can be found in Reis (2017).

Like MC methods, TD methods are also model-free; therefore, action-values are typically learned and exploration is mandatory. TD methods typically explore using ϵ-greedy policies where the agent performs a random action with ϵ ∈ [0, 1] probability and performs the returns-maximizing action otherwise. Furthermore, ϵ is typically decayed throughout training and starts at a high value during the initial phase when the agent knows nothing. Eventually, ϵ decays to a low value when training is almost complete.

There are two popular TD methods with slightly different update steps: SARSA and Q-learning. SARSA is an on-policy algorithm meaning that its behaviour policy is identical to its target policy. Target policy refers to the policy the agent wants to eventually find. Typically, this will be the optimal policy. On the other hand, the behaviour policy, b(u|s), is how the agent actually behaves. If both the target and behaviour policy are identical, the agent is on-policy. An on-policy agent (assuming the target policy is the optimal policy) during training may quickly converge to a local optimum and never explore (since exploratory policies are typically not the optimal), resulting in a sub-optimal solution. Contrarily, off-policy agents, such as Q-learning, may follow an equiprobable random policy (equal probability of selecting all actions in all states) during training to conduct deep exploration and switch to the optimal policy online. Moreover, off-policy agents are guaranteed to find the optimal policy under the assumption that each state-action pair is visited infinite times and b(u*|s) > 0 (i.e., the probability of picking the optimal action under behaviour policy is not 0) (Sutton, 1988). Since SARSA is an on-policy approach, the action-value functions are updated through Eq. (28) with the quintuple
. Compared to SARSA, Q-learning updates by using only four parameters
through Eq. (29) while assuming
is a decision variable to maximize the action-value function:
(29)
In Q-learning,
is not used because the actual u taken at time
might differ from the target policy since the algorithm is off-policy. To ensure Q-values are still updated towards the optimal policy, Eq. (29) uses the
operation to force the update into using the optimal
. Overall, TD methods unify the best parts of DP and MC methods, allowing the agent to learn purely based on experiences while still being able to perform inter-episode updates to exploit the most recent learnings.

2.5.4. Summary of different RL methods
2.5.4. 不同RL方法的总结
A summary of the characteristics for the DP, MC and TD methods is shown in Table 3. On a high level, dynamic programming requires a model to learn the value functions while both MC and TD methods can learn from sampling state, action and rewards alone. DP and TD methods use bootstrapping to estimate value functions; that is, they estimate the current value function based on previously estimated value functions. This method is data efficient, but introduce unwanted bias to the estimates. On the other hand, MC methods estimate each value function independently through sampling many trajectories to avoid estimation biases. However, this method instead introduces high variance for noisy systems. In terms of computational cost, DP methods require the most because all value functions are simultaneously updated. Comparatively, MC methods only update the value functions that were visited in the sampled trajectories, and updates are conducted at the end of each episode. TD methods update the value function immediately after an experience, and are also very efficient since it only updates the value functions in the visited states. For exploration, DP methods do not require any because the model of the system (both transition probabilities and expected reward) is known to the agent at all times. For MC methods, exploration is conducted by starting at random states when episodes are reset. In TD methods, the agent will occasionally perform a random action. A detailed introductory example is provided in Section 3 to enhance additional intuition and understanding of the implementation procedure.
DP、MC和TD方法的特性总结如表3所示。在高层次上，动态规划需要一个模型来学习价值函数，而 MC 和 TD 方法都可以仅从采样状态、动作和奖励中学习。DP 和 TD 方法使用自举来估计值函数;也就是说，它们根据先前估计的值函数估计当前值函数。这种方法在数据上是有效的，但会给估计带来不必要的偏差。另一方面，MC方法通过对许多轨迹进行采样来独立估计每个值函数，以避免估计偏差。然而，这种方法反而为噪声系统引入了高方差。在计算成本方面，DP 方法要求最高，因为所有值函数都是同时更新的。相比之下，MC 方法仅更新采样轨迹中访问的值函数，并在每集结束时进行更新。TD 方法在体验后立即更新值函数，并且也非常有效，因为它只更新访问状态下的值函数。对于探索，DP 方法不需要任何方法，因为系统模型（转移概率和预期奖励）对智能体来说始终是已知的。对于 MC 方法，通过在剧集重置时从随机状态开始进行探索。在 TD 方法中，代理偶尔会执行随机操作。第 3 节提供了一个详细的介绍性示例，以增强对实施过程的额外直觉和理解。

Table 3. A comparison of DP, MC, and TD methods.
表 3.DP、MC 和 TD 方法的比较。

Empty Cell Dynamic Programming 动态规划 Monte Carlo 蒙特卡洛 Temporal Difference 时间差异
Requires model 需要模型 Yes No No
Estimate bias 估计偏倚 High 高 Low High 高
Estimate variance 估计方差 Low High 高 Low
Computational cost 计算成本 High 高 Medium 中等 Low
v(x) update V（X）更新 All states simultaneously
同时处于所有状态 After a trajectory 在轨迹之后 After an experience 体验后
Exploration 勘探 Not needed, all states update
不需要，所有状态都会更新 Random initialization 随机初始化 Performing a random action
执行随机操作
2.6. Function approximation methods
2.6. 函数逼近方法
Function approximation is widely used in applications of RL in industry. Function approximation methods are briefly discussed in this section to provide a more comprehensive overview of RL methods.
函数逼近在RL在工业中的应用得到广泛应用。本节简要讨论了函数逼近方法，以更全面地概述RL方法。

Originally, RL was proposed to solve MDPs, which are discrete representations of the optimal control problem. For discrete tasks, the value functions are stored in a table, known as the Q-table, with the x- and y-axes being the states and actions. A visual representation of the Q-table for a system with n states and m actions is shown in Fig. 6. However, the states and actions of a typical process control application are both multi-dimensional and continuous. The storage of value functions in such complex tasks could be indefinitely large and intractable. Extending RL solutions to such systems require the value functions to be generalized using a parameterized functional form. Function approximation can be used to address this issue. The approximate value function is given as:
(30)
where
is a continuously differentiable approximation of the value function and
is the weight vector, where d is much smaller than the number of states. Supervised learning models can be used to approximate
and can be linear or nonlinear. In the tabular case, learning is decoupled – updates of one value function do not impact any other value functions – allowing the optimal value functions to be found for all states. This no longer holds in the approximation case. Instead, as an element of
wi, is updated, all approximate value functions utilizing wi will change accordingly, making it impossible for all value functions to be exactly correct (Sutton and Barto, 2018). As such, RL methods using function approximation can approach, but never achieve, perfect optimal control. To focus accuracy of the function approximation on the most important states, the state distribution
must be defined. Often times, μ(x) can simply be the fraction of time spent in x and can be computed trivially as:
(31)
where η(x) is the amount of time spent in state x and ∑η(x′) is the total time spent in all states. Combining the state distribution with mean squared error, the mean squared value error objective function can be define as:
(32)
Ultimately, the goal for approximate value function methods is to find the optimal weight vector w* for which
for all w. The weights are typically identified using gradient methods such as stochastic gradient descent (SGD). SGD is a popular method in machine learning due to its ease of use and advantages in big data applications. SGD is a special case of the more general gradient descent algorithm. In gradient descent, all available data are used to compute the gradient; however, such a method is computationally infeasible in big data applications. Instead, SGD is used to randomly sample a subset of data for the gradient computation. More detailed explanations about SGD can be found in Goodfellow et al. (2015). On each step, SGD adjusts the weights w slightly (adjustment size controlled by α) with accordance to the gradient of the loss function (Bottou, 2010):
(33)
(34)
where
is the weight vector after the
update. Furthermore, α and ∇ are the learning rate and gradient operator, respectively. Larger α results in larger update steps and are typically used at the beginning of training. As k → ∞, α → 0 to ensure that the weights do not oscillate in noisy systems. In Eq. (33), it was assumed that the true value vπ(xt) was known; this is possible in theory but not in real world applications. Instead, Vk(x) is used in practical applications, where Vk(x) is an unbiased estimate of vπ(x) satisfying
.
最初，RL被提出来求解MDP，MDPs是最优控制问题的离散表示。对于离散任务，值函数存储在一个称为 Q 表的表中，其中 x 轴和 y 轴是状态和操作。图 6 显示了具有 n 个状态和 m 个动作的系统的 Q 表的可视化表示。然而，典型过程控制应用的状态和动作既是多维的，又是连续的。在如此复杂的任务中，值函数的存储可能无限大且棘手。将RL解决方案扩展到此类系统需要使用参数化函数形式对值函数进行泛化。函数近似可用于解决此问题。近似值函数的公式为：
(30)
其中是值函数的连续可微近似值，
是权重向量，其中
d 远小于状态数。监督学习模型可用于近似，可以是线性
的，也可以是非线性的。在表格情况下，学习是解耦的——一个值函数的更新不会影响任何其他值函数——允许为所有状态找到最优值函数。这在近似情况下不再成立。相反，随着 w 的一个
元素的更新，所有使用 w 的近似值函数都会相应地发生变化，这使得所有值函数都不可能完全正确（Sutton 和 Barto，2018）。因此，使用函数逼近的RL方法可以接近，但永远无法实现完美的最优控制。为了将函数近似的精度集中在最重要的状态上，必须定义状态分布
。通常，μ（x）可以简单地表示在 x 中花费的时间的一小部分，并且可以简单地计算为：
(31)
其中 η（x）是在状态 x 中花费的时间，∑η（x′）是在所有状态下花费的总时间。将状态分布与均方误差相结合，均方值误差目标函数可以定义为：
(32)
最终，近似值函数方法的目标是找到所有 w 的最优权向量 w*
。权重通常使用梯度方法（如随机梯度下降（SGD））来识别。SGD 是机器学习中一种流行的方法，因为它易于使用，并且在大数据应用中具有优势。SGD 是更通用的梯度下降算法的一个特例。在梯度下降中，所有可用数据都用于计算梯度;然而，这种方法在大数据应用中在计算上是不可行的。相反，SGD 用于随机采样数据子集以进行梯度计算。关于SGD的更详细解释可以在Goodfellow等人（2015）中找到。在每一步中，SGD根据损失函数的梯度（Bottou，2010）略微调整权重w（调整大小由α控制）：
(33)
(34)
其中
是
更新后的权重向量。此外，α 和 ∇ 分别是学习率和梯度算子。α越大，更新步骤越大，通常在训练开始时使用。当 k → ∞时，α → 0 以确保砝码不会在嘈杂系统中振荡。在方程（33）中，假设真值v π （x t ）是已知的;这在理论上是可能的，但在实际应用中却不是。相反，V （x）用于实际应用，其中 V k k （x）是对 v π （x）满足的无偏估计
。

Fig. 6
Download : Download high-res image (95KB)
Download : Download full-size image
Fig. 6. An example of the Q-table for a system with n states and m actions.

Now that the objective function and model optimization algorithms are defined, the simplest function approximation approach will be introduced. Suppose that each state x can be represented using a feature vector:
that has the same length as w. Additionally, each component si of
is a function known as a feature of x. In linear methods, the features form a linear basis for the set of approximation functions and are known as the basis functions (Sutton and Barto, 2018). Using the basis functions, the value function can be approximated as a linear basis function using:
Or
(35)
Note that Eq. (35) is linear-in-parameter. Although such a model does not explore interaction effects between features, it is still effective for many value functions. The gradient of
from Eq. (35) is nothing more than:
(36)
By combining Eqs. (36) and (34) yields:
(37)
This is a very simple and elegant solution that is easy to implement and interpret. Additionally, the solution to
is unique due to its linear structure. Lastly, the design of the basis functions will be briefly explained. Here, the simplest basis function will be introduced: the polynomial basis function. Polynomial functions offer high flexibility in its model structure and are intuitively to implement. The general structure of a two-regressor polynomial model is given as:
(38)
where wi are the model weights and w0 is the model bias. Here, x1 and x2 are states from an arbitrary system. A more illustrative example for designing the features for an agent in a process control problem is shown below:
Suppose we are tasked with optimally controlling the temperature of a CSTR with two states: mass flow rate x1, and temperature x2. Due to limited hardware and continuous states, the system requires function approximation. The task is to construct a feature vector that captures both states. A trivial design would be:
where Eq. (35) becomes:
Such a design is simple and intuitive, but does not explore the interaction effects between x1 and x2. In chemical reactors, the interaction effects between mass flow rate (x1) and temperature (x2) can be critically important for control and is used to estimate the enthalpy of the reaction (Borgnakke and Sonntag, 2008). Moreover, the system may be affine and does not depend on either x1 or x2. In such a case, the relationship is lost when both x1 and x2 are 0 because
must equal 0. To accommodate for these factors, the feature vector could instead be:
where
captures affine relationships and
captures interaction effects.

There exists many more advanced basis functions and function approximation methods such as Fourier basis functions, coarse coding, tile coding, radial basis functions and nonlinear neural network function approximation. For detailed information regarding these methods, please refer to Chapter 9 in Sutton and Barto (2018).

2.6.1. Emergence of deep reinforcement learning
This subsection introduces major contributions in the deep RL field, where deep neural networks are used for function approximation in Q-learning.

Fig. 7 shows the interest in RL worldwide from 2007 - 2019. It can be seen that RL had rather stagnant interest until 2015, when significant growth began. The first contributing factor to this growth was the publication of Mnih et al. (2013) in 2013, where the authors introduced a deep RL algorithm, called deep Q-learning network (DQN), that was able to play many Atari games from image inputs alone using the same algorithm (though the agent had to be re-trained for each game). Performance of the agent was compared to human players after training for 10 million frames (approximately 56 hours of continuous play time1). To ensure a fair comparison, the skill of the agent was handicapped by human features such as delays in response time and was only able to see information presented on the screen. DQN surpassed human level in half of the games. DQN used Q-learning with deep neural networks for function approximation; the complete details can be found in Mnih et al. (2013). In DQN, continuous states can be used, but the actions were discrete because Atari games typically had discrete inputs. Furthermore, like other TD methods, DQN experienced high bias. The algorithm was a massive contribution in RL literature but lacked continuous actions required for process control.
图 7 显示了 2007 年至 2019 年全球对 RL 的兴趣。可以看出，RL的兴趣一直停滞不前，直到2015年开始显着增长。这种增长的第一个促成因素是 Mnih 等人（2013 年）在 2013 年发表的文章，作者介绍了一种深度 RL 算法，称为深度 Q 学习网络（DQN），该算法能够使用相同的算法仅从图像输入中玩许多 Atari 游戏（尽管必须为每款游戏重新训练代理）。在训练了 1000 万帧（大约 56 小时的连续播放时间 1 ）后，将智能体的性能与人类玩家进行了比较。为了确保公平的比较，代理的技能受到人为特征（例如响应时间延迟）的阻碍，并且只能看到屏幕上显示的信息。DQN在一半的比赛中超过了人类的水平。DQN 将 Q 学习与深度神经网络结合使用，以实现函数逼近;完整的细节可以在Mnih等人（2013）中找到。在DQN中，可以使用连续状态，但动作是离散的，因为Atari游戏通常具有离散的输入。此外，与其他TD方法一样，DQN经历了高偏差。该算法在RL文献中做出了巨大贡献，但缺乏过程控制所需的连续动作。

Fig. 7
Download : Download high-res image (152KB)
下载：下载高分辨率图片（152KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 7. Growth in search results of reinforcement learning on Google from 2007 - 2019. Figure from Trends (2017).
图 7.2007 年至 2019 年强化学习在 Google 上的搜索结果增长。数据来自趋势（2017年）。

The first RL algorithm that can handle both continuous states and actions naturally was developed in Silver et al. (2014). The algorithm, called deterministic policy gradient (DPG), was a policy iteration method trained using Monte Carlo methods. DPG deterministically maps continuous states to continuous actions through a neural network that approximates the optimal policy. Although DPG can handle both continuous states and actions and is another major contribution, it is very computationally expensive, experiences high variance, and is unnatural for continuous tasks due to the Monte Carlo training method. More details regarding the DPG algorithm can be found in Silver et al. (2014).
第一个可以自然处理连续状态和动作的RL算法是在Silver等人（2014）中开发的。该算法称为确定性策略梯度（DPG），是一种使用蒙特卡罗方法训练的策略迭代方法。DPG 通过近似于最优策略的神经网络确定性地将连续状态映射到连续动作。尽管 DPG 可以处理连续状态和动作，并且是另一个主要贡献，但由于蒙特卡罗训练方法，它的计算成本非常高，方差很高，并且对于连续任务来说是不自然的。有关DPG算法的更多详细信息可以在Silver等人（2014）中找到。

In 2015, RL had evolved again into an algorithm that may be suitable for online optimal control. Lillicrap et al. (2015) pushed the boundary of RL literature through unifying the ideas of Mnih et al. (2013) and Silver et al. (2014) into one actor-critic RL algorithm known as the deep deterministic policy gradient (DDPG). The general architecture is shown in Fig. 8. Here, the DPG was used to map states into actions and was known as the actor. Moreover, the DQN was the critic and was used to identify the action-values to update the DPG without using MC methods. Originally, DQN and DPG suffered from high bias and high variance, respectively. However, by unifying the two algorithms, both the bias and variance can be significantly reduced (Lillicrap et al., 2015). Additionally, policy gradient methods are typically trained using Monte Carlo methods at the end of episodes (Sutton et al., 1999); however in DDPG, this limitation is overcome through training the DPG by the gradient of the critic (i.e., the derivative of the Q values with respect to the model weights) using gradient ascent. Intuitively, the DPG’s weights are updated to maximize the action-value, Q(x, u). Traditionally, continuous action RL algorithms struggle in exploration because techniques such as ϵ-greedy assumes a discrete action space. DDPG explores through the Ornstein-Ulhenback exploratory noise given by:
(39)
where u′(xt) is the input signal corrupted by the Ornstein-Ulhenback exploratory noise,
given by the following stochastic differential equation (Uhlenbeck and Ornstein, 1930):
(40)
where θ > 0, σ > 0, and Wt denotes a special case of a continuous time stochastic process, known as the Wiener process. Detailed information regarding the Wiener process and its properties can be found in Karatzas and Shreve (1991). Sometimes, Gaussian white noise is used for exploration; however, de-correlated random signals are ineffective for deep exploration since the signal has zero average effect resulting in no displacement in any particular direction and simply introduces oscillation in the process. Intuitively,
is a temporally correlated process that promotes deep exploration and needs to be tuned to suit different environments. More detailed theoretical overview of the DDPG algorithm can be found in Lillicrap et al. (2015).
2015年，RL再次演变为一种可能适用于在线最优控制的算法。Lillicrap et al. （2015）通过将 Mnih et al. （2013）和 Silver et al. （2014）的思想统一到一种称为深度确定性策略梯度（DDPG）的演员-评论家 RL 算法中，推动了 RL 文献的界限。总体架构如图 8 所示。在这里，DPG 用于将状态映射到操作中，并称为 actor。此外，DQN 是批评者，用于识别操作值以在不使用 MC 方法的情况下更新 DPG。最初，DQN 和 DPG 分别存在高偏差和高方差。然而，通过统一这两种算法，偏差和方差都可以显着降低（Lillicrap等人，2015）。此外，政策梯度方法通常在剧集结束时使用蒙特卡罗方法进行训练（Sutton等人，1999）;然而，在 DDPG 中，通过使用梯度上升通过批评者的梯度（即 Q 值相对于模型权重的导数）来训练 DPG，可以克服这一限制。直观地，DPG 的权重会更新以最大化操作值 Q（x， u）。传统上，连续动作RL算法在探索中会遇到困难，因为诸如ε贪婪之类的技术假定了一个离散的动作空间。DDPG 通过 Ornstein-Ulhenback 探索性噪声进行探索，由下式给出：其中 u′（x t ）是被 Ornstein-Ulhenback 探索性噪声破坏的输入信号，由以下随机微分方程（Uhlenbeck 和 Ornstein，1930 年）给出：
(40)
其中 θ > 0，σ > 0，W t 表示连续时间随机过程的特例，
(39)
称为维纳过程。关于维纳过程及其性质的详细信息可以在Karatzas和Shreve（1991）中找到。有时，高斯白噪声用于探索;然而，去相关随机信号对于深度探索是无效的，因为信号的平均效应为零，导致在任何特定方向上都没有位移，并且只是在此过程中引入了振荡。直观地说，这是一个时间相关的过程，它促进了深入探索，
需要调整以适应不同的环境。DDPG算法的更详细的理论概述可以在Lillicrap等人（2015）中找到。

Fig. 8
Download : Download high-res image (119KB)
下载：下载高分辨率图片（119KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 8. The architecture of the deep deterministic policy gradient algorithm (Lillicrap et al., 2015).
图 8.深度确定性策略梯度算法的架构（Lillicrap 等人，2015 年）。

In DDPG, four neural networks are used collectively when identifying the optimal policy. Furthermore, the neural networks are implicitly trained upon each others weights, which potentially causes function approximation errors and leads to sub-optimal policies. In Fujimoto et al. (2018), the authors introduced a novel way to minimize the errors by delaying policy updates. The new algorithm was also tested on different benchmark environments and showed better performance in each case.
在 DDPG 中，在确定最佳策略时，将同时使用四个神经网络。此外，神经网络在彼此的权重上隐式训练，这可能会导致函数逼近误差并导致次优策略。在Fujimoto等人（2018）中，作者介绍了一种通过延迟策略更新来最大限度地减少错误的新方法。新算法还在不同的基准测试环境中进行了测试，并在每种情况下都显示出更好的性能。

DDPG was the first RL algorithm that effectively solved many high-dimensional continuous control tasks. Indeed, most previous actor-critic or policy optimization RL algorithms were never demonstrated to work on high dimensional state and action spaces due to instability caused by catastrophic interference (a symptom where newly learned information would undesirably overwrite previously learned knowledge) or were simply learning too slow for industrial applications (Deisenroth and Neumann, 2013). Instead, the abilities of DDPG were demonstrated on many high dimensional control problems in the MuJoCo physics environment, some being as large as 102 observations and 9 actions (Lillicrap et al., 2015).
DDPG是第一个有效解决许多高维连续控制任务的RL算法。事实上，由于灾难性干扰（一种新学习的信息会不受欢迎地覆盖以前学习的知识的症状）引起的不稳定性，大多数以前的参与者-批评家或政策优化RL算法从未被证明在高维状态和动作空间上工作（一种新学习的信息会不受欢迎地覆盖以前学习的知识的症状），或者只是学习太慢而无法用于工业应用（Deisenroth和Neumann，2013）。取而代之的是，DDPG的能力在MuJoCo物理环境中的许多高维控制问题上得到了证明，有些问题高达102个观测值和9个动作（Lillicrap等人，2015）。

At around the same time, Schulman et al. (2015) proposed another deep RL method utilizing policy optimization. This algorithm was also shown to be effective on high-dimensional continuous control tasks. The algorithm, known as trust region policy optimization (TRPO), guarantees monotonic improvements after each update step through careful parameter updates governed by applying a KL divergence threshold on the new policy compared to the existing policy. More specifically, the update constraint guarantees that the new policy lies within the trust region: a subspace in which the local function approximations are reliable. Unlike DDPG where there exist an actor and a critic, TRPO identifies the policies directly. In doing so, Lillicrap et al. (2015) believe that TRPO is much less data efficient. Moreover, the parameter updates must be solved using a conjugate gradient method, due to the update constraint, which may be difficult to implement.
大约在同一时间，Schulman等人（2015）提出了另一种利用策略优化的深度强化学习方法。该算法也被证明对高维连续控制任务有效。该算法称为信任区域策略优化（TRPO），通过对新策略应用与现有策略相比的 KL 散度阈值来控制仔细的参数更新，从而保证在每个更新步骤后进行单调改进。更具体地说，更新约束保证新策略位于信任区域内：局部函数近似值可靠的子空间。与存在演员和批评者的 DDPG 不同，TRPO 直接确定政策。在这样做的过程中，Lillicrap等人（2015）认为TRPO的数据效率要低得多。此外，由于更新约束，参数更新必须使用共轭梯度方法求解，这可能难以实现。

To improve upon the previous flaws of TRPO, Schulman et al. (2017) published a new RL algorithm in 2017 called proximal policy optimization (PPO). Compared to TRPO, PPO was much simpler to implement, more general, and has improved data efficiency. Specifically, PPO implements the update constraint directly in the objective function as a penalty. In doing so, SGD can be used instead. Test cases showed that PPO was able to achieve much better performance compared to TRPO after training for the same time on various continuous control tasks. Unfortunately, Schulman et al. (2017) did not compare its performance against DDPG, making it difficult to conclude if it is a better algorithm.
为了改进TRPO先前的缺陷，Schulman等人（2017）在2017年发布了一种新的RL算法，称为近端策略优化（PPO）。与TRPO相比，PPO的实施更简单，更通用，并且提高了数据效率。具体来说，PPO 直接在目标函数中实现更新约束作为惩罚。这样，可以改用 SGD。测试用例表明，与TRPO相比，PPO在各种连续控制任务上进行了相同的训练后，能够获得更好的性能。不幸的是，Schulman等人（2017）没有将其性能与DDPG进行比较，因此很难得出结论，它是否是一种更好的算法。

In 2018, Haarnoja et al. (2018) introduced a new actor-critic algorithm to improve the sample efficiency and convergence factors of previously introduced methods. In the new actor-critic algorithm, both expected reward and action randomness are maximized together during optimal policy search. Ultimately, this new algorithm was shown to surpass state-of-the-art performance on various continuous control benchmarks while being stable during its learning process.
2018 年，Haarnoja 等人（2018 年）引入了一种新的 actor-critic 算法，以提高先前引入的方法的样本效率和收敛因子。在新的 actor-critic 算法中，在最优策略搜索期间，预期奖励和行动随机性一起最大化。最终，这种新算法被证明在各种连续控制基准上超越了最先进的性能，同时在其学习过程中保持稳定。

A complete time line of popular deep reinforcement learning methods is shown in Fig. 9. Deep RL methods began with continuous states and discrete actions but quickly evolved to accommodate for continuous actions as well. Unfortunately, the MC methods were required to train the method, resulting in high variance. DDPG was introduced to resolve this shortcoming; deep algorithms from 2015 there-forth can handle both continuous states and actions. For a comprehensive performance comparison between state-of-the-art continuous control deep RL algorithms, please refer to Henderson et al. (2018).
流行的深度强化学习方法的完整时间线如图 9 所示。深度强化学习方法从连续状态和离散动作开始，但很快发展到适应连续动作。不幸的是，需要 MC 方法来训练该方法，导致高方差。引入 DDPG 是为了解决这一缺点;从2015年开始，深度算法可以处理连续状态和动作。有关最先进的连续控制深度RL算法之间的全面性能比较，请参阅Henderson等人（2018）。

Fig. 9
Download : Download high-res image (165KB)
下载：下载高分辨率图片（165KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 9. Date of significant breakthroughs in deep reinforcement learning.
图 9.深度强化学习取得重大突破的日期。

2.6.2. Deep RL and its implications on industrial control
2.6.2. 深度强化学习及其对工业控制的影响
The implementation potential of RL before the emergence of deep function approximation was quite limited due to its application being confined to discrete state and action systems. Using deep function approximation, RL succeeded in solving various complex tasks such as in Lillicrap et al. (2015) and Mnih et al. (2013). However, most deep RL applications thus far have been in simulated environments and were not implemented in safety-sensitive industrial settings. Furthermore, using deep learning for function approximation require proper tuning of the neural networks and results are not repeatable because neural networks are typically initiated with random weights and the exploration phase is also stochastic. Moreover, training agents that leverage large neural networks can take up to several days depending on the computation power available. Assuming that the deep function approximations are tuned properly, two major hurdles for deep RL still exist: i) the black-box nature of the control policy; ii) the connectivity of RL to the industrial distributed control system (DCS).
在深度函数逼近出现之前，RL的实现潜力非常有限，因为它的应用仅限于离散状态和动作系统。使用深度函数逼近，RL成功地解决了各种复杂的任务，例如Lillicrap等人（2015）和Mnih等人（2013）。然而，到目前为止，大多数深度强化学习应用都是在模拟环境中进行的，并没有在安全敏感的工业环境中实现。此外，使用深度学习进行函数逼近需要对神经网络进行适当的调整，并且结果不可重复，因为神经网络通常是以随机权重启动的，并且探索阶段也是随机的。此外，利用大型神经网络的训练代理可能需要长达几天的时间，具体取决于可用的计算能力。假设深度函数近似被适当调整，深度RL仍然存在两个主要障碍：i）控制策略的黑盒性质;ii） RL与工业分布式控制系统（DCS）的连接。

Firstly, the black-box nature of RL introduces risks to process operations because predicting the behaviour of RL may be difficult. In MPC, the control actions could be understood through analyzing the system model. However, in RL, it is nearly impossible to understand why or how the agent learned the control policy. One could test the behaviour of RL by providing different operating conditions and observing the response of the agent; however, control policies using deep function approximation are likely to be highly nonlinear. This implies that operating conditions close to the tested points may still result in very different behaviours. Secondly, reliable RL communication with industrial DCS systems might pose challenges because there exists no industrially accepted RL software as of 2020. Moreover, deep learning control policies cannot be directly imported in modern industrial DCS because of a lack of support for large, interacting, neural networks. One possible solution is to build the RL agent in an external software and leverage Open Platform Communication (OPC) to communicate to modern control systems used by AspenTech’s Advanced Process Control Suite (AspenTech, 2019).
首先，RL的黑盒性质给过程操作带来了风险，因为预测RL的行为可能很困难。在MPC中，可以通过分析系统模型来理解控制动作。然而，在RL中，几乎不可能理解智能体为什么或如何学习控制策略。可以通过提供不同的操作条件和观察智能体的反应来测试RL的行为;然而，使用深度函数逼近的控制策略可能是高度非线性的。这意味着接近测试点的操作条件仍可能导致非常不同的行为。其次，与工业DCS系统的可靠RL通信可能会带来挑战，因为截至2020年，还没有工业上接受的RL软件。此外，由于缺乏对大型交互神经网络的支持，深度学习控制策略无法直接导入到现代工业 DCS 中。一种可能的解决方案是在外部软件中构建RL代理，并利用开放平台通信（OPC）与AspenTech的高级过程控制套件（AspenTech，2019）使用的现代控制系统进行通信。

Due to the stochastic and black-box nature of deep learning and the difficulty of implementing such a system into modern control systems, deep RL may still require further research before it is industrially ready.
由于深度学习的随机性和黑盒性质，以及将此类系统实施到现代控制系统中的难度，深度强化学习在工业化之前就可能还需要进一步的研究。

Applications of reinforcement learning
强化学习的应用
This section starts with a literature review of the most influential RL papers. Then, RL will be compared to traditional control frameworks. Afterwards, a detailed tutorial where RL was applied onto an industrial pumping system will be shown. Finally, a literature review of other potential applications of RL in the process control industry will be introduced.
本节首先对最具影响力的RL论文进行文献综述。然后，将RL与传统的控制框架进行比较。之后，将展示将RL应用于工业泵送系统的详细教程。最后，本文将对RL在过程控制行业中的其他潜在应用进行文献综述。

3.1. Renowned triumphs 3.1. 著名的胜利
RL first gained massive publicity after the publication of Mnih et al. (2013) and Mnih et al. (2015), where a general agent was able to successfully conquer many ATARI video games using image inputs alone. Unfortunately, such games are near-deterministic and the state space was sufficiently small allowing even rules-based methods to be feasible in such systems (though no previous algorithms can learn the games through a general algorithm). Although the algorithm was impressive, previous methods were also able to approach the ultimate performance achieved. To conquer a task never done before, Silver et al. (2016) and DeepMind (2016a) were published in early 2016. The two studies introduced a RL algorithm to conquer Go, a board game invented more than 3000 years ago in China. Go is known as the most challenging game for AI due to its massive state and action space (more than 10170 possible states), and the requirement to defeat random opponents with different play styles. Modern AI solutions to Go struggle against even amateur players; however, AlphaGo was able to convincingly defeat the world’s best Go player, Ke Jie. At the beginning, the algorithm used supervised learning to obtain fundamental knowledge from amateur level players. Then, expert level players were used to learn advanced strategies. After confidently surpassing the experts, the agent continued to perfect itself through self-play, ultimately becoming the world’s best Go player (Silver, Huang, Maddison, Guez, Sifre, Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel, Hassabis, 2016, DeepMind). Intuitively, these experiments demonstrate the potential of RL to identify hidden patterns and provide valuable contributions to modern engineering beyond what is already known.
RL在Mnih等人（2013年）和Mnih等人（2015年）出版后首次获得大规模宣传，其中总代理能够仅使用图像输入成功征服许多ATARI视频游戏。不幸的是，这样的博弈是近乎确定性的，状态空间足够小，甚至允许基于规则的方法在这样的系统中是可行的（尽管以前的算法无法通过通用算法来学习博弈）。尽管该算法令人印象深刻，但以前的方法也能够接近所达到的最终性能。为了克服以前从未完成过的任务，Silver et al. （2016）和 DeepMind （2016a）于 2016 年初发表。这两项研究引入了一种RL算法来征服围棋，围棋是3000多年前在中国发明的棋盘游戏。围棋因其庞大的状态和动作空间（超过 10 种 170 可能的状态）以及击败具有不同游戏风格的随机对手的要求，被称为 AI 最具挑战性的游戏。现代围棋人工智能解决方案甚至与业余棋手作战;然而，AlphaGo能够令人信服地击败世界上最好的围棋棋手柯杰。一开始，该算法使用监督学习从业余水平的玩家那里获取基础知识。然后，使用专家级玩家来学习高级策略。在自信地超越专家后，这位棋手继续通过自我对弈来完善自己，最终成为世界上最好的围棋选手（Silver、Huang、Maddison、Guez、Sifre、Driessche、Schrittwieser、Antonoglou、Panneershelvam、Lanctot、Dieleman、Grewe、Nham、Kalchbrenner、Sutskever、Lillicrap、Leach、Kavukcuoglu、Graepel、Hassabis、2016、DeepMind）。直观地，这些实验证明了RL在识别隐藏模式方面的潜力，并为现代工程提供了超越已知的宝贵贡献。

The original AlphaGo contained human engineered features that were believed to aid the agent in learning. Interestingly, DeepMind believed the opposite. That is, DeepMind believed that the agent’s skill was handicapped by said features; thus, leading to the publication of AlphaGo Zero (zero referring to zero human knowledge), a different version of AlphaGo without human bias (Silver et al., 2017b). In AlphaGo Zero, all human engineered features were removed, leaving only the locations of the black and white stones as states. Within 40 days of training, AlphaGo Zero, starting tabula rasa, was able to surpass the best performance ever achieved by AlphaGo through pure self-play, a feat only achievable through RL. Moreover, only 3 days of training was needed for AlphaGo Zero to achieve world championship level. The changes also made AlphaGo Zero more efficient, consuming less than 10% of power and using only 4 tensor processing units (TPUs) compared to the 48 used previously.
最初的AlphaGo包含人类工程特征，据信可以帮助智能体学习。有趣的是，DeepMind的想法恰恰相反。也就是说，DeepMind 认为代理的技能受到上述特征的阻碍;因此，导致了AlphaGo Zero（零指零人类知识）的发布，这是一个没有人类偏见的不同版本的AlphaGo（Silver et al.， 2017b）。在AlphaGo Zero中，所有人类工程特征都被移除，只留下黑白棋子的位置作为状态。在训练的40天内，AlphaGo Zero开始白板，能够超越AlphaGo通过纯粹的自我对弈所取得的最佳表现，这一壮举只能通过RL实现。此外，AlphaGo Zero只需要3天的训练就可以达到世界冠军的水平。这些变化也使AlphaGo Zero的效率更高，功耗不到10%，并且仅使用4个张量处理单元（TPU），而之前使用的是48个。

By late 2017, AlphaZero was released following the ideas of AlphaGo Zero, where a general agent taught itself how to play Chess, Shogi, and Go and was able to defeat the world champion program in each respective case (DeepMind, Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2017a, Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2018). The world’s best Chess player in history, Magnus Carlsen, had a peak FIDE ELO (measurement of skill assigned by FIDE, a world renowned Chess organization) of 2882. Using traditional methods such as supervised learning, the ideal AI would be capped at 2882, representing zero replication error. Within just 200,000 training steps, AlphaZero was able to achieve an ELO above 3300 from pure self-play. Within 300,000 steps (4 hours physical time), AlphaZero surpassed the world’s best Chess engine, Stockfish (Chabris, 2015). Comparing the two Chess engines, Stockfish required decades to refine by expert engineers. AlphaZero was simply initiated tabula rasa, and after 4 hours, it became the best. Most impressively, AlphaZero demonstrated RL’s ability for long-term decision making, that is, it sacrificed many pieces in the early game to obtain a significant advantage in the end game some thirty steps in the future. Furthermore, AlphaZero only searches 104’s moves per turn compared to traditional Chess engines that searches up to 107’s moves. Moreover, the same algorithm was used to learn Shogi and Go and was able to defeat the respective best game engines, Elmo and AlphaGo Zero.
到 2017 年底，AlphaZero 按照 AlphaGo Zero 的想法发布，其中总代理自学了如何下棋、将棋和围棋，并能够在每种情况下击败世界冠军计划（DeepMind、Silver、Hubert、Schrittwieser、Antonoglou、Lai、Guez、Lanctot、Sifre、Kumaran、Graepel、Lillicrap、Simonyan、Hassabis、2017a、Silver、Hubert、Schrittwieser、 Antonoglou、Lai、Guez、Lanctot、Sifre、Kumaran、Graepel、Lillicrap、Simonyan、Hassabis，2018 年）。历史上世界上最好的国际象棋选手马格努斯·卡尔森（Magnus Carlsen）的巅峰国际棋联ELO（由世界著名的国际象棋组织国际棋联分配的技能测量）为2882年。使用监督学习等传统方法，理想的 AI 上限为 2882，表示零复制错误。在短短 200,000 个训练步骤中，AlphaZero 能够通过纯粹的自我游戏实现 3300 以上的 ELO。在300,000步（4小时物理时间）内，AlphaZero超越了世界上最好的国际象棋引擎Stockfish（Chabris，2015）。比较两个 Chess 引擎，Stockfish 需要几十年的时间才由专业工程师进行改进。AlphaZero只是简单地启动了白板，4小时后，它变成了最好的。最令人印象深刻的是，AlphaZero展示了RL的长期决策能力，也就是说，它在游戏前期牺牲了许多棋子，以在未来大约三十步的最终游戏中获得显着优势。此外，AlphaZero 每回合只能搜索 10 4 步，而传统的国际象棋引擎最多可以搜索 10 7 步。此外，相同的算法用于学习将棋和围棋，并能够击败各自最好的游戏引擎 Elmo 和 AlphaGo Zero。

The achievements of RL up until this point are nothing short of amazing; however, all previous applications assumed a perfect information system where all system states are perfectly observable. For example, you cannot hide the location of your pieces in Chess from your opponent. Additionally, large amounts of time were allowed for the computer to find the optimal action. Unfortunately, systems in the real world may contain fast dynamics and are littered with incorrect, unmeasurable, and/or unreliable information. To demonstrate RL’s ability to perform in a real time partially measurable settings, AlphaStar was released in late 2018 (DeepMind, Vinyals, Babuschkin, Czarnecki, Mathieu, Dudzik, Chung, Choi, Powell, Lillicrap, Kavukcuoglu, Hassabis, Apps, Silver, 2019). Here, the agent played a game called StarCraft II, a real-time strategy game where the player is the general of an army and is tasked with building various structures for military, resource, or energy needs in order to ultimately defeat the opponent. Compared to previous games, StarCraft poses a highly challenging (most humans would find it difficult to play) and real time environment where the opponent’s moves are hidden and unknown. Additionally, the number of states and actions are near infinite. Here, the agent must evaluate fast enough to win real time battles while managing the required resources of its army. After training, AlphaStar was able to decisively defeat two of the best StarCraft II players using pixel inputs alone. Moreover, AlphaStar was not given any information unavailable to players and was handicapped to be at human level (e.g., the agent cannot perform thousands of actions per second, etc.). AlphaStar showcased RL’s ability to react to unexpected situations, real-time high dimensional action selection, and hierarchical long-term planning. Such characteristics could be very applicable in industrial process control for fault-tolerant control or high dimensional multi-variate optimal control.
到目前为止，RL的成就简直令人惊叹。然而，所有以前的应用程序都假设了一个完美的信息系统，其中所有系统状态都是完全可观察的。例如，您不能向对手隐藏棋子在国际象棋中的位置。此外，计算机需要大量时间来找到最佳操作。不幸的是，现实世界中的系统可能包含快速动态，并且充斥着不正确、无法测量和/或不可靠的信息。为了展示RL在部分可测量的实时环境中执行的能力，AlphaStar于2018年底发布（DeepMind，Vinyals，Babuschkin，Czarnecki，Mathieu，Dudzik，Chung，Choi，Powell，Lillicrap，Kavukcuoglu，Hassabis，Apps，Silver，2019）。在这里，特工玩了一款名为《星际争霸II》的游戏，这是一款即时战略游戏，玩家是一支军队的将军，其任务是建造各种结构来满足军事、资源或能源需求，以最终击败对手。与以前的游戏相比，《星际争霸》提出了一个极具挑战性的（大多数人会觉得很难玩）和实时的环境，对手的动作是隐藏的和未知的。此外，状态和操作的数量几乎是无限的。在这里，特工必须以足够快的速度进行评估，以赢得实时战斗，同时管理其军队所需的资源。经过训练，AlphaStar 能够仅使用像素输入就决定性地击败了两名最好的星际争霸 II 玩家。此外，AlphaStar没有获得任何玩家无法获得的信息，并且处于人类水平（例如，智能体每秒无法执行数千个动作等）。 AlphaStar 展示了 RL 对意外情况做出反应的能力、实时高维动作选择和分层长期规划的能力。这些特性在工业过程控制中非常适用于容错控制或高维多变量最优控制。

All of the above applications assumed a single agent environment. In industrial process control, the agent must also understand the consequences of its actions on the overall system. RL’s abilities in multi-agent partially measurable systems was first demonstrated on DotA 2 by OpenAI. DotA, like StarCraft, is a real-time high dimensional strategy game where each team tries to defeat the opponent. Unlike StarCraft, there are five players per team; therefore, the agent’s interaction effects with other agents must also be considered. Additionally, the time horizon per game can be up to 80,000, a dramatic increase compared to Chess or Go, which typically ends within 150 turns (OpenAI, 2018b). At each time t, the agent observes 20,000 continuous observations and has access to 1000 different actions. The reward function of the agent contains two components: individual performance and team performance. To enhance cooperation among the independent agents, a separate hyper parameter called team spirit, denoted here as ϕ, was used to specify the importance of the individual reward function compared to the team reward function. Throughout the game, team spirit will be annealed from 0 to 1 to communicate that in the end game, only the team reward function matters. The reward function is given as:
(41)
As of April 2019, OpenAI Five was able to defeat the best DotA 2 teams in the world (OpenAI, 2018a).

State-of-the-art RL research was first applied to near-deterministic low dimensional systems, eventually transitioning to complex video games that reflect the uncertain and stochastic nature of the real world. RL was shown to effectively handle partially measurable, long horizon, and high dimensional systems. Additionally, RL can quickly react to unexpected situations, learn to behave optimally in a team environment and is feasible for real-time applications with fast dynamics. Most importantly, RL was shown to be a general algorithm that can be used for different applications. Such characteristics hold huge implications for useful applications in industrial process control.
最先进的RL研究首先应用于近乎确定的低维系统，最终过渡到反映现实世界的不确定性和随机性的复杂视频游戏。RL被证明可以有效地处理部分可测量的、长视距和高维的系统。此外，RL可以对意外情况做出快速反应，学会在团队环境中以最佳方式行事，并且对于具有快速动态的实时应用程序是可行的。最重要的是，RL被证明是一种通用算法，可用于不同的应用。这些特性对工业过程控制中的有用应用具有巨大的意义。

3.2. Comparison with common advanced control frameworks
3.2. 与常见高级控制框架的比较
A typical control framework is shown in Fig. 10. Real-time optimization (RTO) situates in the top layer and solves complex steady-state optimization problems to find the optimal steady states with respect to a desired performance metric (Huesman et al., 2008). Typically, this layer evaluates on the hourly time scale. The optimal steady states are then communicated to the MPC layer, where dynamic optimization is performed on a minutes scale to identify the optimal state and input trajectories to arrive at the optimal steady states. Typically, the objective function in this layer is given by:
(42)
due to its convex nature (Mayne and Rawlings, 2017). Here, H denotes the prediction and control horizon. Qmpc and Rmpc are the tuning matrices for the state and input costs, respectively. Often times, plant managers do not allow for direct manipulation of the process actuators by MPC due to safety concerns. In these scenarios, the state trajectories are computed from the process model using the ideal input trajectory determined by the MPC and are communicated to the regulatory control layer, where PIDs are typically used to track the set points. On a simplified level, RTO identifies the optimal set points for economic objectives to be met while MPC finds the optimal trajectory to achieve the desired set points. More recently, researchers began to intertwine ideas from RTO with MPC where the economic objective of RTO is placed directly into the MPC objective function. Such a strategy is now known as economic model predictive control (EMPC) (Ellis et al., 2014).
典型的控制框架如图 10 所示。实时优化（RTO）位于顶层并解决复杂的稳态优化问题，以找到相对于所需性能指标的最佳稳态（Huesman 等人，2008 年）。通常，此图层按小时时间刻度进行评估。然后将最佳稳态传达给 MPC 层，在 MPC 层中，以分钟为单位进行动态优化，以确定最佳状态并输入轨迹以达到最佳稳态。通常，该层中的目标函数由下式给出：
(42)
由于其凸性质（Mayne和Rawlings，2017）。这里，H 表示预测和控制范围。Q mpc 和 R mpc 分别是状态和输入成本的调优矩阵。通常，出于安全考虑，工厂经理不允许MPC直接操作过程执行器。在这些场景中，使用MPC确定的理想输入轨迹从过程模型计算状态轨迹，并传送到调节控制层，PID通常用于跟踪设定点。在简化的层面上，RTO 确定要实现的经济目标的最佳设定点，而 MPC 则找到实现所需设定点的最佳轨迹。最近，研究人员开始将RTO与MPC的思想交织在一起，其中RTO的经济目标直接置于MPC目标函数中。这种策略现在被称为经济模型预测控制（EMPC）（Ellis等人，2014）。

Fig. 10
Download : Download high-res image (102KB)
下载：下载高分辨率图片（102KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 10. The traditional control architecture.
图 10.传统的控制架构。

The gain in optimality after the addition of each control layer is shown in Fig. 11. The objective of each additional layer is to increase the control optimality with respect to performance metrics. In typical processes, an optimal operating condition exists at a boundary. However, the optimal operating point cannot be achieved because of uncertainty and other factors.
添加每个控制层后的最优增益如图 11 所示。每个附加层的目标是提高与性能指标相关的控制最优性。在典型工艺中，最佳操作条件存在于边界处。但是，由于不确定性和其他因素，无法实现最佳工作点。

Fig. 11
Download : Download high-res image (142KB)
下载：下载高分辨率图片（142KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 11. Optimality of each control layer.
图 11.每个控制层的最优性。

For each control layer, there exists a set of distinct algorithms such as PIDs for regulatory control and a variant of MPC for the upper layers. Theoretically, the flexible nature of RL may enable it to potentially replace any of the above layers due to its general nature. For example, if RL were to replace MPC, the agent’s reward function simply has to be the negative of Eq. (42), with the actions being the recommended set points. For the regulatory layer, the reward function can remain the same, but the action would result in direct control of the system’s actuators. Finally, RL as an EMPC would have the economic objective added onto Eq. (42) as the reward function. However, proper design of the state space, action space, and reward function must all be considered during the design of the agent and will most likely be more challenging compared to implementing the simpler algorithms.
对于每个控制层，都存在一组不同的算法，例如用于调节控制的 PID 和用于上层的 MPC 变体。从理论上讲，RL的灵活性可能使其有可能取代上述任何层，因为它具有通用性。例如，如果RL要取代MPC，则智能体的奖励函数只需是方程（42）的负数，而动作是推荐的设定点。对于调节层，奖励功能可以保持不变，但该动作将导致对系统执行器的直接控制。最后，RL作为EMPC将经济目标添加到方程（42）中作为奖励函数。然而，在设计智能体时，必须考虑状态空间、动作空间和奖励函数的正确设计，并且与实现更简单的算法相比，很可能更具挑战性。

Perhaps one of the biggest advantages of RL is its rapid online computation time. Using most solvers, MPCs have a computational complexity of
where n and m are the dimensions of the states and actions, respectively (Richter et al., 2012). Even after exploitation of problem heuristics, MPCs’ computational complexity improves only to
(Dunn, Bertsekas, 1989, Wright, 1997, Wang, Boyd, 2008). Therefore, the online computation time may be infeasible for large systems and/or for systems with long control horizons. On the other hand, RL’s optimal policy is pre-computed offline, making online evaluation extremely fast. Such a method is very similar to the concept of parametric programming from explicit MPCs (Bemporad et al., 2002). Although, RL still has to be trained offline to identify the optimal policies and may require hundreds of thousands of interactions before reaching a near-optimal policy. For applications where offline computation time is not of concern, RL might be the preferred method.
也许RL的最大优势之一是其快速的在线计算时间。使用大多数求解器，MPC 的计算复杂度
为 n 和 m 分别是状态和动作的维度（Richter 等人，2012 年）。即使在利用问题启发式方法之后，MPC的计算复杂度也仅提高到（
Dunn，Bertsekas，1989，Wright，1997，Wang，Boyd，2008）。因此，对于大型系统和/或控制范围较长的系统，在线计算时间可能不可行。另一方面，RL的最优策略是离线预先计算的，这使得在线评估速度极快。这种方法与来自显式MPC的参数化编程的概念非常相似（Bemporad et al.， 2002）。尽管如此，RL 仍然需要离线训练才能确定最佳策略，并且可能需要数十万次交互才能达到接近最佳策略。对于不考虑离线计算时间的应用，RL 可能是首选方法。

Another significant difference is that RL is model-free. That is, a model is only required for initial training of the agent simply to reduce training time, and is not used during real-time implementation. In contrast, the identified model will be used exclusively online in MPC and can potentially lead to sub-optimal control. Off-set free control is used in MPCs to overcomes offset errors in MPCs through online model modification (Pannocchia et al., 2015); however, it may not work well for very noisy processes. Furthermore, some complex process dynamics might be difficult to explicitly model and could increase online computation time. Adaptation in RL is conducted by changing the control policy directly and occurs after each control input. Model identification is not required and could be advantageous for systems where accurate process models are not mathematically identifiable. However, the continually adaptive control policy of RL is black box by nature and could pose safety concerns. Moreover, the speed of adaptation is a function of the learning rate and must be tuned. A low learning rate would render the adaptive feature meaningless. Contrarily, a high learning rate may result in unstable control of the process, especially in processes with poor signal-to-noise ratios because the agent will learn inaccurate system dynamics.
另一个显著的区别是RL是无模型的。也就是说，模型仅在代理的初始训练中才需要，只是为了减少训练时间，并且在实时实现过程中不使用。相比之下，已识别的模型将仅在 MPC 中在线使用，并可能导致次优控制。MPC中使用偏移自由控制，通过在线模型修改来克服MPC中的偏移误差（Pannocchia等人，2015）;但是，它可能不适用于非常嘈杂的进程。此外，一些复杂的过程动力学可能难以显式建模，并可能增加在线计算时间。RL中的适应是通过直接改变控制策略来进行的，并在每次控制输入之后进行。模型识别不是必需的，对于无法通过数学识别精确过程模型的系统来说可能是有利的。然而，RL的持续自适应控制策略本质上是黑匣子，可能会带来安全问题。此外，适应速度是学习率的函数，必须进行调整。低学习率会使自适应特征变得毫无意义。相反，高学习率可能会导致对过程的不稳定控制，尤其是在信噪比较差的过程中，因为智能体会学习不准确的系统动力学。

In terms of optimization window, MPC considers each future stage to be equally important when computing for the optimal input trajectory. On the other hand, traditional RL considers each consecutive stage to be of less value (due to the discount factor). Recently, Asis et al. (2020) published an RL algorithm formulated in a fixed-horizon fashion identical to MPC. In doing so, the authors also demonstrated the increased stability and effectiveness of the new algorithm.
在优化窗口方面，MPC认为在计算最佳输入轨迹时，每个未来阶段都同等重要。另一方面，传统的RL认为每个连续阶段的价值较低（由于折扣因素）。最近，Asis 等人（2020）发表了一种以与 MPC 相同的固定水平方式制定的 RL 算法。在此过程中，作者还证明了新算法的稳定性和有效性的提高。

Lastly, RL has very few hyper parameters in the tabular case; however, initial design of RL typically require process experts to configure. First of all, the state and action spaces must be properly configured to the regions of operation. Additionally, the reward function require careful design so unintentional behaviour does not occur. For example, if the reward function does not contain a cost for manipulating inputs, the agent may continuously change the input for noisy systems. This may lead to unwanted oscillation and equipment wear in the process. As for convergence of the control policy, as long as the learning rate is annealed to near zero, RL should converge sufficiently well for most cases.
最后，RL 在表格情况下具有很少的超参数;然而，RL的初始设计通常需要工艺专家进行配置。首先，必须将状态和操作空间正确配置为操作区域。此外，奖励函数需要仔细设计，以免发生无意行为。例如，如果奖励函数不包含操纵输入的成本，则代理可能会不断更改噪声系统的输入。这可能会导致在此过程中出现不必要的振荡和设备磨损。至于控制策略的收敛性，只要学习率退火到接近于零，RL在大多数情况下就应该足够好地收敛。

A high level comparison between RL and MPC is shown in Table 4. In addition to the above comparisons, RL was also compared to industrial MPCs currently available on the market because they exist as proven technology, and not just academic studies. Most industrial MPCs are in the form of dynamic matrix control (DMC), an early variant of MPC. In any real applications, the process models are never perfect, resulting in a near-optimal solution. Additionally, the online computation time for DMC is high, especially for nonlinear systems. RL can perform at the same speed regardless of the system linearity. For online implementations, RL must explore random actions to adapt (an idea that sounds risky for live processes). Interestingly, industrial DMCs indeed performs random actions online for model adaption. Typically, DMCs are initialized in the smart step mode, where the system will perform random step tests to calibrate the identified model to the real process. Afterwards, the operators will switch the system to the calibrate mode. In calibrate mode, the system continues to perform step tests; however, they are much more infrequent and are lower in magnitude. Such a model adaptation technique is identical to RL, where exploration is plentiful initially, but is eventually annealed to near zero. Currently, one major flaw that could be preventing RL from industrial wide adoption is the non-interpretable nature of its control policy. For a more detailed, explicit comparison between MPC and RL, see Gorges (2017). Although RL may appear to have many advantages compared to traditional optimal controllers, RL literature is still embryonic compared to MPC and lacks solutions to many fundamental problems. A list of shortcomings currently barring RL from industry wide adoption is provided in Section 4. A summary of RL compared to literature MPC and industrial MPC is shown in Table 4.
RL和MPC之间的高级比较如表4所示。除了上述比较之外，RL还与目前市场上的工业MPC进行了比较，因为它们是作为成熟的技术而存在的，而不仅仅是学术研究。大多数工业 MPC 采用动态矩阵控制（DMC）的形式，这是 MPC 的早期变体。在任何实际应用中，过程模型从来都不是完美的，因此会产生近乎最优的解决方案。此外，DMC的在线计算时间很长，特别是对于非线性系统。无论系统线性度如何，RL都可以以相同的速度运行。对于在线实现，RL必须探索随机操作来适应（这个想法对于实时流程来说听起来很有风险）。有趣的是，工业DMC确实在线执行随机动作以进行模型适应。通常，DMC 在智能步进模式下初始化，系统将执行随机步进测试，以将识别的模型校准到实际过程。之后，操作员将系统切换到校准模式。在校准模式下，系统继续执行阶跃测试;然而，它们更不常见，幅度也更低。这种模型适应技术与RL相同，在RL中，探索量很大，但最终会退火到接近零。目前，可能阻止RL在工业界广泛采用的一个主要缺陷是其控制策略的不可解释性。有关MPC和RL之间更详细、更明确的比较，请参阅Gorges （2017）。尽管与传统的最优控制器相比，RL似乎具有许多优势，但与MPC相比，RL文献仍处于萌芽状态，缺乏解决许多基本问题的方法。第 4 节列出了目前阻碍 RL 在全行业采用的缺点。RL与文献MPC和工业MPC的比较总结如表4所示。

Table 4. A comparison between RL, MPC in literature, and industrial MPC software.
表 4.RL、文献中的 MPC 和工业 MPC 软件之间的比较。

Empty Cell Reinforcement Learning 强化学习 Model Predictive Control 模型预测控制 Typical Industrial DMC 典型工业DMC
Performance 性能 Close to optimal 接近最佳状态 Optimal with perfect model
完美模型最优 Close to optimal 接近最佳状态
Online comp. cost 在线补偿费用 Low High 高 High 高
Offline comp. cost 离线补偿成本 Policy & model identification
策略和模型识别 Model identification 型号识别 Model identification 型号识别
Reliance on model 对模型的依赖 Only for training 仅用于培训 At all times 在任何时候 At all times 在任何时候
Online calibration 在线校准 Exploratory moves 探索性举措 Various methods 各种方法 Exploratory moves 探索性举措
Sensitivity to tuning 对调谐的敏感度 Low High 高 High 高
3.3. A RL control experiment: Optimal control of a pump system
3.3. RL控制实验：泵系统的最佳控制
This section introduces an illustration on how to implement RL onto a pilot-scale industrial process control system2. A tabular Q-learning RL agent will be implemented onto the FLUIDMechatronix system from Turbine Technologies (machine shown in Fig. 12) for set-point tracking. The system is equipped with a variable frequency drive to regulate the output pressure. The output pressure, P, and pump RPM (manipulated input) will be used for this example. The operating ranges of the pressure and pump RPM are:
A FOMDP will be used to describe this system because the system measurements are available and system dynamics are fast. The initial conditions of the system are given by:
(43)
(44)

本节介绍如何在中试规模的工业过程控制系统上实现 2 RL的图示。表格式 Q-learning RL 代理将在 Turbine Technologies 的 FLUIDMechatronix 系统（机器如图 12 所示）上实现，用于设定点跟踪。该系统配有变频驱动器来调节输出压力。本示例将使用输出压力 P 和泵 RPM（操纵输入）。压力和泵转速的工作范围为：
FOMDP 将用于描述该系统，因为系统测量可用且系统动力学速度快。系统的初始条件由下式给出：
(43)
(44)

Fig. 12
Download : Download high-res image (261KB)
下载：下载高分辨率图片（261KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 12. The FLUIDMechatronix experiment from Turbine Technologies (Technologies, 2019).
图 12.Turbine Technologies 的 FLUIDMechatronix 实验（Technologies，2019 年）。

The steps required for implementation of RL algorithms are shown in Fig. 13. The RL set-up for this example is shown in Fig. 14. The RL agent will track the pressure set-point by changing the pump RPM on the single-input single-output (SISO) system. Specifically, the state and action of the agent are the current set-point tracking error:
(45)
and the change in pump RPM Δu, respectively. In Eq. (45), Pt denotes the pressure at time t and Pt,sp denotes the corresponding pressure set-point. This use of the change of RMP as the action allows the agent to track multiple set-points. If the action was the pump RPM instead, then the agent can only track one set-point since it simply maps different tracking errors to the steady state pump RPMs. The reward function of the agent is given by:
(46)
where Δu is the change in input (i.e., changing the input has a cost). Additionally, the reward is capped to -200 to avoid numerical issues and unexpectedly large bootstrapping errors. The agent will evaluate every five seconds to ensure that the system has reached steady state before the consecutive action is made. Five-second was selected because it was the longest observed transition time required for the system to reach steady state. The very short dynamic transition is not to be considered in this experiment.
实现RL算法所需的步骤如图13所示。本例的RL设置如图14所示。RL 代理将通过更改单输入单输出（SISO）系统上的泵 RPM 来跟踪压力设定点。具体来说，智能体的状态和动作分别是当前设定点跟踪误差和
(45)
泵 RPM Δu 的变化。在方程（45）中，P表示时间t时的压力，P t t,sp 表示相应的压力设定点。这种将 RMP 的更改用作操作的操作允许代理跟踪多个设定点。如果操作是泵 RPM，则代理只能跟踪一个设定点，因为它只是将不同的跟踪误差映射到稳态泵 RPM。智能体的奖励函数由下式给出：
(46)
其中 Δu 是输入的变化（即，改变输入是有代价的）。此外，奖励上限为 -200，以避免数字问题和意外的大引导错误。代理将每五秒评估一次，以确保系统在执行连续操作之前已达到稳定状态。之所以选择5秒，是因为这是系统达到稳定状态所需的最长过渡时间。本实验不考虑非常短的动态跃迁。

Fig. 13
Download : Download high-res image (281KB)
下载：下载高分辨率图片（281KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 13. General procedure for implementing industrial reinforcement learning.
图 13.实施工业强化学习的一般程序。

Fig. 14
Download : Download high-res image (83KB)
下载：下载高分辨率图片（83KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 14. The RL set-up for the FLUIDMechatronix experiment.
图 14.FLUIDMechatronix 实验的 RL 设置。

The hyper parameters of the agent are summarized in Table 5. In this example, the states and actions are discretized as:
(47)
(48)
and the Q-matrix corresponding to the action-value functions is given in Fig. 15. Initially, all action-values are initiated as 0. The states and actions, x and u, correspond to ε and Δu, respectively. The discount factor, γ, of the agent was 0.9. Overall, the agent was trained for 2,000,000 time steps corresponding to approximately 23 days of continuous operating experience. After every 400th time step, the agent was reset back to the initial states given in Eqs. (43) and (44) to prevent controller saturation during initial episodes. Such a long continuous training time on a live process is unreasonable; therefore, a crude model of the system was first identified and was used to initially train the agent in simulation. The identified model had a mean squared error of 0.056 on the experimental data and is given as:
(49)
The model fit on a normalized test data set is shown in Fig. 16.
表 5 总结了代理的超参数。在此示例中，状态和动作离散化为：
(47)
(48)
图15给出了与动作值函数相对应的Q矩阵。最初，所有操作值都以 0 启动。状态和动作 x 和 u 分别对应于 ε 和 Δu。代理γ的贴现系数为0.9。总体而言，该代理接受了 2,000,000 个时间步长的培训，相当于大约 23 天的连续操作经验。每 400 th 个时间步长后，智能体被重置回方程中给出的初始状态。（43）和（44）以防止在初始发作期间控制器饱和。在实时过程中如此长时间的连续训练时间是不合理的;因此，首先确定了系统的粗略模型，并用于在模拟中初步训练智能体。在实验数据上，识别出的模型的均方误差为 0.056，其公式为：
(49)
归一化测试数据集上的模型拟合如图 16 所示。

Table 5. Summary of the agent’s hyper parameters in the Mechatronix experiment.
表 5.Mechatronix 实验中智能体的超参数摘要。

Hyper Parameter 超参数 Value 价值
States, x 状态，x
Actions, u 行动，u
Reward, r 奖励，r max(
-200) 最大（
-200）
Learning rate, α 学习率， α [0.001, 0.7]
Discount factor, γ 贴现系数，γ 0.9
Exploratory factor, ϵ 探索性因素， ε [0.1, 1]
Evaluation interval 评估间隔 5 seconds 5 秒
System representation 系统表示 FOMDP FOMDP（FOMDP）
Fig. 15
Download : Download high-res image (99KB)
下载：下载高分辨率图片（99KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 15. Q-matrix of the Mechatronix system.
图 15.Mechatronix 系统的 Q 矩阵。

Fig. 16
Download : Download high-res image (263KB)
下载：下载高分辨率图片（263KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 16. Performance of the identified system model on a test data set.
图 16.已识别系统模型在测试数据集上的性能。

Exploration-wise, the agent starts with an equiprobable random policy (ϵ = 1), which decays linearly until
by the 500,000th time step. Likewise, the learning rate, α, is initiated at 0.7 and decays linearly until 0.001.
在探索方面，智能体以等概率随机策略（ε = 1）开始，该策略线性衰减，直到 500,000
th 个时间步长。同样，学习率（α）从 0.7 开始，线性衰减到 0.001。

The agent will behave as follows: initially, the agent observes εt and performs action Δut, that is, the agent observes the current tracking error and then changes the previous input by some amount. Next, the input signal corresponding to
will be sent to the Mechatronix experiment. After waiting for five seconds to ensure the system reached steady state, the agent will receive reward
and observe the new tracking error,
. Then, the agent uses Eq. (29) to update its current knowledge, and the cycle starts anew. A numerical example is provided below:
代理的行为如下：最初，代理观察ε t 并执行动作 Δu t ，即代理观察当前的跟踪误差，然后将之前的输入更改一定量。接下来，对应的
输入信号将被发送到 Mechatronix 实验。等待 5 秒以确保系统达到稳定状态后，智能体将收到奖励
并观察新的跟踪错误。
然后，智能体使用方程（29）更新其当前知识，循环重新开始。下面提供了一个数值示例：
Suppose the agent discretized the system into five states and three actions corresponding to:
The Q-matrix was initialized as:
where rows 1–5 correspond to the five states and columns 1–3 correspond to the three actions. After the initial set-up, the agent has to be trained. RL agents are typically trained through a series of episodes, where each episode consists of multiple update steps. In this example, the agent will be trained through 1000 episodes, where each episode consists of 2000 seconds (2,000,000 seconds of total training time). Furthermore, the Q-matrix is updated after every 5 seconds, resulting in 400 update steps per episode. At the end of each episode, the agent will be reset to its initial states. Note that simulations were used for initial training. Total simulation time was only a few minutes for the entire 1000 episodes.

The initial set-point of the system was set to 30 kPa. At
the system was at steady state with 15 kPa and 37 RPM, resulting in
. The agent receives the error and rounds it to the nearest state,
. Given this state, the agent then picks the action that coincides with the highest Q value from the Q-matrix:
where the first, second, and third value correspond to the current predicted Q values for selecting actions
respectively. Because the agent was not provided with any prior information about the system, the agent must first pick an action arbitrarily to identify more information. During scenarios where the highest Q values are identical, ties must be broken arbitrarily to avoid biasing one action over others.

If
was picked: the system will transition to a steady state of 7.4 kPa after five seconds and the new error would be -23. Clearly, this is a sub-optimal action but the agent was not equipped with prior knowledge of this. Additionally, the agent would receive reward:
and
. From this newly learned knowledge, the agent would update the Q-matrix using Eq. (29):
and the updated Q-matrix would be:
Notice here that all three u’s are the maximizing action for
; therefore, the ties are broken randomly here as well to avoid unnecessary bias. Suppose the first episode was terminated early and the system was reset back to
. This time:
meaning that picking
is a sub-optimal action compared to
. Here, the agent would pick either
instead.

After traversing through the state space many times, the Q-matrix is now given by:
This time, the agent has much more information about the system and can begin acting optimally. After, once again, resetting the agent back to 15 kPa and 37 RPM, the decision making of the agent this time around is deterministic. Given action values:
the agent would pick
corresponding to
and transition the system to
and yield reward
. Now, the system’s state is closest to
. In
the optimal action-value is
. The new update step is given as:
Here, the α is 0.001 due to decay throughout the training process. Furthermore, the TD error is at −1.2, signifying that the agent’s value functions are very close to optimal and the agent is well trained. All TD errors will eventually converge to near-zero and the agent’s policy will become optimal.
在多次遍历状态空间后，Q 矩阵现在由下式给出：
这一次，智能体拥有更多关于系统的信息，可以开始以最佳方式行动。在再次将代理重置回 15 kPa 和 37 RPM 后，这次代理的决策是确定性的。给定动作值：
智能体将选择
对应的系统并过渡到
并
产生奖励
。现在，系统的状态最接近
。在最佳动作值中
，
是。新的更新步骤如下：
在这里，由于整个训练过程中的衰减，α为 0.001。此外，TD 误差为 −1.2，表示智能体的值函数非常接近最优，并且智能体训练有素。所有 TD 错误最终将收敛到接近零，并且代理的策略将变得最佳。

After simulating the system for many steps, the reward obtained by the agent during the training phase is shown in Fig. 17. The reward never converged to zero because the lower bound of ϵ was set to 0.1, meaning that the agent continued to explore sub-optimal actions during training. As for the desired set-point during training, the agent’s set-point is drawn from a Gaussian distribution N(30, 5).
在对系统进行多步模拟后，智能体在训练阶段获得的奖励如图 17 所示。奖励永远不会收敛到零，因为ε的下限设置为 0.1，这意味着智能体在训练期间继续探索次优操作。至于训练期间所需的设定点，智能体的设定点是从高斯分布 N（30， 5）中得出的。

Fig. 17
Download : Download high-res image (216KB)
下载：下载高分辨率图片（216KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 17. Loss curve of the agent during training.
图 17.训练期间智能体的损失曲线。

After 2,000,000 update steps, the agent was applied onto the real process and was tasked to track pressure set-points of 35 and 5. The pressure trajectory of the Mechatronix experiment is shown in Fig. 18a and b. The mean squared tracking error (MSE) for set-points of 35 and 5 are 14.2 and 15.5, respectively. In both cases, the initial pressure of the system was about 5.5 kPa above the desired set-point to ensure fair MSE calculations. The agent behaves much like a linear self-tuning PID where the RL maps tracking errors to changes in input. Such a set-up only works well locally for nonlinear systems because the system is locally linear. As the agent moves away from the local linear region, the performance deteriorates significantly, as shown in Fig. 18b. The agent’s performance is significantly better when tracking
because the set-points during training were biased towards the upper range. Additionally, it can be seen in Fig. 16 that the controller gain changes significantly at lower pressures, making the optimal policy for the higher pressure range sub-optimal in the lower pressure range. There also exists a large off-set between the set-point and pressure trajectory. This is caused by the discretization error; there is no action Δu that can get the system to exactly
. To overcome this, the action space can be more finely discretized, but will increase the training time and space complexity required by the agent. Another simpler way will be introduced later on in this example.
经过 2,000,000 个更新步骤后，该代理被应用于实际过程，并负责跟踪压力设定点 35 和 5。Mechatronix实验的压力轨迹如图18a和b所示。设定点 35 和 5 的均方跟踪误差（MSE）分别为 14.2 和 15.5。在这两种情况下，系统的初始压力都比期望的设定值高出约5.5 kPa，以确保公平的MSE计算。智能体的行为很像线性自整定 PID，其中 RL 将跟踪误差映射到输入的变化。这种设置仅适用于非线性系统，因为系统是局部线性的。当药剂远离局部线性区域时，性能会显著下降，如图18b所示。智能体在跟踪
时的性能明显更好，因为训练期间的设定点偏向于上限范围。此外，从图16中可以看出，控制器增益在较低压力下发生显著变化，使得较高压力范围的最优策略在较低压力范围内次优。设定点和压力轨迹之间也存在较大的偏移。这是由离散化误差引起的;没有动作 Δu 可以使系统精确
地达到。为了克服这个问题，动作空间可以更精细地离散化，但会增加智能体所需的训练时间和空间复杂性。在此示例中稍后将介绍另一种更简单的方法。

Fig. 18
Download : Download high-res image (263KB)
下载：下载高分辨率图片（263KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 18. Pressure trajectory of the Mechatronix experiment. Solid line represents the average of 10 runs to ensure reproducability. Shaded area correspond to one standard deviation.
图 18.Mechatronix实验的压力轨迹。实线表示 10 次运行的平均值，以确保可重复性。阴影区域对应于一个标准差。

One simple way to extend the current agent to nonlinear systems is to approximate the system using piecewise linear functions as shown in Fig. 19. Communication wise, the agent will receive this information through a second state:
and the new state space for the agent is given by:
(50)
where 1, 2, 3, 4 and 5 for the second state correspond to the region the agent is currently in. Here, the Q-matrix will instead be initialized as 0(41 · 5) × 3 to accommodate for the second state. Intuitively, the agent now observes the current tracking error and the region in which this tracking error has occurred. Since each region is locally linear, the control law is also linear allowing the agent’s policy to be applied along each region (Seborg et al., 2013). Such an idea is similar to linear parameter-varying models or multimode models. Here, the agent’s policy changes depending on the region it is currently in.
将电流代理扩展到非线性系统的一种简单方法是使用分段线性函数对系统进行近似，如图 19 所示。在通信方面，代理将通过第二状态接收此信息：代理的新状态空间由下式给出：
(50)
其中第二种状态的 1、2、3、4 和 5 对应于代理当前所在的区域。在这里，Q 矩阵将被初始化为 0 (41 · 5) × 3 以适应第二种状态。直观地说，代理现在可以观察当前的跟踪误差以及发生此跟踪误差的区域。由于每个区域都是局部线性的，因此控制律也是线性的，允许代理的策略应用于每个区域（Seborg et al.， 2013）。这种想法类似于线性参数变化模型或多模态模型。在这里，代理的策略会根据其当前所在的区域而变化。

Fig. 19
Download : Download high-res image (180KB)
下载：下载高分辨率图片（180KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 19. Approximating the nonlinear Mechatronix system.
图 19.近似非线性 Mechatronix 系统。

After training the new agent for 2,000,000 steps, the agent was again implemented onto the Mechatronix experiment. The new pressure trajectories for tracking
and 5 are shown in Fig. 20a and b with MSEs of 14.2 and 12.5, respectively. It can be seen that the performance for tracking the lower set-point has dramatically improved; however, the offset still exists.
在对新智能体进行 2,000,000 步的训练后，该智能体再次被实施到 Mechatronix 实验中。图 20a 和 b 显示了用于跟踪
和 5 的新压力轨迹，MSE 分别为 14.2 和 12.5。可以看出，跟踪较低设定点的性能得到了显着提高;但是，偏移量仍然存在。

Fig. 20
Download : Download high-res image (308KB)
下载：下载高分辨率图片（308KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 20. Pressure trajectory using the nonlinear agent. Solid line represents the average of 10 runs to ensure reproducability. Shaded area correspond to one standard deviation.
图 20.使用非线性代理的压力轨迹。实线表示 10 次运行的平均值，以确保可重复性。阴影区域对应于一个标准差。

Because the nonlinear system was approximated using linear components, both the system and control law are locally linear. Using this characteristic, the optimal control action can be interpolated using the linear interpolation equation (Meijering, 2002):
(51)
where x is the actual state from the system; typically, x is not included in the discretized states
and is instead between xhigh and xlow. For example, suppose the discretized states are
. If the current state is -3, xhigh and xlow would correspond to 0 and -5, respectively. Likewise, uhigh and ulow correspond to the greedy action (i.e., return maximizing action) for xhigh and xlow, respectively. For example, given the action space
and
matrix:
and uhigh and ulow would be 0 and −5 (actions corresponding to the index of the highest Q-value), respectively. The optimal action for
would be:

由于非线性系统是使用线性分量近似的，因此系统和控制律都是局部线性的。利用这一特性，可以使用线性插值方程（Meijering，2002）进行插值，从而实现最佳控制作用：
(51)
其中x是系统的实际状态;通常，x 不包含在离散化状态
中，而是介于 x 和 x high low 之间。例如，假设离散化状态为
。如果当前状态为 -3，则 x 和 x high low 将分别对应于 0 和 -5。同样，u 和 u high low 分别对应于 x 和 x high low 的贪婪操作（即返回最大化操作）。例如，给定动作空间
和
矩阵：
，u 和 u high low 将分别为 0 和 −5（对应于最高 Q 值的指数的动作）。最佳操作
是：

By adding the interpolation technique onto the two-state RL agent (without re-training the agent), the new pressure trajectories are shown in Fig. 21a and b with new MSEs of 13.6 and 11.7, respectively. Now, the off-set is eliminated and the system can operate optimally.
通过将插值技术添加到双态RL智能体上（无需重新训练智能体），新的压力轨迹如图21a和b所示，新的MSE分别为13.6和11.7。现在，消除了偏移，系统可以以最佳方式运行。

Fig. 21
Download : Download high-res image (290KB)
下载：下载高分辨率图片（290KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 21. Pressure trajectory of the nonlinear agent using interpolation action selection. Solid line represents the average of 10 runs to ensure reproducability. Shaded area correspond to one standard deviation.
图 21.使用插值动作选择的非线性代理的压力轨迹。实线表示 10 次运行的平均值，以确保可重复性。阴影区域对应于一个标准差。

A summary of the simple RL solutions and their respective characteristics are shown in Table 6. In this illustration, implementation of a simple RL agent onto a pilot-scale industrial experiment was introduced. Techniques to extend the agent’s ability to nonlinear systems and for off-set free control were also shown. The agent’s performance on the live systems were replicated 10 times for each algorithm to ensure reproducability; the resultant standard deviation in the pressure trajectories was very narrow, representing high reproducability.
表 6 总结了简单的 RL 解决方案及其各自的特性。在此图中，介绍了在中试规模的工业实验中实现简单的RL代理。还展示了将智能体的能力扩展到非线性系统和偏移自由控制的技术。每种算法对智能体在实时系统上的性能进行了 10 次复制，以确保可重复性;压力轨迹中的标准偏差非常窄，代表高重现性。

Table 6. A comparison between RL, MPC in literature, and industrial MPC software.

Empty Cell Normal Q-learning Two-state Q-learning Two-state interpolated Q-learning
MSE (High/Low SP) 14.2 & 15.5 14.2 & 12.5 13.6 & 11.7
Offset Yes Yes No
Nonlinear 非线性 No Yes Yes
RL agents typically only provide the control action for the immediate future. This is similar to traditional methods like MPC where only the first input is used during each step; however, MPC is implemented in a receding horizon fashion and also calculates an input trajectory for future steps making open-loop control possible for short horizons. RL can also be implemented in such a way. Such RL methods are called planning methods or model-based RL and require a model of the system. In receding horizon RL, the agent outputs the immediate control action, uses the model to identify the next state, identifies the optimal control action for the next state, and continues the cycle thereforth.
RL 代理通常只提供近期的控制操作。这类似于 MPC 等传统方法，在每一步中仅使用第一个输入;然而，MPC 是以后退地平线的方式实现的，并且还计算未来步骤的输入轨迹，从而使短地平线的开环控制成为可能。RL也可以以这种方式实现。这种 RL 方法称为规划方法或基于模型的 RL，需要系统的模型。在后退地平线RL中，智能体输出即时控制动作，使用模型识别下一个状态，识别下一个状态的最佳控制动作，并继续循环。

Because the example shown here is for illustration purposes, the example is simple, evaluates at a preset intervals and does not consider system dynamics or unmeasured states. For systems with variable transition times and system dynamics consideration, SMDPs should be used. The SMDP Q-learning algorithm is given by Bradtke and Duff (1994):
(52)
where
is the reward rate and is given in Eq. (10). For systems with unmeasured states, concepts of POMDPs provided in Section 2 are required.
由于此处显示的示例仅用于说明目的，因此该示例很简单，以预设的时间间隔进行评估，并且不考虑系统动力学或未测量的状态。对于具有可变转换时间和系统动力学考虑的系统，应使用SMDP。SMDP Q学习算法由Bradtke和Duff（1994）给出：
(52)
其中
是奖励率，在方程（10）中给出。对于具有未测量状态的系统，需要第 2 节中提供的 POMDP 概念。

If a deep RL algorithm, such as DDPG, was used in this particular example, the major difference would be the design of the state and action spaces. In DDPG, the two states and control action would instead be continuous and be given by:
In this set-up, there would naturally no discretization error because both the states and actions are continuous. However, note here that the search space for the optimal policy is much larger compared to the tabular case and will require significantly more time before a near-optimal policy is found.
如果在这个特定示例中使用深度 RL 算法，例如 DDPG，则主要区别在于状态和动作空间的设计。在 DDPG 中，两个状态和控制动作将是连续的，由下式给出：
在这种设置中，自然不会出现离散化误差，因为状态和动作都是连续的。但是，请注意，与表格大小写相比，最佳策略的搜索空间要大得多，并且需要更多时间才能找到接近最优的策略。

3.4. Automated PID tuning
3.4. 自动PID整定
Proportional-Integral-Derivative (PID) controllers are widespread throughout industry due to their effectiveness and ease of implementation. The general PID formulation is given by Seborg et al. (2013):
(53)
where
is the derivative of the error at time t, Kp, Ki, and Kd are hyper parameters corresponding to the proportional gain, integral gain, and derivative gain and should be well tuned for acceptable controller performance. However, depending on the process to be controlled, this tuning process may be difficult and time-consuming, especially in MIMO systems where control loops are intertwined (i.e., tuning of one control loop results in de-tuning of another). Many methods exist for initial tuning, such as the Ziegler-Nichols method. But in most cases, the final controller performs well below optimal, especially in industry where engineers are time constrained (Howell and Best, 2000). Instead, an RL agent can be used to automatically and optimally tune the PID parameters.
比例积分微分（PID）控制器因其有效性和易于实施而在整个行业中广泛使用。Seborg 等人（2013）给出了一般的 PID 公式：
(53)
其中
是时间 t 的误差导数，K 、 K 和 K p d 是对应于比例增益、积分增益和导数增益的超参数，应进行良好调整以获得可接受的控制器性能。然而，根据要控制的过程，这种调谐过程可能既困难又耗时，特别是在控制环路交织在一起的MIMO系统中（即，一个控制环路的调谐会导致另一个控制环路的失谐）。存在许多初始调谐方法，例如 Ziegler-Nichols 方法。但在大多数情况下，最终控制器的性能远低于最佳状态，尤其是在工程师时间有限的行业中（Howell and Best，2000）。相反，RL 代理可用于自动优化调整 PID 参数。

One of the earliest studies on automated PID tuning using concepts from RL was published by Howell and Best (2000) in 2000 (architecture shown in Fig. 22) where the authors automated the tuning of a Ford Motors Zetec engine. The algorithm here, named Continuous Action Reinforcement Learning Automata (CARLA), was used to fine tune PIDs after initial parameters were set using methods like Ziegler-Nichols. Results showed a 60% reduction in the cost function after RL tuning. CARLA works as follows: Initially, each system parameter is associated with one CARLA. The action of each CARLA is the recommended system parameter and is drawn from a corresponding probability density function, f(x). For example, in a system with one PID, there would be three CARLAs corresponding to the three PID parameters. The action of each CARLA is the recommended parameter and is implemented into the process. Afterwards, the performance of the system with respect to the desired performance metric is observed. Performances better than the mean performance will cause the distribution means to be shifted towards the recommended parameters, vice versa for lower performances. Exploration in CARLA is conducted in a similar way as DDPG, except Gaussian white noise is injected into the action rather than the Ornstein-Uhlenbeck noise. More detailed information regarding the CARLA algorithm can be found in Howell and Best (2000). Such a method is simple, but may not be scalable to large MIMO systems due to the vast numbers of CARLAs required.
Howell 和 Best （2000）于 2000 年发表了关于使用 RL 概念进行自动 PID 调谐的最早研究之一（架构如图 22 所示），作者在其中自动调谐了福特汽车 Zetec 发动机。这里的算法被命名为连续动作强化学习自动机（CARLA），用于在使用Ziegler-Nichols等方法设置初始参数后微调PID。结果显示，RL 调优后，成本函数降低了 60%。CARLA 的工作原理如下：最初，每个系统参数都与一个 CARLA 相关联。每个 CARLA 的作用是推荐的系统参数，由相应的概率密度函数 f（x）得出。例如，在具有一个 PID 的系统中，将有三个 CARLA 对应于三个 PID 参数。每个 CARLA 的动作是推荐的参数，并实施到流程中。之后，观察系统相对于所需性能指标的性能。性能优于平均性能将导致分布均值偏向推荐参数，反之亦然，性能较低。CARLA的探索方式与DDPG类似，只是高斯白噪声被注入到动作中，而不是Ornstein-Uhlenbeck噪声。有关CARLA算法的更多详细信息可以在Howell and Best （2000）中找到。这种方法很简单，但由于需要大量的CARLA，可能无法扩展到大型MIMO系统。

Fig. 22
Download : Download high-res image (157KB)
下载：下载高分辨率图片（157KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 22. CARLA: An RL-powered automatic PID tuning algorithm.
图 22.CARLA：RL驱动的自动PID调谐算法。

By 2006, Wang et al. (2006) developed a more advanced PID tuning method through utilizing an actor-critic RL algorithm. In this algorithm, the agent’s states are given as:
and the actions are:
Notice that the three states correspond to the integral, proportional, and derivative errors of the discrete PID formulation. The RL formulation here maps the current error, and the first- and second-order difference of errors to some optimal PID parameters at each time t (i.e., the PID parameters change at every time t). Intuitively, the agent observes the current errors and outputs the optimal PID parameters at current time t. The PID is then re-parameterized using these new parameters and is used to calculate Δut:
(54)
where Ki, Kp, and Kd are outputted by the agent and may change for each time t. From Δut, ut can be calculated by:
From the original paper, simulation results showcased the algorithm’s adaptability, robustness and ability to perfectly track complex nonlinear systems. By 2008, Sedighizadeh et al. (2008) applied this adaptive PID to a wind turbine experiment, yielding near perfect tracking performance in the industrial application. Complete applicational details can be found in Sedighizadeh et al. (2008). In 2015, the algorithm was implemented again, this time on an under-actuated robotic arm (Akbarimajd, 2015). The robotic arm has fast dynamics and lacks adequate actuators for ideal control. Such a study allows for the exploration of the RL tuned PID’s fault tolerant characteristics. In the experiments, the robotic arm’s goal was to maintain a formation, but was exposed to many disturbances. Traditional control methods typically lead to overshooting and other undesired behaviours; however, the RL tuned PID did not exhibit any such behaviour and showed significantly better performance in terms of disturbance rejection and response time compared to traditional approaches. By 2017, a Q-learning variant of this proposed algorithm was used in Gurel (2017) to tune a race track robot. Compared to manually tuned PIDs, the RL tuned PID robots achieved up to 59% faster lap times.
到2006年，Wang等人（2006）利用演员-评论家RL算法开发了一种更先进的PID调谐方法。在此算法中，代理的状态为：操作为：
请注意，这三个状态对应于离散 PID 公式的积分误差、比例误差和导数误差。此处的RL公式将当前误差以及误差的一阶和二阶差映射到每个时间t的一些最佳PID参数（即，PID参数在每次t时都变化）。直观地，智能体观察当前误差并在当前时间 t 输出最佳 PID 参数。然后使用这些新参数对 PID 进行重新参数化，并用于计算 Δu t ：
(54)
其中 K、K 和 K p d 由代理输出，并且可能随每次 t 而变化。从 Δu t 中，u t 可以由下式计算：
仿真结果从原文中展示了该算法的适应性、鲁棒性和完美跟踪复杂非线性系统的能力。到 2008 年，Sedighizadeh 等人（2008 年）将这种自适应 PID 应用于风力涡轮机实验，在工业应用中产生了近乎完美的跟踪性能。完整的应用细节可以在Sedighizadeh等人（2008）中找到。2015年，该算法再次实施，这次是在驱动不足的机械臂上（Akbarimajd，2015）。机械臂具有快速的动力，并且缺乏足够的执行器来实现理想的控制。这样的研究可以探索RL调谐PID的容错特性。在实验中，机械臂的目标是保持编队，但受到许多干扰。传统的控制方法通常会导致超调和其他不良行为;然而，与传统方法相比，RL调谐的PID没有表现出任何此类行为，并且在干扰抑制和响应时间方面表现出明显更好的性能。到 2017 年，Gurel （2017）中使用了这种拟议算法的 Q 学习变体来调整赛道机器人。与手动调谐的 PID 相比，RL 调谐的 PID 机器人的单圈时间缩短了 59%。

In 2010, Brujeni et al. (2010) leveraged the SARSA RL algorithm to dynamically tune a PI-controller used to control a continuous stirred tank heater. The agent was first pre-trained on an estimated model for the process and was then implemented to continuously tune the tank heater online. The agent aimed to reject disturbances and track the set-point. In the end, the authors compared the performance of the RL tuned controller against internal model control tuning methods. It was found that RL was the superior tuning method due to its continuous adaptive nature.
2010 年，Brujeni 等人（2010 年）利用 SARSA RL 算法动态调整用于控制连续搅拌罐加热器的 PI 控制器。首先对智能体进行过程估计模型的预训练，然后实施以连续在线调整储罐加热器。该代理旨在拒绝干扰并跟踪设定值。最后，作者将RL调谐控制器的性能与内部模型控制调谐方法进行了比较。研究发现，RL由于其连续自适应特性，是优越的调谐方法。

In 2013, Hakim et al. (2013) implemented an automated tuning strategy similar to the one proposed above for a multi-PID soccer robot. The agent’s states were altered. Instead of receiving the error signals, the agent received the state it currently resides in for the soccer game. Intuitively, this allowed the agent to understand its current situation, and tune its characteristics accordingly. For example, the robot may require faster speed while running down the field compared to when it is ready to take a shot. In an industrial setting, such ideas may be useful for an event triggered control system. For example, if the weather conditions are poor, the control system should be more conservative and have less gain. Ultimately, the paper demonstrated the superior performance of the RL tuned robots compared to robots tuned using the Ziegler-Nichols method.
2013 年，Hakim 等人（2013 年）实施了一种类似于上述多 PID 足球机器人的自动调整策略。代理的状态被更改了。代理没有接收错误信号，而是接收到它当前在足球比赛中所处的状态。直观地说，这使智能体能够了解其当前情况，并相应地调整其特征。例如，与准备投篮时相比，机器人在田野上奔跑时可能需要更快的速度。在工业环境中，这些想法可能对事件触发的控制系统有用。例如，如果天气条件较差，控制系统应更保守，增益较小。最终，这篇论文证明了RL调谐机器人与使用Ziegler-Nichols方法调优的机器人相比具有更优越的性能。

On the more advanced side, RL was also shown to be effective in a model-based PID tuning strategy where the controller was tuned based on a finite horizon cost. Ultimately, this method was found to work on nonlinear MIMO systems with arbitrary couplings. The method was tested on Apollo, a real-life robot with imperfect low-level tracking controllers and unobserved dynamics. More details regarding this method can be found in Doerr et al. (2017).
在更高级的方面，RL在基于模型的PID调谐策略中也被证明是有效的，该策略基于有限的水平成本对控制器进行调谐。最终，该方法被发现适用于具有任意耦合的非线性MIMO系统。该方法在阿波罗（Apollo）上进行了测试，阿波罗是一个现实生活中的机器人，具有不完美的低级跟踪控制器和未观察到的动态。关于这种方法的更多细节可以在Doerr等人（2017）中找到。

Over the past 3 years, many more automated PID tuning methods using RL were published and are not all presented here. The ideas are very similar to the ones presented above with only slight alterations of the agent set-up.
在过去的 3 年中，发表了更多使用 RL 的自动 PID 调谐方法，但这里并未全部介绍。这些想法与上面介绍的想法非常相似，只是对代理设置进行了轻微的更改。

3.5. Various simulated process control applications
3.5. 各种模拟过程控制应用
One key advantage of RL in optimal control probably is its direct adaptive characteristic. Optimal control methods aim to extremize the functional equation of the controlled system and have shown to be less tractable both computationally and analytically compared to tracking or regulations problems. Consequently, adaptive optimal control has received relatively less attention, with existing studies mostly focusing on indirect methods (Sutton et al., 1991). Sutton et al. (1991) showed that RL can overcome this dilemma by serving as a direct optimal control method. Here, indirect methods refer to process re-identification methods whereas direct methods alter the control policy directly. Direct adaptive optimal control could be especially useful for systems where accurate models are not available. In such scenarios, RL can update the control policy directly through interactions with the system, eventually adapting to the optimal policy. This was shown in Moriyama et al. (2018), where the authors applied an RL algorithm to a data-center cooling application where accurate system models are very difficult to identify even with sufficient data. However, the RL agent was able to find an optimal policy to control the system after sufficient online interactions. The agent resulted in 22% reduced power consumption compared to previous model-based methods. In another work, Raju et al. (2015) showed that RL was able to adapt to changing load fluctuations in power systems, eventually resulting in optimal control after sufficient online interactions. Wireless networking is another system with non-identifiable dynamics. In Fan et al. (2019), the authors achieved increased energy efficiency using deep RL by allowing the agent to learn the optimal policy online instead of mathematically modelling the system.
RL在最佳控制方面的一个关键优势可能是其直接自适应特性。最优控制方法旨在极端化受控系统的功能方程，并且与跟踪或调节问题相比，在计算和分析方面都显示出较差的可处理性。因此，适应性最优控制受到的关注相对较少，现有的研究主要集中在间接方法上（Sutton等人，1991）。Sutton等人（1991）表明，RL可以通过作为直接最优控制方法来克服这一难题。在这里，间接方法是指过程重新识别方法，而直接方法直接改变控制策略。直接自适应最优控制对于无法获得精确模型的系统特别有用。在这种情况下，RL可以通过与系统的交互直接更新控制策略，最终适应最优策略。Moriyama 等人（2018）展示了这一点，其中作者将 RL 算法应用于数据中心冷却应用，即使有足够的数据，也很难识别准确的系统模型。然而，RL代理能够在充分的在线交互后找到最佳策略来控制系统。与以前基于模型的方法相比，该代理的功耗降低了 22%。在另一项研究中，Raju et al. （2015）表明，RL 能够适应电力系统中不断变化的负载波动，最终在充分的在线交互后实现最佳控制。无线网络是另一个具有不可识别动态的系统。在 Fan 等人中。（2019 年），作者使用深度强化学习提高了能源效率，允许智能体在线学习最佳策略，而不是对系统进行数学建模。

For general applications of RL in simulated control environments, Hoskins and Himmelblau (1992) was perhaps the first instance where reinforcement learning was used for process control (in a set-point tracking sense). The authors trained a neural network based agent to control a CSTR. More recently, Spielberg et al. (2017) showed that DDPG can be used to successfully control arbitrary SISO and MIMO systems so long as the reward functions are properly formulated. In Spielberg et al. (2017), the agents mapped states,
to actions
. In Wang et al. (2017), an actor-critic RL method was applied to control the temperature of a building heating, ventilation, and air conditioning (HVAC) system, resulting in 2.5% reduction in energy usage and 15% increase in thermal comfort. The same HVAC system was also optimized using a proximal actor-critic RL agent in Wang et al. (2018). All previous applications formulated the agent to perform set-point tracking; however, RL is very flexible and can be used for optimal control (i.e., optimize an economic objective) by changing the reward function to be in terms of an economic objective. The viability of RL has also been shown in fault-tolerant control (FTC) (Nian et al., 2019). It was illustrated that RL agents were able to mediate faults and were able to adapt to changing operating conditions.
对于RL在模拟控制环境中的一般应用，Hoskins和Himmelblau（1992）可能是第一个将强化学习用于过程控制（在设定点跟踪意义上）的例子。作者训练了一个基于神经网络的智能体来控制CSTR。最近，Spielberg等人（2017）表明，只要正确制定奖励函数，DDPG就可以用于成功控制任意SISO和MIMO系统。在Spielberg et al. （2017）中，智能体将状态映射
到行动
。在 Wang 等人（2017 年）中，应用了演员-评论家 RL 方法来控制建筑物供暖、通风和空调（HVAC）系统的温度，从而减少了 2.5% 的能源使用和 15% 的热舒适性。在Wang等人（2018）中，还使用近端演员-评论家RL代理优化了相同的HVAC系统。所有以前的应用程序都制定了代理来执行设定点跟踪;然而，RL非常灵活，可以通过将奖励函数更改为经济目标来实现最佳控制（即优化经济目标）。RL 的可行性也已在容错控制（FTC）中得到证实（Nian 等人，2019 年）。结果表明，RL代理能够调解故障并能够适应不断变化的操作条件。

3.6. RL and chemical engineering
3.6. RL和化学工程
In as early as 2005, Lee and Lee (2005) investigated two approximate dynamic programming (ADP) algorithms, J-learning and Q-learning, for optimal control of a CSTR. In both cases, the learning algorithm attempted to identify the optimal policy offline. Compared to MPC, these methods were shown to have lower online computational burden and the ability to control model extrapolation when computing for the optimal control actions. The performance of these algorithms were then compared to proportional-integral controllers, a successive linearization MPC, and nonlinear MPC on a CSTR. First, an nonlinear auto-regressive with exogenous input (ARX) model was identified for the CSTR. Then, each controller was applied in simulation to control the process. In the end, the performance of the two ADP methods were superior compared to its counterparts. In this particular example, MPC struggled to achieve closed-loop stability due to large model extrapolations during online optimization.
早在 2005 年，Lee 和 Lee （2005）就研究了两种近似动态规划（ADP）算法，即 J 学习和 Q 学习，用于对 CSTR 进行最佳控制。在这两种情况下，学习算法都试图离线识别最佳策略。与MPC相比，这些方法被证明具有更低的在线计算负担，并且在计算最佳控制动作时能够控制模型外推。然后，将这些算法的性能与比例积分控制器、连续线性化 MPC 和 CSTR 上的非线性 MPC 进行比较。首先，为CSTR建立了一个具有外生输入的非线性自回归（ARX）模型。然后，在仿真中应用每个控制器来控制过程。最后，两种ADP方法的性能均优于同类方法。在这个特定示例中，由于在线优化期间的大型模型外推，MPC 难以实现闭环稳定性。

Joy and Kaisare (2011) built upon previous findings and applied the J-learning and Q-learning techniques to an adiabatic plug flow reactor. In this example, the plug flow reactor was modelled using three partial differential equations. Two different PIDs and three different linear MPCs were applied onto the system along with the ADP control algorithms. In the simulation cases presented, the ADP control algorithms were able to achieve superior control performance compared to the PIDs and linear MPCs.
Joy 和 Kaisare （2011）以先前的研究结果为基础，将 J 学习和 Q 学习技术应用于绝热活塞流反应器。在本例中，使用三个偏微分方程对活塞流反应器进行建模。将两个不同的PID和三个不同的线性MPC与ADP控制算法一起应用于系统。在所介绍的仿真案例中，与PID和线性MPC相比，ADP控制算法能够实现更出色的控制性能。

By 2018, Sidhu et al. (2018) demonstrated the capabilities of an ADP method to optimally control the proppant concentration for hydraulic fracturing. It was shown that traditional optimal controllers were not ideal due to the high sampling time and large-scale nonlinear system. The proposed ADP controller was first trained on a simulation model of the reservoir to gain preliminary process insight. After training, the ADP optimal controller was used to generate an online pumping schedule while being able to handle plant-model mismatch within the rock formation.
到 2018 年，Sidhu 等人（2018 年）展示了 ADP 方法以最佳方式控制水力压裂支撑剂浓度的能力。结果表明，传统的最优控制器由于采样时间长、非线性系统大等原因并不理想。所提出的ADP控制器首先在油藏的仿真模型上进行了训练，以获得初步的过程洞察力。经过训练，ADP最优控制器用于生成在线泵送计划，同时能够处理岩层内的植物模型不匹配。

3.7. Electricity optimization at Google data centers
3.7. 谷歌数据中心的电力优化
One of the only, and most impressive in terms of achievement and value creation, live implementation of reinforcement learning3 was achieved by Google DeepMind where the agent succeeded in reducing electricity usage of Google data centers by up to 40%. Indirectly, this also reduced the carbon footprint of all companies using Google’s services. Services such as Google Search, Gmail, and YouTube, are all ran on servers powered by Google’s data centers and generate enormous amounts of heat. Consequently, the data centers’ primary source of energy usage is cooling. Cooling is accomplished using industrial equipment such as heat exchangers, pumps, and cooling towers. The difficulty comes from the problematic dynamics of the environment caused by DeepMind (2016c):
在成就和价值创造方面，唯一也是最令人印象深刻的是，强化学习 3 的实时实施是由 Google DeepMind 实现的，该代理成功地将 Google 数据中心的用电量减少了 40%。间接地，这也减少了所有使用谷歌服务的公司的碳足迹。Google 搜索、Gmail 和 YouTube 等服务都在由 Google 数据中心提供支持的服务器上运行，并产生大量热量。因此，数据中心的主要能源使用来源是冷却。冷却是使用工业设备完成的，例如热交换器、泵和冷却塔。困难来自于 DeepMind （2016c）造成的环境动态问题：
1.
Highly complex, multi-dimensional environment with numerous nonlinear interactions. Such an environment renders traditional system identification approaches ineffectively. Additionally, human operators simply cannot intuitively understand all the complex interactions.
高度复杂的多维环境，具有许多非线性相互作用。这样的环境使传统的系统识别方法变得无效。此外，人类操作员根本无法直观地理解所有复杂的交互。

Highly variable internal and external conditions (such as weather, server load, etc.) rendering rules- and heuristics-based approaches fruitless.
高度可变的内部和外部条件（如天气、服务器负载等）使基于规则和启发式的方法徒劳无功。

All data centers have unique set-ups, requiring custom-tuned models for each individual environment. Such a dilemma requires an artificial general intelligence set-up where one algorithm can learn many different scenarios.
所有数据中心都有独特的设置，需要针对每个单独的环境定制调整模型。这种困境需要一种通用人工智能设置，其中一种算法可以学习许多不同的场景。

To overcome this, DeepMind research scientists first used historical operating data from the data centers to train neural network models for different operating conditions. The inputs to the neural networks were sensor information such as temperature, pump speeds, etc., and the output was the power usage effectiveness (PUE) defined as:
(55)

为了克服这个问题，DeepMind的研究科学家首先使用数据中心的历史运行数据来训练不同运行条件的神经网络模型。神经网络的输入是传感器信息，如温度、泵速等，输出是电源使用效率（PUE），定义为：
(55)

The neural networks were used as simulators for the physical data centers. Different agents were trained on different data centers and during different operating conditions to minimize the PUE over a long hoirzon. Initially, only recommendations were provided by the algorithm. The PUE of a data center with and without implementing the agent’s recommendations is shown in Fig. 23. By 2018, the agents were allowed to fully control the data centers after safety modifications were added. Specifically, a RL agent obtains measurements from sensors throughout the data center once every five minutes and replies with the optimal inputs that satisfy a robust set of safety constraints (DeepMind, 2018). The inputs are then verified with the local control operators to ensure that the system remains within constraint boundaries. Over the first few months of live implementation, the agents were able to successfully reduce electricity consumption by an average of 30% and are expected to get better as they learn online.
神经网络被用作物理数据中心的模拟器。不同的代理在不同的数据中心和不同的操作条件下接受培训，以在较长的时间内将PUE降至最低。最初，该算法仅提供建议。图 23 显示了有和没有实施代理建议的数据中心的 PUE。到 2018 年，在添加安全修改后，代理被允许完全控制数据中心。具体来说，RL代理每五分钟从整个数据中心的传感器获取一次测量结果，并回复满足一组强大安全约束的最佳输入（DeepMind，2018）。然后与本地控制操作员一起验证输入，以确保系统保持在约束边界内。在实际实施的头几个月，代理商能够成功地将电力消耗平均减少 30%，并且随着他们的在线学习，预计会变得更好。

Fig. 23
Download : Download high-res image (172KB)
下载：下载高分辨率图片（172KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 23. Power usage effectiveness with and without ML control. Original figure from DeepMind (2016c).
图 23.无论是否具有 ML 控制，都能提高电源使用效率。原始图来自 DeepMind （2016c）。

3.8. Sequential anomaly detection
3.8. 顺序异常检测
Anomaly detection is another field where reinforcement learning has gained traction. In industrial processes, anomaly detection is a proactive risk management strategy to identify and localize potential hazards before loss incidents occur. RL anomaly detection’s sequential (time-series) nature and ability to self-learn provide an attractive edge. One of the earliest papers regarding RL-based anomaly detection was presented in Cannadey (2000) (architecture shown in Fig. 24), where the author used concepts of reinforcement learning to build an adaptive neural network to learn and identify new cyber attacks. The architecture was similar to time-series anomaly detection. The system was represented as a POMDP where the agent optimally mapped the observations
to actions
guided by a scalar reward. Observations were an augmentation of past states to give the agent access to time-series information. Correctly identifying anomalies yielded +1 reward while any misclassification yielded -1 reward. More recently, Zighra (a online security company) deployed the proposed algorithm from Cannadey (2000) into a product called SensifyID. Although the idea was originally applied to network systems, a very similar concept was proposed in Nian et al. (2019) where the algorithm was instead used as a general fault detection tool for process control systems.
异常检测是强化学习获得关注的另一个领域。在工业流程中，异常检测是一种主动风险管理策略，用于在损失事件发生之前识别和定位潜在危险。RL 异常检测的顺序（时间序列）性质和自学习能力提供了有吸引力的优势。Cannadey （2000）（图 24 中显示了架构），作者使用强化学习的概念构建了一个自适应神经网络来学习和识别新的网络攻击。该体系结构类似于时间序列异常检测。该系统被表示为一个 POMDP，其中智能体将观察结果以最佳方式映射
到由标量奖励引导的行动
。观察是过去状态的增强，使代理能够访问时间序列信息。正确识别异常会产生 +1 奖励，而任何错误分类都会产生 -1 奖励。最近，Zighra（一家在线安全公司）将Cannadey（2000）提出的算法部署到一个名为SensifyID的产品中。虽然这个想法最初应用于网络系统，但在 Nian 等人（2019）中提出了一个非常相似的概念，其中该算法被用作过程控制系统的通用故障检测工具。

Fig. 24
Download : Download high-res image (102KB)
下载：下载高分辨率图片（102KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 24. A sample anomaly detection architecture.
图 24.异常情况检测体系结构示例。

By 2010, Xu (2010) extended the original ideas by first representing the system as a partially measurable Markov reward process (MDP with no actions). The states remained
and there were no actions in this system. Instead, the authors proposed the agent to learn the probabilities of each state transitioning into an anomalous state given as:
(56)
where Pa(x) denotes the probability of transitioning into an anomalous state a given observation ot.
is a set of anomalous states. If Pa(x) > μa, an anomaly was deemed imminent, where μa is a threshold hyper parameter. High values reduce false positives, but may miss anomalies. On the other hand, low values increase true positives, but also increase false alarms. The value function of this approach is represented as:
(57)
where r(ot) is the reward obtained in ot. If

otherwise 0. In this setup, states with high values have higher chance of being anomalous. In the end, Xu (2010) also compared the RL anomaly detection algorithm to other classification algorithms such as support vector machines. Results showed that RL anomaly detection resulted in a higher detection accuracy compared to all other methods, although all algorithms scored an accuracy above 99.8% on the selected data sets including linear methods like logistic regression.
到2010年，Xu（2010）扩展了最初的想法，首先将系统表示为部分可测量的马尔可夫奖励过程（没有动作的MDP）。各州仍然存在
，这个系统中没有任何行动。取而代之的是，作者建议智能体学习每个状态转换为异常状态的概率，给定如下：
(56)
其中 P a （x）表示给定观察值转换为异常状态的概率 o t 。
是一组异常状态。如果 P a （x） > μ a ，则认为异常即将发生，其中 μ a 是阈值超参数。高值可减少误报，但可能会遗漏异常。另一方面，低值会增加真阳性，但也会增加误报。这种方法的值函数表示为：
(57)
其中 are（o ）是在 o t t 中获得的奖励。否则

为 0。在此设置中，具有高值的状态具有更高的异常几率。最后，Xu（2010）还将RL异常检测算法与其他分类算法（如支持向量机）进行了比较。结果表明，与所有其他方法相比，RL异常检测的检测准确率更高，尽管所有算法在所选数据集上的准确率都高于99.8%，包括逻辑回归等线性方法。

More recently in 2018, Huang et al. (2018) proposed a recurrent neural network (RNN) RL anomaly detection algorithm without the need to tune μa. Overall, the algorithm was very similar to Cannadey (2000) where a policy, π(u, x), was identified to map states to actions
. In this representation, the classification is binary so no threshold is required. Compared to Cannadey (2000), the system from Huang et al. (2018) was still a POMDP; however, a long short term memory (LSTM) recurrent neural network (RNN) was used to memorize previous states instead of using the state augmentation strategy. The differences in performance between the two strategies have yet to be explored in literature. Ultimately, the algorithm was implemented onto the Yahoo anomaly detection benchmark data set (Laptev et al., 2015) and was able to identify all anomalies with no false alarms.
最近在 2018 年，Huang 等人（2018 年）提出了一种无需调整μ的递归神经网络（RNN） RL 异常检测算法 a 。总体而言，该算法与Cannadey（2000）非常相似，其中确定了策略π（u，x）以将状态映射到操作
。在此表示中，分类是二进制的，因此不需要阈值。与 Cannadey （2000）相比，Huang et al. （2018）的系统仍然是 POMDP;然而，使用长短期记忆（LSTM）递归神经网络（RNN）来记忆以前的状态，而不是使用状态增强策略。这两种策略之间的性能差异尚未在文献中探讨。最终，该算法被实施到雅虎异常检测基准数据集（Laptev等人，2015）上，并且能够识别所有异常，没有误报。

3.9. Temporal credit assignment
3.9. 时间学分分配
Another advantageous difference exhibited by reinforcement learning is its credit assignment capabilities. There are three forms of credit assignments:
强化学习表现出的另一个优势区别是其学分分配能力。学分分配有三种形式：
1.
Temporal credit assignment: Given a sequence of
properly assign values based on a desired performance metric to different state-action pairs. For example, the reason a company went bankrupt is rarely caused by the actions of the CEO during the day of the bankruptcy. Most likely, a chain of poor decisions ultimately resulted in this outcome.
临时信用分配：给定一系列基于所需性能指标
为不同的状态-操作对正确分配值的值。例如，一家公司破产的原因很少是由首席执行官在破产当天的行为引起的。最有可能的是，一连串糟糕的决定最终导致了这一结果。

Transfer credit assignment: Ability to generalize one action across many tasks, e.g., driving a car in Canada should be very similar to driving a car in the United States.
转移学分分配：在许多任务中推广一个动作的能力，例如，在加拿大驾驶汽车应该与在美国驾驶汽车非常相似。

Structural credit assignment: Identifying the effects of individual parameters on the ultimate outcome. For example, increasing the temperature would increase the output flow rate by some amount.
结构性信用分配：确定各个参数对最终结果的影响。例如，提高温度会使输出流量增加一定量。

Transfer credit assignment and structural credit assignment are mainly used in transfer learning and supervised learning, respectively. Reinforcement learning conducts temporal credit assignment through assigning values to different state-action pairs (Minsky, 1973). RL conducts temporal credit assignment through interactions and understanding the system dynamics. After training (assuming an accurate simulator), the agent’s value functions for each state might be useful to aid process operators in determining the current state of the plant. States with high value functions denote good plant operation while states with low value functions may denote sub-optimal operation.
转移学分分配和结构学分分配分别主要用于迁移学习和监督学习。强化学习通过为不同的状态-动作对分配值来进行时间学分分配（Minsky，1973）。RL通过交互和理解系统动力学进行时间学分分配。经过训练（假设使用准确的模拟器），智能体对每个状态的值函数可能有助于帮助过程操作员确定工厂的当前状态。具有高值函数的状态表示良好的工厂运行，而具有低值函数的状态可能表示次优运行。

For example in alarm management, RL can assign values to alarms when they go off based on the system’s state or state trajectory. In doing so, each alarm corresponds to a respective value and alarms can be sorted based on priority. Additionally, alarms failing to meet a certain value threshold can be filtered out altogether to mitigate alarm flooding from nuisance alarms. Other applications requiring temporal credit assignment, such as root cause analysis, should also be feasible using RL although no studies have been conducted so far on this topic.
例如，在警报管理中，RL 可以根据系统的状态或状态轨迹为警报发出时为其分配值。这样，每个警报对应于相应的值，并且可以根据优先级对警报进行排序。此外，未能达到特定阈值的警报可以被完全过滤掉，以减轻滋扰警报造成的警报泛滥。其他需要时间学分分配的应用，如根本原因分析，也应该使用RL是可行的，尽管到目前为止还没有关于这个主题的研究。

Shortcomings of reinforcement learning
强化学习的缺点
Although the potential of RL seems promising, there are still several problems hindering it from industry wide adoption. Some shortcomings include: data inefficiency, un-established stability theory, model-free state constraint handling, and the requirement of a representative simulator.
尽管RL的潜力看起来很有希望，但仍存在一些问题阻碍了其在行业范围内的采用。一些缺点包括：数据效率低下、未建立的稳定性理论、无模型状态约束处理以及对代表性模拟器的要求。

4.1. Data inefficiency 4.1. 数据效率低下
Table 7 shows the time required for popular RL algorithms to learn specific tasks. Perhaps the most controversial topic of reinforcement learning is its poor learning efficiency. For example, OpenAI Five accumulated over 180 years of experience per day, but still required training for many days before becoming competent in DotA (OpenAI, 2018b). Likewise, AlphaZero was trained for tens of thousands of games to master Chess, Shogi, and Go (DeepMind, Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2017a, Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2018). In process control, plant managers cannot wait for such an lengthy period of time for the initial agent training; therefore, the training of RL agents is infeasible if simulators cannot be used or are too inaccurate. Compared to a human, the learning speed of reinforcement learning seems unreasonably slow. For example, DQN required days of continuous play to become skilled at the game. A task that took most humans a few minutes. One main difference between RL and humans is that RL starts tabula rasa while humans are equipped with almost all of the knowledge required before learning a new task. For example, when a human learns to drive, they already understand the functionality of the car’s pedals, the street signs, and the ultimate goal. On the other hand, RL starts by knowing nothing – not even the purpose of driving. Intuitively, this scenario is comparable to tasking a newborn baby to drive a car.
表 7 显示了流行的 RL 算法学习特定任务所需的时间。也许强化学习最具争议的话题是它的学习效率差。例如，OpenAI Five 每天积累了超过 180 年的经验，但在胜任 DotA 之前仍需要训练多天（OpenAI，2018b）。同样，AlphaZero也接受了数万场比赛的训练，以掌握国际象棋、将棋和围棋（DeepMind、Silver、Hubert、Schrittwieser、Antonoglou、Lai、Guez、Lanctot、Sifre、Kumaran、Graepel、Lillicrap、Simonyan、Hassabis，2017a，Silver、Hubert、Schrittwieser、Antonoglou、Lai、Guez、Lanctot、Sifre、Kumaran、Graepel、Lillicrap、Simonyan、Hassabis，2018）。在过程控制中，工厂经理不能等待如此长的时间进行初始代理培训;因此，如果模拟器无法使用或太不准确，则 RL 代理的训练是不可行的。与人类相比，强化学习的学习速度似乎慢得不合理。例如，DQN 需要连续玩几天才能熟练掌握游戏。这项任务需要大多数人花几分钟的时间。RL 和人类之间的一个主要区别是，RL 开始白板，而人类在学习新任务之前几乎具备了所需的所有知识。例如，当一个人学习驾驶时，他们已经了解了汽车踏板的功能、路标和最终目标。另一方面，RL 一开始什么都不知道——甚至不知道驾驶的目的。直观地说，这种情况类似于让新生婴儿开车。

Table 7. Time required for RL agents to learn tasks.
表 7.RL 代理学习任务所需的时间。

RL algorithm RL算法 Approx. time required 所需时间
OpenAI Five OpenAI 五 Hundreds of years 数百年
AlphaZero 阿尔法零 10000′s of games 10000 场比赛
DQN Days of continuous play 连续播放的天数
DDPG DDPG系列 Millions of steps 数百万步
One method to inject prior knowledge into the agent is called transfer learning. The agent’s weights are initiated from some previously trained agent whose task was similar. For example, the knowledge of an agent trained for set-point tracking of a pump could be transferred to an agent doing a similar task on a control valve. The topic of transfer learning is very popular in traditional deep learning, especially in image tasks where training is extremely computationally expensive. A survey on transfer learning in deep learning can be found in Tan et al. (2018). For a survey of transfer learning catered to reinforcement learning tasks, refer to Taylor and Stone (2009).
将先验知识注入智能体的一种方法称为迁移学习。智能体的权重是由一些先前训练过的智能体发起的，其任务类似。例如，受过训练的泵设定点跟踪代理的知识可以转移到在控制阀上执行类似任务的代理。迁移学习的话题在传统的深度学习中非常流行，尤其是在训练计算成本极高的图像任务中。Tan et al. （2018）中对深度学习中的迁移学习进行了调查。有关迎合强化学习任务的迁移学习的调查，请参阅Taylor and Stone （2009）。

Another popular way to increase data efficiency is the concept of a replay buffer (also called experience replay) first introduced in Lin (1992). DQN was the first deep RL method to leverage a replay buffer (Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller, 2013, Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg, Hassabis, 2015). One of the advantages of RL is its ability to update the agent using tuples of experiences. For example, Q-learning can be updated by the following tuple of experiences:
The replay buffer is a large archive (often times 1,000,000 records) of previous experiences and is used to train the agent on de-correlated random experiences from the past. The buffer ensures that the agent’s policy does not overfit the current operating regime during time correlated tasks and enhances data efficiency through learning the same experiences many times, a similar concept to running through many epochs in deep learning. During updates, random mini-batches of previous experiences are sampled from the replay buffer to update the agent. Relating to humans, a replay buffer is similar to hippocampal replay, where the memories of experiences are replayed over and over sub-consciously. Indeed, that is one theory of how humans learn so efficiently (Schuck and Niv, 2019). One flaw is that the experiences are sampled randomly. To overcome this, Schaul et al. (2015) proposed the prioritized experience replay algorithm to extend the original concept and biases sampling to experiences exhibiting large TD error. Such experiences can be intuitively understood as shocking because the outcome was significantly different than what was expected. For humans, shocking experiences are naturally remembered and replayed more often. The agent’s learning speed using prioritized experience replay was compared to the original concept on playing ATARI games. Results show that the agent learned faster in 41 out of 49 games.
提高数据效率的另一种流行方法是 Lin （1992）中首次引入的回放缓冲区（也称为体验回放）的概念。DQN 是第一个利用重放缓冲区的深度 RL 方法（Mnih， Kavukcuoglu， Silver， Graves， Antonoglou， Wierstra， Riedmiller， 2013， Mnih， Kavukcuoglu， Silver， Rusu， Veness， Bellemare， Graves， Riedmiller， Fidjeland， Ostrovski， Petersen， Beattie， Sadik， Antonoglou， King， Kumaran， Wierstra， Legg， Hassabis， 2015）。RL 的优点之一是它能够使用体验元组更新代理。例如，Q-learning 可以通过以下体验元组进行更新：
重播缓冲区是以前体验的大型存档（通常为 1,000,000 条记录），用于训练代理使用过去的去相关随机体验。缓冲区确保代理的策略在时间相关任务期间不会过度拟合当前操作状态，并通过多次学习相同的经验来提高数据效率，这与深度学习中贯穿多个时期的概念相似。在更新期间，将从重播缓冲区中抽取以前体验的随机小批量，以更新代理。与人类有关，回放缓冲区类似于海马体回放，其中经历的记忆在潜意识中一遍又一遍地重播。事实上，这是人类如何如此有效地学习的一种理论（Schuck and Niv，2019）。一个缺陷是体验是随机抽样的。为了克服这个问题，Schaul等人（2015）提出了优先体验回放算法，将原始概念和偏差采样扩展到表现出较大TD误差的体验。这样的经历可以直观地理解为令人震惊，因为结果与预期大不相同。对于人类来说，令人震惊的经历自然会被记住并更频繁地重播。将代理使用优先体验回放的学习速度与玩 ATARI 游戏的原始概念进行了比较。结果显示，在49场比赛中，有41场比赛的智能体学习得更快。

Eligibility traces is a third way to increase learning efficiency. The use of eligibility traces is equivalent to combining TD and MC methods into a unifying algorithm. On a high level, eligibility traces allow agents to update multiple value functions per step, like in MC methods, without the termination of an episode. A detailed overview regarding eligibility traces can be found in Sutton and Barto (2018).
资格跟踪是提高学习效率的第三种方法。使用资格跟踪相当于将 TD 和 MC 方法组合成一个统一的算法。在较高级别上，资格跟踪允许代理在每个步骤中更新多个值函数，就像在 MC 方法中一样，而不会终止剧集。有关资格跟踪的详细概述，请参见 Sutton and Barto （2018）。

It is also possible to speed up training through exploiting the heuristics of the environment. An heuristically accelerated RL (HARL) algorithm uses an external heuristics function
to guide the agent’s action selection (Reinaldo et al., 2008). Intuitively, during action selection, HARL agents observe the following values instead:
(58)
where H is the heuristics function that adds some value to each action-value to promote or discourage its corresponding action. Heuristics functions are very flexible. For example, an upper confidence bound heuristics function promotes exploration of potentially optimal states, instead of random exploration such as in ϵ-greedy, and is given by (Garivier and Moulines, 2008):
(59)
where c is the degree of exploration and Nt is the number of times that u was selected prior to time t. As Nt(x, u) → ∞, the corresponding Q(x, u) value becomes very accurate and
. Heuristically accelerated Q-learning was first introduced in Reinaldo et al. (2004), showing significantly better results when applied onto mobile robots. For the design of a heuristic function compatible with eligibility traces, see Reinaldo et al. (2012). Ferreira et al. (2014) introduced a heuristics function to accelerate training in multi-agent multi-objective environments. In Martins and Bianchi (2014), the performance of popular HARL algorithms is compared with their non-heuristic counterpart. It was found that the heuristics variant significantly improved the performance of the original algorithm.
还可以通过利用环境的启发式方法来加快训练速度。启发式加速RL（HARL）算法使用外部启发式函数
来指导代理的动作选择（Reinaldo等人，2008）。直观地说，在操作选择期间，HARL 代理会观察以下值：
(58)
其中 H 是启发式函数，它为每个操作值添加一些值以促进或阻止其相应的操作。启发式函数非常灵活。例如，置信上限启发式函数促进了对潜在最优状态的探索，而不是像ε贪婪那样的随机探索，由（Garivier and Moulines， 2008）
(59)
给出：其中c是探索的程度，N t 是在时间t之前选择u的次数。当 N t （x， u） → ∞时，相应的 Q（x， u）值变得非常准确，并且
。启发式加速Q学习最早是在Reinaldo等人（2004）中引入的，当应用于移动机器人时，显示出明显更好的结果。对于与资格跟踪兼容的启发式函数的设计，请参阅Reinaldo et al. （2012）。Ferreira等人（2014）引入了一种启发式函数，以加速多智能体多目标环境中的训练。在Martins和Bianchi（2014）中，将流行的HARL算法的性能与非启发式算法的性能进行了比较。结果发现，启发式变体显著提高了原始算法的性能。

Lastly, a newer topic called meta RL was introduced recently to improve the calibration time of the agent. Specifically, meta RL targets industrial applications where accurate simulators are difficult to identify. In such situations, the agent requires a long online calibration time even after being pre-trained in simulation because the simulator-process mismatch is large. By applying meta-RL, this calibration time can be significantly reduced. Meta-learning was first introduced in Hochreiter et al. (2001), but the ideas were first applied to RL in Wang et al. (2016); Duan et al. (2017). In meta RL, the agent learns a general policy through interacting with many different simulation models (different models capture model uncertainty). During testing, the general policy should have the ability to adapt to new similar tasks quickly. The framework of meta-RL is nearly identical to normal RL; the state and action spaces of the agent do not change across the different simulators. However, the identified policy in meta-RL is given by:
(60)
That is, the agent uses the previous action and reward in conjunction with the current states to obtain the current action. Intuitively,
and
provide the agent with intuition on what the objective of the current task may be, creating a near-Markovian setting. Additionally, RNNs are used to provide the agent with a memory of past states; hence, not needing it as an explicit input. For more information regarding meta-RL, see Wang et al. (2016) and Duan et al. (2017).
最后，最近引入了一个名为meta RL的新主题，以改善智能体的校准时间。具体来说，meta RL针对的是难以识别精确模拟器的工业应用。在这种情况下，即使经过模拟预训练，智能体也需要很长的在线校准时间，因为模拟器-过程不匹配很大。通过应用meta-RL，可以显著缩短校准时间。Meta learning最早是在Hochreiter et al. （2001）中引入的，但这些想法在Wang等人（2016）中首次应用于RL;Duan等人（2017）。在元强化学习中，智能体通过与许多不同的仿真模型交互来学习一般策略（不同的模型捕获模型的不确定性）。在测试期间，一般策略应该能够快速适应新的类似任务。meta-RL 的框架与普通 RL 几乎相同;代理的状态和操作空间在不同的模拟器之间不会更改。但是，meta-RL 中识别的策略由下式给出：
(60)
也就是说，智能体将前一个动作和奖励与当前状态结合使用，以获得当前动作。直观地，并为
智能体提供关于当前任务目标可能是什么的直觉，
创建一个接近马尔可夫式的设置。此外，RNN 用于为代理提供过去状态的记忆;因此，不需要它作为显式输入。有关meta-RL的更多信息，请参阅Wang et al. （2016）和Duan et al. （2017）。

Normally, RL algorithms are initiated tabula rasa, and are very data inefficient. With the addition of a replay buffer, eligibility traces, heuristics, and more recently meta-RL, applications of RL onto industrial process control systems are now more feasible for small scale applications.
通常，RL算法是白板启动的，并且数据效率非常低。随着重放缓冲区、资格跟踪、启发式方法以及最近的元 RL 的加入，RL 在工业过程控制系统上的应用现在对于小规模应用更加可行。

4.2. Scalability 4.2. 可扩展性
Historically, the exact solution to MDPs required dynamic programming and was cursed by dimensionality (i.e., the computational complexity increased exponentially as the states and/or actions increased) (Bellman, 1957a). However, rapid advancements in nonlinear function approximation for the value function, as demonstrated in Mnih et al. (2013), Mnih et al. (2015), Silver et al. (2014), Lillicrap et al. (2015), Schulman et al. (2015), and Schulman et al. (2017), has largely solved the scalability issue for systems with a reasonable number of states (i.e., systems with less than 100 states). For larger systems, multi-agent RL architectures can be used, as demonstrated in OpenAI (2018b), to achieve optimality in an overall system. One flaw with function approximation approaches, especially using deep learning, is their lack of explainability. If an incident were to occur, identifying the root cause is nearly impossible due to the black-box nature of the policy function. Conversely, explainable algorithms, such as tabular Q-learning, becomes infinitely large as the states and actions increase and are infeasible for most problems. For deep RL methods to truly be explainable, more fundamental problems must be solved in the literature of deep learning; therefore in the near future, industrial applications of RL may be limited to small scale or safety-insensitive systems if explainability is mandatory.
从历史上看，MDP的精确解决方案需要动态规划，并且受到维度的诅咒（即，随着状态和/或动作的增加，计算复杂性呈指数级增加）（Bellman，1957a）。然而，正如 Mnih 等人（2013）、Mnih 等人（2015）、Silver 等人（2014）、Lillicrap 等人（2015）、Schulman 等人（2015）和 Schulman 等人（2017）所证明的那样，价值函数的非线性函数近似的快速发展在很大程度上解决了具有合理状态数量的系统（即状态少于 100 个的系统）的可扩展性问题。对于较大的系统，可以使用多智能体RL架构，如OpenAI（2018b）所示，以实现整个系统的最优性。函数逼近方法的一个缺陷，特别是使用深度学习的方法，是它们缺乏可解释性。如果发生事件，由于策略功能的黑盒性质，几乎不可能确定根本原因。相反，可解释的算法，如表格 Q 学习，随着状态和动作的增加而变得无限大，对于大多数问题来说都是不可行的。为了使深度强化学习方法真正可解释，必须在深度学习的文献中解决更基本的问题;因此，在不久的将来，如果可解释性是强制性的，RL的工业应用可能仅限于小规模或对安全不敏感的系统。

4.3. Stability 4.3. 稳定性
Stability guarantees are also typically required for optimal control projects to be commissioned. The literature of RL in the process control space is quite embryonic with most papers explicitly studying stability stemming from 2017 onwards. Berkenkamp et al. (2017) was the first influential paper regarding RL stability where the authors developed a model-based RL algorithm with stability guarantees. More specifically, the proposed algorithm was proved to be stable given a Lipschitz continuous settings with reliable system models. Additionally, it was assumed that the system is initiated in the region of attraction: a forward invariant subspace where all state trajectories stay within it and eventually converges to a goal state. The region of attraction is asymptotically stable and is initially found using the system model. As the agent explores, the zone of attraction will increase in size. However in control theory, it has been shown that the region of attraction is very difficult to find for large nonlinear systems (Zecevic and Siljak, 2009). In Jin and Lavaei (2018), a stability-certified RL method was introduced. The input-output gradients of the policy during updates were regulated to obtain strong guarantees of robust stability. Furthermore, the authors applied the proposed algorithm to a multi-flight formation and a power system frequency regulation task to demonstrate its effectiveness in terms of performance, learning speed, and stability. For a more theoretical overview, Busoniu et al. (2018) provides a great overview of RL’s stability theory compared to traditional optimal control methods such as H∞, linear quadratic regulator (LQR), or linear quadratic Gaussian (LQG).
对于调试最佳控制项目，通常还需要稳定性保证。RL在过程控制领域的文献还处于萌芽阶段，大多数论文都明确研究了从2017年开始的稳定性。Berkenkamp et al. （2017）是第一篇关于 RL 稳定性的有影响力的论文，作者开发了一种具有稳定性保证的基于模型的 RL 算法。更具体地说，在具有可靠系统模型的Lipschitz连续设置下，所提出的算法被证明是稳定的。此外，假设系统是在吸引力区域启动的：一个前向不变的子空间，所有状态轨迹都保持在其中并最终收敛到目标状态。吸引力区域是渐近稳定的，最初是使用系统模型发现的。随着智能体的探索，吸引力区域的大小将增加。然而，在控制理论中，已经表明，对于大型非线性系统来说，吸引力区域很难找到（Zecevic和Siljak，2009）。在Jin和Lavaei（2018）中，引入了一种稳定性认证的RL方法。对更新期间策略的输入输出梯度进行调节，以获得稳健稳定性的有力保证。此外，作者将所提出的算法应用于多飞行编队和电力系统频率调节任务，以证明其在性能、学习速度和稳定性方面的有效性。为了获得更理论上的概述，Busoniu等人（2018）对RL的稳定性理论与传统的最优控制方法（如H ∞ ，线性二次调节器（LQR）或线性二次高斯（LQG））进行了很好的概述。

4.4. Convergence 4.4. 收敛
Convergence of tabular reinforcement learning algorithms for MDPs can be guaranteed given learning rate 0 ≤ α < 1, bounded reward function |Rn| ≤ Rmax and satisfying (Sutton and Barto, 2018):
(61)
(62)
Eq. (61) is a condition to ensure that short term noise can be overcome. Subsequently, Eq. (62) imposes a condition to ensure α eventually becomes sufficiently small for convergence to occur. Such conditions are strict but are required given the stochastic conditions (Marti, 2008). For POMDPs, convergence guarantees are difficult because the agent does not know its current true states.
给定学习率 0 ≤ α < 1，有界奖励函数 |R n |≤R max 和令人满意（Sutton和Barto，2018）：
(61)
(62)
方程（61）是确保可以克服短期噪声的条件。随后，方程（62）施加了一个条件，以确保α最终变得足够小，以便发生收敛。这些条件是严格的，但考虑到随机条件是必需的（Marti，2008）。对于 POMDP，收敛保证很困难，因为智能体不知道其当前的真实状态。

Linear function approximation methods were also proven to converge and can be found in Sutton and Barto (1998). No proofs currently exist for the convergence of nonlinear function approximation cases where value functions are updated using direct methods. Direct methods refer to algorithms that perform gradient descent updates assuming model weights affect only Q(xt, ut), and not
. The lack of proofs stem from the flawed assumption. In function approximation cases, parameter updates would most definitely affect both Q(xt, ut) and
because they are often calculated using the same weights. Tabular methods do not suffer from such an assumption because each action-value was explicitly stored in the Q-table, and each update is independent. On the other hand, convergence of RL algorithms using residual gradient descent, where both Q(xt, ut) and
are considered during weight updates, has been shown to converge, even in the POMDP case (Baird, 2013, Baird, 1995).
线性函数近似方法也被证明是收敛的，可以在Sutton和Barto（1998）中找到。目前尚无证明使用直接方法更新值函数的非线性函数逼近情况的收敛性。直接方法是指执行梯度下降更新的算法，假设模型权重仅影响 Q（x t ， u t ），而不
影响。缺乏证据源于有缺陷的假设。在函数逼近的情况下，参数更新肯定会影响 Q（x t ， u t ），并且
因为它们通常使用相同的权重进行计算。表格方法不受这种假设的影响，因为每个操作值都显式存储在 Q 表中，并且每次更新都是独立的。另一方面，使用残余梯度下降的RL算法的收敛性，其中Q（x，u t t ）和
在权重更新期间都考虑在内，即使在POMDP的情况下也被证明是收敛的（Baird，2013，Baird，1995）。

4.5. Constraints 4.5. 约束
For nearly all industrial control applications, both state and input constraints are required to address safety concerns (Arendt and Lorenzo, 2000). In RL, input constraints are trivial. Conversely, state constraints (i.e., guaranteeing the avoidance of certain states) can be very challenging. One could implement soft state constraints and design a reward function to discourage policies resulting in arrival of such states. Indeed, humans are trained in such a way where guardians discourage undesirable behaviour; however, such a weak condition does not satisfy the strict safety requirements of industrial process control.
对于几乎所有的工业控制应用，都需要状态和输入约束来解决安全问题（Arendt和Lorenzo，2000）。在RL中，输入约束是微不足道的。相反，状态约束（即保证避免某些状态）可能非常具有挑战性。人们可以实施软状态约束，并设计一个奖励函数来阻止导致这种状态到来的政策。事实上，人类的训练方式是监护人劝阻不良行为;然而，如此薄弱的条件并不能满足工业过程控制的严格安全要求。

The earliest study of constrained RL was conducted in Altman (1999) where the author introduced the constrained Markov decision process (CMDP). In CMDPs, there exists an additional value function called the constrained value function, C, and is identified concurrently with Vπ. During action selection, the selected action must satisfy:
where c is a real value threshold. Typically, C is represented as a probability of exceeding some constraints and is given as:
where pr0 is the maximum allowable probability of violating the given constraints (Geibel, 2006). CMDP systems are solved using linear programming due to multiple reward functions. Furthermore, CMDP RL is usually model-based methods and many solutions are intractable for high-dimensional problems. Moreover, nearly all recent progress of constrained RL uses concepts from constrained optimization to perform a policy search (Achiam, Held, Tamar, Abbeel, 2017, Andersson, Heintz, Doherty, 2015, Bhatia, Varakantham, Kumar, 2018). For example, Achiam et al. (2017) introduced solution to continuous CMDPs using a constrained policy optimization method. In Bhatia et al. (2018), authors developed three different optimization approaches on top of DDPG to handle resource allocation constraints.
最早的约束RL研究是在Altman（1999）中进行的，作者介绍了约束马尔可夫决策过程（CMDP）。在 CMDP 中，存在一个称为约束值函数 C 的附加值函数，并且与 V π 同时标识。在操作选择期间，所选操作必须满足：
其中 c 是实际值阈值。通常，C表示为超过某些约束的概率，并给出为：
其中pr 0 是违反给定约束的最大允许概率（Geibel，2006）。由于具有多个奖励函数，CMDP系统使用线性规划求解。此外，CMDP RL通常是基于模型的方法，对于高维问题来说，许多解决方案都是棘手的。此外，几乎所有约束强化学习的最新进展都使用约束优化的概念来执行策略搜索（Achiam， Held， Tamar， Abbeel， 2017， Andersson， Heintz， Doherty， 2015， Bhatia， Varakantham， Kumar， 2018）。例如，Achiam 等人（2017 年）介绍了使用约束策略优化方法的连续 CMDP 解决方案。在Bhatia等人（2018）中，作者在DDPG之上开发了三种不同的优化方法来处理资源分配约束。

More recently, system constraints are handled through a new field of study in RL named safe reinforcement learning where authors attempt to design agents to solve tasks without violating safety characteristics of the system. Typically, there are two methods: i) The first way modifies the optimality criterion of the agent with a safety factor; ii) the second way is to provide the agent with external heuristics of the system or to guide the agent using a risk metric. For example, Berkenkamp et al. (2017) trained an agent to solve the inverted pendulum problem without the pendulum ever falling down. A survey of safe RL methods can be found in Garcia (2015).
最近，系统约束是通过RL中一个名为安全强化学习的新研究领域来处理的，作者试图设计代理来解决任务，而不会违反系统的安全特性。通常有两种方法：i）第一种方法是用安全系数修改药剂的最优性准则;ii）第二种方法是为智能体提供系统的外部启发式方法，或者使用风险指标指导智能体。例如，Berkenkamp et al. （2017）训练了一个智能体来解决倒立摆问题，而不会让摆锤掉下来。可以在Garcia（2015）中找到安全RL方法的调查。

4.6. Accurate simulator 4.6. 精确模拟器
Overall, it seems that the most impressive applications of RL (i.e., agents that were easily and decisively able to triumph human level performance) were all applied onto video games with very limited applications elsewhere. The main reason is twofold: games have explicit rules that are well understood and the lack of accurate simulators for industrial settings.
总的来说，RL最令人印象深刻的应用（即能够轻松果断地战胜人类水平的智能体）似乎都应用于视频游戏，而其他地方的应用非常有限。主要原因有两个：游戏有明确的规则，很容易理解，并且缺乏用于工业环境的准确模拟器。

Firstly, even the most complex games are known completely by the designers of the games; therefore, it is much simpler to design reward functions because the ultimate goal is known. Indeed, an agent pursing the true ultimate goal without unnecessary bias is demonstrated to be vastly superior as shown by AlphaGo Zero and AlphaZero (Silver, Huang, Maddison, Guez, Sifre, Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel, Hassabis, 2016, Silver, Schrittwieser, Simonyan, Antonoglou, Huang, Guez, Hubert, Baker, Lai, Bolton, Chen, Chen, Lillicrap, Hui, Sifre, Driessche, Graepel, Hassabis, 2017b). However, when consider even the simplest SISO control tasks, it is not explicitly known whether a controller with oscillations but superior tracking performance is better compared to more conservative counterparts. As such, designing a good reward function is very challengin because the true objective is often unclear. Notice that even in Google DeepMind’s case, the ultimate objective was not just to minimize electricity, but rather to minimize PUE. The implications of this were not provided, but one can assume minimizing electricity costs may lead to unintended consequences.
首先，即使是最复杂的游戏也被游戏的设计者完全了解;因此，设计奖励函数要简单得多，因为最终目标是已知的。事实上，正如AlphaGo Zero和AlphaZero所展示的那样，追求真正最终目标而没有不必要的偏见的智能体被证明是非常优越的（Silver， Huang， Maddison， Guez， Sifre， Driessche， Schrittwieser， Antonoglou， Panneershelvam， Lanctot， Dieleman， Grewe， Nham， Kalchbrenner， Sutskever， Lillicrap， Leach， Kavukcuoglu， Graepel， Hassabis， 2016， Silver， Schrittwieser， Simonyan， Antonoglou， Huang， Guez， Hubert， Baker， Lai， Bolton， Chen， Chen， Lillicrap， Hui， Sifre， Driessche， Graepel， Hassabis， 2017b）。然而，当考虑最简单的SISO控制任务时，与更保守的同类产品相比，具有振荡但具有卓越跟踪性能的控制器是否更好，这一点尚不明确。因此，设计一个好的奖励函数是非常具有挑战性的，因为真正的目标往往不明确。请注意，即使在 Google DeepMind 的案例中，最终目标也不仅仅是最小化电力，而是最小化 PUE。没有提供其影响，但可以假设最大限度地降低电力成本可能会导致意想不到的后果。

Secondly, most major triumphs of reinforcement learning were applied to systems with a perfect simulator. That is, the performance achieved by following specific policies in the training phase can be exactly achieved during online application. Additionally, the mapping of arbitrary simulated states to actions is exactly representative of what would occur in the real world. Without a doubt, such a condition is somewhat achievable in the real world for some tasks. For example, astronauts and pilots are first trained in simulation before deployment; however, the production of such a simulator in industrial process control is time consuming, costly, and often times, impossible. In fact, this might only be economically feasible for small scale systems or if the agent was given the opportunity to directly manipulate the industrial distributed control system. Moreover, if such models were identifiable and can be represented using mathematics, model based optimal control methods (known as planning methods in RL literature) such as MPC may be the superior choice assuming feasible computation times. In previous sections, it has been shown that RL can be trained on sub-optimal models and eventually adapt its policy to the environment (a significant advantage over traditional model-based approaches); however, this topic has yet to be thoroughly explored outside of simulations.
其次，强化学习的大多数主要胜利都应用于具有完美模拟器的系统。也就是说，在培训阶段遵循特定策略所达到的性能，可以在在线申请时精确实现。此外，将任意模拟状态映射到操作完全代表了现实世界中将发生的情况。毫无疑问，对于某些任务，这样的条件在现实世界中是可以实现的。例如，宇航员和飞行员在部署前首先接受模拟训练;然而，在工业过程控制中生产这种模拟器既费时又费钱，而且通常是不可能的。事实上，这可能只适用于小规模系统，或者如果代理有机会直接操纵工业分布式控制系统，这在经济上是可行的。此外，如果这些模型是可识别的，并且可以使用数学来表示，那么假设计算时间可行，基于模型的最优控制方法（在RL文献中称为规划方法）可能是更好的选择。在前面的章节中，已经表明 RL 可以在次优模型上进行训练，并最终使其策略适应环境（与传统的基于模型的方法相比，这是一个显着的优势）;然而，这个主题在模拟之外还有待彻底探索。

Most importantly, the identified system model must be representative of the real process during process identification. If not, the optimal policy identified during training will perform poorly and could potentially become a safety concern on the real process. Consider the following quantitative example:
最重要的是，在过程识别过程中，识别出的系统模型必须代表真实过程。否则，在训练期间确定的最佳策略将表现不佳，并可能成为实际过程中的安全问题。请看下面的定量例子：
Suppose there exists a simple arbitrary system given by:
(63)

假设存在一个简单的任意系统，由下式给出：
(63)

Table 8 shows two data sets collected for the system described by Eq. (63). The first and second data sets were then used to identify Eqs. (64) and (65), respectively. In both cases, the MSE of the model was zero given their respective data sets. Despite zero modelling errors given their respective data sets, it can be seen that both models do not represent the real system, shown in Eq. (63), in the slightest. In fact, Eq. (65) does not even consider u2 in the system model. To avoid confusion, the system described by Eqs. (64) and (65) will be referred to as system 1 and system 2, respectively. Furthermore, the actual system shown in Eq. (63) will be denoted as the real system. Moreover, the RL agent trained on systems 1 and 2 are denoted as agent 1 and agent 2, respectively.
(64)
(65)

表8显示了为方程（63）所述系统收集的两个数据集。然后使用第一和第二数据集来识别方程。（64）和（65）。在这两种情况下，给定各自的数据集，模型的MSE均为零。尽管给定各自的数据集的建模误差为零，但可以看出，这两个模型都丝毫不代表方程（63）所示的真实系统。事实上，方程（65）甚至没有在系统模型中考虑u 2 。为避免混淆，方程描述的系统。（64）和（65）将分别称为系统 1 和系统 2。此外，方程（63）中所示的实际系统将表示为真实系统。此外，在系统 1 和 2 上训练的 RL 智能体分别表示为智能体 1 和智能体 2。
(64)
(65)

Suppose that this arbitrary system requires a controller for set-point tracking, with the objective described by the following reward function:
(66)
where Δu1 and Δu2 denotes the change in controller input between
and t. For this task, two different RL agents were trained on the system models provided by Eqs. (64) and (65). Both agents were trained for 50,000 update steps and shared the same hyper parameters shown in Table 9.
假设这个任意系统需要一个控制器进行设定点跟踪，其目标由以下奖励函数描述：
(66)
其中 Δu 和 Δu 1 2 表示控制器输入在和 t 之间
的变化。对于这项任务，在 Eqs 提供的系统模型上训练了两个不同的 RL 代理。（64）和（65）。两个智能体都经过了 50,000 个更新步骤的训练，并共享表 9 中所示的相同超参数。

Fig. 25 a and c show the state trajectories of the agents trained using Eqs. (64) and (65) applied to their respective training systems. Fig. 25b and d show the trajectories of the actual system when the two agents are implemented. The cumulative reward of the different agents is shown in Table 10. It can be seen that both agents performed poorly on the real system due to the non-representative training model.
图 25 a 和 c 显示了使用方程训练的智能体的状态轨迹。（64）和（65）适用于各自的培训系统。图 25b 和 d 显示了实现两个代理时实际系统的轨迹。不同智能体的累积奖励如表10所示。可以看出，由于训练模型不具代表性，两个智能体在真实系统上的表现都很差。

Ultimately, the poor performance by the RL agents were caused by the non-representativeness of the data and/or poor selection of the model structures. The optimal policy identified by agents trained on either model is not reliable because the models do not resemble the real system; therefore, it is critical to ensure the training model is, at least somewhat, representative of the real system.
最终，RL智能体的不良性能是由于数据的不代表性和/或模型结构的选择不当造成的。在任一模型上训练的智能体确定的最优策略都不可靠，因为这些模型与真实系统不同;因此，确保训练模型至少在某种程度上代表真实系统至关重要。

Table 8. Two data sets collected from an arbitrary process.
表 8.从任意进程收集的两个数据集。

10 1 3 6 2 −4
18 3 3 8 3 −7
12 2 2 10 4 −12
Table 9. Hyper parameters for the agents controlling an arbitrary system.
表 9.控制任意系统的代理的超参数。

Hyper Parameter 超参数 Value 价值
States, x 状态，x
Action 1, u1 行动 1，u 1
Action 2, u2 行动 2，u 2
Reward, r 奖励，r Eq. (66) 方程（66）
Learning rate, α 学习率， α [0.001, 0.7]
Discount factor, γ 贴现系数，γ 0.95
Exploratory factor, ϵ 探索性因素， ε [0.1, 1]
Evaluation time 评估时间 1 seconds 1 秒
System representation 系统表示 FOMDP FOMDP（FOMDP）
Fig. 25
Download : Download high-res image (712KB)
下载：下载高分辨率图片（712KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 25. State trajectories of the two RL agents applied to system 1 and system 2 and the real system.
图 25.应用于系统 1 和系统 2 以及真实系统的两个 RL 代理的状态轨迹。

Table 10. Cumulative reward of the agents.
表 10.代理商的累积奖励。

Agent 1 on System 1
系统 1 上的代理 1 −150
Agent 2 on System 2
系统 2 上的代理 2 −156
Agent 1 on Real System
真实系统上的代理 1 −6255
Agent 2 on Real System
真实系统上的代理 2 −10698
5. Conclusions 5. 结论
Reinforcement learning has demonstrated to have great potential in surpassing human level performance in many complex tasks assuming an sufficiently accurate simulator exists or can be constructed. It was shown to possess the ability to learn many different tasks using the same algorithm. This may imply great potential in engineering where custom design of similar projects result in significant cost and time expenditure. Moreover, reinforcement learning also exhibits the ability to self-learn, significantly reducing development time.
强化学习已被证明具有巨大的潜力，可以在许多复杂任务中超越人类水平的表现，前提是存在或可以构建足够准确的模拟器。它被证明具有使用相同算法学习许多不同任务的能力。这可能意味着在工程领域具有巨大的潜力，因为类似项目的定制设计会导致大量的成本和时间支出。此外，强化学习还表现出自我学习的能力，大大缩短了开发时间。

With a proper problem formulation, RL has also been successfully simulated in process control problems with regulation or set point tracking objectives. Optimal control problems have also been shown to be feasible for RL and ADP methods in literature. A gentle first step towards industrial scale RL implementation could be to use RL for PID tuning. Indeed, many methods have been proposed to re-configure PID parameters as a function of proportion, integral, and derivative errors, achieving superior control performance compared to other tuning methods. For more ambitious projects, a closely monitored RL agent might be feasible for continuous adaptive optimal control. Google was one of the first movers to do so, resulting in up to 40% electricity savings in their industrial data centers. Another potential advantage of RL is its direct adaptive characteristic. Compared to traditional adaptive optimal control frameworks where model re-identification is required, RL is model-free and adapts the policy directly. Lastly, RL can solve the temporal credit assignment problem where values are assigned to each state to denote its desirability. This information could have potential implications in alarm management, root cause analysis, and other related applications. The most influential advantages and disadvantages of reinforcement learning are summarized in Table 11.
通过适当的问题表述，RL也成功地模拟了具有调节或设定点跟踪目标的过程控制问题。文献中也证明了RL和ADP方法的最优控制问题是可行的。实现工业规模RL的第一步可能是使用RL进行PID调谐。事实上，已经提出了许多方法将PID参数重新配置为比例误差、积分误差和导数误差的函数，与其他整定方法相比，实现了卓越的控制性能。对于更雄心勃勃的项目，密切监控的RL代理对于连续自适应最优控制可能是可行的。谷歌是最早这样做的推动者之一，导致其工业数据中心节省了高达40%的电力。RL的另一个潜在优势是其直接自适应特性。与需要重新识别模型的传统自适应最优控制框架相比，RL是无模型的，可以直接调整策略。最后，RL可以解决时间学分分配问题，其中值被分配给每个状态以表示其可取性。此信息可能对警报管理、根本原因分析和其他相关应用产生潜在影响。表11总结了强化学习最有影响力的优点和缺点。

Table 11. Most influential advantages and disadvantages of reinforcement learning.
表 11.强化学习最有影响力的优点和缺点。

Advantages 优势 Disadvantages 弊
Online computation time 在线计算时间 Accurate simulator required
需要精确的模拟器
Can learn many tasks 可以学习很多任务 Reward design can be difficult
奖励设计可能很困难
Direct adaptive optimal control
直接自适应优化控制 Stability theory lacking 缺乏稳定性理论
Engineered features not needed
不需要工程功能 State constraints are difficult
状态约束是困难的
As a closing note, the truly greatest characteristic of RL is its general nature, allowing for learning of nearly anything through a general algorithm. Although modern RL still faces many shortcomings, it is expected that RL will play an important role in industrial process control in the near future.
最后，RL真正最大的特点是它的通用性，允许通过通用算法学习几乎任何东西。尽管现代RL仍面临许多缺点，但预计在不久的将来，RL将在工业过程控制中发挥重要作用。

Declaration of Competing Interest
利益冲突声明
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明，他们没有已知的相互竞争的经济利益或个人关系，这些利益或关系可能会影响本文所报告的工作。

Acknowledgments 确认
This research was supported in part by the NSERC Industrial Research Chair in Control of Oil Sands Processes and NSERC discovery grants.
这项研究得到了NSERC油砂过程控制工业研究主席和NSERC发现资助的部分支持。

References 引用
Achiam, Held, Tamar, Abbeel, 2017
阿基亚姆，赫尔德，塔玛，阿比尔，2017
J. Achiam, D. Held, A. Tamar, P. Abbeel
J.阿奇亚姆，D.赫尔德，A.塔玛，P.阿比尔
Constrained policy optimization
受约束的策略优化
ML (2017) 机器学习（2017）
arXiv:1705.10528 arXiv：1705.10528
Google Scholar Google 学术搜索
Akbarimajd, 2015 阿克巴里马伊德， 2015
A. Akbarimajd 阿克巴里马伊德
Reinforcement learning adaptive PID controller for an under-actuated robot arm
用于欠驱动机械臂的强化学习自适应PID控制器
Int. J. Integrat. Eng., 7 (2015), pp. 20-27
国际 J. Integrat.《工程学（Eng.）》，第 7 卷（2015 年），第 20-27 页
Google Scholar Google 学术搜索
Altman 奥特曼
Altman, E. (1999). Constrained markov decision processes. ISSN 01676377. 10.1016/0167-6377(96)00003-X
阿尔特曼，E.（1999 年）。约束马尔可夫决策过程。ISSN 01676377.10.1016/0167-6377（96）00003-X
Google Scholar Google 学术搜索
Andersson, Heintz, Doherty, 2015
安德森，海因茨，多尔蒂，2015
O. Andersson, F. Heintz, P. Doherty
O.安德森，F.海因茨，P.多尔蒂
Model-based reinforcement learning in continuous environments using real-time constrained optimization
使用实时约束优化在连续环境中进行基于模型的强化学习
AIII (2015) AIII （2015年）
Google Scholar Google 学术搜索
Arendt, Lorenzo, 2000 洛伦佐·阿伦特，2000
J.S. Arendt, D.K. Lorenzo J.S.阿伦特，D.K.洛伦佐
Evaluating process safety in the chemical industry: a user’s guide to quantitative risk analysis
评估化工行业的过程安全性：定量风险分析用户指南
American institute of chemical engineers: New york, new york, USA (2000)
美国化学工程师学会：美国纽约州纽约市（2000年）
Google Scholar Google 学术搜索
ISBN 978-0-81-690746-5 国际标准图书编号 978-0-81-690746-5

Asis, Chan, Pitis, Sutton, Graves, 2020
阿西斯，陈，皮蒂斯，萨顿，格雷夫斯，2020
D.K. Asis, A. Chan, S. Pitis, R.S. Sutton, D. Graves
D.K.Asis， A.Chan， S.Pitis， R.S.Sutton， D.格雷夫斯
Fixed-horizon temporal difference methods for stable reinforcement learning
稳定强化学习的固定水平时差方法
AAAI (2020) AAAI （2020年）
arXiv:1909.03906 arXiv：1909.03906
Google Scholar Google 学术搜索
AspenTech 阿斯彭泰克
AspenTech (2019). AspenONE advanced process control. [Online]. Available: https://www.aspentech.com/uploadedfiles/v7/1732_15_aspen_apc_web.pdf.
阿斯彭泰克（2019 年）。AspenONE先进的过程控制。[在线]。可用： https://www.aspentech.com/uploadedfiles/v7/1732_15_aspen_apc_web.pdf.
Google Scholar Google 学术搜索
Baird, 1995 贝尔德， 1995
L.C. Baird L.C.贝尔德
Residual Algorithms 残差算法
Proceedings of the Workshop on Value Function Approximation, Machine Learning (1995)
价值函数近似研讨会论文集，机器学习（1995）
Google Scholar Google 学术搜索
Baird, 2013 贝尔德， 2013
L.C. Baird L.C.贝尔德
Reinforcement learning with function approximation
具有函数逼近的强化学习
Machine Learning (2013) 机器学习（2013）
Google Scholar Google 学术搜索
San Francisco, CA 旧金山（San Francisco， CA）

Barto, Sutton, Brouwer, 1981
巴托，萨顿，布劳维尔，1981
A. Barto, R.S. Sutton, P.S. Brouwer
A.巴托，R.S.萨顿，P.S.布劳维尔
Associative search network: a reinforcement learning associative memory
联想搜索网络：一种强化学习联想记忆
Biol. Cybern., 40 (1981), pp. 201-211
《Cybern.》，第 40 卷（1981 年），第 201-211 页
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Bellman, 1957a 贝尔曼，1957a
R.E. Bellman R.E.贝尔曼
Dynamic Programming 动态规划
Princeton University Press, New Jersey, USA (1957)
普林斯顿大学出版社，美国新泽西州（1957）
Google Scholar Google 学术搜索
Bellman, 1957b 贝尔曼，1957b
R.E. Bellman R.E.贝尔曼
A markovian decision process
马尔可夫决策过程
J. Math. Mech., 6 (1957), pp. 679-684
《机械数学（J. Math. Mech.）》，第 6 卷（1957 年），第 679-684 页
Google Scholar Google 学术搜索
Bemporad, Morari, Dua, Pistikopoulos, 2002
Bemporad， Morari， Dua， Pistikopoulos， 2002
A. Bemporad, M. Morari, V. Dua, E.N. Pistikopoulos
A.本波拉德、M.莫拉里、V.杜阿、E.N.皮斯蒂科普洛斯
The explicit linear quadratic regulator for constrained systems
用于约束系统的显式线性二次稳压器
Automatica, 38 (2002), pp. 3-20
《自动（Automatica）》，第 38 卷（2002 年），第 3-20 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
10.101610.1016/s0005-1098(01)00174-1
10.101610.1016/S0005-1098（01）00174-1

Berkenkamp, Turchetta, Schoellig, Krause, 2017
Berkenkamp， Turchetta， Schoellig， Krause， 2017
F. Berkenkamp, M. Turchetta, A.P. Schoellig, A. Krause
F.Berkenkamp， M.Turchetta， A.P.Schoellig， A.Krause
Safe model-based reinforcement learning with stability guarantees
安全的基于模型的强化学习，具有稳定性保证
NIPS (2017), pp. 1-11 NIPS （2017），第 1-11 页
Google Scholar Google 学术搜索
Bhatia, Varakantham, Kumar, 2018
Bhatia， Varakantham，库马尔， 2018
A. Bhatia, P. Varakantham, A. Kumar
A.巴蒂亚、P.瓦拉坎瑟姆、A.库马尔
Resource constrained deep reinforcement learning
资源受限的深度强化学习
ML (2018) 机器学习（2018）
arXiv:1812.00600 arXiv：1812.00600
Google Scholar Google 学术搜索
Borel, 1909 博雷尔，1909年
Borel 博雷尔
Les probabilities denombrables et leurs applications aritbmttieuqs
可数概率及其 aritbmttieuqs 应用
Rendiconti del Circolo Matematico di Palermo, 27 (1909), pp. 247-271
《巴勒莫（Palermo）的 Circolo Matematico Rendiconti （Rendiconti del Circolo Matematico di Palermo）》，第 27 卷（1909 年），第 247-271 页
Google Scholar Google 学术搜索
Borgnakke, Sonntag, 2008 博格纳克，桑塔格，2008
C. Borgnakke, R.E. Sonntag C.博格纳克，R.E.Sonntag
Fundamentals of Thermodynamics
热力学基础
Wiley, Hoboken, New Jersey, United States (2008)
Wiley， Hoboken， New Jersey，美国（2008）
Google Scholar Google 学术搜索
ISBN 978-0-470-04192-5 国际标准图书编号： 978-0-470-04192-5

Bottou, 2010 Bottou， 2010年
L. Bottou L.博图（L.Bottou）
Large-scale machine learning with stochastic gradient descent
具有随机梯度下降的大规模机器学习
COMPSTAT (2010), pp. 177-186, 10.1007/978-3-7908-2604-3-16
COMPSTAT （2010），第 177-186 页、10.1007/978-3-7908-2604-3-16
View articleView in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
Bradtke, Duff, 1994 布拉德克，达夫，1994
S.J. Bradtke, M.O. Duff SJ布拉德克，M.O.达夫
Reinforcement learning methods for continuous-time markov decision problems
连续时间马尔可夫决策问题的强化学习方法
NIPS (1994), pp. 393-400
View in ScopusGoogle Scholar
Brujeni, Lee, Shah, 2010
L.A. Brujeni, J.M. Lee, S.L. Shah
Dynamic tuning of PI-controllers based on model-free reinforcement learning methods
基于无模型强化学习方法的PI控制器动态调优
ICCAS (2010), pp. 453-458
ICCAS（2010年），第453-458页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
Gyeonggi-do 京畿道

Busoniu, Bruin, Tolic, Kober, Palunko, 2018
Busoniu， Bruin， Tolic， Kober， Palunko， 2018
L. Busoniu, T.D. Bruin, D. Tolic, J. Kober, I. Palunko
L.布索尼乌、T.D.布鲁因、D.托利奇、J.科贝尔、I.帕伦科
Reinforcement learning for control: performance, stability, and deep approximators
用于控制的强化学习：性能、稳定性和深度逼近器
Annu. Rev. Control (46) (2018), pp. 8-28
安努。控制修订版（46）（2018），页码：8-28
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Cannadey, 2000 Cannadey， 2000年
J. Cannadey J.坎纳迪
Next generation intrusion detection: autonomous reinforcement learning of network attacks
下一代入侵检测：网络攻击的自主强化学习
NISSC (2000), pp. 1-12 NISSC （2000），第1-12页
Google Scholar Google 学术搜索
Chabris, 2015 夏布里斯，2015
C. Chabris C.夏布里斯
The real kings of chess are computers
真正的国际象棋之王是计算机
Wall Street J. (2015) 华尔街 J. （2015）
Google Scholar Google 学术搜索
Chenna, Jain, Kapoor, Bapi, Yadaiah, Negi, Rao, Deekshatulu, 2004
Chenna， Jain， Kapoor， Bapi， Yadaiah， Negi， Rao， Deekshatulu， 2004
S.K. Chenna, Y.K. Jain, H. Kapoor, R.S. Bapi, N. Yadaiah, A. Negi, V.S. Rao, B.L. Deekshatulu
S.K.Chenna， Y.K.Jain， H.Kapoor， R.S.Bapi， N.Yadaiah， A.Negi， V.S.Rao， B.L.Deekshatulu
State estimation and tracking problems: a comparison between kalman filter and recurrent neural networks
状态估计与跟踪问题：卡尔曼滤波与循环神经网络的比较
ICONIP (2004), pp. 275-281
ICONIP （2004），第275-281页
View article CrossRefGoogle Scholar
查看文章 CrossRefGoogle 学术搜索
DeepMind DeepMind的
DeepMind, G. (2016a). Alphago. [Online]. Available: https://deepmind.com/research/case-studies/alphago-the-story-so-far.
DeepMind，G.（2016a）。阿尔法围棋。[在线]。可用： https://deepmind.com/research/case-studies/alphago-the-story-so-far.
Google Scholar Google 学术搜索
DeepMind DeepMind的
DeepMind, G. (2016b). Alphazero: shedding new light on chess, shogi, and go. [Online]. Available: https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go.
DeepMind，G.（2016b）。Alphazero：为国际象棋、将棋和围棋提供新的视角。[在线]。可用： https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go.
Google Scholar Google 学术搜索
DeepMind DeepMind的
DeepMind, G. (2016c). Deepmind AI reduces google data centre cooling bill by 40%. [Online]. Available: https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40.
DeepMind，G.（2016c）。Deepmind AI 将谷歌数据中心冷却费用降低了 40%。[在线]。可用： https://deepmind.com/blog/article/deepmind-ai-reduces-google-data-centre-cooling-bill-40.
Google Scholar Google 学术搜索
DeepMind DeepMind的
DeepMind, G. (2018). Safety-first AI for autonomous data centre cooling and industrial control. [Online]. Available: https://deepmind.com/blog/article/safety-first-ai-autonomous-data-centre-cooling-and-industrial-control.
DeepMind，G.（2018 年）。安全第一的人工智能，用于自主数据中心冷却和工业控制。[在线]。可用： https://deepmind.com/blog/article/safety-first-ai-autonomous-data-centre-cooling-and-industrial-control.
Google Scholar Google 学术搜索
DeepMind DeepMind的
DeepMind, G. (2019). Alphastar: mastering the real-time strategy game starcraft II. [Online]. Available: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii.
DeepMind，G.（2019 年）。Alphastar：掌握即时战略游戏《星际争霸II》。[在线]。可用： https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii.
Google Scholar Google 学术搜索
Deisenroth, Neumann, 2013
戴森罗斯，诺依曼，2013
M.P. Deisenroth, G.P. Neumann
M.P.Deisenroth， G.P.诺依曼
A survey on policy search for robotics
机器人政策搜索调查
Found. Trend. Robot. (2013), pp. 1-142
发现。趋势。机器人。（2013），第1-142页
Google Scholar Google 学术搜索
Doerr, Nguyen-Tuong, Marco, Schaal, Trimpe, 2017
Doerr， Nguyen-Tuong， Marco， Schaal， Trimpe， 2017
A. Doerr, D. Nguyen-Tuong, A. Marco, S. Schaal, S. Trimpe
A.Doerr， D.Nguyen-Tuong， A.Marco， S.Schaal， S.Trimpe
Model-based policy search for automatic tuning of multivariate PID controllers
基于模型的多变量PID控制器自动整定策略搜索
ICRA (2017) ICRA（2017年）
arXiv:1703.02899 arXiv：1703.02899
Google Scholar Google 学术搜索
Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2017
Duan， Schulman， Chen， Bartlett， Sutskever， Abbeel， 2017
Y. Duan, J. Schulman, X. Chen, P. Bartlett, I. Sutskever, P. Abbeel
Y.Duan， J.Schulman， X.Chen， P.Bartlett， I.Sutskever， P.Abbeel
RL2: fast reinforcement learning via slow reinforcement learning
RL 2 ：通过慢速强化学习进行快速强化学习
ICRL (2017) ICRL （2017年）
arXiv:1611.02779 arXiv：1611.02779
Google Scholar Google 学术搜索
Dunn, Bertsekas, 1989 邓恩，贝尔塞卡斯，1989
J.C. Dunn, D.P. Bertsekas J.C.邓恩，D.P.Bertsekas
Efficient dynamic programming implementations of newtons method for unconstrained optimal control problems
牛顿法在无约束最优控制问题的高效动态规划实现
J. Optim. Theory Appl., 63 (1) (1989), pp. 23-38
J.优化。《应用理论（Theory Appl.）》，第 63 卷第 1 期（1989 年），第 23-38 页
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Ellis, Durand, Christofides, 2014
埃利斯，杜兰德，克里斯托菲德斯，2014
M. Ellis, H. Durand, P.D. Christofides
M.埃利斯，H.杜兰德，PD克里斯托菲德斯
A tutorial review of economic model predictive control methods
经济模型预测控制方法教程综述
J. Process Control, 24 (2014), pp. 1156-1178, 10.1016/j.compchemeng.2010.07.001
《过程控制（J. Process Control）》，第 24 卷（2014 年），第 1156-1178 页，10.1016/j.compchemeng.2010.07.001
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Fan, Yao, Guo, Lu, 2019
范瑶，郭璐， 2019
H. Fan, C. Yao, J. Guo, X. Lu
H.Fan， C.Yao， J.Guo， X.Lu
Deep reinforcement learning for energy efficiency optimization in wireless networks
用于无线网络能效优化的深度强化学习
ICCCBDA (2019) ICCCBDA（2019年）
Google Scholar Google 学术搜索
Ferreira, Riberiro, Bianchi, 2014
费雷拉，里贝里罗，比安奇， 2014
L. Ferreira, C. Riberiro, R. Bianchi
L.费雷拉， C.里贝里罗， R.比安奇
Heuristically accelerated reinforcement learning modularization for multi-agent multi-objective problems
面向多智能体多目标问题的启发式加速强化学习模块化
Appl. Intell., 41 (2014), pp. 551-562
《应用情报（Appl. Intell.）》，第 41 卷（2014 年），第 551-562 页
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
CrossRef 交叉参考View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Fujimoto, Hoof, Meger, 2018
藤本，蹄，梅格，2018
S. Fujimoto, H. Hoof, D. Meger
S.藤本， H.Hoof， D.Meger
Addressing function approximation error in actor-critic methods
Addressing 函数逼近误差在 actor-critic 方法中
ICML (2018) ICML （2018年）
arXiv:1802.09477 arXiv：1802.09477
Google Scholar Google 学术搜索
Garcia, 2015 加西亚， 2015
J. Garcia J.加西亚
A comprehensive survey on safe reinforcement learning
安全强化学习的综合调查
J. Mach. Learn. Res., 16 (2015), pp. 1437-1480
J. Mach. 学习。《研究（Res.）》，第 16 卷（2015 年），第 1437-1480 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Garivier, & Moulines 加里维耶和穆林斯
Garivier, A., & Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv:0805.3415.
Garivier，A.和Moulines，E.（2008）。关于非平稳强盗问题的置信上限策略。arXiv：0805.3415。
Google Scholar Google 学术搜索
Ge, Song, Ding, Huang, 2017
葛松，丁，黄，2017
Z. Ge, Z. Song, S.X. Ding, B. Huang
Z.Ge， Z.Song， S.X.Ding， B.Huang
Data mining and analytics in the process industry: the role of machine learning
过程工业中的数据挖掘和分析：机器学习的作用
IEEE (2017), pp. 20590-20616
IEEE（2017 年），第 20590-20616 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Geibel, 2006 盖贝尔， 2006
P. Geibel P.盖贝尔
Reinforcement learning for MDPs with constraints
带约束的 MDP 的强化学习
ECML (2006), pp. 646-653 ECML（2006 年），第 646-653 页
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
CrossRef 交叉参考View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Goodfellow, Bengio, Courville, 2015
Goodfellow， Bengio，库尔维尔， 2015
I. Goodfellow, Y. Bengio, A. Courville
I.Goodfellow，Y.Bengio，A.Courville
Deep Learning 深度学习
The MIT Press, Cambridge, Massachusetts, USA (2015)
麻省理工学院出版社，美国马萨诸塞州剑桥市（2015）
Google Scholar Google 学术搜索
Gorges, 2017 峡谷， 2017
D. Gorges D.峡谷
Relations between model predictive control and reinforcement learning
模型预测控制与强化学习的关系
IFAC, 50 (2017), pp. 4920-4928
《国际会计师联合会（IFAC）》，第 50 卷（2017 年），第 4920-4928 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Gurel, 2017 Gurel， 2017年
C.S. Gurel C.S.古雷尔
Technical Report: Q-Learning for Adaptive PID Control of a line Follower Mobile Robot
技术报告：Q-Learning for Adaptive PID Control of a line follower mobile robot
University of Maryland (2017), pp. 1-19
马里兰大学（2017 年），第 1-19 页
Google Scholar Google 学术搜索
Haarnoja, Zhou, Abbeel, Levine, 2018
Haarnoja， Zhou， Abbeel， Levine， 2018
T. Haarnoja, A. Zhou, P. Abbeel, S. Levine
T.Haarnoja， A.Zhou， P.Abbeel， S.Levine
Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor
软 actor-critic：使用随机 actor 的偏离策略最大熵深度强化学习
ICML (2018) ICML （2018年）
arXiv:1801.01290 arXiv：1801.01290
Google Scholar Google 学术搜索
Hakim, Hindersah, Rijanto, 2013
哈基姆，欣德萨，里扬托，2013
A. Hakim, H. Hindersah, E. Rijanto
A.哈基姆，H.欣德萨，E.Rijanto
Application of reinforcement learning on self-tuning PID controller for soccer robot multi-agent system
强化学习在足球机器人多智能体系统中自整定PID控制器上的应用
rICT & ICeV-T (2013) rICT 和 ICeV-T （2013）
Google Scholar Google 学术搜索
Henderson, Islam, Bachman, Pineau, Precup, Meger, 2018
亨德森，伊斯兰，巴赫曼，皮诺，普雷库普，梅格尔，2018
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger
P.亨德森，R.伊斯兰，P.巴赫曼，J.皮诺，D.普雷库普，D.梅格
Deep reinforcement learning that matters
重要的深度强化学习
AAAI (2018) AAAI （2018年）
arXiv:1709.06560 arXiv：1709.06560
Google Scholar Google 学术搜索
Hinton, Sejnowski, 1999 Hinton， Sejnowski， 1999
J. Hinton, T. Sejnowski J.Hinton，T.Sejnowski
Unsupervised Learning: Foundations of Neural Computation
无监督学习：神经计算的基础
The MIT Press: Cambridge, Massachusetts, USA (1999)
麻省理工学院出版社：美国马萨诸塞州剑桥市（1999）
Google Scholar Google 学术搜索
ISBN 978-0-26-258168-4 国际标准图书编号 978-0-26-258168-4

Hochreiter, Younger, Conwell, 2001
Hochreiter， Younger， Conwell， 2001
A. Hochreiter, S. Younger, P. Conwell
A.Hochreiter， S.Younger， P.Conwell
Learning to learn using gradient descent
学习使用梯度下降进行学习
ICANN (2001) ICANN （2001年）
Google Scholar Google 学术搜索
Hoskins, Himmelblau, 1992
霍斯金斯，希梅尔布劳，1992
J.C. Hoskins, D.M. Himmelblau
J.C.霍斯金斯，DM希梅尔布劳
Process control via neural networks and reinforcement learning
通过神经网络和强化学习进行过程控制
Comput. Chem. Eng., 16 (1992), pp. 241-251
计算。《化学工程（Chem. Eng.）》，第 16 卷（1992 年），第 241-251 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Howell, Best, 2000 豪厄尔，最佳，2000
M.N. Howell, M.C. Best M.N.豪威尔，M.C.贝斯特
On-line PID tuning for engine idle-speed control using continuous action reinforcement learning automata
使用连续动作强化学习自动机进行发动机怠速控制的在线PID调谐
Control Eng. Pract., 8 (2000), pp. 147-154
《控制工程实践（Control Eng. Pract.）》，第 8 卷（2000 年），第 147-154 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Huang, Wu, Zuo, Pei, Min, 2018
黄武，左，裴敏，2018
C. Huang, Y. Wu, Y. Zuo, K. Pei, K. Min
C.Huang， Y.Wu， Y.Zuo， K.Pei， K.Min
Towards experienced anomaly detector through reinforcement learning
通过强化学习实现经验丰富的异常检测器
AAAI, 10 (2018), pp. 8087-8088
《AAAI》，第 10 卷（2018 年），第 8087-8088 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Huesman, Bosgra, Van Hof, 2008
休斯曼，博斯格拉，范霍夫，2008
A.E.M. Huesman, O.H. Bosgra, P.M.J. Van Hof
A.E.M.Huesman， O.H.Bosgra， P.M.J.Van Hof
Integrating MPC and RTO in the process industry by economic dynamic lexicographic optimization; an open-loop exploration
通过经济动态词典优化，将MPC和RTO整合到流程工业中;开环探索
AICHE (2008) 爱车（2008）
Google Scholar Google 学术搜索
Jin, Lavaei, 2018 Jin， Lavaei， 2018
M. Jin, J. Lavaei M.Jin， J.Lavaei
Stability-certified reinforcement learning: acontrol-theoretic prespective
稳定性认证的强化学习：acontrol-theory-prespective
ML (2018) 机器学习（2018）
arXiv:1810.11505 arXiv：1810.11505
Google Scholar Google 学术搜索
Joy, Kaisare, 2011 乔伊，凯萨雷，2011
M. Joy, N. Kaisare M.乔伊，N.凯萨雷
Approximate dynamic programming-based control of distributed parameter systems
分布式参数系统的近似基于动态规划的控制
J. Chem. Eng. (6) (2011), pp. 452-459
J. Chem. Eng. （6）（2011），第 452-459 页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
Kaelbling, Littman, & Cassandra
Kaelbling、Littman 和 Cassandra
Kaelbling, L., Littman, M., & Cassandra, A. (1999). DPOMDPs and their algorithms, sans formula![Online]. Available: http://cs.brown.edu/research/ai/pomdp/tutorial/index.html.
Kaelbling，L.，Littman，M.和Cassandra，A.（1999）。DPOMDP 及其算法，没有公式！[在线]。可用： http://cs.brown.edu/research/ai/pomdp/tutorial/index.html.
Google Scholar Google 学术搜索
Karatzas, Shreve, 1991 卡拉扎斯，什里夫，1991
I. Karatzas, S.E. Shreve I.Karatzas，SE Shreve
Brownian Motion and Stochastic Calculus
布朗运动和随机微积分
(2nd), Springer-Verlag, Berlin, Germany (1991)
（第2届），施普林格出版社，柏林，德国（1991）
Google Scholar Google 学术搜索
ISBN 978-0-387-97655-6 国际标准图书编号 978-0-387-97655-6

Laptev, Amizadeh, Flint, 2015
拉普捷夫，阿米扎德，弗林特，2015
N. Laptev, S. Amizadeh, I. Flint
N.拉普捷夫、S.阿米扎德、I.弗林特
Generic and scalable framework for automated time-series anomaly detection
用于自动时间序列异常检测的通用且可扩展的框架
ACM SIGKDD (2015), pp. 1939-1947
ACM SIGKDD （2015），第 1939-1947 页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
Lee, Shin, Realff, 2018 Lee， Shin， Realff， 2018
H.L. Lee, J. Shin, M.J. Realff
H.L.Lee， J.Shin， M.J.Realff
Machine learning: overview of the recent progresses and implications for the process systems engineering field
机器学习：概述过程系统工程领域的最新进展和启示
Comput. Chem. Eng., 114 (9) (2018), pp. 111-121
计算。《化学工程（Chem. Eng.）》，第 114 卷第 9 期（2018 年），第 111-121 页
Google Scholar Google 学术搜索
Lee, Wong, 2010 李黄， 2010
J.H. Lee, W. Wong
Approximate dynamic programming approach for process control
过程控制的近似动态规划方法
J. Process Control, 20 (9) (2010), pp. 1038-1048
《过程控制（J. Process Control）》，第 20 卷第 9 期（2010 年），第 1038-1048 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Lee, Lee, 2005 李李， 2005
J.M. Lee, J.H. Lee J.M.Lee， J.H.Lee
Approximate dynamic programming-based approaches for input-output data-driven control of nonlinear processes
近似的基于动态规划的非线性过程输入输出数据驱动控制方法
Automatica, 41 (7) (2005), pp. 1281-1288
《自动（Automatica）》，第 41 卷第 7 期（2005 年），第 1281-1288 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wierstra
Lillicrap、Hunt、Pritzel、Heess、Erez、Tassa、Silver 和 Wierstra
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
Lillicrap，TP，Hunt，JJ，Pritzel，A.，Heess，N.，Erez，T.，Tassa，Y.，Silver，D.和Wierstra，D.（2015）。通过深度强化学习实现持续控制。arXiv：1509.02971。
Google Scholar Google 学术搜索
Lin, 1992 林， 1992
L. Lin 林玲玲
Self-improving reactive agents based on reinforcement learning, planning and teaching
基于强化学习、规划和教学的自我完善反应剂
Mach. Learn., 8 (1992), pp. 293-321
《学习（Mach. Learn）》，第 8 卷（1992 年），第 293-321 页
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Lusena, Mundhenk, Goldsmith, 2001
卢塞纳，蒙德汉克，金匠，2001
C. Lusena, M. Mundhenk, J. Goldsmith
C.卢塞纳，M.蒙德亨克，J.戈德史密斯
Nonapproximability results for partially observable markov decision processes
部分可观察马尔可夫决策过程的非近似性结果
J. AI Res., 14 (2001), pp. 83-103
《.AI研究》，第 14 卷（2001 年），第 83-103 页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
Maravelias, Sung, 2009 Maravelias， Sung， 2009
C.T. Maravelias, C. Sung C.T.马拉维利亚斯，C.宋
Integration of production planning and scheduling: overview, challenges and opportunities
生产计划和调度的整合：概述、挑战和机遇
CCE, 33 (2009), pp. 1919-1930
《CCE》，第 33 卷（2009 年），第 1919-1930 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Marti, 2008 马蒂， 2008
K. Marti K.马蒂
Stochastic Optimization Methods
随机优化方法
(2nd), Springer, New York, USA (2008)
（第二名），施普林格，纽约，美国（2008）
Google Scholar Google 学术搜索
ISBN 978-3-540-79458-5 国际标准图书编号 978-3-540-79458-5

Martins, Bianchi, 2014 马丁斯，比安奇，2014
M. Martins, R. Bianchi M.马丁斯，R.比安奇
Heuristically-accelerated reinforcement learning: acomparative analysis of performance
启发式加速强化学习：性能的比较分析
TAROS (2014), pp. 15-27 TAROS （2014），第 15-27 页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
Mayne, Rawlings, 2017 梅恩，罗林斯，2017
D. Mayne, J.B. Rawlings D.梅恩，JB罗林斯
Model Predictive Control: Theory and Design
模型预测控制：理论与设计
(2nd), Nob Hill Publishing, Wisconson, USA (2017)
（第二名），Nob Hill Publishing，威斯康星州，美国（2017）
Google Scholar Google 学术搜索
Meijering, 2002 Meijering， 2002年
E. Meijering E.梅杰林
A chronology of interpolation: from ancient astronomy to modern signal and image processing
插值年表：从古代天文学到现代信号和图像处理
IEEE, 90 (2002), pp. 319-342
《IEEE》，第 90 卷（2002 年），第 319-342 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Mes, Rivera, 2017 梅斯，里维拉，2017
M.R.K. Mes, A.P. Rivera M.R.K.Mes， AP Rivera
Approximate Dynamic Programming by Practical Examples
通过实际实例近似动态规划
Springer, New York, USA (2017)
施普林格，纽约，美国（2017）
Google Scholar Google 学术搜索
ISBN 978-3-319-47764-0 国际标准图书编号 978-3-319-47764-0

Minsky, 1973 明斯基，1973年
M. Minsky 明斯基
Steps towards artificial intelligence
迈向人工智能的步骤
IRE (1973), pp. 8-31 IRE （1973），第 8-31 页
Google Scholar Google 学术搜索
Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller, 2013
Mnih， Kavukcuoglu， Silver， Graves， Antonoglou， Wierstra， Riedmiller， 2013
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller
V.Mnih、K.Kavukcuoglu、D.Silver、A.Graves、I.Antonoglou、D.Wierstra、M.Riedmiller
Playing atari with deep reinforcement learning
使用深度强化学习玩雅达利
NIPS (2013) NIPS （2013）
Google Scholar Google 学术搜索
Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg, Hassabis, 2015
Mnih， Kavukcuoglu， Silver， Rusu， Veness， Bellemare， Graves， Riedmiller， Fidjeland， Ostrovski， Petersen， Beattie， Sadik， Antonoglou， King， Kumaran， Wierstra， Legg， Hassabis， 2015
V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis
V.Mnih， K.Kavukcuoglu， D.Silver， A.A.Rusu， J.Veness， M.G.Bellemare， A.Graves， M.Riedmiller， A.K.Fidjeland， G.Ostrovski， S.Petersen， C.Beattie， A.Sadik， I.Antonoglou， H.King， D.Kumaran， D.Wierstra， S.Legg， D.Hassabis
Human-level control through deep reinforcement learning
通过深度强化学习实现人类水平的控制
Nature, 518 (2015), pp. 529-533
《自然（Nature）》，第 518 卷（2015 年），第 529-533 页
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
CrossRef 交叉参考View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
Monfort, Bogost, 2018 蒙福特，博格斯特， 2018
N. Monfort, I. Bogost N.蒙福特，I.博戈斯特
Racing the Beam: The Atari Video Computer System
Racing the Beam：雅达利视频计算机系统
The MIT Press, Cambridge, Massachusetts, USA (2018)
麻省理工学院出版社，美国马萨诸塞州剑桥市（2018）
Google Scholar Google 学术搜索
ISBN 0-262-01257-X 国际标准图书编号 0-262-01257-X

Moriyama, Magistris, Tatsubori, Pham, Munawar, Tachibana, 2018
森山， Magistris， Tatsubori， Pham， Munawar，立花， 2018
T. Moriyama, G.D. Magistris, M. Tatsubori, T. Pham, A. Munawar, R. Tachibana
森山、G.D.Magistris、M.Tatsubori、T.Pham、A.Munawar、R.Tachibana
Reinforcement learning testbed for power-consumption optimization
用于功耗优化的强化学习测试平台
AsiaSim (2018) 亚洲模拟（2018）
Google Scholar Google 学术搜索
Ng, 2003 吴恩达， 2003
A. Ng
Shaping and Policy Search in Reinforcement Learning, University of California, Berkeley (2003)
强化学习中的塑造和政策搜索，加州大学伯克利分校（2003）
Phd dissertation 博士论文
Google Scholar Google 学术搜索
Ng
Ng, A. (2018). Machine Learning Yearing. First Edition
吴，A.（2018 年）。机器学习年。初版
Google Scholar Google 学术搜索
Nian, Liu, Huang, Tawanda, 2019
Nian， Liu， Huang， Tawanda， 2019
R. Nian, J. Liu, B. Huang, M. Tawanda
R.Nian， J.Liu， B.Huang， M.Tawanda
Fault tolerant control system: a reinforcement learning approach
容错控制系统：一种强化学习方法
SICE (2019), pp. 1010-1015
SICE （2019），第 1010-1015 页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
OpenAI OpenAI的
OpenAI (2018a). How to train your openAI five. [Online]. Available: https://openai.com/blog/how-to-train-your-openai-five/.
OpenAI（2018a）。如何训练你的 openAI 五人组。[在线]。可用： https://openai.com/blog/how-to-train-your-openai-five/.
Google Scholar Google 学术搜索
OpenAI OpenAI的
OpenAI (2018b). OpenAI five. [Online]. Available: https://openai.com/blog/openai-five/.
OpenAI（2018b）。OpenAI五。[在线]。可用： https://openai.com/blog/openai-five/.
Google Scholar Google 学术搜索
Pannocchia, Gabiccini, Artoni, 2015
Pannocchia， Gabiccini， Artoni， 2015
G. Pannocchia, M. Gabiccini, A. Artoni
G.Pannocchia， M.Gabiccini， A.Artoni
Offset-free MPC explained: novelties, subtleties, and applications
无偏移 MPC 解释：新颖性、微妙之处和应用
IFAC, 48 (2015), pp. 342-351
《国际会计师联合会（IFAC）》，第 48 卷（2015 年），第 342-351 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Poznyak, 2008 波兹尼亚克， 2008
A.S. Poznyak A.S.波兹尼亚克
Advanced Mathematical Tools for Automatic Control Engineers: Deterministic Techniques
面向自动控制工程师的高级数学工具：确定性技术
Volume 1, Elsevier, Oxford, United Kingdom (2008)
第1卷，爱思唯尔，牛津，英国（2008年）
Google Scholar Google 学术搜索
ISBN 978-0-08-044674-5 国际标准图书编号 978-0-08-044674-5

PwC
PwC (2019). Sizing the prize: What’s the real value of AI for your business and how can you capitalise?[Online]. Available: www.pwc.com/gx/en/issues/analytics/assets/pwc-ai-analysis-sizing-the-prize-report.pdf.
普华永道（2019）。确定奖项：人工智能对你的企业的真正价值是什么，你如何利用？[在线]。可用： www.pwc.com/gx/en/issues/analytics/assets/pwc-ai-analysis-sizing-the-prize-report.pdf.
Google Scholar Google 学术搜索
Raju, Milton, Suresh, Sankar, 2015
拉朱，米尔顿，苏雷什，桑卡尔，2015
L. Raju, R.S. Milton, S. Suresh, S. Sankar
L.拉朱、R.S.米尔顿、S.苏雷什、S.桑卡尔
Reinforcement learning in adaptive control of power system generation
电力系统发电自适应控制中的强化学习
Procedia Comput. Sci., 46 (2015), pp. 202-209
Procedia Comput.《科学（Sci.）》，第 46 卷（2015 年），第 202-209 页
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Rawlik, Toussaint, Vijayakumar, 2013
Rawlik， Toussaint， Vijayakumar， 2013
K. Rawlik, M. Toussaint, S. Vijayakumar
K.罗利克、M.Toussaint、S.Vijayakumar
On stochastic optimal control and reinforcement learning by approximate inference
基于近似推理的随机最优控制与强化学习
IJCAI (2013), pp. 3052-3060
IJCAI （2013），第3052-3060页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Reinaldo, Riberiro, Costa, 2004
雷纳尔多，里贝里罗，科斯塔， 2004
B. Reinaldo, C. Riberiro, A. Costa
B.雷纳尔多， C.里贝里罗， A.科斯塔
Heuristically accelerated q-learning: a new approach to speed up reinforcement learning
启发式加速 q 学习：一种加速强化学习的新方法
SBIA (2004), pp. 245-254 SBIA（2004年），第245-254页
Google Scholar Google 学术搜索
Reinaldo, Riberiro, Costa, 2008
雷纳尔多，里贝里罗，科斯塔， 2008
B. Reinaldo, C. Riberiro, A. Costa
B.雷纳尔多， C.里贝里罗， A.科斯塔
Accelerating autonomous learning by using heuristic selection of actions
通过使用启发式操作选择来加速自主学习
J. Heurist., 14 (2008), pp. 135-168
J. Heurist.，第 14 卷（2008 年），第 135-168 页
Google Scholar Google 学术搜索
Reinaldo, Riberiro, Costa, 2012
雷纳尔多，里贝里罗，科斯塔，2012
B. Reinaldo, C. Riberiro, A. Costa
B.雷纳尔多， C.里贝里罗， A.科斯塔
Heuristically accelerated reinforcement learning: theoretical and experimental results
启发式加速强化学习：理论和实验结果
ECAI (2012), pp. 169-175 ECAI（2012 年），第 169-175 页
Google Scholar Google 学术搜索
Reis 雷斯
Reis, A. (2017). Reinforcement learning: Eligibility traces and TD(λ). [Online]. Available: https://amreis.github.io/ml/reinf-learn/2017/11/02/reinforcement-learning-eligibility-traces.html.
雷斯，A.（2017 年）。强化学习：资格跟踪和TD（λ）。[在线]。可用： https://amreis.github.io/ml/reinf-learn/2017/11/02/reinforcement-learning-eligibility-traces.html.
Google Scholar Google 学术搜索
Richter, Jones, Morari, 2012
里希特，琼斯，莫拉里，2012
S. Richter, C.N. Jones, M. Morari
S.里希特、C.N.琼斯、莫拉里
Computational complexity certification for real-time MPC with input constraints based on the fast gradient method
基于快速梯度法的具有输入约束的实时 MPC 的计算复杂性认证
Automat. Control, 57 (2012), pp. 1391-1403
自动。《控制（Control）》，第 57 卷（2012 年），第 1391-1403 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Russel, Norvig, 2009 罗素，诺维格，2009
S.J. Russel, P. Norvig SJ罗素，P.诺维格
Artificial Intelligence: A Modern Approach
人工智能：一种现代方法
(3rd), Prentice Hall, Upper Saddle River, New Jersey, USA (2009), pp. 1-22
（第 3 位），Prentice Hall，美国新泽西州马鞍河上游（2009 年），第 1-22 页
View PDF
This article is free to access.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
ISBN 978-0-13-604259-4 国际标准图书编号 978-0-13-604259-4

Schaul, Quan, Antonoglou, Silver, 2015
Schaul， Quan， Antonoglou，银奖， 2015
T. Schaul, J. Quan, I. Antonoglou, D. Silver
T.Schaul， J.Quan， I.Antonoglou， D.西尔弗
Prioritized experience replay
优先体验重播
ML (2015) ML （2015年）
arXiv:1511.05952 arXiv：1511.05952
Google Scholar Google 学术搜索
Schuck, Niv, 2019 舒克，尼夫，2019
N.W. Schuck, Y. Niv NW舒克，Y.Niv
Sequential replay of nonspatial task states in the human hippocampus
人类海马体中非空间任务状态的顺序回放
Science, 364 (2019), pp. 1-11, 10.1126/science.aaw5181
《科学（Science）》，第 364 卷（2019 年），第 1-11 页，10.1126/science.aaw5181
View PDF
This article is free to access.
Google Scholar Google 学术搜索
Schulman, Levine, Moritz, Jordan, Abbeel, 2015
舒尔曼，莱文，莫里茨，乔丹，阿比尔，2015
J. Schulman, S. Levine, P. Moritz, M.I. Jordan, P. Abbeel
J.舒尔曼、S.莱文、P.莫里茨、M.I.乔丹、P.阿贝尔
Trust region policy optimization
信任区域策略优化
ICML (2015) ICML （2015年）
arXiv:1502.05477 arXiv：1502.05477
Google Scholar Google 学术搜索
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
舒尔曼、沃尔斯基、达里瓦尔、拉德福德、克里莫夫，2017
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov
J.舒尔曼、F.沃尔斯基、P.达里瓦尔、A.拉德福德、O.克里莫夫
Proximal policy optimization algorithms
近端策略优化算法
Mach. Learn. (2017) 学习。(2017)
arXiv:1707.06347 arXiv：1707.06347
Google Scholar Google 学术搜索
Seborg, Mellichamp, Edgar, Doyle, 2013
塞博格、梅利尚、埃德加、道尔，2013
D.E. Seborg, D.A. Mellichamp, T.F. Edgar, F.J. Doyle
D.E.塞博格、D.A.梅利尚、T.F.埃德加、F.J.道尔
Process Dynamics and Control
过程动力学与控制
(2nd), Wiley, Hoboken, New Jersey, USA (2013)
（第二名），Wiley，霍博肯，新泽西州，美国（2013）
Google Scholar Google 学术搜索
ISBN 978-0-470-12867-1 国际标准图书编号 978-0-470-12867-1

Sedighizadeh, Rezazadeh, Sun, 2008
Sedighizadeh， Rezazadeh，太阳， 2008
M. Sedighizadeh, A. Rezazadeh, W. Sun
M.Sedighizadeh， A.Rezazadeh， W.Sun
Adaptive PID controller based on reinforcement learning for wind turbine control
基于强化学习的风电机组控制自适应PID控制器
Int. J. Electric. Inf. Eng., 2 (2008), pp. 124-130
国际电气。《Inf. Eng.》，第 2 卷（2008 年），第 124-130 页
Google Scholar Google 学术搜索
Shin, Badgwell, Liu, Lee, 2019
Shin， Badgwell， Liu， Lee， 2019
J. Shin, T.A. Badgwell, K. Liu, J.H. Lee
J.Shin， T.A.Badgwell， K.Liu， J.H.Lee
Reinforcement learning - overview of recent progress and implications for process control
强化学习 - 概述最近的进展和对过程控制的影响
CCE, 127 (2019), pp. 282-294, 10.1016/j.compchemeng.2019.05.029
CCE， 127 （2019）， pp. 282-294， 10.1016/j.compchemeng.2019.05.029
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
Sidhu, Siddhamshetty, Kwon, 2018
Sidhu， Siddhamshetty， Kwon， 2018
H.S. Sidhu, P. Siddhamshetty, J.S. Kwon
H.S.Sidhu， P.Siddhamshetty， J.S.Kwon
Approximate dynamic programming based control of proppant concentration in hydraulic fracturing
基于动态规划的水力压裂支撑剂浓度近似控制
Mathematics, 6 (8) (2018), p. 132
《数学（Mathematics）》，第 6 卷第 8 期（2018 年），第 132 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Silver 银
Silver, D. (2018). Class lecture, topic: ”planning by dynamic programming.” COMPGI13. Imperial College London, London, Mar. 12.
西尔弗，D.（2018 年）。课堂讲座，主题：“通过动态规划进行规划”。COMPGI13。伦敦帝国理工学院，伦敦，3月12日。
Google Scholar Google 学术搜索
Silver, Huang, Maddison, Guez, Sifre, Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, Dieleman, Grewe, Nham, Kalchbrenner, Sutskever, Lillicrap, Leach, Kavukcuoglu, Graepel, Hassabis, 2016
Silver， Huang， Maddison， Guez， Sifre， Driessche， Schrittwieser， Antonoglou， Panneershelvam， Lanctot， Dieleman， Grewe， Nham， Kalchbrenner， Sutskever， Lillicrap， Leach， Kavukcuoglu， Graepel， Hassabis， 2016
D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis
D.Silver， A.Huang， C.J.Maddison， A.Guez， L.Sifre， G.Driessche， J.Schrittwieser， I.Antonoglou， V.Panneershelvam， M.Lanctot， S.Dieleman， D.Grewe， J.Nham， N.Kalchbrenner， I.Sutskever， T.Lillicrap， M.Leach， K.Kavukcuoglu， T.Graepel， D.Hassabis
Mastering the game of go with deep neural networks and tree search
掌握深度神经网络和树搜索的围棋游戏
Nature, 529 (2016), pp. 484-503
《自然（Nature）》，第 529 卷（2016 年），第 484-503 页
View PDF
Your institution provides access to this article.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2017a
Silver， Hubert， Schrittwieser， Antonoglou， Lai， Guez， Lanctot， Sifre， Kumaran， Graepel， Lillicrap， Simonyan， Hassabis， 2017a
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis
D.Silver， T.Hubert， J.Schrittwieser， I.Antonoglou， M.Lai， A.Guez， M.Lanctot， L.Sifre， D.Kumaran， T.Graepel， T.Lillicrap， K.Simonyan， D.Hassabis
Mastering chess and shogi by self-play with a general reinforcement learning algorithm
使用通用强化学习算法通过自我对弈来掌握国际象棋和将棋
ML (2017) 机器学习（2017）
arXiv:1712.01815 arXiv：1712.01815
Google Scholar Google 学术搜索
Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, Hassabis, 2018
银奖， Hubert， Schrittwieser， Antonoglou， Lai， Guez， Lanctot， Sifre， Kumaran， Graepel， Lillicrap， Simonyan， Hassabis， 2018
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis
D.Silver， T.Hubert， J.Schrittwieser， I.Antonoglou， M.Lai， A.Guez， M.Lanctot， L.Sifre， D.Kumaran， T.Graepel， T.Lillicrap， K.Simonyan， D.Hassabis
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play
掌握国际象棋、将棋和自我对弈的通用强化学习算法
Science, 362 (2018), 10.1126/science.aar6404
科学， 362 （2018）， 10.1126/science.aar6404
View PDF
Your institution provides access to this article.
Google Scholar Google 学术搜索
Silver, Lever, Heess, Degris, Wierstra, Riedmiller, 2014
银奖、肝脏、希斯、德格里斯、维尔斯特拉、里德米勒，2014
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller
D.西尔弗、G.利弗、N.希斯、T.德格里斯、D.维尔斯特拉、M.里德米勒
Deterministic policy gradient algorithms
确定性策略梯度算法
ICML (2014) ICML （2014年）
Google Scholar Google 学术搜索
Silver, Schrittwieser, Simonyan, Antonoglou, Huang, Guez, Hubert, Baker, Lai, Bolton, Chen, Chen, Lillicrap, Hui, Sifre, Driessche, Graepel, Hassabis, 2017b
银奖， Schrittwieser， Simonyan， Antonoglou， Huang， Guez， Hubert， Baker， Lai， Bolton， Chen， Chen， Lillicrap， Hui， Sifre， Driessche， Graepel， Hassabis， 2017b
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, A. Chen, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, D. Hassabis
D.Silver， J.Schrittwieser， K.Simonyan， I.Antonoglou， A.Huang， A.Guez， T.Hubert， L.Baker， M.Lai， A.Bolton， A.Chen， Y.Chen， T.Lillicrap， F.Hui， L.Sifre， G.Driessche， T.Graepel， D.Hassabis
Mastering the game of go without human knowledge
在人类不知情的情况下掌握围棋游戏
Nature, 550 (2017), pp. 354-359
《自然（Nature）》，第 550 卷（2017 年），第 354-359 页
View PDF
Your institution provides access to this article.
CrossRefView in ScopusGoogle Scholar
CrossRef在 Scopus 中查看Google 学术搜索
Spielberg, Gopaluni, Loewen, 2017
斯皮尔伯格、戈帕鲁尼、洛文，2017
S.P.K. Spielberg, R.B. Gopaluni, P.D. Loewen
S.P.K.Spielberg， R.B.Gopaluni， P.D.Loewen
Deep reinforcement learning approaches for process control
用于过程控制的深度强化学习方法
AdCONIP (10) (2017), pp. 201-207, 10.1109/ADCONIP.2017.7983780
AdCONIP （10）（2017），第 201-207 页，10.1109/ADCONIP.2017.7983780
View article View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Sutton, Barto, 1998 巴托·萨顿，1998
R. Sutton, A. Barto R.萨顿，A.巴托
Reinforcement Learning: An Introduction
强化学习：简介
(1st), The MIT Press, Cambridge, Massachusetts, USA (1998), pp. 206-207
（第 1 版），麻省理工学院出版社，美国马萨诸塞州剑桥市（1998 年），第 206-207 页
Google Scholar Google 学术搜索
Sutton, Barto, 2018 萨顿，巴托，2018
R. Sutton, A. Barto R.萨顿，A.巴托
Reinforcement Learning: An Introduction
强化学习：简介
(2nd), The MIT Press, Cambridge, Massachusetts, USA (2018)
（第2次），麻省理工学院出版社，美国马萨诸塞州剑桥市（2018）
Google Scholar Google 学术搜索
Sutton, 1988 萨顿，1988
R.S. Sutton R.S.萨顿
Learning to predict by the methods of temporal differences
学习通过时间差异的方法进行预测
Mach. Learn., 3 (1988), pp. 9-44
《学习（Mach. Learn）》，第 3 卷（1988 年），第 9-44 页
View PDF
This article is free to access.
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Sutton, Barto, Williams, 1991
萨顿，巴托，威廉姆斯，1991
R.S. Sutton, A.G. Barto, R.J. Williams
R.S.萨顿、A.G.巴托、RJ 威廉姆斯
Reinforcement learning is direct adaptive optimal control
强化学习是直接自适应最优控制
ACC (1991) 行政协调会（1991）
Google Scholar Google 学术搜索
Sutton, McAllester, Singh, Mansour, 1999
萨顿，麦卡莱斯特，辛格，曼苏尔，1999
R.S. Sutton, D. McAllester, S. Singh, Y. Mansour
R.S.萨顿、D.麦卡莱斯特、S.辛格、Y.曼苏尔
Policy gradient methods for reinforcement learning with function approximation
基于函数逼近的强化学习策略梯度方法
NIPS (1999), pp. 1057-1063
NIPS （1999），第1057-1063页
Google Scholar Google 学术搜索
Tan, Sun, Kong, Zhang, Yang, Liu, 2018
Tan， Sun， Kong， Zhang， Yang， Liu， 2018
C.T. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, C. Liu
C.T.Tan， F.Sun， T.Kong， W.Zhang， C.Yang， C.Liu
A survey on deep transfer learning
深度迁移学习综述
ICANN (2018) ICANN （2018）
Google Scholar Google 学术搜索
Taylor, Stone, 2009 泰勒，斯通，2009
M.E. Taylor, P. Stone M.E.泰勒，P.斯通
Transfer learning for reinforcement learning domains: a survey
强化学习领域的迁移学习：一项调查
ML, 10 (2009), pp. 1633-1685
《ML （ML）》，第 10 卷（2009 年），第 1633-1685 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
Technologies 技术
Technologies, T. (2019). Pumps & process automation lab. [Online]. Available: http://www.turbinetechnologies.com/educational-lab-products/pump-lab-with-automation.
技术，T.（2019 年）。泵和过程自动化实验室。[在线]。可用： http://www.turbinetechnologies.com/educational-lab-products/pump-lab-with-automation.
Google Scholar Google 学术搜索
Thorndike 桑代克
Thorndike, E. L… Animal intelligence. Hafner, Darien, CT.
桑代克，EL动物智能。哈夫纳，达里恩，康涅狄格州。
Google Scholar Google 学术搜索
Trends 趋势
Trends, G. (2017). The growth in search of reinforcement learning. [Online]. Available: https://trends.google.com/trends/explore?date=2007-01-01%202019-08-21&q=reinforcement%20learning.
趋势，G.（2017 年）。寻求强化学习的成长。[在线]。可用： https://trends.google.com/trends/explore?date=2007-01-01%202019-08-21&q=reinforcement%20learning.
Google Scholar Google 学术搜索
Uhlenbeck, Ornstein, 1930
奥恩斯坦·乌伦贝克，1930 年
G. Uhlenbeck, L.S. Ornstein
G.乌伦贝克，L.S.奥恩斯坦
On the theory of the brownian motion
关于布朗运动的理论
Phys. Rev., 36 (1930) Phys. Rev.， 36 （1930年）
Google Scholar Google 学术搜索
Vinyals, Babuschkin, Czarnecki, Mathieu, Dudzik, Chung, Choi, Powell, Lillicrap, Kavukcuoglu, Hassabis, Apps, Silver, 2019
Vinyals， Babuschkin， Czarnecki， Mathieu， Dudzik， Chung， Choi， Powell， Lillicrap， Kavukcuoglu， Hassabis， Apps，银牌， 2019
O. Vinyals, I. Babuschkin, W. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. Choi, R. Powell, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, D. Silver
O.Vinyals， I.Babuschkin， W.Czarnecki， M.Mathieu， A.Dudzik， J.Chung， D.Choi， R.Powell， T.Lillicrap， K.Kavukcuoglu， D.Hassabis， C.Apps， D.Silver
Grandmaster level in starcraft II using multi-agent reinforcement learning
使用多智能体强化学习的《星际争霸II》中的宗师级别
Nature (362) (2019), 10.1038/s41586-019-1724-z
自然（362）（2019）， 10.1038/s41586-019-1724-z
View PDF
Your institution provides access to this article.
Google Scholar Google 学术搜索
Wang, Kurth-Nelson, Tirumala, Soyer, Leibo, Munos, Blundell, Kumaran, Botvinick, 2016
Wang， Kurth-Nelson， Tirumala， Soyer， Leibo， Munos， Blundell， Kumaran， Botvinick， 2016
J. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Leibo, R. Munos, C. Blundell, D. Kumaran, M. Botvinick
J.Wang， Z.Kurth-Nelson， D.Tirumala， H.Soyer， J.Leibo， R.Munos， C.Blundell， D.Kumaran， M.Botvinick
Learn. Reinforcement Learn (2016)
学习。强化学习（2016）
arXiv:1611.05763 arXiv：1611.05763
Google Scholar Google 学术搜索
Wang, Cheng, Sun, 2006 王成，孙， 2006
X. Wang, Y. Cheng, W. Sun X.Wang， Y.Cheng， W.Sun
A proposal of adaptive PID controller based on reinforcement learning
基于强化学习的自适应PID控制器建议
China Univ Min. Technol., 17 (2006), pp. 40-44
《中国工业大学学报》，2006 年 17 期，第 40-44 页
Google Scholar Google 学术搜索
Wang, Boyd, 2008 王博伊德， 2008
Y. Wang, S. Boyd Y.Wang，S.博伊德
Fast model predictive control using online optimization
使用在线优化进行快速模型预测控制
Proc. 17th world congress (2008)
第17届世界大会（2008年）
Google Scholar Google 学术搜索
Wang, Velswamy, Huang, 2017
Wang， Velswamy， Huang， 2017
Y. Wang, K. Velswamy, B. Huang
Y.Wang， K.Velswamy， B.Huang
A long-short term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems
一种基于长短期记忆循环神经网络的办公供暖通风空调系统强化学习控制器
Processes, 5 (2017), pp. 1-18, 10.3390/pr5030046
《流程（Processes）》，第 5 卷（2017 年），第 1-18 页，10.3390/pr5030046
View article Google Scholar Google 学术搜索
Wang, Velswamy, Huang, 2018
Wang， Velswamy， Huang， 2018
Y. Wang, K. Velswamy, B. Huang
Y.Wang， K.Velswamy， B.Huang
A novel approach to feedback control with deep reinforcement learning
一种基于深度强化学习的反馈控制新方法
IFAC (2018), pp. 31-37 IFAC（2018 年），第 31-37 页
Google Scholar Google 学术搜索
Wright, 1997 赖特，1997
S.J. Wright SJ赖特
Applying new optimization algorithms to model predictive control
将新的优化算法应用于模型预测控制
Chem. Process Control, 91 (316) (1997), pp. 147-155
《过程控制（Chem. Process Control）》，第 91 卷第 316 期（1997 年），第 147-155 页
Google Scholar Google 学术搜索
Xu, 2010 徐， 2010
X. Xu 徐旭
Sequential anomaly detection based on temporal-difference learning: principles, models and case studies
基于时序差分学习的序贯异常检测：原理、模型和案例研究
Appl. Soft Comput., 10 (2010), pp. 859-867
《应用软件计算（Appl. Soft Comput.）》，第 10 卷（2010 年），第 859-867 页
Google Scholar Google 学术搜索
Zecevic, Siljak, 2009 泽切维奇，西利亚克， 2009
A.I. Zecevic, D.D. Siljak A.I.Zecevic， D.D.Siljak
Regions of attraction 景点
Control Complex Syst. (2009), pp. 111-141
控制复合系统（2009 年），第 111-141 页
Google Scholar Google 学术搜索
Cited by (266) 被引用（266）
A practically implementable reinforcement learning control approach by leveraging offset-free model predictive control
利用无偏移模型预测控制，实现可实际实现的强化学习控制方法
2024, Computers and Chemical Engineering
2024，计算机与化学工程
Show abstract
Machine learning in process systems engineering: Challenges and opportunities
过程系统工程中的机器学习：挑战与机遇
2024, Computers and Chemical Engineering
2024，计算机与化学工程
Show abstract
Zinc roasting temperature field control with CFD model and reinforcement learning
基于CFD模型和强化学习的锌焙烧温度场控制
2024, Advanced Engineering Informatics
2024，高级工程信息学
Show abstract
Machine learning interfaces for modular modelling and operation-based design optimization of solar thermal systems in process industry
机器学习接口，用于过程工业中太阳能热系统的模块化建模和基于操作的设计优化
2024, Engineering Applications of Artificial Intelligence
2024，人工智能工程应用
Show abstract
A data-driven tracking control framework using physics-informed neural networks and deep reinforcement learning for dynamical systems
基于物理的神经网络和深度强化学习的数据驱动跟踪控制框架
2024, Engineering Applications of Artificial Intelligence
2024，人工智能工程应用
Show abstract
50% reduction in energy consumption in an actual cold storage facility using a deep reinforcement learning-based control algorithm
使用基于深度强化学习的控制算法，实际冷藏设施的能耗降低 50%
2023, Applied Energy 2023，应用能源
Show abstract