自动驾驶的深度强化学习：一项综述 Deep Reinforcement Learning for Autonomous Driving: A Survey

最新推荐文章于 2025-03-13 15:41:52 发布

资源存储库

最新推荐文章于 2025-03-13 15:41:52 发布

阅读量2k

点赞数 25

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135519799

版权

Abstract: 摘要：
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
随着深度表示学习的发展，强化学习（RL）领域已经成为一个强大的学习框架，现在能够在高维环境中学习复杂的策略。本文总结了深度强化学习（DRL）算法，并提供了采用（D）RL方法的自动驾驶任务分类，同时解决了自动驾驶智能体在现实世界中部署的关键计算挑战。它还描绘了相邻的领域，如行为克隆、模仿学习、逆强化学习，这些领域是相关的，但不是经典的强化学习算法。讨论了模拟器在训练智能体中的作用，以及验证、测试和鲁棒RL中现有解决方案的方法。
Published in: IEEE Transactions on Intelligent Transportation Systems ( Volume: 23, Issue: 6, June 2022)
发表于： IEEE Transactions on Intelligent Transportation Systems （卷： 23，期： 6， June 2022）
Page(s): 4909 - 4926 页数： 4909 - 4926
Date of Publication: 09 February 2021
发布日期： 09 February 2021
ISSN Information: ISSN信息：
DOI: 10.1109/TITS.2021.3054625
DOI： 10.1109/TITS.2021.3054625
Publisher: IEEE 发布者： IEEE
SECTION I. 第一节总则Introduction 介绍
Autonomous driving (AD)1 systems constitute of multiple perception level tasks that have now achieved high precision on account of deep learning architectures. Besides the perception, autonomous driving systems constitute of multiple tasks where classical supervised learning methods are no more applicable. First, when the prediction of the agent’s action changes future sensor observations received from the environment under which the autonomous driving agent operates, for example the task of optimal driving speed in an urban area. Second, supervisory signals such as time to collision (TTC), lateral error w.r.t to optimal trajectory of the agent, represent the dynamics of the agent, as well uncertainty in the environment. Such problems would require defining the stochastic cost function to be maximized. Third, the agent is required to learn new configurations of the environment, as well as to predict an optimal decision at each instant while driving in its environment. This represents a high dimensional space given the number of unique configurations under which the agent & environment are observed, this is combinatorially large. In all such scenarios we are aiming to solve a sequential decision process, which is formalized under the classical settings of Reinforcement Learning (RL), where the agent is required to learn and represent its environment as well as act optimally given at each instant [1]. The optimal action is referred to as the policy.
自动驾驶（AD） 1 系统由多个感知级任务组成，这些任务现在由于深度学习架构而实现了高精度。除了感知之外，自动驾驶系统还包含多个任务，而经典的监督学习方法不再适用。首先，当对智能体动作的预测发生变化时，从自动驾驶智能体运行的环境接收到的未来传感器观察结果，例如在城市地区执行最佳驾驶速度的任务。其次，监控信号，如碰撞时间（TTC）、与智能体最佳轨迹的横向误差，代表了智能体的动态，以及环境中的不确定性。这些问题需要定义随机成本函数才能最大化。第三，智能体需要学习环境的新配置，并在其环境中驾驶时预测每个时刻的最佳决策。这代表了一个高维空间，考虑到观察到智能体和环境的独特配置的数量，这在组合上是很大的。在所有这些场景中，我们的目标是解决一个顺序决策过程，该过程在强化学习（RL）的经典设置下正式化，其中智能体需要学习和表示其环境，并在每个时刻 [1] 以最佳方式行动。最佳操作称为策略。

In this review we cover the notions of reinforcement learning, the taxonomy of tasks where RL is a promising solution especially in the domains of driving policy, predictive perception, path and motion planning, and low level controller design. We also focus our review on the different real world deployments of RL in the domain of autonomous driving expanding our conference paper [2] since their deployment has not been reviewed in an academic setting. Finally, we motivate users by demonstrating the key computational challenges and risks when applying current day RL algorithms such imitation learning, deep Q learning, among others. We also note from the trends of publications in Figure 2 that the use of RL or Deep RL applied to autonomous driving or the self driving domain is an emergent field. This is due to the recent usage of RL/DRL algorithms domain, leaving open multiple real world challenges in implementation and deployment. We address the open problems in VI.
在这篇综述中，我们涵盖了强化学习的概念，即强化学习的任务分类法，其中RL是一个很有前途的解决方案，特别是在驾驶策略、预测感知、路径和运动规划以及低级控制器设计领域。我们还将重点放在RL在自动驾驶领域的不同实际部署上，扩展了我们的会议论文 [2] ，因为它们的部署尚未在学术环境中进行审查。最后，我们通过展示应用当今强化学习算法（如模仿学习、深度 Q 学习等）时的关键计算挑战和风险来激励用户。我们还从出版物的趋势中注意到 Figure 2 ，RL或Deep RL应用于自动驾驶或自动驾驶领域是一个新兴领域。这是由于最近使用 RL/DRL 算法领域，在实现和部署中留下了多个现实世界的挑战。我们解决了VI中的开放性问题。

Fig. 1. - Standard components in a modern autonomous driving systems pipeline listing the various tasks. The key problems addressed by these modules are Scene Understanding, Decision and Planning.
Fig. 1. 图 1.
Standard components in a modern autonomous driving systems pipeline listing the various tasks. The key problems addressed by these modules are Scene Understanding, Decision and Planning.
现代自动驾驶系统管道中的标准组件列出了各种任务。这些模块解决的关键问题是场景理解、决策和规划。

Show All 显示全部

Fig. 2. - Trend of publications for keywords 1. “reinforcement learning”, 2. “deep reinforcement”, and 3. “reinforcement learning” AND (“autonomous cars” OR “autonomous vehicles” OR “self driving”) for academic publication trends from this [13].
Fig. 2. 图 2.
Trend of publications for keywords 1. “reinforcement learning”, 2. “deep reinforcement”, and 3. “reinforcement learning” AND (“autonomous cars” OR “autonomous vehicles” OR “self driving”) for academic publication trends from this [13].
关键词的出版趋势 1.“强化学习”，2。“深度强化”，以及 3.“强化学习”和（“自动驾驶汽车”或“自动驾驶汽车”或“自动驾驶”）的学术出版趋势由此 [13] 而来。

Show All 显示全部

The main contributions of this work can be summarized as follows:
这项工作的主要贡献可以总结如下：

Self-contained overview of RL background for the automotive community as it is not well known.
对汽车界的RL背景的独立概述，因为它并不为人所知。

Detailed literature review of using RL for different autonomous driving tasks.
使用RL完成不同自动驾驶任务的详细文献综述。

Discussion of the key challenges and opportunities for RL applied to real world autonomous driving.
讨论RL应用于现实世界自动驾驶的主要挑战和机遇。

The rest of the paper is organized as follows. Section II provides an overview of components of a typical autonomous driving system. Section III provides an introduction to reinforcement learning and briefly discusses key concepts. Section IV discusses more sophisticated extensions on top of the basic RL framework. Section V provides an overview of RL applications for autonomous driving problems. Section VI discusses challenges in deploying RL for real-world autonomous driving systems. Section VII concludes this paper with some final remarks.
本文的其余部分组织如下。 Section II 概述了典型自动驾驶系统的组件。 Section III 介绍了强化学习，并简要讨论了关键概念。 Section IV 讨论了在基本 RL 框架之上的更复杂的扩展。 Section V 概述了RL在自动驾驶问题上的应用。 Section VI 讨论了在实际自动驾驶系统中部署强化学习的挑战。 Section VII 最后，我们先谈几点意见。
SECTION II. 第二节.Components of AD System AD系统组件
Figure 1 comprises of the standard blocks of an AD system demonstrating the pipeline from sensor stream to control actuation. The sensor architecture in a modern autonomous driving system notably includes multiple sets of cameras, radars and LIDARs as well as a GPS-GNSS system for absolute localisation and inertial measurement Units (IMUs) that provide 3D pose of the vehicle in space.
Figure 1 由AD系统的标准模块组成，演示了从传感器流到控制驱动的管道。现代自动驾驶系统中的传感器架构主要包括多组摄像头、雷达和激光雷达，以及用于绝对定位和惯性测量单元（IMU）的 GPS-GNSS 系统，可提供车辆在太空中的 3D 姿态。

The goal of the perception module is the creation of an intermediate level representation of the environment state (for example bird-eye view map of all obstacles and agents) that is to be later utilised by a decision making system that ultimately produces the driving policy. This state would include lane position, drivable zone, location of agents such as cars & pedestrians, state of traffic lights and others. Uncertainties in the perception propagate to the rest of the information chain. Robust sensing is critical for safety thus using redundant sources increases confidence in detection. This is achieved by a combination of several perception tasks like semantic segmentation [3], [4], motion estimation [5], depth estimation [6], soiling detection [7], etc which can be efficiently unified into a multi-task model [8], [9].
感知模块的目标是创建环境状态的中间级表示（例如，所有障碍物和智能体的鸟瞰图），稍后由最终产生驾驶策略的决策系统使用。这种状态将包括车道位置、可驾驶区域、汽车和行人等代理的位置、交通信号灯的状态等。感知中的不确定性会传播到信息链的其余部分。可靠的传感对于安全至关重要，因此使用冗余源可以提高检测的可信度。这是通过几个感知任务的组合来实现的，如语义分割 [3] 、运动估计 [5] 、深度估计 [6] 、污垢检测 [7] 等，这些任务可以有效地统一到一个多任务模型 [8] [9] 中。 [4]

A. Scene Understanding A. 场景理解
This key module maps the abstract mid-level representation of the perception state obtained from the perception module to the high level action or decision making module. Conceptually, three tasks are grouped by this module: Scene understanding, Decision and Planning as seen in Figure 1 module aims to provide a higher level understanding of the scene, it is built on top of the algorithmic tasks of detection or localisation. By fusing heterogeneous sensor sources, it aims to robustly generalise to situations as the content becomes more abstract. This information fusion provides a general and simplified context for the Decision making components.
该关键模块将从感知模块获得的感知状态的抽象中级表示映射到高级行动或决策模块。从概念上讲，该模块分为三个任务：场景理解、决策和规划，如模块所示 Figure 1 ，旨在提供对场景的更高层次的理解，它建立在检测或定位的算法任务之上。通过融合异构传感器源，它旨在随着内容变得更加抽象而稳健地泛化到各种情况。这种信息融合为决策组件提供了一般和简化的上下文。

Fusion provides a sensor agnostic representation of the environment and models the sensor noise and detection uncertainties across multiple modalities such as LIDAR, camera, radar, ultra-sound. This basically requires weighting the predictions in a principled way.
Fusion提供了与传感器无关的环境表示，并对激光雷达、摄像头、雷达、超声波等多种模态的传感器噪声和检测不确定性进行建模。这基本上需要以原则性的方式对预测进行加权。

B. Localization and Mapping
B. 定位和映射
Mapping is one of the key pillars of automated driving [10]. Once an area is mapped, current position of the vehicle can be localized within the map. The first reliable demonstrations of automated driving by Google were primarily reliant on localisation to pre-mapped areas. Because of the scale of the problem, traditional mapping techniques are augmented by semantic object detection for reliable disambiguation. In addition, localised high definition maps (HD maps) can be used as a prior for object detection.
地图绘制是自动驾驶 [10] 的关键支柱之一。绘制区域地图后，可以在地图中定位车辆的当前位置。谷歌首次对自动驾驶进行可靠的演示，主要依赖于对预先绘制区域的定位。由于问题的规模，传统的映射技术通过语义对象检测得到增强，以实现可靠的消歧。此外，局部高清地图（高清地图）可以用作物体检测的先验。

C. Planning and Driving Policy
C. 规划和驾驶政策
Trajectory planning is a crucial module in the autonomous driving pipeline. Given a route-level plan from HD maps or GPS based maps, this module is required to generate motion-level commands that steer the agent.
轨迹规划是自动驾驶流程中的关键模块。给定来自高清地图或基于 GPS 的地图的路线级计划，需要此模块来生成控制智能体的运动级命令。

Classical motion planning ignores dynamics and differential constraints while using translations and rotations required to move an agent from source to destination poses [11]. A robotic agent capable of controlling 6-degrees of freedom (DOF) is said to be holonomic, while an agent with fewer controllable DOFs than its total DOF is said to be non-holonomic. Classical algorithms such as A∗ algorithm based on Djisktra’s algorithm do not work in the non-holonomic case for autonomous driving. Rapidly-exploring random trees (RRT) [12] are non-holonomic algorithms that explore the configuration space by random sampling and obstacle free path generation. There are various versions of RRT currently used in for motion planning in autonomous driving pipelines.
经典的运动规划忽略了动力学和差分约束，同时使用将智能体从源姿势移动到目标姿势 [11] 所需的平移和旋转。能够控制 6 个自由度（DOF）的机器人智能体被称为整体智能体，而可控制自由度小于其总自由度的智能体被称为非整体智能体。经典算法，例如 A∗ 基于Djisktra算法的算法，在自动驾驶的非完整情况下不起作用。快速探索随机树（RRT） [12] 是一种非完整算法，通过随机采样和无障碍路径生成来探索配置空间。目前有各种版本的RRT用于自动驾驶管道中的运动规划。

D. Control D. 控制
A controller defines the speed, steering angle and braking actions necessary over every point in the path obtained from a pre-determined map such as Google maps, or expert driving recording of the same values at every waypoint. Trajectory tracking in contrast involves a temporal model of the dynamics of the vehicle viewing the waypoints sequentially over time.
控制器定义路径中每个点所需的速度、转向角和制动动作，这些动作是从预先确定的地图（如谷歌地图）或每个航点相同值的专家驾驶记录中获得的。相比之下，轨迹跟踪涉及车辆动力学的时间模型，该模型随时间顺序查看航点。

Current vehicle control methods are founded in classical optimal control theory which can be stated as a minimisation of a cost function x˙=f(x(t),u(t)) defined over a set of states x(t) and control actions u(t) . The control input is usually defined over a finite time horizon and restricted on a feasible state space x∈Xfree [14]. The velocity control are based on classical methods of closed loop control such as PID (proportional-integral-derivative) controllers, MPC (Model predictive control). PIDs aim to minimise a cost function constituting of three terms current error with proportional term, effect of past errors with integral term, and effect of future errors with the derivative term. While the family of MPC methods aim to stabilize the behavior of the vehicle while tracking the specified path [15]. A review on controllers, motion planning and learning based approaches for the same are provided in this review [16] for interested readers. Optimal control and reinforcement learning are intimately related, where optimal control can be viewed as a model based reinforcement learning problem where the dynamics of the vehicle/environment are modeled by well defined differential equations. Reinforcement learning methods were developed to handle stochastic control problems as well ill-posed problems with unknown rewards and state transition probabilities. Autonomous vehicle stochastic control is large domain, and we advise readers to read the survey on this subject by authors in [17].
目前的车辆控制方法建立在经典的最优控制理论之上，该理论可以表述为在一组状态 x(t) 和控制动作 u(t) 上定义的成本函数 x˙=f(x(t),u(t)) 的最小化。控制输入通常在有限的时间范围内定义，并限制在可行的状态空间 x∈Xfree [14] 上。速度控制基于经典的闭环控制方法，如PID（比例-积分-微分）控制器、MPC（模型预测控制）。PID 旨在最小化由三个项组成的成本函数，该函数由三个项的当前误差与比例项、过去误差与积分项的影响以及未来误差与导数项的影响组成。而 MPC 方法系列旨在稳定车辆的行为，同时跟踪指定的路径 [15] 。本综述为感兴趣的读者提供了关于控制器、运动规划和基于学习的方法的综述 [16] 。最优控制和强化学习密切相关，其中最优控制可以被视为基于模型的强化学习问题，其中车辆/环境的动力学由定义良好的微分方程建模。开发了强化学习方法来处理随机控制问题以及具有未知奖励和状态转移概率的病态问题。自动驾驶汽车随机控制是一个很大的领域，我们建议读者阅读作者关于这一主题的调查 [17] 。

SECTION III. 第三节.Reinforcement Learning 强化学习
Machine learning (ML) is a process whereby a computer program learns from experience to improve its performance at a specified task [18]. ML algorithms are often classified under one of three broad categories: supervised learning, unsupervised learning and reinforcement learning (RL). Supervised learning algorithms are based on inductive inference where the model is typically trained using labelled data to perform classification or regression, whereas unsupervised learning encompasses techniques such as density estimation or clustering applied to unlabelled data. By contrast, in the RL paradigm an autonomous agent learns to improve its performance at an assigned task by interacting with its environment. Russel and Norvig define an agent as “anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators” [19]. RL agents are not told explicitly how to act by an expert; rather an agent’s performance is evaluated by a reward function R . For each state experienced, the agent chooses an action and receives an occasional reward from its environment based on the usefulness of its decision. The goal for the agent is to maximize the cumulative rewards received over its lifetime. Gradually, the agent can increase its long-term reward by exploiting knowledge learned about the expected utility (i.e. discounted sum of expected future rewards) of different state-action pairs. One of the main challenges in reinforcement learning is managing the trade-off between exploration and exploitation. To maximize the rewards it receives, an agent must exploit its knowledge by selecting actions which are known to result in high rewards. On the other hand, to discover such beneficial actions, it has to take the risk of trying new actions which may lead to higher rewards than the current best-valued actions for each system state. In other words, the learning agent has to exploit what it already knows in order to obtain rewards, but it also has to explore the unknown in order to make better action selections in the future. Examples of strategies which have been proposed to manage this trade-off include ϵ -greedy and softmax. When adopting the ubiquitous ϵ -greedy strategy, an agent either selects an action at random with probability 0<ϵ<1 , or greedily selects the highest valued action for the current state with the remaining probability 1−ϵ . Intuitively, the agent should explore more at the beginning of the training process when little is known about the problem environment. As training progresses, the agent may gradually conduct more exploitation than exploration. The design of exploration strategies for RL agents is an area of active research (see e.g. [20]).
机器学习（ML）是计算机程序从经验中学习以提高其在指定任务 [18] 中的性能的过程。ML 算法通常分为三大类：监督学习、无监督学习和强化学习（RL）。监督学习算法基于归纳推理，其中模型通常使用标记数据进行训练以执行分类或回归，而无监督学习则包含应用于未标记数据的密度估计或聚类等技术。相比之下，在RL范式中，自主代理通过与环境交互来学习提高其在指定任务中的性能。Russel 和 Norvig 将智能体定义为“任何可以被视为通过传感器感知其环境并通过执行器作用于该环境的东西”。 [19] 专家没有明确告知 RL 代理如何行动;相反，智能体的表现是由奖励函数 R 来评估的。对于所经历的每种状态，智能体都会选择一个动作，并根据其决策的有用性从其环境中偶尔获得奖励。代理的目标是最大化在其生命周期内获得的累积奖励。渐渐地，智能体可以通过利用对不同状态-行动对的预期效用（即预期未来奖励的折扣总和）的知识来增加其长期奖励。强化学习的主要挑战之一是管理探索和开发之间的权衡。为了最大化它获得的奖励，智能体必须通过选择已知会导致高回报的行动来利用其知识。另一方面，为了发现这些有益的操作，它必须承担尝试新操作的风险，这些操作可能会比每个系统状态的当前最有价值的操作带来更高的回报。换句话说，学习智能体必须利用它已经知道的东西来获得奖励，但它也必须探索未知的东西，以便在未来做出更好的行动选择。为管理这种权衡而提出的策略示例包括 ϵ -greedy 和 softmax。当采用无处不在 ϵ 的贪婪策略时，智能体要么以概率随机选择一个动作，要么以剩余概率 0<ϵ<1 1−ϵ 贪婪地为当前状态选择价值最高的动作。直观地说，当对问题环境知之甚少时，代理应该在训练过程开始时进行更多探索。随着训练的进行，智能体可能会逐渐进行更多的开发而不是探索。RL代理的探索策略设计是一个积极研究的领域（参见例如）。 [20]

Markov decision processes (MDPs) are considered the de facto standard when formalising sequential decision making problems involving a single RL agent [21]. An MDP consists of a set S of states, a set A of actions, a transition function T and a reward function R [22], i.e. a tuple <S,A,T,R> . When in any state s∈S , selecting an action a∈A will result in the environment entering a new state s′∈S with a transition probability T(s,a,s′)∈(0,1) , and give a reward R(s,a) . This process is illustrated in Fig. 3. The stochastic policy π:S→D is a mapping from the state space to a probability over the set of actions, and π(a|s) represents the probability of choosing action a at state s . The goal is to find the optimal policy π∗ , which results in the highest expected sum of discounted rewards [21]:
π∗=argmaxπEπ{∑k=0H−1γkrk+1∣s0=s}:=Vπ(s),(1)
View SourceRight-click on figure for MathML and additional features.for all states s∈S , where rk=R(sk,ak) is the reward at time k and Vπ(s) , the ‘value function’ at state s following a policy π , is the expected ‘return’ (or ‘utility’) when starting at s and following the policy π thereafter [1]. An important, related concept is the action-value function, a.k.a.‘Q-function’ defined as:
Qπ(s,a)=Eπ{∑k=0H−1γkrk+1∣s0=s,a0=a}.(2)
View SourceRight-click on figure for MathML and additional features.The discount factor γ ∈ [0,1] controls how an agent regards future rewards. Low values of γ encourage myopic behaviour where an agent will aim to maximise short term rewards, whereas high values of γ cause agents to be more forward-looking and to maximise rewards over a longer time frame. The horizon H refers to the number of time steps in the MDP. In infinite-horizon problems H=∞ , whereas in episodic domains H has a finite value. Episodic domains may terminate after a fixed number of time steps, or when an agent reaches a specified goal state. The last state reached in an episodic domain is referred to as the terminal state. In finite-horizon or goal-oriented domains discount factors of (close to) 1 may be used to encourage agents to focus on achieving the goal, whereas in infinite-horizon domains lower discount factors may be used to strike a balance between short- and long-term rewards. If the optimal policy for a MDP is known, then Vπ∗ may be used to determine the maximum expected discounted sum of rewards available from any arbitrary initial state. A rollout is a trajectory produced in the state space by sequentially applying a policy to an initial state. A MDP satisfies the Markov property, i.e. system state transitions are dependent only on the most recent state and action, not on the full history of states and actions in the decision process. Moreover, in many real-world application domains, it is not possible for an agent to observe all features of the environment state; in such cases the decision-making problem is formulated as a partially-observable Markov decision process (POMDP). Solving a reinforcement learning task means finding a policy π that maximises the expected discounted sum of rewards over trajectories in the state space. RL agents may learn value function estimates, policies and/or environment models directly. Dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment in terms of reward and transition functions. Unlike DP, in Monte Carlo methods there is no assumption of complete environment knowledge. Monte Carlo methods are incremental in an episode-by-episode sense. Upon the completion of an episode, the value estimates and policies are updated. Temporal Difference (TD) methods, on the other hand, are incremental in a step-by-step sense, making them applicable to non-episodic scenarios. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods learn their estimates based on other estimates.
马尔可夫决策过程（MDP）被认为是将涉及单个 RL 代理 [21] 的顺序决策问题形式化的事实标准。MDP 由一组状态、一组 S A 动作、一个过渡函数和一个奖励函数 T R [22] 组成，即元组 <S,A,T,R> 。当处于任何状态时，选择一个动作 a∈A 将导致环境进入一个具有转移概率 T(s,a,s′)∈(0,1) 的新状态 s∈S s′∈S ，并给予奖励 R(s,a) 。此过程如 Fig. 3 所示。随机策略 π:S→D 是从状态空间到动作集的概率的映射， π(a|s) 表示在状态 s 下选择动作 a 的概率。目标是找到最优策略，从而产生最高的贴现奖励期望总和：
π∗=argmaxπEπ{∑k=0H−1γkrk+1∣s0=s}:=Vπ(s),(1)
View SourceRight-click on figure for MathML and additional features.对于所有状态，其中 rk=R(sk,ak) 是时间 k 的奖励 [21] ， Vπ(s) 而，策略之后 π 状态 s∈S s 的“价值函数”是在政策开始 s 和之后 [1] 遵循策略 π∗ π 时的预期“回报”（或“效用”）.一个重要的相关概念是行动价值函数，又名“Q 函数”，定义为：
Qπ(s,a)=Eπ{∑k=0H−1γkrk+1∣s0=s,a0=a}.(2)
View SourceRight-click on figure for MathML and additional features.折扣因子 γ ∈ [0,1] 控制代理如何看待未来的奖励。低值鼓励 γ 短视行为，代理人的目标是最大化短期奖励，而高值 γ 导致代理人更具前瞻性，并在更长的时间范围内最大化回报。跨度 H 是指 MDP 中的时间步长数。在无限视界问题 H=∞ 中，而在情节域中 H 具有有限值。情节域可能会在固定数量的时间步长后终止，或者在代理达到指定目标状态时终止。在情节域中达到的最后状态称为终末状态。在有限视界或以目标为导向的领域中，（接近）1 的贴现因子可用于鼓励智能体专注于实现目标，而在无限视界领域中，较低的贴现因子可用于在短期和长期奖励之间取得平衡。如果 MDP 的最佳策略是已知的，则 Vπ∗ 可用于确定从任意初始状态获得的最大预期折扣奖励总和。推出是通过按顺序将策略应用于初始状态而在状态空间中产生的轨迹。MDP 满足马尔可夫属性，即系统状态转换仅依赖于最近的状态和动作，而不依赖于决策过程中状态和动作的完整历史记录。此外，在许多实际应用领域中，代理不可能观察到环境状态的所有特征;在这种情况下，决策问题被表述为部分可观察的马尔可夫决策过程（POMDP）。解决强化学习任务意味着找到一种策略 π ，使状态空间中轨迹上的预期折扣奖励总和最大化。RL 代理可以直接学习价值函数估计、策略和/或环境模型。动态规划（DP）是指一组算法，可用于计算最佳策略，给定在奖励和转换函数方面的完美环境模型。与DP不同，在蒙特卡罗方法中，没有完整的环境知识的假设。蒙特卡罗方法在逐集意义上是渐进的。剧集完成后，价值估算值和策略将更新。另一方面，时间差分（TD）方法在逐步意义上是递增的，因此适用于非情节场景。与蒙特卡罗方法一样，TD方法可以直接从原始经验中学习，而无需环境动态模型。与 DP 一样，TD 方法根据其他估计值学习其估计值。

Fig. 3. - A graphical decomposition of the different components of an RL algorithm. It also demonstrates the different challenges encountered while training a D(RL) algorithm.
Fig. 3. 图 3.
A graphical decomposition of the different components of an RL algorithm. It also demonstrates the different challenges encountered while training a D(RL) algorithm.
RL 算法不同组件的图形分解。它还演示了在训练 D（RL）算法时遇到的不同挑战。

Show All 显示全部

A. Value-Based Methods A. 基于价值的方法
Q-learning is one of the most commonly used RL algorithms. It is a model-free TD algorithm that learns estimates of the utility of individual state-action pairs (Q-functions defined in Eqn. 2). Q-learning has been shown to converge to the optimum state-action values for a MDP with probability 1, so long as all actions in all states are sampled infinitely often and the state-action values are represented discretely [23]. In practice, Q-learning will learn (near) optimal state-action values provided a sufficient number of samples are obtained for each state-action pair. If a Q-learning agent has converged to the optimal Q values for a MDP and selects actions greedily thereafter, it will receive the same expected sum of discounted rewards as calculated by the value function with π∗ (assuming that the same arbitrary initial starting state is used for both). Agents implementing Q-learning update their Q values according to the following update rule:
Q(s,a)←Q(s,a)+α[r+γmaxa′∈AQ(s′,a′)−Q(s,a)],(3)
View SourceRight-click on figure for MathML and additional features.where Q(s,a) is an estimate of the utility of selecting action a in state s , α ∈ [0,1] is the learning rate which controls the degree to which Q values are updated at each time step, and γ ∈ [0,1] is the same discount factor used in Eqn. 1. The theoretical guarantees of Q-learning hold with any arbitrary initial Q values [23]; therefore the optimal Q values for a MDP can be learned by starting with any initial action value function estimate. The initialisation can be optimistic (each Q(s,a) returns the maximum possible reward), pessimistic (minimum) or even using knowledge of the problem to ensure faster convergence. Deep Q-Networks (DQN) [24] incorporates a variant of the Q-learning algorithm [25], by using deep neural networks (DNNs) as a non-linear Q function approximator over high-dimensional state spaces (e.g. the pixels in a frame of an Atari game). Practically, the neural network predicts the value of all actions without the use of any explicit domain-specific information or hand-designed features. DQN applies experience replay technique to break the correlation between successive experience samples and also for better sample efficiency. For increased stability, two networks are used where the parameters of the target network for DQN are fixed for a number of iterations while updating the parameters of the online network. Readers are directed to sub-section III-E for a more detailed introduction to the use of DNNs in Deep RL.
Q-learning是最常用的RL算法之一。它是一种无模型的 TD 算法，用于学习单个状态-动作对（中定义的 Q 函数） Eqn. 2 效用的估计值。Q 学习已被证明可以收敛到概率为 1 的 MDP 的最佳状态-动作值，只要对所有状态中的所有动作进行无限频繁采样，并且状态-动作值是离散表示 [23] 的。在实践中，Q-learning将学习（接近）最佳状态-动作值，前提是为每个状态-动作对获得足够数量的样本。如果 Q 学习代理已收敛到 MDP 的最佳 Q 值，并在此后贪婪地选择操作，则它将获得与值函数计算的相同预期的折扣奖励总和 π∗ （假设两者都使用相同的任意初始起始状态）。实现 Q 学习的智能体根据以下更新规则更新其 Q 值：
Q(s,a)←Q(s,a)+α[r+γmaxa′∈AQ(s′,a′)−Q(s,a)],(3)
View SourceRight-click on figure for MathML and additional features.其中 Q(s,a) 是对在状态 s 中选择操作 a 的效用的估计值，是控制 Q 值在每个时间步的更新程度的学习率， α ∈ [0,1] γ ∈ [0,1] 并且与中使用的折扣因子相同 Eqn. 1 。Q学习的理论保证适用于任意初始Q值 [23] ;因此，MDP 的最佳 Q 值可以通过从任何初始动作值函数估计开始来学习。初始化可以是乐观的（每个 Q(s,a) 都返回最大可能的奖励），也可以是悲观的（最小），甚至可以使用对问题的知识来确保更快的收敛。深度Q网络（DQN）采用了Q学习算法的变体 [25] ，通过使用深度神经网络（DNN） [24] 作为高维状态空间（例如Atari游戏帧中的像素）的非线性Q函数逼近器。实际上，神经网络可以预测所有动作的价值，而无需使用任何明确的特定领域信息或手工设计的特征。DQN应用经验回放技术来打破连续经验样本之间的相关性，并提高样本效率。为了提高稳定性，使用两个网络，其中 DQN 目标网络的参数在更新在线网络参数的同时固定多次迭代。读者可以 sub-section III-E 更详细地了解 DNN 在 Deep RL 中的使用。

B. Policy-Based Methods B. 基于策略的方法
The difference between value-based and policy-based methods is essentially a matter of where the burden of optimality resides. Both method types must propose actions and evaluate the resulting behaviour, but while value-based methods focus on evaluating the optimal cumulative reward and have a policy follows the recommendations, policy-based methods aim to estimate the optimal policy directly, and the value is a secondary if calculated at all. Typically, a policy is parameterised as a neural network πθ . Policy gradient methods use gradient descent to estimate the parameters of the policy that maximise the expected reward. The result can be a stochastic policy where actions are selected by sampling, or a deterministic policy. Many real-world applications have continuous action spaces. Deterministic policy gradient (DPG) algorithms [1], [26] allow reinforcement learning in domains with continuous actions. Silver et al. [26] proved that a deterministic policy gradient exists for MDPs satisfying certain conditions, and that deterministic policy gradients have a simple model-free form that follows the gradient of the action-value function. As a result, instead of integrating over both state and action spaces in stochastic policy gradients, DPG integrates over the state space only leading to fewer samples in problems with large action spaces. To ensure sufficient exploration, actions are chosen using a stochastic policy, while learning a deterministic target policy. The REINFORCE [27] algorithm is a straight forward policy-based method. The discounted cumulative reward gt=∑H−1k=0γkrk+t+1 at one time step is calculated by playing the entire episode, so no estimator is required for policy evaluation. The parameters are updated into the direction of the performance gradient:
θ←θ+αγtg∇logπθ(a|s),(4)
View SourceRight-click on figure for MathML and additional features.where α is the learning rate for a stable incremental update. Intuitively, we want to encourage state-action pairs that result in the best possible returns. Trust Region Policy Optimization (TRPO) [28], works by preventing the updated policies from deviating too much from previous policies, thus reducing the chance of a bad update. TRPO optimises a surrogate objective function where the basic idea is to limit each policy gradient update as measured by the Kullback-Leibler (KL) divergence between the current and the new proposed policy. This method results in monotonic improvements in policy performance. While Proximal Policy Optimization (PPO) [29] proposed a clipped surrogate objective function by adding a penalty for having a too large policy change. Accordingly, PPO policy optimisation is simpler to implement, and has better sample complexity while ensuring the deviation from the previous policy is relatively small.
基于价值的方法和基于策略的方法之间的区别本质上是最优性负担所在的问题。这两种方法类型都必须提出行动并评估由此产生的行为，但基于价值的方法侧重于评估最佳累积奖励并让政策遵循建议，而基于策略的方法旨在直接估计最佳策略，如果计算的话，价值是次要的。通常，策略被参数化为神经网络 πθ 。策略梯度方法使用梯度下降来估计使预期奖励最大化的策略参数。结果可以是随机策略（其中通过抽样选择操作），也可以是确定性策略。许多实际应用程序都具有连续的操作空间。确定性策略梯度（DPG）算法 [1] ， [26] 允许在具有连续动作的领域中进行强化学习。Silver 等人 [26] 证明，满足某些条件的 MDP 存在确定性策略梯度，并且确定性策略梯度具有遵循动作-值函数梯度的简单无模型形式。因此，DPG 不是在随机策略梯度中同时在状态空间和动作空间上进行积分，而是在状态空间上进行积分，这只会导致在具有大型动作空间的问题中减少样本。为确保充分探索，使用随机策略选择操作，同时学习确定性目标策略。REINFORCEFORCE [27] 算法是一种基于策略的简单方法。通过播放整个剧集来计算单个时间步长的折扣累积奖励 gt=∑H−1k=0γkrk+t+1 ，因此策略评估不需要估算器。参数将更新到性能梯度的方向：
θ←θ+αγtg∇logπθ(a|s),(4)
View SourceRight-click on figure for MathML and additional features.其中 α 是稳定增量更新的学习率。直观地说，我们希望鼓励能够带来最佳回报的状态-行动对。信任区域策略优化（TRPO） [28] 的工作原理是防止更新的策略与以前的策略偏离太多，从而减少错误更新的可能性。TRPO 优化了代理目标函数，其基本思想是限制每个政策梯度更新，以当前和新提议政策之间的 Kullback-Leibler （KL）差异来衡量。这种方法导致策略性能的单调改进。而近端策略优化（PPO） [29] 则通过添加对政策变化过大的惩罚来提出一个剪裁代理目标函数。因此，PPO政策优化更易于实施，并且具有更好的样本复杂度，同时确保与先前政策的偏差相对较小。

C. Actor-Critic Methods C. 演员-评论家方法
Actor-critic methods are hybrid methods that combine the benefits of policy-based and value-based algorithms. The policy structure that is responsible for selecting actions is known as the ‘actor’. The estimated value function criticises the actions made by the actor and is known as the ‘critic’. After each action selection, the critic evaluates the new state to determine whether the result of the selected action was better or worse than expected. Both networks need their gradient to learn. Let J(θ):=Eπθ[r] represent a policy objective function, where θ designates the parameters of a DNN. Policy gradient methods search for local maximum of J(θ) . Since optimization in continuous action spaces could be costly and slow, the DPG (Direct Policy Gradient) algorithm represents actions as parameterised function μ(s|θμ) , where θμ refers to the parameters of the actor network. Then the unbiased estimate of the policy gradient step is given as:
∇θJ=−Eπθ{(g−b)logπθ(a|s)},(5)
View SourceRight-click on figure for MathML and additional features.where b is the baseline. While using b≡0 is the simplification that leads to the REINFORCE formulation. Williams [27] explains a well chosen baseline can reduce variance leading to a more stable learning. The baseline, b can be chosen as Vπ(s) , Qπ(s,a) or ‘Advantage’ Aπ(s,a) based methods. Deep Deterministic Policy Gradient (DDPG) [30] is a model-free, off-policy (please refer to subsection III-D for a detailed distinction), actor-critic algorithm that can learn policies for continuous action spaces using deep neural net based function approximation, extending prior work on DPG to large and high-dimensional state-action spaces. When selecting actions, exploration is performed by adding noise to the actor policy. Like DQN, to stabilise learning a replay buffer is used to minimize data correlation. A separate actor-critic specific target network is also used. Normal Q-learning is adapted with a restricted number of discrete actions, and DDPG also needs a straightforward way to choose an action. Starting from Q-learning, we extend Eqn. 2 to define the optimal Q-value and optimal action as Q∗ and a∗ .
Q∗(s,a)=maxπ Qπ(s,a),a∗=argmaxa Q∗(s,a).(6)
View SourceRight-click on figure for MathML and additional features.
参与者-批评者方法是混合方法，它结合了基于策略和基于价值的算法的优点。负责选择操作的策略结构称为“参与者”。估计值函数批评行为者的行为，被称为“批评家”。每次选择操作后，批评者都会评估新状态，以确定所选操作的结果是否比预期的更好或更差。两个网络都需要梯度来学习。let J(θ):=Eπθ[r] 表示策略目标函数，其中 θ 指定 DNN 的参数。策略梯度方法搜索 J(θ) 的局部最大值。由于连续动作空间中的优化可能成本高昂且缓慢，因此DPG（直接策略梯度）算法将动作表示为参数化函数 μ(s|θμ) ，其中 θμ 指的是参与者网络的参数。然后，政策梯度步长的无偏估计值为：
∇θJ=−Eπθ{(g−b)logπθ(a|s)},(5)
View SourceRight-click on figure for MathML and additional features.其中 b 是基线。使用 b≡0 时是导致 REINFORCE 配方的简化。威廉姆斯 [27] 解释说，精心选择的基线可以减少方差，从而实现更稳定的学习。基线， b 可以选择为 Vπ(s) Aπ(s,a) 基于“ Qπ(s,a) 优势”的方法。深度确定性策略梯度（DDPG）是一种无模型、非策略（请参阅详细 subsection III-D 区分） [30] 的 actor-critic 算法，它可以使用基于深度神经网络的函数逼近来学习连续动作空间的策略，将 DPG 的先前工作扩展到大型和高维状态动作空间。选择操作时，通过向执行组件策略添加干扰来执行探索。与 DQN 一样，为了稳定学习，使用重放缓冲区来最小化数据相关性。还使用了一个单独的演员-评论家特定目标网络。正常的 Q 学习需要有限数量的离散动作，DDPG 还需要一种直接的方式来选择动作。从 Q 学习开始，我们扩展 Eqn. 2 为将最优 Q 值和最优动作定义为 Q∗ 和 a∗ 。
Q∗(s,a)=maxπ Qπ(s,a),a∗=argmaxa Q∗(s,a).(6)
View SourceRight-click on figure for MathML and additional features.

In the case of Q-learning, the action is chosen according to the Q-function as in Eqn. 6. But DDPG chains the evaluation of Q after the action has already been chosen according to the policy. By correcting the Q-values towards the optimal values using the chosen action, we also update the policy towards the optimal action proposition. Thus two separate networks work at estimating Q∗ and π∗ .
在 Q 学习的情况下，根据 Q 函数选择动作，如 Eqn. 6 所示。但是 DDPG 在根据策略选择操作后将 Q 的评估链接起来。通过使用所选行动将 Q 值校正为最优值，我们还将策略更新为最优行动主张。因此，两个独立的网络在估计 Q∗ 和 π∗ 上起作用。

Asynchronous Advantage Actor Critic (A3C) [31] uses asynchronous gradient descent for optimization of deep neural network controllers. Deep reinforcement learning algorithms based on experience replay such as DQN and DDPG have demonstrated considerable success in difficult domains such as playing Atari games. However, experience replay uses a large amount of memory to store experience samples and requires off-policy learning algorithms. In A3C, instead of using an experience replay buffer, agents asynchronously execute on multiple parallel instances of the environment. In addition to the reducing correlation of the experiences, the parallel actor-learners have a stabilizing effect on training process. This simple setup enables a much larger spectrum of on-policy as well as off-policy reinforcement learning algorithms to be applied robustly using deep neural networks. A3C exceeded the performance of the previous state-of-the-art at the time on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU by combining several ideas. It also demonstrates how using an estimate of the value function as the previously explained baseline b reduces variance and improves convergence time. By defining the advantage as Aπ(a,s)=Qπ(s,a)−Vπ(s) , the expression of the policy gradient from Eqn. 5 is rewritten as ∇θL=−Eπθ{Aπ(a,s)logπθ(a|s)} . The critic is trained to minimize 12∥Aπθ(a,s)∥2 . The intuition of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but also how much better they turned out to be than expected, leading to reduced variance and more stable training. The A3C model also demonstrated good performance in 3D environments such as labyrinth exploration. Advantage Actor Critic (A2C) is a synchronous version of the asynchronous advantage actor critic model, that waits for each agent to finish its experience before conducting an update. The performance of both A2C and A3C is comparable. Most greedy policies must alternate between exploration and exploitation, and good exploration visits the states where the value estimate is uncertain. This way, exploration focuses on trying to find the most uncertain state paths as they bring valuable information. In addition to advantage, explained earlier, some methods use the entropy as the uncertainty quantity. Most A3C implementations include this as well. Two methods with common authors are energy-based policies [32] and more recent and with widespread use, the Soft Actor Critic (SAC) algorithm [33], both rely on adding an entropy term to the reward function, so we update the policy objective from Eqn. 1 to Eqn. 7. We refer readers to [33] for an in depth explanation of the expression
π∗MaxEnt=argmaxπEπ{∑t[r(st,at)+αH(π(.|st))]},(7)
View SourceRight-click on figure for MathML and additional features.shown here for illustration of how the entropy H is added.
Asynchronous Advantage Actor Critic （A3C） [31] 使用异步梯度下降来优化深度神经网络控制器。基于经验回放的深度强化学习算法（如DQN和DDPG）在玩Atari游戏等困难领域取得了相当大的成功。但是，体验重播使用大量内存来存储体验样本，并且需要非策略学习算法。在 A3C 中，代理不使用体验重播缓冲区，而是在环境的多个并行实例上异步执行。除了降低经验的相关性外，平行的参与者-学习者对训练过程也有稳定作用。这种简单的设置使得使用深度神经网络能够稳健地应用更广泛的策略内和策略外强化学习算法。A3C 在雅达利域名上超越了当时最先进的性能，同时通过结合多种想法，在单个多核 CPU 而不是 GPU 上训练了一半的时间。它还演示了如何使用值函数的估计值作为前面解释的基线 b 来减少方差并缩短收敛时间。通过将优势定义为 Aπ(a,s)=Qπ(s,a)−Vπ(s) ，策略梯度的 Eqn. 5 表达式被改写为 ∇θL=−Eπθ{Aπ(a,s)logπθ(a|s)} 。批评家受过训练，可以最小化 12∥Aπθ(a,s)∥2 .使用优势估计而不仅仅是贴现回报的直觉是让智能体不仅可以确定其行为有多好，还可以确定它们比预期的要好多少，从而减少方差和更稳定的训练。A3C模型在迷宫探索等3D环境中也表现出良好的性能。 Advantage Actor Critic （A2C）是异步 advantage actor critic 模型的同步版本，该模型在执行更新之前等待每个代理完成其体验。A2C 和 A3C 的性能相当。大多数贪婪的政策必须在勘探和开发之间交替，而好的勘探会访问价值估计不确定的州。这样，探索的重点是尝试找到最不确定的状态路径，因为它们会带来有价值的信息。除了前面解释的优点之外，一些方法使用熵作为不确定性量。大多数 A3C 实现也包括这一点。两种方法具有共同的作者，一种是基于能量的策略 [32] ，另一种是最近广泛使用的软行为者批评家（SAC）算法 [33] ，这两种方法都依赖于向奖励函数添加熵项，因此我们将策略目标从 Eqn. 1 更新为。 Eqn. 7 我们请读者 [33] 对此处显示的表达式
π∗MaxEnt=argmaxπEπ{∑t[r(st,at)+αH(π(.|st))]},(7)
View SourceRight-click on figure for MathML and additional features.进行深入解释，以说明如何添加熵 H 。

D. Model-Based (vs. Model-Free) & On/Off Policy Methods
D. 基于模型（与无模型）和开/关策略方法
In practical situations, interacting with the real environment could be limited due to many reasons including safety and cost. Learning a model for environment dynamics may reduce the amount of interactions required with the real environment. Moreover, exploration can be performed on the learned models. In the case of model-based approaches (e.g. Dyna-Q [34], R-max [35]), agents attempt to learn the transition function T and reward function R , which can be used when making action selections. Keeping a model approximation of the environment means storing knowledge of its dynamics, and allows for fewer, and sometimes, costly environment interactions. By contrast, in model-free approaches such knowledge is not a requirement. Instead, model-free learners sample the underlying MDP directly in order to gain knowledge about the unknown model, in the form of value function estimates for example. In Dyna-2 [36], the learning agent stores long-term and short-term memories, where a memory is defined as the set of features and corresponding parameters used by an agent to estimate the value function. Long-term memory is for general domain knowledge which is updated from real experience, while short-term memory is for specific local knowledge about the current situation, and the value function is a linear combination of long and short term memories.
在实际情况下，由于安全和成本等多种原因，与真实环境的交互可能会受到限制。学习环境动力学模型可以减少与真实环境所需的交互量。此外，还可以对学习到的模型进行探索。在基于模型的方法（例如Dyna-Q，R-max [35] ）的情况下，智能体试图学习转移函数和奖励函数 T R [34] ，可以在做出动作选择时使用。保持模型对环境的近似值意味着存储其动态知识，并允许更少的环境交互，有时甚至是昂贵的环境交互。相比之下，在无模型方法中，这些知识不是必需的。取而代之的是，无模型学习者直接对底层 MDP 进行采样，以便以价值函数估计的形式获得有关未知模型的知识。在 Dyna-2 中，学习代理存储长期和短期记忆 [36] ，其中记忆被定义为智能体用来估计值函数的特征集和相应参数。长期记忆是从真实经验中更新的一般领域知识，而短期记忆是关于当前情况的特定局部知识，价值函数是长期和短期记忆的线性组合。

Learning algorithms can be on-policy or off-policy depending on whether the updates are conducted on fresh trajectories generated by the policy or by another policy, that could be generated by an older version of the policy or provided by an expert. On-policy methods such as SARSA [37], estimate the value of a policy while using the same policy for control. However, off-policy methods such as Q-learning [25], use two policies: the behavior policy, the policy used to generate behavior; and the target policy, the one being improved on. An advantage of this separation is that the target policy may be deterministic (greedy), while the behavior policy can continue to sample all possible actions, [1].
学习算法可以是符合策略的，也可以是不符合策略的，具体取决于更新是在策略生成的新轨迹上进行的，还是由其他策略生成的，这些轨迹可能由旧版本的策略生成或由专家提供。策略方法（如SARSA [37] ）在使用相同策略进行控制的同时估计策略的价值。但是，非策略方法（例如 Q-learning [25] ）使用两种策略：行为策略，用于生成行为的策略;以及目标策略，即正在改进的策略。这种分离的一个优点是，目标策略可能是确定性的（贪婪），而行为策略可以继续对所有可能的操作进行采样 [1] 。

E. Deep Reinforcement Learning (DRL)
E. 深度强化学习（DRL）
Tabular representations are the simplest way to store learned estimates (of e.g. values, policies or models), where each state-action pair has a discrete estimate associated with it. When estimates are represented discretely, each additional feature tracked in the state leads to an exponential growth in the number of state-action pair values that must be stored [38]. This problem is commonly referred to in the literature as the “curse of dimensionality”, a term originally coined by Bellman [39]. In simple environments this is rarely an issue, but it may lead to an intractable problem in real-world applications, due to memory and/or computational constraints. Learning over a large state-action space is possible, but may take an unacceptably long time to learn useful policies. Many real-world domains feature continuous state and/or action spaces; these can be discretised in many cases. However, large discretisation steps may limit the achievable performance in a domain, whereas small discretisation steps may result in a large state-action space where obtaining a sufficient number of samples for each state-action pair is impractical. Alternatively, function approximation may be used to generalise across states and/or actions, whereby a function approximator is used to store and retrieve estimates. Function approximation is an active area of research in RL, offering a way to handle continuous state and/or action spaces, mitigate against the state-action space explosion and generalise prior experience to previously unseen state-action pairs. Tile coding is one of the simplest forms of function approximation, where one tile represents multiple states or state-action pairs [38]. Neural networks are also commonly used to implement function approximation, one of the most famous examples being Tesuaro’s application of RL to backgammon [40]. Recent work has applied deep neural networks as a function approximation method; this emerging paradigm is known as deep reinforcement learning (DRL). DRL algorithms have achieved human level performance (or above) on complex tasks such as playing Atari games [24] and playing the board game Go [41].
表格表示是存储学习估计值（例如值、策略或模型）的最简单方法，其中每个状态-操作对都有一个与之关联的离散估计值。当以离散方式表示估计值时，在状态中跟踪的每个附加特征都会导致必须存储 [38] 的状态-操作对值数量呈指数增长。这个问题在文献中通常被称为“维度的诅咒”，这个术语最初是由贝尔曼 [39] 创造的。在简单的环境中，这很少成为问题，但由于内存和/或计算限制，它可能会在实际应用中导致棘手的问题。在较大的状态行动空间中学习是可能的，但可能需要很长时间才能学习有用的策略。许多现实世界的领域都具有连续的状态和/或操作空间;在许多情况下，这些可以离散化。然而，较大的离散化步长可能会限制域中可实现的性能，而较小的离散化步长可能会导致较大的状态动作空间，其中为每个状态动作对获得足够数量的样本是不切实际的。或者，函数近似可用于跨状态和/或动作的泛化，从而使用函数近似器来存储和检索估计值。函数逼近是强化学习的一个活跃研究领域，它提供了一种处理连续状态和/或动作空间的方法，减轻了状态-动作空间的爆炸，并将先前的经验推广到以前看不见的状态-动作对。瓦片编码是最简单的函数近似形式之一，其中一块瓦片代表多个状态或状态-动作对 [38] 。神经网络也常用于实现函数逼近，最著名的例子之一是 Tesuaro 将 RL 应用于西洋双陆棋 [40] 。最近的工作将深度神经网络应用为函数逼近方法;这种新兴范式被称为深度强化学习（DRL）。DRL算法在玩雅达利游戏和玩棋盘游戏 [24] 围棋 [41] 等复杂任务上达到了人类水平（或更高）的性能。

In DQN [24] it is demonstrated how a convolutional neural network can learn successful control policies from just raw video data for different Atari environments. The network was trained end-to-end and was not provided with any game specific information. The input to the convolutional neural network consists of a 84×84×4 tensor of 4 consecutive stacked frames used to capture the temporal information. Through consecutive layers, the network learns how to combine features in order to identify the action most likely to bring the best outcome. One layer consists of several convolutional filters. For instance, the first layer uses 32 filters with 8×8 kernels with stride 4 and applies a rectifier non linearity. The second layer is 64 filters of 4×4 with stride 2, followed by a rectifier non-linearity. Next comes a third convolutional layer of 64 filters of 3×3 with stride 1 followed by a rectifier. The last intermediate layer is composed of 512 rectifier units fully connected. The output layer is a fully-connected linear layer with a single output for each valid action. For DQN training stability, two networks are used while the parameters of the target network are fixed for a number of iterations while updating the online network parameters. For practical reasons, the Q(s,a) function is modeled as a deep neural network that predicts the value of all actions given the input state. Accordingly, deciding what action to take requires performing a single forward pass of the network. Moreover, in order to increase sample efficiency, experiences of the agent are stored in a replay memory (experience replay), where the Q-learning updates are conducted on randomly selected samples from the replay memory. This random selection breaks the correlation between successive samples. Experience replay enables reinforcement learning agents to remember and reuse experiences from the past where observed transitions are stored for some time, usually in a queue, and sampled uniformly from this memory to update the network. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. An alternative method is to use two separate experience buckets, one for positive and one for negative rewards [42]. Then a fixed fraction from each bucket is selected to replay. This method is only applicable in domains that have a natural notion of binary experience. Experience replay has also been extended with a framework for prioritising experience [43], where important transitions, based on the TD error, are replayed more frequently, leading to improved performance and faster training when compared to the standard experience replay approach.
在DQN [24] 中，演示了卷积神经网络如何从不同Atari环境的原始视频数据中学习成功的控制策略。该网络是端到端训练的，没有提供任何特定于游戏的信息。卷积神经网络的输入由 4 个连续堆叠帧的 84×84×4 张量组成，用于捕获时间信息。通过连续的层，网络学习如何组合特征，以确定最有可能带来最佳结果的行动。一层由几个卷积滤波器组成。例如，第一层使用 32 个滤波器， 8×8 内核步幅为 4，并应用整流器非线性。第二层是 64 个滤波器 4×4 ，步幅为 2，后跟一个整流器非线性。接下来是 64 个滤波器 3×3 的第三个卷积层，步幅为 1，后跟一个整流器。最后一层中间层由512个完全连接的整流单元组成。输出层是一个全连接的线性层，每个有效动作都有一个输出。为了提高DQN训练的稳定性，在更新在线网络参数的同时，使用两个网络，同时对目标网络的参数进行多次迭代。出于实际原因，该函数被建模为一个深度神经网络，该 Q(s,a) 网络在给定输入状态的情况下预测所有动作的值。因此，决定采取什么行动需要对网络执行一次前向传递。此外，为了提高样本效率，智能体的体验被存储在回放存储器（experience replay）中，其中Q学习更新是对从回放存储器中随机选择的样本进行的。这种随机选择打破了连续样本之间的相关性。经验回放使强化学习代理能够记住和重用过去的经验，其中观察到的转换存储了一段时间，通常在队列中，并从该内存中统一采样以更新网络。但是，这种方法只是以与最初经历的频率相同的频率重放过渡，而不管它们的重要性如何。另一种方法是使用两个单独的经验桶，一个用于正奖励，一个用于负奖励 [42] 。然后从每个存储桶中选择一个固定部分进行重播。此方法仅适用于具有自然二元体验概念的领域。经验回放也通过一个优先处理经验 [43] 的框架进行了扩展，其中基于TD错误的重要转换被更频繁地回放，与标准的经验回放方法相比，可以提高性能并加快训练速度。

The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action resulting in over optimistic value estimates. In Double DQN (D-DQN) [44] the over estimation problem in DQN is tackled where the greedy policy is evaluated according to the online network and uses the target network to estimate its value. It was shown that this algorithm not only yields more accurate value estimates, but leads to much higher scores on several games.
标准 Q 学习和 DQN 中的 max 运算符使用相同的值来选择和评估操作，从而导致过度乐观的值估计。在双 DQN （D-DQN） [44] 中，解决了 DQN 中的过度估计问题，其中根据在线网络评估贪婪策略并使用目标网络来估计其值。结果表明，该算法不仅产生了更准确的价值估计，而且在几场比赛中获得了更高的分数。

In Dueling network architecture [45] the state value function and associated advantage function are estimated, and then combined together to estimate action value function. The advantage of the dueling architecture lies partly in its ability to learn the state-value function efficiently. In a single-stream architecture only the value for one of the actions is updated. However in dueling architecture, the value stream is updated with every update, allowing for better approximation of the state values, which in turn need to be accurate for temporal difference methods like Q-learning.
在Dueling网络架构 [45] 中，对状态值函数和相关的优势函数进行估计，然后组合在一起估计动作值函数。决斗架构的优势部分在于它能够有效地学习状态值函数。在单流体系结构中，仅更新其中一个操作的值。然而，在决斗架构中，价值流会随着每次更新而更新，从而可以更好地近似状态值，而状态值又需要对于时间差异方法（如 Q 学习）准确无误。

DRQN [46] applied a modification to the DQN by combining a Long Short Term Memory (LSTM) with a Deep Q-Network. Accordingly, the DRQN is capable of integrating information across frames to detect information such as velocity of objects. DRQN showed to generalize its policies in case of complete observations and when trained on Atari games and evaluated against flickering games, it was shown that DRQN generalizes better than DQN.
DRQN 通过将长短期记忆（LSTM）与深度 Q 网络相结合，对 DQN [46] 进行了修改。因此，DRQN 能够跨帧集成信息以检测物体速度等信息。DRQN在完全观察的情况下显示出泛化其策略，当在Atari游戏上进行训练并针对闪烁游戏进行评估时，结果表明DRQN比DQN具有更好的泛化性。

SECTION IV. 第四节.Extensions to Reinforcement Learning
强化学习的扩展
This section introduces and discusses some of the main extensions to the basic single-agent RL paradigms which have been introduced over the years. As well as broadening the applicability of RL algorithms, many of the extensions discussed here have been demonstrated to improve scalability, learning speed and/or converged performance in complex problem domains.
本节介绍并讨论了多年来引入的基本单智能体 RL 范式的一些主要扩展。除了扩大RL算法的适用性外，这里讨论的许多扩展也被证明可以提高复杂问题域的可扩展性、学习速度和/或收敛性能。

A. Reward Shaping
As noted in Section III, the design of the reward function is crucial: RL agents seek to maximise the return from the reward function, therefore the optimal policy for a domain is defined with respect to the reward function. In many real-world application domains, learning may be difficult due to sparse and/or delayed rewards. RL agents typically learn how to act in their environment guided merely by the reward signal. Additional knowledge can be provided to a learner by the addition of a shaping reward to the reward naturally received from the environment, with the goal of improving learning speed and converged performance. This principle is referred to as reward shaping. The term shaping has its origins in the field of experimental psychology, and describes the idea of rewarding all behaviour that leads to the desired behaviour. Skinner [47] discovered while training a rat to push a lever that any movement in the direction of the lever had to be rewarded to encourage the rat to complete the task. Analogously to the rat, a RL agent may take an unacceptably long time to discover its goal when learning from delayed rewards, and shaping offers an opportunity to speed up the learning process. Reward shaping allows a reward function to be engineered in a way to provide more frequent feedback signal on appropriate behaviours [48], which is especially useful in domains with sparse rewards. Generally, the return from the reward function is modified as follows: r′=r+f where r is the return from the original reward function R , f is the additional reward from a shaping function F , and r′ is the signal given to the agent by the augmented reward function R′ . Empirical evidence has shown that reward shaping can be a powerful tool to improve the learning speed of RL agents [49]. However, it can have unintended consequences. The implication of adding a shaping reward is that a policy which is optimal for the augmented reward function R′ may not in fact also be optimal for the original reward function R . A classic example of reward shaping gone wrong for this exact reason is reported by [49] where the experimented bicycle agent would turn in circle to stay upright rather than reach its goal. Difference rewards (D ) [50] and potential-based reward shaping (PBRS) [51] are two commonly used shaping approaches. Both D and PBRS have been successfully applied to a wide range of application domains and have the added benefit of convenient theoretical guarantees, meaning that they do not suffer from the same issues as the unprincipled reward shaping approaches described above (see e.g. [51]–[55]).

B. Multi-Agent Reinforcement Learning (MARL)
In multi-agent reinforcement learning, multiple RL agents are deployed into a common environment. The single-agent MDP framework becomes inadequate when multiple autonomous agents act simultaneously in the same domain. Instead, the more general stochastic game (SG) may be used in the case of a Multi-Agent System (MAS) [56]. A SG is defined as a tuple <S,A1…N,T,R1…N> , where N is the number of agents, S is the set of system states, Ai is the set of actions for agent i (and A is the joint action set), T is the transition function, and Ri is the reward function for agent i . The SG looks very similar to the MDP framework, apart from the addition of multiple agents. In fact, for the case of N=1 a SG then becomes a MDP. The next system state and the rewards received by each agent depend on the joint action a of all of the agents in a SG, where a is derived from the combination of the individual actions ai for each agent in the system. Each agent may have its own local state perception si , which is different to the system state s (i.e. individual agents are not assumed to have full observability of the system). Note also that each agent may receive a different reward for the same system state transition, as each agent has its own separate reward function Ri . In a SG, the agents may all have the same goal (collaborative SG), totally opposing goals (competitive SG), or there may be elements of collaboration and competition between agents (mixed SG). Whether RL agents in a MAS will learn to act together or at cross-purposes depends on the reward scheme used for a specific application.

C. Multi-Objective Reinforcement Learning
In multi-objective reinforcement learning (MORL) the reward signal is a vector, where each component represents the performance on a different objective. The MORL framework was developed to handle sequential decision making problems where tradeoffs between conflicting objective functions must be considered. Examples of real-world problems with multiple objectives include selecting energy sources (tradeoffs between fuel cost and emissions) [57] and watershed management (tradeoffs between generating electricity, preserving reservoir levels and supplying drinking water) [58]. Solutions to MORL problems are often evaluated using the concept of Pareto dominance [59] and MORL algorithms typically seek to learn or approximate the set of non-dominated solutions. MORL problems may be defined using the MDP or SG framework as appropriate, in a similar manner to single-objective problems. The main difference lies in the definition of the reward function: instead of returning a single scalar value r , the reward function R in multi-objective domains returns a vector r consisting of the rewards for each individual objective c∈C . Therefore, a regular MDP or SG can be extended to a Multi-Objective MDP (MOMDP) or Multi-Objective SG (MOSG) by modifying the return of the reward function. For a more complete overview of MORL beyond the brief summary presented in this section, the interested reader is referred to recent surveys [60], [61].

D. State Representation Learning (SRL)
D. 状态表征学习（SRL）
State Representation Learning refers to feature extraction & dimensionality reduction to represent the state space with its history conditioned by the actions and environment of the agent. A complete review of SRL for control is discussed in [62]. In the simplest form SRL maps a high dimensional vector ot into a small dimensional latent space st . The inverse operation decodes the state back into an estimate of the original observation o^t . The agent then learns to map from the latent space to the action. Training the SRL chain is unsupervised in the sense that no labels are required. Reducing the dimension of the input effectively simplifies the task as it removes noise and decreases the domain’s size as shown in [63]. SRL could be a simple auto-encoder (AE), though various methods exist for observation reconstruction such as Variational Auto-Encoders (VAE) or Generative Adversarial Networks (GANs), as well as forward models for predicting the next state or inverse models for predicting the action given a transition. A good learned state representation should be Markovian; i.e. it should encode all necessary information to be able to select an action based on the current state only, and not any previous states or actions [62], [64].
状态表示学习是指特征提取和降维，以表示状态空间，其历史受代理的动作和环境的制约。中 [62] 讨论了SRL控制的完整综述。在最简单的形式中，SRL 将高维向量映射 ot 到小维潜在空间 st 。逆运算将状态解码回原始观测值的估计 o^t 值。然后，智能体学习从潜在空间映射到操作。训练 SRL 链是无监督的，因为不需要标签。减小输入的维度可以有效地简化任务，因为它可以消除噪声并减小域的大小，如所示 [63] 。SRL 可以是一个简单的自编码器（AE），尽管存在各种观察重建方法，例如变分自编码器（VAE）或生成对抗网络（GAN），以及用于预测下一个状态的正向模型或用于预测给定转换动作的逆向模型。一个好的有学问的国家代表应该是马尔可夫式的;也就是说，它应该对所有必要的信息进行编码，以便能够仅根据当前状态而不是任何以前的状态或操作来选择操作 [62] 。 [64]

E. Learning From Demonstrations
Learning from Demonstrations (LfD) is used by humans to acquire new skills in an expert to learner knowledge transmission process. LfD is important for initial exploration where reward signals are too sparse or the input domain is too large to cover. In LfD, an agent learns to perform a task from demonstrations, usually in the form of state-action pairs, provided by an expert without any feedback rewards. However, high quality and diverse demonstrations are hard to collect, leading to learning sub-optimal policies. Accordingly, learning merely from demonstrations can be used to initialize the learning agent with a good or safe policy, and then reinforcement learning can be conducted to enable the discovery of a better policy by interacting with the environment. Combining demonstrations and reinforcement learning has been conducted in recent research. AlphaGo [41], combines search tree with deep neural networks, initializes the policy network by supervised learning on state-action pairs provided by recorded games played by human experts. Additionally, a value network is trained to tell how desirable a board state is. By conducting self-play and reinforcement learning, AlphaGo is able to discover new stronger actions and learn from its mistakes, achieving super human performance. More recently, AlphaZero [65], developed by the same team, proposed a general framework for self-play models. AlphaZero is trained entirely using reinforcement learning and self play, starting from completely random play, and requires no prior knowledge of human players. AlphaZero taught itself from scratch how to master the games of chess, shogi, and Go game, beating a world-champion program in each case. In [66] it is shown that given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance. Measuring the divergence between the current policy and the expert policy for optimization is proposed in [67]. DQfD [68] pre-trains the agent and uses expert demonstrations by adding them into the replay buffer with additional priority. Moreover, a training framework that combines learning from both demonstrations and reinforcement learning is proposed in [69] for fast learning agents. Two policies close to maximizing the reward function can still have large differences in behaviour. To avoid degenerating a solution which would fit the reward but not the original behaviour, authors [70] proposed a method for enforcing that the optimal policy learnt over the rewards should still match the observed policy in behavior. Behavior Cloning (BC) is applied as a supervised learning that maps states to actions based on demonstrations provided by an expert. On the other hand, Inverse Reinforcement Learning (IRL) is about inferring the reward function that justifies demonstrations of the expert. IRL is the problem of extracting a reward function given observed, optimal behavior [71]. A key motivation is that the reward function provides a succinct and robust definition of a task. Generally, IRL algorithms can be expensive to run, requiring reinforcement learning in an inner loop between cost estimation to policy training and evaluation. Generative Adversarial Imitation Learning (GAIL) [72] introduces a way to avoid this expensive inner loop. In practice, GAIL trains a policy close enough to the expert policy to fool a discriminator. This process is similar to GANs [73], [74]. The resulting policy must travel the same MDP states as the expert, or the discriminator would pick up the differences. The theory behind GAIL is an equation simplification: qualitatively, if IRL is going from demonstrations to a cost function and RL from a cost function to a policy, then we should altogether be able to go from demonstration to policy in a single equation while avoiding the cost function estimation.
从演示中学习（LfD）被人类用于在专家到学习者的知识传递过程中获得新技能。LfD 对于奖励信号过于稀疏或输入域太大而无法覆盖的初始探索非常重要。在 LfD 中，智能体从演示中学习执行任务，通常以状态-动作对的形式提供，由专家提供，没有任何反馈奖励。然而，高质量和多样化的演示很难收集，导致学习次优政策。因此，仅从演示中学习可用于使用良好或安全的策略初始化学习代理，然后可以进行强化学习，以便通过与环境交互来发现更好的策略。在最近的研究中，已经进行了演示和强化学习的结合。AlphaGo [41] 将搜索树与深度神经网络相结合，通过对人类专家录制的游戏提供的状态-动作对进行监督学习来初始化策略网络。此外，还训练了一个价值网络，以判断董事会状态的可取性。通过进行自我对弈和强化学习，AlphaGo能够发现新的更强大的动作并从错误中吸取教训，从而实现超人的表现。最近，由同一团队开发的AlphaZero [65] 提出了一个自我游戏模型的通用框架。AlphaZero完全使用强化学习和自我游戏进行训练，从完全随机的游戏开始，不需要人类玩家的先验知识。AlphaZero从头开始自学如何掌握国际象棋、将棋和围棋游戏，在每种情况下都击败了世界冠军程序。其中 [66] 表明，给定初始演示，不需要显式探索，我们可以获得接近最佳的性能。中提出了衡量当前策略与专家策略之间的差异以进行优化 [67] 。DQfD [68] 对代理进行预训练，并通过将代理添加到具有额外优先级的重播缓冲区中来使用专家演示。此外，针对快速学习智能体，提出了一种 [69] 结合演示学习和强化学习的训练框架。两个接近最大化奖励功能的政策在行为上仍然有很大的差异。为了避免退化一个适合奖励但不适合原始行为的解决方案，作者 [70] 提出了一种方法，用于强制执行在奖励上学习的最优策略仍然应该与观察到的行为策略相匹配。行为克隆（BC）作为一种监督学习，根据专家提供的演示将状态映射到操作。另一方面，逆强化学习（IRL）是关于推断奖励函数，以证明专家的演示是合理的。IRL 是提取给定观察到的最优行为 [71] 的奖励函数的问题。一个关键的动机是奖励函数提供了任务的简洁而稳健的定义。通常，IRL 算法的运行成本可能很高，需要在成本估算到策略训练和评估之间的内部循环中进行强化学习。生成对抗性模仿学习（GAIL） [72] 引入了一种避免这种昂贵的内部循环的方法。在实践中，GAIL 训练的策略与专家策略足够接近，以欺骗歧视者。这个过程类似于 GAN、 [73] [74] . 生成的策略必须与专家相同的 MDP 状态，否则鉴别器会发现差异。GAIL 背后的理论是方程简化：定性地，如果 IRL 从演示到成本函数，RL 从成本函数到策略，那么我们应该完全能够在一个方程中从演示到政策，同时避免成本函数估计。

SECTION V. 第五节Reinforcement Learning for Autonomous Driving Tasks
自动驾驶任务的强化学习
Autonomous driving tasks where RL could be applied include: controller optimization, path planning and trajectory optimization, motion planning and dynamic path planning, development of high-level driving policies for complex navigation tasks, scenario-based policy learning for highways, intersections, merges and splits, reward learning with inverse reinforcement learning from expert data for intent prediction for traffic actors such as pedestrian, vehicles and finally learning of policies that ensures safety and perform risk estimation. Before discussing the applications of DRL to AD tasks we briefly review the state space, action space and rewards schemes in autonomous driving setting.
RL可以应用的自动驾驶任务包括：控制器优化、路径规划和轨迹优化、运动规划和动态路径规划、为复杂导航任务制定高级驾驶策略、高速公路、交叉路口、合并和拆分的基于场景的策略学习、从专家数据中逆强化学习的奖励学习，用于行人等交通参与者的意图预测，车辆，最后学习确保安全和执行风险评估的政策。在讨论DRL在AD任务中的应用之前，我们简要回顾了自动驾驶设置中的状态空间、动作空间和奖励方案。

A. State Spaces, Action Spaces and Rewards
A. 状态空间、行动空间和奖励
To successfully apply DRL to autonomous driving tasks, designing appropriate state spaces, action spaces, and reward functions is important. Leurent et al. [75] provided a comprehensive review of the different state and action representations which are used in autonomous driving research. Commonly used state space features for an autonomous vehicle include: position, heading and velocity of ego-vehicle, as well as other obstacles in the sensor view extent of the ego-vehicle. To avoid variations in the dimension of the state space, a Cartesian or Polar occupancy grid around the ego vehicle is frequently employed. This is further augmented with lane information such as lane number (ego-lane or others), path curvature, past and future trajectory of the ego-vehicle, longitudinal information such as Time-to-collision (TTC), and finally scene information such as traffic laws and signal locations.
为了将 DRL 成功应用于自动驾驶任务，设计合适的状态空间、动作空间和奖励函数非常重要。Leurent等人 [75] 对自动驾驶研究中使用的不同状态和动作表示进行了全面回顾。自动驾驶汽车常用的状态空间特征包括：自我车辆的位置、航向和速度，以及自我车辆传感器视野范围内的其他障碍物。为了避免状态空间维度的变化，经常使用围绕自我车辆的笛卡尔或极地占用网格。此外，车道信息（车道号（车道线或其他车道）、路径曲率、车辆过去和未来轨迹、碰撞时间（TTC）等纵向信息，以及交通法规和信号位置等场景信息。

Using raw sensor data such as camera images, LiDAR, radar, etc. provides the benefit of finer contextual information, while using condensed abstracted data reduces the complexity of the state space. In between, a mid-level representation such as 2D bird eye view (BEV) is sensor agnostic but still close to the spatial organization of the scene. Fig. 4 is an illustration of a top down view showing an occupancy grid, past and projected trajectories, and semantic information about the scene such as the position of traffic lights. This intermediary format retains the spatial layout of roads when graph-based representations would not. Some simulators offer this view such as Carla or Flow (see Table II).

TABLE I List of AD Tasks That Require D(RL) to Learn a Policy or Behavior
Table I-
List of AD Tasks That Require D(RL) to Learn a Policy or Behavior
TABLE II Simulators for RL Applications in Advanced Driving Assistance Systems (ADAS) and Autonomous Driving
Table II-
Simulators for RL Applications in Advanced Driving Assistance Systems (ADAS) and Autonomous Driving
Fig. 4. - Bird Eye View (BEV) 2D representation of a driving scene. Left demonstrates an occupancy grid. Right shows the combination of semantic information (traffic lights) with past (red) and projected (green) trajectories. The ego car is represented by a green rectangle in both images.
Fig. 4.
Bird Eye View (BEV) 2D representation of a driving scene. Left demonstrates an occupancy grid. Right shows the combination of semantic information (traffic lights) with past (red) and projected (green) trajectories. The ego car is represented by a green rectangle in both images.

Show All

A vehicle policy must control a number of different actuators. Continuous-valued actuators for vehicle control include steering angle, throttle and brake. Other actuators such as gear changes are discrete. To reduce complexity and allow the application of DRL algorithms which work with discrete action spaces only (e.g. DQN), an action space may be discretised uniformly by dividing the range of continuous actuators such as steering angle, throttle and brake into equal-sized bins (see Section VI-C). Discretisation in log-space has also been suggested, as many steering angles which are selected in practice are close to the centre [76]. Discretisation does have disadvantages however; it can lead to jerky or unstable trajectories if the step values between actions are too large. Furthermore, when selecting the number of bins for an actuator there is a trade-off between having enough discrete steps to allow for smooth control, and not having so many steps that action selections become prohibitively expensive to evaluate. As an alternative to discretisation, continuous values for actuators may also be handled by DRL algorithms which learn a policy directly, (e.g. DDPG). Temporal abstractions options framework [77]) may also be employed to simplify the process of selecting actions, where agents select options instead of low-level actions. These options represent a sub-policy that could extend a primitive action over multiple time steps.

Designing reward functions for DRL agents for autonomous driving is still very much an open question. Examples of criteria for AD tasks include: distance travelled towards a destination [78], speed of the ego vehicle [78]–[80], keeping the ego vehicle at a standstill [81], collisions with other road users or scene objects [78], [79], infractions on sidewalks [78], keeping in lane, and maintaining comfort and stability while avoiding extreme acceleration, braking or steering [80], [81], and following traffic rules [79].

B. Motion Planning & Trajectory Optimization
Motion planning is the task of ensuring the existence of a path between target and destination points. This is necessary to plan trajectories for vehicles over prior maps usually augmented with semantic information. Path planning in dynamic environments and varying vehicle dynamics is a key problem in autonomous driving, for example negotiating right to pass through in an intersection [87], merging into highways. Recent work by authors [89] contains real world motions by various traffic actors, observed in diverse interactive driving scenarios. Recently, authors demonstrated an application of DRL (DDPG) for AD using a full-sized autonomous vehicle [90]. The system was first trained in simulation, before being trained in real time using on board computers, and was able to learn to follow a lane, successfully completing a real-world trial on a 250 metre section of road. Model-based deep RL algorithms have been proposed for learning models and policies directly from raw pixel inputs [91], [92]. In [93], deep neural networks have been used to generate predictions in simulated environments over hundreds of time steps. RL is also suitable for Control. Classical optimal control methods like LQR/iLQR are compared with RL methods in [94]. Classical RL methods are used to perform optimal control in stochastic settings, for example the Linear Quadratic Regulator (LQR) in linear regimes and iterative LQR (iLQR) for non-linear regimes are utilized. A recent study in [95] demonstrates that random search over the parameters for a policy network can perform as well as LQR.

C. Simulator & Scenario Generation Tools
Autonomous driving datasets address supervised learning setup with training sets containing image, label pairs for various modalities. Reinforcement learning requires an environment where state-action pairs can be recovered while modelling dynamics of the vehicle state, environment as well as the stochasticity in the movement and actions of the environment and agent respectively. Various simulators are actively used for training and validating reinforcement learning algorithms. Table II summarises various high fidelity perception simulators capable of simulating cameras, LiDARs and radar. Some simulators are also capable of providing the vehicle state and dynamics. A complete review of sensors and simulators utilised within the autonomous driving community is available in [105] for readers. Learned driving policies are stress tested in simulated environments before moving on to costly evaluations in the real world. Multi-fidelity reinforcement learning (MFRL) framework is proposed in [106] where multiple simulators are available. In MFRL, a cascade of simulators with increasing fidelity are used in representing state dynamics (and thus computational cost) that enables the training and validation of RL algorithms, while finding near optimal policies for the real world with fewer expensive real world samples using a remote controlled car. CARLA Challenge [107] is a Carla simulator based AD competition with pre-crash scenarios characterized in a National Highway Traffic Safety Administration report [108]. The systems are evaluated in critical scenarios such as: Ego-vehicle loses control, ego-vehicle reacts to unseen obstacle, lane change to evade slow leading vehicle among others. The scores of agents are evaluated as a function of the aggregated distance travelled in different circuits, and total points discounted due to infractions.
自动驾驶数据集通过包含各种模态的图像、标签对的训练集来解决监督学习设置问题。强化学习需要一个可以恢复状态-动作对的环境，同时模拟车辆状态、环境的动力学以及环境和智能体的运动和动作的随机性。各种模拟器被积极用于训练和验证强化学习算法。 Table II 总结了各种能够模拟摄像头、激光雷达和雷达的高保真感知模拟器。一些模拟器还能够提供车辆状态和动力学。对自动驾驶社区中使用的传感器和模拟器的完整评论 [105] 可供读者使用。学习到的驾驶策略在模拟环境中进行压力测试，然后再在现实世界中进行昂贵的评估。在多个模拟器可用 [106] 的情况下，提出了多保真强化学习（MFRL）框架。在MFRL中，使用保真度不断提高的级联模拟器来表示状态动力学（以及计算成本），从而能够训练和验证RL算法，同时使用遥控汽车以更少的昂贵真实世界样本为现实世界找到接近最优的策略。CARLA 挑战赛 [107] 是一项基于 Carla 模拟器的 AD 竞赛，其碰撞前场景在美国国家公路交通安全管理局报告中 [108] 进行了描述。这些系统在关键场景中进行评估，例如：自我车辆失去控制、自我车辆对看不见的障碍物做出反应、变道以躲避缓慢行驶的车辆等。智能体的分数被评估为在不同赛道中行驶的总距离的函数，并且由于违规而打折的总分。

D. LfD and IRL for AD Applications
D. AD应用的LfD和IRL
Early work on Behavior Cloning (BC) for driving cars in [109], [110] presented agents that learn form demonstrations (LfD) that tries to mimic the behavior of an expert. BC is typically implemented as a supervised learning, and accordingly, it is hard for BC to adapt to new, unseen situations. An architecture for learning a convolutional neural network, end to end, in self-driving cars domain was proposed in [111], [112]. The CNN is trained to map raw pixels from a single front facing camera directly to steering commands. Using a relatively small training dataset from humans/experts, the system learns to drive in traffic on local roads with or without lane markings and on highways. The network learns image representations that detect the road successfully, without being explicitly trained to do so. Authors of [113] proposed to learn comfortable driving trajectories optimization using expert demonstration from human drivers using Maximum Entropy Inverse RL. Authors of [114] used DQN as the refinement step in IRL to extract the rewards, in an effort learn human-like lane change behavior.
早期关于驾驶汽车的行为克隆（BC）的工作 [109] ， [110] 提出了学习演示（LfD）的代理，试图模仿专家的行为。BC 通常作为监督学习实施，因此，BC 很难适应新的、看不见的情况。在， [112] 中 [111] 提出了一种在自动驾驶汽车领域端到端学习卷积神经网络的架构。CNN 经过训练，可以将单个前置摄像头的原始像素直接映射到转向命令。该系统使用来自人类/专家的相对较小的训练数据集，学习在有或没有车道标记的当地道路和高速公路上行驶。该网络学习成功检测道路的图像表示，而无需进行明确训练。作者 [113] 建议使用最大熵逆RL的人类驾驶员的专家演示来学习舒适的驾驶轨迹优化。作者 [114] 使用 DQN 作为 IRL 中的细化步骤来提取奖励，以学习类似人类的变道行为。

SECTION VI. 第六节.Real World Challenges and Future Perspectives
现实世界的挑战和未来展望
In this section, challenges for conducting reinforcement learning for real-world autonomous driving are presented and discussed along with the related research approaches for solving them.
在本节中，将介绍和讨论在现实世界中自动驾驶进行强化学习的挑战，以及解决这些问题的相关研究方法。

A. Validating RL Systems A. 验证 RL 系统
Henderson et al. [115] described challenges in validating reinforcement learning methods focusing on policy gradient methods for continuous control algorithms such as PPO, DDPG and TRPO as well as in reproducing benchmarks. They demonstrate with real examples that implementations often have varying code-bases and different hyper-parameter values, and that unprincipled ways to estimate the top-k rollouts could lead to incoherent interpretations on the performance of the reinforcement learning algorithms, and further more on how well they generalize. Authors concluded that evaluation could be performed either on a well defined common setup or on real-world tasks. Authors in [116] proposed automated generation of challenging and rare driving scenarios in high-fidelity photo-realistic simulators. These adversarial scenarios are automatically discovered by parameterising the behavior of pedestrians and other vehicles on the road. Moreover, it is shown that by adding these scenarios to the training data of imitation learning, the safety is increased.
Henderson等人 [115] 描述了验证强化学习方法的挑战，重点是PPO、DDPG和TRPO等连续控制算法的策略梯度方法，以及重现基准。他们用真实的例子证明，实现通常具有不同的代码库和不同的超参数值，并且估计top-k部署的无原则方法可能会导致对强化学习算法性能的不连贯解释，并进一步解释它们的泛化程度。作者得出的结论是，评估既可以对定义明确的通用设置进行，也可以对实际任务进行。作者 [116] 提议在高保真逼真的模拟器中自动生成具有挑战性和罕见的驾驶场景。通过对道路上行人和其他车辆的行为进行参数化，可以自动发现这些对抗性场景。此外，研究表明，通过将这些场景添加到模仿学习的训练数据中，可以提高安全性。

B. Bridging the Simulation-Reality Gap
B. 弥合模拟与现实的差距
Simulation-to-real-world transfer learning is an active domain, since simulations are a source large & cheap data with perfect annotations. Authors [117] train a robot arm to grasp objects in the real world by performing domain adaption from simulation-to-reality, at both feature-level and pixel-level. The vision-based grasping system achieved comparable performance with 50 times fewer real-world samples. Authors in [118], randomized the dynamics of the simulator during training. The resulting policies were capable of generalising to different dynamics without requiring retraining on real system. In the domain of autonomous driving, authors [119] train an A3C agent using simulation-real translated images of the driving environment. Following which, the trained policy was evaluated on a real world driving dataset.
模拟到现实世界的迁移学习是一个活跃的领域，因为模拟是具有完美注释的大型廉价数据的来源。作者 [117] 训练机械臂通过在特征级和像素级执行从模拟到现实的域适应来抓取现实世界中的物体。基于视觉的抓取系统实现了可比的性能，而真实世界的样本减少了 50 倍。作者在 [118] 训练期间随机化了模拟器的动态。由此产生的策略能够推广到不同的动态，而无需在实际系统上进行重新训练。在自动驾驶领域，作者 [119] 使用驾驶环境的模拟真实翻译图像来训练 A3C 代理。之后，在真实世界的驾驶数据集上评估经过训练的策略。

Authors in [120] addressed the issue of performing imitation learning in simulation that transfers well to images from real world. They achieved this by unsupervised domain translation between simulated and real world images, that enables learning the prediction of steering in the real world domain with only ground truth from the simulated domain. Authors remark that there were no pairwise correspondences between images in the simulated training set and the unlabelled real-world image set. Similarly, [121] performs domain adaptation to map real world images to simulated images. In contrast to sim-to-real methods they handle the reality gap during deployment of agents in real scenarios, by adapting the real camera streams to the synthetic modality, so as to map the unfamiliar or unseen features of real images back into the simulated environment and states. The agents have already learnt a policy in simulation.
作者解决了 [120] 在模拟中执行模仿学习的问题，该模仿学习可以很好地转移到现实世界的图像中。他们通过模拟图像和真实世界图像之间的无监督域转换实现了这一点，这使得在现实世界域中仅使用模拟域的地面实况来学习转向预测。作者指出，模拟训练集中的图像与未标记的真实世界图像集中的图像之间没有成对的对应关系。同样，执行 [121] 域自适应以将真实世界图像映射到模拟图像。与模拟到真实的方法相比，它们通过将真实的相机流调整为合成模态，从而处理在真实场景中部署智能体期间的现实差距，从而将真实图像中不熟悉或看不见的特征映射回模拟环境和状态。智能体已经在模拟中学习了策略。

C. Sample Efficiency C. 样品效率
Animals are usually able to learn new tasks in just a few trials, benefiting from their prior knowledge about the environment. However, one of the key challenges for reinforcement learning is sample efficiency. The learning process requires too many samples to learn a reasonable policy. This issue becomes more noticeable when collection of valuable experience is expensive or even risky to acquire. In the case of robot control and autonomous driving, sample efficiency is a difficult issue due to the delayed and sparse rewards found in typical settings, along with an unbalanced distribution of observations in a large state space.
动物通常能够在短短几次试验中学习新任务，这得益于它们对环境的先验知识。然而，强化学习的主要挑战之一是样本效率。学习过程需要太多样本来学习合理的策略。当收集有价值的经验是昂贵的，甚至是有风险的时，这个问题就变得更加明显。在机器人控制和自动驾驶的情况下，样本效率是一个难题，因为在典型设置中发现的延迟和稀疏的奖励，以及大状态空间中观测值的不平衡分布。

Reward shaping enables the agent to learn intermediate goals by designing a more frequent reward function to encourage the agent to learn faster from fewer samples. Authors in [122] design a second “trauma” replay memory that contains only collision situations in order to pool positive and negative experiences at each training step.
奖励整形使智能体能够通过设计更频繁的奖励函数来学习中间目标，以鼓励智能体从更少的样本中更快地学习。作者 [122] 设计了第二个“创伤”回放记忆，该记忆仅包含碰撞情况，以便在每个训练步骤中汇集积极和消极的经历。

IL Boostrapped RL: 1） IL Boostrapped RL：
Further efficiency can be achieved where the agent first learns an initial policy offline performing imitation learning from roll-outs provided by an expert. Following which, the agent can self-improve by applying RL while interacting with the environment.
如果代理首先离线学习初始策略，然后从专家提供的推出中执行模仿学习，则可以实现更高的效率。之后，代理可以通过在与环境交互时应用 RL 进行自我改进。

Actor Critic with Experience Replay (ACER) [123], is a sample-efficient policy gradient algorithm that makes use of a replay buffer, enabling it to perform more than one gradient update using each piece of sampled experience, as well as a trust region policy optimization method.
Actor Critic with Experience Replay （ACER） [123] 是一种采样效率高的策略梯度算法，它利用重放缓冲区，使其能够使用每个采样体验以及信任区域策略优化方法执行多个梯度更新。

Transfer learning is another approach for sample efficiency, which enables the reuse of previously trained policy for a source task to initialize the learning of a target task. Policy composition presented in [124] propose composing previously learned basis policies to be able to reuse them for a novel task, which leads to faster learning of new policies. A survey on transfer learning in RL is presented in [125]. Multi-fidelity reinforcement learning (MFRL) framework [106] showed to transfer heuristics to guide exploration in high fidelity simulators and find near optimal policies for the real world with fewer real world samples. Authors in [126] transferred policies learnt to handle simulated intersections to real world examples between DQN agents.
迁移学习是提高样本效率的另一种方法，它允许重用以前为源任务训练的策略，以初始化目标任务的学习。建议中 [124] 提出的政策组成组成，编写以前学习过的基础策略，以便能够将它们重用于新任务，从而更快地学习新策略。关于RL中迁移学习的调查在 [125] 中。多保真强化学习（MFRL）框架表明，该框架 [106] 可以转移启发式方法，以指导高保真模拟器中的探索，并以较少的真实世界样本为现实世界找到接近最优的策略。转移策略的作者 [126] 学会了处理 DQN 代理之间与真实世界示例的模拟交叉点。

Meta-learning algorithms enable agents adapt to new tasks and learn new skills rapidly from small amounts of experiences, benefiting from their prior knowledge about the world. Authors of [127] addressed this issue through training a recurrent neural network on a training set of interrelated tasks, where the network input includes the action selected in addition to the reward received in the previous time step. Accordingly, the agent is trained to learn to exploit the structure of the problem dynamically and solve new problems by adjusting its hidden state. A similar approach for designing RL algorithms is presented in [128]. Rather than designing a “fast” reinforcement learning algorithm, it is represented as a recurrent neural network, and learned from data. In Model-Agnostic Meta-Learning (MAML) proposed in [129], the meta-learner seeks to find an initialisation for the parameters of a neural network, that can be adapted quickly for a new task using only few examples. Reptile [130] includes a similar model. Authors [131] present simple gradient-based meta-learning algorithm.
元学习算法使智能体能够适应新任务，并从少量经验中快速学习新技能，从而受益于他们对世界的先验知识。作者 [127] 通过在一组相互关联的任务上训练循环神经网络来解决此问题，其中网络输入除了在上一个时间步中获得的奖励外，还包括选择的动作。因此，智能体被训练来学习动态地利用问题的结构，并通过调整其隐藏状态来解决新问题。中介绍了用于设计 RL 算法的类似方法 [128] 。它不是设计一种“快速”的强化学习算法，而是将其表示为一个递归神经网络，并从数据中学习。在中提出的与模型无关的元学习（MAML）中 [129] ，元学习器试图找到神经网络参数的初始化，该初始化只需使用几个示例即可快速适应新任务。爬行动物 [130] 包括一个类似的模型。作者 [131] 提出了简单的基于梯度的元学习算法。

Efficient State Representations:
2）有效的状态表示：
World models proposed in [132] learn a compressed spatial and temporal representation of the environment using VAEs. Further on a compact and simple policy directly from the compressed state representation.
使用VAE学习环境的压缩空间和时间表示中 [132] 提出的世界模型。此外，直接从压缩状态表示中获取紧凑而简单的策略。

D. Exploration Issues With Imitation
D. 模仿的探索问题
In imitation learning, the agent makes use of trajectories provided by an expert. However, the distribution of states the expert encounters usually does not cover all the states the trained agent may encounter during testing. Furthermore imitation assumes that the actions are independent and identically distributed (i.i.d.). One solution consists in using the Data Aggregation (DAgger) methods [133] where the end-to-end learned policy is executed, and extracted observation-action pairs are again labelled by the expert, and aggregated to the original expert observation-action dataset. Thus, iteratively collecting training examples from both reference and trained policies explores more valuable states and solves this lack of exploration. Following work on Search-based Structured Prediction (SEARN) [133], Stochastic Mixing Iterative Learning (SMILE) trains a stochastic stationary policy over several iterations and then makes use of a geometric stochastic mixing of the policies trained. In a standard imitation learning scenario, the demonstrator is required to cover sufficient states so as to avoid unseen states during test. This constraint is costly and requires frequent human intervention. More recently, Chauffeurnet [134] demonstrated the limits of imitation learning where even 30 million state-action samples were insufficient to learn an optimal policy that mapped bird-eye view images (states) to control (action). The authors propose the use of simulated examples which introduced perturbations, higher diversity of scenarios such as collisions and/or going off the road. The featurenet includes an agent RNN that outputs the way point, agent box position and heading at each iteration. Authors [135] identify limits of imitation learning, and train a DNN end-to-end using the ego vehicles on input raw image, and 2d and 3d locations of neighboring vehicles to simultaneously predict the ego-vehicle action as well as neighbouring vehicle trajectories.

E. Intrinsic Reward Functions
In controlled simulated environments such as games, an explicit reward signal is given to the agent along with its sensor stream. However, in real-world robotics and autonomous driving deriving, designing a good reward functions is essential so that the desired behaviour may be learned. The most common solution has been reward shaping [51] and consists in supplying additional well designed rewards to the agent to encourage the optimization into the direction of the optimal policy. Rewards as already pointed earlier in the paper, could be estimated by inverse RL (IRL) [70], which depends on expert demonstrations. In the absence of an explicit reward shaping and expert demonstrations, agents can use intrinsic rewards or intrinsic motivation [136] to evaluate if their actions were good or not. Authors of [137] define curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. In [138] the agent learns a next state predictor model from its experience, and uses the error of the prediction as an intrinsic reward. This enables that agent to determine what could be a useful behavior even without extrinsic rewards.

F. Incorporating Safety in DRL
Deploying an autonomous vehicle in real environments after training directly could be dangerous. Different approaches to incorporate safety into DRL algorithms are presented here. For imitation learning based systems, Safe DAgger [139] introduces a safety policy that learns to predict the error made by a primary policy trained initially with the supervised learning approach, without querying a reference policy. An additional safe policy takes both the partial observation of a state and a primary policy as inputs, and returns a binary label indicating whether the primary policy is likely to deviate from a reference policy without querying it. Authors of [140] addressed safety in multi-agent Reinforcement Learning for Autonomous Driving, where a balance is maintained between unexpected behavior of other drivers or pedestrians and not to be too defensive, so that normal traffic flow is achieved. While hard constraints are maintained to guarantee the safety of driving, the problem is decomposed into a composition of a policy for desires to enable comfort driving and trajectory planning. The deep reinforcement learning algorithms for control such as DDPG and safety based control are combined in [141], including artificial potential field method that is widely used for robot path planning. Using TORCS environment, the DDPG is applied first for learning a driving policy in a stable and familiar environment, then policy network and safety-based control are combined to avoid collisions. It was found that combination of DRL and safety-based control performs well in most scenarios. In order to enable DRL to escape local optima, speed up the training process and avoid danger conditions or accidents, Survival-Oriented Reinforcement Learning (SORL) model is proposed in [140], where survival is favored over maximizing total reward through modeling the autonomous driving problem as a constrained MDP and introducing Negative-Avoidance Function to learn from previous failure. The SORL model was found to be not sensitive to reward function and can use different DRL algorithms like DDPG. Furthermore, a comprehensive survey on safe reinforcement learning can be found in [142] for interested readers.

G. Multi-Agent Reinforcement Learning
Autonomous driving is a fundamentally multi-agent task; as well as the ego vehicle being controlled by an agent, there will also be many other actors present in simulated and real world autonomous driving settings, such as pedestrians, cyclists and other vehicles. Therefore, the continued development of explicitly multi-agent approaches to learning to drive autonomous vehicles is an important future research direction. Several prior methods have already approached the autonomous driving problem using a MARL perspective, e.g. [140], [143]–[146].

One important area where MARL techniques could be very beneficial is in high-level decision making and coordination between groups of autonomous vehicles, in scenarios such as overtaking in highway scenarios [146], or negotiating intersections without signalised control. Another area where MARL approaches could be of benefit is in the development of adversarial agents for testing autonomous driving policies before deployment [145], i.e. agents controlling other vehicles in a simulation that learn to expose weaknesses in the behaviour of autonomous driving policies by acting erratically or against the rules of the road. Finally, MARL approaches could potentially have an important role to play in developing safe policies for autonomous driving [140], as discussed earlier.

SECTION VII.Conclusion
Reinforcement learning is still an active and emerging area in real-world autonomous driving applications. Although there are a few successful commercial applications, there is very little literature or large-scale public datasets available. Thus we were motivated to formalize and organize RL applications for autonomous driving. Autonomous driving scenarios involve interacting agents and require negotiation and dynamic decision making which suits RL. However, there are many challenges to be resolved in order to have mature solutions which we discuss in detail. In this work, a detailed theoretical reinforcement learning is presented, along with a comprehensive literature survey about applying RL for autonomous driving tasks.

Challenges, future research directions and opportunities are discussed in section VI. This includes: validating the performance of RL based systems, the simulation-reality gap, sample efficiency, designing good reward functions, incorporating safety into decision making RL systems for autonomous agents.

Reinforcement learning results are usually difficult to reproduce and are highly sensitive to hyper-parameter choices, which are often not reported in detail. Both researchers and practitioners need to have a reliable starting point where the well known reinforcement learning algorithms are implemented, documented and well tested. These frameworks have been covered in table III.

TABLE III Open-Source Frameworks and Packages for State of the Art RL/DRL Algorithms and Evaluation
Table III-
Open-Source Frameworks and Packages for State of the Art RL/DRL Algorithms and Evaluation
TABLE IV Acronyms Related to Reinforcement Learning (RL)
Table IV-
Acronyms Related to Reinforcement Learning (RL)
The development of explicitly multi-agent reinforcement learning approaches to the autonomous driving problem is also an important future challenge that has not received a lot of attention to date. MARL techniques have the potential to make coordination and high-level decision making between groups of autonomous vehicles easier, as well as providing new opportunities for testing and validating the safety of autonomous driving policies.
针对自动驾驶问题的显式多智能体强化学习方法的发展也是未来的一个重要挑战，迄今为止尚未受到太多关注。MARL技术有可能使自动驾驶汽车组之间的协调和高层决策变得更加容易，并为测试和验证自动驾驶政策的安全性提供新的机会。

Furthermore, implementation of RL algorithms is a challenging task for researchers and practitioners. This work presents examples of well known and active open-source RL frameworks that provide well documented implementations that enables the opportunity of using, evaluating and extending different RL algorithms. Finally, We hope that this overview paper encourages further research and applications.
此外，RL算法的实现对研究人员和从业者来说是一项具有挑战性的任务。这项工作提供了众所周知且活跃的开源 RL 框架的示例，这些框架提供了有据可查的实现，从而有机会使用、评估和扩展不同的 RL 算法。最后，我们希望这篇概述论文能鼓励进一步的研究和应用。

Authors
Figures
References
Download PDFs
Export
References & Cited By
参考文献和引用文献
Select All 全选
1.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA, USA:MIT Press, 2018.
R. S. Sutton 和 AG Barto，强化学习：简介，美国马萨诸塞州剑桥：麻省理工学院出版社，2018 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
2.
V. Talpaert et al., “Exploring applications of deep reinforcement learning for real-world autonomous driving systems”, Proc. 14th Int. Joint Conf. Comput. Vis. Imag. Comput. Graph. Theory Appl., pp. 564-572, 2019.
V. Talpaert 等人，“探索深度强化学习在真实世界自动驾驶系统中的应用”，第 14 届国际联合会议论文集。Vis. Imag.计算。图。《应用理论》，第564-572页，2019年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

M. Siam, S. Elkerdawy, M. Jagersand and S. Yogamani, “Deep semantic segmentation for automated driving: Taxonomy roadmap and challenges”, Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), pp. 1-8, Oct. 2017.
M. Siam、S. Elkerdawy、M. Jagersand 和 S. Yogamani，“自动驾驶的深度语义分割：分类路线图和挑战”，IEEE 第 20 届国际会议论文集。Transp. Syst. （ITSC），第 1-8 页，2017 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

K. El Madawi, H. Rashed, A. El Sallab, O. Nasr, H. Kamel and S. Yogamani, “RGB and LiDAR fusion based 3D semantic segmentation for autonomous driving”, Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 7-12, Oct. 2019.
K. El Madawi、H. Rashed、A. El Sallab、O. Nasr、H. Kamel 和 S. Yogamani，“基于 RGB 和 LiDAR 融合的自动驾驶 3D 语义分割”，IEEE Intell 论文集。Transp. Syst. Conf. （ITSC），第 7-12 页，2019 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand and A. El-Sallab, “MODNet: Motion and appearance based moving object detection network for autonomous driving”, Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), pp. 2859-2864, Nov. 2018.
M. Siam、H. Mahgoub、M. Zahran、S. Yogamani、M. Jagersand 和 A. El-Sallab，“MODNet：基于运动和外观的自动驾驶运动物体检测网络”，第 21 届国际会议 Intell。Transp. Syst. （ITSC），第 2859-2864 页，2018 年 11 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

V. R. Kumar et al., “Monocular fisheye camera depth estimation using sparse LiDAR supervision”, Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), pp. 2853-2858, Nov. 2018.
V. R. Kumar 等人，“使用稀疏 LiDAR 监督的单目鱼眼相机深度估计”，第 21 届国际会议论文集。Transp. Syst. （ITSC），第 2853-2858 页，2018 年 11 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

M. Uricar, P. Krizek, G. Sistu and S. Yogamani, “SoilingNet: Soiling detection on automotive surround-view cameras”, Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 67-72, Oct. 2019.
M. Uricar、P. Krizek、G. Sistu 和 S. Yogamani，“SoilingNet：汽车环视摄像头上的污垢检测”，IEEE Intell 论文集。Transp. Syst. Conf. （ITSC），第 67-72 页，2019 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

G. Sistu et al., “NeurAll: Towards a unified visual perception model for automated driving”, Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 796-803, Oct. 2019.
G. Sistu 等人，“NeurAll：迈向自动驾驶的统一视觉感知模型”，IEEE Intell.Transp. Syst. Conf. （ITSC），第 796-803 页，2019 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

S. Yogamani et al., “WoodScape: A multi-task multi-camera fisheye dataset for autonomous driving”, Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 9308-9318, Oct. 2019.
S. Yogamani 等人，“WoodScape：用于自动驾驶的多任务多摄像头鱼眼数据集”，IEEE/CVF Int. Conf. Comput.Vis. （ICCV），第 9308-9318 页，2019 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

S. Milz, G. Arbeiter, C. Witt, B. Abdallah and S. Yogamani, “Visual SLAM for automated driving: Exploring the applications of deep learning”, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 247-257, Jun. 2018.
S. Milz、G. Arbeiter、C. Witt、B. Abdallah 和 S. Yogamani，“用于自动驾驶的视觉 SLAM：探索深度学习的应用”，IEEE/CVF Conf. Comput.Vis. Pattern Recognit.研讨会（CVPRW），第 247-257 页，2018 年 6 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
11.
S. M. LaValle, Planning Algorithms, New York, NY, USA:Cambridge Univ. Press, 2006.
S. M. LaValle，规划算法，美国纽约：剑桥大学出版社，2006 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
12.
S. M. LaValle and J. J. Kuffner, “Randomized kinodynamic planning”, Int. J. Robot. Res., vol. 20, no. 5, pp. 378-400, May 2001.
S. M. LaValle 和 J. J. Kuffner，“随机激动力学规划”，Int. J. Robot。《研究》，第20卷，第5期，第378-400页，2001年5月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
13.
T. D. Team, Dimensions Publication Trends, Oct. 2020, [online] Available: https://app.dimensions.ai/discover/publication.
T. D. Team，Dimensions Publication Trends，2020 年 10 月，[在线] 可用：https://app.dimensions.ai/discover/publication。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

Y. Kuwata, S. Karaman, J. Teo, E. Frazzoli, J. P. How and G. Fiore, “Real-time motion planning with applications to autonomous urban driving”, IEEE Trans. Control Syst. Technol., vol. 17, no. 5, pp. 1105-1118, Sep. 2009.
Y. Kuwata， S. Karaman， J. Teo， E. Frazzoli， JPHow 和 G. Fiore，“实时运动规划及其在自动驾驶城市驾驶中的应用”，IEEE Trans. Control Syst. Technol.，第 17 卷，第 5 期，第 1105-1118 页，2009 年 9 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

B. Paden, M. Cap, S. Z. Yong, D. Yershov and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles”, IEEE Trans. Intell. Vehicles, vol. 1, no. 1, pp. 33-55, Mar. 2016.
B. Paden、M. Cap、S. Z. Yong、D. Yershov 和 E. Frazzoli，“自动驾驶城市车辆运动规划和控制技术综述”，IEEE Trans. Intell.《车辆》，第 1 卷，第 1 期，第 33-55 页，2016 年 3 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
16.
W. Schwarting, J. Alonso-Mora and D. Rus, “Planning and decision-making for autonomous vehicles”, Annu. Rev. Control Robot. Auto. Syst., vol. 1, no. 1, pp. 187-210, May 2018.
W. Schwarting、J. Alonso-Mora 和 D. Rus，“自动驾驶汽车的规划和决策”，Annu。Rev. 控制机器人。《Auto. Syst.》，第 1 卷，第 1 期，第 187-210 页，2018 年 5 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

S. Kuutti, R. Bowden, Y. Jin, P. Barber and S. Fallah, “A survey of deep learning applications to autonomous vehicle control”, IEEE Trans. Intell. Transp. Syst., Jan. 2020.
S. Kuutti、R. Bowden、Y. Jin、P. Barber 和 S. Fallah，“深度学习在自动驾驶汽车控制中的应用调查”，IEEE Trans. Intell。Transp. Syst.，2020 年 1 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
18.
T. M. Mitchell, “Machine learning” in McGraw-Hill Series in Computer Science, Boston, MA, USA:McGraw-Hill, 1997.
T. M. Mitchell，“机器学习”，McGraw-Hill 计算机科学系列，美国马萨诸塞州波士顿：McGraw-Hill，1997 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
19.
S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Upper Saddle River, NJ, USA:Prentice-Hall, 2009.
S. J. Russell 和 P. Norvig，人工智能：现代方法，美国新泽西州马鞍河上游：Prentice-Hall，2009 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
20.
Z.-W. Hong, T.-Y. Shann, S.-Y. Su, Y.-H. Chang, T.-J. Fu and C.-Y. Lee, “Diversity-driven exploration strategy for deep reinforcement learning” in Advances in Neural Information Processing Systems, New York, NY, USA:Curran Associates, Inc, vol. 31, pp. 10489-10500, 2018, [online] Available: https://proceedings.neurips.cc/paper/2018/file/a2802cade04644083dcde1c8c483ed9a-Paper.pdf.
洪志伟， T.-Y.Shann， S.-Y.苏永华张T.-J.傅和C.-Y.Lee，“深度强化学习的多样性驱动探索策略”，载于《神经信息处理系统进展》，美国纽约州纽约市：Curran Associates， Inc，第 31 卷，第 10489-10500 页，2018 年，[在线] 可用：https://proceedings.neurips.cc/paper/2018/file/a2802cade04644083dcde1c8c483ed9a-Paper.pdf。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
21.
M. van Otterlo, Reinforcement Learning: State-of-the-Art, Berlin, Germany:Springer, 2012.
M. van Otterlo，强化学习：最先进的技术，德国柏林：施普林格，2012 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
22.
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, NY, USA:Wiley, 1994.
M. L. Puterman，马尔可夫决策过程：离散随机动态规划，美国纽约：Wiley，1994 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
23.
C. J. Watkins and P. Dayan, “Technical note: Q-learning”, Mach. Learn., vol. 8, no. 3, pp. 279-292, 1992.
C. J. Watkins 和 P. Dayan，“技术说明：Q-learning”，Mach. Learn.，第 8 卷，第 3 期，第 279-292 页，1992 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
24.
V. Mnih et al., “Human-level control through deep reinforcement learning”, Nature, vol. 518, no. 7540, pp. 529-533, 2015.
Show in Context CrossRef Google Scholar
25.
C. J. C. H. Watkins, “Learning from delayed rewards”, 1989, [online] Available: https://ci.nii.ac.jp/naid/10008997819/.
Show in Context Google Scholar
26.
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, “Deterministic policy gradient algorithms”, Proc. ICML, pp. 387-395, 2014.
Show in Context Google Scholar
27.
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning”, Mach. Learn., vol. 8, no. 3, pp. 229-256, May 1992.
Show in Context CrossRef Google Scholar
28.
J. Schulman, S. Levine, P. Abbeel, M. Jordan and P. Moritz, “Trust region policy optimization”, Proc. Int. Conf. Mach. Learn., pp. 1889-1897, 2015.
Show in Context Google Scholar
29.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization algorithms”, arXiv:1707.06347, 2017, [online] Available: http://arxiv.org/abs/1707.06347.
Show in Context Google Scholar
30.
T. P. Lillicrap et al., “Continuous control with deep reinforcement learning”, Proc. 4th Int. Conf. Learn. Represent. (ICLR), pp. 1-14, May 2016, [online] Available: https://iclr.cc/archive/www/doku.php%3Fid=iclr2016:accepted-main.html.
Show in Context Google Scholar
31.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, et al., “Asynchronous methods for deep reinforcement learning”, Proc. Int. Conf. Mach. Learn., pp. 1928-1937, 2016.
V. Mnih、A. P. Badia、M. Mirza、A. Graves、T. Lillicrap、T. Harley 等人，“深度强化学习的异步方法”，Proc. Int. Conf. Mach. Learn.，第 1928-1937 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
32.
T. Haarnoja, H. Tang, P. Abbeel and S. Levine, “Reinforcement learning with deep energy-based policies”, Proc. 34th Int. Conf. Mach. Learn. (JMLR), vol. 70, pp. 1352-1361, 2017.
T. Haarnoja、H. Tang、P. Abbeel 和 S. Levine，“基于深度能量的策略的强化学习”，第 34 届国际会议 Mach. Learn。（JMLR），第 70 卷，第 1352-1361 页，2017 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
33.
T. Haarnoja et al., “Soft actor-critic algorithms and applications”, arXiv:1812.05905, 2018, [online] Available: http://arxiv.org/abs/1812.05905.
T. Haarnoja 等人，“软演员-评论家算法和应用”，arXiv：1812.05905,2018 年，[在线] 可用：http://arxiv.org/abs/1812.05905。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
34.
R. S. Sutton, “Integrated architectures for learning planning and reacting based on approximating dynamic programming” in Machine Learning Proceedings, Amsterdam, The Netherlands:Elsevier, 1990.
R. S. Sutton，“基于近似动态规划的学习规划和反应的集成架构”，载于《机器学习论文集》，荷兰阿姆斯特丹：爱思唯尔，1990 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
35.
R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning”, J. Mach. Learn. Res., vol. 3, pp. 213-231, Oct. 2002.
R. I. Brafman 和 M. Tennenholtz，“R-max-a general polynomial time algorithm for near-optimal reinforcement learning”，J. Mach. Learn。《研究》，第3卷，第213-231页，2002年10月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
36.
D. Silver, R. S. Sutton and M. Müller, “Sample-based learning and search with permanent and transient memories”, Proc. 25th Int. Conf. Mach. Learn. ICML, pp. 968-975, 2008.
D. Silver、R. S. Sutton 和 M. Müller，“基于样本的学习和搜索与永久和瞬态记忆”，第 25 届国际会议 Mach. Learn。ICML，第 968-975 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
37.
G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems”, 1994.
G. A. Rummery 和 M. Niranjan，“使用联结主义系统的在线 Q 学习”，1994 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
38.
R. S. Sutton and A. G. Barto, Reinforcement Learning an Introduction, Cambridge, MA, USA:MIT Press, 2018, [online] Available: https://books.google.fr/books?id=uWV0DwAAQBAJ&lpg=PR7&ots=mioFt_0-p9&dq=Reinforcement%20learning%20an%20introduction%E2%80%93%20second%20edition&lr&pg=PR7#v=onepage&q=Reinforcement%20learning%20an%20introduction%E2%80%93%20second%20edition&f=false.
R. S. Sutton 和 A. G. Barto，Reinforcement Learning an Introduction，美国马萨诸塞州剑桥：麻省理工学院出版社，2018 年，[在线] 可用：https://books.google.fr/books?id=uWV0DwAAQBAJ&lpg=PR7&ots=mioFt_0-p9&dq=Reinforcement%20learning%20an%20introduction%E2%80%93%20second%20edition&lr&pg=PR7#v=onepage&q=Reinforcement%20learning%20an%20introduction%E2%80%93%20second%20edition&f=false。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
39.
R. Bellman, Dynamic Programming, Princeton, NJ, USA:Princeton Univ. Press, 1957.
R. Bellman，动态规划，美国新泽西州普林斯顿：普林斯顿大学出版社，1957 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

G. Tesauro, “TD-gammon a self-teaching backgammon program achieves master-level play”, Neural Comput., vol. 6, no. 2, pp. 215-219, Mar. 1994.
G. Tesauro，“TD-gammon a self-teaching backgammon program achieves master-level play”，Neural Comput.，第 6 卷，第 2 期，第 215-219 页，1994 年 3 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
41.
D. Silver et al., “Mastering the game of go with deep neural networks and tree search”, Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.
D. Silver 等人，“掌握深度神经网络和树搜索的围棋游戏”，《自然》，第 529 卷，第 7587 期，第 484-489 页，2016 年 1 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
42.
K. Narasimhan, T. Kulkarni and R. Barzilay, “Language understanding for text-based games using deep reinforcement learning”, Proc. Conf. Empirical Methods Natural Lang. Process., pp. 1-11, 2015.
K. Narasimhan、T. Kulkarni 和 R. Barzilay，“使用深度强化学习对基于文本的游戏进行语言理解”，Proc. Conf. Empirical Methods Natural Lang. Process.，第 1-11 页，2015 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
43.
T. Schaul, J. Quan, I. Antonoglou and D. Silver, “Prioritized experience replay”, arXiv:1511.05952, 2015, [online] Available: http://arxiv.org/abs/1511.05952.
T. Schaul、J. Quan、I. Antonoglou 和 D. Silver，“优先体验重播”，arXiv：1511.05952,2015 年，[在线] 可用：http://arxiv.org/abs/1511.05952。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
44.
H. Van Hasselt, A. Guez and D. Silver, “Deep reinforcement learning with double Q-learning”, Proc. AAAI, vol. 16, pp. 2094-2100, 2016.
H. Van Hasselt、A. Guez 和 D. Silver，“Deep reinforcement learning with double Q-learning”，Proc. AAAI，第 16 卷，第 2094-2100 页，2016 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
45.
Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot and N. de Freitas, “Dueling network architectures for deep reinforcement learning”, arXiv:1511.06581, 2015, [online] Available: http://arxiv.org/abs/1511.06581.
Show in Context Google Scholar
46.
M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observable MDPs”, arXiv:1507.06527, 2015, [online] Available: https://arxiv.org/abs/1507.06527.
Show in Context Google Scholar
47.
B. F. Skinner, The Behavior of Organisms: An Experimental Analysis, New York, NY, USA:Appleton-Century, 1938.
Show in Context Google Scholar
48.
E. Wiewiora, “Reward shaping” in Encyclopedia of Machine Learning and Data Mining, Boston, MA, USA:Springer, pp. 1104-1106, 2017.
Show in Context CrossRef Google Scholar
49.
J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping”, Proc. 15th Int. Conf. Mach. Learn., pp. 463-471, 1998.
Show in Context Google Scholar
50.
D. H. Wolpert, K. R. Wheeler and K. Tumer, “Collective intelligence for control of distributed dynamical systems”, EPL (Europhys. Lett.), vol. 49, no. 6, pp. 708, 2000.
Show in Context CrossRef Google Scholar
51.
A. Y. Ng, D. Harada and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping”, Proc. 16th Int. Conf. Mach. Learn., pp. 278-287, 1999.
Show in Context Google Scholar
52.
S. Devlin and D. Kudenko, “Theoretical considerations of potential-based reward shaping for multi-agent systems”, Proc. 10th Int. Conf. Auto. Agents Multiagent Syst. (AAMAS), pp. 225-232, 2011.
Show in Context Google Scholar
53.
P. Mannion, S. Devlin, K. Mason, J. Duggan and E. Howley, “Policy invariance under reward transformations for multi-objective reinforcement learning”, Neurocomputing, vol. 263, pp. 60-73, Nov. 2017.
Show in Context CrossRef Google Scholar
54.
M. Colby and K. Tumer, “An evolutionary game theoretic analysis of difference evaluation functions”, Proc. Annu. Conf. Genetic Evol. Comput., pp. 1391-1398, Jul. 2015.
M. Colby 和 K. Tumer，“差异评估函数的进化博弈理论分析”，Proc.Conf. Genetic Evol.Comput.，第 1391-1398 页，2015 年 7 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
55.
P. Mannion, J. Duggan and E. Howley, “A theoretical and empirical analysis of reward transformations in multi-objective stochastic games”, Proc. 16th Int. Conf. Auto. Agents Multiagent Syst. (AAMAS), pp. 1-3, 2017.
P. Mannion、J. Duggan 和 E. Howley，“多目标随机博弈中奖励转换的理论和实证分析”，第 16 届国际会议自动代理多代理系统（AAMAS），第 1-3 页，2017 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
56.
L. Buşoniu, R. Babuška and B. Schutter, “Multi-agent reinforcement learning: An overview” in Innovations in Multi-Agent Systems and Applications, Berlin, Germany:Springer, vol. 310, 2010.
L. Buşoniu、R. Babuška 和 B. Schutter，“多智能体强化学习：概述”，多智能体系统和应用创新，德国柏林：Springer，第 310 卷，2010 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
57.
P. Mannion, K. Mason, S. Devlin, J. Duggan and E. Howley, “Multi-objective dynamic dispatch optimisation using multi-agent reinforcement learning”, Proc. 15th Int. Conf. Auto. Agents Multiagent Syst. (AAMAS), pp. 1345-1346, 2016.
P. Mannion、K. Mason、S. Devlin、J. Duggan 和 E. Howley，“使用多智能体强化学习的多目标动态调度优化”，第 15 届国际会议自动代理多智能体系统（AAMAS），第 1345-1346 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
58.
K. Mason, P. Mannion, J. Duggan and E. Howley, “Applying multi-agent reinforcement learning to watershed management”, Proc. Adapt. Learn. Agents Workshop (AAMAS), pp. 1-8, May 2016.
K. Mason、P. Mannion、J. Duggan 和 E. Howley，“将多智能体强化学习应用于流域管理”，Proc.学习。代理研讨会（AAMAS），第 1-8 页，2016 年 5 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
59.
V. Pareto, Manual Political Economy, Oxford, U.K.:OUP Oxford, 1906.
V.帕累托，《手动政治经济学》，英国牛津：OUP Oxford，1906年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
60.
D. M. Roijers, P. Vamplew, S. Whiteson and R. Dazeley, “A survey of multi-objective sequential decision-making”, J. Artif. Intell. Res., vol. 48, pp. 67-113, Oct. 2013.
D. M. Roijers、P. Vamplew、S. Whiteson 和 R. Dazeley，“多目标顺序决策调查”，J. Artif。智力。Res.，第 48 卷，第 67-113 页，2013 年 10 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
61.
R. Rădulescu, P. Mannion, D. M. Roijers and A. Nowé, “Multi-objective multi-agent decision making: A utility-based analysis and survey”, Auto. Agents Multi-Agent Syst., vol. 34, no. 1, pp. 10, Apr. 2020.
R. Rădulescu、P. Mannion、D. M. Roijers 和 A. Nowé，“多目标多智能体决策：基于效用的分析和调查”，《自动代理多智能体系统》，第 34 卷，第 1 期，第 10 页，2020 年 4 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
62.
T. Lesort, N. Díaz-Rodríguez, J.-F. Goudou and D. Filliat, “State representation learning for control: An overview”, Neural Netw., vol. 108, pp. 379-392, Dec. 2018.
T.勒索特，N.迪亚斯-罗德里格斯，J.-F.Goudou 和 D. Filliat，“控制的状态表示学习：概述”，Neural Netw.，第 108 卷，第 379-392 页，2018 年 12 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
63.
A. Raffin, A. Hill, K. R. Traoré, T. Lesort, N. D. Rodríguez and D. Filliat, “Decoupling feature extraction from policy learning: Assessing benefits of state representation learning in goal based robotics”, arXiv:1901.08651, 2019, [online] Available: https://arxiv.org/abs/1901.08651.
A. Raffin、A. Hill、KR Traoré、T. Lesort、ND Rodríguez 和 D. Filliat，“从政策学习中解耦特征提取：评估基于目标的机器人中状态表示学习的好处”，arXiv：1901.08651,2019 年，[在线] 可用：https://arxiv.org/abs/1901.08651。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
64.
W. Böhmer, J. T. Springenberg, J. Boedecker, M. Riedmiller and K. Obermayer, “Autonomous learning of state representations for control: An emerging field aims to autonomously learn state representations for reinforcement learning agents from their real-world sensor observations”, KI - Künstliche Intelligenz, vol. 29, no. 4, pp. 353-362, Nov. 2015.
W. Böhmer、J. T. Springenberg、J. Boedecker、M. Riedmiller 和 K. Obermayer，“控制状态表示的自主学习：一个新兴领域旨在从其真实世界的传感器观察中自主学习强化学习代理的状态表示”，KI - Künstliche Intelligenz，第 29 卷，第 4 期，第 353-362 页，2015 年 11 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
65.
D. Silver et al., “Mastering the game of go without human knowledge”, Nature, vol. 550, no. 7676, pp. 354, 2017.
D. Silver 等人，“在人类不知情的情况下掌握围棋游戏”，《自然》，第 550 卷，第 7676 期，第 354 页，2017 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
66.
P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning”, Proc. 22nd Int. Conf. Mach. Learn. ICML, pp. 1-8, 2005.
P. Abbeel 和 A. Y. Ng，“强化学习中的探索和学徒学习”，第 22 届国际会议 Mach. Learn。ICML，第 1-8 页，2005 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
67.
B. Kang, Z. Jie and J. Feng, “Policy optimization with demonstrations”, Proc. Int. Conf. Mach. Learn., pp. 2474-2483, 2018.
B. Kang、Z. Jie 和 J. Feng，“通过演示进行政策优化”，Proc. Int. Conf. Mach. Learn.，第 2474-2483 页，2018 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
68.
T. Hester et al., “Deep Q-learning from demonstrations”, Proc. 32nd AAAI Conf. Artif. Intell., pp. 1-12, 2018.
T. Hester 等人，“从演示中深入学习 Q”，第 32 届 AAAI 会议论文集。Intell.，第 1-12 页，2018 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
69.
S. Ibrahim and D. Nevin, “End-to-end framework for fast learning asynchronous agents”, Proc. 32nd Conf. Neural Inf. Process. Syst. Imitation Learn. Challenges Robot. Workshop (NeurIPS), 2018, [online] Available: https://sites.google.com/view/nips18-ilr#h.p_6wGpM-tJnQIU.
S. Ibrahim 和 D. Nevin，“快速学习异步代理的端到端框架”，Proc. 32nd Conf. Neural Inf. Process。系统模仿学习。挑战机器人。研讨会（NeurIPS）， 2018， [在线] 可用： https://sites.google.com/view/nips18-ilr#h.p_6wGpM-tJnQIU.
Show in Context Google Scholar
70.
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning”, Proc. 21st Int. Conf. Mach. Learn., pp. 1, 2004.
Show in Context CrossRef Google Scholar
71.
A. Y. Ng et al., “Algorithms for inverse reinforcement learning”, Proc. ICML, pp. 2, 2000.
Show in Context Google Scholar
72.
J. Ho and S. Ermon, “Generative adversarial imitation learning”, Proc. Adv. Neural Inf. Process. Syst., pp. 4565-4573, 2016.
J. Ho 和 S. Ermon，“生成对抗性模仿学习”，Proc. Adv. Neural Inf. Process。Syst.，第 4565-4573 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
73.
I. Goodfellow et al., “Generative adversarial nets”, Proc. Adv. Neural Inf. Process. Syst., pp. 2672-2680, 2014.
I. Goodfellow 等人，“生成对抗网络”，Proc. Adv. Neural Inf. Process。Syst.，第 2672-2680 页，2014 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
74.
M. Uřičář, P. Křížek, D. Hurych, I. Sobh, S. Yogamani and P. Denny, “Yes we gan: Applying adversarial techniques for autonomous driving”, Electron. Imag., vol. 2019, no. 15, pp. 1-48, 2019.
M. Uřičář、P. Křížek、D. Hurych、I. Sobh、S. Yogamani 和 P. Denny，“是的，我们gan：将对抗性技术应用于自动驾驶”，Electron。《Imag.》，2019 年第 15 卷，第 1-48 页，2019 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
75.
E. Leurent, “A survey of state-action representations for autonomous driving”, 2018, [online] Available: https://hal.archives-ouvertes.fr/hal-01908175/document.
E. Leurent，“自动驾驶国家行动表征调查”，2018 年，[在线] 可用：https://hal.archives-ouvertes.fr/hal-01908175/document。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

H. Xu, Y. Gao, F. Yu and T. Darrell, “End-to-end learning of driving models from large-scale video datasets”, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2174-2182, Jul. 2017.
H. Xu、Y. Gao、F. Yu 和 T. Darrell，“从大规模视频数据集中学习驾驶模型的端到端学习”，IEEE Conf. Comput。Vis. Pattern Recognit.（CVPR），第 2174-2182 页，2017 年 7 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
77.
R. S. Sutton, D. Precup and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning”, Artif. Intell., vol. 112, no. 1, pp. 181-211, Aug. 1999.
R. S. Sutton、D. Precup 和 S. Singh，“在 MDP 和半 MDP 之间：强化学习中时间抽象的框架”，Artif。《情报》，第112卷，第1期，第181-211页，1999年8月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
78.
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez and V. Koltun, “CARLA: An open urban driving simulator”, Proc. 1st Annu. Conf. Robot Learn., pp. 1-16, 2017.
A. Dosovitskiy、G. Ros、F. Codevilla、A. Lopez 和 V. Koltun，“CARLA：开放式城市驾驶模拟器”，Proc. 1st Annu。Conf. Robot Learn.，第 1-16 页，2017 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
79.
C. Li and K. Czarnecki, “Urban driving with multi-objective deep reinforcement learning”, Proc. 18th Int. Conf. Auto. Agents MultiAgent Syst., pp. 359-367, 2019.
C. Li 和 K. Czarnecki，“具有多目标深度强化学习的城市驾驶”，第 18 届国际会议 Auto. Agents MultiAgent Syst.，第 359-367 页，2019 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
80.
S. Kardell and M. Kuosku, “Autonomous vehicle control via deep reinforcement learning”, 2017, [online] Available: https://www.semanticscholar.org/paper/Autonomous-vehicle-control-via-deep-reinforcement-Kardell-Kuosku/00440fbe53b0b099a7fa1a4714 caf401c8663019.
S. Kardell 和 M. Kuosku，“通过深度强化学习实现自动驾驶汽车控制”，2017 年，[在线] 可用：https://www.semanticscholar.org/paper/Autonomous-vehicle-control-via-deep-reinforcement-Kardell-Kuosku/00440fbe53b0b099a7fa1a4714 caf401c8663019。
Show in Context Google Scholar

J. Chen, B. Yuan and M. Tomizuka, “Model-free deep reinforcement learning for urban autonomous driving”, Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 2765-2771, Oct. 2019.
Show in Context View Article
Google Scholar
82.
A. E. Sallab, M. Abdou, E. Perot and S. Yogamani, “End-to-end deep reinforcement learning for lane keeping assist”, Proc. MLITS NIPS Workshop, vol. 2, pp. 1-9, 2016.
Google Scholar Google 学术搜索
83.
A. Sallab, M. Abdou, E. Perot and S. Yogamani, “Deep reinforcement learning framework for autonomous driving”, Electron. Imag., vol. 2017, no. 19, pp. 70-76, Jan. 2017.
A. Sallab、M. Abdou、E. Perot 和 S. Yogamani，“自动驾驶的深度强化学习框架”，Electron。《Imag.》，2017 年第 19 卷，第 70-76 页，2017 年 1 月。
CrossRef Google Scholar
CrossRef Google 学术搜索

P. Wang, C.-Y. Chan and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneuvers”, Proc. IEEE Intell. Vehicles Symp. (IV), pp. 1379-1384, Jun. 2018.
王沛，C.-Y.Chan 和 A. de La Fortelle，“基于强化学习的自动变道机动方法”，IEEE Intell.车辆问题（IV），第 1379-1384 页，2018 年 6 月。
View Article 查看文章
Google Scholar
Google 学术搜索

P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learning architecture toward autonomous driving for on-ramp merge”, Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), pp. 1-6, Oct. 2017.
P. Wang 和 C.-Y.Chan， “Formulation of deep reinforcement learning architecture towards autonomous driving for on-ramp merge”， Proc. IEEE 20th Int. Conf. Intell.Transp. Syst. （ITSC），第 1-6 页，2017 年 10 月。
View Article 查看文章
Google Scholar
Google 学术搜索

D. C. K. Ngai and N. H. C. Yung, “A multiple-goal reinforcement learning method for complex vehicle overtaking maneuvers”, IEEE Trans. Intell. Transp. Syst., vol. 12, no. 2, pp. 509-522, Jun. 2011.
D. C. K. Ngai 和 N. H. C. Yung，“复杂车辆超车机动的多目标强化学习方法”，IEEE Trans. Intell。Transp. Syst.，第 12 卷，第 2 期，第 509-522 页，2011 年 6 月。
View Article 查看文章
Google Scholar
Google 学术搜索

D. Isele, R. Rahimi, A. Cosgun, K. Subramanian and K. Fujimura, “Navigating occluded intersections with autonomous vehicles using deep reinforcement learning”, Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 2034-2039, May 2018.
D. Isele、R. Rahimi、A. Cosgun、K. Subramanian 和 K. Fujimura，“使用深度强化学习使用自动驾驶汽车导航遮挡的十字路口”，Proc. IEEE Int. Conf. Robot。自动。（ICRA），第 2034-2039 页，2018 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
88.
A. Keselman, S. Ten, A. Ghazali and M. Jubeh, “Reinforcement learning with A* and a deep heuristic”, arXiv:1811.07745, 2018, [online] Available: http://arxiv.org/abs/1811.07745.
A. Keselman、S. Ten、A. Ghazali 和 M. Jubeh，“使用 A* 和深度启发式进行强化学习”，arXiv：1811.07745,2018 年，[在线] 可用：http://arxiv.org/abs/1811.07745。
Google Scholar Google 学术搜索
89.
W. Zhan et al., “INTERACTION dataset: An INTERnational adversarial and cooperative moTION dataset in interactive driving scenarios with semantic maps”, arXiv:1910.03088, 2019, [online] Available: http://arxiv.org/abs/1910.03088.
W. Zhan et al.， “INTERACTION dataset： An INTERnational adversarial and cooperative moTION dataset in interactive driving scenarios with semantic maps”， arXiv：1910.03088， 2019， [online] 可用： http://arxiv.org/abs/1910.03088.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

A. Kendall et al., “Learning to drive in a day”, Proc. Int. Conf. Robot. Autom. (ICRA), pp. 8248-8254, May 2019.
Show in Context View Article
Google Scholar
91.
M. Watter, J. Springenberg, J. Boedecker and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images”, Proc. Adv. Neural Inf. Process. Syst., pp. 2746-2754, 2015.
Show in Context Google Scholar
92.
N. Wahlström, T. B. Schön and M. P. Deisenroth, “Learning deep dynamical models from image pixels”, IFAC-PapersOnLine, vol. 48, no. 28, pp. 1059-1064, 2015.
N. Wahlström、T. B. Schön 和 M. P. Deisenroth，“从图像像素中学习深度动力学模型”，IFAC-PapersOnLine，第 48 卷，第 28 期，第 1059-1064 页，2015 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
93.
S. Chiappa, S. Racanière, D. Wierstra and S. Mohamed, “Recurrent environment simulators”, Proc. 5th Int. Conf. Learn. Represent. ICLR, pp. 1-61, Apr. 2017.
S. Chiappa、S. Racanière、D. Wierstra 和 S. Mohamed，“循环环境模拟器”，第 5 届国际会议学习论文集。代表。ICLR，第 1-61 页，2017 年 4 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
94.
B. Recht, “A tour of reinforcement learning: The view from continuous control”, Annu. Rev. Control Robot. Auto. Syst., vol. 2, no. 1, pp. 253-279, May 2019.
B. Recht，“强化学习之旅：来自连续控制的观点”，Annu。Rev. 控制机器人。《汽车系统》，第 2 卷，第 1 期，第 253-279 页，2019 年 5 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
95.
H. Mania, A. Guy and B. Recht, “Simple random search of static linear policies is competitive for reinforcement learning”, Proc. 31st Annu. Conf. Adv. Neural Inf. Process. Syst. (NeurIPS), pp. 1800-1809, 2018, [online] Available: https://proceedings.neurips.cc/paper/2018.
H. Mania、A. Guy 和 B. Recht，“静态线性策略的简单随机搜索对强化学习具有竞争力”，第 31 届年鉴。Conf. Adv. Neural Inf. Process.系统（NeurIPS），第 1800-1809 页，2018 年，[在线] 可用：https://proceedings.neurips.cc/paper/2018。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
96.
B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom and A. Sumner, Torcs the Open Racing Car Simulator, 2000, [online] Available: http://torcs.sourceforge.net.
B. Wymann、E. Espié、C. Guionneau、C. Dimitrakakis、R. Coulom 和 A. Sumner，Torcs the Open Racing Car Simulator，2000 年，[在线] 可用：http://torcs.sourceforge.net。
Google Scholar Google 学术搜索
97.
S. Shah, D. Dey, C. Lovett and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles” in Field and Service Robotics, Cham, Switzerland:Springer, pp. 621-635, 2018.
S. Shah、D. Dey、C. Lovett 和 A. Kapoor，“AirSim：自动驾驶汽车的高保真视觉和物理仿真”，载于 Field and Service Robotics，瑞士湛市：Springer，第 621-635 页，2018 年。
CrossRef Google Scholar
CrossRef Google 学术搜索

N. Koenig and A. Howard, “Design and use paradigms for gazebo an open-source multi-robot simulator”, Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 2149-2154, Sep. 2004.
N. Koenig 和 A. Howard，“Design and use paradigms for gazebo an open-source multi-robot simulator”， Proc. IEEE/RSJ Int. Conf. Intell.机器人系统（IROS），第2149-2154页，2004年9月。
View Article 查看文章
Google Scholar
Google 学术搜索

P. A. Lopez et al., “Microscopic traffic simulation using SUMO”, Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), pp. 2575-2582, Nov. 2018.
P. A. Lopez 等人，“使用 SUMO 进行微观交通模拟”，Proc. 21st Int. Conf. Intell.Transp. Syst. （ITSC），第 2575-2582 页，2018 年 11 月。
View Article
Google Scholar
100.
C. Quiter and M. Ernst, Deepdrive/Deepdrive: 2.0, Mar. 2018, [online] Available: https://deepdrive.voyage.auto/.
Google Scholar
101.
Drive Constellation Now Available, Apr. 2019, [online] Available: https://blogs.nvidia.com/blog/2019/03/18/drive-constellation-now-availa%ble/.
Google Scholar
102.
A. Santara et al., Multi-Agent Autonomous Driving Simulator Built on Top of TORCS, Apr. 2019, [online] Available: https://github.com/madras-simulator/MADRaS.
Google Scholar
103.
C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky and A. M. Bayen, “Flow: Architecture and benchmarking for reinforcement learning in traffic control”, arXiv:1710.05465, 2017, [online] Available: https://arxiv.org/abs/1710.05465.
C. Wu、A. Kreidieh、K. Parvate、E. Vinitsky 和 A. M. Bayen，“Flow： Architecture and benchmarking for reinforcement learning in traffic control”，arXiv：1710.05465,2017 年，[在线] 可用：https://arxiv.org/abs/1710.05465。
Google Scholar Google 学术搜索
104.
E. Leurent, A Collection of Environments for Autonomous Driving and Tactical Decision-Making Tasks, Apr. 2019, [online] Available: https://github.com/eleurent/highway-env.
E. Leurent，《自动驾驶和战术决策任务环境集合》，2019 年 4 月，[在线] 可用：https://github.com/eleurent/highway-env。
Google Scholar Google 学术搜索
105.
F. Rosique, P. J. Navarro, C. Fernández and A. Padilla, “A systematic review of perception system and simulators for autonomous vehicles research”, Sensors, vol. 19, no. 3, pp. 648, Feb. 2019.
F. Rosique、P. J. Navarro、C. Fernández 和 A. Padilla，“自动驾驶汽车研究感知系统和模拟器的系统综述”，《传感器》，第 19 卷，第 3 期，第 648 页，2019 年 2 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

M. Cutler, T. J. Walsh and J. P. How, “Reinforcement learning with multi-fidelity simulators”, Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 3888-3895, May 2014.
M. Cutler、TJ Walsh 和 JPHow， “Reinforcement learning with multi-fidelity simulators”， Proc. IEEE Int. Conf. Robot.自动。（ICRA），第 3888-3895 页，2014 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
107.
F. C. German Ros, V. Koltun and A. M. Lopez, Carla Autonomous Driving Challenge, Apr. 2019, [online] Available: https://carlachallenge.org/.
F. C. German Ros、V. Koltun 和 A. M. Lopez，Carla 自动驾驶挑战赛，2019 年 4 月，[在线] 可用：https://carlachallenge.org/。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
108.
W. G. Najm et al., “Pre-crash scenario typology for crash avoidance research”, 2007, [online] Available: https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/pre-crash_scenario_typology-final_pdf_version_5-2-07.pdf.
W. G. Najm 等人，“用于避免碰撞研究的碰撞前情景类型学”，2007 年，[在线] 可用：https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/pre-crash_scenario_typology-final_pdf_version_5-2-07.pdf。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
109.
D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network”, Proc. Adv. Neural Inf. Process. Syst., pp. 1-16, 1989.
D. A. Pomerleau，“Alvinn：神经网络中的自主陆地车辆”，Proc. Adv. Neural Inf. Process。Syst.，第 1-16 页，1989 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation”, Neural Comput., vol. 3, no. 1, pp. 88-97, Feb. 1991.
D. A. Pomerleau，“用于自主导航的人工神经网络的高效训练”，Neural Comput.，第 3 卷，第 1 期，第 88-97 页，1991 年 2 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
111.
M. Bojarski et al., “End to end learning for self-driving cars”, Proc. NIPS Deep Learn. Symp., pp. 1-9, 2016.
M. Bojarski等人，“自动驾驶汽车的端到端学习”，Proc. NIPS Deep Learn。Symp.，第 1-9 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
112.
M. Bojarski et al., “Explaining how a deep neural network trained with End-to-End learning steers a car”, arXiv:1704.07911, 2017, [online] Available: http://arxiv.org/abs/1704.07911.
M. Bojarski 等人，“解释如何通过端到端学习训练的深度神经网络来驾驶汽车”，arXiv：1704.07911,2017 年，[在线] 可用：http://arxiv.org/abs/1704.07911。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

M. Kuderer, S. Gulati and W. Burgard, “Learning driving styles for autonomous vehicles from demonstration”, Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 2641-2646, May 2015.
M. Kuderer、S. Gulati 和 W. Burgard，“从演示中学习自动驾驶汽车的驾驶风格”，Proc. IEEE Int. Conf. Robot。自动。（ICRA），第 2641-2646 页，2015 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
114.
S. Sharifzadeh, I. Chiotellis, R. Triebel and D. Cremers, “Learning to drive using inverse reinforcement learning and deep Q-networks”, Proc. NIPS Workshops, pp. 1-7, Dec. 2016.
S. Sharifzadeh、I. Chiotellis、R. Triebel 和 D. Cremers，“使用逆强化学习和深度 Q 网络学习驾驶”，Proc. NIPS 研讨会，第 1-7 页，2016 年 12 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
115.
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup and D. Meger, “Deep reinforcement learning that matters”, Proc. 32nd AAAI Conf. Artif. Intell., pp. 1-26, 2018.
P. Henderson、R. Islam、P. Bachman、J. Pineau、D. Precup 和 D. Meger，“重要的深度强化学习”，第 32 届 AAAI 会议论文集。Intell.，第 1-26 页，2018 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

Y. Abeysirigoonawardena, F. Shkurti and G. Dudek, “Generating adversarial driving scenarios in high-fidelity simulators”, Proc. Int. Conf. Robot. Autom. (ICRA), pp. 8271-8277, May 2019.
Y. Abeysirigoonawardena、F. Shkurti 和 G. Dudek，“在高保真模拟器中生成对抗性驾驶场景”，Proc. Int. Conf. Robot。自动。（ICRA），第 8271-8277 页，2019 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

K. Bousmalis et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping”, Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 4243-4250, May 2018.
K. Bousmalis 等人，“使用仿真和域自适应提高深度机器人抓取的效率”，Proc. IEEE Int. Conf. Robot。自动。（ICRA），第 4243-4250 页，2018 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

X. B. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel, “Sim-to-Real transfer of robotic control with dynamics randomization”, Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 1-8, May 2018.
X. B. Peng、M. Andrychowicz、W. Zaremba 和 P. Abbeel，“具有动态随机化的机器人控制的模拟到真实转移”，Proc. IEEE Int. Conf. Robot。自动。（ICRA），第 1-8 页，2018 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
119.
X. Pan, Y. You, Z. Wang and C. Lu, “Virtual to real reinforcement learning for autonomous driving”, Proc. Brit. Mach. Vis. Conf., pp. 2-14, 2017.
X. Pan、Y. You、Z. Wang 和 C. Lu，“自动驾驶的虚拟到现实强化学习”，Proc. Brit. Mach. Vis. Conf.，第 2-14 页，2017 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

A. Bewley et al., “Learning to drive from simulation without real world labels”, Proc. Int. Conf. Robot. Autom. (ICRA), pp. 4818-4824, May 2019.
A. Bewley 等人，“在没有真实世界标签的情况下从模拟中学习驾驶”，Proc. Int. Conf. Robot。自动。（ICRA），第 4818-4824 页，2019 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

J. Zhang et al., “VR-goggles for robots: Real-to-sim domain adaptation for visual control”, IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1148-1155, Apr. 2019.
J. Zhang 等人，“用于机器人的 VR 护目镜：视觉控制的实景到模拟域适配”，IEEE Robot。自动。Lett.，第 4 卷，第 2 期，第 1148-1155 页，2019 年 4 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

H. Chae, C. M. Kang, B. Kim, J. Kim, C. C. Chung and J. W. Choi, “Autonomous braking system via deep reinforcement learning”, Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), pp. 1-6, Oct. 2017.
H. Chae、C. M. Kang、B. Kim、J. Kim、C. C. Chung 和 J. W. Choi，“通过深度强化学习实现自主制动系统”，IEEE 第 20 届国际会议论文集。Transp. Syst. （ITSC），第 1-6 页，2017 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
123.
Z. Wang et al., “Sample efficient actor-critic with experience replay”, Proc. 5th Int. Conf. Learn. Represent. ICLR, pp. 1-20, Apr. 2017.
Z. Wang et al.， “Sample efficient actor-critic with experience replay”， Proc. 5th Int. Conf. Learn.代表。ICLR，第 1-20 页，2017 年 4 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
124.
R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez and K. Goldberg, “Composing meta-policies for autonomous driving using hierarchical deep reinforcement learning”, arXiv:1711.01503, 2017, [online] Available: http://arxiv.org/abs/1711.01503.
R. Liaw、S. Krishnan、A. Garg、D. Crankshaw、J. E. Gonzalez 和 K. Goldberg，“使用分层深度强化学习编写自动驾驶元策略”，arXiv：1711.01503,2017 年，[在线] 可用：http://arxiv.org/abs/1711.01503。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
125.
M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey”, J. Mach. Learn. Res., vol. 10, pp. 1633-1685, Jul. 2009.
M. E. Taylor 和 P. Stone，“强化学习领域的迁移学习：一项调查”，J. Mach. Learn。Res.，第 10 卷，第 1633-1685 页，2009 年 7 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
126.
D. Isele and A. Cosgun, “Transferring autonomous driving knowledge on simulated and real intersections”, Proc. Lifelong Learn. Reinforcement Learn. Approach ICML Workshop (NeurIPS), 2017, [online] Available: https://rlabstraction2016.wixsite.com/icml-2017/accepted-papers.
D. Isele 和 A. Cosgun，“在模拟和真实交叉路口转移自动驾驶知识”，Proc. Lifelong Learn。强化学习。方法ICML研讨会（NeurIPS），2017年，[在线]可用：https://rlabstraction2016.wixsite.com/icml-2017/accepted-papers。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
127.
J. X. Wang et al., “Learning to reinforcement learn”, Proc. Complete CogSci, 2016, [online] Available: https://cogsci.mindmodeling.org/2017/papers/0252/index.html.
J. X. Wang et al.， “Learning to reinforcement learn”， Proc. Complete CogSci， 2016， [online] 可用： https://cogsci.mindmodeling.org/2017/papers/0252/index.html.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
128.
Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever and P. Abbeel, " RL 2 : Fast reinforcement learning via slow reinforcement learning ", arXiv:1611.02779, 2016, [online] Available: http://arxiv.org/abs/1611.02779.
Y. Duan、J. Schulman、X. Chen、P. L. Bartlett、I. Sutskever 和 P. Abbeel，“RL 2：通过慢强化学习进行快速强化学习”，arXiv：1611.02779,2016 年，[在线] 可用：http://arxiv.org/abs/1611.02779。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
129.
C. Finn, P. Abbeel and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks”, Proc. 34th Int. Conf. Mach. Learn., vol. 70, pp. 1126-1135, Aug. 2017.
C. Finn、P. Abbeel 和 S. Levine，“用于快速适应深度网络的与模型无关的元学习”，第 34 届国际会议 Mach. Learn.，第 70 卷，第 1126-1135 页，2017 年 8 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
130.
A. Nichol, J. Achiam and J. Schulman, “On first-order meta-learning algorithms”, arXiv:1803.02999, 2018, [online] Available: https://arxiv.org/abs/1803.02999.
A. Nichol、J. Achiam 和 J. Schulman，“关于一阶元学习算法”，arXiv：1803.02999,2018 年，[在线] 可用：https://arxiv.org/abs/1803.02999。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
131.
M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch and P. Abbeel, “Continuous adaptation via meta-learning in nonstationary and competitive environments”, Proc. 6th Int. Conf. Learn. Represent. ICLR, pp. 1-21, May 2018.
M. Al-Shedivat、T. Bansal、Y. Burda、I. Sutskever、I. Mordatch 和 P. Abbeel，“在非平稳和竞争环境中通过元学习持续适应”，第 6 届国际会议论文集。代表。ICLR，第 1-21 页，2018 年 5 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
132.
D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution”, Proc. Adv. Neural Inf. Process. Syst., pp. 1-13, 2018.
D. Ha 和 J. Schmidhuber，“循环世界模型促进政策演变”，Proc. Adv. Neural Inf. Process。系统，第 1-13 页，2018 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
133.
S. Ross and D. Bagnell, “Efficient reductions for imitation learning”, Proc. 13th Int. Conf. Artif. Intell. Statist., pp. 661-668, 2010.
S. Ross 和 D. Bagnell，“模仿学习的有效还原”，第 13 届国际会议 Artif。智力。《统计论》，第661-668页，2010年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
134.
M. Bansal, A. Krizhevsky and A. Ogale, “ChauffeurNet: Learning to drive by imitating the best and synthesizing the worst”, Robotics Science and Systems XV, 2018.
M. Bansal、A. Krizhevsky 和 A. Ogale，“ChauffeurNet：通过模仿最好的和综合最坏的东西来学习驾驶”，机器人科学与系统 XV，2018 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

T. Buhet, E. Wirbel and X. Perrotton, “Conditional vehicle trajectories prediction in CARLA urban environment”, Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), pp. 2310-2319, Oct. 2019.
T. Buhet、E. Wirbel 和 X. Perrotton，“CARLA 城市环境中的条件车辆轨迹预测”，Proc. IEEE/CVF Int. Conf. Comput。Vis. Workshop （ICCVW），第 2310-2319 页，2019 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
136.
N. Chentanez, A. G. Barto and S. P. Singh, “Intrinsically motivated reinforcement learning”, Proc. Adv. Neural Inf. Process. Syst., pp. 1281-1288, 2005.
N. Chentanez、A. G. Barto 和 S. P. Singh，“内在动机强化学习”，Proc. Adv. Neural Inf. Process。Syst.，第 1281-1288 页，2005 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

D. Pathak, P. Agrawal, A. A. Efros and T. Darrell, “Curiosity-driven exploration by self-supervised prediction”, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 2778-2787, Jul. 2017.
D. Pathak、P. Agrawal、A. A. Efros 和 T. Darrell，“通过自我监督预测进行好奇心驱动的探索”，IEEE Conf. Comput。Vis. Pattern Recognit.研讨会（CVPRW），第 2778-2787 页，2017 年 7 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
138.
Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell and A. A. Efros, “Large-scale study of curiosity-driven learning”, arXiv:1808.04355, 2018, [online] Available: http://arxiv.org/abs/1808.04355.
Y. Burda、H. Edwards、D. Pathak、A. Storkey、T. Darrell 和 A. A. Efros，“好奇心驱动学习的大规模研究”，arXiv：1808.04355,2018 年，[在线] 可用：http://arxiv.org/abs/1808.04355。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
139.
J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end simulated driving”, Proc. 31st AAAI Conf. Artif. Intell., pp. 2891-2897, 2017.
J. Zhang 和 K. Cho，“Query-efficient imitation learning for end-to-end simulated driving”，第 31 届 AAAI 会议论文集。Intell.，第 2891-2897 页，2017 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
140.
S. Shalev-Shwartz, S. Shammah and A. Shashua, “Safe multi-agent reinforcement learning for autonomous driving”, arXiv:1610.03295, 2016, [online] Available: http://arxiv.org/abs/1610.03295.
S. Shalev-Shwartz、S. Shammah 和 A. Shashua，“用于自动驾驶的安全多智能体强化学习”，arXiv：1610.03295,2016 年，[在线] 可用：http://arxiv.org/abs/1610.03295。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
141.
X. Xiong, J. Wang, F. Zhang and K. Li, “Combining deep reinforcement learning and safety based control for autonomous driving”, arXiv:1612.00147, 2016, [online] Available: http://arxiv.org/abs/1612.00147.
X. Xiong， J. Wang， F. Zhang and K. Li， “Combining deep reinforcement learning and safety based control for autonomous driving”， arXiv：1612.00147， 2016， [online] 可用： http://arxiv.org/abs/1612.00147.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
142.
J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning”, J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437-1480, 2015.
J. García 和 F. Fernández，“安全强化学习的综合调查”，J. Mach. Learn。Res.，第 16 卷，第 1 期，第 1437-1480 页，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

P. Palanisamy, “Multi-agent connected autonomous driving using deep reinforcement learning”, Proc. Int. Joint Conf. Neural Netw. (IJCNN), pp. 1-7, Jul. 2020.
P. Palanisamy，“使用深度强化学习的多智能体连接自动驾驶”，Proc. Int. Joint Conf. Neural Netw。（IJCNN），第 1-7 页，2020 年 7 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
144.
S. Bhalla, S. Ganapathi Subramanian and M. Crowley, “Deep multi agent reinforcement learning for autonomous driving” in Advances in Artificial Intelligence, Cham, Switzerland:Springer, pp. 67-78, 2020.
S. Bhalla、S. Ganapathi Subramanian 和 M. Crowley，“用于自动驾驶的深度多智能体强化学习”，载于《人工智能进展》，瑞士湛市：Springer，第 67-78 页，2020 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
145.
A. Wachi, “Failure-scenario maker for rule-based agent using multi-agent adversarial reinforcement learning and its application to autonomous driving”, arXiv:1903.10654, 2019, [online] Available: http://arxiv.org/abs/1903.10654.
A. Wachi，“使用多智能体对抗强化学习的基于规则的智能体的故障场景制造者及其在自动驾驶中的应用”，arXiv：1903.10654,2019 年，[在线] 可用：http://arxiv.org/abs/1903.10654。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

C. Yu et al., “Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs”, IEEE Trans. Intell. Transp. Syst., vol. 21, no. 2, pp. 735-748, Feb. 2020.
C. Yu et al.， “基于动态协调图的高速公路自动驾驶分布式多智能体协调学习”， IEEE Trans. Intell.Transp. Syst.，第 21 卷，第 2 期，第 735-748 页，2020 年 2 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
147.
P. Dhariwal et al., OpenAI Baselines, 2017, [online] Available: https://github.com/openai/baselines.
P. Dhariwal 等人，OpenAI 基线，2017 年，[在线] 可用：https://github.com/openai/baselines。
Google Scholar Google 学术搜索
148.
A. Juliani et al., “Unity: A general platform for intelligent agents”, arXiv:1809.02627, 2018, [online] Available: http://arxiv.org/abs/1809.02627.
A. Juliani et al.， “Unity： A general platform for intelligent agents”， arXiv：1809.02627， 2018， [online] 可用： http://arxiv.org/abs/1809.02627.
Google Scholar Google 学术搜索
149.
S. Guadarrama et al., TF-Agents: A Library for Reinforcement Learning in Tensorflow, Jun. 2018, [online] Available: https://github.com/tensorflow/agents.
S. Guadarrama et al.， TF-Agents： A Library for Reinforcement Learning in Tensorflow， 2018年6月， [在线] 可用： https://github.com/tensorflow/agents.
Google Scholar Google 学术搜索
150.
A. Stooke and P. Abbeel, “Rlpyt: A research code base for deep reinforcement learning in PyTorch”, arXiv:1909.01500, 2019, [online] Available: http://arxiv.org/abs/1909.01500.
A. Stooke 和 P. Abbeel，“Rlpyt：PyTorch 中深度强化学习的研究代码库”，arXiv：1909.01500,2019 年，[在线] 可用：http://arxiv.org/abs/1909.01500。
Google Scholar Google 学术搜索
151.
I. Osband et al., “Behaviour suite for reinforcement learning”, arXiv:1908.03568, 2019, [online] Available: http://arxiv.org/abs/1908.03568.
I. Osband et al.， “Behaviour suite for reinforcement learning”， arXiv：1908.03568， 2019， [online] 可用： http://arxiv.org/abs/1908.03568.
Google Scholar Google 学术搜索
Citations
Keywords
Metrics
Footnotes