深度强化学习的探索：一项综述 Exploration in deep reinforcement learning: A survey

最新推荐文章于 2025-03-31 20:12:11 发布

资源存储库

最新推荐文章于 2025-03-31 20:12:11 发布

阅读量1.7k

点赞数 15

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135520341

版权

Exploration in deep reinforcement learning: A survey 深度强化学习的探索：一项综述

Highlights 突出

•
Exploration in deep reinforcement learning is investigated comprehensively.
对深度强化学习的探索进行了全面研究。

•
New classification method for exploratory approaches is designed.
设计了一种新的探索性方法分类方法。

•
Promising future research directions on exploration are discussed.
讨论了未来勘探研究方向的展望。

Abstract 摘要

This paper reviews exploration techniques in deep reinforcement learning. Exploration techniques are of primary importance when solving sparse reward problems. In sparse reward problems, the reward is rare, which means that the agent will not find the reward often by acting randomly. In such a scenario, it is challenging for reinforcement learning to learn rewards and actions association. Thus more sophisticated exploration methods need to be devised. This review provides a comprehensive overview of existing exploration approaches, which are categorised based on the key contributions as: reward novel states, reward diverse behaviours, goal-based methods, probabilistic methods, imitation-based methods, safe exploration and random-based methods. Then, unsolved challenges are discussed to provide valuable future research directions. Finally, the approaches of different categories are compared in terms of complexity, computational effort and overall performance.
本文综述了深度强化学习中的探索技术。在解决稀疏奖励问题时，探索技术是最重要的。在稀疏奖励问题中，奖励是罕见的，这意味着智能体不会经常通过随机行动来找到奖励。在这种情况下，强化学习学习奖励和行动关联是具有挑战性的。因此，需要设计更复杂的勘探方法。本文综述了对现有探索方法的全面概述，这些方法根据主要贡献分为：奖励新状态、奖励多样化行为、基于目标的方法、概率方法、基于模仿的方法、安全探索和基于随机的方法。然后，讨论了尚未解决的挑战，以提供有价值的未来研究方向。最后，从复杂度、计算工作量和整体性能方面比较了不同类别的方法。

Previous article in issue 上一页已发行文章Next article in issue 下一篇文章有问题
Keywords 关键字
Deep reinforcement learning
深度强化学习Exploration 勘探Intrinsic motivation 内在动机Sparse reward problems 稀疏奖励问题

1. Introduction 1. 引言

In numerous real-world problems, the outcomes of a certain event are only visible after a significant number of other events have occurred. These types of problems are called sparse reward problems since the reward is rare and without a clear link to previous actions. We note that sparse reward problems are common in a real world. For example, during search and rescue missions, the reward is only given when an object is found, or during delivery, the reward is only given when an object is delivered. In sparse reward problems, thousands of decisions might need to be made before the outcomes are visible. Here, we present a review on a group of techniques that can solve this issue, namely exploration in reinforcement learning.
在许多现实世界的问题中，某个事件的结果只有在大量其他事件发生后才能看到。这些类型的问题被称为稀疏奖励问题，因为奖励很少见，并且与以前的行动没有明确的联系。我们注意到，稀疏奖励问题在现实世界中很常见。例如，在搜救任务中，只有在找到物体时才会给予奖励，或者在交付过程中，只有在交付物体时才会给予奖励。在稀疏的奖励问题中，可能需要做出数千个决策才能看到结果。在这里，我们回顾了一组可以解决这个问题的技术，即强化学习中的探索。

In reinforcement learning, an agent is given a state and a reward from the environment. The task of the agent is to determine an appropriate action. In reinforcement learning, the appropriate action is such that it maximises the reward, or it could be said that the action is exploitative. However, solving problems with just exploitation may not be feasible owing to reward sparseness. With reward sparseness, the agent is unlikely to find a reward quickly, and thus, it has nothing to exploit. Thus, an exploration algorithm is required to solve sparse reward problems.
在强化学习中，智能体被赋予一种状态和来自环境的奖励。代理的任务是确定适当的操作。在强化学习中，适当的行动是使奖励最大化，或者可以说该行动是剥削性的。然而，由于奖励稀疏，通过公正的剥削来解决问题可能并不可行。由于奖励稀疏，代理不太可能快速找到奖励，因此，它没有什么可利用的。因此，需要一种探索算法来解决稀疏奖励问题。

The most common technique for exploration in reinforcement learning is random exploration [1]. In this type of approach, the agent decides what to do randomly regardless of its progress. The most commonly-used technique of this type, called
-greedy, uses the time decaying parameter
to reduce exploration over time. This can theoretically solve the sparse reward problem given a sufficient amount of time. However, this is often impractical in real-world applications because learning times can be very large. However, we note that even just with random exploration, deep reinforcement learning has shown some impressive performance in Atari games [2], Mujoco simulator [3], controller tuning [4], autonomous landing [5], self-driving cars [6] and healthcare [7].
强化学习中最常见的探索技术是随机探索[1]。在这种类型的方法中，代理随机决定要做什么，而不管其进度如何。这种类型的最常用的技术称为
-greedy，它使用时间衰减参数
来减少随时间推移的探索。从理论上讲，在给定足够时间的情况下，这可以解决稀疏奖励问题。然而，这在实际应用中通常是不切实际的，因为学习时间可能非常长。然而，我们注意到，即使只是随机探索，深度强化学习在雅达利游戏[2]、Mujoco模拟器[3]、控制器调优[4]、自动着陆[5]、自动驾驶汽车[6]和医疗保健[7]中也显示出一些令人印象深刻的性能。

Another solution for exploration could be reward shaping. In reward shaping, the designer ‘artificially’ imposes a reward more often. For example, for search and rescue missions, agents can be given a negative reward every time they do not find the victim. However, reward shaping is a challenging problem that is heavily dependent on the experience of the designer. Punishing the agent too much could lead to the agent not moving at all [8], while rewarding it too much may cause the agent to repeat certain actions infinitely [9]. Thus, with the issues of random exploration and reward shaping, there is a need for more sophisticated exploration algorithms.
探索的另一种解决方案可能是奖励塑造。在奖励塑造中，设计师“人为地”更频繁地施加奖励。例如，对于搜救任务，特工每次没有找到受害者时都会得到负奖励。然而，奖励塑造是一个具有挑战性的问题，在很大程度上取决于设计师的经验。过多地惩罚智能体可能会导致智能体完全不动[8]，而过多地惩罚智能体可能会导致智能体无限重复某些动作[9]。因此，随着随机探索和奖励塑造的问题，需要更复杂的探索算法。

While exploration in reinforcement learning was considered as early as 1991 [10], [11], it is still under development. Recently, exploration has shown a significant gain in performance compared to non-exploratory algorithms: Diversity is all you need (DIYAN) [12] improved on MuJoCo benchmarks; random network distillation (RND) [13] and pseudocounts [14] were the first to score on difficult Montezuma’s Revenge problem; and Agent57 [15] is the first to beat humans in all 57 Atari games.
虽然早在1991年就考虑了强化学习的探索[10]，[11]，但它仍在发展中。最近，与非探索性算法相比，探索在性能上有显著提高：多样性是你所需要的一切（DIYAN） [12] 在 MuJoCo 基准测试的基础上进行了改进;随机网络蒸馏（RND）[13]和伪计数[14]率先解决了蒙特祖玛的复仇问题;Agent57 [15] 是第一个在所有 57 款 Atari 游戏中击败人类的游戏。

This review focuses on exploratory approaches which fulfil at least one of the following criteria: (i) determines the exploration degree based on the agent’s learning, (ii) actively decides to take certain actions in hopes of finding new outcomes, and (iii) motivates itself to continue exploring despite a lack of environmental rewards. In addition, this review focuses on approaches that have been applied to deep reinforcement learning. Note that this review is intended for beginners in exploration for deep reinforcement learning; thus, the focus is on the breadth of approaches and their relatively simplified description. Note also that, throughout the paper, we will use ’reinforcement learning’ as it is a more general term rather than ’deep reinforcement learning’.
本综述的重点是至少满足以下标准之一的探索性方法：（i）根据主体的学习确定探索程度，（ii）积极决定采取某些行动以期找到新的结果，以及（iii）尽管缺乏环境回报，但仍激励自己继续探索。此外，本综述重点关注应用于深度强化学习的方法。请注意，这篇评论是为探索深度强化学习的初学者准备的;因此，重点是方法的广度及其相对简化的描述。还要注意的是，在整篇论文中，我们将使用“强化学习”，因为它是一个更通用的术语，而不是“深度强化学习”。

Several review articles exist in the field of reinforcement learning. Aubert et al. [16] presented an overview of intrinsic motivation in reinforcement learning, Li [17] presented a comprehensive overview of techniques and applications, Nguyen et al. [18] considered an application to multi-agent problems, Levine [19] provided a tutorial and extensive comparison with probabilistic inference methods and [20] provided an extensive description of the key breakthrough methods in reinforcement learning, including ones in exploration. However, none of the aforementioned reviews focused on exploration or considered it in great detail. The only other review focused on exploration is from 1999 and is now outdated and inaccurate [21].
在强化学习领域存在几篇综述文章。Aubert等[16]概述了强化学习中的内在动机，Li[17]对技术和应用进行了全面的概述，Nguyen等[18]考虑了在多智能体问题中的应用，Levine[19]提供了教程并与概率推理方法进行了广泛的比较，[20]提供了强化学习中关键的突破性方法。包括那些在探索中。然而，上述评论都没有关注探索或非常详细地考虑它。1999年发表的唯一一篇关于探索的综述已经过时且不准确[21]。

The contributions of this study are as follows. First, the systematic overview of exploration in deep reinforcement learning is presented. As mentioned above, no other modern review exists with this focus. Second, a categorisation of exploration in reinforcement learning is provided. The categorisation is devised to provide a good way of comparing different approaches. Finally, future challenges are identified and discussed.
本研究的贡献如下。首先，对深度强化学习中的探索进行了系统概述。如上所述，没有其他现代综述以此为重点。其次，对强化学习中的探索进行了分类。该分类旨在提供一种比较不同方法的好方法。最后，确定并讨论了未来的挑战。

Preliminaries 2. 预赛
2.1. Introduction to reinforcement learning
2.1. 强化学习简介
2.1.1. Markov decision process
2.1.1. 马尔可夫决策过程
We consider a standard reinforcement setting in which an agent interacts with a stochastic and fully observable environment by sequentially choosing actions in a discrete time step to maximise cumulative rewards. This series of processes is called Markov decision process (MDP). An MDP has a tuple of
, where
is a set of states,
is a set of actions the agent can select,
is a transition probability that satisfies the Markov property given as:
(1)
is a set of rewards, and
is a discount factor. At each time step
, an agent receives states
from the environment and selects the best possible actions
according to policy
, which maps from states
to actions
. The agent receives a reward
from the environment to take an action
. The goal of the agent is to maximise the discounted expected reward
from each state
.
我们考虑了一种标准的强化设置，在这种设置中，智能体通过在离散的时间步长中按顺序选择动作来最大化累积奖励，从而与随机且完全可观察的环境进行交互。这一系列过程称为马尔可夫决策过程（MDP）。MDP 有一个元组，其中是一组状态，是代理可以选择的一组操作，是满足马尔可夫属性的转移概率，如下所示：是一组奖励
，

(1)
并且是
折扣因子。
在每个时间步中
，代理从环境中接收状态，并根据策略选择最佳操作
，策略
从状态

映射到操作
。代理
从环境中获得奖励
以采取行动。代理的目标是最大化每个状态
的折扣预期奖励
。

2.1.2. Value-based methods
2.1.2. 基于值的方法
Given that the agent follows policy
, a state-value function is defined as
. Similarly, the action-value function,
, is an expected estimate value for a given state
for taking an action
. Q-learning is a typical type of off-policy learning that updates a target policy
using samples generated by any stochastic behaviour policy in an environment. Following the Bellman equation and temporal difference (TD) for the action-value function, the Q-learning algorithm is recursively updated using the following equation:
(2)
where
follows the target policy
and
is the learning rate. While updating Q-learning, the next actions
are sampled from the behaviour policy which follows an
-greedy exploration strategy, and among them, the action that makes the largest Q-value,
, is selected.
给定代理遵循策略
，状态值函数被定义为
。同样，动作值函数，
是执行操作
的给定状态
的预期估计值。Q 学习是一种典型的策略外学习类型，它使用环境中任何随机行为策略生成的样本来更新目标策略
。根据动作值函数的贝尔曼方程和时间差（TD），Q 学习算法使用以下等式递归更新：
(2)
其中
遵循目标策略
，
是学习率。在更新 Q 学习时，从遵循
贪婪探索策略的行为策略中抽取接下来的操作，并在其中选择具有最大 Q 值
的操作
。

2.1.3. Policy-based methods
2.1.3. 基于策略的方法
In contrast to value-based methods, policy-based methods directly update the policy parameterised by
. In reinforcement learning, because the goal is to maximise the expected return throughout states, the objective function for the policy is defined as
. Williams et al. [22] suggested the REINFORCE algorithm which updates the policy network by taking a gradient ascent in the direction of
. The gradient of the objective function is expressed as:
(3)
where
denotes the state distribution. A general overview of reinforcement learning can be found in [23].
与基于值的方法相比，基于策略的方法直接更新由
参数化的策略。在强化学习中，由于目标是使整个州的预期回报最大化，因此策略的目标函数定义为
。Williams等[22]提出了REINFORCET算法，该算法通过沿的方向梯度上升来更新策略网络
。目标函数的梯度表示为：
(3)
其中
表示状态分布。强化学习的一般概述见文献[23]。

2.2. Exploration 2.2. 探索
Exploration can be defined as the activity of searching and finding out about something [24]. In the context of reinforcement learning, “something” is a reward function and the “searching and finding” is an agent’s attempt to try to maximise the reward function. Exploration in reinforcement learning is of particular importance because a reward function is often complex and agents are expected to improve over their lifetime. Exploration can take various forms such as randomly taking certain actions and seeing the output, following the best known solution, or actively considering moves that are good for novel discoveries.
探索可以定义为搜索和发现某事的活动[24]。在强化学习的上下文中，“某物”是一个奖励函数，而“搜索和发现”是智能体试图最大化奖励函数的尝试。强化学习的探索尤为重要，因为奖励函数通常很复杂，并且智能体有望在其一生中得到改善。探索可以采取多种形式，例如随机采取某些行动并查看输出、遵循最知名的解决方案或积极考虑有利于新发现的举措。

Problems that can be solved by exploration are common in nature. Exploration is the act of searching for a solution to a problem. We note that exploration is the most useful in problems in which a route to the actual solution (i.e. reward) is obstructed by the local minima (maxima) or areas of flat rewards. These conditions mean that discovering the true nature of rewards is challenging. The following examples are intuitive illustrations of those problems: (i) search and rescue—the agent needs to explore to find a target (victim); the agent is only rewarded when it finds the victim; otherwise, the reward is 0; and (ii) delivery—trying to deliver an object in the unknown areas; the agent is only rewarded when the appropriate drop-off point has been found; otherwise, the reward is 0. Exploration could be considered as a ubiquitous problem that is highly relevant to many domains with ongoing research.
可以通过探索解决的问题在自然界中很常见。探索是寻找问题解决方案的行为。我们注意到，在通向实际解决方案（即奖励）的路径被局部最小值（最大值）或平坦奖励区域阻碍的问题中，探索是最有用的。这些条件意味着发现奖励的真正本质是具有挑战性的。以下示例直观地说明了这些问题：（i）搜索和救援——智能体需要探索以找到目标（受害者）;代理人只有在找到受害者时才会得到奖励;否则，奖励为 0;（ii）投递——试图在未知区域投递物体;只有当找到合适的下车点时，代理才会获得奖励;否则，奖励为 0。探索可以被认为是一个普遍存在的问题，与许多正在进行的研究领域高度相关。

2.3. Challenging problems
2.3. 具有挑战性的问题
In this section, some of the challenging problems for exploration in reinforcement learning are described, namely noisy-TV and sparse reward problems.
在本节中，描述了强化学习中一些具有挑战性的探索问题，即噪声电视和稀疏奖励问题。

2.3.1. Noisy-TV 2.3.1. 嘈杂的电视
In a noisy-TV [13] problem, the agent is stuck in exploring an infinite number of states which lead to no reward. This phenomenon can be easily explained with an example. Imagine a state consisting of a virtual TV where the agent can operate the remote, but operating the remote controller leads to no reward. A new random image is generated on the TV every time a remote is operated. Thus, the agent will experience novelty all the time. This keeps the agent’s attention high infinitely but clearly leads to no meaningful progress. This kind of behaviour can also be described as a couch potato problem.
在一个嘈杂的电视[13]问题中，智能体被困在探索无限数量的状态中，而这些状态不会导致任何奖励。这种现象可以很容易地用一个例子来解释。想象一下，一个由虚拟电视组成的状态，代理可以操作遥控器，但操作遥控器不会带来任何奖励。每次操作遥控器时，电视上都会生成一个新的随机图像。因此，代理将始终体验到新奇感。这使智能体的注意力无限高，但显然不会导致有意义的进展。这种行为也可以被描述为沙发土豆问题。

2.3.2. Sparse reward problems
2.3.2. 稀疏奖励问题
Sparse rewards are a classical problem in exploration. In the sparse reward problem, the reward is relatively rare. In other words, there is a long gap between an action and a reward. This is problematic for reinforcement learning because for a long time (or at all times) it has no reward to learn from. The agent cannot learn any useful behaviours and eventually converges to a trivial solution. As an example, consider a maze where the agent has to complete numerous steps before reaching the end and being rewarded. The larger the maze is, the less likely it is for the agent to see the reward. Eventually, the maze will be so large that the agent will never see the reward; thus, it will have no opportunity to learn.
稀疏的奖励是探索中的一个经典问题。在稀疏奖励问题中，奖励相对稀少。换句话说，行动和奖励之间有很大的差距。这对强化学习来说是有问题的，因为在很长一段时间内（或在任何时候），它都没有奖励可以学习。智能体无法学习任何有用的行为，并最终收敛到一个微不足道的解决方案。例如，考虑一个迷宫，代理必须完成许多步骤才能到达终点并获得奖励。迷宫越大，代理看到奖励的可能性就越小。最终，迷宫将如此之大，以至于代理人永远不会看到奖励;因此，它将没有学习的机会。

2.4. Benchmarks 2.4. 基准
In this section, the most commonly used benchmarks for reinforcement learning are briefly introduced and described. We highlight four benchmarks: Atari Games, VizDoom, Minecraft, and Mujoco.
在本节中，简要介绍和描述了最常用的强化学习基准。我们重点介绍四个基准测试：Atari Games、VizDoom、Minecraft 和 Mujoco。

2.4.1. Atari games 2.4.1. 雅达利游戏
The Atari games benchmark are a set of 57 Atari games combined under the Atari Learning Environment (ALE) [25]. In Atari games, the state space is normally either images or random-access memory (RAM) snapshots. The action space consists of five joystick actions (up, down, left, right, and action button). Atari games can be largely split into two groups: easy (54 games) and difficult exploration (3 games) [26]. In the easy exploration problem, the reward is relatively easy to find. In hard exploration problems, the reward is not often given, and the association between states and rewards is complex.
雅达利游戏基准测试是在雅达利学习环境（ALE）下组合的57款雅达利游戏[25]。在雅达利游戏中，状态空间通常是图像或随机存取存储器（RAM）快照。操作空间由五个操纵杆操作（向上、向下、向左、向右和操作按钮）组成。雅达利游戏大致可分为两组：简单（54款）和困难探索（3款）[26]。在容易探索的问题中，奖励相对容易找到。在困难的探索问题中，奖励通常不会给出，状态和奖励之间的联系很复杂。

2.4.2. VizDoom 2.4.2. 可视化末日
VizDoom [27] is a benchmark based on the Doom game. The game has a first-person perspective (i.e., view from characters’ eyes), and the image seen by the character is normally used as a state space. The action space is normally eight directional control and two action buttons (picking up key cards and opening doors). Note that more actions can be added, if needed. One of the key advantages of VizDoom is the availability of easy-to-use tools for editing scenarios and low computational burden.
VizDoom [27] 是基于 Doom 游戏的基准测试。游戏采用第一人称视角（即从角色的眼睛看），角色看到的图像通常用作状态空间。操作空间通常是八个方向控制和两个操作按钮（拿起钥匙卡和开门）。请注意，如果需要，可以添加更多操作。VizDoom 的主要优势之一是提供易于使用的工具来编辑场景和低计算负担。

2.4.3. Malmo 2.4.3. 马尔默
Malmo [28] is a benchmark based on the game Minecraft. In Minecraft, environments are built using same-shaped blocks, similar to how Lego bricks are used for building. Similar to VizDoom, it is also from the first-person perspective, and the image is the state space. The key advantage of Malmo is its flexibility in terms of the environment structure, domain size, custom scripts, and reward functions.
马尔默[28]是基于游戏Minecraft的基准测试。在Minecraft中，环境是使用相同形状的积木构建的，类似于使用乐高积木进行构建的方式。与 VizDoom 类似，它也是从第一人称视角出发，图像是状态空间。Malmo 的主要优势在于其在环境结构、域大小、自定义脚本和奖励功能方面的灵活性。

2.4.4. Mujoco 2.4.4. 穆乔科
MuJoCo [29] represents multi-joint dynamics with contact. Mujoco is a popular benchmark used for physics-based simulations. In reinforcement learning, Mujoco is typically used to simulate walking robots. These are typically cheetah, ant, humanoids, and their derivatives. The task of reinforcement learning is to control various joint angles and forces to develop walking behaviour. Normally, the task is to walk as far as possible or to reach a specific goal.
MuJoCo[29]表示接触的多关节动力学。Mujoco 是一种流行的基准测试，用于基于物理的模拟。在强化学习中，Mujoco通常用于模拟步行机器人。这些通常是猎豹、蚂蚁、类人生物及其衍生物。强化学习的任务是控制各种关节角度和力，以发展行走行为。通常，任务是走得尽可能远或达到特定目标。

Exploration in reinforcement learning
强化学习的探索
Exploration in reinforcement learning can be split into two main streams: efficiency and safe exploration. In efficiency, the idea is to make exploration more sample efficient so that the agent can explore in as few steps as possible. In safe exploration, the focus is on ensuring safety during exploration. We suggest splitting efficiency-based methods further into imitation-based and self-taught methods. In imitation-based learning, the agent learns how to utilise a policy from an expert to improve exploration. In self-taught methods, learning is performed from scratch. Self-taught methods can be further divided into planning, intrinsic rewards, and random methods. In planning methods, the agent plans its next action to gain a better understanding of the environment. In random methods, the agent does not make conscious plans; rather, it explores and then sees a consequence of this exploration. We distinguish intrinsic reward methods into two categories: (i) reward novel states–reward agents for visiting novel states; and (ii) reward diverse behaviours-reward agents for discovering novel behaviours. Note that intrinsic rewards are a part of a larger notion of intrinsic motivation. For an extensive review of intrinsic motivation, see [16], [30]. In planning methods, two distinguished categories are considered: (i) goal-based: an agent is given an exploratory goal to reach; and (ii) probability- probabilistic models are used for an environment. Review of the entire categorisations is represented in Fig. 1. From the following, each category is described in detail. The main objective of the categorisation is to highlight the key contribution of each approach. Note that a certain approach could be a combination of various techniques. For example, Go-explore [31] utilises reward novel states methods, but the main contribution is best described by goal-based methods.
强化学习中的探索可以分为两个主要方向：效率和安全探索。在效率方面，这个想法是使探索更有效率，以便代理可以在尽可能少的步骤中进行探索。在安全勘探中，重点是确保勘探过程中的安全。我们建议将基于效率的方法进一步分为基于模仿和自学的方法。在基于模仿的学习中，智能体学习如何利用专家的策略来改进探索。在自学方法中，学习是从头开始进行的。自学方法可以进一步分为计划法、内在奖励法和随机法。在规划方法中，智能体计划其下一步行动以更好地了解环境。在随机方法中，智能体不会制定有意识的计划;相反，它探索并看到这种探索的结果。我们将内在奖励方法分为两类：（i）奖励新状态——访问新状态的奖励代理;（ii）奖励不同的行为——奖励发现新行为的代理人。请注意，内在奖励是更大的内在动机概念的一部分。关于内在动机的广泛综述，参见[16]，[30]。在规划方法中，考虑了两个不同的类别：（i）基于目标：为智能体提供要达到的探索性目标;（ii）概率 - 概率模型用于环境。整个分类的回顾如图 1 所示。下面详细介绍了每个类别。分类的主要目的是突出每种方法的主要贡献。请注意，某种方法可能是各种技术的组合。例如，Go-explore[31]利用了奖励新状态方法，但主要贡献最好用基于目标的方法来描述。

Download : Download high-res image (370KB)
下载：下载高分辨率图片（370KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 1. Overview of exploration in reinforcement learning.
图 1.强化学习中的探索概述。

3.1. Reward novel states 3.1. 奖励新状态
In this section, approaches on reward novel state are discussed and compared. Reward novel state approaches give agents a reward for discovering new states. This reward is called an intrinsic reward. As can be observed in Fig. 2, the intrinsic reward (
) supplements rewards given by the environment (
called an extrinsic reward). By rewarding novel states, agents will incorporate exploration into their behaviours [30].
在本节中，讨论和比较了奖励新状态的方法。奖励新状态方法为发现新状态的智能体提供奖励。这种奖励称为内在奖励。从图2中可以看出，内在奖励（
）是对环境给予的奖励（
称为外在奖励）的补充。通过奖励新状态，智能体会将探索融入他们的行为中[30]。

Download : Download high-res image (127KB)
下载：下载高分辨率图片（127KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 2. Overview of the reward novel state methods. In general, in reward novel states, the agent is given additional reward
for discovering novelty. This additional reward is generated from intrinsic reward module
.
图 2.奖励新状态方法概述。一般来说，在奖励新奇状态下，智能体
会因发现新奇而获得额外的奖励。这个额外的奖励是由内在奖励模块
产生的。

These approaches were generalised in [30]. In general, there are two necessary components: “an adaptive predictor or compressor or model of the growing data history as the agent is interacting with its environment to provide an intrinsic reward, and a general reinforcement learner to learn behaviours” [30]. In this division, the reinforcement learner is asked to invent things which predictor does not know yet. In our review, the former is simply referred to as an intrinsic reward module, and the latter is referred to as an agent.
这些方法在[30]中得到了推广。一般来说，有两个必要的组成部分：“当智能体与其环境交互以提供内在奖励时，一个是不断增长的数据历史的自适应预测器或压缩器或模型，另一个是学习行为的一般强化学习者”[30]。在这个部门中，强化学习者被要求发明预测者还不知道的东西。在我们的评论中，前者被简单地称为内在奖励模块，后者被称为代理。

There are different ways of classifying intrinsic rewards [16], [32]. Here, we largely follow the classification of [16] with the following categories: (i) prediction error methods, (ii) count-based methods and (iii) memory methods.
对内在奖励进行分类的方法有很多[16]，[32]。在这里，我们主要遵循[16]的分类，分为以下几类：（i）预测误差方法，（ii）基于计数的方法和（iii）记忆方法。

3.1.1. Prediction error methods
3.1.1. 预测误差方法
In prediction error methods, the error of a prediction model when predicting a previously visited state is used to compute the intrinsic reward. For a certain state, if a model’s prediction is inaccurate, it means that a given state has not been seen often and the intrinsic reward is high. One of the key questions that needs to be addressed is how to use the model’s error to compute the intrinsic reward. To this end, Achiam et al. [33] compared two intrinsic reward functions: (i) how big the error is in a prediction model and (ii) the learning progress. The first method has shown better performance and is therefore recommended, which can be formalised as:
(4)
where
represents a state,
is an environmental model,
and
are two consecutive time steps,
is an optional model for state representation, and
is an optional reward scaling function.
在预测误差方法中，预测模型在预测先前访问的状态时的误差用于计算内在奖励。对于某个状态，如果模型的预测不准确，则意味着给定状态不经常被看到，并且内在奖励很高。需要解决的关键问题之一是如何利用模型的误差来计算内在奖励。为此，Achiam等[33]比较了两种内在奖励函数：（i）预测模型中的误差有多大和（ii）学习进度。第一种方法显示出更好的性能，因此被推荐，它可以形式化为：
(4)
其中
表示状态，是环境模型，
并且是两个连续的时间步长，是状态表示的可选模型，

并且是
可选的奖励缩放函数。

The simplest method of this type was described in [10], [11]. The intrinsic reward is measured as the Euclidean distance between the prediction of a state from a model and that state. This simple idea was revisited in [34]. Generative adversarial networks (GAN [35]), distinguishing real from fake states as a prediction error method, were proposed in [36]. Since then many other approaches were devised, which can be further divided into: (i) state representation prediction, (ii) a priori knowledge and (iii) uncertainty about the environment.
这种类型的最简单的方法在[10]，[11]中进行了描述。内在奖励的衡量标准是模型预测状态与该状态之间的欧几里得距离。这个简单的想法在[34]中被重新审视。[36]提出了生成对抗网络（GAN [35]），将真实状态与虚假状态区分开来，作为一种预测误差方法。从那时起，人们设计了许多其他方法，这些方法可以进一步分为：（i）状态表征预测，（ii）先验知识和（iii）环境的不确定性。

State representation prediction methods.
状态表示预测方法。
In state representation prediction methods, the state is represented in a higher-dimensional space. Then, a model is tasked with predicting the next state representation given the previous state representation. The larger the error is in the prediction, the larger the intrinsic reward is. One way of providing state representation is using an autoencoder [37]. Both pre-trained and online trained autoencoders were considered and showed similar performance. Improvements to autoencoder-based approaches were proposed in [38], [39], where a slow-trained autoencoder was added. Thus, the intrinsic reward decays slower and the agent explores for longer while increasing the chance of finding the optimal reward.
在状态表示预测方法中，状态在更高维的空间中表示。然后，模型的任务是根据前一个状态表示来预测下一个状态表示。预测中的误差越大，内在奖励就越大。提供状态表示的一种方法是使用自动编码器[37]。考虑了预训练和在线训练的自动编码器，并显示出相似的性能。在[38]，[39]中提出了对基于自动编码器的方法的改进，其中添加了一个慢速训练的自动编码器。因此，内在奖励衰减得更慢，智能体探索的时间更长，同时增加了找到最佳奖励的机会。

Another method of providing state representation involves utilising fixed networks with random weights. Then, another network is used to predict the outputs of randomly initialised networks as shown in Fig. 3. The most popular approach of this type is called random network distillation (RND) [13]. A similar approach was considered in [40].
提供状态表示的另一种方法是使用具有随机权重的固定网络。然后，使用另一个网络来预测随机初始化网络的输出，如图 3 所示。这种最流行的方法称为随机网络蒸馏（RND）[13]。[40]也考虑了类似的方法。

A state representation method derived from inverse dynamic features (IDF) was used in [41]. In IDF, the representation comes from forcing an agent to predict the action as illustrated in Fig. 4. IDF was compared against the state prediction method and random representation in [42] with the following conclusions: IDF had the best performance and it scales the best to the unseen environments. IDF was utilised in [43], where the Euclidean distance between two consecutive state representations was used as an intrinsic reward, as shown in Fig. 4. Intuitively, the more significant the transition is, the larger the change is in IDF’s state representation. In another study, RND and IDF were combined into a single intrinsic reward [44].
文献[41]采用了一种由逆动态特征（IDF）推导的状态表示方法。在 IDF 中，表示来自强制智能体预测操作，如图 4 所示。将IDF与[42]中的状态预测方法和随机表示进行了比较，得出以下结论：IDF具有最佳性能，并且对看不见的环境具有最佳的扩展性。在[43]中使用了IDF，其中两个连续状态表示之间的欧几里得距离被用作内在奖励，如图4所示。直观地说，转换越重要，IDF 状态表示的变化就越大。另一项研究显示，RND和IDF合并为一个内在奖励[44]。

Download : Download high-res image (126KB)
下载：下载高分辨率图片（126KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 3. RND overview. The predictor is trying to predict output of a randomly parameterised target.
图 3.RND 概述。预测器正在尝试预测随机参数化目标的输出。

A compact representation using information theory was proposed in [45]. Information theory is used to represent states that are close to the representation space in the environment space. Information theory can also be used to create a bottleneck latent representation [46]. Bottleneck latent representation occurs when mutual information between the input to the network and latent representation is minimised.
文献[45]提出了一种使用信息论的紧凑表示。信息论用于表示环境空间中接近表征空间的状态。信息论也可以用来创建瓶颈潜在表示[46]。当网络输入与潜在表示之间的相互信息最小化时，就会发生瓶颈潜在表示。

Download : Download high-res image (225KB)
下载：下载高分辨率图片（225KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 4. IDF and rewarding impact driven exploration (RIDE) overview. In IDF, the features are extracted based on the network predicting the next action. In RIDE, the intrinsic reward is based on the difference in state representation (adapted from [43]).
图 4.IDF 和奖励影响驱动探索（RIDE）概述。在 IDF 中，特征是根据预测下一个操作的网络提取的。在RIDE中，内在奖励基于状态表示的差异（改编自[43]）。

A priori knowledge methods.
先验知识方法。
In some types of problems, it makes sense to use certain parts of the state space as an error and use it for computing the intrinsic reward. Those parts could be depth point cloud, position, and sound, and they rely on a priori knowledge from the designer.
在某些类型的问题中，将状态空间的某些部分用作错误并将其用于计算内在奖励是有意义的。这些部分可能是深度点云、位置和声音，它们依赖于设计人员的先验知识。

Depth point cloud prediction error was used in [47]. The scalability of this approach was analysed by [48]. It was found that the performance was good in the same environment with different starting positions, but it did not scale to a new scenario. Positions in a 3D space can also be used [49]. An approach using the position was proposed in [50]. The environment is split into the x–y grid where each node’s intrinsic reward is placed. When the episode terminates, the rewards are restored to a default value.
深度点云预测误差在[47]中使用。[48]分析了这种方法的可扩展性。结果发现，在具有不同起始位置的相同环境中，性能良好，但未扩展到新场景。也可以使用3D空间中的位置[49]。[50]中提出了一种使用该立场的方法。环境被拆分为 x-y 网格，每个节点的内在奖励都位于其中。当剧集结束时，奖励将恢复为默认值。

Sound as a source of intrinsic reward was used in [51]. To model sounds, the model is trained to recognise when the sound and the corresponding frame match. If the model indicates misalignment between frames and sounds, it means that the state is novel.
声音作为内在奖励的来源被用于[51]。为了对声音进行建模，模型经过训练，可以识别声音和相应的帧何时匹配。如果模型指示帧和声音之间未对齐，则表示该状态是新颖的。

Uncertainty about the environment methods.
环境方法的不确定性。
In these methods, the intrinsic reward is based on the uncertainty the agent has. If the agent is exploring highly uncertain areas of the environment, the reward is high. Uncertainty can be utilised using the following techniques: Bayesian, ensembles of models and information-theoretic approaches.
在这些方法中，内在奖励基于智能体的不确定性。如果智能体正在探索环境中高度不确定的领域，则回报很高。不确定性可以使用以下技术：贝叶斯、模型集成和信息论方法。

Bayesian approaches are generally intractable for large problem spaces; thus, approximations are used. Kotler et al. [52] presented a close to optimal approximation method using the Dirichlet probability distribution over state, action, and next state triplet. Another approximation could be to use ensembles of models as proposed in [53]. The intrinsic reward is given based on model disagreement as shown in Fig. 5. The models were initialised with different random weights and were trained on different mini-batches to maintain diversity.
贝叶斯方法通常难以处理大型问题空间;因此，使用近似值。Kotler等[52]提出了一种接近最优的近似方法，该方法使用狄利克雷概率分布在状态、动作和下一个状态三元组上。另一种近似可能是使用[53]中提出的模型集成。内在奖励是根据模型分歧给出的，如图 5 所示。这些模型使用不同的随机权重进行初始化，并在不同的小批量上进行训练以保持多样性。

In information-theoretic approaches, the intrinsic reward is computed using the information gained from agent actions. The higher the gain is, the more the agent learns, and the higher the intrinsic reward is. The general framework for these types of approaches was presented in [54], [55]. One of the most popular information-theoretic approaches is variational information maximisation exploration (VIME) [56]. In this approach, the information gain is approximated as a Kullback–Leibler (KL) divergence between the weight distribution of the Bayesian neural network, before and after seeing new observations. In [57], maximising mutual information between a sequence of actions leads to a state that is rewarded. Rewarding this mutual information gain means maximising the information contained in the action sequence about a state. Mutual information gain was combined with the state prediction error into a single intrinsic reward in [58], [59].
在信息论方法中，内在奖励是使用从代理行为中获得的信息来计算的。增益越高，智能体学习的越多，内在奖励就越高。这些方法的一般框架见[54]，[55]。最流行的信息论方法之一是变分信息最大化探索（VIME）[56]。在这种方法中，信息增益近似为贝叶斯神经网络权重分布之间的 Kullback-Leibler （KL）散度，在看到新的观测值之前和之后。在[57]中，最大化一系列动作之间的相互信息会导致一种获得奖励的状态。奖励这种相互信息增益意味着最大化动作序列中包含的有关状态的信息。在[58]，[59]中，相互信息增益与状态预测误差结合成一个单一的内在奖励。

Discussion. 讨论。
The key advantages of prediction error methods are that they rely only on a model of the environment. Thus, there is no need for buffers or complex approximation methods. Each of the four different categories of methods has unique advantages and challenges.
预测误差方法的主要优点是它们仅依赖于环境模型。因此，不需要缓冲器或复杂的近似方法。四种不同类别的方法中的每一种都有独特的优势和挑战。

Download : Download high-res image (188KB)
下载：下载高分辨率图片（188KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 5. Overview of self-supervised exploration via the disagreement method. The intrinsic reward is based on disagreement between models (adapted from [53]).
图 5.通过分歧法进行自监督探索的概述。内在奖励基于模型之间的分歧（改编自[53]）。

While predicting the state directly requires little to no a priori knowledge, the model needs to learn how to recognise different states. Additionally, they struggle when many states are present in the environment. State representation methods can cope with large state spaces at the cost of increased designer burden and reduced accuracy. Moreover, in a state representation method, the agent cannot affect the state representation, which can often lead to different states being represented similarly. Utilising a priori knowledge relies on defining a special element of the state space as a source of an error for computing the intrinsic reward. These methods do not suffer from problems with the speed of prediction and state recognition. However, they rely on the designer experience to define parts of the state space appropriately. Finally, in uncertainty about the environmental approaches, the agent’s uncertainty is used to generate the intrinsic reward. The key advantage of this approach is its high scalability and automatic transition between exploration and exploitation. Prediction error methods have also shown the ability to solve the couch-potato (noisy-TV) problem by storing observations in a memory buffer [60]. An intrinsic reward is given only when observation is sufficiently far away (in terms of time steps) from the observations stored in the buffer. This mitigates the couch potato problem since repeatedly visiting states close to each other is not rewarded.
虽然直接预测状态几乎不需要先验知识，但模型需要学习如何识别不同的状态。此外，当环境中存在许多状态时，它们会挣扎。状态表示方法可以处理大型状态空间，但代价是增加了设计人员的负担并降低了准确性。此外，在状态表示方法中，智能体不能影响状态表示，这通常会导致不同的状态被相似地表示。利用先验知识依赖于定义状态空间的特殊元素作为计算内在奖励的误差源。这些方法在预测和状态识别速度方面不存在问题。但是，它们依赖于设计器的经验来适当地定义状态空间的各个部分。最后，在环境方法的不确定性中，利用主体的不确定性来产生内在奖励。这种方法的主要优点是其高可扩展性和勘探和开发之间的自动转换。预测误差方法还显示出通过将观测结果存储在内存缓冲区中来解决沙发土豆（嘈杂电视）问题的能力[60]。只有当观测值与缓冲区中存储的观测值相距足够远（以时间步长而言）时，才会给出内在奖励。这缓解了沙发土豆问题，因为反复访问彼此靠近的州不会得到奖励。

3.1.2. Count-based methods
3.1.2. 基于计数的方法
In count-based methods, each state is associated with the visitation count number
. If the state has a low count, the agent will be given a high intrinsic reward to encourage revisiting. The method of computing the reward based on the count was discussed in [61]. It has been shown that 1/
guarantees a faster convergence rate than the commonly used 1/
.
在基于计数的方法中，每个状态都与访问计数数
相关联。如果该州的计数较低，则代理将获得较高的内在奖励，以鼓励重新访问。[61]讨论了基于计数计算奖励的方法。已经表明，1/ 保证了比常用的 1/

更快的收敛速率。

In problems with large number of states, counting visits to states is difficult because it requires saving the count for each state. To solve this problem, count is normally done on a reduced-size state representation.
在具有大量状态的问题中，计算对状态的访问是很困难的，因为它需要保存每个状态的计数。为了解决这个问题，通常对减小尺寸的状态表示进行计数。

Count on state representation methods.
依靠状态表示方法。
In count on state representation methods, the states are represented in a reduced form to alleviate memory requirements. This allows storing the count and a state with minimal memory in a table, even in the case of a large state space.
依靠状态表示方法，状态以简化形式表示，以减轻内存需求。这允许在表中以最小的内存存储计数和状态，即使在状态空间很大的情况下也是如此。

One of the popular methods of this type was proposed in [62], where static hashing was used. Here, a technique called SimHash [63] was used, which represents images as a set of numbers called a hash. To generate an even more compact representation, in [64], the state was represented as the learned x–y position of an agent. This was achieved using an attentive dynamic model (ADM). Successor state representation (SSR) [65] is a method which combines count and representation. The SSR is based on the count and order between the states. Intuitively, the SSR can be used as a count replacement.
[62]中提出了一种流行的方法，其中使用了静态散列。在这里，使用了一种称为SimHash[63]的技术，它将图像表示为一组称为哈希的数字。为了生成更紧凑的表示，在[64]中，状态被表示为智能体的学习x-y位置。这是使用细心动态模型（ADM）实现的。后继状态表示（SSR）[65]是一种将计数和表示相结合的方法。SSR 基于状态之间的计数和顺序。直观地说，SSR 可以用作计数替换。

It is also possible to approximate count on state representation by using a function. For example, Bellemare et al. [14] proposed an approximation based on a density model. The density models include context tree switching (CTS) [14], Gaussian mixture models [66] or PixelCNN [67]. Martin et al. [68] proposed an improvement in the approximate count by making counts on the feature space rather than raw inputs.
还可以通过使用函数来近似计算状态表示的计数。例如，Bellemare等[14]提出了一种基于密度模型的近似方法。密度模型包括上下文树切换（CTS）[14]、高斯混合模型[66]或PixelCNN[67]。Martin等[68]提出了一种通过对特征空间而不是原始输入进行计数来改进近似计数的方法。

Discussion. 讨论。
Count-based methods approximate the intrinsic reward by counting the number of times a given state has been visited. To reduce computational efforts of count-based methods, counts are generally associated with state representations rather than states. This, however, relies on being able to efficiently represent states. State representations can still require a lot of memory and careful design.
基于计数的方法通过计算给定状态的访问次数来近似内在奖励。为了减少基于计数的方法的计算工作量，计数通常与状态表示相关联，而不是与状态相关联。然而，这依赖于能够有效地表示状态。状态表示仍然需要大量内存和精心设计。

3.1.3. Memory methods 3.1.3. 内存方法
In these methods, an intrinsic reward is given on how easy it is to distinguish a state from all others. The easier it is to distinguish from the others, the more novel the given state is. As comparing states directly is computationally expensive, several approximation methods have been devised. Here, we categorise them into comparison models and experience replay.
在这些方法中，内在奖励是关于将一种状态与所有其他状态区分开来的难易程度。越容易与其他状态区分开来，给定状态就越新颖。由于直接比较状态在计算上成本很高，因此已经设计了几种近似方法。在这里，我们将它们分类为比较模型和体验回放。

Models can be trained for comparing state-to-state to reduce the computational load. One example method is to use exemplar model [69] developed in [70]. Exemplar models are a set of
classifiers, each of which is trained to distinguish a specific state from the others. Training multiple classifiers is generally computationally expensive. To further reduce the computational cost, the following two strategies are proposed: updating a single exemplar with a each new data point and sampling
exemplars from a buffer.
可以训练模型以比较状态，以减少计算负载。一个示例方法是使用在[70]中开发的示例模型[69]。示例模型是一组分类器，每个
分类器都经过训练，可以区分特定状态和其他状态。训练多个分类器通常在计算上成本很高。为了进一步降低计算成本，提出了以下两种策略：使用每个新数据点更新单个样本和从缓冲区中采样
样本。

Instead of developing models for comparison, a limited size of experience replay was combined with prediction error methods in [71]. To devise intrinsic rewards, two rewards are combined: (i) intrinsic episodic experience replay is used to store states and compare them to others; and (ii) intrinsic motivation RND [13] is used to determine the state’s long-term novelty. Additionally, multiple policies are trained, each with a different ratio between the extrinsic and intrinsic reward. A meta learner to automatically choose different ratios of extrinsic and intrinsic rewards at each step was proposed in [15].
在[71]中，没有开发用于比较的模型，而是将有限规模的经验回放与预测误差方法相结合。为了设计内在奖励，将两种奖励结合起来：（i）内在情景体验回放用于存储状态并将其与其他状态进行比较;（ii）内在动机RND[13]用于确定国家的长期新颖性。此外，还训练了多种策略，每种策略在外在奖励和内在奖励之间具有不同的比率。[15]提出了一种元学习器，可以在每一步自动选择不同比例的外在和内在奖励。

Discussion. 讨论。
In memory-based approaches, the agent derives an intrinsic reward by comparing its current state with states stored in the memory. The comparison model method has the advantage of small memory requirements, but requires careful model parameter tuning. On the other hand, using a limited experience buffer does not suffer from model inaccuracies and has shown a great performance in difficult exploratory Atari games.
在基于内存的方法中，智能体通过将其当前状态与存储在内存中的状态进行比较来获得内在奖励。比较模型方法的优点是内存要求小，但需要仔细的模型参数调优。另一方面，使用有限的经验缓冲区不会出现模型不准确的问题，并且在困难的探索性雅达利游戏中表现出出色的性能。

3.1.4. Summary 3.1.4. 小结
The reward novel state-based approaches are summarised in Table 1. The table utilises the following legend: Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]). Prediction error methods are the most commonly-used methods. In general, they have shown very good performance (for example, RND [13] with 6500 in Montezuma’s Revenge). However, they normally require a hand-designed state representation method for computational efficiency. This requires problem-specific adaptations, thus reducing the applicability of those approaches. Count-based methods are computationally efficient but they can either require memory to store counts or complex models [14]. Also, counting states in continuous-state domains is challenging and requires combining continuous states into discontinuous chunks. Recently, memory methods have shown good performance in games such as Montezuma Revenge, scoring as much as 11,000 [71]. Memory methods require a careful balance of how much data to remember for comparison. Otherwise, computing the comparison can take a long time.
表1总结了基于状态的奖励新方法。该表使用以下图例：图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数]（[基线方法] [分数]）。预测误差方法是最常用的方法。总的来说，他们表现出了非常好的表现（例如，RND [13] 在蒙特祖玛的复仇中以 6500 分）。然而，它们通常需要手工设计的状态表示方法来提高计算效率。这需要针对具体问题进行调整，从而降低了这些方法的适用性。基于计数的方法在计算上是高效的，但它们可能需要内存来存储计数或复杂的模型[14]。此外，对连续状态域中的状态进行计数具有挑战性，需要将连续状态组合成不连续的块。最近，记忆方法在蒙特祖玛复仇等游戏中表现出色，得分高达11,000分[71]。记忆方法需要仔细平衡要记住的数据量以进行比较。否则，计算比较可能需要很长时间。

Table 1. Comparison of reward novel state approaches.
表 1.奖励新状态方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
Pathak et al. [41 ]
Pathak等[41] A3C Prediction error 预测误差 Vizdoom: very sparse 0.6 (A3C 0)
Vizdoom：非常稀疏 0.6 （A3C 0） Vizdoom image Vizdoom 图像 P MB D/ D D/D
Stadie et al. [37 ]
Stadie等[37] autoencoder 自动编码器 DQN Prediction error 预测误差 Atari: Alien 1436 (DQN 300)
雅达利：外星人 1436 （DQN 300） Atari images 雅达利图片 Q MB D/ D D/D
Savinov et al. [60 ]
萨维诺夫等[60] pretrained discriminator (non-online)
预训练鉴别器（非在线） PPO Prediction error 预测误差 Vizdoom: very sparse 1 (PPO 0); Dmlab: very sparse 30 (PPO+ICM 0)
Vizdoom：非常稀疏 1 （PPO 0）;Dmlab：非常稀疏 30 （PPO+ICM 0） Vizdoom images/ Mujoco joints angles
Vizdoom 图像/Mujoco 关节角度 Ac MB C/ C C/C
Burda et al. [13 ]
Burda等[13] PPO Prediction error 预测误差 Atari: Montezuma Revenge 7500 (PPO 2500)
雅达利：蒙特祖玛复仇 7500 （PPO 2500） Atari images 雅达利图片 P MB D/ D D/D
Bougie and Ichise [38 ]
布吉和一濑 [38 ] PPO Prediction error 预测误差 Atari: Montezuma Revenge 20 rooms found (RND 14)
雅达利：蒙特祖玛复仇找到 20 间客房（RND 14） images 图像 Ac MB D/ D D/D
Hong et al. [36 ]
Hong等[36] DQN Prediction error 预测误差 Atari: Montezuma Revenge 200 (DQN 0)
雅达利：蒙特祖玛复仇200 （DQN 0） Enumerated state id/ Atari Image
枚举状态 id/ Atari Image Q MB D/ D D/D
Kim et al. [45 ]
Kim等[45] TRPO TRPO系列 Prediction error 预测误差 Atari: Frostbite 6000 (ICM 3000)
雅达利：冻伤 6000 （ICM 3000） Atari Images 雅达利图片 Ac MB D/ D D/D
Stanton and Clune [50 ]
斯坦顿和克鲁恩 [50 ] agent position, reward grid
代理位置，奖励网格 A2C Prediction error 预测误差 Atari: Montezuma Revenge 3200 (A2C 0)
雅达利：蒙特祖玛复仇3200 （A2C 0） Atari images 雅达利图片 Ac MB D/ D D/D
Achiam and Sastry [33 ]
阿奇亚姆和萨斯特里 [33 ] TRPO TRPO系列 Prediction error 预测误差 Mujoco: halfcheetah 80 (VIME 40); Atari: Venture 400 (VIME 0)
穆乔科：半猎豹 80 （VIME 40）;雅达利：冒险400 （VIME 0） Atari RAM states/ Mujoco joints angles
Atari RAM状态/ Mujoco关节角度 Ac MB C/ C C/C
Li et al. [44 ]
Li等[44] A2C Prediction error 预测误差 Atari: Asterix 500000 (RND 10000)
雅达利：Asterix 500000 （RND 10000） Atari images 雅达利图片 Ac MB D/ D D/D
Kim et al. [46 ]
Kim等[46] PPO Prediction error 预测误差 Atari: Montezuma Revenge with distraction 1500 (RND 0)
Atari： Montezuma Revenge with distraction 1500 （RND 0） Atari images 雅达利图片 Ac MB D/ D
Chien and Hsu [59 ]
钱和许 [59 ] DQN Prediction error 预测误差 PyDial: 85 (CME 80); OpenAI: Mario 0.8 (CME 0.8)
PyDial：85 （CME 80）;OpenAI：马里奥 0.8 （CME 0.8） Images 图像 Q MB D/ D
Li et al. [34 ]
Li等[34] DDPG DDPG系列 Prediction error 预测误差 Robot: FetchPush 1 (DDPG 0)
机器人：FetchPush 1 （DDPG 0） Robot joints angles 机器人关节角度 Ac MB C/ C
Raileanu and Rocktäschel [43 ]
莱利努和罗克塔舍尔 [43 ] IMPALA 高角羚 Prediction error 预测误差 Vizdoom: 0.95 (ICM 0.95) Vizdoom：0.95（ICM 0.95） Vizdoom Images Vizdoom 图片 Ac MB D/ D
Mirowski et al. [47 ]
Mirowski等[47] A3C Prediction error 预测误差 DM Lab: Random Goal 96 (LSTM-A3C 65)
DM Lab：随机目标 96 （LSTM-A3C 65） DM Lab images DM Lab 图像 Ac MB C/ C
Tang et al. [62 ]
Tang等[62] TRPO TRPO系列 Count-based 基于计数 Atari: Montezuma Revenge 238 (TRPO 0); Mujoco: swimmergather 0.3 (VIME 0.15)
雅达利：蒙特祖玛复仇 238 （TRPO 0）;Mujoco：游泳者 0.3 （VIME 0.15） Atari images/ Mujoco joints angles
雅达利图片/ Mujoco关节角度 P MF C/ C
Martin et al. [68 ]
Martin等[68] Blob-PROST features Blob-PROST 功能 SARSA-e Count-based 基于计数 Atari: Montezuma Revenge 2000 (SARSA 200)
雅达利：蒙特祖玛复仇 2000 （SARSA 200） Blob-PROST features Blob-PROST 功能 Q MB D/ D
Machado et al. [65 ]
马查多等[65] DQN Count-based 基于计数 Atari: Montezuma Revenge 1396 (Psuedo counts 1671)
雅达利：蒙特祖玛复仇 1396 年（Psuedo 计数 1671 年） Atari images 雅达利图片 Q MF D/ D
Ostrovski et al. [67 ]
Ostrovski等[67] DQN and Reactor DQN 和反应器 Count-based 基于计数 Atari: Gravitar 1500 (Reactor 1000)
雅达利：Gravitar 1500（反应堆1000） Atari images 雅达利图片 Ac MB D/ D
Badia et al. [71 ]
Badia等[71] R2D2 R2D2型 Memory 记忆 Atari: Pitfal 15000 (R2D2 −0.5)
雅达利：皮特帕尔 15000 （R2D2 −0.5） Atari images 雅达利图片 P MB D/ D
Badia et al. [15 ]
Badia等[15] R2D2 R2D2型 Memory 记忆 Beat humans in all 57 atari games
在所有 57 款雅达利游戏中击败人类 Atari images 雅达利图片 P MB D/ D
Fu et al. [70 ]
傅等[70] state encoder 状态编码器 TRPO TRPO系列 Memory 记忆 Mujoco: SparseHalfCheetah 173.2 (VIME 98); Atari: Frostbite 4901 (TRPO 2869); Doom: MyWayHome 0.788 (VIME 0.443)
Mujoco：SparseHalfCheetah 173.2 （VIME 98）;雅达利：冻伤 4901 （TRPO 2869）;厄运：MyWayHome 0.788 （VIME 0.443） Atari images/ Mujoco joints angles
雅达利图片/ Mujoco关节角度 Ac MB C/ C
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.2. Reward diverse behaviours
3.2. 奖励多样化的行为
In reward diverse behaviours, the agent collects as many different experiences as possible, as shown in Fig. 6. This makes exploration an objective rather than a reward finding. These types of approaches can also be called diversity and can be split into evolution strategies and policy learning.
在奖励多样化的行为中，智能体会收集尽可能多的不同体验，如图 6 所示。这使得探索成为一种目标，而不是一种奖励。这些类型的方法也可以称为多样性，可以分为进化战略和政策学习。

Download : Download high-res image (190KB)
下载：下载高分辨率图片（190KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 6. Overview of reward diverse behaviour-based methods. The key idea is for the agent to experience as many things as possible, in which either evolution or policy learning can be used to generate a set of diverse experiences.
图 6.奖励多样化基于行为的方法概述。关键思想是让智能体体验尽可能多的事物，其中进化或策略学习可用于生成一组不同的体验。

3.2.1. Evolution strategies
3.2.1. 进化策略
Reward diverse behaviours were initially used with an evolutionary-based approach. In evolutionary approaches, a group of sample solutions (population) is tested and evolves over time to get closer to the optimal solution. Note that evolutionary approaches are generally not considered as the part of reinforcement learning but can be used to solve the same type of problems [72], [73].
奖励多样化的行为最初被用于基于进化的方法。在进化方法中，一组样本解决方案（种群）经过测试并随着时间的推移而演变，以更接近最佳解决方案。请注意，进化方法通常不被视为强化学习的一部分，但可用于解决相同类型的问题[72]，[73]。

One of the earliest methods called novelty search was devised in [74], [75]. In novelty search, the agent is encouraged to generate numerous different behaviours using a metric called diversity measure. The diversity measure must be hand-designed for each environment, limiting transferability between different domains. Recently, novelty search has been combined with other approaches, such as reward maximisation [76] and reward novel state method [77]. In Conti et al. [76], the novelty-search policy is combined with a reward maximisation policy to encourage diverse behaviours and search for the reward. Gravina et al. [77] compared three ways of combining novelty search and reward novel state: (i) novelty search, (ii) sum of reward novel state and novelty search, and (iii) sequential optimisation where the second one performed the best in a simulated robot environment. More detailed reviews of exploration in evolution strategies can be found in [78], [79].
最早的一种称为新颖性检索的方法是在[74]，[75]中设计的。在新颖性搜索中，鼓励智能体使用称为多样性度量的指标生成许多不同的行为。多样性度量必须针对每个环境进行手工设计，从而限制不同域之间的可转移性。最近，新颖性搜索与其他方法相结合，如奖励最大化[76]和奖励新颖状态法[77]。在Conti等[76]中，新颖性搜索策略与奖励最大化策略相结合，以鼓励多样化的行为并寻找奖励。Gravina等[77]比较了三种将新颖性搜索和奖励新颖状态相结合的方法：（i）新颖性搜索，（ii）奖励新颖状态和新颖性搜索的总和，以及（iii）顺序优化，其中第二种在模拟机器人环境中表现最佳。关于进化策略探索的更详细综述可以在[78]，[79]中找到。

Discussion. 讨论。
Initially, novelty search was used as a stand-alone technique; however, recently, combining it with other techniques [76], [77] has shown more promise. Such a combination is more beneficial (in terms of reward) as diverse behaviours are more directed towards highly scoring ones.
最初，新颖性搜索被用作一种独立的技术;然而，最近，将其与其他技术[76]，[77]相结合，显示出更大的前景。这种组合更有益（就奖励而言），因为不同的行为更针对高分的行为。

3.2.2. Policy learning 3.2.2. 政策学习
Recently, diversity measures have been applied in policy learning approaches. The diversity among policies was measured in [80]. Diversity is computed by measuring the distance between policies (either KL divergence or simple mean squared error). Very promising results for diversity are presented in [12], as shown in Fig. 7. To generate diverse policies, the objective function consists of (i) maximising the entropy of skills, (ii) inferring behaviours from the current state, and (iii) maximising randomness within a skill. A similar approach was proposed in [81] with a new entropy-based objective function. A combination of diversity with a goal-based approach was proposed in [82]. In this study, the agent learns diverse goals and goals useful for rewards using the skew-fit algorithm. In the skew-fit algorithm, the agent skews the empirical goal distribution so that rarely visited states can be visited more frequently. The algorithm was tested using both simulations and real robots.
最近，多样性措施已应用于政策学习方法。政策的多样性在[80]中衡量。多样性是通过测量策略之间的距离（KL 散度或简单均方误差）来计算的。如图7所示，[12]给出了非常有希望的多样性结果。为了生成不同的策略，目标函数包括（i）最大化技能的熵，（ii）从当前状态推断行为，以及（iii）最大化技能内的随机性。[81]提出了一种类似的方法，即一种新的基于熵的目标函数。[82]提出了多样性与基于目标的方法的结合。在这项研究中，智能体使用偏拟合算法学习不同的目标和对奖励有用的目标。在偏拟合算法中，智能体对经验目标分布进行偏斜，以便可以更频繁地访问很少访问的状态。该算法使用模拟和真实机器人进行了测试。

In [83], the agent stores a set of successful policies in an experience replay and then minimises the difference between the current policy and the best policies from storage. To allow exploration at the same time, the entropy of parameters between policies is maximised. The results show an advantage over evolution strategies and PPO in sparse reward Mujoco problems.
在 [83] 中，代理在体验回放中存储一组成功的策略，然后最小化当前策略与存储中的最佳策略之间的差异。为了同时允许探索，策略之间的参数熵最大化。结果表明，在稀疏奖励Mujoco问题中，与进化策略和PPO相比，具有优势。

Download : Download high-res image (237KB)
下载：下载高分辨率图片（237KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 7. An overview of Diversity is all you need (DIAYN), where the agent is encouraged to have as many diverse policies as possible (adapted from [12]).
图 7.Diversity is all you need （DIAYN）概述，鼓励代理制定尽可能多的多样化政策（改编自 [12]）。

Discussion. 讨论。
Diversity in policy-based approaches is a relatively new concept that is still being developed. Careful design of a diversity criterion shows very promising performance, beating standard reinforcement learning with significant margins [12].
基于政策的方法的多样性是一个相对较新的概念，仍在发展中。多样性准则的精心设计显示出非常有希望的性能，以显着的余量击败了标准强化学习[12]。

3.2.3. Summary 3.2.3. 小结
Reward diverse behaviour methods are summarised in Table 2. In evolution strategies approaches, a diverse population is used, whereas in policy learning, a diverse policy is found. Evolution strategies have the potential to find solutions that are not envisioned by designers as they search for the neural network structure as well as diversity. Evolution strategies suffer from the low sample efficiency, making the training either computationally expensive or slow. Policy learning is not able to go beyond pre-specified structures but can also show some remarkable results [12]. Another advantage of the policy learning method is suitability to both continuous and discrete state-space problems.
奖励多样化行为方法总结于表2。在进化策略方法中，使用多样化的人群，而在政策学习中，发现多样化的政策。演化策略有可能找到设计人员在寻找神经网络结构和多样性时没有想到的解决方案。进化策略的样本效率低，使得训练要么计算成本高昂，要么速度慢。政策学习不能超越预先指定的结构，但也可以显示出一些显着的结果[12]。策略学习方法的另一个优点是适用于连续和离散的状态空间问题。

Table 2. Comparison of reward diverse behaviour-based approaches.
表 2.奖励多样化基于行为的方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
Conti et al. [76 ]
Conti等[76] domain specific behaviours
特定于域的行为 Reinforce 加强 Evolution strategies 进化策略 Atari: Frostbite 3785 (DQN 1000)
雅达利：冻伤 3785 （DQN 1000） Atari RAM state/ Mujoco joints angles
Atari RAM状态/ Mujoco关节角度 Ac MF C/ C C/C
Gravina et al. [77 ]
格拉维纳等[77] NS population based 基于NS人群 Evolution strategies 进化策略 Robotic navigation: 400 successes
机器人导航：400 次成功 six range finders, pie-slice goal-direction sensor
6 个测距仪，饼切目标方向传感器 Ac MB C/ C C/C
Lehman and Stanley [74 ]
雷曼兄弟和斯坦利 [74 ] measure of policies distance
政策距离的衡量标准 NEAT 整洁 Evolution strategies 进化策略 maze: 295 (maximum achievable)
迷宫：295（最大可实现） six range finders, pie-slice goal-direction sensor
6 个测距仪，饼切目标方向传感器 Ac MF D/ D D/D
Risi et al. [75 ]
Risi等[75] measure of policies distance
政策距离的衡量标准 NEAT 整洁 Evolution strategies 进化策略 T-maze: solved after 50,000 evaluations
T-maze：经过 50,000 次评估后解决 enumerated state id 枚举状态 ID Ac MF D/ D D/D
Cohen et al. [81 ]
Cohen等[81] SAC Policy learning 政策学习 Mujoco: Hopper 3155 (DIAYN 3120)
Mujoco：料斗 3155 （DIAYN 3120） Mujoco joint angles Mujoco 关节角 Ac MB C/ C C/C
Pong et al. [82 ]
庞氏等[82] RIG Policy learning 政策学习 Door Opening (distance to the objective): 0.02 (RIG + DISCERN-g 0.04)
开门（到物镜的距离）：0.02 （RIG + DISCERN-g 0.04） Robots joint angles 机器人关节角度 Ac MB C/ C C/C
Eysenbach et al. [12 ]
Eysenbach等[12] SAC Policy learning 政策学习 Mujoco: half cheetah 4.5 (TRPO 2)
Mujoco：半猎豹 4.5 （TRPO 2） Mujoco joints angles Mujoco 关节角度 Ac MF C/ C
Hong et al. [80 ]
Hong等[80] DQN, DDPG, A2C DQN、DDPG、A2C Policy learning 政策学习 Atari: Venture 900 (others 0); Mujoco: SparseHalfCheetah 80 (Noisy-DDPG 5)
雅达利：Venture 900（其他 0）;Mujoco：SparseHalfCheetah 80 （Noisy-DDPG 5） Atari images/ Mujoco joints angles
雅达利图片/ Mujoco关节角度 Ac/ Q 交流/Q MB C/ C
Gangwani et al. [83 ]
Gangwani等[83] Itself 本身 Policy learning 政策学习 Mujoco: SparseHalfCheetah 1000 (PPO 0)
Mujoco： SparseHalf猎豹 1000 （PPO 0） Robot joints angles 机器人关节角度 Ac MF C/ C
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.3. Goal-based methods 3.3. 基于目标的方法
In goal-based methods, the states of interest for exploration are used to guide the agent’s exploration. In this way, the exploration can immediately focus on largely unknown areas. In those types of methods, the agent requires a goal generator, a policy to find a goal, and an exploration strategy (see Fig. 8). The goal generator is responsible for creating goals for the agent. The policy is used to achieve the desired goals. An exploration strategy is used to explore once a goal has been achieved or while trying to achieve goals.
在基于目标的方法中，探索的感兴趣状态用于指导智能体的探索。通过这种方式，探索可以立即集中在大部分未知区域。在这些类型的方法中，智能体需要一个目标生成器、一个寻找目标的策略和一个探索策略（见图 8）。目标生成器负责为代理创建目标。该策略用于实现预期目标。探索策略用于在实现目标后或尝试实现目标时进行探索。

Download : Download high-res image (121KB)
下载：下载高分辨率图片（121KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 8. Illustration of goal-based methods. In goal-based methods, the agent’s task is to reach a specific goal (or state). Then, this goal is explored using another exploration method (left) or to generate an exploratory target goal (right). The key concept is to guide agents directly to unexplored areas.
图 8.基于目标的方法的图示。在基于目标的方法中，智能体的任务是达到特定的目标（或状态）。然后，使用另一种探索方法（左）或生成探索性目标（右）来探索该目标。关键概念是将代理直接引导到未开发的区域。

Here, we split goal-based methods into two categories: goals to explore from and exploratory goal methods.
在这里，我们将基于目标的方法分为两类：探索目标和探索性目标方法。

3.3.1. Goals to explore from methods
3.3.1. 从方法中探索的目标
The main technique used for these methods are (i) memorise visited states and trajectories — storing the past states in a buffer and choosing an exploratory goal from the buffer; and (ii) learn from the goal — assuming the goal state is known but a path to it is unknown.
这些方法使用的主要技术是：（i）记住访问的状态和轨迹——将过去的状态存储在缓冲区中，并从缓冲区中选择一个探索性目标;（ii）从目标中学习——假设目标状态是已知的，但通往它的路径是未知的。

One of the most famous approach when goal is chosen from a buffer of this type is called the go-explore [31]. The states and trajectories are saved in a buffer and are selected probabilistically. Once the state to explore from has been found, the agent is teleported there and explores it randomly. In [84], teleportation was replaced with policy learning. Go-exploration was extended to continuous domains in [85]. Concurrently, similar concepts were developed in [86], [87], [88], [89]. In these approaches, a trajectory from the past is selected as an agent to exploit or explore randomly. If exploration is selected, a sample state from the trajectory is selected as a goal to explore based on the visitation count.
当从这种类型的缓冲区中选择目标时，最著名的方法之一称为go-explore[31]。状态和轨迹保存在缓冲区中，并按概率进行选择。一旦找到要探索的状态，智能体就会被传送到那里并随机探索它。在[84]中，传送被政策学习所取代。在[85]中，Go-explores被扩展到连续域。同时，在[86]、[87]、[88]、[89]中也提出了类似的概念。在这些方法中，过去的轨迹被选为随机利用或探索的代理。如果选择了探索，则会根据访问计数选择轨迹中的样本状态作为要探索的目标。

Another goal method was proposed in [90], where the least visited state was selected as a goal from the x–y grid on an Atari game. This reduces the computational effort of remembering where the agent has been significantly.
在[90]中提出了另一种目标方法，其中访问量最少的状态被选为Atari游戏的x-y网格中的目标。这大大减少了记住代理位置的计算工作量。

Learn from goal methods assume that the agent knows how the state with maximum reward looks like, but does not know how to get there. In this case, it is plausible to utilise this knowledge, as described in [91], [92]. In [91], the model was trained to predict the backward steps in reinforcement learning. With such a model, the agent can ‘imagine’ states before the goal and thus can explore from the goal state. Similarly, another scenario, in which the agent can start at the reward position, can be conceived; then, it can also explore the starting position from the goal [92].
从目标方法中学习假设智能体知道具有最大奖励的状态是什么样子的，但不知道如何到达那里。在这种情况下，利用这些知识是合理的，如[91]，[92]所述。在[91]中，该模型被训练来预测强化学习中的后向步骤。有了这样的模型，智能体可以“想象”目标之前的状态，从而可以从目标状态进行探索。同样，可以设想另一种情况，即代理可以从奖励位置开始;然后，它也可以从球门[92]探索起始位置。

Discussion. 讨论。
Memorise visited states and trajectories methods have shown some remarkable results in hard exploration benchmarks such as Montezuma’s revenge and pitfall. By utilising a reward state as a goal, as outlined in learn from the goal methods, the exploration problem can be mitigated, as the agent knows where to look for the reward.
Memorise访问过的状态和轨迹方法在蒙特祖玛的复仇和陷阱等艰苦的探索基准中显示出一些显着的成果。通过利用奖励状态作为目标，如从目标中学习方法中所述，可以缓解探索问题，因为代理知道在哪里寻找奖励。

3.3.2. Exploratory goal methods
3.3.2. 探索性目标方法
In this subsection, an exploratory goal is given to the agent to try to reach. Exploration occurs when the agent attempts to reach the goal. The following techniques are considered: (i) meta-controllers, (ii) goals in the region of the highest uncertainty, and (iii) sub-goal methods.
在本小节中，为代理提供了一个探索性目标，以尝试达到。当智能体尝试达到目标时，就会发生探索。考虑了以下技术：（i）元控制器，（ii）不确定性最高的区域的目标，以及（iii）子目标方法。

Meta-controllers. 元控制器。
In meta-controllers, the algorithm consists of two parts: a controller and a worker. The controller has a high-level overview and provides goals that the worker is trying to find.
在元控制器中，算法由两部分组成：控制器和工作线程。控制器具有高级概述，并提供工作人员尝试查找的目标。

One of the simple approaches is to generate and sample goals randomly [93]. The random goal selection mechanism was refined in [94] with goal selection based on the learning progress. A similar approach in two phases was proposed by Pere et al. [95]. First, the agent explores randomly to learn the environment representation. Second, the goals are randomly sampled from the learned representation. An approach in which both goal creation and selection mechanisms are devised by a meta-controller was proposed in [96]. In this work, a meta-controller proposes goals within a certain horizon for a worker to find.
其中一种简单的方法是随机生成和抽样目标[93]。在[94]中，根据学习进度对随机目标选择机制进行了细化。Pere等[95]提出了两个阶段的类似方法。首先，智能体随机探索以学习环境表示。其次，从学习到的表示中随机抽取目标。[96]提出了一种由元控制器设计目标创建和选择机制的方法。在这项工作中，元控制器在一定范围内提出目标，供工作人员找到。

In [97], a multi-arm bandit-based method to choose one strategy from a group of hand-designed strategies was proposed. At each episode, every ten steps, the agent chooses a strategy based on its performance. The goal selection mechanism from a group of hand-designed goals is also discussed in Kulkarni et al. [98]. The low-level controller is trained on a state–action space, and the meta-controller is trained on a goal-state space. An approach in which each subtask is learned by one learner was proposed in [99]. To allow any sub-task learner to perform its task from all states, the starting points for learning are shared between sub-task learners.
在[97]中，提出了一种基于多臂强盗的方法，从一组手工设计的策略中选择一种策略。在每一集中，每十个步骤，代理会根据其性能选择一种策略。Kulkarni等[98]也讨论了从一组手工设计的目标中选择目标的机制。低级控制器在状态-动作空间上训练，元控制器在目标状态空间上训练。[99]提出了一种方法，即每个子任务由一个学习者学习。为了允许任何子任务学习者从所有状态执行其任务，学习的起点在子任务学习者之间共享。

Sub-goals. 子目标。
In sub-goal methods, the algorithms find the sub-goals for the agent to reach. In general, sub-goal methods can be split into: (i) bottleneck states which lead to many others as exploratory goals, (ii) progress towards the main goal which is likely to lead to the reward and (iii) uncertainty based sub-goals.
在子目标方法中，算法会找到代理要达到的子目标。一般来说，子目标方法可以分为：（i）导致许多其他目标作为探索性目标的瓶颈状态，（ii）朝着可能导致奖励的主要目标的进展，以及（iii）基于不确定性的子目标。

One of the early methods of discovering bottleneck states was described in [100] using an ant colony optimisation method. Bottleneck states are said to be the states often visited by ants when exploring (by measuring pheromone levels). To discover bottleneck states, [101] proposed the use of proto-value functions based on the eigenvalue of representations. This allows the computation of eigenvector centrality [102], which has a high value if the node has many connections. This was later improved in [103] by replacing the handcrafted adjacency matrix with successor representations.
[100]描述了发现瓶颈状态的早期方法之一，使用蚁群优化方法。据说瓶颈状态是蚂蚁在探索时经常访问的状态（通过测量信息素水平）。为了发现瓶颈状态，[101]提出了使用基于表示特征值的原始值函数。这允许计算特征向量中心性[102]，如果节点有许多连接，则该值很高。后来在[103]中，通过将手工制作的邻接矩阵替换为后继表示，对此进行了改进。

To design sub-goals which lead to a reward, Fang et al. [104] proposed progressively generating sub-tasks that are closer to the main task. To this end, two components are used: the learning progress estimator and task generator. The learning progress estimator determines the learning progress on the main task. The task generator then uses the learning progress to generate sub-tasks closer to the main tasks.
为了设计能够带来奖励的子目标，Fang等[104]提出逐步生成更接近主要任务的子任务。为此，使用了两个组件：学习进度估算器和任务生成器。学习进度估算器确定主要任务的学习进度。然后，任务生成器使用学习进度生成更接近主要任务的子任务。

In uncertainty based methods, sub-goals the goals for the agent are positioned at the most uncertain states. One of the earliest attempts of this type was proposed by Guestrin et al. [105]. Here, the upper and lower bounds of the reward are estimated. Then, states with high uncertainty regarding the reward are used as exploratory goals. Clustering states using k-means and visiting least-visited clusters were proposed in [106]. Clustering can also help to solve the couch potato problem, as described in [107]. In this approach, the states are clustered using Gaussian mixture models. The agent avoids the couch potato problem by clustering all states from a TV into a single avoidable cluster.
在基于不确定性的方法中，子目标（智能体的目标）被定位在最不确定的状态。Guestrin等人[105]提出了这种类型的最早尝试之一。在这里，估计了奖励的上限和下限。然后，将奖励具有高度不确定性的状态用作探索性目标。[106]提出了使用k-means的聚类状态和访问访问量最小的聚类。聚类也有助于解决沙发土豆问题，如[107]所述。在这种方法中，使用高斯混合模型对状态进行聚类。代理通过将电视中的所有状态聚类到单个可避免的聚类中来避免沙发土豆问题。

Discussion. 讨论。
There are two main categories of exploratory goal methods: meta-controllers, and sub-goals. The key advantage of meta-controllers is that they allow the agent to set its own goals without excessively rewarding itself. However, training the controller is a challenge, which was not fully solved yet. In sub-goals methods, what constitutes a goal is defined by human designers. This puts a significant burden on the designer to provide suitable and meaningful goals.
探索性目标方法主要分为两类：元控制器和子目标。元控制器的主要优点是，它们允许智能体设定自己的目标，而不会过度奖励自己。然而，训练控制器是一个挑战，尚未完全解决。在子目标方法中，目标的构成是由人类设计者定义的。这给设计者带来了沉重的负担，需要提供合适且有意义的目标。

3.3.3. Summary 3.3.3. 小结
The goal-based methods are summarised in Table 3. Goals to explore from methods have shown very good performance recently [84], [89] in difficult exploratory games such as Montezuma’s Revenge. The key challenges of these methods are the need to store states and trajectories as well as how to navigate to the goal. This issue is partially mitigated in [89] by using the agent’s position as the state representation, however, this is highly problem-specific. Exploratory goal methods are limited as devising an exploratory goal becomes more challenging with increasing sparsity of the reward. This is somewhat mitigated in [94] or [104], but these approaches rely on the ability to parametrise the task.
表3总结了基于目标的方法。最近在《蒙特祖玛的复仇》等高难度的探索性游戏中，从方法中探索的目标表现出了非常好的表现[84]，[89]。这些方法的主要挑战是需要存储状态和轨迹以及如何导航到目标。在[89]中，通过使用代理的位置作为状态表示，这个问题得到了部分缓解，但是，这是高度特定于问题的。探索性目标方法受到限制，因为随着奖励的稀疏性增加，设计探索性目标变得更具挑战性。这在[94]或[104]中得到了一定程度的缓解，但这些方法依赖于参数化任务的能力。

Table 3. Comparison of Goal-based approaches.
表 3.基于目标的方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
Guo et al. [87 ]
郭等[87] A2C and PPO A2C 和 PPO Goals to Explore from 要探索的目标 Atari: Montezume Revenage 20158 (A2C+CoEX 6600)
雅达利：蒙特祖姆复仇 20158 （A2C+CoEX 6600） Atari images 雅达利图片 P MF C/ C C/C
Guo and Brunskill [86 ]
郭和布伦斯基尔 [86 ] DQN, DDPG DQN、DDPG Goals to Explore from 要探索的目标 Mujoco: Fetch Push 0.9 after 400 epoch (DDPG 0.5)
Mujoco：400 个 epoch 后获取推送 0.9 （DDPG 0.5） Mujoco joints angles Mujoco 关节角度 Q MB C/ C C/C
Florensa et al. [92 ]
Florensa等[92] goal position 目标位置 TRPO TRPO系列 Goals to Explore from 要探索的目标 Mujoco: Key Insertion 0.55 (TRPO 0.01)
Mujoco：按键插入 0.55 （TRPO 0.01） Mujoco joints angles Mujoco 关节角度 Ac MF C/ C C/C
Edwards et al. [91 ]
Edwards等[91] goal state information 目标状态信息 DDQN DDQN的 Goals to Explore from 要探索的目标 Gridworld 0 (DDQN −1) 网格世界 0 （DDQN −1） Enumerated state id 枚举状态 ID Q MF D/ D D/D
Matheron et al. [85 ]
Matheron等[85] state storage method 状态存储方法 DDPG DDPG系列 Goals to Explore from 要探索的目标 Maze: reach reward after 146,000 steps (TD3 never)
迷宫：146,000 步后达到奖励（TD3 从不） x–y position x-y 位置 Ac MB C/ C C/C
Oh et al. [88 ]
Oh等[88] A2C Goals to Explore from 要探索的目标 Atari: Montezuma Revenge 2500 (A2C 0)
雅达利：蒙特祖玛复仇2500 （A2C 0） Atari images 雅达利图片 Ac MF D/ D D/D
Guo et al. [89 ]
郭教授， et [89 ] access to agent position 获得代理职位 Itself 本身 Goals to Explore from 要探索的目标 Atari: Pitfall 11,000 (PPO 0); Robot manipulation task: 40 (PPO 0)
雅达利：陷阱 11,000 （PPO 0）;机器人操作任务：40 （PPO 0） Atari images/ agent positions/ robotics joint angles
雅达利图片/代理位置/机器人关节角度 Ac MF C/ C C/C
Ecoffet et al. [31 ]
Eefffet等[31] teleportation ability 瞬移能力 itself 本身 Goals to Explore from 要探索的目标 Atari: Montezuma Revenge 46000 (RND 11000)
雅达利：蒙特祖玛复仇 46000 （RND 11000） Atari images 雅达利图片 Ac MB D/ D D/D
Ecoffet et al. [84 ]
Eofffet等[84] access to agent position 获得代理职位 itself 本身 Goals to Explore from 要探索的目标 Atari: Montezuma Revenge 46000 (RND 11000)
雅达利：蒙特祖玛复仇 46000 （RND 11000） Atari images 雅达利图片 Ac MB D/ D
Hester and Stone [97 ]
海丝特和斯通 [97 ] Strategies set 策略集 texpl- ore-vanir Texpl-矿石-香烟 Exploratory Goal 探索性目标 Sensor Goal: −53 (greedy −54)
传感器目标：−53（贪婪−54） Enumerated state id 枚举状态 ID Ac MB D/ D
Machado et al. [101 ]
马查多等[101] hadcrafted features 精心制作的功能 itself 本身 Exploratory Goal 探索性目标 4-room domain: 1 4房域： 1 Enumerated state id 枚举状态 ID Ac MB D/ D
Machado et al. [103 ]
马查多等[103] itself 本身 Exploratory Goal 探索性目标 4-room domain: 1 4房域： 1 Enumerated state id 枚举状态 ID Ac MB D/ D
Abel et al. [106 ]
Abel等[106] DQN Exploratory Goal 探索性目标 Malmo: Visual Hill Climbing 170 (DQN+boosted 60)
马尔默：视觉爬山 170 （DQN+boosted 60） Image/ Vehicle positions 图片/车辆位置 Q MB C/ C C/C
Forestier et al. [93 ]
Forestier等[93] randomly generated goals 随机生成的目标 Itself 本身 Exploratory Goal 探索性目标 Minecraft: mountain car 84% explored (
-greedy 3%)
Minecraft： mountain car 84% 探索（
-贪婪 3%） State Id 状态 ID Ac MB C/ C C/C
Colas et al. [94 ]
Colas等[94] M-UVFA M-UVFA型 Exploratory Goal 探索性目标 OpenAI: Goal Fetch Arm 0.8 (M-UVFA 0.78)
OpenAI：目标获取臂 0.8 （M-UVFA 0.78） Robot joints angles 机器人关节角度 Ac MB C/ C C/C
Péré et al. [95 ]
Péré等[95] IMGEP IMGEP公司 Exploratory Goal 探索性目标 Mujoco (KLC): ArmArrow 7.4 (IMGEP with handcrafted features 7.7)
Mujoco （KLC）： ArmArrow 7.4 （IMGEP with handcrafted features 7.7） Mujoco joints angles Mujoco 关节角度 Ac MF C/ C C/C
Ghafoorian et al. [100 ]
Ghafoorian等[100] Q-learning Q-学习 Exploratory Goal 探索性目标 Taxi Driver: Found goal after 50 episodes (Q-learning after 200)
出租车司机：50集后找到目标（200集后Q学习） State Id 状态 ID Q MF D/ D D/D
Riedmiller et al. [99 ]
Riedmiller等[99] rewards for auxiliary tasks
辅助任务奖励 Itself 本身 Exploratory Goal 探索性目标 Block stacking: 140 (DDPG 0)
方块堆叠：140 （DDPG 0） Robot joints angles 机器人关节角度 Ac MB C/ C C/C
Fang et al. [104 ]
方等[104] tasks parameterisation 任务参数化 itself 本身 Exploratory Goal 探索性目标 GirdWorld: 1 (GoalGAN 0.6)
环形世界： 1 （GoalGAN 0.6） Robot joints angles, images
机器人关节角度、图像 Ac MB C/ C C/C
Legend: A — action space, Ac — action, R - reference, MB — model based, MF — model free, D — discrete, C — continuous, Q - Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R - 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q - Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.4. Probabilistic methods
3.4. 概率方法
In probabilistic approaches, the agent holds a probability over states, actions, values, rewards or their combination and chooses the next action based on that probability. Probabilistic methods can be split into optimistic and uncertain methods [108]. The main difference between them is how they model a probability and how the agent utilises the probability, as shown in Fig. 9. In optimistic methods, the estimation needs to depend on a reward, either implicitly or explicitly. Then, the upper bound of the estimate is used to make the action. In uncertainty-based methods, the estimate is the uncertainty about the environment, such as the value function and state prediction. In the uncertainty-based method, the agent takes actions that minimise environmental uncertainty. Note that uncertainty methods can use estimations from optimistic methods but they utilise them differently.
在概率方法中，智能体对状态、动作、值、奖励或其组合持有概率，并根据该概率选择下一个动作。概率方法可分为乐观方法和不确定方法[108]。它们之间的主要区别在于它们如何对概率进行建模，以及智能体如何利用概率，如图 9 所示。在乐观方法中，估计需要依赖于隐式或显式的奖励。然后，使用估计值的上限来执行操作。在基于不确定性的方法中，估计是关于环境的不确定性，例如值函数和状态预测。在基于不确定性的方法中，智能体采取的行动将环境不确定性降至最低。请注意，不确定性方法可以使用乐观方法的估计，但它们的使用方式不同。

Download : Download high-res image (177KB)
下载：下载高分辨率图片（177KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 9. Overview of probabilistic methods. The agent uses uncertainty over the environment model to either behave optimistically (left) or follow the most uncertain solution (right). Both should lead to a reduction in the uncertainty of the agent.
图 9.概率方法概述。智能体使用环境模型的不确定性来表现乐观（左）或遵循最不确定的解决方案（右）。两者都应导致代理不确定性的降低。

3.4.1. Optimistic methods
3.4.1. 乐观方法
In optimistic approaches, the agent follows optimism under the uncertainty principle. In other words, the agent follows the upper confidence bound of the reward estimate. The use of Gaussian process (GP) as a reward model was presented in [109]. The GP readily provides uncertainty, which can be used for reward estimation. The linear Gaussian algorithm can also be used as a model of the reward [110]. Bootstrapped deep-Q networks (DQN) and Thomson sampling were utilised in [111]. Bootstrapped DQNs naturally provide a distribution over rewards and values so that optimistic decisions can be taken.
在乐观方法中，智能体在不确定性原则下遵循乐观。换言之，智能体遵循奖励估计的置信度上限。[109]介绍了使用高斯过程（GP）作为奖励模型。GP 很容易提供不确定性，可用于奖励估计。线性高斯算法也可以用作奖励模型[110]。[111]使用了自举deep-Q网络（DQN）和Thomson采样。自举 DQN 自然会提供奖励和价值的分布，以便可以做出乐观的决策。

It is also possible to hold a set of value functions and samples during exploration [112], [113]. The most optimistic value function is used by the agent for an episode. At the end of the episode, the distribution of the value functions was updated.
在探索过程中也可以保存一组值函数和样本[112]，[113]。代理对剧集使用最乐观的值函数。在本集的最后，更新了值函数的分布。

Discussion. 讨论。
In optimistic approaches, the agent attempts to utilise optimism under the uncertainty principle. To utilise this principle the agent needs to be able to model the reward. It is possible to do this by either modelling the reward directly or by approximating value functions. Value function approximation can be advantageous as reward sparsity increases. With increased reward sparsity, the agent can utilise the partial reward from value functions for learning.
在乐观方法中，智能体试图在不确定性原则下利用乐观。为了利用这一原则，智能体需要能够对奖励进行建模。可以通过直接对奖励进行建模或通过近似值函数来做到这一点。随着奖励稀疏度的增加，值函数近似可能是有利的。随着奖励稀疏性的增加，智能体可以利用价值函数的部分奖励进行学习。

3.4.2. Uncertainty methods
3.4.2. 不确定性方法
In uncertainty-based methods, the agent holds a probability distribution over actions and/or states which represent the uncertainty of the environment. Then, it chooses an action that minimises the uncertainty. Here, five subcategories are distinguished: parameter uncertainty, value uncertainty, network ensemble, and information-theoretic.
在基于不确定性的方法中，智能体对表示环境不确定性的动作和/或状态具有概率分布。然后，它选择一个将不确定性降至最低的行动。这里区分了五个子类别：参数不确定性、值不确定性、网络集成和信息理论。

Parameter uncertainty. 参数不确定度。
In parameter uncertainty, the agent holds uncertainty over the parameters defining a policy. Then, the agent samples from those and follows this policy for a certain time and updates the parameters based on the performance. One of the simplest approaches is to hold a distribution over the parameters of the network [114]. Here, the network parameters were sampled from the weight distribution. Colas et al. [115] split the exploration into two phases: (i) explore randomly and (ii) compare experiences to an expert-created imitation to determine the good behaviour.
在参数不确定性中，代理对定义策略的参数具有不确定性。然后，代理从这些参数中抽样，并在一段时间内遵循此策略，并根据性能更新参数。最简单的方法之一是在网络参数上保持分布[114]。在这里，网络参数是从权重分布中抽样的。Colas等[115]将探索分为两个阶段：（i）随机探索和（ii）将经验与专家创造的模仿进行比较，以确定良好的行为。

In [116], the successor state representation was utilised as a model of the environment. The exploration was performed by sampling parameters from the Bayesian linear regression model which predicts successor representation.
在[116]中，继承国表示被用作环境模型。通过从贝叶斯线性回归模型中抽取参数来进行探索，该模型预测了后继表示。

Policy and Q-value uncertainty.
政策和 Q 值的不确定性。
In policy and Q-value uncertainty, the agent holds uncertainty over Q-values/actions and samples the appropriate action. Some of the simplest approaches rely on optimisation to determine the distribution parameters. For example, in [117], the cross-entropy method (CEM) was used to control the variance of a Gaussian distribution from which actions were drawn. Alternatively, policies can be sampled [118]. In this study, a set of sampling policies sampled from a base policy were used. At the end of the episode, the best policy was chosen as an update to the base policy.
在策略和 Q 值不确定性中，智能体对 Q 值/行动保持不确定性，并对适当的行动进行采样。一些最简单的方法依靠优化来确定分布参数。例如，在[117]中，交叉熵法（CEM）用于控制从中得出动作的高斯分布的方差。或者，可以对策略进行采样[118]。在这项研究中，使用了一组从基本策略中抽样的抽样策略。在剧集结束时，选择了最佳策略作为基本策略的更新。

The most prevalent approach of this type is to use the Bayesian framework. In [119], the hypothesis is generated once and then followed for a certain number of steps, which saves computational time. This idea was further developed in [120], where Bayesian sampling was combined with a tree-based state representation for further efficiency gains. To enable Bayesian uncertainty approaches to deep learning, O’Donoghue et al. [121] derived Bayesian uncertainty such that it can be computed using the Bellman principle and the output of the neural network.
这种类型的最流行的方法是使用贝叶斯框架。在[119]中，假设生成一次，然后遵循一定数量的步骤，从而节省了计算时间。这个想法在[120]中得到了进一步发展，其中贝叶斯采样与基于树的状态表示相结合，以进一步提高效率。为了将贝叶斯不确定性方法用于深度学习，O’Donoghue等[121]推导了贝叶斯不确定性，以便可以使用贝尔曼原理和神经网络的输出来计算贝叶斯不确定性。

To minimise the uncertainty about policy and/or Q-values, information-theoretic approaches can be used. Agents choose actions that will result in maximal information gain, thus reducing uncertainty about the environment. An example of this approach, called information-directed sampling (IDS), is discussed in [122]. In IDS, the information gain function is expressed as a ratio between regret and how informative the action is.
为了尽量减少政策和/或Q值的不确定性，可以使用信息论方法。智能体选择能够带来最大信息增益的行动，从而减少对环境的不确定性。[122]讨论了这种方法的一个例子，称为信息导向采样（IDS）。在IDS中，信息增益函数表示为后悔与行动信息量之间的比率。

Network ensembles. 网络集成。
In the network ensemble method, the agent uses several models (initialised with different parameters) to approximate the distribution. Sampling one model from the ensemble to follow was discussed in [123]. In this study, a DQN with multiple heads, each estimating Q-value, was proposed. At each episode, one head was chosen randomly for use.
在网络集成方法中，代理使用多个模型（使用不同的参数初始化）来近似分布。[123]讨论了从集合中抽取一个模型。在这项研究中，提出了一种具有多个头的DQN，每个头估计Q值。在每一集中，随机选择一个头部使用。

It is difficult to determine the model convergence by sampling one model at a time. Therefore, multiple models to approximate the distribution over states were devised in [124]. In this approach, Q-values estimated by different models were computed and fitted into a Gaussian distribution. A similar approach was developed in [125], using the information gain among the environmental models to decide where to go. Another ensemble model was presented in [126]. Exploration is achieved by finding a policy which results in the highest disagreement among the environmental models.
很难通过一次对一个模型进行采样来确定模型收敛性。因此，在[124]中设计了多个模型来近似状态分布。在这种方法中，计算了由不同模型估计的 Q 值并将其拟合到高斯分布中。[125]中也提出了类似的方法，利用环境模型中的信息增益来决定去哪里。文献[126]提出了另一个集成模型。探索是通过找到一个导致环境模型之间最大分歧的政策来实现的。

Discussion. 讨论。
In parameter sampling, the policy is parameterised (i.e. represented by the neural network), and the probability over parameters is devised. The agent samples the parameters and continues the update-exploitation cycle. In contrast, in policy and Q-value sampling methods, the probability distribution is not based on policy parameters but on actions and Q-values. The advantage of doing this over parameter sampling is faster updates because the policy can be adjusted dynamically. The disadvantage is that estimating the exact probability is intractable, and thus, simplifications need to be made. Another method is to use network ensembles to approximate the distribution over the action/states. This agent can either sample from the distribution or choose one model to follow. While more computationally intensive, this approach can also be updated instantaneously.
在参数抽样中，策略被参数化（即由神经网络表示），并设计参数的概率。代理对参数进行采样，并继续更新-利用周期。相比之下，在策略和 Q 值抽样方法中，概率分布不是基于策略参数，而是基于行动和 Q 值。与参数采样相比，这样做的优点是更新速度更快，因为可以动态调整策略。缺点是估计确切的概率是棘手的，因此需要进行简化。另一种方法是使用网络集成来近似动作/状态的分布。此代理可以从分布中采样，也可以选择一个模型进行跟踪。虽然计算量更大，但这种方法也可以即时更新。

3.4.3. Summary 3.4.3. 小结
Tabular summary of optimistic and uncertainty approaches is shown in Table 4 and have been extensively compared in [108]. The article concludes that the biggest issue for optimistic exploration is that the confidence sets are built independent of each other. Thus, an agent can have multiple states with high confidence. This results in unnecessary exploration as the agent visits states which do not lead to the reward. Remedying this issue would be computationally intractable. In uncertainty methods, the confidence bounds are built depending on each other; thus, it does not have this problem.
乐观和不确定性方法的表格总结如表4所示，并在[108]中进行了广泛比较。文章的结论是，乐观探索的最大问题是置信集是相互独立的。因此，一个代理可以具有多个高置信度的状态。这会导致不必要的探索，因为代理访问了不会导致奖励的状态。解决这个问题在计算上是棘手的。在不确定性方法中，置信边界是相互依赖的;因此，它没有这个问题。

Table 4. Comparison of probabilistic approaches.
表 4.概率方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
D’Eramo et al. [111 ]
D’Eramo等[111] bDQN, SARSA bDQN，SARSA Optimistic 乐观的 Mujoco: acrobot −100 (Thomson −120)
Mujoco： acrobot -100 （汤姆森 -120） Mujoco joints angles Mujoco 关节角度 Q MF C/ C C/C
Osband et al. [112 ]
Osband等[112] LSVI LSVI系列 Optimistic 乐观的 Tetris: 5000 (LSVI 4000) 俄罗斯方块：5000 （LSVI 4000） Hand tuned 22 features 手动调校 22 个功能 Ac MF D/ D D/D
Jung and Stone [109 ]
荣格和斯通 [109 ] Optimistic 乐观的 Mujoco: Inverted Pendulum 0 (SARSA −10)
Mujoco：倒立摆 0 （SARSA −10） State Id 状态 ID Ac MB D/ C D/C的
Xie et al. [110 ]
谢等[110] MPC Optimistic 乐观的 Robotics hand simulation: complete each of 10 poses
机器人手部模拟：完成 10 个姿势中的每一个 joints angles 关节角度 Ac MB C/ C C/C
Osband et al. [113 ]
Osband等[113] LSVI LSVI系列 Optimistic 乐观的 Cartpole Swing up: 600 (DQN 0)
车杆向上摆动： 600 （DQN 0） State Id 状态 ID Ac MF D/ D D/D
Nikolov et al. [122 ]
Nikolov等[122] bDQN and C51 bDQN 和 C51 Uncertainty 不确定性 55 atari games: 1058% of reference human performance
55款雅达利游戏：1058%的参考人类表现 Atari images 雅达利图片 Q MB D/ D D/D
Colas et al. [115 ]
Colas等[115] a set of goal policies

一组目标策略
DDPG DDPG系列 Uncertainty 不确定性 Mujoco: Half Cheetah 6000 (DDP 5445)
Mujoco：半猎豹 6000 （DDP 5445） Mujoco joints angles Mujoco 关节角度 P MF C/ C C/C
Osband et al. [123 ]
Osband等[123] DQN Uncertainty 不确定性 Atari: James Bond 1000 (DQN 600)
雅达利：詹姆斯邦德 1000 （DQN 600） Atari images 雅达利图片 Q MB D/ D D/D
Tang and Agrawal [114 ]
唐和阿格拉瓦尔 [114 ] DDPG DDPG系列 Uncertainty 不确定性 Mujoco: sparse mountaincar 0.2 (NoisyNet 0)
Mujoco：稀疏山车 0.2 （NoisyNet 0） Mujoco joints angles Mujoco 关节角度 Ac MF D/ C D/C的
Strens [119 ] 斯特伦斯 [119 ] Dynamic Programming 动态规划 Uncertainty 不确定性 Maze: 1864 (QL SEMI-UNIFORM 1147)
迷宫：1864年（QL SEMI-UNIFORM 1147） Enumerated state id 枚举状态 ID Ac MB D/ D D/D
Akiyama et al. [118 ]
秋山等[118] initial policy guess 初始策略猜测 LSPI LSPI的 Uncertainty 不确定性 Ball bating 2-DoF simulation: 67 (Passive learning:61)
击球 2-DoF 模拟：67（被动学习：61） Robot angles 机器人角度 Ac MB D/ C D/C的
Henaff [126 ] 赫纳夫 [126 ] DQN Uncertainty 不确定性 Maze: −4 (UE2 −14) 迷宫： −4 （UE2 −14） Enumerated state id 枚举状态 ID Q MB D/ D
Guez et al. [120 ]
Guez等[120] guess of a prior 先验的猜测 Policy learning 政策学习 Uncertainty 不确定性 Dearden Maze: 965.2 (SBOSS 671.3)
迪尔登迷宫：965.2 （SBOSS 671.3） Enumerated state id 枚举状态 ID Ac MB D/ D
O’Donoghue et al. [121 ]
O’Donoghue等[121] prior distribution 先前的分布 DQN Uncertainty 不确定性 Atari: Montezuma Revenge 3000 (DQN 0)
雅达利：蒙特祖玛复仇3000 （DQN 0） Atari images 雅达利图片 Q MB D/ D
Shyam et al. [125 ]
Shyam等[125] SAC Uncertainty 不确定性 Chain: 100% explored (bootstrapped-DQN 30%)
链：100% 探索（bootstrapped-DQN 30%） Enumerated state id/ Mujoco joints angles
枚举状态 id/Mujoco 关节角度 Ac MB C/ C
Stulp [117 ] 树桩 [117 ]
Uncertainty 不确定性 Ball batting: learned after 20 steps
击球：20步后学会 Robot joints angles 机器人关节角度 Ac MB C/ C
Janz et al. [116 ]
Janz等[116] DQN Uncertainty 不确定性 49 Atari games: 77.55% superhuman (Bootstrapped DQN 67.35%)
49 款雅达利游戏：77.55% 超人（自力更生 DQN 67.35%） Atari images 雅达利图片 Q MB D/ D
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.5. Imitation-based methods
3.5. 基于模仿的方法
In imitation learning, the exploration is ‘kick-started’ with demonstrations from different sources (usually humans). This is similar to how humans learn because we are initially guided in what to do by society and teachers. Thus, it is plausible to see imitation learning as a supplement to standard reinforcement learning. Note that demonstrations do not have to be perfect; rather, they just need to be a good starting point. Imitation learning can be categorised to imitation in experience replay and imitation with exploration strategy as illustrated in Fig. 10.
在模仿学习中，探索是通过来自不同来源（通常是人类）的演示“启动”的。这与人类的学习方式相似，因为我们最初是由社会和老师指导我们做什么的。因此，将模仿学习视为标准强化学习的补充是合理的。请注意，演示不一定是完美的;相反，它们只需要成为一个好的起点。模仿学习可分为经验回放中的模仿和探索策略中的模仿，如图10所示。

Download : Download high-res image (176KB)
下载：下载高分辨率图片（176KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 10. Overview of imitation-based methods. In imitation-based methods, the agent receives demonstrations from expert on how to behave. These are then used in two ways: (i) directly learning on demonstrations or (ii) using demonstrations as a start for other exploration techniques.
图 10.基于模仿的方法概述。在基于模仿的方法中，智能体从专家那里获得有关如何表现的演示。然后以两种方式使用这些技术：（i）直接在演示中学习，或（ii）使用演示作为其他探索技术的起点。

3.5.1. Imitations in experience replay methods
3.5.1. 经验回放方法中的模仿
One of the most common techniques is combining samples from demonstrations with samples collected by an agent in a single experience replay. This guarantees that imitations can be used throughout the learning process while using new experiences.
最常见的技术之一是将演示中的样本与代理在单个体验回放中收集的样本相结合。这保证了在整个学习过程中都可以使用模仿，同时使用新的体验。

In [127], the demonstrations were stored in a prioritised experience replay alongside the agent’s experience. The transitions from demonstrations have a higher probability of being selected. Deep Q learning from demonstration (DQfD) [128] differs in two aspects from [127]. First, the agent was pre-trained on demonstrations only. Second, the ratio between the samples from the agent’s run and demonstrations was controlled by a parameter. A similar work with R2D2 was reported in [129]. Storing states in two different replays was presented in [130]. Every time the agent samples for learning, it samples a certain amount from each buffer.
在[127]中，演示与代理的体验一起存储在优先体验回放中。从演示过渡到被选中的可能性更高。基于演示的深度Q学习（DQfD）[128]与[127]在两个方面有所不同。首先，智能体只在演示中接受了预训练。其次，来自代理运行和演示的样本之间的比率由参数控制。[129]报道了R2D2的类似研究。[130]介绍了在两个不同的回放中存储状态。每次代理采样进行学习时，它都会从每个缓冲区中采样一定量。

Discussion. 讨论。
Using one or two experience replays seems to have negligible impact on performance. However, storing in one experience replay is conceptually and implementation-wise easier. Moreover, it allows agents to stop using imitation experiences when they are not needed anymore.
使用一两次体验回放对性能的影响似乎可以忽略不计。但是，在一次体验中存储重播在概念上和实现方面都更容易。此外，它允许代理在不再需要模仿体验时停止使用模仿体验。

3.5.2. Imitation with exploration strategy methods
3.5.2. 模仿探索策略方法
Instead of using experience replays, imitations and exploration strategies can be combined directly. In such an approach, imitations are used as a ‘kick-start’ for exploration.
与其使用经验回放，不如直接将模仿和探索策略结合起来。在这种方法中，模仿被用作探索的“启动”。

A single demonstration was used as a starting point for exploration in [131]. The agent randomly explores from a state alongside a single demonstration run. The agent trained from a mediocre demonstration can score highly in Montezuma’s Revenge. The auxiliary reward approach was proposed in Aytar et al. [26]. The architecture can combine several YouTube videos into a single embedding space for training. The auxiliary reward is added to every
frame from the demonstration video. The agent that can ask for help from the demonstrator was proposed in [132]. If the agent detects an unknown environment, the human demonstrator is asked to show the agent how to navigate.
在[131]中，以单个演示作为探索的起点。代理在单个演示运行旁边从一个状态随机探索。从平庸的演示中训练出来的特工可以在蒙特祖玛的复仇中获得高分。Aytar等[26]提出了辅助奖励方法。该架构可以将多个 YouTube 视频组合到一个嵌入空间中进行训练。辅助奖励被添加到演示视频的每一
帧中。可以向演示者寻求帮助的代理是在[132]中提出的。如果智能体检测到未知环境，则要求人工演示者向智能体展示如何导航。

Discussion. 讨论。
Using imitations as a starting point for exploration has shown impressive performance in difficult exploratory games. In particular, [26], [131] scored highly in Montezuma’s Revenge. This is the effect of overcoming the initial burden of exploration through demonstrations. The approach from [26] can score highly in Montezuma’s revenge with just a single demonstration, making it very sample efficient. Meanwhile, the approach from [26] can combine data from multiple sources, making it more suitable for problems with many demonstrations.
使用模仿作为探索的起点，在困难的探索游戏中表现出令人印象深刻的表现。特别是[26]，[131]在蒙特祖玛的复仇中得分很高。这是通过演示克服最初探索负担的效果。[26]的方法只需一次演示就可以在蒙特祖玛的复仇中获得高分，使其样本效率非常高。同时，[26]的方法可以结合来自多个来源的数据，使其更适合于具有许多演示的问题。

3.5.3. Summary 3.5.3. 小结
A comparison of the imitation methods is presented in Table 5. Imitations in experience replay allow the agent to seamlessly and continuously learn from demonstration experiences. However, imitations with exploration strategies have the potential to find good novel strategies around existing ones. Imitations with exploration strategies have shown a great capability to overcome initial exploration difficulty. Imitations with exploration strategies achieve better performance compared with using imitations in experience replay only.
表5列出了模仿方法的比较。经验回放中的模仿使智能体能够无缝地、持续地从演示经验中学习。然而，对探索策略的模仿有可能围绕现有策略找到好的新策略。对勘探策略的模仿显示出克服初始勘探困难的巨大能力。与仅在体验回放中使用模仿相比，使用探索策略的模仿可以获得更好的性能。

Table 5. Comparison of imitation-based approaches.
表 5.基于模仿的方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
Hester et al. [128 ]
Hester等[128] imitation trained policy 模仿训练的政策 DQN Imitations in Experience Replay
体验回放中的模仿 Atari: Pitfall 50.8 (Baseline 0)
雅达利：陷阱 50.8（基线 0） Atari Images 雅达利图片 Q MF D/ D D/D
Vecerik et al. [127 ]
Vecerik等[127] demonstrations 示威 DDPG DDPG系列 Imitations in Experience Replay
体验回放中的模仿 Peg insertion: 5 钉插入：5
(DDPG −15) （DDPG −15） Robot joints angles 机器人关节角度 Ac MF C/ C C/C
Nair et al. [130 ]
奈尔等[130] demonstrations 示威 DDPG DDPG系列 Imitations in Experience Replay
体验回放中的模仿 Brick stacking: Pick and Place 0.9 (Behaviour cloning 0.8)
砖堆叠：拾取和放置 0.9（行为克隆 0.8） Robot joints angles 机器人关节角度 Ac MF C/ C C/C
Gulcehr et al. [129 ]
Gulcehr等[129] demonstrations 示威 R2D2 R2D2型 Imitations in Experience Replay
体验回放中的模仿 Hard-eight: Drawbridge 12.5 (R2D2:0)
Hard-eight：吊桥 12.5 （R2D2：0） Vizdoom Images Vizdoom 图片 Ac MF D/ D D/D
Aytar et al. [26 ]
Aytar等[26] youtube embbeding YouTube 嵌入 IMPALA 高角羚 Imitation with Exploration Strategy
模仿探索策略 Atari: Montezuma’s Revenge 80k (DqfD 4k)
雅达利：蒙特祖玛的复仇 80k （DqfD 4k） Atari Images 雅达利图片 Ac MF D/ D D/D
Salimans and Chen [131 ]
萨利曼斯和陈 [131 ] single demonstration 单次演示 PPO Imitation with Exploration Strategy
模仿探索策略 Atari: Montezuma Revenge with distraction 74500 (Playing by youtube 41098)
Atari： Montezuma Revenge with distraction 74500 （由 youtube 41098 播放） Atari images 雅达利图片 Ac MF D/ D D/D
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.6. Safe exploration 3.6. 安全探索
In safe exploration, the problem of preventing agents from unsafe behaviours is considered. This is an important aspect of exploration research, as the agent’s safety needs to be ensured. Safe exploration can be split into three categories: (i) human designer knowledge, (ii) prediction model, and (iii) auxiliary reward as illustrated in Fig. 11. For more details about safe exploration in reinforcement learning, the reader is invited to read [133].
在安全探索中，考虑了防止代理发生不安全行为的问题。这是勘探研究的一个重要方面，因为需要确保代理的安全。安全探索可分为三类：（i）人类设计者知识，（ii）预测模型，以及（iii）辅助奖励，如图11所示。有关强化学习中安全探索的更多细节，请读者阅读[133]。

Download : Download high-res image (149KB)
下载：下载高分辨率图片（149KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 11. Illustration of safe exploration methods. In safe exploration methods, attempts are made to prevent unsafe behaviours during exploration. Here, three techniques are highlighted: (i) human designer knowledge—the agent’s behaviours are limited by human-designed boundaries; (ii) prediction models—the agent learns unsafe behaviours and how to avoid them; and (iii) auxiliary rewards—agents are punished in dangerous states.
图 11.安全勘探方法的图示。在安全勘探方法中，试图防止勘探过程中的不安全行为。这里强调了三种技术：（i）人类设计者的知识——智能体的行为受到人类设计的边界的限制;（ii）预测模型——智能体学习不安全行为以及如何避免这些行为;（iii）辅助奖励——特工在危险状态下受到惩罚。

3.6.1. Human designer knowledge methods
3.6.1. 人类设计师的知识方法
Human-designated safety boundaries are used in human designer knowledge methods. Knowledge from the human designer can be split into baseline behaviours, direct human intervention and prediction models.
人类指定的安全边界用于人类设计者的知识方法中。来自人类设计师的知识可以分为基线行为、直接人类干预和预测模型。

Baseline behaviours impose an impassable safety baseline. Garcia et al. [134] proposed the addition of a risk function (which determines unsafe states) and baseline behaviour (which decides what to do in unsafe states). In [135], the agent was constrained by an additional pre-trained module to prevent unsafe actions as shown in Fig. 12, while in [136], agents are expected to perform no worse than the a priori known baseline. Classifying which object is dangerous and how to avoid them before the training of an agent was proposed in [137]. The agent learns how to avoid certain objects rather than states; thus, this approach can be generalised to new scenarios.
基线行为强加了一条不可逾越的安全基线。Garcia等[134]提出增加风险函数（确定不安全状态）和基线行为（决定在不安全状态下做什么）。在 [135] 中，智能体受到一个额外的预训练模块的约束，以防止不安全的行为，如图 12 所示，而在 [136] 中，智能体的性能预计不会比先验已知基线差。在[137]中提出了在训练智能体之前对哪个物体进行分类以及如何避免它们。智能体学习如何避开某些对象而不是状态;因此，这种方法可以推广到新的方案。

The human intervention approach was discussed in [138]. During the initial phases of exploration, humans in the loop stop disasters. Then, a supervised trained network of data collected from humans is used as a replacement for humans.
[138]讨论了人为干预方法。在探索的初始阶段，人类在循环中阻止灾难。然后，使用从人类收集的受监督的训练数据网络来替代人类。

Download : Download high-res image (87KB)
下载：下载高分辨率图片（87KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 12. Overview of safe exploration in continuous action spaces [135]. The additional model is modifying the actions of the original policy.
图 12.连续动作空间中的安全探索概述[135]。其他模型是修改原始策略的操作。

In the prediction model, the human designed safety model determines if the agent’s next action leads to an unsafe position and avoids it. In [139], a rover traversing a terrain of different heights was considered. The Gaussian process model provides estimates of the height at a given location. If the height is lower than the safe behaviour limit, the robot can explore safely. A heuristic safety model using a priori knowledge was proposed in [140]. To this end, they proposed an algorithm called action pruning, which uses the heuristics to prevent agent from committing to unsafe actions.
在预测模型中，人类设计的安全模型确定智能体的下一步行动是否会导致不安全的位置并避免它。在[139]中，考虑了穿越不同高度地形的漫游车。高斯过程模型提供了给定位置高度的估计值。如果高度低于安全行为限制，机器人可以安全地探索。文献[140]提出了一种使用先验知识的启发式安全模型。为此，他们提出了一种称为动作修剪的算法，该算法使用启发式方法来防止代理提交不安全的操作。

Discussion. 讨论。
In human designer knowledge methods, the barriers to unsafe behaviours are placed by a human designer. Baseline behaviours and human intervention methods guarantee certain performance in certain situations but they will only work in pre-defined situations. Prediction model methods require a model of the environment. This can be either in the form of a mathematical model [139] or heuristic rules [140]. Prediction models have a higher chance of working on previously unseen environments and have a higher chance of adaptability than baseline behaviours and human intervention methods.
在人类设计师的知识方法中，不安全行为的障碍是由人类设计师设置的。基线行为和人为干预方法保证了在某些情况下的某些性能，但它们只能在预定义的情况下起作用。预测模型方法需要环境模型。这可以是数学模型[139]或启发式规则[140]的形式。与基线行为和人类干预方法相比，预测模型在以前看不见的环境中工作的机会更高，并且具有更高的适应性机会。

3.6.2. Auxiliary reward methods
3.6.2. 辅助奖励方式
In auxiliary rewards, the agent is punished for putting itself into a dangerous situation. This approach requires the least human intervention, but it generates the weakest safety behaviours.
在辅助奖励中，代理人因将自己置于危险境地而受到惩罚。这种方法需要最少的人为干预，但它产生的安全行为最弱。

One of the methods is to find states in which an episode terminates and discourages an agent from approaching using an intrinsic fear [141]. The approach counts back a certain number of states from death and applies the distance-to-death penalty. Additionally, they made a simple environment in which a highest positive reward was next to the negative reward. The DQN eventually jumps to the negative rewards. The authors state “We might critically ask, in what real-world scenario, we could depend upon a system [DQN] that cannot solve [these kinds of problems]”. A similar approach, but with more stochasticity, was later proposed in [142].
其中一种方法是找到发作终止的状态，并阻止智能体使用内在恐惧接近[141]。该方法从死亡开始倒数一定数量的状态，并适用距离死刑。此外，他们还创造了一个简单的环境，其中最高的积极奖励紧挨着消极奖励。DQN 最终会跳到负奖励。作者指出，“我们可能会批判性地问，在什么现实世界中，我们可以依赖一个无法解决[这类问题]的系统[DQN]”。后来在[142]中提出了一种类似的方法，但具有更多的随机性。

Allowing the agent to learn undesirable states from previous experiences autonomously was discussed in [143]. The states and their advantage values were stored in a common buffer. Then, frequently visited states with the lowest advantage have additional negative rewards associated with them.
[143]讨论了允许智能体自主地从以前的经验中学习不良状态。状态及其优势值存储在公共缓冲区中。然后，经常访问的具有最低优势的州会获得与之相关的额外负奖励。

Discussion. 讨论。
Auxiliary rewards can be an effective method of discouraging agents from unsafe behaviours. For example, in [141], half of the agent’s death was prevented. Moreover, some approaches, such as Karimpanal et al. [143], have shown the ability to fully automatically determine undesirable states and avoid them. This, however, assumes that when the agent perishes, it has a low score; this may not always be the case.
辅助奖励可以成为阻止代理人不安全行为的有效方法。例如，在[141]中，避免了一半的特工死亡。此外，一些方法，如Karimpanal等[143]，已经显示出完全自动确定不良状态并避免它们的能力。然而，这假设当智能体死亡时，它的分数很低;情况可能并非总是如此。

3.6.3. Summary 3.6.3. 小结
An overview of the safety approaches is shown in Table 6. Safety is a vital aspect of reinforcement learning for practical applications in many domains. There are three general approaches: human designer knowledge, prediction models, and auxiliary rewards. Human designer knowledge guarantees safe behaviour in certain states. However, the agent struggles to learn new safe behaviours. Auxiliary reward approaches can adjust to new scenarios, but they require time to train and design of the negative reward.
安全方法的概述如表6所示。在许多领域的实际应用中，安全性是强化学习的一个重要方面。一般有三种方法：人类设计者知识、预测模型和辅助奖励。人类设计师的知识保证了某些州的安全行为。然而，智能体努力学习新的安全行为。辅助奖励方法可以适应新的场景，但它们需要时间来训练和设计负奖励。

Table 6. Comparison of Safe approaches.
表 6.安全方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S A/S
Garcelon et al. [136 ]
Garcelon等[136] Baseline safe policy 基线安全策略 Policy-based UCRL2 基于策略的 UCRL2 Human Designer Knowledge 人类设计师知识 stochastic inventory control: never breaching the safety baseline
随机库存控制：永不违反安全基线 amount of products in inventory
库存产品数量 P MB D/ D D/D
Garcia and Fernandez [134 ]
加西亚和费尔南德斯 [134 ] baseline behaviour 基线行为 Human Designer Knowledge 人类设计师知识 car parking problem 6.5 停车问题 6.5 angles and positions of respective controllable vehicles
各自可控车辆的角度和位置 Ac MB D/ D D/D
Hunt et al. [137 ]
Hunt等[137] pretrained safety network
预训练安全网络 PPO Human Designer Knowledge 人类设计师知识 Point mass environment: 0 unsafe actions (PPO 3000)
点质量环境：0 个不安全动作（PPO 3000） bird’s eye view of the problem
问题的鸟瞰图 Ac MB D/ D D/D
Saunders et al. [138 ]
Saunders等[138] human intervention data 人为干预数据 DQN Human Designer Knowledge 人类设计师知识 Atari: Space Invaders 0 catastrophes (DQN 800000)
雅达利：太空入侵者 0 灾难（DQN 800000） Atari images 雅达利图片 Q MB D/ D D/D
Dalal et al. [135 ]
Dalal等[135] pretrained safety model 预训练安全模型 DDPG DDPG系列 Human Designer Knowledge 人类设计师知识 spaceship: Arena 1000 (DDPG 300)
宇宙飞船：竞技场 1000 （DDPG 300） x–y position x-y 位置 Ac MB C/ C C/C
Gao et al. [140 ]
高教授， et al. [140 ] environmental knowledge 环境知识 PPO Human Designer Knowledge 人类设计师知识 Pommerman: 0.8 (Baseline 0)
Pommerman：0.8（基线 0） Agent, enemy agents and bombs positions
特工、敌方特工和炸弹位置 Ac MB D/ D D/D
Turchetta et al. [139 ]
Turchetta等[139] Bayesian optimisation 贝叶斯优化 Human Designer Knowledge 人类设计师知识 Simulated rover: 80% exploration (Random 0.98%)
模拟漫游车：80%探索（随机0.98%） x–y position x-y 位置 Ac MB C/ C C/C
Fatemi et al. [142 ]
法特米等[142] DQN Auxiliary Reward 辅助奖励 Bridge: optimal after 14k episodes (ten times faster then competitor)
桥接：14k 集后最佳（比竞争对手快十倍） card types/ atari images 卡片类型/ Atari图片 Q MB D/ D
Lipton et al. [141 ]
Lipton等[141] DQN Auxiliary Reward 辅助奖励 Atari: Asteroids total death 40,000 (DQN 80,000)
雅达利：小行星总死亡人数 40,000 （DQN 80,000） Atari images 雅达利图片 Ac MB C/ C
Karimpanal et al. [143 ]
Karimpanal等[143] Q-learning and DDPG Q-learning和DDPG Auxiliary Reward 辅助奖励 Navigation environment: −3 (PQRL −3.5)
导航环境：−3 （PQRL −3.5） Enumerated state id 枚举状态 ID Q MF C/ C
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

3.7. Random-based methods
3.7. 基于随机的方法
In random-based approaches, improvements to simple random exploration are discussed. Random exploration tends to be inefficient as it often revisits the same states. To solve this problem, the following approaches are considered: (i) reduced states/actions for exploration methods, (ii) exploration parameters methods, and (iii) network parameter noise methods, as illustrated in Fig. 13.
在基于随机的方法中，讨论了对简单随机探索的改进。随机探索往往效率低下，因为它经常重新访问相同的状态。为了解决这个问题，考虑了以下方法：（i）勘探方法的简化状态/作用，（ii）勘探参数方法，以及（iii）网络参数噪声方法，如图13所示。

Download : Download high-res image (189KB)
下载：下载高分辨率图片（189KB）
Download : Download full-size image
下载：下载全尺寸图像
Fig. 13. Overview of random based methods. In random-based methods, simple random exploration is modified for improved efficiency. In modifying the states for exploration, the number of actions to be taken randomly is reduced. In modifying the exploration parameters, the exploration is automatically decided. In the network parameter noise, the noise is imposed on the policy parameters.
图 13.基于随机的方法概述。在基于随机的方法中，对简单的随机探索进行了修改以提高效率。在修改要探索的状态时，要随机执行的操作数会减少。在修改勘探参数时，会自动决定勘探。在网络参数噪声中，噪声被施加在策略参数上。

3.7.1. Exploration parameters methods
3.7.1. 探索参数方法
In this section, exploration is parameterised (for example,
in
-greedy). Then, the parameters are modified according to the agent’s learning progress.
在本节中，对探索进行了参数化（例如，
在 -greedy 中
）。然后，根据智能体的学习进度修改参数。

One technique to adapt the exploration rate is by simply considering a reward and adjusting the random exploration parameter accordingly, as described in [144]. Using a pure reward can lead to problems with sparse rewards. To solve this problem, in [145],
was made to depend on the error of the value-function estimates instead of the reward. It is also possible to determine the amount of random exploration using the environmental model entropy, as discussed in [146]. The learning rate [147] can also depend on exploration in which a parameter
that is functionally equivalent to the learning rate is introduced. If the agent is exploring a lot, the value of
slows down the learning to account for uncertainty. Khamassi et al. [148] used long-term and short-term reward averages to control exploration and exploitation. When the short-term average is below the long-term average, exploration should be increased.
调整探索率的一种技术是简单地考虑奖励并相应地调整随机探索参数，如[144]所述。使用纯奖励可能会导致奖励稀疏的问题。为了解决这个问题，在[145]中，人们要求依赖于价值函数估计的误差，
而不是奖励。也可以使用环境模型熵来确定随机探索的数量，如[146]所述。学习率[147]还可以取决于在探索中引入一个功能上等同于学习率的参数
。如果智能体正在探索很多，那么的价值
会减慢学习速度以解释不确定性。Khamassi等[148]使用长期和短期平均回报来控制勘探和开发。当短期均值低于长期均值时，应加大勘探力度。

Chang et al. [149] used multiple agents (ants) to adjust the exploration parameters. At each step, the ants chose their actions randomly, but were skewed by pheromone values left by other ants.
Chang等[149]使用多个试剂（蚂蚁）来调整勘探参数。在每一步中，蚂蚁随机选择它们的行为，但被其他蚂蚁留下的信息素值所扭曲。

Another approach of this type could be reducing states for exploration based on some predefined metric. An approach using the adaptive resonance theorem (ART) [150] was presented in [151] and was later extended in [152]. In ART, knowledge about actions can be split into: (i) positive chunk which leads to positive rewards, (ii) negative chunk which leads to negative results, and (iii) empty chunk which is not yet taken. In this approach, the action is randomly chosen from positive and no chunks; thus, the agent is exploring either new things or ones with the positive reward. Wang et al. [152] extended this to include the probability of selecting the remaining actions based on how well they are known.
这种类型的另一种方法是根据一些预定义的指标减少探索的状态。[151]提出了一种使用自适应共振定理（ART）[150]的方法，后来在[152]中进行了扩展。在ART中，关于行动的知识可以分为：（i）导致积极奖励的积极块，（ii）导致消极结果的消极块，以及（iii）尚未采取的空块。在这种方法中，动作是从正块中随机选择的，没有块;因此，智能体正在探索新事物或具有正奖励的事物。Wang等[152]将其扩展为包括根据剩余动作的已知程度选择剩余动作的概率。

Discussion. 讨论。
Different parameters can be changed based on learning progress. Initially, approaches used learning progress, reward, or value of states to determine the rate of exploration. The challenge with these approaches is determining the parameters controlling the exploration. However, it is also possible to adjust the learning rate based on exploration [147]. The advantage is that the agent avoids learning uncertain information, but it slows down the training. Finally, reducing states for exploration can make exploration more sample efficient, but it struggles to account for unseen states that occurs after the eliminated states.
可以根据学习进度更改不同的参数。最初，方法使用学习进度、奖励或状态值来确定探索率。这些方法的挑战在于确定控制勘探的参数。然而，也可以根据探索调整学习率[147]。优点是智能体避免学习不确定的信息，但会减慢训练速度。最后，减少探索状态可以使探索更加高效，但它很难解释消除状态之后发生的看不见的状态。

3.7.2. Random noise 3.7.2. 随机噪声
In random noise approaches, random noise is used for exploration. The random noise can be either imposed on networks parameters or be produced based on states met during exploration.
在随机噪声方法中，随机噪声用于探索。随机噪声既可以施加在网络参数上，也可以根据探索过程中遇到的状态产生。

The easiest method of including the noise is to include a fixed amount of noise [153]. This paper reviews the usage of small perturbations in the parameter space. In [154], chaotic networks were used to induce the noise in the network. It is also possible to adjust the noise strength using backpropagation, as described in [155] where the noise is created by a constant noise source multiplied by a gradient-adaptable parameter. Another way of the using the noise is by comparing the decision made by the noisy and noiseless policy [156]. Exploration is imposed, if decisions are sufficiently different.
包含噪声的最简单方法是包含固定数量的噪声[153]。本文综述了小扰动在参数空间中的应用。在[154]中，混沌网络被用来诱导网络中的噪声。也可以使用反向传播来调整噪声强度，如[155]所述，其中噪声由恒定噪声源乘以梯度自适应参数产生。使用噪声的另一种方法是比较噪声和无噪声策略做出的决定[156]。如果决策足够不同，就会进行探索。

In [157], the problem of assigning rewards when the same state is present multiple times is discussed. In such a problem, the agent will be likely to take different actions for the same state, making credit assignment difficult. To solve this problem, a random action generation function dependent on the input state was developed; if the state is the same, the random action is the same.
在[157]中，讨论了当同一状态多次出现时分配奖励的问题。在这样的问题中，代理可能会对同一状态采取不同的操作，从而使信用分配变得困难。为了解决这个问题，该文开发了一种依赖于输入状态的随机动作生成函数;如果状态相同，则随机操作相同。

Discussion. 讨论。
Network parameter noise was first developed for evolutionary approaches, such as [153]. Recently, the noise of parameters has been used in policy-based methods. In particular, good performance was achieved in [155] which was able to achieve 50% improvement averaged over 52 Atari games.
网络参数噪声最初是为进化方法而开发的，例如[153]。最近，参数噪声已被用于基于策略的方法中。特别是，在[155]中取得了良好的表现，在52款雅达利游戏中平均实现了50%的改进。

3.7.3. Summary 3.7.3. 小结
A comparison of the random-based approach is presented in Table 7. The key advantage of reduced states for exploration methods is that the exploration can be very effective, but it needs to hold the memory of where it has been. Exploration parameter methods solves a trade-off between exploration and exploitation well; however, the agent can still get stuck in exploring unnecessary states. The random noise approaches are very simple to implement and show promising results, but they rely on careful tuning of parameters by designers.
表7列出了基于随机的方法的比较。简化状态对于探索方法的主要优点是，探索可以非常有效，但它需要保留它曾经去过的地方的记忆。勘探参数法很好地解决了勘探与开采之间的权衡;但是，代理仍然可能卡在探索不必要的状态时遇到困难。随机噪声方法的实现非常简单，并显示出有希望的结果，但它们依赖于设计人员对参数的仔细调整。

Table 7. Comparison of Random based approaches.
表 7.基于随机的方法的比较。

R Prior Knowledge 先验知识 U Method 方法 Top score on a key benchmark
在关键基准测试中得分最高 Input Types 输入类型 O MB/ MF 兆字节/中频 A/ S
Wang et al. [152 ]
Wang等[152] ART Exploration parameters 勘探参数 minefield navigation (successful rate): 91% (Baseline 91%)
雷区导航（成功率）：91%（基线91%） Vehicles positions 车辆位置 Q MB C/ C
Shani et al. [147 ]
Shani等[147] DDQN and DDPG DDQN 和 DDPG Exploration parameters 勘探参数 Atari: Frostbite 2686 (DDPG 1720); Mujoco: HalfCheetah 4579 (DDPG 2255)
雅达利：冻伤 2686 （DDPG 1720）;Mujoco：半猎豹 4579 （DDPG 2255） Atari images, Mujoco joints angles
雅达利图片，Mujoco关节角度 Ac/ Q 交流/Q MF C/ C C/C
Patrascu and Stacey [144 ]
帕特拉斯库和史黛西 [144 ] Fuzzy ART MAP architecture
模糊 ART MAP 架构 Exploration parameters 勘探参数 Changing world environment (grid with two alternating paths to reward) 0.9
不断变化的世界环境（网格有两条交替的奖励路径）0.9 Enumerated state id 枚举状态 ID Ac MB D/ D D/D
Usama and Chang [146 ]
乌萨马和张 [146 ] DQN Exploration parameters 勘探参数 VizDoom: Defend the centre 12.2 (
-greedy 11.8)
VizDoom：保卫中心 12.2 （
-贪婪 11.8） Images 图像 Q MB C/ C C/C
Tokic [145 ] 托基奇 [145 ] V-learning V-学习 Exploration parameters 勘探参数 Multi-arm bandit: 1.42 (Softmax 1.38)
多臂强盗：1.42（Softmax 1.38） Enumerated state id 枚举状态 ID V MF D/ D D/D
Khamassi et al. [148 ]
Khamassi等[148] Q-learning Q-学习 Exploration parameters 勘探参数 Nao simulator: Engagement 10 (Kalman-QL 5)
Nao模拟器：交战10（卡尔曼-QL 5） Robot joints angles 机器人关节角度 Q MF D/ D D/D
Shibata and Sakashita [154 ]
柴田和坂下 [154 ] Actor–critic 演员兼评论家 Random noise 随机噪声 area with randomly positioned obstacle: 0.6 out of 1
随机放置障碍物的区域：0.6 分（满分 1 分） Enumerated state id 枚举状态 ID Ac MF D/ D D/D
Plappert et al. [156 ]
Plappert等[156] measure of policies distance
政策距离的衡量标准 DQN, DDPG and TRPO DQN、DDPG 和 TRPO Random noise 随机噪声 Atari: BeamRdier 9000 (
-greedy 5000); Mujoco: Half cheetah 5000 (
-greedy 1500)
雅达利：BeamRdier 9000 （
-贪婪 5000）;Mujoco：半猎豹 5000 （
-贪婪 1500） Atari images/Mujoco joints angles
雅达利图片/Mujoco关节角度 Ac/ Q 交流/Q MF C/ C C/C
Shibata and Sakashita [154 ]
柴田和坂下 [154 ] Random noise 随机噪声 Multi Arm Bandit Problem: 1 (Optimal)
多臂强盗问题：1（最佳） Stateless 无国籍 Ac MB C/ C
Fortunato et al. [155 ]
Fortunato等[155] Random noise 随机噪声 Atari: 57 games 633 points (Duelling DQN 524)
雅达利：57场比赛633分（决斗DQN 524） Atari images 雅达利图片 Ac MF C/ C
Legend: A — action space, Ac — action, R — reference, MB — model based, MF — model free, D — discrete, C — continuous, Q — Q values, V — values, P — policy, O — output, S — state space, U — underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).
图例：A — 动作空间，Ac — 动作，R — 参考，MB — 基于模型，MF — 无模型，D — 离散，C — 连续，Q — Q 值，V — 值，P — 策略，O — 输出，S — 状态空间，U — 基础算法和关键基准解释的最高分 - [基准]：[场景] [分数] （[基线方法] [分数]）。

Future challenges 4. 未来的挑战
In this section, we discuss the following future challenges on exploration in reinforcement learning: evaluation, scalability, exploration–exploitation dilemma, intrinsic reward, noisy TV problems, safety, and transferability.
在本节中，我们将讨论强化学习探索的以下未来挑战：评估、可扩展性、探索-开发困境、内在奖励、嘈杂的电视问题、安全性和可转移性。

Evaluation. 评估。
Currently, evaluating and comparing different exploration algorithms is challenging. This issue arises from three reasons: lack of a common benchmark, lack of a common evaluation strategy, and lack of good metrics to measure exploration.
目前，评估和比较不同的勘探算法具有挑战性。这个问题源于三个原因：缺乏共同的基准，缺乏共同的评估策略，以及缺乏衡量勘探的良好指标。

Currently, four major benchmarks used by the community are VizDoom [27], Minecraft [28], Atari Games [15] and Mujoco [29]. Each benchmark is characterised by different complexities in terms of state space, reward sparseness, and action space. Moreover, each benchmark offers several scenarios with various degrees of complexity. Such a wealth of benchmarks is desirable for exposure of agents to various complexities; however, the difference in complexity between different benchmarks is well-understood. This leads to difficulty in comparing algorithms using different benchmarks. There have been attempts to solve the evaluation issues using a common benchmark, for example, in [158]. However, this study is not commonly adopted yet.
目前，社区使用的四个主要基准测试是 VizDoom [27]、Minecraft [28]、Atari Games [15] 和 Mujoco [29]。每个基准测试在状态空间、奖励稀疏性和操作空间方面都具有不同的复杂性。此外，每个基准测试都提供了多个具有不同复杂程度的场景。如此丰富的基准对于代理暴露于各种复杂性是可取的;然而，不同基准之间的复杂性差异是众所周知的。这导致使用不同基准比较算法的困难。有人尝试使用通用基准来解决评估问题，例如在[158]中。然而，这项研究尚未被普遍采用。

Regarding the evaluation strategy, most algorithms use a reward after a certain number of steps. Note that in the context of this paragraph, steps could also mean episodes, iterations and epochs. This makes the reporting of results inconsistent in two aspects: (i) the number of steps in which the algorithm was tested and (ii) how the reward is reported. The first makes comparisons between algorithms difficult because performance can vary widely depending on when the comparison is made. The second concern is how rewards are reported. Most authors choose to report the average reward the agent has scored; however, sometimes comparison with the average human performance is used (without clear indication of what average human performance means exactly). Moreover, sometimes the distinction between the average reward or maximum reward is not clearly made.
关于评估策略，大多数算法在一定步骤后使用奖励。请注意，在本段的上下文中，步骤也可能意味着情节、迭代和纪元。这使得结果报告在两个方面不一致：（i）测试算法的步骤数和（ii）奖励的报告方式。第一种情况使算法之间的比较变得困难，因为性能可能会因比较时间的不同而有很大差异。第二个问题是奖励是如何报告的。大多数作者选择报告代理获得的平均奖励;然而，有时使用与人类平均表现进行比较（没有明确指出人类平均表现究竟意味着什么）。此外，有时平均奖励或最高奖励之间的区别并不明确。

Finally, it is arguable if a reward is an appropriate measure for evaluation [37]. One of the key issues is that it fails to account for the speed of learning, which should be higher if exploration is more efficient [37]. Attempts have been made to address this issue in [37], but as of the time of writing this review paper, this new metric is not widely adopted. Another issue with rewards is that it does not provide any information regarding the goodness of exploratory behaviour. This is even more difficult in continuous action space problems where computing novelty is considerably more challenging.
最后，奖励是否是评估的适当措施是有争议的[37]。其中一个关键问题是它没有考虑到学习速度，如果探索更有效率，学习速度应该更高[37]。在[37]中已经尝试解决这个问题，但截至撰写这篇综述论文时，这一新指标尚未被广泛采用。奖励的另一个问题是，它没有提供任何关于探索行为的好坏的信息。这在连续动作空间问题中更加困难，因为计算新颖性更具挑战性。

Scalability. 可扩展性。
Exploration in reinforcement learning does not scale well to real-world problems. This is caused by two limitations: training time and inefficient state representation. Currently, even the fastest training requires millions of samples in complex environments. Note that even the most complex environments currently used in reinforcement learning are still relatively simple compared to the real world. In the real world, collecting millions of samples for training is unrealistic owing to wear and tear of physical devices. To cope with the real world, either a sim-to-real gap needs to be reduced or exploration needs to become more sample efficient.
强化学习的探索并不能很好地扩展到现实世界的问题。这是由两个限制引起的：训练时间和低效的状态表示。目前，即使是最快的训练也需要在复杂环境中进行数百万个样本。请注意，与现实世界相比，即使是目前用于强化学习的最复杂的环境也仍然相对简单。在现实世界中，由于物理设备的磨损，收集数百万个样本进行训练是不现实的。为了应对现实世界，要么需要缩小模拟与真实的差距，要么需要提高样本效率。

Another limitation is efficient state representation so that memorising states and actions is possible in large domains. For example, Go-Explore [31] does not scale up well if the environment is large. This problem was discussed in [159] by comparing how the brain stores memories and computes novelty. It states that the human brain is much faster at determining scene novelty and has a much larger capacity. To achieve this, the brain uses an agreement between multiple neurons. The more neurons indicate that the given image is novel, the higher the novelty is. Thus, the brain does not need to remember full states; instead, it trains itself to recognise the novelty. This is currently unmatched in reinforcement learning in terms of the representation efficiency.
另一个限制是有效的状态表示，因此可以在大型域中记住状态和动作。例如，如果环境很大，Go-Explore [31] 就不能很好地扩展。[159]通过比较大脑如何存储记忆和计算新颖性来讨论这个问题。它指出，人脑在确定场景新颖性方面要快得多，并且具有更大的容量。为了实现这一点，大脑使用多个神经元之间的协议。表明给定图像具有新颖性的神经元越多，新颖性就越高。因此，大脑不需要记住完整的状态;相反，它训练自己识别新奇事物。就表示效率而言，这在强化学习中是目前无法比拟的。

Exploration–exploitation dilemma.
勘探-开发困境。
The exploration–exploitation dilemma is an ongoing research topic not only in reinforcement learning but also in a general problem. Most current exploration approaches have a built-in solution to exploration–exploitation, but not all methods do. This is particularly true in goal-based methods that rely on hand-designed solutions. Moreover, even in approaches that solve it automatically, the balance is still mostly decided by the designer-provided threshold. One potential way of solving this problem is to train a set of skills (policies) during exploration and combine skills in greater goal-oriented policies [160]. This is similar to how humans solve problems by learning smaller skills and then using them later to exploit them as a larger policy.
探索-开发困境是一个持续的研究课题，不仅在强化学习中，而且在一般问题中也是如此。目前的大多数勘探方法都有一个内置的勘探-开发解决方案，但并非所有方法都这样做。在依赖于手工设计的解决方案的基于目标的方法中尤其如此。此外，即使在自动求解的方法中，平衡仍然主要由设计人员提供的阈值决定。解决这个问题的一个潜在方法是在探索过程中训练一套技能（政策），并将技能结合到更大的目标导向政策中[160]。这类似于人类如何通过学习较小的技能来解决问题，然后使用它们作为更大的策略来利用它们。

Intrinsic reward. 内在奖励。
Reward novel states and diverse behaviour approaches can be improved in two ways: (i) the agent should be more free to reward itself and (ii) better balance between long-term and short-term novelty should be achieved.
奖励新奇状态和多样化行为方法可以通过两种方式进行改进：（i）智能体应该更自由地奖励自己，（ii）应该在长期和短期新颖性之间实现更好的平衡。

In most intrinsic reward approaches, the exact reward formulation is performed by an expert. Designing a reward that guarantees good exploration is a challenging and time-consuming task. Moreover, there might be ways of rewarding agents which were not conceived by designers. Thus, it could be beneficial if an agent is not only trained in the environment but is also trained on how to reward itself. This would be closer to human behaviour where the self-rewarding mechanism was developed through evolution.
在大多数内在奖励方法中，确切的奖励公式由专家执行。设计一个保证良好探索的奖励是一项具有挑战性且耗时的任务。此外，可能有一些奖励代理的方法，这些方法不是设计师构想的。因此，如果智能体不仅在环境中接受训练，而且还接受如何奖励自己的训练，这可能是有益的。这将更接近人类行为，在人类行为中，自我奖励机制是通过进化发展起来的。

Balancing the long-term novelty and short-term novelty is another challenge. In this problem, the agent tries to balance two factors: revisiting states often to find something new or abandoning states quickly to try to find something new. This is currently a hand-designed parameter, but its tuning is time-consuming. Recently, there has been a fix proposed in [15] where meta-learning decides the appropriate balance, but at the cost of computational complexity for training.
平衡长期新颖性和短期新颖性是另一个挑战。在这个问题中，智能体试图平衡两个因素：经常重新访问状态以找到新的东西，或者快速放弃状态以尝试找到新的东西。这是目前手工设计的参数，但其调整非常耗时。最近，在[15]中提出了一个修复方法，其中元学习决定了适当的平衡，但代价是训练的计算复杂性。

Noisy-TV problem. 嘈杂的电视问题。
The noisy-TV (or couch potato problem) remains largely unsolved. While using memory can be used to solve it, they are limited by memory requirements. Thus, it can be envisioned that if the noisy sequence is very long and the state space is complex, memory approaches will also struggle to solve it. One method that has shown some promise is the use of clustering [107] to cluster noisy states and avoid that cluster. However, this requires the design of correct clusters.
嘈杂的电视（或沙发土豆问题）在很大程度上仍未得到解决。虽然可以使用内存来解决它，但它们受到内存要求的限制。因此，可以设想，如果噪声序列很长并且状态空间很复杂，内存方法也将难以解决它。一种已经显示出一定前景的方法是使用聚类[107]来聚类噪声状态并避免该聚类。但是，这需要设计正确的集群。

Optimal exploration. 最佳探索。
One area which is rarely considered in the current exploration in reinforcement learning research is how to explore optimally. For optimal exploration, the agent does not revisit states unnecessarily and explores the most promising areas first. This problem and the proposed solution are described in detail in [161]. The solution uses a demand matrix, which is an
by
matrix of
states and
actions, indicating state–action exploration counts. It then defines the exploration cost for exploration policy, which is the number of steps each state–action pair needs to be explored. Note that the demand matrix does not need to be known a priori and can be updated online. This aspect needs further developments.
在当前强化学习研究的探索中，一个很少考虑的领域是如何进行最佳探索。为了获得最佳探索，智能体不会不必要地重新访问状态，并首先探索最有前途的区域。[161]中详细描述了该问题和提出的解决方案。该解决方案使用需求矩阵，该矩阵是状态和
操作的
按
矩阵划分的
矩阵，指示状态-操作探索计数。然后，它定义探索策略的探索成本，即需要探索每个状态-操作对的步骤数。请注意，需求矩阵不需要先验地知道，可以在线更新。这方面需要进一步发展。

Safe exploration. 安全探索。
Safe exploration is of paramount importance for real-world applications. However, so far, there have been very few approaches to cope with this issue. Most of them rely heavily on hand-designed rules to prevent catastrophes. Moreover, it has been shown in [141] that current reinforcement learning is struggling to prevent catastrophes even with carefully engineered rewards. Thus, there exists a need for the agent to recognise unsafe situations and act accordingly. Moreover, what constitutes an unsafe situation is not well defined beyond hand-designed rules. This leads to problems with regard to the scalability and transferability of safe exploration in reinforcement learning. A more rigorous definition of an unsafe situation would be beneficial to address this problem.
安全探索对于实际应用至关重要。然而，到目前为止，处理这个问题的方法很少。他们中的大多数严重依赖手工设计的规则来防止灾难。此外，[141]已经表明，即使有精心设计的奖励，当前的强化学习也难以防止灾难。因此，代理人需要识别不安全的情况并采取相应的行动。此外，除了手工设计的规则之外，什么构成不安全情况并没有明确定义。这导致了强化学习中安全探索的可扩展性和可转移性方面的问题。对不安全情况进行更严格的定义将有利于解决这一问题。

Transferability. 可转移性。
Most exploratory approaches are currently limited to the domain on which they were trained. When faced with new environments (e.g., increased state space and different reward functions), exploration strategies do not seem to perform well [43], [48]. Coping with this issue would be helpful in two scenarios. First, it would be beneficial to be able to teach the agent behaviours in smaller scenarios and then allow it to perform well on larger scenarios to alleviate computational issues. Second, in some domains, defining state spaces suitable for exploration is challenging and may vary in size significantly between tasks (e.g., search for a victim of an accident).
目前，大多数探索性方法仅限于它们所训练的领域。当面对新的环境（例如，增加的状态空间和不同的奖励函数）时，探索策略似乎表现不佳[43]，[48]。在两种情况下，应对此问题会有所帮助。首先，能够在较小的场景中教授智能体的行为，然后允许它在较大的场景中表现良好，以缓解计算问题，这将是有益的。其次，在某些领域中，定义适合探索的状态空间具有挑战性，并且在任务之间（例如，搜索事故的受害者）之间的大小可能会有很大差异。

Conclusions 5. 结论
This paper presents a review of the exploration in reinforcement learning. The following methods were discussed: reward novel states, reward diverse behaviours, goal-based methods, uncertainty, imitation-based methods, safe exploration, and random methods.
本文对强化学习的探索进行了综述。讨论了以下方法：奖励新状态、奖励多样化行为、基于目标的方法、不确定性、基于模仿的方法、安全探索和随机方法。

In reward novel state methods, the agent is given a reward for discovering a novel or surprising state. This reward can be computed using prediction error, count, or memory. In prediction error methods, the reward is given based on the accuracy of the agent’s internal environmental model. In count-based methods, the reward is given based on how often a given state is visited. In memory-based methods, the reward is computed based on how different a state is compared to other states in a buffer.
在奖励新颖状态方法中，智能体因发现新颖或令人惊讶的状态而获得奖励。可以使用预测误差、计数或内存来计算此奖励。在预测误差方法中，奖励是根据智能体内部环境模型的准确性给出的。在基于计数的方法中，奖励是根据访问给定状态的频率给出的。在基于内存的方法中，奖励是根据状态与缓冲区中其他状态相比的不同程度来计算的。

In reward diverse behaviour methods, the agent is rewarded for discovering as many diverse behaviours as possible. Note here that we use word behaviour loosely as a sequence of actions or a policy. Reward diverse behaviour methods can be divided into: evolutionary strategies and policy learning. In evolution strategies, diversity among the population of agents is encouraged. In policy learning, the diversity of policy parameters is encouraged.
在奖励多样化行为方法中，智能体因发现尽可能多的不同行为而获得奖励。请注意，这里我们松散地使用单词行为作为一系列操作或策略。奖励多样化的行为方法可以分为：进化策略和政策学习。在进化策略中，鼓励智能体种群的多样性。在政策学习中，鼓励政策参数的多样性。

In goal-based methods, the agent is given the goal of either exploring from or exploring while trying to reach the goal. In the first method, the agent chooses the goal to get to and then explore from it. This results in a very efficient exploration as the agent visits predominantly unknown areas. In the second method, called the exploratory goal, the agent is exploring while travelling towards a goal. The key idea of this method is to provide goals which are suitable for exploration.
在基于目标的方法中，智能体被赋予了探索或探索的目标，同时试图达到目标。在第一种方法中，智能体选择要到达的目标，然后从中进行探索。这导致了非常有效的探索，因为代理访问了主要未知的区域。在第二种方法中，称为探索性目标，智能体在朝着目标前进的同时进行探索。这种方法的关键思想是提供适合探索的目标。

In probabilistic methods, the agent holds an uncertainty model about the environment and uses it to make its next move. The uncertainty method has two subcategories: optimistic and uncertainty methods. In optimistic methods, the agent follows the optimism under uncertainty principle. This means that the agent will sample the most optimistic understanding of the reward. In uncertainty methods, the agent will sample from internal uncertainty to move towards the least known areas.
在概率方法中，智能体持有一个关于环境的不确定性模型，并使用它来做出下一步行动。不确定性方法有两个子类别：乐观方法和不确定性方法。在乐观方法中，智能体遵循不确定性原则下的乐观原则。这意味着智能体将对奖励进行最乐观的理解。在不确定度方法中，智能体将从内部不确定性中采样，以向最不为人知的区域移动。

Imitation-based methods rely on using demonstrations to help exploration. In general, there are two methods: combining demonstrations with experience replay and combining them with an exploration strategy. In the first method, samples from demonstrations and collected by the agent are combined into one buffer for the agent to learn from. In the second method, the demonstrations are used as a starting point for other exploration techniques such as the reward novel state.
基于模仿的方法依赖于使用演示来帮助探索。一般来说，有两种方法：将演示与经验回放相结合，并将它们与探索策略相结合。在第一种方法中，来自演示并由代理收集的样本被合并到一个缓冲区中，供代理学习。在第二种方法中，演示被用作其他探索技术的起点，例如奖励新状态。

Safe exploration methods were devised to ensure the safe behaviour of the agents during exploration. In safe exploration, the most prevalent method is to use human designer knowledge to develop boundaries for the agent. Furthermore, it is possible to train a model that predicts and stops agents from making a disastrous move. Finally, the agent can be discouraged from visiting dangerous states with a negative reward.
设计了安全勘探方法，以确保试剂在勘探过程中的安全行为。在安全探索中，最普遍的方法是使用人类设计师的知识来为智能体制定边界。此外，还可以训练一个模型来预测和阻止智能体做出灾难性的举动。最后，可以劝阻智能体不要访问具有负奖励的危险状态。

Random exploration methods improve standard random exploration. These improvements include modifying the states for exploration, modifying exploration parameters, and putting the noise on network parameters. In modifying states for exploration, certain states and actions are removed from the random choice if they have been sufficiently explored. In modifying exploration parameter methods, the parameters affecting when to randomly explore are automatically chosen based on the agent’s learning progress. Lastly, in the network parameter noise approach, random noise is applied to the parameters to induce exploration before the weight convergence.
随机探索方法改进了标准的随机探索。这些改进包括修改探索状态、修改探索参数以及将噪声置于网络参数上。在修改状态以进行探索时，如果某些状态和操作已被充分探索，则会从随机选择中删除它们。在修改探索参数方法时，会根据智能体的学习进度自动选择影响何时随机探索的参数。最后，在网络参数噪声方法中，对参数施加随机噪声，在权重收敛之前诱导探索。

Finally, the best approaches in terms of ease of implementation, computational cost and overall performance are highlighted. The easiest methods to implement are reward novel states, reward diverse behaviours and random-based approaches. Basic implementation of those approaches can be used with almost any other existing reinforcement learning algorithms; they might require a few additions and tuning to work. In terms of computational efficiency, random-based, reward novel states and reward divers behaviours generally require the least resources. Particularly, random-based approaches are computationally efficient as the additional components are lightweight. Currently, best-performing methods are goal-based and reward novel states methods where goal-based methods have achieved high scores in difficult exploratory problems such as Montezuma’s revenge. However, goal-based methods tend to be the most complex in terms of implementation. Overall, reward novel states methods seem like a good compromise between ease of implementation and performance.
最后，重点介绍了在易于实现、计算成本和整体性能方面的最佳方法。最简单的实施方法是奖励新状态、奖励不同的行为和基于随机的方法。这些方法的基本实现几乎可以与任何其他现有的强化学习算法一起使用;它们可能需要一些添加和调整才能工作。在计算效率方面，基于随机的奖励新状态和奖励潜水员行为通常需要最少的资源。特别是，基于随机的方法在计算上是高效的，因为附加组件是轻量级的。目前，表现最好的方法是基于目标的，并奖励基于目标的方法在困难的探索性问题（如蒙特祖玛的复仇）中取得高分的新状态方法。然而，基于目标的方法在实施方面往往是最复杂的。总的来说，奖励新状态方法似乎是易于实现和性能之间的一个很好的折衷方案。

CRediT authorship contribution statement
CRediT 作者贡献声明
Pawel Ladosz: Conceptualization, Investigation, Visualization, Data curation, Writing – original draft, Writing – review & editing. Lilian Weng: Conceptualization, Validation, Writing – review & editing. Minwoo Kim: Visualization, Writing – review & editing, Learning. Hyondong Oh: Conceptualization, Validation, Writing – review & editing, Supervision.
Pawel Ladosz：概念化、调查、可视化、数据管理、写作——原稿、写作——审查和编辑。Lilian Weng：概念化、验证、写作——审查和编辑。Minwoo Kim：可视化，写作 - 审查和编辑，学习。Hyondong Oh：概念化、验证、写作——审查和编辑、监督。

Declaration of Competing Interest
利益冲突声明
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明，他们没有已知的相互竞争的经济利益或个人关系，这些利益或关系可能会影响本文所报告的工作。

References 引用
[1]
Sutton R.S., Barto A.G. 萨顿R.S.，巴托A.G.
Reinforcement Learning: An Introduction
强化学习：简介
A Bradford Book, Cambridge, MA, USA (2018)
布拉德福德书，美国马萨诸塞州剑桥（2018）
Google Scholar Google 学术搜索
[2]
Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., Petersen S., Beattie C., Sadik A., Antonoglou I., King H., Kumaran D., Wierstra D., Legg S., Hassabis D.
MnihV.， KavukcuogluK.， SilverD.， RusuA.A.， VenessJ.， BellemareM.G.， GravesA.， RiedmillerM.， FidjelandA.K.， OstrovskiG.， PetersenS.， BeattieC.， SadikA.， AntonoglouI.， KingH.， KumaranD.， WierstraD.， LeggS.， HassabisD.
Human-level control through deep reinforcement learning
通过深度强化学习实现人类水平的控制
Nature, 518 (7540) (2015), pp. 529-533, 10.1038/nature14236
《自然（Nature）》，第 518 卷第 7540 期（2015 年），第 529-533 页，10.1038/nature14236
URL: http://dx.doi.org/10.1038/nature14236
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[3]
Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D.
LillicrapT.P.， HuntJ.J.， PritzelA.， HeessN.， ErezT.， TassaY.， SilverD.， WierstraD.
Continuous control with deep reinforcement learning
通过深度强化学习实现持续控制
4th International Conference on Learning Representations, ICLR 2016 (2016)
第四届学习表征国际会议，ICLR 2016 （2016）
arXiv:1509.02971 arXiv：1509.02971
Google Scholar Google 学术搜索
[4]
Lee S., Bang H. 李S.，BangH。
Automatic gain tuning method of a quad-rotor geometric attitude controller using A3C
基于A3C的四旋翼几何姿态控制器增益自动整定方法
Int. J. Aeronaut. Space Sci., 21 (2) (2020), pp. 469-478, 10.1007/s42405-019-00233-x
国际航空员。《空间科学（Space Sci.）》，第 21 卷第 2 期（2020 年），第 469-478 页，10.1007/s42405-019-00233-x
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[5]
Polvara R., Patacchiola M., Sharma S., Wan J., Manning A., Sutton R., Cangelosi A.
PolvaraR.， PatacchiolaM.， SharmaS.， WanJ.， ManningA.， SuttonR.， CangelosiA.
Autonomous quadrotor landing using deep reinforcement learning
使用深度强化学习的自主四旋翼着陆
(2017)
arXiv:1709.03339 arXiv：1709.03339
Google Scholar Google 学术搜索
[6]
Kiran B.R., Sobh I., Talpaert V., Mannion P., Sallab A.A., Yogamani S., Perez P.
KiranB.R.， SobhI.， TalpaertV.， MannionP.， SallabA.A.， YogamaniS.， PerezP.
Deep reinforcement learning for autonomous driving: A survey
自动驾驶的深度强化学习：一项调查
IEEE Trans. Intell. Transp. Syst. (2021), pp. 1-18, 10.1109/TITS.2021.3054625
IEEE翻译。系统（2021 年），第 1-18 页，10.1109/TITS.2021.3054625
arXiv:2002.00444 arXiv：2002.00444
View article Google Scholar
查看文章 Google 学术搜索
[7]
Yu C., Liu J., Nemati S. YuC.， LiuJ.， NematiS.
Reinforcement learning in healthcare: A survey
医疗保健中的强化学习：一项调查
(2019)
arXiv:1908.08796 arXiv：1908.08796
Google Scholar Google 学术搜索
[8]
Irpan A. 鸢鸠。
Deep reinforcement learning doesn’t work yet
深度强化学习尚不起作用
(2018)
https://www.alexirpan.com/2018/02/14/rl-hard.html
Google Scholar Google 学术搜索
[9]
Clark J., Amodei D. 克拉克J.，AmodeiD。
Faulty reward functions in the wild
野外奖励功能错误
(2016)
https://openai.com/blog/faulty-reward-functions/
Google Scholar Google 学术搜索
[10]
Schmidhuber J. 施密德胡伯J.
Curious model-building control systems
好奇的模型构建控制系统
1991 IEEE International Joint Conference on Neural Networks, IJCNN 1991 (1991), pp. 1458-1463, 10.1109/ijcnn.1991.170605
1991 IEEE神经网络国际联合会议，IJCNN 1991（1991），第1458-1463页，10.1109/ijcnn.1991.170605
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[11]
Schmidhuber J. 施密德胡伯J.
A possibility for implementing curiosity and boredom in model-building neural controllers
在模型构建神经控制器中实现好奇心和无聊的可能性
Proceedings of the First International Conference on Simulation of Adaptive Behavior, 1 (1991), pp. 5-10
第一届适应性行为模拟国际会议论文集（Proceedings of the First International Conference on Simulation of Adaptive Behavior），第 1 期（1991 年），第 5-10 页
URL: ftp://ftp.idsia.ch/pub/juergen/curiositysab.pdf
Google Scholar Google 学术搜索
[12]
Eysenbach B., Gupta A., Ibarz J., Levine S.
艾森巴赫B.，古普塔A.，伊巴尔兹J.，莱文S.
Diversity is all you need
多样性就是您所需要的
7th International Conference on Learning Representations, ICLR 2019 (2019)
第七届学习表征国际会议，ICLR 2019 （2019）
URL: https://openreview.net/pdf?id=SJx63jRqFm
Google Scholar Google 学术搜索
[13]
Burda Y., Edwards H., Storkey A., Klimov O.
BurdaY.， EdwardsH.， StorkeyA.， KlimovO.
Exploration by random network distillation
随机网络蒸馏探索
7th International Conference on Learning Representations, ICLR 2019 (2018)
第七届学习表征国际会议，ICLR 2019 （2018）
arXiv:1810.12894 arXiv：1810.12894
Google Scholar Google 学术搜索
[14]
Bellemare M.G., Srinivasan S., Ostrovski G., Schaul T., Saxton D., Munos R.
BellemareM.G.， SrinivasanS.， OstrovskiG.， ShaulT.， SaxtonD.， MunosR.
Unifying count-based exploration and intrinsic motivation
统一基于计数的探索和内在动机
Conference on Neural Information Processing Systems, NeurIPS 2016 (2016), 10.1002/pola.10609
神经信息处理系统会议，NeurIPS 2016 （2016），10.1002/pola.10609
arXiv:1606.01868 arXiv：1606.01868
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
Google Scholar Google 学术搜索
[15]
Badia A.P., Piot B., Kapturowski S., Sprechmann P., Vitvitskyi A., Guo D., Blundell C.
BadiaA.P.， PiotB.， KapturowskiS.， SprechmannP.， VitvitskyiA.， GuoD.， BlundellC.
Agent57: Outperforming the atari human benchmark
Agent57：超越雅达利人类基准
(2020)
arXiv:2003.13350 arXiv：2003.13350
Google Scholar Google 学术搜索
[16]
Aubret A., Matignon L., Hassas S.
AubretA.， MatignonL.， HassasS.
A survey on intrinsic motivation in reinforcement learning
强化学习中内在动机的调查
(2019)
ArXiv, Im, arXiv:1908.06976
ArXiv，Im，arXiv：1908.06976
Google Scholar Google 学术搜索
[17]
Li Y. 李伊。
Deep reinforcement learning
深度强化学习
(2018), 10.18653/v1/p18-5007
（2018），10.18653/v1/p18-5007
arXiv:1911.10107 arXiv：1911.10107
View article Google Scholar
查看文章 Google 学术搜索
[18]
Nguyen T.T., Nguyen N.D., Nahavandi S.
NguyenT.T.， NguyenN.D.， NahavandiS.
Deep reinforcement learning for multi-agent systems: A review of challenges, solutions and applications
多智能体系统的深度强化学习：挑战、解决方案和应用回顾
(2018), pp. 3826-3839 （2018 年），第 3826-3839 页
ArXiv, 50 (9) ArXiv，50 （9）
Google Scholar Google 学术搜索
[19]
Levine S. 莱文。
Reinforcement learning and control as probabilistic inference: Tutorial and review
强化学习和控制作为概率推理：教程和回顾
(2018)
ArXiv, arXiv:1805.00909 ArXiv，arXiv：1805.00909
Google Scholar Google 学术搜索
[20]
Lazaridis A., Fachantidis A., Vlahavas I.
拉扎里迪斯A.，法坎蒂迪斯A.，弗拉哈瓦斯I.
Deep reinforcement learning: A state-of-the-art walkthrough
深度强化学习：最先进的演练
J. Artificial Intelligence Res., 69 (2020)
J. 人工智能研究， 69 （2020）
Google Scholar Google 学术搜索
[21]
Mcfarlane R. 麦克法兰R.
A survey of exploration strategies in reinforcement learning
强化学习中的探索策略综述
(1999), pp. 1-10 （1999年），第1-10页
URL: https://pdfs.semanticscholar.org/0276/1533d794ed9ed5dfd0295f2577e1e98c4fe2.pdf
Google Scholar Google 学术搜索
[22]
Williams R.J. 威廉姆斯R.J.
Simple statistical gradient-following algorithms for connectionist reinforcement learning
用于连接主义强化学习的简单统计梯度跟踪算法
Mach. Learn., 8 (3–4) (1992), pp. 229-256, 10.1007/bf00992696
《机器学习（Mach. Learn）》，第 8 卷第 3–4 期（1992 年），第 229-256 页，10.1007/bf00992696
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
Google Scholar Google 学术搜索
[23]
Arulkumaran K., Deisenroth M.P., Brundage M., Bharath A.A.
ArulkumaranK.， DeisenrothM.P.， BrundageM.， BharathA.A.
Deep reinforcement learning: A brief survey
深度强化学习：简要调查
IEEE Signal Process. Mag., 34 (6) (2017), pp. 26-38, 10.1109/MSP.2017.2743240
IEEE信号处理。《Mag.》，第 34 卷第 6 期（2017 年），第 26-38 页，10.1109/MSP.2017.2743240
arXiv:arXiv:1708.05866v2 arXiv：arXiv：1708.05866v2
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[24]
Exploration 勘探
(2020)
https://dictionary.cambridge.org/dictionary/english/exploration. (Accessed: 09 April 2020)
Google Scholar Google 学术搜索
[25]
Bellemare M.G., Naddaf Y., Veness J., Bowling M.
BellemareM.G.， NaddafY.， VenessJ.，保龄球M.
The arcade learning environment: An evaluation platform for general agents
街机学习环境：面向一般代理商的评估平台
J. Artificial Intelligence Res., 47 (2013), pp. 253-279, 10.1613/jair.3912
J. 人工智能研究（J. Artificial Intelligence Res.），第 47 卷（2013 年），第 253-279 页，10.1613/jair.3912
arXiv:1207.4708 arXiv：1207.4708
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[26]
Aytar Y., Pfaff T., Budden D., Le Paine T., Wang Z., De Freitas N.
AytarY.， PfaffT.， BuddenD.， Le PaineT.， WangZ.， De FreitasN.
Playing hard exploration games by watching youtube
通过观看 youtube 玩艰苦的探索游戏
Conference on Neural Information Processing Systems, NeurIPS 2018 (2018), pp. 2930-2941
神经信息处理系统会议，NeurIPS 2018 （2018），第 2930-2941 页
arXiv:1805.11592 arXiv：1805.11592
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[27]
Kempka M., Wydmuch M., Runc G., Toczek J., Jaskowski W.
KempkaM.， WydblowM.， RuncG.， ToczekJ.， JaskowskiW.
ViZDoom: A Doom-based AI research platform for visual reinforcement learning
ViZDoom：基于 Doom 的视觉强化学习 AI 研究平台
IEEE Conference on Computatonal Intelligence and Games, CIG (2016), 10.1109/CIG.2016.7860433
IEEE计算机智能与游戏会议，CIG （2016），10.1109/CIG.2016.7860433
arXiv:1605.02097 arXiv：1605.02097
View article Google Scholar
查看文章 Google 学术搜索
[28]
Johnson M., Hofmann K., Hutton T., Bignell D.
约翰逊M.，霍夫曼K.，赫顿T.，比格内尔D.
The malmo platform for artificial intelligence experimentation
用于人工智能实验的马尔默平台
IJCAI International Joint Conference on Artificial Intelligence, Vol. 2016-Janua (2016), pp. 4246-4247
IJCAI 国际人工智能联合会议，2016 年 1 月（2016 年），第 4246-4247 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[29]
Todorov E., Erez T., Tassa Y.
MuJoCo: A physics engine for model-based control
IEEE International Conference on Intelligent Robots and Systems, IEEE (2012), pp. 5026-5033, 10.1109/IROS.2012.6386109
View article View in ScopusGoogle Scholar
[30]
Schmidhuber J.
Formal theory of creativity, fun, and intrinsic motivation (1990–2010)
IEEE Trans. Auton. Ment. Dev., 2 (3) (2010), pp. 230-247, 10.1109/TAMD.2010.2056368
arXiv:1510.05840
View article View in ScopusGoogle Scholar
[31]
Ecoffet A., Huizinga J., Lehman J., Stanley K.O., Clune J.
Go-explore: A new approach for hard-exploration problems
(2019), pp. 1-37
arXiv:1901.10995
Google Scholar
[32]
Oudeyer P.-Y., Kaplan F.
What is intrinsic motivation? A typology of computational approaches
Front. Neurorobot., 1 (6) (2007), pp. 1184-1191, 10.3389/neuro.12.006.2007
arXiv:arXiv:1410.5401v2, URL: http://journal.frontiersin.org/article/10.3389/neuro.12.006.2007/abstract
View article Google Scholar
查看文章 Google 学术搜索
[33]
Achiam J., Sastry S. AchiamJ.，萨斯特里S。
Surprise-based intrinsic motivation for deep reinforcement learning
基于惊喜的深度强化学习的内在动机
(2017), pp. 1-13 （2017），第1-13页
arXiv:1703.01732 arXiv：1703.01732
Google Scholar Google 学术搜索
[34]
Li B., Lu T., Li J., Lu N., Cai Y., Wang S.
LiB.， LuT.， LiJ.， LuN.， CaiY.， WangS.
Curiosity-driven exploration for off-policy reinforcement learning methods
好奇心驱动的非策略强化学习方法探索
IEEE International Conference on Robotics and Biomimetics, ROBIO 2019, December (2019), pp. 1109-1114, 10.1109/ROBIO49542.2019.8961529
IEEE机器人与仿生学国际会议，ROBIO 2019,2019年12月，第1109-1114页，10.1109/ROBIO49542.2019.8961529
View articleView in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[35]
Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y.
GoodfellowI.， Pouget-AbadieJ.， MirzaM.， XuB.， Warde-FarleyD.， OzairS.， CourvilleA.， BengioY.
Generative adversarial nets
生成对抗网络
Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K.Q. (Eds.), Conference on Neural Information Processing Systems, NeurIPS 2014, Vol. 27 (2014)
GhahramaniZ.， WellingM.， CortesC.， LawrenceN.， WeinbergerK.Q.（编辑），神经信息处理系统会议，NeurIPS 2014，第 27 卷（2014 年）
URL: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Google Scholar Google 学术搜索
[36]
Hong W., Zhu M., Liu M., Zhang W., Zhou M., Yu Y., Sun P.
洪伟，朱明，刘明，张伟，周明，余英，孙平.
Generative adversarial exploration for reinforcement learning
强化学习的生成对抗性探索
ACM International Conference Proceeding Series (2019), 10.1145/3356464.3357706
ACM国际会议论文集系列（2019），10.1145/3356464.3357706
View article Google Scholar
查看文章 Google 学术搜索
[37]
Stadie B.C., Levine S., Abbeel P.
StadieB.C.， LevineS.， AbbeelP.
Incentivizing exploration in reinforcement learning with deep predictive models
使用深度预测模型激励强化学习中的探索
(2015), pp. 1-11 （2015），第1-11页
arXiv:1507.00814 arXiv：1507.00814
View article CrossRefGoogle Scholar
查看文章 CrossRefGoogle 学术搜索
[38]
Bougie N., Ichise R. BougieN.， IchiseR.
Fast and slow curiosity for high-level exploration in reinforcement learning
强化学习中高层次探索的快慢好奇心
Appl. Intell. (2020), 10.1007/s10489-020-01849-3
应用智能（2020）， 10.1007/s10489-020-01849-3
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
Google Scholar Google 学术搜索
[39]
Bougie N., Ichise R. BougieN.， IchiseR.
Towards high-level intrinsic exploration in reinforcement learning
迈向强化学习中的高级内在探索
International Joint Conference on Artificial Intelligence, IJCAI-20 (2020)
人工智能国际联合会议，IJCAI-20（2020）
arXiv:1810.12894 arXiv：1810.12894
Google Scholar Google 学术搜索
[40]
Osband I., Aslanides J., Cassirer A.
奥斯班德I.，阿斯兰尼德斯J.，卡西尔A.
Randomized prior functions for deep reinforcement learning
用于深度强化学习的随机先验函数
Conference on Neural Information Processing Systems, NeurIPS 2018 (2018)
神经信息处理系统会议，NeurIPS 2018 （2018）
Google Scholar Google 学术搜索
[41]
Pathak D., Agrawal P., Efros A.A., Darrell T.
PathakD.， AgrawalP.， EfrosA.A.， DarrellT.
Curiosity-driven exploration by self-supervised prediction
基于自监督预测的好奇心驱动探索
Proceedings of the 34th International Conference on Machine Learning (2017), pp. 488-489, 10.1109/CVPRW.2017.70
第 34 届机器学习国际会议论文集（2017），第 488-489 页，10.1109/CVPRW.2017.70
arXiv:1705.05363 arXiv：1705.05363
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[42]
Burda Y., Edwards H., Pathak D., Storkey A., Darrell T., Efros A.A.
BurdaY.， EdwardsH.， PathakD.， StorkeyA.， DarrellT.， EfrosA.A.
Large-scale study of curiosity-driven learning
好奇心驱动学习的大规模研究
7th International Conference on Learning Representations, ICLR 2019 (2019)
第七届学习表征国际会议，ICLR 2019 （2019）
arXiv:1808.04355 arXiv：1808.04355
Google Scholar Google 学术搜索
[43]
Raileanu R., Rocktäschel T.
RaileanuR.， RocktäschelT.
RIDE: REwarding impact-driven exploration for procedurally-generated environments
RIDE：为程序生成的环境提供影响驱动的探索
8th International Conference on Learning Representations, ICLR 2020 (2020)
第八届学习表征国际会议，ICLR 2020 （2020）
arXiv:2002.12292 arXiv：2002.12292
Google Scholar Google 学术搜索
[44]
Li J., Shi X., Li J., Zhang X., Wang J.
LiJ.， ShiX.， LiJ.， ZhangX.， WangJ.
Random curiosity-driven exploration in deep reinforcement learning
深度强化学习中随机的好奇心驱动探索
Neurocomputing, 418 (2020), pp. 139-147, 10.1016/j.neucom.2020.08.024
《神经计算（Neurocomputing）》，第 418 卷（2020 年），第 139-147 页，10.1016/j.neucom.2020.08.024
View PDFView articleGoogle Scholar
查看 PDF查看文章Google 学术搜索
[45]
Kim H., Kim J., Jeong Y., Levine S., Song H.O.
KimH.， KimJ.， JeongY.， LevineS.， SongH.O.
EMI: EXploration with mutual information
EMI：利用互信息进行探索
36th International Conference on Machine Learning, ICML 2019 (2019), pp. 5837-5851
第 36 届机器学习国际会议，ICML 2019 （2019），第 5837-5851 页
arXiv:1810.01176 arXiv：1810.01176
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[46]
Kim Y., Nam W., Kim H., Kim J.H., Kim G.
KimY.， NamW.， KimH.， KimJ.H.， KimG.
Curiosity-bottleneck: Exploration by distilling task-specific novelty
好奇心瓶颈：通过提炼特定任务的新颖性进行探索
36th International Conference on Machine Learning, ICML 2019 (2019), pp. 5861-5874
第 36 届机器学习国际会议，ICML 2019 （2019），第 5861-5874 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[47]
Mirowski P., Pascanu R., Viola F., Soyer H., Ballard A.J., Banino A., Denil M., Goroshin R., Sifre L., Kavukcuoglu K., Kumaran D., Hadsell R.
MirowskiP.， PascanuR.， ViolaF.， SoyerH.， BallardA.J.， BaninoA.， DenilM.， GoroshinR.， SifreL.， KavukcuogluK.， KumaranD.， HadsellR.
Learning to navigate in complex environments
学习在复杂环境中导航
5th International Conference on Learning Representations, ICLR 2017 (2017)
第五届学习表征国际会议，ICLR 2017 （2017）
arXiv:1611.03673 arXiv：1611.03673
Google Scholar Google 学术搜索
[48]
Dhiman V., Banerjee S., Griffin B., Siskind J.M., Corso J.J.
DhimanV.， BanerjeeS.， GriffinB.， SiskindJ.M.， CorsoJ.J.
A critical investigation of deep reinforcement learning for navigation
导航深度强化学习的关键研究
(2018)
arXiv:1802.02274 arXiv：1802.02274
Google Scholar Google 学术搜索
[49]
Li B., Lu T., Li J., Lu N., Cai Y., Wang S.
LiB.， LuT.， LiJ.， LuN.， CaiY.， WangS.
ACDER: AUgmented curiosity-driven experience replay
ACDER：好奇心驱动的体验回放
IEEE International Conference on Robotics and Automation, ICRA 2020 (2020), pp. 4218-4224, 10.1109/ICRA40945.2020.9197421
IEEE机器人与自动化国际会议，ICRA 2020 （2020），第 4218-4224 页，10.1109/ICRA40945.2020.9197421
View articleView in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[50]
Stanton C., Clune J. 斯坦顿C.，克鲁内J。
Deep curiosity search: Intra-life exploration can improve performance on challenging deep reinforcement learning problems
深度好奇心搜索：生命内探索可以提高具有挑战性的深度强化学习问题的性能
(2018)
ArXiv, arXiv:1806.00553 ArXiv，arXiv：1806.00553
Google Scholar Google 学术搜索
[51]
Dean V., Tulsiani S., Gupta A.
DeanV.， TulsianiS.， GuptaA.
See, hear, explore: Curiosity via audio-visual association
看、听、探索：通过视听联想激发好奇心
(2020)
ArXiv, arXiv:2007.03669 ArXiv，arXiv：2007.03669
Google Scholar Google 学术搜索
[52]
Kolter J.Z., Ng A.Y. KolterJ.Z.， NgA.Y.
Near-Bayesian exploration in polynomial time
多项式时间中的近贝叶斯探索
(2009), pp. 513-520 （2009年），第513-520页
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
[53]
D. Pathak, D. Gandhi, A. Gupta, Self-supervised exploration via disagreement, in: Proceedings of the 36th International Conference on Machine Learning, 2019.
D. Pathak、D. Gandhi、A. Gupta，通过分歧进行自我监督探索，载于：第 36 届机器学习国际会议论文集，2019 年。
Google Scholar Google 学术搜索
[54]
Still S., Precup D. StillS.， PrecupD.
An information-theoretic approach to curiosity-driven reinforcement learning
好奇心驱动的强化学习的信息论方法
Theory Biosci., 131 (3) (2012), pp. 139-148, 10.1007/s12064-011-0142-z
《生物科学理论（Theory Biosci.）》，第 131 卷第 3 期（2012 年），第 131 卷第 3 期，第 131 卷第 3 期，第 131 卷第 3 期，第139-148， 10.1007/S12064-011-0142-Z
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[55]
Still S. 剧照。
Information theoretic approach to interactive learning
互动学习的信息理论方法
(2009), pp. 1-6 （2009年），第1-6页
Arxiv Arxiv的
View PDF 查看 PDF
Your institution provides access to this article.
您的机构提供对本文的访问。
CrossRef 交叉参考View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[56]
Houthooft R., Chen X., Duan Y., Schulman J., De Turck F., Abbeel P.
HouthooftR.， ChenX.， DuanY.， SchulmanJ.， De TurckF.， AbbeelP.
VIME: VAriational information maximizing exploration
VIME：虚拟信息最大化探索
Conference on Neural Information Processing Systems, NeurIPS 2016 (2016), pp. 1117-1125
神经信息处理系统会议，NeurIPS 2016 （2016），第 1117-1125 页
arXiv:1605.09674 arXiv：1605.09674
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[57]
Mohamed S., Rezende D.J. MohamedS.， RezendeD.J.
Variational information maximisation for intrinsically motivated reinforcement learning
内在动机强化学习的变分信息最大化
Conference on Neural Information Processing Systems, NeurIPS 2015 (2015), pp. 2125-2133
神经信息处理系统会议，NeurIPS 2015 （2015），第 2125-2133 页
arXiv:1509.08731 arXiv：1509.08731
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[58]
De Abril I.M., Kanai R. De AbrilI.M.， KanaiR.
Curiosity-driven reinforcement learning with homeostatic regulation
具有稳态调节的好奇心驱动的强化学习
Proceedings of the International Joint Conference on Neural Networks, Vol. 2018-July, no. 1, IEEE (2018), 10.1109/IJCNN.2018.8489075
Proceedings of the International Joint Conference on Neural Networks， Vol. 2018-July， no. 1， IEEE （2018）， 10.1109/IJCNN.2018.8489075
arXiv:1801.07440 arXiv：1801.07440
View article Google Scholar
查看文章 Google 学术搜索
[59]
Chien J.-T., Hsu P.-C. ChienJ.-T.， HsuP.-C.
Stochastic curiosity maximizing exploration
随机好奇心最大化探索
2020 International Joint Conference on Neural Networks, IJCNN (2020), 10.1109/ijcnn48605.2020.9207295
2020 年神经网络国际联合会议， IJCNN （2020）， 10.1109/ijcnn48605.2020.9207295
View article Google Scholar
查看文章 Google 学术搜索
[60]
Savinov N., Raichuk A., Marinier R., Vincent D., Pollefeys M., Lillicrap T., Gelly S.
SavinovN.， RaichukA.， MarinierR.， VincentD.， PollefeysM.， LillicrapT.， GellyS.
Episodic curiosity through reachability
通过可达性满足偶发的好奇心
7th International Conference on Learning Representations, ICLR 2019 (2019), pp. 1-20
第七届学习表征国际会议，ICLR 2019 （2019），第 1-20 页
arXiv:1810.02274 arXiv：1810.02274
View article CrossRefGoogle Scholar
查看文章 CrossRefGoogle 学术搜索
[61]
Ménard P., Domingues O.D., Jonsson A., Kaufmann E., Leurent E., Valko M.
MénardP.， DominguesO.D.， JonssonA.， KaufmannE.， LeurentE.， ValkoM.
Fast active learning for pure exploration in reinforcement learning
用于强化学习中纯探索的快速主动学习
(2020), pp. 1-36 （2020），第1-36页
arXiv:2007.13442 arXiv：2007.13442
Google Scholar Google 学术搜索
[62]
Tang H., Houthooft R., Foote D., Stooke A., Chen X., Duan Y., Schulman J., De Turck F., Abbeel P.
TangH.， HouthooftR.， FooteD.， StookeA.， ChenX.， DuanY.， SchulmanJ.， De TurckF.， AbbeelP.
Exploration: A study of count-based exploration for deep reinforcement learning
探索：基于计数的深度强化学习探索研究
Conference on Neural Information Processing Systems, NeurIPS 2017 (2017), pp. 2754-2763
神经信息处理系统会议，NeurIPS 2017 （2017），第 2754-2763 页
arXiv:1611.04717 arXiv：1611.04717
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[63]
Charikar M.S. 查里卡M.S.
Similarity estimation techniques from rounding algorithms
舍入算法的相似度估计技术
Conference Proceedings of the Annual ACM Symposium on Theory of Computing (2002), pp. 380-388, 10.1145/509907.509965
ACM 计算理论年度研讨会论文集（2002），第 380-388 页，10.1145/509907.509965
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[64]
Choi J., Guo Y., Moczulski M., Oh J., Wu N., Norouzi M., Lee H.
ChoiJ.， GuoY.， MoczulskiM.， OhJ.， WuN.， NorouziM.， LeeH.
Contingency-aware exploration in reinforcement learning
强化学习中的权变感知探索
7th International Conference on Learning Representations, ICLR 2019 (2019), pp. 1-19
第七届学习表征国际会议，ICLR 2019 （2019），第 1-19 页
arXiv:1811.01483 arXiv：1811.01483
Google Scholar Google 学术搜索
[65]
Machado M.C., Bellemare M.G., Bowling M.
MachadoM.C.， BellemareM.G.，保龄球M.
Count-based exploration with the successor representation
基于计数的探索，具有后继表示
AAAI Conference on Artificial Intelligence (2020), 10.1609/aaai.v34i04.5955
AAAI人工智能会议（2020），10.1609/aaai.v34i04.5955
arXiv:1807.11622 arXiv：1807.11622
View article Google Scholar
查看文章 Google 学术搜索
[66]
Zhao R., Tresp V. ZhaoR.， TrespV.
Curiosity-driven experience prioritization via density estimation
通过密度估计确定好奇心驱动的体验优先级
(2019)
ArXiv, arXiv:1902.08039 ArXiv，arXiv：1902.08039
Google Scholar Google 学术搜索
[67]
Ostrovski G., Bellemare M.G., Van Den Oord A., Munos R.
奥斯特洛夫斯基G.， BellemareM.G.， Van Den OordA.， MunosR.
Count-based exploration with neural density models
使用神经密度模型进行基于计数的探索
34th International Conference on Machine Learning, Vol. 6, ICML 2017 (2017), pp. 4161-4175
第 34 届机器学习国际会议，第 6 卷，ICML 2017 （2017），第 4161-4175 页
arXiv:1703.01310 arXiv：1703.01310
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[68]
Martin J., Narayanan S.S., Everitt T., Hutter M.
MartinJ.， NarayananS.S.， EverittT.， HutterM.
Count-based exploration in feature space for reinforcement learning
基于计数的特征空间探索，用于强化学习
IJCAI International Joint Conference on Artificial Intelligence (2017), pp. 2471-2478, 10.24963/ijcai.2017/344
IJCAI人工智能国际联合会议（2017），第2471-2478页，10.24963/ijcai.2017/344
arXiv:arXiv:1706.08090v1 arXiv：arXiv：1706.08090v1
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[69]
Malisiewicz T., Gupta A., Efros A.A.
MalisiewiczT.， GuptaA.， EfrosA.A.
Ensemble of exemplar-SVMs for object detection and beyond
用于对象检测及其他功能的示例 SVM 集合
Proceedings of the IEEE International Conference on Computer Vision, IEEE (2011), pp. 89-96, 10.1109/ICCV.2011.6126229
IEEE计算机视觉国际会议论文集，IEEE（2011），第89-96页，10.1109/ICCV.2011.6126229
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[70]
Fu J., Co-Reyes J.D., Levine S.
FuJ.， Co-ReyesJ.D.， LevineS.
Ex2: Exploration with exemplar models for deep reinforcement learning
示例2：使用示例模型进行深度强化学习的探索
Conference on Neural Information Processing Systems, NeurIPS 2017 (2017), pp. 2578-2588
神经信息处理系统会议，NeurIPS 2017 （2017），第 2578-2588 页
arXiv:1703.01260 arXiv：1703.01260
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[71]
Badia A.P., Sprechmann P., Vitvitskyi A., Guo D., Piot B., Kapturowski S., Tieleman O., Arjovsky M., Pritzel A., Bolt A., Blundell C.
BadiaA.P.， SprechmannP.， VitvitskyiA.， GuoD.， PiotB.， KapturowskiS.， TielemanO.， ArjovskyM.， PritzelA.， BoltA.， BlundellC.
Never give up: Learning directed exploration strategies
永不放弃：学习定向探索策略
8th International Conference on Learning Representations, ICLR 2020 (2020)
第八届学习表征国际会议，ICLR 2020 （2020）
arXiv:2002.06038 arXiv：2002.06038
Google Scholar Google 学术搜索
[72]
Such F.P., Madhavan V., Conti E., Lehman J., Stanley K.O., Clune J.
SuchF.P.， MadhavanV.， ContiE.， LehmanJ.， StanleyK.O.， CluneJ.
Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning
深度神经进化：遗传算法是训练深度神经网络进行强化学习的竞争性替代方案
(2017), 10.48550/ARXIV.1712.06567
（2017），10.48550/ARXIV.1712.06567
arXiv:1712.06567 arXiv：1712.06567
View article Google Scholar
查看文章 Google 学术搜索
[73]
Salimans T., Ho J., Chen X., Sidor S., Sutskever I.
SalimansT.， HoJ.， ChenX.， SidorS.， SutskeverI.
Evolution strategies as a scalable alternative to reinforcement learning
进化策略作为强化学习的可扩展替代方案
(2017), pp. 476-485, 10.1109/ICSTW.2011.58
（2017），第476-485页，10.1109/ICSTW.2011.58
ArXiv, arXiv:1703.03864v2
ArXiv，arXiv：1703.03864v2
View article Google Scholar
查看文章 Google 学术搜索
[74]
Lehman J., Stanley K.O. 雷曼兄弟J.， StanleyK.O.
Abandoning objectives: Evolution through the search for novelty alone
摒弃目标：通过只追求新奇事物来进化
Evol. Comput., 19 (2) (2011), pp. 189-222, 10.1162/EVCO_a_00025
卷。《计算机（Comput.）》，第 19 卷第 2 期（2011 年），第 189-222 页，10.1162/EVCO_a_00025
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[75]
Risi S., Vanderbleek S.D., Hughes C.E., Stanley K.O.
RisiS.， VanderbleekS.D.， HughesC.E.， StanleyK.O.
How novelty search escapes the deceptive trap of learning to learn
新颖性搜索如何摆脱学习学习的欺骗性陷阱
Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation - GECCO ’09 (2009), p. 153, 10.1145/1569901.1569923
Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation - GECCO '09 （2009）， p. 153， 10.1145/1569901.1569923
URL: http://portal.acm.org/citation.cfm?doid=1569901.1569923
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[76]
Conti E., Madhavan V., Such F.P., Lehman J., Stanley K.O., Clune J.
ContiE.， MadhavanV.， SuchF.P.， LehmanJ.， StanleyK.O.， CluneJ.
Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents
通过一群寻求新奇的智能体来改进深度强化学习的进化策略探索
Conference on Neural Information Processing Systems, NeurIPS 2018 (2018), pp. 5027-5038
神经信息处理系统会议，NeurIPS 2018 （2018），第 5027-5038 页
arXiv:1712.06560 arXiv：1712.06560
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[77]
Gravina D., Liapis A., Yannakakis G.N.
GravinaD.， LiapisA.， YannakakisG.N.
Quality diversity through surprise
惊喜带来品质多样性
(2018), pp. 1-14, 10.1109/TEVC.2018.2877215
（2018），第 1-14 页，10.1109/TEVC.2018.2877215
arXiv:1807.02397 arXiv：1807.02397
View articleGoogle Scholar
查看文章 Google 学术搜索
[78]
Mouret J.B., Doncieux S. MouretJ.B.， DoncieuxS.
Encouraging behavioral diversity in evolutionary robotics: An empirical study
鼓励进化机器人中的行为多样性：一项实证研究
Evol. Comput., 20 (1) (2012), pp. 91-133, 10.1162/EVCO_a_00048
卷。《计算机（Comput.）》，第 20 卷第 1 期（2012 年），第 91-133 页，10.1162/EVCO_a_00048
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[79]
Pugh J.K., Soros L.B., Stanley K.O.
PughJ.K.， SorosL.B.， StanleyK.O.
Quality diversity: A new frontier for evolutionary computation
质量多样性：进化计算的新前沿
Front. Robot. AI, 3 (July) (2016), 10.3389/frobt.2016.00040
前面。机器人。AI， 3 （July）（2016）， 10.3389/frobt.2016.00040
URL: http://journal.frontiersin.org/Article/10.3389/frobt.2016.00040/abstract
View article Google Scholar
查看文章 Google 学术搜索
[80]
Z.W. Hong, T.Y. Shann, S.Y. Su, Y.H. Chang, C.Y. Lee, Diversity-Driven Exploration Strategy for Deep Reinforcement Learning, in: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings, 2018.
Z.W. Hong， T.Y. Shann， S.Y. Su， Y.H. Chang， C.Y. Lee， Diversity-Driven Exploration Strategy for Deep Reinforcement Learning， in： 6th International Conference on Learning Representations， ICLR 2018 - Workshop Track Proceedings， 2018.
Google Scholar Google 学术搜索
[81]
Cohen A., Yu L., Qiao X., Tong X.
CohenA.， YuL.， QiaoX.， TongX.
Maximum entropy diverse exploration: Disentangling maximum entropy reinforcement learning
最大熵多样化探索：解开最大熵强化学习
(2019)
arXiv:1911.00828 arXiv：1911.00828
Google Scholar Google 学术搜索
[82]
Pong V.H., Dalal M., Lin S., Nair A., Bahl S., Levine S.
PongV.H.， DalalM.， LinS.， NairA.， BahlS.， LevineS.
Skew-fit: State-covering self-supervised reinforcement learning
Skew-fit：状态覆盖自监督强化学习
(2019)
arXiv:1903.03698 arXiv：1903.03698
Google Scholar Google 学术搜索
[83]
T. Gangwani, Q. Liu, J. Peng, Learning Self-Imitating Diverse Policies, in: 7th International Conference on Learning Representations, ICLR 2019, 2019.
T. Gangwani， Q. Liu， J. Peng，学习自我模仿多样化政策，载于：第七届学习表征国际会议， ICLR 2019， 2019.
Google Scholar Google 学术搜索
[84]
Ecoffet A., Huizinga J., Lehman J., Stanley K.O., Clune J.
EcoffetA.， HuizingaJ.， LehmanJ.， StanleyK.O.， CluneJ.
First return then explore
先返回，再探索
(2020), pp. 1-46 （2020），第1-46页
arXiv:2004.12919 arXiv：2004.12919
View article CrossRefGoogle Scholar
查看文章 CrossRefGoogle 学术搜索
[85]
Matheron G., Perrin N., Sigaud O.
MatheronG.， PerrinN.， SigaudO.
PBCS: EFficient exploration and exploitation using a synergy between reinforcement learning and motion planning
PBCS：利用强化学习和运动规划之间的协同作用进行 EFficient 探索和开发
ICANN 2020, Vol. 12397 LNCS, Springer International Publishing (2020), pp. 295-307, 10.1007/978-3-030-61616-8_24
ICANN 2020 年，第 12397 卷 LNCS，施普林格国际出版社（2020 年），第 295-307 页，10.1007/978-3-030-61616-8_24
arXiv:2004.11667 arXiv：2004.11667
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[86]
Guo Z.D., Brunskill E. 郭Z.D.， BrunskillE.
Directed exploration for reinforcement learning
强化学习的定向探索
(2019)
arXiv:1906.07805 arXiv：1906.07805
Google Scholar Google 学术搜索
[87]
Guo Y., Choi J., Moczulski M., Bengio S., Norouzi M., Lee H.
GuoY.， ChoiJ.， MoczulskiM.， BengioS.， NorouziM.， LeeH.
Self-imitation learning via trajectory-conditioned policy for hard-exploration tasks
基于轨迹条件策略的自我模仿学习，用于艰苦探索任务
(2019), pp. 1-22 （2019），第1-22页
arXiv:1907.10247 arXiv：1907.10247
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[88]
J. Oh, Y. Guo, S. Singh, H. Lee, Self-Imitation Learning, in: Proceedings of the 35th International Conference on Machine Learning, 2018.
J. Oh， Y. Guo， S. Singh， H. Lee， Self-Imitation Learning， in： Proceedings of the 35th International Conference on Machine Learning， 2018.
Google Scholar Google 学术搜索
[89]
Y. Guo, J. Choi, M. Moczulski, S. Feng, S. Bengio, M. Norouzi, H. Lee, Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards, in: Conference on Neural Information Processing Systems, NeurIPS 2020, 2020.
Y. Guo、J. Choi、M. Moczulski、S. Feng、S. Bengio、M. Norouzi、H. Lee，基于记忆的轨迹条件策略，用于从稀疏奖励中学习，载于：神经信息处理系统会议，NeurIPS 2020,2020。
Google Scholar Google 学术搜索
[90]
Liu E.Z., Keramati R., Seshadri S., Guu K., Pasupat P., Brunskill E., Liang P.
LiuE.Z.， KeramatiR.， SeshadriS.， GuuK.， PasupatP.， BrunskillE.， LiangP.
Learning abstract models for strategic exploration and fast reward transfer
学习抽象模型，实现战略探索和快速奖励转移
(2020)
arXiv:2007.05896 arXiv：2007.05896
Google Scholar Google 学术搜索
[91]
Edwards A.D., Downs L., Davidson J.C.
爱德华兹A.D.，唐斯L.，戴维森J.C.
Forward-backward reinforcement learning
前向-后向强化学习
(2018)
arXiv:1803.10227 arXiv：1803.10227
Google Scholar Google 学术搜索
[92]
Florensa C., Held D., Wulfmeier M., Zhang M., Abbeel P.
FlorensaC.， HeldD.， WulfmeierM.， ZhangM.， AbbeelP.
Reverse curriculum generation for reinforcement learning
强化学习的反向课程生成
(2017)
CoRL, arXiv:1707.05300 CoRL，arXiv：1707.05300
Google Scholar Google 学术搜索
[93]
Forestier S., Mollard Y., Oudeyer P.Y.
ForestierS.， MollardY.， OudeyerP.Y.
Intrinsically motivated goal exploration processes with automatic curriculum learning
具有自动课程学习的内在动机目标探索过程
(2017), pp. 1-33 （2017），第1-33页
ArXiv, arXiv:1708.02190 ArXiv，arXiv：1708.02190
Google Scholar Google 学术搜索
[94]
Colas C., Founder P., Sigaud O., Chetouani M., Oudeyer P.Y.
ColasC.， FounderP.， SigaudO.， ChetouaniM.， OudeyerP.Y.
CURIOUS: INtrinsically motivated modular multi-goal reinforcement learning
好奇：内在动机的模块化多目标强化学习
36th International Conference on Machine Learning, ICML 2019 (2019), pp. 2372-2387
第 36 届机器学习国际会议，ICML 2019 （2019），第 2372-2387 页
arXiv:1810.06284 arXiv：1810.06284
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[95]
Péré A., Forestier S., Sigaud O., Oudeyer P.Y.
PéréA.， ForestierS.， SigaudO.， OudeyerP.Y.
Unsupervised learning of goal spaces for intrinsically motivated goal exploration
目标空间的无监督学习，用于内在动机的目标探索
6th International Conference on Learning Representations, ICLR 2018 (2018), pp. 1-26
第六届学习表征国际会议，ICLR 2018 （2018），第 1-26 页
arXiv:1803.00781 arXiv：1803.00781
Google Scholar Google 学术搜索
[96]
Vezhnevets A.S., Osindero S., Schaul T., Heess N., Jaderberg M., Silver D., Kavukcuoglu K.
VezhnevetsA.S.， OsinderoS.， SchaulT.， HeessN.， JaderbergM.， SilverD.， KavukcuogluK.
Feudal networks for hierarchical reinforcement learning
用于分层强化学习的封建网络
34th International Conference on Machine Learning, vol. 7, ICML 2017 (2017), pp. 5409-5418
第 34 届机器学习国际会议，第 7 卷，ICML 2017 （2017），第 5409-5418 页
arXiv:1703.01161 arXiv：1703.01161
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[97]
T. Hester, P. Stone, Learning Exploration Strategies in Model-Based Reinforcement Learning, in: Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, AAAI, 2013.
T. Hester， P. Stone， Learning Exploration Strategies in Model-Based Reinforcement Learning， in： Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems， AAAI， 2013.
Google Scholar Google 学术搜索
[98]
Kulkarni T.D., Narasimhan K.R., Saeedi A., Tenenbaum J.B.
库尔卡尼T.D.，纳拉辛汉K.R.， SaeediA.， TenenbaumJ.B.
Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation
分层深度强化学习：整合时间抽象和内在动机
Conference on Neural Information Processing Systems, NeurIPS 2016 (2016), 10.1162/NECO
神经信息处理系统会议，NeurIPS 2016 （2016），10.1162/NECO
arXiv:NIHMS150003 arXiv：NIHMS150003
View article Google Scholar
查看文章 Google 学术搜索
[99]
M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess, T. Springenberg, Learning by Playing - Solving Sparse Reward Tasks From Scratch, in: Proceedings of the 35th International Conference on Machine Learning, 2018.
M. Riedmiller、R. Hafner、T. Lampe、M. Neunert、J. Degrave、T. van de Wiele、V. Mnih、N. Heess、T. Springenberg，通过游戏学习 - 从头开始解决稀疏奖励任务，载于：第 35 届机器学习国际会议论文集，2018 年。
Google Scholar Google 学术搜索
[100]
Ghafoorian M., Taghizadeh N., Beigy H.
GhafoorianM.， TaghizadehN.， BeigyH.
Automatic abstraction in reinforcement learning using ant system algorithm
使用蚂蚁系统算法进行强化学习中的自动抽象
AAAI Spring Symposium - Technical Report, Vol. SS-13-05 (2013), pp. 9-14
AAAI 春季研讨会 - 技术报告，第 SS-13-05 卷（2013 年），第 9-14 页
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[101]
Machado M.C., Bellemare M.G., Bowling M.
MachadoM.C.， BellemareM.G.，保龄球M.
A Laplacian framework for option discovery in reinforcement learning
强化学习中选项发现的拉普拉斯框架
34th International Conference on Machine Learning, Vol. 5, ICML 2017 (2017), pp. 3567-3582
第 34 届机器学习国际会议，第 5 卷，ICML 2017 （2017），第 3567-3582 页
arXiv:1703.00956 arXiv：1703.00956
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[102]
Zaki M.J., Meira W. Jr. 扎基M.J.， MeiraW.Jr.
Data Mining and Analysis: Fundamental Concepts and Algorithms
数据挖掘与分析：基本概念和算法
Cambridge University Press (2014), 10.1017/CBO9780511810114
剑桥大学出版社（2014）， 10.1017/CBO9780511810114
View article Google Scholar
查看文章 Google 学术搜索
[103]
Machado M.C., Rosenbaum C., Guo X., Liu M., Tesauro G., Campbell M.
MachadoM.C.， RosenbaumC.， GuoX.， LiuM.， TesauroG.， CampbellM.
Eigenoption discovery through the deep successor representation
通过深度后继表示发现特征选项
6th International Conference on Learning Representations, ICLR 2018 (2018)
第六届学习表征国际会议，ICLR 2018 （2018）
arXiv:arXiv:1710.11089v3 arXiv：arXiv：1710.11089v3
Google Scholar Google 学术搜索
[104]
Fang K., Zhu Y., Savarese S., Fei-Fei L.
FangK.， ZhuY.， SavareseS.， Fei-FeiL.
Adaptive procedural task generation for hard-exploration problems
针对困难探索问题的自适应程序任务生成
ICLR 2021 (2020) ICLR 2021 （2020年）
arXiv:2007.00350, submitted for publication
arXiv：2007.00350，提交发表
Google Scholar Google 学术搜索
[105]
Guestrin C., Patrascu R., Schuurmans D.
GuestrinC.， PatrascuR.， SchuurmansD.
Algorithm-directed exploration for model-based reinforcement learning in factored mdps
因子 mdps 中基于模型的强化学习的算法导向探索
Machine Learning International Workshop (2002), pp. 235-242
机器学习国际研讨会（2002 年），第 235-242 页
URL: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Algorithm-Directed+Exploration+for+Model-Based+Reinforcement+Learning+in+Factored+MDPs#0
Google Scholar Google 学术搜索
[106]
Abel D., Agarwal A., Diaz F., Krishnamurthy A., Schapire R.E.
AbelD.， AgarwalA.， DiazF.， KrishnamurthyA.， SchapireR.E.
Exploratory gradient boosting for reinforcement learning in complex domains
复杂领域中强化学习的探索性梯度提升
(2016)
arXiv:1603.04119 arXiv：1603.04119
Google Scholar Google 学术搜索
[107]
Kovač G., Laversanne-Finot A., Oudeyer P.-Y.
KovačG.， Laversanne-FinotA.， OudeyerP.-Y.
GRIMGEP: LEarning progress for robust goal sampling in visual deep reinforcement learning
GRIMGEP：在视觉深度强化学习中实现稳健目标采样的进展
(2020), pp. 1-15 （2020），第1-15页
CoRL, arXiv:2008.04388v1 CoRL，arXiv：2008.04388v1
View article CrossRefView in ScopusGoogle Scholar
查看文章 CrossRefView in ScopusGoogle 学术搜索
[108]
Osband I., Van Roy B. 奥斯班德I.，范罗伊B.
Why is posterior sampling better than optimism for reinforcement learning?
为什么后验抽样比乐观的强化学习更好？
34th International Conference on Machine Learning, ICML 2017 (2017), pp. 4133-4148
第 34 届机器学习国际会议，ICML 2017 （2017），第 4133-4148 页
arXiv:1607.00215 arXiv：1607.00215
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[109]
Jung T., Stone P. 荣格，斯通P.
Gaussian Processes for sample efficient reinforcement learning with rmax-like exploration
高斯过程，用于样本高效强化学习和类似rmax的探索
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6321 LNAI (2010), pp. 601-616, 10.1007/978-3-642-15880-3_44
《计算机科学讲义》（包括《人工智能讲义》和《生物信息学讲义》），第 6321 卷 LNAI （2010），第 601-616 页，10.1007/978-3-642-15880-3_44
arXiv:1201.6604 arXiv：1201.6604
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[110]
Xie C., Patil S., Moldovan T., Levine S., Abbeel P.
XieC.， PatilS.， MoldovanT.， LevineS.， AbbeelP.
Model-based reinforcement learning with parametrized physical models and optimism-driven exploration
基于模型的强化学习，包括参数化物理模型和乐观驱动的探索
IEEE International Conference on Robotics and Automation, ICRA 2016, IEEE (2016), pp. 504-511, 10.1109/ICRA.2016.7487172
IEEE机器人与自动化国际会议，ICRA 2016，IEEE（2016），第504-511页，10.1109/ICRA.2016.7487172
arXiv:1509.06824 arXiv：1509.06824
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[111]
D’Eramo C., Cini A., Restelli M.
D’EramoC.， CiniA.， RestelliM.
Exploiting action-value uncertainty to drive exploration in reinforcement learning
利用动作价值不确定性来推动强化学习的探索
Proceedings of the International Joint Conference on Neural Networks, IJCNN 2019, IEEE (2019), 10.1109/IJCNN.2019.8852326
Proceedings of the International Joint Conference on Neural Networks， IJCNN 2019， IEEE （2019）， 10.1109/IJCNN.2019.8852326
View article Google Scholar
查看文章 Google 学术搜索
[112]
Osband I., Van Roy B., Wen Z.
OsbandI.， Van RoyB.， WenZ.
Generalization and exploration via randomized value functions
通过随机值函数进行泛化和探索
33rd International Conference on Machine Learning (2016), pp. 3540-3561
第 33 届机器学习国际会议（2016 年），第 3540-3561 页
arXiv:1402.0635 arXiv：1402.0635
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[113]
Osband I., Van Roy B., Russo D.J., Wen Z.
OsbandI.， Van RoyB.， RussoD.J.， WenZ.
Deep exploration via randomized value functions
通过随机值函数进行深度探索
J. Mach. Learn. Res., 20 (2019), pp. 1-62
J. Mach. 学习。《研究》，第 20 卷（2019 年），第 1-62 页
arXiv:1703.07608 arXiv：1703.07608
Google Scholar Google 学术搜索
[114]
Tang Y., Agrawal S. TangY.， AgrawalS.
Exploration by distributional reinforcement learning
分布强化学习探索
IJCAI International Joint Conference on Artificial Intelligence, Vol. 2018-July (2018), pp. 2710-2716, 10.24963/ijcai.2018/376
IJCAI人工智能国际联合会议，2018年7月（2018年），第2710-2716页，10.24963/ijcai.2018/376
arXiv:1805.01907 arXiv：1805.01907
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[115]
C. Colas, O. Sigaud, P.Y. Oudeyer, GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, 2018.
C. Colas、O. Sigaud、P.Y. Oudeyer，GEP-PG：深度强化学习算法中的解耦探索和开发，载于：第 35 届机器学习国际会议论文集，ICML 2018,2018 年。
Google Scholar Google 学术搜索
[116]
Janz D., Hron J., Mazur P., Hofmann K., Hernández-Lobato J.M., Tschiatschek S.
JanzD.， HronJ.， MazurP.， HofmannK.， Hernández-LobatoJ.M.， TschiatschekS.
Successor uncertainties: Exploration and uncertainty in temporal difference learning
后继不确定性：时间差异学习中的探索与不确定性
Conference on Neural Information Processing Systems, Vol. 33, NeurIPS 2019 (2019)
神经信息处理系统会议，第 33 卷，NeurIPS 2019 （2019）
arXiv:1810.06530 arXiv：1810.06530
Google Scholar Google 学术搜索
[117]
Stulp F. StulpF。
Adaptive exploration for continual reinforcement learning
持续强化学习的自适应探索
IEEE International Conference on Intelligent Robots and Systems, IROS 2012, IEEE (2012), pp. 1631-1636, 10.1109/IROS.2012.6385818
IEEE智能机器人与系统国际会议，IROS 2012，IEEE（2012），第1631-1636页，10.1109/IROS.2012.6385818
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[118]
Akiyama T., Hachiya H., Sugiyama M.
秋山T.， HachiyaH.，杉山M.
Efficient exploration through active learning for value function approximation in reinforcement learning
通过主动学习高效探索强化学习中的价值函数逼近
Neural Netw., 23 (5) (2010), pp. 639-648, 10.1016/j.neunet.2009.12.010
《神经网络（Neural Netw.）》，第 23 卷第 5 期（2010 年），第 639-648 页，10.1016/j.neunet.2009.12.010
URL: http://dx.doi.org/10.1016/j.neunet.2009.12.010
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
[119]
Strens M. 斯特伦斯M。
A Bayesian framework for reinforcement learning
强化学习的贝叶斯框架
Proc of the 17th International Conference on Machine Learning (2000), pp. 943-950
第 17 届机器学习国际会议论文集（2000），第 943-950 页
URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1701&rep=rep1&type=pdf
Google Scholar Google 学术搜索
[120]
Guez A., Silver D., Dayan P.
GuezA.， SilverD.， DayanP.
Efficient Bayes-adaptive reinforcement learning using sample-based search
使用基于样本的搜索进行高效的贝叶斯自适应强化学习
Conference on Neural Information Processing Systems, NeurIPS 2012 (2012), pp. 1025-1033
神经信息处理系统会议，NeurIPS 2012 （2012），第 1025-1033 页
arXiv:1205.3109 arXiv：1205.3109
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[121]
O’Donoghue B., Osband I., Munos R., Mnih V.
O’DonoghueB.， OsbandI.， MunosR.， MnihV.
The uncertainty Bellman equation and exploration
不确定性贝尔曼方程和探索
35th International Conference on Machine Learning Vol. 9, ICML 2018 (2018), pp. 6154-6173
第 35 届机器学习国际会议第 9 卷，ICML 2018 （2018），第 6154-6173 页
arXiv:1709.05380 arXiv：1709.05380
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[122]
Nikolov N., Kirschner J., Berkenkamp F., Krause A.
NikolovN.， KirschnerJ.， BerkenkampF.， KrauseA.
Information-directed exploration for deep reinforcement learning
深度强化学习的信息导向探索
7th International Conference on Learning Representations, ICLR 2019 (2019)
第七届学习表征国际会议，ICLR 2019 （2019）
arXiv:1812.07544 arXiv：1812.07544
Google Scholar Google 学术搜索
[123]
I. Osband, C. Blundell, A. Pritzel, B.V. Roy, Deep Exploration via Bootstrapped DQN, in: Conference on Neural Information Processing Systems, NeurIPS 2016, 2016.
I. Osband、C. Blundell、A. Pritzel、B.V. Roy，通过 Bootstrapped DQN 进行深度探索，载于：神经信息处理系统会议，NeurIPS 2016,2016 年。
Google Scholar Google 学术搜索
[124]
Pearce T., Anastassacos N., Zaki M., Neely A.
皮尔斯T.，阿纳斯塔萨科斯N.，扎基姆.， NeelyA.
BayesIan inference with anchored ensembles of neural networks, and application to exploration in reinforcement learning
基于神经网络锚定集成的贝叶斯推理及其在强化学习探索中的应用
Exploration in Reinforcement Learning Work- Shop At the 35th International Conference on Machine Learning (2018)
强化学习工作探索 - 第35届机器学习国际会议（2018）
arXiv:1805.11324 arXiv：1805.11324
Google Scholar Google 学术搜索
[125]
Shyam P., Jaskowski W., Gomez F.
ShyamP.， JaskowskiW.， GomezF.
Model-based active exploration
基于模型的主动探索
36th International Conference on Machine Learning, ICML 2019 (2019), pp. 10136-10152
第 36 届机器学习国际会议，ICML 2019 （2019），第 10136-10152 页
arXiv:1810.12162 arXiv：1810.12162
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[126]
Henaff M. 亨纳夫M.
Explicit explore-exploit algorithms in continuous state spaces
连续状态空间中的显式探索-利用算法
Conference on Neural Information Processing Systems, NeurIPS 2019 (2019)
神经信息处理系统会议，NeurIPS 2019 （2019）
arXiv:1911.00617 arXiv：1911.00617
Google Scholar Google 学术搜索
[127]
Vecerik M., Hester T., Scholz J., Wang F., Pietquin O., Piot B., Heess N., Rothörl T., Lampe T., Riedmiller M.
VecerikM.， HesterT.， ScholzJ.， WangF.， PietquinO.， PiotB.， HeessN.， RothörlT.， LampeT.， RiedmillerM.
Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
利用演示对机器人问题进行深度强化学习，奖励稀少
(2017)
arXiv:1707.08817 arXiv：1707.08817
Google Scholar Google 学术搜索
[128]
Hester T., Schaul T., Sendonaris A., Vecerik M., Piot B., Osband I., Pietquin O., Horgan D., Dulac-Arnold G., Lanctot M., Quan J., Agapiou J., Leibo J.Z., Gruslys A.
HesterT.， SchaulT.， SendonarisA.， VecerikM.， PiotB.， OsbandI.， PietquinO.， HorganD.， Dulac-ArnoldG.， LanctotM.， QuanJ.， AgapiouJ.， LeiboJ.Z.， GruslysA.
Deep q-learning from demonstrations
从演示中深入学习 Q
32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (2018), pp. 3223-3230
第 32 届 AAAI 人工智能会议，AAAI 2018 （2018），第 3223-3230 页
arXiv:1704.03732 arXiv：1704.03732
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[129]
C. Gulcehr, T.L. Paine, B. Shahriari, M. Denil, M. Hoffman, H. Soyer, R. Tanburn, S. Kapturowski, N. Rabinowitz, D. Williams, G. Barth-Maron, Z. Wang, N. de Freitas, Making Efficient Use of Demonstrations to Solve Hard Exploration Problems, in: 8th International Conference on Learning Representations, ICLR 2020, 2020.
C. Gulcehr、TL Paine、B. Shahriari、M. Denil、M. Hoffman、H. Soyer、R. Tanburn、S. Kapturowski、N. Rabinowitz、D. Williams、G. Barth-Maron、Z. Wang、N. de Freitas，有效利用演示来解决困难的探索问题，载于：第 8 届学习表征国际会议，ICLR 2020,2020。
Google Scholar Google 学术搜索
[130]
Nair A., McGrew B., Andrychowicz M., Zaremba W., Abbeel P.
NairA.， McGrewB.， AndrychowiczM.， ZarembaW.， AbbeelP.
Overcoming exploration in reinforcement learning with demonstrations
通过演示克服强化学习中的探索
Proceedings - IEEE International Conference on Robotics and Automation, IEEE (2018), pp. 6292-6299, 10.1109/ICRA.2018.8463162
论文集 - IEEE机器人与自动化国际会议，IEEE（2018），第6292-6299页，10.1109/ICRA.2018.8463162
arXiv:1709.10089 arXiv：1709.10089
View articleView in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[131]
Salimans T., Chen R. SalimansT.， ChenR.
Learning Montezuma’s revenge from a single demonstration
从一次示威中学习蒙特祖玛的复仇
Conference on Neural Information Processing Systems (2018)
神经信息处理系统会议（2018）
arXiv:1812.03381 arXiv：1812.03381
Google Scholar Google 学术搜索
[132]
K. Subramanian, C.L. Isbell, A.L. Thomaz, Exploration from demonstration for interactive reinforcement learning, in: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, 2016, pp. 447–456.
K. Subramanian、CL Isbell、A.L. Thomaz，交互式强化学习演示的探索，载于：自治代理和多智能体系统国际联合会议论文集，AAMAS，2016 年，第 447-456 页。
Google Scholar Google 学术搜索
[133]
Garcıa J., Fernandez F. GarcıaJ.， FernandezF.
A comprehensive survey on safe reinforcement learning
安全强化学习的综合调查
J. Mach. Learn. Res., 16 (2015)
J. Mach. 学习。第16号决议（2015年）
Google Scholar Google 学术搜索
[134]
Garcia J., Fernandez F. 加西亚J.，费尔南德斯F.
Safe exploration of state and action spaces in reinforcement learning
强化学习中状态和动作空间的安全探索
J. Artificial Intelligence Res., 45 (2012), pp. 515-564, 10.1613/jair.3761
J. 人工智能研究（J. Artificial Intelligence Res.），第 45 卷（2012 年），第 515-564 页，10.1613/jair.3761
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[135]
Dalal G., Dvijotham K., Vecerik M., Hester T., Paduraru C., Tassa Y.
DalalG.， DvijothamK.， VecerikM.， HesterT.， PaduraruC.， TassaY.
Safe exploration in continuous action spaces
在连续动作空间中安全探索
(2018)
arXiv:1801.08757 arXiv：1801.08757
Google Scholar Google 学术搜索
[136]
Garcelon E., Ghavamzadeh M., Lazaric A., Pirotta M.
GarcelonE.， GhavamzadehM.， LazaricA.， PirottaM.
Conservative exploration in reinforcement learning
强化学习中的保守探索
Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (2020)
第23届人工智能与统计国际会议论文集（2020）
arXiv:2002.03218 arXiv：2002.03218
Google Scholar Google 学术搜索
[137]
Hunt N., Fulton N., Magliacane S., Hoang N., Das S., Solar-Lezama A.
HuntN.， FultonN.， MagliacaneS.， HoangN.， DasS.， Solar-LezamaA.
Verifiably safe exploration for end-to-end reinforcement learning
端到端强化学习的可验证安全探索
(2020)
arXiv:2007.01223 arXiv：2007.01223
Google Scholar Google 学术搜索
[138]
Saunders W., Stuhlmüller A., Sastry G., Evans O.
桑德斯W.，斯图尔穆勒A.，萨斯特里G.，埃文斯O.
Trial without error: Towards safe reinforcement learning via human intervention
无误试验：通过人工干预实现安全的强化学习
Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS (2018), pp. 2067-2069
自治代理和多智能体系统国际联合会议论文集，AAMAS （2018），第 2067-2069 页
arXiv:1707.05173 arXiv：1707.05173
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[139]
Turchetta M., Berkenkamp F., Krause A.
TurchettaM.， BerkenkampF.， KrauseA.
Safe exploration in finite Markov decision processes with Gaussian processes
使用高斯过程在有限马尔可夫决策过程中进行安全探索
Conference on Neural Information Processing Systems, NeurIPS 2016 (2016), pp. 4312-4320
神经信息处理系统会议，NeurIPS 2016 （2016），第 4312-4320 页
arXiv:1606.04753 arXiv：1606.04753
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[140]
Gao C., Kartal B., Hernandez-Leal P., Taylor M.E.
GaoC.， KartalB.， Hernandez-LealP.， TaylorM.E.
On hard exploration for reinforcement learning: a case study in pommerman
关于强化学习的艰苦探索：以pommerman为例
Fifteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (2019)
第十五届AAAI人工智能与互动数字娱乐会议（2019）
arXiv:1907.11788 arXiv：1907.11788
Google Scholar Google 学术搜索
[141]
Lipton Z.C., Azizzadenesheli K., Kumar A., Li L., Gao J., Deng L.
立顿Z.C.， AzizzadenesheliK.， KumarA.， LiL.， GaoJ.， DengL.
Combating reinforcement learning’s sisyphean curse with intrinsic fear
用内在的恐惧对抗强化学习的西西弗斯诅咒
(2016)
arXiv:1611.01211 arXiv：1611.01211
Google Scholar Google 学术搜索
[142]
M. Fatemi, S. Sharma, H. van Seijen, S.E. Kahou, Dead-ends and Secure Exploration in Reinforcement Learning, in: 36th International Conference on Machine Learning, ICML 2019, 2019, pp. 3315–3323.
M. Fatemi、S. Sharma、H. van Seijen、SE Kahou，强化学习中的死胡同和安全探索，载于：第 36 届机器学习国际会议，ICML 2019,2019 年，第 3315–3323 页。
Google Scholar Google 学术搜索
[143]
Karimpanal T.G., Rana S., Gupta S., Tran T., Venkatesh S.
KarimpanalT.G.， RanaS.， GuptaS.， TranT.， VenkateshS.
Learning transferable domain priors for safe exploration in reinforcement learning
学习可转移领域先验，用于强化学习中的安全探索
(2020), pp. 1-10, 10.1109/ijcnn48605.2020.9207344
（2020 年），第 1 页。1-10， 10.1109/ijcnn48605.2020.9207344
arXiv:1909.04307 arXiv：1909.04307
View article Google Scholar
查看文章 Google 学术搜索
[144]
R. Patrascu, D. Stacey, Adaptive Exploration in Reinforcement Learning, in: Proceedings of the International Joint Conference on Neural Networks, Vol. 4, 1999, pp. 2276–2281, http://dx.doi.org/10.1109/ijcnn.1999.833417.
R. Patrascu， D. Stacey， Adaptive Exploration in Reinforcement Learning， in： Proceedings of the International Joint Conference on Neural Networks， Vol. 4， 1999， pp. 2276–2281， http://dx.doi.org/10.1109/ijcnn.1999.833417.
Google Scholar Google 学术搜索
[145]
M. Tokic, Adaptive
-greedy Exploration in Reinforcement Learning Based on Value Differences, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 6359 LNAI, 2010, pp. 203–210, http://dx.doi.org/10.1007/978-3-642-16111-7_23.
M. Tokic，基于价值差异的强化学习中的自适应
贪婪探索，载于：计算机科学讲义（包括人工智能讲义和生物信息学讲义），第 6359 卷 LNAI，2010 年，第 203-210 页，http://dx.doi.org/10.1007/978-3-642-16111-7_23。
Google Scholar Google 学术搜索
[146]
Usama M., Chang D.E. UsamaM.， ChangD.E.
Learning-driven exploration for reinforcement learning
强化学习的学习驱动探索
(2019)
arXiv:1906.06890 arXiv：1906.06890
Google Scholar Google 学术搜索
[147]
Shani L., Efroni Y., Mannor S.
ShaniL.， EfroniY.， MannorS.
Exploration conscious reinforcement learning revisited
重新审视探索意识强化学习
36th International Conference on Machine Learning, ICML 2019 (2019), pp. 9986-10012
第 36 届机器学习国际会议，ICML 2019 （2019），第 9986-10012 页
arXiv:1812.05551 arXiv：1812.05551
View in ScopusGoogle Scholar
在 Scopus 中查看Google 学术搜索
[148]
Khamassi M., Velentzas G., Tsitsimis T., Tzafestas C.
KhamassiM.， VelentzasG.， TsitsimisT.， TzafestasC.
Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task
主动探索和参数化强化学习应用于模拟人机交互任务
2017 1st IEEE International Conference on Robotic Computing, IRC 2017, IEEE (2017), pp. 28-35, 10.1109/IRC.2017.33
2017 第一届IEEE机器人计算国际会议，IRC 2017，IEEE （2017），第28-35页，10.1109/IRC.2017.33
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[149]
Chang H.S. 张海旭
An ant system based exploration-exploitation for reinforcement learning
一种基于蚂蚁系统的强化学习探索-开发
Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics, Vol. 4 (2004), pp. 3805-3810, 10.1109/ICSMC.2004.1400937
会议论文集 - IEEE International Conference on Systems， Man and Cybernetics，第 4 卷（2004），第 3805-3810 页，10.1109/ICSMC.2004.1400937
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[150]
Grossberg S. 格罗斯伯格。
Competitive learning: From interactive activation to adaptive resonance
竞争性学习：从互动激活到适应性共振
Cogn. Sci., 11 (1) (1987), pp. 23-63, 10.1016/S0364-0213(87)80025-3
康恩。《科学（Sci.）》，第 11 卷第 1 期（1987 年），第 23-63 页，10.1016/S0364-0213（87）80025-3
View PDFView articleGoogle Scholar
查看 PDF查看文章Google 学术搜索
[151]
Teng T.H., Tan A.H. TengT.H.， TanA.H.
Knowledge-based exploration for reinforcement learning in self-organizing neural networks
基于知识的自组织神经网络中强化学习探索
Proceedings - 2012 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2012, IEEE (2012), pp. 332-339, 10.1109/WI-IAT.2012.154
论文集 - 2012 IEEE/WIC/ACM International Conference on Intelligent Agent Technology， IAT 2012， IEEE （2012）， pp. 332-339， 10.1109/WI-IAT.2012.154
View article View in ScopusGoogle Scholar
查看文章在 Scopus 中查看Google 学术搜索
[152]
Wang P., Zhou W.J., Wang D., Tan A.H.
王萍，周文杰，王婷婷， TanA.H.
Probabilistic guided exploration for reinforcement learning in self-organizing neural networks
自组织神经网络中强化学习的概率引导探索
Proceedings - 2018 IEEE International Conference on Agents, ICA 2018, IEEE (2018), pp. 109-112, 10.1109/AGENTS.2018.8460067
论文集 - 2018 IEEE国际代理会议，ICA 2018，IEEE（2018），第109-112页，10.1109/AGENTS.2018.8460067
View articleGoogle Scholar
查看文章 Google 学术搜索
[153]
Rückstieß T., Sehnke F., Schaul T., Wierstra D., Sun Y., Schmidhuber J.
RückstießT.， SehnkeF.， SchaulT.， WierstraD.， SunY.， SchmidhuberJ.
Exploring parameter space in reinforcement learning
探索强化学习中的参数空间
J. Behav. Robot., 1 (1) (2010), pp. 14-24, 10.2478/s13230-010-0002-4
J.行为。《机器人》，第 1 卷第 1 期（2010 年），第 1 页。14-24， 10.2478/S13230-010-0002-4
View article Google Scholar
查看文章 Google 学术搜索
[154]
Shibata K., Sakashita Y. ShibataK.，SakashitaY。
Reinforcement learning with internal-dynamics-based exploration using a chaotic neural network
使用混沌神经网络进行基于内部动力学的探索的强化学习
Proceedings of the International Joint Conference on Neural Networks, IJCNN 2015, IEEE (2015), 10.1109/IJCNN.2015.7280430
Proceedings of the International Joint Conference on Neural Networks， IJCNN 2015， IEEE （2015）， 10.1109/IJCNN.2015.7280430
View article Google Scholar
查看文章 Google 学术搜索
[155]
Fortunato M., Azar M.G., Piot B., Menick J., Hessel M., Osband I., Graves A., Mnih V., Munos R., Hassabis D., Pietquin O., Blundell C., Legg S.
FortunatoM.， AzarM.G.， PiotB.， MenickJ.， HesselM.， OsbandI.， GravesA.， MnihV.， MunosR.， HassabisD.， PietquinO.， BlundellC.， LeggS.
Noisy networks for exploration
用于探索的嘈杂网络
6th International Conference on Learning Representations, ICLR 2018 (2018), pp. 1-21
第六届学习表征国际会议，ICLR 2018 （2018），第 1-21 页
arXiv:1706.10295 arXiv：1706.10295
View article CrossRefGoogle Scholar
查看文章 CrossRefGoogle 学术搜索
[156]
Plappert M., Houthooft R., Dhariwal P., Sidor S., Chen R.Y., Chen X., Asfour T., Abbeel P., Andrychowicz M.
PlappertM.， HouthooftR.， DhariwalP.， SidorS.， ChenR.Y.， ChenX.， AsfourT.， AbbeelP.， AndrychowiczM.
Parameter space noise for exploration
用于探索的参数空间噪声
6th International Conference on Learning Representations, ICLR 2018 (2018), pp. 1-18
第六届学习表征国际会议，ICLR 2018 （2018），第 1-18 页
arXiv:1706.01905 arXiv：1706.01905
Google Scholar Google 学术搜索
[157]
Rückstieß T., Felder M., Schmidhuber J.
RückstießT.， FelderM.， SchmidhuberJ.
State-dependent exploration for policy gradient methods
策略梯度方法的状态相关探索
Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5212 LNAI (2008), pp. 234-249, 10.1007/978-3-540-87481-2_16
计算机科学讲义（包括人工智能讲义和生物信息学讲义），第 5212 卷 LNAI （2008），第 234-249 页，10.1007/978-3-540-87481-2_16
View PDF 查看 PDF
This article is free to access.
本文可免费访问。
View in Scopus 在 Scopus 中查看Google Scholar Google 学术搜索
[158]
Osband I., Doron Y., Hessel M., Aslanides J., Sezener E., Saraiva A., McKinney K., Lattimore T., Szepezvari C., Singh S., van Roy B., Sutton R., Silver D., van Hasselt H.
OsbandI.， DoronY.， HesselM.， AslanidesJ.， SezenerE.， SaraivaA.， McKinneyK.， LattimoreT.， SzepezvariC.， SinghS.， van RoyB.， SuttonR.， SilverD.， van HasseltH.
Behaviour suite for reinforcement learning
用于强化学习的行为套件
8th International Conference on Learning Representations, ICLR 2020 (2020)
第八届学习表征国际会议，ICLR 2020 （2020）
arXiv:1908.03568 arXiv：1908.03568
Google Scholar Google 学术搜索
[159]
Jaegle A., Mehrpour V., Rust N.
JaegleA.， MehrpourV.， RustN.
Visual novelty, curiosity, and intrinsic reward in machine learning and the brain
机器学习和大脑中的视觉新颖性、好奇心和内在奖励
Curr. Opin. Neurobiol., 58 (2019), pp. 167-174, 10.1016/j.conb.2019.08.004
卷曲。意见。《神经生物学（Neurobiol.）》，第 58 卷（2019 年），第 167-174 页，10.1016/j.conb.2019.08.004
arXiv:1901.02478 arXiv：1901.02478
View PDFView articleView in ScopusGoogle Scholar
查看 PDF查看相关文章在 Scopus 中查看Google 学术搜索
[160]
OpenAI A. OpenAIA（开放AIA）。
Asymmetric self-play for automatic goal discovery in robotic manipulation
用于机器人操作中自动发现目标的非对称自播放
(2021)
arXiv:2101.04882 arXiv：2101.04882
Google Scholar Google 学术搜索
[161]
L. Zhang, K. Tang, X. Yao, Explicit Planning for Efficient Exploration in Reinforcement Learning, in: Conference on Neural Information Processing Systems, Vol. 32, NeurIPS 2019, 2019.
L. Zhang， K. Tang， X. Yao， Explicit Planning for Efficient Exploration in Reinforcement Learning， in： Conference on Neural Information Processing Systems， Vol. 32， NeurIPS 2019， 2019.
Google Scholar Google 学术搜索