【综述】【多智能体深度强化学习的综述与评论】-CSDN博客

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137181868

本文是一篇关于多智能体深度强化学习（MDRL）的综述，探讨了MDRL的基础、挑战和未来方向。MDRL结合了强化学习和深度学习，用于解决多智能体环境中的复杂问题。文章强调了MDRL在非平稳环境、高维空间和智能体间交互等方面面临的难题，并提供了对新从业者的经验指导，包括基准测试和开放性问题。此外，还提出了MDRL实施和计算需求等实际挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A survey and critique of multiagent deep reinforcement learning

2 Single-agent learning

2 单智能体学习

2.1 Reinforcement learning

2.1 强化学习

2.2 Deep reinforcement learning

2.2 深度强化学习

3 Multiagent deep reinforcement learning (MDRL)

3 多智能体深度强化学习（MDRL）

3.1 Multiagent learning

3.1 多智能体学习

3.2 MDRL categorization

3.2 MDRL分类

3.3 Emergent behaviors

3.3 紧急行为

3.4 Learning communication

3.4 学习交流

3.5 Learning cooperation

3.5 学习合作

3.6 Agents modeling agents

3.6 智能体建模智能体

4 Bridging RL, MAL and MDRL

4 桥接 RL、MAL 和 MDRL

4.1 Avoiding deep learning amnesia: examples in MDRL

4.1 避免深度学习遗忘症：MDRL 中的示例

4.2 Lessons learned

4.2 经验教训

4.3 Benchmarks for MDRL

4.3 MDRL的基准

4.4 Practical challenges in MDRL

A survey and critique of multiagent deep reinforcement learning

多智能体深度强化学习的综述与评论

Published: 16 October 2019
出版日期：2019年10月16日
https://link.springer.com/article/10.1007/s10458-019-09421-1

Abstract 摘要

Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (i) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (ii) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (iii) We take a more critical tone raising practical challenges of MDRL (e.g., implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (e.g., RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
深度强化学习（RL）近年来取得了突出的成果。这导致了应用和方法数量的急剧增加。最近的工作探索了超越单智能体场景的学习，并考虑了多智能体学习（MAL）场景。初步结果报告了在复杂的多智能体领域取得的成功，尽管有几个挑战需要解决。本文的主要目标是清晰地概述当前的多智能体深度强化学习（MDRL）文献。此外，我们通过更广泛的分析来补充概述：（i）我们重新审视了以前的关键组件，这些组件最初在MAL和RL中提出，并强调了它们如何适应多智能体深度强化学习设置。（ii）我们为该领域的新从业者提供一般指南：描述从 MDRL 工作中吸取的经验教训，指出最近的基准，并概述开放的研究途径。（iii）我们采取更批判的语气，提出MDRL的实际挑战（例如，实现和计算需求）。我们希望本文将有助于统一和激励未来的研究，以利用现有的丰富文献（例如，RL和MAL），共同努力促进多智能体社区的富有成效的研究。

1 Introduction

1 引言

Almost 20 years ago Stone and Veloso’s seminal survey [305] laid the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of AI. About 10 years ago, Shoham et al. [289] noted that the literature on multiagent learning (MAL) was growing and it was not possible to enumerate all relevant articles. Since then, the number of published MAL works continues to steadily rise, which led to different surveys on the area, ranging from analyzing the basics of MAL and their challenges [7, 55, 333], to addressing specific subareas: game theory and MAL [233, 289], cooperative scenarios [213, 248], and evolutionary dynamics of MAL [38]. In just the last couple of years, three surveys related to MAL have been published: learning in non-stationary environments [141], agents modeling agents [6], and transfer learning in multiagent RL [290].
大约20年前，Stone和Veloso的开创性调查[305]为定义多智能体系统（MAS）领域及其在AI背景下的开放问题奠定了基础。大约10年前，Shoham等[289]指出，关于多智能体学习（MAL）的文献正在增长，不可能列举所有相关文章。从那时起，已发表的MAL作品数量继续稳步上升，这导致了对该领域的不同调查，从分析MAL的基础知识及其挑战[7,55,333]，到解决特定的子领域：博弈论和MAL [233,289]，合作情景[213,248]，以及MAL的进化动力学[38]。在过去的几年里，已经发表了三项与MAL相关的调查：非平稳环境中的学习[141]，智能体建模智能体[6]和多智能体RL中的迁移学习[290]。

The research interest in MAL has been accompanied by successes in artificial intelligence, first, in single-agent video games [221]; more recently, in two-player games, for example, playing Go [291, 293], poker [50, 224], and games of two competing teams, e.g., DOTA 2 [235] and StarCraft II [339].
对MAL的研究兴趣伴随着人工智能的成功，首先是单智能体视频游戏[221];最近，在双人游戏中，例如，玩围棋 [ 291， 293]、扑克 [ 50， 224] 和两个竞争团队的游戏，例如 DOTA 2 [ 235] 和星际争霸 II [ 339]。

While different techniques and algorithms were used in the above scenarios, in general, they are all a combination of techniques from two main areas: reinforcement learning (RL) [315] and deep learning [184, 281].
虽然在上述场景中使用了不同的技术和算法，但总的来说，它们都是来自两个主要领域的技术的组合：强化学习（RL）[315]和深度学习[184,281]。

RL is an area of machine learning where an agent learns by interacting (i.e., taking actions) within a dynamic environment. However, one of the main challenges to RL, and traditional machine learning in general, is the need for manually designing quality features on which to learn. Deep learning enables efficient representation learning, thus allowing the automatic discovery of features [184, 281]. In recent years, deep learning has had successes in different areas such as computer vision and natural language processing [184, 281]. One of the key aspects of deep learning is the use of neural networks (NNs) that can find compact representations in high-dimensional data [13].
RL 是机器学习的一个领域，代理通过在动态环境中进行交互（即采取行动）来学习。然而，RL和传统机器学习面临的主要挑战之一是需要手动设计高质量的特征来学习。深度学习能够实现高效的表示学习，从而允许自动发现特征[184,281]。近年来，深度学习在计算机视觉和自然语言处理等不同领域取得了成功[184,281]。深度学习的一个关键方面是神经网络（NN）的使用，它可以在高维数据中找到紧凑的表示[ 13]。

In deep reinforcement learning (DRL) [13, 101] deep neural networks are trained to approximate the optimal policy and/or the value function. In this way the deep NN, serving as function approximator, enables powerful generalization. One of the key advantages of DRL is that it enables RL to scale to problems with high-dimensional state and action spaces. However, most existing successful DRL applications so far have been on visual domains (e.g., Atari games), and there is still a lot of work to be done for more realistic applications [359, 364] with complex dynamics, which are not necessarily vision-based.
在深度强化学习（DRL）[13,101]中，深度神经网络被训练为近似最优策略和/或值函数。通过这种方式，作为函数逼近器的深度 NN 实现了强大的泛化。DRL 的主要优势之一是它使 RL 能够扩展到具有高维状态和动作空间的问题。然而，到目前为止，大多数现有的成功DRL应用程序都是在视觉领域（例如，Atari游戏），对于具有复杂动态的更逼真的应用程序[359,364]，还有很多工作要做，这些应用程序不一定是基于视觉的。

DRL has been regarded as an important component in constructing general AI systems [179] and has been successfully integrated with other techniques, e.g., search [291], planning [320], and more recently with multiagent systems, with an emerging area of multiagent deep reinforcement learning(MDRL) [232, 251].Footnote1
DRL被认为是构建通用AI系统的重要组成部分[179]，并已成功与其他技术集成，例如搜索[291]、规划[320]，以及最近的多智能体系统，以及多智能体深度强化学习（MDRL）的新兴领域[232,251]。 Footnote1

Learning in multiagent settings is fundamentally more difficult than the single-agent case due to the presence of multiagent pathologies, e.g., the moving target problem (non-stationarity) [55, 141, 289], curse of dimensionality [55, 289], multiagent credit assignment [2, 355], global exploration [213], and relative overgeneralization [105, 247, 347]. Despite this complexity, top AI conferences like AAAI, ICML, ICLR, IJCAI and NeurIPS, and specialized conferences such as AAMAS, have published works reporting successes in MDRL. In light of these works, we believe it is pertinent to first, have an overview of the recent MDRL works, and second, understand how these recent works relate to the existing literature.
由于存在多智能体病理，例如移动目标问题（非平稳性）[55,141,289]，维度诅咒[55,289]，多智能体信用分配[2,355]，全局探索[213]和相对过度泛化[105,247,347]，因此在多智能体环境中学习从根本上比单智能体更困难。尽管存在这种复杂性，但 AAAI、ICML、ICLR、IJCAI 和 NeurIPS 等顶级 AI 会议以及 AAMAS 等专业会议都发表了报告 MDRL 成功案例的作品。鉴于这些工作，我们认为首先，对最近的MDRL工作进行概述，其次，了解这些最近的工作与现有文献的关系是相关的。

This article contributes to the state of the art with a brief survey of the current works in MDRL in an effort to complement existing surveys on multiagent learning [56, 141], cooperative learning [213, 248], agents modeling agents [6], knowledge reuse in multiagent RL [290], and (single-agent) deep reinforcement learning [13, 191].
本文通过对MDRL当前工作的简要调查，为最新技术做出了贡献，以补充现有的多智能体学习[56,141]，合作学习[213,248]，智能体建模智能体[6]，多智能体RL中的知识重用[290]和（单智能体）深度强化学习[13,191]的调查。

First, we provide a short review of key algorithms in RL such as Q-learning and REINFORCE (see Sect. 2.1). Second, we review DRL highlighting the challenges in this setting and reviewing recent works (see Sect. 2.2). Third, we present the multiagent setting and give an overview of key challenges and results (see Sect. 3.1). Then, we present the identified four categories to group recent MDRL works (see Fig. 1):
首先，我们对RL中的关键算法进行了简短的回顾，例如Q-learning和REINFORCE（见第2.1节）。其次，我们回顾了DRL，强调了这种背景下的挑战，并回顾了最近的工作（见第2.2节）。第三，我们介绍了多智能体设置，并概述了主要挑战和结果（见第 3.1 节）。然后，我们提出了已确定的四个类别，以对最近的MDRL作品进行分组（见图1）：

Analysis of emergent behaviors: evaluate single-agent DRL algorithms in multiagent scenarios (e.g., Atari games, social dilemmas, 3D competitive games).
紧急行为分析：评估多智能体场景（例如，雅达利游戏、社交困境、3D 竞技游戏）中的单智能体 DRL 算法。
Learning communication: agents learn communication protocols to solve cooperative tasks.
学习沟通：智能体学习沟通协议以解决合作任务。
Learning cooperation: agents learn to cooperate using only actions and (local) observations.
学习合作：智能体学会仅使用行动和（局部）观察进行合作。
Agents modeling agents: agents reason about others to fulfill a task (e.g., best response learners).
智能体建模智能体：智能体推理他人以完成任务（例如，最佳响应学习者）。

For each category we provide a description as well as outline the recent works (see Sect. 3.2 and Tables 1, 2, 3, 4). Then, we take a step back and reflect on how these new works relate to the existing literature. In that context, first, we present examples on how methods and algorithms originally introduced in RL and MAL were successfully been scaled to MDRL (see Sect. 4.1). Second, we provide some pointers for new practitioners in the area by describing general lessons learned from the existing MDRL works (see Sect. 4.2) and point to recent multiagent benchmarks (see Sect. 4.3). Third, we take a more critical view and describe practical challenges in MDRL, such as reproducibility, hyperparameter tunning, and computational demands (see Sect. 4.4). Then, we outline some open research questions (see Sect. 4.5). Lastly, we present our conclusions from this work (see Sect. 5).
对于每个类别，我们都会提供描述并概述最近的作品（见第 3.2 节和表 1、2、3、4）。然后，我们退后一步，反思这些新作品与现有文献的关系。在此背景下，首先，我们举例说明最初在RL和MAL中引入的方法和算法如何成功地扩展到MDRL（见第4.1节）。其次，我们通过描述从现有MDRL工作中吸取的一般经验教训（见第4.2节）并指出最近的多代理基准（见第4.3节），为该领域的新从业者提供了一些指导。第三，我们采取更具批判性的观点，并描述了MDRL中的实际挑战，例如可重复性，超参数调整和计算需求（见第4.4节）。然后，我们概述了一些开放性研究问题（见第 4.5 节）。最后，我们提出了这项工作的结论（见第5节）。

Our goal is to outline a recent and active area (i.e., MDRL), as well as to motivate future research to take advantage of the ample and existing literature in multiagent learning. We aim to enable researchers with experience in either DRL or MAL to gain a common understanding about recent works, and open problems in MDRL, and to avoid having scattered sub-communities with little interaction [6, 81, 141, 289].
我们的目标是概述一个最近活跃的领域（即MDRL），并激励未来的研究利用多智能体学习中大量和现有的文献。我们的目标是使具有DRL或MAL经验的研究人员能够对MDRL中最近的工作和开放问题达成共识，并避免分散的子社区，几乎没有互动[6,81,141,289]。

Fig. 1 图1

Categories of different MDRL works. a Analysis of emergent behaviors: evaluate single-agent DRL algorithms in multiagent scenarios. b Learning communication: agents learn with actions and through messages. c Learning cooperation: agents learn to cooperate using only actions and (local) observations. d Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive). For a more detailed description see Sects. 3.3–3.6 and Tables 1, 2, 3 and 4
不同MDRL作品的类别。a 紧急行为分析：评估多智能体场景中的单智能体 DRL 算法。b 学习交流：智能体通过行动和信息学习。c 学习合作：智能体学会仅使用行动和（局部）观察进行合作。d 智能体建模智能体：智能体推理他人以完成任务（例如，合作或竞争）。更详细的描述见第3.3-3.6节和表1、表2、表3和表4

2 Single-agent learning

2 单智能体学习

This section presents the formalism of reinforcement learning and its main components before outlining deep reinforcement learning along with its particular challenges and recent algorithms. For a more detailed description we refer the reader to excellent books and surveys on the area [13, 101, 164, 315, 353].
本节介绍了强化学习的形式主义及其主要组成部分，然后概述了深度强化学习及其特殊挑战和最新算法。为了更详细的描述，我们向读者推荐了关于该地区的优秀书籍和调查[13,101,164,315,353]。

2.1 Reinforcement learning

2.1 强化学习

REINFORCE (Monte Carlo policy gradient) In contrast to value-based methods, which do not try to optimize directly over a policy space [175], policy gradient methods can learn parameterized policies without using intermediate value estimates.
REINFORCEFORCE（蒙特卡罗策略梯度）与基于值的方法相比，基于值的方法不尝试直接在策略空间上进行优化[175]，策略梯度方法可以在不使用中间值估计的情况下学习参数化策略。

The policy gradient update can be generalized to include a comparison to an arbitrary baseline of the state [354]. The baseline, b(s), can be any function, as long as it does not vary with the action; the baseline leaves the expected value of the update unchanged, but it can have an effect on its variance [315]. A natural choice for the baseline is a learned state-value function, this reduces the variance, and it is bias-free if learned by MC.Footnote3 Moreover, when using the state-value function for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states) it assigns credit (reducing the variance but introducing bias), i.e., criticizes the policy’s action selections. Thus, in actor-critic methods [175], the actor represents the policy, i.e., action-selection mechanism, whereas a critic is used for the value function learning. In the case when the critic learns a state-action function (Q function) and a state value function (V function), an advantage function can be computed by subtracting state values from the state-action values [283, 315]. The advantage function indicates the relative quality of an action compared to other available actions computed from the baseline, i.e., state value function. An example of an actor-critic algorithm is Deterministic Policy Gradient (DPG) [292]. In DPG [292] the critic follows the standard Q-learning and the actor is updated following the gradient of the policy’s performance [128], DPG was later extended to DRL (see Sect. 2.2) and MDRL (see Sect. 3.5). For multiagent learning settings the variance is further increased as all the agents’ rewards depend on the rest of the agents, and it is formally shown that as the number of agents increase, the probability of taking a correct gradient direction decreases exponentially [206]. Recent MDRL works addressed this high variance issue, e.g., COMA [97] and MADDPG [206] (see Sect. 3.5).
策略梯度更新可以泛化为包括与状态的任意基线的比较[354]。基线 b（s）可以是任何函数，只要它不随动作而变化;基线使更新的预期值保持不变，但可能会对其方差产生影响[315]。基线的自然选择是学习到的状态值函数，这减少了方差，如果通过 MC 学习，它是无偏差的。此外，当使用状态值函数进行引导时（从后续状态的估计值更新状态的值估计值），它会分配信用（减少方差但引入偏差），即批评策略的操作选择。 Footnote3 因此，在行动者-批评方法[175]中，行动者代表政策，即行动选择机制，而批评者则用于价值函数学习。当批评者学习状态动作函数（Q函数）和状态值函数（V函数）时，可以通过从状态动作值中减去状态值来计算优势函数[283,315]。优势函数表示与从基线计算的其他可用操作（即状态值函数）相比，操作的相对质量。Actor-critic算法的一个例子是确定性策略梯度（DPG）[292]。在 DPG [ 292] 中，批评者遵循标准的 Q 学习，而 actor 则按照策略性能的梯度进行更新 [ 128]，DPG 后来扩展到 DRL（见第 2.2 节）和 MDRL（见第 3.5 节）。对于多智能体学习设置，方差进一步增加，因为所有智能体的奖励都依赖于其他智能体，并且正式表明，随着智能体数量的增加，采取正确梯度方向的概率呈指数下降[206]。最近的MDRL工作解决了这个高方差问题，例如COMA [ 97]和MADDPG [ 206]（见第3.5节）。

Policy gradient methods have a clear connection with deep reinforcement learning since the policy might be represented by a neural network whose input is a representation of the state, whose output are action selection probabilities or values for continuous control [192], and whose weights are the policy parameters.
策略梯度方法与深度强化学习有明显的联系，因为策略可能由一个神经网络表示，其输入是状态的表示，其输出是动作选择概率或连续控制值[192]，其权重是策略参数。

2.2 Deep reinforcement learning

2.2 深度强化学习

While tabular RL methods such as Q-learning are successful in domains that do not suffer from the curse of dimensionality, there are many limitations: learning in large state spaces can be prohibitively slow, methods do not generalize (across the state space), and state representations need to be hand-specified [315]. Function approximators tried to address those limitations, using for example, decision trees [262], tile coding [314], radial basis functions [177], and locally weighted regression [46] to approximate the value function.
虽然表格RL方法（如Q-learning）在不受维度诅咒的领域中是成功的，但存在许多局限性：在大状态空间中的学习速度可能非常慢，方法不能泛化（在状态空间中），并且状态表示需要手动指定[315]。函数近似器试图解决这些局限性，例如使用决策树[262]、瓦片编码[314]、径向基函数[177]和局部加权回归[46]来近似值函数。

However, extending deep learning to RL problems comes with additional challenges including non-i.i.d. (not independently and identically distributed) data. Many supervised learning methods assume that training data is from an i.i.d. stationary distribution [36, 269, 281]. However, in RL, training data consists of highly correlated sequential agent-environment interactions, which violates the independence condition. Moreover, RL training data distribution is non-stationary as the agent actively learns while exploring different parts of the state space, violating the condition of sampled data being identically distributed [220].
然而，将深度学习扩展到强化学习问题也带来了额外的挑战，包括非i.i.d。（非独立和相同分布的）数据。许多监督学习方法假设训练数据来自i.i.d.稳态分布[36,269,281]。然而，在强化学习中，训练数据由高度相关的顺序智能体-环境交互组成，这违反了独立性条件。此外，RL训练数据的分布是非平稳的，因为智能体在探索状态空间的不同部分时会主动学习，这违反了采样数据相同分布的条件[220]。

In practice, using function approximators in RL requires making crucial representational decisions and poor design choices can result in estimates that diverge from the optimal value function [1, 21, 46, 112, 334, 351]. In particular, function approximation, bootstrapping, and off-policy learning are considered the three main properties that when combined, can make the learning to diverge and are known as the deadly triad [315, 334]. Recently, some works have shown that non-linear (i.e., deep) function approximators poorly estimate the value function [104, 151, 331] and another work found problems with Q-learning using function approximation (over/under-estimation, instability and even divergence) due to the delusional bias: “delusional bias occurs whenever a backed-up value estimate is derived from action choices that are not realizable in the underlying policy class”[207]. Additionally, convergence results for reinforcement learning using function approximation are still scarce [21, 92, 207, 217, 330]; in general, stronger convergence guarantees are available for policy-gradient methods [316] than for value-based methods [315].
在实践中，在RL中使用函数逼近器需要做出关键的表示决策，而糟糕的设计选择可能导致估计偏离最优值函数[1,21,46,112,334,351]。特别是，函数逼近、自举和偏离策略学习被认为是三个主要属性，当它们结合在一起时，可以使学习发散，被称为致命的三联征[315,334]。最近，一些工作表明，非线性（即深度）函数近似器对价值函数的估计很差[104,151,331]，另一项工作发现，由于妄想偏差，使用函数近似（高估/低估，不稳定甚至背离）的Q学习存在问题：“每当从基础策略类中无法实现的行动选择中得出备份值估计时，就会发生妄想偏差”[207]。此外，使用函数逼近的强化学习的收敛结果仍然很少[ 21， 92， 207， 217， 330];一般来说，策略梯度方法[316]比基于值的方法[315]具有更强的收敛保证。

Below we mention how the existing DRL methods aim to address these challenges when briefly reviewing value-based methods, such as DQN [221]; policy gradient methods, like Proximal Policy Optimization (PPO) [283]; and actor-critic methods like Asynchronous Advantage Actor-Critic (A3C) [158]. We refer the reader to recent surveys on single-agent DRL [13, 101, 191] for a more detailed discussion of the literature.
下面我们提到现有的DRL方法如何解决这些挑战，简要回顾基于价值的方法，如DQN [ 221];政策梯度方法，如近端政策优化（PPO）[283];以及异步优势演员-批评家（A3C）等演员-批评家方法[ 158]。我们请读者参考最近关于单药DRL的调查[13,101,191]，以对文献进行更详细的讨论。

Fig. 2 图2

The ER buffer provides stability for learning as random batches sampled from the buffer helps alleviating the problems caused by the non-i.i.d. data. However, it comes with disadvantages, such as higher memory requirements and computation per real interaction [219]. The ER buffer is mainly used for off-policy RL methods as it can cause a mismatch between buffer content from earlier policy and from the current policy for on-policy methods [219]. Extending the ER buffer for the multiagent case is not trivial, see Sects. 3.5, 4.1 and 4.2. Recent works were designed to reduce the problem of catastrophic forgetting (this occurs when the trained neural network performs poorly on previously learned tasks due to a non-stationary training distribution [111, 214]) and the ER buffer, in DRL [153] and MDRL [246].
ER 缓冲区为学习提供了稳定性，因为从缓冲区采样的随机批次有助于缓解由非 i.i.d 引起的问题。数据。然而，它也有缺点，例如更高的内存要求和每次实际交互的计算量[219]。ER缓冲区主要用于非策略RL方法，因为它可能导致早期策略的缓冲区内容与当前策略的策略内容不匹配[219]。为多智能体情况扩展 ER 缓冲区并非易事，参见第 3.5、4.1 和 4.2 节。最近的工作旨在减少灾难性遗忘的问题（当训练神经网络由于非平稳训练分布[111,214]而在先前学习的任务上表现不佳时，就会发生这种情况）和ER缓冲区，在DRL[153]和MDRL[246]。

DQN has been extended in many ways, for example, by using double estimators [130] to reduce the overestimation bias with Double DQN [336] (see Sect. 4.1) and by decomposing the Q-function with a dueling-DQN architecture [345], where two streams are learned, one estimates state values and another one advantages, those are combined in the final layer to form Q values (this method improved over Double DQN).
DQN已经在许多方面得到了扩展，例如，通过使用双估计器[ 130]来减少双倍DQN [ 336]的高估偏差（参见第4.1节），以及通过使用决斗DQN架构分解Q函数[ 345]，其中学习两个流，一个估计状态值，另一个优点，它们在最后一层组合形成Q值（这种方法比Double DQN改进）。

In practice, DQN is trained using an input of four stacked frames (last four frames the agent has encountered). If a game requires a memory of more than four frames it will appear non-Markovian to DQN because the future game states (and rewards) do not depend only on the input (four frames) but rather on the history [132]. Thus, DQN’s performance declines when given incomplete state observations (e.g., one input frame) since DQN assumes full state observability.
在实践中，DQN 是使用四个堆叠帧（代理遇到的最后四个帧）的输入进行训练的。如果一个游戏需要超过四帧的内存，那么它对DQN来说似乎是非马尔可夫的，因为未来的游戏状态（和奖励）不仅取决于输入（四帧），还取决于历史[132]。因此，当给定不完整的状态观测值（例如，一个输入帧）时，DQN 的性能会下降，因为 DQN 假定具有完整的状态可观测性。

Real-world tasks often feature incomplete and noisy state information resulting from partial observability (see Sect. 2.1). Deep Recurrent Q-Networks (DRQN) [131] proposed using recurrent neural networks, in particular, Long Short-Term Memory (LSTMs) cells [147] in DQN, for this setting. Consider the architecture in Fig. 2 with the first dense layer after convolution replaced by a layer of LSTM cells. With this addition, DRQN has memory capacity so that it can even work with only one input frame rather than a stacked input of consecutive frames. This idea has been extended to MDRL, see Fig. 6 and Sect. 4.2. There are also other approaches to deal with partial observability such as finite state controllers [218] (where action selection is performed according to the complete observation history) and using an initiation set of options conditioned on the previously employed option [302].
实际任务通常具有不完整和嘈杂的状态信息，这是由于部分可观测性造成的（参见第 2.1 节）。深度递归Q网络（DRQN）[131]建议使用递归神经网络，特别是DQN中的长短期记忆（LSTM）细胞[147]来设置。考虑图 2 中的架构，卷积后的第一个致密层被一层 LSTM 单元所取代。通过这一新增功能，DRQN 具有内存容量，因此它甚至可以只使用一个输入帧，而不是连续帧的堆叠输入。这个想法已经扩展到MDRL，见图6和第4.2节。还有其他方法可以处理部分可观测性，例如有限状态控制器[218]（根据完整的观测历史执行动作选择）和使用以先前使用的选项[302]为条件的启动选项集。

Fig. 3 图3

Policy gradient methods For many tasks, particularly for physical control, the action space is continuous and high dimensional where DQN is not suitable. Deep Deterministic Policy Gradient (DDPG) [192] is a model-free off-policy actor-critic algorithm for such domains, based on the DPG algorithm [292] (see Sect. 2.1). Additionally, it proposes a new method for updating the networks, i.e., the target network parameters slowly change (this could also be applicable to DQN), in contrast to the hard reset (direct weight copy) used in DQN. Given the off-policy nature, DDPG generates exploratory behavior by adding sampled noise from some noise processes to its actor policy. The authors also used batch normalization [152] to ensure generalization across many different tasks without performing manual normalizations. However, note that other works have shown batch normalization can cause divergence in DRL [274,