【多智能体深度强化学习的调查与批判】A survey and critique of multiagent deep reinforcement learning

最新推荐文章于 2024-03-30 20:45:03 发布

资源存储库

最新推荐文章于 2024-03-30 20:45:03 发布

阅读量1.4k

点赞数 22

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135521056

版权

【多智能体深度强化学习的调查与批判】A survey and critique of multiagent deep reinforcement learning

Abstract 抽象

Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (i) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (ii) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (iii) We take a more critical tone raising practical challenges of MDRL (e.g., implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (e.g., RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
深度强化学习（RL）近年来取得了突出的成果。这导致了应用和方法数量的急剧增加。最近的工作探索了超越单智能体场景的学习，并考虑了多智能体学习（MAL）场景。初步结果报告了在复杂的多智能体领域取得的成功，尽管有几个挑战需要解决。本文的主要目标是清晰地概述当前的多智能体深度强化学习（MDRL）文献。此外，我们通过更广泛的分析来补充概述：（i）我们重新审视了以前的关键组件，这些组件最初在MAL和RL中提出，并强调了它们如何适应多智能体深度强化学习设置。（ii）我们为该领域的新从业者提供一般指南：描述从 MDRL 工作中吸取的经验教训，指出最近的基准，并概述开放的研究途径。（iii）我们采取更批判的语气，提出MDRL的实际挑战（例如，实现和计算需求）。我们希望本文将有助于统一和激励未来的研究，以利用现有的丰富文献（例如，RL和MAL），共同努力促进多智能体社区的富有成效的研究。

1 Introduction 1 引言

Almost 20 years ago Stone and Veloso’s seminal survey [305] laid the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of AI. About 10 years ago, Shoham et al. [289] noted that the literature on multiagent learning (MAL) was growing and it was not possible to enumerate all relevant articles. Since then, the number of published MAL works continues to steadily rise, which led to different surveys on the area, ranging from analyzing the basics of MAL and their challenges [7, 55, 333], to addressing specific subareas: game theory and MAL [233, 289], cooperative scenarios [213, 248], and evolutionary dynamics of MAL [38]. In just the last couple of years, three surveys related to MAL have been published: learning in non-stationary environments [141], agents modeling agents [6], and transfer learning in multiagent RL [290].
大约20年前，Stone和Veloso的开创性调查[305]为定义多智能体系统（MAS）领域及其在AI背景下的开放问题奠定了基础。大约10年前，Shoham等[289]指出，关于多智能体学习（MAL）的文献正在增长，不可能列举所有相关文章。从那时起，已发表的MAL作品数量继续稳步上升，这导致了对该领域的不同调查，从分析MAL的基础知识及其挑战[7,55,333]，到解决特定的子领域：博弈论和MAL [233,289]，合作情景[213,248]，以及MAL的进化动力学[38]。在过去的几年里，已经发表了三项与MAL相关的调查：非平稳环境中的学习[141]，智能体建模智能体[6]和多智能体RL中的迁移学习[290]。

The research interest in MAL has been accompanied by successes in artificial intelligence, first, in single-agent video games [221]; more recently, in two-player games, for example, playing Go [291, 293], poker [50, 224], and games of two competing teams, e.g., DOTA 2 [235] and StarCraft II [339].
对MAL的研究兴趣伴随着人工智能的成功，首先是单智能体视频游戏[221];最近，在双人游戏中，例如，玩围棋 [ 291， 293]、扑克 [ 50， 224] 和两个竞争团队的游戏，例如 DOTA 2 [ 235] 和星际争霸 II [ 339]。

While different techniques and algorithms were used in the above scenarios, in general, they are all a combination of techniques from two main areas: reinforcement learning (RL) [315] and deep learning [184, 281].
虽然在上述场景中使用了不同的技术和算法，但总的来说，它们都是来自两个主要领域的技术的组合：强化学习（RL）[315]和深度学习[184,281]。

RL is an area of machine learning where an agent learns by interacting (i.e., taking actions) within a dynamic environment. However, one of the main challenges to RL, and traditional machine learning in general, is the need for manually designing quality features on which to learn. Deep learning enables efficient representation learning, thus allowing the automatic discovery of features [184, 281]. In recent years, deep learning has had successes in different areas such as computer vision and natural language processing [184, 281]. One of the key aspects of deep learning is the use of neural networks (NNs) that can find compact representations in high-dimensional data [13].
RL 是机器学习的一个领域，代理通过在动态环境中进行交互（即采取行动）来学习。然而，RL和传统机器学习面临的主要挑战之一是需要手动设计高质量的特征来学习。深度学习能够实现高效的表示学习，从而允许自动发现特征[184,281]。近年来，深度学习在计算机视觉和自然语言处理等不同领域取得了成功[184,281]。深度学习的一个关键方面是神经网络（NN）的使用，它可以在高维数据中找到紧凑的表示[ 13]。

In deep reinforcement learning (DRL) [13, 101] deep neural networks are trained to approximate the optimal policy and/or the value function. In this way the deep NN, serving as function approximator, enables powerful generalization. One of the key advantages of DRL is that it enables RL to scale to problems with high-dimensional state and action spaces. However, most existing successful DRL applications so far have been on visual domains (e.g., Atari games), and there is still a lot of work to be done for more realistic applications [359, 364] with complex dynamics, which are not necessarily vision-based.
在深度强化学习（DRL）[13,101]中，深度神经网络被训练为近似最优策略和/或值函数。通过这种方式，作为函数逼近器的深度 NN 实现了强大的泛化。DRL 的主要优势之一是它使 RL 能够扩展到具有高维状态和动作空间的问题。然而，到目前为止，大多数现有的成功DRL应用程序都是在视觉领域（例如，Atari游戏），对于具有复杂动态的更逼真的应用程序[359,364]，还有很多工作要做，这些应用程序不一定是基于视觉的。

DRL has been regarded as an important component in constructing general AI systems [179] and has been successfully integrated with other techniques, e.g., search [291], planning [320], and more recently with multiagent systems, with an emerging area of multiagent deep reinforcement learning(MDRL) [232, 251].Footnote1
DRL被认为是构建通用AI系统的重要组成部分[179]，并已成功与其他技术集成，例如搜索[291]、规划[320]，以及最近的多智能体系统，以及多智能体深度强化学习（MDRL）的新兴领域[232,251]。 Footnote1

Learning in multiagent settings is fundamentally more difficult than the single-agent case due to the presence of multiagent pathologies, e.g., the moving target problem (non-stationarity) [55, 141, 289], curse of dimensionality [55, 289], multiagent credit assignment [2, 355], global exploration [213], and relative overgeneralization [105, 247, 347]. Despite this complexity, top AI conferences like AAAI, ICML, ICLR, IJCAI and NeurIPS, and specialized conferences such as AAMAS, have published works reporting successes in MDRL. In light of these works, we believe it is pertinent to first, have an overview of the recent MDRL works, and second, understand how these recent works relate to the existing literature.
由于存在多智能体病理，例如移动目标问题（非平稳性）[55,141,289]，维度诅咒[55,289]，多智能体信用分配[2,355]，全局探索[213]和相对过度泛化[105,247,347]，因此在多智能体环境中学习从根本上比单智能体更困难。尽管存在这种复杂性，但 AAAI、ICML、ICLR、IJCAI 和 NeurIPS 等顶级 AI 会议以及 AAMAS 等专业会议都发表了报告 MDRL 成功案例的作品。鉴于这些工作，我们认为首先，对最近的MDRL工作进行概述，其次，了解这些最近的工作与现有文献的关系是相关的。

This article contributes to the state of the art with a brief survey of the current works in MDRL in an effort to complement existing surveys on multiagent learning [56, 141], cooperative learning [213, 248], agents modeling agents [6], knowledge reuse in multiagent RL [290], and (single-agent) deep reinforcement learning [13, 191].
本文通过对MDRL当前工作的简要调查，为最新技术做出了贡献，以补充现有的多智能体学习[56,141]，合作学习[213,248]，智能体建模智能体[6]，多智能体RL中的知识重用[290]和（单智能体）深度强化学习[13,191]的调查。

First, we provide a short review of key algorithms in RL such as Q-learning and REINFORCE (see Sect. 2.1). Second, we review DRL highlighting the challenges in this setting and reviewing recent works (see Sect. 2.2). Third, we present the multiagent setting and give an overview of key challenges and results (see Sect. 3.1). Then, we present the identified four categories to group recent MDRL works (see Fig. 1):
首先，我们对RL中的关键算法进行了简短的回顾，例如Q-learning和REINFORCE（见第2.1节）。其次，我们回顾了DRL，强调了这种背景下的挑战，并回顾了最近的工作（见第2.2节）。第三，我们介绍了多智能体设置，并概述了主要挑战和结果（见第 3.1 节）。然后，我们提出了已确定的四个类别，以对最近的MDRL作品进行分组（见图1）：

Analysis of emergent behaviors: evaluate single-agent DRL algorithms in multiagent scenarios (e.g., Atari games, social dilemmas, 3D competitive games).
紧急行为分析：评估多智能体场景（例如，雅达利游戏、社交困境、3D 竞技游戏）中的单智能体 DRL 算法。

Learning communication: agents learn communication protocols to solve cooperative tasks.
学习沟通：智能体学习沟通协议以解决合作任务。

Learning cooperation: agents learn to cooperate using only actions and (local) observations.
学习合作：智能体学会仅使用行动和（局部）观察进行合作。

Agents modeling agents: agents reason about others to fulfill a task (e.g., best response learners).
智能体建模智能体：智能体推理他人以完成任务（例如，最佳响应学习者）。

For each category we provide a description as well as outline the recent works (see Sect. 3.2 and Tables 1, 2, 3, 4). Then, we take a step back and reflect on how these new works relate to the existing literature. In that context, first, we present examples on how methods and algorithms originally introduced in RL and MAL were successfully been scaled to MDRL (see Sect. 4.1). Second, we provide some pointers for new practitioners in the area by describing general lessons learned from the existing MDRL works (see Sect. 4.2) and point to recent multiagent benchmarks (see Sect. 4.3). Third, we take a more critical view and describe practical challenges in MDRL, such as reproducibility, hyperparameter tunning, and computational demands (see Sect. 4.4). Then, we outline some open research questions (see Sect. 4.5). Lastly, we present our conclusions from this work (see Sect. 5).
对于每个类别，我们都会提供描述并概述最近的作品（见第 3.2 节和表 1、2、3、4）。然后，我们退后一步，反思这些新作品与现有文献的关系。在此背景下，首先，我们举例说明最初在RL和MAL中引入的方法和算法如何成功地扩展到MDRL（见第4.1节）。其次，我们通过描述从现有MDRL工作中吸取的一般经验教训（见第4.2节）并指出最近的多代理基准（见第4.3节），为该领域的新从业者提供了一些指导。第三，我们采取更具批判性的观点，并描述了MDRL中的实际挑战，例如可重复性，超参数调整和计算需求（见第4.4节）。然后，我们概述了一些开放性研究问题（见第 4.5 节）。最后，我们提出了这项工作的结论（见第5节）。

Our goal is to outline a recent and active area (i.e., MDRL), as well as to motivate future research to take advantage of the ample and existing literature in multiagent learning. We aim to enable researchers with experience in either DRL or MAL to gain a common understanding about recent works, and open problems in MDRL, and to avoid having scattered sub-communities with little interaction [6, 81, 141, 289].
我们的目标是概述一个最近活跃的领域（即MDRL），并激励未来的研究利用多智能体学习中大量和现有的文献。我们的目标是使具有DRL或MAL经验的研究人员能够对MDRL中最近的工作和开放问题达成共识，并避免分散的子社区，几乎没有互动[6,81,141,289]。

Fig. 1 图1
figure 1
Categories of different MDRL works. a Analysis of emergent behaviors: evaluate single-agent DRL algorithms in multiagent scenarios. b Learning communication: agents learn with actions and through messages. c Learning cooperation: agents learn to cooperate using only actions and (local) observations. d Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive). For a more detailed description see Sects. 3.3–3.6 and Tables 1, 2, 3 and 4
不同MDRL作品的类别。a 紧急行为分析：评估多智能体场景中的单智能体 DRL 算法。b 学习交流：智能体通过行动和信息学习。c 学习合作：智能体学会仅使用行动和（局部）观察进行合作。d 智能体建模智能体：智能体推理他人以完成任务（例如，合作或竞争）。更详细的描述见第3.3-3.6节和表1、表2、表3和表4

Full size image 全尺寸图像
2 Single-agent learning 2 单智能体学习
This section presents the formalism of reinforcement learning and its main components before outlining deep reinforcement learning along with its particular challenges and recent algorithms. For a more detailed description we refer the reader to excellent books and surveys on the area [13, 101, 164, 315, 353].
本节介绍了强化学习的形式主义及其主要组成部分，然后概述了深度强化学习及其特殊挑战和最新算法。为了更详细的描述，我们向读者推荐了关于该地区的优秀书籍和调查[13,101,164,315,353]。

2.1 Reinforcement learning
2.1 强化学习
RL formalizes the interaction of an agent with an environment using a Markov decision process (MDP) [261]. An MDP is defined by the tuple
where
represents a finite set of states.
represents a finite set of actions. The transition function
determines the probability of a transition from any state
to any state
given any possible action
. The reward function
defines the immediate and possibly stochastic reward that an agent would receive given that the agent executes action a while in state s and it is transitioned to state
,
represents the discount factor that balances the trade-off between immediate rewards and future rewards.
RL使用马尔可夫决策过程（MDP）将智能体与环境的交互形式化[ 261]。MDP 由元组定义，
其中
表示一组有限的状态。
表示一组有限的操作。转换函数
确定给定任何可能的动作
，从任何状态转换到任何状态

的概率。奖励函数
定义了代理将获得的即时和可能的随机奖励，前提是代理在状态 s 中执行了一段时间的操作，并且它转换到状态
，
表示平衡即时奖励和未来奖励之间权衡的折扣因子。

MDPs are adequate models to obtain optimal decisions in single agent fully observable environments.Footnote2 Solving an MDP will yield a policy
, which is a mapping from states to actions. An optimal policy
is the one that maximizes the expected discounted sum of rewards. There are different techniques for solving MDPs assuming a complete description of all its elements. One of the most common techniques is the value iteration algorithm [33], which requires a complete and accurate representation of states, actions, rewards, and transitions. However, this may be difficult to obtain in many domains. For this reason, RL algorithms often learn from experience interacting with the environment in discrete time steps.
MDP 是在单智能体完全可观察的环境中获得最佳决策的充分模型。 Footnote2 求解MDP将产生一个策略
，它是从状态到动作的映射。最佳策略是使预期的折扣奖励金额最大化的策略
。假设对 MDP 的所有元素进行完整描述，则有不同的技术可以求解 MDP。最常见的技术之一是价值迭代算法[33]，它需要完整而准确地表示状态、动作、奖励和转换。但是，这在许多领域可能很难获得。出于这个原因，RL算法通常从离散时间步长与环境交互的经验中学习。

Q-learning One of the most well known algorithms for RL is Q-learning [346]. It has been devised for stationary, single-agent, fully observable environments with discrete actions. A Q-learning agent keeps the estimate of its expected payoff starting in state s, taking action a as
. Each tabular entry
is an estimate of the corresponding optimal
function that maps state-action pairs to the discounted sum of future rewards starting with action a at state s and following the optimal policy thereafter. Each time the agent transitions from a state s to a state
via action a receiving payoff r, the Q table is updated as follows:
Q-learning：最广为人知的RL算法之一是Q-learning[346]。它专为具有离散动作的固定、单智能体、完全可观察的环境而设计。Q-learning 代理从状态 s 开始估计其预期收益，将操作 a 作为
。每个表格条目
都是对相应最优
函数的估计值，该函数将状态-操作对映射到未来奖励的折扣总和，从状态 s 处的操作 a 开始，此后遵循最优策略。每次代理通过操作 a 接收收益 r 从状态 s 转换为状态
时，Q 表都会更新如下：

(1)
with the learning rate
. Q-learning is proven to converge to
if state and action spaces are discrete and finite, the sum of the learning rates goes to infinity (so that each state-action pair is visited infinitely often) and that the sum of the squares of the learning rates is finite (which is required to show that the convergence is with probability one) [94, 154, 168, 318, 319, 329, 346]. The convergence of single-step on-policy RL algorithms, i.e, SARSA (
), for both decaying exploration (greedy in the limit with infinite exploration) and persistent exploration (selecting actions probabilistically according to the ranks of the Q values) was demonstrated by Singh et al. [294]. Furthermore, Van Seijen [337] has proven convergence for Expected SARSA (see Sect. 3.1 for convergence results in multiagent domains).
随着学习率
的提高。Q学习被证明收敛到
如果状态和动作空间是离散的和有限的，学习率的总和达到无穷大（因此每个状态-动作对被无限频繁地访问），并且学习率的平方和是有限的（这是证明收敛是概率为1所必需的）[94， 154, 168, 318, 319, 329, 346].Singh等人[294]证明了单步上策略RL算法（即SARSA （））在衰减探索（无限探索的极限中贪婪）和持久探索
（根据Q值的等级概率选择行动）的收敛性。此外，Van Seijen [ 337] 已经证明了预期 SARSA 的收敛性（参见第 3.1 节，了解多药剂域的收敛结果）。

REINFORCE (Monte Carlo policy gradient) In contrast to value-based methods, which do not try to optimize directly over a policy space [175], policy gradient methods can learn parameterized policies without using intermediate value estimates.
REINFORCEFORCE（蒙特卡罗策略梯度）与基于值的方法相比，基于值的方法不尝试直接在策略空间上进行优化[175]，策略梯度方法可以在不使用中间值估计的情况下学习参数化策略。

Policy parameters are learned by following the gradient of some performance measure with gradient descent [316]. For example, REINFORCE [354] uses estimated return by Monte Carlo (MC) methods with full episode trajectories to learn policy parameters
, with
, as follows
策略参数是通过跟踪一些具有梯度下降的性能测量的梯度来学习的[316]。例如，REINFORCEFORCE[354]使用蒙特卡洛（MC）方法估计的回报和完整的情节轨迹来学习策略参数
，其中
，如下所示

(2)
where
represents the return,
is the learning rate, and
. A main limitation is that policy gradient methods can have high variance [175].
其中
表示回报，是学习率，
和
。一个主要的局限性是策略梯度方法可能具有很高的方差[175]。

The policy gradient update can be generalized to include a comparison to an arbitrary baseline of the state [354]. The baseline, b(s), can be any function, as long as it does not vary with the action; the baseline leaves the expected value of the update unchanged, but it can have an effect on its variance [315]. A natural choice for the baseline is a learned state-value function, this reduces the variance, and it is bias-free if learned by MC.Footnote3 Moreover, when using the state-value function for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states) it assigns credit (reducing the variance but introducing bias), i.e., criticizes the policy’s action selections. Thus, in actor-critic methods [175], the actor represents the policy, i.e., action-selection mechanism, whereas a critic is used for the value function learning. In the case when the critic learns a state-action function (Q function) and a state value function (V function), an advantage function can be computed by subtracting state values from the state-action values [283, 315]. The advantage function indicates the relative quality of an action compared to other available actions computed from the baseline, i.e., state value function. An example of an actor-critic algorithm is Deterministic Policy Gradient (DPG) [292]. In DPG [292] the critic follows the standard Q-learning and the actor is updated following the gradient of the policy’s performance [128], DPG was later extended to DRL (see Sect. 2.2) and MDRL (see Sect. 3.5). For multiagent learning settings the variance is further increased as all the agents’ rewards depend on the rest of the agents, and it is formally shown that as the number of agents increase, the probability of taking a correct gradient direction decreases exponentially [206]. Recent MDRL works addressed this high variance issue, e.g., COMA [97] and MADDPG [206] (see Sect. 3.5).
策略梯度更新可以泛化为包括与状态的任意基线的比较[354]。基线 b（s）可以是任何函数，只要它不随动作而变化;基线使更新的预期值保持不变，但可能会对其方差产生影响[315]。基线的自然选择是学习到的状态值函数，这减少了方差，如果通过 MC 学习，它是无偏差的。此外，当使用状态值函数进行引导时（从后续状态的估计值更新状态的值估计值），它会分配信用（减少方差但引入偏差），即批评策略的操作选择。 Footnote3 因此，在行动者-批评方法[175]中，行动者代表政策，即行动选择机制，而批评者则用于价值函数学习。当批评者学习状态动作函数（Q函数）和状态值函数（V函数）时，可以通过从状态动作值中减去状态值来计算优势函数[283,315]。优势函数表示与从基线计算的其他可用操作（即状态值函数）相比，操作的相对质量。Actor-critic算法的一个例子是确定性策略梯度（DPG）[292]。在 DPG [ 292] 中，批评者遵循标准的 Q 学习，而 actor 则按照策略性能的梯度进行更新 [ 128]，DPG 后来扩展到 DRL（见第 2.2 节）和 MDRL（见第 3.5 节）。对于多智能体学习设置，方差进一步增加，因为所有智能体的奖励都依赖于其他智能体，并且正式表明，随着智能体数量的增加，采取正确梯度方向的概率呈指数下降[206]。最近的MDRL工作解决了这个高方差问题，例如COMA [ 97]和MADDPG [ 206]（见第3.5节）。

Policy gradient methods have a clear connection with deep reinforcement learning since the policy might be represented by a neural network whose input is a representation of the state, whose output are action selection probabilities or values for continuous control [192], and whose weights are the policy parameters.
策略梯度方法与深度强化学习有明显的联系，因为策略可能由一个神经网络表示，其输入是状态的表示，其输出是动作选择概率或连续控制值[192]，其权重是策略参数。

2.2 Deep reinforcement learning
2.2 深度强化学习
While tabular RL methods such as Q-learning are successful in domains that do not suffer from the curse of dimensionality, there are many limitations: learning in large state spaces can be prohibitively slow, methods do not generalize (across the state space), and state representations need to be hand-specified [315]. Function approximators tried to address those limitations, using for example, decision trees [262], tile coding [314], radial basis functions [177], and locally weighted regression [46] to approximate the value function.
虽然表格RL方法（如Q-learning）在不受维度诅咒的领域中是成功的，但存在许多局限性：在大状态空间中的学习速度可能非常慢，方法不能泛化（在状态空间中），并且状态表示需要手动指定[315]。函数近似器试图解决这些局限性，例如使用决策树[262]、瓦片编码[314]、径向基函数[177]和局部加权回归[46]来近似值函数。

Similarly, these challenges can be addressed by using deep learning, i.e., neural networks [46, 262] as function approximators. For example,
can be used to approximate the state-action values with
representing the neural network weights. This has two advantages, first, deep learning helps to generalize across states improving the sample efficiency for large state-space RL problems. Second, deep learning can be used to reduce (or eliminate) the need for manually designing features to represent state information [184, 281].
类似地，这些挑战可以通过使用深度学习来解决，即神经网络[46,262]作为函数逼近器。例如，
可用于通过表示神经网络权重来
近似状态动作值。这有两个优点，首先，深度学习有助于跨状态泛化，从而提高大型状态空间RL问题的样本效率。其次，深度学习可用于减少（或消除）手动设计特征来表示状态信息的需要[184,281]。

However, extending deep learning to RL problems comes with additional challenges including non-i.i.d. (not independently and identically distributed) data. Many supervised learning methods assume that training data is from an i.i.d. stationary distribution [36, 269, 281]. However, in RL, training data consists of highly correlated sequential agent-environment interactions, which violates the independence condition. Moreover, RL training data distribution is non-stationary as the agent actively learns while exploring different parts of the state space, violating the condition of sampled data being identically distributed [220].
然而，将深度学习扩展到强化学习问题也带来了额外的挑战，包括非i.i.d。（非独立和相同分布的）数据。许多监督学习方法假设训练数据来自i.i.d.稳态分布[36,269,281]。然而，在强化学习中，训练数据由高度相关的顺序智能体-环境交互组成，这违反了独立性条件。此外，RL训练数据的分布是非平稳的，因为智能体在探索状态空间的不同部分时会主动学习，这违反了采样数据相同分布的条件[220]。

In practice, using function approximators in RL requires making crucial representational decisions and poor design choices can result in estimates that diverge from the optimal value function [1, 21, 46, 112, 334, 351]. In particular, function approximation, bootstrapping, and off-policy learning are considered the three main properties that when combined, can make the learning to diverge and are known as the deadly triad [315, 334]. Recently, some works have shown that non-linear (i.e., deep) function approximators poorly estimate the value function [104, 151, 331] and another work found problems with Q-learning using function approximation (over/under-estimation, instability and even divergence) due to the delusional bias: “delusional bias occurs whenever a backed-up value estimate is derived from action choices that are not realizable in the underlying policy class”[207]. Additionally, convergence results for reinforcement learning using function approximation are still scarce [21, 92, 207, 217, 330]; in general, stronger convergence guarantees are available for policy-gradient methods [316] than for value-based methods [315].
在实践中，在RL中使用函数逼近器需要做出关键的表示决策，而糟糕的设计选择可能导致估计偏离最优值函数[1,21,46,112,334,351]。特别是，函数逼近、自举和偏离策略学习被认为是三个主要属性，当它们结合在一起时，可以使学习发散，被称为致命的三联征[315,334]。最近，一些工作表明，非线性（即深度）函数近似器对价值函数的估计很差[104,151,331]，另一项工作发现，由于妄想偏差，使用函数近似（高估/低估，不稳定甚至背离）的Q学习存在问题：“每当从基础策略类中无法实现的行动选择中得出备份值估计时，就会发生妄想偏差”[207]。此外，使用函数逼近的强化学习的收敛结果仍然很少[ 21， 92， 207， 217， 330];一般来说，策略梯度方法[316]比基于值的方法[315]具有更强的收敛保证。

Below we mention how the existing DRL methods aim to address these challenges when briefly reviewing value-based methods, such as DQN [221]; policy gradient methods, like Proximal Policy Optimization (PPO) [283]; and actor-critic methods like Asynchronous Advantage Actor-Critic (A3C) [158]. We refer the reader to recent surveys on single-agent DRL [13, 101, 191] for a more detailed discussion of the literature.
下面我们提到现有的DRL方法如何解决这些挑战，简要回顾基于价值的方法，如DQN [ 221];政策梯度方法，如近端政策优化（PPO）[283];以及异步优势演员-批评家（A3C）等演员-批评家方法[ 158]。我们请读者参考最近关于单药DRL的调查[13,101,191]，以对文献进行更详细的讨论。

Value-based methods The major breakthrough work combining deep learning with Q-learning was the Deep Q-Network (DQN) [221]. DQN uses a deep neural network for function approximation [268]Footnote4 (see Fig. 2) and maintains an experience replay (ER) buffer [193, 194] to store interactions
. DQN keeps an additional copy of neural network parameters,
, for the target network in addition to the
parameters to stabilize the learning, i.e., to alleviate the non-stationary data distribution.Footnote5 For each training iteration i, DQN minimizes the mean-squared error (MSE) between the Q-network and its target network using the loss function:
基于价值的方法将深度学习与Q学习相结合的主要突破性工作是深度Q网络（DQN）[221]。DQN使用深度神经网络进行函数逼近[268]（见图2），并维护一个体验回放（ER）缓冲区[193,194] Footnote4 来存储交互
。DQN为目标网络保留了神经网络参数的附加副本，
除了
参数之外，还用于稳定学习，即缓解非平稳数据分布。 Footnote5 对于每次训练迭代 i，DQN 使用损失函数最小化 Q 网络与其目标网络之间的均方误差（MSE）：

(3)
where target network parameters
are set to Q-network parameters
periodically and mini-batches of
tuples are sampled from the ER buffer, as depicted in Fig. 3.
其中，目标网络参数定期设置为 Q 网络参数

，并从 ER 缓冲区中采样小批量
元组，如图 3 所示。

Fig. 2 图2
figure 2
Deep Q-Network (DQN) [221]: Inputs are four stacked frames; the network is composed of several layers: Convolutional layers employ filters to learn features from high-dimensional data with a much smaller number of neurons and Dense layers are fully-connected layers. The last layer represents the actions the agent can take (in this case, 10 possible actions). Deep Recurrent Q-Network (DRQN) [131], which extends DQN to partially observable domains [63], is identical to this setup except the penultimate layer (
Dense layer) is replaced with a recurrent LSTM layer [147]
深度 Q 网络（DQN） [ 221]：输入是四个堆叠帧;该网络由几层组成：卷积层使用过滤器从神经元数量少得多的高维数据中学习特征，而密集层是全连接层。最后一层表示代理可以执行的操作（在本例中为 10 个可能的操作）。深度递归Q网络（DRQN）[131]将DQN扩展到部分可观测域[63]，与此设置相同，只是倒数第二层（
密集层）被递归LSTM层取代[147]

Full size image 全尺寸图像
The ER buffer provides stability for learning as random batches sampled from the buffer helps alleviating the problems caused by the non-i.i.d. data. However, it comes with disadvantages, such as higher memory requirements and computation per real interaction [219]. The ER buffer is mainly used for off-policy RL methods as it can cause a mismatch between buffer content from earlier policy and from the current policy for on-policy methods [219]. Extending the ER buffer for the multiagent case is not trivial, see Sects. 3.5, 4.1 and 4.2. Recent works were designed to reduce the problem of catastrophic forgetting (this occurs when the trained neural network performs poorly on previously learned tasks due to a non-stationary training distribution [111, 214]) and the ER buffer, in DRL [153] and MDRL [246].
ER 缓冲区为学习提供了稳定性，因为从缓冲区采样的随机批次有助于缓解由非 i.i.d 引起的问题。数据。然而，它也有缺点，例如更高的内存要求和每次实际交互的计算量[219]。ER缓冲区主要用于非策略RL方法，因为它可能导致早期策略的缓冲区内容与当前策略的策略内容不匹配[219]。为多智能体情况扩展 ER 缓冲区并非易事，参见第 3.5、4.1 和 4.2 节。最近的工作旨在减少灾难性遗忘的问题（当训练神经网络由于非平稳训练分布[111,214]而在先前学习的任务上表现不佳时，就会发生这种情况）和ER缓冲区，在DRL[153]和MDRL[246]。

DQN has been extended in many ways, for example, by using double estimators [130] to reduce the overestimation bias with Double DQN [336] (see Sect. 4.1) and by decomposing the Q-function with a dueling-DQN architecture [345], where two streams are learned, one estimates state values and another one advantages, those are combined in the final layer to form Q values (this method improved over Double DQN).
DQN已经在许多方面得到了扩展，例如，通过使用双估计器[ 130]来减少双倍DQN [ 336]的高估偏差（参见第4.1节），以及通过使用决斗DQN架构分解Q函数[ 345]，其中学习两个流，一个估计状态值，另一个优点，它们在最后一层组合形成Q值（这种方法比Double DQN改进）。

In practice, DQN is trained using an input of four stacked frames (last four frames the agent has encountered). If a game requires a memory of more than four frames it will appear non-Markovian to DQN because the future game states (and rewards) do not depend only on the input (four frames) but rather on the history [132]. Thus, DQN’s performance declines when given incomplete state observations (e.g., one input frame) since DQN assumes full state observability.
在实践中，DQN 是使用四个堆叠帧（代理遇到的最后四个帧）的输入进行训练的。如果一个游戏需要超过四帧的内存，那么它对DQN来说似乎是非马尔可夫的，因为未来的游戏状态（和奖励）不仅取决于输入（四帧），还取决于历史[132]。因此，当给定不完整的状态观测值（例如，一个输入帧）时，DQN 的性能会下降，因为 DQN 假定具有完整的状态可观测性。

Real-world tasks often feature incomplete and noisy state information resulting from partial observability (see Sect. 2.1). Deep Recurrent Q-Networks (DRQN) [131] proposed using recurrent neural networks, in particular, Long Short-Term Memory (LSTMs) cells [147] in DQN, for this setting. Consider the architecture in Fig. 2 with the first dense layer after convolution replaced by a layer of LSTM cells. With this addition, DRQN has memory capacity so that it can even work with only one input frame rather than a stacked input of consecutive frames. This idea has been extended to MDRL, see Fig. 6 and Sect. 4.2. There are also other approaches to deal with partial observability such as finite state controllers [218] (where action selection is performed according to the complete observation history) and using an initiation set of options conditioned on the previously employed option [302].
实际任务通常具有不完整和嘈杂的状态信息，这是由于部分可观测性造成的（参见第 2.1 节）。深度递归Q网络（DRQN）[131]建议使用递归神经网络，特别是DQN中的长短期记忆（LSTM）细胞[147]来设置。考虑图 2 中的架构，卷积后的第一个致密层被一层 LSTM 单元所取代。通过这一新增功能，DRQN 具有内存容量，因此它甚至可以只使用一个输入帧，而不是连续帧的堆叠输入。这个想法已经扩展到MDRL，见图6和第4.2节。还有其他方法可以处理部分可观测性，例如有限状态控制器[218]（根据完整的观测历史执行动作选择）和使用以先前使用的选项[302]为条件的启动选项集。

Fig. 3 图3
figure 3
Representation of a DQN agent that uses an experience replay buffer [193, 194] to keep
tuples for minibatch updates. The Q-values are parameterized with a NN and a policy is obtained by selecting (greedily) over those at every timestep
DQN 代理的表示形式，该代理使用体验重播缓冲区 [ 193， 194] 来保留
小批量更新的元组。使用 NN 对 Q 值进行参数化，并通过在每个时间步长上选择（贪婪）这些值来获取策略

Full size image 全尺寸图像
Policy gradient methods For many tasks, particularly for physical control, the action space is continuous and high dimensional where DQN is not suitable. Deep Deterministic Policy Gradient (DDPG) [192] is a model-free off-policy actor-critic algorithm for such domains, based on the DPG algorithm [292] (see Sect. 2.1). Additionally, it proposes a new method for updating the networks, i.e., the target network parameters slowly change (this could also be applicable to DQN), in contrast to the hard reset (direct weight copy) used in DQN. Given the off-policy nature, DDPG generates exploratory behavior by adding sampled noise from some noise processes to its actor policy. The authors also used batch normalization [152] to ensure generalization across many different tasks without performing manual normalizations. However, note that other works have shown batch normalization can cause divergence in DRL [274, 335].
策略梯度方法对于许多任务，特别是对于物理控制，动作空间是连续的和高维的，而 DQN 不适合。深度确定性策略梯度（DDPG）[192]是基于DPG算法[292]的无模型非策略行为者-批评算法（参见第2.1节）。此外，它提出了一种更新网络的新方法，即目标网络参数缓慢变化（这也适用于DQN），这与DQN中使用的硬复位（直接权重复制）形成鲜明对比。鉴于策略外的性质，DDPG 通过将来自某些噪声进程的采样噪声添加到其执行组件策略中来生成探索性行为。作者还使用批量归一化[152]来确保在许多不同任务中泛化，而无需执行手动归一化。然而，请注意，其他研究表明批量归一化会导致DRL的发散[274,335]。

Asynchronous Advantage Actor-Critic (A3C) [219] is an algorithm that employs a parallelized asynchronous training scheme (using multiple CPU threads) for efficiency. It is an on-policy RL method that does not use an experience replay buffer. A3C allows multiple workers to simultaneously interact with the environment and compute gradients locally. All the workers pass their computed local gradients to a global NN which performs the optimization and synchronizes with the workers asynchronously (see Fig. 4). There is also the Advantage Actor-Critic (A2C) method [234] that combines all the gradients from all the workers to update the global NN synchronously. The loss function for A3C is composed of two terms: policy loss (actor),
, and value loss (critic),
. A3C parameters are updated using the advantage function
, commonly used to reduce variance (see Sect. 2.1). An entropy loss for the policy,
, is also commonly added, which helps to improve exploration by discouraging premature convergence to suboptimal deterministic policies [219]. Thus, the loss function is given by:
with
and
, being weighting terms on the individual loss components. Wang et al. [344] took A3C’s framework but used off-policy learning to create the Actor-critic with experience replay (ACER) algorithm. Gu et al. [118] introduced the Interpolated Policy Gradient (IPG) algorithm and showed a connection between ACER and DDPG: they are a pair of reparametrization terms (they are special cases of IPG) when they are put under the same stochastic policy setting, and when the policy is deterministic they collapse into DDPG.
Asynchronous Advantage Actor-Critic （A3C） [ 219] 是一种采用并行异步训练方案（使用多个 CPU 线程）来提高效率的算法。它是一种策略上的 RL 方法，不使用体验重播缓冲区。A3C 允许多个工作人员同时与环境交互并在本地计算梯度。所有工作线程将其计算出的局部梯度传递给全局 NN，该网络执行优化并与工作线程异步同步（见图 4）。还有 Advantage Actor-Critic （A2C）方法 [ 234]，它结合了所有 worker 的所有梯度，以同步更新全局 NN。A3C 的损失函数由两项组成：策略损失（参与者）
和价值损失（批评者
）。A3C参数使用优势函数
进行更新，通常用于减少方差（见第2.1节）。策略的熵损失也经常被添加
，这有助于通过阻止过早收敛到次优确定性策略来改善探索[219]。因此，损失函数由下式给出：
with
和
，是各个损失分量的加权项。Wang等[344]采用了A3C的框架，但使用非策略学习创建了具有经验回放的Actor-critic（ACER）算法。Gu等[ 118]介绍了插值策略梯度（IPG）算法，并展示了ACER和DDPG之间的联系：当它们被置于相同的随机策略设置下时，它们是一对重新参数化项（它们是IPG的特例），当策略是确定性的时，它们会坍缩为DDPG。

Fig. 4 图4
figure 4
Asynchronous Advantage Actor-Critic (A3C) employs multiple (CPUs) workers without needing an ER buffer. Each worker has its own NN and independently interacts with the environment to compute the loss and gradients. Workers then pass computed gradients to the global NN that optimizes the parameters and synchronizes with the worker asynchronously. This distributed system is designed for single-agent deep RL. Compared to different DQN variants, A3C obtains better performance on a variety of Atari games using substantially less training time with multiple CPU cores of standard laptops without a GPU [219]. However, we note that more recent approaches use both multiple CPU cores for more efficient training data generation and GPUs for more efficient learning
Asynchronous Advantage Actor-Critic （A3C）雇用多个（CPU）工作线程，而无需 ER 缓冲区。每个工作线程都有自己的 NN，并独立地与环境交互以计算损失和梯度。然后，工作线程将计算出的梯度传递给全局 NN，该网络会优化参数并与工作线程异步同步。该分布式系统专为单智能体深度强化学习而设计。与不同的DQN变体相比，A3C在没有GPU的标准笔记本电脑的多个CPU内核下，使用更少的训练时间，在各种Atari游戏中获得了更好的性能[219]。然而，我们注意到，最近的方法既使用多个 CPU 内核来生成更高效的训练数据，又使用 GPU 来更高效地学习

Full size image 全尺寸图像
Jaderberg et al. [158] built the Unsupervised Reinforcement and Auxiliary Learning (UNREAL) framework on top of A3C and introduced unsupervised auxiliary tasks (e.g., reward prediction) to speed up the learning process. Auxiliary tasks in general are not used for anything other than shaping the features of the agent, i.e., facilitating and regularizing the representation learning process [31, 288]; their formalization in RL is related to the concept of general value functions [315, 317]. The UNREAL framework optimizes a combined loss function
, that combines the A3C loss,
, together with auxiliary task losses
, where
are weight terms (see Sect. 4.1 for use of auxiliary tasks in MDRL). In contrast to A3C, UNREAL uses a prioritized ER buffer, in which transitions with positive reward are given higher probability of being sampled. This approach can be viewed as a simple form of prioritized replay [278], which was in turn inspired by model-based RL algorithms like prioritized sweeping [10, 223].
Jaderberg等[158]在A3C之上构建了无监督强化和辅助学习（UNREAL）框架，并引入了无监督辅助任务（如奖励预测）来加快学习过程。一般而言，辅助任务除了塑造智能体的特征外，不用于其他任何用途，即促进和规范表征学习过程[ 31， 288];它们在RL中的形式化与广义值函数的概念有关[315,317]。UNREAL框架优化了一个组合损失函数，该函数
将A3C损失与辅助任务损失

结合在一起，其中
是权重项（有关MDRL中辅助任务的使用，请参见第4.1节）。与A3C相比，虚幻引擎使用优先ER缓冲区，其中具有正奖励的过渡被采样的概率更高。这种方法可以看作是一种简单的优先重放形式[278]，而这种重放又受到基于模型的RL算法的启发，如优先扫描[10,223]。

Another distributed architecture is the Importance Weighted Actor-Learner Architecture (IMPALA) [93]. Unlike A3C or UNREAL, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner, thus IMPALA decouples acting from learning.
另一种分布式架构是重要性加权参与者-学习器架构（Importance Weighted Actor-Learner Architecture，IMPALA）[93]。与 A3C 或 UNREAL 不同，IMPALA 参与者将体验轨迹（状态、动作和奖励的序列）传达给集中的学习者，因此 IMPALA 将行为与学习分离。

Trust Region Policy Optimization (TRPO) [283] and Proximal Policy Optimization (PPO) [284] are recently proposed policy gradient algorithms where the latter represents the state-of-the art with advantages such as being simpler to implement and having better empirical sample complexity. Interestingly, a recent work [151] studying PPO and TRPO arrived at the surprising conclusion that these methods often deviate from what the theoretical framework would predict: gradient estimates are poorly correlated with the true gradient and value networks tend to produce inaccurate predictions for the true value function. Compared to vanilla policy gradient algorithms, PPO prevents abrupt changes in policies during training through the loss function, similar to early work by Kakade [166]. Another advantage of PPO is that it can be used in a distributed fashion, i.e, Distributed PPO (DPPO) [134]. Note that distributed approaches like DPPO or A3C use parallelization only to improve the learning by more efficient training data generation through multiple CPU cores for single agent DRL and they should not be considered multiagent approaches (except for recent work which tries to exploit this parallelization in a multiagent environment [19]).
信任区域策略优化（TRPO）[283]和近端策略优化（PPO）[284]是最近提出的策略梯度算法，后者代表了最先进的技术，具有更易于实现和具有更好的经验样本复杂性等优点。有趣的是，最近一项研究PPO和TRPO的工作[151]得出了一个令人惊讶的结论，即这些方法往往偏离理论框架的预测：梯度估计与真实梯度的相关性很差，价值网络往往会对真实值函数产生不准确的预测。与普通策略梯度算法相比，PPO通过损失函数防止训练期间策略的突然变化，类似于Kakade[166]的早期工作。PPO的另一个优点是它可以以分布式方式使用，即分布式PPO（DPPO）[134]。请注意，像 DPPO 或 A3C 这样的分布式方法使用并行化只是为了通过单个代理 DRL 的多个 CPU 内核更高效地生成训练数据来改进学习，它们不应被视为多代理方法（除了最近试图在多代理环境中利用这种并行化的工作 [ 19]）。

Lastly, there’s a connection between policy gradient algorithms and Q-learning [282] within the framework of entropy-regularized reinforcement learning [126] where the value and Q functions are slightly altered to consider the entropy of the policy. In this vein, Soft Actor-Critic (SAC) [127] is a recent algorithm that concurrently learns a stochastic policy, two Q-functions (taking inspiration from Double Q-learning) and a value function. SAC alternates between collecting experience with the current policy and updating from batches sampled from the ER buffer.
最后，在熵正则化强化学习[126]的框架内，策略梯度算法和Q学习[282]之间存在联系，其中值和Q函数略有改变以考虑策略的熵。本着这种精神，Soft Actor-Critic （SAC） [ 127] 是一种最新的算法，它同时学习一个随机策略、两个 Q 函数（灵感来自双 Q 学习）和一个值函数。SAC 在收集当前策略的经验和从从 ER 缓冲区采样的批次进行更新之间交替进行。

We have reviewed recent algorithms in DRL, while the list is not exhaustive, it provides an overview of the different state-of-art techniques and algorithms which will become useful while describing the MDRL techniques in the next section.
我们已经回顾了 DRL 中的最新算法，虽然列表并不详尽，但它概述了不同的最先进的技术和算法，这些技术和算法在下一节中描述 MDRL 技术时将变得有用。

3 Multiagent deep reinforcement learning (MDRL)
3 多智能体深度强化学习（MDRL）
First, we briefly introduce the general framework on multiagent learning and then we dive into the categories and the research on MDRL.
首先，我们简要介绍了多智能体学习的一般框架，然后我们深入研究了MDRL的类别和研究。

3.1 Multiagent learning 3.1 多智能体学习
Learning in a multiagent environment is inherently more complex than in the single-agent case, as agents interact at the same time with environment and potentially with each other [55]. The independent learners, a.k.a. decentralized learners approach [323] directly uses single-agent algorithms in the multi-agent setting despite the underlying assumptions of these algorithms being violated (each agent independently learns its own policy, treating other agents as part of the environment). In particular the Markov property (the future dynamics, transitions, and rewards depend only on the current state) becomes invalid since the environment is no longer stationary [182, 233, 333]. This approach ignores the multiagent nature of the setting entirely and it can fail when an opponent adapts or learns, for example, based on the past history of interactions [289]. Despite the lack of guarantees, independent learners have been used in practice, providing advantages with regards to scalability while often achieving good results [213].
多智能体环境中的学习本质上比单智能体环境中的学习更复杂，因为智能体同时与环境交互，并可能相互交互[55]。独立学习者，又名分散式学习者方法[323]在多智能体环境中直接使用单智能体算法，尽管这些算法的基本假设被违反（每个智能体独立学习自己的策略，将其他智能体视为环境的一部分）。特别是马尔可夫属性（未来的动态、转换和奖励仅取决于当前状态）变得无效，因为环境不再是静止的 [ 182， 233， 333]。这种方法完全忽略了设置的多智能体性质，当对手适应或学习时，例如，基于过去的互动历史，它可能会失败[289]。尽管缺乏保证，但独立学习者在实践中已被使用，在可扩展性方面提供了优势，同时经常取得良好的结果[213]。

To understand why multiagent domains are non-stationary from agents’ local perspectives, consider a simple stochastic (also known as Markov) game
, which can be seen as an extension of an MDP to multiple agents [198, 200]. One key distinction is that the transition,
, and reward function,
, depend on the actions
of all,
, agents, this means,
and
.
为了理解为什么多智能体域从代理的局部角度是非平稳的，请考虑一个简单的随机（也称为马尔科夫）博弈
，它可以被看作是MDP对多个智能体的扩展[198,200]。一个关键的区别是，转换、
和奖励函数，取决于所有，

，代理，这意味着
，
和
。

Given a learning agent i and using the common shorthand notation
for the set of opponents, the value function now depends on the joint action
, and the joint policy
Footnote6:
给定一个学习代理 i 并使用对手集合的通用速记符号
，值函数现在取决于联合行动
和联合策略
Footnote6 ：

(4)
Consequently, the optimal policy is dependent on the other agents’ policies,
因此，最佳策略取决于其他代理的策略，

(5)
Specifically, the opponents’ joint policy
can be non-stationary, i.e., changes as the opponents’ policies change over time, for example with learning opponents.
具体来说，对手的联合政策
可以是非固定的，即随着对手的政策随着时间的推移而变化，例如学习对手。

Convergence results Littman [200] studied convergence properties of reinforcement learning joint action agents [70] in Markov games with the following conclusions: in adversarial environments (zero-sum games) an optimal play can be guaranteed against an arbitrary opponent, i.e., Minimax Q-learning [198]. In coordination environments (e.g., in cooperative games all agents share the same reward function), strong assumptions need be made about other agents to guarantee convergence to optimal behavior [200], e.g., Nash Q-learning [149] and Friend-or-Foe Q-learning [199]. In other types of environments no value-based RL algorithms with guaranteed convergence properties are known [200].
收敛结果 Littman[ 200] 研究了强化学习联合行动智能体 [ 70] 在马尔可夫博弈中的收敛特性，得出以下结论：在对抗性环境（零和博弈）中，可以保证对任意对手进行最佳博弈，即 Minimax Q 学习 [ 198]。在协调环境中（例如，在合作博弈中，所有智能体共享相同的奖励函数），需要对其他智能体做出强有力的假设，以保证收敛到最佳行为[200]，例如，纳什Q学习[149]和敌我Q学习[199]。在其他类型的环境中，没有已知的具有保证收敛特性的基于值的RL算法[200]。

Recent work on MDRL have addressed scalability and have focused significantly less on convergence guarantees, with few exceptions [22, 40, 255, 297]. One notable work has shown a connection between update rules for actor-critic algorithms for multiagent partially observable settings and (counterfactual) regret minimizationFootnote7: the advantage values are scaled counterfactual regrets. This lead to new convergence properties of independent RL algorithms in zero-sum games with imperfect information [300]. The result is also used to support policy gradient optimization against worst-case opponents, in a new algorithm called Exploitability Descent [204].Footnote8
最近关于MDRL的工作已经解决了可扩展性问题，而对收敛保证的关注明显减少，只有少数例外[22,40,255,297]。一项值得注意的工作表明，多智能体部分可观察设置的 actor-critic 算法的更新规则与（反事实）后悔最小化 Footnote7 之间存在联系：优势值是缩放的反事实遗憾。这导致了独立RL算法在信息不完全的零和博弈中的新收敛特性[300]。该结果还用于支持针对最坏情况对手的策略梯度优化，这是一种称为可利用性下降的新算法[204]。 Footnote8

We refer the interested reader to seminal works about convergence in multiagent domains [23, 41, 42, 45, 113, 165, 167, 277, 295, 357, 367]. Note that instead of convergence, some MAL algorithms have proved learning a best response against classes of opponents [66, 326, 349].
我们向感兴趣的读者推荐关于多智能体领域收敛的开创性著作 [ 23， 41， 42， 45， 113， 165， 167， 277， 295， 357， 367]。请注意，一些MAL算法已被证明不是收敛，而是学习对对手类别的最佳响应[66,326,349]。

There are other common problems in MAL, including action shadowing [105, 347], the curse of dimensionality [55], and multiagent credit assignment [2]. Describing each problem is out of the scope of this survey. However, we refer the interested reader to excellent resources on general MAL [209, 333, 350], as well as surveys in specific areas: game theory and multiagent reinforcement learning [55, 233], cooperative scenarios [213, 248], evolutionary dynamics of multiagent learning [38], learning in non-stationary environments [141], agents modeling agents [6], and transfer learning in multiagent RL [290].
MAL中还有其他常见问题，包括动作阴影[105,347]，维度诅咒[55]和多智能体信用分配[2]。描述每个问题超出了本次调查的范围。然而，我们向感兴趣的读者推荐了关于一般MAL[209,333,350]的优秀资源，以及特定领域的调查：博弈论和多智能体强化学习[55,233]，合作场景[213,248]，多智能体学习的进化动力学[38]，非平稳环境中的学习[141]，智能体建模智能体[6]和多智能体RL中的迁移学习[290]。

3.2 MDRL categorization 3.2 MDRL分类
In Sect. 2.2 we outlined some recent works in single-agent DRL since an exhaustive list is out of the scope of this article. This explosion of works has led DRL to be extended and combined with other techniques [13, 191, 251]. One natural extension to DRL is to test whether these approaches could be applied in a multiagent environment.
在第 2.2 节中，我们概述了单智能体 DRL 的一些最新工作，因为详尽的列表超出了本文的范围。这种工程的爆炸式增长导致DRL被扩展并与其他技术相结合[13,191,251]。DRL 的一个自然扩展是测试这些方法是否可以应用于多智能体环境。

We analyzed the most recent works (that are not covered by previous MAL surveys [6, 141] and we do not consider genetic algorithms or swarm intelligence in this survey) that have a clear connection with MDRL. We propose 4 categories which take inspiration from previous surveys [6, 55, 248, 305] and that conveniently describe and represent current works. Note that some of these works fit into more than one category (they are not mutually exclusive), therefore their summaries are presented in all applicable Tables 1, 2, 3 and 4, however, for the ease of exposition when describing them in the text we only do so in one category. Additionally, for each work we present its learning type, either a value-based method (e.g., DQN) or a policy gradient method (e.g., actor-critic); also, we mention if the setting is evaluated in a fully cooperative, fully competitive or mixed environment (both cooperative and competitive).
我们分析了与MDRL有明确联系的最新工作（以前的MAL调查[6,141]未涵盖，并且我们在本次调查中不考虑遗传算法或群体智能）。我们提出了 4 个类别，这些类别从以前的调查 [ 6， 55， 248， 305] 中汲取灵感，方便地描述和代表当前的作品。请注意，其中一些作品属于多个类别（它们并不相互排斥），因此它们的摘要在所有适用的表1、表2、表3和表4中列出，但是，为了便于在文本中描述它们时进行说明，我们只在一个类别中进行。此外，对于每件作品，我们都会介绍其学习类型，要么是基于价值的方法（例如，DQN），要么是政策梯度方法（例如，演员-批评家）;此外，我们还提到是否在完全合作、完全竞争或混合环境（合作和竞争）中评估环境。

Analysis of emergent behaviors These works, in general, do not propose learning algorithms—their main focus is to analyze and evaluate DRL algorithms, e.g., DQN [188, 264, 322], PPO [24, 264] and others [187, 225, 264], in a multiagent environment. In this category we found works which analyze behaviors in the three major settings: cooperative, competitive and mixed scenarios; see Sect. 3.3 and Table 1.
对涌现行为的分析这些著作一般不提出学习算法，它们的主要重点是在多智能体环境中分析和评估DRL算法，例如DQN [ 188， 264， 322]、PPO [ 24， 264]和其他算法 [ 187， 225， 264]。在这个类别中，我们发现了分析三种主要环境中的行为的作品：合作、竞争和混合场景;参见第 3.3 节和表 1。

Learning communication [96, 183, 225, 253, 256, 312]. These works explore a sub-area in which agents can share information with communication protocols, for example through direct messages [96] or via a shared memory [256]. This area is attracting attention and it had not been explored much in the MAL literature. See Sect. 3.4 and Table 2.
学习交流 [ 96， 183， 225， 253， 256， 312].这些工作探索了一个子领域，在这个子领域中，智能体可以通过通信协议共享信息，例如通过直接消息[96]或通过共享内存[256]。这一领域正在引起人们的注意，在MAL文献中并没有得到太多的探讨。见第 3.4 节和表 2。

Learning cooperation While learning to communicate is an emerging area, fostering cooperation in learning agents has a long history of research in MAL [213, 248]. In this category the analyzed works are evaluated in either cooperative or mixed settings. Some works in this category take inspiration from MAL (e.g., leniency, hysteresis, and difference rewards concepts) and extend them to the MDRL setting [98, 244, 247]. A notable exception [99] takes a key component from RL (i.e., experience replay buffer) and adapts it for MDRL. See Sect. 3.5 and Table 3.
学习合作虽然学习交流是一个新兴领域，但促进学习代理的合作在MAL中有着悠久的研究历史[213,248]。在此类别中，分析的作品在合作或混合环境中进行评估。这一类别中的一些作品从MAL中汲取灵感（例如，宽大、滞后和差异奖励概念），并将其扩展到MDRL设置[98,244,247]。一个值得注意的例外 [ 99] 从 RL 中获取了一个关键组件（即体验回放缓冲区），并将其改编为 MDRL。见第3.5节和表3。

Agents modeling agents Albrecht and Stone [6] presented a thorough survey in this topic and we have found many works that fit into this category in the MDRL setting, some taking inspiration from DRL [133, 148, 265], and others from MAL [97, 136, 180, 263, 358]. Modeling agents is helpful not only to cooperate, but also for modeling opponents [133, 136, 148, 180], inferring goals [265], and accounting for the learning behavior of other agents [97]. In this category the analyzed algorithms present their results in either a competitive setting or a mixed one (cooperative and competitive). See Sect. 3.6 and Table 4.
代理建模代理Albrecht和Stone[6]对这个主题进行了彻底的调查，我们发现许多作品在MDRL环境中属于这一类别，有些灵感来自DRL [133,148,265]，有些则来自MAL [97,136,180,263,358]。建模智能体不仅有助于合作，而且有助于对对手进行建模[133,136,148,180]，推断目标[265]，并考虑其他智能体的学习行为[97]。在此类别中，分析的算法在竞争环境或混合环境（合作和竞争）中呈现其结果。参见第 3.6 节和表 4。

In the rest of this section we describe each category along with the summaries of related works.
在本节的其余部分，我们将介绍每个类别以及相关作品的摘要。

Table 1 These papers analyze emergent behaviors in MDRL
表1 这些论文分析了MDRL中的紧急行为
Full size table 全尺寸表
Table 2 These papers propose algorithms for learning communication
表2 本文提出了学习交流的算法
Full size table 全尺寸表
Table 3 These papers aim to learn cooperation
表3 这些论文旨在学习合作
Full size table 全尺寸表
Table 4 These papers consider agents modeling agents
表4 这些论文考虑了代理建模代理
Full size table 全尺寸表
3.3 Emergent behaviors 3.3 紧急行为
Some recent works have analyzed the previously mentioned independent DRL agents (see Sect. 3.1) from the perspective of types of emerging behaviors (e.g., cooperative or competitive).
最近的一些工作从新兴行为类型（例如，合作或竞争）的角度分析了前面提到的独立 DRL 代理（见第 3.1 节）。

One of the earliest MDRL works is by Tampuu et al. [322], which had two independent DQN learning agents to play the Atari Pong game. Their focus was to adapt the reward function for the learning agents, which resulted in either cooperative or competitive emergent behaviors.
最早的MDRL作品之一是Tampuu等人[322]，它有两个独立的DQN学习代理来玩Atari Pong游戏。他们的重点是调整学习代理的奖励功能，这导致了合作或竞争的涌现行为。

Leibo et al. [188] meanwhile studied independent DQNs in the context of sequential social dilemmas: a Markov game that satisfies certain inequalities [188]. The focus of this work was to highlight that cooperative or competitive behaviors exist not only as discrete (atomic) actions, but they are temporally extended (over policies). In the related setting of one shot Markov social dilemmas, Lerer and Peysakhovich [189] extended the famous Tit-for-Tat (TFT)Footnote9 strategy [15] for DRL (using function approximators) and showed (theoretically and experimentally) that such agents can maintain cooperation. To construct the agents they used self-play and two reward schemes: selfish and cooperative. Previously, different MAL algorithms were designed to foster cooperation in social dilemmas with Q-learning agents [77, 303].
Leibo等[188]同时研究了序列社会困境背景下的独立DQN：满足某些不等式的马尔可夫博弈[188]。这项工作的重点是强调合作或竞争行为不仅作为离散的（原子）行为存在，而且它们在时间上是扩展的（在政策之上）。在马尔可夫社会困境的相关设置中，Lerer和Peysakhovich[189]将著名的针锋相对（TFT）策略[15]扩展到DRL（使用函数逼近器），并（在理论和实验上） Footnote9 证明这种智能体可以保持合作。为了构建智能体，他们使用了自我游戏和两种奖励计划：自私和合作。以前，不同的MAL算法被设计用于促进与Q-learning代理在社会困境中的合作[77,303]。

Self-play is a useful concept for learning algorithms (e.g., fictitious play [49]) since under certain classes of games it can guarantee convergenceFootnote10 and it has been used as a standard technique in previous RL and MAL works [43, 291, 325]. Despite its common usage self-play can be brittle to forgetting past knowledge [180, 186, 275] (see Sect. 4.5 for a note on the role of self-play as an open question in MDRL). To overcome this issue, Leibo et al. [187] proposed Malthusian reinforcement learning as an extension of self-play to population dynamics. The approach can be thought of as community coevolution and has been shown to produce better results (avoiding local optima) than independent agents with intrinsic motivation [30]. A limitation of this work is that it does not place itself within the state of the art in evolutionary and genetic algorithms. Evolutionary strategies have been employed for solving reinforcement learning problems [226] and for evolving function approximators [351]. Similarly, they have been used multiagent scenarios to compute approximate Nash equilibria [238] and as metaheuristic optimization algorithms [53, 54, 150, 248].
自我游戏是学习算法（例如，虚构游戏[49]）的有用概念，因为在某些类别的游戏中，它可以保证收敛 Footnote10 性，并且它已被用作以前的RL和MAL作品中的标准技术[43,291,325]。尽管自我游戏很常见，但很容易忘记过去的知识[180,186,275]（参见第4.5节关于自我游戏在MDRL中作为一个开放性问题的作用的注释）。为了克服这个问题，Leibo等[187]提出马尔萨斯强化学习是自我游戏对种群动力学的延伸。这种方法可以被认为是社区协同进化，并且已被证明比具有内在动机的独立主体产生更好的结果（避免局部最优）[30]。这项工作的一个局限性是，它没有将自己置于进化和遗传算法的最新水平。进化策略已被用于解决强化学习问题[226]和演化函数逼近器[351]。类似地，它们已被用于多智能体场景来计算近似纳什均衡[238]和元启发式优化算法[53,54,150,248]。

Bansal et al. [24] explored the emergent behaviors in competitive scenarios using the MuJoCo simulator [327]. They trained independent learning agents with PPO and incorporated two main modifications to deal with the MAL nature of the problem. First, they used exploration rewards [122] which are dense rewards that allow agents to learn basic (non-competitive) behaviors—this type of reward is annealed through time giving more weight to the environmental (competitive) reward. Exploration rewards come from early work in robotics [212] and single-agent RL [176], and their goal is to provide dense feedback for the learning algorithm to improve sample efficiency (Ng et al. [231] studied the theoretical conditions under which modifications of the reward function of an MDP preserve the optimal policy). For multiagent scenarios, these dense rewards help agents in the beginning phase of the training to learn basic non-competitive skills, increasing the probability of random actions from the agent yielding a positive reward. The second contribution was opponent sampling which maintains a pool of older versions of the opponent to sample from, in contrast to using the most recent version.
Bansal等[ 24]使用MuJoCo模拟器[ 327]探索了竞争场景中的涌现行为。他们用PPO训练了独立的学习代理，并结合了两个主要的修改来处理问题的MAL性质。首先，他们使用探索奖励[122]，这是一种密集的奖励，允许智能体学习基本的（非竞争性）行为——这种类型的奖励会随着时间的推移而退火，从而更加重视环境（竞争性）奖励。探索奖励来自机器人技术[212]和单智能体RL[176]的早期工作，其目标是为学习算法提供密集的反馈，以提高样本效率（Ng等人[231]研究了修改MDP奖励函数保持最优策略的理论条件）。对于多智能体场景，这些密集的奖励有助于智能体在训练的开始阶段学习基本的非竞争性技能，从而增加智能体随机操作产生正奖励的可能性。第二个贡献是对手抽样，与使用最新版本相比，它保留了一个旧版本的对手样本池。

Raghu et al. [264] investigated how DRL algorithms (DQN, A2C, and PPO) performed in a family of two-player zero-sum games with tunable complexity, called Erdos-Selfridge-Spencer games [91, 299]. Their reasoning is threefold: (i) these games provide a parameterized family of environments where (ii) optimal behavior can be completely characterized, and (iii) support multiagent play. Their work showed that algorithms can exhibit wide variation in performance as the algorithms are tuned to the game’s difficulty.
Raghu等[264]研究了DRL算法（DQN、A2C和PPO）在一系列具有可调复杂性的双人零和博弈（称为Erdos-Selfridge-Spencer博弈）中的表现[91,299]。他们的理由有三：（i）这些游戏提供了一个参数化的环境系列，其中（ii）可以完全表征最佳行为，以及（iii）支持多智能体游戏。他们的研究表明，当算法根据游戏的难度进行调整时，算法的性能可能会有很大的变化。

Lazaridou et al. [183] proposed a framework for language learning that relies on multiagent communication. The agents, represented by (feed-forward) neural networks, need to develop an emergent language to solve a task. The task is formalized as a signaling game [103] in which two agents, a sender and a receiver, obtain a pair of images. The sender is told one of them is the target and is allowed to send a message (from a fixed vocabulary) to the receiver. Only when the receiver identifies the target image do both agents receive a positive reward. The results show that agents can coordinate for the experimented visual-based domain. To analyze the semantic propertiesFootnote11 of the learned communication protocol they looked whether symbol usage reflects the semantics of the visual space, and that despite some variation, many high level objects groups correspond to the same learned symbols using a t-SNE [210] based analysis (t-SNE is a visualization technique for high-dimensional data and it has also been used to better understand the behavior of trained DRL agents [29, 362]). A key objective of this work was to determine if the agent’s language could be human-interpretable. To achieve this, learned symbols were grounded with natural language by extending the signaling game with a supervised image labelling task (the sender will be encouraged to use conventional names, making communication more transparent to humans). To measure the interpretability of the extended game, a crowdsourced survey was performed, and in essence, the trained agent receiver was replaced with a human. The results showed that 68% of the cases, human participants picked the correct image.
Lazaridou等[183]提出了一个依赖于多智能体交流的语言学习框架。以（前馈）神经网络为代表的智能体需要开发一种紧急语言来解决任务。该任务被形式化为信令游戏[103]，其中两个代理，一个发送者和一个接收者，获取一对图像。发送者被告知其中一个是目标，并允许向接收者发送消息（来自固定词汇表）。只有当接收者识别出目标图像时，两个智能体才会获得积极的奖励。结果表明，智能体可以针对实验的基于视觉对象的领域进行协调。为了分析学习到的通信协议的语义属性 Footnote11 ，他们研究了符号的使用是否反映了视觉空间的语义，尽管存在一些变化，但许多高级对象组使用基于t-SNE [ 210]的分析对应于相同的学习符号（t-SNE是一种高维数据的可视化技术，它也被用于更好地理解经过训练的DRL代理的行为[ 29， 362]）这项工作的一个关键目标是确定智能体的语言是否可以被人类解释。为了实现这一点，通过监督图像标记任务扩展信号博弈，将学习到的符号与自然语言相结合（将鼓励发送者使用常规名称，使通信对人类更加透明）。为了衡量扩展游戏的可解释性，进行了众包调查，从本质上讲，训练有素的代理接收者被人类取代。结果显示，68%的病例中，人类参与者选择了正确的图像。

Similarly, Mordatch and Abbeel [225] investigated the emergence of language with the difference that in their setting there were no explicit roles for the agents (i.e., sender or receiver). To learn, they proposed an end-to-end differentiable model of all agent and environment state dynamics over time to calculate the gradient of the return with backpropagation.
类似地，Mordatch和Abbeel[225]研究了语言的出现，不同之处在于在他们的设置中，代理（即发送者或接收者）没有明确的角色。为了学习，他们提出了一个端到端的可微模型，该模型反映了所有智能体和环境状态动态随时间的变化，以计算反向传播的回报梯度。

3.4 Learning communication
3.4 学习交流
As we discussed in the previous section, one of the desired emergent behaviors of multiagent interaction is the emergence of communication [183, 225]. This setting usually considers a set of cooperative agents in a partially observable environment (see Sect. 2.2) where agents need to maximize their shared utility by means of communicating information.
正如我们在上一节中所讨论的，多智能体交互的理想涌现行为之一是通信的出现[183,225]。此设置通常考虑部分可观察环境中的一组合作代理（参见第 2.2 节），其中代理需要通过通信信息来最大化其共享效用。

Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL) are two methods using deep networks to learn to communicate [96]. Both methods use a neural net that outputs the agent’s Q values (as done in standard DRL algorithms) and a message to communicate to other agents in the next timestep. RIAL is based on DRQN and also uses the concept of parameter sharing, i.e., using a single network whose parameters are shared among all agents. In contrast, DIAL directly passes gradients via the communication channel during learning, and messages are discretized and mapped to the set of communication actions during execution.
强化智能体间学习（RIAL）和可微智能体间学习（DIAL）是两种使用深度网络学习交流的方法[96]。这两种方法都使用一个神经网络，该神经网络输出代理的 Q 值（如在标准 DRL 算法中所做的那样）和一条消息，以便在下一个时间步中与其他代理进行通信。RIAL 基于 DRQN，还使用参数共享的概念，即使用参数在所有代理之间共享的单个网络。相比之下，DIAL 在学习过程中直接通过通信通道传递梯度，消息在执行期间离散化并映射到通信操作集。

Memory-driven (MD) communication was proposed on top of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [206] method. In MD-MADDPG [256], the agents use a shared memory as a communication channel: before taking an action, the agent first reads the memory, then writes a response. In this case the agent’s policy becomes dependent on its private observation and its interpretation of the collective memory. Experiments were performed with two agents in cooperative scenarios. The results highlighted the fact that the communication channel was used differently in each environment, e.g., in simpler tasks agents significantly decrease their memory activity near the end of the task as there are no more changes in the environment; in more complex environments, the changes in memory usage appear at a much higher frequency due to the presence of many sub-tasks.
内存驱动（MD）通信是在多智能体深度确定性策略梯度（MADDPG）[206]方法之上提出的。在MD-MADDPG [ 256]中，代理使用共享内存作为通信通道：在采取行动之前，代理首先读取内存，然后写入响应。在这种情况下，代理人的政策就依赖于它的私人观察和对集体记忆的解释。在合作场景中使用两种智能体进行实验。结果强调了这样一个事实，即通信通道在每个环境中的使用方式不同，例如，在更简单的任务中，由于环境中不再有变化，智能体在任务结束时显着减少了其记忆活动;在更复杂的环境中，由于存在许多子任务，内存使用量的变化以更高的频率出现。

Dropout [301] is a technique to prevent overfitting (in supervised learning this happens when the learning algorithm achieves good performance only on a specific data set and fails to generalize) in neural networks which is based on randomly dropping units and their connections during training time. Inspired by dropout, Kim et al. [173] proposed a similar approach in multiagent environments where direct communication through messages is allowed. In this case, the messages of other agents are dropped out at training time, thus the authors proposed the Message-Dropout MADDPG algorithm [173]. This method is expected to work in fully or limited communication environments. The empirical results show that with properly chosen message dropout rate, the proposed method both significantly improves the training speed and the robustness of learned policies (by introducing communication errors) during execution time. This capability is important as MDRL agents trained in simulated or controlled environments will be less fragile when transferred to more realistic environments.
Dropout [ 301] 是一种防止神经网络中过度拟合的技术（在监督学习中，当学习算法仅在特定数据集上获得良好性能并且无法泛化时，就会发生这种情况），它基于在训练期间随机丢弃单元及其连接。Kim等[173]受到dropout的启发，在允许通过消息进行直接通信的多智能体环境中提出了类似的方法。在这种情况下，其他智能体的消息在训练时被丢弃，因此作者提出了Message-Dropout MADDPG算法[173]。此方法应在完全或有限的通信环境中工作。实证结果表明，在正确选择消息丢弃率的情况下，所提方法在执行期间显著提高了训练速度和学习策略的鲁棒性（通过引入通信错误）。此功能非常重要，因为在模拟或受控环境中训练的 MDRL 代理在转移到更真实的环境中时将不那么脆弱。

While RIAL and DIAL used a discrete communication channel, CommNet [312] used a continuous vector channel. Through this channel agents receive the summed transmissions of other agents. The authors assume full cooperation and train a single network for all the agents. There are two distinctive characteristics of CommNet from previous works: it allows multiple communication cycles at each timestep and a dynamic variation of agents at run time, i.e., agents come and go in the environment.
RIAL 和 DIAL 使用离散通信信道，而 CommNet [ 312] 使用连续向量信道。通过此信道，代理接收其他代理的总和传输。作者假设充分合作，并为所有智能体训练一个单一的网络。与以前的工作相比，CommNet有两个显着的特点：它允许在每个时间步长进行多个通信周期，以及在运行时代理的动态变化，即代理在环境中来来去去。

In contrast to previous approaches, in Multiagent Bidirectionally Coordinated Network (BiCNet) [253], communication takes place in the latent space (i.e., in the hidden layers). It also uses parameter sharing, however, it proposes bidirectional recurrent neural networks [285] to model the actor and critic networks of their model. Note that in BiCNet agents do not explicitly share a message and thus it can be considered a method for learning cooperation.
与以前的方法相比，在多智能体双向协调网络（BiCNet）[253]中，通信发生在潜在空间（即隐藏层）中。它还使用参数共享，但是，它提出了双向递归神经网络[285]来模拟其模型的演员和评论家网络。请注意，在BiCNet中，代理不会显式共享消息，因此可以将其视为一种学习合作的方法。

Learning communication is an active area in MDRL with many open questions, in this context, we refer the interested reader to a recent work by Lowe et al. [205] where it discusses common pitfalls (and recommendations to avoid those) while measuring communication in multiagent environments.
学习沟通是MDRL中一个活跃的领域，有许多开放性问题，在这种情况下，我们向感兴趣的读者推荐Lowe等人[205]最近的一项研究，其中讨论了在多智能体环境中测量沟通时常见的陷阱（以及避免这些陷阱的建议）。

3.5 Learning cooperation 3.5 学习合作
Although explicit communication is a new emerging trend in MDRL, there has already been a large amount of work in MAL for cooperative settingsFootnote12 that do not involve communication [213, 248]. Therefore, it was a natural starting point for many recent MDRL works.
尽管显式沟通是MDRL中一个新的新兴趋势，但MAL中已经有大量不涉及沟通的合作环境 Footnote12 的工作[213,248]。因此，这是最近许多MDRL工作的自然起点。

Foerster et al. [99] studied the simple scenario of cooperation with independent Q-learning agents (see Sect. 3.1), where the agents use the standard DQN architecture of neural networks and an experience replay buffer (see Fig. 3). However, for the ER to work, the data distribution needs to follow certain assumptions (see Sect. 2.2) which are no loger valid due to the multiagent nature of the world: the dynamics that generated the data in the ER no longer reflect the current dynamics, making the experience obsolete [99, 194]. Their solution is to add information to the experience tuple that can help to disambiguate the age of the sampled data from the replay memory. Two approaches were proposed. The first is Multiagent Importance Sampling which adds the probability of the joint action so an importance sampling correction [36, 260] can computed when the tuple is later sampled for training. This was similar to previous works in adaptive importance sampling [4, 102] and off-environment RL [68]. The second approach is Multiagent Fingerprints which adds the estimate (i.e., fingerprint) of other agents’ policies (loosely inspired by Hyper-Q [326], see Sect. 4.1). For the practical implementation, good results were obtained by using the training iteration number and exploration rate as the fingerprint.
Foerster等[99]研究了与独立Q学习智能体合作的简单场景（见第3.1节），其中智能体使用神经网络的标准DQN架构和经验回放缓冲区（见图3）。然而，要使ER正常工作，数据分布需要遵循某些假设（参见第2.2节），由于世界的多主体性质，这些假设是无效的：在ER中生成数据的动态不再反映当前的动态，使体验过时[99,194]。他们的解决方案是向体验元组添加信息，以帮助消除重播内存中采样数据的年龄。提出了两种方法。第一种是多智能体重要性抽样，它增加了联合行动的概率，因此当稍后对元组进行采样进行训练时，可以计算出重要性抽样校正 [ 36， 260]。这与之前在自适应重要性抽样[4,102]和离线RL[68]方面的工作相似。第二种方法是多智能体指纹，它增加了对其他智能体策略的估计（即指纹）（大致受Hyper-Q [326]的启发，参见第4.1节）。在实际实现中，以训练迭代次数和探索率为指纹，取得了良好的效果。

Gupta et al. [123] tackled cooperative environments in partially observable domains without explicit communication. They proposed parameter sharing (PS) as a way to improve learning in homogeneous multiagent environments (where agents have the same set of actions). The idea is to have one globally shared learning network that can still behave differently in execution time, i.e., because its inputs (individual agent observation and agent index) will be different. They tested three variations of this approach with parameter sharing: PS-DQN, PS-DDPG and PS-TRPO, which extended single-agent DQN, DDPG and TRPO algorithms, respectively. The results showed that PS-TRPO outperformed the other two. Note that Foerster et al. [96] concurrently proposed a similar concept, see Sect. 3.4.
Gupta等[ 123]研究了在没有明确通信的情况下部分可观察域中的合作环境。他们提出参数共享（PS）作为改善同构多智能体环境（智能体具有相同操作集）中学习的一种方式。这个想法是建立一个全球共享的学习网络，该网络在执行时间上仍然可以表现不同，也就是说，因为它的输入（个体智能体观察和智能体索引）会有所不同。他们通过参数共享测试了这种方法的三种变体：PS-DQN、PS-DDPG 和 PS-TRPO，它们分别扩展了单代理 DQN、DDPG 和 TRPO 算法。结果表明，PS-TRPO优于其他两种。请注意，Foerster等人[96]同时提出了一个类似的概念，参见第3.4节。

Lenient-DQN (LDQN) [247] took the leniency concept [37] (originally presented in MAL) and extended their use to MDRL. The purpose of leniency is to overcome a pathology called relative overgeneralization [249, 250, 347]. Similar to other approaches designed to overcome relative overgeneralization (e.g., distributed Q-learning [181] and hysteretic Q-learning [213]) lenient learners initially maintain an optimistic disposition to mitigate the noise from transitions resulting in miscoordination, preventing agents from being drawn towards sub-optimal but wide peaks in the reward search space [246]. However, similar to other MDRL works [99], the LDQN authors experienced problems with the ER buffer and arrived at a similar solution: adding information to the experience tuple, in their case, the leniency value. When sampling from the ER buffer, this value is used to determine a leniency condition; if the condition is not met then the sample is ignored.
宽大DQN（LDQN）[247]采用了宽大概念[37]（最初在MAL中提出），并将其使用扩展到MDRL。宽大处理的目的是克服一种称为相对过度概括的病理学[249,250,347]。与其他旨在克服相对过度泛化的方法（例如，分布式Q学习[181]和滞后Q学习[213]）类似，宽松的学习者最初保持乐观的态度，以减轻导致不协调的过渡噪声，防止智能体被吸引到奖励搜索空间中的次优但宽的峰值[246]。然而，与其他MDRL工作类似[ 99]，LDQN的作者遇到了ER缓冲区的问题，并得出了类似的解决方案：将信息添加到经验元组中，在他们的案例中是宽大值。从 ER 缓冲区采样时，此值用于确定宽大条件;如果不满足条件，则忽略样本。

In a similar vein, Decentralized-Hysteretic Deep Recurrent Q-Networks (DEC-HDRQNs) [244] were proposed for fostering cooperation among independent learners. The motivation is similar to LDQN, making an optimistic value update, however, their solution is different. Here, the authors took inspiration from Hysteretic Q-learning [213], originally presented in MAL, where two learning rates were used. A difference between lenient agents and hysteretic Q-learning is that lenient agents are only initially forgiving towards teammates. Lenient learners over time apply less leniency towards updates that would lower utility values, taking into account how frequently observation-action pairs have been encountered. The idea being that the transition from optimistic to average reward learner will help make lenient learners more robust towards misleading stochastic rewards [37]. Additionally, in DEC-HDRQNs the ER buffer is also extended into concurrent experience replay trajectories, which are composed of three dimensions: agent index, the episode, and the timestep; when training, the sampled traces have the same starting timesteps. Moreover, to improve on generalization over different tasks, i.e., multi-task learning[62], DEC-HDRQNs make use of policy distillation [146, 273] (see Sect. 4.1). In contrast to other approaches, DEC-HDRQNS are fully decentralized during learning and execution.
同样，提出了分散式滞后深度循环Q网络（DEC-HDRQNs）[244]，用于促进独立学习者之间的合作。动机与LDQN相似，进行乐观的价值更新，但是，他们的解决方案不同。在这里，作者从滞后Q学习[213]中汲取灵感，该学习最初在MAL中提出，其中使用了两种学习率。宽容的智能体和滞后型的Q学习之间的区别在于，宽容的智能体最初只是对队友的宽容。随着时间的流逝，宽容的学习者对会降低效用值的更新采取较少的宽容态度，同时考虑到遇到观察-行动对的频率。这个想法是，从乐观到平均奖励学习者的转变将有助于使宽容的学习者对误导性的随机奖励更加稳健[37]。此外，在 DEC-HDRQN 中，ER 缓冲区还扩展到并发体验回放轨迹，这些轨迹由三个维度组成：智能体索引、剧集和时间步长;训练时，采样的跟踪具有相同的开始时间步长。此外，为了提高对不同任务的泛化能力，即多任务学习[62]，DEC-HDRQNs利用了策略蒸馏[146,273]（见第4.1节）。与其他方法相比，DEC-HDRQNS在学习和执行过程中是完全去中心化的。

Weighted Double Deep Q-Network (WDDQN) [365] is based on having double estimators. This idea was originally introduced in Double Q-learning [130] and aims to remove the existing overestimation bias caused by using the maximum action value as an approximation for the maximum expected action value (see Sect. 4.1). It also uses a lenient reward [37] to be optimistic during initial phase of coordination and proposes a scheduled replay strategy in which samples closer to the terminal states are heuristically given higher priority; this strategy might not be applicable for any domain. For other works extending the ER to multiagent settings see MADDPG [206], Sects. 4.1 and 4.2.
加权双深度Q网络（WDDQN）[365]基于双重估计器。这个想法最初是在Double Q-learning[130]中引入的，旨在消除由于使用最大动作值作为最大期望动作值的近似值而导致的现有高估偏差（见第4.1节）。它还使用宽松的奖励 [ 37] 在协调的初始阶段保持乐观，并提出了一种预定的重放策略，其中更接近最终状态的样本被启发式地赋予更高的优先级;此策略可能不适用于任何域。有关将 ER 扩展到多智能体设置的其他工作，请参见 MADDPG [ 206]，第 4.1 和 4.2 节。

Fig. 5 图5
figure 5
A schematic view of the architecture used in FTW (For the Win) [156]: two unrolled recurrent neural networks (RNNs) operate at different time-scales, the idea is that the Slow RNN helps with long term temporal correlations. Observations are latent space output of some convolutional neural network to learn non-linear features. Feudal Networks [338] is another work in single-agent DRL that also maintains a multi-time scale hierarchy where the slower network sets the goal, and the faster network tries to achieve them. Fedual Networks were in turn, inspired by early work in RL which proposed a hierarchy of Q-learners [82, 296]
FTW （For the Win） [ 156] 中使用的架构示意图：两个展开的递归神经网络（RNN）在不同的时间尺度上运行，其想法是慢速 RNN 有助于长期时间相关性。观测值是一些卷积神经网络学习非线性特征的潜在空间输出。封建网络[338]是单智能体DRL的另一项工作，它也维护了一个多时间尺度的层次结构，其中较慢的网络设定目标，而较快的网络试图实现这些目标。反过来，Fedual Networks的灵感来自RL的早期工作，该工作提出了Q-learners的层次结构[ 82， 296]

Full size image 全尺寸图像
While previous approaches were mostly inspired by how MAL algorithms could be extended to MDRL, other works take as base the results by single-agent DRL. One example is the For The Win (FTW) [156] agent which is based on the actor-learner structure of IMPALA [93] (see Sect. 2.2). The authors test FTW in a game where two opposing teams compete to capture each other’s flags [57]. To deal with the MAL problem they propose two main additions: a hierarchical two-level representation with recurrent neural networks operating at different timescales, as depicted in Fig. 5, and a population based training [157, 185, 271] where 30 agents were trained in parallel together with a stochastic matchmaking scheme that biases agents to be of similar skills. The Elo rating system [90] was originally devised to rate chess player skills,Footnote13 TrueSkill [138] extended Elo by tracking uncertainty in skill rating, supporting draws, and matches beyond 1 vs 1;
Rank is a more recent alternative to ELO [243]. FTW did not use TrueSkill but a simpler extension of Elo for n vs n games (by adding individual agent ratings to compute the team skill). Hierarchical approaches were previously proposed in RL, e.g., Feudal RL [82, 296], and were later extended to DRL in Feudal networks [338]; population based training can be considered analogous to evolutionary strategies that employ self-adaptive hyperparameter tuning to modify how the genetic algorithm itself operates [20, 85, 185]. An interesting result from FTW is that the population-based training obtained better results than training via self-play [325], which was a standard concept in previous works [43, 291]. FTW used heavy compute resources, it used 30 agents (processes) in parallel where every training game lasted 4500 agent steps (
5 min) and agents were trained for two billion steps (
450K games).
虽然以前的方法主要受到 MAL 算法如何扩展到 MDRL 的启发，但其他工作以单智能体 DRL 的结果为基础。一个例子是For The Win（FTW）[156]代理，它基于IMPALA [93]的actor-learner结构（参见第2.2节）。作者在一场比赛中测试了FTW，其中两支对立的球队竞争夺取对方的旗帜[57]。为了解决 MAL 问题，他们提出了两个主要添加：分层的两级表示，循环神经网络在不同的时间尺度上运行，如图 5 所示，以及基于群体的训练 [ 157， 185， 271]，其中 30 个智能体被并行训练，以及一个随机匹配方案，该方案使智能体具有相似的技能。Elo 评分系统 [ 90] 最初设计用于对棋手的技能进行评分， Footnote13 TrueSkill [ 138] 通过跟踪技能评级的不确定性、支持平局和 1 对 1 以上的比赛来扩展 Elo;
Rank是ELO的最新替代品[243]。FTW没有使用TrueSkill，而是在n对n游戏中对Elo的更简单扩展（通过添加单个代理评级来计算团队技能）。以前在RL中提出了分层方法，例如，封建RL [ 82， 296 ]，后来扩展到封建网络中的DRL [ 338];基于群体的训练可以被认为类似于进化策略，后者采用自适应超参数调整来修改遗传算法本身的运作方式[ 20， 85， 185]。FTW的一个有趣的结果是，基于人群的训练比通过自我游戏的训练获得了更好的结果[325]，这是以前工作中的标准概念[43,291]。 FTW 使用大量计算资源，它并行使用 30 个代理（进程），其中每个训练游戏持续 4500 个代理步骤（
5 分钟），代理训练 20 亿个步骤（
450K 游戏）。

Lowe et al. [206] noted that using standard policy gradient methods (see Sect. 2.1) on multiagent environments yields high variance and performs poorly. This occurs because the variance is further increased as all the agents’ rewards depend on the rest of the agents, and it is formally shown that as the number of agents increase, the probability of taking a correct gradient direction decreases exponentially [206]. Therefore, to overcome this issue Lowe et al. proposed the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [206], building on DDPG [192] (see Sect. 2.2), to train a centralized critic per agent that is given all agents’ policies during training to reduce the variance by removing the non-stationarity caused by the concurrently learning agents. Here, the actor only has local information (turning the method into a centralized training with decentralized execution) and the ER buffer records experiences of all agents. MADDPG was tested in both cooperative and competitive scenarios, experimental results show that it performs better than several decentralized methods (such as DQN, DDPG, and TRPO). The authors mention that traditional RL methods do not produce consistent gradient signals. This is exemplified in a challenging competitive scenarios where agents continuously adapt to each other causing the learned best-response policies oscillate—for such a domain, MADDPG is shown to learn more robustly than DDPG.
Lowe等[206]指出，在多智能体环境中使用标准策略梯度方法（见第2.1节）会产生高方差且性能不佳。这是因为方差进一步增加，因为所有智能体的奖励都依赖于其他智能体，并且正式表明，随着智能体数量的增加，采取正确梯度方向的概率呈指数下降[ 206]。因此，为了克服这个问题，Lowe等人提出了多智能体深度确定性策略梯度（MADDPG）[206]，建立在DDPG [192]（参见第2.2节）的基础上，为每个智能体训练一个集中的批评者，在训练期间给予所有智能体的策略，通过消除并发学习智能体引起的非平稳性来减少方差。在这里，参与者只有本地信息（将方法转变为具有分散执行的集中式训练），并且 ER 缓冲区记录所有代理的体验。MADDPG在合作和竞争两个场景下进行了测试，实验结果表明，MADDPG的性能优于几种去中心化方法（如DQN、DDPG和TRPO）。作者提到，传统的RL方法不能产生一致的梯度信号。这在具有挑战性的竞争场景中得到了举例说明，在这种场景中，智能体不断相互适应，导致学习到的最佳响应策略振荡——对于这样的领域，MADDPG 被证明比 DDPG 学习更稳健。

Another approach based on policy gradients is the Counterfactual Multi-Agent Policy Gradients (COMA) [98]. COMA was designed for the fully centralized setting and the multiagent credit assignment problem [332], i.e., how the agents should deduce their contributions when learning in a cooperative setting in the presence of only global rewards. Their proposal is to compute a counterfactual baseline, that is, marginalize out the action of the agent while keeping the rest of the other agents’ actions fixed. Then, an advantage function can be computed comparing the current Q value to the counterfactual. This counterfactual baseline has its roots in difference rewards, which is a method for obtaining the individual contribution of an agent in a cooperative multiagent team [332]. In particular, the aristocrat utility aims to measure the difference between an agent’s actual action and the average action [355]. The intention would be equivalent to sideline the agent by having the agent perform an action where the reward does not depend on the agent’s actions, i.e., to consider the reward that would have arisen assuming a world without that agent having ever existed (see Sect. 4.2).
另一种基于政策梯度的方法是反事实多智能体策略梯度（COMA）[98]。COMA是针对完全集中式设置和多智能体学分分配问题[332]而设计的，即在仅存在全局奖励的情况下，智能体在合作环境中学习时应如何推断其贡献。他们的建议是计算一个反事实基线，即边缘化代理人的行为，同时保持其他代理人的其余行动不变。然后，可以计算一个优势函数，将当前 Q 值与反事实进行比较。这种反事实基线的根源在于差异奖励，这是一种在合作的多智能体团队中获得智能体个体贡献的方法[332]。特别是，贵族效用旨在衡量代理人的实际行动与平均行动之间的差异[355]。其意图相当于通过让代理人执行奖励不依赖于代理人的行为的行动来使代理人靠边站，即考虑假设一个没有该代理人的世界而产生的奖励（见第4.2节）。

On the one hand, fully centralized approaches (e.g., COMA) do not suffer from non-stationarity but have constrained scalability. On the other hand, independent learning agents are better suited to scale but suffer from non-stationarity issues. There are some hybrid approaches that learn a centralized but factoredQ value function [119, 174]. Value Decomposition Networks (VDNs) [313] decompose a team value function into an additive decomposition of the individual value functions. Similarly, QMIX [266] relies on the idea of factorizing, however, instead of sum, QMIX assumes a mixing network that combines the local values in a non-linear way, which can represent monotonic action-value functions. While the mentioned approaches have obtained good empirical results, the factorization of value-functions in multiagent scenarios using function approximators (MDRL) is an ongoing research topic, with open questions such as how well factorizations capture complex coordination problems and how to learn those factorizations [64] (see Sect. 4.4).
一方面，完全集中的方法（例如，COMA）没有非平稳性，但具有受限的可扩展性。另一方面，独立学习代理更适合扩展，但存在非平稳性问题。有一些混合方法可以学习集中但因式 Q 值函数 [ 119， 174]。价值分解网络（VDN）[313]将团队价值函数分解为单个价值函数的加性分解。类似地，QMIX [ 266] 依赖于因式分解的思想，然而，QMIX 假设一个混合网络，该网络以非线性方式组合局部值，可以表示单调作用值函数，而不是求和。虽然上述方法已经取得了良好的实证结果，但使用函数近似器（MDRL）在多智能体场景中对价值函数进行因式分解是一个正在进行的研究课题，存在一些开放性问题，例如因式分解如何捕获复杂的协调问题以及如何学习这些因式分解[64]（见第4.4节）。

3.6 Agents modeling agents
3.6 智能体建模智能体
An important ability for agents to have is to reason about the behaviors of other agents by constructing models that make predictions about the modeled agents [6]. An early work for modeling agents while using deep neural networks was the Deep Reinforcement Opponent Network (DRON) [133]. The idea is to have two networks: one which evaluates Q-values and a second one that learns a representation of the opponent’s policy. Moreover, the authors proposed to have several expert networks to combine their predictions to get the estimated Q value, the idea being that each expert network captures one type of opponent strategy [109]. This is related to previous works in type-based reasoning from game theory [129, 167] later applied in AI [6, 26, 109]. The mixture of experts idea was presented in supervised learning where each expert handled a subset of the data (a subtask), and then a gating network decided which of the experts should be used [155].
智能体拥有的一项重要能力是通过构建模型来预测被建模的智能体[6]。使用深度神经网络对智能体进行建模的早期工作是深度强化对手网络（DRON）[133]。这个想法是有两个网络：一个评估Q值，另一个学习对手政策的表示。此外，作者建议让几个专家网络将他们的预测结合起来，以获得估计的Q值，其想法是每个专家网络捕获一种类型的对手策略[109]。这与博弈论[129,167]中先前在基于类型的推理方面的工作有关，后来应用于人工智能[6,26,109]。在监督学习中提出了专家的混合想法，其中每个专家处理数据的一个子集（子任务），然后门控网络决定应该使用哪个专家[155]。

Fig. 6 图6
figure 6
a Deep policy inference Q-network: receives four stacked frames as input (similar to DQN, see Fig. 2). b Deep Policy Inference Recurrent Q-Network: receives one frame as input and has an LSTM layer instead of a fully connected layer (FC). Both approaches [148] condition the
value outputs on the policy features,
, which are also used to learn the opponent policy

a 深度策略推理 Q-network：接收 4 个堆叠帧作为输入（类似于 DQN，见图 2）。b 深度策略推理循环Q-Network：接收一帧作为输入，并具有LSTM层而不是全连接层（FC）。这两种方法[ 148] 都对策略特征的值输出进行条件
，这些特征
也用于学习对手策略

Full size image 全尺寸图像
DRON uses hand-crafted features to define the opponent network. In contrast, Deep Policy Inference Q-Network (DPIQN) and its recurrent version, DPIRQN [148] learn policy features directly from raw observations of the other agents. The way to learn these policy features is by means of auxiliary tasks [158, 317] (see Sects. 2.2 and 4.1) that provide additional learning goals, in this case, the auxiliary task is to learn the opponents’ policies. This auxiliary task modifies the loss function by computing an auxiliary loss: the cross entropy loss between the inferred opponent policy and the ground truth (one-hot action vector) of the opponent. Then, the Q value function of the learning agent is conditioned on the opponent’s policy features (see Fig. 6), which aims to reduce the non-stationarity of the environment. The authors used an adaptive training procedure to adjust the attention (a weight on the loss function) to either emphasize learning the policy features (of the opponent) or the respective Q values of the agent. An advantage of these approaches is that modeling the agents can work for both opponents and teammates [148].
DRON 使用手工制作的功能来定义对手网络。相比之下，深度策略推理Q网络（DPIQN）及其循环版本DPIRQN[148]直接从其他智能体的原始观察中学习策略特征。学习这些政策特征的方法是通过辅助任务[158,317]（见第2.2和4.1节），这些任务提供了额外的学习目标，在这种情况下，辅助任务是学习对手的政策。此辅助任务通过计算辅助损失来修改损失函数：推断的对手策略与对手的地面实况（一热动作向量）之间的交叉熵损失。然后，学习智能体的Q值函数以对手的策略特征为条件（见图6），旨在降低环境的非平稳性。作者使用自适应训练程序来调整注意力（损失函数的权重），以强调学习（对手的）政策特征或智能体的相应Q值。这些方法的一个优点是，对智能体进行建模既可以对对手也对队友起作用[148]。

In many previous works an opponent model is learned from observations. Self Other Modeling (SOM) [265] proposed a different approach, this is, using the agent’s own policy as a means to predict the opponent’s actions. SOM can be used in cooperative and competitive settings (with an arbitrary number of agents) and infers other agents’ goals. This is important because in the evaluated domains, the reward function depends on the goal of the agents. SOM uses two networks, one used for computing the agents’ own policy, and a second one used to infer the opponent’s goal. The idea is that these networks have the same input parameters but with different values (the agent’s or the opponent’s). In contrast to previous approaches, SOM is not focused on learning the opponent policy, i.e., a probability distribution over next actions, but rather on estimating the opponent’s goal. SOM is expected to work best when agents share a set of goals from which each agent gets assigned one at the beginning of the episode and the reward structure depends on both of their assigned goals. Despite its simplicity, training takes longer as an additional optimization step is performed given the other agent’s observed actions.
在以前的许多作品中，对手模型都是从观察中学习出来的。自我他者建模（Self Other Modeling，SOM）[265]提出了一种不同的方法，即使用智能体自己的策略作为预测对手行为的手段。SOM 可用于合作和竞争环境（具有任意数量的智能体）并推断其他智能体的目标。这很重要，因为在被评估的领域中，奖励函数取决于智能体的目标。SOM 使用两个网络，一个用于计算代理自己的策略，另一个用于推断对手的目标。这个想法是，这些网络具有相同的输入参数，但具有不同的值（代理或对手的）。与以前的方法相比，SOM并不专注于学习对手的政策，即下一步行动的概率分布，而是估计对手的目标。当智能体共享一组目标时，SOM 预计效果最佳，每个智能体在剧集开始时都会从中获得一个目标，并且奖励结构取决于他们分配的两个目标。尽管它很简单，但训练需要更长的时间，因为考虑到其他智能体观察到的操作，会执行额外的优化步骤。

There is a long-standing history of combining game theory and MAL [43, 233, 289]. From that context, some approaches were inspired by influential game theory approaches. Neural Fictitious Self-Play (NFSP) [136] builds on fictitious (self-) play [49, 135], together with two deep networks to find approximate Nash equilibriaFootnote14 in two-player imperfect information games [341] (for example, consider Poker: when it is an agent’s turn to move it does not have access to all information about the world). One network learns an approximate best response (
greedy over Q values) to the historical behavior of other agents and the second one (called the average network) learns to imitate its own past best response behaviour using supervised classification. The agent behaves using a mixture of the average and the best response networks depending on the probability of an anticipatory parameter [287]. Comparisons with DQN in Leduc Hold’em Poker revealed that DQN’s deterministic strategy is highly exploitable. Such strategies are sufficient to behave optimally in single-agent domains, i.e., MDPs for which DQN was designed. However, imperfect-information games generally require stochastic strategies to achieve optimal behaviour [136]. DQN learning experiences are both highly correlated over time, and highly focused on a narrow state distribution. In contrast to NFSP agents whose experience varies more smoothly, resulting in a more stable data distribution, more stable neural networks and better performance.
博弈论与MAL的结合由来已久[43,233,289]。从这个背景来看，一些方法受到有影响力的博弈论方法的启发。神经虚构自我博弈（NFSP）[136]建立在虚构（自我）博弈[49,135]的基础上，结合两个深度网络，在双人不完全信息博弈 Footnote14 [341]中找到近似纳什均衡（例如，考虑扑克：当轮到智能体移动时，它无法获得有关世界的所有信息）。一个网络学习对其他智能体的历史行为的近似最佳响应（对 Q 值的
贪婪），第二个网络（称为平均网络）学习使用监督分类模仿自己过去的最佳响应行为。智能体的行为是平均响应网络和最佳响应网络的混合，这取决于预期参数的概率[287]。与 Leduc Hold’em Poker 中的 DQN 的比较表明，DQN 的确定性策略具有高度可利用性。这些策略足以在单代理域（即设计了 DQN 的 MDP）中发挥最佳作用。然而，不完全信息博弈通常需要随机策略来实现最佳行为[136]。DQN 学习体验在一段时间内高度相关，并且高度关注狭窄的状态分布。与NFSP代理相比，NFSP代理的体验变化更平滑，从而产生更稳定的数据分布、更稳定的神经网络和更好的性能。

The (N)FSP concept was further generalized in Policy-Space Response Oracles (PSRO) [180], where it was shown that fictitious play is one specific meta-strategy distribution over a set of previous (approximate) best responses (summarized by a meta-game obtained by empirical game theoretic analysis [342]), but there are a wide variety to choose from. One reason to use mixed meta-strategies is that it prevents overfittingFootnote15 the responses to one specific policy, and hence provides a form of opponent/teammate regularization. An approximate scalable version of the algorithm leads to a graph of agents best-responding independently called Deep Cognitive Hierarchies (DCHs) [180] due to its similarity to behavioral game-theoretic models [59, 72].
（N）FSP概念在政策空间响应预言机（PSRO）[180]中得到进一步推广，其中表明虚构博弈是一组先前（近似）最佳响应（由经验博弈论分析获得的元博弈[342]）上的一个特定的元策略分布，但有多种选择。使用混合元策略的一个原因是，它可以防止对特定策略 Footnote15 的响应过度拟合，从而提供一种对手/队友正则化的形式。该算法的近似可扩展版本导致了独立响应最佳的智能体图，称为深度认知层次结构（DCH）[180]，因为它与行为博弈论模型相似[59,72]。

Minimax is a paramount concept in game theory that is roughly described as minimizing the worst case scenario (maximum loss) [341]. Li et al. [190] took the minimax idea as an approach to robustify learning in multiagent environments so that the learned robust policy should be able to behave well even with strategies not seen during training. They extended the MADDPG algorithm [206] to Minimax Multiagent Deep Deterministic Policy Gradients (M3DDPG), which updates policies considering a worst-case scenario: assuming that all other agents act adversarially. This yields a minimax learning objective which is computationally intractable to directly optimize. They address this issue by taking ideas from robust reinforcement learning [227] which implicitly adopts the minimax idea by using the worst noise concept [257]. In MAL different approaches were proposed to assess the robustness of an algorithm, e.g., guarantees of safety [66, 259], security [73] or exploitability [80, 161, 215].
最小值是博弈论中一个最重要的概念，大致描述为最小化最坏情况（最大损失）[341]。Li等[ 190]将极小最大值的思想作为一种在多智能体环境中鲁棒学习的方法，因此即使使用训练期间未看到的策略，学习到的鲁棒策略也应该能够表现良好。他们将MADDPG算法[206]扩展到Minimax多智能体深度确定性策略梯度（M3DDPG），该算法考虑了最坏的情况来更新策略：假设所有其他智能体都采取对抗性行动。这产生了一个最小值学习目标，该目标在计算上难以直接优化。他们通过从鲁棒强化学习[227]中汲取思想来解决这个问题，该学习通过使用最差噪声概念[257]隐含地采用了最小值思想。在MAL中，提出了不同的方法来评估算法的鲁棒性，例如，保证安全性[66,259]，安全性[73]或可利用性[80,161,215]。

Previous approaches usually learned a model of the other agents as a way to predict their behavior. However, they do not explicitly account for anticipated learning of the other agents, which is the objective of Learning with Opponent-Learning Awareness (LOLA) [97]. LOLA optimizes the expected return after the opponent updates its policy one step. Therefore, a LOLA agent directly shapes the policy updates of other agents to maximize its own reward. One of LOLA’s assumptions is having access to opponents’ policy parameters. LOLA builds on previous ideas by Zhang and Lesser [363] where the learning agent predicts the opponent’s policy parameter update but only uses it to learn a best response (to the anticipated updated parameters).
以前的方法通常学习其他智能体的模型作为预测其行为的一种方式。然而，它们没有明确解释其他智能体的预期学习，而这正是对手学习意识学习（LOLA）的目标[97]。LOLA在对手一步更新其政策后优化预期回报。因此，LOLA 代理直接塑造其他代理的策略更新，以最大化自己的奖励。LOLA的假设之一是可以访问对手的政策参数。LOLA建立在Zhang和Lesser[363]之前的想法之上，其中学习代理预测对手的策略参数更新，但仅使用它来学习最佳响应（对预期的更新参数）。

Theory of mind is part of a group of recursive reasoning approaches[60, 61, 109, 110] in which agents have explicit beliefs about the mental states of other agents. The mental states of other agents may, in turn, also contain beliefs and mental states of other agents, leading to a nesting of beliefs [6]. Theory of Mind Network (ToMnet) [263] starts with a simple premise: when encountering a novel opponent, the agent should already have a strong and rich prior about how the opponent should behave. ToMnet has an architecture composed of three networks: (i) a character network that learns from historical information, (ii) a mental state network that takes the character output and the recent trajectory, and (iii) the prediction network that takes the current state as well as the outputs of the other networks as its input. The output of the architecture is open for different problems but in general its goal is to predict the opponent’s next action. A main advantage of ToMnet is that it can predict general behavior, for all agents; or specific, for a particular agent.
心智理论是一组递归推理方法的一部分[60,61,109,110]，其中智能体对其他智能体的心理状态有明确的信念。反过来，其他主体的心理状态也可能包含其他主体的信念和心理状态，从而导致信念的嵌套[6]。心智网络理论（Theory of Mind Network，ToMnet）[263]从一个简单的前提开始：当遇到一个新颖的对手时，智能体应该已经对对手应该如何表现有一个强大而丰富的先验。ToMnet的架构由三个网络组成：（i）从历史信息中学习的角色网络，（ii）将角色输出和最近的轨迹作为心理状态网络，以及（iii）将当前状态以及其他网络的输出作为输入的预测网络。该架构的输出针对不同的问题开放，但总的来说，它的目标是预测对手的下一步行动。ToMnet的一个主要优点是它可以预测所有智能体的一般行为;或特定，针对特定代理。

Deep Bayesian Theory of Mind Policy (Bayes-ToMoP) [358] is another algorithm that takes inspiration from theory of mind [76]. The algorithm assumes the opponent has different stationary strategies to act and changes among them over time [140]. Earlier work in MAL dealt with this setting, e.g., BPR+ [143] extends the Bayesian policy reuseFootnote16 framework [272] to multiagent settings (BPR assumes a single-agent environment; BPR+ aims to best respond to the opponent in a multiagent game). A limitation of BPR+ is that it behaves poorly against itself (self-play), thus, Deep Bayes-ToMoP uses theory of mind to provide a higher-level reasoning strategy which provides an optimal behavior against BPR+ agents.
深贝叶斯心理政策理论（Bayes-ToMoP）[358]是另一种从心理理论[76]中汲取灵感的算法。该算法假设对手有不同的固定策略来行动，并且它们之间会随着时间的推移而变化[140]。MAL的早期工作涉及此设置，例如，BPR+ [ 143] 将贝叶斯策略重用 Footnote16 框架 [ 272] 扩展到多智能体设置（BPR 假设单智能体环境;BPR+ 旨在在多智能体游戏中对对手做出最佳反应）。BPR+ 的一个局限性是它对自己表现不佳（自玩），因此，Deep Bayes-ToMoP 使用心理理论来提供更高层次的推理策略，该策略提供了针对 BPR+ 代理的最佳行为。

Deep BPR+ [366] is another work inspired by BPR+ which uses neural networks as value-function approximators. It not only uses the environment reward but also uses the online learned opponent model [139, 144] to construct a rectified belief over the opponent strategy. Additionally, it leverages ideas from policy distillation [146, 273] and extends them to the multiagent case to create a distilled policy network. In this case, whenever a new acting policy is learned, distillation is applied to consolidate the new updated library which improves in terms of storage and generalization (over opponents).
Deep BPR+ [ 366] 是另一部受 BPR+ 启发的工作，它使用神经网络作为值函数逼近器。它不仅使用环境奖励，还使用在线学习的对手模型[139,144]来构建对对手策略的修正信念。此外，它利用了政策蒸馏中的想法[146,273]，并将其扩展到多智能体案例，以创建一个精炼的政策网络。在这种情况下，每当学习到新的代理策略时，都会应用蒸馏来巩固新的更新库，该库在存储和泛化方面有所改进（超过对手）。

4 Bridging RL, MAL and MDRL
4 桥接 RL、MAL 和 MDRL
This section aims to provide directions to promote fruitful cooperations between sub-communities. First, we address the pitfall of deep learning amnesia, roughly described as missing citations to the original works and not exploiting the advancements that have been made in the past. We present examples on how ideas originated earlier, for example in RL and MAL, were successfully extended to MDRL (see Sect. 4.1). Second, we outline lessons learned from the works analyzed in this survey (see Sect. 4.2). Then we point the readers to recent benchmarks for MDRL (see Sect. 4.3) and we discuss the practical challenges that arise in MDRL like high computational demands and reproducibility (see Sect. 4.4). Lastly, we pose some open research challenges and reflect on their relation with previous open questions in MAL [6] (see Sect. 4.5).
本节旨在为促进子社区之间富有成效的合作提供方向。首先，我们解决了深度学习遗忘症的陷阱，粗略地描述为缺少对原始作品的引用，并且没有利用过去取得的进步。我们举例说明了早期的想法（例如在RL和MAL中）如何成功地扩展到MDRL（见第4.1节）。其次，我们概述了从本次调查中分析的作品中吸取的经验教训（见第 4.2 节）。然后，我们向读者指出MDRL的最新基准（见第4.3节），并讨论了MDRL中出现的实际挑战，如高计算要求和可重复性（见第4.4节）。最后，我们提出了一些开放性研究挑战，并反思了它们与《仲裁示范法》[6]中先前开放性问题的关系（见第4.5节）。

4.1 Avoiding deep learning amnesia: examples in MDRL
4.1 避免深度学习遗忘症：MDRL 中的示例
This survey focuses on recent deep works, however, in previous sections, when describing recent algorithms, we also point to original works that inspired them. Schmidhuber said “Machine learning is the science of credit assignment. The machine learning community itself profits from proper credit assignment to its members” [280]. In this context, we want to avoid committing the pitfall of not giving credit to original ideas that were proposed earlier, a.k.a. deep learning amnesia. Here, we provide some specific examples of research milestones that were studied earlier, e.g., RL or MAL, and that now became highly relevant for MDRL. Our purpose is to highlight that existent literature contains pertinent ideas and algorithms that should not be ignored. On the contrary, they should be examined and cited [58, 79] to understand recent developments [343].
本次调查的重点是最近的深度工作，但是，在前面的章节中，在描述最近的算法时，我们也指出了启发它们的原创作品。Schmidhuber说：“机器学习是学分分配的科学。机器学习社区本身从其成员的适当学分分配中获益“[280]。在这种情况下，我们希望避免犯一个陷阱，即不信任之前提出的原创想法，也就是深度学习遗忘症。在这里，我们提供了一些早期研究过的研究里程碑的具体例子，例如RL或MAL，这些里程碑现在与MDRL高度相关。我们的目的是强调现有的文献包含不容忽视的相关思想和算法。相反，应该检查和引用它们[58,79]以了解最近的发展[343]。

Dealing with non-stationarity in independent learners It is well known that using independent learners makes the environment non-stationary from the agent’s point of view [182, 333]. Many MAL algorithms tried to solve this problem in different ways [141]. One example is Hyper-Q [326] which accounts for the (values of mixed) strategies of other agents and includes that information in the state representation, which effectively turns the learning problem into a stationary one. Note that in this way it is possible to even consider adaptive agents. Foerster et al. [96] make use of this insight to propose their fingerprint algorithm in an MDRL problem (see Sect. 3.5). Other examples include the leniency concept [37] and Hysteretic Q-learning [213] originally presented in MAL, which now have their “deep” counterparts, LDQNs [247] and DEC-HDRQNs[244], see Sect. 3.5.
处理独立学习者的非平稳性众所周知，从智能体的角度来看，使用独立学习者会使环境变得非平稳 [ 182， 333]。许多MAL算法试图以不同的方式解决这个问题[141]。一个例子是Hyper-Q [ 326]，它解释了其他智能体的（混合）策略，并将该信息包含在状态表示中，这有效地将学习问题转化为平稳问题。请注意，通过这种方式，甚至可以考虑自适应代理。Foerster等[96]利用这一见解提出了他们在MDRL问题中的指纹算法（见第3.5节）。其他例子包括最初在MAL中提出的宽大概念[37]和滞后Q学习[213]，它们现在有它们的“深度”对应物LDQNs[247]和DEC-HDRQNs[244]，见第3.5节。

Multiagent credit assignment In cooperative multiagent scenarios, it is common to use either local rewards, unique for each agent, or global rewards, which represent the entire group’s performance [3]. However, local rewards are usually harder to obtain, therefore, it is common to rely only on the global ones. This raises the problem of credit assignment: how does a single agent’s actions contribute to a system that involves the actions of many agents [2]. A solution that came from MAL research that has proven successful in many scenarios is difference rewards [3, 86, 332], which aims to capture an agent’s contribution to the system’s global performance. In particular the aristocrat utility aims to measure the difference between an agent’s actual action and the average action [355], however, it has a self-consistency problem and in practice it is more common to compute the wonderful life utility [355, 356], which proposes to use a clamping operation that would be equivalent to removing that player from the team. COMA [98] builds on these concepts to propose an advantage function based on the contribution of the agent, which can be efficiently computed with deep neural networks (see Sect. 3.5).
多智能体信用分配在合作多智能体场景中，通常使用局部奖励（每个智能体唯一）或全局奖励（代表整个群体的绩效）[ 3]。然而，本地奖励通常更难获得，因此，通常只依赖全球奖励。这就提出了一个信用分配的问题：单个智能体的行为如何有助于一个涉及多个智能体行为的系统[2]。来自MAL研究的一个解决方案，在许多情况下被证明是成功的，是差异奖励[3,86,332]，它旨在捕获智能体对系统全局性能的贡献。特别是贵族效用旨在衡量智能体的实际行动与平均行动之间的差异[355]，然而，它有一个自洽问题，在实践中，计算奇妙的生活效用[355,356]更为常见，它建议使用相当于将该球员从团队中移除的夹紧操作。COMA [ 98] 基于这些概念提出了一个基于智能体贡献的优势函数，该函数可以用深度神经网络进行有效计算（参见第 3.5 节）。

Multitask learning In the context of RL, multitask learning [62] is an area that develops agents that can act in several related tasks rather than just in a single one [324]. Distillation, roughly defined as transferring the knowledge from a large model to a small model, was a concept originally introduced for supervised learning and model compression [52, 146]. Inspired by those works, Policy distillation [273] was extended to the DRL realm. Policy distillation was used to train a much smaller network and to merge several task-specific policies into a single policy, i.e., for multitask learning. In the MDRL setting, Omidshafiei et al. [244] successfully adapted policy distillation within Dec-HDRQNs to obtain a more general multitask multiagent network (see Sect. 3.5). Another example is Deep BPR+ [366] which uses distillation to generalize over multiple opponents (see Sect. 3.6).
多任务学习在强化学习的背景下，多任务学习[62]是一个开发代理的领域，这些代理可以执行多个相关任务，而不仅仅是单个任务[324]。蒸馏，大致定义为将知识从大模型转移到小模型，是最初为监督学习和模型压缩而引入的概念[52,146]。受这些著作的启发，政策蒸馏[273]被扩展到DRL领域。策略蒸馏用于训练一个更小的网络，并将几个特定于任务的策略合并到一个策略中，即用于多任务学习。在MDRL设置中，Omidshafiei等[244]成功地在Dec-HDRQNs中调整了策略蒸馏，以获得更通用的多任务多智能体网络（见第3.5节）。另一个例子是Deep BPR+ [ 366]，它使用蒸馏来推广多个对手（见第3.6节）。

Auxiliary tasks Jaderberg et al. [158] introduced the term auxiliary task with the insight that (single-agent) environments contain a variety of possible training signals (e.g., pixel changes). These tasks are naturally implemented in DRL in which the last layer is split into multiple parts (heads), each working on a different task. All heads propagate errors into the same shared preceding part of the network, which would then try to form representations, in its next-to-last layer, to support all the heads [315]. However, the idea of multiple predictions about arbitrary signals was originally suggested for RL, in the context of general value functions [315, 317] and there still open problems, for example, better theoretical understanding [31, 88]. In the context of neural networks, early work proposed hints that improved the network performance and learning time. Suddarth and Kergosien [311] presented a minimal example of a small neural network where it was shown that adding an auxiliary task effectively removed local minima. One could think of extending these auxiliary tasks to modeling other agents’ behaviors [142, 225], which is one of the key ideas that DPIQN and DRPIQN [148] proposed in MDRL settings (see Sect. 3.6).
辅助任务 Jaderberg等[158]引入了辅助任务这一术语，认为（单智能体）环境包含各种可能的训练信号（例如，像素变化）。这些任务自然是在 DRL 中实现的，其中最后一层被分成多个部分（头部），每个部分处理不同的任务。所有磁头都将错误传播到网络的同一共享前半部分，然后该半部分将尝试在其倒数第二层中形成表示，以支持所有磁头[315]。然而，在一般值函数的背景下，对任意信号进行多重预测的想法最初是针对RL提出的[315,317]，并且仍然存在未解决的问题，例如，更好的理论理解[31,88]。在神经网络的背景下，早期的工作提出了改善网络性能和学习时间的提示。Suddarth和Kergosien[311]提出了一个小型神经网络的最小例子，其中表明添加辅助任务有效地消除了局部最小值。人们可以考虑将这些辅助任务扩展到对其他智能体的行为进行建模[142,225]，这是DPIQN和DRPIQN [148]在MDRL设置中提出的关键思想之一（参见第3.6节）。

Experience replay Lin [193, 194] proposed the concept of experience replay to speed up the credit assignment propagation process in single agent RL. This concept became central to many DRL works [220] (see Sect. 2.2). However, Lin stated that a condition for the ER to be useful is that “the environment should not change over time because this makes past experiences irrelevant or even harmful” [194]. This is a problem in domains where many agents are learning since the environment becomes non-stationary from the point of view of each agent. Since DRL relies heavily on experience replay, this is an issue in MDRL: the non-stationarity introduced means that the dynamics that generated the data in the agent’s replay memory no longer reflect the current dynamics in which it is learning [96]. To overcome this problem different methods have been proposed [99, 244, 247, 365], see Sect. 4.2.
经验回放 Lin [ 193， 194] 提出了经验回放的概念，以加快单智能体 RL 中的学分分配传播过程。这个概念成为许多DRL作品的核心[220]（见第2.2节）。然而，Lin表示，ER有用的一个条件是“环境不应该随着时间的推移而改变，因为这会使过去的经历变得无关紧要甚至有害”[194]。这在许多智能体正在学习的领域中是一个问题，因为从每个智能体的角度来看，环境变得不平稳。由于DRL严重依赖经验回放，这是MDRL中的一个问题：引入的非平稳性意味着在智能体的回放内存中生成数据的动态不再反映它正在学习的当前动态[96]。为了克服这个问题，已经提出了不同的方法[99,244,247,365]，见第4.2节。

Double estimators Double Q-learning [130] proposed to reduce the overestimation of action values in Q-learning, this is caused by using the maximum action value as an approximation for the maximum expected action value. Double Q-learning works by keeping two Q functions and was proven to convergence to the optimal policy [130]. Later this idea was applied to arbitrary function approximators, including deep neural networks, i.e., Double DQN [336], which were naturally applied since two networks were already used in DQN (see Sect. 2.2). These ideas have also been recently applied to MDRL [365].
双估计器双Q学习[ 130]提出减少Q学习中动作值的高估，这是由于使用最大动作值作为最大期望动作值的近似值造成的。双Q学习通过保留两个Q函数来工作，并被证明可以收敛到最优策略[130]。后来，这个想法被应用于任意函数逼近器，包括深度神经网络，即双DQN [ 336]，由于DQN中已经使用了两个网络，因此自然而然地应用了这些网络（参见第2.2节）。这些想法最近也被应用于MDRL [ 365]。

4.2 Lessons learned 4.2 经验教训
We have exemplified how RL and MAL can be extended for MDRL settings. Now, we outline general best practices learned from the works analyzed throughout this paper.
我们已经举例说明了如何针对 MDRL 设置扩展 RL 和 MAL。现在，我们概述了从本文分析的工作中学到的一般最佳实践。

Experience replay buffer in MDRL While some works removed the ER buffer in MDRL [96] it is an important component in many DRL and MDRL algorithms. However, using the standard buffer (i.e., keeping
) will probably fail due to a lack of theoretical guarantees under this setting, see Sects. 2.2 and 4.1. Adding information in the experience tuple that can help disambiguate the sample is the solution adopted in many works, whether a value based method [99, 244, 247, 365] or a policy gradient method [206]. In this regard, it is an open question to consider how new DRL ideas could be best integrated into the ER [11, 83, 153, 196, 278] and how those ideas would fare in a MDRL setting.
在MDRL中体验重放缓冲区虽然有些作品删除了MDRL中的ER缓冲区[96]，但它是许多DRL和MDRL算法中的一个重要组成部分。然而，由于在这种设置下缺乏理论保证，使用标准缓冲区（即保持
）可能会失败，参见第 2.2 和 4.1 节。在经验元组中添加有助于消除样本歧义的信息是许多工作中采用的解决方案，无论是基于值的方法[99,244,247,365]还是策略梯度方法[206]。在这方面，考虑如何最好地将新的 DRL 想法整合到 ER [ 11， 83， 153， 196， 278] 以及这些想法在 MDRL 环境中如何发展是一个悬而未决的问题。

Centralized learning with decentralized execution Many MAL works were either fully centralized or fully decentralized approaches. However, inspired by decentralized partially observable Markov decison processes (DEC-POMDPs) [34, 237],Footnote17 in MDRL this new mixed paradigm has been commonly used [98, 99, 180, 206, 247, 266] (a notable exception are DEC-HDRQNs [244] which perform learning and execution in a decentralized manner, see Sect. 3.5). Note that not all real-world problems fit into this paradigm and it is more common for robotics or games where a simulator is generally available [96]. The main benefit is that during learning additional information can be used (e.g., global state, action, or rewards) and during execution this information is removed.
集中学习与分散执行许多MAL工作要么是完全集中的，要么是完全分散的。然而，受分散部分可观察马尔可夫决策过程（DEC-POMDP）[34,237]的启发，在MDRL中，这种新的混合范式已被普遍使用[98,99,180,206,247,266]（一个值得注意的例外是DEC-HDRQNs[244] Footnote17 ，它以分散的方式执行学习和执行，见第3.5节）。请注意，并非所有现实世界的问题都适合这种范式，在机器人或游戏中更为常见，因为模拟器已经普遍可用[96]。主要好处是，在学习过程中可以使用其他信息（例如，全局状态、操作或奖励），并在执行过程中删除这些信息。

Parameter sharing Another frequent component in many MDRL works is the idea of sharing parameters, i.e., training a single network in which agents share their weights. Note that, since agents could receive different observations (e.g., in partially observable scenarios), they can still behave differently. This method was proposed concurrently in different works [96, 124] and later it has been successfully applied in many others [99, 253, 266, 312, 313].
参数共享许多 MDRL 工作中另一个常见的组件是共享参数的想法，即训练一个代理共享其权重的单一网络。请注意，由于智能体可以接收不同的观察结果（例如，在部分可观察的场景中），因此它们的行为仍然可能不同。该方法在不同的著作中同时提出[96,124]，后来又成功地应用于许多其他著作[99,253,266,312,313]。

Recurrent networks Recurrent neural networks (RNNs) enhanced neural networks with a memory capability, however, they suffer from the vanishing gradient problem, which renders them inefficient for long-term dependencies [252]. However, RNN variants such as LSTMs [114, 147] and GRUs (Gated Recurrent Unit) [67] addressed this challenge. In single-agent DRL, DRQN [131] initially proposed idea of using recurrent networks in single-agent partially observable environments. Then, Feudal Networks [338] proposed a hierarchical approach [82], multiple LSTM networks with different time-scales, i.e., the observation input schedule is different for each LSTM network, to create a temporal hierarchy so that it can better address the long-term credit assignment challenge for RL problems. Recently, the use of recurrent networks has been extended to MDRL to address the challenge of partially observability [24, 96, 148, 244, 253, 263, 265, 266, 313] for example, in FTW [156], depicted in Fig. 5 and DRPIRQN [148] depicted in Fig. 6. See Sect. 4.4 for practical challenges (e.g., training issues) of recurrent networks in MDRL.
递归网络递归神经网络（RNNs）增强了具有记忆能力的神经网络，然而，它们受到梯度消失问题的影响，这使得它们在长期依赖性方面效率低下[252]。然而，RNN变体，如LSTM [ 114， 147] 和 GRU （门控循环单元） [ 67] 解决了这一挑战。在单智能体DRL中，DRQN[131]最初提出了在单智能体部分可观测环境中使用循环网络的想法。然后，封建网络[ 338]提出了一种分层方法[ 82]，将多个具有不同时间尺度的LSTM网络，即每个LSTM网络的观测输入时间表不同，以创建时间层次结构，从而更好地解决RL问题的长期信用分配挑战。最近，循环网络的使用已扩展到MDRL，以解决部分可观测性的挑战[24,96,148,244,253,263,265,266,313]，例如，在FTW [156]中，如图5所示，DRPIRQN [ 148]如图6所示。参见第 4.4 节，了解 MDRL 中循环网络的实际挑战（例如，培训问题）。

Overfitting in MAL In single-agent RL, agents can overfit to the environment [352]. A similar problem can occur in multiagent settings [160], agents can overfit, i.e., an agent’s policy can easily get stuck in a local optima and the learned policy may be only locally optimal to other agents’ current policies [190]. This has the effect of limiting the generalization of the learned policies [180]. To reduce this problem, a solution is to have a set of policies (an ensemble) and learn from them or best respond to the mixture of them [133, 180, 206]. Another solution has been to robustify algorithms—a robust policy should be able to behave well even with strategies different from its training (better generalization) [190].
MAL中的过拟合在单智能体RL中，智能体可以过度拟合环境[ 352]。在多智能体设置中也会出现类似的问题[160]，智能体可能会过度拟合，即智能体的策略很容易卡在局部最优状态中，而学习到的策略可能只对其他智能体的当前策略具有局部最优[190]。这限制了所学策略的泛化[180]。为了减少这个问题，解决方案是制定一套策略（一个整体），并从中学习或最好地应对它们的混合[133,180,206]。另一个解决方案是使算法鲁棒化——一个鲁棒的策略应该能够表现良好，即使使用与其训练不同的策略（更好的泛化）[190]。

4.3 Benchmarks for MDRL 4.3 MDRL的基准
Standardized environments such as the Arcade Learning Environment (ALE) [32, 211] and OpenAI Gym [48] have allowed single-agent RL to move beyond toy domains. For DRL there are open-source frameworks that provide compact and reliable implementations of some state-of-the-art DRL algorithms [65]. Even though MDRL is a recent area, there are now a number of open sourced simulators and benchmarks to use with different characteristics, which we describe below.
Arcade Learning Environment （ALE） [ 32， 211] 和 OpenAI Gym [ 48] 等标准化环境使单智能体强化学习超越了玩具领域。对于DRL，有一些开源框架提供了一些最先进的DRL算法的紧凑而可靠的实现[65]。尽管 MDRL 是一个新兴领域，但现在有许多开源模拟器和基准测试可以使用，它们具有不同的特性，我们将在下面描述。

Fully Cooperative Multiagent Object Transporation Problems (CMOTPs)Footnote18 were originally presented by Busoniu et al. [56] as a simple two-agent coordination problem in MAL. Palmer et al. [247] proposed two pixel-based extensions to the original setting which include narrow passages that test the agents’ ability to master fully-cooperative sub-tasks, stochastic rewards and noisy observations, see Fig. 7a.
完全协作多智能体对象传输问题（CMOPPs） Footnote18 最初由Busoniu等[56]提出，是MAL中一个简单的双智能体协调问题。Palmer等[247]提出了两个基于像素的原始设置扩展，其中包括测试智能体掌握完全合作子任务、随机奖励和嘈杂观察能力的狭窄通道，见图7a。

The Apprentice Firemen GameFootnote19 (inspired by the classic climb game [70]) is another two-agent pixel-based environment that simultaneously confronts learners with four pathologies in MAL: relative overgeneralization, stochasticity, the moving target problem, and alter exploration problem [246].
学徒消防员游戏（灵感来自经典的攀爬游戏 Footnote19 [70]）是另一个基于像素的双智能体环境，它同时让学习者面临MAL中的四种病症：相对过度泛化、随机性、移动目标问题和改变探索问题[246]。

Pommerman [267] is a multiagent benchmark useful for testing cooperative, competitive and mixed (cooperative and competitive) scenarios. It supports partial observability and communication among agents, see Fig. 7b. Pommerman is a very challenging domain from the exploration perspective as the rewards are very sparse and delayed [107]. A recent competition was held during NeurIPS-2018Footnote20 and the top agents from that competition are available for training purposes.
Pommerman [ 267] 是一个多智能体基准测试，可用于测试合作、竞争和混合（合作和竞争）场景。它支持代理之间的部分可观测性和通信，参见图 7b。从探索的角度来看，Pommerman是一个非常具有挑战性的领域，因为奖励非常稀少和延迟[107]。最近的一场比赛在NeurIPS-2018期间举行 Footnote20 ，该比赛的顶级代理可用于培训目的。

Starcraft Multiagent Challenge [276] is based on the real-time strategy game StarCraft II and focuses on micromanagement challenges,Footnote21 that is, fine-grained control of individual units, where each unit is controlled by an independent agent that must act based on local observations. It is accompanied by a MDRL framework including state-of-the-art algorithms (e.g., QMIX and COMA).Footnote22
《星际争霸》多智能体挑战[276]基于即时战略游戏《星际争霸II》，专注于微观管理挑战，即对单个单位的细粒度控制，其中每个单位都由一个独立的代理控制， Footnote21 该代理必须根据本地观察采取行动。它伴随着一个MDRL框架，包括最先进的算法（例如，QMIX和COMA）。 Footnote22

The Multi-Agent Reinforcement Learning in Malmö (MARLÖ) competition [254] is another multiagent challenge with multiple cooperative 3D gamesFootnote23 within Minecraft. The scenarios were created with the open source Malmö platform [162], providing examples of how a wider range of multiagent cooperative, competitive and mixed scenarios can be experimented on within Minecraft.
马尔默的多智能体强化学习（MARLÖ）竞赛[254]是另一个多智能体挑战，在Minecraft中拥有多个合作3D游戏 Footnote23 。这些场景是使用开源的马尔默平台[162]创建的，提供了如何在Minecraft中试验更广泛的多智能体合作、竞争和混合场景的例子。

Hanabi is a cooperative multiplayer card game (two to five players). The main characteristic of the game is that players do not observe their own cards but other players can reveal information about them. This makes an interesting challenge for learning algorithms in particular in the context of self-play learning and ad-hoc teams [5, 44, 304]. The Hanabi Learning Environment [25] was recently releasedFootnote24 and it is accompanied with a baseline (deep RL) agent [145].
Hanabi 是一款多人合作纸牌游戏（两到五名玩家）。游戏的主要特点是玩家不会观察自己的牌，但其他玩家可以透露有关它们的信息。这给学习算法带来了一个有趣的挑战，特别是在自我游戏学习和临时团队的背景下[ 5， 44， 304]。Hanabi学习环境[ 25]最近发布 Footnote24 ，它伴随着一个基线（深度RL）代理[ 145]。

Arena [298] is platform for multiagent researchFootnote25 based on the Unity engine [163]. It has 35 multiagent games (e.g., social dilemmas) and supports communication among agents. It has basseline implementations of recent DRL algorithms such as independent PPO learners.
Arena [ 298] 是基于Unity引擎 [ 163] 的多智能体研究 Footnote25 平台。它有 35 个多智能体游戏（例如，社交困境），并支持智能体之间的通信。它具有最新 DRL 算法的基本实现，例如独立的 PPO 学习器。

MuJoCo Multiagent Soccer [203] uses the MuJoCo physics engine [327]. The environment simulates a 2 vs. 2 soccer game with agents having a 3-dimensional action space.Footnote26
MuJoCo Multiagent Soccer [ 203] 使用 MuJoCo 物理引擎 [ 327]。该环境模拟了一场 2 对 2 的足球比赛，代理具有 3 维动作空间。 Footnote26

Neural MMO [308] is a research platformFootnote27 inspired by the human game genre of Massively Multiplayer Online (MMO) Role-Playing Games. These games involve a large, variable number of players competing to survive.
神经MMO [ 308] 是一个研究平台 Footnote27 ，其灵感来自大型多人在线（MMO）角色扮演游戏的人类游戏类型。这些游戏涉及大量数量可变的玩家为生存而竞争。

Fig. 7 图7
figure 7
a A fully cooperative benchmark with two agents, Multiagent Object Trasportation. b A mixed cooperative-competitive domain with four agents, Pommerman. For more MDRL benchmarks see Sect. 4.3
a 具有两个代理的完全协作基准，即多代理对象传输。b 一个有四名代理人的混合合作竞争领域，Pommerman。有关 MDRL 基准的更多内容，请参见第 4.3 节

Full size image 全尺寸图像
4.4 Practical challenges in MDRL
4.4 MDRL的实际挑战
In this section we take a more critical view with respect to MDRL and highlight different practical challenges that already happen in DRL and that are likely to occur in MDRL such as reproducibility, hyperparameter tuning, the need of computational resources and conflation of results. We provide pointers on how we think those challenges could be (partially) addressed.
在本节中，我们对 MDRL 采取了更具批判性的观点，并强调了 DRL 中已经发生的和 MDRL 中可能发生的不同实际挑战，例如可重复性、超参数调整、计算资源的需求和结果的混合。我们提供了一些建议，说明我们认为如何（部分）解决这些挑战。

Reproducibility, troubling trends and negative results Reproducibility is a challenge in RL which is only aggravated in DRL due to different sources of stochasticity: baselines, hyperparameters, architectures [137, 228] and random seeds [69]. Moreover, DRL does not have common practices for statistical testing [100] which has led to bad practices such as only reporting the results when algorithms perform well, sometimes referred as cherry picking [16] (Azizzadenesheli also describes cherry planting as adapting an environment to a specific algorithm [16]). We believe that together with following the advice on how to design experiments and report results [197], the community would also benefit from reporting negative results [100, 108, 270, 286] for carefully designed hypothesis and experiments.Footnote28 However, we found very few papers with this characteristic[17, 170, 208] — we note that this is not encouraged in the ML community; moreover, negative results reduce the chance of paper acceptance [197]. In this regard, we ask the community to reflect on these practices and find ways to remove these obstacles.
可重复性、令人不安的趋势和负面结果可重复性是RL中的一个挑战，由于随机性来源不同：基线、超参数、架构[137,228]和随机种子[69]，在DRL中只会加剧。此外，DRL没有统计测试的通用做法[100]，这导致了不良做法，例如仅在算法表现良好时才报告结果，有时被称为樱桃采摘[16]（Azizzadenesheli也将樱桃种植描述为使环境适应特定的算法[16]）。我们相信，除了遵循关于如何设计实验和报告结果的建议[197]外，社区还将从报告精心设计的假设和实验的负面结果[100,108,270,286]中受益。 Footnote28 然而，我们发现很少有论文具有这种特征[ 17， 170， 208] — 我们注意到这在ML社区中并不受欢迎;此外，阴性结果降低了论文被接受的机会[197]。在这方面，我们要求社区反思这些做法，并找到消除这些障碍的方法。

Implementation challenges and hyperparameter tuning One problem is that canonical implementations of DRL algorithms often contain additional non-trivial optimizations—these are sometimes necessary for the algorithms to achieve good performance [151]. A recent study by Tucker et al. [331] found that several published works on action-dependant baselines contained bugs and errors—those were the real reason of the high performance in the experimental results, not the proposed method. Melis et al. [216] compared a series of works with increasing innovations in network architectures and the vanilla LSTMs [147] (originally proposed in 1997). The results showed that, when properly tuned, LSTMs outperformed the more recent models. In this context, Lipton and Steinhardt noted that the community may have benefited more by learning the details of the hyperparameter tuning [197]. A partial reason for this surprising result might be that this type of networks are known for being difficult to train [252] and there are recent works in DRL that report problems when using recurrent networks [78, 95, 106, 123]. Another known complication is catastrophic forgetting (see Sect. 2.2) with recent examples in DRL [264, 336]—we expect that these issues would likely occur in MDRL. The effects of hyperparameter tuning were analyzed in more detail in DRL by Henderson et al. [137], who arrived at the conclusion that hyperparameters can have significantly different effects across algorithms (they tested TRPO, DDPG, PPO and ACKTR) and environments since there is an intricate interplay among them [137]. The authors urge the community to report all parameters used in the experimental evaluations for accurate comparison—we encourage a similar behavior for MDRL. Note that hyperparameter tuning is related to the troubling trend of cherry picking in that it can show a carefully picked set of parameters that make an algorithm work (see previous challenge). Lastly, note that hyperparameter tuning is computationally very expensive, which relates to the connection with the following challenge of computational demands.
实现挑战和超参数调优一个问题是，DRL算法的规范实现通常包含额外的非平凡优化，这些优化有时是算法实现良好性能所必需的[151]。Tucker等人[331]最近的一项研究发现，一些已发表的关于动作依赖基线的论文中存在错误和错误——这些是实验结果中高性能的真正原因，而不是所提出的方法。Melis等人[216]比较了一系列工作，这些工作在网络架构和原版LSTM[147]（最初于1997年提出）方面不断创新。结果表明，当适当调整时，LSTM 的性能优于最近的模型。在这种情况下，Lipton和Steinhardt指出，通过学习超参数调优的细节，社区可能受益更多[197]。造成这一令人惊讶结果的部分原因可能是这种类型的网络以难以训练而闻名[252]，并且最近在DRL中有一些研究报告了使用循环网络时的问题[78,95,106,123]。另一个已知的并发症是灾难性的遗忘（见第2.2节），最近在DRL中的例子[264,336]——我们预计这些问题可能会发生在MDRL中。Henderson等[137]在DRL中更详细地分析了超参数调优的影响，他们得出的结论是，超参数在算法（他们测试了TRPO、DDPG、PPO和ACKTR）和环境之间可以产生显着不同的影响，因为它们之间存在错综复杂的相互作用[137]。作者敦促社区报告实验评估中使用的所有参数，以便进行准确比较——我们鼓励 MDRL 采取类似的行为。请注意，超参数优化与樱桃采摘的令人不安的趋势有关，因为它可以显示一组精心挑选的参数，这些参数使算法正常工作（参见之前的挑战）。最后，请注意，超参数调优在计算上非常昂贵，这与以下计算需求挑战有关。

Computational resources Deep RL usually requires millions of interactions for an agent to learn [9], i.e., low sample efficiency [361], which highlights the need for large computational infrastructure in general. The original A3C implementation [219] uses 16 CPU workers for 4 days to learn to play an Atari game with a total of 200M training framesFootnote29 (results are reported for 57 Atari games). Distributed PPO used 64 workers (presumably one CPU per worker, although this is not clearly stated in the paper) for 100 hours (more than 4 days) to learn locomotion tasks [134]. In MDRL, for example, the Atari Pong game, agents were trained for 50 epochs, 250k time steps each, for a total of 1.25M training frames [322]. The FTW agent [156] uses 30 agents (processes) in parallel and every training game lasts for 5 min; FTW agents were trained for approximately 450K games
4.2 years. These examples highlight the computational demands sometimes needed within DRL and MDRL.
计算资源深度强化学习通常需要数百万次交互才能让智能体学习[9]，即样本效率低[361]，这凸显了对大型计算基础设施的需求。最初的 A3C 实现 [ 219] 使用 16 个 CPU 工作者 4 天来学习玩一个总共有 200M 个训练帧的 Atari 游戏 Footnote29 （报告了 57 个 Atari 游戏的结果）。分布式 PPO 使用 64 个 worker（大概每个 worker 一个 CPU，尽管论文中没有明确说明）学习运动任务 100 小时（超过 4 天）[134]。例如，在 MDRL 中，Atari Pong 游戏，智能体被训练了 50 个 epoch，每个 epoch 250k 时间步长，总共训练了 1.25M 帧 [ 322]。FTW 智能体 [ 156] 并行使用 30 个智能体（进程），每个训练游戏持续 5 分钟;FTW 特工接受了大约 450K 游戏
的培训 4.2 年。这些示例突出了 DRL 和 MDRL 中有时需要的计算需求。

Recent works have reduced the learning of an Atari game to minutes (Stooke and Abbeel [306] trained DRL agents in less than one hour with hardware consisting of 8 GPUs and 40 cores). However, this is (for now) the exception and computational infrastructure is a major bottleneck for doing DRL and MDRL, especially for those who do not have such large compute power (e.g., most companies and most academic research groups) [29, 286].Footnote30 Within this context we propose two ways to address this problem. (1) Raising awareness: For DRL we found few works that study the computational demands of recent algorithms [9, 18]. For MDRL most published works do not provide information regarding computational resources used such as CPU/GPU usage, memory demands, and wall-clock computation. Therefore, the first way to tackle this issue is by raising awareness and encouraging authors to report metrics about computational demands for accurately comparison and evaluation. (2) Delve into algorithmic contributions. Another way to address these issues is to prioritize the algorithmic contribution for the new MDRL algorithms rather than the computational resources spent. Indeed, for this to work, it needs to be accompanied with high-quality reviewers.
最近的工作将雅达利游戏的学习时间缩短到几分钟（Stooke 和 Abbeel [306] 在不到一小时的时间内使用由 8 个 GPU 和 40 个内核组成的硬件训练了 DRL 代理）。然而，这是（目前）例外，计算基础设施是进行DRL和MDRL的主要瓶颈，特别是对于那些没有如此大计算能力的人（例如，大多数公司和大多数学术研究小组）[29,286]。 Footnote30 在此背景下，我们提出了两种解决这个问题的方法。（1）提高认识：对于DRL，我们发现很少有研究最新算法的计算需求的工作[9,18]。对于 MDRL，大多数已发表的作品都没有提供有关所用计算资源的信息，例如 CPU/GPU 使用率、内存需求和挂钟计算。因此，解决这个问题的第一种方法是提高认识并鼓励作者报告有关计算需求的指标，以便进行准确的比较和评估。（2）深入研究算法贡献。解决这些问题的另一种方法是优先考虑新 MDRL 算法的算法贡献，而不是花费的计算资源。事实上，要做到这一点，它需要有高质量的审稿人。

We have argued to raise awareness on the computational demands and report results, however, there is still the open question on how and what to measure/report. There are several dimensions to measure efficiency: sample efficiency is commonly measured by counting state-action pairs used for training; computational efficiency could be measured by number of CPUs/GPUs and days used for training. How do we measure the impact of other resources, such as external data sources or annotations?Footnote31 Similarly, do we need to differentiate the computational needs of the algorithm itself versus the environment it is run in? We do not have the answers, however, we point out that current standard metrics might not be entirely comprehensive.
我们主张提高对计算需求和报告结果的认识，但是，关于如何以及测量/报告什么仍然存在悬而未决的问题。衡量效率有几个维度：样本效率通常通过计算用于训练的状态-动作对来衡量;计算效率可以通过 CPU/GPU 的数量和用于训练的天数来衡量。我们如何衡量其他资源（例如外部数据源或注释）的影响？ Footnote31 同样，我们是否需要区分算法本身的计算需求与它的运行环境？我们没有答案，但是，我们指出，当前的标准指标可能并不完全全面。

In the end, we believe that high compute based methods act as a frontier to showcase benchmarks [235, 339], i.e., they show what results are possible as data and compute is scaled up (e.g., OpenAI Five generates 180 years of gameplay data each day using 128,000 CPU cores and 256 GPUs [235]; AlphaStar uses 200 years of Starcraft II gameplay [339]); however, lighter compute based algorithmic methods can also yield significant contributions to better tackle real-world problems.
最后，我们认为基于高计算的方法可以作为展示基准的前沿 [ 235， 339]，即它们显示了随着数据和计算规模的扩大，可能的结果（例如，OpenAI Five 每天使用 128,000 个 CPU 内核和 256 个 GPU 生成 180 年的游戏数据 [ 235];AlphaStar 使用了 200 年的星际争霸 II 游戏玩法 [ 339]）;然而，基于轻量级计算的算法方法也可以为更好地解决现实世界的问题做出重大贡献。

Occam’s razor and ablative analysis Finding the simplest context that exposes the innovative research idea remains challenging, and if ignored it leads to a conflation of fundamental research (working principles in the most abstract setting) and applied research (working systems as complete as possible). In particular, some deep learning papers are presented as learning from pixels without further explanation, while object-level representations would have already exposed the algorithmic contribution. This still makes sense to remain comparable with established benchmarks (e.g., OpenAI Gym [48]), but less so if custom simulations are written without open source access, as it introduces unnecessary variance in pixel-level representations and artificially inflates computational resources (see previous point about computational resources).Footnote32 In this context there are some notable exceptions where the algorithmic contribution is presented in a minimal setting and then results are scaled into complex settings: LOLA [97] first presented a minimalist setting with a two-player two-action game and then with a more complex variant; similarly, QMIX [266] presented its results in a two-step (matrix) game and then in the more involved Starcraft II micromanagement domain [276].
奥卡姆剃刀和消融分析找到揭示创新研究理念的最简单的背景仍然具有挑战性，如果忽视它，会导致基础研究（最抽象环境中的工作原理）和应用研究（工作系统尽可能完整）的混淆。特别是，一些深度学习论文在没有进一步解释的情况下被描述为从像素中学习，而对象级表示已经暴露了算法的贡献。与已建立的基准测试（例如，OpenAI Gym [ 48]）保持可比性仍然有意义，但如果自定义模拟是在没有开源访问的情况下编写的，则意义不大，因为它在像素级表示中引入了不必要的差异，并人为地夸大了计算资源（参见上一点关于计算资源）。 Footnote32 在这种情况下，有一些值得注意的例外，其中算法贡献以最小的设置呈现，然后结果被缩放到复杂的设置中：LOLA [ 97] 首先提出了一个极简主义的设置，有一个双人双动作游戏，然后是一个更复杂的变体;类似地，QMIX[266]在两步（矩阵）博弈中展示了其结果，然后在更复杂的星际争霸II微观管理领域[276]。

4.5 Open questions 4.5 开放性问题
Finally, here we present some open questions for MDRL and point to suggestions on how to approach them. We believe that there are solid ideas in earlier literature and we refer the reader to Sect. 4.1 to avoid deep learning amnesia.
最后，我们在这里提出了一些关于MDRL的开放性问题，并指出了如何处理这些问题的建议。我们相信早期文献中有坚实的思想，我们建议读者参考第 4.1 节以避免深度学习遗忘症。

On the challenge of sparse and delayed rewards.
关于稀疏和延迟奖励的挑战。

Recent MDRL competitions and environments have complex scenarios where many actions are taken before a reward signal is available (see Sect. 4.3). This sparseness is already a challenge for RL [89, 315] where approaches such as count-based exploration/intrinsic motivation [27, 30, 47, 279, 307] and hierarchical learning [87, 178, 278] have been proposed to address it—in MDRL this is even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also to learn the strategic element (e.g., competitive/collaborative) embedded in the multiagent setting. To address this issue, recent MDRL approaches applied dense rewards [176, 212, 231] (a concept originated in RL) at each step to allow the agents to learn basic motor skills and then decrease these dense rewards over time in favor of the environmental reward [24], see Sect. 3.3. Recent works like OpenAI Five [235] uses hand-crafted intermediate rewards to accelerate the learning and FTW [156] lets the agents learn their internal rewards by a hierarchical two-tier optimization. In single agent domains, RUDDER [12] has been recently proposed for such delayed sparse reward problems. RUDDER generates a new MDP with more intermediate rewards whose optimal solution is still an optimal solution to the original MDP. This is achieved by using LSTM networks to redistribute the original sparse reward to earlier state-action pairs and automatically provide reward shaping. How to best extend RUDDER to multiagent domains is an open avenue of research.
最近的 MDRL 竞赛和环境具有复杂的场景，在奖励信号可用之前采取了许多行动（参见第 4.3 节）。这种稀疏性对于RL [ 89， 315]来说已经是一个挑战，其中基于计数的探索/内在动机[ 27， 30， 47， 279， 307] 和分层学习 [ 87， 178， 278] 等方法已被提出来解决它——在 MDRL 中，这更成问题，因为智能体不仅需要学习基本行为（如在 DRL 中），还需要学习战略要素（例如，竞争/协作）嵌入在多智能体设置中。为了解决这个问题，最近的MDRL方法在每一步都应用了密集奖励[176,212,231]（起源于RL的概念），以允许智能体学习基本的运动技能，然后随着时间的推移减少这些密集奖励，以支持环境奖励[24]，参见第3.3节。最近的工作，如OpenAI Five[235]使用手工制作的中间奖励来加速学习，FTW [156]让智能体通过分层的两层优化来学习他们的内部奖励。在单智能体领域中，RUDDER [ 12] 最近被提出用于解决这种延迟稀疏奖励问题。RUDDER 生成具有更多中间奖励的新 MDP，其最优解仍然是原始 MDP 的最优解。这是通过使用 LSTM 网络将原始稀疏奖励重新分配给早期的状态-动作对并自动提供奖励整形来实现的。如何最好地将 RUDDER 扩展到多智能体域是一条开放的研究途径。

On the role of self-play.
关于自我发挥的作用。

Self-play is a cornerstone in MAL with impressive results [42, 45, 71, 113, 149]. While notable results had also been shown in MDRL [43, 136], recent works have also shown that plain self-play does not yield the best results. However, adding diversity, i.e., evolutionary methods [20, 85, 185, 271] or sampling-based methods, have shown good results [24, 156, 187]. A drawback of these solutions is the additional computational requirements since they need either parallel training (more CPU computation) or memory requirements. Then, it is still an open problem to improve the computational efficiency of these previously proposed successful methods, i.e., achieving similar training stability with smaller population sizes that uses fewer CPU workers in MAL and MDRL (see Sect. 4.4 and Albrecht et al. [6, Section 5.5]).
自我游戏是MAL的基石，其成绩令人印象深刻[42,45,71,113,149]。虽然MDRL也显示出显著的结果[43,136]，但最近的研究也表明，简单的自我游戏并不能产生最好的结果。然而，增加多样性，即进化方法[20,85,185,271]或基于抽样的方法，已经显示出良好的结果[24,156,187]。这些解决方案的一个缺点是额外的计算要求，因为它们需要并行训练（更多的 CPU 计算）或内存要求。然后，提高这些先前提出的成功方法的计算效率仍然是一个悬而未决的问题，即在 MAL 和 MDRL 中使用较少的 CPU 工作者的较小种群规模实现类似的训练稳定性（参见第 4.4 节和 Albrecht 等人 [ 6，第 5.5 节]）。

On the challenge of the combinatorial nature of MDRL.
关于MDRL组合性质的挑战。

Monte Carlo tree search (MCTS) [51] has been the backbone of the major breakthroughs behind AlphaGo [291] and AlphaGo Zero [293] that combined search and DRL. A recent work [340] has outlined how search and RL can be better combined for potentially new methods. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents’ action spaces for centralized methods [169]. One way to tackle this challenge within multiagent scenarios is the use of search parallelization [35, 171]. Given more scalable planners, there is room for research in combining these techniques in MDRL settings.
蒙特卡洛树搜索（MCTS）[51]是AlphaGo[291]和AlphaGo Zero[293]将搜索和DRL结合在一起的重大突破的支柱。最近的一项研究[340]概述了如何更好地将搜索和RL结合起来，以获得潜在的新方法。然而，对于多智能体场景，还有一个额外的挑战，即集中式方法的所有智能体的操作空间呈指数增长[169]。在多智能体场景中解决这一挑战的一种方法是使用搜索并行化[35,171]。鉴于更具可扩展性的规划器，在MDRL环境中结合这些技术还有研究空间。

To learn complex multiagent interactions some type of abstraction [84] is often needed, for example, factored value functions [8, 119,120,121, 174, 236] (see QMIX and VDN in Sect. 3.5 for recent work in MDRL) try to exploit independence among agents through (factored) structure; however, in MDRL there are still open questions such as understanding their representational power [64] (e.g., the accuracy of the learned Q-function approximations) and how to learn those factorizations, where ideas from transfer planning techniques could be useful [240, 335]. In transfer planning the idea is to define a simpler “source problem” (e.g., with fewer agents), in which the agent(s) can plan [240] or learn [335]; since it is less complex than the real multiagent problem, issues such as the non-stationarity of the environment can be reduced/removed. Lastly, another related idea are influence abstractions [28, 141, 241], where instead of learning a complex multiagent model, these methods try to build smaller models based on the influence agents can exert on one another. While this has not been sufficiently explored in actual multiagent settings, there is some evidence that these ideas can lead to effective inductive biases, improving effectiveness of DRL in such local abstractions [309].
为了学习复杂的多智能体交互[84]，通常需要某种类型的抽象[84]，例如，分解值函数[8,119,120,121,174,236]（参见第3.5节中的QMIX和VDN，了解MDRL的最新工作）试图通过（分解）结构来利用智能体之间的独立性;然而，在MDRL中，仍然存在一些悬而未决的问题，例如理解它们的表示能力[64]（例如，学习到的Q函数近似的准确性）以及如何学习这些因式分解，其中来自转移规划技术的想法可能是有用的[240,335]。在转移计划中，这个想法是定义一个更简单的“源问题”（例如，使用更少的智能体），其中智能体可以计划[240]或学习[335];由于它比真正的多智能体问题更简单，因此可以减少/消除环境的非平稳性等问题。最后，另一个相关的想法是影响抽象[28,141,241]，这些方法不是学习复杂的多智能体模型，而是尝试根据智能体可以相互施加的影响来构建较小的模型。虽然这在实际的多智能体环境中尚未得到充分探索，但有一些证据表明，这些想法可以导致有效的归纳偏差，从而提高DRL在这种局部抽象中的有效性[309]。

5 Conclusions 5 结论
Deep reinforcement learning has shown recent success on many fronts [221, 224, 291] and a natural next step is to test multiagent scenarios. However, learning in multiagent environments is fundamentally more difficult due to non-stationarity, the increase of dimensionality, and the credit-assignment problem, among other factors [45, 55, 141, 246, 305, 332, 348].
深度强化学习最近在许多方面都取得了成功[221,224,291]，下一步自然是测试多智能体场景。然而，由于非平稳性、维度的增加和学分分配问题等因素，在多智能体环境中学习从根本上更加困难[45,55,141,246,305,332,348]。

This survey provides a broad overview of recent works in the emerging area of Multiagent Deep Reinforcement Learning (MDRL). First, we categorized recent works into four different topics: emergent behaviors, learning communication, learning cooperation, and agents modeling agents. Then, we exemplified how key components (e.g., experience replay and difference rewards) originated in RL and MAL need to be adapted to work in MDRL. We provided general lessons learned applicable to MDRL, pointed to recent multiagent benchmarks and highlighted some open research problems. Finally, we also reflected on the practical challenges such as computational demands and reproducibility in MDRL.
本调查对多智能体深度强化学习（MDRL）这一新兴领域的最新工作进行了广泛的概述。首先，我们将最近的工作分为四个不同的主题：涌现行为、学习交流、学习合作和代理建模代理。然后，我们举例说明了源自 RL 和 MAL 的关键组件（例如，经验重放和差异奖励）需要如何适应 MDRL 中的工作。我们提供了适用于MDRL的一般经验教训，指出了最近的多药物基准，并强调了一些开放性研究问题。最后，我们还反思了MDRL中的计算需求和可重复性等实际挑战。

Our conclusions of this work are that while the number of works in DRL and MDRL are notable and represent important milestones for AI, at the same time we acknowledge there are also open questions in both (deep) single-agent learning [81, 151, 211, 328] and multiagent learning [116, 172, 195, 242, 245, 360]. Our view is that there are practical issues within MDRL that hinder its scientific progress: the necessity of high compute power, complicated reproducibility (e.g., hyperparameter tuning), and the lack of sufficient encouragement for publishing negative results. However, we remain highly optimistic of the multiagent community and hope this work serves to raise those issues, encounter good solutions, and ultimately take advantage of the existing literature and resources available to move the area in the right direction.
我们对这项工作的结论是，虽然 DRL 和 MDRL 中的工作数量值得注意，并且代表了 AI 的重要里程碑，但与此同时，我们承认（深度）单智能体学习 [81， 151， 211， 328] 和多智能体学习 [ 116， 172， 195， 242， 245， 360] 也存在开放性问题。我们的观点是，MDRL内部存在阻碍其科学进步的实际问题：高计算能力的必要性，复杂的可重复性（例如，超参数调整），以及缺乏对发表负面结果的足够鼓励。然而，我们仍然对多智能体社区持高度乐观态度，并希望这项工作有助于提出这些问题，找到好的解决方案，并最终利用现有的文献和资源将该地区推向正确的方向。

Notes 笔记
We have noted inconsistency in abbreviations such as: D-MARL, MADRL, deep-multiagent RL and MA-DRL.
我们注意到缩写不一致，例如：D-MARL、MADRL、deep-multiagent RL 和 MA-DRL。

A Partially Observable Markov Decision Process (POMDP) [14, 63] explicitly models environments where the agent no longer sees the true system state and instead receives an observation (generated from the underlying system state).
部分可观察马尔可夫决策过程（POMDP） [ 14， 63] 显式地模拟了代理不再看到真实系统状态的环境，而是接收观察（从底层系统状态生成）。

Action-dependant baselines had been proposed [117, 202], however, a recent study by Tucker et al. [331] found that in many works the reason of good performance was because of bugs or errors in the code, rather than the proposed method itself.
已经提出了与动作相关的基线[117,202]，然而，Tucker等人[331]最近的一项研究发现，在许多作品中，良好性能的原因是代码中的错误或错误，而不是所提出的方法本身。

Before DQN, many approaches used neural networks for representing the Q-value function [74], such as Neural Fitted Q-learning [268] and NEAT+Q [351].
在DQN之前，许多方法使用神经网络来表示Q值函数[74]，如神经拟合Q学习[268]和NEAT+Q[351]。

Double Q-learning [130] originally proposed keeping two Q functions (estimators) to reduce the overestimation bias in RL, while still keeping the convergence guarantees, later it was extended to DRL in Double DQN [336] (see Sect. 4.1).
双Q学习[ 130]最初提出保留两个Q函数（估计器）以减少RL中的高估偏差，同时仍然保持收敛保证，后来在Double DQN [ 336]中扩展到DRL（参见第4.1节）。

In this setting each agent independently executes a policy, however, there are other cases where this does not hold, for example when agents have a coordinated exploration strategy.
在此设置中，每个代理独立执行策略，但是，在其他情况下，这不成立，例如，当代理具有协调的探索策略时。

Counterfactual regret minimization is a technique for solving large games based on regret minimization [230, 368] due to a well-known connection between regret and Nash equilibria [39]. It has been one of the reasons of successes in Poker [50, 224].
反事实后悔最小化是一种基于后悔最小化[230,368]解决大型博弈的技术，这是由于后悔和纳什均衡[39]之间众所周知的联系。这是扑克成功的原因之一 [ 50， 224]。

This algorithm is similar to CFR-BR [159] and has the main advantage that the current policy convergences rather than the average policy, so there is no need to learn the average strategy, which requires large reservoir buffers or many past networks.
该算法与CFR-BR [ 159]类似，其主要优点是当前策略收敛而不是平均策略，因此无需学习平均策略，这需要较大的储层缓冲区或许多过去的网络。

TFT originated in an iterated prisoner’s dilemma tournament and later inspired different strategies in MAL [258], its generalization, Godfather, is a representative of leader strategies [201].
TFT起源于一个迭代的囚徒困境竞赛，后来在MAL中启发了不同的策略[258]，它的概括，教父，是领导者策略的代表[201]。

The average strategy profile of fictitious players converges to a Nash equilibrium in certain classes of games, e.g., two-player zero-sum and potential games [222].
在某些类别的博弈中，虚构玩家的平均策略特征收敛于纳什均衡，例如，双人零和博弈和潜在博弈[222]。

The vocabulary that agents use was arbitrary and had no initial meaning. To understand its emerging semantics they looked at the relationship between symbols and the sets of images they referred to [183].
代理使用的词汇是任意的，没有初始含义。为了理解其新兴的语义，他们研究了符号和他们所指的图像集之间的关系[183]。

There is a large body of research on coordinating multiagent teams by specifying communication protocols [115, 321]: these expect agents to know the team’s goal as well as the tasks required to accomplish the goal.
有大量关于通过指定通信协议来协调多智能体团队的研究[115,321]：这些期望智能体知道团队的目标以及实现目标所需的任务。

Elo uses a normal distribution for each player skill, and after each match, both players’ distributions are updated based on measure of surprise, i.e., if a user with previously lower (predicted) skill beats a high skilled one, the low-skilled player is significantly increased.
Elo对每个玩家的技能使用正态分布，每场比赛后，两个玩家的分布都会根据惊喜的衡量标准进行更新，即，如果先前技能较低（预测）的用户击败了高技能的用户，则低技能玩家会显着增加。

Nash equilibrium [229] is a solution concept in game theory in which no agent would choose to deviate from its strategy (they are a best response to others’ strategies). This concept has been explored in seminal MAL algorithms like Nash-Q learning [149] and Minimax-Q learning [198, 199].
纳什均衡[229]是博弈论中的一个解决方案概念，在这个概念中，没有智能体会选择偏离其策略（它们是对他人策略的最佳响应）。这个概念已经在Nash-Q学习[149]和Minimax-Q学习[198,199]等开创性的MAL算法中得到探索。

Johanson et al. [160] also found “overfitting” when solving large extensive games (e.g., poker)—the performance in an abstract game improved but it was worse in the full game.
Johanson等[160]也发现，在解决大型广泛游戏（如扑克）时，会出现“过度拟合”——抽象游戏的性能有所提高，但在完整游戏中表现更差。

Bayesian policy reuse assumes an agent with prior experience in the form of a library of policies. When a novel task instance occurs, the objective is to reuse a policy from its library based on observed signals which correlate to policy performance [272].
贝叶斯策略重用假定代理具有策略库形式的先前经验。当一个新的任务实例出现时，目标是根据观察到的与策略性能相关的信号，重用其库中的策略[272]。

Centralized planning and decentralized execution is also a standard paradigm for multiagent planning [239].
集中式规划和分散式执行也是多智能体规划的标准范式[239]。

https://github.com/gjp1203/nui_in_madrl.

https://www.pommerman.com/.

https://github.com/oxwhirl/smac.

https://github.com/oxwhirl/pymarl.

https://github.com/crowdAI/marlo-single-agent-starter-kit/.

https://github.com/deepmind/hanabi-learning-environment.

https://github.com/YuhangSong/Arena-BuildingToolkit.

https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer.

https://github.com/openai/neural-mmo.

This idea was initially inspired by the Workshop “Critiquing and Correcting Trends in Machine Learning” at NeurIPS 2018 where it was possible to submit Negative results papers: “Papers which show failure modes of existing algorithms or suggest new approaches which one might expect to perform well but which do not. The aim is to provide a venue for work which might otherwise go unpublished but which is still of interest to the community.” https://ml-critique-correct.github.io/.
这个想法最初受到 NeurIPS 2018 研讨会“批评和纠正机器学习趋势”的启发，在那里可以提交负面结果论文：“展示现有算法的失败模式或提出人们可能期望表现良好但表现不佳的新方法的论文。其目的是为那些可能未被发表但仍然引起社区兴趣的作品提供一个场所 https://ml-critique-correct.github.io/。

It is sometimes unclear in the literature what is the meaning of frame due to the “frame skip” technique. It is therefore suggested to refer to “game frames” and “training frames” [310].
由于“跳帧”技术，文献中有时不清楚帧的含义是什么。因此，建议将“游戏帧”和“训练帧”称为“游戏帧”[310]。

One recent effort by Beeching et al. [29] proposes to use only “mid-range hardware” (8 CPUs and 1 GPU) to train deep RL agents.
Beeching等人[29]最近的一项研究建议仅使用“中端硬件”（8个CPU和1个GPU）来训练深度RL代理。

NeurIPS 2019 hosts the “MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors” where the primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments [125].
NeurIPS 2019 举办了“MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors”，该竞赛的主要目标是促进算法的开发，这些算法可以有效地利用人类演示来大幅减少解决复杂、分层和稀疏环境所需的样本数量 [ 125]。

Cuccu, Togelius and Cudré-Mauroux achieved state-of-the-art policy learning in Atari games with only 6 to 18 neurons [75]. The main idea was to decouple image processing from decision-making.
Cuccu、Togelius 和 Cudré-Mauroux 在雅达利游戏中仅用 6 到 18 个神经元就实现了最先进的策略学习 [ 75]。主要思想是将图像处理与决策分离。

References 引用
Achiam, J., Knight, E., & Abbeel, P. (2019). Towards characterizing divergence in deep Q-learning. CoRR arXiv:1903.08894.
Achiam，J.，Knight，E.和Abbeel，P.（2019）。为了表征深度 Q 学习中的差异。CoRR arXiv：1903.08894。

Agogino, A. K., & Tumer, K. (2004). Unifying temporal and structural credit assignment problems. In Proceedings of 17th international conference on autonomous agents and multiagent systems.
Agogino，AK和Tumer，K.（2004）。统一时间和结构学分分配问题。在第 17 届自主代理和多代理系统国际会议论文集。

Agogino, A. K., & Tumer, K. (2008). Analyzing and visualizing multiagent rewards in dynamic and stochastic domains. Autonomous Agents and Multi-Agent Systems, 17(2), 320–338.
Agogino，AK和Tumer，K.（2008）。分析和可视化动态和随机域中的多智能体奖励。自主代理和多代理系统，17（2），320-338。

Google Scholar Google 学术搜索

Ahamed, T. I., Borkar, V. S., & Juneja, S. (2006). Adaptive importance sampling technique for markov chains using stochastic approximation. Operations Research, 54(3), 489–504.
Ahamed， T. I.， Borkar， VS， & Juneja， S. （2006）。使用随机逼近的马尔可夫链自适应重要性抽样技术.运筹学， 54（3）， 489–504.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Albrecht, S. V., & Ramamoorthy, S. (2013). A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the 12th international conference on autonomous agents and multi-agent systems. Saint Paul, MN, USA.
Albrecht，SV和Ramamoorthy，S.（2013）。一种用于多智能体系统中临时协调的博弈论模型和最佳响应学习方法。在第 12 届自主代理和多智能体系统国际会议论文集。美国明尼苏达州圣保罗。

Albrecht, S. V., & Stone, P. (2018). Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258, 66–95.
Albrecht， SV， & Stone， P. （2018）。对其他智能体进行建模的自主智能体：全面的调查和开放性问题。人工智能，258,66-95。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Alonso, E., D’inverno, M., Kudenko, D., Luck, M., & Noble, J. (2002). Learning in multi-agent systems. Knowledge Engineering Review, 16(03), 1–8.
Alonso，E.，D’inverno，M.，Kudenko，D.，Luck，M.和Noble，J.（2002）。在多智能体系统中学习。知识工程评论， 16（03）， 1–8.

Google Scholar Google 学术搜索

Amato, C., & Oliehoek, F. A. (2015). Scalable planning and learning for multiagent POMDPs. In AAAI (pp. 1995–2002).
Amato，C.和Oliehoek，FA（2015）。针对多智能体 POMDP 的可扩展规划和学习。在AAAI中（第1995-2002页）。

Amodei, D., & Hernandez, D. (2018). AI and compute. https://blog.openai.com/ai-and-compute.
Amodei，D.和Hernandez，D.（2018）。AI 和计算。https://blog.openai.com/ai-and-compute。

Andre, D., Friedman, N., & Parr, R. (1998). Generalized prioritized sweeping. In Advances in neural information processing systems (pp. 1001–1007).
安德烈，D.，弗里德曼，N.和帕尔，R.（1998）。广义优先扫描。在神经信息处理系统进展中（第 1001-1007 页）。

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., & Zaremba, W. (2017). Hindsight experience replay. In Advances in neural information processing systems.
Andrychowicz，M.，Wolski，F.，Ray，A.，Schneider，J.，Fong，R.，Welinder，P.，McGrew，B.，Tobin，J.，Abbeel，P.和Zaremba，W.（2017）。事后诸葛亮的经验回放。在神经信息处理系统的进展中。

Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., & Hochreiter, S. (2018). RUDDER: Return decomposition for delayed rewards. arXiv:1806.07857.
Arjona-Medina，JA，Gillhofer，M.，Widrich，M.，Unterthiner，T.和Hochreiter，S.（2018）。RUDDER：延迟奖励的返回分解。arXiv：1806.07857。

Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv:1708.05866v2.
Arulkumaran，K.，Deisenroth，MP，Brundage，M.和Bharath，AA（2017）。深度强化学习的简要调查。arXiv：1708.05866v2。

Astrom, K. J. (1965). Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1), 174–205.
阿斯特罗姆，KJ（1965 年）。对状态信息不完整的马尔可夫过程进行优化控制。数学分析与应用， 10（1）， 174–205.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Axelrod, R., & Hamilton, W. D. (1981). The evolution of cooperation. Science, 211(27), 1390–1396.
阿克塞尔罗德，R.和汉密尔顿，WD（1981）。合作的演变。科学， 211（27）， 1390–1396.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Azizzadenesheli, K. (2019). Maybe a few considerations in reinforcement learning research? In Reinforcement learning for real life workshop.
Azizzadenesheli，K.（2019 年）。也许在强化学习研究中有一些考虑因素？在现实生活中的强化学习研讨会中。

Azizzadenesheli, K., Yang, B., Liu, W., Brunskill, E., Lipton, Z., & Anandkumar, A. (2018). Surprising negative results for generative adversarial tree search. In Critiquing and correcting trends in machine learning workshop.
Azizzadenesheli，K.，Yang，B.，Liu，W.，Brunskill，E.，Lipton，Z.和Anandkumar，A.（2018）。生成对抗树搜索的令人惊讶的负面结果。在“批评和纠正机器学习趋势”研讨会中。

Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a GPU. In International conference on learning representations.
Babaeizadeh，M.，Frosio，I.，Tyree，S.，Clemons，J.和Kautz，J.（2017）。通过 GPU 上的异步优势 actor-critic 进行强化学习。在学习表征国际会议上。

Bacchiani, G., Molinari, D., & Patander, M. (2019). Microscopic traffic simulation by cooperative multi-agent deep reinforcement learning. In AAMAS.
Bacchiani，G.，Molinari，D.和Patander，M.（2019）。基于多智能体深度强化学习的微观交通仿真。在AAMAS中。

Back, T. (1996). Evolutionary algorithms in theory and practice: Evolution strategies, evolutionary programming, genetic algorithms. Oxford: Oxford University Press.
Back，T.（1996 年）。进化算法的理论与实践：进化策略、进化编程、遗传算法。牛津：牛津大学出版社。

MATH 数学

Google Scholar Google 学术搜索

Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Machine Learning Proceedings, 1995, 30–37.
贝尔德，L.（1995 年）。残差算法：具有函数逼近的强化学习。机器学习论文集， 1995， 30–37.

Google Scholar Google 学术搜索

Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., & Graepel, T. (2018). The mechanics of n-player differentiable games. In Proceedings of the 35th international conference on machine learning, proceedings of machine learning research (pp. 354–363). Stockholm, Sweden.
Balduzzi，D.，Racaniere，S.，Martens，J.，Foerster，J.，Tuyls，K.和Graepel，T.（2018）。n-player 可微博弈的机制。在第 35 届机器学习国际会议论文集中，机器学习研究论文集（第 354-363 页）。瑞典斯德哥尔摩。

Banerjee, B., & Peng, J. (2003). Adaptive policy gradient in multiagent learning. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems (pp. 686–692). ACM.
Banerjee，B.和Peng，J.（2003）。多智能体学习中的自适应策略梯度。在第二届自治代理和多代理系统国际联合会议论文集（第 686-692 页）中。ACM。

Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., & Mordatch, I. (2018). Emergent complexity via multi-agent competition. In International conference on machine learning.
Bansal，T.，Pachocki，J.，Sidor，S.，Sutskever，I.和Mordatch，I.（2018）。通过多智能体竞争实现的突发复杂性。在机器学习国际会议上。

Bard, N., Foerster, J. N., Chandar, S., Burch, N., Lanctot, M., & Song, H. F., et al. (2019). The Hanabi challenge: A new frontier for AI research. arXiv:1902.00506.
Bard， N.， Foerster， JN， Chandar， S.， Burch， N.， Lanctot， M.， & Song， HF， et al. （2019）。Hanabi 挑战：AI 研究的新前沿。arXiv：1902.00506。

Barrett, S., Stone, P., Kraus, S., & Rosenfeld, A. (2013). Teamwork with Limited Knowledge of Teammates. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pp. 102–108. Bellevue, WS, USA.
Barrett，S.，Stone，P.，Kraus，S.和Rosenfeld，A.（2013）。团队合作，对队友的了解有限。第二十七届AAAI人工智能会议论文集，第102-108页。美国华盛顿州贝尔维尤。

Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In M. Mirolli & G. Baldassarre (Eds.), Intrinsically motivated learning in natural and artificial systems (pp. 17–47). Berlin: Springer.
巴托，AG（2013 年）。内在动机和强化学习。在 M. Mirolli 和 G. Baldassarre（编辑）中，自然和人工系统中的内在动机学习（第 17-47 页）。柏林：施普林格。

Google Scholar Google 学术搜索

Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. V. (2004). Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22, 423–455.
Becker，R.，Zilberstein，S.，Lesser，V.和Goldman，CV（2004）。求解过渡独立分散马尔可夫决策过程。人工智能研究杂志， 22， 423–455.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Beeching, E., Wolf, C., Dibangoye, J., & Simonin, O. (2019). Deep reinforcement learning on a budget: 3D Control and reasoning without a supercomputer. CoRR arXiv:1904.01806.
Beeching，E.，Wolf，C.，Dibangoye，J.和Simonin，O.（2019）。预算内的深度强化学习：无需超级计算机的 3D 控制和推理。CoRR arXiv：1904.01806。

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems (pp. 1471–1479).
Bellemare，M.，Srinivasan，S.，Ostrovski，G.，Schaul，T.，Saxton，D.和Munos，R.（2016）。统一基于计数的探索和内在动机。在神经信息处理系统的进展中（第 1471-1479 页）。

Bellemare, M. G., Dabney, W., Dadashi, R., Taïga, A. A., Castro, P. S., & Roux, N. L., et al. (2019). A geometric perspective on optimal representations for reinforcement learning. CoRR arXiv:1901.11530.
Bellemare， MG， Dabney， W.， Dadashi， R.， Taïga， AA， Castro， PS， & Roux， N. L.， et al. （2019）。强化学习最佳表示的几何视角。CoRR arXiv：1901.11530。

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
Bellemare，MG，Naddaf，Y.，Veness，J.和Bowling，M.（2013）。街机学习环境：面向总代理的评估平台。人工智能研究杂志， 47， 253–279.

Google Scholar Google 学术搜索

Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684.
贝尔曼，R.（1957 年）。马尔可夫决策过程。数学与力学杂志， 6（5）， 679–684.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819–840.
Bernstein，DS，Givan，R.，Immerman，N.和Zilberstein，S.（2002）。马尔可夫决策过程分散控制的复杂性。运筹学数学， 27（4）， 819–840.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Best, G., Cliff, O. M., Patten, T., Mettu, R. R., & Fitch, R. (2019). Dec-MCTS: Decentralized planning for multi-robot active perception. The International Journal of Robotics Research, 38(2–3), 316–337.
Best， G.， Cliff， OM， Patten， T.， Mettu， R. R.， & Fitch， R. （2019）。Dec-MCTS：多机器人主动感知的分散式规划。国际机器人研究杂志，38（2-3），316-337。

Google Scholar Google 学术搜索

Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
毕晓普，CM（2006 年）。模式识别和机器学习。柏林：施普林格。

MATH 数学

Google Scholar Google 学术搜索

Bloembergen, D., Kaisers, M., & Tuyls, K. (2010). Lenient frequency adjusted Q-learning. In Proceedings of the 22nd Belgian/Netherlands artificial intelligence conference.
Bloembergen，D.，Kaisers，M.和Tuyls，K.（2010）。宽松的频率调整 Q 学习。在第 22 届比利时/荷兰人工智能会议论文集。

Bloembergen, D., Tuyls, K., Hennes, D., & Kaisers, M. (2015). Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53, 659–697.
Bloembergen，D.，Tuyls，K.，Hennes，D.和Kaisers，M.（2015）。多智能体学习的进化动力学：一项调查。人工智能研究杂志， 53， 659–697.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Blum, A., & Monsour, Y. (2007). Learning, regret minimization, and equilibria. Chap. 4. In N. Nisan (Ed.), Algorithmic game theory. Cambridge: Cambridge University Press.
Blum，A.和Monsour，Y.（2007）。学习、后悔最小化和平衡。章节 4.在 N. Nisan （Ed.）中，算法博弈论。剑桥：剑桥大学出版社。

Google Scholar Google 学术搜索

Bono, G., Dibangoye, J. S., Matignon, L., Pereyron, F., & Simonin, O. (2018). Cooperative multi-agent policy gradient. In European conference on machine learning.
Bono，G.，Dibangoye，JS，Matignon，L.，Pereyron，F.和Simonin，O.（2018）。协作式多智能体策略梯度。在欧洲机器学习会议上。

Bowling, M. (2000). Convergence problems of general-sum multiagent reinforcement learning. In International conference on machine learning (pp. 89–94).
保龄球，M.（2000 年）。广义和多智能体强化学习的收敛问题.在机器学习国际会议上（第 89-94 页）。

Bowling, M. (2004). Convergence and no-regret in multiagent learning. Advances in neural information processing systems (pp. 209–216). Canada: Vancouver.
保龄球，M.（2004 年）。多智能体学习的收敛和无悔。神经信息处理系统的进展（第 209-216 页）。加拿大：温哥华。

Google Scholar Google 学术搜索

Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit hold’em poker is solved. Science, 347(6218), 145–149.
Bowling，M.，Burch，N.，Johanson，M.和Tammelin，O.（2015）。单挑限注德州扑克问题得到解决。科学，347（6218），145-149。

Google Scholar Google 学术搜索

Bowling, M., & McCracken, P. (2005). Coordination and adaptation in impromptu teams. Proceedings of the nineteenth conference on artificial intelligence (Vol. 5, pp. 53–58).
Bowling，M.和McCracken，P.（2005）。在即兴团队中进行协调和适应。第十九届人工智能会议论文集（第 5 卷，第 53-58 页）。

Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2), 215–250.
保龄球，M.和Veloso，M.（2002）。使用可变学习率的多智能体学习。人工智能， 136（2）， 215–250.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, pp. 369–376.
Boyan，JA和Moore，AW（1995）。强化学习中的泛化：安全地逼近值函数。在《神经信息处理系统进展》中，第 369-376 页。

Brafman, R. I., & Tennenholtz, M. (2002). R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct), 213–231.
Brafman，RI和Tennenholtz，M.（2002）。R-max-一种用于近优强化学习的通用多项式时间算法。机器学习研究杂志， 3（Oct）， 213–231.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. arXiv preprint arXiv:1606.01540.
Brockman，G.，Cheung，V.，Pettersson，L.，Schneider，J.，Schulman，J.，Tang，J.和Zaremba，W.（2016）。OpenAI 健身房。arXiv 预印本 arXiv：1606.01540。

Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1), 374–376.
布朗，GW（1951 年）。通过虚构游戏对游戏进行迭代解决方案。生产和分配活动分析，13（1），374-376。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Brown, N., & Sandholm, T. (2018). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374), 418–424.
Brown，N.和Sandholm，T.（2018）。用于单挑无限制扑克的超人 AI：Libratus 击败了顶级专业人士。科学，359（6374），418-424。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., et al. (2012). A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1–43.
Browne，CB，Powley，E.，Whitehouse，D.，Lucas，SM，Cowling，PI，Rohlfshagen，P.等人（2012）。蒙特卡洛树搜索方法调查。IEEE游戏计算智能和人工智能汇刊，4（1），1-43。

Google Scholar Google 学术搜索

Bucilua, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535–541). ACM.
Bucilua，C.，Caruana，R.和Niculescu-Mizil，A.（2006）。模型压缩。在第 12 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集（第 535-541 页）中。ACM。

Bull, L. (1998). Evolutionary computing in multi-agent environments: Operators. In International conference on evolutionary programming (pp. 43–52). Springer.
Bull，L.（1998 年）。多智能体环境中的演化计算：算子。在进化编程国际会议上（第 43-52 页）。斯普林格。

Bull, L., Fogarty, T. C., & Snaith, M. (1995). Evolution in multi-agent systems: Evolving communicating classifier systems for gait in a quadrupedal robot. In Proceedings of the 6th international conference on genetic algorithms (pp. 382–388). Morgan Kaufmann Publishers Inc.
Bull，L.，Fogarty，TC和Snaith，M.（1995）。多智能体系统的演变：四足机器人步态的通信分类器系统的演变。第六届遗传算法国际会议论文集（第 382-388 页）。摩根·考夫曼出版公司

Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 38(2), 156–172.
Busoniu，L.，Babuska，R.和De Schutter，B.（2008）。多智能体强化学习的综合调查。IEEE Transactions on Systems， Man and Cybernetics， Part C （Applications and Reviews）， 38（2）， 156–172.

Google Scholar Google 学术搜索

Busoniu, L., Babuska, R., & De Schutter, B. (2010). Multi-agent reinforcement learning: An overview. In D. Srinivasan & L. C. Jain (Eds.), Innovations in multi-agent systems and applications - 1 (pp. 183–221). Berlin: Springer.
Busoniu，L.，Babuska，R.和De Schutter，B.（2010）。多智能体强化学习：概述。在 D. Srinivasan 和 L. C. Jain（编辑）中，多智能体系统和应用的创新 - 1（第 183-221 页）。柏林：施普林格。

Google Scholar Google 学术搜索

Capture the Flag: The emergence of complex cooperative agents. (2018). [Online]. Retrieved September 7, 2018, https://deepmind.com/blog/capture-the-flag/ .
夺旗：复杂合作代理的出现。（2018）. [在线].2018年9月7日检索，https://deepmind.com/blog/capture-the-flag/。

Collaboration & Credit Principles, How can we be good stewards of collaborative trust? (2019). [Online]. Retrieved May 31, 2019, http://colah.github.io/posts/2019-05-Collaboration/index.html.
协作与信用原则，我们如何才能成为协作信任的好管家？（2019）. [在线].2019年5月31日检索，http://colah.github.io/posts/2019-05-Collaboration/index.html。

Camerer, C. F., Ho, T. H., & Chong, J. K. (2004). A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3), 861.
Camerer，CF，Ho，TH和Chong，JK（2004）。游戏的认知层次模型。经济学季刊， 119（3）， 861.

MATH 数学

Google Scholar Google 学术搜索

Camerer, C. F., Ho, T. H., & Chong, J. K. (2004). Behavioural game theory: Thinking, learning and teaching. In Advances in understanding strategic behavior (pp. 120–180). New York.
Camerer，CF，Ho，TH和Chong，JK（2004）。行为博弈论：思考、学习和教学。在理解战略行为的进展中（第 120-180 页）。纽约。

Carmel, D., & Markovitch, S. (1996). Incorporating opponent models into adversary search. AAAI/IAAI, 1, 120–125.
Carmel，D.和Markovitch，S.（1996）。将对手模型纳入对手搜索中。AAAI/IAAI，1,120-125。

Google Scholar Google 学术搜索

Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Caruana，R.（1997 年）。多任务学习。机器学习， 28（1）， 41–75.

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Cassandra, A. R. (1998). Exact and approximate algorithms for partially observable Markov decision processes. Ph.D. thesis, Computer Science Department, Brown University.
卡桑德拉，AR（1998 年）。部分可观察马尔可夫决策过程的精确和近似算法。布朗大学计算机科学系博士论文。

Castellini, J., Oliehoek, F. A., Savani, R., & Whiteson, S. (2019). The representational capacity of action-value networks for multi-agent reinforcement learning. In 18th International conference on autonomous agents and multiagent systems.
Castellini， J.， Oliehoek， FA， Savani， R.， & Whiteson， S. （2019）。多智能体强化学习的动作-价值网络的表征能力。在第 18 届自主代理和多智能体系统国际会议上。

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., Bellemare, M. G. (2018). Dopamine: A research framework for deep reinforcement learning. arXiv:1812.06110.
卡斯特罗，PS，莫伊特拉，S.，Gelada，C.，Kumar，S.，Bellemare，MG（2018）。多巴胺：深度强化学习的研究框架。arXiv：1812.06110。

Chakraborty, D., & Stone, P. (2013). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-Agent Systems, 28(2), 182–213.
Chakraborty，D.和Stone，P.（2013）。存在内存限制代理的多智能体学习。自主代理和多智能体系统，28（2），182-213。

Google Scholar Google 学术搜索

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In Deep learning and representation learning workshop.
Chung，J.，Gulcehre，C.，Cho，K.和Bengio，Y.（2014）。门控递归神经网络在序列建模中的实证评估。在深度学习和表征学习研讨会中。

Ciosek, K. A., & Whiteson, S. (2017). Offer: Off-environment reinforcement learning. In Thirty-first AAAI conference on artificial intelligence.
Ciosek，KA和Whiteson，S.（2017）。提供：非环境强化学习。在第31届AAAI人工智能会议上。

Clary, K., Tosch, E., Foley, J., & Jensen, D. (2018). Let’s play again: Variability of deep reinforcement learning agents in Atari environments. In NeurIPS critiquing and correcting trends workshop.
Clary，K.，Tosch，E.，Foley，J.和Jensen，D.（2018）。让我们再玩一遍：Atari环境中深度强化学习代理的可变性。在 NeurIPS 批评和纠正趋势研讨会上。

Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th national conference on artificial intelligence (pp. 746–752). Madison, Wisconsin, USA.
克劳斯，C.和Boutilier，C.（1998）。协作多智能体系统中强化学习的动力学。在第十五届全国人工智能学术会议论文集（第 746-752 页）中。美国威斯康星州麦迪逊。

Conitzer, V., & Sandholm, T. (2006). AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.
Conitzer，V.和Sandholm，T.（2006）。太棒了：一种通用的多智能体学习算法，在自我游戏中收敛，并学习对静止对手的最佳反应。机器学习， 67（1–2）， 23–43.

Google Scholar Google 学术搜索

Costa Gomes, M., Crawford, V. P., & Broseta, B. (2001). Cognition and behavior in normal-form games: An experimental study. Econometrica, 69(5), 1193–1235.
Costa Gomes，M.，Crawford，VP和Broseta，B.（2001）。正常形式游戏中的认知和行为：一项实验研究。计量经济学， 69（5）， 1193–1235.

Google Scholar Google 学术搜索

Crandall, J. W., & Goodrich, M. A. (2011). Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 82(3), 281–314.
克兰德尔，JW和古德里奇，MA（2011）。使用强化学习学习在重复游戏中学习竞争、协调和合作。机器学习， 82（3）， 281–314.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Crites, R. H., & Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2–3), 235–262.
Crites， R. H.， & Barto， AG （1998）。使用多个强化学习代理的电梯群控制。机器学习， 33（2–3）， 235–262.

MATH 数学

Google Scholar Google 学术搜索

Cuccu, G., Togelius, J., & Cudré-Mauroux, P. (2019). Playing Atari with six neurons. In Proceedings of the 18th international conference on autonomous agents and multiagent systems (pp. 998–1006). International Foundation for Autonomous Agents and Multiagent Systems.
Cuccu，G.，Togelius，J.和Cudré-Mauroux，P.（2019）。用六个神经元玩雅达利。在第 18 届自主代理和多代理系统国际会议论文集（第 998-1006 页）中。国际自主代理和多代理系统基金会。

de Weerd, H., Verbrugge, R., & Verheij, B. (2013). How much does it help to know what she knows you know? An agent-based simulation study. Artificial Intelligence, 199–200©, 67–92.
de Weerd，H.，Verbrugge，R.和Verheij，B.（2013）。知道她所知道的有多大帮助？基于智能体的仿真研究。人工智能， 199–200（C）， 67–92.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

de Cote, E. M., Lazaric, A., & Restelli, M. (2006). Learning to cooperate in multi-agent social dilemmas. In Proceedings of the 5th international conference on autonomous agents and multiagent systems (pp. 783–785). Hakodate, Hokkaido, Japan.
de Cote，EM，Lazaric，A.和Restelli，M.（2006）。学会在多主体社会困境中合作。在第五届自主代理和多智能体系统国际会议论文集（第 783-785 页）中。函馆，北海道，日本。

Deep reinforcement learning: Pong from pixels. (2016). [Online]. Retrieved May 7, 2019, https://karpathy.github.io/2016/05/31/rl/.
深度强化学习：来自像素的乒乓球。（2016）. [在线].2019年5月7日检索，https://karpathy.github.io/2016/05/31/rl/。

Do I really have to cite an arXiv paper? (2017). [Online]. Retrieved May 21, 2019, http://approximatelycorrect.com/2017/08/01/do-i-have-to-cite-arxiv-paper/.
我真的必须引用arXiv论文吗？（2017）. [在线].2019年5月21日检索，http://approximatelycorrect.com/2017/08/01/do-i-have-to-cite-arxiv-paper/。

Damer, S., & Gini, M. (2017). Safely using predictions in general-sum normal form games. In Proceedings of the 16th conference on autonomous agents and multiagent systems. Sao Paulo.
Damer，S.和Gini，M.（2017）。在一般和正态形式游戏中安全地使用预测。在第 16 届自主代理和多智能体系统会议论文集中。圣保罗。

Darwiche, A. (2018). Human-level intelligence or animal-like abilities? Communications of the ACM, 61(10), 56–67.
达里奇，A.（2018 年）。人类水平的智力还是类似动物的能力？ACM通讯，61（10），56-67。

Google Scholar Google 学术搜索

Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in neural information processing systems (pp. 271–278).
Dayan，P.和Hinton，GE（1993）。封建强化学习。在神经信息处理系统进展中（第 271-278 页）。

De Bruin, T., Kober, J., Tuyls, K., & Babuška, R. (2018). Experience selection in deep reinforcement learning for control. The Journal of Machine Learning Research, 19(1), 347–402.
De Bruin，T.，Kober，J.，Tuyls，K.和Babuška，R.（2018）。在深度强化学习中体验控制选择。机器学习研究杂志，19（1），347–402。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

De Hauwere, Y. M., Vrancx, P., & Nowe, A. (2010). Learning multi-agent state space representations. In Proceedings of the 9th international conference on autonomous agents and multiagent systems (pp. 715–722). Toronto, Canada.
De Hauwere，Y.M.，Vrancx，P.和Nowe，A.（2010）。学习多智能体状态空间表示。在第九届自主代理和多智能体系统国际会议论文集（第 715-722 页）中。加拿大多伦多。

De Jong, K. A. (2006). Evolutionary computation: A unified approach. Cambridge: MIT press.
德容，KA（2006 年）。进化计算：一种统一的方法。剑桥：麻省理工学院出版社。

MATH 数学

Google Scholar Google 学术搜索

Devlin, S., Yliniemi, L. M., Kudenko, D., & Tumer, K. (2014). Potential-based difference rewards for multiagent reinforcement learning. In 13th International conference on autonomous agents and multiagent systems, AAMAS 2014. Paris, France.
Devlin，S.，Yliniemi，LM，Kudenko，D.和Tumer，K.（2014）。基于潜力的多智能体强化学习差异奖励。在第 13 届自主代理和多代理系统国际会议上，AAMAS 2014。法国巴黎。

Dietterich, T. G. (2000). Ensemble methods in machine learning. In MCS proceedings of the first international workshop on multiple classifier systems (pp. 1–15). Springer, Berlin Heidelberg, Cagliari, Italy.
迪特里希，TG（2000 年）。机器学习中的集成方法。在MCS中，第一届多分类器系统国际研讨会论文集（第1-15页）。施普林格，柏林海德堡，卡利亚里，意大利。

Google Scholar Google 学术搜索

Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R., & Lakshminarayanan, B. (2018). Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224.
Du，Y.，Czarnecki，WM，Jayakumar，SM，Pascanu，R.和Lakshminarayanan，B.（2018）。使用梯度相似性调整辅助损失。arXiv 预印本 arXiv：1812.02224。

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., & Clune, J. (2019). Go-explore: A new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.
Ecoffet，A.，Huizinga，J.，Lehman，J.，Stanley，KO和Clune，J.（2019）。Go-explore：一种解决困难探索问题的新方法。arXiv 预印本 arXiv：1901.10995。

Elo, A. E. (1978). The rating of chessplayers, past and present. Nagoya: Arco Pub.
Elo，AE（1978 年）。过去和现在的国际象棋选手的评级。名古屋：Arco Pub。

Google Scholar Google 学术搜索

Erdös, P., & Selfridge, J. L. (1973). On a combinatorial game. Journal of Combinatorial Theory, Series A, 14(3), 298–301.
Erdös， P.， & Selfridge， J. L. （1973）。在组合游戏中。组合理论杂志，A辑，14（3），298-301。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr), 503–556.
Ernst，D.，Geurts，P.和Wehenkel，L.（2005）。基于树的批处理模式强化学习。机器学习研究杂志， 6（Apr）， 503–556.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., & Dunning, I., et al. (2018). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In International conference on machine learning.
Espeholt，L.，Soyer，H.，Munos，R.，Simonyan，K.，Mnih，V.，Ward，T.，Doron，Y.，Firoiu，V.，Harley，T.和Dunning，I.等人（2018）。IMPALA：可扩展的分布式深度强化学习，具有重要性加权的 Actor-Learner 架构。在机器学习国际会议上。

Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec), 1–25.
Even-Dar，E.和Mansour，Y.（2003）。Q-learning的学习率。机器学习研究杂志， 5（Dec）， 1–25.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Firoiu, V., Whitney, W. F., & Tenenbaum, J. B. (2017). Beating the World’s best at super smash Bros. with deep reinforcement learning. CoRR arXiv:1702.06230.
Firoiu，V.，Whitney，WF和Tenenbaum，JB（2017）。通过深度强化学习击败世界上最好的超级粉碎兄弟。CoRR arXiv：1702.06230。

Foerster, J. N., Assael, Y. M., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems (pp. 2145–2153).
Foerster，JN，Assael，YM，De Freitas，N.和Whiteson，S.（2016）。学习通过深度多智能体强化学习进行交流。在神经信息处理系统进展中（第 2145-2153 页）。

Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of 17th international conference on autonomous agents and multiagent systems. Stockholm, Sweden.
Foerster， J. N.， Chen， R. Y.， Al-Shedivat， M.， Whiteson， S.， Abbeel， P.， & Mordatch， I. （2018）。以对手学习意识学习。在第 17 届自主代理和多代理系统国际会议论文集。瑞典斯德哥尔摩。

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2017). Counterfactual multi-agent policy gradients. In 32nd AAAI conference on artificial intelligence.
Foerster，J.N.，Farquhar，G.，Afouras，T.，Nardelli，N.和Whiteson，S.（2017）。反事实的多智能体策略梯度。在第 32 届 AAAI 人工智能会议上。

Foerster, J. N., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H. S., Kohli, P., & Whiteson, S. (2017). Stabilising experience replay for deep multi-agent reinforcement learning. In International conference on machine learning.
Foerster，JN，Nardelli，N.，Farquhar，G.，Afouras，T.，Torr，PHS，Kohli，P.和Whiteson，S.（2017）。用于深度多智能体强化学习的稳定体验回放。在机器学习国际会议上。

Forde, J. Z., & Paganini, M. (2019). The scientific method in the science of machine learning. In ICLR debugging machine learning models workshop.
Forde， JZ 和 Paganini， M. （2019）。机器学习科学中的科学方法。在 ICLR 调试机器学习模型研讨会上。

François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., Pineau, J., et al. (2018). An introduction to deep reinforcement learning. Foundations and Trends® in Machine Learning, 11(3–4), 219–354.
François-Lavet，V.，Henderson，P.，Islam，R.，Bellemare，MG，Pineau，J.等人（2018）。深度强化学习简介。机器学习的基础和趋势®，11（3-4），219-354。

MATH 数学

Google Scholar Google 学术搜索

Frank, J., Mannor, S., & Precup, D. (2008). Reinforcement learning in the presence of rare events. In Proceedings of the 25th international conference on machine learning (pp. 336–343). ACM.
弗兰克，J.，曼诺，S.和Precup，D.（2008）。在存在罕见事件的情况下进行强化学习。在第 25 届机器学习国际会议论文集（第 336-343 页）中。ACM。

Fudenberg, D., & Tirole, J. (1991). Game theory. Cambridge: The MIT Press.
Fudenberg，D.和Tirole，J.（1991）。博弈论。剑桥：麻省理工学院出版社。

MATH 数学

Google Scholar Google 学术搜索

Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International conference on machine learning.
Fujimoto，S.，van Hoof，H.和Meger，D.（2018）。解决actor-critic方法中的函数逼近误差。在机器学习国际会议上。

Fulda, N., & Ventura, D. (2007). Predicting and preventing coordination problems in cooperative Q-learning systems. In Proceedings of the twentieth international joint conference on artificial intelligence (pp. 780–785). Hyderabad, India.
Fulda，N.和Ventura，D.（2007）。预测和预防合作 Q 学习系统中的协调问题。在第二十届国际人工智能联合会议论文集（第 780-785 页）中。印度海得拉巴。

Gao, C., Hernandez-Leal, P., Kartal, B., & Taylor, M. E. (2019). Skynet: A top deep RL agent in the inaugural pommerman team competition. In 4th multidisciplinary conference on reinforcement learning and decision making.
Gao，C.，Hernandez-Leal，P.，Kartal，B.和Taylor，ME（2019）。Skynet：首届 pommerman 团队竞赛中的顶级深度 RL 代理。在第 4 届关于强化学习和决策的多学科会议上。

Gao, C., Kartal, B., Hernandez-Leal, P., & Taylor, M. E. (2019). On hard exploration for reinforcement learning: A case study in pommerman. In AAAI conference on artificial intelligence and interactive digital entertainment.
Gao，C.，Kartal，B.，Hernandez-Leal，P.和Taylor，ME（2019）。关于强化学习的艰苦探索：pommerman的案例研究。在AAAI关于人工智能和互动数字娱乐的会议上。

Gencoglu, O., van Gils, M., Guldogan, E., Morikawa, C., Süzen, M., Gruber, M., Leinonen, J., & Huttunen, H. (2019). Hark side of deep learning–from grad student descent to automated machine learning. arXiv preprint arXiv:1904.07633.
Gencoglu，O.，van Gils，M.，Guldogan，E.，Morikawa，C.，Süzen，M.，Gruber，M.，Leinonen，J.和Huttunen，H.（2019）。深度学习的一面——从研究生到自动化机器学习。arXiv 预印本 arXiv：1904.07633。

Gmytrasiewicz, P. J., & Doshi, P. (2005). A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24(1), 49–79.
Gmytrasiewicz，PJ和Doshi，P.（2005）。多智能体设置中的顺序规划框架。人工智能研究杂志， 24（1）， 49–79.

MATH 数学

Google Scholar Google 学术搜索

Gmytrasiewicz, P. J., & Durfee, E. H. (2000). Rational coordination in multi-agent environments. Autonomous Agents and Multi-Agent Systems, 3(4), 319–350.
Gmytrasiewicz，PJ和Durfee，EH（2000）。多智能体环境中的合理协调。自主代理和多代理系统，3（4），319–350。

Google Scholar Google 学术搜索

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211
Goodfellow，IJ，Mirza，M.，Xiao，D.，Courville，A.和Bengio，Y.（2013）。基于梯度的神经网络中灾难性遗忘的实证研究。arXiv：1312.6211

Gordon, G. J. (1999). Approximate solutions to Markov decision processes. Technical report, Carnegie-Mellon University.
戈登，GJ（1999 年）。马尔可夫决策过程的近似解。卡内基梅隆大学技术报告。

Greenwald, A., & Hall, K. (2003). Correlated Q-learning. In Proceedings of 17th international conference on autonomous agents and multiagent systems (pp. 242–249). Washington, DC, USA.
Greenwald，A.和Hall，K.（2003）。相关 Q 学习。在第 17 届自主代理和多智能体系统国际会议论文集（第 242-249 页）中。美国华盛顿特区。

Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232.
Greff，K.，Srivastava，RK，Koutnik，J.，Steunebrink，BR和Schmidhuber，J.（2017）。LSTM：搜索太空漫游。IEEE神经网络和学习系统汇刊，28（10），2222–2232。

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Grosz, B. J., & Kraus, S. (1996). Collaborative plans for complex group action. Artificial Intelligence, 86(2), 269–357.
Grosz，BJ和Kraus，S.（1996）。复杂群体行动的协作计划。人工智能，86（2），269-357。

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., & Edwards, H. (2018). Learning policy representations in multiagent systems. In International conference on machine learning.
Grover，A.，Al-Shedivat，M.，Gupta，JK，Burda，Y.和Edwards，H.（2018）。学习多智能体系统中的策略表示。在机器学习国际会议上。

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., & Levine, S. (2017). Q-prop: Sample-efficient policy gradient with an off-policy critic. In International conference on learning representations.
Gu，S.，Lillicrap，T.，Ghahramani，Z.，Turner，RE和Levine，S.（2017）。Q-prop：具有非政策批评者的样本高效政策梯度。在学习表征国际会议上。

Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., & Levine, S. (2017). Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in neural information processing systems (pp. 3846–3855).
Gu， SS， Lillicrap， T.， Turner， RE， Ghahramani， Z.， Schölkopf， B.， & Levine， S. （2017）。插值策略梯度：合并策略内和策略外梯度估计，用于深度强化学习。在神经信息处理系统的进展中（第 3846-3855 页）。

Guestrin, C., Koller, D., & Parr, R. (2002). Multiagent planning with factored MDPs. In Advances in neural information processing systems (pp. 1523–1530).
Guestrin，C.，Koller，D.和Parr，R.（2002）。使用分解 MDP 进行多智能体规划。在神经信息处理系统进展中（第 1523-1530 页）。

Guestrin, C., Koller, D., Parr, R., & Venkataraman, S. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19, 399–468.
Guestrin，C.，Koller，D.，Parr，R.和Venkataraman，S.（2003）。分解 MDP 的高效求解算法. 人工智能研究杂志， 19， 399–468.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Guestrin, C., Lagoudakis, M., & Parr, R. (2002). Coordinated reinforcement learning. In ICML (Vol. 2, pp. 227–234).
Guestrin，C.，Lagoudakis，M.和Parr，R.（2002）。协调强化学习。在 ICML（第 2 卷，第 227-234 页）中。

Gullapalli, V., & Barto, A. G. (1992). Shaping as a method for accelerating reinforcement learning. In Proceedings of the 1992 IEEE international symposium on intelligent control (pp. 554–559). IEEE.
Gullapalli，V.和Barto，AG（1992）。塑形作为一种加速强化学习的方法。1992年IEEE智能控制国际研讨会论文集（第554-559页）。IEEE的。

Gupta, J. K., Egorov, M., & Kochenderfer, M. (2017). Cooperative multi-agent control using deep reinforcement learning. In G. Sukthankar & J. A. Rodriguez-Aguilar (Eds.), Autonomous agents and multiagent systems (pp. 66–83). Cham: Springer.
Gupta，JK，Egorov，M.和Kochenderfer，M.（2017）。使用深度强化学习的协作式多智能体控制。在 G. Sukthankar 和 J. A. Rodriguez-Aguilar （Eds.）中，自主代理和多代理系统（第 66-83 页）。Cham：斯普林格。

Google Scholar Google 学术搜索

Gupta, J. K., Egorov, M., & Kochenderfer, M. J. (2017). Cooperative Multi-agent Control using deep reinforcement learning. In Adaptive learning agents at AAMAS. Sao Paulo.
Gupta，JK，Egorov，M.和Kochenderfer，MJ（2017）。使用深度强化学习的协作式多智能体控制。在 AAMAS 的自适应学习代理中。圣保罗。

Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S. P., Liebana, D. P., Salakhutdinov, R., Topin, N., Veloso, M., & Wang, P. (2019). The MineRL competition on sample efficient reinforcement learning using human priors. CoRR arXiv:1904.10079.
Guss， WH， Codel， C.， Hofmann， K.， Houghton， B.， Kuno， N.， Milani， S.， Mohanty， SP， Liebana， DP， Salakhutdinov， R.， Topin， N.， Veloso， M.， & Wang， P. （2019）.MineRL竞赛关于使用人类先验的样本高效强化学习。CoRR arXiv：1904.10079。

Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 1352–1361).
Haarnoja，T.，Tang，H.，Abbeel，P.和Levine，S.（2017）。使用基于深度能量的策略进行强化学习。在第 34 届机器学习国际会议论文集（第 70 卷，第 1352-1361 页）中。

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning.
Haarnoja，T.，周，A.，Abbeel，P.和Levine，S.（2018）。软 actor-critic：使用随机 actor 进行偏离策略的最大熵深度强化学习。在机器学习国际会议上。

Hafner, R., & Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning, 84(1–2), 137–169.
Hafner，R.和Riedmiller，M.（2011）。反馈控制中的强化学习。机器学习， 84（1–2）， 137–169.

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Harsanyi, J. C. (1967). Games with incomplete information played by “Bayesian” players, I–III part I. The basic model. Management Science, 14(3), 159–182.
Harsanyi，JC（1967 年）。“贝叶斯”玩家玩的信息不完整的游戏，I-III第一部分。基本模型。管理科学， 14（3）， 159–182.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Hasselt, H. V. (2010). Double Q-learning. In Advances in neural information processing systems (pp. 2613–2621).
哈瑟尔特，HV（2010 年）。双Q学习。在神经信息处理系统进展中（第 2613-2621 页）。

Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. In International conference on learning representations.
Hausknecht，M.和Stone，P.（2015）。部分可观察 MDP 的深度递归 Q 学习。在学习表征国际会议上。

Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 13(1), 33–94.
Hauskrecht，M.（2000 年）。部分可观察马尔可夫决策过程的值函数近似。人工智能研究， 13（1）， 33–94.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

He, H., Boyd-Graber, J., Kwok, K., Daume, H. (2016). Opponent modeling in deep reinforcement learning. In 33rd international conference on machine learning (pp. 2675–2684).
他，H.，Boyd-Graber，J.，Kwok，K.，Daume，H.（2016）。深度强化学习中的对手建模。在第 33 届机器学习国际会议上（第 2675-2684 页）。

Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S. M. A., Riedmiller, M. A., & Silver, D. (2017). Emergence of locomotion behaviours in rich environments. arXiv:1707.02286v2
Heess， N.， TB， D.， Sriram， S.， Lemmon， J.， Merel， J.， Wayne， G.， Tassa， Y.， Erez， T.， Wang， Z.， Eslami， SMA， Riedmiller， MA， & Silver， D. （2017）.在丰富的环境中出现运动行为。arXiv：1707.02286v2

Heinrich, J., Lanctot, M., & Silver, D. (2015). Fictitious self-play in extensive-form games. In International conference on machine learning (pp. 805–813).
Heinrich，J.，Lanctot，M.和Silver，D.（2015）。在广泛形式的游戏中虚构的自我游戏。在机器学习国际会议上（第 805-813 页）。

Heinrich, J., & Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. arXiv:1603.01121.
Heinrich，J.和Silver，D.（2016）。在不完全信息博弈中从自我游戏中获得深度强化学习。arXiv：1603.01121。

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In 32nd AAAI conference on artificial intelligence.
Henderson，P.，Islam，R.，Bachman，P.，Pineau，J.，Precup，D.和Meger，D.（2018）。重要的深度强化学习。在第 32 届 AAAI 人工智能会议上。

Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill

a Bayesian skill rating system. In Advances in neural information processing systems (pp. 569–576).
Herbrich，R.，Minka，T.和Graepel，T.（2007）。TrueSkill
：贝叶斯技能评级系统。在神经信息处理系统进展中（第 569-576 页）。

Hernandez-Leal, P., & Kaisers, M. (2017). Learning against sequential opponents in repeated stochastic games. In The 3rd multi-disciplinary conference on reinforcement learning and decision making. Ann Arbor.
Hernandez-Leal，P.和Kaisers，M.（2017）。在反复的随机博弈中与连续的对手学习。在第三届关于强化学习和决策的多学科会议上。安阿伯。

Hernandez-Leal, P., & Kaisers, M. (2017). Towards a fast detection of opponents in repeated stochastic games. In G. Sukthankar, & J. A. Rodriguez-Aguilar (Eds.) Autonomous agents and multiagent systems: AAMAS 2017 Workshops, Best Papers, Sao Paulo, Brazil, 8–12 May, 2017, Revised selected papers (pp. 239–257).
Hernandez-Leal，P.和Kaisers，M.（2017）。在反复的随机博弈中快速检测对手。在 G. Sukthankar 和 J. A. Rodriguez-Aguilar （Eds.） Autonomous agents and multiagent systems： AAMAS 2017 Workshops， Best Papers， Sao Paulo， Brazil， 8–12 May， 2017，修订版论文（第 239-257 页）。

Hernandez-Leal, P., Kaisers, M., Baarslag, T., & Munoz de Cote, E. (2017). A survey of learning in multiagent environments—dealing with non-stationarity. arXiv:1707.09183.
Hernandez-Leal，P.，Kaisers，M.，Baarslag，T.和Munoz de Cote，E.（2017）。多智能体环境中的学习调查——处理非平稳性。arXiv：1707.09183。

Hernandez-Leal, P., Kartal, B., & Taylor, M. E. (2019). Agent modeling as auxiliary task for deep reinforcement learning. In AAAI conference on artificial intelligence and interactive digital entertainment.
Hernandez-Leal， P.， Kartal， B.， & Taylor， ME （2019）。智能体建模作为深度强化学习的辅助任务。在AAAI关于人工智能和互动数字娱乐的会议上。

Hernandez-Leal, P., Taylor, M. E., Rosman, B., Sucar, L. E., & Munoz de Cote, E. (2016). Identifying and tracking switching, non-stationary opponents: A Bayesian approach. In Multiagent interaction without prior coordination workshop at AAAI. Phoenix, AZ, USA.
Hernandez-Leal，P.，Taylor，ME，Rosman，B.，Sucar，LE和Munoz de Cote，E.（2016）。识别和跟踪切换，非平稳对手：贝叶斯方法。在AAAI的多智能体交互中没有事先协调的研讨会。美国亚利桑那州凤凰城。

Hernandez-Leal, P., Zhan, Y., Taylor, M. E., Sucar, L. E., & Munoz de Cote, E. (2017). Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, 31(4), 767–789.
Hernandez-Leal，P.，Zhan，Y.，Taylor，ME，Sucar，LE和Munoz de Cote，E.（2017）。有效检测针对非静止对手的开关。自主代理和多智能体系统，31（4），767-789。

Google Scholar Google 学术搜索

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence.
Hessel，M.，Modayil，J.，Van Hasselt，H.，Schaul，T.，Ostrovski，G.，Dabney，W.，Horgan，D.，Piot，B.，Azar，M.和Silver，D.（2018）。Rainbow：结合深度强化学习的改进。在第32届AAAI人工智能会议上。

Hinton, G., Vinyals, O., & Dean, J. (2014). Distilling the knowledge in a neural network. In NIPS deep learning workshop.
Hinton，G.，Vinyals，O.和Dean，J.（2014）。在神经网络中提炼知识。在 NIPS 深度学习研讨会中。

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hochreiter，S.和Schmidhuber，J.（1997）。长短期记忆。神经计算，9（8），1735-1780。

Google Scholar Google 学术搜索

Hong, Z. W., Su, S. Y., Shann, T. Y., Chang, Y. H., & Lee, C. Y. (2018). A deep policy inference Q-network for multi-agent systems. In International conference on autonomous agents and multiagent systems.
Hong， Z. W.， Su， S. Y.， Shann， T. Y.， Chang， Y. H.， & Lee， C. Y. （2018）。用于多智能体系统的深度策略推理 Q 网络。在自主代理和多智能体系统国际会议上。

Hu, J., & Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 4, 1039–1069.
胡，J.和Wellman，M.P.（2003）。用于广义和随机博弈的 Nash Q 学习。机器学习研究杂志， 4， 1039–1069.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Iba, H. (1996). Emergent cooperation for multiple agents using genetic programming. In International conference on parallel problem solving from nature (pp. 32–41). Springer.
Iba，H.（1996 年）。使用基因编程的多个代理的紧急合作。在自然界并行解决问题的国际会议上（第 32-41 页）。斯普林格。

Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2018). Are deep policy gradient algorithms truly policy gradient algorithms? CoRR arXiv:1811.02553.
Ilyas，A.，Engstrom，L.，Santurkar，S.，Tsipras，D.，Janoos，F.，Rudolph，L.和Madry，A.（2018）。深度策略梯度算法真的是策略梯度算法吗？CoRR arXiv：1811.02553。

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd international conference on machine learning (pp. 448–456).
Ioffe，S.和Szegedy，C.（2015）。批量归一化：通过减少内部协变量偏移来加速深度网络训练。在第 32 届机器学习国际会议论文集（第 448-456 页）中。

Isele, D., & Cosgun, A. (2018). Selective experience replay for lifelong learning. In Thirty-second AAAI conference on artificial intelligence.
Isele，D.和Cosgun，A.（2018）。选择性经验回放，终身学习。在第32届AAAI人工智能会议上。

Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems (pp. 703–710)
Jaakkola，T.，Jordan，MI和Singh，SP（1994）。随机迭代动态规划算法的收敛性。神经信息处理系统进展（第 703-710 页）

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, G. E., et al. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
Jacobs， RA， Jordan， MI， Nowlan， SJ， Hinton， GE， et al. （1991）。本地专家的适应性混合。神经计算，3（1），79-87。

Google Scholar Google 学术搜索

Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castañeda, A. G., et al. (2019). Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443), 859–865. https://doi.org/10.1126/science.aau6249.
Jaderberg， M.， Czarnecki， WM， Dunning， I.， Marris， L.， Lever， G.， Castañeda， AG， et al. （2019）.具有基于群体的强化学习的 3D 多人游戏中的人类水平表现。科学，364（6443），859-865。https://doi.org/10.1126/science.aau6249。

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., & Simonyan, K., et al. (2017). Population based training of neural networks. arXiv:1711.09846.
Jaderberg，M.，Dalibard，V.，Osindero，S.，Czarnecki，WM，Donahue，J.，Razavi，A.，Vinyals，O.，Green，T.，Dunning，I.和Simonyan，K.等人（2017）。基于群体的神经网络训练。arXiv：1711.09846。

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., & Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In International conference on learning representations.
Jaderberg，M.，Mnih，V.，Czarnecki，WM，Schaul，T.，Leibo，JZ，Silver，D.和Kavukcuoglu，K.（2017）。具有无监督辅助任务的强化学习。在学习表征国际会议上。

Johanson, M., Bard, N., Burch, N., & Bowling, M. (2012). Finding optimal abstract strategies in extensive-form games. In Twenty-sixth AAAI conference on artificial intelligence.
Johanson，M.，Bard，N.，Burch，N.和Bowling，M.（2012）。在广泛形式的博弈中寻找最佳抽象策略。在第26届AAAI人工智能会议上。

Johanson, M., Waugh, K., Bowling, M., & Zinkevich, M. (2011). Accelerating best response calculation in large extensive games. In Twenty-second international joint conference on artificial intelligence.
Johanson，M.，Waugh，K.，Bowling，M.和Zinkevich，M.（2011）。加速大型游戏的最佳响应计算。在第二十二届人工智能国际联合会议上。

Johanson, M., Zinkevich, M. A., & Bowling, M. (2007). Computing robust counter-strategies. In Advances in neural information processing systems (pp. 721–728). Vancouver, BC, Canada.
Johanson，M.，Zinkevich，MA和Bowling，M.（2007）。计算稳健的反击策略。在神经信息处理系统的进展中（第 721-728 页）。加拿大不列颠哥伦比亚省温哥华。

Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In IJCAI (pp. 4246–4247).
Johnson，M.，Hofmann，K.，Hutton，T.和Bignell，D.（2016）。用于人工智能实验的马尔默平台。在 IJCAI 中（第 4246-4247 页）。

Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mattar, M., & Lange, D. (2018). Unity: A general platform for intelligent agents. CoRR arXiv:1809.02627.
Juliani，A.，Berges，V.，Vckay，E.，Gao，Y.，Henry，H.，Mattar，M.和Lange，D.（2018）。Unity：智能代理的通用平台。CoRR arXiv：1809.02627。

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Kaelbling，LP，Littman，ML和Moore，AW（1996）。强化学习：一项调查。人工智能研究杂志， 4， 237–285.

Google Scholar Google 学术搜索

Kaisers, M., & Tuyls, K. (2011). FAQ-learning in matrix games: demonstrating convergence near Nash equilibria, and bifurcation of attractors in the battle of sexes. In AAAI Workshop on Interactive Decision Theory and Game Theory (pp. 309–316). San Francisco, CA, USA.
Kaisers，M.和Tuyls，K.（2011）。FAQ-矩阵博弈中的学习：在纳什均衡附近展示收敛，以及性别之战中吸引子的分岔。在AAAI互动决策理论和博弈论研讨会（第309-316页）中。美国加利福尼亚州旧金山。

Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems (pp. 1531–1538).
卡卡德，SM（2002 年）。自然的政策梯度。在神经信息处理系统进展中（第 1531-1538 页）。

Kalai, E., & Lehrer, E. (1993). Rational learning leads to Nash equilibrium. Econometrica: Journal of the Econometric Society, 61, 1019–1045.
Kalai，E.和Lehrer，E.（1993）。理性学习导致纳什均衡。计量经济学：计量经济学会杂志，61,1019-1045。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Kamihigashi, T., & Le Van, C. (2015). Necessary and sufficient conditions for a solution of the bellman equation to be the value function: A general principle. https://halshs.archives-ouvertes.fr/halshs-01159177
Kamihigashi，T.和Le Van，C.（2015）。贝尔曼方程解为值函数的必要条件和充分条件：一般原则。https://halshs.archives-ouvertes.fr/halshs-01159177

Kartal, B., Godoy, J., Karamouzas, I., & Guy, S. J. (2015). Stochastic tree search with useful cycles for patrolling problems. In 2015 IEEE international conference on robotics and automation (ICRA) (pp. 1289–1294). IEEE.
Kartal，B.，Godoy，J.，Karamouzas，I.和Guy，SJ（2015）。随机树搜索，具有用于巡逻问题的有用循环。2015 年 IEEE 机器人与自动化国际会议（ICRA）（第 1289-1294 页）。IEEE的。

Kartal, B., Hernandez-Leal, P., & Taylor, M. E. (2019). Using Monte Carlo tree search as a demonstrator within asynchronous deep RL. In AAAI workshop on reinforcement learning in games.
Kartal，B.，Hernandez-Leal，P.和Taylor，ME（2019）。在异步深度强化学习中使用蒙特卡洛树搜索作为演示器。在AAAI关于游戏中强化学习的研讨会上。

Kartal, B., Nunes, E., Godoy, J., & Gini, M. (2016). Monte Carlo tree search with branch and bound for multi-robot task allocation. In The IJCAI-16 workshop on autonomous mobile service robots.
Kartal，B.，Nunes，E.，Godoy，J.和Gini，M.（2016）。蒙特卡洛树搜索，带有分支，并绑定用于多机器人任务分配。在IJCAI-16关于自主移动服务机器人的研讨会上。

Khadka, S., Majumdar, S., & Tumer, K. (2019). Evolutionary reinforcement learning for sample-efficient multiagent coordination. arXiv e-prints arXiv:1906.07315.
Khadka，S.，Majumdar，S.和Tumer，K.（2019）。用于样本高效多智能体协调的进化强化学习。arXiv 电子打印 arXiv：1906.07315。

Kim, W., Cho, M., & Sung, Y. (2019). Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In 33rd AAAI conference on artificial intelligence.
Kim，W.，Cho，M.和Sung，Y.（2019）。Message-dropout：一种用于多智能体深度强化学习的高效训练方法。在第 33 届 AAAI 人工智能会议上。

Kok, J. R., & Vlassis, N. (2004). Sparse cooperative Q-learning. In Proceedings of the twenty-first international conference on Machine learning (p. 61). ACM.
Kok，J.R.和Vlassis，N.（2004）。稀疏合作 Q 学习。在第 21 届机器学习国际会议论文集（第 61 页）中。ACM。

Konda, V. R., & Tsitsiklis, J. (2000). Actor-critic algorithms. In Advances in neural information processing systems.
Konda，V.R.和Tsitsiklis，J.（2000）。演员-评论家算法。在神经信息处理系统的进展中。

Konidaris, G., & Barto, A. (2006). Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of the 23rd international conference on machine learning (pp. 489–496). ACM.
Konidaris，G.和Barto，A.（2006）。自主塑造：强化学习中的知识转移。在第 23 届机器学习国际会议论文集（第 489-496 页）中。ACM。

Kretchmar, R. M., & Anderson, C. W. (1997). Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning. In Proceedings of international conference on neural networks (ICNN’97) (Vol. 2, pp. 834–837). IEEE.
Kretchmar，RM和Anderson，CW（1997）。强化学习中局部函数逼近器的CMAC和径向基函数的比较。神经网络国际会议（ICNN’97）论文集（第 2 卷，第 834-837 页）。IEEE的。

Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems (pp. 3675–3683).
Kulkarni，TD，Narasimhan，K.，Saeedi，A.和Tenenbaum，J.（2016）。分层深度强化学习：整合时间抽象和内在动机。在神经信息处理系统进展中（第 3675-3683 页）。

Lake, B. M., Ullman, T. D., Tenenbaum, J., & Gershman, S. (2016). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 1–72.
Lake，BM，Ullman，TD，Tenenbaum，J.和Gershman，S.（2016）。构建像人一样学习和思考的机器。行为与脑科学，40,1-72。

Google Scholar Google 学术搜索

Lanctot, M., Zambaldi, V. F., Gruslys, A., Lazaridou, A., Tuyls, K., Pérolat, J., Silver, D., & Graepel, T. (2017). A unified game-theoretic approach to multiagent reinforcement learning. In Advances in neural information processing systems.
Lanctot，M.，Zambaldi，VF，Gruslys，A.，Lazaridou，A.，Tuyls，K.，Pérolat，J.，Silver，D.和Graepel，T.（2017）。一种统一的博弈论方法，用于多智能体强化学习。在神经信息处理系统的进展中。

Lauer, M., & Riedmiller, M. (2000). An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proceedings of the seventeenth international conference on machine learning.
劳尔，M.和里德米勒，M.（2000）。一种用于协作多智能体系统中分布式强化学习的算法。第十七届机器学习国际会议论文集。

Laurent, G. J., Matignon, L., Fort-Piat, L., et al. (2011). The world of independent learners is not Markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1), 55–64.
Laurent，GJ，Matignon，L.，Fort-Piat，L.等人（2011）。独立学习者的世界不是马尔科夫式的。国际知识与智能工程系统杂志，15（1），55-64。

Google Scholar Google 学术搜索

Lazaridou, A., Peysakhovich, A., & Baroni, M. (2017). Multi-agent cooperation and the emergence of (natural) language. In International conference on learning representations.
Lazaridou，A.，Peysakhovich，A.和Baroni，M.（2017）。多智能体合作和（自然）语言的出现。在学习表征国际会议上。

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436.
LeCun，Y.，Bengio，Y.和Hinton，G.（2015）。深度学习。自然，521（7553），436。

Google Scholar Google 学术搜索

Lehman, J., & Stanley, K. O. (2008). Exploiting open-endedness to solve problems through the search for novelty. In ALIFE (pp. 329–336).
Lehman， J.， & Stanley， K. O. （2008）。利用开放性，通过寻找新奇事物来解决问题。在ALIFE（第329-336页）中。

Leibo, J. Z., Hughes, E., Lanctot, M., & Graepel, T. (2019). Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. CoRR arXiv:1903.00742.
Leibo，JZ，Hughes，E.，Lanctot，M.和Graepel，T.（2019）。自学课程和社交互动创新的出现：多智能体智能研究的宣言。CoRR arXiv：1903.00742。

Leibo, J. Z., Perolat, J., Hughes, E., Wheelwright, S., Marblestone, A. H., Duéñez-Guzmán, E., Sunehag, P., Dunning, I., & Graepel, T. (2019). Malthusian reinforcement learning. In 18th international conference on autonomous agents and multiagent systems.
Leibo， JZ， Perolat， J.， Hughes， E.， Wheelwright， S.， Marblestone， AH， Duéñez-Guzmán， E.， Sunehag， P.， Dunning， I.， & Graepel， T. （2019）.马尔萨斯强化学习。在第 18 届自主代理和多智能体系统国际会议上。

Leibo, J. Z., Zambaldi, V., Lanctot, M., & Marecki, J. (2017). Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th conference on autonomous agents and multiagent systems. Sao Paulo.
Leibo，JZ，Zambaldi，V.，Lanctot，M.和Marecki，J.（2017）。序列社会困境中的多智能体强化学习。在第 16 届自主代理和多智能体系统会议论文集中。圣保罗。

Lerer, A., & Peysakhovich, A. (2017). Maintaining cooperation in complex social dilemmas using deep reinforcement learning. CoRR arXiv:1707.01068.
Lerer，A.和Peysakhovich，A.（2017）。使用深度强化学习在复杂的社会困境中保持合作。CoRR arXiv：1707.01068。

Li, S., Wu, Y., Cui, X., Dong, H., Fang, F., & Russell, S. (2019). Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In AAAI conference on artificial intelligence.
Li， S.， Wu， Y.， Cui， X.， Dong， H.， Fang， F.， & Russell， S. （2019）.通过 minimax 深度确定性策略梯度进行鲁棒的多智能体强化学习。在AAAI人工智能会议上。

Li, Y. (2017). Deep reinforcement learning: An overview. CoRR arXiv:1701.07274.
李，Y.（2017 年）。深度强化学习：概述。CoRR arXiv：1701.07274。

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International conference on learning representations.
Lillicrap，TP，Hunt，JJ，Pritzel，A.，Heess，N.，Erez，T.，Tassa，Y.，Silver，D.和Wierstra，D.（2016）。通过深度强化学习实现持续控制。在学习表征国际会议上。

Lin, L. J. (1991). Programming robots using reinforcement learning and teaching. In AAAI (pp. 781–786).
林，LJ（1991 年）。使用强化学习和教学对机器人进行编程。在AAAI中（第781-786页）。

Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321.
林，LJ（1992 年）。基于强化学习、计划和教学的自我改进反应剂。机器学习， 8（3–4）， 293–321.

Google Scholar Google 学术搜索

Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? End-to-end learning in normal and extensive form games. In Twenty-seventh international joint conference on artificial intelligence.
Ling， C. K.， Fang， F.， & Kolter， JZ （2018）。我们在玩什么游戏？在普通和广泛形式的游戏中进行端到端学习。在第二十七届人工智能国际联合会议上。

Lipton, Z. C., Azizzadenesheli, K., Kumar, A., Li, L., Gao, J., & Deng, L. (2018). Combating reinforcement learning’s Sisyphean curse with intrinsic fear. arXiv:1611.01211v8.
立顿，Z.C.，Azizzadenesheli，K.，Kumar，A.，Li，L.，Gao，J.和邓，L.（2018）。用内在的恐惧来对抗强化学习的西西弗斯诅咒。arXiv：1611.01211v8。

Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. In ICML Machine Learning Debates workshop.
立顿，Z.C.和Steinhardt，J.（2018）。机器学习学术领域令人不安的趋势。在ICML机器学习辩论研讨会上。

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th international conference on machine learning (pp. 157–163). New Brunswick, NJ, USA.
利特曼，ML（1994 年）。马尔可夫博弈作为多智能体强化学习的框架。在第 11 届机器学习国际会议论文集（第 157-163 页）中。美国新泽西州新不伦瑞克省。

Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of 17th international conference on autonomous agents and multiagent systems (pp. 322–328). Williamstown, MA, USA.
利特曼，ML（2001 年）。一般和游戏中的敌我 Q 学习。在第 17 届自主代理和多智能体系统国际会议论文集（第 322-328 页）中。美国马萨诸塞州威廉斯敦。

Littman, M. L. (2001). Value-function reinforcement learning in Markov games. Cognitive Systems Research, 2(1), 55–66.
利特曼，ML（2001 年）。马尔可夫博弈中的价值函数强化学习。认知系统研究，2（1），55-66。

Google Scholar Google 学术搜索

Littman, M. L., & Stone, P. (2001). Implicit negotiation in repeated games. In ATAL ’01: revised papers from the 8th international workshop on intelligent agents VIII.
Littman，ML和Stone，P.（2001）。重复博弈中的隐性谈判。在 ATAL '01 中：第 8 届智能代理国际研讨会 VIII 的修订论文。

Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., & Liu, Q. (2018). Action-depedent control variates for policy optimization via stein’s identity. In International conference on learning representations.
Liu，H.，Feng，Y.，毛，Y.，周，D.，Peng，J.和Liu，Q.（2018）。行动去向控制通过斯坦因的身份进行策略优化。在学习表征国际会议上。

Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., & Graepel, T. (2019). Emergent coordination through competition. In International conference on learning representations.
Liu， S.， Lever， G.， Merel， J.， Tunyasuvunakool， S.， Heess， N.， & Graepel， T. （2019）.通过竞争进行紧急协调。在学习表征国际会议上。

Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J., Morrill, D., Timbers, F., & Tuyls, K. (2019). Computing approximate equilibria in sequential adversarial games by exploitability descent. CoRR arXiv:1903.05614.
洛克哈特，E.，兰克托特，M.，佩罗拉特，J.，莱斯皮奥，J.，莫里尔，D.，Timbers，F.和Tuyls，K.（2019）。通过可利用性下降计算顺序对抗博弈中的近似均衡。CoRR arXiv：1903.05614。

Lowe, R., Foerster, J., Boureau, Y. L., Pineau, J., & Dauphin, Y. (2019). On the pitfalls of measuring emergent communication. In 18th international conference on autonomous agents and multiagent systems.
Lowe， R.， Foerster， J.， Boureau， Y. L.， Pineau， J.， & Dauphin， Y. （2019）。关于测量紧急沟通的陷阱。在第 18 届自主代理和多智能体系统国际会议上。

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems (pp. 6379–6390).
Lowe，R.，Wu，Y.，Tamar，A.，Harb，J.，Abbeel，P.和Mordatch，I.（2017）。混合合作-竞争环境的多主体行为者-批评家。在神经信息处理系统的进展中（第 6379-6390 页）。

Lu, T., Schuurmans, D., & Boutilier, C. (2018). Non-delusional Q-learning and value-iteration. In Advances in neural information processing systems (pp. 9949–9959).
Lu，T.，Schuurmans，D.和Boutilier，C.（2018）。非妄想 Q 学习和价值迭代。在神经信息处理系统的进展中（第 9949-9959 页）。

Lyle, C., Castro, P. S., & Bellemare, M. G. (2019). A comparative analysis of expected and distributional reinforcement learning. In Thirty-third AAAI conference on artificial intelligence.
莱尔，C.，卡斯特罗，PS和Bellemare，MG（2019）。预期强化学习和分布强化学习的比较分析。在第33届AAAI人工智能会议上。

Multiagent Learning, Foundations and Recent Trends. (2017). [Online]. Retrieved September 7, 2018, https://www.cs.utexas.edu/~larg/ijcai17_tutorial/multiagent_learning.pdf .
多智能体学习、基础和最新趋势。（2017）. [在线].2018年9月7日检索，https://www.cs.utexas.edu/~larg/ijcai17_tutorial/multiagent_learning.pdf。

Maaten, Lvd, & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
Maaten，Lvd和Hinton，G.（2008）。使用 t-SNE 可视化数据。机器学习研究杂志， 9（Nov）， 2579–2605.

MATH 数学

Google Scholar Google 学术搜索

Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., & Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61, 523–562.
Machado，MC，Bellemare，MG，Talvitie，E.，Veness，J.，Hausknecht，M.和Bowling，M.（2018）。重温街机学习环境：一般代理的评估协议和未解决的问题。人工智能研究杂志， 61， 523–562.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Mahadevan, S., & Connell, J. (1992). Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 55(2–3), 311–365.
Mahadevan，S.和Connell，J.（1992）。使用强化学习对基于行为的机器人进行自动编程。人工智能， 55（2–3）， 311–365.

Google Scholar Google 学术搜索

Matignon, L., Laurent, G. J., & Le Fort-Piat, N. (2012). Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. Knowledge Engineering Review, 27(1), 1–31.
Matignon，L.，Laurent，GJ和Le Fort-Piat，N.（2012）。合作马尔可夫博弈中的独立强化学习器：关于协调问题的调查。知识工程评论，27（1），1-31。

Google Scholar Google 学术搜索

McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In G. H. Bower (Ed.), Psychology of learning and motivation (Vol. 24, pp. 109–165). Amsterdam: Elsevier.
McCloskey，M.和Cohen，N.J.（1989）。联结主义网络中的灾难性干扰：顺序学习问题。在 G. H. Bower （Ed.）中，学习和动机心理学（第 24 卷，第 109-165 页）。阿姆斯特丹：爱思唯尔。

Google Scholar Google 学术搜索

McCracken, P., & Bowling, M. (2004) Safe strategies for agent modelling in games. In AAAI fall symposium (pp. 103–110).
McCracken， P.， & Bowling， M. （2004）游戏中智能体建模的安全策略。在 AAAI 秋季研讨会（第 103-110 页）中。

Melis, G., Dyer, C., & Blunsom, P. (2018). On the state of the art of evaluation in neural language models. In International conference on learning representations.
Melis，G.，Dyer，C.和Blunsom，P.（2018）。关于神经语言模型中评估的最新技术。在学习表征国际会议上。

Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning (pp. 664–671). ACM.
Melo，FS，Meyn，SP和Ribeiro，MI（2008）。基于函数逼近的强化学习分析。在第 25 届机器学习国际会议论文集（第 664-671 页）中。ACM。

Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence (pp. 427–436).
Meuleau，N.，Peshkin，L.，Kim，KE和Kaelbling，LP（1999）。学习部分可观察环境的有限状态控制器。在第十五届人工智能不确定性会议论文集（第 427-436 页）中。

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928–1937).
Mnih，V.，Badia，AP，Mirza，M.，Graves，A.，Lillicrap，T.，Harley，T.，Silver，D.和Kavukcuoglu，K.（2016）。深度强化学习的异步方法。在机器学习国际会议上（第 1928-1937 页）。

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602v1.
Mnih，V.，Kavukcuoglu，K.，Silver，D.，Graves，A.，Antonoglou，I.，Wierstra，D.和Riedmiller，M.（2013）。玩雅达利与深度强化学习。arXiv：1312.5602v1。

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Mnih，V.，Kavukcuoglu，K.，Silver，D.，Rusu，AA，Veness，J.，Bellemare，MG等人（2015）。通过深度强化学习实现人类水平的控制。自然，518（7540），529-533。

Google Scholar Google 学术搜索

Monderer, D., & Shapley, L. S. (1996). Fictitious play property for games with identical interests. Journal of Economic Theory, 68(1), 258–265.
Monderer，D.和Shapley，LS（1996）。具有相同兴趣的游戏的虚构游戏属性。经济理论杂志，68（1），258-265。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130.
Moore， AW， & Atkeson， CG （1993）。优先扫描：使用更少的数据和更少的时间进行强化学习。机器学习， 13（1）， 103–130.

Google Scholar Google 学术搜索

Moravčík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337), 508–513.
Moravčík，M.，Schmid，M.，Burch，N.，Lisý，V.，Morrill，D.，Bard，N.等人（2017）。DeepStack：单挑无限制扑克中的专家级人工智能。科学，356（6337），508-513。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Mordatch, I., & Abbeel, P. (2018). Emergence of grounded compositional language in multi-agent populations. In Thirty-second AAAI conference on artificial intelligence.
Mordatch，I.和Abbeel，P.（2018）。在多智能体群体中出现扎根的组合语言。在第32届AAAI人工智能会议上。

Moriarty, D. E., Schultz, A. C., & Grefenstette, J. J. (1999). Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, 11, 241–276.
Moriarty， DE， Schultz， AC， & Grefenstette， JJ （1999）。强化学习的进化算法。人工智能研究杂志， 11， 241–276.

MATH 数学

Google Scholar Google 学术搜索

Morimoto, J., & Doya, K. (2005). Robust reinforcement learning. Neural Computation, 17(2), 335–359.
Morimoto，J.和Doya，K.（2005）。鲁棒强化学习。神经计算，17（2），335-359。

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Nagarajan, P., Warnell, G., & Stone, P. (2018). Deterministic implementations for reproducibility in deep reinforcement learning. arXiv:1809.05676
Nagarajan，P.，Warnell，G.和Stone，P.（2018）。深度强化学习中可重复性的确定性实现。arXiv：1809.05676

Nash, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.
纳什，JF（1950 年）。n人博弈中的平衡点。美国国家科学院院刊，36（1），48-49。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Neller, T. W., & Lanctot, M. (2013). An introduction to counterfactual regret minimization. In Proceedings of model AI assignments, the fourth symposium on educational advances in artificial intelligence (EAAI-2013).
Neller，TW和Lanctot，M.（2013）。反事实后悔最小化简介。在模型人工智能作业的论文集，第四届人工智能教育进展研讨会（EAAI-2013）。

Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the sixteenth international conference on machine learning (pp. 278–287).
Ng， A. Y.， Harada， D.， & Russell， SJ （1999）。奖励转换下的政策不变性：奖励塑造的理论和应用。在第十六届机器学习国际会议论文集（第 278-287 页）中。

Nguyen, T. T., Nguyen, N. D., & Nahavandi, S. (2018). Deep reinforcement learning for multi-agent systems: A review of challenges, solutions and applications. arXiv preprint arXiv:1812.11794.
Nguyen， T. T.， Nguyen， N. D.， & Nahavandi， S. （2018）。多智能体系统的深度强化学习：挑战、解决方案和应用回顾。arXiv 预印本 arXiv：1812.11794。

Nowé, A., Vrancx, P., & De Hauwere, Y. M. (2012). Game theory and multi-agent reinforcement learning. In M. Wiering & M. van Otterlo (Eds.), Reinforcement learning (pp. 441–470). Berlin: Springer.
Nowé，A.，Vrancx，P.和De Hauwere，Y.M.（2012）。博弈论和多智能体强化学习。在 M. Wiering 和 M. van Otterlo（编辑）中，强化学习（第 441-470 页）。柏林：施普林格。

MATH 数学

Google Scholar Google 学术搜索

OpenAI Baselines: ACKTR & A2C. (2017). [Online]. Retrieved April 29, 2019, https://openai.com/blog/baselines-acktr-a2c/ .
OpenAI 基线：ACKTR 和 A2C。（2017）. [在线].2019年4月29日检索，https://openai.com/blog/baselines-acktr-a2c/。

Open AI Five. (2018). [Online]. Retrieved September 7, 2018, https://blog.openai.com/openai-five.
打开 AI Five。（2018）. [在线].2018年9月7日检索，https://blog.openai.com/openai-five。

Oliehoek, F. A. (2018). Interactive learning and decision making - foundations, insights & challenges. In International joint conference on artificial intelligence.
奥利胡克，FA（2018 年）。互动学习和决策 - 基础，见解和挑战。在人工智能国际联合会议上。

Oliehoek, F. A., Amato, C., et al. (2016). A concise introduction to decentralized POMDPs. Berlin: Springer.
Oliehoek，FA，Amato，C.等人（2016）。对去中心化 POMDP 的简明介绍。柏林：施普林格。

MATH 数学

Google Scholar Google 学术搜索

Oliehoek, F. A., De Jong, E. D., & Vlassis, N. (2006). The parallel Nash memory for asymmetric games. In Proceedings of the 8th annual conference on genetic and evolutionary computation (pp. 337–344). ACM.
Oliehoek，FA，De Jong，ED和Vlassis，N.（2006）。用于非对称博弈的并行纳什存储器。在第八届遗传和进化计算年会论文集（第 337-344 页）中。ACM。

Oliehoek, F. A., Spaan, M. T., & Vlassis, N. (2008). Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32, 289–353.
Oliehoek，FA，Spaan，MT和Vlassis，N.（2008）。去中心化 POMDP 的最优和近似 Q 值函数。人工智能研究杂志， 32， 289–353.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Oliehoek, F. A., Whiteson, S., & Spaan, M. T. (2013). Approximate solutions for factored Dec-POMDPs with many agents. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems (pp. 563–570). International Foundation for Autonomous Agents and Multiagent Systems.
Oliehoek，FA，Whiteson，S.和Spaan，MT（2013）。具有许多代理的分解 Dec-POMDP 的近似解。在 2013 年自治代理和多代理系统国际会议论文集（第 563-570 页）中。国际自主代理和多代理系统基金会。

Oliehoek, F. A., Witwicki, S. J., & Kaelbling, L. P. (2012). Influence-based abstraction for multiagent systems. In Twenty-sixth AAAI conference on artificial intelligence.
Oliehoek，FA，Witwicki，SJ和Kaelbling，LP（2012）。多智能体系统的基于影响的抽象。在第26届AAAI人工智能会议上。

Omidshafiei, S., Hennes, D., Morrill, D., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J. B., & Tuyls, K. (2019). Neural replicator dynamics. arXiv e-prints arXiv:1906.00190.
Omidshafiei，S.，Hennes，D.，Morrill，D.，Munos，R.，Perolat，J.，Lanctot，M.，Gruslys，A.，Lespiau，JB和Tuyls，K.（2019）。神经复制器动力学。arXiv 电子打印 arXiv：1906.00190。

Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J. B., et al. (2019).
-rank: Multi-agent evaluation by evolution. Scientific Reports, 9, 9937.
Omidshafiei，S.，Papadimitriou，C.，Piliouras，G.，Tuyls，K.，Rowland，M.，Lespiau，JB等人（2019）。
-rank：进化的多智能体评估。科学报告，9,9937。

Google Scholar Google 学术搜索

Omidshafiei, S., Pazis, J., Amato, C., How, J. P., & Vian, J. (2017). Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th international conference on machine learning. Sydney.
Omidshafiei，S.，Pazis，J.，Amato，C.，How，JP和Vian，J.（2017）。部分可观测性下的深度去中心化多任务多智能体强化学习。在第 34 届机器学习国际会议论文集。悉尼。

Ortega, P. A., & Legg, S. (2018). Modeling friends and foes. arXiv:1807.00196
奥尔特加，宾夕法尼亚州和莱格，S.（2018）。塑造朋友和敌人。arXiv：1807.00196

Palmer, G., Savani, R., & Tuyls, K. (2019). Negative update intervals in deep multi-agent reinforcement learning. In 18th International conference on autonomous agents and multiagent systems.
Palmer，G.，Savani，R.和Tuyls，K.（2019）。深度多智能体强化学习中的负更新间隔。在第 18 届自主代理和多智能体系统国际会议上。

Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient multi-agent deep reinforcement learning. In International conference on autonomous agents and multiagent systems.
Palmer，G.，Tuyls，K.，Bloembergen，D.和Savani，R.（2018）。宽松的多智能体深度强化学习。在自主代理和多智能体系统国际会议上。

Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3), 387–434.
Panait，L.和Luke，S.（2005）。合作式多智能体学习：最先进的技术。自主代理和多代理系统，11（3），387-434。

Google Scholar Google 学术搜索

Panait, L., Sullivan, K., & Luke, S. (2006). Lenience towards teammates helps in cooperative multiagent learning. In Proceedings of the 5th international conference on autonomous agents and multiagent systems. Hakodate, Japan.
Panait，L.，Sullivan，K.和Luke，S.（2006）。对队友的宽容有助于多智能体合作学习。在第五届自主代理和多智能体系统国际会议论文集中。函馆，日本。

Panait, L., Tuyls, K., & Luke, S. (2008). Theoretical advantages of lenient learners: An evolutionary game theoretic perspective. JMLR, 9(Mar), 423–457.
Panait，L.，Tuyls，K.和Luke，S.（2008）。宽松学习者的理论优势：进化博弈论视角。JMLR，9（3月），423-457。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Papoudakis, G., Christianos, F., Rahman, A., & Albrecht, S. V. (2019). Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737.
Papoudakis，G.，Christianos，F.，Rahman，A.和Albrecht，SV（2019）。处理多智能体深度强化学习中的非平稳性。arXiv 预印本 arXiv：1906.04737。

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International conference on machine learning (pp. 1310–1318).
Pascanu，R.，Mikolov，T.和Bengio，Y.（2013）。关于训练循环神经网络的难度。在机器学习国际会议上（第 1310-1318 页）。

Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., & Wang, J. (2017). Multiagent bidirectionally-coordinated nets for learning to play StarCraft combat games. arXiv:1703.10069
Peng，P.，Yuan，Q.，温，Y.，杨，Y.，唐，Z.，Long，H.和Wang，J.（2017）。多智能体双向协调网络，用于学习玩《星际争霸》格斗游戏。arXiv：1703.10069

Pérez-Liébana, D., Hofmann, K., Mohanty, S. P., Kuno, N., Kramer, A., Devlin, S., Gaina, R. D., & Ionita, D. (2019). The multi-agent reinforcement learning in Malmö (MARLÖ) competition. CoRR arXiv:1901.08129.
Pérez-Liébana，D.，Hofmann，K.，Mohanty，SP，Kuno，N.，Kramer，A.，Devlin，S.，Gaina，RD和Ionita，D.（2019）。马尔默（MARLÖ）竞赛中的多智能体强化学习。CoRR arXiv：1901.08129。

Pérolat, J., Piot, B., & Pietquin, O. (2018). Actor-critic fictitious play in simultaneous move multistage games. In 21st international conference on artificial intelligence and statistics.
Pérolat，J.，Piot，B.和Pietquin，O.（2018）。演员-评论家在同时移动多阶段游戏中的虚构游戏。在第21届人工智能与统计国际会议上。

Pesce, E., & Montana, G. (2019). Improving coordination in multi-agent deep reinforcement learning through memory-driven communication. CoRR arXiv:1901.03887.
Pesce，E.和Montana，G.（2019）。通过记忆驱动的沟通改善多智能体深度强化学习中的协调。CoRR arXiv：1901.03887。

Pinto, L., Davidson, J., Sukthankar, R., & Gupta, A. (2017). Robust adversarial reinforcement learning. In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 2817–2826). JMLR. org
Pinto，L.，Davidson，J.，Sukthankar，R.和Gupta，A.（2017）。鲁棒的对抗性强化学习。在第 34 届机器学习国际会议论文集（第 70 卷，第 2817–2826 页）中。JMLR。组织

Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. In Proceedings of the 19th international joint conference on artificial intelligence (pp. 817–822). Edinburg, Scotland, UK.
Powers，R.和Shoham，Y.（2005）。与记忆有限的对手学习。在第 19 届人工智能国际联合会议论文集（第 817-822 页）中。爱丁堡，苏格兰，英国。

Powers, R., Shoham, Y., & Vu, T. (2007). A general criterion and an algorithmic framework for learning in multi-agent systems. Machine Learning, 67(1–2), 45–76.
Powers，R.，Shoham，Y.和Vu，T.（2007）。在多智能体系统中学习的一般标准和算法框架。机器学习， 67（1–2）， 45–76.

Google Scholar Google 学术搜索

Precup, D., Sutton, R. S., & Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the seventeenth international conference on machine learning.
Precup，D.，Sutton，R.S.和Singh，S.（2000）。非策略策略评估的资格跟踪。第十七届机器学习国际会议论文集。

Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.
普特曼，ML（1994 年）。马尔可夫决策过程：离散随机动态规划。纽约：威利。

MATH 数学

Google Scholar Google 学术搜索

Pyeatt, L. D., Howe, A. E., et al. (2001). Decision tree function approximation in reinforcement learning. In Proceedings of the third international symposium on adaptive systems: Evolutionary computation and probabilistic graphical models (Vol. 2, pp. 70–77). Cuba.
Pyeatt，L.D.，Howe，AE等人（2001）。强化学习中的决策树函数逼近。在第三届自适应系统国际研讨会论文集：进化计算和概率图形模型（第 2 卷，第 70-77 页）中。古巴。

Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S. M. A., & Botvinick, M. (2018). Machine theory of mind. In International conference on machine learning. Stockholm, Sweden.
北卡罗来纳州拉比诺维茨，Perbet，F.，Song，HF，Zhang，C.，Eslami，SMA和Botvinick，M.（2018）。机器心智理论。在机器学习国际会议上。瑞典斯德哥尔摩。

Raghu, M., Irpan, A., Andreas, J., Kleinberg, R., Le, Q., & Kleinberg, J. (2018). Can deep reinforcement learning solve Erdos–Selfridge-spencer games? In Proceedings of the 35th international conference on machine learning.
Raghu，M.，Irpan，A.，Andreas，J.，Kleinberg，R.，Le，Q.和Kleinberg，J.（2018）。深度强化学习能解决鄂尔多斯-塞尔弗里奇-斯宾塞博弈吗？在第 35 届机器学习国际会议论文集。

Raileanu, R., Denton, E., Szlam, A., & Fergus, R. (2018). Modeling others using oneself in multi-agent reinforcement learning. In International conference on machine learning.
Raileanu，R.，Denton，E.，Szlam，A.和Fergus，R.（2018）。在多智能体强化学习中使用自己对他人进行建模。在机器学习国际会议上。

Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., & Whiteson, S. (2018). QMIX - monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning.
Rashid，T.，Samvelyan，M.，de Witt，CS，Farquhar，G.，Foerster，JN和Whiteson，S.（2018）。QMIX - 用于深度多智能体强化学习的单调值函数分解。在机器学习国际会议上。

Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K., & Bruna, J. (2018). Pommerman: A multi-agent playground. arXiv:1809.07124.
Resnick，C.，Eldridge，W.，Ha，D.，Britz，D.，Foerster，J.，Togelius，J.，Cho，K.和Bruna，J.（2018）。Pommerman：一个多智能体游乐场。arXiv：1809.07124。

Riedmiller, M. (2005). Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European conference on machine learning (pp. 317–328). Springer.
里德米勒，M.（2005 年）。神经拟合 Q 迭代 - 使用数据高效神经强化学习方法的首次体验。在欧洲机器学习会议上（第 317-328 页）。斯普林格。

Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., & Tesauro, G. (2018). Learning to learn without forgetting by maximizing transfer and minimizing interference. CoRR arXiv:1810.11910.
Riemer，M.，Cases，I.，Ajemian，R.，Liu，M.，Rish，I.，Tu，Y.和Tesauro，G.（2018）。学会学习，不要忘记，通过最大化传输和最小化干扰。CoRR arXiv：1810.11910。

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638.
罗森塔尔，R.（1979 年）。文件抽屉问题和空结果的容差。心理学通报，86（3），638。

Google Scholar Google 学术搜索

Rosin, C. D., & Belew, R. K. (1997). New methods for competitive coevolution. Evolutionary Computation, 5(1), 1–29.
Rosin， C. D.， & Belew， R. K. （1997）。竞争性协同进化的新方法。进化计算，5（1），1-29。

Google Scholar Google 学术搜索

Rosman, B., Hawasly, M., & Ramamoorthy, S. (2016). Bayesian policy reuse. Machine Learning, 104(1), 99–127.
Rosman，B.，Hawasly，M.和Ramamoorthy，S.（2016）。贝叶斯策略重用。机器学习， 104（1）， 99–127.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., & Hadsell, R. (2016). Policy distillation. In International conference on learning representations.
Rusu，AA，Colmenarejo，SG，Gulcehre，C.，Desjardins，G.，Kirkpatrick，J.，Pascanu，R.，Mnih，V.，Kavukcuoglu，K.和Hadsell，R.（2016）。政策蒸馏。在学习表征国际会议上。

Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems (pp. 901–909).
Salimans，T.和Kingma，DP（2016）。权重归一化：一种简单的重新参数化，可加速深度神经网络的训练。在神经信息处理系统的进展中（第 901-909 页）。

Samothrakis, S., Lucas, S., Runarsson, T., & Robles, D. (2013). Coevolving game-playing agents: Measuring performance and intransitivities. IEEE Transactions on Evolutionary Computation, 17(2), 213–226.
Samothrakis，S.，Lucas，S.，Runarsson，T.和Robles，D.（2013）。协同进化的游戏代理：衡量性能和不可传递性。IEEE进化计算汇刊，17（2），213–226。

Google Scholar Google 学术搜索

Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G. J., Hung, C., Torr, P. H. S., Foerster, J. N., & Whiteson, S. (2019). The StarCraft multi-agent challenge. CoRR arXiv:1902.04043.
Samvelyan，M.，Rashid，T.，de Witt，CS，Farquhar，G.，Nardelli，N.，Rudner，TGJ，Hung，C.，Torr，PHS，Foerster，JN和Whiteson，S.（2019）。《星际争霸》多智能体挑战。CoRR arXiv：1902.04043。

Sandholm, T. W., & Crites, R. H. (1996). Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems, 37(1–2), 147–166.
Sandholm， T. W.， & Crites， R. H. （1996）。迭代囚徒困境中的多智能体强化学习。生物系统，37（1-2），147-166。

Google Scholar Google 学术搜索

Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. In International conference on learning representations.
Schaul，T.，Quan，J.，Antonoglou，I.和Silver，D.（2016）。优先体验重播。在学习表征国际会议上。

Schmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the international conference on simulation of adaptive behavior: From animals to animats (pp. 222–227).
Schmidhuber，J.（1991 年）。在模型构建神经控制器中实现好奇心和无聊的可能性。在适应性行为模拟国际会议论文集：从动物到动画（第 222-227 页）中。

Schmidhuber, J. (2015). Critique of Paper by “Deep Learning Conspiracy” (Nature 521 p 436). http://people.idsia.ch/~juergen/deep-learning-conspiracy.html.
Schmidhuber，J.（2015 年）。“深度学习阴谋”对论文的批判（Nature 521，第 436 页）。http://people.idsia.ch/~juergen/deep-learning-conspiracy.html。

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
Schmidhuber，J.（2015 年）。神经网络中的深度学习：概述。神经网络，61,85-117。

Google Scholar Google 学术搜索

Schulman, J., Abbeel, P., & Chen, X. (2017) Equivalence between policy gradients and soft Q-learning. CoRR arXiv:1704.06440.
Schulman， J.， Abbeel， P.， & Chen， X. （2017）政策梯度与软 Q 学习之间的等效性。CoRR arXiv：1704.06440。

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust region policy optimization. In 31st international conference on machine learning. Lille, France.
Schulman，J.，Levine，S.，Abbeel，P.，Jordan，MI和Moritz，P.（2015）。信任区域策略优化。在第 31 届机器学习国际会议上。里尔，法国。

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Schulman，J.，Wolski，F.，Dhariwal，P.，Radford，A.和Klimov，O.（2017）。近端策略优化算法。arXiv：1707.06347。

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Schuster，M.和Paliwal，K.K.（1997）。双向递归神经网络。IEEE信号处理汇刊，45（11），2673–2681。

Google Scholar Google 学术搜索

Sculley, D., Snoek, J., Wiltschko, A., & Rahimi, A. (2018). Winner’s curse? On pace, progress, and empirical rigor. In ICLR workshop.
Sculley，D.，Snoek，J.，Wiltschko，A.和Rahimi，A.（2018）。赢家的诅咒？关于速度、进度和经验严谨性。在ICLR研讨会上。

Shamma, J. S., & Arslan, G. (2005). Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria. IEEE Transactions on Automatic Control, 50(3), 312–327.
Shamma，JS和Arslan，G.（2005）。动态虚构博弈、动态梯度博弈和纳什均衡的分布式收敛。IEEE自动控制学报，50（3），312–327。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Shelhamer, E., Mahmoudieh, P., Argus, M., & Darrell, T. (2017). Loss is its own reward: Self-supervision for reinforcement learning. In ICLR workshops.
Shelhamer，E.，Mahmoudieh，P.，Argus，M.和Darrell，T.（2017）。损失本身就是回报：强化学习的自我监督。在ICLR研讨会上。

Shoham, Y., Powers, R., & Grenager, T. (2007). If multi-agent learning is the answer, what is the question? Artificial Intelligence, 171(7), 365–377.
Shoham，Y.，Powers，R.和Grenager，T.（2007）。如果多智能体学习是答案，那么问题是什么？人工智能， 171（7）， 365–377.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Silva, F. L., & Costa, A. H. R. (2019). A survey on transfer learning for multiagent reinforcement learning systems. Journal of Artificial Intelligence Research, 64, 645–703.
席尔瓦，FL和科斯塔，AHR（2019）。关于多智能体强化学习系统的迁移学习的调查。人工智能研究杂志， 64， 645–703.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Silver，D.，Huang，A.，Maddison，CJ，Guez，A.，Sifre，L.，van den Driessche，G.等人（2016）。掌握深度神经网络和树搜索的围棋游戏。自然，529（7587），484-489。

Google Scholar Google 学术搜索

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In ICML.
Silver，D.，Lever，G.，Heess，N.，Degris，T.，Wierstra，D.和Riedmiller，M.（2014）。确定性策略梯度算法。在 ICML 中。

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.
Silver，D.，Schrittwieser，J.，Simonyan，K.，Antonoglou，I.，Huang，A.，Guez，A.等人（2017）。在人类不知情的情况下掌握围棋游戏。自然，550（7676），354。

Google Scholar Google 学术搜索

Singh, S., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 287–308.
Singh，S.，Jaakkola，T.，Littman，ML和Szepesvári，C.（2000）。单步策略强化学习算法的收敛结果。机器学习， 38（3）， 287–308.

MATH 数学

Google Scholar Google 学术搜索

Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 541–548). Morgan Kaufmann Publishers Inc.
Singh，S.，Kearns，M.和Mansour，Y.（2000）。广和博弈中梯度动力学的纳什收敛。在第十六届人工智能不确定性会议论文集（第 541-548 页）中。摩根·考夫曼出版公司

Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3–4), 323–339.
辛格，SP（1992 年）。通过组合元素顺序任务的解决方案来转移学习。机器学习， 8（3–4）， 323–339.

MATH 数学

Google Scholar Google 学术搜索

Song, X., Wang, T., & Zhang, C. (2019). Convergence of multi-agent learning with a finite step size in general-sum games. In 18th International conference on autonomous agents and multiagent systems.
Song， X.， Wang， T.， & Zhang， C. （2019）。广和博弈中具有有限步长的多智能体学习的收敛性。在第 18 届自主代理和多智能体系统国际会议上。

Song, Y., Wang, J., Lukasiewicz, T., Xu, Z., Xu, M., Ding, Z., & Wu, L. (2019). Arena: A general evaluation platform and building toolkit for multi-agent intelligence. CoRR arXiv:1905.08085.
Song， Y.， Wang， J.， Lukasiewicz， T.， Xu， Z.， Xu， M.， Ding， Z.， & Wu， L. （2019）.Arena：用于多智能体智能的通用评估平台和构建工具包。CoRR arXiv：1905.08085。

Spencer, J. (1994). Randomization, derandomization and antirandomization: three games. Theoretical Computer Science, 131(2), 415–429.
斯宾塞，J.（1994 年）。随机化、去随机化和反随机化：三场比赛。理论计算机科学， 131（2）， 415–429.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., & Bowling, M. (2018). Actor-critic policy optimization in partially observable multiagent environments. In Advances in neural information processing systems (pp. 3422–3435).
Srinivasan，S.，Lanctot，M.，Zambaldi，V.，Pérolat，J.，Tuyls，K.，Munos，R.和Bowling，M.（2018）。在部分可观察的多智能体环境中进行 Actor-critic 策略优化。在神经信息处理系统进展中（第 3422-3435 页）。

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Srivastava，N.，Hinton，G.，Krizhevsky，A.，Sutskever，I.和Salakhutdinov，R.（2014）。Dropout：一种防止神经网络过度拟合的简单方法。机器学习研究杂志，15（1），1929–1958。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Steckelmacher, D., Roijers, D. M., Harutyunyan, A., Vrancx, P., Plisnier, H., & Nowé, A. (2018). Reinforcement learning in pomdps with memoryless options and option-observation initiation sets. In Thirty-second AAAI conference on artificial intelligence.
Steckelmacher，D.，Roijers，DM，Harutyunyan，A.，Vrancx，P.，Plisnier，H.和Nowé，A.（2018）。具有无记忆选项和选项观察启动集的 pomdps 中的强化学习。在第32届AAAI人工智能会议上。

Stimpson, J. L., & Goodrich, M. A. (2003). Learning to cooperate in a social dilemma: A satisficing approach to bargaining. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 728–735).
Stimpson，JL和Goodrich，MA（2003）。学会在社会困境中合作：一种令人满意的讨价还价方法。在第 20 届机器学习国际会议（ICML-03）的论文集（第 728-735 页）中。

Stone, P., Kaminka, G., Kraus, S., & Rosenschein, J. S. (2010). Ad Hoc autonomous agent teams: Collaboration without pre-coordination. In 32nd AAAI conference on artificial intelligence (pp. 1504–1509). Atlanta, Georgia, USA.
Stone，P.，Kaminka，G.，Kraus，S.和Rosenschein，JS（2010）。Ad Hoc 自治代理团队：无需预先协调即可进行协作。在第 32 届 AAAI 人工智能会议上（第 1504-1509 页）。美国佐治亚州亚特兰大。

Stone, P., & Veloso, M. M. (2000). Multiagent systems - a survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.
Stone， P.， & Veloso， M. M. （2000）。多智能体系统 - 从机器学习角度进行的调查。自主机器人，8（3），345-383。

Google Scholar Google 学术搜索

Stooke, A., & Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. CoRR arXiv:1803.02811.
Stooke，A.和Abbeel，P.（2018）。深度强化学习的加速方法。CoRR arXiv：1803.02811。

Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.
Strehl，AL和Littman，ML（2008）。马尔可夫决策过程基于模型的区间估计分析.计算机与系统科学学报， 74（8）， 1309–1331.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Suarez, J., Du, Y., Isola, P., & Mordatch, I. (2019). Neural MMO: A massively multiagent game environment for training and evaluating intelligent agents. CoRR arXiv:1903.00784.
Suarez， J.， Du， Y.， Isola， P.， & Mordatch， I. （2019）。神经 MMO：用于训练和评估智能代理的大规模多智能体游戏环境。CoRR arXiv：1903.00784。

Suau de Castro, M., Congeduti, E., Starre, R. A., Czechowski, A., & Oliehoek, F. A. (2019). Influence-based abstraction in deep reinforcement learning. In Adaptive, learning agents workshop.
Suau de Castro， M.， Congeduti， E.， Starre， RA， Czechowski， A.， & Oliehoek， FA （2019）.深度强化学习中基于影响的抽象。在自适应，学习代理研讨会。

Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., & Clune, J. (2017). Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. CoRR arXiv:1712.06567.
Such， F. P.， Madhavan， V.， Conti， E.， Lehman， J.， Stanley， K. O.， & Clune， J. （2017）。深度神经进化：遗传算法是训练深度神经网络进行强化学习的竞争性替代方案。CoRR arXiv：1712.06567。

Suddarth, S. C., & Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. In Neural networks (pp. 120–129). Springer.
Suddarth，SC和Kergosien，Y.（1990）。规则注入提示是提高网络性能和学习时间的一种手段。在神经网络中（第 120-129 页）。斯普林格。

Sukhbaatar, S., Szlam, A., & Fergus, R. (2016). Learning multiagent communication with backpropagation. In Advances in neural information processing systems (pp. 2244–2252).
Sukhbaatar，S.，Szlam，A.和Fergus，R.（2016）。通过反向传播学习多智能体通信。在神经信息处理系统进展中（第 2244-2252 页）。

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V. F., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., & Graepel, T. (2018). Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of 17th international conference on autonomous agents and multiagent systems. Stockholm, Sweden.
Sunehag，P.，Lever，G.，Gruslys，A.，Czarnecki，WM，Zambaldi，VF，Jaderberg，M.，Lanctot，M.，Sonnerat，N.，Leibo，JZ，Tuyls，K.和Graepel，T.（2018）。基于团队奖励的多智能体合作学习价值分解网络。在第 17 届自主代理和多代理系统国际会议论文集。瑞典斯德哥尔摩。

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems (pp. 1038–1044).
萨顿，RS（1996 年）。强化学习中的泛化：使用稀疏粗编码的成功示例。在神经信息处理系统进展中（第 1038-1044 页）。

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). Cambridge: MIT Press.
Sutton，R.S.和Barto，AG（2018）。强化学习：简介（第 2 版）。剑桥：麻省理工学院出版社。

MATH 数学

Google Scholar Google 学术搜索

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems.
Sutton，R.S.，McAllester，DA，Singh，SP和Mansour，Y.（2000）。具有函数逼近的强化学习的策略梯度方法。在神经信息处理系统的进展中。

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., & Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th international conference on autonomous agents and multiagent systems (Vol. 2, pp. 761–768). International Foundation for Autonomous Agents and Multiagent Systems.
Sutton，RS，Modayil，J.，Delp，M.，Degris，T.，Pilarski，PM，White，A.和Precup，D.（2011）。Horde：一种可扩展的实时架构，用于从无监督的感觉运动交互中学习知识。在第 10 届自主代理和多代理系统国际会议上（第 2 卷，第 761-768 页）。国际自主代理和多代理系统基金会。

Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1), 1–103.
Szepesvári，C.（2010 年）。强化学习算法。人工智能与机器学习综合讲座，4（1），1-103。

MATH 数学

Google Scholar Google 学术搜索

Szepesvári, C., & Littman, M. L. (1999). A unified analysis of value-function-based reinforcement-learning algorithms. Neural Computation, 11(8), 2017–2060.
Szepesvári，C.和Littman，ML（1999）。基于值函数的强化学习算法的统一分析。神经计算， 11（8）， 2017–2060.

Google Scholar Google 学术搜索

Tamar, A., Levine, S., Abbeel, P., Wu, Y., & Thomas, G. (2016). Value iteration networks. In NIPS (pp. 2154–2162).
Tamar，A.，Levine，S.，Abbeel，P.，Wu，Y.和Thomas，G.（2016）。价值迭代网络。在NIPS中（第2154-2162页）。

Tambe, M. (1997). Towards flexible teamwork. Journal of Artificial Intelligence Research, 7, 83–124.
Tambe，M.（1997 年）。实现灵活的团队合作。人工智能研究杂志， 7， 83–124.

Google Scholar Google 学术搜索

Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., et al. (2017). Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE, 12(4), e0172395.
Tampuu，A.，Matiisen，T.，Kodelja，D.，Kuzovkin，I.，Korjus，K.，Aru，J.等人（2017）。通过深度强化学习进行多智能体合作和竞争。公共科学图书馆一号，12（4），e0172395。

Google Scholar Google 学术搜索

Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Machine learning proceedings 1993 proceedings of the tenth international conference, University of Massachusetts, Amherst, 27–29 June, 1993 (pp. 330–337).
谭，M.（1993 年）。多智能体强化学习：独立智能体与合作智能体。1993 年机器学习论文集第十届国际会议论文集，马萨诸塞大学阿默斯特分校，1993 年 6 月 27-29 日（第 330-337 页）。

Google Scholar Google 学术搜索

Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10, 1633–1685.
泰勒，ME和斯通，P.（2009）。强化学习领域的迁移学习：一项调查。机器学习研究杂志，10,1633–1685。

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58–68.
Tesauro，G.（1995 年）。时间差分学习和TD-Gammon。ACM通讯，38（3），58-68。

Google Scholar Google 学术搜索

Tesauro, G. (2003). Extending Q-learning to general adaptive multi-agent systems. In Advances in neural information processing systems (pp. 871–878). Vancouver, Canada.
Tesauro，G.（2003 年）。将 Q-learning 扩展到通用自适应多智能体系统。在神经信息处理系统进展中（第 871-878 页）。加拿大温哥华。

Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo - A physics engine for model-based control. In Intelligent robots and systems( pp. 5026–5033).
Todorov，E.，Erez，T.和Tassa，Y.（2012）。MuJoCo - 用于基于模型的控制的物理引擎。在智能机器人和系统中（第 5026-5033 页）。

Torrado, R. R., Bontrager, P., Togelius, J., Liu, J., & Perez-Liebana, D. (2018). Deep reinforcement learning for general video game AI. arXiv:1806.02448
Torrado，RR，Bontrager，P.，Togelius，J.，Liu，J.和Perez-Liebana，D.（2018）。用于通用视频游戏 AI 的深度强化学习。arXiv：1806.02448

Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3), 185–202.
齐齐克利斯，J.（1994 年）。异步随机逼近和 Q 学习。机器学习， 16（3）， 185–202.

MATH 数学

Google Scholar Google 学术搜索

Tsitsiklis, J. N., & Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems (pp. 1075–1081).
Tsitsiklis，JN和Van Roy，B.（1997）。使用函数近似分析时域差分学习。在神经信息处理系统进展中（第 1075-1081 页）。

Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahramani, Z., & Levine, S. (2018). The mirage of action-dependent baselines in reinforcement learning. In International conference on machine learning.
Tucker，G.，Bhupatiraju，S.，Gu，S.，Turner，RE，Ghahramani，Z.和Levine，S.（2018）。强化学习中动作依赖基线的海市蜃楼。在机器学习国际会议上。

Tumer, K., & Agogino, A. (2007). Distributed agent-based air traffic flow management. In Proceedings of the 6th international conference on autonomous agents and multiagent systems. Honolulu, Hawaii.
Tumer，K.和Agogino，A.（2007）。基于分布式代理的空中交通流量管理。在第六届自主代理和多智能体系统国际会议论文集中。夏威夷檀香山。

Tuyls, K., & Weiss, G. (2012). Multiagent learning: Basics, challenges, and prospects. AI Magazine, 33(3), 41–52.
Tuyls，K.和Weiss，G.（2012）。多智能体学习：基础知识、挑战和前景。人工智能杂志，33（3），41-52。

Google Scholar Google 学术搜索

van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018). Deep reinforcement learning and the deadly triad. CoRR arXiv:1812.02648.
van Hasselt，H.，Doron，Y.，Strub，F.，Hessel，M.，Sonnerat，N.和Modayil，J.（2018）。深度强化学习和致命的三联征。CoRR arXiv：1812.02648。

Van der Pol, E., & Oliehoek, F. A. (2016). Coordinated deep reinforcement learners for traffic light control. In Proceedings of learning, inference and control of multi-agent systems at NIPS.
Van der Pol，E.和Oliehoek，FA（2016）。用于交通信号灯控制的协调深度强化学习器。在NIPS多智能体系统的学习，推理和控制论文集。

Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Thirtieth AAAI conference on artificial intelligence.
Van Hasselt，H.，Guez，A.和Silver，D.（2016）。具有双 Q 学习的深度强化学习。在第三十届AAAI人工智能会议上。

Van Seijen, H., Van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical and empirical analysis of Expected Sarsa. In IEEE symposium on adaptive dynamic programming and reinforcement learning (pp. 177–184). Nashville, TN, USA.
Van Seijen，H.，Van Hasselt，H.，Whiteson，S.和Wiering，M.（2009）。预期Sarsa的理论和实证分析。在IEEE自适应动态规划和强化学习研讨会上（第177-184页）。美国田纳西州纳什维尔。

Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal networks for hierarchical reinforcement learning. In International conference on machine learning.
Vezhnevets，AS Osindero，S.，Schaul，T.，Heess，N.，Jaderberg，M.，Silver，D.和Kavukcuoglu，K.（2017）。用于分层强化学习的 FeUdal 网络。在机器学习国际会议上。

Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., & Silver, D. (2019). AlphaStar: Mastering the real-time strategy game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/
Vinyals， O.， Babuschkin， I.， Chung， J.， Mathieu， M.， Jaderberg， M.， Czarnecki， W. M.， Dudzik， A.， Huang， A.， Georgiev， P.， Powell， R.， Ewalds， T.， Horgan， D.， Kroiss， M.， Danihelka， I.， Agapiou， J.， Oh， J.， Dalibard， V.， Choi， D.， Sifre， L.， Sulsky， Y.， Vezhnevets， S.， Molloy， J.， Cai， T.， Budden， D.， Paine， T.， Gulcehre， C.， Wang， Z.， Pfaff， T.，Pohlen，T.，Wu，Y.，Yogatama，D.，Cohen，J.，McKinney，K.，Smith，O.，Schaul，T.，Lillicrap，T.，Apps，C.，Kavukcuoglu，K.，Hassabis，D.和Silver，D.（2019）。AlphaStar：掌握即时战略游戏《星际争霸II》。https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

Vodopivec, T., Samothrakis, S., & Ster, B. (2017). On Monte Carlo tree search and reinforcement learning. Journal of Artificial Intelligence Research, 60, 881–936.
Vodopivec，T.，Samothrakis，S.和Ster，B.（2017）。关于蒙特卡洛树搜索和强化学习。人工智能研究杂志， 60， 881–936.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Von Neumann, J., & Morgenstern, O. (1945). Theory of games and economic behavior (Vol. 51). New York: Bulletin of the American Mathematical Society.
冯·诺依曼（Von Neumann），J.和摩根斯坦（Morgenstern），O.（1945）。博弈与经济行为理论（第51卷）。纽约：美国数学会公报。

MATH 数学

Google Scholar Google 学术搜索

Walsh, W. E., Das, R., Tesauro, G., & Kephart, J. O. (2002). Analyzing complex strategic interactions in multi-agent systems. In AAAI-02 workshop on game-theoretic and decision-theoretic agents (pp. 109–118).
Walsh， WE， Das， R.， Tesauro， G.， & Kephart， JO （2002）。分析多智能体系统中复杂的战略交互。在 AAAI-02 关于博弈论和决策论代理的研讨会（第 109-118 页）中。

Wang, H., Raj, B., & Xing, E. P. (2017). On the origin of deep learning. CoRR arXiv:1702.07800.
Wang，H.，Raj，B.和Xing，EP（2017）。关于深度学习的起源。CoRR arXiv：1702.07800。

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
Wang，Z.，Bapst，V.，Heess，N.，Mnih，V.，Munos，R.，Kavukcuoglu，K.和de Freitas，N.（2016）。通过经验回放的高效演员评论家样本。arXiv 预印本 arXiv：1611.01224。

Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In International conference on machine learning.
Wang，Z.，Schaul，T.，Hessel，M.，Van Hasselt，H.，Lanctot，M.和De Freitas，N.（2016）。用于深度强化学习的决斗网络架构。在机器学习国际会议上。

Watkins, J. (1989). Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge, UK
沃特金斯，J.（1989 年）。从延迟奖励中学习。英国剑桥大学国王学院博士论文

Wei, E., & Luke, S. (2016). Lenient learning in independent-learner stochastic cooperative games. Journal of Machine Learning Research, 17, 1–42.
Wei，E.和Luke，S.（2016）。独立学习者随机合作博弈中的宽松学习。机器学习研究杂志， 17， 1–42.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Wei, E., Wicke, D., Freelan, D., & Luke, S. (2018). Multiagent soft Q-learning. arXiv:1804.09817
Wei，E.，Wicke，D.，Freelan，D.和Luke，S.（2018）。多智能体软 Q 学习。arXiv：1804.09817

Weinberg, M., & Rosenschein, J. S. (2004). Best-response multiagent learning in non-stationary environments. In Proceedings of the 3rd international conference on autonomous agents and multiagent systems (pp. 506–513). New York, NY, USA.
Weinberg，M.和Rosenschein，JS（2004）。非平稳环境中的最佳响应多智能体学习。在第三届自主代理和多智能体系统国际会议论文集（第 506-513 页）中。美国纽约州纽约市。

Weiss, G. (Ed.). (2013). Multiagent systems. Intelligent robotics and autonomous agents series (2nd ed.). Cambridge, MA: MIT Press.
Weiss，G.（编辑）。（2013）. 多智能体系统.智能机器人和自主代理系列（第 2 版）。马萨诸塞州剑桥：麻省理工学院出版社。

Google Scholar Google 学术搜索

Whiteson, S., & Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7(May), 877–917.
Whiteson，S.和Stone，P.（2006）。强化学习的进化函数近似。机器学习研究杂志， 7（May）， 877–917.

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Whiteson, S., Tanner, B., Taylor, M. E., & Stone, P. (2011). Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL) (pp. 120–127). IEEE.
Whiteson，S.，Tanner，B.，Taylor，ME和Stone，P.（2011）。防止经验强化学习中的评估过拟合。2011 年 IEEE 自适应动态规划和强化学习研讨会（ADPRL）（第 120-127 页）。IEEE的。

Wiering, M., & van Otterlo, M. (Eds.) (2012). Reinforcement learning. Adaptation, learning, and optimization (Vol. 12). Springer-Verlag Berlin Heidelberg.
Wiering，M.和van Otterlo，M.（编辑）（2012）。强化学习。适应、学习和优化（第 12 卷）。Springer-Verlag Berlin，海德堡。

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
威廉姆斯，RJ（1992 年）。用于连接主义强化学习的简单统计梯度跟踪算法。机器学习， 8（3–4）， 229–256.

MATH 数学

Google Scholar Google 学术搜索

Wolpert, D. H., & Tumer, K. (2002). Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems (pp. 355–369).
Wolpert，DH和Tumer，K.（2002）。集体成员的最优收益函数。在《经济和社会系统的复杂性建模》（第355-369页）中。

Wolpert, D. H., Wheeler, K. R., & Tumer, K. (1999). General principles of learning-based multi-agent systems. In Proceedings of the third international conference on autonomous agents.
Wolpert，DH，Wheeler，KR和Tumer，K.（1999）。基于学习的多智能体系统的一般原则。在第三届自主代理国际会议论文集。

Wunder, M., Littman, M. L., & Babes, M. (2010). Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. In Proceedings of the 35th international conference on machine learning (pp. 1167–1174). Haifa, Israel.
Wunder，M.，Littman，ML和Babes，M.（2010）。具有 epsilon-greedy 探索的多智能体 Q 学习动态类。在第 35 届机器学习国际会议论文集（第 1167-1174 页）中。以色列海法。

Yang, T., Hao, J., Meng, Z., Zhang, C., & Zheng, Y. Z. Z. (2019). Towards efficient detection and optimal response against sophisticated opponents. In IJCAI.
Yang， T.， Hao， J.， Meng， Z.， Zhang， C.， & Zheng， Y. Z. Z. （2019）。实现对复杂对手的高效检测和最佳响应。在 IJCAI.

Yang, Y., Hao, J., Sun, M., Wang, Z., Fan, C., & Strbac, G. (2018). Recurrent deep multiagent Q-learning for autonomous brokers in smart grid. In Proceedings of the twenty-seventh international joint conference on artificial intelligence. Stockholm, Sweden.
Yang，Y.，Hao，J.，Sun，M.，Wang，Z.，Fan，C.和Strbac，G.（2018）。智能电网中自主代理的循环深度多智能体 Q 学习。在第二十七届人工智能国际联合会议论文集。瑞典斯德哥尔摩。

Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In Proceedings of the 35th international conference on machine learning. Stockholm Sweden.
杨，Y.，罗，R.，李，M.，周，M.，张，W.和王，J.（2018）。平均场多智能体强化学习。在第 35 届机器学习国际会议论文集。瑞典斯德哥尔摩。

Yu, Y. (2018). Towards sample efficient reinforcement learning. In IJCAI (pp. 5739–5743).
Yu，Y.（2018 年）。迈向样本高效强化学习。在 IJCAI 中（第 5739-5743 页）。

Zahavy, T., Ben-Zrihem, N., & Mannor, S. (2016). Graying the black box: Understanding DQNs. In International conference on machine learning (pp. 1899–1908).
Zahavy，T.，Ben-Zrihem，N.和Mannor，S.（2016）。灰化黑匣子：了解 DQN。在机器学习国际会议上（第 1899-1908 页）。

Zhang, C., & Lesser, V. (2010). Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence.
Zhang，C.和Lesser，V.（2010）。具有策略预测的多智能体学习。在第24届AAAI人工智能会议上。

Zhao, J., Qiu, G., Guan, Z., Zhao, W., & He, X. (2018). Deep reinforcement learning for sponsored search real-time bidding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 1021–1030). ACM.
Zhao， J.， Qiu， G.， Guan， Z.， Zhao， W.， & He， X. （2018）。用于赞助搜索实时竞价的深度强化学习。在第 24 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集（第 1021-1030 页）中。ACM。

Zheng, Y., Hao, J., & Zhang, Z. (2018). Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. arXiv:1802.08534.
Zheng，Y.，Hao，J.和Zhang，Z.（2018）。随机合作环境中的加权双深度多智能体强化学习。arXiv：1802.08534。

Zheng, Y., Meng, Z., Hao, J., Zhang, Z., Yang, T., & Fan, C. (2018). A deep bayesian policy reuse approach against non-stationary agents. In Advances in Neural Information Processing Systems (pp. 962–972).
Zheng， Y.， Meng， Z.， Hao， J.， Zhang， Z.， Yang， T.， & Fan， C. （2018）.针对非稳态代理的深度贝叶斯策略重用方法。在神经信息处理系统的进展中（第 962-972 页）。

Zinkevich, M., Greenwald, A., & Littman, M. L. (2006). Cyclic equilibria in Markov games. In Advances in neural information processing systems (pp. 1641–1648).
Zinkevich，M.，Greenwald，A.和Littman，ML（2006）。马尔可夫博弈中的循环均衡。在神经信息处理系统的进展中（第 1641-1648 页）。

Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2008). Regret minimization in games with incomplete information. In Advances in neural information processing systems (pp. 1729–1736).
Zinkevich，M.，Johanson，M.，Bowling，M.和Piccione，C.（2008）。后悔在信息不完整的游戏中最小化。在神经信息处理系统的进展中（第 1729-1736 页）。