多智能体深度强化学习：一项综述 Multi-agent deep reinforcement learning: a survey-CSDN博客

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135520809

Abstract 抽象

The advances in reinforcement learning have recorded sublime success in various domains. Although the multi-agent domain has been overshadowed by its single-agent counterpart during this progress, multi-agent reinforcement learning gains rapid traction, and the latest accomplishments address problems with real-world complexity. This article provides an overview of the current developments in the field of multi-agent deep reinforcement learning. We focus primarily on literature from recent years that combines deep reinforcement learning methods with a multi-agent scenario. To survey the works that constitute the contemporary landscape, the main contents are divided into three parts. First, we analyze the structure of training schemes that are applied to train multiple agents. Second, we consider the emergent patterns of agent behavior in cooperative, competitive and mixed scenarios. Third, we systematically enumerate challenges that exclusively arise in the multi-agent domain and review methods that are leveraged to cope with these challenges. To conclude this survey, we discuss advances, identify trends, and outline possible directions for future work in this research area.
强化学习的进步在各个领域都取得了巨大的成功。尽管在这一进展中，多智能体领域已被单智能体领域所掩盖，但多智能体强化学习获得了快速的牵引力，最新的成就解决了现实世界的复杂性问题。本文概述了多智能体深度强化学习领域的当前发展。我们主要关注近年来将深度强化学习方法与多智能体场景相结合的文献。为了调查构成当代景观的作品，主要内容分为三个部分。首先，我们分析了用于训练多个智能体的训练方案的结构。其次，我们考虑了合作、竞争和混合场景中代理行为的涌现模式。第三，我们系统地列举了多智能体领域中独有的挑战，并回顾了用于应对这些挑战的方法。在本次调查的最后，我们讨论了进展，确定了趋势，并概述了该研究领域未来工作的可能方向。

1 Introduction 1 引言

A multi-agent system describes multiple distributed entities—so-called agents—which take decisions autonomously and interact within a shared environment (Weiss 1999). Each agent seeks to accomplish an assigned goal for which a broad set of skills might be required to build intelligent behavior. Depending on the task, an intricate interplay between agents can occur such that agents start to collaborate or act competitively to excel opponents. Specifying intelligent behavior a-priori through programming is a tough, if not impossible, task for complex systems. Therefore, agents require the ability to adapt and learn over time by themselves. The most common framework to address learning in an interactive environment is reinforcement learning (RL), which describes the change of behavior through a trial-and-error approach.
多智能体系统描述了多个分布式实体（所谓的智能体），它们自主做出决策并在共享环境中进行交互（Weiss 1999）。每个智能体都试图完成一个指定的目标，为此可能需要一套广泛的技能来构建智能行为。根据任务的不同，智能体之间可能会发生错综复杂的相互作用，以便智能体开始协作或竞争性行动以超越对手。对于复杂系统来说，通过编程先验地指定智能行为是一项艰巨的任务，如果不是不可能的话。因此，智能体需要能够随着时间的推移自行适应和学习。在交互式环境中解决学习问题的最常见框架是强化学习（RL），它描述了通过试错方法改变行为。

The field of reinforcement learning is currently thriving. Since the breakthrough of deep learning methods, works have been successful at mastering complex control tasks, e.g. in robotics (Levine et al. 2016; Lillicrap et al. 2016) and game playing (Mnih et al. 2015; Silver et al. 2016). The key to these results is based on learning techniques that employ neural networks as function approximators (Arulkumaran et al. 2017). Despite these achievements, the majority of works investigated single-agent settings only, although many real-world applications naturally comprise multiple decision-makers that interact at the same time. The areas of application encompass the coordination of distributed systems (Cao et al. 2013; Wang et al. 2016b) such as autonomous vehicles (Shalev-Shwartz et al. 2016) and multi-robot control (Matignon et al. 2012a), the networking of communication packages (Luong et al. 2019), or the trading on financial markets (Lux and Marchesi 1999). In these systems, each agent discovers a strategy alongside other entities in a common environment and adapts its policy in response to the behavioral changes of others. Carried by the advances of single-agent deep RL, the multi-agent reinforcement learning (MARL) community has been surged with new interest and a plethora of literature has emerged lately (Hernandez-Leal et al. 2019; Nguyen et al. 2020). The use of deep learning methods enabled the community to exceed the historically investigated tabular problems to challenging problems with real-world complexity (Baker et al. 2020; Berner et al. 2019; Jaderberg et al. 2019; Vinyals et al. 2019).
强化学习领域目前正在蓬勃发展。自深度学习方法取得突破以来，工作已经成功地掌握了复杂的控制任务，例如在机器人技术中（Levine等人，2016年;Lillicrap 等人，2016 年）和游戏（Mnih 等人，2015 年;Silver 等人，2016 年）。这些结果的关键是基于使用神经网络作为函数近似器的学习技术（Arulkumaran 等人，2017 年）。尽管取得了这些成就，但大多数工作只研究了单智能体设置，尽管许多现实世界的应用程序自然包含同时交互的多个决策者。应用领域包括分布式系统的协调（Cao 等人，2013 年;Wang 等人，2016b），例如自动驾驶汽车（Shalev-Shwartz 等人，2016 年）和多机器人控制（Matignon 等人，2012 年a）、通信包联网（Luong 等人，2019 年）或金融市场交易（Lux 和 Marchesi 1999 年）。在这些系统中，每个智能体在共同环境中与其他实体一起发现策略，并根据其他人的行为变化调整其策略。随着单智能体深度强化学习的进步，多智能体强化学习（MARL）社区引起了新的兴趣，最近出现了大量文献（Hernandez-Leal 等人，2019 年;Nguyen 等人，2020 年）。深度学习方法的使用使社区能够超越历史上调查的表格问题，以解决具有现实世界复杂性的挑战性问题（Baker 等人，2020 年;Berner 等人，2019 年;Jaderberg 等人，2019 年;Vinyals 等人，2019 年）。

In this paper, we provide an extensive review of the recent advances in the area of multi-agent deep reinforcement learning (MADRL). Although multi-agent systems enjoy a rich history (Busoniu et al. 2008; Shoham et al. 2003; Stone and Veloso 2000; Tuyls and Weiss 2012), this survey aims to shed light on the contemporary landscape of the literature in MADRL.
在本文中，我们对多智能体深度强化学习（MADRL）领域的最新进展进行了广泛的回顾。尽管多智能体系统有着悠久的历史（Busoniu 等人，2008 年;Shoham 等人，2003 年;Stone 和 Veloso 2000;Tuyls 和 Weiss 2012），本调查旨在阐明 MADRL 文学的当代景观。

1.1 Related work 1.1 相关工作

The intersection of multi-agent systems and reinforcement learning holds a long record of active research. As one of the first surveys in the field, Stone and Veloso (2000) analyzed multi-agent systems from a machine learning perspective and classified the reviewed literature according to heterogeneous and homogeneous agent structures as well as communication skills. The authors discussed issues associated with each classification. Shoham et al. (2003) criticized the ill-posed problem statement of MARL which is in the authors’ opinion unclear and called for more grounded research. They proposed a coherent research agenda which includes four directions for future research. Yang and Gu (2004) reviewed algorithms and pointed out that the main difficulty lies in the generalization to continuous action and state spaces and in the scaling to many agents. Similarly, Busoniu et al. (2008) presented selected algorithms and discussed benefits as well as challenges of MARL. Benefits include computational speed-ups and the possibility of experience sharing between agents. In contrast, drawbacks are the specification of meaningful goals, the non-stationarity of the environment, and the need for coherent coordination in cooperative games. In addition to that, they posed challenges such as the exponential increase of computational complexity with the number of agents and the alter-exploration problem where agents must gauge between the acquisition of new knowledge and the exploitation of current knowledge. More specifically, Matignon et al. (2012b) identified challenges for the coordination of independent learners that arise in fully cooperative Markov Games such as non-stationarity, stochasticity, and shadowed equilibria. Further, they analyzed conditions under which algorithms can address such coordination issues. Another work by Tuyls and Weiss (2012) accounted for the historical developments of MARL and evoked non-technical challenges. They criticized that the intersection of RL techniques and game theory dominates multi-agent learning, which may render the scope of the field too narrow and investigations are limited to simplistic problems such as grid worlds. They claimed that the scalability to high numbers of agents and large and continuous spaces are the holy grail of this research domain.
多智能体系统和强化学习的交叉点拥有长期的积极研究记录。作为该领域最早的调查之一，Stone和Veloso（2000）从机器学习的角度分析了多智能体系统，并根据异质和同质智能体结构以及沟通技巧对所审查的文献进行了分类。作者讨论了与每种分类相关的问题。Shoham等人（2003）批评了MARL的病态问题陈述，作者认为该陈述不明确，并呼吁进行更扎实的研究。他们提出了一个连贯的研究议程，其中包括未来研究的四个方向。Yang 和 Gu （2004）回顾了算法，并指出主要困难在于对连续动作和状态空间的泛化以及扩展到许多智能体。同样，Busoniu等人（2008）提出了选定的算法，并讨论了MARL的优点和挑战。好处包括计算速度和代理之间经验共享的可能性。相比之下，缺点是有意义目标的规范、环境的非平稳性以及合作游戏中需要连贯的协调。除此之外，它们还带来了一些挑战，例如计算复杂性随着智能体数量的增加而呈指数级增长，以及智能体必须在获取新知识和利用当前知识之间进行衡量的替代探索问题。更具体地说，Matignon et al. （ 2012b）确定了在完全合作的马尔可夫博弈中出现的独立学习者协调的挑战，例如非平稳性、随机性和阴影均衡。此外，他们分析了算法可以解决此类协调问题的条件。Tuyls和Weiss（2012）的另一项工作解释了MARL的历史发展，并引发了非技术挑战。他们批评说，RL技术和博弈论的交叉主导了多智能体学习，这可能使该领域的范围过于狭窄，研究仅限于网格世界等简单问题。他们声称，对大量代理和大而连续空间的可扩展性是该研究领域的圣杯。

Since the advent of deep learning methods and the breakthrough of deep RL, the field of MARL has attained new interest and a plethora of literature has emerged during the last years. Nguyen et al. (2020) presented five technical challenges including nonstationarity, partial observability, continuous spaces, training schemes, and transfer learning. They discussed possible solution approaches alongside their practical applications. Hernandez-Leal et al. (2019) concentrated on four categories including the analysis of emergent behaviors, learning communication, learning cooperation, and agent modeling. Further survey literature focuses on one particular sub-field of MADRL. Oroojlooyjadid and Hajinezhad (2019) reviewed recent works in the cooperative setting while Da Silva and Costa (2019) and Da Silva et al. (2019) focused on knowledge reuse. Lazaridou and Baroni (2020) reviewed the emergence of language and connected two perspectives, which comprise the conditions under which language evolves in communities and the ability to solve problems through dynamic communication. Based on theoretical analysis, Zhang et al. (2019) focused on MARL algorithms and presented challenges from a mathematical perspective.
自从深度学习方法的出现和深度强化学习的突破以来，MARL领域获得了新的兴趣，并且在过去几年中出现了大量的文献。Nguyen et al. （ 2020）提出了五个技术挑战，包括非平稳性、部分可观测性、连续空间、训练方案和迁移学习。他们讨论了可能的解决方案及其实际应用。Hernandez-Leal等人（2019）专注于四个类别，包括紧急行为分析，学习沟通，学习合作和智能体建模。进一步的调查文献集中在MADRL的一个特定子领域。Oroojlooyjadid 和 Hajinezhad （2019）回顾了最近在合作环境中的工作，而 Da Silva 和 Costa （2019）以及 Da Silva 等人（2019）则专注于知识重用。Lazaridou 和 Baroni （2020）回顾了语言的出现，并将两种观点联系起来，其中包括语言在社区中演变的条件和通过动态交流解决问题的能力。Zhang等人（2019）在理论分析的基础上，重点关注MARL算法，并从数学角度提出了挑战。

Fig. 1 图1
figure 1
Schematic structure of the main contents in this survey. In Sect. 3, we review schemes that are applied to train agent behavior in the multi-agent setting. The training of agents can be divided into two paradigms which are namely distributed (Sect. 3.1) and centralized (Sect. 3.2). In Sect. 4, we consider the emergent patterns of agent behavior with respect to the reward structure (Sect. 4.1), the language (Sect. 4.2) and the social context (Sect. 4.3). In Sect. 5, we enumerate current challenges of MADRL which include the non-stationarity of the environment due to co-adapting agents (Sect. 5.1), the learning of communication (Sect. 5.2), the need for a coherent coordination of actions (Sect. 5.3), the credit assignment problem (Sect. 5.4), the ability to scale to an arbitrary number of decision-makers (Sect. 5.5), and non-Markovian environments due to partial observations (Sect. 5.6)
本次调查主要内容示意图结构。在第 3 节中，我们回顾了应用于在多智能体设置中训练智能体行为的方案。智能体的培训可以分为两种范式，即分布式（第 3.1 节）和集中式（第 3.2 节）。在第 4 节中，我们考虑了与奖励结构（第 4.1 节）、语言（第 4.2 节）和社会背景（第 4.3 节）相关的代理行为的涌现模式。在第 5 节中，我们列举了 MADRL 当前面临的挑战，包括由于共同适应因素导致的环境非平稳性（第 5.1 节）、沟通学习（第 5.2 节）、需要连贯协调行动（第 5.3 节）、学分分配问题（第 5.4 节）、扩展到任意数量的决策者的能力（第 5.5 节），以及由于部分观测而产生的非马尔可夫环境（第 5.6 节）

Full size image 全尺寸图像
1.2 Contribution and survey structure
1.2 贡献和调查结构
The contribution of this paper is to present a comprehensive survey of the recent research directions pursued in the field of MADRL. We depict a holistic overview of current challenges that arise exclusively in the multi-agent domain of deep RL and discuss state-of-the-art solutions that were proposed to address these challenges. In contrast to the surveys of Hernandez-Leal et al. (2019) and Nguyen et al. (2020), which focus on a subset of topics, we aim to provide a widened and more comprehensive overview of the current investigations conducted in the field of MADRL while recapitulating what has already been accomplished. We identify contemporary challenges and discuss literature that addresses such. We see our work complementary to the theoretical survey of Zhang et al. (2019).
本文的贡献是对MADRL领域近年来所追求的研究方向进行全面综述。我们全面概述了深度强化学习多智能体领域中出现的当前挑战，并讨论了为应对这些挑战而提出的最先进的解决方案。与Hernandez-Leal等人（2019年）和Nguyen等人（2020年）的调查相比，这些调查侧重于一个主题的子集，我们旨在对MADRL领域当前进行的调查进行更广泛和更全面的概述，同时总结已经取得的成就。我们确定当代挑战，并讨论解决这些问题的文献。我们认为我们的工作与Zhang等人（2019）的理论调查相辅相成。

We dedicate this paper to an audience who wants an excursion to the realm of MADRL. Readers shall gain insights about the historical roots of this still young field and its current developments, but also understand the open problems to be faced by future research. The contents of this paper are organized as follows. We begin with a formal introduction to both single-agent and multi-agent RL and reveal pathologies that are present in MARL in Sect. 2. We then continue with the main contents, which are categorized according to the three-fold taxonomy as illustrated in Fig. 1.
我们将这篇论文献给想要游览 MADRL 领域的读者。读者将深入了解这个仍然年轻的领域的历史根源及其当前发展，同时也了解未来研究将面临的开放性问题。本文内容组织如下。我们首先正式介绍了单药和多药RL，并在第2节中揭示了MARL中存在的病理学。然后，我们继续讨论主要内容，这些内容根据三重分类法进行分类，如图 1 所示。

We analyze training architectures in Sect. 3, where we categorize approaches according to a centralized or distributed training paradigm and additionally differentiate into execution schemes. Thereafter, we review literature that investigates emergent patterns of agent behavior in Sect. 4. We classify works in terms of the reward structure (Sect. 4.1), the language between multiple agents (Sect. 4.2), and the social context (Sect. 4.3). In Sect. 5, we enumerate current challenges of the multi-agent domain, which include the non-stationarity of the environment due to simultaneously adapting learners (Sect. 5.1), the learning of meaningful communication protocols in cooperative tasks (Sect. 5.2), the need for coherent coordination of agent actions (Sect. 5.3), the credit assignment problem (Sect. 5.4), the ability to scale to an arbitrary number of decision-makers (Sect. 5.5), and non-Markovian environments due to partial observations (Sect. 5.6). We discuss the matter of MADRL, pose trends that we identified in recent literature, and outline possible future work in Sect. 6. Finally, this survey concludes in Sect. 7.
我们在第 3 节中分析了训练架构，其中我们根据集中式或分布式训练范式对方法进行分类，并另外区分为执行方案。此后，我们回顾了第 4 节中研究代理行为紧急模式的文献。我们根据奖励结构（第4.1节）、多个主体之间的语言（第4.2节）和社会背景（第4.3节）对作品进行分类。在第 5 节中，我们列举了多智能体领域的当前挑战，其中包括由于同时适应学习者而导致的环境的非平稳性（第 5.1 节）、在合作任务中学习有意义的沟通协议（第 5.2 节）、需要连贯协调智能体行动（第 5.3 节）、学分分配问题（第 5.4 节）、扩展到任意数量的决策者的能力（第 5.5 节），以及由于部分观测而产生的非马尔可夫环境（第 5.6 节）。我们讨论了MADRL的问题，提出了我们在最近的文献中确定的趋势，并在第6节中概述了未来可能的工作。最后，本调查在第 7 节中得出结论。

2 Background 2 背景
In this section, we provide a formal introduction into the concepts of RL. We start with the Markov decision process as a framework for single-agent learning in Sect. 2.1. We continue with the multi-agent case and introduce the Markov Game in Sect. 2.2. Finally, we pose pathologies that arise in the multi-agent domain such as the non-stationarity of the environment from the perspective of a single learner, relative over-generalization, and the credit assignment problem in Sect. 2.3. We provide the formal concepts behind these MARL pathologies in order to drive our discussion about the state-of-the-art approaches in Sect. 5. The scope of this background section is deliberately focusing on classical MARL works to reveal the roots of the domain and to give the reader insights into the early works on which modern MADRL approaches rest.
在本节中，我们将正式介绍RL的概念。我们从马尔可夫决策过程开始，作为第 2.1 节中单智能体学习的框架。我们继续讨论多智能体案例，并在第 2.2 节中介绍马尔可夫博弈。最后，我们提出了多智能体领域中出现的病态，例如从单个学习者的角度环境的非平稳性、相对过度泛化以及第 2.3 节中的学分分配问题。我们提供了这些MARL病理学背后的正式概念，以推动我们对第5节中最先进的方法的讨论。本背景部分的范围是特意关注经典的MARL作品，以揭示该领域的根源，并让读者深入了解现代MADRL方法所依赖的早期作品。

2.1 Single-agent reinforcement learning
2.1 单智能体强化学习
The traditional reinforcement learning problem (Sutton and Barto 1998) is concerned with learning a control policy that optimizes a numerical performance by making decisions in stages. The decision-maker called agent interacts with an environment of unknown dynamics in a trial-and-error fashion and occasionally receives feedback upon which the agent wants to improve. The standard formulation for such sequential decision-making is the Markov decision process, which is defined as follows (Bellman 1957; Bertsekas 2012, 2017; Kaelbling et al. 1996).
传统的强化学习问题（Sutton和Barto 1998）涉及学习一种控制策略，该策略通过分阶段做出决策来优化数值性能。被称为智能体的决策者以试错的方式与未知动态的环境进行交互，偶尔会收到智能体想要改进的反馈。这种顺序决策的标准公式是马尔可夫决策过程，其定义如下（Bellman 1957;Bertsekas 2012， 2017;Kaelbling 等人，1996 年）。

Definition 1 定义 1
Markov decision process (MDP) A Markov decision process is formalized by the tuple
where
and
are the state and action space, respectively,
is the transition function describing the probability of a state transition,
is the reward function providing an immediate feedback to the agent, and
describes the discount factor.
马尔可夫决策过程（MDP）马尔可夫决策过程由元组形式化，
其中
和
分别是状态和动作空间，是描述状态转换概率的转换函数，是向智能体提供即时反馈的奖励函数，

并
描述折扣因子。

The agent’s goal is to act in such a way as to maximize the expected performance on a long-term perspective with regard to an unknown transition function
. Therefore, the agent learns a behavior policy
that optimizes the expected performance J throughout learning. The performance is defined as the expected value of discounted rewards
智能体的目标是以这样一种方式行事，即在未知的转移函数
的长期视角上最大化预期性能。因此，智能体学习一种行为策略，该策略
在整个学习过程中优化了预期的性能 J。绩效定义为折扣奖励的预期价值

(1)
over the initial state distribution
while selected actions are governed by the policy
. Here, we regard the infinite-horizon problem where the interaction between agent and environment does not terminate after a countable number of steps. Note that the learning objective can also be formalized for finite-horizon problems (Bertsekas 2012, 2017). As an alternative to the policy performance, which describes the expected performance as a function of the policy, one can define the utility of being in a particular state in terms of a value function. The state-value function
describes the utility under policy
when starting from state x, i.e.
在初始状态分布
上，而选定的操作由策略
控制。在这里，我们考虑无限视界问题，其中智能体和环境之间的交互在可数步数后不会终止。请注意，学习目标也可以形式化为有限视界问题（Bertsekas 2012,2017）。作为策略性能的替代方法，策略性能将预期性能描述为策略的函数，可以根据值函数定义处于特定状态的效用。状态值函数
描述从状态 x 启动时策略
下的实用程序，即

(2)
In a similar manner, the action-value function
describes the utility of being in state x, performing action u, and following the policy
thereafter, that is
以类似的方式，action-value 函数
描述了处于状态 x、执行操作 u 并在此后遵循策略
的效用，即

(3)
In the context of deep reinforcement learning, either the policy, a value function or both are represented by neural networks.
在深度强化学习的上下文中，策略、值函数或两者都由神经网络表示。

2.2 Multi-agent reinforcement learning
2.2 多智能体强化学习
When the sequential decision-making is extended to multiple agents, Markov GamesFootnote1 are commonly applied as framework. The Markov Game was originally introduced by Littman (1994) to generalize MDPs to multiple agents that simultaneously interact within a shared environment and possibly with each other. The definition is formalized in a discrete-time setting and is denoted as follows (Littman 1994).
当顺序决策扩展到多个智能体时，马尔可夫博弈 Footnote1 通常被用作框架。马尔可夫博弈最初是由 Littman （1994）引入的，用于将 MDP 推广到多个代理，这些代理在共享环境中同时交互并可能相互交互。该定义在离散时间设置中正式化，并表示如下（Littman 1994）。

Definition 2 定义 2
Markov Games (MG) The Markov Game is an extension to the MDP and is formalized by the tuple
, where
denotes the set of
interacting agents and
is the set of states observed by all agents. The joint action space is denoted by
which is the collection of individual action spaces from agents
. The transition probability function
describes the chance of a state transition. Each agent owns an associated reward function
that provides an immediate feedback signal. Finally,
describes the discount factor.
马尔可夫博弈（MG）马尔可夫博弈是MDP的扩展，由元组形式化
，其中
表示交互智能体的
集合，
是所有智能体观察到的状态集合。联合动作空间表示，
它是来自智能体
的单个动作空间的集合。转换概率函数
描述状态转换的几率。每个智能体都拥有一个相关的奖励函数，该函数
提供即时反馈信号。最后，
描述折扣系数。

At stage t, each agent
selects and executes an action depending on the individual policy
. The system evolves from state
under the joint action
with respect to the transition probability function
to the next state
while each agent receives
as immediate feedback to the state transition. Akin to the single-agent problem, the aim of each agent is to change its policy in such a way as to optimize the received rewards on a long-term perspective.
在阶段 t，每个代理
根据单个策略
选择并执行操作。系统从相对于转换概率函数
的联合作用
下的状态
演变到下一个状态，而每个智能体接收
到状态
转换的即时反馈。与单代理问题类似，每个代理的目标是改变其策略，以便从长期角度优化所获得的奖励。

A special case of the MG is the stateless setting
called strategic-form gameFootnote2. Strategic-form games describe one-shot interactions where all agents simultaneously execute an action and receive a reward based on the joint action after which the game ends. Significant progress within the MARL community has been accomplished by studying this simplified stateless setting, which is still under active research to cope with several pathologies as discussed later in this section. These games are also known as matrix games because the reward function is represented by an
matrix. The formalism which extends to multi-step sequential stages is called extensive-form game.
MG的一个特例是称为战略形式博弈的无状态设置
。 Footnote2 战略形式的游戏描述了一次性互动，其中所有智能体同时执行一个动作，并根据游戏结束后的联合行动获得奖励。通过研究这种简化的无状态环境，MARL 社区取得了重大进展，该环境仍在积极研究中，以应对本节后面讨论的几种病理学。这些游戏也被称为矩阵游戏，因为奖励函数由
矩阵表示。延伸到多步骤顺序阶段的形式主义称为广义博弈。

In contrast to the single-agent case, the value function
does not only depend on the individual policy of agent i but also on the policies of other agents, i.e. the value function for agent i is the expected sum
与单智能体情况相比，值函数
不仅取决于智能体 i 的单个策略，还依赖于其他智能体的策略，即智能体 i 的值函数是期望的总和

(4)
when the agents behave according to the joint policy
. We denote the joint policy
as the collection of all individual policies, i.e.
. Further, we make use of the convention that
denotes all agents except i, meaning for policies that
.
当代理按照联合策略
行事时。我们将联合保单表示为所有单独保单
的集合，即
.此外，我们利用了表示除 i 之外的所有代理的约定
，这意味着
.

The optimal policy is determined by the individual policy and the other agents’ strategies. However, when other agents’ policies are fixed, the agent i can maximize its own utility by finding the best response
with respect to the other agents’ strategies.
最佳策略由单个策略和其他代理的策略决定。但是，当其他代理的策略被固定时，代理 i 可以通过找到相对于其他代理策略的最佳响应
来最大化自己的效用。

Definition 3 定义 3
Best response The agent’s i best response
to the joint policy
of other agents is
最佳响应代理商对其他代理的联合策略
的最佳响应
是

for all states
and policies
.
对于所有状态
和政策
.

In general, when all agents learn simultaneously, the found best response may not be unique (Shoham and Leyton-Brown 2008). The concept of best response can be leveraged to describe the most influential solution concept from game theory: the Nash equilibrium.
一般来说，当所有智能体同时学习时，发现的最佳反应可能不是唯一的（Shoham and Leyton-Brown 2008）。最佳响应的概念可以用来描述博弈论中最有影响力的解概念：纳什均衡。

Definition 4 定义 4
Nash equilibrium A solution where each agent’s policy
is the best response to the other agents’ policy
such that the following inequality
纳什均衡一种解决方案，其中每个智能体的策略都是对其他智能体策略

的最佳响应，因此以下不平等

holds true for all states
and all policies
is called Nash equilibrium.
适用于所有状态
和所有政策
，称为纳什均衡。

Intuitively spoken, a Nash equilibrium is a solution where one agent cannot improve when the policies of other agents are fixed, that is no agent can improve by unilaterally deviating from
. However, a Nash equilibrium may not be unique. Thus, the concept of Pareto-optimality might be useful (Matignon et al. 2012b).
直观地说，纳什均衡是一种解，当其他智能体的策略固定时，一个智能体无法改进，也就是说，任何智能体都不能通过单方面偏离来改善
。然而，纳什均衡可能不是唯一的。因此，帕累托最优的概念可能是有用的（Matignon et al. 2012b）。

Definition 5 定义 5
Pareto-optimality A joint policy
Pareto-dominates a second joint policy
if and only if
帕累托最优联合政策
帕累托主导当且仅当

A Nash equilibrium is regarded to be Pareto-optimal if no other has greater value and, thus, is not Pareto-dominated.
如果没有其他均衡具有更大的价值，则纳什均衡被认为是帕累托最优的，因此不是帕累托主导的。

Classical MARL literature can be categorized according to different features, such as the type of task and the information available to agents. In the remainder of this section, we introduce MARL concepts based on the taxonomy proposed in Busoniu et al. (2008). For one, the primary factor that influences the learned agent behavior is the type of task. Whether agents compete or cooperate is promoted by the designed reward structure.
经典的MARL文献可以根据不同的特征进行分类，例如任务类型和代理可用的信息。在本节的其余部分，我们介绍了基于Busoniu等人（2008）提出的分类法的MARL概念。首先，影响学习代理行为的主要因素是任务类型。代理人是竞争还是合作，都是通过设计的奖励结构来促进的。

(1) Fully cooperative setting All agents receive the same reward
for state transitions. In such an equally-shared reward setting, agents are motivated to collaborate and try to avoid the failure of an individual to maximize the performance of the team. More generally, we talk about cooperative settings when agents are encouraged to collaborate but do not own an equally-shared reward.
（1）完全合作的设置所有智能体
在状态转换时都会获得相同的奖励。在这种平等分享的奖励设置中，座席有动力进行协作，并试图避免个人未能最大限度地提高团队的绩效。更一般地说，我们谈论的是合作设置，当代理被鼓励合作但不拥有平等分享的奖励时。

(2) Fully competitive setting Such problem is described as a zero-sum Markov Game where the sum of rewards equals zero for any state transition, i.e.
. Agents are prudent to maximize their own individual reward while minimizing the reward of the others. In a loose sense, we refer to competitive games when agents are encouraged to excel against opponents, but the sum of rewards does not equal zero.
（2）充分竞争环境这个问题被描述为零和马尔可夫博弈，其中任何状态转换的奖励总和等于零，即
.代理人谨慎地最大化自己的个人奖励，同时最小化其他人的奖励。从广义上讲，我们指的是竞技游戏，当代理被鼓励在与对手的比赛中表现出色，但奖励的总和不等于零。

(3) Mixed setting Also known as general-sum game, the mixed setting is neither fully cooperative nor fully competitive and, thus, does not incorporate restrictions on agent goals.
（3）混合设置也称为一般和博弈，混合设置既不是完全合作的，也不是完全竞争的，因此不包含对智能体目标的限制。

Beside the reward structure, other taxonomy may be used to differentiate between the information available to the agents. Claus and Boutilier (1998) distinguished between two types of learning, namely independent learners and joint-action learners. The former ignores the existence of other agents and cannot observe the rewards and selected actions of others as considered in Bowling and Veloso (2002) and Lauer and Riedmiller (2000). Joint-action learners, however, observe the taken actions of all other actions a-posteriori as shown in Hu and Wellman (2003) and Littman (2001).
除了奖励结构之外，还可以使用其他分类法来区分代理可用的信息。Claus 和 Boutilier （1998）区分了两种类型的学习，即独立学习者和联合行动学习者。前者忽略了其他智能体的存在，无法观察其他智能体的奖励和选择行为，如Bowling和Veloso（2002）和Lauer和Riedmiller（2000）所考虑的那样。然而，联合行动学习者观察所有其他行动的后验行动，如胡和Wellman（2003）和Littman（2001）所示。

2.3 Formal introduction to multi-agent challenges
2.3 正式介绍多智能体挑战
In the single-agent formalism, the agent is the only decision-instance that influences the state of the environment. State transitions can be clearly attributed to the agent, whereas everything outside the agent’s field of impact is regarded as part of the underlying system dynamics. Even though the environment may be stochastic, the learning problem remains stationary.
在单智能体形式主义中，智能体是唯一影响环境状态的决策实例。状态转换可以清楚地归因于智能体，而智能体影响场之外的一切都被视为底层系统动力学的一部分。尽管环境可能是随机的，但学习问题仍然是静止的。

On the contrary, one of the fundamental problems in the multi-agent domain is that agents update their policies during the learning process simultaneously, such that the environment appears non-stationary from the perspective of a single agent. Hence, the Markov assumption of an MDP no longer holds, and agents face—without further treatment—a moving target problem (Busoniu et al. 2008; Yang and Gu 2004).
相反，多智能体领域的一个基本问题是，智能体在学习过程中同时更新其策略，因此从单个智能体的角度来看，环境看起来是非平稳的。因此，MDP 的马尔可夫假设不再成立，代理面临（无需进一步治疗）移动目标问题（Busoniu 等人，2008 年;Yang 和 Gu 2004）。

Definition 6 定义 6
Non-stationarity A single agent faces a moving target problem when the transition probability function changes
非平稳性：当转移概率函数发生变化时，单个智能体面临移动目标问题

due to the co-adaption
of agents.
由于药剂
的共同适应。

Above, we have introduced the Nash equilibrium as a solution concept where each agent’s policy is the best response to the others. However, it has been shown that agents can converge, despite a high degree of randomness in action selection, to sub-optimal solutions or can get stuck between different solutions (Wiegand 2004). Fulda and Ventura (2007) investigated such convergence to solutions and described a Pareto-selection problem called shadowed equilibrium.
上面，我们引入了纳什均衡作为解决方案概念，其中每个智能体的策略都是对其他智能体的最佳响应。然而，已经表明，尽管动作选择具有高度的随机性，但智能体可以收敛到次优解决方案，或者可能卡在不同的解决方案之间（Wiegand 2004）。Fulda和Ventura（2007）研究了这种解决方案的收敛性，并描述了一个称为阴影均衡的帕累托选择问题。

Definition 7 定义 7
Shadowed equilibrium A joint policy
is shadowed by another joint policy
in a state x if and only if
影子均衡一个联合策略
被状态 x 中的另一个联合策略
所掩盖，当且仅当

(5)
An equilibrium is shadowed by another when at least one agent exists who, when unilaterally deviating from
, will see no better improvement than for deviating from
(Matignon et al. 2012b). As a form of shadowed equilibrium, the pathology of relative over-generalization describes that a sub-optimal Nash equilibrium in the joint action space is preferred over an optimal solution. This phenomenon arises since each agent’s policy performs relatively well when paired with arbitrary actions from other agents (Panait et al. 2006; Wei and Luke 2016; Wiegand 2004).
当至少存在一个代理时，一个平衡被另一个代理所掩盖，当单方面偏离时
，不会比偏离更好的改善
（Matignon et al. 2012b）。作为阴影平衡的一种形式，相对过度泛化的病理学描述了联合作用空间中的次优纳什均衡优于最优解。之所以会出现这种现象，是因为每个代理人的策略在与其他代理人的任意行动配对时表现相对较好（Panait 等人，2006 年;Wei 和 Luke 2016;韦根 2004 年）。

In a Markov Game, we assumed that each agent observes a state x, which encodes all necessary information about the world. However for complex systems, complete information might not be perceivable. In such partially observable settings, the agents do not observe the whole state space but merely a subset
. Hence, the agents are confronted to deal with sequential decision-making under uncertainty. The partially observable Markov Game (Hansen et al. 2004) is the generalization of both MG and MDP.
在马尔可夫博弈中，我们假设每个智能体都观察到一个状态 x，它编码了有关世界的所有必要信息。但是，对于复杂的系统，可能无法感知完整的信息。在这种部分可观察的设置中，智能体不观察整个状态空间，而只是观察一个子集
。因此，智能体面临着在不确定性下处理顺序决策的问题。部分可观测的马尔可夫博弈（Hansen et al. 2004）是MG和MDP的推广。

Definition 8 定义 8
Partially observable Markov Games (POMG) The POMG is mathematically denoted by the tuple
, where
denotes the set of
interacting agents,
is the set of global but unobserved system states, and
is the set of individual action spaces
. The observation space
denotes the collection of individual observation spaces
. The transition probability function is denoted by
, the reward function associated with agent i by
, and the discount factor is
.
部分可观察马尔可夫博弈（POMG） POMG 在数学上用元组
表示，其中
表示交互代理的
集合，是全局但未观察到的系统状态的集合，
并且是
单个动作空间
的集合。观测空间表示单个观测空间

的集合。转移概率函数用
表示，与智能体 i 关联的奖励函数用
表示，贴现因子为
。

When agents face a cooperative task with a shared reward function, the POMG is then known as decentralized Partially Observable Markov decision process (dec-POMDP) (Bernstein et al. 2002; Oliehoek and Amato 2016). In partially observable domains, the inference of good policies is extended in complexity since the history of interactions becomes meaningful. Hence, the agents usually incorporate history-dependent policies
, which map from a history of observations to a distribution over actions.
当智能体面临具有共享奖励函数的合作任务时，POMG被称为分散的部分可观察马尔可夫决策过程（dec-POMDP）（Bernstein等人，2002;Oliehoek 和 Amato 2016）。在部分可观察的领域中，由于交互的历史变得有意义，因此对良好策略的推断在复杂性上得到了扩展。因此，智能体通常包含依赖于历史的策略，这些策略
从观察历史映射到行动的分布。

Definition 9 定义 9
Credit assignment problem In the fully-cooperative setting with joint reward signals, an individual agent cannot conclude the impact of its own action towards the team’s success and, thus, faces a credit assignment problem.
信用分配问题在具有联合奖励信号的完全合作环境中，单个智能体无法得出自己的行为对团队成功的影响的结论，因此面临信用分配问题。

In cooperative games, agents are encouraged to maximize a common goal through a joint reward signal. However, agents cannot ascertain their contribution to the eventual reward when they do not experience the taken joint action or deal with partial observations. Associating rewards to agents is known as the credit assignment problem (Chang et al. 2004; Weiß 1995; Wolpert and Tumer 1999).
在合作博弈中，鼓励智能体通过联合奖励信号最大化共同目标。然而，当智能体没有体验到所采取的联合行动或处理部分观察时，他们无法确定他们对最终奖励的贡献。将奖励与代理相关联被称为信用分配问题（Chang et al. 2004;Weiß 1995 年;Wolpert 和 Tumer 1999）。

Some of the above-introduced pathologies occur in all cooperative, competitive, and mixed tasks, whereas some pathologies like relative over-generalization, credit assignment, and miss-coordination are predominant issues in cooperative settings. To cope with these pathologies, still commonly studied settings are tabular worlds such as variations of the climbing game where solutions are not yet found, e.g. when the environment exhibits reward stochasticity (Claus and Boutilier 1998). Thus, simple worlds remain a fertile ground for further research, especially for problems like shadowed equilibria, non-stationarity or alter-exploration problemsFootnote3 and continue to matter for modern deep learning approaches.
上述一些病症发生在所有合作、竞争和混合任务中，而一些病症，如相对过度泛化、学分分配和协调失误是合作环境中的主要问题。为了应对这些病症，仍然普遍研究的设置是表格世界，例如尚未找到解决方案的攀岩游戏的变体，例如当环境表现出奖励随机性时（Claus and Boutilier 1998）。因此，简单世界仍然是进一步研究的沃土，特别是对于阴影平衡、非平稳性或交替探索问题等问题 Footnote3 ，并且对于现代深度学习方法仍然很重要。

3 Analysis of training schemes
3 培训方案分析
The training of multiple agents has long been a computational challenge (Becker et al. 2004; Nair et al. 2003). Since the complexity in the state and action space grows exponentially with the number of agents, even modern deep learning approaches may reach their limits. In this section, we describe training schemes that are used in practice for learning agent policies in the multi-agent setting similar to the ones described in Bono et al. (2019). We denote training as the process during which agents acquire data to build up experience and optimize their behavior with respect to the received reward signals. In contrast, we refer test timeFootnote4 to the step after the training when the learned policy is evaluated but is no further refined. The training of agents can be broadly divided into two paradigms, namely centralized and distributed (Weiß 1995). If the training of agents is applied in a centralized manner, policies are updated based on the mutual exchange of information during the training. This additional information is then usually removed at test time. In contrast to the centralized scheme, the training can also be handled in a distributed fashion where each agent performs updates on its own and develops an individual policy without utilizing foreign information.
长期以来，多个智能体的训练一直是一个计算挑战（Becker 等人，2004 年;Nair 等人，2003 年）。由于状态和动作空间的复杂性随着智能体数量的增加而呈指数级增长，即使是现代深度学习方法也可能达到极限。在本节中，我们描述了在实践中用于多智能体设置中学习代理策略的训练方案，类似于 Bono 等人（ 2019）中描述的方案。我们将训练表示为智能体获取数据以积累经验并优化其与接收到的奖励信号相关的行为的过程。相比之下，我们将测试时间 Footnote4 称为训练后的步骤，此时评估了学习的策略，但没有进一步细化。智能体的培训大致可以分为两种范式，即集中式和分布式（Weiß 1995）。如果以集中的方式应用代理的培训，则在培训过程中根据相互交换的信息进行政策更新。然后，通常会在测试时删除此附加信息。与集中式方案相比，培训也可以以分布式方式处理，其中每个代理自行执行更新并制定单独的策略，而无需使用外部信息。

In addition to the training paradigm, agents may deviate in the way of how they select actions. We recognize two execution schemes. Centralized execution describes that agents are guided from a centralized unit, which computes the joint actions for all agents. On the contrary, agents determine actions according to their individual policy for decentralized execution. An overview of the training schemes is depicted in Fig. 2 while Table 1 lists the reviewed literature of this section.
除了训练范式之外，智能体在选择操作的方式上也可能有所不同。我们承认两种执行方案。集中式执行描述了代理从集中式单元进行引导，该单元计算所有代理的联合操作。相反，代理根据其分散执行的个人策略确定行动。训练方案的概述如图 2 所示，而表 1 列出了本节的综述文献。

Fig. 2 图2
figure 2
Training schemes in the multi-agent setting. (Left) CTCE holds a joint policy for all agents. (Middle) Each agent updates its own individual policy in DTDE. (Right) CTDE enables agents to exchange additional information during training which is then discarded at test time
多智能体设置中的训练方案。（左）CTCE为所有代理商制定联合政策。（中）每个代理在 DTDE 中更新自己的单独策略。（右）CTDE 使代理能够在训练期间交换附加信息，然后在测试时丢弃这些信息

Full size image 全尺寸图像
3.1 Distributed training 3.1 分布式训练
In distributed training schemes, agents learn independently of other agents and do not rely on explicit information exchange.
在分布式训练方案中，智能体独立于其他智能体进行学习，并且不依赖于显式信息交换。

Definition 10 定义 10
Distributed training decentralized execution (DTDE) Each agent i has an associated policy
which maps local observations to a distribution over individual actions. No information is shared between agents such that each agent learns independently.
分布式训练分散执行（DTDE）每个代理 i 都有一个关联的策略，该策略
将本地观察结果映射到单个操作的分布。代理之间不会共享任何信息，因此每个代理都可以独立学习。

The fundamental drawback of the DTDE paradigm is that the environment appears non-stationary from a single agent’s viewpoint because agents neither have access to the knowledge of others, nor do they perceive the joint action. The first approaches in this training scheme were studied in tabular worlds. The work by Tan (1993) investigated the question if independently learning agents can match with cooperating agents. The results showed that independent learners learn slower in tabular and deterministic worlds. Based on that, Claus and Boutilier (1998) examined both independent and joint-action learners in cooperative stochastic-form games and empirically showed that both types of learning can converge to an equilibrium in deterministic games. Subsequent works elaborated on the DTDE scheme in discretized worlds (Hu and Wellman 1998; Lauer and Riedmiller 2000).
DTDE范式的根本缺点是，从单个智能体的角度来看，环境似乎是非静止的，因为智能体既无法获得他人的知识，也无法感知到联合行动。该训练方案中的第一种方法是在表格世界中研究的。Tan（1993）的研究调查了独立学习智能体是否可以与合作智能体匹配的问题。结果显示，独立学习者在表格和确定性世界中学习较慢。基于此，Claus和Boutilier（1998）研究了合作随机形式博弈中的独立学习者和联合行动学习者，并实证表明，在确定性博弈中，这两种类型的学习都可以收敛到均衡。随后的工作详细阐述了离散世界中的DTDE方案（胡和Wellman 1998;Lauer 和 Riedmiller 2000）。

More recent works report that distributed training schemes scale poorly with the number of agents due to the extra sample complexity, which is added to the learning problem. Gupta et al. (2017) showed that distributed methods have inferior performance compared to policies that are trained with a centralized training paradigm. Similarly, Foerster et al. (2018b) showed that the speed of independently learning actor-critic methods is slower than using centralized training. In further works, DTDE has been applied to cooperative navigation tasks (Chen et al. 2016; Strouse et al. 2018), to partially observable domains (Dobbe et al. 2017; Nguyen et al. 2017b; Srinivasan et al. 2018), and to social dilemmas (Leibo et al. 2017).
最近的研究报告说，由于额外的样本复杂性，分布式训练方案随着智能体数量的扩展性很差，这增加了学习问题。Gupta等人（2017）表明，与使用集中式训练范式训练的策略相比，分布式方法的性能较差。同样，Foerster et al. （ 2018b）表明，独立学习演员-评论家方法的速度比使用集中训练的速度慢。在进一步的工作中，DTDE已应用于合作导航任务（Chen等人，2016;Strouse 等人，2018 年），到部分可观察域（Dobbe 等人，2017 年;Nguyen 等人，2017b;Srinivasan 等人，2018 年）和社会困境（Leibo 等人，2017 年）。

Due to limited information in the distributed setting, independent learners are confronted with several pathologies (Matignon et al. 2012b). Besides non-stationarity, environments may exhibit stochastic transitions or stochastic rewards, which further complicates learning. In addition to that, the search for an optimal policy influences the other agents’ decision-making, which may lead to action shadowing and impacts the balance between exploration and knowledge exploitation.
由于分布式环境中的信息有限，独立学习者面临着多种病症（Matignon 等人，2012b）。除了非平稳性之外，环境还可能表现出随机转换或随机奖励，这进一步使学习复杂化。除此之外，对最优策略的探索会影响其他智能体的决策，这可能导致行动阴影，并影响探索和知识开发之间的平衡。

A line of recent works expands independent learners with techniques to cope with the aforementioned MARL pathologies in cooperative domains. First, Omidshafiei et al. (2017) introduced a decentralized experience replay extension called Concurrent Experience Replay Trajectories (CERT) that enables independent learners to face a cooperative and partially observable setting by rendering samples more stable and efficient. Similarly, Palmer et al. (2018) extended the experience replay of Deep Q-Networks with leniency, which associates stored state-action pairs with decaying temperature values that govern the amount of applied leniency. They showed that this induces optimism in value function updates and can overcome relative over-generalization. Another work by Palmer et al. (2019) proposed negative update intervals double-DQN as an mechanism that identifies and removes generated data from the replay buffer that leads to mis-coordination. Alike, Lyu and Amato 2020 proposed decentralized quantile estimators which identify non-stationary transition samples based on the likelihood of returns. Another work that aims to improve upon independent learners can be found in Zheng et al. (2018a) who used two auxiliary mechanisms, including a lenient reward approximation and a prioritized replay strategy.
最近的一系列工作扩展了独立学习者，其技术可以应对合作领域的上述MARL病理学。首先，Omidshafiei等人（2017）引入了一种名为并发体验回放轨迹（CERT）的去中心化体验回放扩展，通过使样本更加稳定和高效，使独立学习者能够面对合作和部分可观察的环境。同样，Palmer et al. （ 2018）扩展了深度 Q-Networks 的体验回放，将存储的状态-动作对与控制应用宽大程度的衰减温度值相关联。他们表明，这会引起对价值函数更新的乐观情绪，并且可以克服相对过度泛化。Palmer 等人（2019 年）的另一项工作提出了负更新间隔双 DQN 作为一种机制，用于识别并从重放缓冲区中删除生成的数据，从而导致协调错误。同样，Lyu 和 Amato 2020 提出了分散的分位数估计器，该估计器根据返回的可能性识别非平稳过渡样本。另一项旨在改善独立学习者的工作可以在 Zheng 等人（2018a）中找到，他们使用了两种辅助机制，包括宽松的奖励近似和优先重放策略。

A different research direction can be seen in distributed population-based training schemes where agents are optimized through an online evolutionary process such that under-performing agents are substituted by mutated versions of better agents (Jaderberg et al. 2019; Liu et al. 2019).
在基于群体的分布式训练方案中可以看到不同的研究方向，其中智能体通过在线进化过程进行优化，使得表现不佳的智能体被更好智能体的突变版本所取代（Jaderberg 等人，2019 年;Liu 等人，2019 年）。

3.2 Centralized training 3.2 集中培训
The centralized training paradigm describes agent policies that are updated based on mutual information. While the sharing of mutual information between agents is enabled during the training, this additional information is then discarded at test time. The centralized training can be further differentiated into the centralized and decentralized execution scheme.
集中式训练范式描述了基于相互信息更新的代理策略。虽然在训练期间启用了代理之间的相互信息共享，但这些附加信息随后会在测试时被丢弃。集中式训练可以进一步区分为集中式和分散式执行方案。

Definition 11 定义 11
Centralized training centralized execution (CTCE) The CTCE scheme describes a centralized executor
modeling the joint policy that maps the collection of distributed observations to a set of distributions over individual actions.
集中训练集中执行（CTCE） CTCE 方案描述了一个集中式执行器，该执行器
对联合策略进行建模，该策略将分布式观测值的集合映射到单个操作的一组分布。

Some applications assume an unconstrained and instantaneous information exchange between agents. In such a setting, a centralized executor can be leveraged to learn the joint policy for all agents. The CTCE paradigm allows the straightforward employment of single-agent training methods such as actor-critics (Mnih et al. 2016) or policy gradient algorithms (Schulman et al. 2017) to multi-agent problems. An obvious flaw is that state-action spaces grow exponentially by the number of agents. To address the so-called curse of dimensionality, the joint model can be factored into individual policies for each agent. Gupta et al. (2017) represented the centralized executor as a set of independent sub-policies such that agents’ individual action distributions are captured rather than the joint action distribution of all agents, i.e. the joint action distribution
is factored into independent action distributions. Next to the policy, the value function can be factored so that the joint value is decomposed into a sum of local value functions, e.g. the joint action-value function can be expressed by
as shown in Russell and Zimdars (2003). A recent approach for the value function factorization is investigated in Sunehag et al. (2018). However, a phenomenon called lazy agents may occur in the CTCE setting when one agent learns a good policy but a second agent has less incentive to learn a good policy, as his actions may hinder the first agent, resulting in a lower reward (Sunehag et al. 2018).
某些应用程序假定代理之间进行不受约束的即时信息交换。在这样的设置中，可以利用集中式执行器来学习所有代理的联合策略。CTCE 范式允许直接使用单智能体训练方法，例如演员批评家（Mnih 等人，2016 年）或策略梯度算法（Schulman 等人，2017 年）来解决多智能体问题。一个明显的缺陷是，状态行动空间会随着代理数量的增加而呈指数级增长。为了解决所谓的维度诅咒，可以将联合模型分解到每个智能体的单独策略中。Gupta等人（2017）将集中执行者表示为一组独立的子策略，以便捕获代理的单个动作分布，而不是所有代理的联合动作分布，即将联合动作分布分解为独立的动作分布
。在策略旁边，可以对值函数进行分解，以便将联合值分解为局部值函数的总和，例如，联合行动-值函数可以用 Russell 和 Zimdars （2003）所示来
表示。Sunehag等人（2018）研究了价值函数分解的最新方法。然而，在CTCE环境中，当一个智能体学习一个好的策略，但第二个智能体学习一个好的策略的动力较小时，可能会发生一种称为懒惰智能体的现象，因为他的行为可能会阻碍第一个智能体，导致较低的奖励（Sunehag等人，2018）。

Although CTCE regards the learning problem as a single-agent case, we include the paradigm in this paper because the training schemes presented in the subsequent sections occasionally use CTCE as performance baseline and conduct comparisons.
尽管CTCE将学习问题视为单智能体案例，但我们在本文中包括了该范式，因为后续章节中介绍的训练方案偶尔会使用CTCE作为性能基线并进行比较。

Definition 12 定义 12
Centralized training decentralized execution (CTDE) Each agent i holds an individual policy
which maps local observations to a distribution over individual actions. During training, agents are endowed with additional information, which is then discarded at test time.
集中式训练分散式执行（CTDE）每个智能体 i 都有一个单独的策略，该策略
将本地观察结果映射到单个操作的分布。在训练期间，智能体被赋予了额外的信息，然后在测试时丢弃这些信息。

The CTDE paradigm presents the state-of-the-art practice for learning with multiple agents (Kraemer and Banerjee 2016; Oliehoek et al. 2008). In classical MARL, such setting was utilized as joint action learners which has the advantage that perceiving joint actions a-posteriori discards the non-stationarity in the environment (Claus and Boutilier 1998). As of late, CTDE has been successful in MADRL approaches (Foerster et al. 2016; Jorge et al. 2016). Agents utilize shared computational facilities or other forms of communication to exchange information during training. By sharing mutual information, the training process can be eased and the learning speed can become superior when matched against independently trained agents (Foerster et al. 2018b). Moreover, agents can bypass non-stationarity when extra information about the selected actions is available to all agents during training such that the consequences of actions can be attributed to the respective agents. In what follows, we classify the CTDE literature according to the agent structure.
CTDE范式提出了使用多个智能体进行学习的最新实践（Kraemer和Banerjee，2016;Oliehoek 等人，2008 年）。在经典的MARL中，这种设置被用作联合行动学习器，其优点是后验感知联合行动会丢弃环境中的非平稳性（Claus and Boutilier 1998）。最近，CTDE在MADRL方法中取得了成功（Foerster等人，2016年;Jorge 等人，2016 年）。智能体利用共享的计算设施或其他形式的通信在训练期间交换信息。通过共享相互信息，可以简化训练过程，并且在与独立训练的代理匹配时，学习速度可以变得更胜一筹（Foerster 等人，2018b）。此外，当所有智能体在训练期间都可以获得有关所选操作的额外信息时，智能体可以绕过非平稳性，以便可以将操作的后果归因于各自的智能体。在下文中，我们根据代理结构对CTDE文献进行分类。

Homogeneous agents exhibit a common structure or the same set of skills, e.g. the same learning model or share common goals. Owning the same structure, agents can share parts of their learning model or experience with other agents. These approaches can scale well with the number of agents and may allow an efficient learning of behaviors. Gupta et al. (2017) showed that policies based on parameter sharing can be trained more efficiently and, thus, can outperform independently learned ones. Although agents own the same policy network, different agent behaviors can emerge because each agent perceives different observations at test time. It has been thoroughly demonstrated that parameter sharing can help to accelerate the learning progress (Ahilan and Dayan 2019; Chu and Ye 2017; Peng et al. 2017; Sukhbaatar et al. 2016; Sunehag et al. 2018). Next to parameter sharing, homogeneous agents can employ value-based methods where an approximation of the value function is learned based on mutual information. Agents profit from the joint actions and other agents’ policies that are available during training and incorporate this extra information into centralized value functions (Foerster et al. 2016; Jorge et al. 2016). Such information is then discarded at test time. Many approaches consider the decomposition of a joint value function into combinations of individual value functions (Castellini et al. 2019; Rashid et al. 2018; Son et al. 2019; Sunehag et al. 2018). Through decomposition, each agent faces a simplified sub-problem of the original problem. Sunehag et al. (2018) showed that agents learning on local sub-problems scale better with the number of agents than CTCE or independent learners. We elaborate on value function-based factorization more detailed in Sect. 5.4 as an effective approach to tackle credit assignment problems.
同质智能体表现出共同的结构或相同的技能集，例如相同的学习模型或共享共同的目标。拥有相同的结构，代理可以与其他代理共享其学习模型或经验的一部分。这些方法可以随着智能体的数量而很好地扩展，并且可以有效地学习行为。Gupta等人（2017）表明，基于参数共享的策略可以更有效地训练，因此可以优于独立学习的策略。尽管智能体拥有相同的策略网络，但可能会出现不同的智能体行为，因为每个智能体在测试时感知到不同的观察结果。已经彻底证明，参数共享可以帮助加速学习进度（Ahilan and Dayan 2019;Chu 和 Ye 2017;Peng 等人，2017 年;Sukhbaatar 等人，2016 年;Sunehag 等人，2018 年）。除了参数共享之外，同构智能体还可以采用基于值的方法，其中基于互信息学习值函数的近似值。代理人从培训期间可用的联合行动和其他代理人的政策中获利，并将这些额外信息纳入集中价值函数（Foerster 等人，2016 年;Jorge 等人，2016 年）。然后，在测试时丢弃此类信息。许多方法考虑将联合值函数分解为单个值函数的组合（Castellini 等人，2019 年;Rashid 等人，2018 年;Son 等人，2019 年;Sunehag 等人，2018 年）。通过分解，每个智能体都面临着原始问题的简化子问题。Sunehag et al. （ 2018）表明，与 CTCE 或独立学习者相比，在局部子问题上学习的智能体随着智能体数量的增加而扩展得更好。我们在第 5 节中详细阐述了基于价值函数的因式分解。4 作为解决学分分配问题的有效方法。

Heterogeneous agents, on the contrary, differ in structure and skill. An instance for heterogeneous policies can be seen in the extension of an actor-critic approach with a centralized critic, which allows information sharing to amplify the performance of individual agent policies. These methods can be distinguished from each other based on the representation of the critic. Lowe et al. (2017) utilized one centralized critic for each agent that is augmented with additional information during training. The critics are provided with information about every agent’s policy, whereas the actors perceive only local observations. As a result, the agents do not depend on explicit communication and can overcome the non-stationarity in the environment. Likewise, Bono et al. (2019) trained multiple agents with individual policies that share information with a centralized critic and demonstrated that such setup might improve results on standard benchmarks. Besides the utilization of one critic for each agent, Foerster et al. (2018b) applied one centralized critic for all agents to estimate a counterfactual baseline function that marginalizes out a single agent’s action. The critic is conditioned on the history of all agents’ observations or, if available, on the true global state. Typically, actor-critic methods underlie a variance in the critic estimation that is further exacerbated by the number of agents. Therefore, Wu et al. (2018) proposed an action-dependent baseline which includes information from other agents to reduce the variance in the critic estimation function. Further works that incorporate one centralized critic for distributed policies can be found in Das et al. (2019), Iqbal and Sha (2019) and Wei et al. (2018).
相反，异质代理在结构和技能上有所不同。异构策略的一个例子是将参与者-批评家方法扩展到集中式批评家，这允许信息共享以放大单个代理策略的性能。这些方法可以根据批评者的表现形式相互区分。Lowe等人（2017）为每个代理使用一个集中的批评者，并在训练期间增加了额外的信息。批评者获得了有关每个代理人政策的信息，而行动者只感知当地的观察结果。因此，代理不依赖于显式通信，并且可以克服环境中的非平稳性。同样，Bono 等人（ 2019）使用单独的策略训练多个代理，这些策略与集中的批评者共享信息，并证明这种设置可能会改善标准基准的结果。除了为每个智能体使用一个批评者外，Foerster et al. （ 2018b）还对所有智能体应用了一个集中的批评家来估计一个反事实基线函数，该函数将单个智能体的行为边缘化。批评者以所有智能体观察的历史为条件，或者，如果有的话，以真实的全局状态为条件。通常，演员-评论家方法在评论家估计中存在差异，而代理人的数量进一步加剧了这种差异。因此，Wu et al. （ 2018）提出了一个行动依赖基线，其中包括来自其他代理的信息，以减少批评者估计函数的方差。Das et al. （ 2019）， Iqbal and Sha （ 2019）和 Wei et al. （ 2018）中可以找到更多包含分布式策略集中批评者的作品。

Another way to perform decentralized execution is by employing a master-slave architecture, which can resolve coordination conflicts between multiple agents. Kong et al. (2017) applied a centralized master executor which shares information with decentralized slaves. In each time step, the master receives local information from the slaves and shares its internal state in return. The slaves compute actions conditioned on their local observation and the master’s internal state. Similar approaches that make use of different levels of abstraction are hierarchical methods (Kumar et al. 2017) that operate at different time scales or levels of abstraction. We elaborate on hierarchical methods in more detail in Sect. 5.3.
执行去中心化执行的另一种方法是采用主从架构，可以解决多个代理之间的协调冲突。Kong等人（2017）应用了一个集中式主执行器，该执行器与分散的从属共享信息。在每个时间步中，主站从站接收本地信息，并共享其内部状态作为回报。从站根据其本地观察和主站的内部状态计算动作。利用不同抽象级别的类似方法是分层方法（Kumar 等人，2017 年），它们在不同的时间尺度或抽象级别上运行。我们在第 5.3 节中更详细地阐述了分层方法。

Table 1 Overview of training schemes applied in recent MADRL works
表1 近期MADRL工作中采用的培训方案概览
Full size table 全尺寸表
4 Emergent patterns of agent behavior
4 代理行为的紧急模式
Agents adjust their policy to maximize the task success and react to the behavioral changes of other agents. The dynamic interaction between multiple decision-makers, which simultaneously affects the state of the environment, can cause the emergence of specific behavioral patterns. An obvious way to influence the development of agent behavior is through the designed reward structure. By promoting incentives for cooperation, agents can learn team strategies where they try to collaborate and optimize upon a mutual goal. Agents support other agents since the cumulative reward for cooperation is greater than acting selfishly. On the contrary, if the appeals for maximizing the individual performance are larger than being cooperative, agents can learn greedy strategies and maximize their individual reward. Such competitive attitudes can yield high-level strategies like manipulating adversaries to gain an advantage. However, the boundaries between competition and cooperation can be blurred in the multi-agent setting. For instance, if one agent competes with other agents, it is sometimes useful to cooperate temporarily in order to receive a higher reward in the long run.
座席调整其策略以最大限度地提高任务成功率，并对其他座席的行为变化做出反应。多个决策者之间的动态互动，同时影响环境的状态，可能导致特定行为模式的出现。影响智能体行为发展的一个明显方法是通过设计的奖励结构。通过促进合作激励，座席可以学习团队策略，尝试协作并优化共同目标。代理人支持其他代理人，因为合作的累积回报大于自私的行为。相反，如果最大化个人绩效的诉求大于合作，智能体可以学习贪婪的策略并最大化他们的个人奖励。这种竞争态度可以产生高级策略，例如操纵对手以获得优势。然而，在多智能体环境中，竞争与合作之间的界限可能会变得模糊。例如，如果一个代理与其他代理竞争，有时暂时合作是有用的，以便从长远来看获得更高的奖励。

In this section, we review the literature that is interested in developed agent behaviors. We differentiate occurring behaviors according to the reward structure (Sect. 4.1), the language between agents (Sect. 4.2), and the social context (Sect. 4.3). Table 2 summarizes the reviewed literature based on this classification. Note that we focus in this section not on works that introduce new methodologies but on literature that analyzes the emergent behavioral patterns.
在本节中，我们回顾了对发达的代理行为感兴趣的文献。我们根据奖励结构（第 4.1 节）、代理之间的语言（第 4.2 节）和社会背景（第 4.3 节）来区分发生的行为。表2总结了基于此分类的综述文献。请注意，我们在本节中关注的不是引入新方法的著作，而是分析新兴行为模式的文献。

4.1 Reward structure 4.1 奖励结构
The primary factor that influences the emergence of agent behavior is the reward structure. If the reward for mutual cooperation is larger than individual reward maximization, agents tend to learn policies that seek to collaboratively solve the task. In particular, Leibo et al. (2017) compared the magnitude of the team reward in relation to the individual agent reward. They showed that the higher the numerical team reward is compared to the individual reward, the greater is the willingness to collaborate with other agents. The work by Tampuu et al. (2017) demonstrated that punishing the whole team of agents for the failure of a single agent can also cause cooperation. Agents learn policies to avoid the malfunction of an individual, support other agents to prevent failure, and improve the performance of the whole team. Similarly, Diallo et al. (2017) used the Pong video game to investigate the coordination between agents and examined how developed behaviors change regarding the reward function. For a comprehensive review of learning in cooperative settings, one can consider the article by Panait and Luke (2005) for classical MARL and Oroojlooyjadid and Hajinezhad (2019) for recent MADRL.
影响智能体行为出现的首要因素是奖励结构。如果相互合作的奖励大于个人奖励最大化，智能体往往会学习寻求协作解决任务的策略。特别是，Leibo等人（2017）比较了团队奖励与个体代理奖励的关系。他们表明，与个人奖励相比，数字团队奖励越高，与其他代理合作的意愿就越大。Tampuu等人（2017）的研究表明，因单个代理的失败而惩罚整个代理团队也可以引起合作。座席学习策略以避免个人故障，支持其他座席防止失败，并提高整个团队的绩效。同样，Diallo 等人（2017）使用 Pong 视频游戏来研究智能体之间的协调，并研究了发达行为如何改变奖励函数。为了全面回顾合作环境中的学习，可以考虑Panait和Luke（2005）关于经典MARL的文章以及Oroojlooyjadid和Hajinezhad（2019）关于最近的MADRL的文章。

In contrast to the cooperative scenario, one can value individual performance greater than the collaboration among agents. A competitive setting motivates agents to outperform their adversary counterparts. Tampuu et al. (2017) used the video game Pong and manipulated the rewarding structure to examine the emergence of agent behavior. They showed that the higher the reward for competition, the more likely an agent tries to outplay its opponents by using techniques such as wall bouncing or faster ball speed. Employing such high-level strategies to overwhelm the adversary maximizes the individual reward. Similarly, Bansal et al. (2018) investigated competitive scenarios, where agents competed in a 3D world with simulated physics to learn locomotion skills such as running, blocking, or tackling other agents with arms and legs. They argued that adversarial training could help to learn more complex agent behaviors than the environment can exhibit. Likewise, the works of Leibo et al. (2017) and Liu et al. (2019) investigated the emergence of behaviors due to the reward structure in competitive scenarios.
与合作方案相比，人们更看重个人绩效，而不是代理之间的协作。竞争环境会激励代理商超越对手。Tampuu et al. （ 2017）使用视频游戏 Pong 并操纵奖励结构来研究代理行为的出现。他们表明，竞争的奖励越高，代理人就越有可能通过使用墙壁弹跳或更快的球速等技术来击败对手。采用这种高级策略来压倒对手可以最大限度地提高个人奖励。同样，Bansal 等人（2018 年）调查了竞争场景，其中智能体在模拟物理的 3D 世界中竞争，以学习运动技能，例如奔跑、阻挡或用胳膊和腿对付其他智能体。他们认为，对抗性训练可以帮助学习比环境所能表现出的更复杂的智能体行为。同样，Leibo et al. （ 2017）和 Liu et al. （ 2019）的研究研究了竞争情境中奖励结构导致的行为出现。

If the rewards appear in sparse frequency, agents can be equipped with intrinsic reward functions that provide denser feedback signals and, thus, can overcome the sparsity or even the absence of external rewards. One way to realize this is with intrinsic motivation, which is based on the concept of maximizing an internal reinforcement signal by actively discovering novel or surprising patterns (Chentanez et al. 2005; Oudeyer and Kaplan 2007; Schmidhuber 2010). Intrinsic motivation encourages agents to explore states that have been scarcely or never visited and to perform novel actions in those states. Most approaches of intrinsic motivation can be broadly divided into two categories (Pathak et al. 2017). First, agents are encouraged to explore unknown states where the novelty of states is measured by a model that captures the distribution of visited environment states (Bellemare et al. 2016). Second, agents can be motivated to reduce the uncertainty about the consequences of their own actions. The agent builds a model that learns the dynamics of the environment by lowering the prediction error of the follow-up states with respect to the taken actions. The uncertainty indicates the novelty of new experience since the model can only be accurate in states which it has already encountered or can generalize from previous knowledge (Houthooft et al. 2016; Pathak et al. 2017). For a recent survey on intrinsic motivation in RL, one can regard the paper by Aubret et al. (2019). The concept of intrinsic motivation was transferred to the multi-agent domain by Sequeira et al. (2011), who studied the motivational impact on multiple agents. Investigations on the emergence of agent behavior based on intrinsic rewards have been abundantly conducted in Baker et al. (2020), Hughes et al. (2018), Jaderberg et al. (2019), Jaques et al. (2018), Jaques et al. (2019), Peysakhovich and Lerer (2018), Sukhbaatar et al. (2017), Wang et al. (2019) and Wang et al. (2020b).
如果奖励以稀疏的频率出现，智能体可以配备内在奖励函数，提供更密集的反馈信号，从而可以克服稀疏性甚至缺乏外部奖励的问题。实现这一点的一种方法是内在动机，它基于通过积极发现新颖或令人惊讶的模式来最大化内部强化信号的概念（Chentanez 等人，2005 年;Oudeyer 和 Kaplan 2007;Schmidhuber 2010 年）。内在动机鼓励智能体探索很少或从未访问过的状态，并在这些状态下执行新的行动。大多数内在动机方法大致可分为两类（Pathak 等人，2017 年）。首先，鼓励智能体探索未知状态，其中状态的新颖性是通过捕获所访问环境状态分布的模型来衡量的（Bellemare 等人，2016 年）。其次，可以激励代理人减少对自己行为后果的不确定性。智能体构建一个模型，该模型通过降低与所采取的行动相关的后续状态的预测误差来学习环境的动态。不确定性表明了新经验的新颖性，因为该模型只能在已经遇到的状态下是准确的，或者可以从以前的知识中推广出来（Houthooft 等人，2016 年;Pathak 等人，2017 年）。对于最近关于RL内在动机的调查，可以参考Aubret等人（2019）的论文。Sequeira等人（2011）研究了动机对多个智能体的影响，将内在动机的概念转移到了多智能体领域。Baker等人对基于内在奖励的代理行为的出现进行了大量研究。（2020）、Hughes等人（2018）、Jaderberg等人（2019）、Jaques等人（2018）、Jaques等人（2019）、Peysakhovich和Lerer（2018）、Sukhbaatar等人（2017）、Wang等人（2019）和Wang等人（2020b）。

Table 2 Overview of MADRL papers that investigate emergent patterns of agent behavior
表 2 研究智能体行为涌现模式的 MADRL 论文概述
Full size table 全尺寸表
4.2 Language 4.2 语言
The development of language corpora and communication skills of autonomous agents attracts great attention within the community. For one, the behavior that emerges during the deployment of abstract language as well as the learned composition of multiple words to form meaningful contexts is of interest (Kirby 2002). Deep learning methods have widened the scope of computational methodologies for investigating the development of language between dynamic agents (Lazaridou and Baroni 2020). For building rich behaviors and complex reasoning, communication based on high-dimensional data like visual perception is a widespread practice (Antol et al. 2015). In the following, we focus on works that investigate the emergence of language and analyze behavior. Papers that propose new methodologies for developing communication protocols are discussed in Sect. 5.2. We classify the learning of language according to the performed task and the type of interaction the agents pursue. In particular, we differentiate between referential games and dialogues.
自主代理的语言语料库和沟通技巧的发展在社区内引起了极大的关注。首先，在部署抽象语言的过程中出现的行为以及学习的多个单词的组合以形成有意义的上下文是有趣的（Kirby 2002）。深度学习方法拓宽了用于研究动态代理之间语言发展的计算方法的范围（Lazaridou 和 Baroni，2020 年）。为了建立丰富的行为和复杂的推理，基于视觉感知等高维数据的交流是一种普遍的做法（Antol 等人，2015 年）。在下文中，我们将重点介绍研究语言的出现和分析行为的作品。第5.2节讨论了提出开发通信协议的新方法的论文。我们根据执行的任务和代理所追求的交互类型对语言学习进行分类。特别是，我们区分了参考游戏和对话。

The former, referential games, describe cooperative games where the speaking agent communicates an objective via messages to another listening agent. Lazaridou et al. (2017) showed that agents could learn communication protocols solely through interaction. For a meaningful information exchange, agents evolved semantic properties in their language. A key element of the study was to analyze if the agents’ interactions are interpretable for humans, showing limited yet encouraging results. Likewise, Mordatch and Abbeel (2018) investigated the emergence of abstract language that arises through the interaction between agents in a physical environment. In their experiments, the agents should learn a discrete set of vocabulary by solving navigation tasks through communication. By involving more than three agents in the conversation and by penalizing an arbitrary size of vocabulary, agents agreed on a coherent set of vocabulary and discouraged ambiguous words. They also observed that agents learned a syntax structure in the communication protocol that is consistent in vocabulary usage. Another work by Li and Bowling (2019) found out that compositional languages are easier to communicate with other agents than languages with less structure. In addition, changing listening agents during the learning can promote the emergence of language grounded on a higher degree of structure. Many studies are concerned with the development of communication in referential games grounded on visual perception as it can be found in Choi et al. (2018), Evtimova et al. (2018), Havrylov and Titov (2017), Jorge et al. (2016), Lazaridou et al. (2018) and Lee et al. (2017). Further works consider the development of communication in social dilemmas (Jaques et al. 2018, 2019).
前者是参照博弈，描述了合作博弈，在这种博弈中，说话的代理通过消息将目标传达给另一个倾听代理。Lazaridou等人（2017）表明，智能体可以仅通过交互来学习通信协议。为了进行有意义的信息交换，智能体在其语言中演化了语义属性。该研究的一个关键要素是分析智能体的相互作用是否可对人类解释，显示出有限但令人鼓舞的结果。同样，Mordatch 和 Abbeel （2018）研究了通过物理环境中智能体之间的交互而产生的抽象语言的出现。在他们的实验中，智能体应该通过通信解决导航任务来学习一组离散的词汇。通过让三个以上的智能体参与对话，并通过惩罚任意大小的词汇，智能体同意一组连贯的词汇，并阻止模棱两可的单词。他们还观察到，智能体在通信协议中学习了一种语法结构，该语法结构在词汇使用上是一致的。Li 和 Bowling （2019）的另一项研究发现，组合语言比结构较少的语言更容易与其他智能体交流。此外，在学习过程中改变听力代理可以促进基于更高层次结构的语言的出现。许多研究关注基于视觉感知的参考游戏中交流的发展，如Choi等人（2018），Evtimova等人（2018），Havrylov和Titov（2017），Jorge等人（2016），Lazaridou等人（2018）和Lee等人（2017）。进一步的工作考虑了社会困境中沟通的发展（Jaques 等人，2018 年，2019 年）。

As the second category, we describe the emergence of behavioral patterns in communication while conducting dialogues. One type of dialogue are negotiations in which agents pursue to agree on decisions. In a study about negotiations with natural language, Lewis et al. (2017) showed that agents could master linguistic and reasoning problems. Two agents were both shown a collection of items and were instructed to negotiate about how to divide the objects among both agents. Each agent was expected to maximize the value of the bargained objects. Eventually, the agents learned to use high-level strategies such as deception to accomplish higher rewards over their opponents. Similar studies concerned with negotiations are covered in Cao et al. (2018) and He et al. (2018). Another type of dialogue are scenarios where the emergence of communication is investigated in a question-answering style as shown by Das et al. (2017). One agent received an image as input and was instructed to ask questions about the shown image while the second agent responded, both in natural language.
作为第二类，我们描述了在进行对话时交流中行为模式的出现。一种类型的对话是代理人寻求就决定达成一致的谈判。在一项关于自然语言谈判的研究中，Lewis等人（2017）表明，智能体可以掌握语言和推理问题。两名特工都看到了一组物品，并被指示就如何在两个特工之间划分物品进行协商。每个代理人都应使讨价还价物的价值最大化。最终，特工们学会了使用欺骗等高级策略来获得比对手更高的奖励。Cao et al. （ 2018）和 He et al. （ 2018）涵盖了与谈判有关的类似研究。另一种类型的对话是以问答方式调查沟通出现的场景，如Das等人（2017）所示。一名智能体接收图像作为输入，并被指示询问有关显示图像的问题，而另一名智能体则以自然语言回答。

Many of the above-mentioned papers report that utilizing a communication channel can increase task performance in terms of the cumulative reward. However, numerical performance measurements provide evidence but do not give insights about the communication abilities learned by the agents. Therefore, Lowe et al. (2019) surveyed metrics which are applied to assess the quality of learned communication protocols and provided recommendations about the usage of such metrics. Based on that, Eccles et al. (2019) proposed to incorporate inductive bias into the learning objective of agents, which could promote the emergence of a meaningful communication. They showed that inductive bias could lead to improved results in terms of interpretability.
上述许多论文报告说，利用沟通渠道可以提高累积奖励方面的任务绩效。然而，数值性能测量提供了证据，但不能提供有关智能体学习的通信能力的见解。因此，Lowe等人（2019）调查了用于评估学习通信协议质量的指标，并就这些指标的使用提出了建议。基于此，Eccles等人（2019）提出将归纳偏见纳入智能体的学习目标中，这可以促进有意义的沟通的出现。他们表明，归纳偏差可以改善可解释性方面的结果。

4.3 Social context 4.3 社会背景
Next to the reward structure and language, the research community actively investigates the emerging agent behaviors in social contexts. Akin to humans, artificial agents can develop strategies that exploit patterns in complex problems and adapt behaviors in response to others (Baker et al. 2020; Jaderberg et al. 2019). We differentiate the following literature along different dimensions, such as the type of social dilemma and the examined psychological variables.
除了奖励结构和语言之外，研究界还积极研究社会情境中新兴的代理行为。与人类类似，人工代理可以制定策略，利用复杂问题中的模式并调整行为以应对其他问题（Baker 等人，2020 年;Jaderberg 等人，2019 年）。我们根据不同的维度区分以下文献，例如社会困境的类型和所研究的心理变量。

Social dilemmas have long been studied as conflict scenario in which agents gauge between individualistic and collective profits (Crandall and Goodrich 2011; De Cote et al. 2006). The tension between cooperation and defection is evaluated as an atomic decision according to the numerical values of a pay-off matrix. This pay-off matrix satisfies inequalities in the reward function such that agents must decide between cooperation, to benefit as a whole team, or defection, to maximize selfish performance. To temporally extend matrix games, sequential social dilemmas have been introduced to investigate long-term strategic decisions of agent policies rather than short-term actions (Leibo et al. 2017). The arising behaviors in these dilemmas can be classified along psychological variables known from human interaction (Lange et al. 2013) such as the gain of individual benefits (Lerer and Peysakhovich 2017), the fear of future consequences (Pérolat et al. 2017), the assessment of the impact on another agent’s behavior (Jaques et al. 2018, 2019), the trust between agents (Pinyol and Sabater-Mir 2013; Ramchurn et al. 2004; Yu et al. 2013), and the impact of emotions on the decision-making (Moerland et al. 2018; Yu et al. 2013).
长期以来，社会困境一直被研究为代理人在个人主义和集体利益之间衡量的冲突场景（Crandall and Goodrich 2011;De Cote 等人，2006 年）。根据收益矩阵的数值，将合作与叛逃之间的紧张关系评估为原子决策。这种回报矩阵满足了奖励函数中的不平等，因此代理人必须在合作（使整个团队受益）或叛逃（以最大化自私绩效）之间做出决定。为了在时间上扩展矩阵博弈，引入了顺序社会困境来研究代理人政策的长期战略决策，而不是短期行动（Leibo 等人，2017 年）。这些困境中出现的行为可以按照人类互动中已知的心理变量进行分类（Lange 等人，2013 年），例如个人利益的获得（Lerer 和 Peysakhovich 2017 年）、对未来后果的恐惧（Pérolat 等人，2017 年）、对另一个代理人行为影响的评估（Jaques 等人，2018 年、2019 年）、代理人之间的信任（Pinyol 和 Sabater-Mir 2013 年;Ramchurn 等人，2004 年;Yu 等人，2013 年），以及情绪对决策的影响（Moerland 等人，2018 年;Yu 等人，2013 年）。

Kollock (1998) divided social dilemmas into commons dilemmas and public goods dilemmas. The former, commons dilemmas describe the trade-off between individualistic short-term benefits and long-term common interests on a task that is shared by all agents. Recent works on the commons dilemma can be found in Foerster et al. (2018a), Leibo et al. (2017) and Lerer and Peysakhovich (2017). In public goods dilemmas, agents face a scenario where common-pool resources are constrained and oblige a sustainable use of resources. The phenomenon called the tragedy of commons predicts that self-interested agents fail to find socially positive equilibria, which eventually results in the over-exploitation of the common resources (Hardin 1968). Investigations on the trial-and-error learning in common-pool resource scenarios with multiple decision-makers are covered in Hughes et al. (2018), Pérolat et al. (2017) and Zhu and Kirley (2019).
Kollock（1998）将社会困境分为公益困境和公共物品困境。前者，公地困境描述了个人主义的短期利益和长期共同利益之间的权衡，这是一项由所有主体共同完成的任务。最近关于公地困境的研究可以在Foerster等人（2018a），Leibo等人（2017）以及Lerer和Peysakhovich（2017）中找到。在公共产品困境中，代理人面临着公共池资源受到限制并迫使资源可持续使用的情况。这种被称为公地悲剧的现象预示着自利的主体无法找到社会积极的平衡，最终导致对公共资源的过度开发（Hardin 1968）。Hughes et al. （ 2018）， Pérolat et al. （ 2017）以及 Zhu and Kirley （ 2019）涵盖了在具有多个决策者的公共池资源场景中试错学习的研究。

5 Current challenges 5 当前挑战
In this section, we depict several challenges that arise in the multi-agent RL domain and, thus, are currently under active research. We approach the problem of non-stationarity (Sect. 5.1) due to the presence of multiple learners in a shared environment and review literature regarding the development of communication skills (Sect. 5.2). We further investigate the challenge of learning coordination (Sect. 5.3). Then, we survey the difficulty of attributing rewards to specific agents as the credit assignment problem (Sect. 5.4) and examine scalability issues (Sect. 5.5), which increase with the number of agents. Finally, we consider environments where states are only partially observable (Sect. 5.6). While some challenges are omnipresent in the MARL domain, such as non-stationarity or scalability, others like the credit assignment problem or the learning of coordination and communication are prevailing in the cooperative setting.
在本节中，我们描述了多智能体RL领域中出现的几个挑战，因此目前正在积极研究中。我们处理了由于共享环境中存在多个学习者而导致的非平稳性问题（第 5.1 节），并回顾了有关沟通技巧发展的文献（第 5.2 节）。我们进一步研究了学习协调的挑战（第 5.3 节）。然后，我们调查了将奖励归因于特定代理的难度，即信用分配问题（第 5.4 节），并研究了可扩展性问题（第 5.5 节），这些问题随着代理数量的增加而增加。最后，我们考虑了状态只能部分观察到的环境（第 5.6 节）。虽然一些挑战在MARL领域无处不在，例如非平稳性或可扩展性，但其他挑战，如学分分配问题或协调和沟通的学习，在合作环境中占主导地位。

We aim to provide a holistic overview of the contemporary challenges that constitute the landscape in reinforcement learning with multiple agents and survey treatments that were suggested in recent works. In particular, we focus on those challenges which are currently under active research and where progress has been accomplished recently. There are still open problems that have not been or partially addressed so far. Such problems are discussed in Sect. 6. Deliberately, we do not regard challenges that also persist in the single-agent domain, such as sparse rewards or the exploration-exploitation dilemma. We refer the interested reader for an overview of those topics to the articles of Arulkumaran et al. (2017) and Li (2018). Much of the surveyed literature cannot be assigned to one particular but rather to several of the proposed challenges. Hence, we associate the subsequent literature to the one challenge which we believe best addresses it (Table 3).
我们的目标是全面概述构成强化学习景观的当代挑战，这些挑战具有多种代理和最近工作中建议的调查处理方法。特别是，我们关注那些目前正在积极研究和最近取得进展的挑战。到目前为止，仍有一些悬而未决的问题尚未得到解决或部分得到解决。这些问题在第6节中讨论。我们刻意不考虑在单一智能体领域也存在的挑战，例如稀疏的奖励或勘探-开发困境。我们向感兴趣的读者推荐Arulkumaran等人（2017）和Li（2018）的文章，以概述这些主题。许多被调查的文献不能归因于一个特定的，而是归因于几个提出的挑战。因此，我们将随后的文献与我们认为最能解决这一挑战的一个挑战联系起来（表3）。

Table 3 Overview of MADRL challenges and approaches proposed in recent literature
表3 近期文献中提出的MADRL挑战和方法概述
Full size table 全尺寸表
5.1 Non-stationarity 5.1 非平稳性
One major problem resides in the presence of multiple agents that interact within a shared environment and learn simultaneously. Due to the co-adaption, the environment dynamics appear non-stationary from the perspective of a single agent. Thus, agents face a moving target problem if they are not provided with additional knowledge about other agents. As a result, the Markov assumption is violated, and the learning constitutes an inherently difficult problem (Hernandez-Leal et al. 2017; Laurent et al. 2011). The naïve approach is to neglect the adaptive behavior of agents. One can either ignore the existence of other agents (Matignon et al. 2012b) or discount the adaptive behavior by assuming the others’ behavior to be static or optimal (Lauer and Riedmiller 2000). By making such assumptions, the agents are considered as independent learners, and traditional single-agent reinforcement algorithms can be applied. First attempts have been studied in Claus and Boutilier (1998) and Tan (1993), which showed that independent learners could perform well in simple deterministic environments. However, in complex or stochastic environments, independent learners often result in poor performance (Lowe et al. 2017; Matignon et al. 2012b). Moreover, Lanctot et al. (2017) argued that independent learners could over-fit to other agents’ policies during the training and, thus, may fail to generalize at test time.
一个主要问题在于存在多个代理，这些代理在共享环境中进行交互并同时学习。由于协同适应，从单个主体的角度来看，环境动态似乎是非平稳的。因此，如果智能体没有获得有关其他智能体的额外知识，他们就会面临移动目标问题。结果，马尔可夫假设被违反，学习构成了一个固有的困难问题（Hernandez-Leal 等人，2017 年;Laurent 等人，2011 年）。幼稚的方法是忽略智能体的适应性行为。人们可以忽略其他代理的存在（Matignon et al. 2012b），或者通过假设其他代理的行为是静态的或最佳的（Lauer and Riedmiller 2000）来降低适应性行为。通过做出这样的假设，智能体被认为是独立的学习者，可以应用传统的单智能体强化算法。Claus and Boutilier （ 1998）和 Tan （ 1993）进行了首次尝试研究，表明独立学习者可以在简单的确定性环境中表现良好。然而，在复杂或随机的环境中，独立学习者往往会导致表现不佳（Lowe等人，2017;Matignon 等人，2012b）。此外，Lanctot等人（2017）认为，独立学习者在培训期间可能会过度适应其他智能体的政策，因此可能无法在测试时泛化。

In the following, we review literature, which addresses the non-stationarity in a multi-agent environment, and categorize the approaches into those with experience replay, centralized units, and meta-learning. A similar categorization proposed Papoudakis et al. (2019). We identify further approaches which cope with non-stationarity by establishing communication between agents (Sect. 5.2) or building models (Sect. 5.3). However, we discuss these topics separately in the respective sections.
在下文中，我们回顾了解决多智能体环境中非平稳性的文献，并将这些方法分为具有经验回放、集中式单元和元学习的方法。Papoudakis等人（2019）提出了类似的分类。我们通过在智能体（第 5.2 节）或构建模型（第 5.3 节）之间建立通信来确定应对非平稳性的进一步方法。但是，我们将在各自的章节中单独讨论这些主题。

Experience replay mechanism Recent successes with reinforcement learning methods such as deep Q-networks (Mnih et al. 2015) rest upon an experience replay mechanism. However, it is not straightforward to employ experience replays to the multi-agent setting because past experience becomes obsolete with the adaption of agent policies over time. To encounter this, Foerster et al. (2017) proposed two approaches. First, they decay outdated transition samples from the replay memory to stabilize targets and then use importance sampling to incorporate off-policy samples. Since the agents’ policies are known during the training, off-policy updates can be corrected with importance-weighted policy likelihoods. Second, the state space of each agent is enhanced with estimates of the other agents’ policies, so-called fingerprintsFootnote5, to prevent non-stationarity. The value functions can then be conditioned on a fingerprint, which clears the age of data sampled from the replay memory. Another extension for experience replays was proposed by Palmer et al. (2018) who applied leniency to every stored transition sample. Leniency associates each sample of the experience memory with a temperature value, which gradually decays by the number of state-action pair visits. Further utilization of the experience replay mechanism to cope with non-stationarity can be found in Tang et al. (2018) and Zheng et al. (2018a). Nevertheless, if the contemporary dynamics of the learners are neglected, algorithms can utilize short-term buffers as applied in Baker et al. (2020) and Leibo et al. (2017).
经验回放机制最近强化学习方法的成功，如深度 Q 网络（Mnih 等人，2015 年）依赖于经验回放机制。但是，将体验重放应用于多代理设置并不简单，因为随着时间的推移，随着代理策略的调整，过去的经验会过时。为了解决这个问题，Foerster等人（2017）提出了两种方法。首先，它们从重放存储器中衰减过时的转换样本以稳定目标，然后使用重要性采样来合并策略外样本。由于代理的策略在训练期间是已知的，因此可以使用重要性加权策略可能性来纠正策略外更新。其次，每个智能体的状态空间通过对其他智能体策略的估计（即所谓的指纹 Footnote5 ）来增强，以防止非平稳性。然后，可以根据指纹对值函数进行调节，从而清除从重放存储器中采样的数据的年龄。Palmer等人（2018）提出了经验回放的另一种扩展，他们对每个存储的过渡样本都应用了宽大处理。宽大性将体验记忆的每个样本与温度值相关联，该温度值会随着状态-动作对访问的次数而逐渐衰减。进一步利用经验回放机制来应对非平稳性可以在 Tang et al. （ 2018）和 Zheng et al. （ 2018a）中找到。然而，如果忽略学习者的当代动态，算法可以利用 Baker 等人（2020）和 Leibo 等人（2017）所应用的短期缓冲区。

Centralized Training Scheme As already discussed in Sect. 3.2, the CTDE paradigm can be leveraged to share mutual information between learners to ease training. The availability of information during the training can loosen the non-stationarity of the environment since agents are augmented with information about others. One approach is to enhance actor-critic methods with centralized critics over which mutual information is shared between agents during the training (Bono et al. 2019; Iqbal and Sha 2019; Wei et al. 2018). Lowe et al. (2017) embedded each agent with one centralized critic that is augmented with all agents’ observations and actions. Based on this additional information, agents face a stationary environment during the training while acting decentralized on local observations at test time. Next to the equipment of one critic per agent, all agents can share one global centralized critic. Foerster et al. (2018b) applied one centralized critic conditioned on the joint action and observations of all agents. The critic computes an agent’s individual advantage through estimating the value of the joint action based on a counterfactual baseline, which marginalizes out single agents’ influence. Another approach to the CTDE scheme can be seen in value-based methods. Rashid et al. (2018) learned a joint action-value function conditioned on the joint observation-action history. The joint action-value function is then divided into agent individual value functions based on monotonic non-linear composition. Foerster et al. (2016) used action-value functions that share information through a communication channel during the training but then discarded it at test time. Similarly, Jorge et al. (2016) employed communication during training to promote information exchange for optimizing action-value functions.
集中培训计划如第 3.2 节所述，可以利用 CTDE 范式在学习者之间共享相互信息，以简化培训。训练期间信息的可用性可以放松环境的非平稳性，因为智能体被有关他人的信息所增强。一种方法是通过集中式批评者来增强演员-评论家方法，在培训期间，代理人之间共享相互信息（Bono 等人，2019 年;伊克巴尔和沙 2019;Wei 等人，2018 年）。Lowe等人（2017）为每个智能体嵌入了一个集中的批评者，该批评者通过所有智能体的观察和行动进行增强。基于这些附加信息，智能体在训练期间面临静止环境，同时在测试时分散对局部观察进行操作。除了每个座席一个评论家的设备外，所有座席都可以共享一个全局集中的评论家。Foerster等人（2018b）应用了一个集中的批评者，条件是所有代理人的联合行动和观察。批评者通过基于反事实基线估计联合行动的价值来计算代理人的个体优势，从而将单一代理人的影响边缘化。CTDE方案的另一种方法可以在基于值的方法中看到。Rashid et al. （ 2018）学习了以联合观察-行动历史为条件的联合行动-价值函数。然后，将联合动作-值函数划分为基于单调非线性组合的智能体个体值函数。Foerster等人（2016）使用动作值函数，在训练期间通过通信渠道共享信息，然后在测试时将其丢弃。同样，Jorge 等人。（ 2016）在培训期间使用沟通来促进信息交流，以优化行动价值函数。

Meta-Learning Sometimes, it can be useful to learn how to adapt to the behavioral changes of others. This learning-to-learn approach is known as meta-learning (Finn and Levine 2018; Schmidhuber et al. 1996). Recent works in the single-agent domain have shown promising results (Duan et al. 2016; Wang et al. 2016a). Al-Shedivat et al. (2018) transferred this approach to the multi-agent domain and developed a meta-learning based method to tackle the consecutive adaptation of agents in non-stationary environments. Regarding non-stationarity as a sequence of stationary tasks, agents learn to exploit dependencies between successive tasks and generalize over co-adapting agents at test time. They evaluated the resulting behaviors in a competitive multi-agent setting where agents fight in a simulated physics environment. Meta-learning can also be utilized to construct agent models (Rabinowitz et al. 2018). By learning how to model other agents and make inferences on them, agents learn to predict the other agent’s future action sequences. They embedded this principle into how one agent learns to capture the behavioral patterns of other agents efficiently.
元学习有时，学习如何适应他人的行为变化会很有用。这种学习到学习的方法被称为元学习（Finn and Levine，2018 年;Schmidhuber 等人，1996 年）。最近在单药领域的工作显示出有希望的结果（Duan 等人，2016 年;Wang 等人，2016a）。Al-Shedivat 等人（2018 年）将这种方法转移到多智能体领域，并开发了一种基于元学习的方法来解决智能体在非平稳环境中的连续适应。将非平稳性视为一系列平稳任务，智能体学会利用连续任务之间的依赖关系，并在测试时推广共同适应的智能体。他们评估了在竞争激烈的多智能体环境中产生的行为，在这种环境中，智能体在模拟的物理环境中进行战斗。元学习也可用于构建智能体模型（Rabinowitz 等人，2018 年）。通过学习如何对其他智能体进行建模并对其进行推断，智能体可以学习预测其他智能体的未来动作序列。他们将这一原则嵌入到一个智能体如何学习有效地捕捉其他智能体的行为模式中。

5.2 Learning communication
5.2 学习交流
Agents capable of developing communication and language corpora pose one of the vital challenges in machine intelligence (Kirby 2002). Intelligent agents must not only decide on what to communicate but also when and with whom. It is indispensable that the developed language is grounded on a common consensus such that all agents understand the spoken language, including its semantics. The research efforts in learning to communicate have intensified because many pathologies can be overcome by incorporating communication skills into agents, including non-stationarity, coherent coordination among agents, and partial observability. For instance, when an agent knows the actions taken by others, the learning problem becomes stationary again from a single agent’s perspective in a fully observable environment. Even partial observability can be loosened by messaging local observations to other participants through communication, which helps compensate for limited knowledge (Goldman and Zilberstein 2004).
能够开发通信和语言语料库的代理构成了机器智能的重要挑战之一（Kirby 2002）。智能代理不仅要决定要通信的内容，还要决定何时以及与谁通信。开发的语言必须建立在共同共识的基础上，以便所有代理都理解口语，包括其语义。学习沟通的研究工作已经加强，因为可以通过将沟通技巧纳入智能体来克服许多病症，包括非平稳性、智能体之间的连贯协调和部分可观察性。例如，当一个智能体知道其他人采取的行动时，在完全可观察的环境中，从单个智能体的角度来看，学习问题再次变得静止。即使是部分的可观察性也可以通过通信将本地观察结果传递给其他参与者来放松，这有助于弥补有限的知识（Goldman and Zilberstein 2004）。

The common framework to investigate communication is the dec-POMDP (Oliehoek and Amato 2016) which is a fully cooperative setting where agents perceive partial observations of the environment and try to improve upon an equally-shared reward. In such distributed systems, agents must not only learn how to cooperate but also how to communicate in order to optimize the mutual objective. Early MARL works investigated communication rooted in tabular worlds with limited observability (Kasai et al. 2008). Since the spring of deep learning methods, the research of learning communication has witnessed great attention because advanced computational methods provide new opportunities to study highly complex data.
调查沟通的通用框架是 dec-POMDP（Oliehoek 和 Amato，2016 年），这是一个完全合作的环境，代理感知对环境的部分观察，并试图改善平等分享的奖励。在这种分布式系统中，智能体不仅要学习如何合作，还要学习如何沟通，以优化共同目标。早期的MARL工作研究了植根于可观察性有限的表格世界的通信（Kasai等人，2008）。自深度学习方法问世以来，由于先进的计算方法为研究高度复杂的数据提供了新的机会，学习交流的研究受到了极大的关注。

In the following, we categorize the surveyed literature according to the message addressing. First, we describe the broadcasting scenario where sent messages are received by all agents. Second, we look into works that use targeted messages to decide on the recipients by using an attention mechanism. Third and last, we review communication in networked settings where agents communicate only with their local neighborhood instead of the whole population. Figure 3 shows a schematic illustration of this categorization. Another taxonomy may be based on the discrete or continuous nature of messages and the frequency of passed messages.
在下文中，我们根据消息寻址对调查的文献进行分类。首先，我们描述所有代理都接收已发送消息的广播场景。其次，我们研究了使用注意力机制使用有针对性的信息来决定收件人的作品。第三，也是最后一点，我们回顾了网络环境中的通信，在这种环境中，代理仅与当地社区而不是整个人群进行通信。图 3 显示了这种分类的示意图。另一种分类法可以基于消息的离散或连续性质以及传递消息的频率。

Fig. 3 图3
figure 3
Schematic illustration of communication types. Unilateral arrows represent unidirectional messages, while bilateral arrows symbolize bidirectional message passing. (Left) In broadcasting, messages are sent to all participants of the communication channel. For better visualization, the broadcasting of only one agent is illustrated but each agent can broadcast messages to all other agents. (Middle) Agents can target the communication through an attention mechanism that determines when, what and with whom to communicate. (Right) Networked communication describes the local connection to neighborhood agents
通信类型的示意图。单边箭头表示单向消息，而双边箭头表示双向消息传递。（左）在广播中，消息被发送给通信通道的所有参与者。为了更好地可视化，仅演示了一个代理的广播，但每个代理都可以将消息广播到所有其他代理。（中）座席可以通过注意力机制来定位通信，该机制决定何时、何地以及与谁进行通信。（右）网络通信描述了与邻域代理的本地连接

Full size image 全尺寸图像
Broadcasting Messages are addressed to all participants of the communication channel. Foerster et al. (2016) studied how agents learn discrete communication protocols in dec-POMDPs in order to accomplish a fully-cooperative task. Being in a CTDE setting, the communication is not restricted during the training but bandwidth-limited at test time. To discover meaningful communication protocols, they proposed two methods. The first, reinforced inter-agent learning (RIAL), is based on deep recurrent Q-networks combined with independent Q-learning where each agent learns an action-value function conditioned on the observation history as well as messages from other agents. Additionally, they applied parameter sharing so that all agents share and update common features from only one Q-network. The second method, differentiable inter-agent learning (DIAL), combines the centralized learning paradigm with deep Q-networks. Messages are delivered over discrete connections, which are based on a relaxation to become differentiable. In contrast, Sukhbaatar et al. (2016) proposed CommNet as an architecture that allows the learning of communication between agents purely based on continuous protocols. They showed that each agent learns the joint-action and a sparse communication protocol that encodes meaningful information. The authors emphasized that the decreased observability of vicious states encourages the importance of communication between agents. To foster scalable communication protocols that also facilitate heterogeneous agents, Peng et al. (2017) introduced the bidirectionally-coordinated network (BiCNet) where agents learn in a vectorized actor-critic framework to communicate. Through communication, they were able to coordinate heterogeneous agents in a combat game of StarCraft.
广播消息发给通信信道的所有参与者。Foerster等人（2016）研究了智能体如何在dec-POMDP中学习离散通信协议，以完成完全合作的任务。在CTDE设置中，通信在训练期间不受限制，但在测试时受到带宽限制。为了发现有意义的通信协议，他们提出了两种方法。第一种是强化智能体间学习（RIAL），它基于深度循环 Q 网络与独立的 Q 学习相结合，其中每个智能体学习一个以观察历史以及来自其他智能体的消息为条件的动作值函数。此外，他们应用了参数共享，以便所有代理仅从一个 Q 网络共享和更新通用功能。第二种方法是可微分代理间学习（DIAL），它将集中式学习范式与深度Q网络相结合。消息通过离散连接传递，这些连接基于松弛以变得可微分。相比之下，Sukhbaatar等人（2016）提出了CommNet作为一种架构，允许纯粹基于连续协议学习代理之间的通信。他们表明，每个智能体都学习了联合行动和编码有意义信息的稀疏通信协议。作者强调，恶性状态的可观察性下降鼓励了智能体之间沟通的重要性。为了促进可扩展的通信协议，Peng等人（2017）引入了双向协调网络（BiCNet），其中代理在矢量化参与者-批评者框架中学习以进行通信。通过沟通，他们能够在《星际争霸》的战斗游戏中协调异构代理。

Targeted communication When agents are endowed with targeted communication protocols, they utilize an attention mechanism to determine when, what and with whom to communicate. Jiang and Lu (2018) introduced ATOC as an attentional communication model that enables agents to send messages dynamically and selectively so that communication takes place among a group of agents only when required. They argued that attention is essential for large-scale settings because agents learn to decide which information is most useful for decision-making. Selective communication is the reason why ATOC outperforms CommNet and BiCNet on the conducted navigation tasks. A similar conclusion was drawn by Hoshen (2017) who introduced the vertex attention interaction network (VAIN) as an extension to the CommNet. The baseline approach is extended with an attention mechanism that increases performance due to the focus on only relevant agents. The work by Das et al. (2019) introduced targeted multi-agent communication (TarMAC) that uses attention to decide with whom and what to communicate by actively addressing other agents for message passing. Jain et al. (2019) proposed TBONE for visual navigation in cooperative tasks. In contrast to former works, which are limited to the fully-cooperative setting, Singh et al. (2019) considered mixed settings where each agent owns an individual reward function. They proposed the individualized controlled continuous communication model (IC3Net), where agents learn when to exchange information using a gating mechanism that blocks incoming communication requests if necessary.
有针对性的沟通当智能体被赋予有针对性的沟通协议时，他们利用注意力机制来确定何时、什么以及与谁沟通。江和Lu（2018）引入了ATOC作为一种注意力通信模型，使代理能够动态和选择性地发送消息，以便仅在需要时在一组代理之间进行通信。他们认为，注意力对于大规模环境至关重要，因为智能体学会了决定哪些信息对决策最有用。选择性通信是ATOC在导航任务上优于CommNet和BiCNet的原因。Hoshen （ 2017）也得出了类似的结论，他引入了顶点注意力交互网络（VAIN）作为 CommNet 的扩展。基线方法通过注意力机制进行扩展，由于仅关注相关代理，因此可以提高性能。Das等人（2019）的工作引入了有针对性的多智能体通信（TarMAC），它通过主动解决其他智能体进行消息传递，利用注意力来决定与谁和什么进行通信。Jain等人（2019）提出了TBONE用于协作任务中的视觉导航。与以前的工作相比，这些工作仅限于完全合作的环境，Singh等人（2019）考虑了混合环境，其中每个智能体都拥有单独的奖励功能。他们提出了个性化受控连续通信模型（IC3Net），其中代理使用门控机制来学习何时交换信息，该机制在必要时阻止传入的通信请求。

Networked communication Another form of communication is a networked communication protocol where agents can exchange information with their neighborhood (Nedic and Ozdaglar 2009; Zhang et al. 2018). Agents act decentralized based on local observations and received messages from network neighbors. Zhang et al. (2018) used an actor-critic framework where agents share their critic information with their network neighbors to promote global optimality. Chu et al. (2020) introduced the neural communication protocol (NeurComm) to enhance communication efficiency by reducing queue length and intersection delay. Further, they showed that a spatial discount factor could stabilize training when only the local vicinity is regarded to perform policy updates. For theoretical contributions, one may consider the works of Qu et al. (2020), Zhang et al. (2018) and Zhang et al. (2019) whereas the paper of Chu et al. (2020) provides an application perspective in the domain of traffic light control.
网络通信另一种形式的通信是网络通信协议，代理可以在其中与邻居交换信息（Nedic and Ozdaglar 2009;Zhang 等人，2018 年）。代理根据本地观察和从网络邻居接收的消息进行分散操作。Zhang et al. （ 2018）使用了一个演员-评论家框架，在该框架中，代理与他们的网络邻居分享他们的评论家信息，以促进全球最优性。Chu et al. （ 2020）引入了神经通信协议（NeurComm），通过减少队列长度和交叉延迟来提高通信效率。此外，他们表明，当仅考虑局部附近地区进行政策更新时，空间折扣因子可以稳定培训。对于理论贡献，可以考虑Qu et al. （ 2020）， Zhang et al. （ 2018）和 Zhang et al. （ 2019）的著作，而 Chu et al. （ 2020）的论文则提供了在交通灯控制领域的应用视角。

Extensions Further methods approach the improvement of coordination skills by applying intrinsic motivation (Jaques et al. 2018, 2019), by making the communication protocol more robust or scalable (Kim et al. 2019; Singh et al. 2019), and maximizing the utility of the communication through efficient encoding (Celikyilmaz et al. 2018; Li et al. 2019b; Wang et al. 2020c).
扩展进一步的方法通过应用内在动机（Jaques 等人，2018 年，2019 年），通过使通信协议更加稳健或可扩展（Kim 等人，2019 年;Singh 等人，2019 年），并通过高效编码最大限度地提高通信的效用（Celikyilmaz 等人，2018 年;Li 等人，2019b;Wang 等人，2020c）。

The above-reviewed papers focus on new methodologies about communication protocols. Besides that, a bulk of literature considers the analysis of emergent language and the occurrence of agent behavior, which we discuss in Sect. 4.2.
上述综述的论文侧重于有关通信协议的新方法。除此之外，大量文献还考虑了对涌现语言的分析和代理行为的发生，我们在第 4.2 节中对此进行了讨论。

5.3 Coordination 5.3 协调
Successful coordination in multi-agent systems requires agents to agree on a consensus (Wei Ren et al. 2005). In particular, accomplishing a joint goal in cooperative settings demands a coherent action selection such that the joint action optimizes the mutual task performance. Cooperation among agents is complicated when stochasticity is present in system transitions and rewards or when agents observe only partial information of the environment’s state. Mis-coordination may arise in the form of action shadowing when exploratory behavior influences the other agents’ search space during learning and, as a result, sub-optimal solutions are found.
多智能体系统中的成功协调需要智能体就共识达成一致（Wei 任 et al. 2005）。特别是，在合作环境中实现联合目标需要连贯的动作选择，以便联合行动优化共同任务的性能。当系统转换和奖励中存在随机性时，或者当智能体仅观察到环境状态的部分信息时，智能体之间的合作就很复杂。当探索行为在学习过程中影响其他智能体的搜索空间时，可能会以动作阴影的形式出现错误协调，从而找到次优解决方案。

Therefore, the agreement upon a mutual consensus necessitates the sharing and collection of information about other agents to derive optimal decisions. Finding such a consensus in the decision-making may happen explicitly through communication or implicitly by constructing models of other agents. The former requires skills to communicate with others so that agents can express their purpose and align their coordination. For the latter, agents need the ability to observe other agents’ behavior and reason about their strategies to build a model. If the prediction model is accurate, an agent can learn the other agents’ behavioral patterns and direct actions towards a consensus, leading to coordinated behavior. Besides explicit communication and constructing agent models, the CTDE scheme can be leveraged to build different levels of abstraction, which are applied to learn high-level coordination while independent skills are trained at low-level.
因此，在相互共识的基础上达成一致需要共享和收集有关其他主体的信息，以得出最佳决策。在决策中发现这种共识可以通过沟通明确地发生，也可以通过构建其他主体的模型来隐含地发生。前者需要与他人沟通的技能，以便座席能够表达他们的目的并协调他们的协调。对于后者，智能体需要能够观察其他智能体的行为，并推理他们的策略来构建模型。如果预测模型是准确的，智能体可以学习其他智能体的行为模式，并指导行动以达成共识，从而产生协调的行为。除了显式沟通和构建智能体模型外，CTDE方案还可用于构建不同层次的抽象，这些抽象用于学习高级协调，而独立技能则用于训练低级技能。

In the remainder of this section, we focus on methods that solve coordination issues without establishing communication protocols between agents. Although communication may ease coordination, we discuss this topic separately in Sect. 5.2.
在本节的其余部分，我们将重点介绍在不建立代理之间通信协议的情况下解决协调问题的方法。虽然沟通可以简化协调，但我们在第 5.2 节中单独讨论这个话题。

Independent learners The naïve approach to handle multi-agent problems is to regard each agent individually such that other agents are perceived as part of the environment and, thus, are neglected during learning. Opposed to joint action learners, where agents experience the selected actions of others a-posteriori, independently learning agents face the main difficulty of coherently choosing actions such that the joint action becomes optimal concerning the mutual goal (Matignon et al. 2012b). During the learning of good policies, agents influence each other’s search space, which can lead to action shadowing. The notion of coordination among several autonomously and independently acting agents enjoys a long record, and a bulk of research was conducted in settings with non-communicative agents (Fulda and Ventura 2007; Matignon et al. 2012b). Early works investigated the convergence of independent learners and showed that the convergence to solutions is feasible under certain conditions in deterministic games but fails in stochastic environments (Claus and Boutilier 1998; Lauer and Riedmiller 2000). Stochasticity, relative over-generalization, and other pathologies such as non-stationarity and the alter-exploration problem led to new branches of research including hysteretic learning (Matignon et al. 2007) and leniency (Potter and De Jong 1994). Hysteretic Q-learning was introduced to encounter the over-estimation of the value function evoked by stochasticity. Two learning rates are used to increase and decrease the value function updates while relying on an optimistic form of learning. A modern approach to hysteretic learning can be seen in Palmer et al. (2018) and Omidshafiei et al. (2017). An alternative method to adjust the degree of applied optimism during learning is leniency (Panait et al. 2006; Wei and Luke 2016). Leniency associates selected actions with decaying temperature values that govern the amount of applied leniency. Agents are optimistic during the early phase when exploration is still high but become less lenient for frequently visited state-action pairs over the training so that value estimations become more accurate towards the end of learning.
独立学习者处理多智能体问题的幼稚方法是单独考虑每个智能体，以便将其他智能体视为环境的一部分，从而在学习过程中被忽略。与联合行动学习者相反，在联合行动学习者中，智能体会事后体验他人的选择行动，独立学习智能体面临连贯选择行动的主要困难，以便联合行动在共同目标方面成为最佳（Matignon 等人，2012b）。在学习良好策略的过程中，智能体会相互影响对方的搜索空间，这可能导致行动阴影。几个自主和独立行动的代理人之间的协调概念有着悠久的记录，并且大量研究是在非交流代理人的环境中进行的（Fulda 和 Ventura 2007;Matignon 等人，2012b）。早期的工作研究了独立学习者的收敛性，并表明在确定性博弈中的某些条件下，解决方案的收敛是可行的，但在随机环境中失败了（Claus and Boutilier 1998;Lauer 和 Riedmiller 2000）。随机性、相对过度泛化和其他病理学，如非平稳性和交替探索问题，导致了新的研究分支，包括滞后学习（Matignon et al. 2007）和宽大（Potter and De Jong 1994）。引入滞后Q学习，以应对随机性引起的对价值函数的高估。两种学习率用于增加和减少值函数更新，同时依赖于乐观的学习形式。在Palmer等人（2018）和Omidshafiei等人（2017）中可以看到滞后学习的现代方法。在学习过程中调整应用乐观程度的另一种方法是宽大处理（Panait 等人，2006 年;Wei 和 Luke 2016）。宽大将选定的操作与控制应用宽大程度的衰减温度值相关联。智能体在探索仍然很高的早期阶段是乐观的，但在训练期间对频繁访问的状态-动作对变得不那么宽松，因此在学习结束时，价值估计变得更加准确。

Further works expanded independent learners with enhanced techniques to cope with the MARL pathologies mentioned above. Extensions to the deep Q-network can be seen in additional mechanisms used for the experience replay (Palmer et al. 2019), the utilization of specialized estimators (Zheng et al. 2018a) and the use of implicit quantile networks (Lyu and Amato 2020). Further literature investigated independent learners as benchmark reference but reported limited success in cooperative tasks of various domains when no other techniques are applied to alleviate the issue of independent learners (Foerster et al. 2018b; Sunehag et al. 2018).
进一步的工作通过增强的技术扩展了独立学习者，以应对上述MARL病理学。深度 Q 网络的扩展可以在用于体验回放的其他机制（Palmer 等人，2019 年）、专用估计器的使用（Zheng 等人，2018a）和隐式分位数网络的使用（Lyu 和 Amato 2020 年）中看到。进一步的文献调查了独立学习者作为基准参考，但报告说，在没有应用其他技术来缓解独立学习者问题的情况下，在各个领域的合作任务中成功有限（Foerster 等人，2018b;Sunehag 等人，2018 年）。

Constructing models An implicit way to achieve coordination among agents is to capture the behavior of others by constructing models. Models are functions that take past interaction data as input and output predictions about the agents of interest. This can be very important to render the learning process robust against the decision-making of other agents in the environment (Hu and Wellman 1998). The constructed models and the predicted behavior vary widely depending on the approaches and the assumptions being made (Albrecht and Stone 2018).
构建模型实现智能体之间协调的隐式方法是通过构建模型来捕获他人的行为。模型是将过去的交互数据作为有关相关智能体的输入和输出预测的函数。这对于使学习过程对环境中其他代理的决策具有鲁棒性非常重要（胡和Wellman 1998）。构建的模型和预测的行为因方法和假设的不同而有很大差异（Albrecht 和 Stone 2018）。

One of the first works based on deep learning methods was conducted by He et al. (2016) in an adversarial setting. They proposed an architecture that utilizes two neural networks. One neural network captures the opponents’ strategies, and the second network estimates the opponents’ Q-values. These networks jointly learn models of opponents by encoding observations into a deep Q-network. Another work by Foerster et al. (2018a) introduced a learning method where the policy updates rely on the impact on other agents. The opponent’s policy parameters can be inferred from the observed trajectory by using a maximum likelihood technique. The arising non-stationarity is tackled by accounting only recent data. An additional possibility is to address the information gain about other agents through Bayesian methods. Raileanu et al. (2018) employed a model where agents estimate the other agents’ hidden states and embed these estimations into their own policy. Inferring other agents’ hidden states from their behavior allows them to choose appropriate actions and promotes eventual coordination. Foerster et al. (2019) used all publicly available observations in the environment to calculate a public belief over agents’ local information. Another work by Yang et al. (2018a) used Bayesian techniques to detect opponent strategies in competitive games. A particular challenge is to learn agent models in the presence of fast adapting agents, which amplifies the problem of non-stationarity. As a countermeasure, Everett and Roberts (2018) proposed the switching agent model (SAM), which learns a set of opponent models and a switching mechanism between models. By tracking and detecting the behavioral adaption of other agents, the switching mechanism learns to select the best response from the learned set of opponent models and, thus, showed superior performance over single model learners.
基于深度学习方法的首批工作之一是由He等人（2016）在对抗环境中进行的。他们提出了一种利用两个神经网络的架构。一个神经网络捕获对手的策略，第二个网络估计对手的 Q 值。这些网络通过将观察结果编码到深度 Q 网络中来共同学习对手的模型。Foerster等人（2018a）的另一项工作介绍了一种学习方法，其中策略更新依赖于对其他代理的影响。通过使用最大似然法，可以从观察到的轨迹中推断出对手的政策参数。由此产生的非平稳性可以通过仅考虑最近的数据来解决。另一种可能性是通过贝叶斯方法解决有关其他智能体的信息增益。Raileanu et al. （ 2018）采用了一种模型，在该模型中，智能体估计其他智能体的隐藏状态并将这些估计嵌入到他们自己的策略中。从其他智能体的行为中推断出其他智能体的隐藏状态，使他们能够选择适当的行动并促进最终的协调。Foerster et al. （ 2019）使用环境中所有公开可用的观察结果来计算公众对代理本地信息的信念。Yang et al. （ 2018a）的另一项工作使用贝叶斯技术来检测竞技游戏中的对手策略。一个特殊的挑战是在存在快速适应智能体的情况下学习智能体模型，这放大了非平稳性问题。作为对策，Everett 和 Roberts （2018）提出了切换代理模型（SAM），该模型学习一组对手模型和模型之间的切换机制。通过跟踪和检测其他智能体的行为适应，切换机制学会从学习到的对手模型集中选择最佳响应，因此表现出优于单个模型学习器的性能。

Further works on constructing models can be found in cooperative tasks (Barde et al. 2019; Tacchetti et al. 2019; Zheng et al. 2018b) with imitation learning (Grover et al. 2018; Le et al. 2017), in social dilemmas (Jaques et al. 2019; Letcher et al. 2019), and by predicting behaviors from observations (Hong et al. 2017; Hoshen 2017). For a comprehensive survey on constructing models in multi-agent systems, one may consider the work of Albrecht and Stone (2018).
关于构建模型的进一步工作可以在合作任务中找到（Barde 等人，2019 年;Tacchetti 等人，2019 年;Zheng et al. 2018b）与模仿学习（Grover et al. 2018;Le 等人，2017 年），在社会困境中（Jaques 等人，2019 年;Letcher 等人，2019 年），并通过观察预测行为（Hong 等人，2017 年;Hoshen 2017 年）。对于在多智能体系统中构建模型的综合调查，可以考虑Albrecht和Stone（2018）的工作。

Besides resolving the coordination problem, building models of other agents can cope with the non-stationarity in the environment. As soon as one agent has knowledge about others’ behavior, previously unexplainable transition dynamics can be attributed to the responsible agents, and the environment becomes stationary again from the viewpoint of an individual agent.
除了解决协调问题外，构建其他智能体的模型还可以应对环境中的非平稳性。一旦一个智能体了解了其他智能体的行为，以前无法解释的过渡动态就可以归因于负责的智能体，并且从单个智能体的角度来看，环境再次变得静止。

Hierarchical methods Learning to coordinate can be challenging if multiple decision-makers are involved due to the increasing complexity (Bernstein et al. 2002). An approach to deal with the coordination problem is by abstracting low-level coordination to higher levels. The idea originated in the single-agent domain where hierarchies for temporal abstraction are employed to ease long-term reward assignments (Dayan and Hinton 1993; Sutton et al. 1999). Lower levels entail only partial information of the higher levels so that the learning task becomes simpler the lower the level of abstraction. First attempts for hierarchical multi-agent RL can be found in the tabular case (Ghavamzadeh et al. 2006; Makar et al. 2001). A deep approach was proposed by Kumar et al. (2017), where a higher-level controller guides the information exchange between decentralized agents. Grounded on the high-level controller, the agents communicate with only one other agent at each time step, which allows the exploration of distributed policies. Another work by Han et al. (2019) is built upon the options framework (Sutton et al. 1999) where they embedded a dynamic termination criterion for Q-learning. By adding a termination criterion, agents could flexibly quit the option execution and react to the behavioral changes of other agents. Related to the idea of feudal networks (Dayan and Hinton 1993), Ahilan and Dayan (2019) applied a two-level abstraction of agents to a cooperative multi-agent setting where, in contrast to other methods, the hierarchy relied on rewards instead of state goals. They showed that this approach could be well suited for decentralized control problems. Jaderberg et al. (2019) used hierarchical representations that allowed agents to reason at different time scales. The authors demonstrated that agents are capable of solving mixed cooperative and competitive tasks in simulated physics environments. Another work by Lee et al. (2020) proposed a hierarchical method to coordinate two agents on robotic manipulation and locomotion tasks to accomplish collaboration such as object pick and placement. They learned primitive skills on the low-level, which are guided by a higher-level policy. Further works cover hierarchical methods in cooperation tasks (Cai et al. 2013; Ma and Wu 2020; Tang et al. 2018) or social dilemmas (Vezhnevets et al. 2019). An open challenge for hierarchical methods is the autonomous creation and discovery of abstract goals from data (Schaul et al. 2015; Vezhnevets et al. 2017).
分层方法：由于复杂性的增加，如果涉及多个决策者，学习协调可能具有挑战性（Bernstein 等人，2002 年）。处理协调问题的一种方法是将低级协调抽象为高级协调。这个想法起源于单智能体领域，其中使用时间抽象的层次结构来简化长期奖励分配（Dayan and Hinton 1993;Sutton等人，1999）。较低级别仅需要较高级别的部分信息，因此抽象级别越低，学习任务就越简单。分层多智能体 RL 的首次尝试可以在表格案例中找到（Ghavamzadeh 等人，2006 年;Makar 等人，2001 年）。Kumar et al. （2017）提出了一种深度方法，其中更高级别的控制器指导去中心化代理之间的信息交换。基于高级控制器，代理在每个时间步仅与另一个代理通信，这允许探索分布式策略。Han et al. （ 2019）的另一项工作建立在选项框架（Sutton et al. 1999）之上，他们在该框架中嵌入了用于 Q 学习的动态终止标准。通过添加终止标准，智能体可以灵活地退出期权执行，并对其他智能体的行为变化做出反应。与封建网络（Dayan and Hinton 1993）的思想相关，Ahilan and Dayan （2019）将代理的两级抽象应用于合作的多代理设置，与其他方法相比，等级制度依赖于奖励而不是国家目标。他们表明，这种方法非常适合分散控制问题。Jaderberg等人（2019）使用分层表示，允许代理在不同的时间尺度上进行推理。作者证明，智能体能够在模拟的物理环境中解决混合、合作和竞争的任务。Lee 等人（2020 年）的另一项工作提出了一种分层方法，用于协调机器人操作和运动任务的两个代理，以完成对象拾取和放置等协作。他们在低级学习了原始技能，这些技能以更高级别的政策为指导。进一步的工作涵盖了合作任务中的分层方法（Cai et al. 2013;马和吴 2020;Tang 等人，2018 年）或社会困境（Vezhnevets 等人，2019 年）。分层方法的一个公开挑战是从数据中自主创建和发现抽象目标（Schaul 等人，2015 年;Vezhnevets 等人，2017 年）。

5.4 Credit assignment problem
5.4 学分分配问题
In the fully-cooperative setting, agents are encouraged to maximize an equally-shared reward signal. Even in a fully-observable state space, it is difficult to determine which agents and actions contributed to the eventual reward outcome when agents do not have access to the joint action. Claus and Boutilier (1998) showed that independent learners could not differentiate between the teammate’s exploration and the stochasticity in the environment even in a simple bi-matrix game. This can render the learning problem difficult because agents should be ideally provided with feedback corresponding to the task performance to enable sufficient learning. Associating rewards to agents is known as the credit assignment problem (Weiß 1995; Wolpert and Tumer 1999). This problem is intensified by the sequential nature of reinforcement learning where agents must understand not only the impact of single actions but also the entire action sequences that eventually lead to the reward outcome (Sen and Weiss 1999). An additional challenge arises when agents have only access to local observations of the environment, which we discuss in Sect. 5.6. In the remainder of this section, we consider three actively investigated approaches that deal with how to determine the contribution of agents jointly-shared reward settings.
在完全合作的环境中，鼓励智能体最大化平等分享的奖励信号。即使在完全可观察的状态空间中，当智能体无法访问联合行动时，也很难确定哪些智能体和行动促成了最终的奖励结果。Claus和Boutilier（1998）表明，即使在简单的双矩阵博弈中，独立学习者也无法区分队友的探索和环境中的随机性。这可能会使学习问题变得困难，因为理想情况下，应该为智能体提供与任务性能相对应的反馈，以实现充分的学习。将奖励与代理相关联被称为信用分配问题（Weiß 1995;Wolpert 和 Tumer 1999）。强化学习的顺序性加剧了这个问题，在这种学习中，智能体不仅必须了解单个动作的影响，还必须了解最终导致奖励结果的整个动作序列（Sen and Weiss 1999）。当智能体只能访问对环境的局部观察时，就会出现另一个挑战，我们在第 5.6 节中对此进行了讨论。在本节的其余部分，我们考虑了三种积极研究的方法，这些方法涉及如何确定代理共同共享奖励设置的贡献。

Decomposition Early works approached the credit assignment problem by applying filters (Chang et al. 2004) or modifying the reward function such as reward shaping (Ng et al. 1999). Recent approaches focus on exploiting dependencies between agents to decompose the reward among the agents with respect to their actual contribution towards the global reward (Kok and Vlassis 2006). The learning problem is simplified by dividing the task into smaller and, hence, easier sub-problems through decomposition. Sunehag et al. (2018) introduced the value decomposition network (VDN) which factorizes the joint action-value function into a linear combination of individual action-value functions. The VDN learns how to optimally assign an individual reward according to the agent’s performance. The neural network helps to disambiguate the joint reward signal concerning the impact of the agent. Rashid et al. (2018) proposed QMIX as an improvement over VDN. QMIX learns a centralized action-value function that is decomposed into agent individual action-value functions through non-linear combinations. Under the assumption of monotonic relationships between the centralized Q-function and the individual Q-functions, decentralized policies can be extracted by individual argmax operations. As an advancement over both VDN and QMIX, Son et al. (2019) proposed QTRAN, which discards the assumption of linearity and monotonicity in the factorization and allows any non-linear combination of value functions. Further approaches about the factorization of value functions can be found in Castellini et al. (2019), Chen et al. (2018), Nguyen et al. (2017b), Wang et al. (2020a), Wang et al. (2020c) and Yang et al. (2018b).
分解：早期的工作通过应用过滤器（Chang et al. 2004）或修改奖励函数（如奖励塑造）来解决学分分配问题（Ng et al. 1999）。最近的方法侧重于利用智能体之间的依赖关系来分解智能体之间的奖励，即它们对全局奖励的实际贡献（Kok and Vlassis 2006）。通过分解将任务划分为更小的子问题，从而简化了学习问题。Sunehag et al. （ 2018）引入了价值分解网络（VDN），该网络将联合动作-价值函数分解为单个动作-价值函数的线性组合。VDN 学习如何根据座席的表现以最佳方式分配个人奖励。神经网络有助于消除有关智能体影响的联合奖励信号的歧义。Rashid 等人（2018 年）提出 QMIX 是对 VDN 的改进。QMIX学习一个集中的动作-值函数，该函数通过非线性组合分解为智能体个体的动作-值函数。在中心化 Q 函数和单个 Q 函数之间存在单调关系的假设下，可以通过单个 argmax 操作提取分散策略。作为对 VDN 和 QMIX 的进步，Son 等人（2019 年）提出了 QTRAN，它摒弃了因式分解中线性和单调性的假设，并允许值函数的任何非线性组合。关于价值函数因式分解的进一步方法可以在Castellini等人（2019），Chen等人（2018），Nguyen等人（2017b），Wang等人（2020a），Wang等人（2020c）和Yang等人（2018b）中找到。

Marginalization Next to the decomposition into simpler sub-problems, one can apply an extra function that marginalizes out the effect of agent individual actions. Nguyen et al. (2018) introduced a mean collective actor-critic framework which marginalizes out the actions of agents by using an approximation of the critic and reduces the variance of the gradient estimation. Similarly, Foerster et al. (2018b) marginalized out the individual actions of agents by applying a counterfactual baseline function. The counterfactual baseline function uses a centralized critic, which calculates the advantage of a single agent by comparing the estimated return of the current joint-action to the counterfactual baseline. The impact of a single agent’s action is determined and can be attributed to the agent itself. Another work by Wu et al. (2018) used a marginalized action-value function as a baseline to reduce the variance of critic estimates. The marginalization approaches are closely related to the difference rewards proposed by Tumer and Wolpert (2004) who determine the impact of an agent’s individual action compared to the average reward of all agents.
边缘化除了分解为更简单的子问题之外，还可以应用一个额外的函数来边缘化主体个体行为的影响。Nguyen et al. （ 2018）引入了一个平均集体行动者-批评者框架，该框架通过使用批评者的近似值来边缘化代理人的行为，并减少梯度估计的方差。同样，Foerster等人（2018b）通过应用反事实基线函数将代理人的个人行为边缘化。反事实基线函数使用集中式批评者，该批评者通过将当前联合行动的估计回报与反事实基线进行比较来计算单个智能体的优势。单个代理操作的影响是确定的，可以归因于代理本身。Wu et al. （ 2018）的另一项工作使用边缘化行动价值函数作为基线来减少批评者估计的方差。边缘化方法与Tumer和Wolpert（2004）提出的差异奖励密切相关，他们决定了与所有代理的平均奖励相比，代理的个人行为的影响。

Inverse reinforcement learning Credit assignment problems can be evoked by a bad design of the reinforcement learning problem. Misinterpretations of the agents can lead to failure because unintentional strategies are explored, e.g. if the reward function does not capture all important aspects of the underlying task (Amodei et al. 2016). Therefore, an important step in the problem design is the reward function. However, designing a reward function can be challenging for complex problems (Hadfield-Menell et al. 2017) and becomes even more complicated for multi-agent systems since different agents may accomplish different goals. Another approach to address the credit assignment problem is by inverse reinforcement learning (Ng and Russell 2000) that describes how an agent learns a reward function that explains the demonstrated behavior of an expert without having access to the reward signal. The learned reward function can then be used to build strategies. The work of Lin et al. (2018) applied the principle of inverse reinforcement learning to the multi-agent setting. They showed that multiple agents could recover reward functions that are correlated with the ground truths. Related to inverse RL, imitation learning can be used to learn from expert knowledge. Yu et al. (2019) imitated expert behaviors to learn high-dimensional policies in both cooperative and competitive environments. They were able to recover the expert policies for each individual agent from the provided expert demonstrations. Further works on imitation learning consider the fully cooperative setting (Barrett et al. 2017; Le et al. 2017) and Markov Games with mixed settings (Song et al. 2018).
逆向强化学习信用分配问题可能由强化学习问题的不良设计引起。对智能体的误解可能导致失败，因为探索了无意的策略，例如，如果奖励函数没有捕获潜在任务的所有重要方面（Amodei 等人，2016 年）。因此，问题设计中的一个重要步骤是奖励函数。然而，对于复杂的问题来说，设计奖励函数可能具有挑战性（Hadfield-Menell 等人，2017 年），并且对于多智能体系统变得更加复杂，因为不同的智能体可能实现不同的目标。解决学分分配问题的另一种方法是逆强化学习（Ng and Russell 2000），它描述了智能体如何在无法访问奖励信号的情况下学习奖励函数来解释专家的演示行为。然后，可以使用学习到的奖励函数来制定策略。Lin等人（2018）的工作将逆强化学习的原理应用于多智能体设置。他们表明，多个智能体可以恢复与基本事实相关的奖励函数。与逆向强化学习相关，模仿学习可用于从专业知识中学习。Yu et al. （ 2019）模仿专家行为来学习合作和竞争环境中的高维政策。他们能够从提供的专家演示中恢复每个代理的专家策略。关于模仿学习的进一步工作考虑了完全合作的环境（Barrett 等人，2017 年;Le 等人，2017 年）和马尔可夫游戏，混合设置（Song 等人，2018 年）。

5.5 Scalability 5.5 可扩展性
Training a large number of agents is inherently difficult. Every agent involved in the environment adds extra complexity to the learning problem such that the computational effort grows exponentially by the number of agents. Besides complexity concerns, sufficient scaling also demands agents to be robust towards the behavioral adaption of other agents. However, agents can leverage the benefit of distributed knowledge shared and reused between agents to accelerate the learning process. In the following, we review approaches that address the handling of many agents and discuss possible solutions. We broadly classify the surveyed works into those that apply some form of knowledge reuse, reduce the complexity of the learning problem, and develop robustness against the policy adaptions of other agents.
培训大量代理本身就很困难。环境中涉及的每个智能体都会给学习问题增加额外的复杂性，因此计算工作量会随着智能体的数量呈指数级增长。除了复杂性问题之外，充分的扩展还要求智能体对其他智能体的行为适应保持稳健。但是，代理可以利用代理之间共享和重用的分布式知识的优势来加速学习过程。在下文中，我们将回顾解决许多代理处理的方法，并讨论可能的解决方案。我们将调查的作品大致分类为应用某种形式的知识重用的作品，降低学习问题的复杂性，并发展对其他主体的政策适应的鲁棒性。

Knowledge reuse The training of individual learning models does scale poorly with the increasing number of agents because the computational effort increases due to the combinatorial possibilities. Knowledge reuse strategies are employed to ease the learning process and scale RL to complex problems by reutilizing previous knowledge into new tasks. Knowledge reuse can be applied in many facets (Silva et al. 2018).
知识重用随着智能体数量的增加，单个学习模型的训练确实扩展性很差，因为计算工作量由于组合可能性而增加。采用知识重用策略来简化学习过程，并通过将以前的知识重新利用到新任务中来扩展强化学习以解决复杂问题。知识重用可以应用于许多方面（Silva 等人，2018 年）。

First, agents can make use of a parameter sharing technique if they exhibit homogeneous structures, e.g. the weights in a neural network for sharing parts or the whole learning model with others. Sharing the parameters of a policy enables an efficient training process that can scale up to an arbitrary number of agents and, thus, can boost the learning process (Gupta et al. 2017). Parameter sharing has proven to be useful in various applications such as learning to communicate (Foerster et al. 2016; Jiang and Lu 2018; Peng et al. 2017; Sukhbaatar et al. 2016), modeling agents (Hernandez-Leal et al. 2019), and in partially observable cooperative games (Sunehag et al. 2018). For a discussion on different parameter sharing strategies, one may consider the paper by Chu and Ye (2017).
首先，如果智能体表现出同质结构，例如神经网络中用于与他人共享部分或整个学习模型的权重，则可以使用参数共享技术。共享策略的参数可以实现高效的训练过程，该过程可以扩展到任意数量的代理，从而可以促进学习过程（Gupta 等人，2017 年）。参数共享已被证明在各种应用中很有用，例如学习交流（Foerster 等人，2016 年;江和卢 2018;Peng 等人，2017 年;Sukhbaatar 等人，2016 年）、建模代理（Hernandez-Leal 等人，2019 年）和部分可观察的合作博弈（Sunehag 等人，2018 年）。对于不同参数共享策略的讨论，可以考虑Chu和Ye（2017）的论文。

As the second approach, knowledge reuse can be applied in form of transfer learning (Da Silva et al. 2019; Da Silva and Costa 2019). Experience obtained in learning to perform one task may also improve the performance in a related but different task (Taylor and Stone 2009). Da Silva and Costa (2017) used a knowledge database from which an agent can extract previous solutions of related tasks and embed such information into the current task’s training. Likewise, Da Silva et al. (2017) applied expert demonstrations where the agents take the role of students that ask a teacher for advice. They demonstrated that simultaneously learning agents could advise each other through knowledge transfer. Further works on transfer learning can be found in the cooperative multi-agent setting (Omidshafiei et al. 2019) and in natural language applications (Luketina et al. 2019). In general multi-agent systems, the works of (Boutsioukis et al. 2012; Taylor et al. 2013) substantiate that transfer learning can speed up the learning process.
作为第二种方法，知识重用可以以迁移学习的形式应用（Da Silva 等人，2019 年;Da Silva 和 Costa 2019）。在学习执行一项任务中获得的经验也可能提高相关但不同的任务的表现（Taylor and Stone 2009）。Da Silva和Costa（2017）使用了一个知识数据库，代理可以从中提取相关任务的先前解决方案，并将此类信息嵌入到当前任务的训练中。同样，Da Silva等人（2017）应用了专家演示，其中代理扮演学生的角色，向老师寻求建议。他们证明，同时学习智能体可以通过知识转移相互建议。关于迁移学习的进一步工作可以在合作多智能体设置（Omidshafiei 等人，2019 年）和自然语言应用程序（Luketina 等人，2019 年）中找到。在一般的多智能体系统中，（Boutsioukis 等人，2012 年;Taylor et al. 2013）证实，迁移学习可以加快学习过程。

Besides parameter sharing and transfer learning, curriculum learning may be applied for the scaling to many agents. Since tasks become more challenging to master and more time consuming to train as the number of agents increases, it is often challenging to learn from scratch. Curriculum learning starts with a small number of agents and then gradually enlarges the number of agents over the training course. Through the steady increase within the curriculum, trained policies can perform better than without a curriculum (Gupta et al. 2017; Long et al. 2020; Narvekar et al. 2016). Curriculum learning schemes can also cause improved generalization and faster convergence of agent policies (Bengio et al. 2009). Further works show that agents can generate learning curricula automatically (Sukhbaatar et al. 2017; Svetlik et al. 2017) or can create arms races in competitive settings (Baker et al. 2020).
除了参数共享和迁移学习外，课程学习还可以应用于扩展到许多智能体。由于随着代理数量的增加，任务变得更具挑战性，训练起来也越来越耗时，因此从头开始学习通常具有挑战性。课程学习从少量的代理开始，然后在培训课程中逐渐扩大代理的数量。通过课程的稳步增长，训练有素的政策比没有课程的政策表现得更好（Gupta 等人，2017 年;Long 等人，2020 年;Narvekar 等人，2016 年）。课程学习方案还可以提高代理策略的泛化性和更快的收敛速度（Bengio 等人，2009 年）。进一步的研究表明，智能体可以自动生成学习课程（Sukhbaatar 等人，2017 年;Svetlik 等人，2017 年），或者可以在竞争环境中制造军备竞赛（Baker 等人，2020 年）。

Complexity reduction Many real-world applications naturally encompass large numbers of simultaneously interacting agents (Nguyen et al. 2017a, b). As the quantity of agents increases, the requirement to contain the curse of dimensionality becomes inevitable. Yang et al. (2018b) addressed the issue of scalability with a mean-field method. The interactions between large numbers of agents are estimated by the impact of a single agent compared to the mean impact of the whole or local agent population. The complexity reduces as the problem is broken down into pairwise interactions between an agent and its neighborhood. Regarding the average effect to its neighbors, each agent learns the best response towards its proximity. Another approach to constrain the explosion in complexity is by factorizing the problem into smaller sub-problems (Guestrin et al. 2002). Chen et al. (2018) decomposed the joint action-value function into independent components and used pairwise interactions between agents to render large-scale problems computationally tractable. Further works studied large-scale MADRL problems with graphical models (Nguyen et al. 2017a) and the CTDE paradigm (Lin et al. 2018).
降低复杂性许多实际应用自然包含大量同时相互作用的代理（Nguyen 等人，2017a，b）。随着药剂数量的增加，包含维度诅咒的要求变得不可避免。Yang et al. （ 2018b）使用平均场方法解决了可扩展性问题。大量智能体之间的相互作用是通过单个智能体的影响与整个或局部智能体群体的平均影响相比来估计的。当问题被分解为智能体与其邻域之间的成对交互时，复杂性会降低。关于对其邻居的平均影响，每个智能体都会学习对其邻近的最佳响应。限制复杂性爆炸的另一种方法是将问题分解为更小的子问题（Guestrin et al. 2002）。Chen et al. （ 2018）将联合动作-值函数分解为独立的组件，并使用智能体之间的成对交互来使大规模问题在计算上易于处理。进一步的工作研究了图形模型（Nguyen et al. 2017a）和 CTDE 范式（Lin et al. 2018）的大规模 MADRL 问题。

Robustness Another desired property is the robustness of learned policies to perturbations in the environment caused by other agents. Perturbations are fortified by the number of agents and the resulting growth of the state-action space. In supervised learning, a common problem is that models can over-fit to the data set. Similarly, over-fitting can occur in RL frameworks if environments provide little or no deviation (Bansal et al. 2018). To maintain robustness over the training process and to the other agents’ adaption, several methods have been proposed.
鲁棒性另一个期望的属性是学习策略对其他智能体引起的环境中扰动的鲁棒性。扰动由代理的数量和由此产生的状态动作空间的增长而得到加强。在监督学习中，一个常见的问题是模型可能会过度拟合数据集。同样，如果环境提供很少或没有偏差，RL 框架中可能会发生过度拟合（Bansal 等人，2018 年）。为了保持训练过程的鲁棒性并适应其他智能体的适应性，已经提出了几种方法。

First, regularization techniques can be used to prevent over-fitting to other agents’ behavior. Examples can be seen in policies ensembles (Lowe et al. 2017), where a collection of different sub-policies is trained for each agent, or can be found in best responses to policy mixtures (Lanctot et al. 2017).
首先，可以使用正则化技术来防止过度拟合其他智能体的行为。例如，在策略集合（Lowe等人，2017年）中可以看到，其中为每个代理训练了一组不同的子策略，或者可以在对策略组合的最佳响应中找到（Lanctot等人，2017年）。

Second, adversarial training can be applied to mitigate the vulnerability of polices towards perturbations. Pinto et al. (2017) added an adversarial agent to the environment that applied targeted disturbances to the learning process. By hampering the training, the agents were compelled to encounter these disturbances and develop robust policies. Similarly, Li et al. (2019a) used an adversarial setting to reduce the sensitivity of agents towards the environment. Bansal et al. (2018) demonstrated that policies, which are trained in a competitive setting, could yield behaviors that are far more complex than the environment itself. From an application perspective, Spooner and Savani (2020) studied robust decision-making in market making.
其次，对抗性训练可以用于减轻政策对扰动的脆弱性。Pinto等人（2017）在环境中添加了一种对抗性代理，该代理对学习过程施加有针对性的干扰。由于阻碍培训，特工们被迫遇到这些干扰并制定强有力的政策。同样，Li et al. （ 2019a）使用对抗性设置来降低代理对环境的敏感性。Bansal et al. （2018）证明，在竞争环境中训练的政策可能会产生比环境本身复杂得多的行为。从应用的角度来看，Spooner和Savani（2020）研究了做市商中的稳健决策。

The observations from above are in accordance with the findings of related studies about the impact of self-play (Raghu et al. 2018; Sukhbaatar et al. 2017). Heinrich and Silver (2016) used self-play to learn approximate Nash equilibria of imperfect-information games and showed that self-play could be used to obtain better robustness in the learned policies. Similarly, self-play was used to compete with older versions of policies to render the learned behaviors more robust (Baker et al. 2020; Berner et al. 2019; Silver et al. 2018). Silver et al. (2016) adapted self-play as a regularization technique to prevent the policy network from over-fitting by playing against older versions of itself. However, Gleave et al. (2020) studied the existence of adversarial policies in competitive games and showed that complex policies could be fooled by comparably easy strategies. Although agents trained through self-play proved to be more robust, allegedly random and uncoordinated strategies caused agents to fail at the task. They argued that the vulnerability towards adversarial attacks increases with the dimensionality of the observation space. A further research direction for addressing robustness is to render the learning representation invariant towards permutations, as shown in Liu et al. (2020).
上述观察结果与关于自我游戏影响的相关研究结果一致（Raghu 等人，2018 年;Sukhbaatar 等人，2017 年）。Heinrich和Silver（2016）使用自我博弈来学习不完全信息博弈的近似纳什均衡，并表明自我博弈可用于在学习策略中获得更好的鲁棒性。同样，自我博弈被用来与旧版本的政策竞争，以使习得的行为更加稳健（Baker 等人，2020 年;Berner 等人，2019 年;Silver 等人，2018 年）。Silver et al. （ 2016）将自我游戏作为一种正则化技术，以防止策略网络通过与自身的旧版本进行游戏来过度拟合。然而，Gleave 等人（2020 年）研究了竞争性游戏中对抗性政策的存在，并表明复杂的政策可以被相对简单的策略所愚弄。尽管通过自我游戏训练的智能体被证明更强大，但据称随机和不协调的策略导致智能体无法完成任务。他们认为，对抗性攻击的脆弱性随着观察空间的维度增加而增加。解决鲁棒性的进一步研究方向是使学习表示对排列不变，如Liu et al. （ 2020）所示。

5.6 Partial observability
5.6 部分可观测性
Outside an idealized setting, agents neither can observe the global state of the environment, nor do they have access to the internal knowledge of other agents. By perceiving only partial observations, a single observation does not capture all relevant information about the environment and its history. Hence, the Markov property is not fulfilled, and the environment appears non-Markovian. An additional difficulty elicited by partial observability is the lazy agent problem which can occur in cooperative settings (Sunehag et al. 2018). As introduced in Sect. 2.2, the common frameworks that deal with partial observability are POMPDs for general settings and dec-POMDPs for cooperative settings with a shared reward function. Dec-POMDPs are computationally challenging (Bernstein et al. 2002) and still intractable when solving problems with real-world complexity (Amato et al. 2015). However, recent work accomplished promising results in video games with imperfect information (Baker et al. 2020; Berner et al. 2019; Jaderberg et al. 2019; Vinyals et al. 2019).
在理想化设置之外，智能体既无法观察环境的全局状态，也无法访问其他智能体的内部知识。由于仅感知部分观测值，单个观测值无法捕获有关环境及其历史的所有相关信息。因此，马尔可夫属性不满足，环境看起来不是马尔可夫的。部分可观察性引发的另一个困难是懒惰代理问题，它可能发生在合作环境中（Sunehag 等人，2018 年）。如第 2.2 节所述，处理部分可观察性的常见框架是用于常规设置的 POMPD 和用于具有共享奖励函数的合作设置的 dec-POMDP。Dec-POMDP在计算上具有挑战性（Bernstein等人，2002年），并且在解决现实世界的复杂性问题时仍然难以解决（Amato等人，2015年）。然而，最近的工作在信息不完善的视频游戏中取得了可喜的成果（Baker 等人，2020 年;Berner 等人，2019 年;Jaderberg 等人，2019 年;Vinyals 等人，2019 年）。

A natural way to deal with non-Markovian environments is through information exchange between the decision-makers (Goldman and Zilberstein 2004). Agents that are able to communicate can compensate for their limited knowledge by propagating information and fill the lack of knowledge about other agents or the environment (Foerster et al. 2016). As we already discussed in Sect. 5.2, there are several ways to incorporate communication capabilities into agents. A primary example is Jiang and Lu (2018) who used an attention mechanism to establish communication under partial observations. Rather than having a fixed frequency for the information exchange, they learned to communicate on-demand. Further approaches under partial observability have been investigated in cooperative tasks (Das et al. 2019; Sukhbaatar et al. 2016) or mixed settings (Singh et al. 2019).
处理非马尔可夫环境的一种自然方法是通过决策者之间的信息交换（Goldman and Zilberstein 2004）。能够沟通的智能体可以通过传播信息来弥补其有限的知识，并填补对其他智能体或环境知识的缺乏（Foerster 等人，2016 年）。正如我们在第 5.2 节中已经讨论过的，有几种方法可以将通信功能整合到代理中。一个主要的例子是江和Lu（2018），他们使用注意力机制在部分观察下建立沟通。他们没有固定的信息交换频率，而是学会了按需交流。在合作任务中研究了部分可观测性的进一步方法（Das 等人，2019 年;Sukhbaatar 等人，2016 年）或混合环境（Singh 等人，2019 年）。

In the following, we review papers that cope with partial observability by incorporating a memory mechanism. Agents, which have the capability of memorizing past experiences, can compensate for the lack of information.
在下文中，我们回顾了通过结合记忆机制来应对部分可观测性的论文。具有记忆过去经验能力的智能体可以弥补信息的不足。

Memory mechanism A common way to tackle partial observability is the usage of deep recurrent neural networks, which equip agents with a memory mechanism to store information that can be relevant in the future (Hausknecht and Stone 2015). However, long-term dependencies render the decision-making difficult since experiences that were observed in the further past may have been forgotten (Hochreiter and Schmidhuber 1997). Approaches involving recurrent neural networks to deal with partial observability can be realized with value-based approaches (Omidshafiei et al. 2017) or actor-critic methods (Dibangoye and Buffet 2018; Foerster et al. 2018b; Gupta et al. 2017). Foerster et al. (2019) used a Bayesian method to tackle partial observability in cooperative settings. They used all publicly available features of the environment and agents to determine a public belief over the agents’ internal states. A severe concern in MADRL is that the memorization of past information is exacerbated by the number of agents involved during the learning process.
记忆机制解决部分可观察性的一种常见方法是使用深度递归神经网络，它为智能体配备了一种记忆机制来存储未来可能相关的信息（Hausknecht and Stone 2015）。然而，长期依赖性使决策变得困难，因为在更远的过去观察到的经验可能已经被遗忘（Hochreiter and Schmidhuber 1997）。涉及循环神经网络来处理部分可观察性的方法可以通过基于价值的方法（Omidshafiei 等人，2017 年）或演员批评家方法（Dibangoye 和 Buffet，2018 年;Foerster 等人，2018b;Gupta等人，2017）。Foerster et al. （ 2019）使用贝叶斯方法来解决合作环境中的部分可观察性。他们使用环境和代理的所有公开可用功能来确定公众对代理内部状态的信念。MADRL的一个严重问题是，在学习过程中涉及的智能体数量加剧了对过去信息的记忆。

6 Discussion 6 讨论
In this section, we discuss findings from previous sections. We enumerate trends that we have identified in recent literature. Since these trends are useful for addressing current challenges, they may also be an avenue for upcoming research. To the end of our discussion, we point out possible future work. We elaborate on problems where only a minority of research has been conducted and pose two problems which we find the toughest ones to overcome.
在本节中，我们将讨论前几节的发现。我们列举了我们在最近的文献中发现的趋势。由于这些趋势有助于应对当前的挑战，因此它们也可能成为未来研究的途径。在讨论结束时，我们指出了未来可能开展的工作。我们详细阐述了只有少数研究进行的问题，并提出了两个我们认为最难克服的问题。

Despite the recent advances in many directions, many pathologies such as relative over-generalization combined with reward stochasticity are not yet solved, even in allegedly simple tabular worlds. MADRL has taken profit from the history of MARL by scaling up the insights to more complex problems. Approaches where strong solutions exist in simplified MARL settings may be transferable to the MADRL domain. Thus by enhancing older methods with new deep learning approaches, unsolved problems and concepts from MARL continue to matter in MADRL. An essential point for MADRL is that reproducibility is taken conscientiously. Well-known papers from the single-agent domain underline the significance of hyper-parameters, the number of independent random seeds, and chosen code-base towards the eventual task performance (Henderson et al. 2018; Islam et al. 2017). To maintain steady progress, the reporting of all used hyper-parameters and a transparent conduction of experiments is crucial. We want to make the community aware that these findings may also be valid for the multi-agent domain. Therefore, it is inevitable that standardized frameworks are created in which different algorithms can be compared along with their merits and demerits. Many individual environments have been proposed which exhibit intricate structure and real-world complexity (Baker et al. 2020; Beattie et al. 2016; Johnson et al. 2016; Juliani et al. 2018; Song et al. 2019; Vinyals et al. 2017). However, no consistent benchmark yet exists that provides a unified interface and allows a fair comparison between different kinds of algorithms grounded on a great variety of tasks like the OpenAI Gym (Brockman et al. 2016) for single-agent problems.
尽管最近在许多方向上取得了进展，但许多病态，如相对过度泛化与奖励随机性相结合，尚未得到解决，即使在所谓的简单表格世界中也是如此。MADRL从MARL的历史中获益，将洞察力扩展到更复杂的问题。在简化的 MARL 设置中存在强解的方法可以转移到 MADRL 域。因此，通过使用新的深度学习方法增强旧方法，MARL中未解决的问题和概念在MADRL中仍然很重要。MADRL的一个基本点是认真考虑可重复性。来自单智能体领域的知名论文强调了超参数、独立随机种子的数量和选择的代码库对最终任务性能的重要性（Henderson 等人，2018 年;Islam 等人，2017 年）。为了保持稳步进展，报告所有使用的超参数和透明的实验进行至关重要。我们希望让社区意识到，这些发现也可能适用于多智能体领域。因此，不可避免地创建了标准化框架，其中可以比较不同的算法及其优缺点。已经提出了许多单独的环境，这些环境表现出错综复杂的结构和现实世界的复杂性（Baker 等人，2020 年;Beattie 等人，2016 年;Johnson 等人，2016 年;Juliani 等人，2018 年;Song 等人，2019 年;Vinyals 等人，2017 年）。然而，目前还没有一致的基准测试来提供统一的接口，并允许对基于各种任务的不同种类的算法进行公平比较，例如用于单代理问题的 OpenAI Gym（Brockman 等人，2016 年）。

Table 4 Our identified trends in MADRL and the addressed challenges
表4 我们确定的MADRL趋势和已解决的挑战
Full size table 全尺寸表
6.1 Trends 6.1 趋势
Over the last years, approaches in the multi-agent domain achieved successes based on recurring patterns of good practice. We have identified four trends in state-of-the-art literature that have been frequently applied to address current challenges (Table 4).
在过去几年中，多智能体领域的方法基于反复出现的良好实践模式取得了成功。我们确定了最新文献的四个趋势，这些趋势经常被用于应对当前的挑战（表4）。

As the first trend, we observe curriculum learning as an approach to divide the learning process into stages to deal with scalability issues. By starting with a small quantity, the number of agents is gradually enlarged over the learning course so that large-scale training becomes feasible (Gupta et al. 2017; Long et al. 2020; Narvekar et al. 2016). Alternatively, curricula can also be employed to create different stages of difficulty, where agents face relatively easy tasks at the beginning and gradually more complex tasks as their skills increase (Vinyals et al. 2019). Besides that, curriculum training is used to investigate the emergence of agent behavior. Curricula describe engineered changes in the dynamics of the environment. Agents adapt their behaviors over time in response to the strategic changes of others, which can yield arms races between agents. This process of continual adaption is referred to autocurricula (Leibo et al. 2019), which have been reported in several works (Baker et al. 2020; Sukhbaatar et al. 2017; Svetlik et al. 2017).
作为第一个趋势，我们将课程学习视为一种将学习过程划分为多个阶段以处理可扩展性问题的方法。通过从少量开始，在学习过程中逐渐扩大代理的数量，以便大规模培训变得可行（Gupta 等人，2017 年;Long 等人，2020 年;Narvekar 等人，2016 年）。或者，课程也可以用于创建不同的难度阶段，其中代理在开始时面临相对简单的任务，随着技能的提高，任务逐渐变得更加复杂（Vinyals 等人，2019 年）。除此之外，课程培训还用于调查代理行为的出现。课程描述了环境动态的工程变化。智能体会随着时间的推移而调整自己的行为，以应对他人的战略变化，这可能会产生智能体之间的军备竞赛。这种持续适应的过程被称为自动课程（Leibo 等人，2019 年），这已在几部作品中报道（Baker 等人，2020 年;Sukhbaatar 等人，2017 年;Svetlik 等人，2017 年）。

Second, we recognize a trend towards deep neural networks embedded with recurrent units to memorize experience. By having the ability to track the history of state transitions and the decisions of other agents, the non-stationarity of the environment due to multiple decision-makers and partially observable states can be addressed in small problems (Omidshafiei et al. 2017), and can be managed sufficiently well in complex problems (Baker et al. 2020; Berner et al. 2019; Jaderberg et al. 2019).
其次，我们认识到一种趋势，即嵌入循环单元的深度神经网络来记忆经验。通过能够跟踪状态转换的历史和其他智能体的决策，可以在小问题中解决由于多个决策者和部分可观察状态而导致的环境非平稳性（Omidshafiei 等人，2017 年），并且可以在复杂问题中得到充分的管理（Baker 等人，2020 年;Berner 等人，2019 年;Jaderberg 等人，2019 年）。

Third, an active line of research is exploring the development of communication skills. Due to the rise of deep learning methods, new computational approaches are available to investigate the emergence of language between interactive agents (Lazaridou and Baroni 2020). Despite the plethora of works that analyze emergent behaviors and semantics, many works propose methods that endow agents with communication skills. By expressing their intension, agents can align their coordination and find a consensus (Foerster et al. 2016). The non-stationarity from the perspective of a single learner can be eluded when agents disclose their history. Moreover, agents can share their local information with others to alleviate partial observability (Foerster et al. 2018b; Omidshafiei et al. 2017).
第三，一个积极的研究方向是探索沟通技巧的发展。由于深度学习方法的兴起，新的计算方法可用于研究交互式代理之间语言的出现（Lazaridou 和 Baroni 2020）。尽管有大量分析紧急行为和语义的著作，但许多著作提出了赋予智能体沟通技巧的方法。通过表达他们的意图，智能体可以调整他们的协调并找到共识（Foerster 等人，2016 年）。从单个学习者的角度来看，当智能体披露他们的历史时，可以避免非平稳性。此外，智能体可以与他人共享他们的本地信息，以减轻部分可观察性（Foerster 等人，2018b;Omidshafiei 等人，2017 年）。

Fourth and last, we note a clear trend towards the CTDE paradigm that enables the shaing of information during the training. Local information such as the observation-action history, function values, or policies can be made available to all agents during the training, which renders the environment stationary from the viewpoint of an individual agent and may diminish partial observability (Lowe et al. 2017). Further, the credit assignment problem can be addressed when information is available about all agents, and a centralized mechanism can attribute the individual contribution to the respective agent (Foerster et al. 2018b). Further challenges that can be loosened are coordination and scalability when the lack of information of an individual agent is compensated, and the learning process is accelerated (Gupta et al. 2017).
第四，也是最后一点，我们注意到CTDE范式的明显趋势，即在培训期间可以对信息进行遮蔽。在训练期间，所有智能体都可以使用观察操作历史、函数值或策略等本地信息，这使得环境从单个智能体的角度来看是静止的，并可能降低部分可观察性（Lowe 等人，2017 年）。此外，当所有代理人的信息可用时，可以解决信用分配问题，并且集中式机制可以将个人贡献归因于各自的代理人（Foerster 等人，2018b）。可以放松的进一步挑战是协调和可扩展性，当单个智能体的信息缺失得到补偿时，学习过程被加速（Gupta等人，2017）。

6.2 Future work 6.2 未来工作
Next to our identified trends, which are already under active research, we recognize areas that have not been sufficiently explored yet. One such area is multi-goal learning where each agent has an individually associated goal that needs to be optimized. However, global optimality can only be accomplished if agents also allow others to be successful in their task (Yang et al. 2020). Typical scenarios are cooperative tasks such as public good dilemmas, where agents are obliged to the sustainable use of limited resources, or autonomous driving, where agents have individual destinations and are supposed to coordinate the path-finding to avoid crashes. A similar direction is multi-task learning where agents are expected to perform well not only on one single but also on related other tasks (Omidshafiei et al. 2017; Taylor and Stone 2009). Besides multi-goal and multi-task learning, another avenue for future work is present in safe MADRL. Safety is a highly desired property because autonomously acting agents are expected to ensure system performance while holding to safety guarantees during learning and employment (García et al. 2015). Several works in single-agent RL are concerned with safety concepts, but its applicability to multiple agents is limited and still in its infancy (Zhang and Bastani 2019; Zhu et al. 2020). Akin to the growing interest in learning to communicate, a similar effect may happen in the multi-agent domain, where deep learning methods open new paths. For an application perspective on safe autonomous driving, one can consider the article by Shalev-Shwartz et al. (2016). Another possible direction for future research offers the intersection between MADRL and evolutionary methodologies. Evolutionary algorithms have been used in versatile contexts of multi-agent RL, e.g. for building intrinsic motivation (Wang et al. 2019), shaping rewards (Jaderberg et al. 2019), generating curricula (Long et al. 2020) and analyzing dynamics (Bloembergen et al. 2015). Since evolution requires many entities to adapt, multi-agent RL is a natural playground for such algorithms.
除了我们已确定的趋势（已经在积极研究中）之外，我们还认识到尚未充分探索的领域。其中一个领域是多目标学习，其中每个智能体都有一个需要优化的单独关联目标。然而，只有当智能体也允许其他人成功完成他们的任务时，才能实现全局最优性（Yang 等人，2020 年）。典型的场景是合作任务，例如公共利益困境，其中智能体有义务可持续地使用有限的资源，或自动驾驶，其中智能体具有单独的目的地，并且应该协调路径查找以避免碰撞。一个类似的方向是多任务学习，其中智能体不仅在一项任务上表现良好，而且在相关的其他任务上也表现良好（Omidshafiei 等人，2017 年;泰勒和斯通 2009 年）。除了多目标和多任务学习之外，安全MADRL还为未来工作提供了另一种途径。安全是一个非常理想的属性，因为自主作用的代理有望在学习和就业期间保持安全保证的同时确保系统性能（García 等人，2015 年）。单智能体强化学习中的一些工作涉及安全概念，但其对多种智能体的适用性有限，仍处于起步阶段（Zhang and Bastani 2019;Zhu 等人，2020 年）。与人们对学习交流的兴趣日益浓厚类似，类似的效果也可能发生在多智能体领域，深度学习方法开辟了新的途径。对于安全自动驾驶的应用观点，可以考虑Shalev-Shwartz等人（2016）的文章。未来研究的另一个可能方向是MADRL和进化方法之间的交叉点。进化算法已被用于多智能体强化学习的多功能环境中，例如用于建立内在动机（Wang 等人，2019 年）、塑造奖励（Jaderberg 等人，2019 年）、生成课程（Long 等人，2020 年）和分析动态（Bloembergen 等人，2015 年）。由于进化需要许多实体来适应，因此多智能体 RL 是此类算法的天然游乐场。

Beyond the current challenges and reviewed literature of Sect. 5, we identify two problems that we regard as the most challenging problems to overcome by future work. We primarily choose these two problems since they are the ones that matter the most when it comes to the applicability of algorithms to real-world scenarios. Most research focuses on learning within homogeneous settings where agents share common interests and optimize a mutual goal. For instance, the learning of communication is mainly studied in dec-POMDPs, where agents are expected to optimize upon a joint reward signal. When agents share common interests, the CTDE paradigm is usually a beneficial choice to exchange information between agents, and problems like non-stationarity, partial observability, and coordination can be diminished. However, heterogeneity implies that agents may have their own interests and goals, individual experience and knowledge, or different skills and capabilities. Limited research has been conducted in heterogeneous scenarios, although many real-world problems naturally comprise a mixture of different entities. Under real-world conditions, agents have only access to local and heterogeneous information on which decisions must be taken. The fundamental problem in the multi-agent domain is and ever has been the curse of dimensionality (Busoniu et al. 2008; Hernandez-Leal et al. 2019). The state-action space and the combinatorial possibilities of agent interactions increase exponentially by the number of agents, which renders sufficient exploration itself a difficult problem. This is intensified when agents have only access to partial observations of the environment or when the environment is of continuous nature. Although powerful function approximators like neural networks can cope with continuous spaces and generalize well over large spaces, open questions remain like how to explore large and complex spaces sufficiently well and how to solve large combinatorial optimization problems.
除了当前的挑战和第5节的回顾文献之外，我们还确定了两个问题，我们认为这些问题是未来工作需要克服的最具挑战性的问题。我们主要选择这两个问题，因为它们是算法在现实世界场景中适用性最重要的问题。大多数研究都集中在同质环境中的学习，在这种环境中，智能体拥有共同的兴趣并优化共同目标。例如，沟通的学习主要在dec-POMDPs中研究，其中智能体被期望根据联合奖励信号进行优化。当智能体有共同利益时，CTDE范式通常是智能体之间交换信息的有益选择，并且可以减少非平稳性、部分可观察性和协调性等问题。然而，异质性意味着智能体可能有自己的兴趣和目标、个人经验和知识，或者不同的技能和能力。尽管许多现实世界的问题自然而然地由不同实体的混合组成，但在异构场景中进行了有限的研究。在现实世界中，智能体只能访问必须做出决策的本地和异构信息。多智能体领域的根本问题一直是维度的诅咒（Busoniu et al. 2008;Hernandez-Leal 等人，2019 年）。智能体交互的状态-行动空间和组合可能性随着智能体数量的增加而呈指数级增长，这使得充分探索本身成为一个难题。当智能体只能访问对环境的部分观察或环境具有连续性时，这种情况会加剧。尽管像神经网络这样的强大函数逼近器可以处理连续空间并在大空间上很好地泛化，但悬而未决的问题仍然存在，例如如何充分探索大型复杂空间以及如何解决大型组合优化问题。

7 Conclusion 7 结论
Even though multi-agent reinforcement learning enjoys a long record, historical approaches hardly exceeded the complexity of discretized environments with a limited amount of states and actions (Busoniu et al. 2008; Tuyls and Weiss 2012). Since the breakthrough of deep learning methods, the field is undergoing a rapid transformation, and many previously unsolved problems have become step by step tractable. Latest advances showed that tasks with real-world complexity could be mastered (Baker et al. 2020; Berner et al. 2019; Jaderberg et al. 2019; Vinyals et al. 2019). Still, MADRL is a young field which attracts growing interest, and the amount of published literature rises swiftly. In this article, we surveyed recent works that combine deep learning methods with multi-agent reinforcement learning. We analyzed training schemes that are used to learn policies, and we reviewed patterns of agent behavior that emerge when multiple entities interact simultaneously. In addition, we systematically investigated challenges that are present in the multi-agent context and studied recent approaches that are under active research. Finally, we outlined trends which we have identified in state-of-the-art literature and proposed possible avenues for future work. With this contribution, we want to equip interested readers with the necessary tools to understand the contemporary challenges in MADRL by providing a more holistic overview of the recent approaches. We want to emphasize its potential and reveal opportunities as well as its limitations. In the foreseeable future, we expect an abundance of new literature to emanate and, hence, we want to encourage the community for further developments in this interesting and young field of research.
尽管多智能体强化学习有着悠久的历史，但历史方法几乎没有超过状态和动作数量有限的离散化环境的复杂性（Busoniu et al. 2008;Tuyls 和 Weiss 2012）。自深度学习方法取得突破以来，该领域正在经历快速转型，许多以前未解决的问题已经一步步变得易于处理。最新进展表明，可以掌握具有现实世界复杂性的任务（Baker 等人，2020 年;Berner 等人，2019 年;Jaderberg 等人，2019 年;Vinyals 等人，2019 年）。尽管如此，MADRL仍然是一个年轻的领域，吸引了越来越多的兴趣，并且出版的文献数量迅速增加。在本文中，我们调查了将深度学习方法与多智能体强化学习相结合的最新工作。我们分析了用于学习策略的训练方案，并回顾了多个实体同时交互时出现的代理行为模式。此外，我们系统地调查了多智能体环境中存在的挑战，并研究了正在积极研究的最新方法。最后，我们概述了我们在最新文献中发现的趋势，并提出了未来工作的可能途径。通过这篇论文，我们希望通过更全面地概述最近的方法，为感兴趣的读者提供必要的工具，以了解 MADRL 的当代挑战。我们想强调它的潜力，揭示它的机会和局限性。在可预见的未来，我们期待大量新文献的出现，因此，我们希望鼓励社区在这个有趣而年轻的研究领域进一步发展。

Notes 笔记
Markov games are also known as Stochastic Games (Shapley 1953), but we continue to use the term Markov Game to draw a clear distinction between deterministic Markov Games and stochastic Markov Games.
马尔可夫博弈也被称为随机博弈（Shapley 1953），但我们继续使用术语马尔可夫博弈来明确区分确定性马尔可夫博弈和随机马尔可夫博弈。

The strategic-form game is also known as matrix game or normal-form game. The most commonly studied strategic-form game is the one with
players, the so-called bi-matrix game.
战略形式博弈也称为矩阵博弈或正常形式博弈。最常研究的战略形式博弈是有
玩家的游戏，即所谓的双矩阵博弈。

The alter-exploration dilemma, also known as the exploration-exploitation problem, describes the trade-off an agent faces to decide whether to choose actions that extend experience or take decisions that are already optimal according to the current knowledge.
改变-探索困境，也称为探索-开发问题，描述了智能体面临的权衡，以决定是选择扩展经验的行动，还是根据当前知识做出已经是最佳的决策。

Note that test and execution time are often used interchangeably in recent literature. For clarity, we use the term test for the post-training evaluation and the term execution for the action selection with respect to some policy.
请注意，在最近的文献中，测试和执行时间经常互换使用。为清楚起见，我们使用术语“测试”来表示训练后评估，使用术语“执行”来表示与某些策略相关的操作选择。

Fingerprints draw their inspiration from Tesauro (2004) who eluded non-stationarity by conditioning each agent’s policy on estimates of other agents’ policies.
指纹的灵感来自Tesauro（2004），他通过根据对其他智能体策略的估计来调节每个智能体的策略来规避非平稳性。

References 引用
Ahilan S, Dayan P (2019) Feudal multi-agent hierarchies for cooperative reinforcement learning. CoRR arxiv: abs/1901.08492
Ahilan S， Dayan P （2019）用于合作强化学习的封建多智能体层次结构。CoRR arxiv：abs/1901.08492

Al-Shedivat M, Bansal T, Burda Y, Sutskever I, Mordatch I, Abbeel P (2018) Continuous adaptation via meta-learning in nonstationary and competitive environments. In: International conference on learning representations. https://openreview.net/forum?id=Sk2u1g-0-
Al-Shedivat M、Bansal T、Burda Y、Sutskever I、Mordatch I、Abbeel P （2018）在非平稳和竞争环境中通过元学习进行持续适应。在：关于学习表征的国际会议。https://openreview.net/forum?id=Sk2u1g-0-

Albrecht SV, Stone P (2018) Autonomous agents modelling other agents: a comprehensive survey and open problems. Artif Intell 258:66–95. https://doi.org/10.1016/j.artint.2018.01.002. http://www.sciencedirect.com/science/article/pii/S0004370218300249
Albrecht SV， Stone P （2018）对其他智能体进行建模的自主代理：综合调查和开放性问题。Artif Intell 258：66-95。https://doi.org/10.1016/j.artint.2018.01.002。http://www.sciencedirect.com/science/article/pii/S0004370218300249

Amato C, Konidaris G, Cruz G, Maynor CA, How JP, Kaelbling LP (2015) Planning for decentralized control of multiple robots under uncertainty. In: 2015 IEEE international conference on robotics and automation (ICRA), pp 1241–1248. https://doi.org/10.1109/ICRA.2015.7139350
Amato C， Konidaris G， Cruz G， Maynor CA， How JP， Kaelbling LP （2015）在不确定性下规划多个机器人的分散控制。在：2015年IEEE机器人与自动化国际会议（ICRA），第1241-1248页。https://doi.org/10.1109/ICRA.2015.7139350

Amodei D, Olah C, Steinhardt J, Christiano PF, Schulman J, Mané D (2016) Concrete problems in AI safety. CoRR. arxiv: abs/1606.06565,
Amodei D， Olah C， Steinhardt J， Christiano PF， Schulman J， Mané D （2016）人工智能安全的具体问题。Arxiv： abs/1606.06565，

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: The IEEE international conference on computer vision (ICCV)
Antol S、Agrawal A、Lu J、Mitchell M、Batra D、Lawrence Zitnick C、Parikh D （2015） Vqa：视觉问答。在：IEEE计算机视觉国际会议（ICCV）

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38. https://doi.org/10.1109/MSP.2017.2743240
Arulkumaran K、Deisenroth MP、Brundage M、Bharath AA （2017）深度强化学习：简要调查。IEEE信号处理杂志34（6）：26–38。https://doi.org/10.1109/MSP.2017.2743240

Article 品

Google Scholar Google 学术搜索

Aubret A, Matignon L, Hassas S (2019) A survey on intrinsic motivation in reinforcement learning. arXiv e-prints arXiv:1908.06976,
Aubret A、Matignon L、Hassas S （2019）关于强化学习中内在动机的调查。arXiv 电子打印 arXiv：1908.06976，

Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2020) Emergent tool use from multi-agent autocurricula. In: International conference on learning representations. https://openreview.net/forum?id=SkxpxJBKwS
Baker B、Kanitscheider I、Markov T、Wu Y、Powell G、McGrew B、Mordatch I （2020）多智能体自动课程的紧急工具使用。在：关于学习表征的国际会议。https://openreview.net/forum?id=SkxpxJBKwS

Bansal T, Pachocki J, Sidor S, Sutskever I, Mordatch I (2018) Emergent complexity via multi-agent competition. In: International conference on learning representations. https://openreview.net/forum?id=Sy0GnUxCb
Bansal T、Pachocki J、Sidor S、Sutskever I、Mordatch I （2018）通过多智能体竞争的新兴复杂性。在：关于学习表征的国际会议。https://openreview.net/forum?id=Sy0GnUxCb

Barde P, Roy J, Harvey FG, Nowrouzezahrai D, Pal C (2019) Promoting coordination through policy regularization in multi-agent reinforcement learning. arXiv e-prints arXiv:1908.02269,
Barde P， Roy J， Harvey FG， Nowrouzezahrai D， Pal C （2019）通过多智能体强化学习中的策略正则化促进协调。arXiv 电子打印 arXiv：1908.02269，

Barrett S, Rosenfeld A, Kraus S, Stone P (2017) Making friends on the fly: cooperating with new teammates. Artif Intell 242:132–171
Barrett S、Rosenfeld A、Kraus S、Stone P （2017）在飞行中结交朋友：与新队友合作。阿蒂夫·英特尔 242：132-171

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Beattie C, Leibo JZ, Teplyashin D, Ward T, Wainwright M, Küttler H, Lefrancq A, Green S, Valdés V, Sadik A, Schrittwieser J, Anderson K, York S, Cant M, Cain A, Bolton A, Gaffney S, King H, Hassabis D, Legg S, Petersen S (2016) Deepmind lab. CoRR. arxiv: abs/1612.03801
比蒂 C、雷博 JZ、Teplyashin D、沃德 T、温赖特 M、库特勒 H、勒弗朗克 A、格林 S、巴尔德斯 V、萨迪克 A、施里特维瑟 J、安德森 K、约克 S、坎特 M、凯恩 A、博尔顿 A、加夫尼 S、金 H、哈萨比斯 D、莱格 S、彼得森 S （2016） Deepmind 实验室。Arxiv：abs/1612.03801

Becker R, Zilberstein S, Lesser V, Goldman CV (2004) Solving transition independent decentralized Markov decision processes. J Artif Intell Res 22:423–455
Becker R， Zilberstein S， Lesser V， Goldman CV （2004）求解过渡无关的分散马尔可夫决策过程。J Artif Intell Res 22：423-455

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R (2016) Unifying count-based exploration and intrinsic motivation. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 1471–1479. http://papers.nips.cc/paper/6383-unifying-count-based-exploration-and-intrinsic-motivation.pdf
Bellemare M， Srinivasan S， Ostrovski G， Schaul T， Saxton D， Munos R （2016）统一基于计数的探索和内在动机。在：Lee DD、Sugiyama M、Luxburg UV、Guyon I、Garnett R（编辑）神经信息处理系统进展 29，Curran Associates， Inc.，第 1471-1479 页。http://papers.nips.cc/paper/6383-unifying-count-based-exploration-and-intrinsic-motivation.pdf

Bellman R (1957) A Markovian decision process. J Math Mechanics 6(5):679–684. http://www.jstor.org/stable/24900506
Bellman R （1957）马尔可夫决策过程。数学力学杂志 6（5）：679–684。http://www.jstor.org/stable/24900506

Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, ICML ’09, pp 41–48. https://doi.org/10.1145/1553374.1553380,
Bengio Y， Louradour J， Collobert R， Weston J （2009）课程学习。收录于：第 26 届机器学习年度国际会议论文集，ACM，纽约，纽约，美国，ICML '09，第 41-48 页。https://doi.org/10.1145/1553374.1553380，

Berner C, Brockman G, Chan B, Cheung V, Debiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C, Józefowicz R, Gray S, Olsson C, Pachocki JW, Petrov M, de Oliveira Pinto HP, Raiman J, Salimans T, Schlatter J, Schneider J, Sidor S, Sutskever I, Tang J, Wolski F, Zhang S (2019) Dota 2 with large scale deep reinforcement learning. ArXiv arxiv: abs/1912.06680
Berner C， Brockman G， Chan B， Cheung V， Debiak P， Dennison C， Farhi D， Fischer Q， Hashme S， Hesse C， Józefowicz R， Gray S， Olsson C， Pachocki JW， Petrov M， de Oliveira Pinto HP， Raiman J， Salimans T， Schlatter J， Schneider J， Sidor S， Sutskever I， Tang J， Wolski F， Zhang S （2019） Dota 2 与大规模深度强化学习。ArXiv arxiv：abs/1912.06680

Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840. https://doi.org/10.1287/moor.27.4.819.297
Bernstein DS， Givan R， Immerman N， Zilberstein S （2002）马尔可夫决策过程分散控制的复杂性。数学操作系统研究 27（4）：819–840。https://doi.org/10.1287/moor.27.4.819.297

Article 品

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Bertsekas DP (2012) Dynamic programming and optimal control, vol 2, 4th edn. Athena Scientific, Belmont
Bertsekas DP （2012）动态规划与最优控制，第 2 卷，第 4 版。Athena Scientific，贝尔蒙特

MATH 数学

Google Scholar Google 学术搜索

Bertsekas DP (2017) Dynamic programming and optimal control, vol 1, 4th edn. Athena Scientific, Belmont
Bertsekas DP （2017）动态规划与最优控制，第 1 卷，第 4 版。Athena Scientific，贝尔蒙特

MATH 数学

Google Scholar Google 学术搜索

Bloembergen D, Tuyls K, Hennes D, Kaisers M (2015) Evolutionary dynamics of multi-agent learning: a survey. J Artif Intell Res 53:659–697
Bloembergen D， Tuyls K， Hennes D， Kaisers M （2015）多智能体学习的进化动力学：一项调查。J Artif Intell Res 53：659-697

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Bono G, Dibangoye JS, Matignon L, Pereyron F, Simonin O (2019) Cooperative multi-agent policy gradient. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G (eds) Machine learning and knowledge discovery in databases. Springer International Publishing, Cham, pp 459–476
Bono G， Dibangoye JS， Matignon L， Pereyron F， Simonin O （2019）合作多智能体政策梯度。在：Berlingerio M、Bonchi F、Gärtner T、Hurley N、Ifrim G（编辑）数据库中的机器学习和知识发现。施普林格国际出版社，Cham，第 459-476 页

Chapter 章

Google Scholar Google 学术搜索

Boutsioukis G, Partalas I, Vlahavas I (2012) Transfer learning in multi-agent reinforcement learning domains. In: Sanner S, Hutter M (eds) Recent advances in reinforcement learning. Springer, Berlin, pp 249–260
Boutsioukis G， Partalas I， Vlahavas I （2012）多智能体强化学习领域的迁移学习。在：Sanner S，Hutter M（编辑）强化学习的最新进展。施普林格，柏林，第 249-260 页

Chapter 章

Google Scholar Google 学术搜索

Bowling M, Veloso M (2002) Multiagent learning using a variable learning rate. Artif Intell 136(2):215–250
Bowling M， Veloso M （2002）使用可变学习率的多智能体学习。Artif Intell 136（2）：215–250

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv:1606.01540
Brockman G， Cheung V， Pettersson L， Schneider J， Schulman J， Tang J， Zaremba W （2016） Openai 健身房。arXiv：1606.01540

Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern Part C (Appl Rev) 38(2):156–172. https://doi.org/10.1109/TSMCC.2007.913919
Busoniu L， Babuska R， De Schutter B （2008）多智能体强化学习的综合综述。IEEE Trans Syst Man Cybern Part C （Appl Rev） 38（2）：156–172.https://doi.org/10.1109/TSMCC.2007.913919

Article 品

Google Scholar Google 学术搜索

Cai Y, Yang SX, Xu X (2013) A combined hierarchical reinforcement learning based approach for multi-robot cooperative target searching in complex unknown environments. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp 52–59. https://doi.org/10.1109/ADPRL.2013.6614989
蔡莹，杨世旭，徐旭（2013）一种基于分层强化学习的复杂未知环境下多机器人协同目标搜索方法.在：2013年IEEE自适应动态规划和强化学习研讨会（ADPRL），第52-59页。https://doi.org/10.1109/ADPRL.2013.6614989

Cao K, Lazaridou A, Lanctot M, Leibo JZ, Tuyls K, Clark S (2018) Emergent communication through negotiation. In: International conference on learning representations. https://openreview.net/forum?id=Hk6WhagRW
Cao K， Lazaridou A， Lanctot M， Leibo JZ， Tuyls K， Clark S （2018）通过谈判进行紧急沟通。在：关于学习表征的国际会议。https://openreview.net/forum?id=Hk6WhagRW

Cao Y, Yu W, Ren W, Chen G (2013) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans Industr Inf 9(1):427–438. https://doi.org/10.1109/TII.2012.2219061
曹莹，于伟，任，陈光（2013）分布式多智能体协调研究进展综述.IEEE Trans Industr Inf 9（1）：427–438。https://doi.org/10.1109/TII.2012.2219061

Article 品

Google Scholar Google 学术搜索

Castellini J, Oliehoek FA, Savani R, Whiteson S (2019) The representational capacity of action-value networks for multi-agent reinforcement learning. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, Richland, SC, AAMAS ’19, pp 1862–1864. http://dl.acm.org/citation.cfm?id=3306127.3331944
Castellini J， Oliehoek FA， Savani R， Whiteson S （2019）多智能体强化学习的动作-价值网络的表征能力。在：第 18 届自主代理和多代理系统国际会议论文集，自治代理和多代理系统国际基金会，南卡罗来纳州里奇兰，AAMAS '19，第 1862-1864 页。http://dl.acm.org/citation.cfm?id=3306127.3331944

Celikyilmaz A, Bosselut A, He X, Choi Y (2018) Deep communicating agents for abstractive summarization. CoRR arxiv: abs/1803.10357,
Celikyilmaz A， Bosselut A， He X， Choi Y （2018）用于抽象摘要的深度沟通代理。CoRR arxiv：abs/1803.10357，

Chang Y, Ho T, Kaelbling LP (2004) All learning is local: Multi-agent learning in global reward games. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16, MIT Press, pp 807–814. http://papers.nips.cc/paper/2476-all-learning-is-local-multi-agent-learning-in-global-reward-games.pdf
Chang Y， Ho T， Kaelbling LP （2004）所有学习都是本地的：全球奖励游戏中的多智能体学习。在：Thrun S、Saul LK、Schölkopf B（编辑）神经信息处理系统进展 16，麻省理工学院出版社，第 807-814 页。http://papers.nips.cc/paper/2476-all-learning-is-local-multi-agent-learning-in-global-reward-games.pdf

Chen Y, Zhou M, Wen Y, Yang Y, Su Y, Zhang W, Zhang D, Wang J, Liu H (2018) Factorized q-learning for large-scale multi-agent systems. CoRR arxiv: abs/1809.03738
Chen Y，周 M，温 Y， Yang Y， Su Y， Zhang W， Zhang D， Wang J， Liu H （2018）大规模多智能体系统的因子化 q 学习。CoRR arxiv：abs/1809.03738

Chen YF, Liu M, Everett M, How JP (2016) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. CoRR. arxiv: abs/1609.07845,
Chen YF， Liu M， Everett M， How JP （2016）使用深度强化学习的去中心化非通信多智能体碰撞规避。Arxiv：abs/1609.07845，

Chentanez N, Barto AG, Singh SP (2005) Intrinsically motivated reinforcement learning. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems 17, MIT Press, pp 1281–1288. http://papers.nips.cc/paper/2552-intrinsically-motivated-reinforcement-learning.pdf
Chentanez N， Barto AG， Singh SP （2005）内在动机强化学习。在：Saul LK、Weiss Y、Bottou L（编辑）神经信息处理系统进展 17，麻省理工学院出版社，第 1281-1288 页。http://papers.nips.cc/paper/2552-intrinsically-motivated-reinforcement-learning.pdf

Choi E, Lazaridou A, de Freitas N (2018) Multi-agent compositional communication learning from raw visual input. In: International conference on learning representations. https://openreview.net/forum?id=rknt2Be0-
Choi E， Lazaridou A， de Freitas N （2018）从原始视觉输入中学习多智能体组合通信。在：关于学习表征的国际会议。https://openreview.net/forum?id=rknt2Be0-

Chu T, Chinchali S, Katti S (2020) Multi-agent reinforcement learning for networked system control. In: International conference on learning representations. https://openreview.net/forum?id=Syx7A3NFvH
Chu T， Chinchali S， Katti S （2020）用于网络系统控制的多智能体强化学习。在：关于学习表征的国际会议。https://openreview.net/forum?id=Syx7A3NFvH

Chu T, Wang J, Codecà L, Li Z (2020) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans Intell Transp Syst 21(3):1086–1095
Chu T， Wang J， Codecà L， Li Z （2020）用于大规模交通信号控制的多智能体深度强化学习。IEEE Trans Intell 传输系统 21（3）：1086–1095

Article 品

Google Scholar Google 学术搜索

Chu X, Ye H (2017) Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. CoRR arxiv: abs/1710.00336
Chu X， Ye H （2017）用于协作式多智能体强化学习的参数共享深度确定性策略梯度。CoRR arxiv：abs/1710.00336

Claus C, Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the fifteenth national conference on artificial intelligence and tenth innovative applications of artificial intelligence conference, AAAI 98, IAAI 98, July 26–30, 1998, Madison, Wisconsin, USA, pp 746–752. http://www.aaai.org/Library/AAAI/1998/aaai98-106.php
Claus C， Boutilier C （1998）合作多智能体系统中强化学习的动力学。在：第十五届全国人工智能会议论文集和第十届人工智能创新应用会议，AAAI 98，IAAI 98,1998 年 7 月 26-30 日，美国威斯康星州麦迪逊，第 746-752 页。http://www.aaai.org/Library/AAAI/1998/aaai98-106.php

Crandall JW, Goodrich MA (2011) Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Mach Learn 82(3):281–314. https://doi.org/10.1007/s10994-010-5192-9
Crandall JW， Goodrich MA （2011）使用强化学习学习在重复游戏中学习竞争、协调和合作。马赫学习 82（3）：281–314。https://doi.org/10.1007/s10994-010-5192-9

Article 品

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Da Silva FL, Costa AHR (2017) Accelerating multiagent reinforcement learning through transfer learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI Press, AAAI’17, pp 5034–5035. http://dl.acm.org/citation.cfm?id=3297863.3297988
Da Silva FL， Costa AHR （2017）通过迁移学习加速多智能体强化学习。在：第三十一届AAAI人工智能会议论文集，AAAI出版社，AAAI’17，第5034-5035页。http://dl.acm.org/citation.cfm?id=3297863.3297988

Da Silva FL, Costa AHR (2019) A survey on transfer learning for multiagent reinforcement learning systems. J Artif Int Res 64(1):645–703. https://doi.org/10.1613/jair.1.11396
Da Silva FL， Costa AHR （2019）多智能体强化学习系统的迁移学习调查。艺术国际研究杂志 64（1）：645–703。https://doi.org/10.1613/jair.1.11396

Article 品

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Da Silva FL, Glatt R, Costa AHR (2017) Simultaneously learning and advising in multiagent reinforcement learning. In: Proceedings of the 16th conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multiagent systems, Richland, SC, AAMAS ’17, pp 1100–1108. http://dl.acm.org/citation.cfm?id=3091210.3091280
Da Silva FL， Glatt R， Costa AHR （2017）同时学习和建议多智能体强化学习。在：第 16 届自主代理和多代理系统会议论文集，自治代理和多代理系统国际基金会，南卡罗来纳州里奇兰，AAMAS '17，第 1100-1108 页。http://dl.acm.org/citation.cfm?id=3091210.3091280

Da Silva FL, Warnell G, Costa AHR, Stone P (2019) Agents teaching agents: a survey on inter-agent transfer learning. Auton Agent Multi-Agent Syst 34(1):9. https://doi.org/10.1007/s10458-019-09430-0
Da Silva FL、Warnell G、Costa AHR、Stone P （2019）代理教学代理：关于代理间迁移学习的调查。Auton Agent 多代理系统 34（1）：9。https://doi.org/10.1007/s10458-019-09430-0

Article 品

Google Scholar Google 学术搜索

Das A, Kottur S, Moura JMF, Lee S, Batra D (2017) Learning cooperative visual dialog agents with deep reinforcement learning. In: The IEEE international conference on computer vision (ICCV)
Das A， Kottur S， Moura JMF， Lee S， Batra D （2017）通过深度强化学习学习协作视觉对话代理。在：IEEE计算机视觉国际会议（ICCV）

Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J (2019) TarMAC: Targeted multi-agent communication. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, California, USA, Proceedings of machine learning research, vol 97, pp 1538–1546. http://proceedings.mlr.press/v97/das19a.html
Das A、Gervet T、Romoff J、Batra D、Parikh D、Rabbat M、Pineau J （2019） TarMAC：有针对性的多智能体通信。在：Chaudhuri K，Salakhutdinov R（编辑）第36届机器学习国际会议论文集，PMLR，美国加利福尼亚州长滩，机器学习研究论文集，第97卷，第1538-1546页。http://proceedings.mlr.press/v97/das19a.html

Dayan P, Hinton GE (1993) Feudal reinforcement learning. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems 5, Morgan-Kaufmann, pp 271–278. http://papers.nips.cc/paper/714-feudal-reinforcement-learning.pdf
Dayan P， Hinton GE （1993）封建强化学习。在：Hanson SJ，Cowan JD，Giles CL（编辑）神经信息处理系统进展5，Morgan-Kaufmann，第271-278页。http://papers.nips.cc/paper/714-feudal-reinforcement-learning.pdf

De Cote EM, Lazaric A, Restelli M (2006) Learning to cooperate in multi-agent social dilemmas. In: Proceedings of the fifth international joint conference on autonomous agents and multiagent systems, ACM, New York, NY, USA, AAMAS ’06, pp 783–785. https://doi.org/10.1145/1160633.1160770
De Cote EM， Lazaric A， Restelli M （2006）在多主体社会困境中学习合作。在：第五届自主代理和多代理系统国际联合会议论文集，ACM，纽约，纽约，美国，AAMAS’06，第783-785页。https://doi.org/10.1145/1160633.1160770

Diallo EAO, Sugiyama A, Sugawara T (2017) Learning to coordinate with deep reinforcement learning in doubles pong game. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA), pp 14–19. https://doi.org/10.1109/ICMLA.2017.0-184
Diallo EAO， Sugiyama A， Sugawara T （2017）学习在双打乒乓球比赛中与深度强化学习协调。在：2017年第16届IEEE机器学习与应用国际会议（ICMLA），第14-19页。https://doi.org/10.1109/ICMLA.2017.0-184

Dibangoye J, Buffet O (2018) Learning to act in decentralized partially observable MDPs. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of Machine Learning Research, vol 80, pp 1233–1242. http://proceedings.mlr.press/v80/dibangoye18a.html
Dibangoye J， Buffet O （2018）学习在分散的部分可观察的 MDP 中行动。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，Stockholmsmässan，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第1233-1242页。http://proceedings.mlr.press/v80/dibangoye18a.html

Dobbe R, Fridovich-Keil D, Tomlin C (2017) Fully decentralized policies for multi-agent systems: an information theoretic approach. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 2941–2950. http://papers.nips.cc/paper/6887-fully-decentralized-policies-for-multi-agent-systems-an-information-theoretic-approach.pdf
Dobbe R， Fridovich-Keil D， Tomlin C （2017）多智能体系统的完全去中心化策略：一种信息理论方法。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 2941-2950 页。http://papers.nips.cc/paper/6887-fully-decentralized-policies-for-multi-agent-systems-an-information-theoretic-approach.pdf

Duan Y, Schulman J, Chen X, Bartlett PL, Sutskever I, Abbeel P (2016)

fast reinforcement learning via slow reinforcement learning. CoRR arxiv: abs/1611.02779,
Duan Y， Schulman J， Chen X， Bartlett PL， Sutskever I， Abbeel P （2016）
：通过慢强化学习进行快速强化学习。CoRR arxiv：abs/1611.02779，

Eccles T, Bachrach Y, Lever G, Lazaridou A, Graepel T (2019) Biases for emergent communication in multi-agent reinforcement learning. In: Wallach H, Larochelle H, Beygelzimer A, Alche-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32, Curran Associates, Inc., pp 13111–13121. http://papers.nips.cc/paper/9470-biases-for-emergent-communication-in-multi-agent-reinforcement-learning.pdf
Eccles T， Bachrach Y， Lever G， Lazaridou A， Graepel T （2019）多智能体强化学习中紧急沟通的偏差。在：Wallach H、Larochelle H、Beygelzimer A、Alche-Buc F、Fox E、Garnett R（编辑）神经信息处理系统进展 32，Curran Associates， Inc.，第 13111-13121 页。http://papers.nips.cc/paper/9470-biases-for-emergent-communication-in-multi-agent-reinforcement-learning.pdf

Everett R, Roberts S (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In: 2018 AAAI Spring symposium series
Everett R， Roberts S （2018）使用对手建模和深度强化学习来对抗非平稳智能体。在：2018 AAAI春季研讨会系列

Evtimova K, Drozdov A, Kiela D, Cho K (2018) Emergent communication in a multi-modal, multi-step referential game. In: International conference on learning representations. https://openreview.net/forum?id=rJGZq6g0-
Evtimova K， Drozdov A， Kiela D， Cho K （2018）多模态、多步骤参考游戏中的紧急通信。在：关于学习表征的国际会议。https://openreview.net/forum?id=rJGZq6g0-

Finn C, Levine S (2018) Meta-learning and universality: deep representations and gradient descent can approximate any learning algorithm. In: International conference on learning representations. https://openreview.net/forum?id=HyjC5yWCW
Finn C， Levine S （2018）元学习和普遍性：深度表示和梯度下降可以近似任何学习算法。在：关于学习表征的国际会议。https://openreview.net/forum?id=HyjC5yWCW

Foerster J, Assael IA, de Freitas N, Whiteson S (2016) Learning to communicate with deep multi-agent reinforcement learning. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 2137–2145. http://papers.nips.cc/paper/6042-learning-to-communicate-with-deep-multi-agent-reinforcement-learning.pdf
Foerster J， Assael IA， de Freitas N， Whiteson S （2016）学习通过深度多智能体强化学习进行交流。在：Lee DD、Sugiyama M、Luxburg UV、Guyon I、Garnett R（编辑）神经信息处理系统进展 29，Curran Associates， Inc.，第 2137-2145 页。http://papers.nips.cc/paper/6042-learning-to-communicate-with-deep-multi-agent-reinforcement-learning.pdf

Foerster J, Nardelli N, Farquhar G, Afouras T, Torr PHS, Kohli P, Whiteson S (2017) Stabilising experience replay for deep multi-agent reinforcement learning. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of Machine Learning Research, vol 70, pp 1146–1155. http://proceedings.mlr.press/v70/foerster17b.html
Foerster J， Nardelli N， Farquhar G， Afouras T， Torr PHS， Kohli P， Whiteson S （2017）深度多智能体强化学习的稳定经验回放。在：Precup D，Teh YW（编辑）第34届机器学习国际会议论文集，PMLR，国际会议中心，悉尼，澳大利亚，机器学习研究论文集，第70卷，第1146-1155页。http://proceedings.mlr.press/v70/foerster17b.html

Foerster J, Chen RY, Al-Shedivat M, Whiteson S, Abbeel P, Mordatch I (2018a) Learning with opponent-learning awareness. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’18, pp 122–130. http://dl.acm.org/citation.cfm?id=3237383.3237408
Foerster J， Chen RY， Al-Shedivat M， Whiteson S， Abbeel P， Mordatch I （2018a）具有对手学习意识的学习。在：第 17 届自主代理和多代理系统国际会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '18，第 122-130 页。http://dl.acm.org/citation.cfm?id=3237383.3237408

Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018b) Counterfactual multi-agent policy gradients. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193
Foerster J， Farquhar G， Afouras T， Nardelli N， Whiteson S （2018b）反事实多智能体政策梯度。https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17193

Foerster J, Song F, Hughes E, Burch N, Dunning I, Whiteson S, Botvinick M, Bowling M (2019) Bayesian action decoder for deep multi-agent reinforcement learning. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, California, USA, Proceedings of Machine Learning Research, vol 97, pp 1942–1951. http://proceedings.mlr.press/v97/foerster19a.html
Foerster J， Song F， Hughes E， Burch N， Dunning I， Whiteson S， Botvinick M， Bowling M （2019）用于深度多智能体强化学习的贝叶斯动作解码器。在：Chaudhuri K， Salakhutdinov R （eds） Proceedings of the 36th International Conference on Machine Learning， PMLR， Long Beach， California， USA， Proceedings of Machine Learning Research， vol 97， pp 1942–1951.http://proceedings.mlr.press/v97/foerster19a.html

Fulda N, Ventura D (2007) Predicting and preventing coordination problems in cooperative q-learning systems. In: Proceedings of the 20th international joint conference on artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’07, pp 780–785
Fulda N， Ventura D （2007）预测和预防合作 q 学习系统中的协调问题。收录于：第 20 届人工智能国际联合会议论文集，Morgan Kaufmann Publishers Inc.，美国加利福尼亚州旧金山，IJCAI’07，第 780-785 页

García J, Fern, o Fernández (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16(42):1437–1480. http://jmlr.org/papers/v16/garcia15a.html
García J， Fern， o Fernández （2015）关于安全强化学习的综合调查。马赫学习研究杂志 16（42）：1437–1480。 http://jmlr.org/papers/v16/garcia15a.html

Ghavamzadeh M, Mahadevan S, Makar R (2006) Hierarchical multi-agent reinforcement learning. Auton Agent Multi-Agent Syst. https://doi.org/10.1007/s10458-006-7035-4
Ghavamzadeh M， Mahadevan S， Makar R （2006）分层多智能体强化学习。Auton Agent 多智能体系统 https://doi.org/10.1007/s10458-006-7035-4

Article 品

Google Scholar Google 学术搜索

Gleave A, Dennis M, Wild C, Kant N, Levine S, Russell S (2020) Adversarial policies: Attacking deep reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=HJgEMpVFwB
Gleave A、Dennis M、Wild C、Kant N、Levine S、Russell S （2020）对抗性政策：攻击深度强化学习。在：关于学习表征的国际会议。https://openreview.net/forum?id=HJgEMpVFwB

Goldman CV, Zilberstein S (2004) Decentralized control of cooperative systems: categorization and complexity analysis. J Artif Int Res 22(1):143–174. http://dl.acm.org/citation.cfm?id=1622487.1622493
Goldman CV， Zilberstein S （2004）合作系统的分散控制：分类和复杂性分析。艺术国际研究杂志 22（1）：143–174。http://dl.acm.org/citation.cfm?id=1622487.1622493

Grover A, Al-Shedivat M, Gupta J, Burda Y, Edwards H (2018) Learning policy representations in multiagent systems. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of Machine Learning Research, vol 80, pp 1802–1811. http://proceedings.mlr.press/v80/grover18a.html
Grover A、Al-Shedivat M、Gupta J、Burda Y、Edwards H （2018）学习多智能体系统中的策略表示。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，斯德哥尔摩，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第1802-1811页。http://proceedings.mlr.press/v80/grover18a.html

Guestrin C, Koller D, Parr R (2002) Multiagent planning with factored mdps. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14, MIT Press, pp 1523–1530. http://papers.nips.cc/paper/1941-multiagent-planning-with-factored-mdps.pdf
Guestrin C， Koller D， Parr R （2002）使用因子 mdp 进行多智能体规划。在：Dietterich TG、Becker S、Ghahramani Z（编辑）神经信息处理系统进展 14，麻省理工学院出版社，第 1523-1530 页。http://papers.nips.cc/paper/1941-multiagent-planning-with-factored-mdps.pdf

Gupta JK, Egorov M, Kochenderfer M (2017) Cooperative multi-agent control using deep reinforcement learning. In: Sukthankar G, Rodriguez-Aguilar JA (eds) autonomous agents and multiagent systems. Springer, Cham, pp 66–83
Gupta JK， Egorov M， Kochenderfer M （2017）使用深度强化学习的协作多智能体控制。在：Sukthankar G，Rodriguez-Aguilar JA（编辑）自主代理和多代理系统。施普林格，Cham，第 66-83 页

Chapter 章

Google Scholar Google 学术搜索

Hadfield-Menell D, Milli S, Abbeel P, Russell SJ, Dragan A (2017) Inverse reward design. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 6765–6774. http://papers.nips.cc/paper/7253-inverse-reward-design.pdf
Hadfield-Menell D， Milli S， Abbeel P， Russell SJ， Dragan A （2017）逆奖励设计。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 6765-6774 页。http://papers.nips.cc/paper/7253-inverse-reward-design.pdf

Han D, Boehmer W, Wooldridge M, Rogers A (2019) Multi-agent hierarchical reinforcement learning with dynamic termination. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’19, pp 2006–2008. http://dl.acm.org/citation.cfm?id=3306127.3331992
Han D， Boehmer W， Wooldridge M， Rogers A （2019）具有动态终止的多智能体分层强化学习。在：第 18 届自主代理和多代理系统国际会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '19，第 2006-2008 页。http://dl.acm.org/citation.cfm?id=3306127.3331992

Hansen EA, Bernstein D, Zilberstein S (2004) Dynamic programming for partially observable stochastic games. In: AAAI
Hansen EA， Bernstein D， Zilberstein S （2004）部分可观察随机博弈的动态规划。在： AAAI

Hardin G (1968) The tragedy of the commons. Science 162(3859):1243–1248
Hardin G （1968）公地悲剧。科学 162（3859）：1243–1248

Article 品

Google Scholar Google 学术搜索

Hausknecht M, Stone P (2015) Deep recurrent q-learning for partially observable mdps. https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/view/11673
Hausknecht M， Stone P （2015）部分可观察 mdp 的深度递归 q 学习。https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/view/11673

Havrylov S, Titov I (2017) Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 2149–2159. http://papers.nips.cc/paper/6810-emergence-of-language-with-multi-agent-games-learning-to-communicate-with-sequences-of-symbols.pdf
Havrylov S， Titov I （2017）多智能体博弈语言的出现：学习与符号序列进行交流。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 2149-2159 页。http://papers.nips.cc/paper/6810-emergence-of-language-with-multi-agent-games-learning-to-communicate-with-sequences-of-symbols.pdf

He H, Boyd-Graber J, Kwok K, III HD (2016) Opponent modeling in deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd international conference on machine learning, PMLR, New York, New York, USA, Proceedings of Machine Learning Research, vol 48, pp 1804–1813. http://proceedings.mlr.press/v48/he16.html
He H， Boyd-Graber J， Kwok K， III HD （2016）深度强化学习中的对手建模。在：Balcan MF，Weinberger KQ（编辑）第 33 届机器学习国际会议论文集，PMLR，纽约，纽约，美国，机器学习研究论文集，第 48 卷，第 1804–1813 页。http://proceedings.mlr.press/v48/he16.html

He H, Chen D, Balakrishnan A, Liang P (2018) Decoupling strategy and generation in negotiation dialogues. CoRR arxiv: abs/1808.09637,
He H， Chen D， Balakrishnan A， Liang P （2018）谈判对话中的脱钩策略和生成。CoRR arxiv：abs/1808.09637，

Heinrich J, Silver D (2016) Deep reinforcement learning from self-play in imperfect-information games. CoRR arxiv: abs/1603.01121,
Heinrich J， Silver D （2016）不完全信息博弈中自我游戏的深度强化学习。CoRR arxiv：abs/1603.01121，

Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16669
Henderson P， Islam R， Bachman P， Pineau J， Precup D， Meger D （2018）重要的深度强化学习。https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16669

Hernandez-Leal P, Kaisers M, Baarslag T, de Cote EM (2017) A survey of learning in multiagent environments: dealing with non-stationarity. CoRR arxiv: abs/1707.09183,
Hernandez-Leal P、Kaisers M、Baarslag T、de Cote EM （2017）多智能体环境中的学习调查：处理非平稳性。CoRR arxiv：abs/1707.09183，

Hernandez-Leal P, Kartal B, Taylor ME (2019) Agent modeling as auxiliary task for deep reinforcement learning. CoRR arxiv: abs/1907.09597,
Hernandez-Leal P、Kartal B、Taylor ME （2019）智能体建模作为深度强化学习的辅助任务。CoRR arxiv：abs/1907.09597，

Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agent Multi-Agent Syst 33(6):750–797. https://doi.org/10.1007/s10458-019-09421-1
Hernandez-Leal P、Kartal B、Taylor ME （2019）多智能体深度强化学习的调查和批评。Auton Agent 多代理系统 33（6）：750–797。https://doi.org/10.1007/s10458-019-09421-1

Article 品

Google Scholar Google 学术搜索

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hochreiter S， Schmidhuber J （1997）长短期记忆。神经计算 9（8）：1735–1780。https://doi.org/10.1162/neco.1997.9.8.1735

Article 品

Google Scholar Google 学术搜索

Hong Z, Su S, Shann T, Chang Y, Lee C (2017) A deep policy inference q-network for multi-agent systems. CoRR arxiv: abs/1712.07893,
Hong Z， Su S， Shann T， Chang Y， Lee C （2017）多智能体系统的深度策略推理 q-network。CoRR arxiv：abs/1712.07893，

Hoshen Y (2017) Vain: Attentional multi-agent predictive modeling. In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates Inc., USA, NIPS’17, pp 2698–2708. http://dl.acm.org/citation.cfm?id=3294996.3295030
Hoshen Y （2017）徒劳：注意力多智能体预测建模。收录于：第31届神经信息处理系统国际会议论文集，Curran Associates Inc.，美国，NIPS’17，第2698-2708页。http://dl.acm.org/citation.cfm?id=3294996.3295030

Houthooft R, Chen X, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P (2016) Vime: variational information maximizing exploration. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp 1109–1117. http://papers.nips.cc/paper/6591-vime-variational-information-maximizing-exploration.pdf
Houthooft R， Chen X， Chen X， Duan Y， Schulman J， De Turck F， Abbeel P （2016） Vime：变分信息最大化探索。在：Lee DD、Sugiyama M、Luxburg UV、Guyon I、Garnett R（编辑）神经信息处理系统进展 29，Curran Associates， Inc.，第 1109-1117 页。http://papers.nips.cc/paper/6591-vime-variational-information-maximizing-exploration.pdf

Hu J, Wellman MP (1998) Multiagent reinforcement learning: theoretical framework and an algorithm. In: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, ICML ’98, pp 242–250. http://dl.acm.org/citation.cfm?id=645527.657296
胡 J， Wellman MP （1998）多智能体强化学习：理论框架和算法。收录于：Proceedings of the Fifteenth International Conference on Machine Learning，Morgan Kaufmann Publishers Inc.，加利福尼亚州旧金山，ICML '98，第 242-250 页。http://dl.acm.org/citation.cfm?id=645527.657296

Hu J, Wellman MP (2003) Nash q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069
胡 J， Wellman MP （2003）广和随机博弈的纳什q学习。马赫学习研究 4：1039–1069

MathSciNet 数学科学网

MATH 数学

Google Scholar Google 学术搜索

Hughes E, Leibo JZ, Phillips M, Tuyls K, Dueñez Guzman E, García Castañeda A, Dunning I, Zhu T, McKee K, Koster R, Roff H, Graepel T (2018) Inequity aversion improves cooperation in intertemporal social dilemmas. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 3326–3336. http://papers.nips.cc/paper/7593-inequity-aversion-improves-cooperation-in-intertemporal-social-dilemmas.pdf
Hughes E， Leibo JZ， Phillips M， Tuyls K， Dueñez Guzman E， García Castañeda A， Dunning I， Zhu T， McKee K， Koster R， Roff H， Graepel T （2018）不平等厌恶改善了跨期社会困境中的合作。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 3326-3336 页。http://papers.nips.cc/paper/7593-inequity-aversion-improves-cooperation-in-intertemporal-social-dilemmas.pdf

Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, California, USA, Proceedings of machine learning research, vol 97, pp 2961–2970. http://proceedings.mlr.press/v97/iqbal19a.html
Iqbal S， Sha F （2019）多智能体强化学习的演员-注意力-批评家。在：Chaudhuri K，Salakhutdinov R（编辑）第36届机器学习国际会议论文集，PMLR，美国加利福尼亚州长滩，机器学习研究论文集，第97卷，第2961-2970页。http://proceedings.mlr.press/v97/iqbal19a.html

Islam R, Henderson P, Gomrokchi M, Precup D (2017) Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR arxiv: abs/1708.04133,
Islam R， Henderson P， Gomrokchi M， Precup D （2017）用于连续控制的基准深度强化学习任务的可重复性。CoRR arxiv：abs/1708.04133，

Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, Beattie C, Rabinowitz NC, Morcos AS, Ruderman A, Sonnerat N, Green T, Deason L, Leibo JZ, Silver D, Hassabis D, Kavukcuoglu K, Graepel T (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364(6443):859–865
Jaderberg M、Czarnecki WM、Dunning I、Marris L、Lever G、Castañeda AG、Beattie C、Rabinowitz NC、Morcos AS、Ruderman A、Sonnerat N、Green T、Deason L、Leibo JZ、Silver D、Hassabis D、Kavukcuoglu K、Graepel T （2019）基于群体的强化学习在 3D 多人游戏中的人类水平表现。科学 364（6443）：859–865

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Jain U, Weihs L, Kolve E, Rastegari M, Lazebnik S, Farhadi A, Schwing AG, Kembhavi A (2019) Two body problem: Collaborative visual task completion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Jain U、Weihs L、Kolve E、Rastegari M、Lazebnik S、Farhadi A、Schwing AG、Kembhavi A （2019）双体问题：协作视觉任务完成。在：IEEE/CVF 计算机视觉和模式识别（CVPR）会议论文集

Jaques N, Lazaridou A, Hughes E, Gülçehre Ç, Ortega PA, Strouse D, Leibo JZ, de Freitas N (2018) Intrinsic social motivation via causal influence in multi-agent RL. CoRR arxiv: abs/1810.08647,
Jaques N， Lazaridou A， Hughes E， Gülçehre Ç， Ortega PA， Strouse D， Leibo JZ， de Freitas N （2018）多智能体 RL 中因果影响的内在社会动机。CoRR arxiv：abs/1810.08647，

Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo JZ, De Freitas N (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International conference on machine learning, pp 3040–3049
Jaques N， Lazaridou A， Hughes E， Gulcehre C， Ortega P， Strouse D， Leibo JZ， De Freitas N （2019）社会影响作为多智能体深度强化学习的内在动机。在：机器学习国际会议，第 3040–3049 页

Jiang J, Lu Z (2018) Learning attentional communication for multi-agent cooperation. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 7254–7264. http://papers.nips.cc/paper/7956-learning-attentional-communication-for-multi-agent-cooperation.pdf
江 J， Lu Z （2018）学习多智能体合作的注意力沟通.在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 7254-7264 页。http://papers.nips.cc/paper/7956-learning-attentional-communication-for-multi-agent-cooperation.pdf

Johnson M, Hofmann K, Hutton T, Bignell D (2016) The malmo platform for artificial intelligence experimentation. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, AAAI Press, IJCAI’16, pp 4246–4247. http://dl.acm.org/citation.cfm?id=3061053.3061259
Johnson M， Hofmann K， Hutton T， Bignell D （2016）用于人工智能实验的马尔默平台。在：第二十五届国际人工智能联合会议论文集，AAAI出版社，IJCAI’16，第4246-4247页。http://dl.acm.org/citation.cfm?id=3061053.3061259

Jorge E, Kågebäck M, Gustavsson E (2016) Learning to play guess who? and inventing a grounded language as a consequence. CoRR arxiv: abs/1611.03218,
Jorge E， Kågebäck M， Gustavsson E （2016）学习玩猜猜是谁？并因此发明了一种接地气的语言。CoRR arxiv：abs/1611.03218，

Juliani A, Berges V, Vckay E, Gao Y, Henry H, Mattar M, Lange D (2018) Unity: a general platform for intelligent agents. CoRR arxiv: abs/1809.02627,
Juliani A， Berges V， Vckay E， Gao Y， Henry H， Mattar M， Lange D （2018） Unity：智能代理的通用平台。CoRR arxiv：abs/1809.02627，

Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4(1):237–285. http://dl.acm.org/citation.cfm?id=1622737.1622748
Kaelbling LP、Littman ML、Moore AW （1996）强化学习：一项调查。J Artif Intell Res 4（1）：237–285。http://dl.acm.org/citation.cfm?id=1622737.1622748

Kasai T, Tenmoto H, Kamiya A (2008) Learning of communication codes in multi-agent reinforcement learning problem. In: 2008 IEEE conference on soft computing in industrial applications, pp 1–6
Kasai T， Tenmoto H， Kamiya A （2008）多智能体强化学习问题中的通信代码学习。在：2008年IEEE工业应用中的软计算会议，第1-6页

Kim W, Cho M, Sung Y (2019) Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence 33(01):6079–6086
Kim W， Cho M， Sung Y （2019）消息丢弃：一种用于多智能体深度强化学习的有效训练方法。在：AAAI人工智能会议论文集33（01）：6079–6086

Kirby S (2002) Natural language from artificial life. Artif Life 8(2):185–215. https://doi.org/10.1162/106454602320184248
Kirby S （2002）来自人工生命的自然语言。艺术生活 8（2）：185–215。https://doi.org/10.1162/106454602320184248

Article 品

Google Scholar Google 学术搜索

Kok JR, Vlassis N (2006) Collaborative multiagent reinforcement learning by payoff propagation. J Mach Learn Res 7:1789–1828. http://dl.acm.org/citation.cfm?id=1248547.1248612
Kok JR， Vlassis N （2006）通过收益传播进行协作式多智能体强化学习。J Mach Learn Res 7：1789–1828。http://dl.acm.org/citation.cfm?id=1248547.1248612

Kollock P (1998) Social dilemmas: the anatomy of cooperation. Annu Rev Sociol 24(1):183–214. https://doi.org/10.1146/annurev.soc.24.1.183
Kollock P （1998）社会困境：合作剖析。Annu Rev Sociol 24（1）：183–214。https://doi.org/10.1146/annurev.soc.24.1.183

Article 品

Google Scholar Google 学术搜索

Kong X, Xin B, Liu F, Wang Y (2017) Revisiting the master-slave architecture in multi-agent deep reinforcement learning. CoRR arxiv: abs/1712.07305,
Kong X， Xin B， Liu F， Wang Y （2017）重新审视多智能体深度强化学习中的主从架构。CoRR arxiv：abs/1712.07305，

Kraemer L, Banerjee B (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190:82–94
Kraemer L， Banerjee B （2016）多智能体强化学习作为去中心化规划的预演。神经计算 190：82-94

Article 品

Google Scholar Google 学术搜索

Kumar S, Shah P, Hakkani-Tür D, Heck LP (2017) Federated control with hierarchical multi-agent deep reinforcement learning. CoRR arxiv: abs/1712.08266,
Kumar S， Shah P， Hakkani-Tür D， Heck LP （2017）具有分层多智能体深度强化学习的联合控制。CoRR arxiv：abs/1712.08266，

Lanctot M, Zambaldi V, Gruslys A, Lazaridou A, Tuyls K, Perolat J, Silver D, Graepel T (2017) A unified game-theoretic approach to multiagent reinforcement learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 4190–4203. http://papers.nips.cc/paper/7007-a-unified-game-theoretic-approach-to-multiagent-reinforcement-learning.pdf
Lanctot M， Zambaldi V， Gruslys A， Lazaridou A， Tuyls K， Perolat J， Silver D， Graepel T （2017）多智能体强化学习的统一博弈论方法。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 4190-4203 页。http://papers.nips.cc/paper/7007-a-unified-game-theoretic-approach-to-multiagent-reinforcement-learning.pdf

Lange PAV, Joireman J, Parks CD, Dijk EV (2013) The psychology of social dilemmas: a review. Organ Behav Hum Decis Process 120(2):125–141
Lange PAV、Joireman J、Parks CD、Dijk EV （2013）社会困境心理学：综述。器官行为嗡嗡声 Decis 过程 120（2）：125–141

Article 品

Google Scholar Google 学术搜索

Lauer M, Riedmiller M (2000) An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: In Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann, pp 535–542
Lauer M， Riedmiller M （2000）一种用于协作多智能体系统中分布式强化学习的算法。在：第十七届机器学习国际会议论文集，Morgan Kaufmann，第 535-542 页

Laurent GJ, Matignon L, Fort-Piat NL (2011) The world of independent learners is not markovian. Int J Knowl-Based Intell Eng Syst 15(1):55–64. http://dl.acm.org/citation.cfm?id=1971886.1971887
Laurent GJ， Matignon L， Fort-Piat NL （2011）独立学习者的世界不是马尔科夫式的。基于知识的 Intell Eng Syst 15（1）：55–64。http://dl.acm.org/citation.cfm?id=1971886.1971887

Lazaridou A, Baroni M (2020) Emergent multi-agent communication in the deep learning era. ArXiv arxiv: abs/2006.02419
Lazaridou A， Baroni M （2020）深度学习时代的新兴多智能体通信。ArXiv arxiv：abs/2006.02419

Lazaridou A, Peysakhovich A, Baroni M (2017) Multi-agent cooperation and the emergence of (natural) language. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. https://openreview.net/forum?id=Hk8N3Sclg
Lazaridou A， Peysakhovich A， Baroni M （2017）多智能体合作和（自然）语言的出现。在：第五届学习表征国际会议，ICLR 2017，法国土伦，2017年4月24日至26日，会议记录。https://openreview.net/forum?id=Hk8N3Sclg

Lazaridou A, Hermann KM, Tuyls K, Clark S (2018) Emergence of linguistic communication from referential games with symbolic and pixel input. In: International conference on learning representations. https://openreview.net/forum?id=HJGv1Z-AW
Lazaridou A， Hermann KM， Tuyls K， Clark S （2018）具有符号和像素输入的指称游戏中语言交流的出现。在：关于学习表征的国际会议。https://openreview.net/forum?id=HJGv1Z-AW

Le HM, Yue Y, Carr P, Lucey P (2017) Coordinated multi-agent imitation learning. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of Machine Learning Research, vol 70, pp 1995–2003. http://proceedings.mlr.press/v70/le17a.html
Le HM， Yue Y， Carr P， Lucey P （2017）协调多智能体模仿学习。在：Precup D，Teh YW （eds） Proceedings of the 34th International Conference on Machine Learning， PMLR， International Convention Centre， Sydney， Australia， Proceedings of Machine Learning Research，第 70 卷，第 1995–2003 页。http://proceedings.mlr.press/v70/le17a.html

Lee J, Cho K, Weston J, Kiela D (2017) Emergent translation in multi-agent communication. CoRR arxiv: abs/1710.06922,
Lee J， Cho K， Weston J， Kiela D （2017）多智能体沟通中的紧急翻译。CoRR arxiv：abs/1710.06922，

Lee Y, Yang J, Lim JJ (2020) Learning to coordinate manipulation skills via skill behavior diversification. In: International conference on learning representations. https://openreview.net/forum?id=ryxB2lBtvH
Lee Y， Yang J， Lim JJ （2020）通过技能行为多样化学习协调操作技能。在：关于学习表征的国际会议。https://openreview.net/forum?id=ryxB2lBtvH

Leibo JZ, Zambaldi V, Lanctot M, Marecki J, Graepel T (2017) Multi-agent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’17, pp 464–473. http://dl.acm.org/citation.cfm?id=3091125.3091194
Leibo JZ， Zambaldi V， Lanctot M， Marecki J， Graepel T （2017）序列社会困境中的多智能体强化学习。在：第 16 届自主代理和多代理系统会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '17，第 464-473 页。http://dl.acm.org/citation.cfm?id=3091125.3091194

Leibo JZ, Hughes E, Lanctot M, Graepel T (2019) Autocurricula and the emergence of innovation from social interaction: a manifesto for multi-agent intelligence research. CoRR arxiv: abs/1903.00742,
Leibo JZ、Hughes E、Lanctot M、Graepel T （2019） Autocurricula 和社交互动创新的出现：多智能体智能研究的宣言。CoRR arxiv：abs/1903.00742，

Lerer A, Peysakhovich A (2017) Maintaining cooperation in complex social dilemmas using deep reinforcement learning. CoRR arxiv: abs/1707.01068,
Lerer A， Peysakhovich A （2017）使用深度强化学习在复杂的社会困境中保持合作。CoRR arxiv：abs/1707.01068，

Letcher A, Foerster J, Balduzzi D, Rocktäschel T, Whiteson S (2019) Stable opponent shaping in differentiable games. In: International conference on learning representations. https://openreview.net/forum?id=SyGjjsC5tQ
Letcher A， Foerster J， Balduzzi D， Rocktäschel T， Whiteson S （2019）可微博弈中的稳定对手塑造。在：关于学习表征的国际会议。https://openreview.net/forum?id=SyGjjsC5tQ

Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17(1):1334–1373. http://dl.acm.org/citation.cfm?id=2946645.2946684
Levine S， Finn C， Darrell T， Abbeel P （2016）深度视觉运动政策的端到端训练。机器学习研究杂志 17（1）：1334–1373。http://dl.acm.org/citation.cfm?id=2946645.2946684

Lewis M, Yarats D, Dauphin YN, Parikh D, Batra D (2017) Deal or no deal? end-to-end learning for negotiation dialogues. CoRR arxiv: abs/1706.05125,
Lewis M， Yarats D， Dauphin YN， Parikh D， Batra D （2017）交易还是不交易？谈判对话的端到端学习。CoRR arxiv：abs/1706.05125，

Li F, Bowling M (2019) Ease-of-teaching and language structure from emergent communication. In: Wallach H, Larochelle H, Beygelzimer A, Alche-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32, Curran Associates, Inc., pp 15851–15861. http://papers.nips.cc/paper/9714-ease-of-teaching-and-language-structure-from-emergent-communication.pdf
Li F， Bowling M （2019）来自新兴交流的教学便利性和语言结构。在：Wallach H、Larochelle H、Beygelzimer A、Alche-Buc F、Fox E、Garnett R（编辑）神经信息处理系统进展 32，Curran Associates， Inc.，第 15851-15861 页。http://papers.nips.cc/paper/9714-ease-of-teaching-and-language-structure-from-emergent-communication.pdf

Li S, Wu Y, Cui X, Dong H, Fang F, Russell S (2019a) Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. Proc AAAI Conf Artif Intell 33(01):4213–4220
Li S， Wu Y， Cui X， Dong H， Fang F， Russell S （2019a）通过 minimax 深度确定性策略梯度进行鲁棒多智能体强化学习。Proc AAAI Conf Artif Intell 33（01）：4213–4220

Google Scholar Google 学术搜索

Li X, Sun M, Li P (2019b) Multi-agent discussion mechanism for natural language generation. Proc AAAI Conf Artif Intell 33(01):6096–6103
Li X， Sun M， Li P （2019b）自然语言生成的多智能体讨论机制。Proc AAAI Conf Artif Intell 33（01）：6096–6103

Google Scholar Google 学术搜索

Li Y (2018) Deep reinforcement learning. CoRR arxiv: abs/1810.06339,
Li Y （2018）深度强化学习。CoRR arxiv：abs/1810.06339，

Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: ICLR (Poster). http://arxiv.org/arxiv: abs/1509.02971
Lillicrap TP、Hunt JJ、Pritzel A、Heess N、Erez T、Tassa Y、Silver D、Wierstra D （2016）具有深度强化学习的持续控制。在：ICLR（海报）。http://arxiv.org/arxiv： abs/1509.02971

Lin K, Zhao R, Xu Z, Zhou J (2018) Efficient large-scale fleet management via multi-agent deep reinforcement learning. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, ACM, New York, NY, USA, KDD ’18, pp 1774–1783. https://doi.org/10.1145/3219819.3219993,
Lin K， Zhao R， Xu Z，周 J （2018）基于多智能体深度强化学习的高效大规模车队管理.收录于：第 24 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集，ACM，纽约，纽约，美国，KDD '18，第 1774-1783 页。https://doi.org/10.1145/3219819.3219993，

Lin X, Beling PA, Cogill R (2018) Multiagent inverse reinforcement learning for two-person zero-sum games. IEEE Trans Games 10(1):56–68. https://doi.org/10.1109/TCIAIG.2017.2679115
Lin X， Beling PA， Cogill R （2018）两人零和博弈的多智能体逆强化学习。IEEE跨游戏10（1）：56-68。https://doi.org/10.1109/TCIAIG.2017.2679115

Article 品

Google Scholar Google 学术搜索

Littman M (2001) Value-function reinforcement learning in markov games. Cogn Syst Res 2:55–66
Littman M （2001）马尔可夫博弈中的价值函数强化学习。Cogn Syst Res 2：55-66

Article 品

Google Scholar Google 学术搜索

Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the eleventh international conference on international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’94, pp 157–163. http://dl.acm.org/citation.cfm?id=3091574.3091594
Littman ML （1994）马尔可夫博弈作为多智能体强化学习的框架。收录于：第十一届机器学习国际会议论文集，Morgan Kaufmann Publishers Inc.，美国加利福尼亚州旧金山，ICML’94，第 157-163 页。http://dl.acm.org/citation.cfm?id=3091574.3091594

Liu IJ, Yeh RA, Schwing AG (2020) Pic: Permutation invariant critic for multi-agent deep reinforcement learning. In: PMLR, proceedings of machine learning research, vol 100, pp 590–602. http://proceedings.mlr.press/v100/liu20a.html
Liu IJ， Yeh RA， Schwing AG （2020）图片：多智能体深度强化学习的置换不变批评者。在：PMLR，机器学习研究论文集，第 100 卷，第 590-602 页。http://proceedings.mlr.press/v100/liu20a.html

Liu S, Lever G, Heess N, Merel J, Tunyasuvunakool S, Graepel T (2019) Emergent coordination through competition. In: International conference on learning representations. https://openreview.net/forum?id=BkG8sjR5Km
Liu S， Lever G， Heess N， Merel J， Tunyasuvunakool S， Graepel T （2019）通过竞争进行紧急协调。在：关于学习表征的国际会议。https://openreview.net/forum?id=BkG8sjR5Km

Long Q, Zhou Z, Gupta A, Fang F, Wu Y, Wang X (2020) Evolutionary population curriculum for scaling multi-agent reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=SJxbHkrKDH
Long Q，周 Z， Gupta A， Fang F， Wu Y， Wang X （2020）用于扩展多智能体强化学习的进化种群课程。在：关于学习表征的国际会议。https://openreview.net/forum?id=SJxbHkrKDH

Lowe R, WU Y, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 6379–6390. http://papers.nips.cc/paper/7217-multi-agent-actor-critic-for-mixed-cooperative-competitive-environments.pdf
Lowe R， WU Y， Tamar A， Harb J， Pieter Abbeel O， Mordatch I （2017）混合合作竞争环境的多主体演员-评论家。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 6379-6390 页。http://papers.nips.cc/paper/7217-multi-agent-actor-critic-for-mixed-cooperative-competitive-environments.pdf

Lowe R, Foerster JN, Boureau Y, Pineau J, Dauphin YN (2019) On the pitfalls of measuring emergent communication. CoRR arxiv: abs/1903.05168,
Lowe R， Foerster JN， Boureau Y， Pineau J， Dauphin YN （2019）关于测量紧急沟通的陷阱。CoRR arxiv：abs/1903.05168，

Luketina J, Nardelli N, Farquhar G, Foerster JN, Andreas J, Grefenstette E, Whiteson S, Rocktäschel T (2019) A survey of reinforcement learning informed by natural language. CoRR arxiv: abs/1906.03926,
Luketina J， Nardelli N， Farquhar G， Foerster JN， Andreas J， Grefenstette E， Whiteson S， Rocktäschel T （2019）自然语言强化学习调查。CoRR arxiv：abs/1906.03926，

Luong NC, Hoang DT, Gong S, Niyato D, Wang P, Liang Y, Kim DI (2019) Applications of deep reinforcement learning in communications and networking: a survey. IEEE Communications Surveys Tutorials pp 1–1. https://doi.org/10.1109/COMST.2019.2916583
Luong NC， Hoang DT， Gong S， Niyato D， Wang P， Liang Y， Kim DI （2019）深度强化学习在通信和网络中的应用：一项调查。IEEE 通信调查教程，第 1-1 页。https://doi.org/10.1109/COMST.2019.2916583

Lux T, Marchesi M (1999) Scaling and criticality in a stochastic multi-agent model of a financial market. Nature 397(6719):498–500. https://doi.org/10.1038/17290
Lux T， Marchesi M （1999）金融市场随机多智能体模型中的规模和临界性。自然 397（6719）：498–500。https://doi.org/10.1038/17290

Article 品

Google Scholar Google 学术搜索

Lyu X, Amato C (2020) Likelihood quantile networks for coordinating multi-agent reinforcement learning. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp 798–806
Lyu X， Amato C （2020）用于协调多智能体强化学习的似然分位数网络。收录于：第 19 届自主代理和多代理系统国际会议论文集，第 798-806 页

Ma J, Wu F (2020) Feudal multi-agent deep reinforcement learning for traffic signal control. In: Seghrouchni AEF, Sukthankar G, An B, Yorke-Smith N (eds) Proceedings of the 19th international conference on autonomous agents and multiagent systems, AAMAS ’20, Auckland, New Zealand, May 9-13, 2020, International Foundation for Autonomous Agents and Multiagent Systems, pp 816–824. https://dl.acm.org/doi/arxiv: abs/10.5555/3398761.3398858
马 J， Wu F （2020）用于交通信号控制的封建多智能体深度强化学习.在：Seghrouchni AEF、Sukthankar G、An B、Yorke-Smith N（编辑）第 19 届自主代理和多代理系统国际会议论文集，AAMAS '20，新西兰奥克兰，2020 年 5 月 9 日至 13 日，国际自主代理和多代理系统基金会，第 816-824 页。https://dl.acm.org/doi/arxiv： abs/10.5555/3398761.3398858

Makar R, Mahadevan S, Ghavamzadeh M (2001) Hierarchical multi-agent reinforcement learning. In: Proceedings of the fifth international conference on autonomous agents, ACM, New York, NY, USA, AGENTS ’01, pp 246–253. https://doi.org/10.1145/375735.376302,
Makar R， Mahadevan S， Ghavamzadeh M （2001）分层多智能体强化学习。在：第五届自治代理国际会议论文集，ACM，纽约，纽约，美国，代理’01，第246-253页。https://doi.org/10.1145/375735.376302，

Matignon L, Laurent GJ, Le Fort-Piat N (2007) Hysteretic q-learning : an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In: 2007 IEEE/RSJ international conference on intelligent robots and systems, pp 64–69
Matignon L， Laurent GJ， Le Fort-Piat N （2007）滞后 q-learning：一种用于合作多智能体团队中分散强化学习的算法。在：2007 IEEE/RSJ 智能机器人和系统国际会议，第 64-69 页

Matignon L, Jeanpierre L, Mouaddib AI (2012a) Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes. https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5038
Matignon L， Jeanpierre L， Mouaddib AI （2012a）使用分散马尔可夫决策过程在通信约束下进行协调多机器人探索。https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5038

Matignon L, Gj Laurent, Le fort piat N, (2012b) Review: independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. Knowl Eng Rev 27(1):1–31. https://doi.org/10.1017/S0269888912000057
Matignon L， Gj Laurent， Le fort piat N，（2012b） Review： Independent reinforcement learners in cooperative markov 博弈：关于协调问题的调查。知识工程修订版 27（1）：1–31。https://doi.org/10.1017/S0269888912000057

Article 品

Google Scholar Google 学术搜索

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529 EP –. https://doi.org/10.1038/nature14236
Mnih V、Kavukcuoglu K、Silver D、Rusu AA、Veness J、Bellemare MG、Graves A、Riedmiller M、Fidjeland AK、Ostrovski G、Petersen S、Beattie C、Sadik A、Antonoglou I、King H、Kumaran D、Wierstra D、Legg S、Hassabis D （2015）通过深度强化学习进行人类水平控制。自然 518：529 EP –.https://doi.org/10.1038/nature14236

Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd international conference on machine learning, PMLR, New York, New York, USA, Proceedings of machine learning research, vol 48, pp 1928–1937. http://proceedings.mlr.press/v48/mniha16.html
Mnih V， Badia AP， Mirza M， Graves A， Lillicrap T， Harley T， Silver D， Kavukcuoglu K （2016）深度强化学习的异步方法。在：Balcan MF，Weinberger KQ（编辑）第 33 届机器学习国际会议论文集，PMLR，纽约，纽约，美国，机器学习研究论文集，第 48 卷，第 1928-1937 页。http://proceedings.mlr.press/v48/mniha16.html

Moerland TM, Broekens J, Jonker CM (2018) Emotion in reinforcement learning agents and robots: a survey. Mach Learn 107(2):443–480. https://doi.org/10.1007/s10994-017-5666-0
Moerland TM、Broekens J、Jonker CM （2018）强化学习代理和机器人中的情感：一项调查。马赫学习 107（2）：443–480。https://doi.org/10.1007/s10994-017-5666-0

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Mordatch I, Abbeel P (2018) Emergence of grounded compositional language in multi-agent populations. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17007
Mordatch I， Abbeel P （2018）多智能体群体中扎根组合语言的出现。https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17007

Nair R, Tambe M, Yokoo M, Pynadath D, Marsella S (2003) Taming decentralized pomdps: towards efficient policy computation for multiagent settings. In: Proceedings of the 18th international joint conference on artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’03, pp 705–711. http://dl.acm.org/citation.cfm?id=1630659.1630762
Nair R， Tambe M， Yokoo M， Pynadath D， Marsella S （2003）驯服去中心化的 pomdps：迈向多智能体设置的高效策略计算。收录于：第 18 届人工智能国际联合会议论文集，Morgan Kaufmann Publishers Inc.，美国加利福尼亚州旧金山，IJCAI’03，第 705-711 页。http://dl.acm.org/citation.cfm?id=1630659.1630762

Narvekar S, Sinapov J, Leonetti M, Stone P (2016) Source task creation for curriculum learning. In: Proceedings of the 2016 international conference on autonomous agents & multiagent systems, international foundation for autonomous agents and multiagent systems, Richland, SC, AAMAS ’16, pp 566–574. http://dl.acm.org/citation.cfm?id=2936924.2937007
Narvekar S， Sinapov J， Leonetti M， Stone P （2016）课程学习的源任务创建。在：2016年自主代理和多代理系统国际会议的论文集，自治代理和多代理系统的国际基金会，南卡罗来纳州里奇兰，AAMAS '16，第566-574页。http://dl.acm.org/citation.cfm?id=2936924.2937007

Nedic A, Ozdaglar A (2009) Distributed subgradient methods for multi-agent optimization. IEEE Trans Autom Control 54(1):48–61
Nedic A， Ozdaglar A （2009）用于多智能体优化的分布式次梯度方法。IEEE跨自动控制54（1）：48–61

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’00, pp 663–670. http://dl.acm.org/citation.cfm?id=645529.657801
Ng AY， Russell SJ （2000）逆强化学习算法。收录于：第十七届机器学习国际会议论文集，Morgan Kaufmann Publishers Inc.，美国加利福尼亚州旧金山，ICML '00，第 663-670 页。http://dl.acm.org/citation.cfm?id=645529.657801

Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. In: In Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann, pp 278–287
Ng AY， Harada D， Russell S （1999）奖励转换下的政策不变性：奖励塑造的理论和应用。在：第十六届机器学习国际会议论文集，Morgan Kaufmann，第 278-287 页

Nguyen DT, Kumar A, Lau HC (2017a) Collective multiagent sequential decision making under uncertainty. https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14891
Nguyen DT， Kumar A， Lau HC （2017a）不确定性下的集体多智能体顺序决策。https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14891

Nguyen DT, Kumar A, Lau HC (2017b) Policy gradient with value function approximation for collective multiagent planning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 4319–4329. http://papers.nips.cc/paper/7019-policy-gradient-with-value-function-approximation-for-collective-multiagent-planning.pdf
Nguyen DT， Kumar A， Lau HC （2017b）具有价值函数近似的策略梯度，用于集体多智能体规划。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 4319-4329 页。http://papers.nips.cc/paper/7019-policy-gradient-with-value-function-approximation-for-collective-multiagent-planning.pdf

Nguyen DT, Kumar A, Lau HC (2018) Credit assignment for collective multiagent rl with global rewards. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 8102–8113. http://papers.nips.cc/paper/8033-credit-assignment-for-collective-multiagent-rl-with-global-rewards.pdf
Nguyen DT， Kumar A， Lau HC （2018）具有全球奖励的集体多智能体 rl 的学分分配。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 8102-8113 页。http://papers.nips.cc/paper/8033-credit-assignment-for-collective-multiagent-rl-with-global-rewards.pdf

Nguyen TT, Nguyen ND, Nahavandi S (2020) Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications. IEEE Trans Cybern 50(9):3826–3839
Nguyen TT、Nguyen ND、Nahavandi S （2020）多智能体系统的深度强化学习：挑战、解决方案和应用回顾。IEEE Trans Cybern 50（9）：3826–3839

Article 品

Google Scholar Google 学术搜索

Oliehoek FA, Amato C (2016) A Concise Introduction to Decentralized POMDPs, 1st edn. Springer Publishing Company, Berlin
Oliehoek FA， Amato C （2016）去中心化 POMDP 简明介绍，第 1 版。施普林格出版公司，柏林

Book 书

Google Scholar Google 学术搜索

Oliehoek FA, Spaan MTJ, Vlassis N (2008) Optimal and approximate q-value functions for decentralized pomdps. J Artif Int Res 32(1):289–353. http://dl.acm.org/citation.cfm?id=1622673.1622680
Oliehoek FA， Spaan MTJ， Vlassis N （2008）分散式 pomdps 的最优和近似 q 值函数。艺术国际研究杂志 32（1）：289–353。http://dl.acm.org/citation.cfm?id=1622673.1622680

Omidshafiei S, Pazis J, Amato C, How JP, Vian J (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of machine learning research, vol 70, pp 2681–2690. http://proceedings.mlr.press/v70/omidshafiei17a.html
Omidshafiei S， Pazis J， Amato C， How JP， Vian J （2017）部分可观察性下的深度分散多任务多智能体强化学习。在：Precup D，Teh YW（编辑）第34届机器学习国际会议论文集，PMLR，国际会议中心，悉尼，澳大利亚，机器学习研究论文集，第70卷，第2681-2690页。http://proceedings.mlr.press/v70/omidshafiei17a.html

Omidshafiei S, Kim DK, Liu M, Tesauro G, Riemer M, Amato C, Campbell M, How JP (2019) Learning to teach in cooperative multiagent reinforcement learning. Proc AAAI Conf Artif Intelli 33(01):6128–6136
Omidshafiei S， Kim DK， Liu M， Tesauro G， Riemer M， Amato C， Campbell M， How JP （2019）学习在合作多智能体强化学习中教学。Proc AAAI Conf Artif Intelli 33（01）：6128–6136

Google Scholar Google 学术搜索

Oroojlooyjadid A, Hajinezhad D (2019) A review of cooperative multi-agent deep reinforcement learning. ArXiv arxiv: abs/1908.03963
Oroojlooyjadid A， Hajinezhad D （2019）合作多智能体深度强化学习综述。ArXiv arxiv：abs/1908.03963

Oudeyer PY, Kaplan F (2007) What is intrinsic motivation? A typology of computational approaches. Front Neurorobotics 1:6–6
Oudeyer PY， Kaplan F （2007）什么是内在动机？计算方法的类型学。前神经机器人学1：6-6

Article 品

Google Scholar Google 学术搜索

Palmer G, Tuyls K, Bloembergen D, Savani R (2018) Lenient multi-agent deep reinforcement learning. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’18, pp 443–451. http://dl.acm.org/citation.cfm?id=3237383.3237451
Palmer G， Tuyls K， Bloembergen D， Savani R （2018）宽松的多智能体深度强化学习。在：第 17 届自主代理和多代理系统国际会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '18，第 443-451 页。http://dl.acm.org/citation.cfm?id=3237383.3237451

Palmer G, Savani R, Tuyls K (2019) Negative update intervals in deep multi-agent reinforcement learning. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp 43–51
Palmer G， Savani R， Tuyls K （2019）深度多智能体强化学习中的负更新间隔。在：第 18 届自主代理和多代理系统国际会议论文集，第 43-51 页

Panait L, Luke S (2005) Cooperative multi-agent learning: the state of the art. Auton Agent Multi-Agent Syst 11(3):387–434. https://doi.org/10.1007/s10458-005-2631-2
Panait L， Luke S （2005）合作多智能体学习：最先进的技术。Auton Agent 多代理系统 11（3）：387–434。https://doi.org/10.1007/s10458-005-2631-2

Article 品

Google Scholar Google 学术搜索

Panait L, Sullivan K, Luke S (2006) Lenient learners in cooperative multiagent systems. In: Proceedings of the fifth international joint conference on autonomous agents and multiagent systems, association for computing machinery, New York, NY, USA, AAMAS ’06, pp 801–803. https://doi.org/10.1145/1160633.1160776,
Panait L， Sullivan K， Luke S （2006）合作多智能体系统中的宽容学习者。在：第五届自主代理和多代理系统国际联合会议论文集，计算机协会，纽约，纽约，美国，AAMAS’06，第801-803页。https://doi.org/10.1145/1160633.1160776，

Papoudakis G, Christianos F, Rahman A, Albrecht SV (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. CoRR arxiv: abs/1906.04737,
Papoudakis G， Christianos F， Rahman A， Albrecht SV （2019）处理多智能体深度强化学习中的非平稳性。CoRR arxiv：abs/1906.04737，

Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of Machine Learning Research, vol 70, pp 2778–2787. http://proceedings.mlr.press/v70/pathak17a.html
Pathak D、Agrawal P、Efros AA、Darrell T （2017）通过自我监督预测进行好奇心驱动的探索。在：Precup D，Teh YW （eds） Proceedings of the 34th International Conference on Machine Learning， PMLR， International Convention Centre， Sydney， Australia， Proceedings of Machine Learning Research，第 70 卷，第 2778–2787 页。http://proceedings.mlr.press/v70/pathak17a.html

Peng P, Yuan Q, Wen Y, Yang Y, Tang Z, Long H, Wang J (2017) Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. CoRR arxiv: abs/1703.10069,
Peng P， Yuan Q，温 Y， Yang Y， Tang Z， Long H， Wang J （2017）用于学习玩星际争霸格斗游戏的多智能体双向协调网络。CoRR arxiv：abs/1703.10069，

Pérolat J, Leibo JZ, Zambaldi V, Beattie C, Tuyls K, Graepel T (2017) A multi-agent reinforcement learning model of common-pool resource appropriation. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 3643–3652. http://papers.nips.cc/paper/6955-a-multi-agent-reinforcement-learning-model-of-common-pool-resource-appropriation.pdf
Pérolat J， Leibo JZ， Zambaldi V， Beattie C， Tuyls K， Graepel T （2017）公共池资源挪用的多智能体强化学习模型。在：Guyon I、Luxburg UV、Bengio S、Wallach H、Fergus R、Vishwanathan S、Garnett R（编辑）神经信息处理系统进展 30，Curran Associates， Inc.，第 3643-3652 页。http://papers.nips.cc/paper/6955-a-multi-agent-reinforcement-learning-model-of-common-pool-resource-appropriation.pdf

Peysakhovich A, Lerer A (2018) Prosocial learning agents solve generalized stag hunts better than selfish ones. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, international Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’18, pp 2043–2044. http://dl.acm.org/citation.cfm?id=3237383.3238065
Peysakhovich A， Lerer A （2018）亲社会学习代理比自私的猎鹿更能解决普遍的猎鹿问题。在：第 17 届自主代理和多代理系统国际会议论文集，国际自治代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '18，第 2043-2044 页。http://dl.acm.org/citation.cfm?id=3237383.3238065

Pinto L, Davidson J, Sukthankar R, Gupta A (2017) Robust adversarial reinforcement learning. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of machine learning research, vol 70, pp 2817–2826. http://proceedings.mlr.press/v70/pinto17a.html
Pinto L， Davidson J， Sukthankar R， Gupta A （2017）鲁棒对抗性强化学习。在：Precup D，Teh YW（编辑）第34届机器学习国际会议论文集，PMLR，国际会议中心，悉尼，澳大利亚，机器学习研究论文集，第70卷，第2817-2826页。http://proceedings.mlr.press/v70/pinto17a.html

Pinyol I, Sabater-Mir J (2013) Computational trust and reputation models for open multi-agent systems: a review. Artif Intell Rev 40(1):1–25. https://doi.org/10.1007/s10462-011-9277-z
Pinyol I， Sabater-Mir J （2013）开放多智能体系统的计算信任和声誉模型：综述。Artif Intell 修订版 40（1）：1-25。https://doi.org/10.1007/s10462-011-9277-z

Article 品

Google Scholar Google 学术搜索

Potter MA, De Jong KA (1994) A cooperative coevolutionary approach to function optimization. In: Davidor Y, Schwefel HP, Männer R (eds) Parallel problem solving from nature - PPSN III. Springer, Berlin, pp 249–257
Potter MA， De Jong KA （1994）一种用于函数优化的合作协同进化方法。在：Davidor Y、Schwefel HP、Männer R（编辑）来自自然的并行问题解决 - PPSN III。施普林格，柏林，第 249-257 页

Chapter 章

Google Scholar Google 学术搜索

Qu G, Wierman A, Li N (2020) Scalable reinforcement learning of localized policies for multi-agent networked systems. PMLR, The Cloud, Proceedings of machine learning research, vol 120, pp 256–266. http://proceedings.mlr.press/v120/qu20a.html
Qu G， Wierman A， Li N （2020）多智能体网络系统的本地化策略的可扩展强化学习。PMLR，云，机器学习研究论文集，第 120 卷，第 256-266 页。http://proceedings.mlr.press/v120/qu20a.html

Rabinowitz N, Perbet F, Song F, Zhang C, Eslami SMA, Botvinick M (2018) Machine theory of mind. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 4218–4227. http://proceedings.mlr.press/v80/rabinowitz18a.html
Rabinowitz N， Perbet F， Song F， Zhang C， Eslami SMA， Botvinick M （2018）机器心智理论。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，斯德哥尔摩smässan，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第4218-4227页。http://proceedings.mlr.press/v80/rabinowitz18a.html

Raghu M, Irpan A, Andreas J, Kleinberg B, Le Q, Kleinberg J (2018) Can deep reinforcement learning solve Erdos-Selfridge-Spencer games? In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 4238–4246. http://proceedings.mlr.press/v80/raghu18a.html
Raghu M， Irpan A， Andreas J， Kleinberg B， Le Q， Kleinberg J （2018）深度强化学习可以解决 Erdos-Selfridge-Spencer 游戏吗？在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，Stockholmsmässan，Stockholm Sweden，机器学习研究论文集，第80卷，第4238-4246页。http://proceedings.mlr.press/v80/raghu18a.html

Raileanu R, Denton E, Szlam A, Fergus R (2018) Modeling others using oneself in multi-agent reinforcement learning. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 4257–4266. http://proceedings.mlr.press/v80/raileanu18a.html
Raileanu R， Denton E， Szlam A， Fergus R （2018）在多智能体强化学习中使用自己对他人进行建模。在：Dy J， Krause A （eds） Proceedings of the 35th International Conference on Machine Learning， PMLR， Stockholmsmässan， Stockholm Sweden， Proceedings of machine learning research， vol 80， pp 4257–4266.http://proceedings.mlr.press/v80/raileanu18a.html

Ramchurn SD, Huynh D, Jennings NR (2004) Trust in multi-agent systems. Knowl Eng Rev 19(1):1–25. https://doi.org/10.1017/S0269888904000116
Ramchurn SD、Huynh D、Jennings NR （2004）对多智能体系统的信任。知识工程修订版 19（1）：1-25。https://doi.org/10.1017/S0269888904000116

Article 品

Google Scholar Google 学术搜索

Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 4295–4304. http://proceedings.mlr.press/v80/rashid18a.html
Rashid T、Samvelyan M、Schroeder C、Farquhar G、Foerster J、Whiteson S （2018） QMIX：用于深度多智能体强化学习的单调值函数分解。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，斯德哥尔摩，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第4295-4304页。http://proceedings.mlr.press/v80/rashid18a.html

Russell S, Zimdars AL (2003) Q-decomposition for reinforcement learning agents. In: Proceedings of the twentieth international conference on international conference on machine learning, AAAI Press, ICML’03, pp 656–663. http://dl.acm.org/citation.cfm?id=3041838.3041921
Russell S， Zimdars AL （2003）强化学习代理的 Q 分解。在：第二十届国际机器学习会议论文集，AAAI出版社，ICML’03，第656-663页。http://dl.acm.org/citation.cfm?id=3041838.3041921

Schaul T, Horgan D, Gregor K, Silver D (2015) Universal value function approximators. In: Proceedings of the 32nd international conference on international conference on machine learning - volume 37, JMLR.org, ICML’15, pp 1312–1320
Schaul T， Horgan D， Gregor K， Silver D （2015）通用值函数近似器。收录于：第 32 届机器学习国际会议论文集 - 第 37 卷，JMLR.org，ICML’15，第 1312–1320 页

Schmidhuber J (2010) Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Trans Auton Ment Dev 2(3):230–247. https://doi.org/10.1109/TAMD.2010.2056368
Schmidhuber J （2010）创造力、乐趣和内在动机的形式理论（1990–2010）。IEEE跨自动化开发2（3）：230–247。https://doi.org/10.1109/TAMD.2010.2056368

Article 品

Google Scholar Google 学术搜索

Schmidhuber J, Zhao J, Wiering M (1996) Simple principles of metalearning. Tech. rep
Schmidhuber J， Zhao J， Wiering M （1996）元学习的简单原则。技术代表

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. CoRR arxiv: abs/1707.06347,
Schulman J， Wolski F， Dhariwal P， Radford A， Klimov O （2017）近端策略优化算法。CoRR arxiv：abs/1707.06347，

Sen S, Weiss G (1999) Multiagent systems. MIT Press, Cambridge, MA, USA. http://dl.acm.org/citation.cfm?id=305606.305612
Sen S， Weiss G （1999）多智能体系统。麻省理工学院出版社，美国马萨诸塞州剑桥市。http://dl.acm.org/citation.cfm?id=305606.305612

Sequeira P, Melo FS, Prada R, Paiva A (2011) Emerging social awareness: exploring intrinsic motivation in multiagent learning. In: 2011 IEEE international conference on development and learning (ICDL), vol 2, pp 1–6. https://doi.org/10.1109/DEVLRN.2011.6037325
Sequeira P， Melo FS， Prada R， Paiva A （2011）新兴社会意识：探索多智能体学习中的内在动机。在：2011年IEEE发展与学习国际会议（ICDL），第2卷，第1-6页。https://doi.org/10.1109/DEVLRN.2011.6037325

Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. CoRR arxiv: abs/1610.03295,
Shalev-Shwartz S， Shammah S， Shashua A （2016）用于自动驾驶的安全、多智能体、强化学习。CoRR arxiv：abs/1610.03295，

Shapley LS (1953) Stochastic games. Proc Nat Acad Sci 39(10):1095–1100
Shapley LS （1953）随机游戏。美国国家科学院院刊 39（10）：1095–1100

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Shoham Y, Leyton-Brown K (2008) Multiagent systems: algorithmic, game-theoretic, and logical foundations. Cambridge University Press, USA
Shoham Y， Leyton-Brown K （2008）多智能体系统：算法、博弈论和逻辑基础。剑桥大学出版社，美国

Book 书

Google Scholar Google 学术搜索

Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Tech. rep
Shoham Y， Powers R， Grenager T （2003）多智能体强化学习：一项批判性调查。技术代表

Silva FLD, Taylor ME, Costa AHR (2018) Autonomously reusing knowledge in multiagent reinforcement learning. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI-18, International joint conferences on artificial intelligence organization, pp 5487–5493. https://doi.org/10.24963/ijcai.2018/774,
Silva FLD、Taylor ME、Costa AHR （2018）在多智能体强化学习中自主重用知识。见：第二十七届国际人工智能联合会议论文集，IJCAI-18，人工智能组织国际联合会议，第5487-5493页。https://doi.org/10.24963/ijcai.2018/774，

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529:484 EP –. https://doi.org/10.1038/nature16961
Silver D， Huang A， Maddison CJ， Guez A， Sifre L， van den Driessche G， Schrittwieser J， Antonoglou I， Panneershelvam V， Lanctot M， Dieleman S， Grewe D， Nham J， Kalchbrenner N， Sutskever I， Lillicrap T， Leach M， Kavukcuoglu K， Graepel T， Hassabis D （2016）掌握深度神经网络和树搜索的围棋游戏。自然 529：484 EP –.https://doi.org/10.1038/nature16961

Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan K, Hassabis D (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419):1140–1144
Silver D， Hubert T， Schrittwieser J， Antonoglou I， Lai M， Guez A， Lanctot M， Sifre L， Kumaran D， Graepel T， Lillicrap T， Simonyan K， Hassabis D （2018）一种通用的强化学习算法，可以掌握国际象棋、将棋和自我对弈。科学 362（6419）：1140–1144

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Singh A, Jain T, Sukhbaatar S (2019) Learning when to communicate at scale in multiagent cooperative and competitive tasks. In: International conference on learning representations. https://openreview.net/forum?id=rye7knCqK7
Singh A、Jain T、Sukhbaatar S （2019）学习何时在多智能体、合作和竞争任务中进行大规模沟通。在：关于学习表征的国际会议。https://openreview.net/forum?id=rye7knCqK7

Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International conference on machine learning, pp 5887–5896
Son K， Kim D， Kang WJ， Hostallero DE， Yi Y （2019） Qtran：学习因式分解与合作多智能体强化学习的变换。在：机器学习国际会议，第 5887–5896 页

Song J, Ren H, Sadigh D, Ermon S (2018) Multi-agent generative adversarial imitation learning. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, Curran Associates, Inc., vol 31, pp 7461–7472. https://proceedings.neurips.cc/paper/2018/file/240c945bb72980130446fc2b40fbb8e0-Paper.pdf
Song J，任 H， Sadigh D， Ermon S （2018）多智能体生成对抗模仿学习。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统的进展，Curran Associates， Inc.，第 31 卷，第 7461-7472 页。https://proceedings.neurips.cc/paper/2018/file/240c945bb72980130446fc2b40fbb8e0-Paper.pdf

Song Y, Wang J, Lukasiewicz T, Xu Z, Xu M, Ding Z, Wu L (2019) Arena: A general evaluation platform and building toolkit for multi-agent intelligence. CoRR arxiv: abs/1905.08085,
Song Y， Wang J， Lukasiewicz T， Xu Z， Xu M， Ding Z， Wu L （2019） Arena：多智能体智能的通用评估平台和构建工具包。CoRR arxiv：abs/1905.08085，

Spooner T, Savani R (2020) Robust market making via adversarial reinforcement learning. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, pp 2014–2016
Spooner T， Savani R （2020）通过对抗性强化学习进行稳健的做市。收录于：第 19 届自主代理和多智能体系统国际会议论文集，第 2014-2016 页

Srinivasan S, Lanctot M, Zambaldi V, Perolat J, Tuyls K, Munos R, Bowling M (2018) Actor-critic policy optimization in partially observable multiagent environments. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 3422–3435. http://papers.nips.cc/paper/7602-actor-critic-policy-optimization-in-partially-observable-multiagent-environments.pdf
Srinivasan S， Lanctot M， Zambaldi V， Perolat J， Tuyls K， Munos R， Bowling M （2018）部分可观察多智能体环境中的演员-评论家政策优化。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 3422-3435 页。http://papers.nips.cc/paper/7602-actor-critic-policy-optimization-in-partially-observable-multiagent-environments.pdf

Stone P, Veloso M (2000) Multiagent systems: a survey from a machine learning perspective. Auton Robots 8(3):345–383. https://doi.org/10.1023/A:1008942012299
Stone P， Veloso M （2000）多智能体系统：从机器学习角度进行的调查。自动机器人 8（3）：345–383。https://doi.org/10.1023/A:1008942012299

Article 品

Google Scholar Google 学术搜索

Strouse D, Kleiman-Weiner M, Tenenbaum J, Botvinick M, Schwab DJ (2018) Learning to share and hide intentions using information regularization. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 10249–10259. http://papers.nips.cc/paper/8227-learning-to-share-and-hide-intentions-using-information-regularization.pdf
Strouse D、Kleiman-Weiner M、Tenenbaum J、Botvinick M、Schwab DJ （2018）学习使用信息正则化来分享和隐藏意图。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 10249-10259 页。http://papers.nips.cc/paper/8227-learning-to-share-and-hide-intentions-using-information-regularization.pdf

Sukhbaatar S, szlam a, Fergus R (2016) Learning multiagent communication with backpropagation. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, Curran Associates, Inc., pp 2244–2252. http://papers.nips.cc/paper/6398-learning-multiagent-communication-with-backpropagation.pdf
Sukhbaatar S， szlam a， Fergus R （2016）学习具有反向传播的多智能体通信。在：Lee DD、Sugiyama M、Luxburg UV、Guyon I、Garnett R（编辑）神经信息处理系统进展 29，Curran Associates， Inc.，第 2244-2252 页。http://papers.nips.cc/paper/6398-learning-multiagent-communication-with-backpropagation.pdf

Sukhbaatar S, Kostrikov I, Szlam A, Fergus R (2017) Intrinsic motivation and automatic curricula via asymmetric self-play. CoRR arxiv: abs/1703.05407,
Sukhbaatar S， Kostrikov I， Szlam A， Fergus R （2017）通过不对称自我游戏的内在动机和自动课程。CoRR arxiv：abs/1703.05407，

Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K, Graepel T (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’18, pp 2085–2087. http://dl.acm.org/citation.cfm?id=3237383.3238080
Sunehag P， Lever G， Gruslys A， Czarnecki WM， Zambaldi V， Jaderberg M， Lanctot M， Sonnerat N， Leibo JZ， Tuyls K， Graepel T （2018）基于团队奖励的合作多智能体学习的价值分解网络。在：第 17 届自主代理和多代理系统国际会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '18，第 2085-2087 页。http://dl.acm.org/citation.cfm?id=3237383.3238080

Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. Adaptive computation and machine learning, MIT Press. http://www.worldcat.org/oclc/37293240
Sutton RS， Barto AG （1998）强化学习：简介。自适应计算和机器学习，麻省理工学院出版社，http://www.worldcat.org/oclc/37293240

Sutton RS, Precup D, Singh S (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1):181–211
Sutton RS， Precup D， Singh S （1999）在 mdps 和 semi-mdps 之间：强化学习中时间抽象的框架。Artif Intell 112（1）：181–211

Article 品

MathSciNet 数学科学网

Google Scholar Google 学术搜索

Svetlik M, Leonetti M, Sinapov J, Shah R, Walker N, Stone P (2017) Automatic curriculum graph generation for reinforcement learning agents. https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14961
Svetlik M， Leonetti M， Sinapov J， Shah R， Walker N， Stone P （2017）强化学习代理的自动课程图生成。https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14961

Tacchetti A, Song HF, Mediano PAM, Zambaldi V, Kramár J, Rabinowitz NC, Graepel T, Botvinick M, Battaglia PW (2019) Relational forward models for multi-agent learning. In: International conference on learning representations. https://openreview.net/forum?id=rJlEojAqFm
Tacchetti A、Song HF、Mediano PAM、Zambaldi V、Kramár J、Rabinowitz NC、Graepel T、Botvinick M、Battaglia PW （2019）多智能体学习的关系前向模型。在：关于学习表征的国际会议。https://openreview.net/forum?id=rJlEojAqFm

Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, Aru J, Vicente R (2017) Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 12(4):1–15. https://doi.org/10.1371/journal.pone.0172395
Tampuu A， Matiisen T， Kodelja D， Kuzovkin I， Korjus K， Aru J， Aru J， Vicente R （2017）深度强化学习的多智能体合作与竞争。公共科学图书馆一号12（4）：1-15。https://doi.org/10.1371/journal.pone.0172395

Article 品

Google Scholar Google 学术搜索

Tan M (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In: In Proceedings of the tenth international conference on machine learning, Morgan Kaufmann, pp 330–337
Tan M （1993）多智能体强化学习：独立智能体与合作智能体。在：第十届机器学习国际会议论文集，Morgan Kaufmann，第 330-337 页

Tang H, Hao J, Lv T, Chen Y, Zhang Z, Jia H, Ren C, Zheng Y, Fan C, Wang L (2018) Hierarchical deep multiagent reinforcement learning. CoRR arxiv: abs/1809.09332,
Tang H， Hao J， Lv T， Chen Y， Zhang Z， Jia H，任 C，郑 Y，范 C， Wang L （2018）分层深度多智能体强化学习。CoRR arxiv：abs/1809.09332，

Taylor A, Dusparic I, Cahill V (2013) Transfer learning in multi-agent systems through parallel transfer. In: in Workshop on theoretically grounded transfer learning at the 30th international conference on machine learning (Poster
Taylor A， Dusparic I， Cahill V （2013）通过并行迁移在多智能体系统中进行迁移学习。在：在第 30 届机器学习国际会议上关于理论基础迁移学习的研讨会（海报）

Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. J Mach Learn Res 10:1633–1685. http://dl.acm.org/citation.cfm?id=1577069.1755839
Taylor ME， Stone P （2009）强化学习领域的迁移学习：一项调查。J Mach 学习研究 10：1633–1685。http://dl.acm.org/citation.cfm?id=1577069.1755839

Tesauro G (2004) Extending q-learning to general adaptive multi-agent systems. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16, MIT Press, pp 871–878. http://papers.nips.cc/paper/2503-extending-q-learning-to-general-adaptive-multi-agent-systems.pdf
Tesauro G （2004）将 q-learning 扩展到通用自适应多智能体系统。在：Thrun S、Saul LK、Schölkopf B（编辑）神经信息处理系统进展 16，麻省理工学院出版社，第 871-878 页。http://papers.nips.cc/paper/2503-extending-q-learning-to-general-adaptive-multi-agent-systems.pdf

Tumer K, Wolpert DH (2004) Collectives and the design of complex systems. Springer, Berlin
Tumer K， Wolpert DH （2004）集体和复杂系统的设计。施普林格，柏林

Tuyls K, Weiss G (2012) Multiagent learning: basics, challenges, and prospects. AI Mag 33(3):41
Tuyls K， Weiss G （2012）多智能体学习：基础知识、挑战和前景。人工智能杂志 33（3）：41

Google Scholar Google 学术搜索

Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, Kavukcuoglu K (2017) FeUdal networks for hierarchical reinforcement learning. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, PMLR, International Convention Centre, Sydney, Australia, Proceedings of Machine Learning Research, vol 70, pp 3540–3549. http://proceedings.mlr.press/v70/vezhnevets17a.html
Vezhnevets AS， Osindero S， Schaul T， Heess N， Jaderberg M， Silver D， Kavukcuoglu K （2017）用于分层强化学习的 FeUdal 网络。在：Precup D，Teh YW （eds） Proceedings of the 34th International Conference on Machine Learning， PMLR， International Convention Centre， Sydney， Australia， Proceedings of Machine Learning Research，第 70 卷，第 3540–3549 页。http://proceedings.mlr.press/v70/vezhnevets17a.html

Vezhnevets AS, Wu Y, Leblond R, Leibo JZ (2019) Options as responses: grounding behavioural hierarchies in multi-agent RL. CoRR arxiv: abs/1906.01470,
Vezhnevets AS， Wu Y， Leblond R， Leibo JZ （2019）选项作为响应：在多智能体 RL 中建立行为层次结构。CoRR arxiv：abs/1906.01470，

Vinyals O, Ewalds T, Bartunov S, Georgiev P, Vezhnevets AS, Yeo M, Makhzani A, Küttler H, Agapiou J, Schrittwieser J, Quan J, Gaffney S, Petersen S, Simonyan K, Schaul T, van Hasselt H, Silver D, Lillicrap TP, Calderone K, Keet P, Brunasso A, Lawrence D, Ekermo A, Repp J, Tsing R (2017) Starcraft II: a new challenge for reinforcement learning. CoRR arxiv: abs/1708.04782,
Vinyals O， Ewalds T， Bartunov S， Georgiev P， Vezhnevets AS， Yeo M， Makhzani A， Küttler H， Agapiou J， Schrittwieser J， Quan J， Gaffney S， Petersen S， Simonyan K， Schaul T， van Hasselt H， Silver D， Lillicrap TP， Calderone K， Keet P， Brunasso A， Lawrence D， Ekermo A， Repp J， Tsing R （2017）星际争霸 II：强化学习的新挑战。CoRR arxiv：abs/1708.04782，

Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou JP, Jaderberg M, Vezhnevets AS, Leblond R, Pohlen T, Dalibard V, Budden D, Sulsky Y, Molloy J, Paine TL, Gulcehre C, Wang Z, Pfaff T, Wu Y, Ring R, Yogatama D, Wünsch D, McKinney K, Smith O, Schaul T, Lillicrap T, Kavukcuoglu K, Hassabis D, Apps C, Silver D (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782):350–354. https://doi.org/10.1038/s41586-019-1724-z
Vinyals O， Babuschkin I， Czarnecki WM， Mathieu M， Dudzik A， Chung J， Choi DH， Powell R， Ewalds T， Georgiev P， Oh J， Horgan D， Kroiss M， Danihelka I， Huang A， Sifre L， Cai T， Agapiou JP， Jaderberg M， Vezhnevets AS， Leblond R， Pohlen T， Dalibard V， Budden D， Sulsky Y， Molloy J， Paine TL， Gulcehre C， Wang Z， Pfaff T， Wu Y， Ring R， Yogatama D， Wünsch D， McKinney K， Smith O， Schaul T， Lillicrap T， Kavukcuoglu K， Hassabis D， Apps C， Silver D （2019）使用多智能体强化学习的星际争霸 II 中的宗师级别。自然575（7782）：350-354。https://doi.org/10.1038/s41586-019-1724-z

Article 品

Google Scholar Google 学术搜索

Wang JX, Kurth-Nelson Z, Tirumala D, Soyer H, Leibo JZ, Munos R, Blundell C, Kumaran D, Botvinick M (2016a) Learning to reinforcement learn. CoRR arxiv: abs/1611.05763,
Wang JX， Kurth-Nelson Z， Tirumala D， Soyer H， Leibo JZ， Munos R， Blundell C， Kumaran D， Botvinick M （2016a）学习强化学习。CoRR arxiv：abs/1611.05763，

Wang JX, Hughes E, Fernando C, Czarnecki WM, Duéñez Guzmán EA, Leibo JZ (2019) Evolving intrinsic motivations for altruistic behavior. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, AAMAS ’19, pp 683–692. http://dl.acm.org/citation.cfm?id=3306127.3331756
Wang JX， Hughes E， Fernando C， Czarnecki WM， Duéñez Guzmán EA， Leibo JZ （2019）利他行为的内在动机演变。在：第 18 届自主代理和多代理系统国际会议论文集，国际自主代理和多代理系统基金会，南卡罗来纳州里奇兰，AAMAS '19，第 683-692 页。http://dl.acm.org/citation.cfm?id=3306127.3331756

Wang S, Wan J, Zhang D, Li D, Zhang C (2016b) Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput Netw 101:158–168. https://doi.org/10.1016/j.comnet.2015.12.017. http://www.sciencedirect.com/science/article/pii/S1389128615005046, industrial Technologies and Applications for the Internet of Things
Wang S， Wan J， Zhang D， Li D， Zhang C （2016b）迈向工业 4.0 的智能工厂：基于大数据的反馈和协调的自组织多智能体系统。计算网络 101：158-168。https://doi.org/10.1016/j.comnet.2015.12.017。物联网的 http://www.sciencedirect.com/science/article/pii/S1389128615005046、工业技术与应用

Wang T, Dong H, Lesser VR, Zhang C (2020a) ROMA: multi-agent reinforcement learning with emergent roles. CoRR arxiv: abs/2003.08039
Wang T， Dong H， Lesser VR， Zhang C （2020a） ROMA：具有紧急角色的多智能体强化学习。CoRR arxiv：abs/2003.08039

Wang T, Wang J, Wu Y, Zhang C (2020b) Influence-based multi-agent exploration. In: International conference on learning representations. https://openreview.net/forum?id=BJgy96EYvr
Wang T， Wang J， Wu Y， Zhang C （2020b）基于影响力的多智能体探索。在：关于学习表征的国际会议。https://openreview.net/forum?id=BJgy96EYvr

Wang T, Wang J, Zheng C, Zhang C (2020c) Learning nearly decomposable value functions via communication minimization. In: International conference on learning representations. https://openreview.net/forum?id=HJx-3grYDB
Wang T， Wang J， Zheng C， Zhang C （2020c）通过通信最小化学习近似可分解的值函数。在：关于学习表征的国际会议。https://openreview.net/forum?id=HJx-3grYDB

Wei E, Luke S (2016) Lenient learning in independent-learner stochastic cooperative games. J Mach Learn Res 17(84):1–42. http://jmlr.org/papers/v17/15-417.html
Wei E， Luke S （2016）独立学习者随机合作博弈中的宽松学习。马赫学习研究杂志 17（84）：1–42。http://jmlr.org/papers/v17/15-417.html

Wei E, Wicke D, Freelan D, Luke S (2018) Multiagent soft q-learning. https://www.aaai.org/ocs/index.php/SSS/SSS18/paper/view/17508
Wei E， Wicke D， Freelan D， Luke S （2018）多智能体软 q 学习。https://www.aaai.org/ocs/index.php/SSS/SSS18/paper/view/17508

Wei Ren, Beard RW, Atkins EM (2005) A survey of consensus problems in multi-agent coordination. In: Proceedings of the 2005, American control conference, 2005., pp 1859–1864 vol. 3. https://doi.org/10.1109/ACC.2005.1470239
Wei 任， Beard RW， Atkins EM （2005）多智能体协调共识问题综述.见：2005 年会议记录，美国控制会议，2005 年，第 1859-1864 页，第 3 卷。https://doi.org/10.1109/ACC.2005.1470239

Weiß G (1995) Distributed reinforcement learning. In: Steels L (ed) The biology and technology of intelligent autonomous agents. Springer, Berlin, pp 415–428
Weiß G （1995）分布式强化学习。在：Steels L（编辑）智能自主代理的生物学和技术。施普林格，柏林，第 415-428 页

Chapter 章

Google Scholar Google 学术搜索

Weiss G (ed) (1999) Multiagent systems: a modern approach to distributed artificial intelligence. MIT Press, Cambridge
Weiss G （ed）（1999）多智能体系统：分布式人工智能的现代方法。麻省理工学院出版社，剑桥

Google Scholar Google 学术搜索

Wiegand RP (2004) An analysis of cooperative coevolutionary algorithms. PhD thesis, USA, aAI3108645
Wiegand RP （2004）合作协同进化算法的分析。博士论文，美国， aAI3108645

Wolpert DH, Tumer K (1999) An introduction to collective intelligence. CoRR cs.LG/9908014. http://arxiv.org/arxiv: abs/cs.LG/9908014
Wolpert DH， Tumer K （1999）集体智慧导论。CoRR cs.LG/9908014。http://arxiv.org/arxiv：ABS/CS。LG/9908014

Wu C, Rajeswaran A, Duan Y, Kumar V, Bayen AM, Kakade S, Mordatch I, Abbeel P (2018) Variance reduction for policy gradient with action-dependent factorized baselines. In: International conference on learning representations. https://openreview.net/forum?id=H1tSsb-AW
Wu C， Rajeswaran A， Duan Y， Kumar V， Bayen AM， Kakade S， Mordatch I， Abbeel P （2018）具有行动相关因子化基线的政策梯度的方差减少。在：关于学习表征的国际会议。https://openreview.net/forum?id=H1tSsb-AW

Yang E, Gu D (2004) Multiagent reinforcement learning for multi-robot systems: a survey. Tech. rep
Yang E， Gu D （2004）多机器人系统的多智能体强化学习：一项调查。技术代表

Yang J, Nakhaei A, Isele D, Fujimura K, Zha H (2020) Cm3: Cooperative multi-goal multi-stage multi-agent reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=S1lEX04tPr
Yang J， Nakhaei A， Isele D， Fujimura K， Zha H （2020） Cm3：合作多目标多阶段多智能体强化学习。在：关于学习表征的国际会议。https://openreview.net/forum?id=S1lEX04tPr

Yang T, Meng Z, Hao J, Zhang C, Zheng Y (2018a) Bayes-tomop: a fast detection and best response algorithm towards sophisticated opponents. CoRR arxiv: abs/1809.04240,
Yang T， Meng Z， Hao J， Zhang C， Zheng Y （2018a）贝叶斯截流：一种针对复杂对手的快速检测和最佳响应算法。CoRR arxiv：abs/1809.04240，

Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J (2018b) Mean field multi-agent reinforcement learning. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 5571–5580. http://proceedings.mlr.press/v80/yang18d.html
Yang Y， Luo R， Li M，周 M， Zhang W， Wang J （2018b）均值场多智能体强化学习。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，斯德哥尔摩，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第5571-5580页。http://proceedings.mlr.press/v80/yang18d.html

Yu C, Zhang M, Ren F (2013) Emotional multiagent reinforcement learning in social dilemmas. In: Boella G, Elkind E, Savarimuthu BTR, Dignum F, Purvis MK (eds) PRIMA 2013: principles and practice of multi-agent systems. Springer, Berlin, pp 372–387
Yu C， Zhang M，任 F （2013）社会困境中的情绪多智能体强化学习。在：Boella G，Elkind E，Savarimuthu BTR，Dignum F，Purvis MK（编辑）PRIMA 2013：多智能体系统的原则和实践。施普林格，柏林，第 372-387 页

Chapter 章

Google Scholar Google 学术搜索

Yu H, Shen Z, Leung C, Miao C, Lesser VR (2013) A survey of multi-agent trust management systems. IEEE Access 1:35–50. https://doi.org/10.1109/ACCESS.2013.2259892
Yu H， Shen Z， Leung C， Miao C， Lesser VR （2013）多智能体信任管理系统综述.IEEE访问1：35-50。https://doi.org/10.1109/ACCESS.2013.2259892

Article 品

Google Scholar Google 学术搜索

Yu L, Song J, Ermon S (2019) Multi-agent adversarial inverse reinforcement learning. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, California, USA, Proceedings of machine learning research, vol 97, pp 7194–7201. http://proceedings.mlr.press/v97/yu19e.html
Yu L， Song J， Ermon S （2019）多智能体对抗性逆强化学习。在：Chaudhuri K，Salakhutdinov R（编辑）第36届机器学习国际会议论文集，PMLR，美国加利福尼亚州长滩，机器学习研究论文集，第97卷，第7194-7201页。http://proceedings.mlr.press/v97/yu19e.html

Zhang K, Yang Z, Basar T (2018) Networked multi-agent reinforcement learning in continuous spaces. In: 2018 IEEE conference on decision and control (CDC), pp 2771–2776
Zhang K， Yang Z， Basar T （2018）连续空间中的网络化多智能体强化学习。在：2018年IEEE决策与控制会议（CDC），第2771-2776页

Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmässan, Stockholm Sweden, Proceedings of machine learning research, vol 80, pp 5872–5881. http://proceedings.mlr.press/v80/zhang18n.html
Zhang K， Yang Z， Liu H， Zhang T， Basar T （2018）使用网络智能体进行完全分散的多智能体强化学习。在：Dy J，Krause A（编辑）第35届机器学习国际会议论文集，PMLR，斯德哥尔摩smässan，瑞典斯德哥尔摩，机器学习研究论文集，第80卷，第5872-5881页。http://proceedings.mlr.press/v80/zhang18n.html

Zhang K, Yang Z, Başar T (2019) Multi-agent reinforcement learning: a selective overview of theories and algorithms. ArXiv arxiv: abs/1911.10635
Zhang K， Yang Z， Başar T （2019）多智能体强化学习：理论和算法的选择性概述。ArXiv arxiv：abs/1911.10635

Zhang W, Bastani O (2019) Mamps: Safe multi-agent reinforcement learning via model predictive shielding. ArXiv arxiv: abs/1910.12639
Zhang W， Bastani O （2019） Mamps：通过模型预测屏蔽进行安全的多智能体强化学习。ArXiv arxiv：abs/1910.12639

Zheng Y, Meng Z, Hao J, Zhang Z (2018a) Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. In: Geng X, Kang BH (eds) PRICAI 2018: trends in artificial intelligence. Springer International Publishing, Cham, pp 421–429
Zheng Y， Meng Z， Hao J， Zhang Z （2018a）随机协作环境中的加权双深度多智能体强化学习。在：Geng X，Kang BH（编辑）PRICAI 2018：人工智能趋势。施普林格国际出版社，Cham，第 421-429 页

Chapter 章

Google Scholar Google 学术搜索

Zheng Y, Meng Z, Hao J, Zhang Z, Yang T, Fan C (2018b) A deep bayesian policy reuse approach against non-stationary agents. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems 31, Curran Associates, Inc., pp 954–964. http://papers.nips.cc/paper/7374-a-deep-bayesian-policy-reuse-approach-against-non-stationary-agents.pdf
Zheng Y， Meng Z， Hao J， Zhang Z， Yang T， Fan C （2018b）一种针对非稳态智能体的深度贝叶斯策略重用方法。在：Bengio S、Wallach H、Larochelle H、Grauman K、Cesa-Bianchi N、Garnett R（编辑）神经信息处理系统进展 31，Curran Associates， Inc.，第 954-964 页。http://papers.nips.cc/paper/7374-a-deep-bayesian-policy-reuse-approach-against-non-stationary-agents.pdf

Zhu H, Kirley M (2019) Deep multi-agent reinforcement learning in a common-pool resource system. In: 2019 IEEE congress on evolutionary computation (CEC), pp 142–149. https://doi.org/10.1109/CEC.2019.8790001
Zhu H， Kirley M （2019）公共池资源系统中的深度多智能体强化学习。在：2019年IEEE进化计算大会（CEC），第142-149页。https://doi.org/10.1109/CEC.2019.8790001

Zhu Z, Biyik E, Sadigh D (2020) Multi-agent safe planning with gaussian processes. ArXiv arxiv: abs/2008.04452
Zhu Z， Biyik E， Sadigh D （2020）使用高斯过程的多智能体安全规划。ArXiv arxiv：abs/2008.04452