【综述】【多智能体系统的深度强化学习：挑战、解决方案和应用回顾】

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137185888

本文回顾了多智能体系统中深度强化学习（Deep Reinforcement Learning, DRL）的挑战、解决方案和应用。从单智能体的DQN到多智能体深度强化学习（MADRL），涉及非平稳性、部分可观测性、训练策略、连续行动空间等问题。MADRL在游戏、交通控制、机器人协作等领域展现出潜力，但也面临人机协作、模型效率及大规模系统扩展性等挑战。未来研究方向包括人机在环架构、基于模型的MADRL和处理异构智能体的优化策略。" 88901322,7917543,C语言排序输入三个整数,"['C语言', '算法', '编程', '基础教程']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications

多智能体系统的深度强化学习：挑战、解决方案和应用回顾

Abstract:

摘要：

        Reinforcement learning (RL) algorithms have been around for decades and employed to solve various sequential decision-making problems. These algorithms, however, have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments.

         This article addresses an important aspect of deep RL related to situations that require multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multiagent deep RL (MADRL) is presented, including nonstationarity, partial observability, continuous state and action spaces, multiagent training schemes, and multiagent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to the future development of more robust and highly useful multiagent learning methods for solving real-world problems.
        强化学习（RL）算法已经存在了几十年，并用于解决各种顺序决策问题。然而，这些算法在处理高维环境时面临着巨大的挑战。深度学习的最新发展使RL方法能够为复杂且有能力的智能体提供最佳策略，这些智能体可以在这些具有挑战性的环境中高效执行。本文讨论了深度 RL 的一个重要方面，该方面与需要多个智能体进行通信和协作以解决复杂任务的情况有关。

        本文介绍了解决多智能体深度RL （MADRL）相关问题的不同方法，包括非平稳性、部分可观察性、连续状态和动作空间、多智能体训练方案和多智能体迁移学习。本文将对所审查方法的优缺点进行分析和讨论，并探讨其相应的应用。预计本综述提供了有关各种MADRL方法的见解，并可能导致未来开发更强大和非常有用的多智能体学习方法来解决现实世界的问题。

Published in: IEEE Transactions on Cybernetics ( Volume: 50, Issue: 9, September 2020)
发表于： IEEE Transactions on Cybernetics （ Volume： 50， Issue： 9， September 2020）

Page(s): 3826 - 3839

页数： 3826 - 3839

Date of Publication: 20 March 2020
发布日期： 2020 年 3 月 20 日

SECTION I. 第一节总则

Introduction 介绍

Reinforcement learning was instigated by a trial-and-error (TE) procedure, conducted by Thorndike in an experiment on cat’s behaviors in 1898 [1]. In 1954, Minsky [2] designed the first neural computer called stochastic neural-analog reinforcement calculators (SNARCs), which simulated the rat’s brain to solve the maze puzzle. SNARCs remarked the uplift of TE learning to a computational period. Almost two decades later, Klopf [3] integrated the mechanism of temporal-difference (TD) learning from psychology into the computational model of TE learning. That integration succeeded in making TE learning a feasible approach to large systems. In 1989, Watkins and Dayan [4] brought the theory of optimal control [5] including the Bellman equation and Markov decision process (MDP) together with TD learning to form a well-known Q-learning. Since then, Q -learning has been applied to solve various real-world problems, but it is unable to solve high-dimensional problems where the number of calculations increases drastically with the number of inputs. This problem, known as the curse of dimensionality, exceeds the computational constraint of conventional computers. In 2015, Mnih et al. [6] made an important breakthrough by combining deep learning with reinforcement learning (RL) to partially overcome the curse of dimensionality. The deep RL has attracted significant attention from the research community since then. Milestones of the development of RL are presented in Fig. 1, which span the TE method to the deep RL.
强化学习是由桑代克在 1898 年对猫的行为进行实验的试错（TE）程序引发的 [1] 。1954年，明斯基 [2] 设计了第一台名为随机神经模拟强化计算器（SNARC）的神经计算机，它模拟了老鼠的大脑来解决迷宫难题。SNARC 指出 TE 学习提升到一个计算期。近二十年后，Klopf [3] 将心理学中的时间差分（TD）学习机制整合到 TE 学习的计算模型中。这种集成成功地使 TE 学习成为大型系统的可行方法。1989年，沃特金斯和达扬 [4] 将贝尔曼方程和马尔可夫决策过程（MDP）等最优控制 [5] 理论与TD学习相结合，形成了著名的Q学习。从那时起， Q -learning被应用于解决各种现实世界的问题，但它无法解决计算数量随着输入数量急剧增加的高维问题。这个问题被称为维度的诅咒，超出了传统计算机的计算限制。2015年，Mnih等人 [6] 通过将深度学习与强化学习（RL）相结合，部分克服了维度的诅咒，取得了重要突破。从那时起，深度RL引起了研究界的极大关注。RL 发展的里程碑在 Fig. 1 中介绍，从 TE 方法到深度 RL。

Fig. 1. 图 1.

Emergence of deep RL through different essential milestones.
通过不同的重要里程碑出现深度 RL。

The RL originates from animal learning in psychology and thus it can mimic the human learning ability to select actions that maximize long-term profit in their interactions with the environment. The RL has been widely used in robotics and autonomous systems, e.g., Mahadevan and Connell [7] designed a robot that can push cubes (1992); Schaal [8] created a humanoid robot that can effectively solve the pole-balancing task (1997); Benbrahim and Franklin [9] made a biped robot that can learn to walk without any knowledge of the environment (1997); Riedmiller et al. [10] built a soccer robot team (2009); and Mulling et al. [11] trained a robot to play table tennis (2013).
RL起源于心理学中的动物学习，因此它可以模仿人类的学习能力，选择在与环境的互动中实现长期利益最大化的行动。RL 已广泛应用于机器人和自主系统，例如，Mahadevan 和 Connell [7] 设计了一种可以推动立方体的机器人（1992 年）;Schaal [8] 创造了一个可以有效解决杆平衡任务的人形机器人（1997）;Benbrahim 和 Franklin [9] 制造了一个可以在不了解环境的情况下学习走路的两足机器人（1997 年）;Riedmiller等人 [10] 建立了一个足球机器人团队（2009年）;Mulling等人 [11] 训练机器人打乒乓球（2013）。

Modern RL is truly marked by the success of the deep RL in 2015 when Mnih et al. [6] made use of a structure named deep Q -network (DQN) in creating an agent that outperformed a professional player in a series of 49 classic Atari games [12]. In 2016, Google’s DeepMind created a self-taught AlphaGo program that could beat the best professional Go players, including China’s Ke Jie and Korea’s Lee Sadol [13]. Deep RL has also been used to solve MuJoCo physics problems [14] and 3-D maze games [15]. In 2017, OpenAI announced a bot that could beat the best professional gamer on the online game Dota 2, which is supposed to be more complex than the Go game. More importantly, deep RL has become a promising approach to solving real-world problems due to its practical methodology, e.g., optimal control of nonlinear systems [16], pedestrian regulation [17], or traffic grid signal control [18]. Enterprise corporations, such as Google, Tesla, and Uber, have been in a race to make self-driving cars. Moreover, recent advances of the deep RL have been extended to solve NP-hard problems such as vehicle routing problem, which is critical in logistics [19], [20].
现代RL的真正标志是2015年深度RL的成功，当时Mnih等人 [6] 利用一种名为深度 Q 网络（DQN）的结构创建了一个代理，该代理在一系列49个经典雅达利游戏中 [12] 表现优于职业玩家。

2016 年，谷歌的 DeepMind 创建了一个自学成才的 AlphaGo 程序，可以击败最好的职业围棋选手，包括中国的柯洁和韩国的李萨多尔 [13] 。

Deep RL 还被用于解决 MuJoCo 物理问题 [14] 和 3D 迷宫游戏 [15] 。2017 年，OpenAI 宣布了一款机器人，可以在网络游戏 Dota 2 上击败最好的职业游戏玩家，该游戏应该比围棋游戏更复杂。更重要的是，深度RL因其实用的方法，例如非线性系统的 [16] 优化控制、行人调节 [17] 或交通网格信号控制 [18] ，已成为解决现实世界问题的一种有前途的方法。谷歌、特斯拉和优步等企业公司一直在竞相制造自动驾驶汽车。此外，深度RL的最新进展已扩展到解决NP难题，例如在物流 [19] 中至关重要的车辆路线问题。 [20]

As real-world problems have become increasingly complex, there are many situations where a single deep RL agent is not able to cope with. In such situations, the applications of a multiagent system (MAS) are indispensable. In an MAS, agents must compete or cooperate to obtain the best overall results. Examples of such systems include multiplayer online games, cooperative robots in the production factories, traffic control systems, and autonomous military systems like unmanned aerial vehicles (UAVs), surveillance, and spacecraft. Among many applications of deep RL in the literature, there are a large number of studies using deep RL in MAS, henceforth multiagent deep RL (MADRL). Extending from a single-agent domain to a multiagent environment creates several challenges. Previous surveys considered different perspectives, for example, Busoniu et al. [21] examined the stability and adaptation aspects of agents, Bloembergen et al. [22] analyzed the evolutionary dynamics, Hernandez-Leal et al. [23] considered emergent behaviors, communication and cooperation learning perspectives, and da Silva et al. [24] reviewed methods for knowledge reuse autonomy in multiagent RL (MARL). This article presents an overview of technical challenges in multiagent learning as well as deep RL approaches to these challenges. We cover numerous MADRL perspectives, including nonstationarity, partial observability, multiagent training schemes, transfer learning in MAS, and continuous state and action spaces in multiagent learning. Applications of MADRL in various fields are also reviewed and analyzed in the current study. In the last section, we present extensive discussions and interesting future research directions of MADRL.
随着现实世界的问题变得越来越复杂，在许多情况下，单个深度RL代理无法应对。在这种情况下，多智能体系统（MAS）的应用是必不可少的。在 MAS 中，代理必须竞争或合作才能获得最佳的整体结果。此类系统的例子包括多人在线游戏、生产工厂中的协作机器人、交通控制系统以及无人驾驶飞行器（UAV）、监视和航天器等自主军事系统。在文献中深度 RL 的众多应用中，有大量研究在 MAS 中使用深度 RL，以下简称多药深度 RL （MADRL）。从单代理域扩展到多代理环境会带来一些挑战。以前的调查考虑了不同的观点，例如，Busoniu等人 [21] 研究了智能体的稳定性和适应性方面，Bloembergen等人 [22] 分析了进化动力学，Hernandez-Leal等人 [23] 考虑了涌现行为，沟通和合作学习角度，da Silva等人。 [24] 回顾了多智能体RL （MARL）中知识重用自主性的方法。本文概述了多智能体学习中的技术挑战以及应对这些挑战的深度 RL 方法。我们涵盖了许多MADRL的观点，包括非平稳性、部分可观察性、多智能体训练方案、MAS中的迁移学习以及多智能体学习中的连续状态和动作空间。本研究还对MADRL在各个领域的应用进行了回顾和分析。在最后一节中，我们介绍了MADRL的广泛讨论和有趣的未来研究方向。

SECTION II. 第二节.

Background: Reinforcement Learning
背景：强化学习

A. Preliminary A. 初步

RL is a TE learning 1) by interacting directly with the environment; 2) to self-teach over time; and 3) eventually achieve designated goal. Specifically, RL defines any decision maker (learner) as an agent and anything outside the agent as an environment. The interactions between the agent and the environment are described via three essential elements: 1) state s ; 2) action a ; and 3) reward r [25]. The state of the environment at time step t is denoted as st . Thereby, the agent examines st and performs a corresponding action at . The environment then alters its state st to st+1 and provides a feedback reward rt+1 to the agent.
RL 是 TE 学习 1）通过直接与环境交互;2）随着时间的推移自学成才;3）最终达到指定目标。具体来说，RL 将任何决策者（学习者）定义为代理，将代理之外的任何事物定义为环境。智能体与环境之间的相互作用通过三个基本要素来描述：1）状态 s ;2）行动 a ;3）奖励 r [25] 。时间步 t 长处的环境状态表示为 st 。因此，代理检查 st 并执行相应的操作 at 。然后，环境会更改其状态 st rt+1 ， st+1 并向代理提供反馈奖励。

The agent’s decision is formalized by defining the concept of policy. A policy π is a mapping function from any perceived state s to the action a taken from that state. A policy is deterministic if the probability of choosing an action a from s : p(a|s)=1 for all state s . In contrast, the policy is stochastic if there exists a state s so that p(a|s)<1 . In either case, we can define the policy π as a probability distribution of candidate actions that will be selected from a certain state as

π==Ψ(s){ p(ai|s) ∣∣∣ ∀ai∈Δπ ∧∑ip(ai|s)=1}(1)

View Source

where Δπ represents all candidate actions ( action space) of the policy π . For clarity, we assume that the action space is discrete because the continuous case can be straightforwardly inferred by using the integral notation. Furthermore, we presume that the next state st+1 and feedback reward rt+1 are entirely determined by the current state–action pair (st,at) regardless of the history. Any RL problem satisfies this “memoryless” condition is known as MDP. Therefore, the dynamics (model) of an RL problem is completely specified by giving all transition probabilities p(ai|s) .
通过定义策略的概念，代理的决策是正式的。策略 π 是从任何感知状态 s 到从该状态执行的操作 a 的映射函数。如果从以下位置选择操作 a 的概率是确定性的，则策略是确定性的：对于所有状态 s 。 s p(a|s)=1 相反，如果存在一个状态 s ，则该策略是随机的 p(a|s)<1 ，因此 .无论哪种情况，我们都可以将策略 π 定义为将从特定状态中选择的候选操作的概率分布，其中

π==Ψ(s){ p(ai|s) ∣∣∣ ∀ai∈Δπ ∧∑ip(ai|s)=1}(1)

View Source

Δπ 表示策略 π 的所有候选操作（操作空间）。为了清楚起见，我们假设动作空间是离散的，因为可以使用积分符号直接推断连续情况。此外，我们假设下一个状态 st+1 和反馈奖励 rt+1 完全由当前状态-动作对 (st,at) 决定，而与历史无关。任何满足此“无记忆”条件的 RL 问题都称为 MDP。因此，RL问题的动力学（模型）完全是通过给出所有转移概率来指定的 p(ai|s) 。

B. Bellman Equation

B. 贝尔曼方程

Remind that the agent receives a feedback reward rt+1 for every time step t until it reaches the terminal state sT . However, the immediate reward rt+1 does not represent the long-term profit, we instead leverage a generalized return value Rt at time step t

Rt==rt+1+γrt+2+γ2rt+3+⋯+γT−t−1rT∑i=0T−t−1γirt+i+1(2)

View Source

where γ is a discounted factor so that 0≤γ<1 . The agent becomes farsighted when γ approaches to 1 and vice versa the agent becomes shortsighted when γ is close to 0.
提醒座席在达到最终状态 sT 之前，每个时间步 t 都会收到反馈奖励 rt+1 。然而，即时回报 rt+1 并不代表长期利润，而是利用时间步长 t

Rt==rt+1+γrt+2+γ2rt+3+⋯+γT−t−1rT∑i=0T−t−1γirt+i+1(2)

View Source

的广义回报值 Rt ，其中 γ 是一个贴现因子 0≤γ<1 ，因此。当接近 1 时 γ ，智能体变得远视，反之亦然，当接近 0 时 γ ，智能体变得近视。

The next step is to define a value function that is used to evaluate how “good” of a certain state s or a certain state–action pair (s,a) . Specifically, the value function of the state s under the policy π is calculated by obtaining the expected return value from s : Vπ(s)=E[Rt|st=s,π] . Likewise, the value function of state–action pair (s,a) is Qπ(s,a)=E[Rt|st=s,at=a,π] . We can leverage value functions to compare how “good” between two policies π and π′ using the following rule [25]:

π≤π′⟺∨ [Vπ(s)≤Vπ′(s) ∀s] [Qπ(s,a)≤Qπ′(s,a) ∀(s,a)].(3)

View Source

下一步是定义一个值函数，用于评估某种状态 s 或某种状态-动作对 (s,a) 的“好”程度。具体来说，策略下状态 s 的值函数是通过从 s ： Vπ(s)=E[Rt|st=s,π] 获取期望返回值来计算 π 的。同样，状态-操作对 (s,a) 的值函数为 Qπ(s,a)=E[Rt|st=s,at=a,π] 。我们可以利用价值函数来比较两个策略 π 之间的“好”程度，并使用 π′ 以下规则 [25] ：

π≤π′⟺∨ [Vπ(s)≤Vπ′(s) ∀s] [Qπ(s,a)≤Qπ′(s,a) ∀(s,a)].(3)

View Source

Based on (2), we can expand Vπ(s) and Qπ(s,a) to present the relationship between two consecutive states s=st and s′=st+1 as [25]

Vπ(s)=∑aπ(s,a)∑s′p(s′|s,a)(Ws→s′|a+γVπ(s′))(4)

View Source

and

Qπ(s,a)=∑s′p(s′|s,a)(Ws→s′|a+γ∑a′π(s′,a′)Qπ(s′,a′))(5)

View Source

where Ws→s′|a=E[rt+1|st=s,at=a,st+1=s′] . Solving (4) or (5), we can find the value function V(s) or Q(s,a) , respectively. Equations (4) and (5) are called the Bellman equations.
基于 (2) ，我们可以展开 Vπ(s) 并 Qπ(s,a) 呈现两个连续状态 s=st 和 s′=st+1 as [25]

Vπ(s)=∑aπ(s,a)∑s′p(s′|s,a)(Ws→s′|a+γVπ(s′))(4)

View Source

和

Qπ(s,a)=∑s′p(s′|s,a)(Ws→s′|a+γ∑a′π(s′,a′)Qπ(s′,a′))(5)

View Source

where Ws→s′|a=E[rt+1|st=s,at=a,st+1=s′] 之间的关系。求解 (4) 或 (5) ，我们可以分别求值函数 V(s) 或 Q(s,a) 。 Equations (4) 并 (5) 被称为贝尔曼方程。

Dynamic programming and its variants [26]–[31] can be used to approximate the solutions of the Bellman equations. However, it requires the complete dynamics information of the problem, and thus when the number of states is large, this approach is infeasible due to the lack of memory and computational power of the conventional computer. In the next section, we will review two model-free RL methods [require no knowledge of transition probabilities p(ai|s) ] to approximate the value functions.

C. RL Methods

In this section, we review two well-known learning schemes in RL: 1) Monte-Carlo (MC) and 2) TD learning. These methods do not require the dynamic information of the environment, that is, they can deal with larger state-space problems than dynamic programming approaches.

1) Monte-Carlo Method:

This method estimates the value function by repeatedly generating episodes and recording average return at each state or each state–action pair. The MC method does not require any knowledge of transition probabilities, that is, the MC method is model free. However, this approach has made two essential assumptions to ensure the convergence happens: 1) the number of episodes is large and 2) every state and every action must be visited with a significant number of times.

Generally, MC algorithms are divided into two groups: 1) on-policy and 2) off-policy. In on-policy methods, we use the policy π for both evaluation and exploration purposes. Therefore, the policy π must be stochastic or soft. In contrast, off-policy uses different policy π′≠π to generate the episodes and hence π can be deterministic. Although the off-policy is desirable due to its simplicity, the on-policy method is more stable when working with continuous state-space problems and when using together with a function approximator (such as neural networks) [32].

2) Temporal-Difference Method:

Similar to MC, the TD method is also learning from experiences (model-free method). However, unlike MC, the TD learning does not wait until the end of episode to make an update. It makes an update on every step within the episode by leveraging the Bellman equation (4) and hence possibly provides faster convergence. Equation (6) presents a one-step TD method

Vi(st)⟵αVi−1(st)+(1−α)(rt+1+γVi−1(st+1))(6)

View Source

where α is a step-size parameter and 0<α<1 . The TD learning uses previous estimated values Vi−1 to update the current ones Vi , which is known as the bootstrapping method. Basically, the bootstrapping method learns faster than nonbootstrapping ones in most of the cases [25]. The TD learning is also divided into two categories: 1) on-policy TD control ( Sarsa) and 2) off-policy TD control ( Q-learning).

In practice, MC and TD learning often use table memory structure (tabular method) to save the value function of each state or each state–action pair. This makes them inefficient due to the lack of memory in solving complex problems where the number of states is large. Therefore, an actor–critic (AC) architecture is designed to subdue this limitation. Specifically, AC includes two separate memory structures for an agent: 1) actor and 2) critic. The actor structure is used to select a suitable action according to the observed state and transfer to the critic structure for evaluation. The critic structure uses a TD error function to decide the future tendency of the selected action. AC methods can be on-policy or off-policy depending on the implementation details. Table I summarizes the characteristics of RL methods as well as their advantages and disadvantages. Table II highlights the differences between dynamic programming and RL methods, which include MC, Sarsa, Q -learning, and AC. As opposed to dynamic programming, RL algorithms are model free [33]. While other RL methods, such as Sarsa, Q -learning, and AC use the bootstrapping method, MC requires to restart episodes to update its value function. Notably, algorithms based on AC are the most generic as they can belong to any category.
在实践中，MC和TD学习经常使用表内存结构（表格法）来保存每个状态或每个状态-动作对的值函数。这使得它们在解决状态数量众多的复杂问题时缺乏内存，因此效率低下。因此，参与者-批评家（AC）架构旨在克服这种限制。具体来说，AC 包括两个独立的代理内存结构：1） actor 和 2） critic。actor 结构用于根据观察到的状态选择合适的动作，并转移到批评者结构进行评估。critic 结构使用 TD 错误函数来决定所选操作的未来趋势。 AC 方法可以是 on-policy 或 off-policy，具体取决于实现细节。 Table I 总结了RL方法的特点及其优缺点。 Table II 重点介绍了动态规划和 RL 方法（包括 MC、Sarsa、 Q -learning 和 AC）之间的差异。与动态规划相反，RL 算法是无 [33] 模型的。虽然其他 RL 方法（如 Sarsa、 Q -learning 和 AC）使用引导方法，但 MC 需要重新启动剧集以更新其值函数。值得注意的是，基于 AC 的算法是最通用的，因为它们可以属于任何类别。

TABLE I Characteristics of RL Methods
表一 RL方法的特点

TABLE II Comparisons Between Dynamic Programming and RL Methods
表二动态规划与RL方法的比较

SECTION III. 第三节.

Deep RL: Single Agent 深度 RL：单代理

A. Deep Q -Network
A. 深度 Q 网络

Deep RL is a broad term that indicates the combination between deep learning and RL to deal with high-dimensional environments [34]–[36]. In 2015, Mnih et al. [6] at the first time announced the success of this combination by creating an autonomous agent that can play competently a series of 49 Atari games. Concisely, the authors proposed a novel structure called DQN that leverages the convolutional neural network (CNN) [37] to directly interpret the graphical representation of the input state s from the environment. The output of DQN produces Q -values of all possible actions a∈Δτ taken at state s , where Δτ denotes the action space [38]. Therefore, DQN can be seen as a policy network τ , parameterized by β , which is continually trained so as to approximate optimal policy. Mathematically, DQN uses the Bellman equation to minimize the loss function L(β) as

L(β)=E[(r+γmaxa′Q(s′,a′|β′)−Q(s,a|β))2].(7)

View Source

深度 RL 是一个广义的术语，表示深度学习和 RL 在处理高维环境方面的结合 [34] 。 [36] 2015 年，Mnih 等人 [6] 首次宣布了这种组合的成功，他们创建了一个可以胜任 49 款 Atari 游戏系列的自主代理。简明扼要地，作者提出了一种称为DQN的新结构，该结构利用卷积神经网络（CNN） [37] 直接解释环境中输入状态 s 的图形表示。DQN 的输出生成在状态 s 下执行的所有可能操作 a∈Δτ 的值，其中 Δτ 表示操作空间 [38] 。 Q 因此，DQN可以看作是一个策略网络 τ ，参数化为， β 它不断被训练以近似最优策略。在数学上，DQN 使用 Bellman 方程将损失函数 L(β) 最小化为

L(β)=E[(r+γmaxa′Q(s′,a′|β′)−Q(s,a|β))2].(7)

View Source

However, using a neural network to approximate value function is proved to be unstable and may result in divergence due to the bias originated from correlative samples [32]. To make the samples uncorrelated, Mnih et al. [6] created a target network τ′ , parameterized by β′ , which is updated in every N steps from the estimation network τ . Moreover, generated samples are stored in an experience replay memory. The samples are then retrieved randomly from the experience replay and fed into the training process [39], as described in Fig. 2.
然而，使用神经网络来近似值函数被证明是不稳定的，并且由于来自相关样本 [32] 的偏差，可能会导致发散。为了使样本不相关，Mnih等人 [6] 创建了一个目标网络，参数化为， β′ 该网络 τ′ 在估计网络 τ 的每个 N 步骤中都进行了更新。此外，生成的样本存储在体验回放内存中。然后从经验回放中随机检索样本，并将其输入到训练过程中 [39] ，如中 Fig. 2 所述。

Fig. 2. 图 2.

Deep Q -network structure with experience replay memory and target network whose parameters are updated from the estimation network after every N steps to ensure a stable learning process.
具有经验回放记忆和目标网络的深度 Q 网络结构，每个 N 步骤后从估计网络更新参数，以确保稳定的学习过程。

Show All 显示全部

Although DQN basically solved a challenging problem in RL, the curse of dimensionality, this is just a rudimental step in solving completely real-world applications. DQN has numerous drawbacks, which can be remedied by different schemes, from a simple form to complex modifications that we will discuss in the next section.
尽管 DQN 基本上解决了 RL 中一个具有挑战性的问题，即维度的诅咒，但这只是解决完全真实世界应用程序的一个基本步骤。DQN 有许多缺点，可以通过不同的方案来补救，从简单的形式到复杂的修改，我们将在下一节中讨论。

B. DQN Variants

B. DQN 变体

The first and simplest form of DQN’s variant is double DQN (DDQN) proposed in [40] and [41]. The idea of DDQN is to separate the selection of “greedy” action from action evaluation. In this way, DDQN expects to reduce the overestimation of Q -values in the training process. In other words, the max operator in (7) is decoupled into two different operators, as represented by the following loss function:

LDDQN(β)=E[(r+γQ(s′,argmaxa′Q(s′,a′|β)|β′)−Q(s,a|β))2].(8)

View Source

DQN 变体的第一种也是最简单的形式是在和 [41] 中 [40] 提出的双 DQN （DDQN）。DDQN的思想是将“贪婪”行动的选择与行动评估分开。通过这种方式，DDQN 期望减少训练过程中对 Q -值的高估。换言之，max 运算符 in (7) 被解耦为两个不同的运算符，由以下损失函数表示：

LDDQN(β)=E[(r+γQ(s′,argmaxa′Q(s′,a′|β)|β′)−Q(s,a|β))2].(8)

View Source

Empirical experimental results on 57 Atari games show that the normalized performance of DDQN without tuning is two times greater than DQN and three times greater than DQN when tuning.
对57款雅达利游戏的实证实验结果表明，不调优的DDQN的归一化性能是DQN的2倍，调优时是DQN的3倍。

Second, the experience replay in DQN plays an important role to break the correlations between samples, and at the same time, remind “rare” samples that the policy network may rapidly forget. However, the fact that selecting randomly samples from the experience replay does not completely separate the sample data. Specifically, we prefer rare and goal-related samples to appear more frequent than redundancy ones. Therefore, Schaul et al. [42] proposed a prioritized experience replay that gives priority to a sample i based on its absolute value of TD error

pi=|δi|=|ri+γmaxaQ(si,a|β′)−Q(si−1,ai−1|β)|.(9)

View Source

其次，DQN中的经验回放对打破样本之间的相关性起着重要作用，同时提醒策略网络可能迅速遗忘的“稀有”样本。但是，从体验回放中随机选择样本这一事实并不能完全分离样本数据。具体来说，我们更喜欢稀有和与目标相关的样本比冗余样本更频繁地出现。因此，Schaul 等人 [42] 提出了一种优先体验回放， i 该回放基于样本的 TD 误差

pi=|δi|=|ri+γmaxaQ(si,a|β′)−Q(si−1,ai−1|β)|.(9)

View Source

绝对值优先

The prioritized experience replay when combined with DDQN provides stable convergence of the policy network and achieves a performance up to five times greater than DQN in terms of normalized mean score on 57 Atari games.
当与DDQN结合使用时，优先体验回放提供了策略网络的稳定收敛，并在57款Atari游戏中实现了比DQN高出五倍的性能。

DQN’s policy evaluation process is struggle to work in “redundant” situations, that is, there are more than two candidate actions that can be selected without getting any negative results. For instance, when driving a car and there are no obstables ahead, we can follow either the left lane or the right lane. If there is an obstacle ahead in the left lane, we must be in the right lane to avoid crashing. Therefore, it is more efficient if we focus only on the road and obstacles ahead. To resolve such situations, Wang et al. [43] proposed a novel network architecture called dueling network. In the dueling architecture, there are two collateral networks that coexist: one network, parameterized by θ , estimates the state-value function V(s|θ) and the other one, parameterized by θ′ , estimates the advantage action function A(s,a|θ′) . The two networks are then aggregated using the following equation to approximate the Q -value function:

Q(s,a|θ,θ′)=V(s|θ)+(A(s,a|θ′)−1|Δπ|∑a′A(s,a′|θ′)).(10)

View Source

DQN的策略评估过程很难在“冗余”的情况下工作，也就是说，可以选择两个以上的候选行动而不会得到任何负面结果。例如，当驾驶汽车并且前方没有观察者时，我们可以沿着左侧车道或右侧车道行驶。如果左车道前方有障碍物，我们必须在右车道，以免撞车。因此，如果我们只关注前方的道路和障碍物，效率会更高。为了解决这种情况，Wang等人 [43] 提出了一种新颖的网络架构，称为决斗网络。在决斗架构中，有两个并存的附带网络：一个网络，参数化为 θ ，估计状态值函数 V(s|θ) ，另一个，参数化为 θ′ ，估计优势行动函数 A(s,a|θ′) 。然后使用以下公式聚合两个网络以近似 Q -value 函数：

Q(s,a|θ,θ′)=V(s|θ)+(A(s,a|θ′)−1|Δπ|∑a′A(s,a′|θ′)).(10)

View Source

Because the dueling network represents the action-value function, it was combined with DDQN and prioritized experience replay to boost the performance up to six times more than the standard DQN on the Atari domain [43].
由于决斗网络代表了动作值函数，因此它与DDQN相结合，并优先考虑体验回放，以将性能提高到Atari域 [43] 上标准DQN的六倍。

Another drawback of DQN is that it uses a history of four frames as an input to the policy network. DQN is therefore inefficient to solve problems where the current state depends on a significant amount of history information such as “Double Dunk” or “Frostbite” [44]. These games are often called partially observable MDP problems. The straightforward solution is to replace the fully connected layer right after the last convolutional layer of the policy network with a recurrent long short-term memory, as described in [44]. This DQN’s variant called deep recurrent Q -network (DRQN) outperforms the standard DQN up to 700 percent in “Double Dunk” and “Frostbite” games. Furthermore, Lample and Chaplot [45] successfully created an agent that beats an average player on “Doom,” a 3-D FPS (first-person shooter) environment by adding a game feature layer in DRQN. Another interesting variant of DRQN is deep attention recurrent Q -network (DARQN) [46]. In that article, Sorokin et al. added attention mechanism into DRQN so that the network can focus only on important regions in the game, allowing smaller network’s parameters and hence speeding the training process. As a result, DARQN achieves a score of 7263 compared with 1284 and 1421 of DQN and DRQN on the game “Seaquest,” respectively.
DQN 的另一个缺点是它使用四个帧的历史记录作为策略网络的输入。因此，DQN 在解决当前状态依赖于大量历史信息（例如“Double Dunk”或“Frostbite” [44] ）的问题时效率低下。这些博弈通常被称为部分可观察的 MDP 问题。最直接的解决方案是将策略网络的最后一个卷积层之后的全连接层替换为循环的长短期记忆，如中 [44] 所述。这种称为深度递归 Q 网络（DRQN）的 DQN 变体在“Double Dunk”和“Frostbite”游戏中比标准 DQN 高出 700%。此外，Lample 和 Chaplot [45] 通过在 DRQN 中添加游戏功能层，成功地创建了一个在“Doom”（3-D FPS（第一人称射击）环境中击败普通玩家的代理。DRQN 的另一个有趣的变体是深度注意力递归 Q 网络（DARQN）。 [46] 在那篇文章中，Sorokin等人在DRQN中添加了注意力机制，使网络可以只关注游戏中的重要区域，从而允许较小的网络参数，从而加快训练过程。结果，DARQN 在游戏“Seaquest”上获得了 7263 分，而 DQN 和 DRQN 的得分分别为 1284 分和 1421 分。

SECTION IV. 第四节.

Deep RL: Multiagent 深度 RL：多智能体

MASs have attracted great attention because they are able to solve complex tasks through the cooperation of individual agents. Within an MAS, the agents communicate with each other and interact with the environment. In a multiagent learning domain, the MDP is generalized to a stochastic game, or a Markov game. Let us denote n as the number of agents, S as a discrete set of environmental states, and Ai,i=1,2,…,n , as a set of actions for each agent. The joint action set for all agents is defined by A=A1×A2×⋯×An . The state transition probability function is represented by p:S×A×S→[0,1] and the reward function is specified as r:S×A×S→Rn . The value function of each agent is dependent on the joint action and joint policy, which is characterized by Vπ:S×A→Rn . The following sections describe challenges and MADRL solutions as well as their applications to solve real-world problems.
MAS之所以受到极大的关注，是因为它们能够通过单个代理的合作来解决复杂的任务。在 MAS 中，代理相互通信并与环境交互。在多智能体学习域中，MDP 被推广为随机博弈或马尔可夫博弈。让我们 n 表示为代理的数量， S 表示为一组离散的环境状态，以及 Ai,i=1,2,…,n 表示为每个代理的一组操作。所有代理的联合操作集由 A=A1×A2×⋯×An 定义。状态转移概率函数由表示 p:S×A×S→[0,1] ，奖励函数指定为 r:S×A×S→Rn 。每个智能体的价值函数取决于联合行动和联合策略，其特征为 Vπ:S×A→Rn 。以下各节将介绍挑战和 MADRL 解决方案，以及它们在解决实际问题中的应用。

A. MADRL: Challenges and Solutions
A. MADRL：挑战与解决方案

1) Nonstationarity: 1）非平稳性：

Controlling multiple agents poses several additional challenges as compared to single-agent setting such as the heterogeneity of agents, how to define suitable collective goals or the scalability to a large number of agents that requires the design of compact representations, and more importantly the nonstationarity problem. In a single-agent environment, an agent is concerning only the outcome of its own actions. In a multiagent domain, an agent observes not only the outcomes of its own action but also the behavior of other agents. Learning among the agents is complex because all agents potentially interact with each other and learn concurrently. The interactions among multiple agents constantly reshape the environment and lead to nonstationarity. In this case, learning among the agents sometimes causes changes in the policy of an agent, and can affect the optimal policy of other agents. The estimated potential rewards of an action would be inaccurate and therefore, good policies at a given point in the multiagent setting could not remain so in the future. The convergence theory of Q -learning applied in a single-agent setting is not guaranteed to most multiagent problems as the Markov property does not hold anymore in the nonstationary environment [47]. Therefore, collecting and processing information must be performed with certain recurrence while ensuring that it does not affect the agents’ stability. The exploration–exploitation dilemma could be more involved under multiagent settings.
与单智能体设置相比，控制多个智能体会带来一些额外的挑战，例如智能体的异质性、如何定义合适的集体目标或需要设计紧凑表示的大量智能体的可扩展性，以及更重要的是非平稳性问题。在单代理环境中，代理只关注其自身操作的结果。在多智能体域中，智能体不仅观察自己操作的结果，还观察其他智能体的行为。智能体之间的学习很复杂，因为所有智能体都可能相互交互并同时学习。多个智能体之间的相互作用不断重塑环境并导致非平稳性。在这种情况下，智能体之间的学习有时会导致智能体策略的变化，并可能影响其他智能体的最优策略。一项行动的估计潜在回报是不准确的，因此，在多智能体环境中的给定点上，良好的政策在未来不可能保持这种状态。应用于单智能体环境中的 Q 学习收敛理论并不能保证适用于大多数多智能体问题，因为马尔可夫性质在非平稳环境中 [47] 不再成立。因此，收集和处理信息必须有一定的重复性，同时确保它不会影响代理的稳定性。在多智能体设置下，探索-开发困境可能更复杂。

The popular independent Q-learning [48] or experience replay-based DQN [6] was not designed for the nonstationary environments. Castaneda [49] proposed two variants of DQN, namely, deep repeated update Q -network (DRUQN) and deep loosely coupled Q -network (DLCQN), to deal with the nonstationarity problem in MAS. The DRUQN is developed based on the repeated update Q -learning (RUQL) model introduced in [50] and [51]. It aims to avoid policy bias by updating the action value inversely proportional to the likelihood of selecting an action. On the other hand, DLCQN relies on the loosely coupled Q -learning proposed in [52], which specifies and adjusts an independence degree for each agent using its negative rewards and observations. Through this independence degree, the agent learns to decide whether it needs to act independently or cooperate with other agents in different circumstances. Likewise, Diallo et al. [53] extended DQN to a multiagent concurrent DQN and demonstrated that this method can converge in a nonstationary environment. Foerster et al. [54] alternatively introduced two methods for stabilizing experience replay of DQN in MADRL. The first method uses the importance sampling approach to naturally decay obsolete data while the second method disambiguates the age of the samples retrieved from the replay memory using a fingerprint.
流行的基于独立 Q 学习 [48] 或体验回放的 DQN [6] 不是为非平稳环境设计的。Castaneda [49] 提出了 DQN 的两种变体，即深度重复更新 Q 网络（DRUQN）和深度松散耦合 Q 网络（DLCQN），以解决 MAS 中的非平稳性问题。DRUQN 是基于和 [51] 中 [50] 引入的重复更新 Q 学习（RUQL&#x