【综述】【多智能体强化学习：理论和算法的选择性概述】

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137182864

本文提供了多智能体强化学习（MARL）的理论和算法选择性概述，重点关注基于理论分析的算法。介绍了在马尔科夫/随机游戏和扩展形式游戏中，针对完全合作、完全竞争及两者混合任务的理论结果。强调了MARL理论的新视角，如广义形式游戏中的学习、网络代理的去中心化学习、均值场制度中的MARL等。同时指出了未来研究方向，如部分观测设置、深度MARL理论、基于模型的MARL以及策略梯度方法的收敛性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

多智能体强化学习：理论和算法的选择性概述

https://arxiv.org/abs/1911.10635v1

Abstract 摘要

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered tremendous success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques.

Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature.

In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.
近年来，强化学习（RL）取得了重大进展，在解决机器学习中的各种顺序决策问题方面取得了巨大成功。大多数成功的RL应用，例如围棋和扑克游戏、机器人和自动驾驶，都涉及多个单一智能体的参与，这自然属于多智能体RL（MARL）领域，这是一个历史相对较长的领域，并且由于单智能体RL技术的进步，最近又重新出现。

虽然在经验上是成功的，但文献中相对缺乏MARL的理论基础。

在本章中，我们将对 MARL 进行选择性概述，重点介绍以理论分析为后盾的算法。更具体地说，我们主要在两个代表性框架（Markov/随机博弈和广义形式博弈）中回顾了MARL算法的理论结果，根据它们所处理的任务类型，即完全合作、完全竞争和两者的混合。我们还介绍了这些算法的几个重要但具有挑战性的应用。与现有的 MARL 评论正交，我们强调了 MARL 理论的几个新角度和分类法，包括广泛形式博弈中的学习、网络代理的去中心化 MARL、均值场制度中的 MARL、基于策略的博弈学习方法的（非）收敛等。一些新的角度是从我们自己的研究努力和兴趣中推断出来的。我们本章的总体目标是，除了对该领域的现状进行评估外，还确定了MARL理论研究的富有成效的未来研究方向。我们希望本章能够为有兴趣研究这一令人兴奋同时又具有挑战性的课题的研究人员提供持续的激励。

1 Introduction

1 引言

Recent years have witnessed sensational advances of reinforcement learning (RL) in many prominent sequential decision-making problems, such as playing the game of Go [1, 2], playing real-time strategy games [3, 4], robotic control [5, 6], playing card games [7, 8], and autonomous driving [9], especially accompanied with the development of deep neural networks (DNNs) for function approximation [10]. Intriguingly, most of the successful applications involve the participation of more than one single agent/player1, which should be modeled systematically as multi-agent RL (MARL) problems. Specifically, MARL addresses the sequential decision-making problem of multiple autonomous agents that operate in a common environment, each of which aims to optimize its own long-term return by interacting with the environment and other agents [11]. Besides the aforementioned popular ones, learning in multi-agent systems finds potential applications in other subareas, including cyber-physical systems [12, 13], finance [14, 15], sensor/communication networks [16, 17], and social science [18, 19].
近年来，强化学习（RL）在许多突出的顺序决策问题中取得了惊人的进展，例如下围棋[1,2]，玩实时战略游戏[3,4]，机器人控制[5,6]，玩纸牌游戏[7,8]和自动驾驶[9]，特别是伴随着用于函数逼近的深度神经网络（DNN）的发展[10]。有趣的是，大多数成功的应用程序都涉及多个智能体/玩家 1 的参与，这应该被系统地建模为多智能体 RL （MARL）问题。具体而言，MARL解决了在公共环境中运行的多个自主智能体的顺序决策问题，每个智能体都旨在通过与环境和其他智能体的交互来优化自己的长期回报[11]。除了上述流行的系统外，多智能体系统中的学习在其他子领域也有潜在的应用，包括信息物理系统[12,13]，金融[14,15]，传感器/通信网络[16,17]和社会科学[18,19]。

Largely, MARL algorithms can be placed into three groups, fully cooperative, fully competitive, and a mix of the two, depending on the types of settings they address. In particular, in the cooperative setting, agents collaborate to optimize a common long-term return; while in the competitive setting, the return of agents usually sum up to zero. The mixed setting involves both cooperative and competitive agents, with general-sum returns. Modeling disparate MARL settings requires frameworks spanning from optimization theory, dynamic programming, game theory, and decentralized control, see §2.2 for more detailed discussions. In spite of these existing multiple frameworks, several challenges in MARL are in fact common across the different settings, especially for the theoretical analysis. Specifically, first, the learning goals in MARL are multi-dimensional, as the objectives of all agents are not necessarily aligned, which brings up the challenge of dealing with equilibrium points, as well as some additional performance criteria beyond return-optimization, such as the efficiency of communication/coordination, and robustness against potential adversarial agents. Moreover, as all agents are improving their policies according to their own interests concurrently, the environment faced by each agent becomes non-stationary. This breaks or invalidates the basic framework of most theoretical analyses in the single-agent setting. Furthermore, the joint action space that increases exponentially with the number of agents may cause scalability issues, known as the combinatorial nature of MARL [20]. Additionally, the information structure, i.e., who knows what, in MARL is more involved, as each agent has limited access to the observations of others, leading to possibly suboptimal decision rules locally. A detailed elaboration on the underlying challenges can be found in §3.
通常，MARL 算法可以分为三组，完全合作、完全竞争以及两者的混合，具体取决于它们所处理的设置类型。

特别是在合作环境中，代理人合作以优化共同的长期回报;而在竞争环境中，代理商的回报通常为零。混合环境涉及合作和竞争代理人，具有一般总回报。对不同的 MARL 设置进行建模需要涵盖优化理论、动态规划、博弈论和分散控制的框架，有关更详细的讨论，请参见 §2.2。尽管存在这些多重框架，但 MARL 中的一些挑战实际上在不同的环境中是共同的，尤其是对于理论分析。具体来说，首先，MARL中的学习目标是多维的，因为所有智能体的目标不一定是一致的，这带来了处理平衡点的挑战，以及回报优化之外的一些其他性能标准，例如沟通/协调的效率，以及对潜在对抗智能体的鲁棒性。此外，由于所有代理人都在根据自己的利益同时改进他们的政策，每个代理人面临的环境变得非静止。这破坏了单药环境中大多数理论分析的基本框架或使其失效。此外，随着智能体数量的增加，联合作用空间呈指数级增长，这可能会导致可扩展性问题，即MARL的组合性质[20]。此外，信息结构，即，谁知道呢，在 MARL 中涉及更多，因为每个智能体对其他人的观察访问有限，导致局部决策规则可能次优。有关潜在挑战的详细阐述，请参见 §3。