星际争霸多智能体挑战赛（SMAC）

资源存储库

已于 2024-03-30 14:01:29 修改

阅读量1.9k

点赞数 11

分类专栏：多智能体强化学习文章标签：人工智能

于 2024-03-30 12:18:43 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137169928

版权

SMAC是基于《星际争霸II》的多智能体挑战赛，旨在解决部分可观察的、合作的多智能体学习问题。它提供了一系列挑战场景，用于测试和评估算法在处理高维度输入、部分可观测性和协调行为的能力。实验结果显示，通过集中式训练和分散执行，深度强化学习方法在SMAC中展现出潜力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The StarCraft Multi-Agent Challenge

3 Multi-Agent Reinforcement Learning

3 多智能体强化学习

Dec-POMDPs 12-POMDPs

（十二月-POMDP）

Centralised training with decentralised execution集中式培训与分散执行

4 SMAC

Scenarios 场景

State and Observations

7 Conclusion and Future Work

A.2 Environment Setting

A.2 环境设置

Appendix B Evaluation Methodology

附录 B 评估方法

B.1 Evaluation Metrics

B.1 评估指标

Appendix C Experimental Setup

附录 C 实验设置

C.1 Architecture and Training

C.1 架构和训练

C.2 Reward and Observation

C.2 奖励与观察

Appendix DTable of Results

附录 DTable of results

The StarCraft Multi-Agent Challenge

星际争霸多智能体挑战赛

https://arxiv.org/abs/1902.04043

Abstract 摘要

In the last few years, deep multi-agent reinforcement learning (RL) has become a highly active area of research. A particularly challenging class of problems in this area is partially observable, cooperative, multi-agent learning, in which teams of agents must learn to coordinate their behaviour while conditioning only on their private observations. This is an attractive research area since such problems are relevant to a large number of real-world systems and are also more amenable to evaluation than general-sum problems.
在过去几年中，深度多智能体强化学习（RL）已成为一个非常活跃的研究领域。该领域的一类特别具有挑战性的问题是部分可观察的、合作的、多智能体的学习，其中智能体团队必须学会协调他们的行为，同时只以他们的个体观察为条件。这是一个有吸引力的研究领域，因为这些问题与大量现实世界的系统相关，并且比一般和问题更容易评估。

Standardised environments such as the ALE and MuJoCo have allowed single-agent RL to move beyond toy domains, such as grid worlds. However, there is no comparable benchmark for cooperative multi-agent RL. As a result, most papers in this field use one-off toy problems, making it difficult to measure real progress. In this paper, we propose the StarCraft Multi-Agent Challenge (SMAC) as a benchmark problem to fill this gap.1 SMAC is based on the popular real-time strategy game StarCraft II and focuses on micromanagement challenges where each unit is controlled by an independent agent that must act based on local observations. We offer a diverse set of challenge scenarios and recommendations for best practices in benchmarking and evaluations. We also open-source a deep multi-agent RL learning framework including state-of-the-art algorithms.2 We believe that SMAC can provide a standard benchmark environment for years to come. Videos of our best agents for several SMAC scenarios are available at: https://youtu.be/VZ7zmQ_obZ0.
ALE 和 MuJoCo 等标准化环境使单智能体 RL 能够超越玩具领域，例如网格世界。然而，对于合作的多智能体 RL，没有可比的基准。因此，该领域的大多数论文都使用一次性的玩具问题，因此很难衡量真正的进展。在本文中，我们提出了星际争霸多智能体挑战（SMAC）作为基准问题来填补这一空白。 1 SMAC基于流行的即时战略游戏《星际争霸II》，专注于微观管理挑战，其中每个单位都由一个独立的代理控制，该代理必须根据当地观察采取行动。我们为基准和评估的最佳实践提供了多样化的挑战场景和建议。我们还开源了一个深度多智能体RL学习框架，包括最先进的算法。 2 我们相信，SMAC可以在未来几年提供标准的基准测试环境。我们针对多个 SMAC 场景的最佳代理的视频可在以下网址获得：https://youtu.be/VZ7zmQ_obZ0。

1 Introduction

1 引言

Deep reinforcement learning (RL) promises a scalable approach to solving arbitrary sequential decision-making problems, demanding only that a user must specify a reward function that expresses the desired behaviour. However, many real-world problems that might be tackled by RL are inherently multi-agent in nature. For example, the coordination of self-driving cars, autonomous drones, and other multi-robot systems are becoming increasingly critical. Network traffic routing, distributed sensing, energy distribution, and other logistical problems are also inherently multi-agent. As such, it is essential to develop multi-agent RL (MARL) solutions that can handle decentralisation constraints and deal with the exponentially growing joint action space of many agents.
深度强化学习（RL）承诺提供一种可扩展的方法来解决任意顺序决策问题，仅要求用户必须指定表达所需行为的奖励函数。然而，RL可能解决的许多现实世界问题本质上是多智能体的。例如，自动驾驶汽车、自主无人机和其他多机器人系统的协调变得越来越重要。网络流量路由、分布式感知、能量分配和其他物流问题本质上也是多智能体的。因此，开发多智能体 RL （MARL）解决方案至关重要，该解决方案可以处理去中心化约束并处理许多智能体呈指数级增长的联合行动空间。

Partially observable, cooperative, multi-agent learning problems are of particular interest. Cooperative problems avoid difficulties in evaluation inherent with general-sum games (e.g., which opponents are evaluated against). Cooperative problems also map well to a large class of critical problems where a single user that manages a distributed system can specify the overall goal, e.g., minimising traffic or other inefficiencies. Most real-world problems depend on inputs from noisy or limited sensors, so partial observability must also be dealt with effectively. This often includes limitations on communication that result in a need for decentralised execution of learned policies. However, there commonly is access to additional information during training, which may be carried out in controlled conditions or simulation.
部分可观察的、合作的、多智能体的学习问题特别令人感兴趣。合作问题避免了一般和博弈固有的评估困难（例如，根据哪些对手进行评估）。合作问题也可以很好地映射到一大类关键问题，在这些问题中，管理分布式系统的单个用户可以指定总体目标，例如，最小化流量或其他低效率。大多数现实世界的问题都依赖于来自嘈杂或有限传感器的输入，因此还必须有效地处理部分可观测性。这通常包括对沟通的限制，导致需要分散执行所学政策。但是，在培训期间通常会访问其他信息，这些信息可以在受控条件或模拟中进行。

A growing number of recent works Foerster et al. (2018a); Rashid et al. (2018); Sunehag et al. (2017); Lowe et al. (2017) have begun to address the problems in this space. However, there is a clear lack of standardised benchmarks for research and evaluation. Instead, researchers often propose one-off environments which can be overly simple or tuned to the proposed algorithms. In single-agent RL, standard environments such as the Arcade Learning Environment Bellemare et al. (2013), or MuJoCo for continuous control Plappert et al. (2018), have enabled great progress. In this paper, we aim to follow this successful model by offering challenging standard benchmarks for deep MARL and to facilitate more rigorous experimental methodology across the field.
Foerster et al. （ 2018a）近期作品越来越多;Rashid等人（2018）;Sunehag等人（2017）;Lowe等人（2017）已经开始解决这一领域的问题。然而，研究和评估显然缺乏标准化的基准。相反，研究人员经常提出一次性的环境，这些环境可能过于简单或针对所提出的算法进行调整。在单智能体RL中，标准环境，如Arcade学习环境Bellemare等人（2013）或用于连续控制的MuJoCo（Plappert等人，2018）已经取得了长足的进步。在本文中，我们旨在通过为深度 MARL 提供具有挑战性的标准基准来遵循这一成功的模型，并促进整个领域更严格的实验方法。

Some testbeds have emerged for other multi-agent regimes, such as Poker Heinrich & Silver (2016), Pong Tampuu et al. (2015), Keepaway Soccer Stone et al. (2005), or simple gridworld-like environments Lowe et al. (2017); Leibo et al. (2017); Yang et al. (2018); Zheng et al. (2017). Nonetheless, we identify a clear gap in challenging and standardised testbeds for the important set of domains described above.
其他多智能体机制已经出现了一些测试平台，例如Poker Heinrich & Silver（2016），Pong Tampuu等人（2015），Keepaway Soccer Stone等人（2005），或简单的网格世界环境Lowe等人（2017）;Leibo et al. （ 2017）;Yang et al. （ 2018）;Zheng 等人（2017 年）。尽管如此，我们发现在上述重要领域集的挑战性和标准化测试平台方面存在明显差距。

To fill this gap, we introduce the StarCraft Multi-Agent Challenge (SMAC). SMAC is built on the popular real-time strategy game StarCraft II3 and makes use of the SC2LE environment Vinyals et al. (2017). Instead of tackling the full game of StarCraft with centralised control, we focus on decentralised micromanagement challenges (Figure 1). In these challenges, each of our units is controlled by an independent, learning agent that has to act based only on local observations, while the opponent’s units are controlled by the hand-coded built-in StarCraft II AI. We offer a diverse set of scenarios that challenge algorithms to handle high-dimensional inputs and partial observability, and to learn coordinated behaviour even when restricted to fully decentralised execution.
为了填补这一空白，我们推出了星际争霸多智能体挑战赛（SMAC）。SMAC建立在流行的即时战略游戏《星际争霸II 3 》之上，并利用了SC2LE环境Vinyals et al. （2017）。我们没有通过集中控制来应对《星际争霸》的全部游戏，而是专注于分散的微观管理挑战（图 1）。在这些挑战中，我们的每个单位都由一个独立的学习代理控制，该代理必须仅根据本地观察采取行动，而对手的单位则由手动编码的内置星际争霸 II AI 控制。我们提供了一组多样化的场景，这些场景挑战算法来处理高维输入和部分可观察性，并学习协调行为，即使仅限于完全分散的执行。

The full games of StarCraft: BroodWar and StarCraft II have already been used as RL environments, due to the many interesting challenges inherent to the games Synnaeve et al. (2016); Vinyals et al. (2017). DeepMind’s AlphaStar DeepMind (2019) has recently shown an impressive level of play on a StarCraft II matchup using a centralised controller. In contrast, SMAC is not intended as an environment to train agents for use in full StarCraft II gameplay. Instead, by introducing strict decentralisation and local partial observability, we use the StarCraft II game engine to build a new set of rich cooperative multi-agent problems that bring unique challenges, such as the nonstationarity of learning Foerster et al. (2017), multi-agent credit assignment Foerster et al. (2018a), and the difficulty of representing the value of joint actions Rashid et al. (2018).
《星际争霸：母巢战争》和《星际争霸II》的完整游戏已经被用作RL环境，因为游戏固有的许多有趣的挑战 Synnaeve et al. （ 2016）;Vinyals等人（2017）。DeepMind 的 AlphaStar DeepMind （ 2019）最近在使用集中式控制器的《星际争霸 II》对决中展示了令人印象深刻的游戏水平。相比之下，SMAC并不打算作为训练特工用于完整《星际争霸II》游戏的环境。取而代之的是，通过引入严格的去中心化和局部部分可观测性，我们使用《星际争霸II》游戏引擎构建了一组新的丰富的合作多智能体问题，这些问题带来了独特的挑战，例如学习的非平稳性Foerster et al. （2017），多智能体信用分配Foerster et al.（2018a），以及表示联合行动价值的困难Rashid et al. （2018）。

To further facilitate research in this field, we also open-source PyMARL, a learning framework that can serve as a starting point for other researchers and includes implementations of several key MARL algorithms. PyMARL is modular, extensible, built on PyTorch, and serves as a template for dealing with some of the unique challenges of deep MARL in practice. We include results on our full set of SMAC environments using QMIX Rashid et al. (2018) and several baseline algorithms, and challenge the community to make progress on difficult environments in which good performance has remained out of reach so far. We also offer a set of guidelines for best practices in evaluations using our benchmark, including the reporting of standardised performance metrics, sample efficiency, and computational requirements (see Appendix B).
为了进一步促进该领域的研究，我们还开源了 PyMARL，这是一个学习框架，可以作为其他研究人员的起点，包括几个关键 MARL 算法的实现。PyMARL 是模块化的、可扩展的、基于 PyTorch 构建的，可作为处理实践中深度 MARL 的一些独特挑战的模板。我们使用 QMIX Rashid 等人（2018）和几种基线算法包括了全套 SMAC 环境的结果，并挑战社区在迄今为止仍无法获得良好性能的困难环境中取得进展。我们还为使用我们的基准进行评估的最佳实践提供了一套指南，包括报告标准化性能指标、样本效率和计算要求（见附录 B）。

We hope SMAC will serve as a valuable standard benchmark, enabling systematic and robust progress in deep MARL for years to come.
我们希望SMAC能够成为一个有价值的标准基准，在未来几年内在深度MARL方面取得系统和稳健的进展。

2 Related Work

2 相关工作

Much work has gone into designing environments to test and develop MARL agents. However, not many of these focused on providing a qualitatively challenging environment that would provide together elements of partial observability, challenging dynamics, and high-dimensional observation spaces.
在设计环境以测试和开发 MARL 代理方面已经做了大量工作。然而，其中没有多少专注于提供一个定性上具有挑战性的环境，该环境将提供部分可观测性、具有挑战性的动力学和高维观测空间的元素。

Stone et al. (2005) presented Keepaway soccer, a domain built on the RoboCup soccer simulator (Kitano et al., 1997), a 2D simulation of a football environment with simplified physics, where the main task consists of keeping a ball within a pre-defined area where agents in teams can reach, steal, and pass the ball, providing a simplified setup for studying cooperative MARL. This domain was later extended to the Half Field Offense task (Kalyanakrishnan et al., 2006; Hausknecht et al., 2016), which increases the difficulty of the problem by requiring the agents to not only keep the ball within bounds but also to score a goal. Neither task scales well in difficulty with the number of agents, as most agents need to do little coordination. There is also a lack of interesting environment dynamics beyond the simple 2D physics nor good reward signals, thus reducing the impact of the environment as a testbed.
Stone et al. （2005）提出了 Keepaway soccer，这是一个建立在 RoboCup 足球模拟器（Kitano et al.， 1997）上的领域，这是一个具有简化物理的足球环境的 2D 模拟，其中主要任务包括将球保持在预定义的区域内，团队中的代理可以到达、抢断和传球，为研究合作 MARL 提供了简化的设置。这个领域后来扩展到半场进攻任务（Kalyanakrishnan等人，2006;Hausknecht et al.， 2016），这增加了问题的难度，要求代理人不仅要将球保持在界内，还要进球。这两项任务都不能很好地扩展代理的数量，因为大多数代理几乎不需要进行协调。除了简单的 2D 物理特性之外，还缺乏有趣的环境动态，也没有良好的奖励信号，从而减少了环境作为测试平台的影响。

Multiple gridworld-like environments have also been explored. Lowe et al. (2017) released a set of simple grid-world like environments for multi-agent RL alongside an implementation of MADDPG, featuring a mix of competitive and cooperative tasks focused on shared communication and low level continuous control. Leibo et al. (2017) show several mixed-cooperative Markov environment focused on testing social dilemmas, however, they did not release an implementation to further explore the tasks. Yang et al. (2018); Zheng et al. (2017) present a framework for creating gridworlds focuses on many-agents tasks, where the number of agents ranges from the hundreds to the millions. This work, however, focuses on testing for emergent behaviour, since environment dynamics and control space need to remain relatively simple for the tasks to be tractable. Resnick et al. (2018) propose a multi-agent environment based on the game Bomberman, encompassing a series of cooperative and adversarial tasks meant to provide a more challenging set of tasks with a relatively straightforward 2D state observation and simple grid-world-like action spaces.
还探索了多个类似网格世界的环境。Lowe等人（2017）发布了一组简单的多智能体RL类似网格世界的环境以及MADDPG的实现，其特点是混合了竞争和协作任务，专注于共享通信和低级连续控制。Leibo et al. （ 2017）展示了几个专注于测试社会困境的混合合作马尔可夫环境，然而，他们没有发布进一步探索任务的实现。Yang et al. （ 2018）;Zheng et al. （ 2017）提出了一个用于创建网格世界的框架，该框架侧重于多智能体任务，其中智能体的数量从数百到数百万不等。然而，这项工作的重点是测试紧急行为，因为环境动力学和控制空间需要保持相对简单才能使任务易于处理。Resnick et al. （ 2018）提出了一个基于游戏 Bomberman 的多智能体环境，包括一系列合作和对抗性任务，旨在提供一组更具挑战性的任务，具有相对简单的 2D 状态观察和简单的网格世界般的动作空间。

Learning to play StarCraft games also has been investigated in several communities: work ranging from evolutionary algorithms to tabular RL applied has shown that the game is an excellent testbed for both modelling and planning (Ontanón et al., 2013), however, most have focused on single-agent settings with multiple controllers and classical algorithms. More recently, progress has been made on developing frameworks that enable researchers working with deep neural networks to test recent algorithms on these games; work on applying deep RL algorithms to single-agent and multi-agent versions of the micromanagement tasks has thus been steadily appearing (Usunier et al., 2016; Foerster et al., 2017, 2018a; Rashid et al., 2018; Nardelli et al., 2018; Hu et al., 2018; Shao et al., 2018; Foerster et al., 2018b) with the release of TorchCraft (Synnaeve et al., 2016) and SC2LE (Vinyals et al., 2017), interfaces to respectively StarCraft: BroodWar and StarCraft II. Our work presents the first standardised testbed for decentralised control in this space.
学习玩《星际争霸》游戏也已经在几个社区进行了研究：从进化算法到应用表格RL的工作表明，该游戏是建模和规划的绝佳测试平台（Ontanón et al.， 2013），然而，大多数都集中在具有多个控制器和经典算法的单代理设置上。最近，在开发框架方面取得了进展，这些框架使使用深度神经网络的研究人员能够在这些游戏中测试最新的算法;因此，将深度RL算法应用于微观管理任务的单智能体和多智能体版本的工作一直在稳步出现（Usunier等人，2016;Foerster等人，2017,2018a;Rashid 等人，2018 年;Nardelli 等人，2018 年;胡等人，2018 年;Shao 等人，2018 年;Foerster et al.， 2018b）随着 TorchCraft （Synnaeve et al.， 2016）和 SC2LE （Vinyals et al.， 2017）的发布，分别与《星际争霸：母巢战争》和《星际争霸 II》接口。我们的工作为该领域的分散控制提供了第一个标准化测试平台。

Other work focuses on playing the full game of StarCraft, including macromanagement and tactics (Pang et al., 2018; Sun et al., 2018; DeepMind, 2019). By introducing decentralisation and local observability, our agents are excessively restricted compared to normal full gameplay. SMAC is therefore not intended as an environment to train agents for use in full gameplay. Instead, we use the StarCraft II game engine to build rich and interesting multi-agent problems.
其他工作侧重于玩《星际争霸》的完整游戏，包括宏观管理和战术（Pang et al.， 2018;Sun 等人，2018 年;DeepMind，2019 年）。通过引入去中心化和本地可观察性，与正常的完整游戏玩法相比，我们的代理受到了过度限制。因此，SMAC并非旨在作为训练代理以在完整游戏中使用的环境。取而代之的是，我们使用《星际争霸II》游戏引擎来构建丰富而有趣的多智能体问题。

3 Multi-Agent Reinforcement Learning

3 多智能体强化学习

In SMAC, we focus on tasks where a team of agents needs to work together to achieve a common goal. We briefly review the formalism of such fully cooperative multi-agent tasks as Dec-POMDPs but refer readers to Oliehoek & Amato (2016) for a more complete picture.
在SMAC中，我们专注于代理团队需要共同努力以实现共同目标的任务。我们简要回顾了像Dec-POMDP这样的完全合作的多智能体任务的形式主义，但请读者参考Oliehoek&Amato（2016）以获得更完整的图片。

Dec-POMDPs 12-POMDPs

（十二月-POMDP）

Formally, a Dec-POMDP � is given by a tuple �=〈�,�,�,�,�,�,�,�〉, where �∈� is the true state of the environment. At each time step, each agent �∈�≡{1,…,�} chooses an action ��∈�, forming a joint action 𝐮∈𝐔≡��. This c