QMIX：用于深度多智能体强化学习的单调值函数分解

资源存储库

已于 2024-03-29 11:43:17 修改

阅读量1.2k

点赞数 29

分类专栏：多智能体强化学习文章标签：强化学习

于 2024-03-29 11:42:32 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137138935

版权

QMIX是一种用于深度多智能体强化学习的新方法，它能以集中的端到端方式训练分散策略。通过结构上确保联合动作值在每个代理值中单调，QMIX使得在非策略学习中有效地最大化联合动作值成为可能，并保证集中式和分散式策略之间的一致性。在一系列星际争霸II微观管理任务中，QMIX表现优于现有基于值的多智能体强化学习方法。

摘要由CSDN通过智能技术生成

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement LearningQMIX：用于深度多智能体强化学习的单调值函数分解

3.2 Deep Recurrent Q-Learning

3.2 深度递归 Q -学习

3.3 Independent Q-Learning3.3 独立 Q -学习

3.4 Value Decomposition Networks

3.4 价值分解网络

4 QMIX

4.1 Representational Complexity

4.1 表征复杂性

5 Two-Step Game

5 两步游戏

6Experimental Setup 6实验装置

6.1Decentralised StarCraft II Micromanagement

Appendix AQMIX 附录AQMIX

A.1Representational ComplexityA.1表征复杂性

Appendix B Two Step Game

附录B 两步游戏

B.1 Architecture and TrainingB.1 体系结构和培训

B.2 Learned Value FunctionsB.2 学习值函数

B.3Results

Appendix C StarCraft II Setup

附录C StarCraft II设置

C.1Environment Features

C.1环境特点

C.2 Architecture and Training

C.2 结构和训练

Appendix D StarCraft II Results

附录D 星际争霸II结果

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
QMIX：用于深度多智能体强化学习的单调值函数分解

Abstract 摘要

In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
在许多现实世界中，一组代理必须协调他们的行为，同时以分散的方式行事。与此同时，通常可以在模拟或实验室环境中以集中方式训练代理，其中可以获得全局状态信息并解除通信约束。学习以额外状态信息为条件的联合动作值是利用集中式学习的一种有吸引力的方法，但提取分散式策略的最佳策略尚不清楚。我们的解决方案是QMIX，这是一种新颖的基于价值的方法，可以以集中的端到端方式训练分散的策略。QMIX采用一个网络，该网络将联合动作值估计为每个代理值的复杂非线性组合，仅以局部观察为条件。我们在结构上强制联合行动值在每个代理值中是单调的，这允许在非策略学习中最大化联合行动值，并保证集中式和分散式策略之间的一致性。我们评估QMIX在一组具有挑战性的星际争霸II微观管理任务，并表明QMIX显着优于现有的基于值的多智能体强化学习方法。

Machine Learning, ICML 机器学习，ICML

1 Introduction 引言

Reinforcement learning (RL) holds considerable promise to help address a variety of cooperative multi-agent problems, such as coordination of robot swarms (Hüttenrauch et al., 2017) and autonomous cars (Cao et al., 2012).
强化学习（RL）在帮助解决各种协作多智能体问题方面具有相当大的前景，例如机器人群的协调（Hüttenrauch等人，2017）和自动汽车（Cao等人，2012年）。

In many such settings, partial observability and/or communication constraints necessitate the learning of decentralised policies, which condition only on the local action-observation history of each agent. Decentralised policies also naturally attenuate the problem that joint action spaces grow exponentially with the number of agents, often rendering the application of traditional single-agent RL methods impractical.
在许多这样的设置中，部分可观测性和/或通信约束需要学习分散的政策，其条件仅限于每个代理的本地动作观察历史。分散的策略也自然地减弱了联合行动空间随代理数量呈指数级增长的问题，这通常使传统的单代理RL方法的应用变得不切实际。

Fortunately, decentralised policies can often be learned in a centralised fashion in a simulated or laboratory setting. This often grants access to additional state information, otherwise hidden from agents, and removes inter-agent communication constraints. The paradigm of centralised training with decentralised execution (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) has recently attracted attention in the RL community (Jorge et al., 2016; Foerster et al., 2018). However, many challenges surrounding how to best exploit centralised training remain open.
幸运的是，分散的策略通常可以在模拟或实验室环境中以集中的方式学习。这通常赠款对附加状态信息的访问权，否则对代理隐藏，并消除代理间的通信约束。集中培训与分散执行的范例（Oliehoek等人，2008; Kraemer & Banerjee，2016）最近在RL社区引起了关注（Jorge et al.，2016; Foerster等人，2018年）。然而，围绕如何最好地利用集中式培训的许多挑战仍然存在。

One of these challenges is how to represent and use the action-value function that most RL methods learn. On the one hand, properly capturing the effects of the agents’ actions requires a centralised action-value function Qtot that conditions on the global state and the joint action. On the other hand, such a function is difficult to learn when there are many agents and, even if it can be learned, offers no obvious way to extract decentralised policies that allow each agent to select only an individual action based on an individual observation.
其中一个挑战是如何表示和使用大多数RL方法学习的动作值函数。一方面，正确地捕捉代理人的行动的影响需要一个集中的行动价值函数 Qtot ，它以全局状态和联合行动为条件。另一方面，当有许多代理时，这样的函数很难学习，即使它可以学习，也没有提供明显的方法来提取分散的策略，允许每个代理仅根据个人观察选择个人行动。

Figure 1:Decentralised unit micromanagement in StarCraft II, where each learning agent controls an individual unit. The goal is to coordinate behaviour across agents to defeat all enemy units.
图1：星际争霸II中分散的单元微观管理，每个学习代理控制一个单独的单元。目标是协调代理之间的行为，以击败所有敌方单位。

The simplest option is to forgo a centralised action-value function and let each agent a learn an individual action-value function Qa independently, as in independent Q-learning (IQL) (Tan, 1993). However, this approach cannot explicitly represent interactions between the agents and may not converge, as each agent’s learning is confounded by the learning and exploration of others.
最简单的选择是放弃一个集中的动作价值函数，让每个智能体 a 独立地学习一个单独的动作价值函数 Qa ，就像独立Q学习（IQL）一样（Tan，1993）。然而，这种方法不能显式地表示代理之间的交互，并且可能不会收敛，因为每个代理的学习都被其他代理的学习和探索所混淆。

At the other extreme, we can learn a fully centralised state-action value function Qtot and then use it to guide the optimisation of decentralised policies in an actor-critic framework, an approach taken by counterfactual multi-agent (COMA) policy gradients (Foerster et al., 2018), as well as work by Gupta et al. (2017). However, this requires on-policy learning, which can be sample-inefficient, and training the fully centralised critic becomes impractical when there are more than a handful of agents.
在另一个极端，我们可以学习一个完全集中的状态-动作值函数 Qtot ，然后用它来指导行动者-批评者框架中分散策略的优化，这是一种由反事实多代理（COMA）策略梯度采取的方法（Foerster等人，2018），以及Gupta et al.（2017）的工作。然而，这需要基于策略的学习，这可能是样本效率低下的，并且当有超过少数代理时，训练完全集中的批评者变得不切实际。

We evaluate QMIX on a range of unit micromanagement tasks built in StarCraft II1
我们在星际争霸2 1 中构建的一系列单位微管理任务上评估QMIX. (Vinyals et al., 2017). Our experiments show that QMIX outperforms IQL and VDN, both in terms of absolute performance and learning speed. In particular, our method shows considerable performance gains on a task with heterogeneous agents. Moreover, our ablations show both the necessity of conditioning on the state information and the non-linear mixing of agent Q-values in order to achieve consistent performance across tasks.
.（Vinyals等人，2017年）。我们的实验表明，QMIX优于IQL和VDN，无论是在绝对性能和学习速度。特别是，我们的方法显示了相当大的性能增益与异构代理的任务。此外，我们的消融显示了调节状态信息和代理 Q 值的非线性混合的必要性，以实现跨任务的一致性能。

2 Related Work

2 相关工作

Recent work in multi-agent RL has started moving from tabular methods (Yang & Gu, 2004; Busoniu et al., 2008) to deep learning methods that can tackle high-dimensional state and action spaces (Tampuu et al., 2017; Foerster et al., 2018; Peng et al., 2017). In this paper, we focus on cooperative settings.
多智能体RL的最近工作已经开始从表格方法（Yang & Gu，2004; Busoniu et al.，2008）到可以处理高维状态和动作空间的深度学习方法（Tampuu等人，2017; Foerster等人，2018年; Peng等人，2017年）。在本文中，我们专注于合作设置。

On the one hand, a natural approach to finding policies for a multi-agent system is to directly learn decentralised value functions or policies. Independent Q-learning (Tan, 1993) trains independent action-value functions for each agent using Q-learning (Watkins, 1989). (Tampuu et al., 2017) extend this approach to deep neural networks using DQN (Mnih et al., 2015). While trivially achieving decentralisation, these approaches are prone to instability arising from the non-stationarity of the environment induced by simultaneously learning and exploring agents. Omidshafiei et al. (2017) and Foerster et al. (2017) address learning stabilisation to some extent, but still learn decentralised value functions and do not allow for the inclusion of extra state information during training.
一方面，为多智能体系统寻找策略的一种自然方法是直接学习分散的价值函数或策略。独立Q学习（Tan，1993）使用Q学习（Watkins，1989）为每个代理训练独立的动作值函数。（Tampuu等人，2017）使用DQN将这种方法扩展到深度神经网络（Mnih等人，2015年）。虽然这些方法很容易实现分散化，但由于同时学习和探索代理所引起的环境的非平稳性而容易产生不稳定性。Omidshafiei et al.（2017）和Foerster et al.（2017）在一定程度上解决了学习稳定性问题，但仍然学习分散的值函数，并且不允许在训练过程中包含额外的状态信息。

On the other hand, centralised learning of joint actions can naturally handle coordination problems and avoids non-stationarity, but is hard to scale, as the joint action space grows exponentially in the number of agents. Classical approaches to scalable centralised learning include coordination graphs (Guestrin et al., 2002), which exploit conditional independencies between agents by decomposing a global reward function into a sum of agent-local terms. Sparse cooperative Q-learning (Kok & Vlassis, 2006) is a tabular Q-learning algorithm that learns to coordinate the actions of a group of cooperative agents only in the states in which such coordination is necessary, encoding those dependencies in a coordination graph. These methods require the dependencies between agents to be pre-supplied, whereas we do not require such prior knowledge. Instead, we assume that each agent always contributes to the global reward, and learns the magnitude of its contribution in each state.
另一方面，联合行动的集中学习可以自然地处理协调问题并避免非平稳性，但难以扩展，因为联合行动空间在代理数量上呈指数级增长。可扩展的集中式学习的经典方法包括协调图（Guestrin等人，2002），它利用代理之间的条件独立性，通过将全局奖励函数分解为代理本地项的总和。稀疏合作Q学习（Kok & Vlllearning，2006）是一种表格式的Q 学习算法，它只在需要协调的状态下学习协调一组合作代理的动作，并将这些依赖关系编码在协调图中。这些方法需要预先提供代理之间的依赖关系，而我们不需要这样的先验知识。相反，我们假设每个智能体总是对全局奖励做出贡献，并学习其在每个状态下的贡献大小。

More recent approaches for centralised learning require even more communication during execution: CommNet (Sukhbaatar et al.,