Multi-Agent Collaboration via Reward Attribution Decomposition
通过奖励归因分解实现多智能体协作
https://arxiv.org/abs/2010.08531
Abstract 摘要
Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and may not generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play). In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that under certain conditions, each agent has a decentralized Q-function that is approximately optimal and can be decomposed into two terms: the self-term that only relies on the agent’s own state, and the interactive term that is related to states of nearby agents, often observed by the current agent. The two terms are jointly trained using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss that ensures both terms retain their semantics. CollaQ is evaluated on various StarCraft maps, outperforming existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of environment steps. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.
多智能体强化学习 (MARL) 的最新进展在《雷神之锤 3》和《Dota 2》等游戏中实现了超人的表现。不幸的是,这些技术需要比人类多几个数量级的训练回合,并且可能无法推广到略微改变的环境或新的智能体配置(即临时团队游戏)。在这项工作中,我们提出了协作Q学习(CollaQ),它可以在星际争霸多智能体挑战中实现最先进的性能,并支持临时团队游戏。我们首先将多智能体协作表述为奖励分配的联合优化,并证明在一定条件下,每个智能体都有一个近似最优的去中心化 Q函数,可以分解为两个项:仅依赖于智能体自身状态的自项,以及与附近智能体状态相关的交互项, 经常被当前代理观察到。这两个术语使用常规 DQN 进行联合训练,并受多智能体奖励归因 (MARA) 损失的约束,以确保两个术语保留其语义。CollaQ 在各种星际争霸地图上进行了评估,通过 40% 在相同数量的环境步骤下提高胜率,优于现有的最先进技术(即 QMIX、QTRAN 和 VDN)。在更具挑战性的临时团队游戏设置中(即,无需重新训练或微调即可重新加重/添加/移除单位),CollaQ 的表现优于以前的 SoTA 30% 。
1 Introduction
1 引言
In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interest from the research community. MARL algorithms have shown super-human level performance in various games like Dota 2 (Berner et al., 2019), Quake 3 Arena (Jaderberg et al., 2019), and StarCraft (Samvelyan et al., 2019). However, the algorithms (Schulman et al., 2017; Mnih et al., 2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019), it takes agents 2.69−8.62 million episodes to learn a simple strategy of door blocking, while it only takes human several rounds to learn this behavior. One of the key reasons for the slow learning is that the number of joint states grows exponentially with the number of agents.
近年来,多智能体深度强化学习(MARL)引起了研究界越来越多的兴趣。MARL 算法在 Dota 2(Berner 等人,2019 年)、雷神之锤 3 竞技场(Jaderberg 等人,2019 年)和星际争霸(Samvelyan 等人,2019 年)等各种游戏中表现出超人水平的性能。然而,算法(Schulman等人,2017;Mnih et al., 2013)的样本效率远低于人类。例如,在《捉迷藏》(Baker et al., 2019)中,特工 2.69−8.62 需要数百万集才能学会一个简单的门堵塞策略,而人类只需要几个回合就可以学会这种行为。学习缓慢的关键原因之一是联合状态的数量随着智能体的数量呈指数增长。
Moreover, many real-world situations require agents to adapt to new configurations of teams. This can be modeled as ad hoc multi-agent reinforcement learning (Stone et al., 2010) (Ad-hoc MARL) settings, in which agents must adapt to different team sizes and configurations at test time. In contrast to the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARL setting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or prior knowledge about teammate behaviors (Barrett and Stone, 2015). As a result, they do not generalize to complex real-world scenarios. Most existing works either focus on improving generalization towards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc setting like varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020). We consider a more general setting where test-time teammates may have different capabilities. The need to reason about different team configurations in the Ad-hoc MARL results in an additional exponential increase (Stone et al., 2010) in representational complexity comparing to the MARL setting.
此外,许多现实世界的情况要求座席适应团队的新配置。这可以建模为临时多智能体强化学习(Stone et al., 2010)(Ad-hoc MARL)设置,其中智能体必须在测试时适应不同的团队规模和配置。与 MARL 设置相比,在 MARL 设置中,代理可以学习固定且依赖于团队的策略,而在临时 MARL 设置中,代理必须评估并适应他人的能力,以做出最佳行为。临时团队游戏中的现有工作要么需要在测试时进行复杂的在线学习(Barrett et al., 2011),要么需要对队友行为的先验知识(Barrett and Stone, 2015)。因此,它们不能泛化到复杂的现实世界场景。大多数现有工作要么侧重于改进对不同对手策略的泛化(Lanctot et al., 2017;胡 等人,2020 年)或简单的临时设置,如不同数量的测试时队友(Schwab 等人,2018 年;Long 等人,2020 年)。我们考虑一个更通用的设置,其中测试时的队友可能具有不同的能力。与 MARL 设置相比,在 Ad-hoc MARL 中对不同团队配置进行推理的需要导致表征复杂性呈指数级增长(Stone et al., 2010)。
In the situation of collaboration, one way to address the complexity of the ad hoc team play setting is to explicitly model and address how agents collaborate. In this paper, one key observation is that when collaborating with different agents, an agent changes their behavior because she realizes that the team could function better if she focuses on some of the rewards while leaving other rewards to other teammates. Inspired by this principle, we formulate multi-agent collaboration as a joint optimization over an implicit reward assignment among agents. Because the rewards are assigned dif