【CollaQ】【通过奖励归因分解实现多智能体协作】

CollaQ 是一种多智能体协作方法,通过奖励归因分解实现协作Q学习,适用于星际争霸多智能体挑战,提升了40%的胜率。该方法允许智能体适应不同的团队配置,超越现有技术30%以上。
摘要由CSDN通过智能技术生成

Multi-Agent Collaboration via Reward Attribution Decomposition

通过奖励归因分解实现多智能体协作

https://arxiv.org/abs/2010.08531

Abstract 摘要

        Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and may not generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play). In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play. We first formulate multi-agent collaboration as a joint optimization on reward assignment and show that under certain conditions, each agent has a decentralized Q-function that is approximately optimal and can be decomposed into two terms: the self-term that only relies on the agent’s own state, and the interactive term that is related to states of nearby agents, often observed by the current agent. The two terms are jointly trained using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss that ensures both terms retain their semantics. CollaQ is evaluated on various StarCraft maps, outperforming existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of environment steps. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%. 
        多智能体强化学习 (MARL) 的最新进展在《雷神之锤 3》和《Dota 2》等游戏中实现了超人的表现。不幸的是,这些技术需要比人类多几个数量级的训练回合,并且可能无法推广到略微改变的环境或新的智能体配置(即临时团队游戏)。在这项工作中,我们提出了协作Q学习(CollaQ),它可以在星际争霸多智能体挑战中实现最先进的性能,并支持临时团队游戏。我们首先将多智能体协作表述为奖励分配的联合优化,并证明在一定条件下,每个智能体都有一个近似最优的去中心化 Q函数,可以分解为两个项:仅依赖于智能体自身状态的自项,以及与附近智能体状态相关的交互项, 经常被当前代理观察到。这两个术语使用常规 DQN 进行联合训练,并受多智能体奖励归因 (MARA) 损失的约束,以确保两个术语保留其语义。CollaQ 在各种星际争霸地图上进行了评估,通过 40% 在相同数量的环境步骤下提高胜率,优于现有的最先进技术(即 QMIX、QTRAN 和 VDN)。在更具挑战性的临时团队游戏设置中(即,无需重新训练或微调即可重新加重/添加/移除单位),CollaQ 的表现优于以前的 SoTA 30% 。 

1 Introduction 

1 引言

        In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interest from the research community. MARL algorithms have shown super-human level performance in various games like Dota 2 (Berner et al., 2019), Quake 3 Arena (Jaderberg et al., 2019), and StarCraft (Samvelyan et al., 2019). However, the algorithms (Schulman et al., 2017; Mnih et al., 2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019), it takes agents 2.69−8.62 million episodes to learn a simple strategy of door blocking, while it only takes human several rounds to learn this behavior. One of the key reasons for the slow learning is that the number of joint states grows exponentially with the number of agents.
        近年来,多智能体深度强化学习(MARL)引起了研究界越来越多的兴趣。MARL 算法在 Dota 2(Berner 等人,2019 年)、雷神之锤 3 竞技场(Jaderberg 等人,2019 年)和星际争霸(Samvelyan 等人,2019 年)等各种游戏中表现出超人水平的性能。然而,算法(Schulman等人,2017;Mnih et al., 2013)的样本效率远低于人类。例如,在《捉迷藏》(Baker et al., 2019)中,特工 2.69−8.62 需要数百万集才能学会一个简单的门堵塞策略,而人类只需要几个回合就可以学会这种行为。学习缓慢的关键原因之一是联合状态的数量随着智能体的数量呈指数增长。

        Moreover, many real-world situations require agents to adapt to new configurations of teams. This can be modeled as ad hoc multi-agent reinforcement learning  (Stone et al., 2010) (Ad-hoc MARL) settings, in which agents must adapt to different team sizes and configurations at test time. In contrast to the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARL setting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or prior knowledge about teammate behaviors (Barrett and Stone, 2015). As a result, they do not generalize to complex real-world scenarios. Most existing works either focus on improving generalization towards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc setting like varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020). We consider a more general setting where test-time teammates may have different capabilities. The need to reason about different team configurations in the Ad-hoc MARL results in an additional exponential increase (Stone et al., 2010) in representational complexity comparing to the MARL setting.
        此外,许多现实世界的情况要求座席适应团队的新配置。这可以建模为临时多智能体强化学习(Stone et al., 2010)(Ad-hoc MARL)设置,其中智能体必须在测试时适应不同的团队规模和配置。与 MARL 设置相比,在 MARL 设置中,代理可以学习固定且依赖于团队的策略,而在临时 MARL 设置中,代理必须评估并适应他人的能力,以做出最佳行为。临时团队游戏中的现有工作要么需要在测试时进行复杂的在线学习(Barrett et al., 2011),要么需要对队友行为的先验知识(Barrett and Stone, 2015)。因此,它们不能泛化到复杂的现实世界场景。大多数现有工作要么侧重于改进对不同对手策略的泛化(Lanctot et al., 2017;胡 等人,2020 年)或简单的临时设置,如不同数量的测试时队友(Schwab 等人,2018 年;Long 等人,2020 年)。我们考虑一个更通用的设置,其中测试时的队友可能具有不同的能力。与 MARL 设置相比,在 Ad-hoc MARL 中对不同团队配置进行推理的需要导致表征复杂性呈指数级增长(Stone et al., 2010)。

        In the situation of collaboration, one way to address the complexity of the ad hoc team play setting is to explicitly model and address how agents collaborate. In this paper, one key observation is that when collaborating with different agents, an agent changes their behavior because she realizes that the team could function better if she focuses on some of the rewards while leaving other rewards to other teammates. Inspired by this principle, we formulate multi-agent collaboration as a joint optimization over an implicit reward assignment among agents. Because the rewards are assigned dif

Brinson多期归因是一种用于评估投资组合绩效的方法,它基于比较组合实际收益与预期收益之间的差异,以及比较组合配置与基准配置之间的差异来分解筛选和配置对组合绩效的贡献。 Python可以通过使用一些金融数据分析库来实现Brinson多期归因。以下是一个简单的示例: 首先,我们需要准备投资组合和基准的收益数据。假设我们有n个投资期间,我们可以创建一个n×m的收益矩阵,其中n是期间数量,m是资产类别数量。对于投资组合,我们可以创建一个n×m的投资组合权重矩阵,用于表示不同资产类别的权重。 接下来,我们需要计算投资组合和基准的收益率。可以使用收益矩阵中的数据计算每个期间的投资组合和基准收益率。 然后,我们需要计算投资组合和基准的配置效应。配置效应指的是投资组合与基准之间资产配置的差异对收益的影响。我们可以通过将投资组合权重矩阵与基准权重矩阵相减来计算配置效应。 最后,我们可以计算选择效应。选择效应指的是投资组合中资产选择带来的超额收益或亏损。我们可以通过将每个期间的投资组合收益率与基准收益率相减,然后乘以各期间的投资组合权重来计算选择效应。 通过计算上述步骤,我们可以得到每个期间的归因结果,包括配置效应和选择效应。根据这些结果,我们可以更好地理解投资组合绩效的来源,并采取相应的措施来改进投资策略。 需要注意的是,实现Brinson多期归因涉及到更复杂的细节和数据处理,上述示例只是一个简要的解释。在实际操作中,我们还需考虑其他因素,如交易成本、分红金和现金等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值