协作任务中多Agent深度强化学习算法的基准测试

Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks
协作任务中多Agent深度强化学习算法的基准测试

Abstract 摘要

        Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we provide a systematic evaluation and comparison of three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.
        多智能体深度强化学习(MARL)缺乏常用的评估任务和标准,使得方法之间的比较变得困难。在这项工作中,我们提供了一个系统的评估和比较三个不同类别的MARL算法(独立学习,集中式多智能体的政策梯度,价值分解)在不同的合作多智能体学习任务。我们的实验可以作为不同学习任务中算法预期性能的参考,并提供了有关不同学习方法有效性的见解。我们开源了EPyMARL,它扩展了PyMARL代码库,以包含其他算法,并允许灵活配置算法实现细节,如参数共享。最后,我们开源两个环境的多智能体的研究,重点是稀疏奖励下的协调。

1 Introduction 

1 导言

        Multi-agent reinforcement learning (MARL) algorithms use RL techniques to co-train a set of agents in a multi-agent system. Recent years have seen a plethora of new MARL algorithms which integrate deep learning techniques (Papoudakis et al., 2019; Hernandez-Leal et al., 2019). However, comparison of MARL algorithms is difficult due to a lack of established benchmark tasks, evaluation protocols, and metrics. While several comparative studies exist for single-agent RL (Duan et al., 2016; Henderson et al., 2018; Wang et al., 2019), we are unaware of such comparative studies for recent MARL algorithms. Albrecht and Ramamoorthy (2012) compare several MARL algorithms but focus on the application of classic (non-deep) approaches in simple matrix games. Such comparisons are crucial in order to understand the relative strengths and limitations of algorithms, which may guide practical considerations and future research.
        多智能体强化学习(MARL)算法使用RL技术来共同训练多智能体系统中的一组智能体。近年来,已经出现了大量集成了深度学习技术的新MARL算法(Papoudakis等人,2019年; Hernandel-Leal等人,2019年)。然而,由于缺乏既定的基准任务,评估协议和指标,MARL算法的比较是困难的。虽然存在针对单药剂RL的若干比较研究(Duan等人,2016;亨德森等人,2018; Wang等人,2019),我们不知道最近的MARL算法的这种比较研究。Albrecht和Ramamoorthy(2012)比较了几种MARL算法,但重点关注经典(非深度)方法在简单矩阵游戏中的应用。这种比较对于了解算法的相对优势和局限性至关重要,这可能会指导实际考虑和未来的研究。

        We contribute a comprehensive empirical comparison of nine MARL algorithms in a diverse set of cooperative multi-agent tasks. We compare three classes of MARL algorithms: independent learning, which applies single-agent RL algorithms for each agent without consideration of the multi-agent structure (Tan, 1993); centralised multi-agent policy gradient (Lowe et al., 2017; Foerster et al., 2018; Yu et al., 2021); and value decomposition (Sunehag et al., 2018; Rashid et al., 2018) algorithms. The two latter classes of algorithms follow the Centralised Training Decentralised Execution (CTDE) paradigm. These algorithm classes are frequently used in the literature either as baselines or building blocks for more complex algorithms (He et al., 2016; Sukhbaatar et al., 2016; Foerster et al., 2016; Raileanu et al., 2018; Jaques et al., 2019; Iqbal and Sha, 2019; Du et al., 2019; Ryu et al., 2020). We evaluate algorithms in two matrix games and four multi-agent environments, in which we define a total of 25 different cooperative learning tasks. Hyperparameters of each algorithm are optimised separately in each environment using a grid-search, and we report the maximum and average evaluation returns during training. We run experiments with shared and non-shared parameters between agents, a common implementation detail in MARL that has been shown to affect converged returns (Christianos et al., 2021). In addition to reporting detailed benchmark results, we analyse and discuss insights regarding the effectiveness of different learning approaches.
我们贡献了一个全面的实证比较九MARL算法在不同的合作多智能体任务。我们比较了三类MARL算法:独立学习,它对每个代理应用单代理RL算法,而不考虑多代理结构(Tan,1993);集中式多代理策略梯度(Lowe等人,2017; Foerster等人,2018年; Yu等人,2021);和价值分解(Sunehag等人,2018; Rashid等人,2018)算法。后两类算法遵循集中训练分散执行(CTDE)范式。这些算法类在文献中经常用作更复杂算法的基线或构建块(He等人,2016; Sukhbaatar等人,2016; Foerster等人,2016; Raileanu等人,2018; Jaques等人,2019; Iqbal和Sha,2019; Du等人,2019; Ryu等人,2020年)。 我们评估算法在两个矩阵游戏和四个多智能体环境中,我们定义了25个不同的合作学习任务。每个算法的超参数在每个环境中使用网格搜索进行单独优化,我们在训练期间报告最大和平均评估结果。我们在代理之间使用共享和非共享参数进行实验,这是MARL中的一个常见实现细节,已被证明会影响收敛收益(Christianos等人,2021年)。除了报告详细的基准测试结果外,我们还分析和讨论了有关不同学习方法有效性的见解。

To facilitate our comparative evaluation, we created the open-source codebase EPyMARL (Extended PyMARL)1
为了方便我们的比较评估,我们创建了开源代码库EPyMARL(扩展PyMARL) 1, an extension of PyMARL (Samvelyan et al., 2019) which is commonly used in MARL research. EPyMARL implements additional algorithms and allows for flexible configuration of different implementation details, such as whether or not agents share network parameters. Moreover, we have implemented and open-sourced two new multi-agent environments: Level-Based Foraging (LBF) and Multi-Robot Warehouse (RWARE). With these environments we aim to test the algorithms’ ability to learn coordination tasks under sparse rewards and partial observability.
,PyMARL的扩展(Samvelyan等人,2019年),这是在MARL研究中常用的。EPyMARL实现了额外的算法,并允许灵活配置不同的实现细节,例如代理是否共享网络参数。此外,我们已经实现并开源了两个新的多智能体环境:基于级别的觅食(LBF)和多机器人仓库(RWARE)。在这些环境中,我们的目标是测试算法的能力,学习协调任务下稀疏奖励和部分可观测性。

2 Algorithms 

2 算法

2.1 Independent Learning (IL)

2.1 独立学习(IL)

For IL, each agent is learning independently and perceives the other agents as part of the environment.
对于IL,每个代理都独立学习,并将其他代理视为环境的一部分。

IQL: In Independent Q-Learning (IQL) (Tan, 1993), each agent has a decentralised state-action value function that is conditioned only on the local history of observations and actions of each agent. Each agent receives its local history of observations and updates the parameters of the Q-value network (Mnih et al., 2015) by minimising the standard Q-learning loss (Watkins and Dayan, 1992).
IQL:在独立Q学习(Independent Q-Learning,IQL)(Tan,1993)中,每个智能体都有一个分散的状态-动作值函数,该函数仅以每个智能体的观察和动作的局部历史为条件。每个代理接收其本地观测历史并更新Q值网络的参数(Mnih等人,2015)通过最小化标准Q学习损失(Watkins和Dayan,1992)。

IA2C: Independent synchronous Advantage Actor-Critic (IA2C) is a variant of the commonly-used A2C algorithm (Mnih et al., 2016; Dhariwal et al., 2017) for decentralised training in multi-agent systems. Each agent has its own actor to approximate the policy and critic network to approximate the value-function. Both actor and critic are trained, conditioned on the history of local observations, actions and rewards the agent perceives, to minimise the A2C loss.
IA 2C:独立同步优势演员-评论家(IA 2C)是常用A2 C算法的变体(Mnih等人,2016;达里瓦尔等人,2017年)用于多代理系统的分散式培训。每个代理都有自己的演员来近似的政策和评论家网络来近似的价值函数。行动者和批评者都接受过训练,条件是当地观察的历史,行动和代理感知的奖励,以最大限度地减少A2 C损失。

IPPO: Independent Proximal Policy Optimisation (IPPO) is a variant of the commonly-used PPO algorithm (Schulman et al., 2017) for decentralised training in multi-agent systems. The architecture of IPPO is identical to IA2C. The main difference between PPO and A2C is that PPO uses a surrogate objective which constrains the relative change of the policy at each update, allowing for more update epochs using the same batch of trajectories. In contrast to PPO, A2C can only perform one update epoch per batch of trajectories to ensure that the training batch remains on-policy.
IPPO:独立邻近策略优化(IPPO)是常用PPO算法的变体(Schulman等人,2017年)用于多代理系统的分散式培训。IPPO的架构与IA2C相同。PPO和A2C之间的主要区别在于PPO使用代理目标,该目标约束每次更新时策略的相对变化,从而允许使用同一批轨迹的更多更新时期。与PPO相比,A2C只能对每批轨迹执行一次更新,以确保训练批次保持在策略上。

2.2 Centralised Training Decentralised Execution (CTDE)

2.2 集中培训

  • 10
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值