MALib:基于群体的多智能体强化学习并行框架

MALib是一个用于基于群体的多智能体强化学习(PB-MARL)的并行框架,旨在解决PB-MARL中的异构任务和数据需求。它提供了一个集中式任务调度模型,支持自动生成的任务和可扩展训练,以及Actor-Evaluator-Learner编程架构,实现训练和采样的高并行性。此外,它还具有对MARL训练范式的高层抽象,以实现代码重用和灵活部署。实验表明,MALib在多智能体任务中比RLlib和OpenSpiel有显著的性能提升。
摘要由CSDN通过智能技术生成

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning

基于群体的多智能体强化学习并行框架

Abstract 摘要

        Population-based multi-agent reinforcement learning (PB-MARL) refers to the series of methods nested with reinforcement learning (RL) algorithms, which produces a self-generated sequence of tasks arising from the coupled population dynamics. By leveraging auto-curricula to induce a population of distinct emergent strategies, PB-MARL has achieved impressive success in tackling multi-agent tasks. Despite remarkable prior arts of distributed RL frameworks, PB-MARL poses new challenges for parallelizing the training frameworks due to the additional complexity of multiple nested workloads between sampling, training and evaluation involved with heterogeneous policy interactions. To solve these problems, we present MALib, a scalable and efficient computing framework for PB-MARL. Our framework is comprised of three key components: (1) a centralized task dispatching model, which supports the self-generated tasks and scalable training with heterogeneous policy combinations; (2) a programming architecture named Actor-Evaluator-Learner, which achieves high parallelism for both training and sampling, and meets the evaluation requirement of auto-curriculum learning; (3) a higher-level abstraction of MARL training paradigms, which enables efficient code reuse and flexible deployments on different distributed computing paradigms. Experiments on a series of complex tasks such as multi-agent Atari Games show that MALib achieves throughput higher than 40K FPS on a single machine with 32 CPU cores; 5× speedup than RLlib and at least 3× speedup than OpenSpiel in multi-agent training tasks. MALib is publicly available at https://github.com/sjtu-marl/malib.

        基于种群的多智能体强化学习(PB-MARL)是指嵌套强化学习(RL)算法的一系列方法,这些方法通过耦合种群动态产生自生成的任务序列。通过利用自动课程来诱导不同的紧急策略群体,PB-MARL在处理多智能体任务方面取得了令人印象深刻的成功。尽管分布式强化学习框架的现有技术非常出色,但由于涉及异构策略交互的采样、训练和评估之间的多个嵌套工作负载的额外复杂性,PB-MARL对并行化训练框架提出了新的挑战。为了解决这些问题,我们提出了一个可扩展的、高效的PB-MARL计算框架MALib。该框架由三个关键部分组成:(1)集中任务调度模型,支持异构策略组合下的自生成任务和可扩展训练;(2)基于Actor-Evaluator-Learner的编程体系结构,实现了训练和采样的高度并行性,满足了自动课程学习的评估要求;(3)对MARL训练范式进行了更高层次的抽象,实现了在不同分布式计算范式上的高效代码重用和灵活部署。在多智能体Atari Games等一系列复杂任务上的实验表明,MALib在32个CPU核的单机上实现了高于40K FPS的吞吐量;在多智能体训练任务中,比RLlib加速5倍,比OpenSpiel加速至少3倍。MALib可在https://github.com/sjtu-marl/malib公开获取。

 1 Introduction

1 介绍

        Training intelligent agents that can adapt to a diverse set of complex environments and agents has been a long-standing challenge. A feasible way to handle these tasks is multi-agent reinforcement learning (MARL) [2], which has shown great potentials to solve multi-agent tasks such as real-time strategy games [45], traffic light control [47] and ride-hailing [50]. In particular, the PB-MARL algorithms combine deep reinforcement learning (DRL) and dynamical population selection methodologies (e.g., game theory [9], evolutionary strategies [34]) to generate auto-curricula. In such a way, PB-MARL continually generates advanced intelligence and has achieved impressive successes in some non-trivial tasks, like Dota2 [30], StrarCraftII [44] and Leduc Poker [23].

        However, due to the intrinsic dynamics arising from multi-agent and population, these algorithms have intricately nested structure and are extremely data-thirsty, requiring a flexible and scalable training framework to ground their effectiveness.

        训练能够适应各种复杂环境和代理的智能代理一直是一个长期存在的挑战。处理这些任务的可行方法是多智能体强化学习(MARL)[2],它在解决实时策略游戏[45]、交通灯控制[47]和网约车[50]等多智能体任务方面显示出巨大的潜力。特别是,PB-MARL算法结合了深度强化学习(DRL)和动态种群选择方法(例如,博弈论[9],进化策略[34])来生成自动课程。通过这种方式,PB-MARL不断产生高级智能,并在一些重要任务中取得了令人印象深刻的成功,如Dota2[30]、StrarCraftII[44]和Leduc Poker[23]。

        然而,由于多智能体和种群的内在动态性,这些算法具有复杂的嵌套结构,并且非常需要数据,

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值