【综述】【合作式多智能体深度强化学习研究进展】

A Review of Cooperative Multi-Agent Deep Reinforcement Learning

合作式多智能体深度强化学习研究进展

https://arxiv.org/abs/1908.03963

Abstract 摘要

        Deep Reinforcement Learning has made significant progress in multi-agent systems in recent years. In this review article, we have focused on presenting recent approaches on Multi-Agent Reinforcement Learning (MARL) algorithms. In particular, we have focused on five common approaches on modeling and solving cooperative multi-agent reinforcement learning problems: (I) independent learners, (II) fully observable critic, (III) value function factorization, (IV) consensus, and (IV) learn to communicate. First, we elaborate on each of these methods, possible challenges, and how these challenges were mitigated in the relevant papers. If applicable, we further make a connection among different papers in each category. Next, we cover some new emerging research areas in MARL along with the relevant recent papers. Due to the recent success of MARL in real-world applications, we assign a section to provide a review of these applications and corresponding articles. Also, a list of available environments for MARL research is provided in this survey. Finally, the paper is concluded with proposals on the possible research directions.
        近年来,深度强化学习在多智能体系统中取得了重大进展。在这篇综述文章中,我们重点介绍了多智能体强化学习(MARL)算法的最新方法。特别是,我们重点介绍了五种关于建模和解决合作多智能体强化学习问题的常见方法:(I)独立学习者,(II)完全可观察的批评家,(III)价值函数分解,(IV)共识,以及(IV)学习沟通。首先,我们在相关论文中详细阐述了这些方法、可能的挑战以及如何缓解这些挑战。如果适用,我们会进一步在每个类别的不同论文之间建立联系。接下来,我们将介绍MARL中一些新兴的研究领域以及相关的近期论文。由于 MARL 最近在实际应用中取得了成功,我们分配了一个部分来提供对这些应用和相应文章的回顾。此外,本调查还提供了 MARL 研究的可用环境列表。最后,对可能的研究方向提出了建议。

Keywords: Reinforcement Learning, Multi-agent systems, Cooperative.
关键词:强化学习,多智能体系统,合作。

1 Introduction 

1 引言

        Multi-Agent Reinforcement Learning (MARL) algorithms are dealing with systems consisting of several agents (robots, machines, cars, etc.) which are interacting within a common environment. Each agent makes a decision in each time-step and works along with the other agent(s) to achieve an individual predetermined goal. The goal of MARL algorithms is to learn a policy for each agent such that all agents together achieve the goal of the system. Particularly, the agents are learnable units that aim to learn an optimal policy on the fly to maximize the long-term cumulative discounted reward through the interaction with the environment. Due to the complexities of the environments or the combinatorial nature of the problem, training the agents is typically a challenging task and several problems which MARL deals with them are categorized as NP-Hard problems, e.g. manufacturing scheduling (Gabel and Riedmiller 2007, Dittrich and Fohlmeister 2020), vehicle routing problem (Silva et al. 2019, Zhang et al. 2020b), some multi-agent games (Bard et al. 2020) are only a few examples to mention.
        多智能体强化学习 (MARL) 算法处理由多个智能体(机器人、机器、汽车等)组成的系统,这些智能体在公共环境中进行交互。每个智能体在每个时间步长中做出决定,并与其他智能体一起工作以实现单个预定目标。MARL 算法的目标是为每个智能体学习策略,以便所有智能体共同实现系统的目标。特别是,智能体是可学习的单位,旨在动态学习最佳策略,通过与环境的相互作用实现长期累积折扣奖励的最大化。由于环境的复杂性或问题的组合性质,训练智能体通常是一项具有挑战性的任务,MARL 处理的几个问题被归类为 NP-Hard 问题,例如制造调度(Gabel 和 Riedmiller 2007,Dittrich 和 Fohlmeister 2020),车辆路线问题(Silva 等人,2019 年,Zhang 等人,2020 年b), 一些多智能体游戏(Bard 等人,2020 年)只是其中的几个例子。

        With the motivation of recent success on deep reinforcement learning (RL)—super-human level control on Atari games (Mnih et al. 2015), mastering the game of Go (Silver et al. 2016), chess (Silver et al. 2017), robotic (Kober et al. 2013), health care planning (Liu et al. 2017), power grid (Glavic et al. 2017), routing (Nazari et al. 2018), and inventory optimization (Oroojlooyjadid et al.)—on one hand, and the importance of multi-agent system (Wang et al. 2016b, Leibo et al. 2017) on the other hand, several researches have been focused on deep MARL. One naive approach to solve these problems is to convert the problem to a single-agent problem and make the decision for all the agents using a centralized controller. However, in this approach, the number of actions typically exponentially increases, which makes the problem intractable. Besides, each agent needs to send its local information to the central controller and with increasing the number of agents, this approach becomes very expensive or impossible. In addition to the communication cost, this approach is vulnerable to the presence of the central unit and any incident that results in the loss of the network. Moreover, usually in multi-agent problems, each agent accesses only some local information, and due to privacy issues, they may not be allowed to share their information with the other agents.
        随着最近在深度强化学习 (RL) 方面取得成功的动机——对 Atari 游戏的超人水平控制(Mnih 等人,2015 年),掌握围棋游戏(Silver 等人,2016 年)、国际象棋(Silver 等人,2017 年)、机器人(Kober 等人,2013 年)、医疗保健计划(Liu 等人,2017 年)、电网(Glavic 等人,2017 年)、路由(Nazari 等人,2018 年)、 和库存优化(Oroojlooyjadid 等人)一方面,多智能体系统的重要性(Wang et al. 2016b, Leibo et al. 2017) 另一方面,一些研究都集中在深度 MARL 上。解决这些问题的一种幼稚方法是将问题转换为单代理问题,并使用集中式控制器为所有代理做出决策。但是,在这种方法中,操作的数量通常呈指数级增长,这使得问题变得棘手。此外,每个代理都需要将其本地信息发送到中央控制器,并且随着代理数量的增加,这种方法变得非常昂贵或不可能。除了通信成本外,这种方法还容易受到中央单元的存在以及导致网络丢失的任何事件的影响。此外,通常在多智能体问题中,每个智能体只访问一些本地信息,并且由于隐私问题,他们可能不被允许与其他智能体共享他们的信息。

        There are several properties of the system that is important in modeling a multi-agent system: (i) centralized or decentralized control, (ii) fully or partially observable environment, (iii) cooperative or competitive environment. Within a centralized controller, a central unit takes the decision for each agent in each time step. On the other hand, in the decentralized system, each agent takes a decision for itself. Also, the agents might cooperate to achieve a common goal, e.g. a group of robots who want to identify a source or they might compete with each other to maximize their own reward, e.g. the players in different teams of a game. In each of these cases, the agent might be able to access the whole information and the sensory observation (if any) of the other agents, or on the other hand, each agent might be able to observe only its local information. In this paper, we have focused on the decentralized problems with the cooperative goal, and most of the relevant papers with either full or partial observability are reviewed. Note that Weiß (1995), Matignon et al. (2012), Buşoniu et al. (2010), Bu et al. (2008) provide reviews on cooperative games and general MARL algorithms published till 2012. Also, Da Silva and Costa (

  • 16
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值