MARL:【多智能体强化学习的参与者-注意力-批评家】

Actor-Attention-Critic for Multi-Agent Reinforcement Learning

多智能体强化学习的参与者-注意力-批评家

https://arxiv.org/abs/1810.02912

Abstract 摘要

        Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm that trains decentralized policies in multi-agent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multi-agent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.
        多智能体场景中的强化学习对于实际应用很重要,但带来的挑战超出了单智能体设置中的挑战。我们提出了一种  演员-评论家算法,该算法在多智能体设置中训练去中心化策略,使用集中计算的批评者,这些批评者共享一个注意力机制,该机制在每个时间步为每个智能体选择相关信息。与最近的方法相比,这种注意力机制可以在复杂的多智能体环境中实现更有效和可扩展的学习。我们的方法不仅适用于具有共享奖励的合作设置,也适用于个性化奖励设置,包括对抗性设置,以及不提供全局状态的设置,并且它不对智能体的行动空间做出任何假设。因此,它足够灵活,可以应用于大多数多智能体学习问题。

Machine Learning, ICML, MARL, RL, Reinforcement Learning, Multi Agent, Attention, Actor-Critic
机器学习, ICML, MARL, RL, 强化学习, 多智能体, 注意力, 演员-评论家

1 Introduction 

1 引言

        Reinforcement learning has recently made exciting progress in many domains, including Atari games (Mnih et al., 2015), the ancient Chinese board game, Go (Silver et al., 2016), and complex continuous control tasks involving locomotion (Lillicrap et al., 2016; Schulman et al., 20152017; Heess et al., 2017). While most reinforcement learning paradigms focus on single agents acting in a static environment (or against themselves in the case of Go), real-world agents often compete or cooperate with other agents in a dynamically shifting environment. In order to learn effectively in multi-agent environments, agents must not only learn the dynamics of their environment, but also those of the other learning agents present.
        
强化学习最近在许多领域取得了令人振奋的进展,包括雅达利游戏(Mnih et al., 2015)、中国古代棋盘游戏围棋(Silver et al., 2016)以及涉及运动的复杂连续控制任务(Lillicrap et al., 2016;Schulman 等人,2015 年,2017 年;Heess 等人,2017 年)。虽然大多数强化学习范式都集中在静态环境中的单个智能体(或者在围棋的情况下与自己对抗),但现实世界的智能体经常在动态变化的环境中与其他智能体竞争或合作。为了在多智能体环境中有效地学习,智能体不仅必须学习其环境的动态,还必须学习其他学习智能体的动态

        To this end, several approaches for multi-agent reinforcement learning have been developed. The simplest approach is to train each agent independently to maximize their individual reward, while treating other agents as part of the environment. However, this approach violates the basic assumption underlying reinforcement learning, that the environment should be stationary and Markovian. Any single agent’s environment is dynamic and nonstationary due to other agents’ changing policies. As such, standard algorithms developed for stationary Markov decision processes fail.
        为此,已经开发了几种多智能体强化学习的方法。
最简单的方法是独立训练每个智能体以最大化他们的个人奖励同时将其他智能体视为环境的一部分。然而,这种方法违反了强化学习的基本假设,即环境应该是静止的和马尔可夫式的。由于其他智能体的策略不断变化,任何单个智能体的环境都是动态和非静止的。因此,为稳态马尔可夫决策过程开发的标准算法失败了。

        At the other end of the spectrum, all agents can be collectively modeled as a single-agent whose action space is the joint action space of all agents (Buşoniu et al., 2010). While allowing coordinated behaviors across agents, this approach is not scalable as the size of action space increases exponentially with respect to the number of agents. It also demands a high degree of communication during execution, as the central policy must collect observations from and distribute actions to the individual agents. In real-world settings, this demand can be problematic.
        在光谱的另一端,所有智能体都可以集体建模为
单个智能体,其作用空间是所有智能体的联合动作空间(Buşoniu等人,2010)。虽然允许跨智能体的协调行为,但这种方法不可扩展,因为动作空间的大小相对于代理的数量呈指数级增长。它还要求在执行过程中进行高度的沟通,因为中央策略必须从各个代理那里收集观察结果并将行动分发给各个代理。在现实世界中,这种需求可能会有问题。

        Recent work (Lowe et al., 2017; Foerster et al., 2018) attempts to combine the strengths of these two approaches. In particular, a critic (or a number of critics) is centrally learned with information from all agents. The actors, however, receive information only from their corresponding agents. Thus, during testing, executing the policies does not require the knowledge of other agents’ actions. This paradigm circumvents the challenge of non-Markovian and non-stationary environments during learning. Despite these progresses, however, algorithms for multi-agent reinforcement learning are still far from being scalable (to larger numbers of agents) and being generically applicable to environments and tasks that are cooperative (sharing a global reward), competitive, or mixed.
        最近的工作(Lowe等人,2017;Foerster 等人,2018 年)试图结合这两种方法的优势。特别是,一个批评家(或一些批评家)是集中学习的,来自所有智能体的信息。然而,演员只从其相应的智能体那里获得信息。因此,在测试期间,执行策略不需要了解其他智能体的动作。这种范式规避了学习过程中非马尔可夫和非静止环境的挑战。然而,尽管取得了这些进展,多智能体强化学习的算法仍然远未可扩展(对更多的智能体)和普遍适用于合作(共享全局奖励)、竞争或混合的环境和任务。

        Our approach 1 extends these prior works in several directions. The main idea is to learn a centralized critic with an attention mechanism. The intuition behind our idea comes from the fact that, in many real-world environments, it is beneficial for agents to know what other agents it should pay attention to. For example, a soccer defender needs to pay attention to attackers in their vicinity as well as the player with the ball, while she/he rarely needs to pay attention to the opposing team’s goalie. The specific attackers that the defender is paying attention to can change at different parts of the game, depending on the formation and strategy of the opponent. A typical centralized approach to multi-agent reinforcement learning does not take these dynamics into account, instead simply considering all agents at all timepoints. Our attention critic is able to dynamically select which agents to attend to at each time point during training, improving performance in multi-agent domains with complex interactions.
        我们的方法 1 将这些先前的工作扩展到几个方向。主要思想是学习具有注意力机制的中心化批评家。我们的想法背后的直觉来自这样一个事实,即在许多现实世界的环境中,智能体知道它应该注意哪些其他智能体是有益的。例如,足球后卫需要注意附近的进攻者以及持球的球员,而她/他很少需要注意对方的守门员。防守方关注的具体攻击者可以在游戏的不同部分发生变化,具体取决于对手的阵型和策略。典型的集中式多智能体强化学习方法不考虑这些动态,而只是考虑所有时间点的所有智能体。我们的注意力评论家能够在训练期间的每个时间点动态选择要关注的智能体,从而提高具有复杂交互的多智能体域的性能。

        Our proposed approach has an input space linearly increasing with respect to the number of agents, as opposed to the quadratic increase in a previous approach (Lowe et al., 2017). It is also applicable to cooperative, competitive, and mixed environments, exceeding the capability of prior work that focuses only on cooperative environments (Foerster et al., 2018). We have validated our approach on three simulated environments and tasks.
        我们提出的方法的输入空间相对于代理的数量呈线性增加,而不是先前方法中的二次增加(Lowe等人,2017)。它也适用于合作、竞争和混合环境,超出了先前只关注合作环境的工作的能力(Foerster 等人,2018 年)。我们已经在三个模拟环境和任务上验证了我们的方法。

        The rest of the paper is organized as follows. In section 

  • 17
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值