IPPO:在《星际争霸》多智能体挑战赛中,你只需要独立学习吗?

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

在《星际争霸》多智能体挑战赛中,你只需要独立学习吗?

https://arxiv.org/abs/2011.09533

Abstract 摘要

        Most recently developed approaches to cooperative multi-agent reinforcement learning in the centralized training with decentralized execution setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO’s strong performance may be due to its robustness to some forms of environment non-stationarity.
        最近开发的在集中式训练中使用分散式执行设置的协作式多智能体强化学习方法涉及估计集中式联合价值函数。在本文中,我们证明,尽管存在各种理论缺陷,但独立 PPO (IPPO) 是一种独立学习形式,其中每个智能体只需估计其局部值函数,在流行的多智能体基准测试套件 SMAC 上,其性能与最先进的联合学习方法一样好或更好,几乎没有超参数调整。我们还将IPPO与几种变体进行了比较;结果表明,IPPO的强劲性能可能是由于其对某些形式的环境非平稳性的鲁棒性。

1 Introduction 

1 引言

        Many practical control problems feature a team of multiple agents that must coordinate to achieve a common goal [512]. Cooperative multi-agent reinforcement learning (MARL) has shown considerable promise in solving tasks that can be described as a Dec-POMDP [17], i.e., where agents optimise a single scalar team reward signal in a partially observable environment while choosing actions based only on their own local action-observation histories [18229].
        许多实际的控制问题都具有由
多个智能体组成的团队,这些智能体必须协调以实现共同目标[5,12]。合作式多智能体强化学习(MARL)在解决可被描述为Dec-POMDP的任务方面显示出相当大的前景[ 17],即智能体在部分可观察的环境中优化单个标量团队奖励信号,同时仅根据自己的局部动作观察历史选择动作 [ 18, 2, 29]。

        Independent learning (IL) decomposes an n-agent MARL problem into n decentralised single-agent problems where all other agents are treated as part of the environment, and learning policies that condition only on an agent’s local observation history. While easy to distribute and decentralisable by construction, IL suffers from a variety of theoretical limitations that may result in learning instabilities or suboptimal performance observed in practice [279, IQL, IAC]. Firstly, the presence of other learning and exploring agents renders the resulting environment non-stationary from the given agent’s perspective, forfeiting convergence guarantees [27]. Secondly, independent learners are not always able to distinguish environment stochasticity from another agent’s exploration, making them unable to learn optimal policies in some environments [6].
        独立学习 (IL) 将 n 智能体 MARL 问题分解为
 n个 分散的单智能体问题,其中所有其他智能体都被视为环境的一部分,以及仅以智能体的局部观察历史为条件的学习策略。虽然IL易于分发和通过结构分散,但存在各种理论局限性,可能导致在实践中观察到的学习不稳定或次优性能[27,9,IQL,IAC]。首先,从给定智能体的角度来看,其他学习和探索智能体的存在使得所得环境变得非平稳,从而丧失了收敛保证[27]。其次,独立学习者并不总是能够将环境随机性与另一个智能体的探索区分开来,这使得他们无法在某些环境中学习最优策略[6]。

        In fact, decentralised policies need not be learnt in a decentralized fashion. For safety and efficiency reasons [26], MARL training frequently takes place centrally in a laboratory or in simulation, allowing agents access to each other’s observations during training, as well as otherwise unobservable extra state information. Centralized training allows training of a single joint policy for all agents that conditions on the joint observations and extra state information. While centralized joint learning reduces or removes issues surrounding partial observability and environment non-stationarity, it must cope with joint action spaces that grow exponentially with respect to the number of agents, as well as a variety of learning pathologies that can result in suboptimal policies [32]. Importantly, vanilla joint policies are not inherently decentralisable and naive policy distillation approaches are often ineffective [4]. Joint learning does not immediately address the multi-agent credit assignment problem either.
        事实上,权力下放的策略不需要以权力下放的方式学习。出于安全和效率的原因[ 26],MARL训练经常在实验室或模拟中集中进行,允许智能体在训练期间访问彼此的观察结果,以及其他无法观察到的额外状态信息。集中训练允许对所有代理进行单一联合策略的培训,这些策略以联合观察和额外的状态信息为条件。虽然集中式联合学习可以减少或消除围绕部分可观察性和环境非平稳性的问题,但它必须应对与智能体数量呈指数增长的联合行动空间,以及可能导致次优策略的各种学习病理[32]。重要的是,普通的联合政策本质上不是可分散的,幼稚的政策蒸馏方法往往是无效的[4]。联合学习也不能立即解决多智能体学分分配问题。

Recent research has focused on algorithms that can exploit the benefits of combining centralised training with decentralised execution [

  • 21
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
以下是截至2021年10月的全球动漫评分排行榜前100名: 1.《Fullmetal Alchemist: Brotherhood》 2.《Steins;Gate》 3.《Your Lie in April》 4.《Hunter x Hunter (2011)》 5.《Gintama》 6.《Haikyuu!!》 7.《Attack on Titan》 8.《Code Geass: Lelouch of the Rebellion》 9.《Clannad: After Story》 10.《Death Note》 11.《One Punch Man》 12.《Jojo's Bizarre Adventure》 13.《Cowboy Bebop》 14.《Baccano!》 15.《Neon Genesis Evangelion》 16.《Spirited Away》 17.《Monogatari Series》 18.《Shigatsu wa Kimi no Uso》 19.《Toradora!》 20.《Tengen Toppa Gurren Lagann》 21.《K-On!》 22.《The Melancholy of Haruhi Suzumiya》 23.《Fate/Zero》 24.《One Piece》 25.《Rurouni Kenshin: Trust and Betrayal》 26.《Dragon Ball Z》 27.《Legend of the Galactic Heroes》 28.《Hajime no Ippo》 29.《Ghost in the Shell: Stand Alone Complex》 30.《Nana》 31.《Samurai Champloo》 32.《Great Teacher Onizuka》 33.《Anohana: The Flower We Saw That Day》 34.《Kimi no Na wa.》 35.《Kara no Kyoukai》 36.《Psycho-Pass》 37.《Bakemonogatari》 38.《No Game No Life》 39.《Nichijou》 40.《Usagi Drop》 41.《Kill la Kill》 42.《Akira》 43.《Durarara!!》 44.《Tatami Galaxy》 45.《Katanagatari》 46.《Berserk》 47.《Princess Mononoke》 48.《Higurashi no Naku Koro ni》 49.《Soul Eater》 50.《Black Lagoon》 51.《Naruto》 52.《Fairy Tail》 53.《Kimi ni Todoke》 54.《Serial Experiments Lain》 55.《Kuroko no Basket》 56.《Chihayafuru》 57.《Hyouka》 58.《Bleach》 59.《Sakamichi no Apollon》 60.《Golden Time》 61.《Love Live! School Idol Project》 62.《Cardcaptor Sakura》 63.《Gekkan Shoujo Nozaki-kun》 64.《Full Metal Panic!》 65.《Mobile Suit Gundam: Iron-Blooded Orphans》 66.《K-On! Movie》 67.《Kizumonogatari》 68.《Re:Zero kara Hajimeru Isekai Seikatsu》 69.《Shirobako》 70.《Yahari Ore no Seishun Love Comedy wa Machigatteiru.》 71.《Darker than Black》 72.《Honey and Clover》 73.《Zankyou no Terror》 74.《Usagi Drop Specials》 75.《Mahou Shoujo Madoka★Magica》 76.《Bungou Stray Dogs》 77.《Kara no Kyoukai 5: Mujun Rasen》 78.《K-On! 2》 79.《Toaru Kagaku no Railgun》 80.《Kimi ga Nozomu Eien》 81.《Grisaia no Kajitsu》 82.《Gochuumon wa Usagi Desu ka?》 83.《Shinsekai yori》 84.《Nodame Cantabile》 85.《Bungou Stray Dogs 2nd Season》 86.《Made in Abyss》 87.《Gin no Saji》 88.《Ranma ½》 89.《K-On!!》 90.《Sword Art Online》 91.《Hibike! Euphonium》 92.《Nisekoi》 93.《K-On!: Live House!》 94.《K-On!: Keikaku!》 95.《D-Frag!》 96.《K-On!: Ura-ON!》 97.《K-On!: College》 98.《K-On!: Fude Pen - Boru Pen》 99.《K-On!: Plan!》 100.《K-On!: Movie - Wonderful♪》

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值