IPPO：在《星际争霸》多智能体挑战赛中，你只需要独立学习吗？

资源存储库

已于 2024-03-30 17:33:05 修改

阅读量1.3k

点赞数 22

文章标签：学习

于 2024-03-30 17:32:23 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/137175862

版权

研究表明，独立PPO（IPPO）在《星际争霸》多智能体挑战赛中表现出与最先进的联合学习方法相当甚至更好的性能。IPPO每个智能体仅估计本地值函数，其鲁棒性可能使其在环境非平稳性中表现优异。尽管独立学习存在理论缺陷，但IPPO在SMAC基准测试中表现出色，挑战了集中式联合学习的必要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

在《星际争霸》多智能体挑战赛中，你只需要独立学习吗？

https://arxiv.org/abs/2011.09533

Abstract 摘要

Most recently developed approaches to cooperative multi-agent reinforcement learning in the centralized training with decentralized execution setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO’s strong performance may be due to its robustness to some forms of environment non-stationarity.
最近开发的在集中式训练中使用分散式执行设置的协作式多智能体强化学习方法涉及估计集中式联合价值函数。在本文中，我们证明，尽管存在各种理论缺陷，但独立 PPO （IPPO）是一种独立学习形式，其中每个智能体只需估计其局部值函数，在流行的多智能体基准测试套件 SMAC 上，其性能与最先进的联合学习方法一样好或更好，几乎没有超参数调整。我们还将IPPO与几种变体进行了比较;结果表明，IPPO的强劲性能可能是由于其对某些形式的环境非平稳性的鲁棒性。

1 Introduction

1 引言

Many practical control problems feature a team of multiple agents that must coordinate to achieve a common goal [5, 12]. Cooperative multi-agent reinforcement learning (MARL) has shown considerable promise in solving tasks that can be described as a Dec-POMDP [17], i.e., where agents optimise a single scalar team reward signal in a partially observable environment while choosing actions based only on their own local action-observation histories [18, 2, 29].
许多实际的控制问题都具有由多个智能体组成的团队，这些智能体必须协调以实现共同目标[5,12]。合作式多智能体强化学习（MARL）在解决可被描述为Dec-POMDP的任务方面显示出相当大的前景[ 17]，即智能体在部分可观察的环境中优化单个标量团队奖励信号，同时仅根据自己的局部动作观察历史选择动作 [ 18， 2， 29]。

Independent learning (IL) decomposes an n-agent MARL problem into n decentralised single-agent problems where all other agents are treated as part of the environment, and learning policies that condition only on an agent’s local observation history. While easy to distribute and decentralisable by construction, IL suffers from a variety of theoretical limitations that may result in learning instabilities or suboptimal performance observed in practice [27, 9, IQL, IAC]. Firstly, the presence of other learning and exploring agents renders the resulting environment non-stationary from the given agent’s perspective, forfeiting convergence guarantees [27]. Secondly, independent learners are not always able to distinguish environment stochasticity from another agent’s exploration, making them unable to learn optimal policies in some environments [6].
独立学习（IL）将 n 智能体 MARL 问题分解为 n个分散的单智能体问题，其中所有其他智能体都被视为环境的一部分，以及仅以智能体的局部观察历史为条件的学习策略。虽然IL易于分发和通过结构分散，但存在各种理论局限性，可能导致在实践中观察到的学习不稳定或次优性能[27,9，IQL，IAC]。首先，从给定智能体的角度来看，其他学习和探索智能体的存在使得所得环境变得非平稳，从而丧失了收敛保证[27]。其次，独立学习者并不总是能够将环境随机性与另一个智能体的探索区分开来，这使得他们无法在某些环境中学习最优策略[6]。

In fact, decentralised policies need not be learnt in a decentralized fashion. For safety and efficiency reasons [26], MARL training frequently takes place centrally in a laboratory or in simulation, allowing agents access to each other’s observations during training, as well as otherwise unobservable extra state information. Centralized training allows training of a single joint policy for all agents that conditions on the joint observations and extra state information. While centralized joint learning reduces or removes issues surrounding partial observability and environment non-stationarity, it must cope with joint action spaces that grow exponentially with respect to the number of agents, as well as a variety of learning pathologies that can result in suboptimal policies [32]. Importantly, vanilla joint policies are not inherently decentralisable and naive policy distillation approaches are often ineffective [4]. Joint learning does not immediately address the multi-agent credit assignment problem either.
事实上，权力下放的策略不需要以权力下放的方式学习。出于安全和效率的原因[ 26]，MARL训练经常在实验室或模拟中集中进行，允许智能体在训练期间访问彼此的观察结果，以及其他无法观察到的额外状态信息。集中训练允许对所有代理进行单一联合策略的培训，这些策略以联合观察和额外的状态信息为条件。虽然集中式联合学习可以减少或消除围绕部分可观察性和环境非平稳性的问题，但它必须应对与智能体数量呈指数增长的联合行动空间，以及可能导致次优策略的各种学习病理[32]。重要的是，普通的联合政策本质上不是可分散的，幼稚的政策蒸馏方法往往是无效的[4]。联合学习也不能立即解决多智能体学分分配问题。

Recent research has focused on algorithms that can exploit the benefits of combining centralised training with decentralised execution [