【VDN】【基于价值分解网络的多智能体协同学习】【Value-Decomposition Networks For Cooperative Multi-Agent Learning】

目录

Value-Decomposition Networks For Cooperative Multi-Agent Learning

基于价值分解网络(VDN)的多智能体协同学习

Abstract 摘要

1 Introduction 引言

1.1 Other Related Work 

1.1 其他相关工作

2 Background 

2 背景

2.1 Reinforcement Learning

2.1 强化学习

​2.2 Deep Q-Learning 

2.2 深度 Q -学习

 2.3 Multi-Agent Reinforcement Learning

2.3 多智能体强化学习

3 A Deep-RL Architecture for Coop-MARL

3 A Coop-MARL的Deep-RL架构

4 Experiments 实验

4.1 Agents

4.2 Environments

4.3 Results

4.4 The Learned Q-Decomposition

4.4 Q -分解

5 Conclusions 

5 结论

Appendix A: Plots 

附录A:图

Appendix B: Diagrams 

附录B:图表

Value-Decomposition Networks For Cooperative Multi-Agent Learning

基于价值分解网络(VDN)的多智能体协同学习

https://arxiv.org/pdf/1706.05296.pdf

2017年6月16日提交

Abstract 摘要

        We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the “lazy agent” problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
        研究了具有
单一联合奖励信号多智能体协作强化学习问题。这类学习问题是困难的,因为通常很大的组合动作和观察空间。在完全集中和分散的方法中,我们发现了虚假奖励的问题和我们称之为“懒惰代理”问题的现象,这是由于部分可观测性而产生的。我们解决这些问题,通过训练个人代理与一种新的价值分解网络架构,学会分解成代理明智的价值函数的团队价值函数。我们在一系列部分可观察的多智能体域进行了实验评估,并表明学习这种值分解会导致上级结果,特别是当与权重共享,角色信息和信息渠道相结合时。

1 Introduction 引言

We consider the cooperative multi-agent reinforcement learning (MARL) problem (Panait and Luke, 2005, Busoniu et al., 2008, Tuyls and Weiss, 2012), in which a system of several learning agents must jointly optimize a single reward signal – the team reward – accumulated over time. Each agent has access to its own (“local”) observations and is responsible for choosing actions from its own action set. Coordinated MARL problems emerge in applications such as coordinating self-driving vehicles and/or traffic signals in a transportation system, or optimizing the productivity of a factory comprised of many interacting components. More generally, with AI agents becoming more pervasive, they will have to learn to coordinate to achieve common goals.
我们考虑协作多智能体强化学习(MARL)问题(Panait和Luke,2005年,Busoniu等人,2008年,Tuyls和韦斯,2012年),其中几个学习代理的系统必须共同优化一个单一的奖励信号-团队奖励-随着时间的推移积累。每个代理都可以访问自己的(“本地”)观察结果,并负责从自己的动作集中选择动作。协调MARL问题出现在一些应用中,例如协调自动驾驶车辆和/或交通系统中的交通信号,或者优化由许多相互作用的组件组成的工厂的生产力。更一般地说,随着人工智能代理变得越来越普遍,它们将不得不学会协调以实现共同的目标。

Although in practice some applications may require local autonomy, in principle the cooperative MARL problem could be treated using a centralized approach, reducing the problem to single-agent reinforcement learning (RL) over the concatenated observations and combinatorial action space. We show that the centralized approach consistently fails on relatively simple cooperative MARL problems in practice. We present a simple experiment in which the centralised approach fails by learning inefficient policies with only one agent active and the other being “lazy”. This happens when one agent learns a useful policy, but a second agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.11For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).
例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。

虽然在实践中,一些应用可能需要局部自治,但原则上可以使用集中式方法来处理合作MARL问题,将问题简化为级联观测和组合动作空间上的单智能体强化学习(RL)。我们表明,集中的方法始终失败相对简单的合作MARL问题在实践中。我们提出了一个简单的实验中,集中式的方法失败的学习效率低下的政策,只有一个代理活动和其他“懒惰”。当一个代理学习一个有用的策略,但第二个代理不鼓励学习,因为它的探索会阻碍第一个代理并导致更差的团队奖励时,就会发生这种情况。

For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).
例如,想象一下使用RL训练一个2人足球队,其中进球数作为球队奖励信号。假设一个球员比另一个球员得分更好。当较差的球员投篮时,结果平均要差得多,较弱的球员学会避免投篮(Hausknecht,2016)。

        An alternative approach is to train independent learners to optimize for the team reward. In general each agent is then faced with a non-stationary learning problem because the dynamics of its environment effectively changes as teammates change their behaviours through learning (Laurent et al., 2011). Furthermore, since from a single agent’s perspective the environment is only partially observed, agents may receive spurious reward signals that originate from their teammates’ (unobserved) behaviour. Because of this inability to explain its own observed rewards naive independent RL is often unsuccessful: for example Claus and Boutilier (1998) show that independent Q-learners cannot distinguish teammates’ exploration from stochasticity in the environment, and fail to solve even an apparently trivial, 2-agent, stateless, 3×3-action problem and the general Dec-POMDP problem is known to be intractable (Bernstein et al., 2000, Oliehoek and Amato, 2016). Though we here focus on 2 player coordination, we note that the problems with individual learners and centralized approaches just gets worse with more agents since then, most rewards do not relate to the individual agent and the action space grows exponentially for the fully centralized approach.
        另一种方法是训练独立学习者来优化团队奖励。一般来说,每个代理都面临着非静态学习问题,因为随着队友通过学习改变他们的行为,其环境的动态有效地改变(Laurent等人,2011年)。此外,由于从单个代理的角度来看,环境仅被部分观察到,代理可能会收到来自其队友(未观察到的)行为的虚假奖励信号。由于无法解释自己观察到的奖励,天真的独立RL通常是不成功的:例如,Claus和Boutilier(1998)表明,独立的学习者不能区分环境中队友的探索和随机性,甚至不能解决一个明显琐碎的、2-agent、无状态的问题, 3×3 -动作问题和一般的Dec-POMDP问题已知是难处理的(伯恩斯坦等人,2000年,Oliehoek和Amato,2016年)。 虽然我们在这里关注的是两个玩家的协调,但我们注意到,从那时起,个体学习者和集中式方法的问题随着更多的代理变得更糟,大多数奖励与个体代理无关,并且完全集中式方法的动作空间呈指数级增长。

One approach to improving the performance of independent learners is to design individual reward functions, more directly related to individual agent observations. However, even in the single-agent case, reward shaping is difficult and only a small class of shaped reward functions are guaranteed to preserve optimality w.r.t. the true objective (Ng et al., 1999, Devlin et al., 2014, Eck et al., 2016). In this paper we aim for more general autonomous solutions, in which the decomposition of the team value function is learned.
提高独立学习者性能的一种方法是设计个体奖励函数,更直接地与个体代理观察相关。然而,即使在单代理的情况下,奖励成形是困难的,只有一类形状的奖励函数保证保持最优性w.r.t.真正的目标(Ng等人,1999年,Devlin等人,2014年,Eck等人,2016年)。在本文中,我们的目标是更一般的自治解决方案,其中团队价值函数的分解是学习。

        We introduce a novel learned additive value-decomposition approach over individual agents. Implicitly, the value decomposition network aims to learn an optimal linear value decomposition from the team reward signal, by back-propagating the total Q gradient through deep neural networks representing the individual component value functions. This additive value decomposition is specifically motivated by avoiding the spurious reward signals that emerge in purely independent learners.The implicit value function learned by each agent depends only on local observations, and so is more easily learned. Our solution also ameliorates the coordination problem of independent learning highlighted in Claus and Boutilier (

  • 23
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值