【 IQL】【基于深度强化学习的多智能体合作与竞争】

Multiagent systems appear in most social, economical, and political situations. In the present work we extend the Deep Q-Learning Network architecture proposed by Google DeepMind to multiagent environments and investigate how two agents controlled by independent Deep Q-Networks interact in the classic videogame Pong. By manipulating the classical rewarding scheme of Pong we demonstrate how competitive and collaborative behaviors emerge. Competitive agents learn to play and score efficiently. Agents trained under collaborative rewarding schemes find an optimal strategy to keep the ball in the game as long as possible. We also describe the progression from competitive to collaborative behavior. The present work demonstrates that Deep Q-Networks can become a practical tool for studying the decentralized learning of multiagent systems living in highly complex environments.
多智能体系统出现在大多数社会、经济和政治环境中。在目前的工作中，我们将Google DeepMind提出的Deep Q-Learning Network架构扩展到多智能体环境，并研究由独立Deep Q-Networks控制的两个智能体如何在经典视频游戏Pong中进行交互。通过操纵Pong的经典奖励计划，我们展示了竞争和合作行为是如何出现的。有竞争力的代理人学习有效地比赛和得分。在协作奖励计划下训练的代理找到一个最佳策略，以尽可能长时间地保持球在游戏中。我们还描述了从竞争到合作行为的进展。目前的工作表明，深度Q网络可以成为研究生活在高度复杂环境中的多智能体系统的分散学习的实用工具。

Introduction 引言

In the ever-changing world biological and engineered agents need to cope with unpredictability. By learning from trial-and-error an animal or a robot can adapt its behavior in a novel or changing environment. This is the main intuition behind reinforcement learning [SB98, PM10]. A reinforcement learning agent modifies its behavior based on the rewards that it collects while interacting with the environment. By trying to maximize the reward during these interactions an agent can learn to implement complex long-term strategies.
在不断变化的世界中，生物和工程制剂需要科普不可预测性。通过从试错中学习，动物或机器人可以在新的或不断变化的环境中调整其行为。这是强化学习背后的主要直觉[ SB98，PM10]。强化学习代理根据它在与环境交互时收集的奖励来修改其行为。通过在这些交互中尝试最大化奖励，智能体可以学习实施复杂的长期策略。

Due to the astronomic number of states of any realistic scenario, for a long time algorithms implementing reinforcement learning were either limited to simple environments or needed to be assisted by additional information about the dynamics of the environment. Recently, however, the Swiss AI Lab IDSIA [KCSG13] and Google DeepMind [MKS+13, MKS+15] have produced spectacular results in applying reinforcement learning to very high-dimensional and complex environments such as video games. In particular, DeepMind demonstrated that AI agents can achieve superhuman performance in a diverse range of Atari video games. Remarkably, the learning agent only used raw sensory input (screen images) and the reward signal (increase in game score). The proposed methodology, so called Deep Q-Network, combines a convolutional neural network for feature representation with Q-learning training [Lin93]. It represents the state-of-the-art approach for model-free reinforcement learning problems in complex environments. The fact that the same algorithm was used for learning very different games suggests its potential for general purpose applications.
由于任何现实场景的状态数量都是天文数字，很长一段时间以来，实现强化学习的算法要么局限于简单的环境，要么需要有关环境动态的额外信息的帮助。然而，最近，瑞士人工智能实验室IDSIA [KCSG 13]和谷歌DeepMind [MKS + 13，MKS + 15]在将强化学习应用于视频游戏等非常高维和复杂的环境方面取得了惊人的成果。特别是，DeepMind证明了AI代理可以在各种Atari视频游戏中实现超人的表现。值得注意的是，学习代理只使用原始的感官输入（屏幕图像）和奖励信号（游戏分数的增加）。所提出的方法，即所谓的Deep Q-Network，将用于特征表示的卷积神经网络与Q学习训练相结合[Lin 93]。它代表了复杂环境中无模型强化学习问题的最新方法。相同的算法被用于学习非常不同的游戏，这一事实表明它具有通用应用的潜力。

The present article builds on the work of DeepMind and explores how multiple agents controlled by autonomous Deep Q-Networks interact when sharing a complex environment. Multiagent systems appear naturally in most of social, economical and political scenarios. Indeed, most of game theory problems deal with multiple agents taking decisions to maximize their individual returns in a static environment [BBDS08]. Collective animal behavior [Sum10] and distributed control systems are also important examples of multiple autonomous systems. Phenomena such as cooperation, communication, and competition may emerge in reinforced multiagent systems.
本文以DeepMind的工作为基础，探讨了由自治Deep Q-Networks控制的多个智能体在共享复杂环境时如何进行交互。多智能体系统自然地出现在大多数社会、经济和政治场景中。事实上，大多数博弈论问题都涉及多个代理人在静态环境中做出决策以最大化其个人回报[BBDS 08]。集体动物行为[Sum 10]和分布式控制系统也是多自治系统的重要例子。现象，如合作，通信和竞争可能会出现在加强多智能体系统。

The goal of the present work is to study emergent cooperative and competitive strategies between multiple agents controlled by autonomous Deep Q-Networks. As in the original article by DeepMind, we use Atari video games as the environment where the agents receive only the raw screen images and respective reward signals as input. We explore how two agents behave and interact in such complex environments when trained with different rewarding schemes.
本工作的目标是研究由自治深度Q网络控制的多个代理之间的紧急合作和竞争策略。与DeepMind的原始文章一样，我们使用Atari视频游戏作为环境，智能体只接收原始屏幕图像和相应的奖励信号作为输入。我们探讨了两个代理人的行为，并在这样复杂的环境中进行互动时，不同的奖励计划的培训。

In particular, using the video game Pong we demonstrate that by changing the rewarding schemes of the agents either competitive or cooperative behavior emerges. Competitive agents get better at scoring, while the collaborative agents find an optimal strategy to keep the ball in the game for as long as possible. We also tune the rewarding schemes in order to study the intermediate states between competitive and cooperative modes and observe the progression from competitive to collaborative behavior.
特别是，使用视频游戏乒乓球，我们证明，通过改变奖励计划的代理人的竞争或合作行为出现。竞争代理在得分方面变得更好，而合作代理找到了一个最佳策略，以尽可能长时间地保持球在游戏中。我们还调整奖励计划，以研究竞争和合作模式之间的中间状态，并观察从竞争到合作行为的进展。

1 Methods

1 方法

1.1 The Deep Q-Learning Algorithm

1.1 深度Q学习算法

The goal of reinforcement learning is to find a policy – a rule to decide which action to take in each of the possible states – which maximizes the agent’s accumulated long term reward in a dynamical environment. The problem is especially challenging when the agent must learn without the explicit information about the dynamics of the environment or the rewards. In this case perhaps the most popular learning algorithm is Q-learning [Wat89]. Q-learning allows one to estimate the value or quality of an action in a particular state of the environment.
强化学习的目标是找到一种策略-一种决定在每个可能的状态下采取何种行动的规则-它在动态环境中最大化代理的累积长期奖励。当智能体必须在没有关于环境动态或奖励的明确信息的情况下学习时，这个问题尤其具有挑战性。在这种情况下，最流行的学习算法可能是Q学习[Wat 89]。Q-learning允许人们在特定的环境状态下估计行为的价值或质量。

Recently Google DeepMind trained convolutional neural networks to approximate these so-called Q-value functions. Leveraging the powerful feature representation of convolutional neural networks the so-called Deep Q-Networks have obtained state of the art results in complex environments. In particular, a trained agent achieved superhuman performance in a range of Atari video games by using only raw sensory input (screen images) and the reward signal [MKS+15, SQAS15].
最近，谷歌DeepMind训练卷积神经网络来近似这些所谓的Q值函数。利用卷积神经网络的强大特征表示，所谓的深度Q网络在复杂环境中获得了最先进的结果。特别是，一个受过训练的智能体在一系列Atari视频游戏中仅使用原始感官输入（屏幕图像）和奖励信号就实现了超人的表现[MKS + 15，SQAS 15]。

When two or more agents share an environment the problem of reinforcement learning is much less understood. The distributed nature of the learning offers new benefits but also challenges such as the definition of good learning goals or the convergence and consistency of algorithms [Sch14, BBDS08]. For example, in the multiagent case the environment state transitions and rewards are affected by the joint action of all agents. This means that the value of an agent’s action depends on the actions of the others, and hence each agent must keep track of each of the other learning agents, possibly resulting in an ever-moving target. In general, learning in the presence of other agents requires a delicate trade-off between the stability and adaptive behavior of each agent.
当两个或多个智能体共享一个环境时，强化学习的问题就不那么容易理解了。学习的分布式本质提供了新的好处，但也带来了挑战，例如良好学习目标的定义或算法的收敛性和一致性[ Sch14，BBDS08]。例如，在多智能体的情况下，环境状态转换和奖励受到所有智能体的联合行动的影响。这意味着一个智能体动作的价值取决于其他智能体的动作，因此每个智能体必须跟踪其他学习智能体中的每一个，这可能导致一个不断移动的目标。一般来说，在其他代理存在的情况下学习需要在每个代理的稳定性和自适应行为之间进行微妙的权衡。

There exist several possible adaptations of the Q-learning algorithm for the multiagent case. However, this is an open research area and theoretical guarantees for multiagent model-free reinforcement learning algorithms are scarce and restricted to specific types of tasks [Sch14, BBDS08]. In practice the simplest method consists of using an autonomous Q-learning algorithm for each agent in the environment, thereby using the environment as the sole source of interaction between agents. In this work we use this method due to its simplicity, decentralized nature, computational speed, and ability to produce consistent results for the range of tasks we report. Therefore, in our tests each agent is controlled by an independent Deep Q-Network with architecture and parameters as reported in [MKS+15].
存在几种可能的适应性Q-学习算法的多智能体的情况下。然而，这是一个开放的研究领域，多智能体无模型强化学习算法的理论保证是稀缺的，并且仅限于特定类型的任务[Sch 14，BBDS 08]。在实践中，最简单的方法是对环境中的每个代理使用自主Q学习算法，从而将环境用作代理之间交互的唯一来源。在这项工作中，我们使用这种方法，因为它的简单性，分散的性质，计算速度，并能够产生一致的结果，我们报告的任务范围。因此，在我们的测试中，每个代理都由一个独立的Deep Q-Network控制，其架构和参数如[MKS + 15]所述。