Asynchronous Methods for Deep Reinforcement Learning
深度强化学习的异步方法
2016年2月4日提交
https://arxiv.org/abs/1602.01783
Abstract
摘要
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
我们提出了一个概念上简单且轻量级的深度强化学习框架,该框架使用异步梯度下降来优化深度神经网络控制器。我们提出了四种标准强化学习算法的异步变体,并表明并行演员学习者对训练具有稳定作用,允许所有四种方法成功训练神经网络控制器。性能最好的方法是Actor-Critic的异步变体,它超越了Atari领域的当前最先进技术,同时在单个多核CPU而不是GPU上训练一半的时间。此外,我们表明,异步演员评论家成功的各种连续电机控制问题,以及在一个新的任务导航随机3D迷宫使用视觉输入。
1 Introduction
1 导言
Deep neural networks provide rich representations that can enable reinforcement learning (RL) algorithms to perform effectively. However, it was previously thought that the combination of simple online RL algorithms with deep neural networks was fundamentally unstable. Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 2013, 2015; Van Hasselt et al., 2015; Schulman et al., 2015a). These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013, 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.
深度神经网络提供了丰富的表示,可以使强化学习(RL)算法有效地执行。然而,以前人们认为简单的在线RL算法与深度神经网络的结合从根本上是不稳定的。相反,已经提出了各种解决方案来稳定算法(Riedmiller,2005; Mnih等人,2013,2015;货车Hasselt等人,2015; Schulman等人,2015年a)。这些方法有一个共同的想法:在线RL代理遇到的观察数据序列是非平稳的,并且在线RL更新是强相关的。通过将代理的数据存储在经验重放存储器中,可以对数据进行批处理(Riedmiller,2005; Schulman等人,2015 a)或随机采样(Mnih等人,2013,2015;货车Hasselt等人,2015年,在不同的时间段。 以这种方式在内存上聚合减少了非平稳性和去相关更新,但同时将方法限制为非策略强化学习算法。
Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.
基于经验回放的深度强化学习算法在Atari 2600等具有挑战性的领域取得了前所未有的成功。然而,经验重放有几个缺点:它使用更多的内存和计算每个真实的交互;它需要离政策学习算法,可以从旧的策略生成的数据更新。
In this paper we provide a very different paradigm for deep reinforcement learning. Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actor-critic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.
在本文中,我们为深度强化学习提供了一个非常不同的范式。我们在环境的多个实例上并行异步执行多个代理,而不是体验重播。这种并行性还将代理的数据解相关为更稳定的过程,因为在任何给定的时间步,并行代理将经历各种不同的状态。这个简单的想法使得更大范围的基本策略上的RL算法,如Sarsa,n-step方法和actor-critic方法,以及非策略RL算法,如Q学习,能够使用深度神经网络稳健有效地应用。
Our parallel reinforcement learning paradigm also offers practical benefits. Whereas previous approaches to deep reinforcement learning rely heavily on specialized hardware such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015; Schaul et al., 2015) or massively distributed architectures (Nair et al., 2015), our experiments run on a single machine with a standard multi-core CPU. When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. The best of the proposed methods, asynchronous advantage actor-critic (A3C), also mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.
我们的并行强化学习范式也提供了实际的好处。而之前的深度强化学习方法严重依赖于GPU等专用硬件(Mnih等人,2015;货车Hasselt等人,2015; Schaul等人,2015)或大规模分布式架构(Nair等人,2015),我们的实验运行在一个标准的多核CPU的单机上。当应用于各种Atari 2600领域时,在许多游戏中,异步强化学习可以获得更好的结果,比以前基于GPU的算法所用的时间更短,使用的资源也比大规模分布式方法少得多。所提出的方法中最好的一种,异步优势演员-评论家(A3 C),也掌握了各种连续的运动控制任务,以及学习纯粹从视觉输入探索3D迷宫的一般策略。 我们相信,A3C在2D和3D游戏、离散和连续动作空间上的成功,以及它训练前馈和循环代理的能力,使它成为迄今为止最通用和最成功的强化学习代理。
2 Related Work
2 相关工作
The General Reinforcement Learning Architecture (Gorila) of (Nair et al., 2015) performs asynchronous training of reinforcement learning agents in a distributed setting. In Gorila, each process contains an actor that acts in its own copy of the environment, a separate replay memory, and a learner that samples data from the replay memory and computes gradients of the DQN loss (Mnih et al., 2015) with respect to the policy parameters. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals. By using 100 separate actor-learner processes and 30 parameter server instances, a total of 130 machines, Gorila was able to significantly outperform DQN over 49 Atari games. On many games Gorila reached the score achieved by DQN over 20 times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).
通用强化学习架构(Gorila)(Nair等人,2015)在分布式环境中执行强化学习代理的异步训练。在Gorila中,每个进程包含在其自己的环境副本中行动的行动者、单独的重放存储器以及从重放存储器采样数据并计算DQN损失的梯度的学习器(Mnih等人,2015年,在政策参数方面。梯度被异步地发送到中央参数服务器,其更新模型的中央副本。更新后的策略参数以固定的间隔发送给参与者-学习者。通过使用100个独立的参与者-学习者过程和30个参数服务器实例,总共130台机器,Gorila能够在49个Atari游戏中显著优于DQN。在许多比赛中,Gorila达到DQN的得分比DQN快20倍以上。我们还注意到,并行化DQN的类似方式由Chavez等人提出,2015年)的报告。
In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation. Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actor-learner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.
在早期的工作中,(Li & Schuurmans,2011)应用Map Reduce框架来并行化具有线性函数近似的批量强化学习方法。并行性用于加速大型矩阵运算,但不能并行收集经验或稳定学习。(Grounds & Kudenko,2008)提出了一种并行版本的Sarsa算法,该算法使用多个独立的参与者-学习者来加速训练。每个参与者-学习者单独学习,并使用对等通信定期向其他学习者发送已发生显著变化的权重的更新。
(Tsitsiklis, 1994) studied convergence properties of Q-learning in the asynchronous optimization setting. These results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as outdated information is always eventually discarded and several other technical assumptions are satisfied. Even earlier, (Bertsekas, 1982) studied the related problem of distributed dynamic programming.
(Tsitsiklis,1994)研究了异步优化设置中Q学习的收敛特性。这些结果表明,当一些信息过时时,Q学习仍然保证收敛,只要过时的信息最终总是被丢弃,并且满足其他几个技术假设。甚至更早,(Bertsekas,1982)研究了分布式动态规划的相关问题。
Another related area of work is in evolutionary methods, which are often straightforward to parallelize by distributing fitness evaluations over multiple machines or threads (Tomassini, 1999). Such parallel evolutionary approaches have recently been applied to some visual reinforcement learning tasks. In one example, (Koutník et al., 2014) evolved convolutional neural network controllers for the TORCS driving simulator by performing fitness evaluations on 8 CPU cores in parallel.
另一个相关的工作领域是进化方法,通过在多个机器或线程上分布适应性评估,通常可以直接并行化(Tomassini,1999)。这种并行进化方法最近被应用于一些视觉强化学习任务。在一个实例中,(Koutník等人,2014)通过在8个CPU内核上并行执行适应度评估,为TORCS驾驶模拟器进化了卷积神经网络控制器。
3 Reinforcement Learning Background
3 强化学习背景
4 Asynchronous RL Framework
4 异步RL框架
We now present multi-threaded asynchronous variants of one-step Sarsa, one-step Q-learning, n-step Q-learning, and advantage actor-critic. The aim in designing these methods was to find RL algorithms that can train deep neural network policies reliably and without large resource requirements. While the underlying RL methods are quite different, with actor-critic being an on-policy policy search method and Q-learning being an off-policy value-based method, we use two main ideas to make all four algorithms practical given our design goal.
我们现在介绍一步Sarsa,一步Q学习,n步Q学习和优势演员评论家的多线程异步变体。设计这些方法的目的是找到可以可靠地训练深度神经网络策略并且不需要大量资源的RL算法。虽然底层的强化学习方法有很大的不同,其中actor-critic是一种基于策略的策略搜索方法,Q-learning是一种基于值的非策略方法,但我们使用两个主要思想来实现我们的设计目标。
First, we use asynchronous actor-learners, similarly to the Gorila framework (Nair et al., 2015), but instead of using separate machines and a parameter server, we use multiple CPU threads on a single machine. Keeping the learners on a single machine removes the communication costs of sending gradients and parameters and enables us to use Hogwild! (Recht et al., 2011) style updates for training.
首先,我们使用异步的参与者-学习者,类似于Gorila框架(Nair等人,2015),但我们不是使用单独的机器和参数服务器,而是在一台机器上使用多个CPU线程。将学习器保持在一台机器上,消除了发送梯度和参数的通信成本,使我们能够使用Hogwild!(Recht等人,2011年)的风格更新培训。
Second, we make the observation that multiple actors-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm.
其次,我们观察到,多个行动者-学习者并行运行可能会探索环境的不同部分。此外,可以明确地使用不同的探索政策,在每个演员学习者,以最大限度地提高这种多样性。通过在不同的线程中运行不同的探索策略,由并行应用在线更新的多个参与者-学习者对参数做出的总体改变在时间上可能比应用在线更新的单个代理更不相关。因此,我们不使用重放存储器,而是依赖于采用不同探索策略的并行演员来执行DQN训练算法中经验重放所承担的稳定作用。
In addition to stabilizing learning, using multiple parallel actor-learners has multiple practical benefits. First, we obtain a reduction in training time that is roughly linear in the number of parallel actor-learners. Second, since we no longer rely on experience replay for stabilizing learning we are able to use on-policy reinforcement learning methods such as Sarsa and actor-critic to train neural networks in a stable way. We now describe our variants of one-step Q-learning, one-step Sarsa, n-step Q-learning and advantage actor-critic.
除了稳定学习之外,使用多个并行的行动者