A3C论文翻译

Asynchronous Methods for Deep Reinforcement Learning

Abstract
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore,we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

摘要

我们提出了一个概念简单和轻量级的框架深度强化学习,使用异步梯度下降优化深度神经网络控制器。我们提出了四种标准强化学习算法的异步变体,并表明并行角色-学习者对训练有一种稳定的效果,允许所有四种方法成功地训练神经网络控制器。性能最好的方法是actor-批评家的异步变体,在单多核CPU而不是GPU上训练一半时间的同时,超过了目前Atari领域的最先进技术。此外,我们还展示了异步角色评论家成功地解决了一系列连续的电机控制问题,以及使用视觉输入导航随机3D迷宫的新任务。

1. Introduction
Deep neural networks provide rich representations that can enable reinforcement learning (RL) algorithms to perform effectively. However, it was previously thought that the combination of simple online RL algorithms with deep neural networks was  undamentally unstable. Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Hasselt et al., 2015; Schulman et al., 2015a). These approaches share a common idea: the sequence of bserved data encountered by an online RL agent is non-stationary, and on-line RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy. In this paper we provide a very different paradigm for deep reinforcement learning. Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process,since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actorcritic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.Our parallel reinforcement learning paradigm also offers practical benefits. Whereas previous approaches to deep reinforcement learning rely heavily on specialized hardware such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;
Schaul et al., 2015) or massively distributed architectures (Nair et al., 2015), our experiments run on a single machine with a standard multi-core CPU. When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. The best of the proposed methods, asynchronous advantage actorcritic (A3C), also mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.

1. 介绍

深度神经网络提供了丰富的表示,可以使强化学习(RL)算法有效地执行。然而,以前人们认为简单的在线RL算法与深度神经网络的结合是不稳定的。相反,提出了多种解决方案来稳定算法(Riedmiller, 2005;Mnih等,2013;2015;Van Hasselt等,2015;Schulman等,2015a)。这些方法都有一个共同的想法:在线RL代理遇到的等待数据序列是非平稳的,并且在线RL更新是强相关的。通过在经验重放存储器中存储代理的数据,数据可以批量处理(Riedmiller, 2005;Schulman等,2015a)或随机抽样(Mnih等,2013;2015;Van Hasselt等,2015)从不同的时间步长。这种内存聚合方法减少了非平稳性和冗余更新,但同时限制了算法的非策略强化学习算法。基于经验重放的深度RL算法在雅达利2600等具有挑战性的领域取得了前所未有的成功。然而,体验重放有几个缺点:每次真实交互使用更多的内存和计算量;它还需要非政策学习算法,可以从旧政策生成的数据中进行更新。在本文中,我们为深度强化学习提供了一个非常不同的范例。我们在环境的多个实例上异步并行地执行多个代理,而不是体验重放。这种并行性还将代理的数据拆分为一个更平稳的过程,因为在任何给定的时间步长,并行代理都将经历各种不同的状态。这个简单的想法使得更大范围的基本的政策上的RL算法,如Sarsa, n步方法,和演员评论家方法,以及政策外的RL算法,如Q-learning,可以使用深度神经网络被稳健和有效地应用。我们的并行强化学习模式也提供了实际的好处。而以往的深度强化学习方法严重依赖于专用硬件,如gpu (Mnih等,2015;Van Hasselt et al., 2015;Schaul et al., 2015)或大规模分布式架构(Nair et al., 2015),我们的实验在具有标准多核CPU的单机上运行。当应用于各种Atari 2600域时,在许多游戏中异步强化学习取得了更好的效果,比以前的gpu算法所花的时间要少得多,比大规模分布式方法所使用的资源要少得多。在被提出的方法中,最好的异步优势actor评论家(A3C)也掌握了各种连续的电机控制任务,以及学习了纯粹从视觉输入探索3D迷宫的一般策略。我们认为,A3C在2D和3D游戏、离散和连续动作空间上的成功,以及它训练前馈和复发型agent的能力,使其成为迄今为止最普遍、最成功的强化学习agent。

2. Related Work
The General Reinforcement Learning Architecture (Gorila)of (Nair et al., 2015) performs asynchronous training of reinforcement learning agents in a distributed setting. In Gorila,each process contains an actor that acts in its own copy of the environment, a separate replay memory, and a learner that samples data from the replay memory and computes gradients of the DQN loss (Mnih et al., 2015) with respect to the policy parameters. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals. By using 100 separate actor-learner processes and 30 parameter server instances,a total of 130 machines, Gorila was able to significantly outperform DQN over 49 Atari games. On many games Gorila reached the score achieved by DQN over 20 times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation.Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actorlearner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.(Tsitsiklis, 1994) studied convergence properties of Qlearning in the asynchronous optimization setting. These results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as outdated information is always eventually discarded and several other technical assumptions are satisfied. Even earlier,(Bertsekas, 1982) studied the related problem of distributed dynamic programming.Another related area of work is in evolutionary methods,which are often straightforward to parallelize by distributing fitness evaluations over multiple machines or threads (Tomassini, 1999). Such parallel evolutionary approaches have recently been applied to some visual reinforcement learning tasks. In one example, (Koutník et al.,2014) evolved convolutional neural network controllers for the TORCS driving simulator by performing fitness evaluations on 8 CPU cores in parallel.

(Nair et al., 2015)的General Reinforcement Learning Architecture (Gorila)在分布式环境下对Reinforcement Learning agent进行异步训练。在Gorila中,每个进程包含一个在其自身环境副本中工作的参与者、一个单独的重播内存和一个从重播内存中采样数据并计算DQN丢失的梯度(Mnih等,2015)的学习者(learner)。梯度异步发送到中心参数服务器,该服务器更新模型的中心副本。更新后的策略参数以固定的时间间隔发送给参与者-学习者。通过使用100个独立的actor-learner进程和30个参数服务器实例,总共130台机器,Gorila能够显著超过49个Atari游戏的DQN。在许多游戏中,Gorila比DQN快20倍达到DQN的分数。我们还注意到(Chavez et al., 2015)提出了一种类似的并行DQN方法。在早期的工作中,(Li & Schuurmans, 2011)将Map Reduce框架应用于线性函数逼近并行化批处理强化学习方法。并行性用于加速大矩阵运算,而不是用于并行化经验的收集或稳定学习。(Grounds & Kudenko, 2008)提出了Sarsa算法的一个并行版本,使用多个独立的演员-学习者来加速训练。每个演员学习者都是单独学习的,并定期向使用点对点通信的其他学习者发送权重的更新。(Tsitsiklis, 1994)研究了Qlearning在异步优化设置中的收敛特性。这些结果表明,只要最终丢弃过时的信息并满足其他几个技术假设,Q-learning仍然可以保证在某些信息过时的情况下收敛。更早的时候,(Bertsekas, 1982)研究了分布式动态规划的相关问题。另一个相关的工作领域是进化方法,通过在多台机器或线程上分布适合度评估,这种方法通常可以直接并行化(Tomassini, 1999)。这种并行进化方法最近已被应用于一些视觉强化学习任务。例如,(Koutnik et al.,2014)通过对8个CPU核并行执行适应度评估,为TORCS驾驶模拟器进化了卷积神经网络控制器。

5. Experiments

We use four different platforms for assessing the properties of the proposed framework. We perform most of our experiments using the Arcade Learning Environment (Bellemare et al., 2012), which provides a simulator for Atari 2600 games. This is one of the most commonly used benchmark environments for RL algorithms. We use the Atari domain to compare gainst state of the art results (Van Hasselt et al., 2015; Wang et al., 2015; Schaul et al., 2015; Nair et al., 2015; Mnih et al., 2015), as well as to carry out a detailed stability and scalability analysis of the proposed methods. We performed further comparisons sing the TORCS 3D car racing simulator (Wymann et al., 2013). We also use two additional domains to evaluate only the A3C algorithm – Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a physics simulator for evaluating agents on continuous motor control tasks with contact dynamics. Labyrinth is a new 3D environment where the agent must learn to find rewards in randomly generated mazes from a visual input. The precise details of our experimental setup can be found in Supplementary Section 8.

5. 实验

我们使用四个不同的平台来评估所提出的框架的特性。我们使用街机学习环境(Bellemare et al., 2012)进行了大部分实验,该环境为雅达利2600游戏提供了模拟器。这是RL算法最常用的基准测试环境之一。我们使用Atari域来比较最新的结果(Van Hasselt et al., 2015;Wang et al., 2015;Schaul等,2015;Nair等,2015年;Mnih等,2015),并对提出的方法进行了详细的稳定性和可扩展性分析。我们对TORCS 3D赛车模拟器进行了进一步的比较(Wymann et al., 2013)。我们还使用两个额外的域来评估A3C算法- Mujoco和Labyrinth。MuJoCo (Todorov, 2015)是一款物理模拟器,用于评估具有接触动力学的连续电机控制任务的代理。迷宫是一个新的3D环境,在这里agent必须学会从视觉输入中找到随机生成的迷宫的奖励。我们的实验设置的精确细节可以在补充的第8节中找到。

5.1. Atari 2600 Games

We first present results on a subset of Atari 2600 games to demonstrate the training speed of the new methods. Figure 1 compares the learning speed of the DQN algorithm trained on an Nvidia K40 GPU with the asynchronous methods trained sing 16 CPU cores on five Atari 2600 games. The results show that all four asynchronous methods we presented can successfully train neural network controllers on the Atari domain. The asynchronous methods tend to learn faster than DQN, with significantly faster learning on some games, while training on only 16 CPU cores. Additionally, the results suggest that n-step methods learn faster than one-step methods on some games. Overall, the policy-based advantage actor-critic method significantly outperforms all three value-based methods. We then evaluated asynchronous advantage actor-critic on 57 Atari games. In order to compare with the state of the art in Atari game playing, we largely followed the training and evaluation protocol of (Van Hasselt et al., 2015). Specifically, we tuned hyperparameters (learning rate and amount of gradient norm clipping) using a search on six Atari games (Beamrider, Breakout, Pong, Q*bert, Seaquest and Space Invaders) and then fixed all hyperparameters for all 57 games. We trained both a feedforward agent with the same architecture as (Mnih et al., 2015; Nair et al., 2015; Van Hasselt et al., 2015) as well as a recurrent agent with an additional 256 LSTM cells after the final hidden layer. We additionally used the final network weights for evaluation to make the results more comparable to the original resultsfrom (Bellemare et al., 2012). We trained our agents for four days using 16 CPU cores, while the other agents were trained for 8 to 10 days on Nvidia K40 GPUs. Table 1 shows the average and median human-normalized scores obtained by our agents trained by asynchronous advantage actor-critic (A3C) as well as the current state-of-the art. Supplementary Table S3 shows the scores on all games. A3C significantly improves on state-of-the-art the average score over 57 games in half the training time of the other methods while using only 16 CPU cores and no GPU. Furthermore, after just one day of training, A3C matches the average human normalized score of Dueling Double DQN and almost reaches the median human normalized core of Gorila. We note that many of the improvements that are presented in Double DQN (Van Hasselt et al., 2015) and Dueling Double DQN (Wang et al., 2015) can be incorporated to 1-step Q and n-step Q methods presented in this work with similar potential improvements.

5.1。Atari 2600 Games

我们首先展示了在雅达利2600游戏子集上的结果来演示新方法的训练速度。图1比较了在Nvidia K40 GPU上训练的DQN算法和在5款Atari 2600游戏上训练的16个CPU核的异步方法的学习速度。结果表明,我们提出的四种异步方法都能成功地在Atari领域训练神经网络控制器。异步学习方法往往比DQN学习得更快,在一些游戏上学习速度明显更快,而只在16个CPU核上进行训练。此外,在一些游戏中,n步方法比一步方法学习速度更快。总的来说,基于策略的优点actor-批评家方法显著优于所有三种基于价值的方法。然后我们评估了57款Atari游戏的异步优势actor-批评家。为了比较雅达利游戏的发展水平,我们在很大程度上遵循了(Van Hasselt et al., 2015)的培训和评估协议。具体来说,我们通过对6款Atari游戏(《Beamrider》、《Breakout》、《Pong》、《Q*bert》、《Seaquest》和《Space Invaders》)的搜索调整了超参数(学习速率和梯度模裁剪量),然后为所有57款游戏固定了所有超参数。我们训练了具有相同架构的前馈代理(Mnih等,2015;Nair等,2015年;Van Hasselt等,2015)以及在最终隐藏层后增加256个LSTM细胞的复发剂。此外,我们使用最终的网络权值进行评价,使结果与原始结果更具可比性(Bellemare等,2012)。我们使用16个CPU核心对代理进行了4天的培训,而其他代理在Nvidia K40 gpu上进行了8到10天的培训。表1显示了我们的代理通过异步优势actor-批评家(A3C)培训获得的平均和中位数人类标准化分数,以及当前的技术水平。补充表S3显示了所有比赛的分数。在只使用16个CPU核和没有GPU的情况下,A3C在最先进水平上显著提高,平均得分超过57场比赛,训练时间是其他方法的一半。而且,经过一天的训练,A3C已经达到了决斗双DQN的人类平均归一化分数,几乎达到了Gorila的人类归一化核的中值。我们注意到,双DQN (Van Hasselt et al., 2015)和决斗双DQN (Wang et al., 2015)中提出的许多改进可以合并到本研究中提出的1步Q和n步Q方法中,具有类似的潜在改进。

5.2. TORCS Car Racing Simulator

We also compared the four asynchronous methods on the TORCS 3D car racing game (Wymann et al., 2013). TORCS not only has more realistic graphics than Atari 2600 games, but also requires the agent to learn the dynamics of the car it is controlling. At each step, an agent received only a visual input in the form of an RGB imageof the current frame as well as a reward proportional to the agent’s velocity along the center of the track at the agent’s current position. We used the same neural network architecture as the one used in the Atari experiments specified in Supplementary Section 8. We performed experiments using four different settings – the agent controlling a slow car with and without opponent bots, and the agent controlling a fast car with and without opponent bots. Full results can be found in Supplementary Figure S6. A3C was the best performing agent, reaching between roughly 75% and 90% of the score obtained by a human tester on all four game configurations in about 12 hours of training. A video showing the learned driving behavior of the A3C agent can be found at https://youtu.be/0xo1Ldx3L5Q.

5.2 TORCS赛车模拟器

我们还比较了四种异步方法在TORCS 3D赛车游戏(Wymann et al., 2013)。TORCS不仅拥有比Atari 2600 Games更逼真的图形,而且还要求代理去学习它所控制的汽车的动力学。在每一步,agent只会收到当前帧的RGB图像形式的视觉输入,以及与agent在当前位置沿轨道中心的速度成比例的奖励。我们使用了与在补充章节8中指定的Atari实验中使用的相同的神经网络架构。我们用四种不同的设置进行了实验——agent控制一辆有或没有对手机器人的慢速汽车,agent控制一辆有或没有对手机器人的快速汽车。完整的结果可以在补充图S6中找到。A3C是表现最好的agent,在大约12个小时的训练中,它可以达到人类测试者在所有四种游戏配置上的75%到90%。可以在https://youtu.be/0xo1Ldx3L5Q上找到关于A3C代理的学习驱动行为的视频。

5.3. Continuous Action Control Using the MuJoCo Physics Simulator

We also examined a set of tasks where the action space is continuous. In particular, we looked at a set of rigid body physics domains with contact dynamics where the tasks include many examples of manipulation and locomotion. These tasks were simulated using the Mujoco physics engine. We evaluated only the asynchronous advantage actor-critic algorithm since, unlike the value-based methods, it is easily extended to continuous actions. In all problems, using either the physical state or pixels as input, Asynchronous Advantage-Critic found good solutions in less than 24 hours of training and typically in under a few hours. Some successful policies learned by our agent can be seen in the following video https://youtu.be/ Ajjc08-iPx8. Further details about this experiment can be found in Supplementary Section 9.

5.3。连续动作控制使用MuJoCo物理模拟器

我们还研究了一系列动作空间是连续的任务。特别是,我们研究了一组接触动力学刚体物理领域,其中的任务包括许多操作和运动的例子。这些任务使用Mujoco物理引擎进行模拟。我们只评估了actor-批评家算法的异步优势,因为与基于值的方法不同,它很容易扩展到连续动作。在所有的问题中,使用物理状态或像素作为输入,异步优势批评家在24小时内,通常在几个小时内找到了好的解决方案。下面的视频中可以看到我们代理学习到的一些成功的策略https://youtu。/ Ajjc08-iPx8。关于这个实验的更多细节可以在补充章节9中找到。

5.4. Labyrinth

We performed an additional set of experiments with A3C on a new 3D environment called Labyrinth. The specific task we considered involved the agent learning to find rewards in randomly generated mazes. At the beginning of each episode the agent was placed in a new randomly generated maze consisting of rooms and corridors. Each maze contained two types of objects that the agent was rewarded for finding – apples and portals. Picking up an apple led to a reward of 1. Entering a portal led to a reward of 10 after which the agent was respawned in a new random location in the maze and all previously collected apples were regenerated. An episode terminated after 60 seconds after which a new episode would begin. The aim of the agent is to collect as many points as possible in the time limit and the optimal strategy involves first finding the portal and then repeatedly going back to it after each respawn. This task is much more challenging than the TORCS driving domain because the agent is faced with a new maze in each episode and must learn a general strategy for exploring random mazes.We trained an A3C LSTM agent on this task using only 84 × 84 RGB images as input. The final average score of around 50 indicates that the agent learned a reasonable strategy for exploring random 3D maxes using only a visual input. A video showing one of the agents exploring previously unseen mazes is included at https: //youtu.be/nMR5mjCFZCw.

5.4。迷宫

我们用A3C在一个名为Labyrinth的新3D环境中进行了一组额外的实验。我们所考虑的具体任务包括agent学习在随机产生的迷宫中寻找奖励。在每一集的开始,特工被放置在一个新的由房间和走廊组成的随机产生的迷宫中。每个迷宫中都有两种类型的物品,搜寻者会得到奖励——苹果和入口。捡起一个苹果得到1英镑的奖励。进入一个入口会得到10英镑的奖励,之后特工会在迷宫中一个新的随机位置重生,之前收集的苹果也会重生。一集在60秒后结束,之后将开始新的一集。代理的目标是在时间限制内收集尽可能多的点数,最佳策略包括首先找到门户,然后在每次重生后重复返回门户。这个任务比TORCS驾驶领域更具挑战性,因为agent在每一集都要面对一个新的迷宫,并且必须学习探索随机迷宫的一般策略。我们在这个任务上训练了一个A3C LSTM agent,只使用84×84张RGB图像作为输入。最终的平均分在50左右,这表明agent学习了一个合理的策略,只使用一个视觉输入来探索随机的3D最大值。在https: // youtuv .be/nMR5mjCFZCw网站上有一段视频,展示了一个特工探索未知迷宫的过程。

5.5. Scalability and Data Efficiency

We analyzed the effectiveness of our proposed framework by looking at how the training time and data efficiency changes with the number of parallel actor-learners. When using multiple workers in parallel and updating a shared model, one would expect that in an ideal case, for a given task and algorithm, the number of training steps to achieve a certain score would remain the same with varying numbers of workers. Therefore, the advantage would be solely due to the ability of the system to consume more data in the same amount of wall clock time and possibly improved exploration. Table 2 shows the training speed-up achieved by using increasing numbers of parallel actor-learners averaged over seven Atari games. These results show that all four methods achieve substantial speedups from using multiple worker threads, with 16 threads leading to at least an order of magnitude speedup. This confirms that our proposed framework scales well with the number of parallel workers, making efficient use of resources. Somewhat surprisingly, asynchronous one-step Q-learning and Sarsa algorithms exhibit superlinear speedups that cannot be explained by purely computational gains. We observe that one-step methods (one-step Q and one-step Sarsa) often require less data to achieve a particular score when using more parallel actor-learners. We believe this is due to positive effect of multiple threads to reduce the bias in one-step methods. These effects are shown more clearly in Figure 3, which shows plots of the average score against the total number of training frames for numbers of actor-learners and training methods on five Atari games, and Figure 4, which shows plots of the average score against wall-clock time.

5.5。可伸缩性和数据效率

通过观察训练时间和数据效率如何随着并行参与者-学习者数量的变化,我们分析了我们提出的框架的有效性。当并行使用多个worker并更新共享模型时,人们会期望在理想情况下,对于给定的任务和算法,随着worker数量的变化,达到某个分数的训练步骤的数量将保持不变。因此,这种优势完全是由于该系统能够在相同的挂钟时间内消耗更多的数据,并可能改进探索。表2显示了在7款Atari游戏中,通过增加并行参与者-学习者的数量所实现的训练加速。这些结果表明,通过使用多个工作线程,所有四种方法都实现了实质性的加速,16个线程至少可以带来一个数量级的加速。这证实了我们提出的框架能够很好地适应并行工作者的数量,从而有效地利用资源。有些令人惊讶的是,异步一步Q-learning和Sarsa算法表现出超线性的加速,不能用纯粹的计算增益来解释。我们观察到,当使用更多的并行行为者-学习者时,一步法(一步Q和一步Sarsa)通常需要较少的数据来获得一个特定的分数。我们认为这是由于多线程在一步方法中减少偏差的积极作用。这些影响在图3和图4中得到了更清晰的体现,图3显示了5款Atari游戏中参与者-学习者数量和训练方法的总训练帧数与平均分数的对比,图4显示了平均分数与时钟时间的对比。

5.6. Robustness and Stability

Finally, we analyzed the stability and robustness of the four proposed asynchronous algorithms. For each of the four algorithms we trained models on five games (Breakout, Beamrider, Pong, Q*bert, Space Invaders) using 50 different learning rates and random initializations. Figure 2 shows scatter plots of the resulting scores for A3C, while Supplementary Figure S11 shows plots for the other three methods. There is usually a range of learning rates for each method and game combination that leads to good scores, indicating that all methods are quite robust to the choice of learning rate and random initialization. The fact that there are virtually no points with scores of 0 in regions with good learning rates indicates that the methods are stable and do not collapse or diverge once they are learning.

5.6。鲁棒性和稳定性

最后,我们分析了四种异步算法的稳定性和鲁棒性。对于四种算法中的每一种,我们在五种游戏(Breakout, Beamrider, Pong, Q*bert, Space Invaders)中训练模型,使用50种不同的学习率和随机初始化。图2为A3C评分结果的散点图,补充图S11为其他三种方法的散点图。通常每种方法和博弈组合都有一定的学习率范围,从而获得较好的成绩,这表明所有方法对学习率的选择和随机初始化都具有相当的鲁棒性。事实上,在具有良好学习率的地区,几乎没有0分,这表明这些方法是稳定的,而且在学习过程中不会崩溃或偏离。

6. Conclusions and Discussion

We have presented asynchronous versions of four standard reinforcement learning algorithms and showed that they are able to train neural network controllers on a variety of domains in a stable manner. Our results show that in our proposed framework stable training of neural networks through reinforcement learning is possible with both valuebased and policy-based methods, off-policy as well as onpolicy methods, and in discrete as well as continuous domains. When trained on the Atari domain using 16 CPU cores, the proposed asynchronous algorithms train faster than DQN trained on an Nvidia K40 GPU, with A3C surpassing the current state-of-the-art in half the training time. One of our main findings is that using parallel actorlearners to update a shared model had a stabilizing effect on the learning process of the three value-based methods we considered. While this shows that stable online Q-learning is possible without experience replay, which was used for this purpose in DQN, it does not mean that experience replay is not useful. Incorporating experience replay into the asynchronous reinforcement learning framework could substantially improve the data efficiency of these methods by reusing old data. This could in turn lead to much faster training times in domains like TORCS where interacting with the environment is more expensive than updating the model for the architecture we used. Combining other existing reinforcement learning methods or recent advances in deep reinforcement learning with our asynchronous framework presents many possibilities for immediate improvements to the methods we presented. While our n-step methods operate in the forward view (Sutton & Barto, 1998) by using corrected n-step returns directly as targets, it has been more common to use the backward view to implicitly combine different returns through eligibility traces (Watkins, 1989; Sutton & Barto, 1998; Peng & Williams, 1996). The asynchronous advantage actor-critic method could be potentially improved by using other ways of estimating the advantage function, such as generalized advantage estimation of (Schulman et al., 2015b). All of the value-based methods we investigated could benefit from different ways of reducing overestimation bias of Q-values (Van Hasselt et al., 2015; Bellemare et al., 2016). Yet another, more speculative, direction is to try and combine the recent work on true online temporal difference methods (van Seijen et al., 2015) with nonlinear function approximation. In addition to these algorithmic improvements, a number of complementary improvements to the neural network architecture are possible. The dueling architecture of (Wang et al., 2015) has been shown to produce more accurate estimates of Q-values by including separate streams for the state value and advantage in the network. The spatial softmax proposed by (Levine et al., 2015) could improve both value-based and policy-based methods by making it easier for the network to represent feature coordinates.

6. 结论和讨论

我们提出了四种标准的增强学习算法的异步版本,并表明它们能够以稳定的方式在不同的领域训练神经网络控制器。我们的结果表明,在我们提出的框架中,通过强化学习对神经网络进行稳定训练是可能的,包括基于价值的方法和基于策略的方法,非策略的方法和基于策略的方法,以及在离散和连续域的方法。在Atari领域使用16个CPU核进行训练时,提出的异步算法的训练速度比在Nvidia K40 GPU上训练的DQN快,而A3C仅用一半的训练时间就超过了目前最先进的算法。我们的主要发现之一是,使用并行actor学习者更新共享模型对我们所考虑的三种基于价值的方法的学习过程有稳定的影响。虽然这表明在DQN中,稳定的在线Q-learning是可能的,但这并不意味着体验回放是没有用的。将经验重放整合到异步强化学习框架中,可以通过重用旧数据大大提高这些方法的数据效率。这将反过来导致在像TORCS这样的领域中更快的训练时间,在这些领域中,与环境的交互比为我们使用的架构更新模型更昂贵。结合其他现有的强化学习方法或在深度强化学习的最新进展与我们的异步框架提出了许多即时改进我们提出的方法的可能性。虽然我们的n步方法操作在前视图(Sutton & Barto, 1998),使用修正的n步返回直接作为目标,它更常见的是使用后视图,通过合格性跟踪隐式合并不同的返回(Watkins, 1989;萨顿和巴托,1998年;彭和威廉姆斯,1996)。通过使用其他估计优势函数的方法,如的广义优势估计(Schulman et al., 2015b),可以对异步优势actor-批评家方法进行潜在的改进。我们研究的所有基于价值的方法都可以从减少q值过高估计偏差的不同方法中受益(Van Hasselt et al., 2015;Bellemare等,2016)。另一个更具推测性的方向是尝试将最近关于真实在线时间差分方法(van Seijen et al., 2015)的研究与非线性函数逼近相结合。除了这些算法的改进,对神经网络架构的一些补充改进是可能的。(Wang et al., 2015)的决斗架构已经证明,通过在网络中包含状态值和优势的单独流,可以产生更准确的q值估计。(Levine et al., 2015)提出的spatial softmax可以改进基于价值和基于政策的方法,使网络更容易表示特征坐标。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值