QUANT[14]强化学习RL论文1:通过深度强化学习实现人的层次控制

36 篇文章 12 订阅
21 篇文章 1 订阅

论文《通过深度强化学习实现人的层次控制》解读

目录

1. 摘要

2.模型

3. 实验设置

4. 结果

5. 结论

6.正文翻译


1. 摘要

The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behavior, of how agents may optimize their control of an environment.To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture, and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

强化学习理论从心理学和神经科学的角度对动物行为进行了规范的解释,解释了行为主体如何优化对环境的控制。然而,在接近真实世界复杂性的情况下,智能体面临着一个困难的任务:他们必须从高维感官输入中获得环境的有效表征,并利用这些表征将过去的经验推广到新的情况。值得注意的是,人类和其他动物似乎是通过强化学习和分层感觉处理系统的和谐结合来解决这个问题的。大量的神经数据证明,多巴胺能神经元发出的相位信号和时间差异强化学习算法之间存在明显的相似之处。

虽然增强学习agent在许多领域都取得了一些成功,但它们的适用性以前仅限于手工制作有用特性的领域,或具有完全观察到的低维状态空间的领域。

在这里,我们利用深度神经网络训练的最新进展来开发一种新的人工智能体,称为深度q网络,它可以使用端到端强化学习直接从高维感知输入中直接学习成功的策略。我们在经典的Atari 2600游戏的挑战性领域测试了这个代理。我们证明deep Q-network agent,收到只有像素和游戏分数作为输入,能够超越所有以前的算法和实现的性能水平与专业的游戏测试人员,在49局游戏里, 使用相同的算法、网络架构和hyperparameters。这项工作填补了高维感官输入和动作之间的鸿沟,从而产生了第一个人工智能,能够学习在各种具有挑战性的任务中脱颖而出。

2.模型

Figure 1 | Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an 84*84*4 image produced by the preprocessing map w, followed by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, maxð0,xÞ).

图1卷积神经网络的|示意图。方法中解释了体系结构的细节。神经网络的输入由预处理映射w生成的84*84*4的图像和三个卷积层(注意:蛇形的蓝线表示每个过滤器在输入图像上的滑动)和两个完全连接的层组成,每个有效操作有一个单独的输出。每个隐层是紧随其后的是一个整流器非线性(max(0,x))。

模型架构。有几种使用神经网络参数化Q的可能方法。因为Q将历史动作对映射到其Q值的标量估计值,所以以前的一些方法已将历史和动作用作神经网络的输入。这种架构的主要缺点是,需要单独的前向通过来计算每个动作的Q值,从而导致成本随动作数量线性增长,而我们使用的架构中每个可能动作的输出单元,并且只有状态表示是神经网络的输入。输出对应于输入状态下各个动作的预测Q值。这种体系结构的主要优点是能够仅通过网络进行一次前向传递就可以计算给定状态下所有可能动作的Q值。图1中示意性显示的确切架构如下。神经网络的输入由预处理图w产生的84 * 84 * 4图像组成。第一隐藏层将步长为4的32个8 * 8的滤波器与输入图像卷积在一起,并应用整流器非线性。第二个隐藏层使步幅为2的64个4 * 4的滤波器卷积,再次是整流器非线性。接下来是第三卷积层,该卷积层将64 * 3 * 3的滤波器与步幅1卷积在一起,然后是整流器。最后的隐藏层是完全连接的,由512个整流器单元组成。输出层是完全连接的线性层,每个有效动作均具有单个输出。在我们考虑的游戏中,有效动作的数量在4到18之间变化。

3. 实验设置

Preprocessing. Working directly with raw Atari 2600 frames, which are 210*160 pixel images with a 128-colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed at reducing the input dimensionality and dealing with some artifacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel color value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artifact caused by the limited number of sprites Atari 2600 can display at once. Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to 84*84. The function w from algorithm 1 described below applies this preprocessing to them most recent frames and stacks them to produce the input to the Q-function, in whichm54, although the algorithm is robust to different values of m (for example, 3 or 5).

1062/5000

预处理。在计算和内存要求方面,直接处理原始Atari 2600帧(其210 * 160像素图像和128色调色板)可能会非常困难。我们应用了一个基本的预处理步骤,旨在减少输入维数并处理Atari 2600仿真器的某些工件。首先,要对单个帧进行编码,我们将要编码的帧和前一帧的每个像素颜色值取最大值。这对于消除游戏中出现的闪烁是必要的,在游戏中某些对象仅在偶数帧中出现,而其他对象仅在奇数帧中出现,由Atari 2600数量有限的精灵引起的伪像可以立即显示。其次,我们然后从RGB帧中提取Y通道,也称为亮度,并将其重新缩放为84 * 84。下文描述的算法1的函数w将预处理应用到它们的最新帧并将它们堆叠以生成Q函数的输入,其中m54,尽管该算法对于m的不同值(例如3或5)具有鲁棒性。

Code availability. The source code can be accessed at https://sites.google.com/a/ deepmind.com/dqn for non-commercial uses only.

代码可用性。 只能通过非商业用途访问https://sites.google.com/a/ deepmind.com/dqn上的源代码。

Model architecture. There are several possible ways of parameterizing Q using a neural network. Because Q maps history action pairs to scalar estimates of their Q-value, the history and the action have been used as inputs to the neural network by some previous approaches. The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action, resulting in a cost that scales linearly with the number of actions.We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual actions for the input state. The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network. The exact architecture, shown schematically in Fig. 1, is as follows. The input to the neural network consists of an 84*84*4 image produced by the preprocessing map w. The first hidden layer convolves 32 filters of 8*8 with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 64 filters of 4*4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3*3 with stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.

模型架构。有几种使用神经网络参数化Q的可能方法。因为Q将历史动作对映射到其Q值的标量估计值,所以以前的一些方法已将历史和动作用作神经网络的输入。这种架构的主要缺点是,需要单独的前向通过来计算每个动作的Q值,从而导致成本随动作数量线性增长,而我们使用的架构中每个可能动作的输出单元,并且只有状态表示是神经网络的输入。输出对应于输入状态下各个动作的预测Q值。这种体系结构的主要优点是能够仅通过网络进行一次前向传递就可以计算给定状态下所有可能动作的Q值。图1中示意性显示的确切架构如下。神经网络的输入由预处理图w产生的84 * 84 * 4图像组成。第一隐藏层将步长为4的32个8 * 8的滤波器与输入图像卷积在一起,并应用整流器非线性。第二个隐藏层使步幅为2的64个4 * 4的滤波器卷积,再次是整流器非线性。接下来是第三卷积层,该卷积层将64 * 3 * 3的滤波器与步幅1卷积在一起,然后是整流器。最后的隐藏层是完全连接的,由512个整流器单元组成。输出层是完全连接的线性层,每个有效动作均具有单个输出。在我们考虑的游戏中,有效动作的数量在4到18之间变化。

Training details.We performed experiments on 49 Atari 2600 games where results were available for all other comparablemethods. A different network was trained on each game: the same network architecture, learning algorithm and hyperparameter settings (see ExtendedData Table 1) were used across all games, showing that our approach is robust enough to work on a variety of games while incorporating only minimal prior knowledge (see below).While we evaluated our agents on unmodified games, we made one change to the reward structure of the games during training only. As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games.At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude. For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training. In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/, tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size 32. The behavior policy during training was e-greedy with e annealed linearly from 1.0 to 0.1 over the first million frames and fixed at 0.1 thereafter.We trained for a total of 50 million frames (that is, around 38 days of game experience in total) and used a replay memory of 1 million most recent frames. Following previous approaches to playing Atari 2600 games, we also use a simple frame-skipping technique15. More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames. Because running the emulator forward for one step requires much less computation than having the agent select an action, this technique allows the agent to play roughly k times more games without significantly increasing the runtime. We use k54 for all games. The values of all the hyperparameters and optimization parameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders, and Beam Rider. We did not perform a systematic grid search owing to the high computational cost.These parameters were then held fixed across all other games.The values and descriptions of all hyperparameters are provided in Extended Data Table 1.

训练细节。我们在49个Atari 2600游戏中进行了实验,结果可用于所有其他可比较的方法。每个游戏都训练了一个不同的网络:在所有游戏中都使用了相同的网络体系结构,学习算法和超参数设置(请参见ExtendedData表1),这表明我们的方法足够强大,可以在各种游戏上运行,同时只包含最少的先验知识。知识(请参阅下文)。虽然我们评估了未经修改的游戏的代理商,但仅在训练期间对游戏的奖励结构进行了一次更改。由于不同游戏的分数规模差异很大,我们将所有正奖励削减为1,将所有负奖励削减为21,而0奖励保持不变。以这种方式削减奖励限制了误差导数的范围,并使在多个游戏中使用相同的学习率变得更容易,同时由于无法区分不同幅度的奖励而可能影响我们特工的表现。对于有生命计数器的游戏,Atari 2600模拟器还会发送游戏中剩余的生命数,然后将其用于标记训练过程中情节的结束。在这些实验中,我们使用了大小为32的小批处理的RMSProp算法(请参见http://www.cs.toronto.edu/,tijmen/csc321/slides/lecture_slides_lec6.pdf)。训练过程中的行为策略与e在最初的100万帧中从1.0线性退火到0.1,然后在之后固定为0.1。我们总共训练了5000万帧(即总共大约38天的游戏经验),并使用了最近100万帧的重播内存框架。按照先前玩Atari 2600游戏的方法,我们还使用了一种简单的跳帧技术15。更准确地说,代理会在第k个帧而不是每个帧上看到并选择动作,并且对跳过的帧重复执行其最后一个动作。因为向前运行仿真器一步要比让代理选择一个动作所需的计算少得多,所以这种技术允许代理在不显着增加运行时间的情况下玩大约k倍的游戏。我们为所有游戏使用k54。通过对Pong,Breakout,Seaquest,Space Invaders和Beam Rider等游戏进行非正式搜索,选择了所有超参数的值和优化参数。由于计算量大,我们没有进行系统的网格搜索,然后将这些参数在所有其他游戏中保持固定。扩展数据表1中提供了所有超参数的值和说明。

Our experimental setup amounts to using the following minimal prior knowledge:  that the input data consisted of visual images (motivating our use of a convolutional  deep network), the game-specific score (with no modification), number  of actions, although not their correspondences (for example, specification of the  up  button ) and the life count.

实验设置,使用以下最小的先验知识:视觉图像的输入数据包括(激励我们用卷积深层网络),游戏分数(没有修改),数量的行动,虽然不是他们的通讯(例如,规范的按钮)和生命。

Evaluation procedure. The trained agents were evaluated by playing each game 30 times for up to 5 min each time with different initial random conditions (noop; see Extended Data Table 1) and an e-greedy policy with e50.05. This procedure is adopted to minimize the possibility of overfitting during evaluation. The random agent served as a baseline comparison and chose a random action at 10 Hz which is every sixth frame, repeating its last action on intervening frames. 10 Hz is about the fastest that a human player can select the fire button, and setting the random agent to this frequency avoids spurious baseline scores in a handful of the games.We did also assess the performance of a random agent that selected action at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy Climber, Demon Attack, Krull and Robotnik), and in all these games DQN outperformed the expert human by a considerable margin. The professional human tester used the same emulator engine as the agents and played under controlled conditions. The human tester was not allowed to pause, save or reload games. As in the original Atari 2600 environment, the emulator was run at 60 Hz and the audio output was disabled: as such, the sensory input was equated between human players and agents. The human performance is the average reward achieved from around 20 episodes of each game lasting a maximum of 5min each, following around 2 h of practice playing each game.

评价过程。在不同的初始随机条件下(noop;扩展数据表1)和e-贪婪策略,e50.05。采用此程序是为了在评估过程中最小化过拟合的可能性。randomagent作为基线比较,选择频率为10hz的随机操作,即每隔六帧重复最后一帧操作。10hz是人类玩家能够选择的最快频率,将随机代理设置为这个频率可以避免在少数游戏中出现虚假的基线分数。我们也评估了一个随机变量的性能,它选择了一个60hz的动作(即每帧)。这只产生了极小的影响:仅在6个游戏中(拳击、突破、疯狂攀岩、恶魔攻击、Krull和机器人坦克),DQN的正规化表现就提高了5%以上,而且在所有这些游戏中,DQN的表现都远远超过人类专家。专业人员测试人员使用与代理相同的模拟器引擎,并在受控条件下进行测试。测试人员不允许暂停、保存或重新加载游戏。在原始的Atari 2600环境中,仿真器以60hz的频率运行,音频输出被禁用:因此,感官输入在人类玩家和代理之间相等。人类的表现是每个游戏大约20集的平均奖励,持续5分钟,每个游戏练习2小时左右。

 

4. 结果

在这项工作中,我们证明了单个体系结构可以仅需很少的先验知识就可以在各种不同环境中成功学习控制策略,仅接收像素和游戏得分作为输入,并在其上使用相同的算法,网络体系结构和超参数,在每个游戏中,都只考虑人类玩家会提供的输入。与以前的工作[24,26]相反,我们的方法采用了端到端强化学习,该方法使用奖励来不断地在卷积网络中塑造特征,使其朝着有利于价值评估的环境的显著特征发展。

该原理利用了神经生物学证据,即知觉学习过程中的奖励信号可能会影响灵长类视觉皮层内表征的特征。值得注意的是,强化学习与深度网络架构的成功集成,在很大程度上依赖于我们加入的回放算法,该算法涉及到最近经历的转换的存储和表示。

Figure 3 Comparison of the DQN agent with the best reinforcement learning methods15 in the literature.  图3 DQN agent与文献中最佳强化学习方法的比较The performance of DQN is normalized with respect to a professional human games tester (that is, 100% level) and random play (that is, 0% level).  相对于专业的人类游戏测试人员(即100%水平)和随机游戏(即0%水平),DQN的性能是标准化的。Note that the normalized performance of DQN, expressed as a percentage, is calculated as 100 × (DQN score-random play score)/(human score -random play score).  注意,DQN的标准化性能以百分比表示,计算为100×(DQN分数-随机游戏分数)/(人类分数-随机游戏分数)。It can be seen that DQN outperforms competing methods (also seeExtended Data Table 2) in almost all the games, and performs at a level that is broadly comparable with or superior to a professional human games tester(that is, operationalized as a level of 75% or above) in the majority of game,audio output was disabled for both human players and agents.  可以看出DQN优于竞争方法(seeExtended数据表2)在几乎所有的游戏中,并执行在水平大致相当或优于专业人力游戏测试员(也就是说,实施水平的75%或以上)在大多数游戏中,音频输出是残疾的人类玩家和代理。Error bars indicate sd across the 30 evaluation episodes, starting with different initial conditions. 误差条表示从不同初始条件开始的30个评估片段的sd。

Figure 4 | Two-dimensional t-SNE embedding of the representations in the last hidden layer assigned by DQN to game states experienced while playing Space Invaders. The plot was generated by letting the DQN agent play for 2 h of real game time and running the t-SNEalgorithm25 on the last hidden layer representations assigned by DQN to each experienced game state. The points are colored according to the state values (V, the maximum expected reward of a state) predicted by DQN for the corresponding game states (ranging from dark red (highest V) to dark blue (lowest V)). The screenshots corresponding to a selected number of points are shown. The DQN agent predicts high state values for both full (top right screenshots) and nearly complete screens (bottom left screenshots) because it has learned that completing a screen leads to a new screen full of enemy ships. Partially completed screens (bottom screenshots) are assigned lower state values because the less immediate reward is available. The screens are shown on the bottom right and top left and middle are less perceptually similar than the other examples but are still mapped to nearby representations and similar values because the orange bunkers do not carry great significance near the end of a level. With permission from Square Enix Limited.

图4 |二维t-SNE将DQN分配给游戏状态在玩《太空入侵者》时最后一个隐含层的表现嵌入。通过让DQN代理玩2个小时的真实游戏时间,并在DQN分配给每个体验游戏状态的最后一个隐含层表示上运行t-SNEalgorithm25来生成情节。根据DQN对相应游戏状态(从暗红色(最高的V)到深蓝色(最低的V))预测的状态值(V,一个状态的最大期望奖励)来着色。屏幕截图对应的是选定的点数。DQN代理可以预测满屏(右上角的截图)和接近满屏(左下角的截图)的状态值都很高,因为它已经了解到完成一个屏幕会导致一个新的满屏的敌人船只。部分完成的屏幕(底部屏幕截图)被分配较低的状态值,因为较低的即时奖励是可用的。屏幕显示在右下角、左上角和中间,与其他例子相比,它们在感知上不太相似,但仍然映射到附近的表示和类似的值,因为橙色的掩体在关卡接近结束时并没有太大的意义。获香港方九软件有限公司批准。

5. 结论

综上所述,我们的工作说明了利用最先进的机器学习技术和生物启发机制来创建能够学习掌握各种挑战性任务的智能体的力量。

6.正文翻译

We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks a central goal of general artificial intelligence that has eluded previous efforts. To achieve this, we developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural networks known as deep neural networks. Notably, recent advances in deep neural networks, in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network, which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields inspired by Hubel and Wiesel s seminal work on feedforward processing in early visual cortex thereby exploiting the local spatial correlations present in images, and building in robustness to natural transformations such as changes of viewpoint or scale.

我们着手创建一个单一的算法,它将能够在各种具有挑战性的任务上开发广泛的能力,这是通用人工智能的中心目标,以前的努力没有实现。为了实现这一目标,我们开发了一种新型的智能体,即deep Q-network (DQN),它能够将强化学习与一类被称为deep neural network的人工神经网络相结合。值得注意的是,深度神经网络的最新进展使得人工神经网络能够直接从原始感官数据中学习对象类别等概念。深度神经网络利用多层节点逐步构建数据的更抽象表示。我们使用一个特别成功的架构,深卷积网络,使用分层的层平铺的接受域卷积过滤器来模仿的影响受休博尔,威塞尔年代前馈处理早期视觉皮层从而利用当地出现在图像的空间相关性,在鲁棒性和建筑等自然转换观点或规模的变化。

We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards.The goal of the agent is to select actions in a fashion that maximizes cumulative future reward. More formally, we use a deep convolutional neural network to approximate the optimal action-value function

我们通过一系列的观察、行动和奖励来考虑agent与环境交互的任务。代理的目标是以一种最大化累积未来回报的方式选择行为。更正式地,我们使用深度卷积神经网络来近似最优动作值函数

 

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to may significantly change the policy and therefore change the data distribution, and the correlations between the action-values (Q) and the target values rzc max a0 Q s0ð , a0Þ. We address these instabilities with a novel variant of Q-learning, which uses two key ideas. First, we used a biologically inspired mechanism termed experience replay21 23 that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution (see below for details). Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.

当使用非线性函数逼近器(如神经网络)来表示动作值(也称为Q)函数时,强化学习被认为是不稳定甚至发散的。这种不稳定性有几个原因:相关性通过观察到的序列展现出来,小的更新这一事实可能大大改变了政策,因此改变数据分布,和之间的相关性action-values (Q)和目标值rzc马克斯•a0问s0ða0Þ。

我们使用一种新的Q-learning变体来解决这些不稳定性,它使用了两个关键思想。

首先,我们使用了一种生物启发的机制,称为经验重玩 使数据随机化,从而消除了观测序列的相关性,并平滑了数据分布的变化(详见下文)。

其次,我们使用了一个迭代更新,将动作值(Q)调整为只定期更新的目标值,从而减少与目标的相关性。

While other stable methods exist for training neural networks in the reinforcement learning setting, such as neural fitted Q-iteration, these methods involve the repeated training of networks de novo on hundreds of iterations. Consequently, these methods, unlike our algorithm, are too inefficient to be used successfully with large neural networks. We parameterize an approximate value function Q(s,a;hi) using the deep convolutional neural network shown in fig. 1, in which hi are the parameters (that is, weights) of the Q-network at iteration i. To perform experience replay we store the agent s experiences et5(st, at,rt,st11) at each time-step t in a data set Dt5{e1, ,et}. During learning, we apply Q-learning updates, on samples (or minibatches) of experience (s,a,r,s9),U(D), drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration i uses the following loss function

在强化学习环境下,神经网络的训练也有其他稳定的方法,如神经拟合q迭代法,但这些方法都需要对网络进行数百次迭代的重复训练。因此,这些方法,不像我们的算法,效率太低,不能成功地用于大型神经网络。我们使用如图所示的深卷积神经网络对一个近似值函数Q(s,a, hi)进行参数化。1,其中hi为第i次迭代时Q-network的参数(即权值),为了执行经验重放,我们将agent s在每个时间步t上的经验et5(st,at,rt,st11)存储在数据集Dt5{e1,,et}中。在学习过程中,我们对样本(或小批量)的经验(s、a、r、s9)、U(D)应用Q-learning更新,这些经验是从存储样本池中均匀随机抽取的。迭代i的Q-learning更新使用以下损失函数

 To evaluate our DQN agent, we took advantage of the Atari 2600  platform, which offers a diverse array of tasks (n=49) designed to be difficult and engaging for human players. We used the same network architecture, hyperparameter values (see Extended Data Table 1) and learning procedure throughout taking high-dimensional data (210|160 colour video at 60 Hz) as input to demonstrate that our approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge (that is, merely the input data were visual images, and the number of actions available in each game, but not their correspondences; see Methods). Notably, our method was able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner illustrated by the temporal evolution of two indices of learning (the agent s average score-per-episode and average predicted Q-values; see Fig. 2 and Supplementary Discussion for details).

为了评估我们的DQN代理,我们利用了雅达利2600平台,该平台提供了各种各样的任务(n=49),这些任务设计得既困难又吸引人。我们使用相同的网络体系结构,hyperparameter值(见表1扩展数据)和学习过程中把高维数据(210 | 160彩色视频60 Hz)作为输入来证明我们的方法有力学习成功的政策在不同的游戏完全基于感官输入只有非常小的先验知识(也就是说,仅仅是输入数据的视觉图像,并可用操作的数量在每一场比赛,但不是他们的通讯;见的方法)。值得注意的是,我们的方法能够利用强化学习信号和随机梯度下降以一种稳定的方式训练大型神经网络,这可以通过学习的两个指标(agent s的平均每集分数和平均预测q值;详见图2和补充讨论)。

We compared DQN with the best-performing methods from there inforcement learning literature on the 49 games where results were available. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on the y-axis; see Methods). Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75%of the human score on more than half of the games (29 games;see Fig. 3, Supplementary Discussion and Extended Data Table 2). In additional simulations (see Supplementary Discussion and Extended Data Tables 3 and 4), we demonstrate the importance of the individual core components of the DQNagent the replay memory, separate target Q-network and deep convolutional network architecture by disabling them and demonstrating the detrimental effects on performance.我们将DQN与49场比赛中表现最好的方法进行了比较。除了学习代理外,我们还报告了在受控条件下进行的专业人类游戏测试人员的得分,以及一致随机选择操作的策略(扩展的DataTable 2和图3,y轴上分别用100%(人类)和0%(随机)表示;见的方法)。我们的DQN方法在43个游戏中表现优于现有的最佳强化学习方法,而没有包含任何其他方法(例如参考文献12、15)所使用的关于雅达利2600游戏的额外先验知识。此外,在49场比赛中,ourDQN代理的表现水平与专业的人类游戏测试员相当,在超过一半的比赛中(29场见图3,补充讨论和扩展数据表2)。在附加的模拟(见补充讨论和扩展数据表3和图4),我们将演示回放记忆对于DQNagent的各个核心组件的重要性,通过禁用,分离了Q-network和深卷积网络体系结构,展示出了对性能的不利影响。

Figure 2 | Training curves tracking the agent s average score and average predicted action-value. 

a, Each point is the average score achieved per episode after the agent is run with e-greedy policy (e=0.05) for 520 k frames on Space Invaders. 

b, Average score achieved per episode for Seaquest. c, Average predicted action-value on a held-out set of states on Space Invaders. Each point on the curve is the average of the action-value Q computed over the held-out set of states. Note that Q-values are scaled due to the clipping of rewards (see Methods). d, Average predicted action-value on Seaquest. See Supplementary Discussion for details. 

图2 |训练曲线,跟踪agent s平均得分和平均预测动作值。a,每一分是对太空入侵者520k帧运行e-贪心策略(e=0.05)后每集的平均得分。b, Seaquest每集平均得分。c,平均预测的行动价值,在一套悬而未决的国家对太空入侵者。曲线上的每一点都是行动值Q的平均值,Q是在给定的状态集上计算出来的。注意q值是由于奖励的减少而缩放的(参见方法)。平均预测海鹰行动价值。详见补充讨论。

We next examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders (see Supplementary Video 1 for a demonstration of the performance of DQN), by using a technique developed for the visualization of high-dimensional data called t-SNE (Fig. 4). As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. Interestingly,we also found instances in which the t-SNE algorithm-generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left, and middle), consistent with the notion that the network is able to learn representations that support adaptive behavior from high-dimensional sensory inputs. Furthermore, we also show that the representations learned by DQN are able to generalize to data generated from policies other than its own in simulations where we presented as input to the network game states experienced during human and agent play, recorded the representations  of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm(Extended Data Fig. 1 and Supplementary Discussion). Extended Data Fig. 2 provides an additional illustration of how the representations learned by DQN allow it to accurately predict state and action values. It is worth noting that the games in whichDQNexcels are extremely varied in their nature, from side-scrolling shooters (River Raid) to boxing games (Boxing) and three-dimensional car-racing games (Enduro).

我们接下来检查DQN表示明白了,支撑的成功表现游戏的代理在太空入侵者(见补充视频1的示范DQN)的性能,通过使用技术发达的高维数据可视化称为t-SNE(图4)。正如所料,t-SNE算法倾向于感知的DQN表示类似的状态映射到附近的点。有趣的是,我们还发现实例的t-SNE算法类似的嵌入DQN表示的状态接近的预期回报但感知不同(图4,右下角,左上角,和中间),一致认为网络能够学习表示支持从高维感官输入适应性行为。此外,我们还表明, DQN能够概括所学的表示数据来自政策除了自己的模拟,我们提出了作为输入到网络游戏经历人类和代理,记录了最后一个隐层的表征,可视化生成的嵌入t-SNE算法(扩展数据图1和补充讨论)。扩展数据图2提供了一个额外的说明,说明了DQN如何通过学习表示来准确地预测状态和动作值。

值得注意的是,dqnexcels的游戏在本质上非常多样化,从横向滚动射击游戏(River Raid)到拳击游戏(boxing)和三维赛车游戏(Enduro)。

Indeed, in certain gamesDQNis able to discover a relatively long-term  strategy (for example, Breakout: the agent learns the optimal strategy,  which is to first dig a tunnel around the side of the wall allowing the ball  to be sent around the back to destroy a large number of blocks; see Supplementary  Video 2 for illustration of development of DQN s performance  over the course of training). Nevertheless, games demanding more temporally extended planning strategies still constitute a major challenge for all existing agents includingDQN(for example, Montezuma s  Revenge).

的确,在某些游戏中,玩家能够发现一种相对较长期的策略(例如,突破:agent学习最优策略,即先在墙的一侧挖一个隧道,让球绕着墙的后方移动,摧毁大量的block;参见补充视频2,演示DQN在培训过程中的性能发展)。尽管如此,对于包括dqn在内的所有现有代理商来说,要求更长的规划策略的游戏仍然是一大挑战(例如Montezuma的《Revenge》)。

In this work, we demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters on each game, privy only to the inputs a human player would have. In contrast to previous work24,26, our approach incorporates end-to-end reinforcement learning that uses the reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation. This principle draws on neurobiological evidence that reward signals during perceptual learning may influence the characteristics of representations within primate visual cortex27,28. Notably, the successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm21 23 involving the storage and representation of recently experienced transitions. Convergent evidence suggests that the hippocampus may support the physical realization of such a process in the mammalian brain, with the time-compressed reactivation of recently experienced trajectories during offline periods21,22 (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through interactions with the basal ganglia22. In the future, it will be important to explore the potential use of biasing the content of experience replay towards salient events, a phenomenon that characterizes empirically observed hippocampal replay29, and relates to the notion of prioritized sweeping 30 in reinforcement learning. Taken together, our work illustrates the power of harnessing state-of-the-art machine learning techniques with biologically inspired mechanisms to create agents that are capable of learning to master a diverse array of challenging tasks.

在这项工作中,我们证明了单个体系结构可以仅需很少的先验知识就可以在各种不同环境中成功学习控制策略,仅接收像素和游戏得分作为输入,并在其上使用相同的算法,网络体系结构和超参数,在每个游戏中,都只考虑人类玩家会提供的输入。与以前的工作[24,26]相反,我们的方法采用了端到端强化学习,该方法使用奖励来不断地在卷积网络中塑造特征,使其朝着有利于价值评估的环境的显著特征发展。

该原理利用了神经生物学证据,即知觉学习过程中的奖励信号可能会影响灵长类视觉皮层内表征的特征。值得注意的是,强化学习与深度网络架构的成功集成,在很大程度上依赖于我们加入的回放算法,该算法涉及到最近经历的转换的存储和表示。

越来越多的证据表明,海马可能支持哺乳动物大脑中这种过程的物理实现,在离线时期(例如,醒着的休息时间),最近经历的轨迹的时间压缩重新激活提供了一种推定的机制,通过该机制可以实现价值功能通过与基底神经节的相互作用有效地更新。将来,探索将体验重播的内容偏向显着事件的潜在用途将很重要,该现象是根据经验观察到的海马重播的特征,并且与强化学习中优先清扫的概念有关。

综上所述,我们的工作说明了利用最先进的机器学习技术和生物启发机制来创建能够学习掌握各种挑战性任务的智能体的力量。

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值