【一种具有优先体验回放的新型 DDPG 方法】A novel DDPG method with prioritized experience replay

最新推荐文章于 2024-08-28 08:22:26 发布

资源存储库

最新推荐文章于 2024-08-28 08:22:26 发布

阅读量1.2k

点赞数 14

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135523246

版权

A novel DDPG method with prioritized experience replay 一种具有优先体验回放的新型 DDPG 方法

Abstract: 抽象：

Recently, a state-of-the-art algorithm, called deep deterministic policy gradient (DDPG), has achieved good performance in many continuous control tasks in the MuJoCo simulator.
最近，一种称为深度确定性策略梯度（DDPG）的最先进的算法在 MuJoCo 模拟器中的许多连续控制任务中取得了良好的性能。
To further improve the efficiency of the experience replay mechanism in DDPG and thus speeding up the training process, in this paper, a prioritized experience replay method is proposed for the DDPG algorithm, where prioritized sampling is adopted instead of uniform sampling.
为了进一步提高DDPG中经验回放机制的效率，从而加快训练过程，该文针对DDPG算法提出了一种优先经验回放方法，即采用优先抽样代替均匀抽样。
The proposed DDPG with prioritized experience replay is tested with an inverted pendulum task via OpenAI Gym.
通过 OpenAI Gym 使用倒立钟摆任务测试了拟议的具有优先体验重播的 DDPG。
The experimental results show that DDPG with prioritized experience replay can reduce the training time and improve the stability of the training process, and is less sensitive to the changes of some hyperparameters such as the size of replay buffer, minibatch and the updating rate of the target network.
实验结果表明，优先体验回放的DDPG可以减少训练时间，提高训练过程的稳定性，并且对回放缓冲区大小、小批量和目标网络更新速率等一些超参数的变化不太敏感。
Published in: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
发表于： 2017 IEEE International Conference on Systems， Man， and Cybernetics （SMC）
Date of Conference: 05-08 October 2017
会议日期：2017年10月5-8日
Date Added to IEEE Xplore: 30 November 2017
添加到IEEE Xplore的日期： 30 November 2017
ISBN Information: ISBN信息：
DOI: 10.1109/SMC.2017.8122622
DOI： 10.1109/SMC.2017.8122622
Publisher: IEEE 发布者： IEEE
Conference Location: Banff, AB, Canada
会议地点： Banff， AB， Canada

SECTION I. 第一节总则Introduction 介绍

In the conventional continuous control domains, where the action space is large and actions are real-valued, letting agents directly learn from pixel-level information to accomplish complex manipulation tasks is very difficult [1], [2]. Recently, a novel deep reinforcement learning algorithm, called deep deterministic policy gradient (DDPG) [1], has achieved good performance in many simulated continuous control problems by integrating deterministic policy gradient algorithm [3] with some tricks introduced in making computers play Atari games [4]–[6]. It uses a replay buffer to remove the correlations existing in the input experiences (i.e. experience replay mechanism and an experience is a four-element tuple (St, at, rt, St+1)) [7], [8] and exploits target network approaches to stabilize the training process. As an essential part of DDPG, the experience replay significantly affects the performance and the speed of the learning process via the experiences selected to train the neural network.
在传统的连续控制领域中，动作空间大，动作是实值的，让智能体直接从像素级信息中学习来完成复杂的操作任务 [1] 是非常困难的 [2] 。最近，一种名为深度确定性策略梯度（DDPG） [1] 的新型深度强化学习算法通过将确定性策略梯度算法 [3] 与计算机玩雅达利游戏 [4] 时引入的一些技巧相结合，在许多模拟的连续控制问题中取得了良好的性能。 [6] 它使用重放缓冲区来消除输入体验中存在的相关性（即体验重放机制，体验是四元素元组（S ， a t ， r t ， S t+ 1 t ））， [7] [8] 并利用目标网络方法来稳定训练过程。作为 DDPG 的重要组成部分，体验回放通过选择用于训练神经网络的体验显着影响学习过程的性能和速度。

In the experience replay mechanism, it uses a finite-size memory called a replay buffer to store previous experiences and randomly selects a fixed number of experiences (i.e., a minibatch of experiences) to update the neural network at a time step.
在体验回放机制中，它使用称为回放缓冲区的有限大小内存来存储以前的体验，并随机选择固定数量的体验（即一小批体验）来更新神经网络的时间步长。
It is clear that temporal correlations existing in the replayed experiences have been greatly weakened by mixing recent experiences with old experiences to update the network.
很明显，通过将最近的体验与旧的体验混合以更新网络，重放体验中存在的时间相关性已经大大减弱。
However, the experience replay mechanism is based on the intuition that all experiences stored in the replay buffer are of equal importance and therefore it uniformly samples a minibatch of experiences to perform the update of the network.
但是，体验重放机制基于直觉，即存储在重播缓冲区中的所有体验都同等重要，因此它统一采样一小批体验来执行网络更新。
This intuition goes against the common phenomenon that when people learn to do something, experiences with big rewards, very successful attempts or painful lessons will become consistently fresh in their memory during the learning process and therefore these experiences are more valuable compared with others.
这种直觉与普遍现象背道而驰，即当人们学会做某事时，在学习过程中，获得丰厚回报的经历、非常成功的尝试或痛苦的教训会在他们的记忆中始终记忆犹新，因此这些经历与其他人相比更有价值。
During the learning process, instead of replaying all experiences uniformly, we tend to frequently replay experiences associated with more successful attempts and similar ideas can be referred to [9], [10]. Experiences with terrible tumbles are also frequently replayed to help us better realize the consequence of the wrong actions, correct them aggressively in the corresponding conditions and avoid making the same mistakes again.
在学习过程中，我们倾向于经常重播与更成功的尝试相关的经验，而不是统一地重播所有经验 [9] ，并且可以参考类似的想法。 [10] 可怕的跌倒经历也经常被重播，以帮助我们更好地意识到错误行为的后果，在相应的条件下积极纠正它们，避免再次犯同样的错误。
Hence, in this paper we consider the values of different experiences and propose to apply prioritized sampling or prioritized experience replay mechanism [11] to improve the DDPG algorithm, while the original experience replay mechanism adopted in DDPG algorithm is called uniform sampling or uniform experience replay.
因此，本文考虑了不同体验的价值，并提出应用优先抽样或优先体验回放机制来改进DDPG算法，而DDPG算法中采用的原始体验回放机制 [11] 称为均匀采样或统一体验回放。

The prioritized experience replay mechanism works as follows.
优先体验重播机制的工作方式如下。
Firstly, we evaluate the learning value of the experiences by their absolute temporal difference errors (TD-errors), which have already been computed in many classical reinforcement learning algorithms like SARSA and Q-learning [12]. Then, by ranking the experiences in the replay buffer through the absolute value of their TD-errors, we more frequently replay those with high magnitude of TD-errors.
首先，我们通过绝对时间差分误差（TD-errors）来评估经验的学习价值，这些误差已经在许多经典的强化学习算法（如SARSA和Q-learning [12] ）中计算出来。然后，通过对回放缓冲区中的 TD 错误绝对值进行排序，我们可以更频繁地重放 TD 错误量级较高的体验。
This practice will inevitably change the state visitation frequency and thus introducing bias, which is corrected by importance-sampling weights [13]. The proposed algorithm is most similar to [14], whose application scenarios are more restrictive and limited to discrete action space. In this paper, we expand the applications from discrete action domains to continuous action domains, and the model is based on Tensorflow [15], [16]. The DDPG with prioritized sampling is tested with the inverted pendulum task on the platform called OpenAI Gym [17].
这种做法将不可避免地改变状态访问频率，从而引入偏差，这可以通过重要性抽样权重来纠正 [13] 。所提算法与 [14] 最为相似，其应用场景限制性更强，仅限于离散动作空间。在本文中，我们将应用从离散动作域扩展到连续动作域，该模型基于 Tensorflow [15] 、 [16] .具有优先采样的 DDPG 在名为 OpenAI Gym 的平台上使用倒摆任务进行测试 [17] 。
With comparison to the original DDPG method, the experimental results show that DDPG with prioritized experience replay not only achieves similar performance within shorter training time, but also demonstrates more stable training process and is less sensitive to the changes of some hyperparameters.
实验结果表明，与原始DDPG方法相比，具有优先经验回放的DDPG不仅在更短的训练时间内取得了相似的性能，而且表现出更稳定的训练过程，并且对某些超参数的变化不太敏感。

SECTION II. 第二节.DDPG with Prioritized Experience Replay

具有优先体验回放的 DDPG

In this section, deep reinforcement learning, DDPG and the idea of prioritized experience replay is first introduced, and then an integrated algorithm of DDPG with prioritized experience replay is presented.
在本节中，首先介绍了深度强化学习、DDPG和优先经验回放的思想，然后提出了一种具有优先经验回放的DDPG集成算法。

A. Deep Reinforcement Learning and DDPG
A. 深度强化学习和DDPG
In standard reinforcement learning scenarios, an agent interacts with the environment to maximize the long-term rewards. Typically, this interaction process is formulated as a Markov Decision Process (MDP) which is described by a four-element tuple (S,A,R,P), where S is the state space, A is the action space, R:S×A→R is the reward function and P:S×A×A→[0,1] is the transition probability. In this scenery, the target of the agent is to learn a policy π:S→A which maximizes the long-term rewards: R0=ΣTi=0r(si,ai) where T is the termination time step, r(si,ai) is the reward of executing action ai in state Si. Usually, this long-term reward is discounted by a factor γ as Rγ0=∑Ti=0γir(si,ai), where γ∈(0,1) The action-value function is used to represent the expected long-term reward of executing action at in state st:
Q(st,at)=E[Rγt|s=st,a=at]=E[∑i=tTγi−tr(si,ai)].(1)
View SourceRight-click on figure for MathML and additional features.
在标准的强化学习场景中，智能体与环境交互以最大化长期回报。通常，这种交互过程被表述为马尔可夫决策过程（MDP），它由一个四元素元组描述，其中 S 是状态空间，是动作空间， R:S×A→R 是奖励函数 (S,A,R,P) ， A P:S×A×A→[0,1] 是转移概率。在此场景中，智能体的目标是学习最大化长期奖励的策略 π:S→A ： R0=ΣTi=0r(si,ai) 其中 T 是终止时间步长， r(si,ai) 是在状态 Si 下执行操作 ai 的奖励。通常，这种长期奖励会被一个因子 γ 折现为，其中 Rγ0=∑Ti=0γir(si,ai) γ∈(0,1) action-value 函数用于表示在状态 st 下执行动作 at 的预期长期奖励：
Q(st,at)=E[Rγt|s=st,a=at]=E[∑i=tTγi−tr(si,ai)].(1)
View SourceRight-click on figure for MathML and additional features.

Bellman Equation is often used to find the optimal action-value function:
Q∗(st,at)=E[r(st,at)+γmaxa′i+1Q∗(st+1,a′t+1)].(2)
View SourceRight-click on figure for MathML and additional features.
贝尔曼方程通常用于求最佳动作值函数：
Q∗(st,at)=E[r(st,at)+γmaxa′i+1Q∗(st+1,a′t+1)].(2)
View SourceRight-click on figure for MathML and additional features.

However, the above approach is limited to occasions where both of the state space and the action space are discrete. In order to apply the above method to problems with a continuous state-action space, two deep neural networks is designed, i.e., the action-value network Q(st,at,w) and the actor network μ(st,v), to approximate the action-value function Q(st,at) as well as the actor function μ(st) where w and v are the network parameters, μ(st) is the function used to map the state st to a deterministic action at. Approaches combining conventional reinforcement learning algorithms (e.g. Q-learning) with deep neural networks are called deep reinforcement learning algorithms (e.g. Deep Q-Network [5]).
但是，上述方法仅限于状态空间和动作空间都是离散的场合。为了将上述方法应用于具有连续状态-动作空间的问题，设计了两个深度神经网络，即动作-值网络和行动者网络，以逼近动作-值函数和行动者函数，其中w和v为网络 Q(st,at,w) μ(st,v) 参数， μ(st) 是用于将状态s t 映射到确定性动作 at 的函数 Q(st,at) μ(st) .将传统强化学习算法（例如 Q 学习）与深度神经网络相结合的方法称为深度强化学习算法（例如 Deep Q-Network [5] ）。

The training of the action-value network is based on minimizing the following loss function L(w):
L(w)=(r(st,at)+γQ′(st+1,at+1,w)−Q(st,at,w))2.(3)
View SourceRight-click on figure for MathML and additional features.
动作-值网络的训练基于最小化以下损失函数 L（w）：
L(w)=(r(st,at)+γQ′(st+1,at+1,w)−Q(st,at,w))2.(3)
View SourceRight-click on figure for MathML and additional features.

The parameters of the actor network are updated using policy gradient algorithm [3]:
∇vQ(s,a,w)|s=st,a=μ(st,v)=∇aQ(s,a,w)|s=st,a=μ(st,v)∇vμ(s,v)|s=st.(4)
View SourceRight-click on figure for MathML and additional features.
使用策略梯度算法更新参与者网络的参数 [3] ：
∇vQ(s,a,w)|s=st,a=μ(st,v)=∇aQ(s,a,w)|s=st,a=μ(st,v)∇vμ(s,v)|s=st.(4)
View SourceRight-click on figure for MathML and additional features.

Previously, the training of deep neural networks is difficult or even impossible for there is no theoretical guarantee and therefore the deep neural networks are prone to diverge in the training process.
以前，深度神经网络的训练是困难的，甚至是不可能的，因为没有理论上的保证，因此深度神经网络在训练过程中容易发散。
However, recent advances in deep learning areas (i.e., experience replay mechanism, target network approach and batch normalization) have helped overcome the difficulty.
然而，深度学习领域的最新进展（即经验回放机制、目标网络方法和批量归一化）有助于克服这一困难。
Experience replay mechanism is proposed to break the correlations between the input training experiences and the target network approach is exploited to give the training process a consistent target. In addition, batch normalization [18] is adopted to constrain the change of the network parameters. These approaches stabilize the training process as well as make the training of large non-linear neural networks possible.
提出经验回放机制来打破输入训练经验与目标网络方法之间的相关性，为训练过程提供一致的目标。此外，采用批量归一化 [18] 来约束网络参数的变化。这些方法稳定了训练过程，并使大型非线性神经网络的训练成为可能。

B. Prioritized Experience Replay
B. 优先体验回放
The core idea of the prioritized experience replay is to more frequently replay experiences associated with very successful attempts or extremely awful performance. Therefore, the criterion of defining the value of the experiences is the central issue.
优先体验重播的核心思想是更频繁地重播与非常成功的尝试或极其糟糕的性能相关的体验。因此，定义经验价值的标准是核心问题。
Since in most reinforcement learning algorithms, TD-error is often used to update the estimation of the action-value function Q(s, a). The value of TD-error acts as the correction for the estimation and may implicitly reflect to what extent an agent can learn from the experience.
由于在大多数强化学习算法中，TD误差通常用于更新动作值函数Q（s， a）的估计。TD误差的值作为估计值的校正，并可能隐含地反映智能体可以从经验中学习的程度。
The bigger the magnitude of absolute TD-error is, the more aggressive the correction for the expected action-value is. In this condition, experiences with high TD-errors are more likely to be of high value and associated with very successful attempts.
绝对TD误差的幅度越大，对预期动作值的修正就越积极。在这种情况下，具有高 TD 错误的经验更有可能具有很高的价值，并且与非常成功的尝试相关联。
Besides, experiences with big negative magnitudes of TD-errors are conditions where the agent behaves miserably and the states of these conditions are badly learned by the agent.
此外，TD 误差量为负值的经验是智能体行为悲惨的情况，并且智能体对这些条件的状态了解得很差。
More frequently replaying these experiences will help the agent gradually realize the true consequence of the wrong behavior in the corresponding states, as well as avoid making the wrong behavior in these conditions again, which can improve the overall performance.
更频繁地重放这些体验将帮助智能体逐渐意识到在相应状态下错误行为的真正后果，并避免在这些条件下再次做出错误行为，这可以提高整体性能。
Therefore, these badly learned experiences are also considered to be of high value. In this paper, we select absolute TD-error |δ| of the experiences as the index to evaluate the value of the experiences. The computation of the TD-error δj of experience j is given as:
δj=r(st,at)+γQ′(st+1,at+1,w)−Q(st,at,w),(5)
View SourceRight-click on figure for MathML and additional features.where Q′(st+1,at+1,w) is the target action-value network parameterized by w. In many classical reinforcement learning algorithms like SARSA and Q-learning, such value has already been computed, which facilitates the evaluation process of the experiences. We define the probability of the sampled experience j as
P(j)=DαjΣkDαk,(6)
View SourceRight-click on figure for MathML and additional features.where Dj=1rank(j)>0,rank(j) is the rank of the experience j in the replay buffer with absolute TD-error being the criterion. Then parameter a controls to what extent the prioritization is used.
因此，这些糟糕的学习经验也被认为具有很高的价值。本文选取经验的绝对TD误差 |δ| 作为评价经验价值的指标。经验 j 的TD误差 δj 的计算公式为：
δj=r(st,at)+γQ′(st+1,at+1,w)−Q(st,at,w),(5)
View SourceRight-click on figure for MathML and additional features.其中 Q′(st+1,at+1,w) 是参数化的目标动作值网络。 w 在许多经典的强化学习算法中，如SARSA和Q-learning，这样的值已经被计算出来，这有利于经验的评估过程。我们将采样体验 j 的概率定义为
P(j)=DαjΣkDαk,(6)
View SourceRight-click on figure for MathML and additional features.Where Dj=1rank(j)>0,rank(j) 是体验 j 在重放缓冲区中的排名，绝对 TD 误差是标准。然后，参数 a 控制优先级的使用程度。
The definition of the sampling probability can be seen as a method of adding stochastic factors in selecting experiences since even those with low TD-errors can still have a probability to be replayed, which guarantees the diversity of sampled experiences.
抽样概率的定义可以看作是一种在选择经验时添加随机因子的方法，因为即使是那些TD误差较低的经验仍然可以有重放的概率，这保证了抽样经验的多样性。
Such diversity can help prevent the neural network from being over-fitting.
这种多样性有助于防止神经网络过度拟合。

However, since we tend to more frequently replay experiences with high TD-errors, it will no doubt change the state visitation frequency. This change may cause the training process of the neural network prone to oscillate or even diverge.
但是，由于我们倾向于更频繁地重播具有高 TD 错误的体验，因此这无疑会改变状态访问频率。这种变化可能会导致神经网络的训练过程容易振荡甚至发散。
In order to handle this issue, importance-sampling weights are used in the computation of weight changes:
Wj=1Sβ⋅P(j)β,(7)
View SourceRight-click on figure for MathML and additional features.where S is the size of the replay buffer, P(j) is the probability of the sampled experience j, the parameter β controls to what extent the correction is used.
为了解决这个问题，在权重变化的计算中使用了重要性抽样权重：
Wj=1Sβ⋅P(j)β,(7)
View SourceRight-click on figure for MathML and additional features.其中 S 是重放缓冲区的大小，是采样体验 j 的概率， P(j) 参数 β 控制校正的使用程度。

Algorithm 1: DDPG with Prioritized Experience Replay
算法 1：具有优先体验重放的 DDPG
Algorithm
Based on the above introduced prioritized experience replay method, an integrated algorithm of DDPG with prioritized experience replay is shown as in Algorithm 1.
基于上面介绍的优先体验回放方法，DDPG的优先体验回放集成算法如图所示 Algorithm 1 。

SECTION III. 第三节.Simulated Experiments 模拟实验
In this section, experimental settings are first introduced and then the proposed algorithm is tested on a classical continuous control problem of an inverted pendulum task. Several groups of experimental results are given with comparison to the original DDPG method.
在本节中，首先介绍了实验设置，然后在倒摆任务的经典连续控制问题上对所提算法进行了测试。给出了几组实验结果，并与原始DDPG方法进行了比较。

A. Experimental Settings A. 实验设置
We use a similar low-dimensional neural network introduced in [1] and a momentum-based algorithm called Adam [19] is adopted to train the neural networks, with a learning rate of 10–4 for action-value network and 10–3 for the actor network. To make sure that the weights of the neural networks will not become too large, L2 regularization item is included in the loss function to update the action-value network. The discount factor is set 0.99 and the update rate of the target network is set as 0.01. Two hidden layers are included in the low-dimensional networks (i.e.
我们使用引入 [1] 的类似低维神经网络，并采用一种称为 Adam [19] 的基于动量的算法来训练神经网络，动作值网络的学习率为 10- 4 ，参与者网络的学习率为 10- 3 。为了保证神经网络的权重不会变得太大，在损失函数中加入了正则化项来 L2 更新动作值网络。折扣系数设置为0.99，目标网络的更新速率设置为0.01。低维网络中包含两个隐藏层（即
the action-value network and the actor network), with 400 hidden units for the first layer and 300 units for the second layer, respectively.
action-value 网络和 actor 网络），第一层和第二层分别有 400 个隐藏单元和 300 个单元。

We randomly select values from a uniform distribution w.r.t the input of the layers for the weights of the two neural networks. However, the final layers’ weights are uniformly selected between −3×10−3 and 3×10−3. We use the Ornstein-Uhlenbeck process [20] to produce noise that is added in the exploration policy to help the agent explore the environment thoroughly, where the θ,σ is set 0.15 and 0.2, respectively. The value of the minibatch is set as 32 and the size of the replay buffer is set as 104. The storage structure of the replay buffer is a circular queue and the data structure of the priority queue is a binary heap to minimize the additional time cost for prioritized sampling.
我们从均匀分布中随机选择值，这些值来自两个神经网络权重的层的输入。但是，最终层的权重统一选择在 −3×10−3 和 3×10−3 之间。我们使用 Ornstein-Uhlenbeck 过程 [20] 产生噪声，这些噪声添加到探索策略中，以帮助智能体彻底探索环境，其中分别 θ,σ 设置为 0.15 和 0.2。小批处理的值设置为 32，重播缓冲区的大小设置为 10 4 。重放缓冲区的存储结构是循环队列，优先级队列的数据结构是二进制堆，以最大程度地减少优先采样的额外时间成本。
In the process of interacting with the environment, the agent receives low-dimensional vectors as the observation (i.e., state). These low-dimensional vectors are values of the speed, joint angles and coordinate information. MuJoCo [21], [22] is used as the simulator and OpenAI Gym is adopted to load the MuJoCo model. Learning speed, learning stability and sensitiveness to the value of some hyperparameters are used as three criterions to evaluate the performance of the algorithms. In this paper, we compare the performance of the original DDPG method, DDPG with prioritized experience replay and the best baseline of the OpenAI Gym in the inverted pendulum task.
在与环境交互的过程中，智能体接收低维向量作为观察（即状态）。这些低维向量是速度、关节角度和坐标信息的值。使用 MuJoCo [21] 作为模拟器， [22] 并采用 OpenAI Gym 加载 MuJoCo 模型。学习速度、学习稳定性和对某些超参数值的敏感度被用作评估算法性能的三个标准。在本文中，我们比较了原始 DDPG 方法、具有优先体验回放的 DDPG 和 OpenAI Gym 在倒立钟摆任务中的最佳基线的性能。

B. Inverted Pendulum Task
B. 倒摆任务
The inverted pendulum task is a classical continuous control problem. As shown in Fig. 1, a cart on which a pendulum is fixed through the pivot point can move in the horizontal direction along the finite-length bar. A horizontal force can be used to keep the pendulum balanced as long as possible.
倒摆任务是一个经典的连续控制问题。如图所示 Fig. 1 ，通过枢轴点固定钟摆的推车可以沿有限长度的杆沿水平方向移动。可以使用水平力来尽可能长时间地保持钟摆的平衡。
The action is defined as the horizontal force added to the cart and the observation (i.e. state) of the agent is defined as a four-dimensional vector that represents the state of the pendulum. In the experiment, the agent is given a positive reward of +1 if the angle θ between the pendulum and the vertical line is kept within a desired degree at each time step. One training episode is autonomously terminated if θ is larger than the target or the cart moves out of the bar. We set the maximum number of the training steps in an episode to be 1, 000 and the inverted pendulum task is deemed as solved when the agent get an average reward of 950.0 in 100 episodes consecutively.
动作被定义为添加到购物车的水平力，而智能体的观察（即状态）被定义为表示钟摆状态的四维向量。在实验中，如果钟摆和垂直线之间的角度 θ 在每个时间步保持在所需的范围内，则智能体将获得 +1 的正奖励。如果 θ 训练集大于目标或推车移出栏，则自动终止训练。我们将一集的最大训练步骤数设置为 1， 000，当智能体在连续 100 集中获得 950.0 的平均奖励时，倒摆任务被视为已解决。
Theoretically, the maximum reward the agent can receive is 1, 000.
从理论上讲，代理可以获得的最大奖励是 1， 000。

Fig. 1: - Inverted pendulum task. (a) The inverted pendulum task in the openai gym; (b) the mathematical model of the inverted pendulum.
Fig. 1: 图 1：
Inverted pendulum task. (a) The inverted pendulum task in the openai gym; (b) the mathematical model of the inverted pendulum.
倒立钟摆任务。（a） openai体育馆的倒摆任务;（b）倒摆的数学模型。

Show All 显示全部

As shown in Fig. 2, it is clear that DDPG with prioritized sampling outperforms DDPG with uniform experience replay regarding the training steps used to accomplish the inverted pendulum task.
如 Fig. 2 所示，很明显，在完成倒摆任务的训练步骤方面，具有优先采样的 DDPG 优于具有统一经验回放的 DDPG。
It takes 700 episodes for the prioritized experience replay method to accomplish the inverted pendulum task while around 1, 300 episodes for the uniform sampling method to achieve the same performance.
优先体验重播方法需要 700 集才能完成倒摆任务，而统一采样方法需要大约 1， 300 集才能达到相同的性能。
DDPG with prioritized sampling also outperforms the best method in the OpenAI Gym that needs 850 episodes to converge.
具有优先采样的 DDPG 也优于需要 850 集才能收敛的 OpenAI Gym 中的最佳方法。
In addition, fewer spines happen in the reward curve of prioritized experience replay compared with the other two algorithms, which demonstrates more stable performance in the training process for DDPG with prioritized experience replay mechanism.
此外，与其他两种算法相比，优先经验回放的奖励曲线中发生的棘点更少，这表明在优先经验回放机制的DDPG训练过程中性能更稳定。

Fig. 2: - Learning performance for the inverted pendulum task regarding the accumulated rewards.
Fig. 2: 图2：
Learning performance for the inverted pendulum task regarding the accumulated rewards.
关于累积奖励的倒摆任务的学习表现。

Show All 显示全部

Then we further study the distribution of the selected samples with different magnitudes of TD-errors under prioritized experience replay. The selected experiences are divided into three parts: high magnitude of TD-error, medium magnitude and low magnitude.
然后，我们进一步研究了优先经验回放下不同TD误差大小的选定样本的分布。所选体验分为三个部分：高 TD 误差、中等和低等。
Since the minibatch is 32 and the size of replay buffer is set 104 in the experiment, we divide the replay buffer into 32 segments and deem that first 8 segments (rank 1∼98) are of high TD-error, last 8 segments (rank 3,544∼10,000) are considered low TD-error and the rest (rank 99∼3,543) is labeled medium TD-error (the determination of the boundaries of these segments is based on stratified sampling).
由于小批量为32，并且重放缓冲区的大小在实验中设置为10，因此我们将重放缓冲区分为32个段，并认为前8个段（秩）具有高TD误差，最后8段（秩）被认为是低TD误差，其余段（秩 1∼98 3,544∼10,000 99∼3,543) 标记为 4 中等TD误差（这些段的边界的确定基于分层采样）。
In the experiment, we find that prioritized experience replay tends to choose those with medium and high magnitude of TD-error for they are of high value to the learning process of the agent.
在实验中，我们发现优先经验回放倾向于选择那些具有中等和高 TD 误差幅度的，因为它们对智能体的学习过程具有很高的价值。
And also, prioritized sampling tends to choose more recent experiences for they are not as well estimated by the action-value function as others.
而且，优先抽样倾向于选择最近的经历，因为它们不像其他经历那样被行动价值函数很好地估计。
Besides, experiences having low TD-errors are also included in the selected experiences, which implicitly shows the diversity of sampled experiences.
此外，TD误差低的体验也被纳入所选体验中，这隐含地显示了采样体验的多样性。
To sum up, prioritized experience replay can not only focus on experiences with higher magnitudes of TD-error to help accelerate the training process, but also include those with low TD-errors to increase the sample diversity of training experiences.
综上所述，优先经验回放不仅可以关注TD误差幅度较大的体验，以帮助加速训练过程，还可以包括TD误差较低的体验，以增加训练体验的样本多样性。

In addition, we change the value of some hyperparameters (i.e., the size of the replay buffer S, minibatch K and the updating rate of the target network λ, see Algorithm 1) and try to test whether DDPG with prioritized sampling and DDPG with uniform sampling are robust enough to be insensitive to these changes. Three groups of experiments are carried out as follows.
此外，我们更改了一些超参数的值（即重放缓冲区的大小 S 、小批量 K 和目标网络 λ 的更新速率，请参阅 Algorithm 1 ），并尝试测试具有优先采样的 DDPG 和具有均匀采样的 DDPG 是否足够鲁棒，以至于对这些变化不敏感。三组实验如下。

Changing the Size of the Replay Buffer
1）更改重放缓冲区的大小
The size of the buffer is set as 104, 105 and 106, respectively. From Fig. 3 and Fig. 4, we find that DDPG with prioritized experience replay exhibits more robust performance compared with uniform one for its maximum average reward and the total training time remains to be 1, 000 and 700 episodes separately although the size of the replay buffer changes.
缓冲区的大小分别设置为 10 4 、 10 5 和 10 6 。从 Fig. 3 和中，我们发现，与统一经验重播相比，具有优先经验重放的 DDPG 在最大平均奖励方面表现出更稳健的性能，并且尽管重播缓冲区的大小发生了变化，但总训练时间仍分别为 1， 000 和 Fig. 4 700 集。
In contrast, the training time of the agent with uniform sampling increases from 1, 300 episodes to 1, 400 episodes when the size of the replay buffer increases from 104 to 106. One possible explanation is that because experiences are randomly selected to replay in uniform sampling method and increasing the size of the buffer is likely to cause more useless experiences to be selected to perform the network update.
相反，当重放缓冲区的大小从 10 增加到 10 时，具有均匀采样的智能体的训练时间从 1 4 ， 300 集增加到 1， 400 集 6 。一种可能的解释是，由于体验是随机选择的，以统一采样方法重播，并且增加缓冲区的大小可能会导致选择更多无用的体验来执行网络更新。
The agent has performed fairly well in these conditions and therefore these experiences are less beneficial to the learning process of the agent compared with those more valuable ones. As a result, the training time of the agent with uniform sampling increases.
智能体在这些条件下表现得相当不错，因此与那些更有价值的经验相比，这些经验对智能体的学习过程不太有益。结果，具有均匀采样的智能体的训练时间增加。

Fig. 3: - Performance of DDPG with prioritized experience replay under different sizes of replay buffer.
Fig. 3: 图 3：
Performance of DDPG with prioritized experience replay under different sizes of replay buffer.
DDPG 在不同大小的回放缓冲区下优先体验重放的性能。

Show All 显示全部

Changing the Value of Minibatch
2）更改小批量的值
We set the size of minibatch as 16, 32 and 64, respectively. From Fig. 5, we find that three average reward curves of the prioritized sampling are almost overlapped, which means a stable performance can still be performed by the agent using prioritized experience replay.
我们将小批量的大小分别设置为 16、32 和 64。从 Fig. 5 中，我们发现优先抽样的三条平均奖励曲线几乎是重叠的，这意味着智能体仍然可以使用优先体验回放来执行稳定的表现。
Nevertheless, total training time of the agent with uniform sampling increases and the learning process of the agent is suffering from more aggressive changes in terms of the average reward curves (see Fig. 6).
然而，具有均匀抽样的智能体的总训练时间增加，并且智能体的学习过程在平均奖励曲线方面正在遭受更激进的变化（见 Fig. 6 ）。
Changing the Updating Rate of the Target Network
3）更改目标网络的更新速率
The updating rate of the target network is set to be 0.01, 0.05 and 0.1, respectively. According to the experimental results, the training time of the agent is directly related with this hyperparameter. From Fig. 7 and Fig. 8, we find that when the updating rate of the target network changes from 0.01 to 0.1, the training time of the two algorithms declines to 600 episodes and 1, 000 episodes, respectively. From the empirical perspective, this improvement is owing to the easiness of the training task.
目标网络的更新速率分别设置为 0.01、0.05 和 0.1。根据实验结果，智能体的训练时间与该超参数直接相关。从和中发现，当目标网络的更新速率从 Fig. 7 0.01变为0.1时，两种算法的训练时间分别下降到600集和 Fig. 8 1， 000集。从经验的角度来看，这种改进是由于训练任务的易用性。
Specifically, a slight increase in the updating rate of the target network contributes to the converging speed of the network and thus reducing the training time.
具体来说，目标网络更新速率的略微增加有助于网络的收敛速度，从而减少训练时间。
However, DDPG with prioritized sampling uses less training time and exhibits more stable performance in the training process compared to the uniform one.
然而，与统一采样相比，具有优先采样的 DDPG 使用更少的训练时间，并且在训练过程中表现出更稳定的性能。

According to all the above three groups of experiments, it can be concluded that DDPG with prioritized sampling is less affected by the changes of the hyperparameters and is more robust compared with DDPG with uniform sampling.
根据以上三组实验，可以得出结论，优先采样的DDPG受超参数变化的影响较小，与均匀采样的DDPG相比，具有更强的鲁棒性。

Fig. 4: - Performance of DDPG with uniform experience replay under different sizes of replay buffer.
Fig. 4: 图 4：
Performance of DDPG with uniform experience replay under different sizes of replay buffer.
不同大小的重放缓冲区下具有统一体验重放的DDPG性能。

Show All 显示全部

Fig. 5: - Performance of DDPG with prioritized experience replay under different sizes of minibatch.
Fig. 5: 图 5：
Performance of DDPG with prioritized experience replay under different sizes of minibatch.
不同大小小批量下具有优先体验重放的 DDPG 性能。

Show All 显示全部

Fig. 6: - Performance of DDPG with uniform experience replay under different sizes of minibatch.
Fig. 6: 图 6：
Performance of DDPG with uniform experience replay under different sizes of minibatch.
不同大小小批量下具有统一体验回放的DDPG性能。

Show All 显示全部

Fig. 7: - Performance of DDPG with prioritized experience replay under different updating rates of the target network.
Fig. 7: 图 7：
Performance of DDPG with prioritized experience replay under different updating rates of the target network.
在目标网络不同更新速率下，具有优先体验重放的 DDPG 性能。

Show All 显示全部

Fig. 8: - Performance of DDPG with uniform experience replay under different updating rates of the target network.
Fig. 8: 图 8：
Performance of DDPG with uniform experience replay under different updating rates of the target network.
目标网络不同更新速率下具有统一体验重放的DDPG性能。

Show All 显示全部

SECTION IV. 第四节.Conclusion 结论
In this paper, the prioritized experience replay is proposed to replace the uniform experience replay in the DDPG algorithm.
该文提出优先体验回放来替代DDPG算法中的统一体验回放。
This idea is based on the intuition that some experiences have higher learning value compared with others and more frequently replaying these valuable experiences will improve the learning performance.
这个想法是基于一种直觉，即与其他经验相比，某些经验具有更高的学习价值，更频繁地重播这些宝贵的经验将提高学习成绩。
The proposed algorithm with prioritized sampling helps shorten the total training time greatly as well as exhibit more stable learning process.
该算法采用优先采样，大大缩短了总训练时间，学习过程更加稳定。
In addition, prioritized experience replay is less affected by the changes of the hyperparameters and is more robust compared with the original DDPG algorithm.
此外，与原始 DDPG 算法相比，优先体验回放受超参数变化的影响较小，鲁棒性更强。

However, more problems need to be further investigated in our future work. For example, a better criterion for evaluating the value of the experiences should be further investigated [14]. The proposed algorithm only concentrates on selecting which experience to replay and inevitably ignores the process of selecting which experience to discard.
但是，在今后的工作中，还有更多的问题需要进一步研究。例如，应进一步研究 [14] 评估经验价值的更好标准。所提出的算法只专注于选择要重播的体验，而不可避免地忽略了选择要丢弃的体验的过程。
In addition, more experience selection and action selection methods for deep reinforcement learning may be inspired by quantum probability to design new exploration strategies [23]–[26]. We will focus on these problems in our future work and apply the proposed algorithm to more difficult tasks like hopper and walker [17].
此外，更多用于深度强化学习的经验选择和动作选择方法可能会受到量子概率的启发，从而设计出新的探索策略 [23] 。 [26] 我们将在未来的工作中关注这些问题，并将所提出的算法应用于更困难的任务，如漏斗和步行者 [17] 。

ACKNOWLEDGMENT 确认
This work was supported by the National Key Research and Development Program of China (No.2016YFD0702100) and by the National Natural Science Foundation of China (No.61432008).
本研究得到了国家重点研发计划（No.2016YFD0702100）和国家自然科学基金（No.61432008）的支持。

Authors
Figures
References
Download PDFs
Export
References & Cited By
参考文献和引用文献
Select All 全选
1.
T. P. Lillicrap, J. J. Hunt, A. Pritzel et al., “Continuous control with deep reinforcement learning”, Computer Science, vol. 8, no. 6, pp. A187, 2016.
T. P. Lillicrap、J. J. Hunt、A. Pritzel 等人，“深度强化学习的持续控制”，《计算机科学》，第 8 卷，第 6 期，第 A187 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
2.
N. Heess, G. Wayne, D. Silver et al., “Learning continuous control policies by stochastic value gradients”, Proceedings of the International Conference on Neural Information Processing Systems, pp. 2944-2952, 2015.
N. Heess、G. Wayne、D. Silver 等人，“通过随机值梯度学习连续控制策略”，神经信息处理系统国际会议论文集，第 2944-2952 页，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
3.
D. Silver, G. Lever, N. Heess et al., “Deterministic Policy Gradient Algorithms”, Proceedings of the International Conference on Machine Learning, pp. 387-395, 2014.
D. Silver、G. Lever、N. Heess 等人，“确定性策略梯度算法”，机器学习国际会议论文集，第 387-395 页，2014 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
4.
V. Mnih, K. Kavukcuoglu, D. Silver et al., Playing Atari with Deep Reinforcement Learning, December 2013.
V. Mnih、K. Kavukcuoglu、D. Silver 等人，Playing Atari with Deep Reinforcement Learning，2013 年 12 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
5.
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning”, Nature, vol. 518, no. 7540, pp. 529-533, 2015.
V. Mnih、K. Kavukcuoglu、D. Silver 等人，“通过深度强化学习进行人类水平控制”，《自然》，第 518 卷，第 7540 期，第 529-533 页，2015 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
6.
X. Guo, S. Singh, H. Lee et al., “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”, Proceedings of the International Conference on neural information processing systems, pp. 3338-3346, 2014.
X. Guo、S. Singh、H. Lee 等人，“使用离线蒙特卡洛树搜索规划进行实时雅达利游戏的深度学习”，神经信息处理系统国际会议论文集，第 3338-3346 页，2014 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
7.
L. J. Lin, “Reinforcement learning for robots using neural networks”, Carnegie Mellon University, 1993.
L. J. Lin，“使用神经网络的机器人强化学习”，卡内基梅隆大学，1993 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

S. Adam, L. Busoniu and R. Babuska, “Experience Replay for Real-Time Reinforcement Learning Control”, IEEE Transactions on Systems Man and Cybernetics Part C: Cybernetics, vol. 42, no. 2, pp. 201-212, 2012.
S. Adam、L. Busoniu 和 R. Babuska，“实时强化学习控制的体验回放”，IEEE Transactions on Systems Man and Cybernetics Part C： Cybernetics，第 42 卷，第 2 期，第 201-212 页，2012 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
9.
A. C. Singer and L. M. Frank, “Rewarded outcomes enhance reactivation of experience in the hippocampus”, Neuron, vol. 64, pp. 910-921, 2009.
A. C. Singer 和 L. M. Frank，“奖励结果增强了海马体体验的重新激活”，《神经元》，第 64 卷，第 910-921 页，2009 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
10.
C. G. McNamara, A. Tejero-Cantero, S. Trouche et al., “Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence”, Nature neuroscience, vol. 17, pp. 1658-1660, 2014.
CG McNamara、A. Tejero-Cantero、S. Trouche 等人，“多巴胺能神经元促进海马再激活和空间记忆持久性”，《自然神经科学》，第 17 卷，第 1658-1660 页，2014 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
11.
A. W. Moore and C G. Atkeson, “Prioritized sweeping: Reinforcement learning with less data and less time”, Machine Learning, vol. 13, no. 1, pp. 103-130, 1993.
A. W. Moore 和 C G. Atkeson，“Priord sweeping： Reinforcement learning with less data and less time”，《机器学习》，第 13 卷，第 1 期，第 103-130 页，1993 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
12.
R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA:MIT Press, 1998.
R. Sutton 和 A. G. Barto，强化学习：简介，马萨诸塞州剑桥：麻省理工学院出版社，1998 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
13.
A. R. Mahmood, H. V. Hasselt and R. S. Sutton, “Weighted importance sampling for off-policy learning with linear function approximation”, Proceedings of the International Conference on Neural Information Processing Systems, pp. 3014-3022, 2014.
A. R. Mahmood、H. V. Hasselt 和 R. S. Sutton，“Weighted importance sampling for off-policy learning with linear function approximation”，神经信息处理系统国际会议论文集，第 3014-3022 页，2014 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
14.
T. Schaul, J. Quan, I. Antonoglou et al., “Prioritized Experience Replay”, Proceedings of the IEEE International Conference on Learning Representations, 2016.
T. Schaul、J. Quan、I. Antonoglou 等人，“优先体验回放”，IEEE 学习表征国际会议论文集，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
15.
M. Abadi, A. Agarwal, P. Barham et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2016.
M. Abadi、A. Agarwal、P. Barham 等人，Tensorflow：异构分布式系统上的大规模机器学习，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
16.
M. Abadi, P. Barham, J. Chen et al., “TensorFlow: A system for largescale machine learning”, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
M. Abadi、P. Barham、J. Chen 等人，“TensorFlow：大规模机器学习系统”，第 12 届 USENIX 操作系统设计与实现研讨会（OSDI）会议记录，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
17.
G. Brockman, V. Cheung, L. Pettersson et al., “OpenAI Gym”, CoRR, 2016.
G. Brockman、V. Cheung、L. Pettersson 等人，“OpenAI Gym”，CoRR，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
18.
S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
S. Ioffe 和 C. Szegedy，批量归一化：通过减少内部协变量位移来加速深度网络训练，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
19.
D. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, 2017.
D. Kingma 和 J. Ba，Adam：随机优化方法，2017 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
20.
G E. Uhlenbeck and L S. Ornstein, “On the Theory of the Brownian Motion”, Revista Latinoamericana De Microbiologia, 1930.
G E. Uhlenbeck 和 L S. Ornstein，“关于布朗运动理论”，Revista Latinoamericana De Microbiologia，1930 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
21.
E. Todorov, T. Erez and Y. Tassa, “MuJoCo: A physics engine for model-based control”, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026-5033, 2012.
E. Todorov、T. Erez 和 Y. Tassa，“MuJoCo：基于模型的控制的物理引擎”，IEEE/RSJ 智能机器人和系统国际会议论文集，第 5026-5033 页，2012 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
22.
E. Todorov, T. Erez and Y. Tassa, “Simulation tools for model-based robotics: Comparison of Bullet Havok MuJoCo ODE and PhysX”, Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4397-4404, 2015.
E. Todorov、T. Erez 和 Y. Tassa，“基于模型的机器人仿真工具：Bullet Havok MuJoCo ODE 和 PhysX 的比较”，IEEE 机器人与自动化国际会议论文集，第 4397-4404 页，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
23.
D. Dong, C. Chen and Z. Chen, “Quantum reinforcement learning”, Lecture Notes in Computer Science, no. 3611, pp. 686-689, 2005.
D. Dong、C. Chen 和 Z. Chen，“量子强化学习”，《计算机科学讲义》，第 3611 期，第 686-689 页，2005 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
24.
D. Dong, C. Chen, C. Zhang and Z. Chen, “Quantum Robot: Structure Algorithms and Applications”, Robotica, vol. 24, no. 4, pp. 513-521, 2006.
D. Dong、C. Chen、C. Zhang 和 Z. Chen，“量子机器人：结构算法和应用”，《机器人》，第 24 卷，第 4 期，第 513-521 页，2006 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

D. Dong, C. Chen, H. X. Li and T. J. Tarn, “Quantum reinforcement learning”, IEEE Transactions on Systems Man and Cybernetics Part B: Cybernetics, vol. 38, no. 5, pp. 1207-1220, 2008.
D. Dong、C. Chen、H. X. Li 和 T. J. Tarn，“量子强化学习”，IEEE Transactions on Systems Man and Cybernetics Part B： Cybernetics，第 38 卷，第 5 期，第 1207-1220 页，2008 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

C. Chen, D. Dong, H. X. Li, J. Chu and T. J. Tarn, “Fidelity-based Probabilistic Q-learning for Control of Quantum Systems”, IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 920-933, 2014.
C. Chen、D. Dong、H. X. Li、J. Chu 和 T. J. Tarn，“用于控制量子系统的保真度概率 Q 学习”，IEEE Transactions on Neural Networks and Learning Systems，第 25 卷，第 5 期，第 920-933 页，2014 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索