【通过改进的基于 DQN 的学习策略进行路径规划】 Path Planning via an Improved DQN-Based Learning Policy

最新推荐文章于 2025-03-20 12:33:05 发布

资源存储库

最新推荐文章于 2025-03-20 12:33:05 发布

阅读量1.6k

点赞数 22

文章标签：人工智能

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135522478

版权

Path Planning via an Improved DQN-Based Learning Policy 通过改进的基于 DQN 的学习策略进行路径规划

Abstract: 抽象：

The path planning technology is an important part of navigation, which is the core of robotics research. Reinforcement learning is a fashionable algorithm that learns from experience by mimicking the process of human learning skills.
路径规划技术是导航的重要组成部分，是机器人研究的核心。强化学习是一种时髦的算法，它通过模仿人类学习技能的过程，从经验中学习。
When learning new skills, the comprehensive and diverse experience help to refine the grasp of new skills which are called as the depth and the breadth of experience.
在学习新技能时，全面而多样的经验有助于完善对新技能的掌握，这被称为经验的深度和广度。
According to the path planning, this paper proposes an improved learning policy based on the different demand of the experience’s depth and breadth in different learning stages, where the deep Q-networks calculated Q-value adopts the dense network framework.
根据路径规划，针对不同学习阶段对体验深度和广度的不同需求，提出一种改进的学习策略，其中深度Q网络计算的Q值采用密集网络框架。
In the initial stage of learning, an experience value evaluation network is created to increase the proportion of deep experience to understand the environmental rules more quickly.
在学习的初始阶段，创建经验价值评估网络，增加深度经验的比例，从而更快地理解环境规律。
When the path wandering phenomenon happens, the exploration of wandering point and other points are taken into account to improve the breadth of the experience pool by using parallel exploration structure.
当路径漂移现象发生时，考虑漂移点和其他点的探索，采用平行探索结构，提高体验池的广度。
In addition, the network structure is improved by referring to the dense connection method, so the learning and expressive abilities of the network are improved to some extent.
此外，通过参考密集连接方法改进了网络结构，因此在一定程度上提高了网络的学习和表达能力。
Finally, the experimental results show that our model has a certain improvement in convergence speed, planning success rate, and path accuracy.
最后，实验结果表明，该模型在收敛速度、规划成功率和路径精度方面都有一定的提升。
Under the same experimental conditions, the method of this paper is compared with the conventional intensive learning method via deep Q-networks. The results show that the indicators of this method are significantly higher.
在相同的实验条件下，将本文的方法与传统的深度Q网络强化学习方法进行了比较。结果表明，该方法的指标显著提高。

SECTION I. 第一节总则

Introduction 介绍

In the field of artificial intelligence, it is a well-known and important issue that how to find the best path from the start point to the goal in a given grid environment.
在人工智能领域，如何在给定的网格环境中找到从起点到目标的最佳路径是一个众所周知的重要问题。
For a long time, many researchers have spent a lot of effort in path planning, and have proposed many algorithms for processing path search and optimization. More representative heuristic algorithms such as A* algorithm [1], [2], simulated annealing algorithm [3], [4], artificial potential field method [5], [6], group intelligent algorithm such as particle swarm algorithm [7], [8] and ant colony algorithm [8], [9]. With the deepening of research, the planning speed and the accuracy of path planning continue to increase, but these traditional algorithms always have shortcomings such as low real-time performance and easy to fall into local best.
长期以来，许多研究人员在路径规划上花费了大量精力，并提出了许多用于处理路径搜索和优化的算法。更具代表性的启发式算法如A*算法、模拟退火算法、人工势场法 [5] 、粒子群算法等群智能算法 [6] [7] [4] 、 [8] 蚁群算法 [8] [1] [3] 等 [9] 。 [2] 随着研究的深入，路径规划的规划速度和精度不断提高，但这些传统算法始终存在实时性低、容易陷入局部最佳等缺点。

With the advent of the artificial intelligence era, the environment facing the path planning field is becoming more and more complex, which requires the path planning algorithm to have the ability to respond quickly to complex environmental changes and flexible learning capabilities.
随着人工智能时代的到来，路径规划领域面临的环境变得越来越复杂，这就要求路径规划算法具备快速响应复杂环境变化的能力和灵活的学习能力。
The expressive power of traditional algorithms encounters bottlenecks and it is difficult to model dynamic complex problems. It is imperative that something be done to change the path planning algorithm framework. Combining deep learning with reinforcement learning [10]–[12] is a good method of autonomous learning [32]–[34]. The concept of “reinforcement learning” first appeared in 1954. This learning method transforms the sequence decision problem into a Markov model [13], and establishes the mapping between the environment state and the state-action value function through the interaction between the agent and the environment, and then obtains the optimal state-action value function to obtain the optimal action sequence.
传统算法的表现力遇到瓶颈，难以对动态复杂问题进行建模。当务之急是采取一些措施来改变路径规划算法框架。将深度学习与强化学习相结合是一种 [12] 很好的自主学习 [32] [10] 方法 [34] 。“强化学习”的概念最早出现在1954年。该学习方法将序列决策问题转化为马尔可夫模型 [13] ，通过智能体与环境的交互，建立环境状态与状态-动作值函数的映射关系，进而得到最优状态-动作值函数，得到最优动作序列。

With the development of time, dynamic programming [14], Q-learning [15], SARSA [16] and other reinforcement algorithms have been proposed, but these tabular reinforcement learning methods have obvious limitations in the size of state space and action space. In 2013, Mnih et al. have proposed the first successful Deep Q-network (DQN) [17] framework that combines deep learning with reinforcement learning. In DQN, the experience pool structure is used to disrupt the sample order to solve the problem that the experience gained from reinforcement learning is related in time.
随着时间的发展，动态规划 [14] 、Q-learning [15] 、SARSA [16] 等强化算法被提出，但这些表格强化学习方法在状态空间和动作空间的大小上存在明显的局限性。2013 年，Mnih 等人提出了第一个成功的深度 Q 网络（DQN） [17] 框架，该框架将深度学习与强化学习相结合。在DQN中，使用经验池结构来打乱样本顺序，以解决从强化学习中获得的经验在时间上相关的问题。
It improves the stability of the combination and achieved impressive performance. Several improvements based on DQN have proposed in 2015 [18]–[25], including duel network structure [18], complex empirical sampling policy [20] and advantage function. The DQN framework has a certain improvement in learning speed and value function estimation accuracy. Q-learning is very suitable for solving some discrete motion sequence decision problems. In recent years, some researchers have used Q-learning on path planning, and the effect is excellent.
它提高了组合的稳定性，并取得了令人印象深刻的性能。2015 [18] 年提出了基于DQN的几项改进 [25] ，包括决网络结构 [18] 、复杂的经验抽样策略 [20] 和优势函数。DQN框架在学习速度和值函数估计精度方面都有一定的提升。Q学习非常适合解决一些离散的运动序列决策问题。近年来，一些研究人员将Q-learning用于路径规划，效果非常好。
However, as mentioned above, the tabular reinforcement learning algorithm has limited capacity, this leads to the model needing to train each map and the model has no generalization performance.
然而，如上所述，表格强化学习算法的能力有限，这导致模型需要训练每个地图，并且模型没有泛化性能。
DQN can solve these problems very well, so the DQN framework has great potential in path planning.
DQN可以很好地解决这些问题，因此DQN框架在路径规划方面具有巨大的潜力。

The policy which is chosen is a very important factor influencing learning outcomes. Many educators advocate a method of learning from typical experiences and then refining knowledge through learning diverse experiences.
所选择的政策是影响学习成果的一个非常重要的因素。许多教育工作者提倡一种从典型经验中学习的方法，然后通过学习不同的经验来提炼知识。
From the field of artificial intelligence, some special samples can really speed up the learning, and the comprehensiveness of the sample can reduce the over-fitting and improve the accuracy of the model.
从人工智能领域来看，一些特殊的样本可以真正加快学习速度，样本的全面性可以减少过拟合，提高模型的准确性。
The network structure determines the learning efficiency and expressive ability of the network. In 2017, Huang proposed a dense connection network (Dense Network) [30], [31], which uses channel-level feature short-circuit connections, effectively improving the feature reuse rate and reducing the gradient disappearance. A small amount of parameters yields surprising results.
网络结构决定了网络的学习效率和表达能力。2017年，黄仁光提出了一种密集连接网络（Dense Network）， [30] [31] 它采用通道级特征短路连接，有效提高了特征复用率，减少了梯度消失。少量的参数会产生令人惊讶的结果。

Unlike traditional path planning methods that require modeling of the various constraints of the problem, map image data are only used as input in our path planning method.
与传统的路径规划方法不同，传统的路径规划方法需要对问题的各种约束进行建模，而地图图像数据仅用作路径规划方法的输入。
In addition, taking into account the generalization ability of the model, the input maps in each round are different. Under the above conditions, we hope that the agent can find the shortest path without collision.
此外，考虑到模型的泛化能力，每一轮的输入映射都不同。在上述条件下，我们希望智能体能够找到最短路径而不会发生碰撞。
In order to accomplish the above functions, the deep reinforcement learning framework is used to improve the common problems in path planning. From the aspects of policy and network architecture, the main contribution of this paper is listed as follows:
为了实现上述功能，采用深度强化学习框架对路径规划中的常见问题进行改进。从策略和网络架构方面，本文的主要贡献列举如下：

An experience value evaluation network is built. At the beginning of the training, when the network is in the pseudo-minded stage, the network is used to help Q network to gain more depth experience and help the model to quickly learn the environmental rules.
构建体验价值评价网络。在训练之初，当网络处于伪思维阶段时，利用网络帮助Q网络获得更深入的经验，帮助模型快速学习环境规律。

A parallel exploration structure is created in order to utilize every step well. When path wandering phenomenon occurs, the exploration of wandering point and other points are taken into account to improve the breadth of the experience pool and increase the accuracy.
为了充分利用每一步，创建了一个平行的勘探结构。当路径漂移现象发生时，考虑漂移点和其他点的探索，以提高经验池的广度，提高准确性。

The dense connection is added to increase the utilization of network features. The convolutional layer is used as the transition layer to reduce the dimension, at the same time to retain more high-dimensional information and position sensitivity.
增加密集连接，提高网络要素的利用率。采用卷积层作为过渡层，降低维数，同时保留更多的高维信息和位置灵敏度。

In general, we propose a rapid learning policy that changes the probability distribution of experience in the skill learning process, so that the model can obtain more needed experience at different stages of learning, which improves learning efficiency.
总的来说，我们提出了一种快速学习策略，改变技能学习过程中经验的概率分布，使模型在学习的不同阶段能够获得更多需要的经验，从而提高学习效率。
In the path planning application, we only need the map image as input instead of modeling the map. For the image input, we improve the network structure and enhance the expressive ability of the network. The remainder of the paper is structured as follows.
在路径规划应用程序中，我们只需要地图图像作为输入，而不是对地图进行建模。对于图像输入，我们改进了网络结构，增强了网络的表现能力。本文的其余部分结构如下。
The traditional reinforcement learning and DQN are introduced in Section II. In Section III, we propose an efficient learning policy based on the different demand of the depth and the breadth of experience in different learning stages, and furthermore, build a value evaluation network to control the depth of experience and speed up the learning process.
中 Section II 介绍了传统的强化学习和DQN。在 Section III 中，我们根据不同学习阶段对经验深度和广度的不同需求，提出了高效的学习政策，并进一步构建了价值评估网络，以控制经验的深度并加快学习过程。
Data simulation process and simulation results are provided in Section IV to demonstrate the effectiveness of our model. Section V concludes the paper with some discussion on future research directions.
文中 Section IV 提供了数据仿真过程和仿真结果，以验证模型的有效性。 Section V 最后对未来的研究方向进行了一些讨论。

SECTION II. 第二节.Related Work 相关工作

A. Reinforcement Learning A. 强化学习

Compared with the “open loop” machine learning method based on existing static data, the reinforcement learning is a kind of “closed loop” method that learns from the experience with the environment.
与基于现有静态数据的“开环”机器学习方法相比，强化学习是一种从环境经验中学习的“闭环”方法。
By interacting with the environment, a basic reinforcement learning framework is considered to learn how to maximize the benefits of a sequence decision problem. The problem can be represented by the Markov decision process (MDP) [13], which can be composed of 5-tuple ⟨S,A,P,R,γ⟩ , where S is the state set, A is the set of finite actions, R is a finite set of the expected reward rt , γ∈[0,1] is a discount factor, and P stands for the state transition probability and a simplified form of the conditional probability Pa(st,st+1) that the action at in state st is performed to achieve state st+1 in time t+1 .
通过与环境的交互，考虑了一个基本的强化学习框架，以学习如何最大限度地发挥序列决策问题的好处。该问题可以用马尔可夫决策过程（MDP） [13] 表示，它可以由5元组组成，其中 S 是状态集，是有限动作的集合，是预期奖励的有限集合，是折扣因子 ⟨S,A,P,R,γ⟩ rt ， γ∈[0,1] P A R 代表状态转移概率和状态中 st 动作 at 达到状态的条件概率 Pa(st,st+1) 的简化形式 st+1 及时 t+1 .

The purpose of reinforcement learning is to ultimately find an optimal sequence of actions π∗={a∗1,a∗2,⋯,a∗t,⋯} in a given environment to maximize the cumulative reward of the agent. For the given action policy π , the cumulative reward and state value functions are defined to quantify the value of the state:
Gt=rt+1+γrt+2+γ2rt+2⋯=∑∞k=0γkrt+k+1(1)
View SourceRight-click on figure for MathML and additional features.In the formula, Gt is called cumulative reward, which represents the sum of the discounts of the reward from the time step t until the end of the action sequence. γ∈[0,1] is a discount factor that determines the trade-off between short-term and long-term gains. γ=1 means that the agent treats the rewards equally from different time step away from itself.
强化学习的目的是最终在给定环境中找到最佳动作 π∗={a∗1,a∗2,⋯,a∗t,⋯} 序列，以最大化智能体的累积奖励。对于给定的动作策略 π ，定义了累积奖励和状态值函数来量化状态的值：
Gt=rt+1+γrt+2+γ2rt+2⋯=∑∞k=0γkrt+k+1(1)
View SourceRight-click on figure for MathML and additional features.在公式中，称为累积奖励， Gt 它表示从时间步长 t 到动作序列结束的奖励折扣总和。 γ∈[0,1] 是决定短期收益和长期收益之间权衡的贴现因素。 γ=1 意味着智能体从不同的时间步长平等地对待奖励。

Since the sequence of actions may be different in the same state, the cumulative reward from a certain state is an expected value rather than a certain value.
由于在同一状态下，动作顺序可能不同，因此来自某个状态的累积奖励是期望值而不是某个值。
But we can define a state value function to quantify the expected value of the cumulative reward in a state under a given policy, as follow.
Vπ(st)=Eπ{∑∞k=0γkrt+k+1|s=st}(2)
View SourceRight-click on figure for MathML and additional features.where Vπ(st) represents the expected cumulative reward of the agent starting from the state st under the policy π . Furthermore, considering the executing action at , we can change the state value function (2) into the state-action value function to describe the cumulative reward, as below:
q(st,at)=Eπ{∑∞k=0γkrt+k+1|s=st,a=at}(3)
View SourceRight-click on figure for MathML and additional features.Obviously, the algorithm of reinforcement learning needs to get the maximum value of the state-action value function q∗(st,at)=maxa∈Aq(st,a) in a given state. In other words, it is easy to get the optimal policy a∗t , that is, argmaxa∈Aq∗(st,a) .
但是我们可以定义一个状态值函数来量化给定策略下状态下累积奖励的期望值，如下所示。
Vπ(st)=Eπ{∑∞k=0γkrt+k+1|s=st}(2)
View SourceRight-click on figure for MathML and additional features.其中 Vπ(st) 表示代理在策略 π 下从状态 st 开始的预期累积奖励。此外，考虑到执行动作，我们可以将状态值函数改为状态动作 at 值函数 (2) 来描述累积奖励，如下所示：
q(st,at)=Eπ{∑∞k=0γkrt+k+1|s=st,a=at}(3)
View SourceRight-click on figure for MathML and additional features.显然，强化学习的算法需要获取给定状态下状态动作值函数 q∗(st,at)=maxa∈Aq(st,a) 的最大值。换言之，很容易得到最优策略 a∗t ，即 argmaxa∈Aq∗(st,a) 。

For the problem of policy evaluation, it is generally to find each q(st,at) and the optimal policy finally by updating the Q table. Note that the state-action value function in (3) indicates that the entire complete sequence of actions is required to update the value of the state-action value function. This means a lot of computational burden, and such an algorithm is undoubtedly inefficient. According to the Bellman equation, q(st,at) can be rewritten as the following form:
q(st,at)=Eπ{rt+1+γq(st+1,at+1)|s=st,a=at}(4)
View SourceRight-click on figure for MathML and additional features.The above equation indicates that the state-action value function can be written by the state-action value of next state and the reward of the specified action. Obviously, according to (4), the Q value of the state can be updated every step, and the efficiency of the algorithm are higher. The value iterative algorithm is based on this, directly updating the table to estimate the optimal state-behavior value function:
q∗(st,at)=Eπ{rt+1+γmaxat+1∈Aq(st+1,at+1)|s=st,a=at}(5)
View SourceRight-click on figure for MathML and additional features.
对于策略评估的问题，一般是通过更新Q表，最终找到每个 q(st,at) 最优策略。请注意，中的 (3) state-action 值函数表示需要整个完整的操作序列才能更新 state-action 值函数的值。这意味着大量的计算负担，而这样的算法无疑是低效的。根据贝尔曼方程， q(st,at) 可以改写为以下形式：
q(st,at)=Eπ{rt+1+γq(st+1,at+1)|s=st,a=at}(4)
View SourceRight-click on figure for MathML and additional features.上式表明，状态-动作值函数可以由下一个状态的状态-动作值和指定动作的奖励来写。显然，根据，状态的Q值可以每一步更新 (4) ，算法的效率更高。值迭代算法就是基于此，直接更新表来估计最优状态行为值函数：
q∗(st,at)=Eπ{rt+1+γmaxat+1∈Aq(st+1,at+1)|s=st,a=at}(5)
View SourceRight-click on figure for MathML and additional features.

B. Deep Q-Network B. 深度 Q 网络
The traditional reinforcement learning algorithm learns the optimal policy by establishing a Q table and updating the Q table, but the reinforcement learning method based on the Q table has an inevitable capacity limitation.
传统的强化学习算法通过建立Q表和更新Q表来学习最优策略，但基于Q表的强化学习方法不可避免地存在容量限制。
When the state space and the action space are large or continuous, the algorithm needs to occupy a large amount of memory or even can’t express the problem. In 2013, Mnih et al.
当状态空间和动作空间较大或连续时，算法需要占用大量内存甚至无法表达问题。2013 年，Mnih 等人。
proposed the first framework DQN, which combines deep learning with reinforcement learning, to solve the problem of capacity limitation and sample correlation. In 2015 [17], a double network structure was proposed to solve the correlation between the state-action value function and the update target. The main achievements are as follows: (1) Propose to use deep convolutional neural network q(s,a;θ) to represent q(s,a) , avoiding the problem that Q table capacity is limited and each state-action value function needs training; (2) Propose the experience replay(ER) structure, solving the problem of time-correlation of samples and improving the stability of training; (3) Set up a separate target network to handle temporal difference (TD) targets, estimating the state-action value and the TD target and updating the weight.
提出了第一个框架DQN，将深度学习与强化学习相结合，以解决容量限制和样本相关性问题。2015年 [17] ，提出了一种双网络结构来解决状态-动作值函数与更新目标之间的相关性。主要成果如下： (1) 提出使用深度卷积神经网络 q(s,a;θ) 进行表示 q(s,a) ，避免了Q表容量有限，每个状态动作值函数需要训练的问题; (2) 提出经验回放（ER）结构，解决样本时间相关性问题，提高训练稳定性; (3) 设置单独的目标网络来处理时差（TD）目标，估计状态操作值和 TD 目标并更新权重。

DQN is an algorithm based on Q-learning. Q-learning updates the value function by time difference formula:
q(st,at)=q(st,at)+αrt+γmaxat+1∈Aq(st+1,at+1)−q(st,at)
View SourceRight-click on figure for MathML and additional features.where q(st,at) is the state-action value function at the current moment, q(st+1,at+1) is the state-action value function at the next moment, and α is the update step size. After using the deep convolutional neural network, change to update the weight θ of Qv :
θt+1=θt+α[rt+γmaxat+1∈Aq(st+1,at+1;θTD)−q(st,at;θt)]×∇q(st,at;θt)(7)
View SourceRight-click on figure for MathML and additional features.where rt+γmaxat+1∈Aq(st+1,at+1;θTD) is the TD target. The network update its weight to let q(st,at;θt) fit the TD target. A separate target network is employed to represent the TD target. Since θt used in the calculation of the gradient is different from θTD used in the calculation of the TD target, it solves the problem of unstable training due to the correlation between samples. The rt+γmaxat+1∈Aq(st+1,at+1;θTD)−q(st,at;θt) in the equation is the loss, the network is trained by minimizing the loss, and finally q(st,at) is estimated. The algorithm uses the stochastic gradient descent method to train the network, θt is updated every training iteration, and θt is assigned to θTD after every C iterations. The DQN algorithm is as shown in Algorithm 1, where pre_train_step, decline_step, μ, mini_batch and terminal will be defined in Section III.
DQN 是一种基于 Q 学习的算法。Q-learning通过时差公式更新值函数：
q(st,at)=q(st,at)+αrt+γmaxat+1∈Aq(st+1,at+1)−q(st,at)
View SourceRight-click on figure for MathML and additional features.其中为当前时刻的状态-动作值函数，为下一时刻的状态-动作值函数， q(st+1,at+1) α 为 q(st,at) 更新步长。使用深度卷积神经网络后，更改以更新 Qv 的权重 θ ：
θt+1=θt+α[rt+γmaxat+1∈Aq(st+1,at+1;θTD)−q(st,at;θt)]×∇q(st,at;θt)(7)
View SourceRight-click on figure for MathML and additional features.其中 rt+γmaxat+1∈Aq(st+1,at+1;θTD) 是 TD 目标。网络更新其权重以适合 q(st,at;θt) TD 目标。使用单独的目标网络来表示TD目标。由于用于梯度计算与用于TD目标计算不同 θTD ，因此解决了由于 θt 样本之间的相关性而导致的训练不稳定的问题。方程中的是 rt+γmaxat+1∈Aq(st+1,at+1;θTD)−q(st,at;θt) 损失，通过最小化损失来训练网络，最后 q(st,at) 进行估计。该算法使用随机梯度下降方法对网络进行训练，每次训练迭代更新， θt 并在 θt 每次 C 迭代后分配。 θTD DQN 算法如所示 Algorithm 1 ，其中 pre_train_step, decline_step, μ, mini_batch 和 terminal 将在中定义 Section III 。

Algorithm 1 DQN. 算法 1 DQN。
Initialization Initialize replay memory space D to capacity N , Initialize the Q network Q with random weights θ0 , Initialize the weights θTD of target network Qt with weights θ0 . Initialize t=0 .
初始化将重放内存空间 D 初始化为容量 N ，用随机权重初始化Q网络 θ0 ，用权重初始化目标网络 θTD Qt Q 的权重 θ0 。初始化 t=0 .

1:
for t<tmax do 对于 t<tmax do

2:
if t≠1 then st=st+1
如果 t≠1 那么 st=st+1

3:
else get the initial observation st
否则获取初始观察结果 st

4:
end if 结束如果

5:
if t<pre_train_step then 如果 t<pre_train_step 那么

6:
select a random action at
选择随机操作 at

7:
else 还

8:
if μ<ϵ then select a random at
如果 μ<ϵ 然后选择一个随机 at

9:
else select at=argmaxa∈Aq(st,a;θt) else 选择 at=argmaxa∈Aq(st,a;θt)

10:
end if 结束如果

11:
end if 结束如果

12:
Store experience ext=(st,at,rt,st+1)
店铺体验 ext=(st,at,rt,st+1)

13:
if t<decline_step then 如果 t<decline_step 那么

14:
ϵ decreases by a certain percentage
ϵ 减少一定百分比

15:
end if 结束如果

16:
if t≥pre_train_step then 如果 t≥pre_train_step 那么

17:
Sample mini_batch in D and calculate yi :
取 D 样 mini_batch 并计算 yi ：

18:
yi=⎧⎩⎨⎪⎪⎪⎪⎪⎪ri,ifterminalri+γmaxai+1∈Aqt(si+1,ai+1;θTD),otherwise

19:
Calculate the loss (q(si,ai;θt)−yi)2
计算损失 (q(si,ai;θt)−yi)2

20:
Train and update Q network’s weights θt+1
训练和更新 Q 网络的权重 θt+1

21:
Every C step copy θt+1 to θTD
每 C 一步都复制到 θt+1 θTD

22:
end if 结束如果

23:
end for 结束

C. Dense Network C. 密集网络
Convolutional neural network (CNN) has become the mainstream method in the field of computer vision. In general, the deeper network is, the higher nonlinearity and fitting accuracy are.
卷积神经网络（CNN）已成为计算机视觉领域的主流方法。一般来说，网络越深，非线性和拟合精度就越高。
However the gradient of network training is transmitted from the back to the front, so the gradient received by the front layer gradually becomes smaller or even the training is stagnant, if the number of network layers reaches a certain limit. Dr.
然而，网络训练的梯度是从后向前传递的，因此如果网络层的数量达到一定极限，前层接收到的梯度会逐渐变小甚至停滞。博士。
He proposed the Residual Network (ResNet) [26]–[29] in 2015. Short-circuiting some layers front and back strengthens the information connection between the front and back layers in order to make training of deeper network possible.
他在 2015 [29] 年提出了残差网络（ResNet）。 [26] 前后部分层短路，加强了前后层之间的信息连接，使更深层次的网络训练成为可能。
In 2017, Huang employed the idea of the residual network to connect the front layer with all the layers after it (Dense Network [30], [31]), enhanced the feature reuse from the propagation and utilization of features, and achieved better performance with fewer parameters. As shown in Figure 1, an L -layer network (L=4) is supposed. Each layer has a nonlinear conversion Hl(⋅) , and the output of the l−1 layer is xl−1 .
2017 年，黄仁勋采用残差网络的思想将前层与其后的所有层（密集网络 [30] ，）连接起来，从特征的传播和利用中增强了特征的复用， [31] 并以更少的参数实现了更好的性能。如图所示 Figure 1 ，假设有一个 L 层网络 (L=4) 。每一层都有一个非线性转换 Hl(⋅) ，该 l−1 层的输出是 xl−1 。

FIGURE 1. - The structure of dense connection.
FIGURE 1. 图 1.
The structure of dense connection.
结构紧密连接。

Show All 显示全部

Then, in the Dense Network, the output expression of the l -th layer is shown mathematically:
xl=Hl([x0,x1,⋯,xl−1])(8)
View SourceRight-click on figure for MathML and additional features.where [x0,x1,⋯,xl−1] is the splicing of the output feature map produced by the 0,1,2,⋯,l−1 layers in the dense block, which is a tensor. The advantage of this connection is that the model is more compact, information can be transferred to deeper layers, enhance the connection of features between the layers, and mitigate gradient disappearance.
然后，在密集网络中，以数学方式显示 l 第 -th 层的输出表达式：
xl=Hl([x0,x1,⋯,xl−1])(8)
View SourceRight-click on figure for MathML and additional features.其中 [x0,x1,⋯,xl−1] 是密集块中各 0,1,2,⋯,l−1 层产生的输出特征图的拼接，即张量。这种连接的优点是模型更紧凑，信息可以传递到更深的层，增强层与层之间特征的连接，并减轻梯度消失。
The premise of feature graph splicing is that the dimension of the feature graph is the same, so the Dense Network is divided into dense blocks and transition layers structure. The convolution with one step size is used in the dense blocks to keep the feature map size unchanged.
特征图拼接的前提是特征图的维度相同，因此将密集网络分为密集块和过渡层结构。在密集块中使用具有一个步长的卷积，以保持特征图大小不变。
The convolution uses multiple small convolution kernels to obtain higher nonlinearity, and the transition layer uses average or maximum pooling to reduce the dimension of the feature map.
卷积使用多个小卷积核来获得更高的非线性，过渡层使用平均或最大池化来降低特征图的维数。

SECTION III. 第三节.Path Planning Based on Efficient Learning Policy DQN Combined With Dense Network
基于高效学习策略DQN结合密集网络的路径规划
In the context of path planning, this paper develops an improved in Policy and Network Deep Q-network (PN-DQN) model.
在路径规划的背景下，该文提出了一种改进的策略和网络深度Q网络（PN-DQN）模型。
We proposes some policies such as action experience value evaluation network, parallel greedy random exploration structure and merges the connection method of dense connection, aiming to improve the path planning speed and accuracy.
提出了动作体验值评估网络、并行贪婪随机探索结构等策略，并融合了密集连接的连接方法，旨在提高路径规划的速度和精度。
This chapter will introduce our algorithm from the problem statement, the learning settings of the model (such as environment observation, action space, reward design, and policy setting) and network architecture, then carefully describe the algorithm flow and training process.
本章将从问题陈述、模型的学习设置（如环境观察、动作空间、奖励设计、策略设置）和网络架构等方面介绍我们的算法，然后仔细描述算法流程和训练过程。

A. Problem Statement A. 问题陈述
The aim of our model is to find an optimal path from the start point to the end point in a randomly generated map without collision. It is considered that the agent moves in a four-connected environment consisting of passable and non-passable trellis. Given the start point s and the goal g which are connectable, the task of the agent is to find a feasible sequence of actions from s to g . It is also the policy π(s,g) .
我们模型的目的是在随机生成的地图中找到从起点到终点的最佳路径，而不会发生碰撞。认为代理在由可通行和不可通行的格子组成的四连接环境中移动。给定可连接的起点 s 和目标 g ，代理的任务是找到从 s 到 g 的可行动作序列。这也是政策 π(s,g) 。

Considering the general path planning problem and assuming that the map M , the start point s and the goal g are known, we use E(M,s,g) to represent the environment. The traditional path planning algorithm deals with these problems by constraining conditions and problems. Mathematical modeling is transformed into searching optimization or solving energy optimization problem. When E(M,s,g) changes, it needs to be solved again. It is also difficult to model E(M,s,g) . Our model mimics the process of human learning skills, which finds the optimal path by learning the rules of the environment. Even if a given obstacle or goal changes, the model does not need to be retrained because our rules are universal. Taking the picture of E(M,s,g) as input, the agent selects the optimal policy to achieve the maximum benefit by observing its position, obstacle position and goal position. In order for the trained model to have good generalization performance, a new E(M,s,g) will be randomly generated in each episode.
考虑到一般的路径规划问题，并假设地图 M 、起点 s 和目标 g 都是已知的，我们用它来 E(M,s,g) 表示环境。传统的路径规划算法通过约束条件和问题来处理这些问题。将数学建模转化为搜索优化或解决能量优化问题。当发生变化时 E(M,s,g) ，需要再次解决。建模 E(M,s,g) 也很困难。我们的模型模仿了人类学习技能的过程，通过学习环境规则来找到最佳路径。即使给定的障碍或目标发生变化，模型也不需要重新训练，因为我们的规则是通用的。智能体以图片 E(M,s,g) 为输入，通过观察其位置、障碍物位置和目标位置来选择最优策略，以实现最大收益。为了使训练好的模型具有良好的泛化性能，每集 E(M,s,g) 都会随机生成一个新的模型。

In the process of human learning new skills, typical cases tend to be more profound and clearer than the general experience.
在人类学习新技能的过程中，典型案例往往比一般经验更深刻、更清晰。
Furthermore, it is important for the refinement of skills to accumulate comprehensive and multi-faceted experience called as the depth and the breadth of experience.
此外，对于技能的提高来说，积累全面和多方面的经验是很重要的，称为经验的深度和广度。
So we can improve the learning policy based on the depth and the breadth of experience in different stages of learning in two sides:
因此，我们可以根据不同学习阶段的经验深度和广度，从两个方面改进学习策略：

An action experience value evaluation network is built that increases the proportion of special experience (at obstacles or reaching the end) at the beginning of training. It is helpful for the model to learn environmental rules faster;
构建动作经验值评估网络，增加训练初期特殊经验（在障碍处或到达终点）的比例。有助于模型更快地学习环境规律;

A parallel exploration structure is created.
将创建一个并行探索结构。
If path wandering phenomenon occurs in training, the learning policy will continue to explore the wandering point and take into account other points on the map, in order to obtain more diverse experience and help the model to master the skills in detail.
如果训练中出现路径漂移现象，学习策略会不断探索徘徊点，并考虑地图上的其他点，以获得更多样化的体验，帮助模型详细掌握技能。

Combined with the efficient learning policy, we improve the network structure and get PN-DQN model. The main model of PN-DQN is shown in Figure 2. In this figure, we take the picture of E(M,s,g) as input. The Q network above the figure is responsible for estimating the value of the state-action. The experience value evaluation network below the figure to take a part of evaluating the value of the experience of each action.
结合高效学习策略，改进网络结构，得到PN-DQN模型。PN-DQN的主要模型如图所示 Figure 2 。在这张图中，我们将的 E(M,s,g) 图片作为输入。图上方的 Q 网络负责估计状态动作的值。在图下的体验价值评估网络中，取一部分来评估每个动作的体验价值。
For faster training speed and higher accuracy, the Q network is combined with the dense connection to improve the extraction and propagation of picture features. On the other hand, the value evaluation network uses a convolutional neural network with a simple structure.
为了实现更快的训练速度和更高的精度，将Q网络与密集连接相结合，以提高图像特征的提取和传播能力。另一方面，价值评估网络使用结构简单的卷积神经网络。
Ultimately, the model selects the actions performed by taking into account the output of the two networks.
最终，该模型通过考虑两个网络的输出来选择执行的操作。

FIGURE 2. - The structure of PN-DQN model.
FIGURE 2. 图2.
The structure of PN-DQN model.
PN-DQN模型的结构。

Show All 显示全部

B. Learning Settings B. 学习设置

Environment Observation
1）环境观察
Figure 3 shows two different environment observations. An observation consists of background, obstacle, current point and goal.
Figure 3 显示了两种不同的环境观测值。观察由背景、障碍物、当前点和目标组成。
According to the image information of the environment, an RGB pixel matrix of 80803 is formed, and then perform gray-scale processing on RGB image matrix to obtain 8080 gray matrix. In general, the gray matrix includes four types of image values.
根据环境的图像信息，形成80803的RGB像素矩阵，然后对RGB图像矩阵进行灰度处理，得到8080的灰度矩阵。通常，灰度矩阵包括四种类型的图像值。
By preprocessing, four types of pixel values are rewritten into a matrix [Pb,Po,pc,pg] consisting of background pixel set Pb , obstacle pixel set Po , current point pixel pc and goal pixel pg . The purpose of our preprocessing is to more accurately distinguish between different objects and get a more manageable observation matrix.
通过预处理，将四种类型的像素值重写为由背景像素集、障碍物像素集 Pb Po 、当前点像素 pc 和目标像素 [Pb,Po,pc,pg] 组成的矩阵 pg 。我们预处理的目的是更准确地区分不同的对象，并获得更易于管理的观察矩阵。

FIGURE 3. - Two different observations.
FIGURE 3. 图3.
Two different observations.
两种不同的观察结果。

Show All 显示全部

Action Space 2）行动空间
There are two general kinds of action space in the mesh path planning task, which are four neighborhoods and eight neighborhoods. These definitions of action space can control the change of the current location.
网格路径规划任务中一般有两种动作空间，分别是 4 个邻域和 8 个邻域。动作空间的这些定义可以控制当前位置的变化。
The experiments in this paper all use four neighborhoods, because the research goal is to get the best path not the motion plan. In the experiment, the speed will be ruled as that one unit distance can be moved per time step.
本文中的实验都使用了四个邻域，因为研究目标是获得最佳路径而不是运动计划。在实验中，速度将被裁定为每个时间步长可以移动一个单位距离。
Reward Design 3）奖励设计
The reward is the only feedback that the model can get from the environment, and it is the learning orientation of the model. The reward determines the skills that the model learns and the efficiency of the model.
奖励是模型能从环境中得到的唯一反馈，是模型的学习取向。奖励决定了模型学习的技能和模型的效率。
Therefore, a good reward design should be concise and fully reflect the designer’s desire to implement the model. In our tasks, reward design focuses on two aspects: reaching the goal and avoiding obstacles.
因此，一个好的奖励设计应该简洁明了，充分体现设计师实现模型的愿望。在我们的任务中，奖励设计侧重于两个方面：达到目标和避开障碍。
Based on these requirements, the reward function is defined as a sparse form:
rt=⎧⎩⎨rreach,rcrash,0,if pc=pgif pc∈Pootherwise(9)
View SourceRight-click on figure for MathML and additional features.which divides the reward function into three parts according to the difference of the arrival point at the next moment. Combined with the optimal function Bellman equation (5), it can be seen that the value of the action to reach the goal is rreach , and the value of the action that hits the obstacle point is rcrash . We generally give rreach a positive value to encourage the model to find the goal, give rcrash a negative values to punish collision behavior. The value of normal action decreases with the increase of the distance between the current point and the goal, and γ needs to take less than 1 to promote the agent to the goal in this experiment.
基于这些要求，将奖励函数定义为稀疏形式：
rt=⎧⎩⎨rreach,rcrash,0,if pc=pgif pc∈Pootherwise(9)
View SourceRight-click on figure for MathML and additional features.根据下一刻到达点的差异，将奖励函数分为三部分。结合最优函数贝尔曼 equation (5) ，可以看出达到目标的动作值为，击中障碍点的动作值为 rreach rcrash 。我们一般给一个正值来鼓励模型找到目标，给 rreach rcrash 一个负值来惩罚碰撞行为。法向动作值随着当前点与目标距离的增加而减小，本实验中 γ 需要取小于 1 才能将智能体提升到目标。
Policy Setting 4）政策设定
DQN generally uses ϵ -greedy policy to balance the exploration and utilization of models, such as:
π(st)={argmaxa∈Aq(st,a),a~,if μ≤ϵotherwise(10)
View SourceRight-click on figure for MathML and additional features.where μ is the random value generated from [0, 1] per round, ϵ is the exploration rate and a~ is a random action.
DQN 一般使用 ϵ -greedy 策略来平衡模型的探索和利用，例如：
π(st)={argmaxa∈Aq(st,a),a~,if μ≤ϵotherwise(10)
View SourceRight-click on figure for MathML and additional features.其中 μ 是每轮 [0， 1] 生成的随机值，是探索率， ϵ a~ 是随机动作。

On the basis of retaining certain randomness, combined with the application characteristics of grid path planning, we make the following improvements to the policy:
在保留一定随机性的基础上，结合网格路径规划的应用特点，对策略进行如下改进：

a: Policy of the Depth of Experience
a：经验深度政策
In order to gain more special experience in the early stage, we create an experience value evaluation network.
为了在前期获得更多的特殊体验，我们创建了体验价值评估网络。
The evaluation network only considers a rectangle with the current point as the center of the eight neighborhoods, and evaluates how valuable the experience of choosing a particular action is. The t -th loss of evaluation network E is defined as:
Lt(θt)=Es,a{((1+|rt|)−e(st,at;θEt))2}(11)
View SourceRight-click on figure for MathML and additional features.where the value evaluation function e(st,at;θEt) gradually progresses to 1+|rt| through training and θE is the weight of evaluation network E . Combined with (9), it can be estimated that the value of e(st,at;θEt) will converge to:
e(st,at;θEt)=⎧⎩⎨⎪⎪1+|rreach|,1+|rcrash|,1,if pc=pgif pc∈Pootherwise(12)
View SourceRight-click on figure for MathML and additional features.
评估网络仅考虑以当前点为八个邻域中心的矩形，并评估选择特定动作的体验的价值。评价网络的第 t -个损失定义为：
Lt(θt)=Es,a{((1+|rt|)−e(st,at;θEt))2}(11)
View SourceRight-click on figure for MathML and additional features.其中价值评价函数 e(st,at;θEt) 通过训练逐渐发展到 1+|rt| ， θE 是评价网络 E E 的权重。结合 (9) ，可以估计的值 e(st,at;θEt) 将收敛为：
e(st,at;θEt)=⎧⎩⎨⎪⎪1+|rreach|,1+|rcrash|,1,if pc=pgif pc∈Pootherwise(12)
View SourceRight-click on figure for MathML and additional features.

The value evaluation network E completes the training in the pre-train stage before the train of network Q , and then help to select action:
at=argmaxa∈Aq(st,a;θt)e(st,a;θEt)(13)
View SourceRight-click on figure for MathML and additional features.
价值评估网络在网络 E Q 训练前的预训练阶段完成训练，然后帮助选择动作：
at=argmaxa∈Aq(st,a;θt)e(st,a;θEt)(13)
View SourceRight-click on figure for MathML and additional features.

Remark 1:According to (12), the model selects an action based on the product of the value evaluation function and the state-behavior value estimation function. In the initial stage of Q network training, due to the maximum operation in (5), the state-action value function gives a positive difference estimate for each action. At this time, e(st,at;θEt) encourages the model to choose the action of reach or crash in the next step, increasing the proportion of special experience. When q(st,at;θt) begins to correctly identify obstacles, e(st,at;θEt) can suppress collisions, encourage agents to explore more locations, increase the diversity of experience, and learn skills carefully. Moreover, in the later exploration, when the model misjudges the obstacle q(st,at;θt)>0 , it encourages to get the experience of the misjudges place.
注1：根据 (12) ，模型根据值评估函数和状态行为值估计函数的乘积选择动作。在 Q 网络训练的初始阶段，由于中 (5) 的最大运算，状态-动作值函数给出了每个动作的正差分估计。此时，鼓励模型在下一步中选择到达或崩溃的动作， e(st,at;θEt) 增加特殊经验的比例。当 q(st,at;θt) 开始正确识别障碍物时，可以抑制碰撞，鼓励智能体探索更多的地点，增加经验的多样性， e(st,at;θEt) 并仔细学习技能。而且，在后来的探索中，当模型误判障碍物时 q(st,at;θt)>0 ，鼓励获得误判地点的经验。
b: Policy of the Breadth of Experience
b：经验广度政策
In order to make better use of each step, the model creates a parallel structure for the path wandering phenomenon (left right left right, or up down up down) that appears during the training process, as shown in Figure 4. The model selects action to maximize q(st,a;θt)e(st,a;θEt) under normal conditions. When the path wandering phenomenon occurs, the parallel structure will be triggered. The parallel structure continues to explore the rest of the map with the greedy random policy, simultaneously continues to gain the experience of wandering point.
为了更好地利用每个步骤，该模型为训练过程中出现的路径漂移现象（左、右、左、右或上下、上下）创建了一个平行结构，如图所示 Figure 4 。该模型选择在正常条件下最大化的 q(st,a;θt)e(st,a;θEt) 动作。当出现路径漂移现象时，会触发并联结构。平行结构继续以贪婪的随机策略探索地图的其余部分，同时继续获得徘徊点的经验。
The greedy policy randomly selects actions with a certain probability, or greedily chooses to move the current point closer to the goal without considering obstacles.
贪婪策略随机选择具有一定概率的动作，或者贪婪地选择将当前点移近目标而不考虑障碍物。
We don’t recommend direct forced out of the wandering point, because the phenomenon of wandering shows that the model lacks of that point’s understanding, and the experience is very important. Wandering processing as shown in the right part of Figure 4, we extract the experience of the two steps before that points ext−1=(st−1,at−1,rt−1,st) and ext=(st,at,rt,st+1) , and the current_step in the current map. The model interacts with the environment through greedy random policy and also judges whether the network weights can jump out of the wandering point after update.
我们不建议直接逼出徘徊点，因为徘徊现象表明模型缺乏对该点的理解，体验非常重要。游离处理如右半部分所示 Figure 4 ，我们提取了当前地图中该点 ext−1=(st−1,at−1,rt−1,st) 和 ext=(st,at,rt,st+1) current_step 之前两个步骤的经验。该模型通过贪婪随机策略与环境进行交互，并判断网络权重在更新后是否能跳出徘徊点。
If the model can identify that point, or if the number of steps reaches the pre-set maximum exploration steps for a single map, then the structure is ended. If not, it continue to add experience of wandering point.
如果模型可以识别该点，或者如果步数达到单个地图的预设最大探索步数，则结构结束。如果没有，它将继续添加徘徊点的经验。
Our general idea is to take into account the wandering experience gained and the exploration of other locations on the map.
我们的总体思路是考虑到获得的流浪经验和对地图上其他地点的探索。

FIGURE 4. - The algorithm flow charts of PN-DQN.
FIGURE 4. 图4.
The algorithm flow charts of PN-DQN.
PN-DQN的算法流程图。

Show All 显示全部

c: Policy of Avoiding Incorrect Evaluation
c：避免错误评估的政策
In order to save time resource, we usually set a maximum number of steps as max_step that the agent can move in each episode. This assumption brings the problem of incorrect evaluation of the value function. In Algorithm 1:
yi={ri,ri+γmaxai+1∈Aqt(si+1,ai+1;θTD),if termialotherwise(14)
View SourceRight-click on figure for MathML and additional features.where termial means pc=pg or pc∈Po or current_step=max_step . If there is no collision or reach the target, the value function should be ri+γmaxai+1∈Aqt(si+1,ai+1;θTD) . However, when current_step is exactly equal to max_step , the model estimate function is ri . This causes a large loss in the evaluation of the value function, which in turn leads to unstable training of the model. So in the experiment we abandoned the experience of current_step reaching max_step .
为了节省时间资源，我们通常会设置一个最大步数，因为 max_step 代理可以在每一集中移动。这种假设带来了对值函数的错误计算的问题。在 Algorithm 1 ：
yi={ri,ri+γmaxai+1∈Aqt(si+1,ai+1;θTD),if termialotherwise(14)
View SourceRight-click on figure for MathML and additional features.其中 termial 表示 pc=pg 或 pc∈Po 或 current_step=max_step 。如果没有碰撞或到达目标，则值函数应为 ri+γmaxai+1∈Aqt(si+1,ai+1;θTD) 。但是，当 current_step 正好等于 max_step 时，模型估计函数为 ri 。这会导致值函数的评估出现较大损失，进而导致模型训练不稳定。所以在实验中，我们放弃了 current_step 达到 max_step .

Remark 2:The terminal in the figure represents three cases: 1. current_step is equal to max_step 2 . collision occurs 3. reaching the target point. The terminal_w in the figure indicates that the model successfully identifies the defect, that is, the action selected under the current weight is different from the action in the experience, and the END in the figure indicates that t reaches tmax .
注2：图中的代表 terminal 三种情况：1. current_step 等于 max_step 2 。发生碰撞 3.到达目标点。图中的表示 terminal_w 模型成功识别出缺陷，即当前权重下选择的动作与体验中的动作不同，图 END 中的表示 t 达到 tmax 。
C. Model and Network Architecture
C. 模型和网络架构
To successfully complete the navigation task, we propose a learning model PN-DQN which is suitable for the current task. The model is shown in Figure 2. The model consists of a value evaluation network E below the figure and the deep Q network Q above the figure and the target network Qt with the same structure. The value evaluation network is shown in Figure 5.
为了成功完成导航任务，我们提出了一个适合当前任务的学习模型PN-DQN。模型如所示 Figure 2 。该模型由图下方的价值评估网络和图上方的深度Q网络以及具有相同结构的目标网络 E Q Qt 组成。价值评估网络如所示 Figure 5 。

FIGURE 5. - The structure of value evaluation network.
FIGURE 5. 图5.
The structure of value evaluation network.
价值评价网络的结构。

Show All 显示全部

The value evaluation network consists of convolution layers and fully connected layers. All convolution layers in this paper consist of convolution and batch normalization, extracting features, changing dimensions and reducing the possibility of over-fitting.
价值评估网络由卷积层和全连接层组成。本文中的所有卷积层都由卷积和批量归一化组成，提取特征，改变维度并减少过拟合的可能性。
The activation function uses ReLU to reduce the gradient disappearance and speed up the training. Using the same padding mode, the relationship of the convolution layer input feature map size Win , output feature map size Wout and the stride S can be given by:
Wout=WinS(15)
View SourceRight-click on figure for MathML and additional features.
激活函数使用 ReLU 来减少梯度消失并加快训练速度。使用相同的填充模式，卷积层输入特征图大小、输出特征图大小 Win Wout 和步幅的关系 S 可以由下式给出：
Wout=WinS(15)
View SourceRight-click on figure for MathML and additional features.

The Q network structure is shown in Figure 6. The network consists of a pre-processing layer, dense blocks and a fully-connected layer. The input of the network is the gray matrix of 80804. The first layer is the convolution layer, uses the ReLU function as activation function, the convolution kernel size is 88, and the step size is 4, which reduces the image dimension, reduces subsequent calculations, and extracts features.
Q 网络结构如所示 Figure 6 。该网络由预处理层、密集块和全连接层组成。网络的输入是 80804 的灰色矩阵。第一层为卷积层，使用ReLU函数作为激活函数，卷积核大小为88，步长为4，降低了图像维数，减少了后续计算，提取了特征。
Behind the convolution layer is an overlapping pooling layer of 22, which maintains the feature map size and increases the generalization performance of the model to avoid over-fitting. Then there are three dense blocks and transition layers,the growth rate is 8, 16, and 16 respectively, and the bottleneck takes 2, which determines the output of the 33 convolutional layer. The 11 convolution layer has the function of integrating features and reducing the amount of subsequent calculations. The number of output channels is bottleneck∗growth rate . Dense blocks use dense connections and use multiple small convolution kernels to improve feature propagation and reuse, as well as increase nonlinearity. The pooling layer is discarded in the transition layer, and the convolution layer is employed to reduce the dimension.
卷积层后面是一个 22 的重叠池化层，它保持了特征图的大小，提高了模型的泛化性能，避免了过度拟合。然后有三个密集块和过渡层，分别 growth rate 是 8、16 和 16， bottleneck 取 2，这决定了 33 卷积层的输出。11卷积层具有对特征进行积分，减少后续计算量的功能。输出通道数为 bottleneck∗growth rate 。密集块使用密集连接并使用多个小卷积核来改善特征传播和重用，并增加非线性。在过渡层中丢弃池化层，采用卷积层来减维。
The main purpose is to retain more high-dimensional features and location information. The transition layer has a ratio of input to output channels of 2:1, which compresses the features to make the network lighter.
主要目的是保留更多高维特征和位置信息。过渡层的输入与输出通道之比为 2：1，它压缩了特征以使网络更轻。
The third part is the fully connected layer, which integrates the features and outputs the state-action value for four actions. The specific parameters of the network are shown in Table 1.
第三部分是全连接层，它集成了特征并输出四个动作的状态-动作值。网络的具体参数如图所示 Table 1 。

TABLE 1 The Parameters in Q Network
表1 Q网络参数说明
Table 1-
The Parameters in Q Network
FIGURE 6. - The structure of Q network.
FIGURE 6. 图6.
The structure of Q network.
Q 网络的结构。

Show All 显示全部

To train the model, we calculate the loss and mean square error, and update the network parameters with (7). The algorithm flow and pseudo code are as follows:
为了训练模型，我们计算损失和均方误差，并使用 (7) 更新网络参数。算法流程和伪代码如下：

Remark 3:As shown in Algorithm 2, we first initialize the environment, experience pool space, value network, target network, value evaluation network, etc. Set a pre_train_step to get some experience through random actions to store in the experience pool and complete the training of the value evaluation network in the pre-training stage. The training phase sets a ϵ that decreases as the number of steps increases, and uses the ϵ -greedy policy to determine whether the current step is random exploration or exploitation (13). If the path wandering phenomenon occurs, the exploitation will adopt the greedy random policy instead, and the parallel structure as shown in Figure 4 is taken into consideration, taking into account the exploration of other points and the acquisition of the wandering experience. It should be noted that the evaluation of the network weights in the form of θEt is only to distinguish between moments and unify other weights. In fact, both θEt and θt do not change with time t , and only during their training rounds will they update based on the calculated gradient. During training, draw mini_batch experiences from the experience pool per step and train the Q network. Finally, the parameters of the Q network Q are copied to the target network every C steps.
备注3：如图所示 Algorithm 2 ，我们首先初始化环境、体验池空间、价值网络、目标网络、价值评估网络等。设置一个 pre_train_step ，通过随机动作获得一些经验，存储在经验池中，在预训练阶段完成价值评估网络的训练。训练阶段设置一个 ϵ 随着步骤数的增加而减少的步骤，并使用 ϵ -greedy 策略来确定当前步骤是随机探索还是利用 (13) 。如果出现路径徘徊现象，则利用将采用贪婪随机策略，并考虑如图所示 Figure 4 的平行结构，兼顾其他点的探索和徘徊经验的获取。需要注意的是 θEt ，以的形式对网络权重的评估，只是为了区分时刻，统一其他权重。事实上，两者都 θEt θt 不会随时间 t 而改变，只有在训练回合中，它们才会根据计算出的梯度进行更新。在训练过程中，每一步从经验池中汲取 mini_batch 经验，训练 Q 网络。最后，每 C 一步将Q网络的参数复制到目标网络 Q 。
Algorithm 2 PN-DQN. 算法 2：PN-DQN。
Initialization Initialize replay memory space D to capacity N , Initialize the Q network Q with random weights θ0 , Initialize the target network Qt′s weights θTD with weights θ0 , Initialize the value evaluation network E with random weights θE . Initialize t=0 .
初始化将重放内存空间 D 初始化为容量 N ，用随机权重初始化Q网络，用权重初始化目标网络权重 θ0 θ0 ， θTD 用随机权重初始化价值评估网络 E θE Q Qt′s 。初始化 t=0 .

1:
for t<tmax do 对于 t<tmax do

2:
if t≠1 then st=st+1
如果 t≠1 那么 st=st+1

3:
else get the initial observation st
否则获取初始观察结果 st

4:
end if 结束如果

5:
if t<pre_train_step then 如果 t<pre_train_step 那么

6:
Select a random action at
选择随机操作 at

7:
if t>mini_batch then 如果 t>mini_batch 那么

8:
Sample mini_batch in D and calculate yi=ri
取 D 样 mini_batch 并计算 yi=ri

9:
Calculate the loss (e(si,ai;θEt)−yi)2
计算损失 (e(si,ai;θEt)−yi)2

10:
Train and update E network’s weights θEt+1
训练和更新 E 网络的权重 θEt+1

11:
end if 结束如果

12:
else 还

13:
if μ<ϵ then 如果 μ<ϵ 那么

14:
select a random action at
选择随机操作 at

15:
else 还

16:
select at=maxa∈Aq(st,a;θt)e(st,a;θEt) 选择 at=maxa∈Aq(st,a;θt)e(st,a;θEt)

17:
end if 结束如果

18:
end if 结束如果

19:
Store experience ext=(st,at,rt,st+1)
店铺体验 ext=(st,at,rt,st+1)

20:
if t<decline_step then 如果 t<decline_step 那么

21:
ϵ decreases by a certain percentage
ϵ 减少一定百分比

22:
end if 结束如果

23:
if t≥pre_train_step then 如果 t≥pre_train_step 那么

24:
Sample mini_batch in D and calculate yi :
取 D 样 mini_batch 并计算 yi ：

25:
yi=⎧⎩⎨⎪⎪⎪⎪⎪⎪ri,ifterminalri+γmaxai+1∈Aqt(si+1,ai+1;θTD),otherwise

26:
Calculate the loss (q(si,ai;θt)−yi)2
计算损失 (q(si,ai;θt)−yi)2

27:
Train and update Q network’s weights θt+1
训练和更新 Q 网络的权重 θt+1

28:
Every C step copy θt+1 to θTD
每 C 一步都复制到 θt+1 θTD

29:
end if 结束如果

30:
end for 结束

SECTION IV. 第四节.Test and Result Analysis 测试和结果分析
The evaluation of the model path planning ability is carried out in a grid environment. The goal of the model is to find the optimal sequence of actions from the starting point to the target point without collision.
模型路径规划能力的评估是在网格环境中进行的。该模型的目标是找到从起点到目标点的最佳动作顺序，而不会发生碰撞。
In order to verify that the model has the ability to adapt to different environments, map E(M,s,g) is regenerated every episode. The model has only a limited number of steps max_step in each episode, and the episode ends early when the agent hits an obstacle or reaches the target. The experiment is divided into two groups with the size of 5 * 5 and 8 * 8. For each group PN-DQN compared by DQN, P-DQN (only change policy), N-DQN (only change network) is employed to show the advantage of our idea.
为了验证模型是否具有适应不同环境的能力，每集都会重新生成地图 E(M,s,g) 。该模型在每一集中只有有限数量的步骤 max_step ，当智能体击中障碍物或到达目标时，该集会提前结束。实验分为大小为55和88的两组。对于每个组的PN-DQN，采用DQN、P-DQN（仅变更策略）、N-DQN（仅变更网络）来展示我们思想的优势。

The policies and networks used in the experiments have been introduced in the previous chapter. Table 2 provide some of the basic parameters used in the experiments:
实验中使用的策略和网络已在上一章中介绍。 Table 2 提供实验中使用的一些基本参数：

TABLE 2 The Parameters in algorithm 2
表2 算法2中的参数
Table 2-
The Parameters in algorithm 2
The size of Mini-batch is set to 32, and the parameters of the value network are copied to the target network every 5 steps. We set the value γ to 0.9 instead of the common 0.99. Since each round E(M,s,g) is different, the minimum value is set to 0.01 to more intuitively observe the model performance.
小批量大小设置为 32，每 5 步将值网络的参数复制到目标网络。我们将该值 γ 设置为 0.9，而不是常见的 0.99。由于每一轮 E(M,s,g) 都不同，因此将最小值设置为 0.01，以便更直观地观察模型性能。

Remark 4:Mini-batch determines the direction in which the gradient falls. Too small may cause over-fitting, while too large will slow down the convergence, and the memory size of the computer is also a limit. Since the reward function (9) stipulates that the return is 0 under normal conditions, we use the discount factor to distinguish the action different from their distance beyond the target point more clearly in the magnitude of the Q value. N should be appropriately larger to allow the model to learn more experience in the state to prevent overfitting and local optimization. A small C can keep the loss calculation real-time and effective, but too small will lead to training shocks.
备注4：小批量决定梯度下降的方向。太小可能会导致过拟合，而太大会减慢收敛速度，并且计算机的内存大小也是一个限制。由于奖励函数 (9) 规定在正常情况下返回值为 0，因此我们使用折扣因子在 Q 值的大小上更清楚地区分不同于它们超出目标点的距离的动作。 N 应适当增大，以允许模型在状态下学习更多经验，以防止过度拟合和局部优化。小一点可以使损失计算保持实时有效，但太小 C 会导致训练冲击。
To evaluate the performance of the model we define the following metrics:
为了评估模型的性能，我们定义了以下指标：

Success rate: the ratio of the number of rounds that successfully find the target point to the total number of rounds;
成功率：成功找到目标点的回合数占总回合数的比值;

Accuracy: the ratio of the shortest path steps to the number of steps used in a successful round;
准确率：最短路径步数与成功回合中使用的步数之比;

Loss: the loss during training.
损失：训练过程中的损失。

The premise that the model ultimately achieves good training results is that the value evaluation network can accurately identify different points.
模型最终取得良好训练效果的前提是价值评估网络能够准确识别不同点。
The training process of the 5 * 5 map consists of 1450000 exploration steps and 50000 pre-training steps (no update network), and the maximum number of steps per episode is 15 steps.
55地图的训练过程由145万步和5万步预训练（无更新网络）组成，每集最大步数为15步。
Similarly, the training process of the 8 * 8 map consists of 1850000 exploration steps and 150, 000 pre-training steps. The maximum number of steps per episode is 25 steps.
同样，88地图的训练过程由1850000个探索步骤和150,000个预训练步骤组成。每集的最大步数为 25 步。
After the pre-training step is completed, observe whether the empirical value network converges, and then train the Q network.
预训练步骤完成后，观察经验值网络是否收敛，然后训练Q网络。

Figure 7 is the loss of the value evaluation network in 50, 000 steps in the 5 * 5 map. It can be seen that the loss decreases with the increase of the number of training steps, which indicates that the parameters of the A network are optimized by the gradient descent method.
Figure 7 是 50， 000 步在 5 * 5 映射中价值评估网络的损失。可以看出，损失随着训练步数的增加而减小，说明梯度下降法对A网络的参数进行了优化。
The model converges around 25, 000 steps, and the value evaluation network can accurately identify different points.
该模型收敛于25， 000步左右，价值评估网络可以准确识别不同的点。

FIGURE 7. - The loss of value evaluation network in the 5 * 5 map.
FIGURE 7. 图7.
The loss of value evaluation network in the 5 * 5 map.
5*5图中价值损失评估网络。

Show All 显示全部

Figure 8 is the success rate and the accuracy of the training process. The results show that as the training progresses, the models become familiar with the environment, and the success rate and the accuracy are increasing. Figure 8(a) and Figure 8(b) are the rates of change in success. From a policy point of view, P-DQN and PN-DQN learn faster than DQN and N-DQN because they get more deep experience in the early stage of training. After the stability, their success rate are higher than DQN (and N-DQN).
Figure 8 是训练过程的成功率和准确性。结果表明，随着训练的进行，模型对环境的熟悉程度越来越高，成功率和准确率都在提高。 Figure 8(a) 并且是 Figure 8(b) 成功率的变化率。从政策角度来看，P-DQN 和 PN-DQN 比 DQN 和 N-DQN 学习得更快，因为它们在训练的早期阶段获得了更深入的经验。稳定后，它们的成功率高于DQN（和N-DQN）。
From the perspective of network architecture, N-DQN and PN-DQN benefit from dense connection and efficient feature learning and utilization, so they have stronger learning ability and learning speed. In Figure 8(b), the P-DQN policy is more suitable for learning, so the success rate of the previous period is higher than that of the N-DQN. However, due to the limitation of the network architecture, the success rate is exceeded by the N-DQN after convergence.
从网络架构的角度来看，N-DQN和PN-DQN得益于密集连接和高效的特征学习和利用，因此具有更强的学习能力和学习速度。在 Figure 8(b) 中，P-DQN 策略更适合学习，因此前一阶段的成功率高于 N-DQN。然而，由于网络架构的限制，收敛后N-DQN的成功率会超过N-DQN。
The PN-DQN model is excellent in both convergence speed and success rate.
PN-DQN模型在收敛速度和成功率方面都非常出色。

FIGURE 8. - The success rate and the accuracy of training process. (a) The success rate in 5 * 5 map. (b) The success rate in 8 * 8 map. © The accuracy in 5 * 5 map. (d) The accuracy in 8 * 8 map.
FIGURE 8. 图8.
The success rate and the accuracy of training process. (a) The success rate in 5 * 5 map. (b) The success rate in 8 * 8 map. © The accuracy in 5 * 5 map. (d) The accuracy in 8 * 8 map.
训练过程的成功率和准确性。（a） 5 * 5 地图中的成功率。（b） 8 * 8 地图中的成功率。（c） 5 * 5 地图中的精度。（d） 8 * 8 地图中的精度。

Show All 显示全部

Figure 8© and Figure 8(d) are accuracy, their trends are similar to success rates. It is worth noting that the accuracy is affected more by the policy than the success rate.
Figure 8© 并且是 Figure 8(d) 准确性，它们的趋势与成功率相似。值得注意的是，准确性受策略的影响大于成功率。
It can be seen that the policy greatly improves the accuracy rate while slightly increasing the success rate of the model, showing the superiority of our policy.
可以看出，该策略在略微提高模型成功率的同时，大大提高了准确率，显示了我们策略的优越性。
In general, both the improvement of the policy and the network structure have a certain improvement on the learning speed and the success rate and also the accuracy. The PN-DQN model has excellent performance in all these aspects.
总的来说，无论是策略的改进，还是网络结构的改进，在学习速度、成功率和准确率上都有一定的提升。PN-DQN模型在所有这些方面都具有出色的性能。

Figure 9 shows the success rate and the accuracy of the test after the training of the model. The test has a total of 200000 steps. Figure 9(a) and Figure 9(b) are success rates. Through the comparison of the four models in each graph, it can be seen that both the policy and the network have improved the success rate of navigation. Comparing the two graphs Figure 9(a) and Figure 9(b), Figure 9(b) shows obvious stratification according to the network used by the model. The more complicated the environment is, the more obvious the learning advantage of dense connections is. In terms of accuracy, it can be seen from Figure 9© and Figure 9(d) that efficient learning policies have a great improvement in accuracy. In Figure 9©, P-DQN even achieve better performance than N-DQN relies on policy advantages. The effect reflects the efficient learning policy is more conducive to the model to master the path planning skills. And Figure 9(d) once again shows us the learning advantage of dense connections. In Figure 9, the PN-DQN model is far ahead of the DQN model in both accuracy and success rate.
Figure 9 显示模型训练后测试的成功率和准确性。该测试共有 200000 个步骤。 Figure 9(a) 并且是 Figure 9(b) 成功率。通过各图中四个模型的对比可以看出，无论是策略还是网络，都提高了导航的成功率。比较两个图形 Figure 9(a) 和 Figure 9(b) ， Figure 9(b) 根据模型使用的网络显示明显的分层。环境越复杂，密集连接的学习优势就越明显。在准确率方面，从 Figure 9© 中可以看出， Figure 9(d) 高效的学习策略在准确率上有了很大的提升。在 Figure 9© P-DQN中，P-DQN甚至比N-DQN取得更好的性能，这依赖于政策优势。该效果反映出高效学习策略更有利于模型掌握路径规划技能。并 Figure 9(d) 再次向我们展示了密集连接的学习优势。其中 Figure 9 ，PN-DQN模型在精度和成功率上都远远领先于DQN模型。

FIGURE 9. - The success rate and the accuracy of testing process. (a) The success rate in 5 * 5 map. (b) The success rate in 8 * 8 map. © The accuracy in 5 * 5 map. (d) The accuracy in 8 * 8 map.
FIGURE 9. 图 9.
The success rate and the accuracy of testing process. (a) The success rate in 5 * 5 map. (b) The success rate in 8 * 8 map. © The accuracy in 5 * 5 map. (d) The accuracy in 8 * 8 map.
测试过程的成功率和准确性。（a） 5 * 5 地图中的成功率。（b） 8 * 8 地图中的成功率。（c） 5 * 5 地图中的精度。（d） 8 * 8 地图中的精度。

Show All 显示全部

Deep experience can bring deeper impact (expressed in the learning of the appropriate loss function), speed up the learning speed of the early period, and breadth experience can improve the diversity of the sample and improve the learning accuracy.
深度经验可以带来更深层次的影响（表现为对适当损失函数的学习），加快早期学习速度，广度经验可以提高样本的多样性，提高学习的准确性。
In general, the expressive ability of the network model determines the success rate and accuracy of the model, and our policy changes the distribution of the experience we have acquired, which is more suitable for skill learning, and has a higher success rate and accuracy, especially accuracy.
一般来说，网络模型的表达能力决定了模型的成功率和准确率，我们的政策改变了我们获得的经验的分布，更适合技能学习，并且具有更高的成功率和准确率，尤其是准确率。

Figure 10 shows the loss during the training.
Figure 10 显示训练期间的损失。
Comparing P-DQN with DQN and PN-DQN with N-DQN respectively, the model with efficient learning policy drops faster under the same conditions, because our policy bring some of the more diversified samples that allows us to get steeper gradients in the early stage.
将P-DQN与DQN和PN-DQN与N-DQN分别进行比较，在相同条件下，具有高效学习策略的模型下降得更快，因为我们的策略带来了一些更多样化的样本，使我们能够在早期阶段获得更陡峭的梯度。
This leads us to have more rounds to get more diverse experience, which make the training process more stable. Compared with N-DQN and DQN, the loss curves of PN-DQN and P-DQN are more stable and smaller, showing the excellent learning characteristics of our policy.
这导致我们有更多的回合来获得更多样化的经验，这使得训练过程更加稳定。与N-DQN和DQN相比，PN-DQN和P-DQN的损失曲线更稳定、更小，显示出我们策略的优秀学习特征。
In terms of network structure, the loss curve graphs of N-DQN and DQN are very similar, but due to the higher learning efficiency and higher fitting ability of our model, N-DQN has faster convergence speed and smaller loss.
在网络结构方面，N-DQN和DQN的损失曲线图非常相似，但由于我们模型的学习效率更高，拟合能力更高，因此N-DQN具有更快的收敛速度和更小的损失。
Overall, our PN-DQN model is more stable, faster, and more accurate than the traditional DQN model.
总体而言，我们的PN-DQN模型比传统的DQN模型更稳定、更快、更准确。

FIGURE 10. - The loss of training process. (a) The loss in 5 * 5 map. (b) The loss in 8 * 8 map.
FIGURE 10. 图 10.
The loss of training process. (a) The loss in 5 * 5 map. (b) The loss in 8 * 8 map.
训练过程的损失。（a） 5 * 5 地图中的损失。（b） 8 * 8 地图中的损失。

Show All 显示全部

Table 3 is the performance
Table 3 是性能

TABLE 3 Result 表3 结果
Table 3-
Result
index of each model in this experiment, taking the average of the success rate and accuracy rate of the last 50, 000 steps in the test, and the loss is the average value of the last 50000 step error in the training.
本实验中每个模型的指标，取测试中最后50,000步的成功率和准确率的平均值，损失为训练中最后50000步误差的平均值。
The probability of finding the path in our 5 * 5 and 8 * 8 variable map environment is 99.58% and 98.70%, the path accuracy is 99.95% and 99.70%, and the state-action values estimation errors are 4.0800e – 5 and 1.0741e – 4 respectively. All aspects have been improved.
在我们的 5 * 5 和 8 * 8 变量映射环境中找到路径的概率分别为 99.58% 和 98.70%，路径准确率分别为 99.95% 和 99.70%，状态-动作值估计误差分别为 4.0800e – 5 和 1.0741e – 4。所有方面都得到了改进。

SECTION V. 第五节Conclusions 结论
This paper has proposed an efficient PN-DQN algorithm. It has been found that when people learn new skills, they like to learn typical cases to master the general framework of skills, and enrich their understanding of skills through diversified experience.
该文提出了一种高效的PN-DQN算法。研究发现，人们在学习新技能时，喜欢学习典型案例来掌握技能的一般框架，通过多样化的经验丰富自己对技能的理解。
In the experience pool, the proportion of deep experience and breadth experience has been decided to change depending on the type of experience required at the different stages of the learning model.
在经验库中，深度经验和广度经验的比例已根据学习模型不同阶段所需的经验类型而变化。
In the grid path planning experiment, a value evaluation network has been built to control the depth of experience and speed up the learning process.
在网格路径规划实验中，构建了价值评估网络，以控制体验深度并加快学习过程。
For the path wandering phenomenon, a parallel structure has been created that it makes more efficient use of time steps and increases the breadth of experience. In addition, we have incorporated a dense connection to enhance the learning ability of the network model.
对于路径漂移现象，已经创建了一个平行结构，它更有效地利用了时间步长并增加了经验的广度。此外，我们还加入了密集连接，以增强网络模型的学习能力。
In the end, simulation experiments have shown that our algorithm is much better than traditional DQN algorithms in terms of learning speed, path planning success rate and path accuracy.
最后，仿真实验表明，该算法在学习速度、路径规划成功率和路径精度方面都优于传统的DQN算法。
We believe that as long as the State space is discrete, our learning policy can speed up learning. our further research topics include the improvement of our algorithm and its application to obstacle avoidance and navigation of aircraft.
我们相信，只要国家空间是离散的，我们的学习政策就可以加快学习速度。我们的进一步研究课题包括算法的改进及其在飞机避障和导航中的应用。

Authors
Figures
References
Download PDFs
Export
References & Cited By
参考文献和引用文献
Select All 全选
1.
P. E. Hart, N. J. Nilsson and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths”, IEEE Trans. Syst. Sci. Cybern., vol. 4, pp. 28-29, 1972.
P. E. Hart、N. J. Nilsson 和 B. Raphael，“启发式确定最小成本路径的正式基础”，IEEE Trans. Syst. Sci. Cybern.，第 4 卷，第 28-29 页，1972 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
2.
R. C. Holte, M. B. Perez, R. M. Zimmer and A. J. MacDonald, “Hierarchical A*: Searching abstraction hierarchies efficiently”, Proc. 13th Nat. Conf. Artif. Intell., pp. 1-6, 1996.
R. C. Holte、M. B. Perez、R. M. Zimmer 和 A. J. MacDonald，“Hierarchical A*： Searching abstraction hierarchies efficient”， Proc. 13th Nat. Conf. Artif.Intell.，第 1-6 页，1996 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
3.
M. Steinbrunn, G. Moerkotte and A. Kemper, “Heuristic and randomized optimization for the join ordering problem”, VLDB J., vol. 6, no. 3, pp. 191-208, 1997.
M. Steinbrunn、G. Moerkotte 和 A. Kemper，“连接排序问题的启发式和随机优化”，VLDB J.，第 6 卷，第 3 期，第 191-208 页，1997 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
4.
P. Mishra and M. H. Eich, “Join processing in relational databases”, ACM Comput. Surv., vol. 24, no. 1, pp. 63-113, 1992.
P. Mishra 和 M. H. Eich，“Join processing in relational databases”，ACM Comput。Surv.，第 24 卷，第 1 期，第 63-113 页，1992 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

J. Borenstein and Y. Koren, “Real-time obstacle avoidance for fast mobile robots in cluttered environments”, Proc. IEEE Int. Conf. Robot. Automat., pp. 572-577, May 1990.
J. Borenstein 和 Y. Koren，“杂乱环境中快速移动机器人的实时避障”，IEEE Int. Conf. Robot。《自动》，第572-577页，1990年5月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots”, Proc. IEEE Int. Conf. Robot. Autom., pp. 500-505, Mar. 1985.
O. Khatib，“机械手和移动机器人的实时避障”，Proc. IEEE Int. Conf. Robot。Autom.，第 500-505 页，1985 年 3 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

J. Kennedy, “The particle swarm: Social adaptation of knowledge”, Proc. IEEE Int. Conf. Evol. Comput. (ICEC), pp. 303-308, Apr. 1997.
J. Kennedy，“粒子群：知识的社会适应”，Proc. IEEE Int. Conf. Evol.计算。（ICEC），第303-308页，1997年4月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

M. Dorigo, V. Maniezzo and A. Colorni, “Ant system: Optimization by a colony of cooperating agents”, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 26, no. 1, pp. 29-41, Feb. 1996.
M. Dorigo、V. Maniezzo 和 A. Colorni，“蚂蚁系统：合作代理群体的优化”，IEEE Trans. Syst. Man Cybern。B Cybern.，第26卷，第1期，第29-41页，1996年2月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
9.
X. Wu, G. Wei, Y. Song and X. Huang, “Improved ACO-based path planning with rollback and death strategies”, Syst. Sci. Control Eng., vol. 6, no. 1, pp. 102-107, 2018.
X. Wu、G. Wei、Y. Song 和 X. Huang，“Improved ACO-based path planning with rollback and death strategies”，Syst. Sci. Control Eng.，第 6 卷，第 1 期，第 102-107 页，2018 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
10.
C. Szepesvri, “Algorithms for reinforcement learning”, Wiley Encyclopedia of Operations Research and Management Science, 2011.
C. Szepesvri，“强化学习算法”，Wiley 运筹学与管理科学百科全书，2011 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
11.
L. P. Kaelbling, M. L. Littman and A. W. Moore, “Reinforcement learning: An introduction”, IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 285-286, 2005.
L. P. Kaelbling、M. L. Littman 和 A. W. Moore，“强化学习：简介”，IEEE Trans. Neural Netw.，第 16 卷，第 1 期，第 285-286 页，2005 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
12.
L. Cui, X. Wang and Y. Zhang, “Reinforcement learning-based asymptotic cooperative tracking of a class multi-agent dynamic systems using neural networks”, Neurocomputing, vol. 171, pp. 220-229, Jan. 2016.
L. Cui、X. Wang 和 Y. Zhang，“Reinforcement learning-based asymptotic cooperative tracking of a class multi-agent dynamic systems using neural networks”，《神经计算》，第 171 卷，第 220-229 页，2016 年 1 月。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
13.
J. Kober, J. A. Bagnell and J. Peters, “Reinforcement learning in robotics: A survey”, Int. J. Robot. Res., vol. 32, no. 11, pp. 1238-1274, 2013.
J. Kober、J. A. Bagnell 和 J. Peters，“机器人技术中的强化学习：一项调查”，Int. J. Robot。《研究》，第32卷，第11期，第1238-1274页，2013年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
14.
L. R. Busoniu, R. Babuska and B. D. Schutter, Reinforcement Learning and Dynamic Programming Using Function Approximators, Boca Raton, FL, USA:CRC Press, 2010.
L. R. Busoniu、R. Babuska 和 B. D. Schutter，使用函数逼近器的强化学习和动态规划，美国佛罗里达州博卡拉顿：CRC 出版社，2010 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

J. Ho, D. W. Engels and S. E. Sarma, “HiQ: A hierarchical Q-learning algorithm to solve the reader collision problem”, Proc. Int. Symp. Appl. Internet Workshops, pp. 4, Jan. 2006.
J. Ho、D. W. Engels 和 S. E. Sarma，“HiQ：解决读者碰撞问题的分层 Q 学习算法”，Proc. Int. Symp. Appl. Internet Workshops，第 4 页，2006 年 1 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

S.-L. Chen and Y.-M. Wei, “Least-squares SARSA (Lambda) algorithms for reinforcement learning”, Proc. Int. Conf. Natural Comput., pp. 632-636, Oct. 2008.
S.-L. Chen 和 Y.-M.Wei，“用于强化学习的最小二乘 SARSA （Lambda）算法”，Proc. Int. Conf. Natural Comput.，第 632-636 页，2008 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
17.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, et al., “Human-level control through deep reinforcement learning”, Nature, vol. 581, no. 7549, pp. 529-533, 2015.
V. Mnih、K. Kavukcuoglu、D. Silver、A. A. Rusu、J. Veness、M. G. Bellemare 等人，“通过深度强化学习进行人类水平控制”，《自然》，第 581 卷，第 7549 期，第 529-533 页，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
18.
Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot and N. De Freitas, “Dueling network architectures for deep reinforcement learning”, Proc. 33nd Int. Conf. Mach. Learn., pp. 1995-2003, 2016.
Z. Wang、T. Schaul、M. Hessel、H. Van Hasselt、M. Lanctot 和 N. De Freitas，“Dueling network architectures for deep reinforcement learning”，Proc. 33nd Int. Conf. Mach. Learn.，第 1995-2003 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
19.
H. Van Hasselt, A. Guez and D. Silver, “Deep reinforcement learning with double Q-learning”, Proc. Int. Conf. Learn. Represent., 2015.
H. Van Hasselt、A. Guez 和 D. Silver，“Deep reinforcement learning with double Q-learning”，Proc. Int. Conf. Learn。代表，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
20.
T. Schaul, J. Quan and I. Antonoglou, “Prioritized experience replay”, Proc. Int. Conf. Learn. Represent., 2015.
T. Schaul、J. Quan 和 I. Antonoglou，“优先体验重播”，Proc. Int. Conf. Learn。代表，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
21.
I. Osband, C. Blundell, A. Pritzel and B. Van Roy, “Deep exploration via bootstrapped DQN”, Proc. Neural Inf. Process. Syst. Conf., pp. 4026-4034, 2016.
I. Osband、C. Blundell、A. Pritzel 和 B. Van Roy，“通过自举 DQN 进行深度探索”，Proc. Neural Inf. Process。Syst. Conf.，第 4026-4034 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
22.
B. C. Stadie, S. Levine and P. Abbeel, “Incentivizing exploration in reinforcement learning with deep predictive models”, arXiv:1507.00814, 2015, [online] Available: https://arxiv.org/abs/1507.00814.
B. C. Stadie、S. Levine 和 P. Abbeel，“使用深度预测模型激励强化学习中的探索”，arXiv：1507.00814,2015 年，[在线] 可用：https://arxiv.org/abs/1507.00814。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
23.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, et al., “Asynchronous methods for deep reinforcement learning”, Proc. 34th Int. Conf. Mach. Learn., pp. 1928-1937, 2016.
V. Mnih、A. P. Badia、M. Mirza、A. Graves、T. Lillicrap、T. Harley 等人，“深度强化学习的异步方法”，第 34 届国际会议 Mach. Learn.，第 1928-1937 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
24.
O. Anschel, N. Baram and N. Shimkin, “Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning”, Proc. 34th Int. Conf. Mach. Learn., pp. 176-185, 2016.
O. Anschel、N. Baram 和 N. Shimkin，“Averaged-DQN：深度强化学习的方差减少和稳定”，第 34 届国际会议 Mach. Learn.，第 176-185 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
25.
M. Wulfmeier, P. Ondruska and I. Posner, “Maximum entropy deep inverse reinforcement learning”, arXiv:1507.04888, 2015, [online] Available: www.arXiv:1507.04888.
M. Wulfmeier、P. Ondruska 和 I. Posner，“最大熵深度逆强化学习”，arXiv：1507.04888,2015 年，[在线] 可用：www.arXiv：1507.04888。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

K. He, X. Zhang, S. Ren and J. Su, “deep residual learning for image recognition”, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770-778, Jun. 2016.
K. He， X. Zhang， S. 任 and J. Su， “deep residual learning for image recognition”， Proc. IEEE Conf. Comput.Vis. Pattern Recognit.，第 770-778 页，2016 年 6 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

K. He, X. Zhang, S. Ren and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification”, Proc. Int. Conf. Comput. Vis., pp. 1026-1034, Dec. 2015.
K. He、X. Zhang、S. 任和 J. Sun，“深入研究整流器：在 ImageNet 分类上超越人类水平的性能”，Proc. Int. Conf. Comput。Vis.，第 1026-1034 页，2015 年 12 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
28.
K. He, X. Zhang, S. Ren and J. Sun, “Identity mappings in deep residual networks”, Proc. Eur. Conf. Comput. Vis., pp. 630-645, 2016.
K. He、X. Zhang、S. 任和 J. Sun，“深度残差网络中的身份映射”，Proc. Eur. Conf. Comput。Vis.，第 630-645 页，2016 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
29.
G. Larsson, M. Maire and G. Shakhnarovich, “FractalNet: Ultra-deep neural networks without residuals”, arXiv:1605.07648, 2016, [online] Available: https://arxiv.org/abs/1605.07648.
G. Larsson、M. Maire 和 G. Shakhnarovich，“FractalNet：无残差的超深度神经网络”，arXiv：1605.07648,2016 年，[在线] 可用：https://arxiv.org/abs/1605.07648。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, “Densely connected convolutional networks”, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4700-4708, Jul. 2017.
G. Huang、Z. Liu、L. van der Maaten 和 K. Q. Weinberger，“密集连接的卷积网络”，IEEE Conf. Comput.Vis. Pattern Recognit.，第 4700-4708 页，2017 年 7 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
31.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Proc. Int. Conf. Mach. Learn., pp. 448-456, 2015.
S. Ioffe 和 C. Szegedy，“批量归一化：通过减少内部协变量位移来加速深度网络训练”，Proc. Int. Conf. Mach. Learn.，第 448-456 页，2015 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
32.
A. Hussein, E. Elyan and M. M. Gaber, “Deep imitation learning for 3D navigation tasks”, Neural Comput. Appl., vol. 29, no. 7, pp. 389-404, 2018.
A. Hussein、E. Elyan 和 M. M. Gaber，“用于 3D 导航任务的深度模仿学习”，Neural Comput。《应用》，第29卷，第7期，第389-404页，2018年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

J. Xu, Q. Liu, H. Guo, A. Kageza, S. AlQarni and S. Wu, “Shared multi-task imitation learning for indoor self-navigation”, Proc. IEEE Global Commun. Conf., pp. 1-7, Dec. 2018.
J. Xu， Q. Liu， H. Guo， A. Kageza， S. AlQarni and S. Wu， “Shared multi-task imitation learning for indoor self-navigation”， Proc. IEEE Global Commun.Conf.，第 1-7 页，2018 年 12 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, et al., “Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations”, IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 4423-4430, Oct. 2018.
M. Pfeiffer、S. Shukla、M. Turchetta、C. Cadena、A. Krause、R. Siegwart 等人，“强化模仿：利用先前的演示对无地图导航进行高效深度强化学习”，IEEE Robot。自动。Lett.，第 3 卷，第 4 期，第 4423-4430 页，2018 年 10 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索