【论文笔记1】【DDQN】【双深度Q网络】

资源存储库

已于 2024-07-14 19:49:37 修改

阅读量528

点赞数 3

分类专栏：笔记文章标签：论文阅读笔记

于 2024-07-14 19:24:18 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/140421178

版权

笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

基于MS-DDQN算法的短程空战无人机自主机动决策

DDQN-TS: A novel bi-objective intelligent scheduling algorithm in the cloud environment

DDQN-TS：一种新型云环境双目标智能调度算法

编辑 4.3. DDQN-TS algorithm

4.3. DDQN-TS算法

Attitude Control of Fixed-wing UAV Based on DDQN

基于DDQN的固定翼无人机姿态控制

A Novel Convolutional Neural Networks for Stock Trading Based on DDQN Algorithm

一种基于DDQN算法的股票交易卷积神经网络

DDQN-Based Trajectory and Resource Optimization for UAV-Aided MEC Secure Communications

基于DDQN的无人机辅助MEC安全通信轨迹与资源优化

Algorithm 1: DDQN-Based Trajectory Optimization and Resource Allocation.

算法一：基于DDQN的轨迹优化与资源分配。

为了应对不确定性，DQN算法还建立了一个具有相同结构的目标网络来更新Q值。目标网络具有与Q函数网络相同的初始结构，但参数是固定的。Q函数网络的参数每隔一段时间分配给目标网络，以保持Q值在一定时间内不变。通过梯度下降法最小化以下损失函数，可以得到最优解：

深度Q网络（DQN）模型主要由多层反向传播（BP）网络机动决策模型和基于Q学习的决策模型组成。

The DQN algorithm suffers from the problem of a cold start. It is difficult to get the model into a relatively ideal environment during the early stages of algorithm iteration, which leads to a large error in the estimation of the value function in the early learning stage. The slow learning rate will affect the later learning process, and it is difficult to find the optimal strategy. With a double Q-learning algorithm, the model can use different models when choosing the optimal action and calculating the target value, thereby reducing the overestimation of the value function.

DQN 算法存在冷启动问题。在算法迭代的早期阶段，很难将模型带入相对理想的环境，导致在早期学习阶段对价值函数的估计存在较大的误差。学习速度慢会影响后期的学习过程，很难找到最优策略。通过双Q学习算法，模型在选择最优动作和计算目标值时可以使用不同的模型，从而减少对价值函数的高估。

A neural network approximation function is also used, in which the input layer includes the current situation information, and the output layer includes longitudinal maneuver decision-making instructions, lateral maneuver decision-making instructions, and speed control commands Fig. 11. The situation information is input into the neural network, and the maneuver action value is output. At the same time, the optimal maneuver action is obtained by interaction with the environment, which allows the UCAV independently make operational decisions and improve its intelligence.
还使用了神经网络近似函数，其中输入层包括当前态势信息，输出层包括纵向机动决策指令、横向机动决策指令和速度控制指令图11。将态势信息输入神经网络，输出机动动作值。同时，通过与环境的相互作用获得最佳机动动作，这使得UCAV能够独立做出作战决策并提高其智能性。

基于MS-DDQN算法的短程空战无人机自主机动决策

DDQN-TS: A novel bi-objective intelligent scheduling algorithm in the cloud environment

DDQN-TS：一种新型云环境双目标智能调度算法

4.3. DDQN-TS algorithm

4.3. DDQN-TS算法

Fig. 2. The architecture of DDQN.
图 2.DDQN 的架构。

A good predictive model requires the neural networks to be trained on a large number of training sets. There are also some differences in the prediction models obtained by different training methods. Generally, two different training methods exist: online training and offline training. The online training strategy refers to training, adjusting network parameters, and making decisions simultaneously. This method has the advantage of being able to adjust the network parameters in real time according to changes in the external environment, and it is applicable to scenarios that require real-time changes to the environment, for example, large online game scenarios. The offline training strategy refers to fully training the neural network with a large number of generated training sets to obtain a network prediction model with specified parameters. Then, the obtained prediction model is used for prediction and testing. The advantage of this method is that after sufficient training, the prediction accuracy of the model can be improved. Since the environment of the application scenario does not need to be changed frequently and we aim to achieve high prediction accuracy and predictive integrity, we use a combination of online decision making and offline training to achieve intelligent scheduling.
一个好的预测模型需要在大量的训练集上训练神经网络。不同训练方法获得的预测模型也存在一些差异。通常存在两种不同的培训方式：在线培训和离线培训。

在线培训策略是指同时进行培训、调整网络参数和做出决策。该方法具有能够根据外部环境变化实时调整网络参数的优点，适用于需要实时改变环境的场景，例如大型网络游戏场景。

离线训练策略是指用大量生成的训练集对神经网络进行全面训练，得到具有指定参数的网络预测模型。然后，利用得到的预测模型进行预测和测试。该方法的优点是经过充分的训练，可以提高模型的预测精度。

由于应用场景的环境不需要频繁变化，我们的目标是实现高预测精度和预测完整性，因此我们采用线上决策和线下培训相结合的方式实现智能调度。

Attitude Control of Fixed-wing UAV Based on DDQN

基于DDQN的固定翼无人机姿态控制

Fig. 1. Agent training process based on DDQN
图 1. 基于DDQN智能体的训练流程

Fig. 2. Agent testing process based on DDQN
图 2. 基于DDQN的智能体测试流程