Deep Recurrent Q-Learning for Partially Observable MDPs(DRQN)笔记

最新推荐文章于 2023-09-07 11:00:02 发布

Melody1211

最新推荐文章于 2023-09-07 11:00:02 发布

阅读量705

点赞数

分类专栏：论文阅读笔记

本文链接：https://blog.csdn.net/Melody1211/article/details/105104248

版权

本文提出DRQN，一种结合LSTM的DQN，用于解决部分可观测马尔科夫决策过程（POMDPs）。DRQN在只输入单一帧的情况下，能整合时间信息，其性能在部分可观测环境中优于DQN，且能适应观测质量的变化。

摘要由CSDN通过智能技术生成

Deep Recurrent Q-Learning for Partially Observable MDPs

1. 论文讲了什么/主要贡献是什么

传统DQN主要面型MDP的环境，在Atari环境中进行测试的过程中也是采取的输入多个帧的形式，使模型输入的观测能够体现出系统的状态。但现实中大部分都是部分可观测的情况——POMDP，本文在DQN的基础上，结合循环神经网络的特性，将LSTM与DQN结合，设计出DRQN，从而能够解决部分可观测的问题，在测试中也只需要输入一帧的观测信息即可。

2. 论文摘要：

Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting Deep Recurrent Q-Network (DRQN), although capable of seeing only a single frame at each timestep, successfully integrates information through time and replicates DQN’s performance on standard Atari games and partially observed equivalents featuring flickering game screens. Additionally, when trained with partial observations and evaluated with incrementally more complete observations, DRQN’s performance scales as a function of observability. Conversely, when trained with full observations and evaluated with partial observations, DRQN’s performance degrades less than DQN’s. Thus, given the same length of history, recurrency is a viable alternative to stacking a history of frames in the DQN’s input layer and while recurrency confers no systematic advantage when learning to play the game, the recurrent net can better adapt at evaluation time if the quality of observations changes.

深度强化学习已经能够训练出在复杂任务上的熟练控制器。但是，这些控制器的内存有限，依赖于能够在每个决策点感知整个游戏屏幕。为了解决这个缺点，文章在DQN中添加循环的效果——将最后一个卷积层后的全连接层替换为一个循环LSTM。由此得到了深度循环Q网络，虽然在每一步只能看到一个帧，但是通过成功地整合了经历时间内的信息，实现了DQN在标准Atari游戏上的表现和部分观可观测的等价情况——提取具有闪烁游戏屏幕上的特征上的表现。此外，当使用部分观测进行训练并使用逐步增加的更完整的观测进行评估时，DRQN的性能随着可观测性的变化而变化。相反，当用完整的观测训练和用部分观测评估时，DRQN的性能下降小于DQN。因此，在给定相同长度的观测历史后，循环是DQN输入层存储历史帧的一个可行的替代，虽然循环在学习玩游戏的过程中没有系统上的优势，但是循环网络能够更好的适应在测试阶段内观测质量的变化。