论文阅读(DRQN):Deep Recurrent Q-Learning for Partially Observable MDPs

简单概括该文:
a、创新之处:提出QRQN结构:DQN+LSTM
b、创新原因:DQN有两个缺陷——1、经验池内存有限制;2、每个决策点都需要完整的游戏界面。
c、改动措施:将DQN的第一个全连接层换成LSTM网络

Introduction:
DQN只取了过去四帧(即四张图)作为输入,而如果游戏需要四帧以上的记忆,则将出现部分可观测马尔科夫性(Partially-Observable Markov Decision Process ,POMDP),例如Pong游戏——仅显示棍子和球的位置缺少球的速度。这会造成信息特征的不完整,状态信息有noisy。而行进方向需要查看球的行进,而不是当前状态(球在一个时刻的位置),这就不满足马尔科夫性中的:未来状态仅取决于当前状态。

因为DQN在面对不完全的状态(incomplete state),性能会下降,所以引入LSTM用来弥补闪烁的游戏界面和卷积层缺乏的速度检测,DRQN更擅长解决信息丢失问题。

DQN:
其与常规的强化学习Q-Learning最大的不同就是,DQN在初始化的时候不再生成一个完整的Q-Table,每一个观测环境的Q值都是通过神经网络生成的,即通过输入当前环境的特征Features来得到当前环境每个动作的Q-Value,并且以这个Q-Value基准进行动作选择。

loss = 目标值网络 — 当前值网络(评估值网络)

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
### Deep Q-Learning 的改进方法及其最新进展 #### 使用经验回放机制提升模型泛化能力 为了克服传统Q-learning中样本关联性强的问题,引入了经验回放(experience replay)技术。通过存储agent在过去环境交互过程中产生的大量状态转移四元组$(s, a, r, s')$于记忆库D内,并从中随机抽取小批量(minibatch)用于参数更新,从而打破数据间的顺序依赖关系,使得训练过程更加稳定高效[^2]。 #### 双重网络结构减少估计偏差 针对原始DQN算法中存在的过拟合现象以及价值函数评估不准确的情况,提出了双重DQN(Double DQN)框架。具体而言,在每次迭代时除了维护当前策略对应的在线网络外,还额外保存一份固定版本的目标网络$\hat{Q}$,仅当满足一定条件(如经过若干轮次后)才对其进行同步刷新。如此一来,既能够有效降低TD误差中的正向偏置项影响,又可防止频繁调整带来的震荡风险[^3]。 #### 多步奖励预测优化长期规划性能 传统的单步回报设定往往难以捕捉复杂的序列决策模式,为此有学者尝试借鉴n-step TD的思想,即综合考虑未来多个时刻累积折扣收益作为即时反馈信号的一部分参与到权重修正环节当中去。此举有助于增强智能体对于远期后果的认知水平,进而改善其全局视野下的行动抉择质量。 ```python import torch.nn.functional as F from collections import deque class Agent: def __init__(self): self.memory = deque(maxlen=10000) def learn(self, batch_size, gamma, n_steps=1): transitions = random.sample(self.memory, batch_size) state_batch, action_batch, reward_batch, next_state_batch = zip(*transitions) current_q_values = ... expected_q_values = ... loss = F.smooth_l1_loss(current_q_values, expected_q_values.detach()) optimizer.zero_grad() loss.backward() if isinstance(n_steps,int) and n_steps>1: rewards = [r * (gamma ** i) for i,r in enumerate(reward_batch)] total_rewards = sum([rewards[i:i+n_steps] for i in range(len(rewards)-n_steps+1)],[]) optimizer.step() ``` #### 分布式架构加速大规模分布式计算效率 鉴于现代RL任务通常涉及海量观测空间与动作集合,单纯依靠单一设备难以支撑起庞大的运算需求。因此借助集群资源构建分布式的训练平台成为必然趋势之一。比如Google提出的IMPALA(ACTOR-CRITIC RECURRENT POLICY GRADIENTS),它巧妙地结合异步优势actor-critic算法同循环神经单元(recurrent neural unit,RNU),实现了跨节点间高效的通信协作机制的同时保持良好的收敛特性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值