强化学习 7—— 一文读懂 Deep Q-Learning(DQN)算法

上篇文章强化学习——状态价值函数逼近介绍了价值函数逼近(Value Function Approximation,VFA)的理论,本篇文章介绍大名鼎鼎的DQN算法。DQN算法是 DeepMind 团队在2015年提出的算法,对于强化学习训练苦难问题,其开创性的提出了两个解决办法,在atari游戏上都有不俗的表现。论文发表在了 Nature 上,此后的一些DQN相关算法都是在其基础上改进,可以说是打开了深度强化学习的大门,意义重大。

论文地址:Mnih, Volodymyr; et al. (2015). Human-level control through deep reinforcement learning

一、DQN简介

其实DQN就是 Q-Learning 算法 + 神经网络。我们知道,Q-Learning 算法需要维护一张 Q 表格,按照下面公式来更新:
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ    m a x a Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \; max_aQ(S_{t+1}, a) - Q(S_t, A_t)] Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]
然后学习的过程就是更新 这张 Q表格,如下图所示:
在这里插入图片描述

而DQN就是用神经网络来代替这张 Q 表格,其余相同,如下图:
在这里插入图片描述

其更新方式为:
Q ( S t , A t , w ) ← Q ( S t , A t , w ) + α [ R t + 1 + γ    m a x a    q ^ ( s t + 1 , a t , w ) − Q ( S t , A t , w ) ] Q(S_t, A_t, w) \leftarrow Q(S_t, A_t, w) + \alpha[R_{t+1} + \gamma\;max_a\; \hat{q}(s_{t+1}, a_t, w) - Q(S_t, A_t, w)] Q(St,At,w)Q(St,At,w)+α[Rt+1+γmaxaq^(st+1,at,w)Q(St,At,w)]

其中 Δ w \Delta w Δw :
Δ w = α ( R t + 1 + γ    m a x a    q ^ ( s t + 1 , a t , w ) − q ^ ( s t , a t , w ) ) ⋅ ∇ w q ^ ( s t , a t , w ) \Delta w = \alpha ({\color{red}R_{t+1} + \gamma\;max_a\; \hat{q}(s_{t+1}, a_t, w)} - \hat{q}{(s_t, a_t, w)})\cdot \nabla_w\hat{q}{(s_t, a_t, w)} Δw=α(Rt+1+γmaxaq^(st+1,at,w)q^(st,at,w))wq^(st,at,w)

二、Experience replay

DQN 第一个特色是使用 Experience replay ,也就是经验回放,为何要用经验回放?还请看下文慢慢详述

对于网络输入,DQN 算法是把整个游戏的像素作为 神经网络的输入,具体网络结构如下图所示:

在这里插入图片描述

第一个问题就是样本相关度的问题,因为在强化学习过程中搜集的数据就是一个时序的玩游戏序列,游戏在像素级别其关联度是非常高的,可能只是在某一处特别小的区域像素有变化,其余像素都没有变化,所以不同时序之间的样本的关联度是非常高的,这样就会使得网络学习比较困难。

DQN的解决办法就是 经验回放(Experience replay)。具体来说就是用一块内存空间 D D D ,用来存储每次探索获得数据 < s t , a t , r t , s t + 1 > <s_t, a_t, r_t, s_{t+1}> <st,at,rt,st+1> 。然后按照以下步骤重复进行:

  • sample:从 D D D 中取出一个 batch 的数据 ( s , a , r , s ′ ) ∈ D (s, a, r, s') \in D (s,a,r,s)D
  • 对于取出的样本数据计算 Target 值: r + γ    m a x a ′ Q ^ ( s ′ , a ′ , w ) r + \gamma\; max_{a'}\hat{Q}(s',a',w) r+γmaxaQ^(s,a,w)
  • 使用随机梯度下降来更新网络权重 w:

Δ w = α ( r + γ    m a x a ′    Q ^ ( s ′ , a ′ , w ) − Q ^ ( s , a , w ) ) ⋅ ∇ w Q ^ ( s , a , w ) \Delta w = \alpha (r + \gamma\;max_{a'}\; \hat{Q}(s', a', w) - \hat{Q}{(s, a, w)})\cdot \nabla_w\hat{Q}{(s, a, w)} Δw=α(r+γmaxaQ^(s,a,w)Q^(s,a,w))wQ^(s,a,w)

利用经验回放,可以充分发挥 off-policy 的优势,behavior policy 用来搜集经验数据,而 target policy 只专注于价值最大化。

三、Fixed Q targets

第二个问题是前文已经说过的,使用 q ^ ( s t , a t , w ) \hat{q}(s_t, a_t, w) q^(st,at,w) 来代替 TD Target,也就是说在TD Target 中已经包含了要优化的 参数 w。也就是说在训练的时候 监督数据 target 是不固定的,所以就使得训练比较困难。

想象一下,把 要估计的 Q ^ \hat{Q} Q^ 叫做 Q estimation,把它看做一只猫。把监督数据 Q Target 看做是一只老鼠,现在可以把训练的过程看做猫捉老鼠的过程(不断减少之间的距离,类比于 Q estimation 网络拟合 Q Target 的过程)。现在问题是猫和老鼠都在移动,这样猫想要捉住老鼠是比较困难的,如下所示:

在这里插入图片描述

那么让老鼠在一段时间间隔内不动(固定住),而这期间,猫是可以动的,这样就比较容易抓住老鼠了。在 DQN 中也是这样解决的,有两套一样的网络,分别是 Q estimation 网络和 Q Target 网络。要做的就是固定住 Q target 网络,那如何固定呢?比如可以让 Q estimation 网路训练10次,然后把 Q estimation 网络更新后的参数 w 赋给 Q target 网络。然后再让Q estimation 网路训练10次,如此往复下去,试想如果不固定 Q Target 网络,两个网络都在不停地变化,这样 拟合是很困难的,如果让 Q Target 网络参数一段时间固定不变,那么拟合过程就会容易很多。下面是 DQN 算法流程图:

在这里插入图片描述

如上图所示,首先智能体不断与环境交互,获取交互数据 < s , a , r , s ′ > <s,a,r,s'> <s,a,r,s>存入replay memory ,当经验池中有足够多的数据后,从经验池中 随机取出一个 batch_size 大小的数据,利用当前网络计算 Q的预测值,使用 Q-Target 网络计算出 Q目标值,然后计算两者之间的损失函数,利用梯度下降来更新当前 网络参数,重复若干次后,把当前网络的参数 复制给 Q-Target 网络。

关于DQN的实现代码部分下篇介绍

参考资料:

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值