【强化学习】循序渐进讲解Deep Q-Networks(DQN)

1 Q-learning与Deep Q-learning

Q-learning是一种用来训练Q函数的算法,Q 函数是一个动作-价值函数,用于确定处于特定状态和在s该状态下采取特定行动的价值。其中的Q函数被以表格的形式展现出来,横轴表示状态,纵轴表示动作,表格中的每一个数据都对应了某一状态下采取某一动作所对应的价值。但是当状态多到无法以表格或数组的形式表现时,最好的办法就是用一个参数化的Q函数去得到近似Q值。由于神经网络在复杂函数建模方面表现出色,我们可以使用神经网络(Deep Q-Networks)来估算 Q 函数。
在这里插入图片描述

DQN的基本原理与Q-learning算法非常相似。它从任意 Q 值估计开始,使用ε-greedy策略探索环境。其核心是在迭代更新中使用双行动概念,即具有当前 Q 值的当前行动 Q ( S t , A t ) Q(S_t, A_t) Q(St,At)和具有目标 Q 值的目标行动 Q ( S t + 1 , a ) Q(S_{t+1}, a) Q(St+1,a),以改进其 Q 值估计。

2 DQN的结构组成

DQN主要由三部分组成:Q network、Target network和经验回放(Experience Replay )组件。
在这里插入图片描述

其中,Q神经网络用于训练产生最佳状态-动作价值,Target神经网络用于计算下一状态下采取的动作所对应的Q值,Experience Replay用于与环境进行交互并产生数据用于训练神经网络。

3 DQN创新技术(重点)

DQN中主要有三个颠覆性创新技术:

  • Experience Replay:更有效地利用过往经验数据;
  • Fixed Q-Target:用于稳定训练,加速收敛;
  • Double Deep Q-Learning:用于解决Q值过高估计的问题。

3.1 Experience Replay(经验回放)

在这里插入图片描述

如图所示,Experience Replay组件采用ε-greedy策略与环境进行交互(当前状态下采取可能得到最高收益的动作),并得到环境反馈的奖励和下一状态,并将这一观察结果保存为训练数据样本(Current State, Action, Reward, Next State)。训练神经网络时,将随机从训练数据样本中抽取数据进行训练。那这时就产生了一个问题,为什么不让智能体与环境不断交互产生新数据并将其用于神经网络的训练呢?

回想一下,当我们训练神经网络时,通常是在随机打乱训练数据后选择一批样本。这样可以确保训练数据有足够的多样性,从而让网络学习到有意义的权重,使其具有良好的泛化能力,并能处理一系列数据值。

以机器人学习在工厂车间内导航为例,假设在某个时间点,它正试图绕过工厂的某个角落,在接下来的几次移动中,它所采取的所有行动都局限于工厂的这一区域。如果神经网络试图从这批操作数据中学习,它就会更新权重以专门处理工厂中的这一局部位置。而它不会学习到任何有关工厂其他地方的信息。一段时间后,把机器人搬到了另一个地方,网络在这一段新的时间内的学习都将集中在那个新的地方,它又可能把在原来地点学到的知识全部忘掉。

3.2 Fixed Q-Target(固定Q目标)

在这里插入图片描述

在DQN中有两个神经网络,但却只有Q神经网络不断学习更新,而Target神经网络只是在每隔一段时间后复制一次Q神经网络参数,用于计算Q-Target。那么为什么不仅使用一个网络来构建DQN呢?

首先,我们可以用一个 Q 网络而不使用目标网络来构建 DQN。在这种情况下,我们通过 Q 网络进行两次传递,首先输出Q Predict值[ Q ( S t , A t ) Q(S_t, A_t) Q(St,At) ],然后输出Q Target值[ R T + 1 + γ m a x Q ( S t + 1 , a ) R_{T+1}+\gamma max Q(S_{t+1}, a) RT+1+γmaxQ(St+1,a) ]。但这可能会带来一个潜在的问题:Q 神经网络的权重在每个时间步都会更新,从而改进对Q Target值的预测。但是由于网络及其权重是相同的,这也会改变我们预测的Q Target值。每次更新后,它们都不会保持稳定,这就像是在追逐一个移动的目标。

通过使用第二个不经过训练的网络,我们可以确保Q Target值在短时间内保持稳定。但这些Q Target值也是预测值,也需要有所迭代,因此在预设的时间步之后,Q 网络的权重会被复制到Target网络。

3.3 Double Deep Q-Learning(双重深度Q学习方法)

在计算 Q Target时会遇到一个简单的问题:我们如何确定下一个状态的最佳行动就是 Q 值最高的行动?

Q 值的准确性取决于我们尝试了哪些动作以及探索了哪些状态。但是在训练开始时,我们没有足够的信息来确定最佳行动,因此将最大 Q 值(噪音较大)作为最佳行动可能会导致误报。如果非最佳操作的 Q 值经常高于最佳操作的 Q 值,学习就会变得复杂。

解决方法是:在计算 Q 目标值时,我们使用两个网络将动作选择与目标 Q 值生成分离开来。

  • 使用 DQN 网络计算当前状态下采取当前动作的Q值[ Q ( S t , A t ) Q(S_t, A_t) Q(St,At) ]。
  • 使用目标网络计算在下一状态下能得到的最大目标 Q 值以及奖励[ R T + 1 + γ m a x Q ( S t + 1 , a ) R_{T+1}+\gamma max Q(S_{t+1}, a) RT+1+γmaxQ(St+1,a) ]。

因此,Double DQN 可以帮助我们减少对 Q 值的高估,从而帮助我们更快地进行训练,并获得更稳定的学习效果。

4 DQN运行过程

在这里插入图片描述

如上图所示,DQN运行分为以下几步:

  1. 初始化Experience Replay组件。其通过与环境进行交互积累部分训练样本(Current State, Action, Reward, Next State),并将之保存为训练数据。

  2. 初始化Q网络网络参数,并将之拷贝给目标网络。

  3. 使用初始化后的Q网络配合Experience Replay组件进行训练数据生成(这一步并不训练网络)。由Experience Replay返回当前状态作为Q网络输入,Q网络使用随机初始化后的参数得到当前状态下可以采用的所有动作所对应的Q值,并按照ε-greedy策略选择要执行的动作输出给Experience Replay,其得到动作后与环境进行交互并得到下一状态以及奖励,并将这一系列数据作为训练数据与第一步产生的部分数据进行存储。

  4. 在已存储数据中随机选择一批训练数据(S1,a4, R1, S2),将当前状态S1输入Q网络得到当前状态下所有动作对应的Q值,并选择a4对应的Q值 q 4 ^ \hat {q_4} q4^存储备用,后续将使用其计算损失值(loss)。

  5. 将训练数据(S1,a4, R1, S2)中的下一状态S2输入目标网络计算下一状态下所有可执行动作对应的Q值,并按照ε-greedy策略选择最大Q值 q 9 q_9 q9,计算 T a r g e t   Q   V a l u e   =   R T + 1 + γ m a x Q ( S t + 1 , a ) Target\ Q\ Value\ =\ R_{T+1}+\gamma max Q(S_{t+1}, a) Target Q Value = RT+1+γmaxQ(St+1,a),即 R 1 + γ   q 9 R_1+\gamma\ q_9 R1+γ q9

  6. 使用之前Q网络输出的 q 4 ^ \hat{q_4} q4^(Predicted Q-Value)和目标网络得到的 q 9 q_9 q9(Target Q-Value)计算均方差损失(MSE Loss)。

  7. 使用梯度下降法反向传播损失并更新 Q 网络的权重。目标网络没有经过训练,保持固定,因此不会计算损失,也不会进行反向传播。

  8. 不断重复步骤3到步骤8,训练更新Q网络参数,保持目标网络参数不变,否则我们就像是在追逐一个不断移动的目标。

  9. T 个时间步后,将 Q 网络权重复制到目标网络。目标网络就能获得改进后的权重,从而也能预测出更准确的 Q 值。处理过程继续进行。

5 参考资料

  1. The Deep Q-learning Algorithm-Hugging Face Deep RL Course
  2. Reinforcement Learning Explained Visually (Part 5): Deep Q Networks, step-by-step
# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

chaoql

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值