Q-Learning与Deep Q-network

1 价值迭代

价值迭代方法假设事先知道环境中的所有状态,可以对其进行迭代,并可以存储与他们关联的近似价值。
对于状态价值步骤如下:请添加图片描述
对于动作价值步骤如下:
请添加图片描述
存在的问题:

  1. 获得优质状态转移动态的估计所需的样本数量,状态空间需要足够小
  2. 将问题限制在离散的动作空间中
  3. 我们很少能知道动作的转移概率和奖励矩阵

2 表格 Q-learning

思想:不需要遍历状态空间的每一个状态,我们有一个环境,该环境可以用作真实状态样本的来源。如果状态空间的一些样本没有展示出来,我们不需要关心这些样本的价值,可以用从环境中获得的状态来更新状态价值,这可以节省很多工作。对于有明确状态价值映射的情况,它具有以下步骤:
请添加图片描述

3 深度Q-Learning

Q-learning方法解决了在整个状态集上迭代的问题,但是在可观察到的状态集数量很大的情况下仍然会遇到困难。例如,在Atari游戏具有不同的屏幕状态,因此如果采用原始像素作为单独的状态,那么很快就会意识到有太多的状态无法跟踪和估算。另外,在某些环境中,不同的可观察状态的数量几乎是无限的。
作为解决这些问题的方案,可以使用非线性表示将状态和动作都映射到一个值,即建模成回归问题。对Q-learning算法进行修改请添加图片描述
但效果并不是很好,可能出现以下问题:

与环境的交互

我们需要以某种方式与环境交互以获得要训练的数据,在一些简单的环境中,可以随机执行动作,但这不是最佳的策略,在一些环境中,随机动作可能大多数情况下没有奖励,或者很小,这意味着要等待很长时间才能遇到有较大奖励的情况。作为替代方案,可以用Q函数近似作为动作的来源。
当Q近似值很差时,随机动作在训练开始时会比较好,因为它可以提供有关环境状态的更均匀分布的信息。随着训练的进行,随机动作变得效率低下,我们希望通过Q的近似值来决定如何采取行动。
请添加图片描述

SGD优化

我们用神经网络来近似一个复杂的非线性函数 Q ( s , a ) Q(s,a) Qs,a,但随机梯度优化的基本要求之一是训练数据独立同分布。而在强化学习中数据并不满足这些条件:

  1. 样本并不是独立的
  2. 训练数据的分布和我们要学习的最佳策略所提供的样本分布并不相同。我们拥有的数据是其他策略的结果。
    为了解决这种问题,我们通常需要使用过去的经验,并从中提取训练数据而不是使用最新的经验数据。此技术称为回放缓冲区(replay buffer)。最简单的实现是设置一个固定大小的缓冲区,将新数据添加到缓冲区的末尾,同时将最旧的数据移除。
    回放缓冲区允许我们在大致独立的数据上训练,但数据足够新,以便在最新策略生成的样本上训练。

步骤之间的相关性

默认训练过程的另一个实际问题也与缺少独立同分布的数据有关,但方式不同。Bellman方程通过 Q ( s ′ , a ′ ) Q(s',a') Q(sa)来提供 Q ( s , a ) Q(s,a) Q(s,a)的值。但是状态 s s s和状态 s ′ s' s之间只有一步之遥,这使得他们非常相近,而神经网络很难分辨他们。当对神经网络的参数进行更新时,可以间接改为 Q ( s ′ , a ′ ) Q(s',a') Q(sa)和其附近其他状态产生的值,,这会使得网络的训练很不稳定。当更新状态s的Q时,在随后的状态中,我们会发现 Q ( s ′ , a ′ ) Q(s',a') Q(sa)变得很糟,但是尝试更新它会破坏 Q ( s , a ) Q(s,a) Q(sa)的值。
为了让训练更加稳定,使用目标网络的技巧,通过该技巧我们可以保留网络的副本并将其用于Bellman方程中 Q ( s ′ , a ′ ) Q(s',a') Q(sa)的值。该网络仅周期性地与主网络同步,例如,每N步进行一次同步。请添加图片描述

马尔可夫性质

强化学习方法以MDP为基础,它假设环境服从马尔可夫性质。但是单一时刻数据可能不足以捕获所有的重要信息,比如我们可能需要速度和方向,这显然违背了马尔可夫性质,可以保留过去的一些观察结果并将其作为状态。

DQN训练的最终形式

论文:playing atati with deep reinforcement learning
算法步骤:
请添加图片描述

4 DQN扩展

4.1 N步DQN

论文:Learning to Predict by the Methods of Temporal Differences
方法:
请添加图片描述
意义:加快值的传播速度,加速收敛
缺点:可能无法收敛,需要调优

4.2 Double DQN

论文:Deep Reinforcement Learning with Double Q-Learning
想法:基础DQN倾向于过高估计Q值,这可能对训练效果有害,有时会产生一个次优策略。造成这种情况的根本原因是Bellman方程中的max运算。可以使用训练网络来选择动作,但是使用目标网络的Q值。请添加图片描述

4.3 噪声网络

论文:Noisy Networks for Exploration
思想:在训练过程中学习探索特征,而不是单独制定探索的策略
基础DQN通过选择随机动作来完成探索,随机选取会依据特定的超参数esilon,他还随着时间的增长慢慢从1.0降至一个小比例0.1或0.02。它需要调优参数来让训练更高效。
做法:在全连接层中加入噪声,并通过反向传播在训练的过程中调整噪声参数。两种添加噪声的方法:
请添加图片描述

4.4 带优先级的回放缓冲区

论文:Prioritized Experence Replay
基础的DQN使用回放缓冲区来打破片段中连续状态转移的相关性。片段中的经验样本是高度相关的,大多数时候,环境是“平滑的”,不会因为执行的动作而改变太多。然而,随机梯度下降方法假设用来训练的数据有独立同分布的性质。为了解决这个问题,基础DQN方法使用一个巨大的状态转移缓冲区,并从中随机采样以获取下一个训练批。
思想:质疑随机均匀采样策略,证明了只要根据训练损失给缓冲区的样本赋予优先级,并按一定的比例根据优先级从缓冲区中进行采样,就能很大程度地提升收敛速度以及DQN产生的策略质量。
比较棘手的是训练时如何保持“不寻常”样本和“剩余样本”之间的平衡,如果只关注缓冲区内的一部分数据集,那么会丢失独立同分布的性质并且很容易过拟合这个小数据集。
**最流行的选择优先级的方法:**和样本在Bellman更新中计算的损失成比例,新的样本刚加入缓冲区的时候需要赋予一个最大的优先级以保证它会马上被采样。
通过调整样本的优先级,在数据分布中引入了偏差(某些状态转移的采样频率会远高于另外一些状态转移),为了能让随机梯度下降法能够正常工作,需要对此进行补偿。为了得到这个结果,论文的作者使用了样本权重,它要乘以单个样本的损失。每个样本的权重值被定义为 w i = ( N ∗ p ( i ) − β ) w_i=(N * p(i)^{-\beta}) wi=(Np(i)β) β \beta β是另一个应该在0-1之间的超参数,当 β = 1 \beta=1 β=1时,采样引入的偏差被完全补偿,但是作者提出为了收敛性的考虑,最好让 β \beta β从0-1之间的一个值开始,再在训练过程中慢慢增加到1。

4.5 Dueling DQN

论文:Dueling Network Architectures for Deep Reinforcement Learning
想法:在神经网络的架构中,显式地i将价值 V ( s ) V(s) V(s)和优势值 A ( s , a ) A(s,a) A(s,a)分隔开,这带来了更好的训练稳定性,更快的收敛速度。
Q ( s , a ) = V ( s ) + A ( s , a ) Q(s,a) = V(s) + A(s,a) Q(s,a)=V(s)+A(s,a)
请添加图片描述
还需要一个约束条件来保证神经网络像我们期望那样学习 V ( s ) V(s) V(s) A ( s , a ) A(s,a) A(s,a):让任何状态的优势值的平均值为0
请添加图片描述

4.6 Categorical DQN

论文:A Distributional Perspective on Reinforcement Learning
思想:论文的作者对Q-Learning的基础部分的Q提出了质疑,试图用更通用的Q值概率分布来替换它们。Q-learning和价值迭代方法,都用简单的数字来表示动作或状态的价值,并用以显示从状态或状态动作对中所能获得的总奖励是多少。但是,对未来所有可能的奖励都塞入一个值不是很现实,在复杂的环境中,未来可能是随机的,概率不同产生的值也不同。
在马尔科夫决策过程中,情况会变得很复杂,因为有一系列的决策需要制定,并且每个决策可能会影响未来的情况。由于目的的不同,使用平均奖励值会丢失很多潜在的动态信息。所以当潜在的值有复杂的分布时,不要将自己限制在预测动作的平均值,可能直接使用分布会更有益。

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值