深度强化学习(1)Deep Q-Learning

Deep Q-Learning

强化学习的求解方法,无论是动态规划DP,蒙特卡罗方法MC,还是时序差分TD,使用的状态都是离散的有限个状态集合 S S S。此时问题的规模比较小,比较容易求解。但是假如我们遇到复杂的状态集合呢?甚至很多时候,状态是连续的,那么就算离散化后,集合也很大,此时我们的传统方法(如Q-Learning),根本无法在内存中维护这么大的一张Q表。

值函数逼近
由于问题的状态集合规模大,一个可行的建模方法是价值函数的近似表示。从数学角度来看,函数逼近方法可以分为参数逼近和非参数逼近,因此强化学习值函数估计可以分为参数化逼近和非参数化逼近。其中参数化逼近又分为线性参数化逼近和非线性化参数逼近。

这一节,我们主要介绍参数化逼近。方法是我们引入一个状态价值函数 v ^ \hat{v} v^, 这个函数由参数 θ \theta θ描述,并接受状态 s s s作为输入,计算后得到状态 s s s的价值,我们将逼近的值函数写为: υ ^ ( s , θ ) \hat{\upsilon}\left(s,\theta\right) υ^(s,θ)。类似的,引入一个动作价值函数 q ^ \hat{q} q^,这个函数由参数 θ \theta θ描述,并接受状态 s s s与动作 a a a作为输入,计算后得到动作价值,我们将逼近的值函数写为: q ^ ( s , a , θ ) \hat{q}(s, a, \theta) q^(s,a,θ)

当逼近的值函数结构确定时,那么值函数的逼近就等价于参数的逼近。值函数的更新也就等价于参数的更新。也就是说,我们需要利用试验数据来更新参数值。

函数逼近 υ ^ ( s , θ ) \hat{\upsilon}\left(s,\theta\right) υ^(s,θ)的过程是一个监督学习的过程,其数据和标签对为: ( S t , U t ) \left(S_t,U_t\right) (St,Ut), 其中 U t U_t Ut等价于蒙特卡罗方法中的 G t G_t Gt,TD方法中的 r + γ Q ( s ′ , a ′ ) r+\gamma Q\left(s',a'\right) r+γQ(s,a),以及 T D ( λ ) TD\left(\lambda\right) TD(λ)中的 G t λ G_{t}^{\lambda} Gtλ
训练的目标函数为:

a r g m i n θ ( q ( s , a ) − q ^ ( s , a , θ ) ) 2 argmin_{\theta}\left(q\left(s,a\right)-\hat{q}\left(s,a,\theta\right)\right)^2 argminθ(q(s,a)q^(s,a,θ))2

值函数可以采用线性逼近也可以采用非线性逼近。非线性逼近常用的是神经网络。下面我们讨论非线性逼近。

DQN
本部分主要讲解DQN,也就是DeepMind发表在Nature上的一篇论文。题目是:Human-level control through deep reinforcement learning

Q-learning算法是1989年Watkins提出来的,2015年Nature论文所提出的DQN就是在Q-learning的基础上修改得到的。DQN对Q-learning的修改主要体现在三个方面:
(1)DQN利用深度卷积神经网络逼近值函数;
(2)DQN利用了经验回放对强化学习的学习过程进行训练;
(3)DQN独立设置了目标网络来单独处理时间差分算法中的TD偏差。

(1)DQN利用深度卷积神经网络逼近行为值函数
值函数利用神经网络进行逼近,属于非线性逼近。虽然逼近方法不同,但都是参数逼近。这里的值函数对应着一组参数,在神经网络里参数是每层网络的权重,我们用 θ \theta θ 表示。用公式表示值函数为: Q ( s , a ; θ ) Q\left(s,a;\theta\right) Q(s,a;θ)。我们这时候对值函数进行更新时其实更新的是参数 θ \theta θ,当网络结构确定时, θ \theta θ就代表值函数。DQN所用的网络结构是三个卷积层加两个全连接层,如图:
在这里插入图片描述

(2)DQN利用了经验回放对强化学习的学习过程进行训练
人在睡觉的时候,海马体会把一天的记忆重放给大脑皮层。利用这个启发机制,DeepMind团队的研究人员构造了一种神经网络的训练方法:经验回放。
通过经验回放为什么可以令神经网络的训练收敛且稳定?
原因是:对神经网络进行训练时,存在的假设是独立同分布。而通过强化学习采集到的数据之间存在着关联性,利用这些数据进行顺序训练,神经网络当然不稳定。经验回放可以打破数据间的关联。具体是这么做的:
在这里插入图片描述
在强化学习过程中,智能体将数据存储到一个数据库中,然后利用均匀随机采样的方法从数据库中抽取数据,然后利用抽取的数据对神经网络进行训练。这种经验回放的技巧可以打破数据之间的关联性。

(3)DQN设置了目标网络来单独处理时间差分算法中的TD偏差
利用神经网络对值函数进行逼近时,值函数更新的是参数 θ \theta θ,更新方法是梯度下降法。因此值函数更新实际上变成了监督学习的一次更新过程,其梯度下降法为:

θ t + 1 = θ t + α [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ ) − Q ( s , a ; θ ) ] ∇ Q ( s , a ; θ ) \theta_{t+1}=\theta_{t}+\alpha\left[r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta\right)-Q(s, a ; \theta)\right] \nabla Q(s, a ; \theta) θt+1=θt+α[r+γmaxaQ(s,a;θ)Q(s,a;θ)]Q(s,a;θ)

其中 r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ ) r+\gamma\max_{a'}Q\left(s',a';\theta\right) r+γmaxaQ(s,a;θ)为TD目标,在计算
max ⁡ a ′ Q ( s ′ , a ′ ; θ ) \max_{a'}Q\left(s',a';\theta\right) maxaQ(s,a;θ)值时用到的网络参数为 θ \theta θ

我们称计算TD目标时所用的网络为TD网络。以往的神经网络逼近值函数时,计算TD目标的动作值函数所用的网络参数 θ \theta θ,与梯度计算中要逼近的值函数所用的网络参数相同,这样就容易使得数据间存在关联性,训练不稳定。为了解决这个问题,DeepMind提出计算TD目标的网络表示为 θ − \theta^- θ
;计算值函数逼近的网络表示为 θ \theta θ;用于动作值函数逼近的网络每一步都更新,而用于计算TD目标的网络每隔固定的步数更新一次。
因此值函数的更新变为:

θ t + 1 = θ t + α [ r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ − ) − Q ( s , a ; θ ) ] ∇ Q ( s , a ; θ ) \theta_{t+1}=\theta_{t}+\alpha\left[r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime} ; \theta^{-}\right)-Q(s, a ; \theta)\right] \nabla Q(s, a ; \theta) θt+1=θt+α[r+γmaxaQ(s,a;θ)Q(s,a;θ)]Q(s,a;θ)

DQN的算法流程

输入:迭代轮数 T T T,状态特征维度 n n n, 动作集 A A A, 步长 α α α,衰减因子 γ γ γ, 探索率 ϵ ϵ ϵ, 当前 Q Q Q网络 Q Q Q,目标 Q Q Q网络 Q ′ Q^{\prime} Q, 批量梯度下降的样本数 m m m,目标 Q Q Q网络参数更新频率 C C C
输出: Q Q Q网络参数

  1. 随机初始化所有的状态和动作对应的价值 Q Q Q. 随机初始化当前 Q Q Q网络的所有参数 w w w,初始化目标 Q Q Q网络 Q ′ Q^{\prime} Q的参数 w ′ = w w^{\prime}=w w=w。清空经验回放的集合 D D D
  2. for i from 1 to T T T,进行迭代:
     a) 初始化 S S S为当前状态序列的第一个状态, 拿到其特征向量 ϕ ( S ) ϕ(S) ϕ(S)
     
     b) 在 Q Q Q网络中使用 ϕ ( S ) ϕ(S) ϕ(S)作为输入,得到 Q Q Q网络的所有动作对应的 Q Q Q值输出。用 ϵ − ϵ− ϵ贪婪法在当前 Q Q Q值输出中选择对应的动作 A A A
     
     c) 在状态 S S S执行当前动作 A A A,得到新状态 S ′ S^{\prime} S对应的特征向量 ϕ ( S ′ ) \phi\left(S^{\prime}\right) ϕ(S)和奖励 R R R,是否终止状态is_end
     
     d) 将 { ϕ ( S ) , A , R , ϕ ( S ′ ) , i s − e n d } \left\{\phi(S), A, R, \phi\left(S^{\prime}\right), i s_{-} e n d\right\} {ϕ(S),A,R,ϕ(S),isend}这个五元组存入经验回放集合 D D D
     
     e) S = S ′ S=S^{\prime} S=S
     
     f) 从经验回放集合 D D D中采样 m m m个样本 { ϕ ( S j ) , A j , R j , ϕ ( S j ′ ) , i s − e n d j } , j = 1 , 2 , , , m \left\{\phi\left(S_{j}\right), A_{j}, R_{j},\phi\left(S_{j}^{\prime}\right), i s_{-} e n d_{j}\right\}, j=1,2,,, m {ϕ(Sj),Aj,Rjϕ(Sj),isendj},j=1,2,,,m计算当前目标Q值 y j y_{j} yj
      y j = { R j i s − e n d j i s t r u e R j + γ max ⁡ a ′ Q ′ ( ϕ ( S j ′ ) , A j ′ , w ′ ) i s − e n d j i s f a l s e y_{j}=\left\{\begin{array}{ll}{R_{j}} & {i s_{-} e n d_{j}}\quad is \quad true\\ {R_{j}+\gamma \max _{a^{\prime}} Q^{\prime}\left(\phi\left(S_{j}^{\prime}\right), A_{j}^{\prime}, w^{\prime}\right)} & {i s_{-} e n d_{j}}\quad is\quad false\end{array}\right. yj={RjRj+γmaxaQ(ϕ(Sj),Aj,w)isendjistrueisendjisfalse
     
     g) 使用均方差损失函数 1 m ∑ j = 1 m ( y j − Q ( ϕ ( S j ) , A j , w ) ) 2 \frac{1}{m} \sum_{j=1}^{m}\left(y_{j}-Q\left(\phi\left(S_{j}\right), A_{j}, w\right)\right)^{2} m1j=1m(yjQ(ϕ(Sj),Aj,w))2,通过神经网络的梯度反向传播来更新 Q Q Q网络的所有参数 w w w
     
     h) 如果 T T T% C C C=1,则更新目标 Q Q Q网络参数 w ′ = w w^{\prime}=w w=w
     
     i) 如果 S ′ S^{\prime} S是终止状态,当前轮迭代完毕,否则转到步骤 b)

注意,上述第二步的f步和g步的 Q Q Q值计算也都需要通过 Q Q Q网络计算得到。另外,实际应用中,为了算法较好的收敛,探索率 ϵ ϵ ϵ需要随着迭代的进行而变小。

参考文献:
[1]https://www.cnblogs.com/pinard/p/9756075.html
[2]https://zhuanlan.zhihu.com/p/26052182

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值