【整理】用简单逻辑图理解DQN(deep Q-learning)的学习过程

强化学习中最基本的深度学习方法即为DQN,在通过学习马尔科夫链、贝尔曼方程和最基本的Q-learning后,将DQN的方法的理解过程记录于此。

试图理解DQN(deep Q-learning)过程


一、DQN背景

先引Q-Learing更好地明了dqn的产生原因:

Q-learning:是一种off-policy的强化学习方法,行动和评估决策的过程。他强调值的迭代,通过不断完成游戏,用贝尔曼方程的方式更新Q值(如在s状态下采取a的长期回报):

仅对有限样本操作:现在的Q值=原来的Q值+学习率(立即回报+Lambda后继状态的最大Q值-原来的Q值)

它含义上是使用Q-Table,记录表格里各个状态下的action和Q值,不断迭代更新,通过更新距离目标最近的参数,再次迭代,几轮epoch,会不断更新到最远的状态。并收敛Q值。
故由于需要记录各个状态s的原因,如玩游戏的过程,像素值的组合无穷无尽,不可能记录所有的状态,所以引入DQN的方法用更新函数参数的方式,建立并训练输入三维图片的无穷无尽状态与输出Q值的关系。

DQN:DQN(deep Q-learning)用神经网络取代Q learning中的q表从而解决了state数量的限制。
即运用拟合函数来输出Q值,一般为输入一个状态,给出不同动作的Q值。相近状态,会得到相近输出,泛化能力很强。


提示:以下是本篇文章正文内容

二、DQN核心思路

DQN的核心有两点,一是Experience reply经验池,二是两个同结构神经网络并固定Q-target网络。

1.Experience reply 经验池

经验池,顾名思义,将曾经运行过的,完成的所有游戏分步记录在池子中。包括读入的状态s,s下的行动a,行动a后获得的奖励r,行动a后到达的状态s_。存为def store_transition()如下代码:

MEMORY_CAPACITY = 2000                          # 记忆池容量
    def store_transition(self, s, a, r, s_):                 # 定义记忆存储函数 (这里输入为一个transition)
        transition = np.hstack((s, [a, r], s_))              # 在水平方向上拼接数组
        # 如果记忆池满了,便覆盖旧的数据
        index = self.memory_counter % MEMORY_CAPACITY        # 获取transition要置入的行数
        self.memory[index, :] = transition                   # 置入transition
        self.memory_counter += 1                             # memory_counter自加1

每次learn从这个回忆单元sample一组batch_size来train。解决了训练样本的关联性问题。包括在超过2000容量后覆盖最远的池中数据,保持新数据的存储,方便训练最新环境下的数据。


2.Fixed-Q-Target是什么?

1.eval_net和target_net介绍

DQN包含两个神经网络。一个eval_net,一个target_net,从q表中衍生出来。
两个net的各自职责:
1. eval_net:评估网络,即图中的net2。
①用来评价当前状态下各个action的Q值。通过使用该网络选择action,选择(大概率)Q最大的action后环境反馈回reward和下一个状态s_,即(s, a, r, s_),然后存储其中。
如何选择action?e贪心策略选取。
②eval_net除了给经验池提供e贪心策略的经验数据以外,还有计算预测的q-value的作用。输入状态s给出预测q_value。
1. target_net:目标网络,即图中的net1。
输入s_,即下一个环境状态,从而通过网络得到各个a下的Q值。然后利用s下动作a得到的奖励r,和贝尔曼方程计算目标,得到Q-target。参考标题一中的式子。

2. fixed的核心

但是到底什么是fixed?为什么要运用两个net?dqn的正中之中就是fix的概念和参数迭代的变化。下面用简图可以清晰看出更新的过程,我来小小介绍一下。
dqn的流程图
以上,可以看到图片的上半部分是经验池获取经验的整个过程,下半部分是训练过程。

  1. 观测帧或多帧的环境,获得o,经过评估网络eval_net,其中输入为o,参数为w2,预测输出为Q值
  2. 用e贪婪算法选择a并作出所选动作,得到下一环境状态,o_,将1和2的过程获得的(o,a,r,o_)存储到经验池中。不断存储直到存满。

  1. 在储存满后或经验池更新上限后,训练一次。注意训练过程中训练的是net2即评估网络eval_net的参数w2。故目标网络参数一直是固定的,反向传播也是传播目标网络的w2参数。称之方法为fixed-q-target。训练的数据是提取经验池中的n条经验,即n条( o,a,r,o_)。
  2. 通过计算目标网络得到的Q_Target和评估网络得到的Q预测max,计算loss,并进行反向传播更新评估网络net2的参数。这个过程中目标网络的参数不变。( 固定目标网络以防止Q 网络输出不稳定)
  3. 在训练一段过程之后,渐渐收敛net2的参数,再将net2训练的参数赋值给net1,继续固定这个赋值的参数。然后循环3过程。

也可以通过代码看到,两个net的结构相同,但参数会随着时间更迭,在不同时间节点不断更新。

class Net(nn.Module):
    def __init__(self):                            # 定义Net的一系列属性
        # nn.Module的子类函数必须在构造函数中执行父类的构造函数
        super(Net, self).__init__()                # 等价与nn.Module.__init__()

        self.fc1 = nn.Linear(N_STATES, 50)         # 设置第一个全连接层(输入层到隐藏层): 状态数个神经元到50个神经元
        self.fc1.weight.data.normal_(0, 0.1)       # 权重初始化 (均值为0,方差为0.1的正态分布)
        self.out = nn.Linear(50, N_ACTIONS)        # 设置第二个全连接层(隐藏层到输出层): 50个神经元到动作数个神经元
        self.out.weight.data.normal_(0, 0.1)       # 权重初始化 (均值为0,方差为0.1的正态分布)

    def forward(self, x):                          # 定义forward函数 (x为状态)
        x = F.relu(self.fc1(x))                    # 连接输入层到隐藏层,且使用激励函数ReLU来处理经过隐藏层后的值
        actions_value = self.out(x)                # 连接隐藏层到输出层,获得最终的输出值 (即动作值)
        return actions_value                       # 返回动作值

总结

DQN的理解过程可以自行绘图,把逻辑圆满,也可以配合食用代码,跑通读懂后即可理解。

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值