Deep Q-Learning深度增强学习(代码篇)

搭建DQN

初始化

#动作数量
self.n_actions 
#状态数量
self.n_features
#learning_rate学习速率
self.lr
#Q-learning中reward衰减因子
self.gamma
#e-greedy的选择概率最大值
self.epsilon_max 
#更新Q现实网络参数的步骤数
self.replace_target_iter
#存储记忆的数量
self.memory_size
#每次从记忆库中取的样本数量
self.batch_size = batch_size
self.epsilon_increment = e_greedy_increment
self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max
#学习的步骤
self.learn_step_counter
#记忆库,此刻的n_feature + 下一步的n_feature + reward + action
self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

#利用Q目标的参数替换Q估计中的参数
t_params = tf.get_collection('target_net_params')
e_params = tf.get_collection('eval_net_params')
#生成了一个tensorflow操作列表[tf.assign(t1,e1), tf.assign(t2,e2), tf.assign(t3,e3)]
self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]

构建神经网络

构造Q估计神经网络

def _build_net(self):
    #输入
    self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')
    #Q现实输入
    self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')

    with tf.variable_scope('eval_net'):
        #collection
        c_names = ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES] 
        #神经元数量
        n_l1 =  10
        #权值
        w_initializer = tf.random_normal_initializer(0., 0.3)
        #偏置
        b_initializer = tf.constant_initializer(0.1)

        #第一层神经元        
        with tf.variable_scope('l1'):
            w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
            b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
            l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
        #第二层神经元
        with tf.variable_scope('l2'):
            w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
            b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
            self.q_eval = tf.matmul(l1, w2) + b2

        #基于Q估计与Q现实,构造loss-function
        with tf.variable_scope('loss'):
            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))

        #训练
        with tf.variable_scope('train'):
            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

构造Q现实神经网络(该段代码紧接着上段,属于_build_net()函数)

    #输入
    self.s_sub = tf.placeholder(tf.float32, [None, self.n_features], name='s_sub')    
    with tf.variable_scope('target_net'):
        #collection
        c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

        #第一层神经元
        with tf.variable_scope('l1'):
            w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
            b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
            l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)

        #第二层神经元
        with tf.variable_scope('l2'):
            w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
            b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
            self.q_next = tf.matmul(l1, w2) + b2

存储状态信息

    def store_transition(self, s, a, r, s_):
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0

        #状态信息list ==> [x, y]
        #[action, reward]动作与奖励信息合并为list
        #下一步状态信息 ==> [x_next, y_next]
        transition = np.hstack((s, [a, r], s_))
        #hstack的结果为 ==> [x, y, a, r, x_next, y_next]

        #每过memory_size,替换存储值
        index = self.memory_counter % self.memory_size

        #memory为二维列表,transition为一行向量,插入index行中
        self.memory[index, :] = transition
        self.memory_counter += 1

选择动作action

    def choose_action(self, observation):
        # 将observation的list[x, y]转为行向量[[x, y]]
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.epsilon:
            # 得到每个action的q的估计值
            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
            # 选择q值最大的action
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)
        return action

增强学习过程

    def learn(self):
        #更换参数
        if self.learn_step_counter % self.replace_target_iter == 0:
            self.sess.run(self.replace_target_op)

        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)

        #从memory中抽取一个记忆值,一个行向量
        #[x, y, a, r, x_next, y_next]
        batch_memory = self.memory[sample_index, :]

        q_next, q_eval = self.sess.run(
         [self.q_next, self.q_eval],
            feed_dict={
             self.s_: batch_memory[:, -self.n_features:],  # fixed params
             self.s: batch_memory[:, :self.n_features],  # newest params
            })

        q_target = q_eval.copy()

        batch_index = np.arange(self.batch_size, dtype=np.int32)
        eval_act_index = batch_memory[:, self.n_features].astype(int)
        reward = batch_memory[:, self.n_features + 1]

        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

        #训练网络
        _, self.cost = self.sess.run([self._train_op, self.loss],
                                     feed_dict={self.s: batch_memory[:, :self.n_features],
                                                self.q_target: q_target})
        self.cost_his.append(self.cost)

        # increasing epsilon
        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learn_step_counter += 1

举例说明上述过程

数据结构

  • action=3
  • n_feature=2
  • batch_size=2
q-eval结构
action_0action_1action_2
121
232

行:每一个样本
列:每一个action对应的Q值

q-next,q-target与q-eval结构相同
batch-index样本索引

一维list ==> [0, 1] #长度:bactch_size

eval_act_index每个样本对应的action的值,也就是每个样本列的索引

一维list ==> [1, 0]

reward每个样本对应的reward的值

一维list ==> [1, 2]


过程

  1. 将q-eval的值赋给q-target
  2. 利用Q-learning算法,计算每一个样本的对应action的q值
    • 样本0,采取了action=0,真实的q值为-1
    • 样本1,采取了action=2,真实的q值为-2
  3. 更新q-target中的值
action_0action_1action_2
-121
23-2

4. 利用更新后的q-target与q-eval之间的差值进行训练


仿真过程

def run_maze():
    # 游戏的每一个回合需要的步数
    step = 0
    # 游戏的回合
    for episode in range(300):
        # 初始化观察值
        observation = env.reset()

        while True:
            # 开始环境仿真
            env.render()

            # 选择动作
            action = RL.choose_action(observation)

            # 加入动作后,环境进行仿真
            # 获取了执行action后,下一步的观测值observation
            # 获取了奖励reward
            # 游戏是否结束标志done
            observation_, reward, done = env.step(action)

            # 存储样本
            RL.store_transition(observation, action, reward, observation_)

            if (step > 200) and (step % 5 == 0):
            # 随机抽取样本,网络进行学习
                RL.learn()

            # 交换观测值
            observation = observation_

            # 判断游戏是否结束
            if done:
                break

            step += 1
# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值