算法实战篇（一），Tensorflow实现经典DQN算法

最新推荐文章于 2025-05-06 17:02:06 发布

samurasun

最新推荐文章于 2025-05-06 17:02:06 发布

阅读量6.9k

点赞数 12

分类专栏：强化学习笔记文章标签：强化学习人工智能

本文链接：https://blog.csdn.net/samurasun/article/details/108235044

版权

我们在“基础算法篇（四）值函数逼近方法解决强化学习问题”中介绍了经典的DQN算法，今天，我们就来点实际的，正式实现一下相关算法。

Tensorflow实现经典DQN算法

一、基础游戏背景介绍
二、建立文件与撰写主函数
三、Agent功能介绍
总结

一、基础游戏背景介绍

我们这次代码实现中，使用的对象，是Gym中的活动杆游戏，如下图所示：
在这里插入图片描述游戏的规则非常简单，即操控黑色小车左右移动，使它上面的木棒能够保持平衡。当小车偏离中心4.8个单位，或杆的倾斜超过24度，任务失败。
这个游戏的观测空间是一个Box(4,)对象，即四个float组成的数组；而行动空间是一个Discrete(2)，即0或1。这两个值其实就决定了我们后续建立神经网络的输入和输出。
下面，我们正式进入代码编写环节。
备注一下：关于编程环境的搭建，请大家参考“番外篇，强化学习基础环境搭建”，本篇使用的tensorflow安装，使用pip install 命令即可，在这里不再赘述。

二、建立文件与撰写主函数

首先，我们可以建立一个Python文件，并按照前面我们介绍的DQN算法，写出主函数：

env = gym.make(ENV_NAME)
agent = DQN(env.observation_space, env.action_space)
for episode in range(EPISODES):
    # get the initial state
    state = env.reset()
    for step in range(STEPS):
        # get the action by state
        action = agent.Choose_Action(state)
        # step the env forward and get the new state
        next_state, reward, done, info = env.step(action)
        # store the data in order to update the network in future
        agent.Store_Data(state, action, reward, next_state, done)
        if len(agent.replay_buffer) > BATCH_SIZE:
            agent.Train_Network(BATCH_SIZE)
        if step % UPDATE_STEP == 0:
            agent.Update_Target_Network()
        state = next_state
        if done:
            break

我们逐行进行介绍：

1.生成环境对象env
2.根据观测空间和行动空间，生成agent对象 [对应（一）、（二）、（三）]
3.对于每一幕游戏
4.首先初始化环境，得到一个初始状态
5.开始进行游戏，对于每一步
6.agent根据状态得到一个行动action [对应（四）]
7.将这个行动action输入环境，推动游戏向前走一步，并得到新的状态、即时奖励，以及游戏是否结束的标识
8.将这一步的相关数据存入一个缓存池内 [对应（五）]
9.如果缓存池内的数据量大于我们设置的进行学习的随机抽取量，则进行学习 [对应（六）]
10.如果步数达到我们设置的更新频率，则更新DQN算法中目标网络的权值参数**[对应（七）]**
11.最后如果这一幕游戏结束，则循环进行下一幕游戏，直到我们设计的所有幕数游戏都进行完，则结束游戏

由上面可以看出，算法中需要重点关注的，就是由DQN类生成的agent对象，下面我们对这个对象进行详细介绍。

三、Agent功能介绍

本算法中的agent，就是由DQN类实例化后的对象，这个对象主要完成的功能，就是根据环境状态生成行动，同时优化网络参数，最终实现DQN算法中介绍的TD差分的最小化，下面详细介绍相关部分。

（一）DQN类的初始化函数

DQN 类的初始化函数主要是通过输入的状态空间和行动空间参数，生成拟合Q值的深度神经网络，并设计网络的参数更新方法，如下：

def __init__(self, observation_space, action_space):
    # the state is the input vector of network, in this env, it has four dimensions
    self.state_dim = observation_space.shape[0]
    # the action is the output vector and it has two dimensions
    self.action_dim = action_space.n
    # init experience replay, the deque is a list that first-in & first-out
    self.replay_buffer = deque()
    # you can create the network by the two parameters
    self.create_Q_network()
    # after create the network, we can define the training methods
    self.create_updating_method()
    # set the value in choose_action
    self.epsilon = INITIAL_EPSILON
    # Init session
    self.session = tf.InteractiveSession()
    self.session.run(tf.global_variables_initializer())

具体的神经网络建立方法和网络参数更新方法，我们在下面进行介绍。

（二）建立深度神经网络

DQN算法中设计了两个网络，current_net和target_net，两个网络结构相同，都是四层的全连接网络（state–>50–>20–>action），在后续算法中current_net用来生成行动，target_net用来计算TD目标值，具体代码如下：

def create_Q_network(self):
    # first, set the input of networks
    self.state_input = tf.placeholder("float", [None, self.state_dim])
    # second, create the current_net
    with tf.variable_scope('current_net'):
        # first, set the network's weights
        W1 = self.weight_variable([self.state_dim, 50])
        b1 = self.bias_variable([50])
        W2 = self.weight_variable([50, 20])
        b2 = self.bias_variable([20])
        W3 = self.weight_variable([20, self.action_dim])
        b3 = self.bias_variable([self.action_dim])
        # second, set the layers
        # hidden layer one
        h_layer_one = tf.nn.relu(tf.matmul(self.state_input, W1) + b1)
        # hidden layer two
        h_layer_two = tf.nn.relu(tf.matmul(h_layer_one, W2) +

最低0.47元/天解锁文章