项目实战：使用Deep Q Network（DQN）算法让机器学习玩游戏（二）

最新推荐文章于 2024-03-08 23:18:05 发布

平行的空间

最新推荐文章于 2024-03-08 23:18:05 发布

阅读量1.5k

点赞数 1

分类专栏：实战经验强化学习深度学习

本文链接：https://blog.csdn.net/zhm2229/article/details/102545528

版权

强化学习同时被 3 个专栏收录

6 篇文章

订阅专栏

深度学习

6 篇文章

订阅专栏

实战经验

3 篇文章

订阅专栏

这个项目用三篇文章进行介绍，各部分的内容如下：

项目实战：使用Deep Q Network（DQN）算法让机器学习玩游戏（一）：总体介绍，游戏部分

项目实战：使用Deep Q Network（DQN）算法让机器学习玩游戏（二）：算法部分

项目实战：使用Deep Q Network（DQN）算法让机器学习玩游戏（三）：算法和游戏的交互部分，模型训练，模型评估，使用相同的算法和参数去玩另外一个不同的游戏

（二）算法部分

由于输入的数据是图像，所以项目中使用的神经网络是CNN。（CNN介绍）

神经网络架构

使用的CNN的网络架构图如下：

我们从游戏那得到的输入数据是大小为700*700的彩色图像，由于棋子的具体颜色对算法没有帮助，所以我们将它进行灰度化处理，然后将大小减小到60*60.

    def pre_process(self, frame, crop_size):
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        frame = cv2.resize(frame, crop_size, interpolation=cv2.INTER_CUBIC)
        return frame

神经网络一共包含3个卷积层，都是使用zero-padding。

在第一个卷积层，使用32个6*6*1的filter，以4*4的stride进行扫描，然后得到32个大小为15*15的feature map。

然后，使用一个最大池化层，它的filter尺寸为2*2， stride为2*2.

第二个卷积层使用64个尺寸为4*4*32的filter，stride是2*2.

第三个卷积层使用64个尺寸为3*3*64的filter，stride是1*1.

在这三个卷积层之后，三维矩阵数据被拍平成一个flatten layer，尺寸大小为1024*1.

在这之后是一个全连接层，有1024*512个神经元。

最后是输出层，由于我们每次可以从7列选择一个作为动作，所以输出层的神经元个数是7.

在神经网络中，每一个卷积层和全连接层都有一个激活函数。项目中使用的激活函数是ReLU函数。

从DQN算法介绍中可知，DQN的输出是不同动作对应的Q value，所以它是一个回归问题。所以项目中采用mean squared error 函数作为loss function。

我们使用mini-batch 梯度下降，batch的大小为48.

优化函数使用adaptive moment estimation（Adam）方法。

self.graph = tf.Graph()
with self.graph.as_default():
    self.sess = tf.Session(graph=self.graph)
with self.sess.as_default():
    with self.graph.as_default():
        self.inp = tf.placeholder("float", [None, 60, 60, self.image_channel], name=self.player + '_inp')
        with tf.variable_scope(self.player + '_net'):
            with tf.variable_scope('l1'):
                self.W_conv1 = tf.Variable(tf.truncated_normal([6, 6, self.image_channel, 32], stddev=0.02), name=self.player + '_w_conv1')
                self.b_conv1 = tf.Variable(tf.constant(0.01, shape=[32]), name=self.player + '_b_conv1')
                self.conv1 = tf.nn.relu(tf.nn.conv2d(self.inp, self.W_conv1, strides=[1, 4, 4, 1], padding="SAME") + self.b_conv1,
                    name=self.player + '_conv1')
                self.pool1 = tf.nn.max_pool(self.conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="SAME",
                                        name=self.player + '_pool1')
            with tf.variable_scope('l2'):
                self.W_conv2 = tf.Variable(tf.truncated_normal([4, 4, 32, 64], stddev=0.02), name=self.player + '_w_conv2')
                self.b_conv2 = tf.Variable(tf.constant(0.01, shape=[64]), name=self.player + '_b_conv2')
                self.conv2 = tf.nn.relu(
                    tf.nn.conv2d(self.pool1, self.W_conv2, strides=[1, 2, 2, 1], padding="SAME") + self.b_conv2,
                    name=self.player + '_conv2')
            with tf.variable_scope('l3'):
                self.W_conv3 = tf.Variable(tf.truncated_normal([3, 3, 64, 64], stddev=0.02), name=self.player + '_w_conv3')
                self.b_conv3 = tf.Variable(tf.constant(0.01, shape=[64]), name=self.player + '_b_conv3')
                self.conv3 = tf.nn.relu(
                    tf.nn.conv2d(self.conv2, self.W_conv3, strides=[1, 1, 1, 1], padding="SAME") + self.b_conv3,
                    name=self.player + '_conv3')
                self.conv3_flat = tf.reshape(self.conv3, [-1, 1024])
            with tf.variable_scope('l4'):
                self.W_fc4 = tf.Variable(tf.truncated_normal([1024, 512], stddev=0.02), name=self.player + '_w_fc4')
                self.b_fc4 = tf.Variable(tf.constant(0.01, shape=[512]), name=self.player + '_b_fc4')
                self.fc4 = tf.nn.relu(tf.matmul(self.conv3_flat, self.W_fc4) + self.b_fc4,
                                      name=self.player + '_fc4')

            with tf.variable_scope('l5'):
                self.W_fc5 = tf.Variable(tf.truncated_normal([512, self.ACTIONS], stddev=0.02), name=self.player + '_w_fc5')
                self.b_fc5 = tf.Variable(tf.constant(0.01, shape=[self.ACTIONS]), name=self.player + '_b_fc5')
                self.out = tf.matmul(self.fc4, self.W_fc5) + self.b_fc5


        self.argmax = tf.placeholder("float", [None, self.ACTIONS],name=self.player + "_argmax")
        self.gt = tf.placeholder("float", [None], name=self.player + "_gt")  # ground truth

        with tf.variable_scope(self.player + '_loss'):
            self.predict_q_value = tf.reduce_sum(tf.multiply(self.out, self.argmax), reduction_indices=1,
                                        name=self.player + '_action')
            # cost function we will reduce through backpropagation
            self.cost = tf.reduce_mean(tf.square(self.predict_q_value - self.gt), name=self.player + '_cost')
            # optimization fucntion to reduce our minimize our cost function
        with tf.variable_scope(self.player + '_train'):
            self.train_step = tf.train.AdamOptimizer(1e-6).minimize(self.cost, name=self.player+'_train_step')

超参数

这个算法中包含一些超参数，超参数的设置如下所示：

为了保证足够的不相关数据，DQN在前面50000步不会进行学习。

experience queue是用来存储经验回放数据。它的尺寸是我们可以从多少个之前的经验中选取数据。

discount rate是指后面动作对前面动作的影响有多大。

batch size是每次选择多少个数据放入到神经网络中训练。

adam learning rate是使用Adam优化算法时的学习率。

episilon相关的参数在下面一起介绍。

动作选择

从DQN算法介绍中，我们知道DQN的输出是所有可能的动作对应的Q value。在动作的选择上，我们使用exploitation-exploration。Exploitation-exploration的介绍请见这里。

我们以epsilon的概率选择Q value最高的动的工作，以1-epsilon的概率随机选择动作。epsilon是一个[0,1]区间的数。

有两种方法设置epsilon的值，一种是固定的值，另一种的变动的值。在这个项目中，我们选择非固定值。在开始的时候，epsilon比较大，agent能以很大的概率随机选择动作，可以对环境进行充分的探索。随着训练次数的处增加，agent对环境有一定的了解，所以我们逐渐选择最大Q value的动作。

参数表中 initail episilon表示episilon的初始值，final episilon表示episilon的最终值，how many steps toanneal epsilon表示经过多少步epislon从初始值降低到最终值。epsilon每次训练更新一次。两个玩家的epsilon变化图如下：

def choose_action(self, observation, mode):
    out_t = self.out.eval(session=self.sess, feed_dict={self.inp: [observation]})
    out_t = out_t[0]

    self.argmax_t = np.zeros([self.ACTIONS])

    if random.random() <= self.epsilon:
        maxIndex = choice(range(self.ACTIONS), 1)
        maxIndex = maxIndex[0]
    else:
        maxIndex = np.argmax(out_t)
    self.argmax_t[maxIndex] = 1

    if mode == 'use mode':
        self.epsilon = self.FINAL_EPSILON
    else:
        if self.epsilon > self.FINAL_EPSILON:
            self.epsilon -= (self.INITIAL_EPSILON - self.FINAL_EPSILON) / self.EXPLORE

    return maxIndex, self.epsilon

经验回放

在每一步中，我们都会得到一组数据（时间t时的status，action，reward，时间t+1时的status）。这组数据会被喂给DQN进行训练。

如果我们只使用当前的数据进行训练，在训练之后就丢弃他们的话，会有两个问题。

一个是之前经验数据在后面的训练过程中都不会再被使用，这些经验数据没有被充分利用。

第二个是这些数据之间存在相关性。他们之间存在时间上的前后关系。这一步的action是在前一步的Q value基础上计算来的，而它所得到的reward又会影响到下一步。而数据之间存在相关性对模型来说是不好的。

所以为了解决上面两个问题，我们使用了经验回放。

经验回放是指建立一个queue用来存放这些元组数据，每一步都会往里面存入数据，当数据的个数超出queue的大小时，将最开始的数据删掉。然后每次训练的时候，从experience queue里随机选择batch size的数据输入给DQN。

def store_transition(self, inp_t, action, reward_t, inp_t1):
    action_list = np.zeros([self.ACTIONS])
    action_list[action] = 1
    self.D.append((inp_t, action_list, reward_t, inp_t1))
    if len(self.D) > self.REPLAY_MEMORY:
        self.D.popleft()

模型学习

1. 当我们从experience queue中获得输入数据时，将元组数据中的第一个，也就是时间t时的status（它表示目前棋盘的画面）输入到CNN，CNN会输出7个可能动作的Q value。

2. 根据Q value选择时间t所应该采取的动作。

3. 在游戏中执行这个动作，并得到游戏返回的对应reward。

4. 我们根据DQN算法的公式计算target Q value和损失函数值。

5. 利用优化算法对神经网络进行更新。

def learn(self, step):
    minibatch = random.sample(self.D, self.BATCH)

    inp_batch = [d[0] for d in minibatch]
    argmax_batch = [d[1] for d in minibatch]
    reward_batch = [d[2] for d in minibatch]
    inp_t1_batch = [d[3] for d in minibatch]

    gt_batch = []
    out_batch = self.sess.run(self.out, feed_dict={self.inp: inp_t1_batch})

    for i in range(0, len(minibatch)):
        gt_batch.append(reward_batch[i] + self.GAMMA * np.max(out_batch[i]))


    bb_cost, bb_train_step = self.sess.run([self.cost, self.train_step],
                                    feed_dict={
                                            self.gt: gt_batch,
                                            self.argmax: argmax_batch,
                                            self.inp: inp_batch
                                            })
    if step % self.SAVE_STEP == 0:
        self.saver.save(self.sess, self.ckp_path + '/model.ckpt', global_step=step)
        print("save checkpoint %s:"%str(self.ckp_path + '/model.ckpt'))
    return bb_cost

同时训练两个模型

由于我们使用self-play方法，两个玩家都是一个模型，他们具有相同的网络结构和超参数，然后相互学习。所以我们需要同时训练两个模型。

在这个项目中，我们使用tensorflow来构建模型。tensorflow使用graph来定义数据流，使用session来进行graph上的操作。

所以我们需要使用两个不同graph和session来分隔两个模型。这样，这两个模型的参数才能正确得更新到它们对应的模型上。

参考：https://my.oschina.net/u/3800567/blog/1786556

https://www.tensorflow.org/guide/graphs?hl=zh-cn

保存模型

由于训练的时间特别长，所以我们需要将训练好的模型保存下以后使用。我们使用tensorflow提供的checkpoint机制来保存和恢复模型。

恢复模型

self.saver = tf.train.Saver()
checkpoint = tf.train.latest_checkpoint(self.ckp_path)
# 从训练好的checkpoint中恢复模型
if checkpoint != None and (self.mode == 'use mode'):
   print('%s Restore Checkpoint %s' % (str(self.player), checkpoint))
   self.saver.restore(self.sess, checkpoint)
   print("Model restored.")
else:
   # 不使用已经训练好的数据，重新初始化变量
   init = tf.global_variables_initializer()
   self.sess.run(init)
   print("%s Initialized new Graph" % (str(self.player)))

保存checkpoint

 self.saver.save(self.sess, self.ckp_path + '/model.ckpt', global_step=step)
 print("save checkpoint %s:"%str(self.ckp_path + '/model.ckpt'))

完整代码及训练好的checkpoint数据请见github：https://github.com/zm2229/use-DQN-to-play-a-simple-game。