毁灭杀手_深度Q学习简介：让我们玩《毁灭战士》-CSDN博客

毁灭杀手

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

深度Q学习简介：让我们玩《毁灭战士》 (An introduction to Deep Q-Learning: let’s play Doom)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow？️的深度强化学习课程的一部分。检查课程表。

Last time, we learned about Q-Learning: an algorithm which produces a Q-table that an agent uses to find the best action to take given a state.

上一次，我们学习了Q学习(Q-Learning)：一种产生Q表的算法，代理可以使用该Q表来找到采取特定状态的最佳动作。

But as we’ll see, producing and updating a Q-table can become ineffective in big state space environments.

但是，正如我们将看到的那样，在大型状态空间环境中，生成和更新Q表可能会变得无效。

This article is the third part of a series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course.

本文是有关深度强化学习的一系列博客文章的第三部分。有关更多信息和更多资源，请查看课程提纲。

Today, we’ll create a Deep Q Neural Network. Instead of using a Q-table, we’ll implement a Neural Network that takes a state and approximates Q-values for each action based on that state.

今天，我们将创建一个Deep Q神经网络。代替使用Q表，我们将实现一个神经网络，该神经网络获取一个状态并基于该状态为每个操作近似Q值。

Thanks to this model, we’ll be able to create an agent that learns to play Doom!

有了这个模型，我们将能够创建一个学习玩《毁灭战士》的特工！

In this article you’ll learn:

在本文中，您将学习：

What is Deep Q-Learning (DQL)?
什么是深度Q学习(DQL)？
What are the best strategies to use with DQL?
与DQL一起使用的最佳策略是什么？
How to handle the temporal limitation problem
如何处理时间限制问题
Why we use experience replay
为什么我们使用体验重播
What are the mathematics behind DQL
DQL背后的数学是什么
How to implement it in Tensorflow
如何在Tensorflow中实现它

在“ Q学习”中添加“深度” (Adding ‘Deep’ to Q-Learning)

In the last article, we created an agent that plays Frozen Lake thanks to the Q-learning algorithm.

在上一篇文章中，由于Q学习算法，我们创建了一个播放Frozen Lake的代理。

We implemented the Q-learning function to create and update a Q-table. Think of this as a “cheat-sheet” to help us to find the maximum expected future reward of an action, given a current state. This was a good strategy — however, this is not scalable.

我们实施了Q学习功能来创建和更新Q表。将其视为“备忘单”，以帮助我们在给定当前状态的情况下找到某项行动的最大预期未来回报。这是一个很好的策略-但是，这是不可扩展的。

Imagine what we’re going to do today. We’ll create an agent that learns to play Doom. Doom is a big environment with a gigantic state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient at all.

想象一下我们今天要做什么。我们将创建一个学习玩《毁灭战士》的特工。厄运是一个拥有巨大状态空间(数百万个不同状态)的大环境。为该环境创建和更新Q表根本没有效率。

The best idea in this case is to create a neural network that will approximate, given a state, the different Q-values for each action.

在这种情况下，最好的想法是创建一个神经网络，该神经网络将在给定状态的情况下近似估算每个动作的不同Q值。

深度Q学习如何工作？ (How does Deep Q-Learning work?)

This will be the architecture of our Deep Q Learning:

这将是我们的深度Q学习的架构：

This can seem complex, but I’ll explain the architecture step by step.

这看起来很复杂，但是我将逐步解释该体系结构。

Our Deep Q Neural Network takes a stack of four frames as an input. These pass through its network, and output a vector of Q-values for each action possible in the given state. We need to take the biggest Q-value of this vector to find our best action.

我们的Deep Q神经网络将四个帧的堆栈作为输入。它们通过其网络，并在给定状态下针对每个可能的动作输出一个Q值向量。我们需要采用此向量的最大Q值来找到最佳的操作。

In the beginning, the agent does really badly. But over time, it begins to associate frames (states) with best actions to do.

最初，代理确实做得很糟糕。但是随着时间的流逝，它开始将框架(状态)与最佳操作相关联。

预处理部分 (Preprocessing part)

Preprocessing is an important step. We want to reduce the complexity of our states to reduce the computation time needed for training.

预处理是重要的一步。我们希望减少状态的复杂性，以减少训练所需的计算时间。

First, we can grayscale each of our states. Color does not add important information (in our case, we just need to find the enemy and kill him, and we don’t need color to find him). This is an important saving, since we reduce our three colors channels (RGB) to 1 (grayscale).

首先，我们可以对每个州进行灰度处理。颜色不会增加重要的信息(在我们的例子中，我们只需要找到敌人并杀死他，我们就不需要颜色来找到他)。这是重要的节省，因为我们将三个颜色通道(RGB)减小为1(灰度)。

Then, we crop the frame. In our example, seeing the roof is not really useful.

然后，我们裁剪框架。在我们的示例中，查看屋顶并不是真正有用的。

Then we reduce the size of the frame, and we we stack four sub-frames together.

然后，我们减小帧的大小，然后将四个子帧堆叠在一起。

时间限制问题 (The problem of temporal limitation)

Arthur Juliani gives an awesome explanation about this topic in his article. He has a clever idea: using LSTM neural networks for handling the problem.

亚瑟朱利安尼(Arthur Juliani)在他的文章中对这个话题做了很棒的解释。他有一个聪明的主意：使用LSTM神经网络来处理问题。

However, I think it’s better for beginners to use stacked frames.

但是，我认为初学者最好使用堆叠式框架。

The first question that you can ask is why we stack frames together?

您要问的第一个问题是为什么我们将框架堆叠在一起？

We stack frames together because it helps us to handle the problem of temporal limitation.

我们将帧堆叠在一起，因为它有助于我们处理时间限制问题。

Let’s take an example, in the game of Pong. When you see this frame:

让我们举个例子，在Pong的游戏中。当您看到此框架时：

Can you tell me where the ball is going?

你能告诉我球要去哪里吗？

No, because one frame is not enough to have a sense of motion!

不，因为一帧不足以产生运动感！

But what if I add three more frames? Here you can see that the ball is going to the right.

但是，如果我再添加三帧怎么办？在这里您可以看到球向右移动。

That’s the same thing for our Doom agent. If we give him only one frame at a time, it has no idea of motion. And how can it make a correct decision, if it can’t determine where and how fast objects are moving?

对于我们的《毁灭战士》特工来说，这是同一回事。如果我们一次只给他一帧，那就不知道运动了。如果无法确定物体在何处以及以多快的速度移动，它将如何做出正确的决定？

使用卷积网络 (Using convolution networks)

The frames are processed by three convolution layers. These layers allow you to exploit spatial relationships in images. But also, because frames are stacked together, you can exploit some spatial properties across those frames.

帧由三个卷积层处理。这些层使您可以利用图像中的空间关系。而且，由于框架是堆叠在一起的，因此您可以利用这些框架的某些空间属性。

If you’re not familiar with convolution, please read this excellent intuitive article by Adam Geitgey.

如果您不熟悉卷积，请阅读Adam Geitgey撰写的这篇出色的直观文章。

Each convolution layer will use ELU as an activation function. ELU has been proven to be a good activation function for convolution layers.

每个卷积层都将ELU用作激活函数。 ELU已被证明是卷积层的良好激活函数。

We use one fully connected layer with ELU activation function and one output layer (a fully connected layer with a linear activation function) that produces the Q-value estimation for each action.

我们使用一个具有ELU激活功能的完全连接层和一个输出层(具有线性激活功能的完全连接层)来生成每个动作的Q值估计。

体验重播：更有效地利用观察到的体验 (Experience Replay: making more efficient use of observed experience)

Experience replay will help us to handle two things:

体验重播将帮助我们处理两件事：

Avoid forgetting previous experiences.
避免忘记以前的经验。
Reduce correlations between experiences.
减少经验之间的关联。

I will explain these two concepts.

我将解释这两个概念。

This part and the illustrations were inspired by the great explanation in the Deep Q Learning chapter in the Deep Learning Foundations Nanodegree by Udacity.

这部分和插图的灵感来自Udacity的“深度学习基础知识纳米程度”中“深度Q学习”一章的精彩解释。

避免忘记以前的经验 (Avoid forgetting previous experiences)

We have a big problem: the variability of the weights, because there is high correlation between actions and states.

我们有一个大问题：权重的可变性，因为动作和状态之间存在高度相关性。

Remember in the first article (Introduction to Reinforcement Learning), we spoke about the Reinforcement Learning process:

请记住，在第一篇文章( 强化学习简介 )中，我们谈到了强化学习过程：

At each time step, we receive a tuple (state, action, reward, new_state). We learn from it (we feed the tuple in our neural network), and then throw this experience.

在每个时间步，我们都会收到一个元组(状态，动作，奖励，new_state)。我们从中学习(在神经网络中输入元组)，然后抛出这种经验。

Our problem is that we give sequential samples from interactions with the environment to our neural network. And it tends to forget the previous experiences as it overwrites with new experiences.

我们的问题是，我们将从与环境的交互作用到神经网络的顺序样本提供给我们。而且它往往会忘记以前的经验，因为它会被新的经验所覆盖。

For instance, if we are in the first level and then the second (which is totally different), our agent can forget how to behave in the first level.

例如，如果我们在第一个级别中，然后是第二个级别(这是完全不同的)，我们的代理会忘记如何在第一个级别中表现。

As a consequence, it can be more efficient to make use of previous experience, by learning with it multiple times.

结果，通过多次学习，可以更有效地利用以前的经验。

Our solution: create a “replay buffer.” This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network.

我们的解决方案：创建一个“重播缓冲区”。这家商店在与环境交互时会遇到元组，然后我们对一小批元组进行采样以提供给神经网络。

Think of the replay buffer as a folder where every sheet is an experience tuple. You feed it by interacting with the environment. And then you take some random sheet to feed the neural network

将重播缓冲区视为一个文件夹，其中每个工作表都是一个体验元组。您可以通过与环境交互来喂食它。然后取一些随机表来填充神经网络

This prevents the network from only learning about what it has immediately done.

这样可以防止网络仅了解立即完成的操作。

减少经验之间的相关性 (Reducing correlation between experiences)

We have another problem — we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated.

我们还有另一个问题-我们知道每个动作都会影响下一个状态。这将输出一系列可以高度相关的经验元组。

If we train the network in sequential order, we risk our agent being influenced by the effect of this correlation.

如果我们按顺序训练网络，则可能会导致代理受此关联的影响。

By sampling from the replay buffer at random, we can break this correlation. This prevents action values from oscillating or diverging catastrophically.

通过从重播缓冲区中随机采样，我们可以打破这种相关性。这样可以防止动作值剧烈波动或发散。

It will be easier to understand that with an example. Let’s say we play a first-person shooter, where a monster can appear on the left or on the right. The goal of our agent is to shoot the monster. It has two guns and two actions: shoot left or shoot right.

通过一个示例将更容易理解它。假设我们玩的是第一人称射击游戏，其中怪物可以出现在左侧或右侧。我们的经纪人的目标是射击怪物。它有两门枪和两个动作：向左射击或向右射击。

We learn with ordered experience. Say we know that if we shoot a monster, the probability that the next monster comes from the same direction is 70%. In our case, this is the correlation between our experiences tuples.

我们以有序的经验学习。假设我们知道如果我们射击一个怪物，那么下一个怪物来自同一方向的概率是70％。在我们的案例中，这就是我们的经验元组之间的相关性。

Let’s begin the training. Our agent sees the monster on the right, and shoots it using the right gun. This is correct!

让我们开始训练。我们的特工看到右边的怪物，并用正确的枪射击。这是对的！

Then the next monster also comes from the right (with 70% probability), and the agent will shoot with the right gun. Again, this is good!

然后，下一个怪物也从右边来(概率为70％)，并且特工将使用右边的枪射击。再次，这很好！

And so on and on…

依此类推……

The problem is, this approach increases the value of using the right gun through the entire state space.

问题是，这种方法增加了在整个状态空间使用正确的枪支的价值。

And if our agent doesn’t see a lot of left examples (since only 30% will probably come from the left), our agent will only finish by choosing right regardless of where the monster comes from. This is not rational at all.

而且，如果我们的特工没有看到很多左例(因为只有30％可能来自左方)，那么无论怪物来自何方，我们的特工都只会选择正确来结束。这根本不合理。

We have two parallel strategies to handle this problem.

我们有两种并行的策略来解决这个问题。

First, we must stop learning while interacting with the environment. We should try different things and play a little randomly to explore the state space. We can save these experiences in the replay buffer.

首先，我们必须在与环境互动时停止学习。我们应该尝试不同的事情，随机玩一些以探索状态空间。我们可以将这些体验保存在重播缓冲区中。

Then, we can recall these experiences and learn from them. After that, go back to play with updated value function.

然后，我们可以回忆这些经验并向他们学习。之后，返回播放更新值功能。

As a consequence, we will have a better set of examples. We will be able to generalize patterns from across these examples, recalling them in whatever order.

因此，我们将有一组更好的示例。我们将能够概括这些示例中的模式，并以任意顺序调用它们。

This helps avoid being fixated on one region of the state space. This prevents reinforcing the same action over and over.

这有助于避免被固定在状态空间的一个区域上。这样可以防止一遍又一遍地加强相同的动作。

This approach can be seen as a form of Supervised Learning.

这种方法可以看作是监督学习的一种形式。

We’ll see in future articles that we can also use “prioritized experience replay.” This lets us present rare or “important” tuples to the neural network more frequently.

我们将在以后的文章中看到，我们也可以使用“优先体验重播”。这使我们可以更频繁地向神经网络显示稀有或“重要”元组。

我们的深度Q学习算法 (Our Deep Q-Learning algorithm)

First a little bit of mathematics:

首先介绍一下数学：

Remember that we update our Q value for a given state and action using the Bellman equation:

请记住，我们使用Bellman方程更新给定状态和动作的Q值：

In our case, we want to update our neural nets weights to reduce the error.

在我们的案例中，我们想更新神经网络权重以减少误差。

The error (or TD error) is calculated by taking the difference between our Q_target (maximum possible value from the next state) and Q_value (our current prediction of the Q-value)

误差(或TD误差)是通过获取我们的Q_target(下一个状态的最大可能值)与Q_value(我们对Q值的当前预测)之差来计算的

Initialize Doom Environment EInitialize replay Memory M with capacity N (= finite capacity)Initialize the DQN weights wfor episode in max_episode:    s = Environment state    for steps in max_steps:         Choose action a from state s using epsilon greedy.         Take action a, get r (reward) and s' (next state)         Store experience tuple <s, a, r, s'> in M         s = s' (state = new_state)                  Get random minibatch of exp tuples from M         Set Q_target = reward(s,a) +  γmaxQ(s')         Update w =  α(Q_target - Q_value) *  ∇w Q_value

There are two processes that are happening in this algorithm:

此算法发生两个过程：

We sample the environment where we perform actions and store the observed experiences tuples in a replay memory.
我们对执行操作的环境进行采样，并将观察到的体验元组存储在重播内存中。
Select the small batch of tuple random and learn from it using a gradient descent update step.
选择一小批随机的元组，并使用梯度下降更新步骤从中学习。

让我们实施我们的Deep Q神经网络 (Let’s implement our Deep Q Neural Network)

We made a video where we implement a Deep Q-learning agent with Tensorflow that learns to play Atari Space Invaders ?️?.

我们制作了一个视频，其中我们使用Tensorflow实施了一个深度Q学习代理，可以学习玩Atari Space Invaders？️?。

Now that we know how it works, we’ll implement our Deep Q Neural Network step by step. Each step and each part of the code is explained directly in the Jupyter notebook linked below.

现在我们知道了它的工作原理，我们将逐步实现我们的Deep Q神经网络。下面链接的Jupyter笔记本直接解释了代码的每个步骤和每个部分。

You can access it in the Deep Reinforcement Learning Course repo.

您可以在“ 深度强化学习课程”存储库中访问它。

That’s all! You’ve just created an agent that learns to play Doom. Awesome!

就这样！您刚刚创建了一个学习玩《毁灭战士》的特工。太棒了！

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment (such as Health Gathering)…and so on.Have fun!

不要忘记自己实现代码的每个部分。尝试修改我给您的代码非常重要。尝试添加纪元，更改体系结构，添加固定的Q值，更改学习率，使用更艰苦的环境(例如“健康收集”)等等。

In the next article, I will discuss the last improvements in Deep Q-learning:

在下一篇文章中，我将讨论深度Q学习的最新改进：

Fixed Q-values
固定的Q值
Prioritized Experience Replay
优先体验重播
Double DQN
双DQN
Dueling Networks
决斗网络

But next time we’ll work on Policy Gradients by training an agent that plays Doom, and we’ll try to survive in an hostile environment by collecting health.

但是下一次，我们将通过培训扮演《毁灭战士》的特工来应对“政策梯度”，并且我们将通过收集健康来尝试在敌对环境中生存。

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章， 请单击“？”。 您可以根据自己喜欢该文章的次数在下面进行搜索，以便其他人可以在Medium上看到此内容。并且不要忘记跟随我！

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法，意见，问题，请在下面发表评论，或给我发送电子邮件：hello@simoninithomas.com或向我发送@ThomasSimonini信息。

Keep learning, stay awesome!

继续学习，保持卓越！