

by Thomas Simonini

深度Q学习简介:让我们玩《毁灭战士》 (An introduction to Deep Q-Learning: let’s play Doom)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

Last time, we learned about Q-Learning: an algorithm which produces a Q-table that an agent uses to find the best action to take given a state.

上一次 ,我们学习了Q学习(Q-Learning):一种产生Q表的算法,代理可以使用该Q表来找到采取特定状态的最佳动作。

But as we’ll see, producing and updating a Q-table can become ineffective in big state space environments.


This article is the third part of a series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course.

Today, we’ll create a Deep Q Neural Network. Instead of using a Q-table, we’ll implement a Neural Network that takes a state and approximates Q-values for each action based on that state.

今天,我们将创建一个Deep Q神经网络。 代替使用Q表,我们将实现一个神经网络,该神经网络获取一个状态并基于该状态为每个操作近似Q值。

Thanks to this model, we’ll be able to create an agent that learns to play Doom!


In this article you’ll learn:


  • What is Deep Q-Learning (DQL)?

  • What are the best strategies to use with DQL?

  • How to handle the temporal limitation problem

  • Why we use experience replay

  • What are the mathematics behind DQL

  • How to implement it in Tensorflow


在“ Q学习”中添加“深度” (Adding ‘Deep’ to Q-Learning)

In the last article, we created an agent that plays Frozen Lake thanks to the Q-learning algorithm.

上一篇文章中 ,由于Q学习算法,我们创建了一个播放Frozen Lake的代理。

We implemented the Q-learning function to create and update a Q-table. Think of this as a “cheat-sheet” to help us to find the maximum expected future reward of an action, given a current state. This was a good strategy — however, this is not scalable.

我们实施了Q学习功能来创建和更新Q表。 将其视为“备忘单”,以帮助我们在给定当前状态的情况下找到某项行动的最大预期未来回报。 这是一个很好的策略-但是,这是不可扩展的。

Imagine what we’re going to do today. We’ll create an agent that learns to play Doom. Doom is a big environment with a gigantic state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient at all.

想象一下我们今天要做什么。 我们将创建一个学习玩《毁灭战士》的特工。 厄运是一个拥有巨大状态空间(数百万个不同状态)的大环境。 为该环境创建和更新Q表根本没有效率。

The best idea in this case is to create a neural network that will approximate, given a state, the different Q-values for each action.

在这种情况下,最好的想法是创建一个神经网络 ,该神经网络将在给定状态的情况下近似估算每个动作的不同Q值。

深度Q学习如何工作? (How does Deep Q-Learning work?)

This will be the architecture of our Deep Q Learning:


This can seem complex, but I’ll explain the architecture step by step.


Our Deep Q Neural Network takes a stack of four frames as an input. These pass through its network, and output a vector of Q-values for each action possible in the given state. We need to take the biggest Q-value of this vector to find our best action.

我们的Deep Q神经网络将四个帧的堆栈作为输入。 它们通过其网络,并在给定状态下针对每个可能的动作输出一个Q值向量。 我们需要采用此向量的最大Q值来找到最佳的操作。

In the beginning, the agent does really badly. But over time, it begins to associate frames (states) with best actions to do.

最初,代理确实做得很糟糕。 但是随着时间的流逝,它开始将框架(状态)与最佳操作相关联。

预处理部分 (Preprocessing part)

Preprocessing is an important step. We want to reduce the complexity of our states to reduce the computation time needed for training.

预处理是重要的一步。 我们希望减少状态的复杂性,以减少训练所需的计算时间。

First, we can grayscale each of our states. Color does not add important information (in our case, we just need to find the enemy and kill him, and we don’t need color to find him). This is an important saving, since we reduce our three colors channels (RGB) to 1 (grayscale).

首先,我们可以对每个州进行灰度处理。 颜色不会增加重要的信息(在我们的例子中,我们只需要找到敌人并杀死他,我们就不需要颜色来找到他)。 这是重要的节省,因为我们将三个颜色通道(RGB)减小为1(灰度)。

Then, we crop the frame. In our example, seeing the roof is not really useful.

然后,我们裁剪框架。 在我们的示例中,查看屋顶并不是真正有用的。

Then we reduce the size of the frame, and we we stack four sub-frames together.


时间限制问题 (The problem of temporal limitation)

Arthur Juliani gives an awesome explanation about this topic in his article. He has a clever idea: using LSTM neural networks for handling the problem.

亚瑟朱利安尼(Arthur Juliani)他的文章中对这个话题做了很棒的解释。 他有一个聪明的主意:使用LSTM神经网络来处理问题。

However, I think it’s better for beginners to use stacked frames.


The first question that you can ask is why we stack frames together?


We stack frames together because it helps us to handle the problem of temporal limitation.


Let’s take an example, in the game of Pong. When you see this frame:

让我们举个例子,在Pong的游戏中。 当您看到此框架时:

Can you tell me where the ball is going?


No, because one frame is not enough to have a sense of motion!


But what if I add three more frames? Here you can see that the ball is going to the right.

但是,如果我再添加三帧怎么办? 在这里您可以看到球向右移动。

That’s the same thing for our Doom agent. If we give him only one frame at a time, it has no idea of motion. And how can it make a correct decision, if it can’t determine where and how fast objects are moving?

对于我们的《毁灭战士》特工来说,这是同一回事。 如果我们一次只给他一帧,那就不知道运动了。 如果无法确定物体在何处以及以多快的速度移动,它将如何做出正确的决定?

使用卷积网络 (Using convolution networks)

The frames are processed by three convolution layers. These layers allow you to exploit spatial relationships in images. But also, because frames are stacked together, you can exploit some spatial properties across those frames.

帧由三个卷积层处理。 这些层使您可以利用图像中的空间关系。 而且,由于框架是堆叠在一起的,因此您可以利用这些框架的某些空间属性。

If you’re not familiar with convolution, please read this excellent intuitive article by Adam Geitgey.

如果您不熟悉卷积,请阅读Adam Geitgey撰写的这篇出色的直观文章

Each convolution layer will use ELU as an activation function. ELU has been proven to be a good activation function for convolution layers.

每个卷积层都将ELU用作激活函数。 ELU已被证明是卷积层的良好激活函数。

We use one fully connected layer with ELU activation function and one output layer (a fully connected layer with a linear activation function) that produces the Q-value estimation for each action.


体验重播:更有效地利用观察到的体验 (Experience Replay: making more efficient use of observed experience)

Experience replay will help us to handle two things:


  • Avoid forgetting previous experiences.

  • Reduce correlations between experiences.


I will explain these two concepts.


This part and the illustrations were inspired by the great explanation in the Deep Q Learning chapter in the Deep Learning Foundations Nanodegree by Udacity.


避免忘记以前的经验 (Avoid forgetting previous experiences)

We have a big problem: the variability of the weights, because there is high correlation between actions and states.


Remember in the first article (Introduction to Reinforcement Learning), we spoke about the Reinforcement Learning process:

请记住,在第一篇文章( 强化学习简介 )中,我们谈到了强化学习过程:

At each time step, we receive a tuple (state, action, reward, new_state). We learn from it (we feed the tuple in our neural network), and then throw this experience.

在每个时间步,我们都会收到一个元组(状态,动作,奖励,new_state)。 我们从中学习(在神经网络中输入元组),然后抛出这种经验。

Our problem is that we give sequential samples from interactions with the environment to our neural network. And it tends to forget the previous experiences as it overwrites with new experiences.

我们的问题是,我们将从与环境的交互作用到神经网络的顺序样本提供给我们。 而且它往往会忘记以前的经验,因为它会被新的经验所覆盖。

For instance, if we are in the first level and then the second (which is totally different), our agent can forget how to behave in the first level.


As a consequence, it can be more efficient to make use of previous experience, by learning with it multiple times.


Our solution: create a “replay buffer.” This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network.

我们的解决方案:创建一个“重播缓冲区”。 这家商店在与环境交互时会遇到元组,然后我们对一小批元组进行采样以提供给神经网络。

Think of the replay buffer as a folder where every sheet is an experience tuple. You feed it by interacting with the environment. And then you take some random sheet to feed the neural network

将重播缓冲区视为一个文件夹,其中每个工作表都是一个体验元组。 您可以通过与环境交互来喂食它。 然后取一些随机表来填充神经网络

This prevents the network from only learning about what it has immediately done.


减少经验之间的相关性 (Reducing correlation between experiences)

We have another problem — we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated.

我们还有另一个问题-我们知道每个动作都会影响下一个状态。 这将输出一系列可以高度相关的经验元组。

If we train the network in sequential order, we risk our agent being influenced by the effect of this correlation.


By sampling from the replay buffer at random, we can break this correlation. This prevents action values from oscillating or diverging catastrophically.

通过从重播缓冲区中随机采样,我们可以打破这种相关性。 这样可以防止动作值剧烈波动或发散。

It will be easier to understand that with an example. Let’s say we play a first-person shooter, where a monster can appear on the left or on the right. The goal of our agent is to shoot the monster. It has two guns and two actions: shoot left or shoot right.

通过一个示例将更容易理解它。 假设我们玩的是第一人称射击游戏,其中怪物可以出现在左侧或右侧。 我们的经纪人的目标是射击怪物。 它有两门枪和两个动作:向左射击或向右射击。

We learn with ordered experience. Say we know that if we shoot a monster, the probability that the next monster comes from the same direction is 70%. In our case, this is the correlation between our experiences tuples.

我们以有序的经验学习。 假设我们知道如果我们射击一个怪物,那么下一个怪物来自同一方向的概率是70%。 在我们的案例中,这就是我们的经验元组之间的相关性。

Let’s begin the training. Our agent sees the monster on the right, and shoots it using the right gun. This is correct!

让我们开始训练。 我们的特工看到右边的怪物,并用正确的枪射击。 这是对的!

Then the next monster also comes from the right (with 70% probability), and the agent will shoot with the right gun. Again, this is good!

然后,下一个怪物也从右边来(概率为70%),并且特工将使用右边的枪射击。 再次,这很好!

And so on and on…


The problem is, this approach increases the value of using the right gun through the entire state space.


And if our agent doesn’t see a lot of left examples (since only 30% will probably come from the left), our agent will only finish by choosing right regardless of where the monster comes from. This is not rational at all.

而且,如果我们的特工没有看到很多左例(因为只有30%可能来自左方),那么无论怪物来自何方,我们的特工都只会选择正确来结束。 这根本不合理。

We have two parallel strategies to handle this problem.


First, we must stop learning while interacting with the environment. We should try different things and play a little randomly to explore the state space. We can save these experiences in the replay buffer.

首先,我们必须在与环境互动时停止学习。 我们应该尝试不同的事情,随机玩一些以探索状态空间。 我们可以将这些体验保存在重播缓冲区中。

Then, we can recall these experiences and learn from them. After that, go back to play with updated value function.

然后,我们可以回忆这些经验并向他们学习。 之后,返回播放更新值功能。

As a consequence, we will have a better set of examples. We will be able to generalize patterns from across these examples, recalling them in whatever order.

因此,我们将有一组更好的示例。 我们将能够概括这些示例中的模式,并以任意顺序调用它们。

This helps avoid being fixated on one region of the state space. This prevents reinforcing the same action over and over.

这有助于避免被固定在状态空间的一个区域上。 这样可以防止一遍又一遍地加强相同的动作。

This approach can be seen as a form of Supervised Learning.


We’ll see in future articles that we can also use “prioritized experience replay.” This lets us present rare or “important” tuples to the neural network more frequently.

我们将在以后的文章中看到,我们也可以使用“优先体验重播”。 这使我们可以更频繁地向神经网络显示稀有或“重要”元组。

我们的深度Q学习算法 (Our Deep Q-Learning algorithm)

First a little bit of mathematics:


Remember that we update our Q value for a given state and action using the Bellman equation:


In our case, we want to update our neural nets weights to reduce the error.


The error (or TD error) is calculated by taking the difference between our Q_target (maximum possible value from the next state) and Q_value (our current prediction of the Q-value)


Initialize Doom Environment EInitialize replay Memory M with capacity N (= finite capacity)Initialize the DQN weights wfor episode in max_episode:    s = Environment state    for steps in max_steps:         Choose action a from state s using epsilon greedy.         Take action a, get r (reward) and s' (next state)         Store experience tuple <s, a, r, s'> in M         s = s' (state = new_state)                  Get random minibatch of exp tuples from M         Set Q_target = reward(s,a) +  γmaxQ(s')         Update w =  α(Q_target - Q_value) *  ∇w Q_value

There are two processes that are happening in this algorithm:


  • We sample the environment where we perform actions and store the observed experiences tuples in a replay memory.

  • Select the small batch of tuple random and learn from it using a gradient descent update step.


让我们实施我们的Deep Q神经网络 (Let’s implement our Deep Q Neural Network)

We made a video where we implement a Deep Q-learning agent with Tensorflow that learns to play Atari Space Invaders ?️?.
我们制作了一个视频,其中我们使用Tensorflow实施了一个深度Q学习代理,可以学习玩Atari Space Invaders?️?。

Now that we know how it works, we’ll implement our Deep Q Neural Network step by step. Each step and each part of the code is explained directly in the Jupyter notebook linked below.

现在我们知道了它的工作原理,我们将逐步实现我们的Deep Q神经网络。 下面链接的Jupyter笔记本直接解释了代码的每个步骤和每个部分。

You can access it in the Deep Reinforcement Learning Course repo.

您可以在“ 深度强化学习课程”存储库中访问它

That’s all! You’ve just created an agent that learns to play Doom. Awesome!

就这样! 您刚刚创建了一个学习玩《毁灭战士》的特工。 太棒了!

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment (such as Health Gathering)…and so on.Have fun!

不要忘记自己实现代码的每个部分。 尝试修改我给您的代码非常重要。 尝试添加纪元,更改体系结构,添加固定的Q值,更改学习率,使用更艰苦的环境(例如“健康收集”)等等。

In the next article, I will discuss the last improvements in Deep Q-learning:


  • Fixed Q-values

  • Prioritized Experience Replay

  • Double DQN

  • Dueling Networks


But next time we’ll work on Policy Gradients by training an agent that plays Doom, and we’ll try to survive in an hostile environment by collecting health.


If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章, 请单击“?”。 您可以根据自己喜欢该文章的次数在下面进行搜索,以便其他人可以在Medium上看到此内容。 并且不要忘记跟随我!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.


Keep learning, stay awesome!


使用Tensorflow进行深度强化学习课程? (Deep Reinforcement Learning Course with Tensorflow ?️)

? Syllabus


? Video version

? 视频版本

Part 1: An introduction to Reinforcement Learning

第1部分: 强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分: 通过Q-Learning更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分: 深度Q学习简介:让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+: 深度Q学习中的改进:双重DQN,优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分: Doom和Cartpole的策略梯度简介

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

第5部分: 优势演员评论家方法简介:让我们玩刺猬索尼克吧!

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分: 使用刺猬索尼克2和3的近距离策略优化(PPO)

Part 7: Curiosity-Driven Learning made easy Part I

第七部分: 好奇心驱动学习变得简单

翻译自: https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/






