DL之LSTM：《Understanding LSTM Networks了解LSTM网络》的翻译并解读

一个处女座的程序猿

已于 2023-12-27 01:19:28 修改

阅读量4.2k

点赞数 8

分类专栏： NLP/LLMs 文章标签： lstm 人工智能 rnn

于 2019-09-03 15:30:23 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/100519515

版权

NLP/LLMs 专栏收录该内容

530 篇文章 418 订阅

订阅专栏

DL之LSTM：《Understanding LSTM Networks了解LSTM网络》的翻译并解读

导读：背景：传统RNN网络无法很好地处理长期依赖关系问题，难以学习使用过去很远时间步的信息。
LSTM网络通过增加输入门、遗忘门和输出门的机制，解决了这一问题。通过三个门的协同工作，可以很好控制细胞状态中的信息流入、输出和储存，有效记住需要的长期依赖信息。文章详细阐述了LSTM每个部分的工作机制:
>> 输入门决定新输入进入细胞状态的程度
>> 遗忘门控制上一时间步记忆中的保留程度
>> 输出门控制记忆单元中信息的输出
文章还介绍了一些LSTM网络的变种，如添加“孔”结构、合并遗忘门和输入门等。
LSTM网络比传统RNN在许多任务上效果更好，基本上近年来RNN在语言处理、翻译、图像分析等方面的重大进展都依赖于LSTM。
文章预期未来RNN研究的一个重要方向是attention机制，可以让每个时间步选择输入的关注部分，进一步提升模型效能。

Understanding LSTM Networks

Recurrent Neural Networks

循环神经网络有循环、展开的循环神经网络

The Problem of Long-Term Dependencies长期依赖的问题

上下文间隔较短的场景，如语言模型

上下文间隔较长的场景，完型天空

LSTM Networks

The repeating module in a standard RNN contains a single layer.

The repeating module in an LSTM contains four interacting layers.

The Core Idea Behind LSTMs—LSTMs背后的核心思想

Step-by-Step LSTM Walk Through—LSTM逐步解读

GRU：Variants on Long Short Term Memory—长短时记忆网络的变体

Conclusion结论

Acknowledgments 致谢

Understanding LSTM Networks

地址	文章地址：http://colah.github.io/posts/2015-08-Understanding-LSTMs/
时间	2015年8月27日
作者	colah

Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

人类思考并不是每一秒都能从头开始。当你阅读这篇文章的时候，你是根据你先前对单词的理解来理解每一个单词的。你不会把所有东西都扔掉，然后从头开始思考。你的思维是有持久性的。

传统的神经网络做不到这一点，这似乎是一个重大的缺点。例如，假设您想要对电影中每一点发生的事件进行分类。目前还不清楚传统的神经网络如何利用其对电影中先前事件的推理来为后来的事件提供信息。

循环神经网络解决了这个问题。它们是带有循环的网络，允许信息持续存在。

循环神经网络有循环、展开的循环神经网络

In the above diagram, a chunk of neural network, \(A\), looks at some input \(x_t\) and outputs a value \(h_t\). A loop allows information to be passed from one step of the network to the next.

在上图中，神经网络的一个部分A 查看一些输入xt 并输出一个值ht 。循环允许信息从网络的一个步骤传递到下一个。

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

这些循环使得RNN看起来有点神秘。然而，如果你多想一下，就会发现它们与普通的神经网络并没有太大的不同。RNN可以被看作是相同网络的多个副本，每个副本将消息传递给后继者。考虑一下如果我们展开循环会发生什么：

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

这种链式结构表明，RNN与序列和列表密切相关。它们是神经网络用于处理此类数据的自然架构。

它们确实被广泛使用！在过去的几年中，人们在许多问题上应用RNN取得了令人难以置信的成功：语音识别、语言建模、翻译、图像字幕等等… 列表还在继续。我将把对使用RNN取得的惊人成就的讨论留给Andrej Karpathy博客文章《循环神经网络的不合理有效性》。但它们确实非常了不起。

取得这些成功的关键是使用“LSTM”，这是一种非常特殊的RNN，它在很多任务上都比标准版本好得多。几乎所有基于RNN的激动人心的结果都是通过它们实现的。本文将探索这些LSTMs。

The Problem of Long-Term Dependencies长期依赖的问题

上下文间隔较短的场景，如语言模型

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

RNN的一个吸引人的地方是，他们可能能够将先前的信息与当前的任务联系起来，例如使用以前的视频帧可能有助于理解当前的帧。如果RNN可以做到这一点，它们将是非常有用的。但他们能吗?视情况而定。

有时，我们只需要查看最近的信息来执行当前的任务。例如，考虑一个语言模型，它试图根据前面的单词预测下一个单词。如果我们试图预测“the clouds are in the sky”中的最后一个词，我们不需要任何进一步的上下文——很明显下一个词将是sky。在这种情况下，相关信息和需要信息的地方之间的间隔很小时，RNN可以学会使用过去的信息。

上下文间隔较长的场景，完型天空

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

但在某些情况下，我们需要更多的上下文。试着预测文章的最后一个词“I grew up in France… I speak fluent French.”最近的信息表明下一个词可能是一种语言的名称，但如果我们想缩小到具体是哪种语言，我们需要来自更远处的法国的上下文。在这种情况下，相关信息和需要它的地方之间的间隔可能变得非常大。

不幸的是，随着这个间隔的增长，RNN变得无法学会连接信息。

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

从理论上讲，RNN绝对有能力处理这种“长期依赖性”。“人类可以为他们仔细挑选参数来解决这种形式的玩具问题。遗憾的是，在实践中，RNN似乎无法学会处理这些问题。Hochreiter(1991)[德语]和Bengio等人(1994)对这个问题进行了深入的探讨，他们发现了一些非常基本的原因，解释了为什么这个问题可能很难解决。

值得庆幸的是，LSTM没有这个问题！

LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

长期短期记忆网络——通常简称为“LSTMs”——是一种特殊的RNN，能够学习长期依赖关系。它们由Hochreiter和Schmidhuber(1997)引入，并在随后的工作中得到了改进和推广。它们在各种问题上表现出色，现在被广泛使用。

LSTMs的设计是为了避免长期依赖性问题。记住信息很长时间实际上是它们的默认行为，而不是它们努力学习的东西！

所有的RNN都是由一系列重复的神经网络模块组成的。在标准的RNN中，这个重复的模块有一个非常简单的结构，例如一个单一的tanh层。

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

LSTMs也有这种类似链的结构，但是重复模块有不同的结构。不是只有一个神经网络层，而是有四个，它们以一种非常特殊的方式相互作用。

LSTM也具有这样的链式结构，但重复模块的结构不同。它不是一个单一的神经网络层，而是由四个层组成，以非常特殊的方式相互作用。

The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

不要担心正在发生的事情的细节。稍后，我们将逐步详细讲解LSTM图。现在，让我们试着熟悉一下我们将要使用的符号。

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

在上面的图中，每一行都携带一个完整的向量，从一个节点的输出到其他节点的输入。

粉红色的圆圈表示逐点操作，比如向量加法，

而黄色的方框表示学习的神经网络层。

行合并的线表示连接，

而行分叉的线表示被复制的内容和被复制到不同位置的内容。

The Core Idea Behind LSTMs—LSTMs背后的核心思想

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTMs的关键是细胞状态，即贯穿图顶部的水平线。

细胞状态有点像传送带。它直接沿整个链运行，只有一些较小的线性交互。信息很容易以不变的方式流动。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

LSTM确实有能力从细胞状态中移除或添加信息，这是由称为门的结构进行精心调控。

门是一种选择性地放行信息的方式。它们由一个sigmoid神经网络层和一个逐点乘法操作组成。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

sigmoid层输出0到1之间的数字，描述每个组件应该放行多少。0的值表示“一点都不要放行”，而1的值表示“全部放行通过!”

LSTM有三个这样的门，用于保护和控制细胞状态。

Step-by-Step LSTM Walk Through—LSTM逐步解读

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at \(h_{t-1}\) and \(x_t\), and outputs a number between \(0\) and \(1\) for each number in the cell state \(C_{t-1}\). A \(1\) represents “completely keep this” while a \(0\) represents “completely get rid of this.”

LSTM的第一步是决定我们将从细胞状态中丢弃什么信息。这个决定是由一个叫做“遗忘门”的sigmoid层做出的。“它查看ht−1和xt，并为细胞状态Ct−1中的每个数输出一个0到1之间的数字。1代表“完全保留此信息”，而0代表“完全丢弃此信息”。”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

让我们回到我们的语言模型的例子，试图根据所有先前的词汇预测下一个词。在这种问题中，细胞状态可能包括当前主题的性别，以便可以使用正确的代词。当我们看到一个新主题时，我们希望忘记旧主题的性别。

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, \(\tilde{C}_t\), that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

下一步是决定我们将在细胞状态中存储什么新信息。这有两个部分。首先，称为“输入门层”的sigmoid层决定我们将更新哪些值。接下来，tanh层创建一个新候选值的向量C~t，可以将其添加到状态中。在下一步中，我们将这两者结合起来创建对状态的更新。

在我们语言模型的例子中，我们希望将新主题的性别添加到细胞状态中，以替代我们正在忘记的旧主题。

It’s now time to update the old cell state, \(C_{t-1}\), into the new cell state \(C_t\). The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by \(f_t\), forgetting the things we decided to forget earlier. Then we add \(i_t*\tilde{C}_t\). This is the new candidate values, scaled by how much we decided to update each state value.

现在是时候将旧的细胞状态Ct−1更新到新的细胞状态Ct了。先前的步骤已经决定了要做什么，我们只需要实际执行。

我们通过ft乘以旧状态，忘记我们先前决定要忘记的东西。然后我们添加it∗C~t。这是新的候选值，乘以我们决定更新每个状态值的比例。

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

在语言模型的情况下，这就是我们实际上会丢弃关于旧主题性别的信息并添加新信息的地方，正如我们在前面的步骤中决定的那样。

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through \(\tanh\) (to push the values to be between \(-1\) and \(1\)) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

最后，我们需要决定我们将输出什么。这个输出将基于我们的细胞状态，但将是一个经过过滤的版本。首先，我们运行一个sigmoid层，它决定我们要输出细胞状态的哪些部分。然后，我们通过tanh（将值推送到-1和1之间）并将其乘以sigmoid门的输出，以便只输出我们决定的部分。

对于语言模型的例子，由于它刚刚看到一个主题，它可能想要输出与动词相关的信息，以防接下来是动词。例如，它可能会输出主题是单数还是复数，以便我们知道如果接下来是动词，应该将动词形式变为什么。

GRU：Variants on Long Short Term Memory—长短时记忆网络的变体

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

到目前为止，我所描述的是一个非常普通的LSTM。但并不是所有的LSTM都与上述相同。实际上，似乎几乎每一篇涉及LSTM的论文都使用略有不同的版本。这些差异可能很小，但值得提及其中一些。

一种流行的LSTM变体，由Gers和Schmidhuber(2000)引入，增加了“窥视孔连接”。这意味着我们让门层看细胞状态。

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

上面的图表在所有的门上都加了窥视孔，但许多论文将某些窥视孔添加到某些门而不是其他门。

另一个变体是使用耦合的遗忘和输入门。与分别决定要遗忘的内容和要添加新信息的内容不同，我们将这些决策一起进行。只有在我们要输入新内容时，我们才会遗忘。只有在遗忘了某些较旧的内容时，我们才会向状态输入新值。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

对LSTM的一个稍微戏剧性的变体是门控循环单元（Gated Recurrent Unit，GRU），

LSTM的一个稍微戏剧性的变化是门控递归单元，或GRU，由Cho等人(2014)引入。由Cho等人于2014年提出。它将遗忘门和输入门合并为单个“更新门”。它还合并了细胞状态和隐藏状态，并进行了一些其他更改。由此产生的模型比标准LSTM模型更简单，并且越来越受欢迎。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

这些只是最显著的LSTM变体之一。还有很多其他变体，例如由Yao等人于2015年提出的深度门控RNN。还有一些完全不同的应对长期依赖性的方法，例如由Koutnik等人于2014年提出的Clockwork RNNs。

这些变体中哪一个是最好的？差异是否重要？Greff等人（2015）对流行的变体进行了良好的比较，发现它们基本相同。Jozefowicz等人（2015）测试了一万多个RNN架构，发现有些架构在某些任务上比LSTMs工作得更好。

Conclusion结论

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

前面，提到了人们使用RNN取得的显著成果。基本上所有这些成果都是通过使用LSTMs实现的。对于大多数任务，它们确实要好得多！

将LSTMs写成一组方程看起来相当令人生畏。希望在这篇文章中一步一步地介绍它们能使它们更容易理解。

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

LSTMs是我们在RNN中可以实现目标的一个重大进步。自然而然地会有疑问：还有下一个大步吗？研究人员普遍认为：“是的！下一个步骤就是Attention！”这个想法是让RNN的每一步从一些更大的信息集合中挑选要查看的信息。例如，如果您正在使用RNN创建描述图像的标题，它可能为每个输出的单词挑选图像的某个部分进行查看。实际上，Xu等人（2015）正是这样做的——如果您想探索Attention，这可能是一个有趣的起点！使用Attention的结果令人振奋，似乎还有更多的成果即将出现……

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

Attention并不是RNN研究中唯一令人振奋的方向。例如，Kalchbrenner等人于2015年提出的Grid LSTMs似乎非常有前途。在生成模型中使用RNN的工作——如Gregor等人（2015）、Chung等人（2015）或Bayer＆Osendorfer（2015）——也显得非常有趣。过去几年对于RNN来说是令人兴奋的时刻，而未来的几年将会更加激动人心!

Acknowledgments 致谢

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

我非常感谢许多人帮助我更好地理解LSTMs，对其可视化进行了评论，并对本文提供了反馈。

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

我非常感谢谷歌的同事们提供的有用的反馈，特别是Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis和Ilya Sutskever。我也感谢许多其他的朋友和同事花时间来帮助我，包括达里奥·阿莫德和雅各布·斯坦哈特。我特别感谢Kyunghyun Cho对我的图表所做的极其周到的回复。

在这篇文章之前，我在两个关于神经网络的系列研讨会上练习解释LSTMs。感谢每一个参与其中的人，感谢他们对我的耐心，感谢他们的反馈。