LSTM Networks 论文精读

Understanding LSTM Networks

Posted on August 27, 2015

Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.
In the above diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.
This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

Neural networks struggle with long term dependencies.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

{递归神经网络

人类不会每秒都从头开始思考。阅读本文时,您会根据对先前单词的理解来理解每个单词。您不会丢掉一切,而是重新开始从头开始思考。您的想法有恒心。

传统的神经网络无法做到这一点,这似乎是一个重大缺陷。例如,假设您想对电影中每个点发生的事件进行分类。尚不清楚传统的神经网络如何利用其对电影中先前事件的推理来告知后期事件。

递归神经网络解决了这个问题。它们是具有循环的网络,可以使信息持久存在。

递归神经网络具有循环。
在上图中,一大块神经网络A看了一些输入xt并输出了值ht。循环允许信息从网络的一个步骤传递到下一个步骤。

这些循环使递归神经网络显得有些神秘。但是,如果您再想一想,就会发现它们与普通的神经网络并没有什么不同。循环神经网络可以看作是同一网络的多个副本,每个副本都将消息传递给后继者。考虑一下如果展开循环会发生什么:

展开的递归神经网络。
这种类似链的性质表明,递归神经网络与序列和列表密切相关。它们是用于此类数据的神经网络的自然架构。

而且它们肯定被使用了!在过去的几年中,将RNN应用到各种问题上已经取得了令人难以置信的成功:语音识别,语言建模,翻译,图像字幕…清单还在继续。我将在Andrej Karpathy的出色博客文章“递归神经网络的不合理有效性”中讨论使用RNN可以实现的惊人成就。但是它们确实非常惊人。

这些成功的关键是使用“ LSTM”,这是一种非常特殊的递归神经网络,在很多任务上都能比标准版本更好地工作。利用递归神经网络几乎可以实现所有令人兴奋的结果。本文将探讨的是这些LSTM。

长期依赖问题

RNN的吸引力之一是这样的想法:它们可以将先前的信息连接到当前任务,例如使用先前的视频帧可能会有助于对当前帧的理解。如果RNN可以做到这一点,它们将非常有用。但是可以吗?这取决于。有时,我们只需要查看最新信息即可执行当前任务。例如,考虑一种语言模型,该模型试图根据前一个单词预测下一个单词。如果我们试图预测“云在天空中”的最后一个词,那么我们不需要任何进一步的上下文-很明显,下一个词将是天空。在这种情况下,相关信息与所需信息之间的差距很小,RNN可以学习使用过去的信息。

但是在某些情况下,我们需要更多背景信息。考虑尝试预测文本“我在法国长大……我会说流利的法语”中的最后一个词。最近的信息表明,下一个词可能是一种语言的名称,但是如果我们想缩小哪种语言的范围,我们需要从更远的地方来追溯法国的情况。相关信息和需要扩大的点之间的差距完全可能。

不幸的是,随着差距的扩大,RNN变得无法学习连接信息。

神经网络在长期依赖方面苦苦挣扎。
从理论上讲,RNN绝对有能力处理这种“长期依赖关系”。人类可以为他们仔细选择参数,以解决这种形式的玩具问题。可悲的是,在实践中,RNN似乎无法学习它们。 Hochreiter(1991)[German]和Bengio等人对此问题进行了深入探讨。 (1994年),他发现了一些可能很难做到的根本原因。

幸运的是,LSTM没有这个问题!

LSTM网络

长短期记忆网络(通常称为“ LSTM”)是一种特殊的RNN,能够学习长期依赖关系。它们是由Hochreiter&Schmidhuber(1997)引入的,并在随后的工作中被许多人提炼和推广。1它们在处理各种各样的问题上表现出色,现已被广泛使用。

LSTM被明确设计为避免长期依赖问题。长时间记住信息实际上是他们的默认行为,而不是他们努力学习的东西!

所有的递归神经网络都具有神经网络的重复模块链的形式。在标准RNN中,此重复模块将具有非常简单的结构,例如单个tanh层。}

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

A LSTM neural network.
The repeating module in an LSTM contains four interacting layers.
Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1ht−1 and xtxt, and outputs a number between 00 and 11 for each number in the cell state Ct−1Ct−1. A 11 represents “completely keep this” while a 00 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, CtCt, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, Ct−1Ct−1, into the new cell state CtCt. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ftft, forgetting the things we decided to forget earlier. Then we add it∗Ctit∗Ct. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

{LSTM也具有这种链状结构,但是重复模块具有不同的结构。而不是只有一个神经网络层,而是有四个以非常特殊的方式进行交互。

LSTM神经网络。
LSTM中的重复模块包含四个交互层。
不必担心发生的事情的细节。稍后,我们将逐步介绍LSTM图。现在,让我们先尝试一下将要使用的符号。

在上图中,每条线都承载着整个矢量,从一个节点的输出到另一节点的输入。粉色圆圈表示逐点操作,例如矢量加法,而黄色框表示学习的神经网络层。合并的行表示串联,而分叉的行表示其内容正在复制,并且副本到达不同的位置。

LSTM背后的核心思想

LSTM的关键是电池状态,水平线贯穿图的顶部。

单元状态有点像传送带。它沿整个链条一直沿直线延伸,只有一些较小的线性相互作用。信息不加改变地流动非常容易。

LSTM确实具有删除或向单元状态添加信息的能力,这些功能由称为门的结构精心调节。

闸门是一种选择性地让信息通过的方式。它们由S形神经网络层和逐点乘法运算组成。

乙状结肠层输出介于零和一之间的数字,描述每个组件应允许多少。值为零表示“不让任何内容通过”,而值为1表示“让所有内容通过!”

LSTM具有这些门中的三个,以保护和控制单元状态。

LSTM分步指南

LSTM的第一步是决定要从单元状态中丢弃哪些信息。该决定由称为“忘记门层”的S形层决定。它查看ht-1ht-1和xtxt,并在单元状态Ct-1Ct-1中为每个数字输出介于00和11之间的数字。 11表示“完全保留此条件”,而00表示“完全保留此条件”。

让我们回到语言模型的示例,该模型试图根据所有先前的单词来预测下一个单词。在这样的问题中,细胞状态可能包括当前受试者的性别,从而可以使用正确的代词。看到新主题时,我们想忘记旧主题的性别。

下一步是确定我们将在单元状态下存储哪些新信息。这包括两个部分。首先,称为“输入门层”的S形层决定了我们将更新哪些值。接下来,tanh层创建一个新候选值C〜tC〜t的向量,可以将其添加到状态中。在下一步中,我们将两者结合起来以创建该状态的更新。

在我们的语言模型示例中,我们希望将新主题的性别添加到单元格状态,以替换我们遗忘的旧主题。

现在是时候将旧单元状态Ct-1Ct-1更新为新单元状态CtCt。前面的步骤已经确定了要做什么,我们只需要实际进行即可。

我们将旧状态乘以ftft,而忘记了我们早些时候决定忘记的事情。然后我们将它加* C〜tit * C〜t。这是新的候选值,根据我们决定更新每个状态值的大小进行缩放。

就语言模型而言,这是我们实际删除旧主题的性别的信息并添加新信息的地方,正如我们在前面的步骤中所确定的那样。

最后,我们需要决定要输出什么。此输出将基于我们的单元状态,但将是过滤后的版本。首先,我们运行一个S型层,确定要输出的单元状态的哪些部分。然后,我们通过tanhtanh放置单元状态(将值推到-1和11之间),然后将其乘以S型门的输出,以便仅输出确定的部分。

对于语言模型示例,由于它只是看到一个主语,因此可能要输出与动词相关的信息,以防万一。例如,它可能会输出主语是单数还是复数,以便我们知道如果接下来要动词的动词应变形为哪种形式。

长短期记忆的变体

到目前为止,我所描述的是一个非常普通的LSTM。但是,并非所有的LSTM都与上述相同。实际上,似乎几乎所有涉及LSTM的论文都使用略有不同的版本。差异不大,但其中一些值得一提。

Gers&Schmidhuber(2000)提出的一种流行的LSTM变体正在添加“窥孔连接”。这意味着我们让栅极层查看单元状态。

上图为所有门都添加了窥孔,但是许多论文都会提供一些窥孔,而没有其他。}

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.
A gated recurrent unit neural network.
These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

Acknowledgments

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

In addition to the original authors, a lot of people contributed to the modern LSTM. A non-comprehensive list is: Felix Gers, Fred Cummins, Santiago Fernandez, Justin Bayer, Daan Wierstra, Julian Togelius, Faustino Gomez, Matteo Gagliolo, and Alex Graves.
另一个变化是使用耦合的忘记门和输入门。我们共同做出这些决定,而不是分别决定要忘记什么以及应该向其中添加新信息。我们只会在要输入某些内容时才忘记它。我们只有在忘记较旧的内容时才向状态输入新值。

Cho等人在LSTM上引入的门控循环单元(GRU)稍有变化。 (2014)。它将忘记和输入门组合到单个“更新门”中。它还合并了单元状态和隐藏状态,并进行了其他一些更改。生成的模型比标准LSTM模型更简单,并且变得越来越流行。
门控循环单元神经网络。
这些只是最著名的LSTM变体中的少数。还有很多其他方法,例如Yao等人的《深度门控RNN》。 (2015)。解决长期依赖关系的方法也完全不同,例如Koutnik等人的Clockwork RNN。 (2014)。

以下哪个变体最好?差异重要吗?格雷夫等。 (2015)对流行的变体进行了很好的比较,发现它们几乎相同。 Jozefowicz等。 (2015年)测试了超过一万种RNN架构,发现其中某些在某些任务上比LSTM更好。

结论

之前,我提到了人们使用RNN所取得的非凡成就。基本上所有这些都是使用LSTM实现的。对于大多数任务,它们确实工作得更好!

作为一组方程式写下的LSTM看起来很吓人。希望本文逐步介绍它们,使他们变得更加平易近人。

LSTM是我们可以使用RNN完成的重要一步。很自然地想:还有另外一个重大步骤吗?研究人员普遍认为:“是的!有下一步,请注意!”这个想法是让RNN的每一步都从更大的信息集合中选择信息。例如,如果您使用RNN创建描述图像的标题,则它可能会选择图像的一部分以查看其输出的每个单词。实际上,徐等人。 (2015)正是这样做的-如果您想引起关注,这可能是一个有趣的起点!注意力吸引了许多令人振奋的结果,而且似乎还有很多其他事情……

注意不是RNN研究中唯一令人兴奋的话题。例如,Kalchbrenner等人的Grid LSTM。 (2015年)似乎很有希望。在生成模型中使用RNN的工作-例如Gregor等。 (2015),Chung等。 (2015)或Bayer&Osendorfer(2015)–似乎也很有趣。最近几年对于递归神经网络来说是一个令人振奋的时刻,而即将到来的几年有望如此!

致谢

感谢许多人帮助我更好地了解LSTM,评论了可视化效果并提供了有关此帖子的反馈。

非常感谢Google同事的宝贵反馈,尤其是Oriol Vinyals,Greg Corrado,Jon Shlens,Luke Vilnis和Ilya Sutskever。我还要感谢其他许多朋友和同事,包括Dario Amodei和Jacob Steinhardt,抽出宝贵的时间帮助我。我特别感谢Cho Kyunghyun Cho关于我的图表的非常周到的通信。

在这篇文章之前,我在我关于神经网络的两个研讨会系列中练习解释LSTM。感谢参与其中的每个人对我的耐心和反馈。

除了原始作者外,许多人还为现代LSTM做出了贡献。一个不全面的清单是:费利克斯·格斯,弗雷德·康明斯,圣地亚哥·费尔南德斯,贾斯汀·拜耳,达安·维尔斯特拉,朱利安·托格利乌斯,福斯蒂诺·戈麦斯,马泰奥·加利奥洛和亚历克斯·格雷夫斯。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值