长期短期记忆网络正在消亡

最新推荐文章于 2024-09-21 09:43:26 发布

weixin_26756255

最新推荐文章于 2024-09-21 09:43:26 发布

阅读量180

点赞数

文章标签：网络 python 人工智能

原文链接：https://towardsdatascience.com/long-short-term-memory-networks-are-dying-whats-replacing-it-5ff3a99399fe

版权

The Long Short-Term Memory — LSTM — network has become a staple in deep learning, popularized as a better variant to the recurrent neural networks. As methods seem to come and go faster and faster as machine learning research accelerates, it seems that LSTM has begun its way out.

长短期记忆(LSTM)网络已成为深度学习的主要内容，已普及为循环神经网络的一种更好的变体。随着机器学习研究的加速，方法的出现和发展变得越来越快，LSTM似乎已经开始出路。

Let’s take a few steps back and explore the evolution language modelling, from its baby steps to modern advancements in complex problems.

让我们退后一步，探索演化语言建模，从它的简单步骤到复杂问题的现代发展。

Fundamentally, like any other supervised machine learning problem, the goal of language modelling is to predict some output y given a document d. The document d must be represented somehow in numerical form, which can be processed by a machine learning algorithm.

从根本上讲，像其他任何监督的机器学习问题一样，语言建模的目标是在给定文档d的情况下预测某些输出y 。文档d必须以数字形式表示，可以通过机器学习算法进行处理。

The initial solution for representing documents as numbers is bag of words (BoW). Each word occupied one dimension in a vector, and each value represented how many times the word appeared in the document. This method, however, doesn’t take into account ordering of the words, which matters a lot (I live to work, I work to live).

将文档表示为数字的初始解决方案是单词袋(BoW)。每个单词在向量中占据一个维度，每个值代表单词在文档中出现的次数。这种方法，但是，并没有考虑到的话账户排序，这是相当重要的 (我活着就是为了工作，我工作是为了生活)。

In order to remedy this problem, n-grams are used. These are sequences of n words, in which each element indicates the presence of a word combination. If there are 10,000 words in our dataset and we want to store bi-grams, we will need to store 10,000² unique combinations. For any reasonably good modelling, we will likely need tri-grams or even quad-grams, which each raise the vocabulary size to another power.

为了解决该问题，使用了n- gram。这些是n个单词的序列，其中每个元素表示单词组合的存在。如果我们的数据集中有10,000个单词，并且我们要存储二元语法，则需要存储10,000²唯一组合。对于任何相当好的建模，我们可能都需要三元组甚至四元组，它们各自将词汇量提高到另一个幂。

Obviously, n-grams and BoW cannot handle even slightly complex language tasks. Their solutions involve vectorization procedures that are too sparse, large, and unable to capture the spirit of language itself. A solution? The recurrent neural network.

显然， n- gram和BoW无法处理甚至稍微复杂的语言任务。他们的解决方案涉及矢量化程序，这些程序过于稀疏，庞大，无法捕捉语言本身的精神。一个办法？递归神经网络。

Instead of using high-dimensional, sparse vectorization solutions that attempt to feed the entire document to the model at once, a recurrent neural network works with the sequential nature of text. RNNs can be expressed as a recursive function, where A is the transformation function applied at each timestep, h is the set of hidden layer states, and x represents the set of data.

递归神经网络不是使用试图将整个文档立即馈入模型的高维，稀疏矢量化解决方案，而是使用具有文本顺序性质的递归神经网络。 RNN可以表示为递归函数，其中A是在每个时间步应用的转换函数， h是隐藏层状态的集合， x表示数据的集合。

演示地址

Each time step is created with knowledge of the previous time step, creating a new output by applying the same function to the previous output. When RNNs are ‘unrolled’, we can see how inputs at various timesteps are fed into the model with knowledge of what the model has previously seen.

每个时间步都是在前一个时间步的知识的基础上创建的，通过对前一个输出应用相同的功能来创建新的输出。当RNN被“展开”时，我们可以了解到模型先前所见的内容，从而了解了如何将各个时间步长的输入馈入模型。

Image for post — “Understanding LSTM Networks” by C. Olah (2015). Image free to share. C. Olah(2015)的“理解LSTM网络” 。图片免费分享。

Because the RNN applies the same function to every input, it has the added benefit of being able to handle variable-length inputs. The rationale behind using the same function can be thought of as applying ‘universal language/sequence rules’ to each time step.

由于RNN将相同的功能应用于每个输入，因此它具有能够处理可变长度输入的附加好处。使用相同功能的基本原理可以认为是对每个时间步骤都应用了“通用语言/顺序规则”。

The recursive aspect of the RNN that makes it great, however, is also what causes issues. Expanding our recursive definition of the RNN simply to the fourth hidden state, we see that the A function is applied many times.

然而，RNN的递归方面使它变得很棒，这也是导致问题的原因。将RNN的递归定义简单地扩展到第四个隐藏状态，我们看到A函数被多次应用。

演示地址

A(x) is really just multiplying by a weight matrix and adding to a bias matrix. Making big simplifications, of course, after ten time steps the initial input, x₀, is essentially multiplied by w¹⁰, where w is the weight matrix. As any calculation will yield, taking numbers to powers returns extreme results:

A(x)被实际上只是由加权矩阵相乘并添加到偏置矩阵。使大的简化，当然，经过十时间步长的初始输入x₀，基本上由瓦特 ¹⁰，其中w是权重矩阵相乘。正如任何计算所得出的那样，将数字取幂会返回极端结果：

0.3¹⁰ = 0.000005
0.3¹⁰= 0.000005
0.5¹⁰ = 0.0009
0.5¹⁰= 0.0009
1.5¹⁰ = 57.7
1.5¹⁰= 57.7
1.7¹⁰ = 201.6
1.7¹⁰= 201.6

This causes a lot of problems. The weight matrix will cause values to either grow towards zero (diminish) or towards infinity or negative infinity (explode). Hence, RNNs suffer from the diminishing and exploding gradient problems. Not only does this cause calculation problems when the weights are updated, it means they have dementia: they forget about anything more than a few time steps back, since they have been obscured or amplified beyond comprehension through recursive multiplication.

这引起很多问题。权重矩阵将导致值向零(递减)或向无穷大或负无穷大(爆炸)增长。因此，RNN受梯度递减问题的困扰。权重更新时，这不仅会导致计算问题，还意味着他们患有痴呆症：他们忘记了很多时间退缩，因为它们已经被递归乘法所掩盖或放大而无法理解。

Hence, when using RNNs to generate text, you may see infinite loops:

因此，当使用RNN生成文本时，您可能会看到无限循环：

I walked on the street and walked on the street and walked on the street and walked on the street and walked on the street and…

我走在大街上，走在大街上，走在大街上，走在大街上，走在大街上，…

When the network generates a second round of ‘walked on’, it has forgotten about the last time it has said it. It thinks, through its naïve mechanics, that with the previous inputs ‘the street and…’, the next outputs should be ‘walked on’. The cycle continues because the attention frame is so small.

当网络产生第二轮“继续前进”时，它已经忘记了上一次说的话。它认为，通过其幼稚的机制，在先前的输入为“ the street and ...”的情况下，应该继续输出下一个输出。因为关注范围太小，所以周期继续。

The cure: the LSTM network, first introduced in 1997 (yeah — wow) but largely unappreciated until recently, when computing resources made the discovery more practical.

解决方法：LSTM网络于1997年首次推出(是的，哇 )，但是直到最近，当计算资源使这项发现变得更加实用时，人们才意识到LSTM网络。

It is still a recurrent network, but applies sophisticated transformations to the inputs. The inputs to each cell are manipulated through complex maneuvers, yielding two outputs, which can be thought of as the ‘long term memory’ (top line running through the cells) and ‘short term memory’ (bottom output).

它仍然是一个循环网络，但是对输入进行了复杂的转换。每个单元的输入都通过复杂的操作进行操作，从而产生两个输出，可以将其视为“长期记忆”(贯穿单元的顶行)和“短期记忆”(底部输出)。

Vectors that pass through the long-term memory channel could pass through the entire chain without any interference. Only gates (pink dots) can block or add information, so if the network chooses to, it can retain data it finds useful from an arbitrary number of cells ago.

通过长期存储通道的向量可以通过整个链，而不会受到任何干扰。只有门(粉红色的点)可以阻止或添加信息，因此，如果网络选择这样做，它可以保留从任意数量的单元中发现有用的数据。

This addition of a long-term information stream drastically expands the network’s attention size. It can access previous cell states, but also useful learning from a while ago, giving it the ability to reference context — a key attribute of more human communication.

长期信息流的添加极大地扩展了网络的关注范围。它可以访问以前的单元状态，但也可以访问不久前的有用学习信息，从而使其能够引用上下文(这是更多人际交流的关键属性)。

The LSTM worked well — for a while. It could perform character generation reasonably well on shorter text lengths and was undeterred by many of the problems that plagued early natural language processing development, notably, more global depth and understanding of not only individual words but their collective meaning.

LSTM运作良好-一段时间了。它可以在较短的文本长度上很好地执行字符生成，并且不受困扰早期自然语言处理发展的许多问题的困扰，尤其是更广泛的深度和对单个单词及其整体含义的理解。

However, the LSTM network has its downsides. It is still a recurrent network, so if the input sequence has 1000 characters, the LSTM cell is called 1000 times, a long gradient path. While the addition of a long-term memory channel helps, there is a limit to how much it can hold.

但是，LSTM网络有其缺点。它仍然是一个递归网络，因此，如果输入序列具有1000个字符，则LSTM单元称为1000次，即长梯度路径。虽然增加一个长期存储通道会有所帮助，但是它可以容纳的存储空间是有限的。

Additionally, because LSTMs are recursive in nature (to find the current state you need to find the previous state), they cannot be trained in parallel.

另外，由于LSTM本质上是递归的(要找到当前状态，您需要找到先前的状态)，因此不能并行训练它们。

What is perhaps more pressing is that transfer learning doesn’t work quite well on LSTMs (or RNNs). Deep convolutional neural networks were popularized in part because pre-trained models like Inception could simply be downloaded and fine-tuned. The valuable ability to begin training with a model that already knows the universal rules of its task makes it more accessible and feasible.

也许更紧迫的是，转移学习在LSTM(或RNN)上不能很好地工作。深度卷积神经网络之所以得到普及，部分原因是像Inception这样的预训练模型可以简单地下载和微调。使用已经知道其任务通用规则的模型开始训练的宝贵能力使其更易于访问和可行。

Sometimes, pretrained LSTMs can be transferred successfully , but there’s a reason why it’s not a common practice. This makes sense — each piece of text has its own unique style. Unlike images, which almost always follow some sort of rigid universal rules with shadows and edges, the structure to text is less apparent and more fluid.

有时，经过培训的LSTM可以成功转移，但这是为什么它不是普遍做法的原因。这是有道理的-每段文字都有自己独特的风格。与图像几乎总是遵循某种带有阴影和边缘的严格的通用规则不同，文本的结构不那么明显，而且更加流畅。

Yes, there are basic grammar rules to uphold the framework of text, but it’s much less strict than with images. On top of this, there are different sets of grammar rules — different forms of poetry, different dialects (Shakespeare and Old English), different use cases (texting language on Twitter, a written version of an improvised speech). It’s likely not much easier to start from, say, an LSTM pretrained on Wikipedia, than to learn a dataset from scratch.

是的，有一些基本的语法规则可以支持文本框架，但是它不如图像严格。最重要的是，有不同的语法规则集–不同的诗歌形式，不同的方言(莎士比亚和古英语)，不同的用例(Twitter上的文字语言，即兴演讲的书面版本)。从Wikipedia上预训练的LSTM开始，要比从头开始学习数据集要容易得多。

Beyond pretrained embeddings, LSTMs are limited when given more demanding modern problems, like machine translation across several languages or text generation completely indistinguishable from human-written text. Increasingly, a newer architecture is being used to address more challenging tasks: the transformer.

除了预训练的嵌入之外，LSTM在遇到更苛刻的现代问题时也受到限制，例如跨多种语言的机器翻译或与人工编写的文本完全无法区分的文本生成。越来越多的新架构被用于解决更具挑战性的任务：变压器。

Released initially in the paper “Attention is All You Need” to address language translation, the architecture of the transformer is plenty complex. The important part, though, is the idea of attention.

最初在论文《注意就是全部》中发布以解决语言翻译问题，该转换器的体系结构非常复杂。但是，重要的部分是注意的概念。

Earlier in the article, we discussed attention span as how many hidden states past the recurrent neural network could look back. Transformers have an infinite attention size, which is the core of their advantage over LSTMs. The key to doing this?

在本文的前面，我们讨论了注意力跨度，即过去的递归神经网络有多少隐藏状态可以回顾。变压器具有无限的关注度，这是它们优于LSTM的核心。这样做的关键？

Transformers don’t use recursion.

变形金刚不使用递归。

Transformers accomplish an infinite attention size by using an all-to-all comparison. Instead of processing each word sequentially, it processes the entire sequence at once to create an ‘attention matrix’, where each output is a weighted sum of the inputs. So, for instance, we may express the French word ‘accord’ as ‘The’(0)+‘agreement’(1)+…. The network learns the weighting of the attention matrix.

变压器通过使用全部比较来实现无限的注意力大小。它没有立即顺序处理每个单词，而是立即处理整个序列以创建“注意矩阵”，其中每个输出是输入的加权和。因此，例如，我们可以将法语单词“ accord”表示为“ The((0)+'agreement”(1)+ ...”。网络学习注意力矩阵的权重。

The region in the red-bounded box is interesting: although European Economic Area translates to the words européenne économique zone, in French the ordering of words is actually zone économique européenne. Attention matrices are able to capture these relationships directly.

红框内的区域很有趣：尽管“欧洲经济区”翻译成“欧洲经济区”一词，但在法语中，单词的排序实际上是“欧洲经济区”。注意矩阵能够直接捕获这些关系。

Attention allows direct access between output values, LSTMs must access this information indirectly and sequentially through memory channels.

注意允许在输出值之间直接访问，LSTM必须通过存储通道间接和顺序地访问此信息。

Transformers are expensive to compute — there is no getting around the O(n²) runtime in constructing the matrix. However, for a variety of reasons, it’s not as severe as some may think. For one, due to the non-recursive nature of the transformer, the model can be trained using parallelism, something not possible to do with LSTMs or RNNs.

变形金刚是计算昂贵的-没有让周围的为O(n²)运行在构建矩阵。但是，由于多种原因，它并不像某些人想象的那么严重。首先，由于变压器的非递归性质，可以使用并行性来训练模型，这与LSTM或RNN无关。

Additionally, GPUs and other hardware have evolved to the point where they scale incredibly well — multiplying 10 by 10 matrices is essentially as fast as multiplying 1000 by 1000 matrices.

此外，GPU和其他硬件已经发展到可以令人难以置信地扩展的程度-将10乘以10矩阵基本上与将1000乘以1000矩阵相提并论。

Much of the long computing time for modern transformers is not attributed to the attention mechanism. Instead, with the help of attention, the problems of recursive language modelling are solved.

现代变压器的漫长计算时间大部分都不归因于注意力机制。相反，在注意力的帮助下，递归语言建模的问题得以解决。

Transformer models also exhibit great results when applied using transfer learning, which has been a huge contribution to their popularity.

变压器模型在通过转移学习进行应用时也显示出了出色的结果，这对它们的普及做出了巨大的贡献。

So what is the future of the LSTM?

那么LSTM的未来是什么？

It still has a long way to go before it truly ‘dies’, but it is certainly on a downslope. For one, variants of LSTMs have shown success in sequence modelling in general, for example to generate music or in forecasting stock prices, where the ability to reference back and retain infinitely long attention spans are not as important, given an additional computational burden.

它要真正“消亡”，还有很长的路要走，但是它肯定会处于下降状态。首先，LSTM的变体通常在序列建模方面显示出成功，例如在生成音乐或预测股票价格方面，考虑到额外的计算负担，引用和保留无限长的关注范围的能力并不那么重要。

摘要 (Summary)

Recurrent neural networks were created to address the sparsity, inefficiency, and lack of information with traditional n-grams and BoW methods by passing the previous output into the next input, creating a more sequential approach towards modelling.
通过将先前的输出传递到下一个输入中，创建了递归神经网络来解决传统n- gram和BoW方法的稀疏性，效率低下和信息匮乏的问题，从而创建了一种更加有序的建模方法。
LSTMs were created to address problems of RNNs forgetting inputs more than a few time steps away by introducing long-term and short-term memory channels, controlled by gates.
创建LSTM是为了解决RNN的问题，即通过引入由门控制的长期和短期存储通道，将输入遗忘了多个时间步。
Some downsides of LSTMs include unfriendliness towards transfer learning, unusable for parallel computing, and a limited attention span, even after being expanded.
LSTM的一些缺点包括对迁移学习的不友好，无法用于并行计算，以及即使扩展后的关注范围也有限。
Transformers throw away recursive modelling. Instead, with attention matrices, transformers are able to directly access other elements of the output, which allow them to have infinite attention sizes. Additionally, they can run on parallel computing.
变形金刚扔掉了递归建模。取而代之的是，借助注意力矩阵，转换器可以直接访问输出的其他元素，从而使它们具有无限的注意力大小。此外，它们可以在并行计算上运行。
LSTMs still have applications in sequential modelling with, for example, music generation or stock forecasting. However, much of the hype associated with LSTM for language modelling is expected to dissipate as transformers become more accessible, powerful, and practical.
LSTM在例如音乐生成或股票预测的顺序建模中仍然具有应用。但是，随着转换器变得更加易于访问，功能强大和实用，与LSTM有关的语言建模大肆宣传有望消失。

Thanks for reading!

谢谢阅读！