改革vt_改革者：高效变压器

最新推荐文章于 2024-08-10 10:46:00 发布

weixin_26752075

最新推荐文章于 2024-08-10 10:46:00 发布

阅读量240

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/reformer-the-efficient-transformer-dd9830164703

版权

改革vt

Transformer (Vaswani et. al.) is great, it attends to a longer context, it offers parallelization in computation which RNNs don’t, and most importantly, they have the state of the art results.

变压器 ( Vaswani等人 )很棒，它涉及更长的上下文，它在计算中提供了RNN没有的并行化，最重要的是，它们具有最新的技术成果。

In this article, we’ll be covering the Reformer Model, which was proposed in the paper Reformer: The Efficient Transformer by Google Research. This model essentially addresses some efficiency constraints of the Transformer model and proposes an improved version of the Transformer that implements Locality Sensitive Hashing (LSH) and Reversible Layers to make the model much more efficient. We’ll be discussing these in greater detail in the coming sections of this post.

在本文中，我们将介绍Google研究院在《改革者：高效变压器》一文中提出的改革者模型。该模型从根本上解决了Transformer模型的一些效率约束，并提出了Transformer的改进版本，该版本实现了局部敏感哈希(LSH)和可逆层，从而使模型更加有效。我们将在本文的后续部分中更详细地讨论这些内容。

变压器效率低下 (Inefficiency of the Transformers)

Despite being the state of the art, the Transformer is very expensive (w.r.t. memory). One of the most well-known Transformer is BERT (Devlin et. al.), that too, is trained for a maximum allowed sequence length of 512. To further delineate on this, we’ll take the same example as in the paper:

尽管处于最先进的状态，但Transformer还是非常昂贵(wrt存储器)。 BERT ( Devlin等人 )是最著名的Transformer之一，它也接受了最大允许序列长度512的训练。为进一步说明这一点，我们将采用与本文相同的示例：

The largest reported configuration of the Transformer has 64 layers and 0.5B parameters per layer. Say we want to train a Transformer for a sequence of length as long as 64K. Here, the 0.5B parameters account for 2GB memory. Moreover, 1024-dimensional embedding weights for a batch size 8 accounts for 64K x 1K x 8 = 0.5B floats which is again 2GB of memory. Now if we were to train this model for a single layer, we could easily train it on a single GPU, however, there are 63 more layers to be appended. Moreover, the corpus on which BERT is trained requires 17GB to store.

据报道，变形金刚的最大配置具有64层，每层参数为0.5B。假设我们要训练一个变形金刚，其长度序列可以长达64K。在这里，0.5B参数占2GB内存。此外，批大小为8的1024维嵌入权重占64K x 1K x 8 = 0.5B浮点数，这又是2GB的内存。现在，如果我们要针对单个层训练该模型，则可以轻松地在单个GPU上对其进行训练，但是，还要追加63个层。此外，受BERT训练的语料库需要17GB的存储空间。

In light of the above, the following are the problems addressed in the original Transformer:

鉴于上述，以下是原始《变形金刚》中解决的问题：

N layers ask for N-times more memory as that of a single layer due to the fact that each layers’ inputs need to be stored for backpropagation.
由于需要存储每层的输入以进行反向传播 ，因此N层需要的存储量是单层的N倍。
The d_ff (depth of the intermediate feed-forward layer) is pretty large as compared to d_model and hence accounts for large memory usage.
d_ff (中间前馈层的深度)与d_model相比很大因此会占用大量内存。
Attention computation for a sequence of length L accounts for O(L²) in both computational and space complexity.
长度为L的序列的注意力计算在计算和空间复杂度上都占O(L²) 。

In the coming sections, we’ll see how Reformer overcomes these.

在接下来的部分中，我们将看到Reformer如何克服这些问题。

本地敏感哈希(LSH)注意 (Locality Sensitive Hashing (LSH) Attention)

Image for post — Scaled Dot-Product Attention via “ Vaswani et. al.”

This is the Scaled-Dot Product Attention from the original Transformer Model. Here the actual embedding is first activated into 3 different vectors namely query (Q), key (K), and value (V) (for each token). Then the vector dot-product of the queries and the keys is obtained that tells how much of each vector contributes to obtaining a given vector (attention, basically). For more on self-attention, you can read my blog on the Transformers.

这是原始变压器模型中的按比例乘积乘积。在这里，首先将实际嵌入激活为3个不同的向量，即查询(Q)，键(K)和值(V)(针对每个令牌)。然后获得查询和关键字的向量点积，该向量告诉每个向量有多少有助于获得给定向量 (基本上是注意)。有关自我关注的更多信息，您可以阅读我在《变形金刚》上的博客。

高效记忆 (Memory Efficient Attention)

The main issue lies with the QKᵀ term. Let the shape of the queries and the keys be (batch_size, seq_length, d_model) each. Now, QKᵀ will result in a shape of (batch_size, seq_length, seq_length). So even for a batch size of 1, a sequence of 64K length will have the QKᵀ term to have a 64K x 64K sized matrix (16GB memory).

主要问题在于QKᵀ术语。令查询和键的形状分别为(batch_size，seq_length，d_model) 。现在，QKᵀ将导致形状为(batch_size，seq_length，seq_length) 。因此，即使批次大小为1，长度为64K的序列也将具有QKᵀ项，以具有64K x 64K大小的矩阵(16GB内存)。

Well, the workaround on this is, instead of taking the whole Q term, one can compute attention for each query q_i separately, i.e.

好吧，关于此问题的解决方法是，无需考虑整个Q项，而是可以分别计算每个查询q_i的注意力，即

Now, this sounds inefficient as this in a way takes away the parallel processing ability of the Transformers. However, you’ll be surprised to know that the LSH attention (which we are about to see) compensates for this part.

现在，这听起来效率很低，因为这在某种程度上破坏了《变形金刚》的并行处理能力。但是，您会惊讶地发现LSH的关注(我们将要看到)弥补了这一部分。

Q = K (Q = K)

In the original implementation of the Transformer, Q, K and V are activated using 3 different sets of weights (linear layers). On the contrary, the authors suggest using the same weights for both, query and the key. It is called a shared-QK model.

在Transformer的原始实现中，使用3组不同的权重(线性层)激活Q，K和V。相反，作者建议对query和key使用相同的权重 。它称为共享QK模型。

It turns out that sharing QK does not affect the performance of Transformer

事实证明，共享QK不会影响Transformer的性能

— Reformer Paper

— 改革者论文

LSH注意 (LSH Attention)

So we’ve discussed computing the attention for each query separately, which seems quite inefficient as it isn’t parallel. But what if we take say just 32 or 64 keys that are closest to the given query from the complete sequence of length 64K and compute the attention for just them? This is exactly what LSH Attention does. Let’s see how LSH works:

因此，我们已经讨论了分别计算每个查询的注意力，由于它不是并行的，因此效率很低。但是，如果我们只说32个或64个最接近给定查询的键 (长度为64K)，然后为它们计算注意力，那该怎么办？ 这正是LSH Attention所做的。让我们看看LSH的工作原理：

To get a clear idea of this, we’ll quickly discuss the role of a vector space first. So, at the first layer of the Transformer, we essentially map each of the tokens in a given sequence to a ‘vector’. This suggests that we are mapping all our tokens to a common vector space where the vector representations for all the tokens in the vocabulary co-exist. Now, these mappings are trained on a language such that the vector representation of similar or related tokens are closer to each other, and the ones that are unrelated lie far away from each other (FYI, we can measure the relatedness of these vectors using distance metrics; for eg. cosine distance).

为了清楚地了解这一点，我们将首先快速讨论向量空间的作用。因此，在Transformer的第一层，我们实质上将给定序列中的每个标记映射到一个“向量”。这表明我们正在将所有标记映射到公共向量空间，词汇表中所有标记的向量表示共存。现在，以一种语言训练这些映射，以使相似或相关标记的向量表示彼此靠近 ， 而不相关的标记则彼此远离 (FYI，我们可以使用距离来测量这些向量的相关性指标；例如余弦距离)。

Consider any 2 vectors from the vector space we discussed earlier as two points x and y. We consider this imaginary circle (sphere actually) to contain these points (the circle is the vector space, say). Then, we divide the circle into 4 parts (4 hashing buckets) and rotate this partition randomly i.e. rotate the circle randomly. We have 2 observations here (refer the figure):

考虑我们前面讨论过的向量空间中的任意两个向量，它们分别是点x和y。我们认为该假想圆(实际上是球)包含这些点(例如，圆是矢量空间)。然后，将圆分成4个部分(4个散列桶)并随机旋转该分区，即随机旋转圆。我们在这里有2个观察结果(请参见下图)：

In the first case (up), the vectors (points) are relatively far from each other. Hence, on random rotation, the vectors are likely to end up in different buckets with high probability.
在第一种情况下(向上)，向量(点)彼此相对较远。因此，在随机旋转时，矢量很有可能最终以不同的方式出现在不同的存储桶中 。
Whereas in the second case (down) the vectors are appreciably closer to each other. So as you can see, they are more likely to end up in the same bucket with high probability.
而在第二种情况下(向下)，向量明显彼此靠近。因此，如您所见，它们更有可能最终以相同的概率出现在同一存储桶中 。

A hashing scheme that assigns each vector x to a hash h(x) is called locality-sensitive if nearby vectors get the same hash with high probability and distant ones do not.

如果附近的向量很有可能获得相同的哈希值，而远距离的向量则没有，则将每个向量x分配给哈希值h(x)的哈希方法称为局部敏感。

— Reformer Paper

— 改革者论文

In the Reformer, this is achieved by taking a random matrix R of size (d_k, b/2) where d_k is the size of the key vector and b is the number of buckets. And the hash function h for a vector x is given as:

在重整器中，这是通过采用大小为( d_k，b / 2 )的随机矩阵R来实现的，其中d_k是密钥矢量的大小， b是存储桶的数量。向量x的哈希函数h给出为：

h(x) = argmax([ xR ; -xR ])

h(x)= argmax([xR; -xR])

where [ a ; b ] is concatenation. So by having xR and -xR, we are essentially taking random projections on x. Intuitively speaking, we are taking b vectors (buckets) of size d_k and evaluating the bucket to which x belongs.

哪里[一个; b]是串联。因此，通过具有xR和-xR ，我们实际上是对x进行随机投影。直观地讲，我们正在获取大小为d_k的 b个矢量(存储桶)，并评估x所属的存储桶 。

The figure above depicts the flow of LSH Attention implemented in the Reformer.

上图描述了在重整器中实施的LSH注意流程。

The query/key (queries = keys) vectors are assigned to their respective buckets using the LSH scheme that we just discussed.
使用我们刚刚讨论的LSH方案，将查询/关键字(查询=关键字)向量分配给它们各自的存储桶。
We sort the query/key vectors according to their buckets.
我们根据查询/键向量的存储桶对其进行排序。
Since the hash buckets may be uneven in size, they are difficult to batch. Hence, the approach adopted is, taking fixed-sized chunks with offset one; i.e. we take chunks of size m and calculate attention for vectors from the same bucket and same chunk and one chunk back. Typically the size of a chunk is 2l / b where l is the sequence length.
由于散列桶的大小可能不均匀，因此很难进行批处理。因此，采用的方法是采用偏移为1的固定大小的块。也就是说，我们取大小为m的块，并计算来自相同存储桶，相同块和一个块返回的向量的关注度 。通常，块的大小为2l / b ，其中l是序列长度。

For causal masking (or look-ahead mask), initially, the position ids of the query/key vectors are taken, sorted according to the bucket-wise sorting order, and then by comparing these position ids, the masks are calculated.
对于因果屏蔽(或前瞻性屏蔽)，首先，获取查询/关键字向量的位置ID，根据存储区排序顺序对其进行排序，然后通过比较这些位置ID来计算掩码。
Moreover, there is a possibility that similar vectors fall in different buckets. To tackle this, we can perform n_rounds rounds of hashing with different hash functions.
此外，相似的向量有可能落在不同的存储桶中。为了解决这个问题，我们可以使用不同的哈希函数执行n_rounds轮哈希。

可逆变压器 (Reversible Transformer)

We solved the memory issue in attention computation using LSH Attention which is discussed previously. However, there is another major point of concern when it comes to memory consumed by the Transformers. In feed-forward layers, it is normal to have a transformer with the d_ff value to be ~4K and the number of layers ~16. Moreover, the inputs to each layer need to be stored for backpropagation. In such a setting, with a sequence length of 64K, we would still end up with the impractical 16GB memory range.

我们使用前面讨论的LSH Attention解决了注意力计算中的内存问题。但是，当涉及到变形金刚消耗的内存时，还有一个主要的关注点。在前馈层中，通常有一个d_ff值为〜4K且层数为〜16的变压器。此外，每层的输入都需要存储以用于反向传播 。在这种情况下，序列长度为64K，我们仍然会遇到不切实际的16GB内存范围。

So to solve this, Reformer borrows the idea from RevNets (Reversible Residual Neural Networks), proposed in Gomez et. al. To understand this architecture, consider the figure above:

因此，为解决此问题，Reformer借鉴了Gomez等人提出的RevNets(可逆残差神经网络)的想法。等要了解此架构，请考虑上图：

(a) A regular skip connection in ResNet: We take the input, compute the layer value (a function actually), and add it to the original layer inputs. In the figure, the value of x and y need to be stored in memory for backpropagation. So:

(a) ResNet中的常规跳过连接：我们获取输入，计算图层值(实际上是一个函数)，并将其添加到原始图层输入中。在图中，x和y的值需要存储在内存中以进行反向传播。所以：

y = x + F(x)

y = x + F(x)

z = y + G(y)

z = y + G(y)

(b) A forward-pass in RevNet: The idea is to allow the inputs at any layer to be recovered from the inputs of its following layer; i.e.

(b) RevNet中的前向传递：其想法是允许从其下一层的输入中恢复任何层的输入；即

y1 = x1;

y1 = x1;

y2 = z2 = x2 + F(x1)

y2 = z2 = x2 + F(x1)

z1 = y1 + G(y2)

z1 = y1 + G(y2)

x1 = y1 = z1 − G(y2)

x1 = y1 = z1-G(y2)

x2 = y2 − F(x1) = z2 − F(x1)

x2 = y2 − F(x1)= z2 − F(x1)

So clearly, we may rather reconstruct the input values than store them for the backpropagation and save memory.

很显然，我们宁愿重构输入值，也不愿将它们存储用于反向传播并节省内存。

Finally, the Reversible Transformer implementation of this is as follows:

最后，此的可逆变压器实现如下：

Y1 = X1 + Attention(X2)

Y1 = X1 +注意(X2)

Y2 = X2 + FeedForward(Y1)

Y2 = X2 +前馈(Y1)

分割d_ff (Splitting d_ff)

Finally, we address the intermediate feed-forward layer which has d_ff output units, where d_ff is pretty large as compared to d_model (generally ~4K).

最后，我们解决具有d_ff输出单元的中间前馈层，其中d_ff与d_model相比较大 (通常〜4K )。

Considering a sequence with 64K tokens, in a standard transformer, all outputs are calculated in parallel and hence the weights take more memory. Although the feed-forward output is computed in parallel for the whole sequence, it need not be as the outputs for any given token’s vector is independent of other vectors.

考虑到具有64K令牌的序列，在标准转换器中，所有输出都是并行计算的，因此权重占用更多内存。尽管前馈输出是针对整个序列并行计算的 ， 但不必如此 ，因为任何给定令牌的向量的输出都独立于其他向量。

Hence, the Reformer suggests processing this layer in ‘c’ chunks:

因此，Reformer建议以' c'块处理这一层：

With chunking + reversible layers, the layer-input memory of a Reformer model is independent of the number of layers.

对于分块+可逆层，Reformer模型的层输入内存与层数无关。

总结一下 (Summing it all up)

The attention computation-memory issue is overcome by using LSH attention.
通过使用LSH注意，可以解决注意力计算内存问题。
The layer-input storage problem is solved by using reversible layers.
通过使用可逆层解决了层输入存储问题。
High dimensional feed-forward layers with d_ff output units can be neutralized by processing the input sequence in chunks.
通过分块处理输入序列，可以消除具有d_ff输出单元的高维前馈层。

结果 (Results)

where,

哪里，

b => batch size, l => sequence length, d_ff => intermediate feed-forward’s dimension, n_h =>number of heads, n_l => number of layers, d_model is the hidden state dimension, n_r => number of hashing rounds, c => chunk size.

B =>批量大小，L =>序列长度，d_ff =>中间前馈的尺寸，N_H =>号的头中，层的N_L =>数，d_model是隐藏状态维，N_R =>数散列回合， c =>块大小。

The Reformer can be used to generate a complete image from a partial image:

重整器可用于从部分图像生成完整图像：

Fun Fact: The Reformer is so efficient that it can process text sequences of lengths up to 1M words on a single GPU with just 16GB memory.

有趣的事实：改革程序非常高效，可以在只有16GB内存的单个GPU上处理长达1M个单词的文本序列。

Fun Fact: Reformer can process entire novels, all at once and on a single device.

趣闻：改革者可以一次在一个设备上处理全部小说。

结论 (Conclusion)

In this (really long) article we covered the Reformer model in depth. We saw what efficiency defects it addresses in the Transformer and how it overcomes them.

在这篇(非常长的)文章中，我们深入介绍了Reformer模型。我们看到了它在变压器中解决的效率缺陷以及如何克服它们。

Here is the link to the Github Repository of the Reformer code by Google trax.

这是 Google trax到Reformer代码的Github存储库的链接。

Here’s a Colab Notebook for image generation demo by Google, and another one for text generation demo.

这是Google的Colab Notebook用于图像生成的演示，另外一个用于文本生成的演示。

Here’s the link to huggingface’s API docs for Reformer implementation and pre-trained weights.

这是用于Reformer实施和预先训练的权重的拥抱面API文档的链接。

翻译自: https://towardsdatascience.com/reformer-the-efficient-transformer-dd9830164703

改革vt

weixin_26752075

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
改革vt_改革者：高效变压器

改革vtTransformer (Vaswani et. al.) is great, it attends to a longer context, it offers parallelization in computation which RNNs don’t, and most importantly, they have the state of the art results. 变压器...
复制链接

扫一扫