Transformer

最新推荐文章于 2025-02-20 17:54:02 发布

DecafTea

最新推荐文章于 2025-02-20 17:54:02 发布

阅读量305

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/111567760

版权

NLP 专栏收录该内容

52 篇文章

订阅专栏

1. attention-based encoder-decoder sequence model architecture (A is RNN or LSTM or GRU）

attention本质是对value进行加权求和。value表示对应word的信息。权重alpha表示信息的重要性，权重越大越attention于对应的value上。

用c_t-1, s_t-1, x’_t来更新s_t, s_t被用作特征向量输入softmax classifier，得到预测结果的概率分布 p_t，抽样得到预测结果z_t, z_t被用作下一轮的新输入x_t+1。

在这里插入图片描述

2. Attention without RNN

attention with or without RNN非常类似。attention without RNN用context vector作softmax classifier的特征向量，attention with RNN用hidden state s作为特征向量。

attention with RNN 和attention without RNN的比较：
在这里插入图片描述
下面我们介绍两种attention：attention layer， self-attention layer

（1）attention layer
在这里插入图片描述
假设任务：英译德seq2seq任务
当前生成的德语单词（预测结果）会被用作下一步的输入。现在图片中已经生成了t个德语单词，把x’_1, x’_2, …, x’_t作为输入来生成下一个单词。新生成的单词会作为第t+1个输入x’_t+1.
在这里插入图片描述

在这里插入图片描述每一个x‘对应一个query，对应一个向量c。c depends on所有的k，v和对应的query。

In MT task, x是input sequence（待翻译文本），x’是预测的output sequence （翻译出的文本）step 1：输入the whole input sequence，得到所有k和v
step 2：顺序计算c1, …, cjNote：cj is dependent on qj and all k and v. Each qj 对应一个cj. cj is inputed in a softmax classifier (NN+softmax activation function) as a feature vector to predict the next character or word’s probability distribution. Then, sample a word in the predicted probability distribution as the next character/word（会被用作下一轮的输入）.
在这里插入图片描述
（2）self-attention layer输入：X（x1, x2, …, xm)
输出：C（c1, c2, …, cm）每个c是一个m维的向量parameters：W_Q, W_K, W_V

在这里插入图片描述具体过程：
算q，k，v算alpha （attention weights)
算c (context vector)

3. Multi-head attention layer

多头attention中，每个单头self-attention有自己的三个W矩阵，得到结果c1, …, cm。把每个单头得到的结果concat起来，就得到多头attention的结果。

Question：multi-head attention比起single-head attention有什么优势？

(1) multi-head self-attention 在这里插入图片描述每个single head attention都有3个weight matrices W，用来得到query，key和value。
每个single-head attention都得到一组m个context vectors 1：m。把l组context vectors 1：m堆叠起来，得到multi-head context vectors 1：m。

(2) multi-head attention 在这里插入图片描述

4. 用multi-head attention layer搭建Transformer

（1）Encoder: stacked self-attention layers
All in one picture:
在这里插入图片描述
transformer encoder: stacked several blockes composed of one multi-head attention layer + one dense layer(全连接层)

叠加多少个block都可以。Transformer‘s encoder共有6个blocks。每个block有两层。每个block都有自己的参数，block之间不共享参数。

具体介绍block内部细节：
在这里插入图片描述全连接层Wu applied to all context vectors c1, …, cm.
cj depends on all x1, …, xm, 但cj受xj影响最大（自己和自己最相关，所以受自己影响最大）。remember，self attention作用是捕获同一个句子中单词之间的句法特征，短语结构或语义特征。

在这里插入图片描述
可以叠加多个”multi-head self-attention layer+dense layer“ block, 这样就构成了深度学习网络。道理和多层RNN，多层LSTM是一样的，因为输入输出sequence length相等，所以可以替代RNN models。

Moreover，each x has the dimension of Dx1，c has the dimension of mx1，u has the dimension of Dx1

在这里插入图片描述
因为经过block之后，input size和output size一样，所以可以用ResNet里的skip connect，把X加到output上去。

（2）Decoder: stacked self-attention+attention layers
Transformer’s decoder: stacked blocks composed of 3 layers including self-attention layer, attention layer

All in pictures:
在这里插入图片描述
图中画的是decoder的一个block，transformer的decoder也是由6个blocks叠加而成的，每个block两个输入一个输出，其中一个输入是重复使用encoder’s output matrix U, m是英文单词的数量，另一个输入是用上一个block的output，t是已经生成的德文单词的数量。
在这里插入图片描述

（3） Transformer: put everything together
在这里插入图片描述
transformer的encoder是由6个encoder blocks叠加而成的，每个block一个输入X一个输出U，下一个block的输入是上一个block的输出，matrix size始终保持不变。m是英文单词的数量。每个encoder block由一个multi-head attention layer和一个dense layer组成。

transformer的decoder也是由6个decoder blocks叠加而成的，每个block两个输入一个输出，其中一个输入是重复使用encoder’s output matrix U, m是英文单词的数量，另一个输入X’是用上一个decoder block的output，t是已经生成的德文单词的数量。每个decoder block由一个multi-head self-attention layer, 一个multi-head attention layer和一个dense layer组成。

5. Comparison with RNN Seq2Seq Model

在这里插入图片描述

Transformer模型和RNN sequence model都是两个输入，一个输出，且输入输出大小相同，这说明RNN序列模型能做的任务，transformer都能做。

6. Summary

（1）单头attention，多头attention
在这里插入图片描述
单头attention之间不共享参数。
上述图示是单头to多头 self-attention的，attention也是同理。

（2）Transformer’s encoder

blocks之间不共享参数
在这里插入图片描述
（3）Transformer’s decoder

(4) Transformer

Transformer is a Seq2Seq model; it has an encoder and a decoder；
Transformer model is not RNN - it does not have recurrent structure (循环结构）；
Transfomer has the same size of inputs and output as RNN；hence you can use transformer do any tasks that RNN can do;
Transformer outperforms all the state-of-the-art RNN models.

Reference:
[1] https://www.bilibili.com/video/BV1qZ4y1H7Pg
[2] https://www.bilibili.com/video/BV1Ap4y1Q7nT