在阅读本文之前,关于self-attention的详细介绍,比较全面的transformer总结看之前copy的这篇文章。
有了self-attention的基础之后再看这篇论文,感觉就容易了。
论文:Attention is all you need。
1-2 Introduction & Background
RNN:This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
解决(治标不治本,因为根本上序列计算的限制还在):
- factorization tricks.
- conditional computation.
也使用过CNN来as basic building block, such as:ByteNet, ConvS2S. But: makes it more diffucult to learn dependencies between distant positions.(计算量与观测序列X和输出序列Y的长度成正比)
历史:
| 名称 | 解释 | 局限 |
|---|---|---|
| seq2seq | ||
| encoder-decoder | 传统,一般配合RNN | |
| RNN\LSTM\GRU | 方向:单向双向;depth:单层or multi-layer; | RNN难以应付长序列、无法并行实现、对齐问题;神经网络需要能够将源语句的所有必要信息压缩成固定长度的向量 |
| CNN | 可以并行计算、变长序列样本 | 占内存、很多trick、大数据量上参数调整不容易 |
| Attention Mechanism | 关注向量子集、解决对齐问题 |
提到的点:
self-attention;
recurrent attention mechanism.
transduction models
3 Model Architecture
大部分的encoder-decoder structure:
输入序列:输入序列 x = (x1,…,xn), N个
encoder输出的连续表示:z = (z1,…,zn),N个
docoder的outputs: y=(y1,…,ym),M个
一次一个元素。consuming the previously generated symbols as additional input when generating the next.


The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
3.1 Encoder and Decoder Stacks


Encoder: a stack of N = 6 identical layers.Each layer has two sub-layers:(从下到上)
- multi-head self-attention mechanism.
- simple, position-wise fully conntected feed-forward network(以下叫ffnn).

最低0.47元/天 解锁文章
6万+

被折叠的 条评论
为什么被折叠?



