Attention is all your need
- 论文地址, 此文仅仅是为了记录自己的学习过程 ,有错误欢迎大家指正。
1. 提出背景
在过去几年中,以LSTM和GRU的循环神经网络在nlp领域中发挥了巨大的作用吗,但是此类模型有一些缺陷,一个是存储记忆限制了训练样本的batch size, 二是计算代价大,三也就是此类模型固有缺陷,训练时候序列之间两个对象的联系依赖于两个对象的距离,这个时候就得引入attention机制,让序列中两个对象的距离不再决定序列中两个对象的联系。
2. 文章题目解释
Attention is all your need.你仅仅需要注意力机制,在过去一段时间里,注意力机制是和循环神经网络组合使用的,就比如我在全球AI挑战赛那个比赛中的模型,attention作为神经网络中的一层,发挥着它特有的作用。论文的作者,通过all your need三个词强调单单依靠attention就可以建立起一个模型,而不需要其他的LSTM和GRU,这个模型被称为 Transformer
3. 模型架构
直接引用论文中的图片
- Encoder and Decoder Stacks(直接摘自原文,原文讲的很精炼,已经不能再概括了)
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization . That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
Figure1中左边的那一部分就是对应一个encoder的layer,一个layer由两小部分构成,下半部分为attention层,上半部分为feed forward层,两层都有residual连接,对应CV中的residule net,作用是降低神经网络训练的难度,当然,加入norm也是为了降低训练难度
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
Figure1中右边的那一部分就对应一个decoder的layer,一个layer由三小部分构成,其中两部分和encoder中的类似,我们主要关注那个masked attention层,它主要作用是让对position i的预测仅仅依赖于i前面的position,即将注意力限制在前面几个position,mask掉后面几个position
4. attention
我们定义这样的一类函数为attention,用python代码粗略写个大概,可能不够严谨哈。。。
def attention(query, dictionary):
```
dictionary是一个python的dict数据类型,即key->value
query中的单个元素,key,value,output都是vector
```
result = np.zeros(dictionary.values[0].shape)
for i in len(query):
result = result + h(query[i] * dictionary.keys[query[i]])*dictionary.values[query[i]]
return result
- attention函数一
引用原文中的插图,上面的attention函数是根据下面这个公式给出的。