论文原文: https://arxiv.org/abs/1706.03762
代码实现:https://github.com/Kyubyong/transformer
按照原文结构记录总结
#1.Model Architecture
1.1.Encoder&Decoder stacks
stacks = 6 transformer
sublayers = multi-head attention + FeedFormward
Embedding dimension = 512
主要学习代码实现: 1.self-attention 2.positional encoding 3.Masked
Encoder input = [batchsize,seq_length,512]
Encoder output = [batchsize,seq_length,512] ???? key?
1.2.Attention
multi-heads = 8
a. Decoder Layer = 3 sublayers
self-attention + Encoder-Decoder-attention + FeedwardNet
Encoder-Decoder-attention = Keys,Values来自于Encoder; query来自于上一层的Decoderlayer
b.Encoder self-attention,keys+values+querys来自于相同的位置
c.Decoder self-attention,训练过程中masked操作,保证当前时间步的Decoder仅能看到左侧的单词信息
代码实现,主要关注self-attention的实现: