文章目录
Transformer 中Encoder和Decoder到底做了什么?
A: 以翻译任务举例: 中文
→
\rightarrow
→英文,如“我爱你”
→
\rightarrow
→“I love you”,分为训练、预测两个阶段:
训练阶段:
Encoder: 接收中文输入,计算每个单词的embedding,其中用到的self-attention机制是为了捕捉单词的上下文关系;
Decoder: 接收英文输入,中文输入以及Encoder对中文的Embedding,并行训练,如根据"<s>"+“我爱你” 预测"I",根据"<s> I"+“我爱你"预测"love”,根据"<s> I love"+“我爱你"预测"you"等。(其中用到了mask矩阵来遮掩"未来"的信息。)
预测阶段:
同训练阶段类似,均需要通过encoder对input编码,只是decoder在生成"I” “love” "you"时不是并行而是串行生成的。
(一个比较重要的点,怎么让机器明白输入的序列中哪些在前、哪些在后?答:position embdding。)
如何理解Decoder中的两种attention?
From The Transformer Family Lil’Log
A: 从下往上,第一个attention仅为强化目标语言之间的相互联系;第二个个人认为是通过构建输入输出之间的关系。
What does the feed-forward neural network in Transformer architecture actually learn?
A1: Actually, as far as I understand, the Self Attention layer doesn’t learn representations based on the surrounding words (what you call contextualized meaning embeddings). It’s merely a way to calculate a similarity score between each word and the rest of the words. That is then normalized (with softmax) to become an attention score, and each word is then weighted by it’s similarities to other words.
The input to the feed forward layer is are the input embeddings to the self-attention layer - but now weighted by similarity scores. The feedforward network then creates a new representation (via nonlinear transform, as opposed to the self attention layer) based on the information from the self attention layer - which is now “contextualized embeddings”.from reddit
A2: Understanding BERT Transformer: Attention isn’t all you need