Q&A about Transformer

最新推荐文章于 2024-07-17 10:28:44 发布

weifzh

最新推荐文章于 2024-07-17 10:28:44 发布

阅读量147

点赞数

文章标签：自然语言处理

本文链接：https://blog.csdn.net/qq_37890256/article/details/115001049

版权

文章目录

Transformer 中Encoder和Decoder到底做了什么？
如何理解Decoder中的两种attention？
What does the feed-forward neural network in Transformer architecture actually learn?

Transformer 中Encoder和Decoder到底做了什么？

A: 以翻译任务举例：中文 $\rightarrow$ 英文，如“我爱你” $\rightarrow$ “I love you”，分为训练、预测两个阶段：
训练阶段：
Encoder: 接收中文输入，计算每个单词的embedding，其中用到的self-attention机制是为了捕捉单词的上下文关系；
Decoder: 接收英文输入，中文输入以及Encoder对中文的Embedding，并行训练，如根据"<s>"+“我爱你” 预测"I"，根据"<s> I"+“我爱你"预测"love”，根据"<s> I love"+“我爱你"预测"you"等。（其中用到了mask矩阵来遮掩"未来"的信息。）
预测阶段：
同训练阶段类似，均需要通过encoder对input编码，只是decoder在生成"I” “love” "you"时不是并行而是串行生成的。
（一个比较重要的点，怎么让机器明白输入的序列中哪些在前、哪些在后？答：position embdding。）

如何理解Decoder中的两种attention？

From The Transformer Family Lil’Log
A: 从下往上，第一个attention仅为强化目标语言之间的相互联系；第二个个人认为是通过构建输入输出之间的关系。

What does the feed-forward neural network in Transformer architecture actually learn?

A1: Actually, as far as I understand, the Self Attention layer doesn’t learn representations based on the surrounding words (what you call contextualized meaning embeddings). It’s merely a way to calculate a similarity score between each word and the rest of the words. That is then normalized (with softmax) to become an attention score, and each word is then weighted by it’s similarities to other words.
The input to the feed forward layer is are the input embeddings to the self-attention layer - but now weighted by similarity scores. The feedforward network then creates a new representation (via nonlinear transform, as opposed to the self attention layer) based on the information from the self attention layer - which is now “contextualized embeddings”.from reddit

A2: Understanding BERT Transformer: Attention isn’t all you need

weifzh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Q&A about Transformer

文章目录What does the feed-forward neural network in Transformer architecture actually learn?What does the feed-forward neural network in Transformer architecture actually learn?A1: Actually, as far as I understand, the Self Attention layer doesn’t learn rep
复制链接

扫一扫