Q&A about Transformer

Transformer 中Encoder和Decoder到底做了什么?

A: 以翻译任务举例: 中文 → \rightarrow 英文,如“我爱你” → \rightarrow “I love you”,分为训练、预测两个阶段:
训练阶段:
Encoder: 接收中文输入,计算每个单词的embedding,其中用到的self-attention机制是为了捕捉单词的上下文关系;
Decoder: 接收英文输入,中文输入以及Encoder对中文的Embedding,并行训练,如根据"<s>"+“我爱你” 预测"I",根据"<s> I"+“我爱你"预测"love”,根据"<s> I love"+“我爱你"预测"you"等。(其中用到了mask矩阵来遮掩"未来"的信息。)
预测阶段:
同训练阶段类似,均需要通过encoder对input编码,只是decoder在生成"I” “love” "you"时不是并行而是串行生成的。
(一个比较重要的点,怎么让机器明白输入的序列中哪些在前、哪些在后?答:position embdding。)

如何理解Decoder中的两种attention?

Transformer
From The Transformer Family Lil’Log
A: 从下往上,第一个attention仅为强化目标语言之间的相互联系;第二个个人认为是通过构建输入输出之间的关系。

What does the feed-forward neural network in Transformer architecture actually learn?

A1: Actually, as far as I understand, the Self Attention layer doesn’t learn representations based on the surrounding words (what you call contextualized meaning embeddings). It’s merely a way to calculate a similarity score between each word and the rest of the words. That is then normalized (with softmax) to become an attention score, and each word is then weighted by it’s similarities to other words.
The input to the feed forward layer is are the input embeddings to the self-attention layer - but now weighted by similarity scores. The feedforward network then creates a new representation (via nonlinear transform, as opposed to the self attention layer) based on the information from the self attention layer - which is now “contextualized embeddings”.from reddit

A2: Understanding BERT Transformer: Attention isn’t all you need

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值