Transformer(李宏毅课程)

最新推荐文章于 2024-04-23 16:10:52 发布

cx_0401

最新推荐文章于 2024-04-23 16:10:52 发布

阅读量189

点赞数

分类专栏：李宏毅课程笔记文章标签： ai

本文链接：https://blog.csdn.net/qq_40438523/article/details/119346397

版权

李宏毅课程笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Transformer

Seq2seq

Seq2seq

The output length is determined by model

Speech Recognition
Machine Translation
Speech Translation(language without text)

Question Answering

NLP problem can be solved by seq2seq
Syntactic Parsing

output is a tree which can be regarded as seq2seq problems

Multi-label Classification

An object can belong to multiple classes

Encoder
Encoder的输入是一排向量，输出也是相同长度的另一排向量。输入首先经过positional encoding加入位置信息后，再通过self-attention加入注意力机制，得到self-attention层的输出后经过residual的操作（将输入和输出相加）得到向量经过normlization得到下一层Fully connet的输入，经过最后全连接层后再一次add&norm后得到最终encoder的输出向量。

1 the block in the net:
(1) residual: add the input to the handled data
(2) norm: $x'_i=\frac{x_i-m}{\sigma}$

the final block architecture:

Decoder
Decoder有两种，第一种是autoregressive，另一种是nonautogressive

Autoregressive

例如是处理一个语音辨识的问题：
encoder的输入就是语言的波形，输出对应的一些向量；
decoder的输入则是encoder的输出和one-hot表示的向量，首先会有一个one-hot形式的begin符号作为decoder的输入，放入decoder后得到一个输出经过softmax后得到一个向量（机），然后（机）作为下一个decoder的输入继续进行。

其中decoder和encoder的不同就在于一开始的输入经过了一层masked self-attention，ta和原先的self-attention不同的地方在于，每个向量都只能考虑之前的所有向量，而非前后所有的向量，因为我们考虑前面语音辨识所说的decoder输入是一个向量一个向量输入，所以每个向量暂时看不到后面的输入，因此无法考虑之后的向量。
Masked Self-attention
From the graph above, we have the input one by one, so we can not input all $a^i$ together

另外一个问题是，seq2seq中输入和输出长度不一样，因此我们要怎么决定输出什么时候停止。实际上就是准备一个特别的终止符号（end），若输入达到这样的end时，就停止输出。

nonAutoregressive
对于NAT，实际上就是将输入变成一整列的begin，而非顺序的序列。

We do not know the output length:
(1) another predictor for output length
(2) output a long sequence include END, ignore the tokens after END
Advantages:
(1) parallel
(2) controllable output length
NAT is usually worse than AT

3.Encoder-Decoder
接下来就是encoder和decoder之间的连接问题，
input all encoder inputs and one decoder input to the cross attention
在这里插入图片描述

1.cross attention
这里面就是cross attention的操作，我们看到上面的图中可以看出，中间层的self-attention的输入是由encoder的输入+decoder的输入组成。例如下图中，我们首先decoder输入BEGIN，transform后得到q，然后对于 $\alpha`_1$ $\alpha`_2$ $\alpha`_3$ 和 $q$ 我们进行一个cross attention的处理得到新的输出v，v就是我们最终放入FC的输入。然后下一个（机）再作为输入进入decoder中，依次和左边的encoder输出进行cross attention处理，最终得到decoder的输出。

Training

Teacher Forcing
在我们decoder的训练过程中，实际上我们decoder的输入就是一个正确答案，输出也是一个正确答案，这种方式就叫做teacher forcing。
然后在testing的过程中，我们decoder的第一个输入是begin符号。

tips:
(1) char-bot: copy some words in the question

(2) guided attention
ex: speech recognition
we should focus on the first word in the beginning
solution: monotonic attention, location-aware attention
(3) Beam Search
greedy decoding: select the max every path
solution: we select the bream search
(4) optimizing evaluation metrics
In the training process, we use cross entropy. But finally we use BLEU score to evaluate the result.
BLEU score: compare the output with correct answer, but it can not be differentiate.
solution: If we would like to use BLEU score, use RL method.
(5) Exposure bias
For decoder, if every training data is true, once test data is wrong ,the output must be wrong

cx_0401

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Transformer(李宏毅课程)

TransformerSeq2seqSeq2seqThe output length is determined by modelSpeech RecognitionMachine TranslationSpeech Translation(language without text)Question AnsweringNLP problem can be solved by seq2seqSyntactic Parsingoutput is a tree which
复制链接

扫一扫

专栏目录