Transformer
Seq2seq
- The output length is determined by model
- Speech Recognition
- Machine Translation
- Speech Translation(language without text)
- Question Answering
- NLP problem can be solved by seq2seq
- Syntactic Parsing
output is a tree which can be regarded as seq2seq problems
- Multi-label Classification
An object can belong to multiple classes
- Encoder
Encoder的输入是一排向量,输出也是相同长度的另一排向量。输入首先经过positional encoding加入位置信息后,再通过self-attention加入注意力机制,得到self-attention层的输出后经过residual的操作(将输入和输出相加)得到向量经过normlization得到下一层Fully connet的输入,经过最后全连接层后再一次add&norm后得到最终encoder的输出向量。
1 the block in the net:
(1) residual: add the input to the handled data
(2) norm: x i ′ = x i − m σ x'_i=\frac{x_i-m}{\sigma} xi′=σxi−m
the final block architecture:
- Decoder
Decoder有两种,第一种是autoregressive,另一种是nonautogressive
- Autoregressive
例如是处理一个语音辨识的问题:
encoder的输入就是语言的波形,输出对应的一些向量;
decoder的输入则是encoder的输出和one-hot表示的向量,首先会有一个one-hot形式的begin符号作为decoder的输入,放入decoder后得到一个输出经过softmax后得到一个向量(机),然后(机)作为下一个decoder的输入继续进行。
其中decoder和encoder的不同就在于一开始的输入经过了一层masked self-attention,ta和原先的self-attention不同的地方在于,每个向量都只能考虑之前的所有向量,而非前后所有的向量,因为我们考虑前面语音辨识所说的decoder输入是一个向量一个向量输入,所以每个向量暂时看不到后面的输入,因此无法考虑之后的向量。
Masked Self-attention
From the graph above, we have the input one by one, so we can not input all a i a^i ai together
另外一个问题是,seq2seq中输入和输出长度不一样,因此我们要怎么决定输出什么时候停止。实际上就是准备一个特别的终止符号(end),若输入达到这样的end时,就停止输出。
- nonAutoregressive
对于NAT,实际上就是将输入变成一整列的begin,而非顺序的序列。
- We do not know the output length:
(1) another predictor for output length
(2) output a long sequence include END, ignore the tokens after END- Advantages:
(1) parallel
(2) controllable output length- NAT is usually worse than AT
3.Encoder-Decoder
接下来就是encoder和decoder之间的连接问题,
input all encoder inputs and one decoder input to the cross attention
1.cross attention
这里面就是cross attention的操作,我们看到上面的图中可以看出,中间层的self-attention的输入是由encoder的输入+decoder的输入组成。例如下图中,我们首先decoder输入BEGIN,transform后得到q,然后对于 α ‘ 1 \alpha`_1 α‘1 α ‘ 2 \alpha`_2 α‘2 α ‘ 3 \alpha`_3 α‘3和 q q q我们进行一个cross attention的处理得到新的输出v,v就是我们最终放入FC的输入。然后下一个(机)再作为输入进入decoder中,依次和左边的encoder输出进行cross attention处理,最终得到decoder的输出。
- Training
- Teacher Forcing
在我们decoder的训练过程中,实际上我们decoder的输入就是一个正确答案,输出也是一个正确答案,这种方式就叫做teacher forcing。
然后在testing的过程中,我们decoder的第一个输入是begin符号。
- tips:
(1) char-bot: copy some words in the question
(2) guided attention
ex: speech recognition
we should focus on the first word in the beginning
solution: monotonic attention, location-aware attention
(3) Beam Search
greedy decoding: select the max every path
solution: we select the bream search
(4) optimizing evaluation metrics
In the training process, we use cross entropy. But finally we use BLEU score to evaluate the result.
BLEU score: compare the output with correct answer, but it can not be differentiate.
solution: If we would like to use BLEU score, use RL method.
(5) Exposure bias
For decoder, if every training data is true, once test data is wrong ,the output must be wrong