seq2seq model: encoder-decoder + example: Machine Translation

最新推荐文章于 2021-08-17 09:53:15 发布

DecafTea

最新推荐文章于 2021-08-17 09:53:15 发布

阅读量279

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/111556092

版权

NLP 专栏收录该内容

52 篇文章 3 订阅

订阅专栏

seq2seq model: encoder-decoder

1.1. its probablistic model

在这里插入图片描述

1.2. RNN encoder-decoder model architecture

context vector c = encoder’s final state i.e. fixed global representation of the input sequence
teacher forcing: During training phase, at time step t, except for the new input x(t) and the last hidden staet h(t-1), we also input the groundtruth y(t-1) or y(1), y(2), …, y(t-1). During testing/inference/prediction phase, use the predicted results instead of the ground truth because we do not know the ground truth.
In summary, input the hidden state and ground truth (h, y) to update h.

1.3. attention-based encoder-decoder model architecture

context vector ci is denpendent on the locality i of the output word yi = how the attention of yi is paid to the all the words in the input sequence. Each ci or c(t) is a different vector whose values of all dimensions sum to 1.
In summary, input the context vector, hiddent state and ground truth (c(t), h(t-1), y(t-1)) to update ht.

1.4. LSTM machine translation model

(seq2seq2 model for machine translation, take character-level tokenization+prediction as an example)

model architecture: （inference or testing phase，z is used as new input. 注意，during training phase，new input in decoder会用前面所有characters or words的ground truth，而testing phase，new input in decoder用上一时间步被预测出的character or word）

Note: 与text generation一样，predicted results z 是由从predicted probability distribution p 中抽样得到的。抽样方式有三种：1. greedy 2. random sampling 3. use temperature to modify the probability distribution (pd) to make high probability higher and low probability lower. Then, random sampling using the new pd.

During training phase:
在这里插入图片描述

During testing phase:

一旦抽到终止符，则返回字符串，这就是模型得到的翻译结果。

How to improve the model?
(1) use Bi-LSTM in encoder instead of LSTM - avoid short-term memory
(2) use word-level tokenization instead of character-level.

character-level

tokenization and build dictionary: sentence -(tokenizer)-> input tokens -(dictionary)-> label list
对每种语言，我们建立一个dictionary，每个字母对应一个label。
Question: why we need different tokenizers and dictionaries for different languages?
In the cha-level, languages have different alphabets.
In the word-level, languages have different vocabularies.
one-hot encoding: input sentence matrix = a tensor [batch_size（一批训练的句子数）, sequence_length（句子长度，需要把每个句子调整成一样长），input_vector_length（character对应的one-hot vector的长度）]
predicted result: the next character

每个语种都需要自己的dictionary

word-level

tokenization and build dictionary: 对每种语言，我们建立一个dictionary，每个word对应一个label。sentence -(tokenizer)-> input tokens -(dictionary)-> label list
word embedding: input sentence matrix = a tensor [batch_size（一批训练的句子数）, sequence_length（句子长度，需要把每个句子调整成一样长），input_vector_length（character对应的word embedding的长度）]

Note：word embedding layer参数很多，若选用的Machine Translation数据集不够大，则易overfitting。所以char-level适用于小数据集，word-level适用于大数据集。word-level效果更好，不仅因为word embedding本身是包含词义和词法的向量，而且因为一句话char的数量是word数量的约4-5倍，可以让sequence更短，记忆效果更好。

predicted result: the next word

(3) multi-task learning

Using N datasets to train N tasks.
Use the same encoder but different decoders: 1 encoder + N decoders
优点：可以把encoder训练的更好，提炼出更有效的global representation of the input sequence。因为用某种语言翻译任务的训练集训练encoder，也会提升其他语言翻译任务中encoder的表现效果。

（4）attention mechanism
use locality-dependent context vectors ci for each yi instead of a fixed global representation of the input sequence c = the last hiddent state of the input sequence denoted as hT or hL.
在这里插入图片描述
h_tilda: the last hidden state of the decoder network

reference:
[1] https://www.bilibili.com/video/BV12f4y12777
[2] 4F10 notes