seq2seq model: encoder-decoder + example: Machine Translation

seq2seq model: encoder-decoder

1.1. its probablistic model

在这里插入图片描述

1.2. RNN encoder-decoder model architecture

  • context vector c = encoder’s final state i.e. fixed global representation of the input sequence
  • teacher forcing: During training phase, at time step t, except for the new input x(t) and the last hidden staet h(t-1), we also input the groundtruth y(t-1) or y(1), y(2), …, y(t-1). During testing/inference/prediction phase, use the predicted results instead of the ground truth because we do not know the ground truth.
  • In summary, input the hidden state and ground truth (h, y) to update h.
    在这里插入图片描述

1.3. attention-based encoder-decoder model architecture

  • context vector ci is denpendent on the locality i of the output word yi = how the attention of yi is paid to the all the words in the input sequence. Each ci or c(t) is a different vector whose values of all dimensions sum to 1.
  • In summary, input the context vector, hiddent state and ground truth (c(t), h(t-1), y(t-1)) to update ht.
    在这里插入图片描述

1.4. LSTM machine translation model

(seq2seq2 model for machine translation, take character-level tokenization+prediction as an example)

  • model architecture: (inference or testing phase,z is used as new input. 注意,during training phase,new input in decoder会用前面所有characters or words的ground truth,而testing phase,new input in decoder用上一时间步被预测出的character or word)
    在这里插入图片描述

Note: 与text generation一样,predicted results z 是由从predicted probability distribution p 中抽样得到的。抽样方式有三种:1. greedy 2. random sampling 3. use temperature to modify the probability distribution (pd) to make high probability higher and low probability lower. Then, random sampling using the new pd.

During training phase:
在这里插入图片描述
在这里插入图片描述
During testing phase:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
一旦抽到终止符,则返回字符串,这就是模型得到的翻译结果。

  • How to improve the model?
    (1) use Bi-LSTM in encoder instead of LSTM - avoid short-term memory
    (2) use word-level tokenization instead of character-level.

character-level

  • tokenization and build dictionary: sentence -(tokenizer)-> input tokens -(dictionary)-> label list
    对每种语言,我们建立一个dictionary,每个字母对应一个label。
    Question: why we need different tokenizers and dictionaries for different languages?
    In the cha-level, languages have different alphabets.
    In the word-level, languages have different vocabularies.

  • one-hot encoding: input sentence matrix = a tensor [batch_size(一批训练的句子数), sequence_length(句子长度,需要把每个句子调整成一样长),input_vector_length(character对应的one-hot vector的长度)]

  • predicted result: the next character
    在这里插入图片描述
    每个语种都需要自己的dictionary
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述

word-level

  • tokenization and build dictionary: 对每种语言,我们建立一个dictionary,每个word对应一个label。sentence -(tokenizer)-> input tokens -(dictionary)-> label list
  • word embedding: input sentence matrix = a tensor [batch_size(一批训练的句子数), sequence_length(句子长度,需要把每个句子调整成一样长),input_vector_length(character对应的word embedding的长度)]

Note:word embedding layer参数很多,若选用的Machine Translation数据集不够大,则易overfitting。所以char-level适用于小数据集,word-level适用于大数据集。word-level效果更好,不仅因为word embedding本身是包含词义和词法的向量,而且因为一句话char的数量是word数量的约4-5倍,可以让sequence更短,记忆效果更好。

  • predicted result: the next word

(3) multi-task learning

  • Using N datasets to train N tasks.
  • Use the same encoder but different decoders: 1 encoder + N decoders
  • 优点:可以把encoder训练的更好,提炼出更有效的global representation of the input sequence。因为用某种语言翻译任务的训练集训练encoder,也会提升其他语言翻译任务中encoder的表现效果。

(4)attention mechanism
use locality-dependent context vectors ci for each yi instead of a fixed global representation of the input sequence c = the last hiddent state of the input sequence denoted as hT or hL.
在这里插入图片描述
h_tilda: the last hidden state of the decoder network
在这里插入图片描述
reference:
[1] https://www.bilibili.com/video/BV12f4y12777
[2] 4F10 notes

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值