On Using Very Large Target Vocabulary for Neural Machine Translation

neural machine translation的优点:
(1)要求比较少的domain knowledge(比如说源语和目标语的特征)
(2)joint tuned, 以往 phrase-based 系统是 tuned separately
(3)要求少量的内存

尽管存在很多优点,但是也不可避免第存在缺点:
target words受限, 随着target words 数量的增加,模型的训练复杂性增加

因为我的目的就是要了解机器翻译的大概流程,所以就大概说下目前机器翻译的做法,针对这样的做法,本文提出了那些改进:
因为目标语言的词汇量巨大,所以通常的做法是选取选取频率最高的K个词作为target vocabulary(通常也称为shortlist), K的选取范围30000到80000词之间, 其他不在shortlist中的词统称为UNK. 当只有一少部分的词作为UNK时,方法的效果就非常好,随着不在shortlist中的词的增加翻译的性能大幅下降。

但是如果一味地增加shortlist的单词量会使得计算的复杂性增加,主要体现在:
这里写图片描述

这里写图片描述

在计算Z的时候因为单词量巨大而使得训练难度增加。

为了解决这个问题,文章提出了一种 importance sampling 的方法,每次更新时,使用目标词汇集中的小部分子集,近似得到Z的值,这样的做法是可以尽可能的增加shortlist的集合使翻译性能提高,并且计算复杂性很低。 这里给出importance sampling的大致含义:
如何理解 importance sampling:
http://blog.csdn.net/tudouniurou/article/details/6277526

importance sampling:
http://blog.sina.com.cn/s/blog_60b44d6a0101l45z.html

The code you provided defines a named tuple `Hypothesis` with two fields, `value` and `score`. This is a convenient way to store and manipulate hypotheses in the context of sequence-to-sequence models. The `NMT` class is a PyTorch module that implements a simple neural machine translation model. It consists of a bidirectional LSTM encoder, a unidirectional LSTM decoder, and a global attention mechanism based on Luong et al. (2015). Here's a breakdown of the code: ```python from collections import namedtuple import torch import torch.nn as nn import torch.nn.functional as F Hypothesis = namedtuple('Hypothesis', ['value', 'score']) class NMT(nn.Module): def __init__(self, src_vocab_size, tgt_vocab_size, emb_size, hidden_size): super(NMT, self).__init__() self.src_embed = nn.Embedding(src_vocab_size, emb_size) self.tgt_embed = nn.Embedding(tgt_vocab_size, emb_size) self.encoder = nn.LSTM(emb_size, hidden_size, bidirectional=True) self.decoder = nn.LSTMCell(emb_size + hidden_size, hidden_size) self.attention = nn.Linear(hidden_size * 2, hidden_size) self.out = nn.Linear(hidden_size, tgt_vocab_size) self.hidden_size = hidden_size def forward(self, src, tgt): batch_size = src.size(0) src_len = src.size(1) tgt_len = tgt.size(1) # Encode the source sentence src_embedded = self.src_embed(src) encoder_outputs, (last_hidden, last_cell) = self.encoder(src_embedded) # Initialize the decoder states decoder_hidden = last_hidden.view(batch_size, self.hidden_size) decoder_cell = last_cell.view(batch_size, self.hidden_size) # Initialize the attention context vector context = torch.zeros(batch_size, self.hidden_size, device=src.device) # Initialize the output scores outputs = torch.zeros(batch_size, tgt_len, self.hidden_size, device=src.device) # Decode the target sentence for t in range(tgt_len): tgt_embedded = self.tgt_embed(tgt[:, t]) decoder_input = torch.cat([tgt_embedded, context], dim=1) decoder_hidden, decoder_cell = self.decoder(decoder_input, (decoder_hidden, decoder_cell)) attention_scores = self.attention(encoder_outputs) attention_weights = F.softmax(torch.bmm(attention_scores, decoder_hidden.unsqueeze(2)).squeeze(2), dim=1) context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1) output = self.out(decoder_hidden) outputs[:, t] = output return outputs ``` The `__init__` method initializes the model parameters and layers. It takes four arguments: - `src_vocab_size`: the size of the source vocabulary - `tgt_vocab_size`: the size of the target vocabulary - `emb_size`: the size of the word embeddings - `hidden_size`: the size of the encoder and decoder hidden states The model has four main components: - `src_embed`: an embedding layer for the source sentence - `tgt_embed`: an embedding layer for the target sentence - `encoder`: a bidirectional LSTM encoder that encodes the source sentence - `decoder`: a unidirectional LSTM decoder that generates the target sentence The attention mechanism is implemented in the `forward` method. It takes two arguments: - `src`: the source sentence tensor of shape `(batch_size, src_len)` - `tgt`: the target sentence tensor of shape `(batch_size, tgt_len)` The method first encodes the source sentence using the bidirectional LSTM encoder. The encoder outputs and final hidden and cell states are stored in `encoder_outputs`, `last_hidden`, and `last_cell`, respectively. The decoder is initialized with the final hidden and cell states of the encoder. At each time step, the decoder takes as input the embedded target word and the context vector, which is a weighted sum of the encoder outputs based on the attention scores. The decoder output and hidden and cell states are updated using the LSTMCell module. The attention scores are calculated by applying a linear transform to the concatenated decoder hidden state and encoder outputs, followed by a softmax activation. The attention weights are used to compute the context vector as a weighted sum of the encoder outputs. Finally, the decoder hidden state is passed through a linear layer to produce the output scores for each target word in the sequence. The output scores are stored in the `outputs` tensor and returned by the method.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值