Seq2Seq 代码笔记

7 Sequence to Sequence Learning



  • 根据“编码器-解码器”架构的设计, 我们可以使用两个循环神经网络来设计一个序列到序列学习的模型。
  • 在实现编码器和解码器时,我们可以使用多层循环神经网络。
  • 我们可以使用 mask 来过滤不相关的计算,例如在计算损失时。
  • 在“编码器-解码器”训练中强制教学方法将原始输出序列(而非预测结果)输入解码器。
  • BLEU是一种常用的评估方法,它通过测量预测序列和标签序列之间的元语法的匹配度来评估预测。

As we have discussed in Section 9.5, machine translation is a major problem \color{red}\text{a major problem} a major problem domain for sequence transduction models, whose input and output are both variable-length   sequences \color{red}\textbf{variable-length sequences} variable-length sequences. To handle this type of inputs and outputs, we can design an architecture with two major components.

  • The first component is an e n c o d e r \color{red}encoder encoder : it takes a variable-length sequence as the input and transforms it into a state with a fixed shape \color{magenta}\text{a state with a fixed shape} a state with a fixed shape. (固定形状的隐状态)
  • The second component is a d e c o d e r \color{red}decoder decoder : it maps the encoded state of a fixed shape to a variable-length sequence \color{magenta}\text{variable-length sequence} variable-length sequence.

This is called an encoder-decoder architecture, which is depicted in Fig. 9.6.1.

Following the design principle of the encoder-decoder architecture,

  • the RNN encoder \color{red}\text{RNN encoder} RNN encoder can take a variable-length sequence as the input and transforms it into a fixed-shape hidden state. In other words, information of the input (source) sequence is encoded in the hidden state of the RNN encoder.
  • To generate the output sequence token by token, a separate RNN decoder \color{red}\text{RNN decoder} RNN decoder can predict the next token based on what tokens have been seen (such as in language modeling) or generated, together with the encoded information of the input sequence. (输入是已经观测的单词,以及Encoder的隐状态信息)

Fig. 9.7.1 illustrates how to use two RNNs for sequence to sequence learning in machine translation.

Fig. 9.7.1 Sequence to sequence learning with an RNN encoder and an RNN decoder

In Fig. 9.7.1, the special "<eos>" token marks the e \color{red}\textbf{e} end o \color{red}\textbf{o} of the s \color{red}\textbf{s} sequence. The model can stop making predictions once this token is generated. At the initial time step of the RNN decoder, there are two special design decisions.

  • First, the special b \color{red}\textbf{b} beginning- o \color{red}\textbf{o} of- s \color{red}\textbf{s} sequence "<bos>" token is an input.
  • Second, the final \text{\colorbox{black}{\color{yellow}final}} final hidden state of the RNN e n c o d e r \color{red}encoder encoder (Encoder的最终隐状态) is used to initiate \text{\colorbox{black}{\color{yellow}initiate}} initiate the hidden state of the d e c o d e r \color{magenta}decoder decoder (Decoder的初始状态).

举例子:In designs such as [Sutskever et al., 2014], this is exactly how the encoded input sequence information is fed into the decoder for generating the output (target) sequence. In some other designs such as [Cho et al., 2014b], the final hidden state of the encoder is also fed into the decoder as part of the inputs at every time step as shown in Fig. 9.7.1. Similar to the training of language models in Section 8.3, we can allow the labels to be the original output sequence, shifted by one token: "<bos>", “Ils”, “regardent”, “.”“Ils”, “regardent”, “.”, “<eos>”.

In the following, we will explain the design of Fig. 9.7.1 in greater detail. We will train this model for machine translation on the English-French dataset as introduced in Section 9.5.

import collections
import math
import torch
from torch import nn
from d2l import torch as d2l

7.1. Encoder

Technically speaking, the e n c o d e r \color{red}encoder encoder transforms \text{\colorbox{black}{\color{yellow}transforms}} transforms an input sequence of variable length into \text{\colorbox{black}{\color{yellow}into}} into a fixed-shape context variable c, and encodes \text{\colorbox{black}{\color{yellow}encodes}} encodes the input sequence information in \text{\colorbox{black}{\color{yellow}in}} in this context variable. As depicted in Fig. 9.7.1, we can use an RNN to design the encoder.

Let us consider a sequence example (batch size: 1). Suppose that the input sequence is x 1 , … , x T x_1, \ldots, x_T x1,,xT, such that x t x_t xt is the t t h t^{\mathrm{th}} tth token in the input text sequence. At time step t t t, the R N N \color{red}RNN RNN transforms \text{\colorbox{black}{\color{yellow}transforms}} transforms the input feature vector x t \mathbf{x}_t xt for x t x_t xt and the hidden state h t − 1 \mathbf{h} _{t-1} ht1 from the previous time step into \text{\colorbox{black}{\color{yellow}into}} into the current hidden state h t \mathbf{h}_t ht. We can use a function f f f to express the transformation of the RNN’s recurrent layer:

h t = f ( x t , h t − 1 ) . \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}). ht=f(xt,ht1).

In general, the e n c o d e r \color{red}encoder encoder transforms \text{\colorbox{black}{\color{yellow}transforms}} transforms the hidden states at all the time steps into \text{\colorbox{black}{\color{yellow}into}} into the context variable through a customized function q q q:

c = q ( h 1 , … , h T ) . \mathbf{c} = q(\mathbf{h}_1, \ldots, \mathbf{h}_T). c=q(h1,,hT).

For example, when choosing q ( h 1 , … , h T ) = h T q(\mathbf{h}_1, \ldots, \mathbf{h}_T) = \mathbf{h}_T q(h1,,hT)=hT such as in Fig. 9.7.1, the context variable is just the hidden state h T h_T hT of the input sequence at the f i n a l \color{yellow}final final time step.

  • So far we have used a unidirectional RNN (单向RNN) to design the encoder, where a hidden state only depends on the input subsequence at and before the time step of the hidden state. (只能看到第 t t t之前的输入子序列)
  • We can also construct encoders using bidirectional RNNs (双向RNN). In this case, a hidden state depends on the subsequence before and after the time step (including the input at the current time step)(能看到前、后的子序列), which encodes the information of the entire sequence.

Now let us [ implement the RNN encoder ]. Note that we use an embedding layer to obtain the feature vector for each token in the input sequence. The weight of an embedding layer is a matrix whose number of rows equals to the size of the input vocabulary (vocab_size) and number of columns equals to the feature vector’s dimension (embed_size). For any input token index i \textbf{i} i, the embedding layer fetches the i t h i^{th} ith row (starting from 0) of the weight matrix to return its feature vector. Besides, here we choose a multilayer GRU to implement the encoder.

class Seq2SeqEncoder(d2l.Encoder):
    """The RNN encoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,

    def forward(self, X, *args):
        # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
        X = self.embedding(X)
        # In RNN models, the first axis corresponds to time steps
        X = X.permute(1, 0, 2)
        # When state is not mentioned, it defaults to zeros
        output, state = self.rnn(X)
        # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
        # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
        return output, state

The returned variables of recurrent layers have been explained in Section 8.6. Let us still use a concrete example to [ illustrate the above encoder implementation. ] Below we instantiate a two-layer GRU encoder \text{GRU encoder} GRU encoder whose number of hidden units is 16.

  • input:Given a minibatch of sequence inputs X (batch size: 4, number of time steps: 7),
  • output:the hidden states of the last layer at all the time steps (output return by the encoder’s recurrent layers) are a tensor of shape (number of time steps, batch size, number of hidden units).
encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
encoder.eval() # BN层、Dropout 失效
X = torch.zeros((4, 7), dtype=torch.long)
output, state = encoder(X)
torch.Size([7, 4, 16])
  • GRU state:Since a GRU is employed here, the shape of the multilayer hidden states at the final time step is (number of hidden layers, batch size, number of hidden units). If an LSTM is used, memory cell information will also be contained in state.

torch.Size([2, 4, 16])

7.2. Decoder

  • 由于本文的Decoder是由RNN搭建的,因此输入已经考虑时序性(RNN内部机理实现)
  • 如果使用attention机制,则需手动设置mask操作,用以实现时序关系

As we just mentioned, the context variable c \mathbf{c} c of the e n c o d e r \color{red}encoder encoder’s output encodes the entire input sequence x 1 , … , x T x_1, \ldots, x_T x1,,xT. Given the output sequence y 1 , y 2 , … , y T ′ y_1, y_2, \ldots, y_{T'} y1,y2,,yT from the training dataset, for each time step t ′ \color{magenta}t' t (the symbol differs from the time step t \color{red}t t of input sequences or encoders), the probability of the d e c o d e r \color{magenta}decoder decoder output y t ′ y_{t'} yt is conditional on (取决于) the previous output subsequence \color{black}\text{\colorbox{yellow}{previous output subsequence}} previous output subsequence y 1 , … , y t ′ − 1 y_1, \ldots, y_{t'-1} y1,,yt1 and the context variable  \color{black}\text{\colorbox{yellow}{context variable }} context variable  c \mathbf{c} c, i.e., P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}) P(yty1,,yt1,c).

To model this conditional probability on sequences, we can use another RNN as the d e c o d e r \color{magenta}decoder decoder. At any time step t ′ \color{magenta}t' t on the output sequence, the RNN

  • takes the output  y t ′ − 1 \color{magenta}\text{output } y_{t^\prime-1} output yt1 from the previous time step and the context variable  c \color{red}\text{context variable }\mathbf{c} context variable c as its input \text{\colorbox{black}{\color{yellow}input}} input,
  • then transforms \text{\colorbox{black}{\color{yellow}transforms}} transforms them and the previous hidden state s t ′ − 1 \mathbf{s}_{t^\prime-1} st1 into \text{\colorbox{black}{\color{yellow}into}} into the hidden state s t ′ \mathbf{s}_{t^\prime} st at the current time step.

As a result, we can use a function g g g to express the transformation of the decoder’s hidden layer:

s t ′ = g ( y t ′ − 1 , c , s t ′ − 1 ) . (9.7.3) \mathbf{s}_{t^\prime} = g(y_{t^\prime-1}, \mathbf{c}, \mathbf{s}_{t^\prime-1}). \tag{9.7.3} st=g(yt1,c,st1).(9.7.3)

  • y t ′ − 1 y_{t^\prime-1} yt1: the output from the previous time step(decoder 上一个step的输出or预测值)
  • c \mathbf{c} c:the context variable (来自encoder最后一层的隐状态)
  • s t ′ − 1 \mathbf{s}_{t^\prime-1} st1:the previous hidden state,(来自decoder的隐状态)

After obtaining the hidden state of the d e c o d e r \color{magenta}decoder decoder, we can use an output layer and the softmax operation to compute the conditional probability distribution
P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) P(y_{t^\prime} \mid y_1, \ldots, y_{t^\prime-1}, \mathbf{c}) P(yty1,,yt1,c) for the output at time step t ′ t^\prime t.

Following Fig. 9.7.1, when implementing the decoder as follows,

  • we directly use the hidden state at the final time step of the e n c o d e r \color{red}encoder encoder to initialize the hidden state of the d e c o d e r \color{magenta}decoder decoder. This requires that the RNN encoder and the RNN decoder have the same number of layers and hidden units.
  • To further incorporate the encoded input sequence information, the context variable is concatenated with \text{\colorbox{black}{\color{yellow}concatenated with}} concatenated with the decoder input at all the time steps. (每一时间步都要拼接 c)
  • To predict the probability distribution of the output token, a fully-connected layer is used to transform the hidden state at the final \color{red}\text{\colorbox{white}{final}} final layer of the RNN d e c o d e r \color{magenta}decoder decoder.
class Seq2SeqDecoder(d2l.Decoder):
    """The RNN decoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqDecoder, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]

    def forward(self, X, state):
        # The output `X` shape: (`num_steps`, `batch_size`, `embed_size`)
        X = self.embedding(X).permute(1, 0, 2)
        # Broadcast `context` so it has the same `num_steps` as `X`
        context = state[-1].repeat(X.shape[0], 1, 1)
        X_and_context =, context), 2)
        output, state = self.rnn(X_and_context, state)
        output = self.dense(output).permute(1, 0, 2)
        # `output` shape: (`batch_size`, `num_steps`, `vocab_size`)
        # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
        return output, state

To [ illustrate the implemented decoder ], below we instantiate it with the same hyperparameters from the aforementioned encoder. As we can see,

  • output after FC layer:the output shape of the decoder becomes (batch size, number of time steps, vocabulary size), where the last \color{red}\text{\colorbox{white}{last}} last dimension of the tensor stores the predicted token distribution \color{red}\text{\colorbox{white}{the predicted token distribution}} the predicted token distribution.
  • state : (num_layers, batch_size, num_hiddens) 记录的还是最后个时间步的输出,这时候state其实没用了,主要看output
decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
state = decoder.init_state(encoder(X))
output, state = decoder(X, state)
output.shape, state.shape
(torch.Size([4, 7, 10]), torch.Size([2, 4, 16]))

To summarize, the layers in the above RNN encoder-decoder model are illustrated in Fig. 9.7.2.

Fig. 9.7.2 Layers in an RNN encoder-decoder model.¶

7.3. Loss Function

At each time step, the d e c o d e r \color{magenta}decoder decoder predicts a probability distribution \color{magenta}\text{probability distribution} probability distribution for the output tokens.

Similar to language modeling, we can

  • apply softmax to obtain the distribution and
  • calculate the cross-entropy loss for optimization. (注意,torch.nn.CorssEntropy() 包含 softmax操作)

Recall Section 9.5 that the special padding tokens are appended to the e \color{red}e end o \color{red}o of s \color{red}s sequences, so sequences of varying lengths can be efficiently loaded in minibatches of the same shape. However, prediction of padding tokens should be excluded from loss calculations. (由于设置了special padding tokens,所以我们的句子长度可以不同,但是计算 Loss 的时候,还是要将这些 填充tokens 排除 \color{red}排除 排除 出去)

To this end, we can use the following sequence_mask function to [ mask irrelevant entries \color{red}\text{mask irrelevant entries} mask irrelevant entries with zero \color{red}\text{zero} zero values \color{red}\text{values} values ] so later multiplication of any irrelevant prediction with zero equals to zero. For example \color{black}\text{\colorbox{yellow}{For example}} For example, if the valid length \color{blue}\text{\colorbox{white}{valid length}} valid length of two sequences excluding padding tokens are one \color{blue}\text{\colorbox{white}{one}} one and two \color{blue}\text{\colorbox{white}{two}} two, respectively, the remaining entries after the first one and the first two entries are cleared to zeros (除了这1、2个有效单词,后面的单词全被置为0).

def sequence_mask(X: torch.tensor, valid_len: torch.tensor, value=0):
    """ Mask irrelevant entries in sequences.

    :param X: tensor, (batch, sequences, num_hidden)
    :param valid_len: tensor, (batch, int)
    :param value:
    :return: tensor, (batch, sequences, num_hidden)
    maxlen = X.size(1) # 返回:当前tesor 列的维度
    # torch.arange() 相当于 range函数
    mask = torch.arange((maxlen), dtype=torch.float32,
                        device=X.device)[None, :] < valid_len[:, None]
    X[~mask] = value
    return Xdef sequence_mask(X, valid_len, value=0):
    """Mask irrelevant entries in sequences."""
    maxlen = X.size(1)
    mask = torch.arange((maxlen), dtype=torch.float32,
                        device=X.device)[None, :] < valid_len[:, None]
    X[~mask] = value
    return X

X = torch.tensor([[1, 2, 3], [4, 5, 6]])
sequence_mask(X, torch.tensor([1, 2]))
tensor([[1, 0, 0],
[4, 5, 0]])

( We can also mask all the entries across the last few axes. ) If you like, you may even specify 可以自己指定值来替换 to replace such entries with a non-zero value.

X = torch.ones(2, 3, 4)
sequence_mask(X, torch.tensor([1, 2]), value=-1)
tensor([[[ 1.,  1.,  1.,  1.],
[-1., -1., -1., -1.],
[-1., -1., -1., -1.]],

[[ 1.,  1.,  1.,  1.],
[ 1.,  1.,  1.,  1.],
[-1., -1., -1., -1.]]])

Now we can [ extend the softmax cross-entropy loss to allow the masking of irrelevant predictions. ]

  • Initially, masks for all the predicted tokens are set to one. 首先,单词全置为1
  • Once the valid length is given, the mask corresponding to any padding token will be cleared to zero. (一旦超过有效长度,后面的单词置0)
  • In the end, the loss for all the tokens will be multipled by the mask to filter out irrelevant predictions of padding tokens in the loss.
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
    """The softmax cross-entropy loss with masks."""
    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
    # `label` shape: (`batch_size`, `num_steps`)
    # `valid_len` shape: (`batch_size`,)
    def forward(self, pred, label, valid_len):
        weights = torch.ones_like(label)
        weights = sequence_mask(weights, valid_len)
        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
            pred.permute(0, 2, 1), label) # 调用父类,nn.CrossEntropyLoss.forward() 方法
        weighted_loss = (unweighted_loss * weights).mean(dim=1)
        return weighted_loss
  • For [ a sanity check 健全性检查], we can create three identical sequences.
  • Then we can specify that the valid lengths of these sequences are 4, 2, and 0, respectively.
  • As a result, the loss of the 第一个句子 should be twice as large as that of 第二个句子, while 第三个句子 should have a zero loss.
loss = MaskedSoftmaxCELoss()
loss(torch.ones(3, 4, 10), torch.ones((3, 4), dtype=torch.long),
     torch.tensor([4, 2, 0]))

tensor([2.3026, 1.1513, 0.0000])

7.4. Training 训练过程

In the following training loop \color{black}\text{\colorbox{yellow}{training loop}} training loop, we concatenate \text{\colorbox{black}{\color{yellow}concatenate}} concatenate the special b \color{red}b beginning- o \color{red}o of- s \color{red}s sequence token [ "<bos>"] and \text{\colorbox{black}{\color{yellow}and}} and the original output sequence (翻译答案Y) excluding \text{\colorbox{black}{\color{yellow}excluding}} excluding (剔除掉) the final \color{red}\text{final} final token ["<eos>"] as the input \text{\colorbox{black}{\color{yellow}as the input}} as the input to the d e c o d e r \color{magenta}decoder decoder, as shown in Fig. 9.7.1.

  • This is called teacher forcing \color{black}\text{\colorbox{yellow}{teacher forcing}} teacher forcing because the original \color{blue}\text{\colorbox{white}{original}} original output sequence (token labels 答案) is fed into the d e c o d e r \color{magenta}decoder decoder.
  • Alternatively, we could also feed the predicted \color{blue}\text{\colorbox{white}{predicted}} predicted token ( d e c o d e r \color{magenta}decoder decoder 自己生成的) from the previous time step as the current input to the d e c o d e r \color{magenta}decoder decoder.
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
    """Train a model for sequence to sequence."""
    def xavier_init_weights(m):
        if type(m) == nn.Linear:
        if type(m) == nn.GRU:
            for param in m._flat_weights_names:
                if "weight" in param:
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = MaskedSoftmaxCELoss()
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[10, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
        for batch in data_iter:
            X, X_valid_len, Y, Y_valid_len = [ for x in batch]
            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
                               device=device).reshape(-1, 1)
            dec_input =[bos, Y[:, :-1]], 1)  # Teacher forcing
            Y_hat, _ = net(X, dec_input, X_valid_len)
            l = loss(Y_hat, Y, Y_valid_len)
            l.sum().backward()  # Make the loss scalar for `backward`
            d2l.grad_clipping(net, 1)
            num_tokens = Y_valid_len.sum()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if (epoch + 1) % 10 == 0:
            animator.add(epoch + 1, (metric[0] / metric[1],))
    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
          f'tokens/sec on {str(device)}')

Now we can [ create and train an RNN encoder-decoder model ] for sequence to sequence learning on the machine translation dataset.

embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs, device = 0.005, 300, d2l.try_gpu()

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(
    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(
    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

loss 0.019, 11254.5 tokens/sec on cuda:0

7.5. Prediction 预测过程

To predict the output sequence token by token, at each decoder time step the predicted token from the previous time step is fed into the d e c o d e r \color{magenta}decoder decoder as an input.

  • Similar to training, at the initial time step the b \color{red}b beginning- o \color{red}o of- s \color{red}s sequence (“<bos>”) token is fed into the decoder. This prediction process is illustrated in Fig. 9.7.3. (“<bos>”在初始时间步,被输入到解码器中)
  • When the e \color{red}e end- o \color{red}o of- s \color{red}s sequence (“<eos>”) token is predicted, the prediction of the output sequence is complete. (当输出序列预测出 结束词元“<eos>”,预测结束)

Fig. 9.7.3 Predicting the output sequence token by token using an RNN encoder-decoder.

We will introduce different strategies for sequence generation in Section 9.8.

def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    """Predict for sequence to sequence."""
    # Set `net` to eval mode for inference
    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
    # Add the batch axis
    enc_X = torch.unsqueeze(
        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
    enc_outputs = net.encoder(enc_X, enc_valid_len)
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    # Add the batch axis
    dec_X = torch.unsqueeze(torch.tensor(
        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
    output_seq, attention_weight_seq = [], []
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, dec_state)
        # We use the token with the highest prediction likelihood as the input
        # of the decoder at the next time step
        dec_X = Y.argmax(dim=2)
        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
        # Save attention weights (to be covered later)
        if save_attention_weights:
        # Once the end-of-sequence token is predicted, the generation of the
        # output sequence is complete
        if pred == tgt_vocab['<eos>']:
    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq

7.6. Evaluation of Predicted Sequences

We can evaluate a predicted sequence by comparing it with the label sequence (the ground-truth). BLEU (Bilingual Evaluation Understudy), though originally proposed for evaluating machine translation results [Papineni et al., 2002], has been extensively used in measuring the quality of output sequences for different applications. In principle, for any n-grams in the predicted sequence, BLEU evaluates whether this n-grams appears in the label sequence.

Denote by p n p_n pn the precision of n n n-grams, which is the ratio of the number of matched n n n-grams in the predicted and label sequences to the number of n n n-grams in the predicted sequence. To explain, given a label sequence A A A, B B B, C C C, D D D, E E E, F F F, and a predicted sequence A A A, B B B, B B B, C C C, D D D, we have p 1 = 4 / 5 p_1 = 4/5 p1=4/5, p 2 = 3 / 4 p_2 = 3/4 p2=3/4, p 3 = 1 / 3 p_3 = 1/3 p3=1/3, and p 4 = 0 p_4 = 0 p4=0. Besides, let l e n label \mathrm{len}_{\text{label}} lenlabel and l e n pred \mathrm{len}_{\text{pred}} lenpred be the numbers of tokens in the label sequence and the predicted sequence, respectively.
Then, BLEU is defined as

exp ⁡ ( min ⁡ ( 0 , 1 − l e n label l e n pred ) ) ∏ n = 1 k p n 1 / 2 n , (9.7.4) \exp\left(\min\left(0, 1 - \frac{\mathrm{len}_{\text{label}}}{\mathrm{len}_{\text{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n}, \tag{9.7.4} exp(min(0,1lenpredlenlabel))n=1kpn1/2n,(9.7.4)

where k k k is the longest n n n-grams for matching.

  • Based on the definition of BLEU in (9.7.4),whenever the predicted sequence is the same as the label sequence, BLEU is 1.
  • Moreover, since matching longer n n n-grams is more difficult, BLEU assigns a greater weight to a longer n n n-gram precision (n 越大,则匹配难度越大,因此在BLEU中,n越大则权重越大). Specifically, when p n p_n pn is fixed, p n 1 / 2 n p_n^{1/2^n} pn1/2n increases as n n n grows (the original paper uses p n 1 / n p_n^{1/n} pn1/n).
  • Furthermore, since predicting shorter sequences tends to obtain a higher p n p_n pn value, the coefficient before the multiplication term in (9.7.4) penalizes shorter predicted sequences. (预测序列越短获得的 p n p_n pn值很高,因此我们做了个 exp \text{exp} exp 惩罚项) For example, when k = 2 k=2 k=2, given the label sequence A A A, B B B, C C C, D D D, E E E, F F F and the predicted sequence A A A, B B B, although p 1 = p 2 = 1 p_1 = p_2 = 1 p1=p2=1, the penalty factor exp ⁡ ( 1 − 6 / 2 ) ≈ 0.14 \exp(1-6/2) \approx 0.14 exp(16/2)0.14 lowers the BLEU. We [implement the BLEU measure] as follows.
def bleu(pred_seq, label_seq, k):  #@save
    """Compute the BLEU."""
    pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[' '.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[' '.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

In the end, we use the trained RNN encoder-decoder to [ translate a few English sentences into French ] and compute the BLEU of the results.

engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
for eng, fra in zip(engs, fras):
    translation, attention_weight_seq = predict_seq2seq(
        net, eng, src_vocab, tgt_vocab, num_steps, device)
    print(f'{eng} => {translation}, bleu {bleu(translation, fra, k=2):.3f}')

go . => va !, bleu 1.000
i lost . => j’ai rouler !, bleu 0.000
he’s calm . => il est paresseux prie ., bleu 0.548
i’m home . => je suis chez moi ., bleu 1.000


  • Following the design of the encoder-decoder architecture, we can use two RNNs to design a model for sequence to sequence learning.
  • When implementing the encoder and the decoder, we can use multilayer RNNs.
  • We can use masks to filter out irrelevant computations, such as when calculating the loss.
  • In encoder-decoder training, the teacher forcing approach feeds original output sequences (in contrast to predictions) into the decoder.
  • BLEU is a popular measure for evaluating output sequences by matching n n n-grams between the predicted sequence and the label sequence.


Transformer seq2seq是一种基于Transformer模型的seq2seq模型。它使用编码器-解码器架构,输入一个序列,输出另一个序列。与传统的seq2seq模型相比,Transformer seq2seq使用Transformer blocks来代替循环网络。这种模型广泛应用于语音识别、机器翻译、语音翻译、语音合成和聊天机器人训练等NLP问题。它的泛用性很高,但有些特定任务可能需要使用经过定制的模型来获得更好的结果。


