「PyTorch自然语言处理系列」8. 自然语言处理的高级序列模型（下）

最新推荐文章于 2023-06-07 17:50:38 发布

数据与智能

最新推荐文章于 2023-06-07 17:50:38 发布

阅读量742

点赞数 1

文章标签： python 机器学习人工智能深度学习 java

本文链接：https://blog.csdn.net/qq_43045873/article/details/121279519

版权

来源 | Natural Language Processing with PyTorch

作者 | Rao，McMahan

译者 | Liangchu

校对 | gongyouliu

编辑 | auroral-L

全文共9275字，预计阅读时间55分钟。

第八章自然语言处理的高级序列模型

8.1 序列到序列模型，编码器-解码器模型，和条件生成

8.2 从序列中捕获更多：双向递归模型

8.3 从序列中捕获更多：注意力

8.3.1 深度神经网络中的注意力

8.4 评估序列生成模型

8.5 示例：神经机器翻译

8.5.1 机器翻译数据集

8.5.2 NMT 向量化管道

8.5.3 NMT 模型的编码和解码

8.5.3.1 注意力的详细解释

8.5.3.2 学习搜索和定时采样

8.5.4 训练例程和结果

8.6 总结

8.5 示例：神经机器翻译

在本节中，我们将介绍S2S模型最常见的实现：机器翻译。随着深度学习在2010年代早期的流行，很显然，只要有足够的数据，使用单词嵌入和RNN是在两种语言之间进行翻译的一种非常强大的方法。随着“评估序列生成模型”一节中注意力机制的引入，机器翻译模型得到了进一步的改进。在本节中，我们将介绍一个基于Luong，Pham和Manning（2015）的实现，它简化了S2S模型中的注意力方法。

我们首先概述神经机器翻译所需的数据集和特殊类型的簿记。数据集是一个平行语料库，它由成对的英语句子和相应的法语翻译组成。因为我们要处理两个长度可能不同的序列，所以我们需要跟踪输入序列和输出序列的最大长度和词汇表。在大多数情况下，本示例是对前几章中的内容的直接扩展。

在介绍了数据集和簿记数据结构之后，我们将介绍该模型以及它如何通过关注源序列中的不同位置来生成目标序列。我们模型中的编码器使用双向GRU（bi-GRU）来计算源序列中每个位置的向量，这些位置由序列的所有部分通知。为此，我们使用PyTorch的PackedSequence数据结构，我们将在“NMT 模型的编码和解码”一节中更深入地介绍这一点。“从序列中捕获更多：注意力”一节中讨论的注意力机制被应用于bi-GRU的输出，并用于调节目标序列的生成。我们在“训练例程和结果”一节中讨论了模型的结果以及改进的方法。

8.5.1 机器翻译数据集

对于本示例，我们使用来自Tatoeba Project的英语-法语句子对数据集。数据预处理首先将所有句子变为小写，并将NLTK的英语和法语tokenizer应用于每个句子对。接下来，我们应用NLTK的特定于语言的单词tokenizer来创建标记列表。这里的标记列表是一个预处理的数据集，我们会在下面进一步说明其计算。

除了刚刚描述的标准预处理之外，我们还使用指定的语法模式列表来选择数据的子集，以简化学习问题。从本质上讲，这意味着我们将数据范围缩小到只有有限范围的句法模式。反过来，这意味着在训练期间，模型将看到更少的变化，并在更短的训练时间内具有更高的性能。

注意：在构建新模型和尝试新体系结构时，应该在建模选择和评估这些选择之间实现更快的迭代周期。

我们选择的数据子集包括以“i am”、“he is”、“she is”、“they are”、“you are”或“we are”开头的英语句子，这将数据集从135842个句子对减少到13062个句子对，系数为 10。为了完成学习设置，我们将剩余的13062个句子对分为70%的训练集、15%的验证集和15%的测试集。从刚刚列出的语法开头的每个句子的比例保持不变，方法是首先按句子开头分组，在这些组中进行拆分，然后合并每个组中的拆分以构成拆分后的数据集。

8.5.2 NMT 向量化管道

对源英语和目标法语句子进行向量化需要比前面章节更复杂的管道。复杂性增加的原因有两个：首先，源序列和目标序列在模型中具有不同的角色，属于不同的语言，并且以两种不同的方式进行向量化；其次，作为使用PyTorch的PackedSequences的先决条件，我们按源语句的长度对每个minibatch进行排序。为了应对这两种复杂性，NMTVectorizer被实例化为两个单独的SequenceVocabulary对象和两个最大序列长度的度量，如下例（8-1）所示：

示例 8-1：构造NMTVectorizer

class NMTVectorizer(object): 
    """ The Vectorizer which coordinates the Vocabularies and puts them to use""" 
    def __init__(self, source_vocab, target_vocab, max_source_length, 
                 max_target_length): 
        """ 
        Args: 
            source_vocab (SequenceVocabulary): maps source words to integers 
            target_vocab (SequenceVocabulary): maps target words to integers 
            max_source_length (int): the longest sequence in the source dataset 
            max_target_length (int): the longest sequence in the target dataset 
        """ 
        self.source_vocab = source_vocab 
        self.target_vocab = target_vocab 


        self.max_source_length = max_source_length 
        self.max_target_length = max_target_length 


    @classmethod 
    def from_dataframe(cls, bitext_df): 
        """Instantiate the vectorizer from the dataset dataframe 


        Args: 
            bitext_df (pandas.DataFrame): the parallel text dataset 
        Returns: 
            an instance of the NMTVectorizer 
        """ 
        source_vocab = SequenceVocabulary() 
        target_vocab = SequenceVocabulary() 
        max_source_length, max_target_length = 0, 0 


        for _, row in bitext_df.iterrows(): 
            source_tokens = row["source_language"].split(" ") 
            if len(source_tokens) > max_source_length: 
                max_source_length = len(source_tokens) 
            for token in source_tokens: 
                source_vocab.add_token(token) 


            target_tokens = row["target_language"].split(" ") 
            if len(target_tokens) > max_target_length: 
                max_target_length = len(target_tokens) 
            for token in target_tokens: 
                target_vocab.add_token(token) 


        return cls(source_vocab, target_vocab, max_source_length, 
                   max_target_length

复杂性增加的第一个原因是源序列和目标序列处理方式的不同。源序列通过在开头插入BEGIN-OF-SEQUENCE标记和在结尾添加END-OF-SEQUENCE标记进行向量化。该模型使用bi-GRU为源语句中的每个标记创建摘要向量，这些摘要向量极大受益于具有句子边界的指示。相比之下，目标序列被向量化为两个副本，每个副本偏移一个标记：第一个副本需要BEGIN-OF-SEQUENCE标记，第二个副本需要END-OF-SEQUENCE标记。回想一下第七章，序列预测任务需要在每个时间步观察输入标记和输出标记。S2S模型中的解码器正在执行此任务，但增加了编码器上下文的可用性。为了解决这一复杂性，我们制定了核心向量化方法_vectorize()，它与源索引和目标索引都无关。然后我们编写两种方法分别处理源索引和目标索引。最后，这些索引集之间通过NMTVectorizer.vectorize方法进行协调，该方法是数据集调用的方法。下例（8-2）展示了代码：

示例 8-2：NMTVectorizer中的向量化方法

class NMTVectorizer(object): 
    """ The Vectorizer which coordinates the Vocabularies and puts them to use""" 
    def _vectorize(self, indices, vector_length=-1, mask_index=0): 
        """Vectorize the provided indices 


        Args: 
            indices (list): a list of integers that represent a sequence 
            vector_length (int): an argument for forcing the length of index vector 
            mask_index (int): the mask_index to use; almost always 0 
        """ 
        if vector_length < 0: 
            vector_length = len(indices) 
        vector = np.zeros(vector_length, dtype=np.int64) 
        vector[:len(indices)] = indices 
        vector[len(indices):] = mask_index 
        return vector 


    def _get_source_indices(self, text): 
        """Return the vectorized source text 


        Args: 
            text (str): the source text; tokens should be separated by spaces 
        Returns: 
            indices (list): list of integers representing the text 
        """ 
        indices = [self.source_vocab.begin_seq_index] 
        indices.extend(self.source_vocab.lookup_token(token) 
                       for token in text.split(" ")) 
        indices.append(self.source_vocab.end_seq_index) 
        return indices 


    def _get_target_indices(self, text): 
        """Return the vectorized source text 


        Args: 
            text (str): the source text; tokens should be separated by spaces 
        Returns: 
            a tuple: (x_indices, y_indices) 
                x_indices (list): list of ints; observations in target decoder 
                y_indices (list): list of ints; predictions in target decoder 
        """ 
        indices = [self.target_vocab.lookup_token(token) 
                   for token in text.split(" ")] 
        x_indices = [self.target_vocab.begin_seq_index] + indices 
        y_indices = indices + [self.target_vocab.end_seq_index] 
        return x_indices, y_indices 


    def vectorize(self, source_text, target_text, use_dataset_max_lengths=True): 
        """Return the vectorized source and target text 


        Args: 
            source_text (str): text from the source language 
            target_text (str): text from the target language 
            use_dataset_max_lengths (bool): whether to use the max vector lengths 
        Returns: 
            The vectorized data point as a dictionary with the keys: 
                source_vector, target_x_vector, target_y_vector, source_length 
        """ 
        source_vector_length = -1 
        target_vector_length = -1 


        if use_dataset_max_lengths: 
            source_vector_length = self.max_source_length + 2 
            target_vector_length = self.max_target_length + 1 


        source_indices = self._get_source_indices(source_text) 
        source_vector = self._vectorize(source_indices, 
                                        vector_length=source_vector_length, 
                                        mask_index=self.source_vocab.mask_index) 


        target_x_indices, target_y_indices = self._get_target_indices(target_text) 
        target_x_vector = self._vectorize(target_x_indices, 
                                        vector_length=target_vector_length, 
                                        mask_index=self.target_vocab.mask_index) 
        target_y_vector = self._vectorize(target_y_indices, 
                                        vector_length=target_vector_length, 
                                        mask_index=self.target_vocab.mask_index) 
        return {"source_vector": source_vector, 
                "target_x_vector": target_x_vector, 
                "target_y_vector": target_y_vector, 
                "source_length": len(source_indices)}

复杂性增加的第二个原因还是来自于源序列：为了使用bi-GRU编码源序列，我们使用 PyTorch 的PackedSequences数据结构。通常，可变长度序列的minibatch在数学上表示为整数矩阵中的行，其中每个序列左对齐并且零填充以适应可变长度。PackedSequences数据结构将可变长度序列表示为一个数组，方法是将每个时间步的序列数据一个接一个地连接起来，并知道每个时间步的序列数，如下图（8-11）所示：

创建PackedSequence有两个先决条件：了解每个序列的长度，并按源序列的长度对其降序排序。为了反映这个新排序的矩阵，minibatch中的剩余张量按相同的顺序排序，以便它们与源序列编码保持一致。在下例（8-3）中，generate_batches()函数被修改为generate_nmt_batches()函数。

示例 8-3：为NMT示例生成minibatch

def generate_nmt_batches(dataset, batch_size, shuffle=True, 
                            drop_last=True, device="cpu"): 
    """A generator function which wraps the PyTorch DataLoader; NMT Version """ 
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size, 
                            shuffle=shuffle, drop_last=drop_last) 


    for data_dict in dataloader: 
        lengths = data_dict['x_source_length'].numpy() 
        sorted_length_indices = lengths.argsort()[::-1].tolist() 


        out_data_dict = {} 
        for name, tensor in data_dict.items(): 
            out_data_dict[name] = data_dict[name][sorted_length_indices].to(device) 
        yield out_data_dict

8.5.3 NMT 模型的编码和解码

在本例中，我们的源序列是一个英语句子，目标序列是一个法语翻译。标准方法是使用“序列到序列模型，编码器-解码器模型和条件生成”一节中描述的编码器-解码器模型。在下例（8-4和8-5）呈现的模型中，编码器首先使用bi-GRU将每个源序列映射到向量状态序列（见“从序列中捕获更多：双向递归模型”），然后，解码器从编码器的隐藏状态开始作为其初始隐藏状态，并使用注意力机制（参见“从序列捕获更多：注意力”）在源序列中选择不同的信息以生成输出序列。在本节的其余部分中，我们将更详细地解释此过程。

示例 8-4：NMTModel在单个forward()方法中封装和协调了编码器和解码器

class NMTModel(nn.Module): 
    """ A Neural Machine Translation Model """ 
    def __init__(self, source_vocab_size, source_embedding_size, 
                 target_vocab_size, target_embedding_size, encoding_size, 
                 target_bos_index): 
        """ 
        Args: 
            source_vocab_size (int): number of unique words in source language 
            source_embedding_size (int): size of the source embedding vectors 
            target_vocab_size (int): number of unique words in target language 
            target_embedding_size (int): size of the target embedding vectors 
            encoding_size (int): the size of the encoder RNN. 
            target_bos_index (int): index for BEGIN-OF-SEQUENCE token 
        """ 
        super(NMTModel, self).__init__() 
        self.encoder = NMTEncoder(num_embeddings=source_vocab_size, 
                                  embedding_size=source_embedding_size, 
                                  rnn_hidden_size=encoding_size) 
        decoding_size = encoding_size * 2 
        self.decoder = NMTDecoder(num_embeddings=target_vocab_size, 
                                  embedding_size=target_embedding_size, 
                                  rnn_hidden_size=decoding_size, 
                                  bos_index=target_bos_index) 


    def forward(self, x_source, x_source_lengths, target_sequence): 
        """The forward pass of the model 


        Args: 
            x_source (torch.Tensor): the source text data tensor. 
                x_source.shape should be (batch, vectorizer.max_source_length) 
            x_source_lengths torch.Tensor): the length of the sequences in x_source 
            target_sequence (torch.Tensor): the target text data tensor 
        Returns: 
            decoded_states (torch.Tensor): prediction vectors at each output step 
        """ 
        encoder_state, final_hidden_states = self.encoder(x_source, 
                                                          x_source_lengths) 
        decoded_states = self.decoder(encoder_state=encoder_state, 
                                      initial_hidden_state=final_hidden_states, 
                                      target_sequence=target_sequence) 
        return decoded_states

示例 8-5：编码器嵌入源单词并使用bi-GRU提取特征

class NMTEncoder(nn.Module): 
    def __init__(self, num_embeddings, embedding_size, rnn_hidden_size): 
        """ 
        Args: 
            num_embeddings (int): size of source vocabulary 
            embedding_size (int): size of the embedding vectors 
            rnn_hidden_size (int): size of the RNN hidden state vectors 
        """ 
        super(NMTEncoder, self).__init__() 
 
        self.source_embedding = nn.Embedding(num_embeddings, embedding_size, 
                                             padding_idx=0) 
        self.birnn = nn.GRU(embedding_size, rnn_hidden_size, bidirectional=True, 
                            batch_first=True) 
 
    def forward(self, x_source, x_lengths): 
        """The forward pass of the model 
 
        Args: 
            x_source (torch.Tensor): the input data tensor. 
                x_source.shape is (batch, seq_size) 
            x_lengths (torch.Tensor): vector of lengths for each item in the batch 
        Returns: 
            a tuple: x_unpacked (torch.Tensor), x_birnn_h (torch.Tensor) 
                x_unpacked.shape = (batch, seq_size, rnn_hidden_size * 2) 
                x_birnn_h.shape = (batch, rnn_hidden_size * 2) 
        """ 
        x_embedded = self.source_embedding(x_source) 
        # create PackedSequence; x_packed.data.shape=(number_items, embedding_size) 
        x_lengths = x_lengths.detach().cpu().numpy() 
        x_packed = pack_padded_sequence(x_embedded, x_lengths, batch_first=True) 
 
        # x_birnn_h.shape = (num_rnn, batch_size, feature_size) 
        x_birnn_out, x_birnn_h  = self.birnn(x_packed) 
        # permute to (batch_size, num_rnn, feature_size) 
        x_birnn_h = x_birnn_h.permute(1, 0, 2) 
 
        # flatten features; reshape to (batch_size, num_rnn * feature_size) 
        #  (recall: -1 takes the remaining positions, 
        #           flattening the two RNN hidden vectors into 1) 
        x_birnn_h = x_birnn_h.contiguous().view(x_birnn_h.size(0), -1) 
 
        x_unpacked, _ = pad_packed_sequence(x_birnn_out, batch_first=True) 
        return x_unpacked, x_birnn_h

通常，编码器将整数序列作为输入，并为每个位置创建特征向量。在本例中，编码器的输出是这些向量以及用于生成特征向量的bi-GRU的最终隐藏状态，该隐藏状态用于在下一节中初始化解码器的隐藏状态。

深入了解编码器，我们首先使用嵌入层嵌入输入序列。通常，只需在嵌入层上设置padding_idx标志，我们就可以使模型处理可变长度序列，因为任何等于padding_idx的位置都会被赋予零值向量，该向量在优化过程中不会更新。回想一下，这就是所谓的mask。然而，在这种编码器-解码器模型中，由于我们使用bi-GRU来编码源序列，因此需要对屏蔽位置进行不同的处理。主要原因是后向部分可能受到遮蔽位置的影响，其系数与它在序列上开始之前遇到的屏蔽位置的数量成比例。

为了处理bi-GRU中可变长度序列的掩码位置，我们使用PyTorch的PackedSequence数据结构，它源于CUDA如何允许以批处理格式处理可变长度序列。如果满足两个条件（提供每个序列的长度并根据序列长度对minibatch进行排序）则可以将任何零填充序列（例如示例8-6中所示的编码器中的嵌入源序列）转换为PackedSequence，上图（8-11）直观显示了该点，由于比较复杂，因此在下例（8-6）中，我们再次演示了这点。

示例 8-6：packed_padded_sequences和pad_packed_sequences的简单演示

Input[0] 
abcd_padded = torch.tensor([1, 2, 3, 4], dtype=torch.float32) 
efg_padded = torch.tensor([5, 6, 7, 0], dtype=torch.float32) 
h_padded = torch.tensor([8, 0, 0, 0], dtype=torch.float32) 
 
padded_tensor = torch.stack([abcd_padded, efg_padded, h_padded]) 
 
describe(padded_tensor) 
 
Output[0] 
Type: torch.FloatTensor 
Shape/size: torch.Size([3, 4]) 
Values: 
tensor([[ 1.,  2.,  3.,  4.], 
        [ 5.,  6.,  7.,  0.], 
        [ 8.,  0.,  0.,  0.]]) 
 
Input[1] 
lengths = [4, 3, 1] 
packed_tensor = pack_padded_sequence(padded_tensor, lengths,    
                                     batch_first=True) 
packed_tensor 
 
Output[1] 
PackedSequence(data=tensor([ 1.,  5.,  8.,  2.,  6.,  3.,  7.,  4.]), 
               batch_sizes=tensor([ 3,  2,  2,  1])) 
Input[2] 
unpacked_tensor, unpacked_lengths = \ 
    pad_packed_sequence(packed_tensor, batch_first=True) 
 
describe(unpacked_tensor) 
describe(unpacked_lengths) 
 
Output[2] 
Type: torch.FloatTensor 
Shape/size: torch.Size([3, 4]) 
Values: 
tensor([[ 1.,  2.,  3.,  4.], 
        [ 5.,  6.,  7.,  0.], 
        [ 8.,  0.,  0.,  0.]]) 
Type: torch.LongTensor 
Shape/size: torch.Size([3]) 
Values: 
tensor([ 4,  3,  1])

如前所述，我们处理生成每个minibatch的排序。然后，如下例（8-7）所示，Pytorch的pack_padded_sequence()函数通过传递嵌入序列、序列长度和一个布尔标志来调用，该标志指示第一个维度是批处理维度。此函数的输出是PackedSequence。生成的PackedSequence被输入到bi-GRU中，为下游解码器创建状态向量。使用另一个布尔标志将bi-GRU的输出解压成完整张量，该标志指示批次位于第一维。如上图（8-11）所示，解包操作将每个屏蔽位置设置为零值向量，从而保持下游计算的完整性。

示例 8-7：NMTDecoder从编码的源句子中构造目标句子

class NMTDecoder(nn.Module): 
    def __init__(self, num_embeddings, embedding_size, rnn_hidden_size, bos_index): 
        """ 
        Args: 
            num_embeddings (int): number of embeddings is also the number of 
                unique words in target vocabulary 
            embedding_size (int): the embedding vector size 
            rnn_hidden_size (int): size of the hidden rnn state 
            bos_index(int): begin-of-sequence index 
        """ 
        super(NMTDecoder, self).__init__() 
        self._rnn_hidden_size = rnn_hidden_size 
        self.target_embedding = nn.Embedding(num_embeddings=num_embeddings, 
                                             embedding_dim=embedding_size, 
                                             padding_idx=0) 
        self.gru_cell = nn.GRUCell(embedding_size + rnn_hidden_size, 
                                   rnn_hidden_size) 
        self.hidden_map = nn.Linear(rnn_hidden_size, rnn_hidden_size) 
        self.classifier = nn.Linear(rnn_hidden_size * 2, num_embeddings) 
        self.bos_index = bos_index 
 
    def _init_indices(self, batch_size): 
        """ return the BEGIN-OF-SEQUENCE index vector """ 
        return torch.ones(batch_size, dtype=torch.int64) * self.bos_index 
 
    def _init_context_vectors(self, batch_size): 
        """ return a zeros vector for initializing the context """ 
        return torch.zeros(batch_size, self._rnn_hidden_size) 
 
    def forward(self, encoder_state, initial_hidden_state, target_sequence): 
        """The forward pass of the model 
 
        Args: 
            encoder_state (torch.Tensor): the output of the NMTEncoder 
            initial_hidden_state (torch.Tensor): The last hidden state in the  NMTEncoder 
            target_sequence (torch.Tensor): the target text data tensor 
            sample_probability (float): the schedule sampling parameter 
                probability of using model's predictions at each decoder step 
        Returns: 
            output_vectors (torch.Tensor): prediction vectors at each output step 
        """ 
        # We are making an assumption there: The batch is on first 
        # The input is (Batch, Seq) 
        # We want to iterate over sequence so we permute it to (S, B) 
        target_sequence = target_sequence.permute(1, 0) 
 
        # use the provided encoder hidden state as the initial hidden state 
        h_t = self.hidden_map(initial_hidden_state) 
 
        batch_size = encoder_state.size(0) 
        # initialize context vectors to zeros 
        context_vectors = self._init_context_vectors(batch_size) 
        # initialize first y_t word as BOS 
        y_t_index = self._init_indices(batch_size) 
 
        h_t = h_t.to(encoder_state.device) 
        y_t_index = y_t_index.to(encoder_state.device) 
        context_vectors = context_vectors.to(encoder_state.device) 
 
        output_vectors = [] 
        # All cached tensors are moved from the GPU and stored for analysis 
        self._cached_p_attn = [] 
        self._cached_ht = [] 
        self._cached_decoder_state = encoder_state.cpu().detach().numpy() 
 
        output_sequence_size = target_sequence.size(0) 
        for i in range(output_sequence_size): 
 
            # Step 1: Embed word and concat with previous context 
            y_input_vector = self.target_embedding(target_sequence[i]) 
            rnn_input = torch.cat([y_input_vector, context_vectors], dim=1) 
 
            # Step 2: Make a GRU step, getting a new hidden vector 
            h_t = self.gru_cell(rnn_input, h_t) 
            self._cached_ht.append(h_t.cpu().data.numpy()) 
 
            # Step 3: Use the current hidden to attend to the encoder state 
            context_vectors, p_attn, _ = \ 
                verbose_attention(encoder_state_vectors=encoder_state, 
                                  query_vector=h_t) 
 
            # auxiliary: cache the attention probabilities for visualization 
            self._cached_p_attn.append(p_attn.cpu().detach().numpy()) 
 
            # Step 4: Use the current hidden and context vectors 
            #         to make a prediction to the next word 
            prediction_vector = torch.cat((context_vectors, h_t), dim=1) 
            score_for_y_t_index = self.classifier(prediction_vector) 
 
            # auxiliary: collect the prediction scores 
            output_vectors.append(score_for_y_t_index)

编码器使用其bi-GRU和打包-解包创建状态向量之后，解码器在时间步上迭代以生成输出序列。从功能上讲，这个循环似乎与第七章中的生成循环非常相似，但也有区别，即Luong、Pham和Manning（2015）的注意方式的方法选择。首先，在每个时间步骤提供目标序列作为观察值。隐藏状态是通过GRUCell计算的。通过将Linear层应用于编码器bi-GRU的级联最终隐藏状态来计算初始隐藏状态。在每个时间步上，解码器GRU的输入是嵌入的输入标记和最后一个时间步的上下文向量的级联向量。上下文向量旨在捕获对该时间步有用的信息，并用于调节模型的输出。对于第一个时间步，上下文向量都是0，表示没有上下文，数学上只允许输入对GRU计算做出贡献。

使用新的隐藏状态作为查询向量，使用当前时间步的注意力机制创建一组新的上下文向量。这些上下文向量与隐藏状态连接，以创建表示该时间步的解码信息的向量。该解码信息状态向量用于分类器（在本例中为简单Linear层）中，以创建预测向量score_for_y_t_index。这些预测向量可以使用softmax函数转化为输出词汇表上的概率分布，也可以与交叉熵损失一起用于优化真实目标。在讨论如何在训练例程中使用预测向量之前，我们先得检查注意计算本身。

8.5.3.1 注意力的详细解释

了解注意力机制在此本例中的工作方式非常重要。回想一下“深度神经网络中的注意力”一节，可以使用查询（query）、键（key）和值（value）来描述注意力机制。得分函数将query向量和key向量作为输入，以计算在value向量中选择的一组权重。在本例中，我们使用点积得分函数（并非唯一的得分函数），解码器的隐藏状态被用作query向量，编码器状态向量集既是key向量又是value向量。

解码器隐藏状态与编码器状态向量的点积为编码序列中的每个项创建标量。使用softmax函数后，这些标量成为编码器状态向量上的概率分布。这些概率用于在将编码器状态向量加在一起之前对其进行加权，从而为每个批处理项目生成单个向量。总之，允许解码器隐藏状态在每个时间步优先加权编码器状态。这就像一个聚光灯，使模型能够学习如何突出显示生成输出序列所需的信息。下例（8-8）演示了该版本的注意力机制。第一个函数尝试详细说明操作，此外，它使用view()操作插入大小为1的维度，以便可以针对另一个张量广播张量。在terse_attention()版本中，view()操作被更普遍接受的unsqueze()取代。此外，我们使用更高效的matmul()运算，而不是将元素相乘和求和。

示例 8-8：更显式地进行元素乘法和求和的注意机制

def verbose_attention(encoder_state_vectors, query_vector): 
    """ 
    encoder_state_vectors: 3dim tensor from bi-GRU in encoder 
    query_vector: hidden state in decoder GRU 
    """ 
    batch_size, num_vectors, vector_size = encoder_state_vectors.size() 
    vector_scores = \ 
        torch.sum(encoder_state_vectors * query_vector.view(batch_size, 1, 
                                                            vector_size), 
                  dim=2) 
    vector_probabilities = F.softmax(vector_scores, dim=1) 
    weighted_vectors = \ 
        encoder_state_vectors * vector_probabilities.view(batch_size, 
                                                          num_vectors, 1) 
    context_vectors = torch.sum(weighted_vectors, dim=1) 
    return context_vectors, vector_probabilities 
 
def terse_attention(encoder_state_vectors, query_vector): 
    """ 
    encoder_state_vectors: 3dim tensor from bi-GRU in encoder 
    query_vector: hidden state 
    """ 
    vector_scores = torch.matmul(encoder_state_vectors, 
                                 query_vector.unsqueeze(dim=2)).squeeze() 
    vector_probabilities = F.softmax(vector_scores, dim=-1) 
    context_vectors = torch.matmul(encoder_state_vectors.transpose(-2, -1), 
                                   vector_probabilities.unsqueeze(dim=2)).squeeze() 
    return context_vectors, vector_probabilities

8.5.3.2 学习搜索和定时采样

按照目前的编写方式，该模型假设提供了目标序列，并将在解码器中的每个时间步用作输入。在测试时违反了这一假设，因为模型不能作弊并知道它试图生成的序列。为了适应这一事实，一种技术是允许模型在训练期间使用自己的预测，该技术在文献中被称为“学习搜索（learning to search）”和“定时采样（scheduled sampling）”。要直观理解该技术，可以将预测问题视为搜索问题。在每个时间步，模型都有许多路径可供选择（选择的数量是目标词汇表的大小），数据是对正确路径的观察。在测试时，模型最终被允许“偏离路径”，因为它没有提供计算概率分布的正确路径。因此，让模型采样自己的路径的技术提供了一种方法，可以通过这种方法优化模型，使其在偏离数据集中的目标序列时具有更好的概率分布。

要使模型在训练期间采样自己的预测，主要需要针对代码做出三个修改：首先，初始索引作为BEGIN-OF-SEQUENCE标记索引变得更加明确；其次，为生成循环中的每一步绘制随机样本，如果随机样本小于样本概率，则在该迭代期间使用模型的预测；最后，实际采样本身是在条件if use_sample下完成的。在下例（8-9）中，注释行显示了如何使用最大预测，而未注释行显示了如何以与其概率成比例的速率实际采样索引：

示例 8-9：在前向过程中构建的带有采样过程的解码器

class NMTDecoder(nn.Module): 
    def __init__(self, num_embeddings, embedding_size, rnn_size, bos_index): 
        super(NMTDecoder, self).__init__() 
        # ... other init code here ... 
 
        # arbitrarily set; any small constant will be fine 
        self._sampling_temperature = 3 
 
   def forward(self, encoder_state, initial_hidden_state, target_sequence, 
               sample_probability=0.0): 
        if target_sequence is None: 
            sample_probability = 1.0 
        else: 
            # We are making an assumption there: The batch is on first 
            # The input is (Batch, Seq) 
            # We want to iterate over sequence so we permute it to (S, B) 
            target_sequence = target_sequence.permute(1, 0) 
            output_sequence_size = target_sequence.size(0) 
 
        # ... nothing changes from the other implementation 
 
        output_sequence_size = target_sequence.size(0) 
        for i in range(output_sequence_size): 
            # new: a helper boolean and the teacher y_t_index 
            use_sample = np.random.random() < sample_probability 
            if not use_sample: 
                y_t_index = target_sequence[i] 
 
            # Step 1: Embed word and concat with previous context 
            # ... code omitted for space 
            # Step 2: Make a GRU step, getting a new hidden vector 
            # ... code omitted for space 
            # Step 3: Use the current hidden to attend to the encoder state 
            # ... code omitted for space 
            # Step 4: Use the current hidden and context vectors 
            #         to make a prediction to the next word 
            prediction_vector = torch.cat((context_vectors, h_t), dim=1) 
            score_for_y_t_index = self.classifier(prediction_vector) 
            # new: sampling if boolean is true. 
            if use_sample: 
                # sampling temperature forces a peakier distribution 
                p_y_t_index = F.softmax(score_for_y_t_index * 
                                        self._sampling_temperature, dim=1) 
                # method 1: choose most likely word 
                # _, y_t_index = torch.max(p_y_t_index, 1) 
                # method 2: sample from the distribution 
                y_t_index = torch.multinomial(p_y_t_index, 1).squeeze() 
 
            # auxiliary: collect the prediction scores 
            output_vectors.append(score_for_y_t_index) 
 
        output_vectors = torch.stack(output_vectors).permute(1, 0, 2) 
 
        return output_vectors

8.5.4 训练例程和结果

本例中的训练例程与前几章中的训练例程几乎相同。对于固定数量的周期，我们在称为minibatch的块中迭代数据集。然而，这里的每个minibatch由四个张量组成：源序列的整数矩阵、目标序列的两个整数矩阵和源序列长度的整数向量。两个目标序列矩阵是目标序列偏移量为1的矩阵，并填充BEGIN-OF-SEQUENCE标记作为目标序列观测值，或填充END-OF-SEQUENCE标记作为目标序列预测标签。该模型以源序列和目标序列观测值作为输入，以生成目标序列预测。在损失函数中使用目标序列预测标签来计算交叉熵损失，然后将其反向传播到每个模型参数，以便知道其梯度。然后调用优化器并按照与梯度成比例的量更新每个模型参数。

除了数据集的训练部分的循环外，验证部分还有一个循环。验证得分被当作模型改进的偏差较小的指标。该过程与训练例程相同，只是模型处于eval模式，且未相对于验证数据进行更新。

在训练模型之后，性能的度量成为了很重要的问题。我们在“评估序列生成模型”一节中介绍过几个生成评估度量，其中像BLEU这样衡量预测句子和参考语句之间的ngram重叠的度量已经成为机器翻译领域的标准。我们在这里省略了聚合结果的评估代码，你可以在本书的GitHub repo中找到它，在代码中，模型的输出与源句子、参考目标句子和该示例的注意力概率矩阵进行聚合。最后，为每对源句子和生成的句子计算BLEU-4。

为了定性评估模型的工作情况，我们将注意力概率矩阵可视化为源和生成文本之间的对齐。然而值得注意的是，最近的研究表明基于注意力的对齐与经典机器翻译中的对齐并不完全相同。与单词和短语之间的对齐表示翻译同义词不同，基于注意力的对齐分数可以表示解码器的有用信息，例如在生成输出动词时注意句子的主语（Koehn和Knowles，2017）。

我们模型的两个版本因其与目标句子的交互方式不同而有所不同：第一个版本使用提供的目标序列作为解码器中每个时间步的输入；第二个版本使用定时采样，以允许模型将其自身预测视为解码器中的输入，这有利于强迫模型优化其错误。下表（8-1）显示了BLEU得分。请记住，为了便于训练，我们选择了标准NMT任务的简化版本，这就是为什么得分似乎高于你在研究文献中通常见到的得分。尽管第二个定时采样的模型有较高的BLEU得分，然而得分相当接近。但这些得分到底意味着什么？为了研究这个问题，我们需要对模型进行定性检验。

要更深入地研究这个问题，我们要绘制注意力得分，以查看它们是否在源句和目标句之间提供任何类型的对齐信息。在这次检查中，我们发现两种模型之间存在着鲜明的对比，下图（8-12）显示了定时采样模型中每个解码器时间步的注意力概率分布。在这个模型中，对于从数据集的验证部分取样的句子，注意力权重排列得相当好。

8.6 总结

本章重点介绍了在条件生成模型的条件上下文中生成序列输出。当条件上下文本身来自另一个序列时，我们将其称为序列到序列（S2S）模型。我们还讨论了S2S模型如何成为编码器-解码器模型的特例。为了充分利用序列，我们讨论了第六章和第七章中讨论的序列模型的结构变体，特别是双向模型。我们还学习了如何结合注意力机制来有效地捕捉更长范围的上下文。最后，我们讨论了如何评估序列到序列模型，并用端到端机器翻译示例进行了演示。到目前为止，本书的每章都对应于一个特定的网络体系结构，在下一章中，我们将结合本章以及前面章节的所有内容，看看如何综合使用这些模型体系架构来构建真实系统的示例。