Tf2.0+基于注意力的神经机器翻译训练发布过程

本文链接：https://blog.csdn.net/Hellolanternfish/article/details/108821432

本文介绍了使用TensorFlow 2.0构建基于注意力机制的神经机器翻译模型的完整过程，包括数据预处理、模型定义、训练、模型发布及结果验证。在1000W数据上训练时遇到内存问题，通过jieba分词解决了词典生成。训练后模型的损失下降，准确率提升。最后展示了模型在长句子翻译上的应用效果。实践中建议使用20万以下数据以缩短训练周期并便于超参数调优。

摘要由CSDN通过智能技术生成

本篇是人工智能、机器翻译的干货文章，面向从事人工智能的研发工程师和翻译领域的职业从业者。意在译员能更好的从计算机编程角度理解机器翻译的代码实现过程。同时也和机器翻译开发工程师共同分享源码的快乐。

实例是采用当下最新的tensorflow2.0+框架，实现transformer模型 Attention（注意力）、MultiHeadAttention（多头注意力）、位置嵌入、掩码、按层提取翻译结果、Flask服务发布。实例代码为完整代码，拷贝即可运行，如果有兴趣获得完整的.py代码请在公众号下留言，获得代码文件。

本文章主要是tensorflow2.0官方教程的模拟整理而来，喜欢原版和英文阅读的请参考官方教程。官方为葡萄牙语和英语的翻译，只有训练和预测过程。未给出中文实例、也不能发布应用。

实例用1000W数据训练后的结果参考：

The user may open a window during driving, for example.
例如，用户可在驾驶期间打开车窗。

The flight simulator includes an Ethernet network, a flight control system, a host processing system, and a display system.

飞行模拟器包括以太网网络、飞行控制系统、主机处理系统以及显示系统。 Examples of the computer readable mediums include, but are not limited to, ROM, RAM, compact
disc (CD)-ROMs, magnetic tapes, floppy disks, flash drives, smart
cards and optical data storage devices.
计算机可读介质的示例包括但不限于ROM、RAM、光盘（CD）-ROM、磁带、软盘、闪存驱动、智能卡和光学数据存储设备。

It is to be understood that the present aspects are not limited to the disclosed forms, but, on the contrary, are intended to cover
various modifications and equivalent arrangements included within the
spirit and scope of the appended claims.
应当理解，本方面并不限于公开的形式，而是相反，本发明旨在涵盖所附权利要求的精神和范围内包括的各种修改和等同布置。

The rubber composition of claim 1, wherein the filler comprises at least one selected from the group consisting of carbon black, calcium
carbonate, talc, clay, silica, mica, titanium dioxide, graphite,
wollastonite, and nanosilver.
根据权利要求1所述的橡胶组合物的橡胶组合物，其中，所述加注件包括选自由炭黑、碳酸钙、滑石、粘土、二氧化硅、云母、二氧化钛、石墨、硅灰石和纳米壳组成的组中的至少一种。

整个过程可以分一下几个步骤：

创建机器翻译模型训练数据。
模型编写和超参数设置。
训练模型。
发布模型应用服务
验证机器翻译模型结果

公共参数：HyperParams.py

class HyperParams:
    # data
    source_temp = './data_src/train.tags.en_zh.en50'
    target_temp = './data_src/train.tags.en_zh.zh50'
    source_train = './data_src/corpus_raw.en-zh.en'
    target_train = './data_src/corpus_raw.en-zh.zh'
    source_test = 'data_src/test.tags.en_zh.en'
    target_test = 'data_src/test.tags.en_zh.zh'

    source_vocab_name = 'source.vocab.tsv'
    target_vocab_name = 'target.vocab.tsv'
    vocab_path = './preprocessed'
    source_vocab = vocab_path+'/'+source_vocab_name
    target_vocab = vocab_path+'/'+target_vocab_name
    model_path = './export_model'
    ckpt_path = './ckpt1'
    checkpoint_path = "./checkpoints1/train"
    min_cnt = 2            # 单词出现次数太少的

    #Transformer 的基础模型使用的数值为：num_layers=6，d_model = 512，dff = 2048。

    BUFFER_SIZE = 40000
    BATCH_SIZE = 128

    MAX_LENGTH = 100    #语料样本长度，大于的则删除
    num_layers = 4
    d_model = 256
    dff = 1024
    num_heads = 8
    dropout_rate = 0.2

1.创建机器翻译模型训练数据。
生成源语言和目标语言的词典：

import tensorflow_datasets as tfds
import tensorflow as tf
import os

data_path='/DATA/xxx/project/tf2/en_zh/'
#初始化需要词典的数据
def init_tfds_data():
    source_files = [data_path+'data_src/corpus_raw.en-zh.en']
    target_files = [data_path+'data_src/corpus_raw.en-zh.zh']
    source_lines_dataset = tf.data.TextLineDataset(source_files)
    target_lines_dataset = tf.data.TextLineDataset(target_files)

    lines_dataset = tf.data.Dataset.zip((source_lines_dataset, target_lines_dataset)) 
    return lines_dataset

#初始化创建源语言和目标语言的词典
def init_tokenizer(lines_dataset):
    tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus((en.numpy() for en, zh in lines_dataset), target_vocab_size=32000)
    tokenizer_en.save_to_file('./tokenizer/tokenizer_en')
    tokenizer_zh = tfds.features.text.SubwordTextEncoder.build_from_corpus((zh.numpy() for en, zh in lines_dataset), target_vocab_size=32000)
    tokenizer_zh.save_to_file('./tokenizer/tokenizer_zh')

#获取词典
def load_tokenizer(vocab_path):
    if not os.path.exists(vocab_path):
        train_examples= init_tfds_data()
        init_tokenizer(train_examples)
        tokenizer_en = tfds.features.text.SubwordTextEncoder.load_from_file(data_path+'tokenizer/tokenizer_en')
        tokenizer_zh = tfds.features.text.SubwordTextEncoder.load_from_file(data_path+'tokenizer/tokenizer_zh')
    else:
        tokenizer_en = tfds.features.text.SubwordTextEncoder.load_from_file(data_path+'tokenizer/tokenizer_en')
        tokenizer_zh = tfds.features.text.SubwordTextEncoder.load_from_file(data_path+'tokenizer/tokenizer_zh')
    return tokenizer_en , tokenizer_zh

#测试词典结果
def test_tokenizer(tokenizer_en,tokenizer_zh):
    source_string = 'Transformer is awesome.'
    tokenized_string = tokenizer_en.encode(source_string)
    print (tokenized_string,tokenizer_en.decode(tokenized_string))

    target_string = '例如，其特征在于，形成了高分子修饰层的橢球'
    target_tokenized_string = tokenizer_zh.encode(target_string)
    print (target_tokenized_string,tokenizer_zh.decode(target_tokenized_string))
    print ('------'*10)

if __name__ == '__main__':
    tokenizer_en , tokenizer_zh = load_tokenizer(data_path+'/tokenizer/tokenizer_en.subwords')
    test_tokenizer(tokenizer_en,tokenizer_zh)

注意：如果数据量过大会内存溢出。本人用 1000W数据发生此情况，另外利用jieba分词重新生成了subword的中文的词典。

2.模型编写和超参数设置。transformer_model.py

import tensorflow as tf
import numpy as np

"""
位置编码（Positional encoding）
因为该模型并不包括任何的循环（recurrence）或卷积，所以模型添加了位置编码，为模型提供一些关于单词在句子中相对位置的信息。位置编码向量被加到嵌入（embedding）向量中。嵌入表示一个 d 维空间的标记，在 d 维空间中有着相似含义的标记会离彼此更近。但是，嵌入并没有对在一句话中的词的相对位置进行编码。因此，当加上位置编码后，词将基于它们含义的相似度以及它们在句子中的位置，在 d 维空间中离彼此更近。
"""

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)
 

"""
遮挡（Masking）
遮挡一批序列中所有的填充标记（pad tokens）。这确保了模型不会将填充作为输入。该 mask 表明填充值 0 出现的位置：在这些位置 mask 输出 1，否则输出 0。
"""
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

    # 添加额外的维度来将填充加到注意力对数（logits）
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)
"""
前瞻遮挡（look-ahead mask）用于遮挡一个序列中的后续标记（future tokens）。换句话说，该 mask 表明了不应该使用的条目。
这意味着要预测第三个词，将仅使用第一个和第二个词。与此类似，预测第四个词，仅使用第一个，第二个和第三个词，依此类推。
"""

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)
"""
按比缩放的点积注意力（Scaled dot product attention）
ransformer 使用的注意力函数有三个输入：Q（请求（query））、K（主键（key））、V（数值（value））。用于计算注意力权重的等式,点积注意力被缩小了深度的平方根倍。这样做是因为对于较大的深度值，点积的大小会增大，从而推动 softmax 函数往仅有很小的梯度的方向靠拢，导致了一种很硬的（hard）softmax。
例如，假设 Q 和 K 的均值为0，方差为1。它们的矩阵乘积将有均值为0，方差为 dk。因此，dk 的平方根被用于缩放（而非其他数值），因为，Q 和 K 的矩阵乘积的均值本应该为 0，方差本应该为1，这样会获得一个更平缓的 softmax。
遮挡（mask）与 -1e9（接近于负无穷）相乘。这样做是因为遮挡与缩放的 Q 和 K 的矩阵乘积相加，并在 softmax 之前立即应用。目标是将这些单元归零，因为 softmax 的较大负数输入在输出中接近于零。
"""

def scaled_dot_product_attention(q, k, v, mask):
    """计算注意力权重。
  q, k, v 必须具有匹配的前置维度。
  k, v 必须有匹配的倒数第二个维度，例如：seq_len_k = seq_len_v。
  虽然 mask 根据其类型（填充或前瞻）有不同的形状，
  但是 mask 必须能进行广播转换以便求和。
  
  参数:
    q: 请求的形状 == (..., seq_len_q, depth)
    k: 主键的形状 == (..., seq_len_k, depth)
    v: 数值的形状 == (..., seq_len_v, depth_v)
    mask: Float 张量，其形状能转换成
          (..., seq_len_q, seq_len_k)。默认为None。
    
  返回值:
    输出，注意力权重
  """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # 缩放 matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # 将 mask 加入到缩放的张量上。
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

        # softmax 在最后一个轴（seq_len_k）上归一化，因此分数相加等于1。
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights
"""
 当 softmax 在 K 上进行归一化后，它的值决定了分配到 Q 的重要程度。输出表示注意力权重和 V（数值）向量的乘积。这确保了要关注的词保持原样，而无关的词将被清除掉。
"""

def print_out(q, k, v):
    temp_out, temp_attn