简单解析transformer代码

最新推荐文章于 2024-08-20 18:22:45 发布

洛克-李

最新推荐文章于 2024-08-20 18:22:45 发布

阅读量5.3k

点赞数 5

分类专栏：深度学习文章标签： tensorflow python transformer

本文链接：https://blog.csdn.net/qq_30232405/article/details/103807476

版权

深度学习专栏收录该内容

28 篇文章 1 订阅

订阅专栏

详解transformer代码

文章目录

详解transformer代码
5.train.py

1.代码下载：

在github下载了比较热门的transformer代码的实现，其gith地址为:https://github.com/Kyubyong/transformer

2.prepro.py

主要负责生成对应的预处理语料文件，并利用sentencepiece包来处理原始语料。

2.1 首先进行语料预处理阶段

    # train
    _prepro = lambda x:  [line.strip() for line in open(x, 'r').read().split("\n") \
                      if not line.startswith("<")]
    prepro_train1, prepro_train2 = _prepro(train1), _prepro(train2)
    assert len(prepro_train1)==len(prepro_train2), "Check if train source and target files match."

    # eval
    _prepro = lambda x: [re.sub("<[^>]+>", "", line).strip() \
                     for line in open(x, 'r').read().split("\n") \
                     if line.startswith("<seg id")]
    prepro_eval1, prepro_eval2 = _prepro(eval1), _prepro(eval2)
    assert len(prepro_eval1) == len(prepro_eval2), "Check if eval source and target files match."

    # test
    prepro_test1, prepro_test2 = _prepro(test1), _prepro(test2)
    assert len(prepro_test1) == len(prepro_test2), "Check if test source and target files match."

代码中可以看到，针对train，eval和test数据集进行了预处理，把其中的一些标点符号去掉

2.2 生成预处理过后的对应数据集

    def _write(sents, fname):
    with open(fname, 'w') as fout:
        fout.write("\n".join(sents))
    _write(prepro_train1, "iwslt2016/prepro/train.de")
    _write(prepro_train2, "iwslt2016/prepro/train.en")
    _write(prepro_train1+prepro_train2, "iwslt2016/prepro/train")
    _write(prepro_eval1, "iwslt2016/prepro/eval.de")
    _write(prepro_eval2, "iwslt2016/prepro/eval.en")
    _write(prepro_test1, "iwslt2016/prepro/test.de")
    _write(prepro_test2, "iwslt2016/prepro/test.en")

其中生成的“train”文件，是结合了“train.de”和“train.en”，以“de”结尾的是德语，以“en”结尾的是翻译成的英语。

2.3 sentencepiece处理

在代码中，利用了sentencepiece中的bpe算法来进行分词。
BPE（Byte Pair Encoding，双字节编码）。2016年应用于机器翻译，解决集外词（OOV）和罕见词（Rare word）问题。
BPE算法属于的是预处理中中的subword算法

    import sentencepiece as spm
    train = '--input=iwslt2016/prepro/train --pad_id=0 --unk_id=1 \ #输入训练文件
             --bos_id=2 --eos_id=3\
             --model_prefix=iwslt2016/segmented/bpe --vocab_size={} \ #输出文件的前缀名陈
             --model_type=bpe'.format(hp.vocab_size)
    spm.SentencePieceTrainer.Train(train)

    logging.info("# Load trained bpe model")
    sp = spm.SentencePieceProcessor()
    sp.Load("iwslt2016/segmented/bpe.model")    

    logging.info("# Segment")
    def _segment_and_write(sents, fname):
        with open(fname, "w") as fout:
            for sent in sents:
                pieces = sp.EncodeAsPieces(sent)    #对文件进行编码
                fout.write(" ".join(pieces) + "\n")

（1） BPE algorithm

BPE(字节对)编码或二元编码是一种简单的数据压缩形式，其中最常见的一对连续字节数据被替换为该数据中不存在的字节。后期使用时需要一个替换表来重建原始数据。OpenAI GPT-2 与Facebook RoBERTa均采用此方法构建subword vector.

优点：可以有效地平衡词汇表大小和步数(编码句子所需的token数量)。
缺点：基于贪婪和确定的符号替换，不能提供带概率的多个分片结果。

（2）BPE算法过程

1）准备足够大的训练语料

2）确定期望的subword词表大小

3）将单词拆分为字符序列并在末尾添加后缀“ </ w>”，统计单词频率。 本阶段的subword的粒度是字符。 例如，“ low”的频率为5，那么我们将其改写为“ l o w </ w>”：5

4）统计每一个连续字节对的出现频率，选择最高频者合并成新的subword

5）重复第4步直到达到第2步设定的subword词表大小或下一个最高频的字节对出现频率为1

停止符"“的意义在于表示subword是词后缀。举例来说：“st"字词不加”“可以出现在词首如"st ar”，加了”“表明改字词位于词尾，如"wide st”，二者意义截然不同。

每次合并后词表可能出现3种变化：

+1，表明加入合并后的新字词，同时原来在2个子词还保留（2个字词不是完全同时连续出现）
+0，表明加入合并后的新字词，同时原来2个子词中一个保留，一个被消解（一个字词完全随着另一个字词的出现而紧跟着出现）
-1，表明加入合并后的新字词，同时原来2个子词都被消解（2个字词同时连续出现）

实际上，随着合并的次数增加，词表大小通常先增加后减小。

（3）BPE算法例子

输入：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

Iter 1, 最高频连续字节对"e"和"s"出现了6+3=9次，合并成"es"。输出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

Iter 2, 最高频连续字节对"es"和"t"出现了6+3=9次, 合并成"est"。输出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}

Iter 3, 以此类推，最高频连续字节对为"est"和"</w>" 输出：
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}

Iter n, 继续迭代直到达到预设的subword词表大小或下一个最高频的字节对出现频率为1。

3.data_load.py

主要负责生成batch数据

3.1 主方法：get_batch

用来生成多个batch数据。
主要使用了“tf.data.Dataset.from_generator”加载数据的方法，把句子中的词语生成在词典中的id。

def get_batch(fpath1, fpath2, maxlen1, maxlen2, vocab_fpath, batch_size, shuffle=False):
    '''Gets training / evaluation mini-batches
    fpath1: source file path. string.
    fpath2: target file path. string.
    maxlen1: source sent maximum length. scalar.
    maxlen2: target sent maximum length. scalar.
    vocab_fpath: string. vocabulary file path.
    batch_size: scalar
    shuffle: boolean

    Returns
    batches
    num_batches: number of mini-batches
    num_samples
    '''
    sents1, sents2 = load_data(fpath1, fpath2, maxlen1, maxlen2)
    batches = input_fn(sents1, sents2, vocab_fpath, batch_size, shuffle=shuffle)
    num_batches = calc_num_batches(len(sents1), batch_size)
    return batches, num_batches, len(sents1)

参数返回	描述
batches	tf.data.Dataset的一种形式，包含了元组xs（）和元组ys
num_batches	总共有多个个batches进行迭代，也就是有多少轮epoch
len(sents1)	数据集的大小

3.2 load_data

加载数据

def load_data(fpath1, fpath2, maxlen1, maxlen2):
    '''Loads source and target data and filters out too lengthy samples.
    fpath1: source file path. string.
    fpath2: target file path. string.
    maxlen1: source sent maximum length. scalar.
    maxlen2: target sent maximum length. scalar.

    Returns
    sents1: list of source sents
    sents2: list of target sents
    '''
    sents1, sents2 = [], []
    with open(fpath1, 'r') as f1, open(fpath2, 'r') as f2:
        for sent1, sent2 in zip(f1, f2):
            if len(sent1.split()) + 1 > maxlen1: continue # 1: </s>
            if len(sent2.split()) + 1 > maxlen2: continue  # 1: </s>
            sents1.append(sent1.strip())
            sents2.append(sent2.strip())
    return sents1, sents2

当句子超过长度maxlen1或者maxlen2的时候，则抛弃
保存句子，返回输入和输入句子的列表

3.3 input_fn

def encode(inp, type, dict):
    '''Converts string to number. Used for `generator_fn`.
    inp: 1d byte array.
    type: "x" (source side) or "y" (target side)
    dict: token2idx dictionary

    Returns
    list of numbers
    '''
    inp_str = inp.decode("utf-8")
    if type=="x": tokens = inp_str.split() + ["</s>"]
    else: tokens = ["<s>"] + inp_str.split() + ["</s>"]

    x = [dict.get(t, dict["<unk>"]) for t in tokens]
    return x

def generator_fn(sents1, sents2, vocab_fpath):
    '''Generates training / evaluation data
    sents1: list of source sents
    sents2: list of target sents
    vocab_fpath: string. vocabulary file path.

    yields
    xs: tuple of
        x: list of source token ids in a sent
        x_seqlen: int. sequence length of x
        sent1: str. raw source (=input) sentence
    labels: tuple of
        decoder_input: decoder_input: list of encoded decoder inputs
        y: list of target token ids in a sent
        y_seqlen: int. sequence length of y
        sent2: str. target sentence
    '''
    token2idx, _ = load_vocab(vocab_fpath)
    for sent1, sent2 in zip(sents1, sents2):
        x = encode(sent1, "x", token2idx)
        y = encode(sent2, "y", token2idx)
        decoder_input, y = y[:-1], y[1:]

        x_seqlen, y_seqlen = len(x), len(y)
        yield (x, x_seqlen, sent1), (decoder_input, y, y_seqlen, sent2)

def input_fn(sents1, sents2, vocab_fpath, batch_size, shuffle=False):
    '''Batchify data
    sents1: list of source sents
    sents2: list of target sents
    vocab_fpath: string. vocabulary file path.
    batch_size: scalar
    shuffle: boolean

    Returns
    xs: tuple of
        x: int32 tensor. (N, T1) # 句子中每个词语转换为id
        x_seqlens: int32 tensor. (N,) # 句子原有长度
        sents1: str tensor. (N,) # 单个句子
    ys: tuple of
        decoder_input: int32 tensor. (N, T2) # 句子中每个词语转换为id，decoder输入
        y: int32 tensor. (N, T2) # 句子中每个词语转换为id，decoder输出
        y_seqlen: int32 tensor. (N, ) # 句子原有长度
        sents2: str tensor. (N,) # 单个句子
    '''
    shapes = (([None], (), ()),
              ([None], [None], (), ()))
    types = ((tf.int32, tf.int32, tf.string),
             (tf.int32, tf.int32, tf.int32, tf.string))
    paddings = ((0, 0, ''),
                (0, 0, 0, ''))

    dataset = tf.data.Dataset.from_generator(
        generator_fn,
        output_shapes=shapes,
        output_types=types,
        args=(sents1, sents2, vocab_fpath))  # <- arguments for generator_fn. converted to np string arrays

    if shuffle: # for training
        dataset = dataset.shuffle(128*batch_size)

    dataset = dataset.repeat()  # iterate forever
    dataset = dataset.padded_batch(batch_size, shapes, paddings).prefetch(1)

    return dataset

用了“tf.data.Dataset.from_generator”加载数据，生成器方法为：generator_fn
返回的dataset中，有两个变量xs和ys。其中xs中包含:x，x_seqlens，sents1；ys中包含：decoder_input，y，y_seqlen，sents2
在encoder输入句子x中，添加了结尾符号“</s>”；在输出句子y中，添加了开头符号“<s>”和结尾符号“</s>”
decoder模型中，decoder_input：用了y[:-1]
decoder模型中，输出：用了y[1:]
最后再用“0” padding 句子，把所有的词语转换成对应的词典id
xs和ys的描述：

xs: tuple of
    x: int32 tensor. (N, T1) # 句子中每个词语转换为id
    x_seqlens: int32 tensor. (N,) # 句子原有长度
    sents1: str tensor. (N,) # 单个句子
ys: tuple of
    decoder_input: int32 tensor. (N, T2) # 句子中每个词语转换为id，decoder输入
    y: int32 tensor. (N, T2) # 句子中每个词语转换为id，decoder输出
    y_seqlen: int32 tensor. (N, ) # 句子原有长度
    sents2: str tensor. (N,) # 单个句子

4.model.py

实现transformer模型的主要代码

4.1 初始化

    def __init__(self, hp):
        self.hp = hp
        self.token2idx, self.idx2token = load_vocab(hp.vocab)
        self.embeddings = get_token_embeddings(self.hp.vocab_size, self.hp.d_model, zero_pad=True)

get_token_embeddings：生成词向量矩阵，这个矩阵是随机初始化的，且self.embeddings设置成了tf.get_variable共享参数
self.embeddings：其维度为（vocab_size，d_model），作者默认为维度是（32001，512）

4.2 encode模型

这部分代码是实现encode模型的

    def encode(self, xs, training=True):
        '''
        Returns
        memory: encoder outputs. (N, T1, d_model)
        '''
        with tf.variable_scope("encoder", reuse=tf.AUTO_REUSE):
            x, seqlens, sents1 = xs

            # src_masks
            src_masks = tf.math.equal(x, 0) # (N, T1)

            # embedding
            enc = tf.nn.embedding_lookup(self.embeddings, x) # (N, T1, d_model)
            enc *= self.hp.d_model**0.5 # scale

            enc += positional_encoding(enc, self.hp.maxlen1)
            enc = tf.layers.dropout(enc, self.hp.dropout_rate, training=training)

            ## Blocks
            for i in range(self.hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE):
                    # self-attention
                    enc = multihead_attention(queries=enc,
                                              keys=enc,
                                              values=enc,
                                              key_masks=src_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=False)
                    # feed forward
                    enc = ff(enc, num_units=[self.hp.d_ff, self.hp.d_model])
        memory = enc
        return memory, sents1, src_masks

（1）超参数的含义：

超参数名称	含义
N	batch_size
T1	句子长度
d_model	词向量维度

（2）实现的功能

输入词向量+positional_encoding
encode中共有6个blocks进行连接，每个encode中有multihead attention和全连接层ff进行连接

（3）具体的分析

4.2.1 positional encoding

transformer模型中缺少一种解释输入序列中单词顺序的方法，它跟序列模型还不不一样。为了处理这个问题，transformer给encoder层和decoder层的输入添加了一个额外的向量Positional Encoding，维度和embedding的维度一样，这个向量采用了一种很独特的方法来让模型学习到这个值，这个向量能决定当前词的位置，或者说在一个句子中不同的词之间的距离。这个位置向量的具体计算方法有很多种，论文中的计算方法如下：

其中pos是指当前词在句子中的位置，i是指向量中每个值的index，可以看出，在偶数位置，使用正弦编码，在奇数位置，使用余弦编码。

def positional_encoding(inputs,
                        maxlen,
                        masking=True,
                        scope="positional_encoding"):
    '''Sinusoidal Positional_Encoding. See 3.5
    inputs: 3d tensor. (N, T, E)
    maxlen: scalar. Must be >= T
    masking: Boolean. If True, padding positions are set to zeros.
    scope: Optional scope for `variable_scope`.

    returns
    3d tensor that has the same shape as inputs.
    '''

    E = inputs.get_shape().as_list()[-1] # static
    N, T = tf.shape(inputs)[0], tf.shape(inputs)[1] # dynamic
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # position indices
        position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1]) # (N, T)

        # First part of the PE function: sin and cos argument
        position_enc = np.array([
            [pos / np.power(10000, (i-i%2)/E) for i in range(E)]
            for pos in range(maxlen)])

        # Second part, apply the cosine to even columns and sin to odds.
        position_enc[:, 0::2] = np.sin(position_enc[:, 0::2])  # dim 2i
        position_enc[:, 1::2] = np.cos(position_enc[:, 1::2])  # dim 2i+1
        position_enc = tf.convert_to_tensor(position_enc, tf.float32) # (maxlen, E)

        # lookup
        outputs = tf.nn.embedding_lookup(position_enc, position_ind)

        # masks
        if masking:
            outputs = tf.where(tf.equal(inputs, 0), inputs, outputs)

        return tf.to_float(outputs)

最后得到的positional encoding加在原始的初始化词向量中。

4.2.2 multihead_attention

multihead attention的主要公式为：
在这里插入图片描述

def scaled_dot_product_attention(Q, K, V, key_masks,
                                 causality=False, dropout_rate=0.,
                                 training=True,
                                 scope="scaled_dot_product_attention"):
    '''See 3.2.1.
    Q: Packed queries. 3d tensor. [N, T_q, d_k].
    K: Packed keys. 3d tensor. [N, T_k, d_k].
    V: Packed values. 3d tensor. [N, T_k, d_v].
    key_masks: A 2d tensor with shape of [N, key_seqlen]
    causality: If True, applies masking for future blinding
    dropout_rate: A floating point number of [0, 1].
    training: boolean for controlling droput
    scope: Optional scope for `variable_scope`.
    '''
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        d_k = Q.get_shape().as_list()[-1]

        # dot product
        outputs = tf.matmul(Q, tf.transpose(K, [0, 2, 1]))  # (N, T_q, T_k)

        # scale
        outputs /= d_k ** 0.5

        # key masking
        outputs = mask(outputs, key_masks=key_masks, type="key")

        # causality or future blinding masking
        if causality:
            outputs = mask(outputs, type="future")

        # softmax
        outputs = tf.nn.softmax(outputs)
        attention = tf.transpose(outputs, [0, 2, 1])
        tf.summary.image("attention", tf.expand_dims(attention[:1], -1))

        # # query masking
        # outputs = mask(outputs, Q, K, type="query")

        # dropout
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=training)

        # weighted sum (context vectors)
        outputs = tf.matmul(outputs, V)  # (N, T_q, d_v)

    return outputs

def multihead_attention(queries, keys, values, key_masks,
                        num_heads=8, 
                        dropout_rate=0,
                        training=True,
                        causality=False,
                        scope="multihead_attention"):
    '''Applies multihead attention. See 3.2.2
    queries: A 3d tensor with shape of [N, T_q, d_model].
    keys: A 3d tensor with shape of [N, T_k, d_model].
    values: A 3d tensor with shape of [N, T_k, d_model].
    key_masks: A 2d tensor with shape of [N, key_seqlen]
    num_heads: An int. Number of heads.
    dropout_rate: A floating point number.
    training: Boolean. Controller of mechanism for dropout.
    causality: Boolean. If true, units that reference the future are masked.
    scope: Optional scope for `variable_scope`.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    d_model = queries.get_shape().as_list()[-1]
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # Linear projections
        Q = tf.layers.dense(queries, d_model, use_bias=True) # (N, T_q, d_model)
        K = tf.layers.dense(keys, d_model, use_bias=True) # (N, T_k, d_model)
        V = tf.layers.dense(values, d_model, use_bias=True) # (N, T_k, d_model)
        
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h)
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)

        # Attention
        outputs = scaled_dot_product_attention(Q_, K_, V_, key_masks, causality, dropout_rate, training)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model)
              
        # Residual connection
        outputs += queries
              
        # Normalize
        outputs = ln(outputs)
 
    return outputs

其中d_k参数的值为：512/8 = 64
用到了residual方法和layer normalization方法
可以看到，在encode模型中Q,K,V三个参数都是一致的。

4.3 decode模型

这部分代码是实现decode模型的

    def decode(self, ys, memory, src_masks, training=True):
        '''
        memory: encoder outputs. (N, T1, d_model)
        src_masks: (N, T1)

        Returns
        logits: (N, T2, V). float32.
        y_hat: (N, T2). int32
        y: (N, T2). int32
        sents2: (N,). string.
        '''
        with tf.variable_scope("decoder", reuse=tf.AUTO_REUSE):
            decoder_inputs, y, seqlens, sents2 = ys

            # tgt_masks
            tgt_masks = tf.math.equal(decoder_inputs, 0)  # (N, T2)

            # embedding
            dec = tf.nn.embedding_lookup(self.embeddings, decoder_inputs)  # (N, T2, d_model)
            dec *= self.hp.d_model ** 0.5  # scale

            dec += positional_encoding(dec, self.hp.maxlen2)
            dec = tf.layers.dropout(dec, self.hp.dropout_rate, training=training)

            # Blocks
            for i in range(self.hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE):
                    # Masked self-attention (Note that causality is True at this time)
                    dec = multihead_attention(queries=dec,
                                              keys=dec,
                                              values=dec,
                                              key_masks=tgt_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=True,
                                              scope="self_attention")

                    # Vanilla attention
                    dec = multihead_attention(queries=dec,
                                              keys=memory,
                                              values=memory,
                                              key_masks=src_masks,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=False,
                                              scope="vanilla_attention")
                    ### Feed Forward
                    dec = ff(dec, num_units=[self.hp.d_ff, self.hp.d_model])

        # Final linear projection (embedding weights are shared)
        weights = tf.transpose(self.embeddings) # (d_model, vocab_size)
        logits = tf.einsum('ntd,dk->ntk', dec, weights) # (N, T2, vocab_size)
        y_hat = tf.to_int32(tf.argmax(logits, axis=-1))

        return logits, y_hat, y, sents2

encode模型中与encode不同之处在于，实现了两个multihead attention结构。

第一个multihead attention结构和encode模型中的一样，都为self-attention结构
第二个multihead attention结构，在Q,K,V的输入就不同了，其输入memory实际上是encode模型的输入结果。
tf.einsum函数是实现矩阵相乘的方式，一般来说如果用mutual的话，不能进行三维矩阵和二维矩阵的相乘，而einsum则可以，通过设置参数，如“ntd,dk->ntk”可以得到矩阵维度是[n,t,k]。

4.4 train函数

用来训练模型的函数
label_smoothing函数用来进行one hot函数的丝滑处理：

def label_smoothing(inputs, epsilon=0.1):
    '''Applies label smoothing. See 5.4 and https://arxiv.org/abs/1512.00567.
    inputs: 3d tensor. [N, T, V], where V is the number of vocabulary.
    epsilon: Smoothing rate.
    
    For example,
    
    ''
    import tensorflow as tf
    inputs = tf.convert_to_tensor([[[0, 0, 1], 
       [0, 1, 0],
       [1, 0, 0]],

      [[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0]]], tf.float32)
       
    outputs = label_smoothing(inputs)
    
    with tf.Session() as sess:
        print(sess.run([outputs]))
    
    >>
    [array([[[ 0.03333334,  0.03333334,  0.93333334],
        [ 0.03333334,  0.93333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334]],

       [[ 0.93333334,  0.03333334,  0.03333334],
        [ 0.93333334,  0.03333334,  0.03333334],
        [ 0.03333334,  0.93333334,  0.03333334]]], dtype=float32)]   
    ''   
    '''
    V = inputs.get_shape().as_list()[-1] # number of channels
    return ((1-epsilon) * inputs) + (epsilon / V)

loss函数用到了交叉熵函数，但是在计算的时候去掉了padding的影响。

    y_ = label_smoothing(tf.one_hot(y, depth=self.hp.vocab_size))
    ce = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=y_)
    nonpadding = tf.to_float(tf.not_equal(y, self.token2idx["<pad>"]))  # 0: <pad>
    loss = tf.reduce_sum(ce * nonpadding) / (tf.reduce_sum(nonpadding) + 1e-7)

同时多学习率进行调整，用到了warmup操作，初始阶段lr慢慢上升，迭代后期则慢慢下降

    global_step = tf.train.get_or_create_global_step()
    lr = noam_scheme(self.hp.lr, global_step, self.hp.warmup_steps)

4.5 eval函数

大致和train函数差不多
在decoder输入的时候，没有用到输出数据集，而是用了xs的输入数据集进行构造，这个decoder_input实际的大小为[N,1]，其中N为batch size，也即是仅仅输入了<s>开头标记。同时在推断下一个词语，一个一个词语进行拼接

    decoder_inputs, y, y_seqlen, sents2 = ys

    decoder_inputs = tf.ones((tf.shape(xs[0])[0], 1), tf.int32) * self.token2idx["<s>"]
    ys = (decoder_inputs, y, y_seqlen, sents2)

    memory, sents1, src_masks = self.encode(xs, False)

    logging.info("Inference graph is being built. Please be patient.")
    for _ in tqdm(range(self.hp.barrages_maxlen2)):
        logits, y_hat, y, sents2 = self.decode(ys, memory, src_masks, False)
        if tf.reduce_sum(y_hat, 1) == self.token2idx["<pad>"]: break

        _decoder_inputs = tf.concat((decoder_inputs, y_hat), 1)
        ys = (_decoder_inputs, y, y_seqlen, sents2)