deep learning 13. transformer 代码详细解析之decoder

开始的话:
从基础做起,不断学习,坚持不懈,加油。
一位爱生活爱技术来自火星的程序汪

b e r t bert bert系列:

  1. b e r t bert bert 语料生成
  2. b e r t bert bert l o s s 解 析 loss解析 loss
  3. b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 e n c o d e r encoder encoder
  4. b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 d e c o d e r decoder decoder

话不多说,直接开始今天的主要内容。

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

输入参数的 s h a p e shape shape已经在代码中给出了详细的注释。
o k ok ok 我们一步一步来看看代码。

1、 e m b e d d i n g embedding embedding_ l a y e r layer layer

e n c o d e r encoder encoder中是一样的,这里就不再说明了,有疑问的请看上一节内容。最后返回的 s h a p e shape shape就是[ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。

2、 p a d pad pad

decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]

这个方法还是比较好理解的吧。 s h a p e shape shape为[ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size], r a n k rank rank为3,第一维不 p a d pad pad,最后一维也不pad,中间这个维度 p a d pad pad,并且是前面 p a d pad pad,后面不 p a d pad pad。所以 p a d pad pad之后的维度为[ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length+1, h i d d e n hidden hidden s i z e size size],然后 在在第二维度去掉了最后一个值(代表的就是[ E O S EOS EOS]这个标志位),这样 s h a p e shape shape仍然为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。

3、 g e t get get_ p o s i t i o n position position _ e n c o d i n g encoding encoding

这一个步骤和上一节的操作也是一样的,也就不再细说了。返回的 s h a p e shape shape为[ s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size] ,然后和 e m b e d d i n g embedding embedding的输出做 a d d add add,做简单的相加。最后返回的 s h a p e shape shape为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。然后再加了一个 d r o p o u t dropout dropout层。

4、 g e t get get_ d e c o d e r decoder decoder _ b i a s bias bias

def get_decoder_self_attention_bias(length):
    with tf.name_scope("decoder_self_attention_bias"):
        valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
        valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
        decoder_bias = _NEG_INF * (1.0 - valid_locs)
    return decoder_bias

L o w e r Lower Lower t r i a n g u l a r triangular triangular p a r t part part,就和下面这个一样。

				[[1. 0. 0. 0. 0.]
                 [1. 1. 0. 0. 0.]
                 [1. 1. 1. 0. 0.]
                 [1. 1. 1. 1. 0.]
                 [1. 1. 1. 1. 1.]]

最后的输出就如下面这样,成为了一个 U p p e r Upper Upper t r i a n g u l a r triangular triangular p a r t part part

tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)

5、 d e c o d e r decoder decoder_ s t a c k stack stack

o k ok ok,我们来看下 d e o c d e r deocder deocder_ s t a c k stack stack

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits
class DecoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(DecoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            vanilla_attention_layer = AttentionLayer(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )

            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )

        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

5.1 s e l f self self_ a t t e n t i o n attention attention

这一部分和 e n c o d e r encoder encoder s e l f self self_ a t t e n t i o n attention attention是一模一样的,
Q 、 K 、 V = d e c o d e r Q、K、V=decoder QKV=decoder i n p u t s inputs inputs,具体计算过程也是一毛一样的。
唯一不同的是 b i a s bias bias g e t get get
d e c o d e r decoder decoder _ b i a s bias bias产生的

5.2 v a n i l l a vanilla vanilla_ a t t e n t i o n attention attention

这个 a t t e n t i o n attention attention的不同之处在于,做的是 d e c o d e r decoder decoder e n c o d e r encoder encoder a t t e n t i o n attention attention。似乎这是这是很重要的一个 a t t e n t i o n attention attention,将 e n c o d e r encoder encoder d e c o d e r decoder decoder做了对齐。

Q = d e c o d e r Q=decoder Q=decoder _ i n p u t s inputs inputs
K 、 V = e n c o d e r K、V=encoder KV=encoder _ i n p u t s inputs inputs

  1. Q Q Q s h a p e shape shape为[ B B B, T d T_d Td, D D D],而 K 、 V K、V KV s h a p e shape shape为 [ B B B, T e T_e Te, D D D]
  2. Q 、 K 、 V Q、K、V QKV分别做 s p l i t split split _ h e a d head head操作。 Q s h a p e Q shape Qshape为[ B B B, H H H, T d T_d Td, D / / H D//H D//H] , K 、 V s h a p e K、V shape KVshape为[ B B B, H H H, T e T_e Te, D / / H D//H D//H] 其中 H H H 表示 n u m num num _ h e a d s heads heads,
  3. Q = s c a l e ( Q ) Q = scale(Q) Q=scale(Q)
  4. l o g i t s = t f . m a t m u l ( Q , K , t r a n s p o s e b = T r u e ) logits = tf.matmul(Q, K, transpose_b=True) logits=tf.matmul(Q,K,transposeb=True),返回 s h a p e shape shape为[ B B B, H H H T d T_d Td, T e T_e Te]
  5. l o g i t s = t f . a d d ( l o g i t s , b i a s ) logits = tf.add(logits, bias) logits=tf.add(logits,bias),这个 b i a s bias bias 就是最开始第一节第一步求得的 a t t e n t i o n attention attention_ b i a s bias bias s h a p e shape shape为 [ B B B, 1 1 1, 1 1 1, T e T_e Te]。最终返回 s h a p e shape shape为[ B B B, H H H T d T_d Td, T e T_e Te]
  6. w e i g h t s = t f . n n . s o f t m a x ( l o g i t s ) weights = tf.nn.softmax(logits) weights=tf.nn.softmax(logits)
  7. d r o p o u t ( w e i g h t s ) dropout(weights) dropout(weights)
  8. a t t e n t i o n attention attention_ o u t p u t = t f . m a t m u l ( w e i g h t s , V ) output = tf.matmul(weights, V) output=tf.matmul(weights,V),weights s h a p e shape shape为[ B B B, H H H, T d T_d Td, T e T_e Te],V=[ B B B, H H H, T e T_e Te, D / / H D//H D//H],最终为[ B B B, H H H, T d T_d Td, D / / H D//H D//H],
  9. o u t = c o m b i n e ( h e a d s ) out=combine(heads) out=combine(heads) 返回 s h a p e shape shape为[ B B B, T d T_d Td, D D D]
  10. d e n s e ( o u t , D ) dense(out, D) dense(out,D) 返回 s h a p e shape shape为[ B B B, T d T_d Td, D D D]

5.3 f e e d feed feed _ f o r w a r d forward forward

e n c o d e r encoder encoder 的是一样的。

5.4 n o r m norm norm

e n c o d e r encoder encoder 的是一样的。

5.5 l i n e a r linear linear

    def linear(self, inputs):
        """
        :param inputs:  a tensor with shape [batch_size, length, hidden_size]
        :return: float32 tensor with shape [batch_size, length, vocab_size]
        """

        with tf.name_scope('pre_softmax_linear'):
            batch_size = tf.shape(inputs)[0]
            length = tf.shape(inputs)[1]

            inputs = tf.reshape(inputs, [-1, self.hidden_size])
            """
                inputs              [batch_size, length, hidden_size]
                shared_weights      [vocab_size, hidden_size]
                transpose           [hidden_size, vocab_size]
                logits              [batch_size, length, vocab_size]
            """
            logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)

            return tf.reshape(logits, [batch_size, length, self.vocab_size])

这个就不细说了。值得注意的是: s h a r e d shared shared_ w e i g h t s weights weights,是 e m b e d i d n g embedidng embedidng时候初始化的向量。最后输出的,也就是每个位置在 v o c a b vocab vocab上的概率分布。

最后:
一直没提到的

class PrePostProcessingWrapper(object):
    """Wrapper class that applies layer pre-processing and post-processing."""

    def __init__(self, layer, params, train):
        self.layer = layer
        self.postprocess_dropout = params["layer_postprocess_dropout"]
        self.train = train

        # Create normalization layer
        self.layer_norm = LayerNormalization(params["hidden_size"])

    def __call__(self, x, *args, **kwargs):
        # Preprocessing: apply layer normalization
        y = self.layer_norm(x)

        # Get layer output
        y = self.layer(y, *args, **kwargs)

        # Postprocessing: apply dropout and residual connection
        if self.train:
            y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
        return x + y

这个 w r a p p e r wrapper wrapper

  1. 先对输入做了一个 n o r m norm norm,和前面提到的 n o r m norm norm是一样的。
  2. 然后拿到 l a y e r layer layer的输出
  3. 对结果加了一个 d r o p o u t dropout dropout
  4. 最后和输入相加,做了一个 r e s i d u a l residual residual

这个操作是对每一个 l a y e r layer layer的输入输出都操作了。

谢谢

更多代码请移步我的个人 g i t h u b github github,会不定期更新。
欢迎关注

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值