deep learning 13. transformer 代码详细解析之decoder

最新推荐文章于 2024-08-08 10:25:06 发布

adowu

最新推荐文章于 2024-08-08 10:25:06 发布

阅读量1.8k

点赞数

分类专栏： Models tensorflow 文章标签： bert transformer

本文链接：https://blog.csdn.net/WUUUSHAO/article/details/88640264

版权

Models 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

tensorflow

1 篇文章 0 订阅

订阅专栏

开始的话：
从基础做起，不断学习，坚持不懈，加油。
一位爱生活爱技术来自火星的程序汪

$b e r t$ 系列：

话不多说，直接开始今天的主要内容。

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

输入参数的 $s h a p e$ 已经在代码中给出了详细的注释。
$o k$ 我们一步一步来看看代码。

1、 $e m b e d d i n g$ _ $l a y e r$

和 $e n c o d e r$ 中是一样的，这里就不再说明了，有疑问的请看上一节内容。最后返回的 $s h a p e$ 就是[ $b a t c h$ _ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ _ $s i z e$ ]。

2、 $p a d$

decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]

这个方法还是比较好理解的吧。 $s h a p e$ 为[ $b a t c h$ _ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ $s i z e$ ]， $r a n k$ 为3，第一维不 $p a d$ ，最后一维也不pad，中间这个维度 $p a d$ ，并且是前面 $p a d$ ，后面不 $p a d$ 。所以 $p a d$ 之后的维度为[ $b a t c h$ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ +1, $h i d d e n$ $s i z e$ ]，然后在在第二维度去掉了最后一个值（代表的就是[ $E O S$ ]这个标志位）,这样 $s h a p e$ 仍然为 [ $b a t c h$ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ _ $s i z e$ ]。

3、 $g e t$ _ $p o s i t i o n$ _ $e n c o d i n g$

这一个步骤和上一节的操作也是一样的，也就不再细说了。返回的 $s h a p e$ 为[ $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ $s i z e$ ] ，然后和 $e m b e d d i n g$ 的输出做 $a d d$ ，做简单的相加。最后返回的 $s h a p e$ 为 [ $b a t c h$ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ _ $s i z e$ ]。然后再加了一个 $d r o p o u t$ 层。

4、 $g e t$ _ $d e c o d e r$ _ $b i a s$

def get_decoder_self_attention_bias(length):
    with tf.name_scope("decoder_self_attention_bias"):
        valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
        valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
        decoder_bias = _NEG_INF * (1.0 - valid_locs)
    return decoder_bias

$L o w e r$ $t r i a n g u l a r$ $p a r t$ ，就和下面这个一样。

				[[1. 0. 0. 0. 0.]
                 [1. 1. 0. 0. 0.]
                 [1. 1. 1. 0. 0.]
                 [1. 1. 1. 1. 0.]
                 [1. 1. 1. 1. 1.]]

最后的输出就如下面这样,成为了一个 $U p p e r$ $t r i a n g u l a r$ $p a r t$

tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)

5、 $d e c o d e r$ _ $s t a c k$

$o k$ ，我们来看下 $d e o c d e r$ _ $s t a c k$

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

class DecoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(DecoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            vanilla_attention_layer = AttentionLayer(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )

            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )

        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

5.1 $s e l f$ _ $a t t e n t i o n$

这一部分和 $e n c o d e r$ 的 $s e l f$ _ $a t t e n t i o n$ 是一模一样的，
$Q 、 K 、 V = d e c o d e r$ $i n p u t s$ ，具体计算过程也是一毛一样的。
唯一不同的是 $b i a s$ 是 $g e t$ $d e c o d e r$ _ $b i a s$ 产生的

5.2 $v a n i l l a$ _ $a t t e n t i o n$

这个 $a t t e n t i o n$ 的不同之处在于，做的是 $d e c o d e r$ 对 $e n c o d e r$ 的 $a t t e n t i o n$ 。似乎这是这是很重要的一个 $a t t e n t i o n$ ，将 $e n c o d e r$ 和 $d e c o d e r$ 做了对齐。

$Q = d e c o d e r$ _ $i n p u t s$
$K 、 V = e n c o d e r$ _ $i n p u t s$

$Q$ 的 $s h a p e$ 为[ $B$ , $T_d$ , $D$ ]，而 $K 、 V$ 的 $s h a p e$ 为 [ $B$ , $T_e$ , $D$ ]
对 $Q 、 K 、 V$ 分别做 $s p l i t$ _ $h e a d$ 操作。 $Q s h a p e$ 为[ $B$ , $H$ , $T_d$ , $D / / H$ ] ， $K 、 V s h a p e$ 为[ $B$ , $H$ , $T_e$ , $D / / H$ ] 其中 $H$ 表示 $n u m$ _ $h e a d s$ ,
$Q = s c a l e (Q)$
$logits = tf.matmul(Q, K, transpose_b=True)$ ，返回 $s h a p e$ 为[ $B$ , $H$ ， $T_d$ , $T_e$ ]
$l o g i t s = t f . a d d (l o g i t s, b i a s)$ ，这个 $b i a s$ 就是最开始第一节第一步求得的 $a t t e n t i o n$ _ $b i a s$ 。 $s h a p e$ 为 [ $B$ , $1$ , $1$ , $T_e$ ]。最终返回 $s h a p e$ 为[ $B$ , $H$ ， $T_d$ , $T_e$ ]
$w e i g h t s = t f . n n . s o f t m a x (l o g i t s)$
$d r o p o u t (w e i g h t s)$
$a t t e n t i o n$ _ $o u t p u t = t f . m a t m u l (w e i g h t s, V)$ ，weights $s h a p e$ 为[ $B$ , $H$ , $T_d$ , $T_e$ ],V=[ $B$ , $H$ , $T_e$ , $D / / H$ ]，最终为[ $B$ , $H$ , $T_d$ , $D / / H$ ],
$o u t = c o m b i n e (h e a d s)$ 返回 $s h a p e$ 为[ $B$ , $T_d$ , $D$ ]
$d e n s e (o u t, D)$ 返回 $s h a p e$ 为[ $B$ , $T_d$ , $D$ ]

5.3 $f e e d$ _ $f o r w a r d$

和 $e n c o d e r$ 的是一样的。

5.4 $n o r m$

和 $e n c o d e r$ 的是一样的。

5.5 $l i n e a r$

    def linear(self, inputs):
        """
        :param inputs:  a tensor with shape [batch_size, length, hidden_size]
        :return: float32 tensor with shape [batch_size, length, vocab_size]
        """

        with tf.name_scope('pre_softmax_linear'):
            batch_size = tf.shape(inputs)[0]
            length = tf.shape(inputs)[1]

            inputs = tf.reshape(inputs, [-1, self.hidden_size])
            """
                inputs              [batch_size, length, hidden_size]
                shared_weights      [vocab_size, hidden_size]
                transpose           [hidden_size, vocab_size]
                logits              [batch_size, length, vocab_size]
            """
            logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)

            return tf.reshape(logits, [batch_size, length, self.vocab_size])

这个就不细说了。值得注意的是： $s h a r e d$ _ $w e i g h t s$ ，是 $e m b e d i d n g$ 时候初始化的向量。最后输出的，也就是每个位置在 $v o c a b$ 上的概率分布。

最后：
一直没提到的

class PrePostProcessingWrapper(object):
    """Wrapper class that applies layer pre-processing and post-processing."""

    def __init__(self, layer, params, train):
        self.layer = layer
        self.postprocess_dropout = params["layer_postprocess_dropout"]
        self.train = train

        # Create normalization layer
        self.layer_norm = LayerNormalization(params["hidden_size"])

    def __call__(self, x, *args, **kwargs):
        # Preprocessing: apply layer normalization
        y = self.layer_norm(x)

        # Get layer output
        y = self.layer(y, *args, **kwargs)

        # Postprocessing: apply dropout and residual connection
        if self.train:
            y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
        return x + y