deep learning 12. transformer 代码详细解析之encoder

最新推荐文章于 2024-04-24 17:22:14 发布

adowu

最新推荐文章于 2024-04-24 17:22:14 发布

阅读量819

点赞数

分类专栏： Models 文章标签： bert transformer

本文链接：https://blog.csdn.net/WUUUSHAO/article/details/88634941

版权

Models 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

开始的话：
从基础做起，不断学习，坚持不懈，加油。
一位爱生活爱技术来自火星的程序汪

$b e r t$ 系列：

结合着自己 $g i t h u b$ （地址见文末尾）上的 $t r a n s f o r m e r$ 代码，详细分析下代码和逻辑。

    def __call__(self, feature, targets=None):
        initializer = tf.variance_scaling_initializer(
            scale=self.params.get('initializer_gain'),
            mode='fan_avg',
            distribution='uniform'
        )

        with tf.variable_scope('transformer', initializer=initializer):
            #   [batch_size, 1, 1, length]
            attention_bias = model_utils.get_padding_bias(feature)

            encoder_outputs = self.encode(feature, attention_bias)

            if targets is None:
                return self.predict(encoder_outputs, attention_bias)

            logits = self.decode(targets, encoder_outputs, attention_bias)
            return logits

主要 $c a l l ()$ 方法入口。（为了代码不那么长去掉了代码中的注释）

def get_padding_bias(x):
    with tf.name_scope("attention_bias"):
        padding = get_padding(x)
        attention_bias = padding * _NEG_INF
        attention_bias = tf.expand_dims(
            tf.expand_dims(attention_bias, axis=1), axis=1)
    return attention_bias

$g e t$ _ $p a d d i n g ()$ 方法代码如下，主要目的是拿到 $a t t e n t i o n$ _ $b i a s$ ：

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

$x$ 的 $s h a p e$ 为[ $b a t c h$ _ $s i z e$ , $s e q u e n c e$ $l e n g t h$ ],是已经 $p a d d i n g$ 过的数据。经过这个方法，就能知道哪些是 $p a d d i n g$ ，哪些是 $n o n$ - $p a d d i n g$ 的数据了。返回 $s h a p e$ 为[ $b a t c h$ $s i z e$ , $s e q u e n c e$ $l e n g t h$ ]。其中 $0$ -> $n o n$ - $p a d d i n g$ ， $1$ -> $p a d d i n g$ 。
_NEG_INF = -1e9， $p a d d i n g$ $b i a s$ 肯定就是给 $p a d d i n g$ 加上 $b i a s$ 了，这个值就是给 $p a d d i n g$ 设置的 $b i a s$ 。
最后返回的 $s h a p e$ 为 [ $b a t c h$ $s i z e$ , $1$ , $1$ , $s e q u e n c e$ _ $l e n g t h$ ]。

第一步： $e n c o d e r$

    def encode(self, inputs, attention_bias):
        with tf.name_scope('encode'):
            #   [batch_size, length, hidden_size]
            embedded_inputs = self.embedding_layer(inputs)
            #   [batch_size, length]
            inputs_padding = model_utils.get_padding(inputs)
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(embedded_inputs)[1]
                #   use sin cos calculate position embeddings
                pos_encoding = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                encoder_inputs = tf.add(embedded_inputs, pos_encoding)
            if self.train:
                encoder_inputs = tf.nn.dropout(encoder_inputs, 1 - self.params.get('encoder_decoder_dropout'))
            return self.encoder_stack(encoder_inputs, attention_bias, inputs_padding)

$o k$ 一步一步来解析吧！

1.1 $e m b e d d i n g$ _ $l a y e r$

主要实现代码如下：

    def call(self, inputs, **kwargs):
        with tf.name_scope('embedding'):
            mask = tf.to_float(tf.not_equal(inputs, 0))
            embeddings = tf.gather(self.shared_weights, inputs)
            embeddings *= tf.expand_dims(mask, -1)
            embeddings *= self.hidden_size ** 0.5
            return embeddings

这段代码还是比较简单的对吧，理解起来。 $m a s k$ 的作用就是让 $p a d d i n g$ 的部分都为 $0$ 。最后对 $e m b e d d i n g$ 的部分进行了一个 $s c a l e$ 。最后返回的 $s h a p e$ 为 [ $b a t c h$ _ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ _ $s i z e$ ]。

1.2 $g e t$ _ $p a d d i n g$

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

和上面 $a t t e n t i o n$ _ $b i a s$ 的逻辑一样。这里返回的是[ $b a t c h$ _ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ ]。其中 $0$ -> $n o n$ - $p a d d i n g$ ， $1$ -> $p a d d i n g$ 。

1.3 $g e t$ $p o s i t i o n$ $e n c o d i n g$

def get_position_encoding( length, hidden_size, min_timescale=1.0, max_timescale=1.0e4):
    position = tf.to_float(tf.range(length))
    num_timescales = hidden_size // 2
    log_timescale_increment = (
            math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1))
    inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
    scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)

    signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
    return signal

算的是 $c o s$ 和 $s i n$ 的值作为 $s e q u e n c e$ _ $l e n g t h$ 的 $p o s i t i o n$ 编码。返回的 $s h a p e$ 为[ $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ $s i z e$ ] 。然后和 $e m b e d d i n g$ 的输出做 $a d d$ ，做简单的相加。最后返回的 $s h a p e$ 为 [ $b a t c h$ $s i z e$ , $s e q u e n c e$ _ $l e n g t h$ , $h i d d e n$ _ $s i z e$ ]。然后再加了一个 $d r o p o u t$ 层，接着扔进 $e n c o d e r$ _ $s t a c k$ 中。

1.4 $e n c o d e r$ _ $s t a c k$

class EncoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(EncoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )
           ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )
            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )
        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

结构很简单，就是一个 $s e l f$ _ $a t t e n t i o n$ 层 + $f e e d$ _ $w a r d$ 层 + $n o r m$ 层。

    def call(self, encoder_inputs, attention_bias, inputs_padding):
        """
        :param encoder_inputs: [batch_size, input_length, hidden_size]
        :param attention_bias: [batch_size, 1, 1, inputs_length]
        :param inputs_padding: [batch_size, length]
        :return: [batch_size, input_length, hidden_size]
        """
        for n, layer in enumerate(self.layers):
            self_attention_layer = layer[0]
            ffn_layer = layer[1]
            with tf.variable_scope('encoder_stack_lay_{}'.format(n)):
                with tf.variable_scope('self_attention'):
                    encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
                with tf.variable_scope('ffn'):
                    encoder_inputs = ffn_layer(encoder_inputs, inputs_padding)
        return self.output_norm(encoder_inputs)

主要 $c a l l ()$ 的输入参数已经在代码中给出了注释。

1.4.1 $s e l f$ _ $a t t e n t i o n$

对于 $s e l f$ _ $a t t e n t i o n$ 层的详细解释在 $g i t h u b$ 中已经逐步添加了注释，很清晰明了，这里就不再多做细说。主要流程是：

$Q 、 K 、 V = e n c o d e r$ _ $i n p u t s$ ， $s h a p e$ 都为[ $B$ , $T$ , $D$ ]。看得懂吧，这样表达简单点。
对 $Q 、 K 、 V$ 分别做 $s p l i t$ _ $h e a d$ 操作。 $s h a p e$ 都为[ $B$ , $H$ , $T$ , $D / / H$ ] 其中 $H$ 表示 $n u m$ _ $h e a d s$
$Q = s c a l e (Q)$
$logits = tf.matmul(Q, K, transpose_b=True)$ ，返回 $s h a p e$ 为[ $B$ , $H$ ， $T$ , $T$ ]
$l o g i t s = t f . a d d (l o g i t s, b i a s)$ ，这个 $b i a s$ 就是第一步求得的 $a t t e n t i o n$ _ $b i a s$
$w e i g h t s = t f . n n . s o f t m a x (l o g i t s)$
$d r o p o u t (w e i g h t s)$
$a t t e n t i o n$ _ $o u t p u t = t f . m a t m u l (w e i g h t s, V)$ ，返回 $s h a p e$ 为[ $B$ , $H$ , $T$ , $D / / H$ ]
$o u t = c o m b i n e (h e a d s)$ 返回 $s h a p e$ 为[ $B$ , $T$ , $D$ ]
$d e n s e (o u t, D)$ 返回 $s h a p e$ 为[ $B$ , $T$ , $D$ ]

这样一个 $s e l f$ - $a t t e n t i o n$ 的流程就走完了。

1.4.2 $f e e d$ _ $w a r d$

    def call(self, inputs, padding=None):
        padding = padding if self.allow_pad else None
        batch_size = tf.shape(inputs)[0]
        length = tf.shape(inputs)[1]
        if padding is not None:
            with tf.name_scope('remove_padding'):
                pad_mask = tf.reshape(padding, [-1])         
                non_pad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))
                inputs = tf.reshape(inputs, [-1, self.hidden_size])             
                inputs = tf.gather_nd(params=inputs, indices=non_pad_ids)
                inputs.set_shape([None, self.hidden_size])
                inputs = tf.expand_dims(inputs, axis=0)
        outputs = self.filter_layer(inputs)
        if self.train:
            outputs = tf.nn.dropout(outputs, 1.0 - self.relu_dropout)
        outputs = self.output_layer(outputs)
        if padding is not None:
            with tf.name_scope('re_add_padding'):
                outputs = tf.squeeze(outputs, axis=0)
                outputs = tf.scatter_nd(
                    indices=non_pad_ids,
                    updates=outputs,
                    shape=[batch_size * length, self.hidden_size]
                )
                outputs = tf.reshape(outputs, [batch_size, length, self.hidden_size])
        return outputs

这里的 $p a d d i n g$ 就是：上面1.2 $g e t$ _ $p a d d i n g ()$ 求出的结果。

1.4.3 $n o r m$

    def call(self, x, epsilon=1e-6):
        mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
        variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
        norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
        return norm_x * self.scale + self.bias