deep learning 12. transformer 代码详细解析之encoder

开始的话:
从基础做起,不断学习,坚持不懈,加油。
一位爱生活爱技术来自火星的程序汪

b e r t bert bert系列:

  1. b e r t bert bert 语料生成
  2. b e r t bert bert l o s s 解 析 loss解析 loss
  3. b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 e n c o d e r encoder encoder
  4. b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 d e c o d e r decoder decoder

结合着自己 g i t h u b github github(地址见文末尾)上的 t r a n s f o r m e r transformer transformer代码,详细分析下代码和逻辑。

    def __call__(self, feature, targets=None):
        initializer = tf.variance_scaling_initializer(
            scale=self.params.get('initializer_gain'),
            mode='fan_avg',
            distribution='uniform'
        )

        with tf.variable_scope('transformer', initializer=initializer):
            #   [batch_size, 1, 1, length]
            attention_bias = model_utils.get_padding_bias(feature)

            encoder_outputs = self.encode(feature, attention_bias)

            if targets is None:
                return self.predict(encoder_outputs, attention_bias)

            logits = self.decode(targets, encoder_outputs, attention_bias)
            return logits

主要 c a l l ( ) call() call() 方法入口。(为了代码不那么长去掉了代码中的注释)

def get_padding_bias(x):
    with tf.name_scope("attention_bias"):
        padding = get_padding(x)
        attention_bias = padding * _NEG_INF
        attention_bias = tf.expand_dims(
            tf.expand_dims(attention_bias, axis=1), axis=1)
    return attention_bias

g e t get get_ p a d d i n g ( ) padding() padding() 方法代码如下,主要目的是拿到 a t t e n t i o n attention attention _ b i a s bias bias

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

x x x s h a p e shape shape为[ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence l e n g t h length length],是已经 p a d d i n g padding padding过的数据。经过这个方法,就能知道哪些是 p a d d i n g padding padding,哪些是 n o n non non- p a d d i n g padding padding的数据了。返回 s h a p e shape shape为[ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence l e n g t h length length]。其中 0 0 0 -> n o n non non- p a d d i n g padding padding 1 1 1 -> p a d d i n g padding padding
_NEG_INF = -1e9 p a d d i n g padding padding b i a s bias bias肯定就是给 p a d d i n g padding padding加上 b i a s bias bias了,这个值就是给 p a d d i n g padding padding设置的 b i a s bias bias
最后返回的 s h a p e shape shape为 [ b a t c h batch batch
s i z e size size, 1 1 1, 1 1 1, s e q u e n c e sequence sequence _ l e n g t h length length]。

第一步: e n c o d e r encoder encoder

    def encode(self, inputs, attention_bias):
        with tf.name_scope('encode'):
            #   [batch_size, length, hidden_size]
            embedded_inputs = self.embedding_layer(inputs)
            #   [batch_size, length]
            inputs_padding = model_utils.get_padding(inputs)
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(embedded_inputs)[1]
                #   use sin cos calculate position embeddings
                pos_encoding = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                encoder_inputs = tf.add(embedded_inputs, pos_encoding)
            if self.train:
                encoder_inputs = tf.nn.dropout(encoder_inputs, 1 - self.params.get('encoder_decoder_dropout'))
            return self.encoder_stack(encoder_inputs, attention_bias, inputs_padding)

o k ok ok 一步一步来解析吧!

1.1 e m b e d d i n g embedding embedding_ l a y e r layer layer

主要实现代码如下:

    def call(self, inputs, **kwargs):
        with tf.name_scope('embedding'):
            mask = tf.to_float(tf.not_equal(inputs, 0))
            embeddings = tf.gather(self.shared_weights, inputs)
            embeddings *= tf.expand_dims(mask, -1)
            embeddings *= self.hidden_size ** 0.5
            return embeddings

这段代码还是比较简单的对吧,理解起来。 m a s k mask mask的作用就是让 p a d d i n g padding padding的部分都为 0 0 0。最后对 e m b e d d i n g embedding embedding的部分进行了一个 s c a l e scale scale。最后返回的 s h a p e shape shape为 [ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。

1.2 g e t get get_ p a d d i n g padding padding

def get_padding(x, padding_value=0):
    with tf.name_scope("padding"):
        return tf.to_float(tf.equal(x, padding_value))

和上面 a t t e n t i o n attention attention_ b i a s bias bias的逻辑一样。这里返回的是[ b a t c h batch batch _ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length]。其中 0 0 0 -> n o n non non- p a d d i n g padding padding 1 1 1 -> p a d d i n g padding padding

1.3 g e t get get p o s i t i o n position position e n c o d i n g encoding encoding

def get_position_encoding( length, hidden_size, min_timescale=1.0, max_timescale=1.0e4):
    position = tf.to_float(tf.range(length))
    num_timescales = hidden_size // 2
    log_timescale_increment = (
            math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1))
    inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
    scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)

    signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
    return signal

算的是 c o s cos cos s i n sin sin的值作为 s e q u e n c e sequence sequence_ l e n g t h length length p o s i t i o n position position编码。返回的 s h a p e shape shape为[ s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size] 。然后和 e m b e d d i n g embedding embedding的输出做 a d d add add,做简单的相加。最后返回的 s h a p e shape shape为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。然后再加了一个 d r o p o u t dropout dropout层,接着扔进 e n c o d e r encoder encoder _ s t a c k stack stack中。

1.4 e n c o d e r encoder encoder_ s t a c k stack stack

class EncoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(EncoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )
           ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )
            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )
        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

结构很简单,就是一个 s e l f self self_ a t t e n t i o n attention attention层 + f e e d feed feed _ w a r d ward ward层 + n o r m norm norm层。

    def call(self, encoder_inputs, attention_bias, inputs_padding):
        """
        :param encoder_inputs: [batch_size, input_length, hidden_size]
        :param attention_bias: [batch_size, 1, 1, inputs_length]
        :param inputs_padding: [batch_size, length]
        :return: [batch_size, input_length, hidden_size]
        """
        for n, layer in enumerate(self.layers):
            self_attention_layer = layer[0]
            ffn_layer = layer[1]
            with tf.variable_scope('encoder_stack_lay_{}'.format(n)):
                with tf.variable_scope('self_attention'):
                    encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
                with tf.variable_scope('ffn'):
                    encoder_inputs = ffn_layer(encoder_inputs, inputs_padding)
        return self.output_norm(encoder_inputs)

主要 c a l l ( ) call() call()的输入参数已经在代码中给出了注释。

1.4.1 s e l f self self_ a t t e n t i o n attention attention

对于 s e l f self self_ a t t e n t i o n attention attention层的详细解释在 g i t h u b github github中已经逐步添加了注释,很清晰明了,这里就不再多做细说。主要流程是:

  1. Q 、 K 、 V = e n c o d e r Q、K、V=encoder QKV=encoder_ i n p u t s inputs inputs s h a p e shape shape都为[ B B B, T T T, D D D]。看得懂吧,这样表达简单点。
  2. Q 、 K 、 V Q、K、V QKV分别做 s p l i t split split _ h e a d head head操作。 s h a p e shape shape都为[ B B B, H H H, T T T, D / / H D//H D//H] 其中 H H H 表示 n u m num num _ h e a d s heads heads
  3. Q = s c a l e ( Q ) Q = scale(Q) Q=scale(Q)
  4. l o g i t s = t f . m a t m u l ( Q , K , t r a n s p o s e b = T r u e ) logits = tf.matmul(Q, K, transpose_b=True) logits=tf.matmul(Q,K,transposeb=True),返回 s h a p e shape shape为[ B B B, H H H T T T, T T T]
  5. l o g i t s = t f . a d d ( l o g i t s , b i a s ) logits = tf.add(logits, bias) logits=tf.add(logits,bias),这个 b i a s bias bias 就是第一步求得的 a t t e n t i o n attention attention_ b i a s bias bias
  6. w e i g h t s = t f . n n . s o f t m a x ( l o g i t s ) weights = tf.nn.softmax(logits) weights=tf.nn.softmax(logits)
  7. d r o p o u t ( w e i g h t s ) dropout(weights) dropout(weights)
  8. a t t e n t i o n attention attention_ o u t p u t = t f . m a t m u l ( w e i g h t s , V ) output = tf.matmul(weights, V) output=tf.matmul(weights,V),返回 s h a p e shape shape为[ B B B, H H H, T T T, D / / H D//H D//H]
  9. o u t = c o m b i n e ( h e a d s ) out=combine(heads) out=combine(heads) 返回 s h a p e shape shape为[ B B B, T T T, D D D]
  10. d e n s e ( o u t , D ) dense(out, D) dense(out,D) 返回 s h a p e shape shape为[ B B B, T T T, D D D]

这样一个 s e l f self self- a t t e n t i o n attention attention的流程就走完了。

1.4.2 f e e d feed feed _ w a r d ward ward
    def call(self, inputs, padding=None):
        padding = padding if self.allow_pad else None
        batch_size = tf.shape(inputs)[0]
        length = tf.shape(inputs)[1]
        if padding is not None:
            with tf.name_scope('remove_padding'):
                pad_mask = tf.reshape(padding, [-1])         
                non_pad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))
                inputs = tf.reshape(inputs, [-1, self.hidden_size])             
                inputs = tf.gather_nd(params=inputs, indices=non_pad_ids)
                inputs.set_shape([None, self.hidden_size])
                inputs = tf.expand_dims(inputs, axis=0)
        outputs = self.filter_layer(inputs)
        if self.train:
            outputs = tf.nn.dropout(outputs, 1.0 - self.relu_dropout)
        outputs = self.output_layer(outputs)
        if padding is not None:
            with tf.name_scope('re_add_padding'):
                outputs = tf.squeeze(outputs, axis=0)
                outputs = tf.scatter_nd(
                    indices=non_pad_ids,
                    updates=outputs,
                    shape=[batch_size * length, self.hidden_size]
                )
                outputs = tf.reshape(outputs, [batch_size, length, self.hidden_size])
        return outputs

这里的 p a d d i n g padding padding 就是:上面1.2 g e t get get_ p a d d i n g ( ) padding() padding()求出的结果。

1.4.3 n o r m norm norm
    def call(self, x, epsilon=1e-6):
        mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
        variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
        norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
        return norm_x * self.scale + self.bias

这个很好理解的对吧,求均值和方差,然后 n o r m norm norm

这样一个 e n c o d e r encoder encoder流程就完了。还是比较简单的,不得不佩服 g o o g l e google google大佬的强悍。

下一节会将 d e c o d e r decoder decoder部分。

谢谢

更多代码请移步我的个人 g i t h u b github github,会不定期更新。
欢迎关注

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值