开始的话:
从基础做起,不断学习,坚持不懈,加油。
一位爱生活爱技术来自火星的程序汪
b e r t bert bert系列:
- b e r t bert bert 语料生成
- b e r t bert bert l o s s 解 析 loss解析 loss解析
- b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 e n c o d e r encoder encoder
- b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 d e c o d e r decoder decoder
话不多说,直接开始今天的主要内容。
def decode(self, targets, encoder_outputs, attention_bias):
"""
:param targets: [batch_size, target_length]
:param encoder_outputs: [batch_size, input_length, hidden_size]
:param attention_bias: [batch_size, 1, 1, input_length]
:return: [batch_size, target_length, vocab_size]
"""
with tf.name_scope('decode'):
# [batch_size, target_length, hidden_size]
decoder_inputs = self.embedding_layer(targets)
with tf.name_scope('shift_targets'):
# pad embedding value 0 at the head of sequence and remove eos_id
decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
with tf.name_scope('add_pos_embedding'):
length = tf.shape(decoder_inputs)[1]
position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
decoder_inputs = tf.add(decoder_inputs, position_decode)
if self.train:
decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))
decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)
outputs = self.decoder_stack(
decoder_inputs,
encoder_outputs,
decoder_self_attention_bias,
attention_bias
)
# [batch_size, target_length, vocab_size]
logits = self.embedding_layer.linear(outputs)
return logits
输入参数的
s
h
a
p
e
shape
shape已经在代码中给出了详细的注释。
o
k
ok
ok 我们一步一步来看看代码。
1、 e m b e d d i n g embedding embedding_ l a y e r layer layer
和 e n c o d e r encoder encoder中是一样的,这里就不再说明了,有疑问的请看上一节内容。最后返回的 s h a p e shape shape就是[ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。
2、 p a d pad pad
decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
这个方法还是比较好理解的吧。 s h a p e shape shape为[ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size], r a n k rank rank为3,第一维不 p a d pad pad,最后一维也不pad,中间这个维度 p a d pad pad,并且是前面 p a d pad pad,后面不 p a d pad pad。所以 p a d pad pad之后的维度为[ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length+1, h i d d e n hidden hidden s i z e size size],然后 在在第二维度去掉了最后一个值(代表的就是[ E O S EOS EOS]这个标志位),这样 s h a p e shape shape仍然为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。
3、 g e t get get_ p o s i t i o n position position _ e n c o d i n g encoding encoding
这一个步骤和上一节的操作也是一样的,也就不再细说了。返回的 s h a p e shape shape为[ s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size] ,然后和 e m b e d d i n g embedding embedding的输出做 a d d add add,做简单的相加。最后返回的 s h a p e shape shape为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。然后再加了一个 d r o p o u t dropout dropout层。
4、 g e t get get_ d e c o d e r decoder decoder _ b i a s bias bias
def get_decoder_self_attention_bias(length):
with tf.name_scope("decoder_self_attention_bias"):
valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
decoder_bias = _NEG_INF * (1.0 - valid_locs)
return decoder_bias
L o w e r Lower Lower t r i a n g u l a r triangular triangular p a r t part part,就和下面这个一样。
[[1. 0. 0. 0. 0.]
[1. 1. 0. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 1.]]
最后的输出就如下面这样,成为了一个 U p p e r Upper Upper t r i a n g u l a r triangular triangular p a r t part part
tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
[-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
[-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
[-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
[-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)
5、 d e c o d e r decoder decoder_ s t a c k stack stack
o k ok ok,我们来看下 d e o c d e r deocder deocder_ s t a c k stack stack
def decode(self, targets, encoder_outputs, attention_bias):
"""
:param targets: [batch_size, target_length]
:param encoder_outputs: [batch_size, input_length, hidden_size]
:param attention_bias: [batch_size, 1, 1, input_length]
:return: [batch_size, target_length, vocab_size]
"""
with tf.name_scope('decode'):
# [batch_size, target_length, hidden_size]
decoder_inputs = self.embedding_layer(targets)
with tf.name_scope('shift_targets'):
# pad embedding value 0 at the head of sequence and remove eos_id
decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
with tf.name_scope('add_pos_embedding'):
length = tf.shape(decoder_inputs)[1]
position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
decoder_inputs = tf.add(decoder_inputs, position_decode)
if self.train:
decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))
decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)
outputs = self.decoder_stack(
decoder_inputs,
encoder_outputs,
decoder_self_attention_bias,
attention_bias
)
# [batch_size, target_length, vocab_size]
logits = self.embedding_layer.linear(outputs)
return logits
class DecoderStack(tf.layers.Layer):
def __init__(self, params, train):
super(DecoderStack, self).__init__()
self.params = params
self.train = train
self.layers = list()
for _ in range(self.params.get('num_blocks')):
self_attention_layer = SelfAttention(
hidden_size=self.params.get('hidden_size'),
num_heads=self.params.get('num_heads'),
attention_dropout=self.params.get('attention_dropout'),
train=self.train
)
vanilla_attention_layer = AttentionLayer(
hidden_size=self.params.get('hidden_size'),
num_heads=self.params.get('num_heads'),
attention_dropout=self.params.get('attention_dropout'),
train=self.train
)
ffn_layer = FFNLayer(
hidden_size=self.params.get('hidden_size'),
filter_size=self.params.get('filter_size'),
relu_dropout=self.params.get('relu_dropout'),
train=self.train,
allow_pad=self.params.get('allow_ffn_pad')
)
self.layers.append(
[
PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
PrePostProcessingWrapper(ffn_layer, self.params, self.train)
]
)
self.output_norm = LayerNormalization(self.params.get('hidden_size'))
5.1 s e l f self self_ a t t e n t i o n attention attention
这一部分和
e
n
c
o
d
e
r
encoder
encoder的
s
e
l
f
self
self_
a
t
t
e
n
t
i
o
n
attention
attention是一模一样的,
Q
、
K
、
V
=
d
e
c
o
d
e
r
Q、K、V=decoder
Q、K、V=decoder
i
n
p
u
t
s
inputs
inputs,具体计算过程也是一毛一样的。
唯一不同的是
b
i
a
s
bias
bias是
g
e
t
get
get
d
e
c
o
d
e
r
decoder
decoder _
b
i
a
s
bias
bias产生的
5.2 v a n i l l a vanilla vanilla_ a t t e n t i o n attention attention
这个 a t t e n t i o n attention attention的不同之处在于,做的是 d e c o d e r decoder decoder对 e n c o d e r encoder encoder的 a t t e n t i o n attention attention。似乎这是这是很重要的一个 a t t e n t i o n attention attention,将 e n c o d e r encoder encoder和 d e c o d e r decoder decoder做了对齐。
Q
=
d
e
c
o
d
e
r
Q=decoder
Q=decoder _
i
n
p
u
t
s
inputs
inputs
K
、
V
=
e
n
c
o
d
e
r
K、V=encoder
K、V=encoder _
i
n
p
u
t
s
inputs
inputs
- Q Q Q的 s h a p e shape shape为[ B B B, T d T_d Td, D D D],而 K 、 V K、V K、V的 s h a p e shape shape为 [ B B B, T e T_e Te, D D D]
- 对 Q 、 K 、 V Q、K、V Q、K、V分别做 s p l i t split split _ h e a d head head操作。 Q s h a p e Q shape Qshape为[ B B B, H H H, T d T_d Td, D / / H D//H D//H] , K 、 V s h a p e K、V shape K、Vshape为[ B B B, H H H, T e T_e Te, D / / H D//H D//H] 其中 H H H 表示 n u m num num _ h e a d s heads heads,
- Q = s c a l e ( Q ) Q = scale(Q) Q=scale(Q)
- l o g i t s = t f . m a t m u l ( Q , K , t r a n s p o s e b = T r u e ) logits = tf.matmul(Q, K, transpose_b=True) logits=tf.matmul(Q,K,transposeb=True),返回 s h a p e shape shape为[ B B B, H H H, T d T_d Td, T e T_e Te]
- l o g i t s = t f . a d d ( l o g i t s , b i a s ) logits = tf.add(logits, bias) logits=tf.add(logits,bias),这个 b i a s bias bias 就是最开始第一节第一步求得的 a t t e n t i o n attention attention_ b i a s bias bias。 s h a p e shape shape为 [ B B B, 1 1 1, 1 1 1, T e T_e Te]。最终返回 s h a p e shape shape为[ B B B, H H H, T d T_d Td, T e T_e Te]
- w e i g h t s = t f . n n . s o f t m a x ( l o g i t s ) weights = tf.nn.softmax(logits) weights=tf.nn.softmax(logits)
- d r o p o u t ( w e i g h t s ) dropout(weights) dropout(weights)
- a t t e n t i o n attention attention_ o u t p u t = t f . m a t m u l ( w e i g h t s , V ) output = tf.matmul(weights, V) output=tf.matmul(weights,V),weights s h a p e shape shape为[ B B B, H H H, T d T_d Td, T e T_e Te],V=[ B B B, H H H, T e T_e Te, D / / H D//H D//H],最终为[ B B B, H H H, T d T_d Td, D / / H D//H D//H],
- o u t = c o m b i n e ( h e a d s ) out=combine(heads) out=combine(heads) 返回 s h a p e shape shape为[ B B B, T d T_d Td, D D D]
- d e n s e ( o u t , D ) dense(out, D) dense(out,D) 返回 s h a p e shape shape为[ B B B, T d T_d Td, D D D]
5.3 f e e d feed feed _ f o r w a r d forward forward
和 e n c o d e r encoder encoder 的是一样的。
5.4 n o r m norm norm
和 e n c o d e r encoder encoder 的是一样的。
5.5 l i n e a r linear linear
def linear(self, inputs):
"""
:param inputs: a tensor with shape [batch_size, length, hidden_size]
:return: float32 tensor with shape [batch_size, length, vocab_size]
"""
with tf.name_scope('pre_softmax_linear'):
batch_size = tf.shape(inputs)[0]
length = tf.shape(inputs)[1]
inputs = tf.reshape(inputs, [-1, self.hidden_size])
"""
inputs [batch_size, length, hidden_size]
shared_weights [vocab_size, hidden_size]
transpose [hidden_size, vocab_size]
logits [batch_size, length, vocab_size]
"""
logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)
return tf.reshape(logits, [batch_size, length, self.vocab_size])
这个就不细说了。值得注意的是: s h a r e d shared shared_ w e i g h t s weights weights,是 e m b e d i d n g embedidng embedidng时候初始化的向量。最后输出的,也就是每个位置在 v o c a b vocab vocab上的概率分布。
最后:
一直没提到的
class PrePostProcessingWrapper(object):
"""Wrapper class that applies layer pre-processing and post-processing."""
def __init__(self, layer, params, train):
self.layer = layer
self.postprocess_dropout = params["layer_postprocess_dropout"]
self.train = train
# Create normalization layer
self.layer_norm = LayerNormalization(params["hidden_size"])
def __call__(self, x, *args, **kwargs):
# Preprocessing: apply layer normalization
y = self.layer_norm(x)
# Get layer output
y = self.layer(y, *args, **kwargs)
# Postprocessing: apply dropout and residual connection
if self.train:
y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
return x + y
这个 w r a p p e r wrapper wrapper
- 先对输入做了一个 n o r m norm norm,和前面提到的 n o r m norm norm是一样的。
- 然后拿到 l a y e r layer layer的输出
- 对结果加了一个 d r o p o u t dropout dropout
- 最后和输入相加,做了一个 r e s i d u a l residual residual。
这个操作是对每一个 l a y e r layer layer的输入输出都操作了。
谢谢
更多代码请移步我的个人 g i t h u b github github,会不定期更新。
欢迎关注