开始的话:
从基础做起,不断学习,坚持不懈,加油。
一位爱生活爱技术来自火星的程序汪
b e r t bert bert系列:
- b e r t bert bert 语料生成
- b e r t bert bert l o s s 解 析 loss解析 loss解析
- b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 e n c o d e r encoder encoder
- b e r t bert bert t r a n s f o r m e r transformer transformer详细解析之 d e c o d e r decoder decoder
结合着自己 g i t h u b github github(地址见文末尾)上的 t r a n s f o r m e r transformer transformer代码,详细分析下代码和逻辑。
def __call__(self, feature, targets=None):
initializer = tf.variance_scaling_initializer(
scale=self.params.get('initializer_gain'),
mode='fan_avg',
distribution='uniform'
)
with tf.variable_scope('transformer', initializer=initializer):
# [batch_size, 1, 1, length]
attention_bias = model_utils.get_padding_bias(feature)
encoder_outputs = self.encode(feature, attention_bias)
if targets is None:
return self.predict(encoder_outputs, attention_bias)
logits = self.decode(targets, encoder_outputs, attention_bias)
return logits
主要 c a l l ( ) call() call() 方法入口。(为了代码不那么长去掉了代码中的注释)
def get_padding_bias(x):
with tf.name_scope("attention_bias"):
padding = get_padding(x)
attention_bias = padding * _NEG_INF
attention_bias = tf.expand_dims(
tf.expand_dims(attention_bias, axis=1), axis=1)
return attention_bias
g e t get get_ p a d d i n g ( ) padding() padding() 方法代码如下,主要目的是拿到 a t t e n t i o n attention attention _ b i a s bias bias:
def get_padding(x, padding_value=0):
with tf.name_scope("padding"):
return tf.to_float(tf.equal(x, padding_value))
x
x
x的
s
h
a
p
e
shape
shape为[
b
a
t
c
h
batch
batch_
s
i
z
e
size
size,
s
e
q
u
e
n
c
e
sequence
sequence
l
e
n
g
t
h
length
length],是已经
p
a
d
d
i
n
g
padding
padding过的数据。经过这个方法,就能知道哪些是
p
a
d
d
i
n
g
padding
padding,哪些是
n
o
n
non
non-
p
a
d
d
i
n
g
padding
padding的数据了。返回
s
h
a
p
e
shape
shape为[
b
a
t
c
h
batch
batch
s
i
z
e
size
size,
s
e
q
u
e
n
c
e
sequence
sequence
l
e
n
g
t
h
length
length]。其中
0
0
0 ->
n
o
n
non
non-
p
a
d
d
i
n
g
padding
padding,
1
1
1 ->
p
a
d
d
i
n
g
padding
padding。
_NEG_INF = -1e9,
p
a
d
d
i
n
g
padding
padding
b
i
a
s
bias
bias肯定就是给
p
a
d
d
i
n
g
padding
padding加上
b
i
a
s
bias
bias了,这个值就是给
p
a
d
d
i
n
g
padding
padding设置的
b
i
a
s
bias
bias。
最后返回的
s
h
a
p
e
shape
shape为 [
b
a
t
c
h
batch
batch
s
i
z
e
size
size,
1
1
1,
1
1
1,
s
e
q
u
e
n
c
e
sequence
sequence _
l
e
n
g
t
h
length
length]。
第一步: e n c o d e r encoder encoder
def encode(self, inputs, attention_bias):
with tf.name_scope('encode'):
# [batch_size, length, hidden_size]
embedded_inputs = self.embedding_layer(inputs)
# [batch_size, length]
inputs_padding = model_utils.get_padding(inputs)
with tf.name_scope('add_pos_embedding'):
length = tf.shape(embedded_inputs)[1]
# use sin cos calculate position embeddings
pos_encoding = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
encoder_inputs = tf.add(embedded_inputs, pos_encoding)
if self.train:
encoder_inputs = tf.nn.dropout(encoder_inputs, 1 - self.params.get('encoder_decoder_dropout'))
return self.encoder_stack(encoder_inputs, attention_bias, inputs_padding)
o k ok ok 一步一步来解析吧!
1.1 e m b e d d i n g embedding embedding_ l a y e r layer layer
主要实现代码如下:
def call(self, inputs, **kwargs):
with tf.name_scope('embedding'):
mask = tf.to_float(tf.not_equal(inputs, 0))
embeddings = tf.gather(self.shared_weights, inputs)
embeddings *= tf.expand_dims(mask, -1)
embeddings *= self.hidden_size ** 0.5
return embeddings
这段代码还是比较简单的对吧,理解起来。 m a s k mask mask的作用就是让 p a d d i n g padding padding的部分都为 0 0 0。最后对 e m b e d d i n g embedding embedding的部分进行了一个 s c a l e scale scale。最后返回的 s h a p e shape shape为 [ b a t c h batch batch_ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。
1.2 g e t get get_ p a d d i n g padding padding
def get_padding(x, padding_value=0):
with tf.name_scope("padding"):
return tf.to_float(tf.equal(x, padding_value))
和上面 a t t e n t i o n attention attention_ b i a s bias bias的逻辑一样。这里返回的是[ b a t c h batch batch _ s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length]。其中 0 0 0 -> n o n non non- p a d d i n g padding padding, 1 1 1 -> p a d d i n g padding padding。
1.3 g e t get get p o s i t i o n position position e n c o d i n g encoding encoding
def get_position_encoding( length, hidden_size, min_timescale=1.0, max_timescale=1.0e4):
position = tf.to_float(tf.range(length))
num_timescales = hidden_size // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) / (tf.to_float(num_timescales) - 1))
inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
return signal
算的是 c o s cos cos和 s i n sin sin的值作为 s e q u e n c e sequence sequence_ l e n g t h length length的 p o s i t i o n position position编码。返回的 s h a p e shape shape为[ s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden s i z e size size] 。然后和 e m b e d d i n g embedding embedding的输出做 a d d add add,做简单的相加。最后返回的 s h a p e shape shape为 [ b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence _ l e n g t h length length, h i d d e n hidden hidden _ s i z e size size]。然后再加了一个 d r o p o u t dropout dropout层,接着扔进 e n c o d e r encoder encoder _ s t a c k stack stack中。
1.4 e n c o d e r encoder encoder_ s t a c k stack stack
class EncoderStack(tf.layers.Layer):
def __init__(self, params, train):
super(EncoderStack, self).__init__()
self.params = params
self.train = train
self.layers = list()
for _ in range(self.params.get('num_blocks')):
self_attention_layer = SelfAttention(
hidden_size=self.params.get('hidden_size'),
num_heads=self.params.get('num_heads'),
attention_dropout=self.params.get('attention_dropout'),
train=self.train
)
ffn_layer = FFNLayer(
hidden_size=self.params.get('hidden_size'),
filter_size=self.params.get('filter_size'),
relu_dropout=self.params.get('relu_dropout'),
train=self.train,
allow_pad=self.params.get('allow_ffn_pad')
)
self.layers.append(
[
PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
PrePostProcessingWrapper(ffn_layer, self.params, self.train)
]
)
self.output_norm = LayerNormalization(self.params.get('hidden_size'))
结构很简单,就是一个 s e l f self self_ a t t e n t i o n attention attention层 + f e e d feed feed _ w a r d ward ward层 + n o r m norm norm层。
def call(self, encoder_inputs, attention_bias, inputs_padding):
"""
:param encoder_inputs: [batch_size, input_length, hidden_size]
:param attention_bias: [batch_size, 1, 1, inputs_length]
:param inputs_padding: [batch_size, length]
:return: [batch_size, input_length, hidden_size]
"""
for n, layer in enumerate(self.layers):
self_attention_layer = layer[0]
ffn_layer = layer[1]
with tf.variable_scope('encoder_stack_lay_{}'.format(n)):
with tf.variable_scope('self_attention'):
encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
with tf.variable_scope('ffn'):
encoder_inputs = ffn_layer(encoder_inputs, inputs_padding)
return self.output_norm(encoder_inputs)
主要 c a l l ( ) call() call()的输入参数已经在代码中给出了注释。
1.4.1 s e l f self self_ a t t e n t i o n attention attention
对于 s e l f self self_ a t t e n t i o n attention attention层的详细解释在 g i t h u b github github中已经逐步添加了注释,很清晰明了,这里就不再多做细说。主要流程是:
- Q 、 K 、 V = e n c o d e r Q、K、V=encoder Q、K、V=encoder_ i n p u t s inputs inputs, s h a p e shape shape都为[ B B B, T T T, D D D]。看得懂吧,这样表达简单点。
- 对 Q 、 K 、 V Q、K、V Q、K、V分别做 s p l i t split split _ h e a d head head操作。 s h a p e shape shape都为[ B B B, H H H, T T T, D / / H D//H D//H] 其中 H H H 表示 n u m num num _ h e a d s heads heads
- Q = s c a l e ( Q ) Q = scale(Q) Q=scale(Q)
- l o g i t s = t f . m a t m u l ( Q , K , t r a n s p o s e b = T r u e ) logits = tf.matmul(Q, K, transpose_b=True) logits=tf.matmul(Q,K,transposeb=True),返回 s h a p e shape shape为[ B B B, H H H, T T T, T T T]
- l o g i t s = t f . a d d ( l o g i t s , b i a s ) logits = tf.add(logits, bias) logits=tf.add(logits,bias),这个 b i a s bias bias 就是第一步求得的 a t t e n t i o n attention attention_ b i a s bias bias
- w e i g h t s = t f . n n . s o f t m a x ( l o g i t s ) weights = tf.nn.softmax(logits) weights=tf.nn.softmax(logits)
- d r o p o u t ( w e i g h t s ) dropout(weights) dropout(weights)
- a t t e n t i o n attention attention_ o u t p u t = t f . m a t m u l ( w e i g h t s , V ) output = tf.matmul(weights, V) output=tf.matmul(weights,V),返回 s h a p e shape shape为[ B B B, H H H, T T T, D / / H D//H D//H]
- o u t = c o m b i n e ( h e a d s ) out=combine(heads) out=combine(heads) 返回 s h a p e shape shape为[ B B B, T T T, D D D]
- d e n s e ( o u t , D ) dense(out, D) dense(out,D) 返回 s h a p e shape shape为[ B B B, T T T, D D D]
这样一个 s e l f self self- a t t e n t i o n attention attention的流程就走完了。
1.4.2 f e e d feed feed _ w a r d ward ward
def call(self, inputs, padding=None):
padding = padding if self.allow_pad else None
batch_size = tf.shape(inputs)[0]
length = tf.shape(inputs)[1]
if padding is not None:
with tf.name_scope('remove_padding'):
pad_mask = tf.reshape(padding, [-1])
non_pad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))
inputs = tf.reshape(inputs, [-1, self.hidden_size])
inputs = tf.gather_nd(params=inputs, indices=non_pad_ids)
inputs.set_shape([None, self.hidden_size])
inputs = tf.expand_dims(inputs, axis=0)
outputs = self.filter_layer(inputs)
if self.train:
outputs = tf.nn.dropout(outputs, 1.0 - self.relu_dropout)
outputs = self.output_layer(outputs)
if padding is not None:
with tf.name_scope('re_add_padding'):
outputs = tf.squeeze(outputs, axis=0)
outputs = tf.scatter_nd(
indices=non_pad_ids,
updates=outputs,
shape=[batch_size * length, self.hidden_size]
)
outputs = tf.reshape(outputs, [batch_size, length, self.hidden_size])
return outputs
这里的 p a d d i n g padding padding 就是:上面1.2 g e t get get_ p a d d i n g ( ) padding() padding()求出的结果。
1.4.3 n o r m norm norm
def call(self, x, epsilon=1e-6):
mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
return norm_x * self.scale + self.bias
这个很好理解的对吧,求均值和方差,然后 n o r m norm norm。
这样一个 e n c o d e r encoder encoder流程就完了。还是比较简单的,不得不佩服 g o o g l e google google大佬的强悍。
下一节会将 d e c o d e r decoder decoder部分。
谢谢
更多代码请移步我的个人 g i t h u b github github,会不定期更新。
欢迎关注