BERT 原理代码分析
- 输入:input_ids
大小:[ batch_size, seq_length, 1]
词id编码-token 获取
basic-方式:区分中文
WordpieceTokenizer:
input = “unaffable”
output = [“un”, “##aff”, “##able”]
词id编码-token embedding/embedding_lookup_factorized
- 设置embedding_table
大小[vocab_size, embedding_size]
embedding_size=128
- first project one-hot vectors into a lower dimensional embedding space of size E
one-hot 方式
one_hot_input_ids = tf.one_hot(flat_input_ids,depth=vocab_size) # one_hot_input_ids=[batch_size * sequence_length,vocab_size]
output_middle = tf.matmul(one_hot_input_ids, embedding_table) # output=[batch_size * sequence_length,embedding_size]
否则 lookup编码
output_middle = tf.gather(embedding_table,flat_input_ids)
# [vocab_size, embedding_size]*[batch_size * sequence_length,]--->[batch_size * sequence_length,embedding_size]
flat_input_ids 大小 (batch_size * sequence_length,)
- project vector(output_middle) to the hidden space
project_variable [embedding_size, hidden_size]
output = tf.matmul(output_middle, project_variable)
# ([batch_size * sequence_length, embedding_size] * [embedding_size, hidden_size])--->[batch_size * sequence_length, hidden_size]
output:(batch_size,sequene_length,hidden_size)
句子/位置编码 segment embedding,position embedding/embedding_postprocessor
token_type_table:在next sentence prediction任务里的Segment A和 Segment B
full_position_embeddings:位置 [0, 1, 2, … seq_length-1]
transformer_model
attention_mask
mask of shape [batch_size, seq_length, seq_length]
input:batch_size,seq_length,input_width(768)
!!all_layer_outputs=【】
for layer_idx in range(num_hidden_layers)
attention
输入:from_tensor
[batch_size, from_seq_length,num_attention_heads * size_per_head]
输出:[batch_size, seq_length, num_attention_heads, width]
B = batch size (number of sequences)
F = from_tensor
sequence length
T = to_tensor
sequence length
N = num_attention_heads
H = size_per_head
#[B*F, N*H]
query_layer= tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))
key_layer[B*T, N*H]
value_layer[B*T, N*H]
query_layer = [B, N, F, H]
key_layer = [B, N, T, H]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
`attention_scores` = [B, N, F, T]
`mask 使用方式`
attention_mask = tf.expand_dims(attention_mask, axis=[1])
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
attention_scores += adder
attention_probs = tf.nn.softmax(attention_scores)
`value_layer` = [B, N, T, H]
context_layer = tf.matmul(attention_probs, value_layer)
context_layer = [B, N, F, H]
context_layer = [B, F, N, H]
context_layer = [B, F, N*H]
每结束一层attention放一起
attention_heads.append(attention_head)
attention_output = tf.concat(attention_heads, axis=-1)
但是还要经过一层
with tf.variable_scope("output"):
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
扩展层
# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
intermediate_output = tf.layers.dense(
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))
再恢复原来大小
with tf.variable_scope("output"):
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
prev_output = layer_output
再放在all_layer_outputs.append(layer_output)
最后可以选择返回全部层或者只返回最后一层。
do_return_all_layers=True
self.sequence_output = self.all_encoder_layers[-1] # [batch_size, seq_length, hidden_size]
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
这里只取了每句的第一个信息返回[:, 0:1, :]。