这个是很早之前就应该做的工作,之前看过几遍源码,但是都没有详细的记录下来,Bert源码还是很优雅的,这次看记录下来方便以后回顾。
先来看它的整体结构:
├── README.md
├── create_pretraining_data.py
├── extract_features.py
├── modeling.py
├── modeling_test.py
├── multilingual.md
├── optimization.py
├── optimization_test.py
├── predicting_movie_reviews_with_bert_on_tf_hub.ipynb
├── requirements.txt
├── run_classifier.py
├── run_classifier_with_tfhub.py
├── run_pretraining.py
├── run_squad.py
├── sample_text.txt
├── tokenization.py
└── tokenization_test.py
create_pretraining_data.py:创建预训练数据;
extract_features.py:提取/转换特征的一些操作;
modeling.py:bert模型结构;
modeling_test.py:bert模型结构的单元测试;
optimization.py:优化器;
optimization_test.py:优化器的单元测试;
run_classifier.py:分类示例;
run_classifier_with_tfhub.py:同上,只是使用了tfhub;
run_pretraining.py:进行预训练
run_squad.py:使用squad数据示例;
tokenization.py:分词、数据清洗等;
tokenization_test.py:分词、数据清洗等的单元测试;
首先来看modeling.py,这个是主要的模型部分,一点一点来看:
class BertConfig(object):
"""Configuration for `BertModel`."""
def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,#这里是2
initializer_range=0.02):
"""Constructs BertConfig.
Args:
vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
hidden_size: Size of the encoder layers and the pooler layer.
num_hidden_layers: Number of hidden layers in the Transformer encoder.
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`BertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
这一部分没啥可看的,上面都有具体的解释。注意:type_vocab_size在bert_config.json中是2。
下面看BertModel类,先来看看使用示例:
# Already been converted into WordPiece token ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
model = modeling.BertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
input_ids的shape是[batch_size,seq_len],这里batch_size=2,seq_len=3;
input_mask:输入序列长度分别为3,2;
token_type_ids :输入的第一个序列中:前两个词属于句子A,第三个词数据句子B;
输入的第二个序列中:第一个词属于句子A,第二个词数据句子B,第三个词是padding;
后面是创建bert config,vocab_size=32000, hidden_size=512,num_hidden_layers=8, num_attention_heads=8, intermediate_size=1024
(这个是bert_base的,具体应该按照bert_config.json来),需要注意的一点是embedding_size=hidden_size,bert_base中都等于768。
下面来看具体的BertModel类:
def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
config:BertConfig实例
is_training:True for train,False for eval,影响dropout
input_ids:int32类型Tensor,[batch_size,seq_length]
input_mask:可选,类型同上
token_type_ids:可选,类型同上
use_one_hot_embeddings:可选,是否使用one-hot embedding(使用矩阵乘法实现提取词的Embedding)TPU更快
scope:可选,bert
看一下构造函数的流程:
config = copy.deepcopy(config) #深拷贝一份config
if not is_training: #不训练的话将dropout置为0
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]
if input_mask is None: #input_mask如果为None,做一个全1向量,表示没有padding
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None: #token_type_ids 如果为None,做一个全0向量,表示都属于第一个句子
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.词embedding
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout. 在词embedding上添加 positional embeddings and token type embeddings,
#然后layer normalize和dropout
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.将2D mask构造为[batch_size, seq_length, seq_length]
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)
# Run the stacked transformer.堆叠transformer
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
# self.all_encoder_layers是一个list,长度为num_hidden_layers,bert_base为12,
# 其每一个元素的shape=[batch_size,seq_length,hidden_size]
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
#self.sequence_output取最后一层,然后取[CLS]得到的维度是[batch_size,1,hidden_size],然后将第二个维度去掉
#得到[batch_size,hidden_size],最后加一个全连接层,维度还是[batch_size,1,hidden_size]
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
能注释的都写在上面了。
下面按上面的流程一步一步看:
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.
Args:
input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
ids.
vocab_size: int. Size of the embedding vocabulary.
embedding_size: int. Width of the word embeddings.
initializer_range: float. Embedding initialization range.
word_embedding_name: string. Name of the embedding table.
use_one_hot_embeddings: bool. If True, use one-hot method for word
embeddings. If False, use `tf.gather()`.
Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].如果输入是两维,增加一个维度
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
flat_input_ids = tf.reshape(input_ids, [-1])
if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.gather(embedding_table, flat_input_ids)
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)
代码很简单,输入输出解释也很详细,if use_one_hot_embeddings是使用矩阵乘法获取词embedding,把输入id变为one-hot向量乘以embedding_table,据说在TPU上速度会更快,在CPU和GPU上使用id下标去embedding_table取更快,也就是tf.nn.embedding_lookup。
除了这些以外,还增加了一个维度,个人感觉可以当做没有。input_ids 增加一个维度后最后一个维度始终是1。
看完这部分下面来看embedding_postprocessor:
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512,
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.
Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.
Returns:
float tensor with same shape as `input_tensor`.
Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]
output = input_tensor
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
# 因为token_type_table维度很小(一般为2),所以使用one-hot乘以token_type_table更快
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
#max_position_embeddings必须比seq_length长
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#位置向量[max_position_embeddings, width],但是实际seq_length并不会到512,为了提高训练速度,使用
#tf.slice取出[0, 1, 2, ..., seq_length-1]的部分
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())
# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.为了将位置embedding加到token embedding上(位置embedding维度是[seq_length,width])维度不同,所以
#使用broadcast机制加上
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
#position_embeddings维度是[1,seq_length,width]
output += position_embeddings
output = layer_norm_and_dropout(output, dropout_prob)
return output
详细的注释都写上面了,作用就是三个embedding相加。
下面来看attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask),在解释这一部分之前,先来简单说一下mask,**Transformer 模型里面涉及两种 mask,分别是 padding mask 和 sequence mask。**其中,padding mask 在所有的 scaled dot-product attention 里面都需要用到,而 sequence mask 只有在 decoder 的 self-attention 里面用到。
(1)Padding mask:因为每个批次输入序列长度是不一样的也就是说,我们要对输入序列进行对齐。具体来说,就是给在较短的序列后面填充 0。但是如果输入的序列太长,则是截取左边的内容,把多余的直接舍弃。因为这些填充的位置,其实是没什么意义的,所以我们的attention机制不应该把注意力放在这些位置上,所以我们需要进行一些处理。具体的做法是,把这些位置的值加上一个非常大的负数(负无穷),这样的话,经过 softmax,这些位置的概率就会接近0!
(2)Sequence mask:sequence mask 是为了使得 decoder 不能看见未来的信息。也就是对于一个序列,在 time_step 为 t 的时刻,我们的解码输出应该只能依赖于 t 时刻之前的输出,而不能依赖 t 之后的输出。因此我们需要想一个办法,把 t 之后的信息给隐藏起来。那么具体怎么做呢?也很简单:产生一个下三角矩阵。把这个矩阵作用在每一个序列上,就可以达到我们的目的。
对于 decoder 的 self-attention,里面使用到的 scaled dot-product attention,同时需要padding mask 和 sequence mask 作为 attn_mask,具体实现就是两个mask相加作为attn_mask。其他情况,attn_mask 一律等于 padding mask。
这一部分作用示例:
输入:
input_ids=[
[1,2,3,0,0],
[1,3,5,6,1]
]
input_mask=[
[1,1,1,0,0],
[1,1,1,1,1]
]
输出:(维度[batch_size,seq_length,seq_length])第一个维度的seq_length表示第几个词,第二个seq_length表示可以attend的词(用1表示,不能attend的用0表示)
[
[1, 1, 1, 0, 0], #它表示第1个词可以attend to 3个词
[1, 1, 1, 0, 0], #它表示第2个词可以attend to 3个词
[1, 1, 1, 0, 0], #它表示第3个词可以attend to 3个词
[1, 1, 1, 0, 0], #无意义,因为输入第4个词是padding的0
[1, 1, 1, 0, 0] #无意义,因为输入第5个词是padding的0
]
[
[1, 1, 1, 1, 1], # 它表示第1个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第2个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第3个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第4个词可以attend to 5个词
[1, 1, 1, 1, 1] # 它表示第5个词可以attend to 5个词
]
def create_attention_mask_from_input_mask(from_tensor, to_mask):
"""Create 3D attention mask from a 2D tensor mask.
Args:
from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
to_mask: int32 Tensor of shape [batch_size, to_seq_length].
Returns:
float Tensor of shape [batch_size, from_seq_length, to_seq_length].
"""
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_shape = get_shape_list(to_mask, expected_rank=2)
to_seq_length = to_shape[1]
to_mask = tf.cast(
tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
# We don't assume that `from_tensor` is a mask (although it could be). We
# don't actually care if we attend *from* padding tokens (only *to* padding)
# tokens so we create a tensor of all ones.
#
# `broadcast_ones` = [batch_size, from_seq_length, 1]
broadcast_ones = tf.ones(
shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
# Here we broadcast along two dimensions to create the mask.
mask = broadcast_ones * to_mask
return mask
看代码就很简单了,注意[batch, A, B]*[batch, B, C]=[batch, A, C],我们可以认为是batch个[A, B]的矩阵乘以batch个[B, C]的矩阵。
再插入另一种实现(苏神的实现方式),一般0作为padding ID,1作为UNK的ID,把UNK也mask掉了。
# x是词ID矩阵
mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(x)
也是很巧妙的实现,替代了上面broadcast_ones 乘法和input_mask。
下面对比两种实现:
def test_mask1():
from keras.layers import Lambda
input_ids = tf.constant([
[11, 2, 3, 0, 0],
[11, 3, 5, 6, 0]
],dtype=tf.float32)
mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(input_ids)
#mask的shape=[batch_size,seq_length,1]
input_a = K.expand_dims(input_ids,1)
#将input_ids增加一维为[batch_size,1,seq_length]
with tf.Session() as sess:
# print(sess.run(mask))
print(sess.run(mask*input_a))
#相乘以后维度[batch_size,seq_length,seq_length]
def test_mask2():
input_ids = tf.constant([
[11, 2, 3, 0, 0],
[1, 3, 5, 6, 0]
])
input_mask =tf.constant([
[1, 1, 1, 0, 0],
[1, 1, 1, 1, 0]
])
batch_szie = input_ids.shape[0]
seq_length = input_ids.shape[1]
to_mask = tf.cast(tf.reshape(input_mask,[batch_szie,1,seq_length]),tf.float32)
broadcast_ones = tf.ones(shape=[batch_szie,seq_length,1],dtype=tf.float32)
mask = broadcast_ones * to_mask
with tf.Session() as sess:
print(sess.run(mask))
第一种结果:
[[[11. 2. 3. 0. 0.]
[11. 2. 3. 0. 0.]
[11. 2. 3. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
[[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[ 0. 0. 0. 0. 0.]]]
第二种结果:
[[[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]]
[[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]]]
这样看下来第一种更好一点
先写到这里,这篇写的篇幅实在有点长,后面的写在Bert源码注解(二)里了。