Bert源码注解（一）

最新推荐文章于 2023-06-06 20:26:02 发布

舒语---依依

最新推荐文章于 2023-06-06 20:26:02 发布

阅读量366

点赞数 2

分类专栏：源码解析文章标签：自然语言处理深度学习

本文链接：https://blog.csdn.net/matlabjenny/article/details/115654518

版权

源码解析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这个是很早之前就应该做的工作，之前看过几遍源码，但是都没有详细的记录下来，Bert源码还是很优雅的，这次看记录下来方便以后回顾。

先来看它的整体结构：

├── README.md
├── create_pretraining_data.py
├── extract_features.py
├── modeling.py
├── modeling_test.py
├── multilingual.md
├── optimization.py
├── optimization_test.py
├── predicting_movie_reviews_with_bert_on_tf_hub.ipynb
├── requirements.txt
├── run_classifier.py
├── run_classifier_with_tfhub.py
├── run_pretraining.py
├── run_squad.py
├── sample_text.txt
├── tokenization.py
└── tokenization_test.py

create_pretraining_data.py：创建预训练数据；
extract_features.py：提取/转换特征的一些操作；
modeling.py：bert模型结构；
modeling_test.py：bert模型结构的单元测试；
optimization.py：优化器；
optimization_test.py：优化器的单元测试；
run_classifier.py：分类示例；
run_classifier_with_tfhub.py：同上，只是使用了tfhub;
run_pretraining.py：进行预训练
run_squad.py：使用squad数据示例；
tokenization.py：分词、数据清洗等；
tokenization_test.py：分词、数据清洗等的单元测试；

首先来看modeling.py，这个是主要的模型部分，一点一点来看：

class BertConfig(object):
  """Configuration for `BertModel`."""
  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,#这里是2
               initializer_range=0.02):
    """Constructs BertConfig.
    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """

这一部分没啥可看的，上面都有具体的解释。注意：type_vocab_size在bert_config.json中是2。

下面看BertModel类，先来看看使用示例：

# Already been converted into WordPiece token ids
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
  token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
    num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

  model = modeling.BertModel(config=config, is_training=True,
    input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

  label_embeddings = tf.get_variable(...)
  pooled_output = model.get_pooled_output()
  logits = tf.matmul(pooled_output, label_embeddings)

input_ids的shape是[batch_size,seq_len]，这里batch_size=2，seq_len=3；
input_mask：输入序列长度分别为3,2；
token_type_ids ：输入的第一个序列中：前两个词属于句子A，第三个词数据句子B；
输入的第二个序列中：第一个词属于句子A，第二个词数据句子B，第三个词是padding；
后面是创建bert config，vocab_size=32000, hidden_size=512,num_hidden_layers=8, num_attention_heads=8, intermediate_size=1024
（这个是bert_base的，具体应该按照bert_config.json来），需要注意的一点是embedding_size=hidden_size，bert_base中都等于768。

下面来看具体的BertModel类：

def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=False,
               scope=None):

config：BertConfig实例
is_training：True for train，False for eval，影响dropout
input_ids：int32类型Tensor,[batch_size,seq_length]
input_mask：可选，类型同上
token_type_ids：可选，类型同上
use_one_hot_embeddings：可选，是否使用one-hot embedding（使用矩阵乘法实现提取词的Embedding）TPU更快
scope：可选，bert

看一下构造函数的流程：

config = copy.deepcopy(config)  #深拷贝一份config
    if not is_training:     #不训练的话将dropout置为0
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:      #input_mask如果为None,做一个全1向量，表示没有padding
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:      #token_type_ids 如果为None，做一个全0向量，表示都属于第一个句子
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.词embedding
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout. 在词embedding上添加 positional embeddings and token type embeddings，
        #然后layer normalize和dropout
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.将2D mask构造为[batch_size, seq_length, seq_length]
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Run the stacked transformer.堆叠transformer
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        # self.all_encoder_layers是一个list，长度为num_hidden_layers,bert_base为12，
        # 其每一个元素的shape=[batch_size,seq_length,hidden_size]

        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        #self.sequence_output取最后一层，然后取[CLS]得到的维度是[batch_size,1,hidden_size]，然后将第二个维度去掉
        #得到[batch_size,hidden_size]，最后加一个全连接层，维度还是[batch_size,1,hidden_size]
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

能注释的都写在上面了。
下面按上面的流程一步一步看：

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].如果输入是两维，增加一个维度
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

代码很简单，输入输出解释也很详细，if use_one_hot_embeddings是使用矩阵乘法获取词embedding，把输入id变为one-hot向量乘以embedding_table，据说在TPU上速度会更快，在CPU和GPU上使用id下标去embedding_table取更快，也就是tf.nn.embedding_lookup。
除了这些以外，还增加了一个维度，个人感觉可以当做没有。input_ids 增加一个维度后最后一个维度始终是1。

看完这部分下面来看embedding_postprocessor：

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    # 因为token_type_table维度很小（一般为2），所以使用one-hot乘以token_type_table更快
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    #max_position_embeddings必须比seq_length长
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #位置向量[max_position_embeddings, width]，但是实际seq_length并不会到512，为了提高训练速度，使用
      #tf.slice取出[0, 1, 2, ..., seq_length-1]的部分
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.为了将位置embedding加到token embedding上（位置embedding维度是[seq_length,width]）维度不同，所以
      #使用broadcast机制加上
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      #position_embeddings维度是[1,seq_length,width]
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

详细的注释都写上面了，作用就是三个embedding相加。

下面来看attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)，在解释这一部分之前，先来简单说一下mask，**Transformer 模型里面涉及两种 mask，分别是 padding mask 和 sequence mask。**其中，padding mask 在所有的 scaled dot-product attention 里面都需要用到，而 sequence mask 只有在 decoder 的 self-attention 里面用到。
（1）Padding mask：因为每个批次输入序列长度是不一样的也就是说，我们要对输入序列进行对齐。具体来说，就是给在较短的序列后面填充 0。但是如果输入的序列太长，则是截取左边的内容，把多余的直接舍弃。因为这些填充的位置，其实是没什么意义的，所以我们的attention机制不应该把注意力放在这些位置上，所以我们需要进行一些处理。具体的做法是，把这些位置的值加上一个非常大的负数(负无穷)，这样的话，经过 softmax，这些位置的概率就会接近0！
（2）Sequence mask：sequence mask 是为了使得 decoder 不能看见未来的信息。也就是对于一个序列，在 time_step 为 t 的时刻，我们的解码输出应该只能依赖于 t 时刻之前的输出，而不能依赖 t 之后的输出。因此我们需要想一个办法，把 t 之后的信息给隐藏起来。那么具体怎么做呢？也很简单：产生一个下三角矩阵。把这个矩阵作用在每一个序列上，就可以达到我们的目的。
对于 decoder 的 self-attention，里面使用到的 scaled dot-product attention，同时需要padding mask 和 sequence mask 作为 attn_mask，具体实现就是两个mask相加作为attn_mask。其他情况，attn_mask 一律等于 padding mask。
这一部分作用示例：
输入：
input_ids=[
[1,2,3,0,0],
[1,3,5,6,1]
]
input_mask=[
[1,1,1,0,0],
[1,1,1,1,1]
]
输出：（维度[batch_size,seq_length,seq_length]）第一个维度的seq_length表示第几个词，第二个seq_length表示可以attend的词（用1表示，不能attend的用0表示）
[
[1, 1, 1, 0, 0], #它表示第1个词可以attend to 3个词
[1, 1, 1, 0, 0], #它表示第2个词可以attend to 3个词
[1, 1, 1, 0, 0], #它表示第3个词可以attend to 3个词
[1, 1, 1, 0, 0], #无意义，因为输入第4个词是padding的0
[1, 1, 1, 0, 0] #无意义，因为输入第5个词是padding的0
]

[
[1, 1, 1, 1, 1], # 它表示第1个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第2个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第3个词可以attend to 5个词
[1, 1, 1, 1, 1], # 它表示第4个词可以attend to 5个词
[1, 1, 1, 1, 1] # 它表示第5个词可以attend to 5个词
]

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  """Create 3D attention mask from a 2D tensor mask.

  Args:
    from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
    to_mask: int32 Tensor of shape [batch_size, to_seq_length].

  Returns:
    float Tensor of shape [batch_size, from_seq_length, to_seq_length].
  """
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  # We don't assume that `from_tensor` is a mask (although it could be). We
  # don't actually care if we attend *from* padding tokens (only *to* padding)
  # tokens so we create a tensor of all ones.
  #
  # `broadcast_ones` = [batch_size, from_seq_length, 1]
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  # Here we broadcast along two dimensions to create the mask.
  mask = broadcast_ones * to_mask

  return mask

看代码就很简单了，注意[batch, A, B]*[batch, B, C]=[batch, A, C]，我们可以认为是batch个[A, B]的矩阵乘以batch个[B, C]的矩阵。
再插入另一种实现（苏神的实现方式），一般0作为padding ID，1作为UNK的ID，把UNK也mask掉了。

# x是词ID矩阵
mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(x)

也是很巧妙的实现，替代了上面broadcast_ones 乘法和input_mask。
下面对比两种实现：

def test_mask1():
    from keras.layers import Lambda
    input_ids = tf.constant([
        [11, 2, 3, 0, 0],
        [11, 3, 5, 6, 0]
    ],dtype=tf.float32)
    mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(input_ids)
    #mask的shape=[batch_size,seq_length,1]
    input_a = K.expand_dims(input_ids,1)
    #将input_ids增加一维为[batch_size,1,seq_length]
    with tf.Session() as sess:
        # print(sess.run(mask))
        print(sess.run(mask*input_a))
        #相乘以后维度[batch_size,seq_length,seq_length]

def test_mask2():
    input_ids = tf.constant([
        [11, 2, 3, 0, 0],
        [1, 3, 5, 6, 0]
    ])
    input_mask =tf.constant([
        [1, 1, 1, 0, 0],
        [1, 1, 1, 1, 0]
    ])
    batch_szie = input_ids.shape[0]
    seq_length = input_ids.shape[1]
    to_mask = tf.cast(tf.reshape(input_mask,[batch_szie,1,seq_length]),tf.float32)
    broadcast_ones = tf.ones(shape=[batch_szie,seq_length,1],dtype=tf.float32)
    mask = broadcast_ones * to_mask
    with tf.Session() as sess:
        print(sess.run(mask))

第一种结果：
[[[11. 2. 3. 0. 0.]
[11. 2. 3. 0. 0.]
[11. 2. 3. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]

[[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[11. 3. 5. 6. 0.]
[ 0. 0. 0. 0. 0.]]]
第二种结果：
[[[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]
[1. 1. 1. 0. 0.]]

[[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]
[1. 1. 1. 1. 0.]]]

这样看下来第一种更好一点

先写到这里，这篇写的篇幅实在有点长，后面的写在Bert源码注解（二）里了。