Bert源码解析笔记（二）：模型主体

最新推荐文章于 2024-07-02 09:59:22 发布

vinojie

最新推荐文章于 2024-07-02 09:59:22 发布

阅读量573

点赞数

分类专栏：自然语言处理文章标签：深度学习 python 自然语言处理

本文链接：https://blog.csdn.net/vinojie/article/details/107024889

版权

自然语言处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Bert模型是基于Transformer架构的（论文：Attention is all you need），它在处理Seq2Seq问题的时候，直接利用注意力机制代替了传统RNN，LSTM，RNN等的固有模式，这些之前固有的模式有个问题就是计算输出的时候不能并行计算，所以Transformner的比之前固有的模式优势在于靠attention机制，不使用RNN，CNN等，并行度高，通过attention，抓长距离依赖关系比RNN强。Bert大火却不懂Transformer？这篇文章关于Tansformer说的比较好。Bert模型就是建立在Transformer架构上，下面介绍Bert模型源码的主体部分modeling.py.

1、模型配置参数

class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

模型配置依次是：词典大小、隐层神经元个数、transformer的层数、attention的头数、激活函数、中间层神经元个数、隐层dropout比例、attention里面dropout比例、sequence最大长度、token_type_ids的词典大小、truncated_normal_initializer的stdev。
源码里有关于参数的解释，没什么好多说的。

Bert主要流程是输入(也就是上篇说的预训练部分的输出）先经过embedding（包括postion和token_type的embedding），然后调用Transformer得到输出结构，其中其中embedding、embedding_table、所有transformer层输出、最后transformer层输出以及pooled_output都可以获得，用于迁移学习的fine-tune和预测任务；

2、word embedding

构造一个词嵌入矩阵embedding_table（在训练过程中利用梯度进行迭代变化的）,进行word embedding词嵌入，比如说原始文本是N个单词，词典大小为M，那么对N个单词进行独热编码后矩阵为NM，也就是源码中的one_hot_input_ids，然后初始化一个矩阵也就是源码中的embedding_tableW为MK，K<<M,emdedding_table就是，它们乘积好就是我们的output，也就是词嵌入的结果。其中embedding_table是要训练的的。

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

3、词向量的后续处理

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]
  output = input_tensor
  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings
  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings
  output = layer_norm_and_dropout(output, dropout_prob)
  return output

输入参数解释：

input_tensor：浮点型张量，形状为[batch_size，seq_length， embedding_size]。
use_token_type：布尔。是否为token_type_ids添加嵌入。
token_type_ids ：（可选）int32张量，形状为[batch_size，seq_length]。如果use_token_type为True，则必须指定。
token_type_vocab_size：整数。 “ token_type_ids”的词汇量。
token_type_embedding_name：字符串。嵌入表变量的名称令牌类型ID。
use_position_embeddings：布尔。是否添加广告的位置嵌入每个令牌在序列中的位置。
position_embedding_name：字符串。嵌入变量的名称用于位置嵌入。
initializer_range：浮动。权重初始化的范围。
max_position_embeddings：整数。最大序列长度可能是与该模型一起使用。该长度可以长于input_tensor，但不能更短。
dropout_prob：浮动。落差概率应用于最终输出张量。

在这里插入图片描述
这里所做的事情就是上面这幅图的Segment Embeddings 和 Postion EMbeddings,Token Embeddings是第二节word embeding中的函数embedding_lookup所做的事。总之这里所做的事就是在词向量的基础上添加word的位置信息和对应的token type等信息，并且最后进行了正则化和dropout之后返回。其中Segment embedding 就是每个序列的词向量中属于同一分句的词做标记，默认最多的分句为16，（要注意的是，我们的序列总共识512个词，这其中有很多分句，给word做一个对应分句的标记），position embeddings容易理解，就给每词在词向量中位置作个标记，在Transformer论文中的position embedding是由sin/cos函数生成的固定的值，而在这里代码实现中是跟普通word embedding一样随机生成的，可以训练的。作者这里这样选择的原因可能是BERT训练的数据比Transformer那篇大很多，完全可以让模型自己去学习。

4、构造attention mask

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]
  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]
  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
  mask = broadcast_ones * to_mask
  return mask

将shape为[batch_size, to_seq_length]的2D mask转换为一个shape 为[batch_size, from_seq_length, to_seq_length] 的3D mask用于attention当中。

5、attention layer

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    attention_scores += adder

  attention_probs = tf.nn.softmax(attention_scores)

  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

transformer的主要内容都在这里，输入的from_tensor当做query，to_tensor当做key和value，self-attention时，from_tensor和to_tensor是同一个值。
（1）函数一开始对输入的shape进行校验，获取batch_size、from_seq_length 、to_seq_length 。输入如果是3D张量则转化成2D矩阵(以输入为word_embedding为例[batch_size, seq_lenth, hidden_size] -> [batch_size*seq_lenth, hidden_size])。
（2）通过全连接线性投影生成query_layer、key_layer 、value_layer，输出的第二个维度变成num_attention_heads * size_per_head（整个模型默认hidden_size=num_attention_heads * size_per_head）。然后通过transpose_for_scores转换成多头。
（3）根据公式计算attention_probs（attention score）：
在这里插入图片描述
如果attention_mask is not None，对mask的部分加上一个很大的负数，这样softmax之后相应的概率值接近为0，再dropout。
（4）最后再将value和attention_probs相乘，返回3D张量或者2D矩阵。

该函数相比其他版本的的transformer很多地方都有简化，有以下四点：
（1）缺少scale的操作；

（2）没有Causality mask，个人猜测主要是bert没有decoder的操作，所以对角矩阵mask是不需要的，从另一方面来说正好体现了双向transformer的特点；

（3）没有query mask。跟（2）理由类似，encoder都是self attention，query和key相同所以只需要一次key mask就够了

（4）没有query的Residual层和normalize.

6、Transformer

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          attention_output = tf.concat(attention_heads, axis=-1)
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output
        all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

transformer是对attention的利用，分以下几步：

（1）计算attention_head_size，attention_head_size = int(hidden_size / num_attention_heads)即将隐层的输出等分给各个attention头。然后将input_tensor转换成2D矩阵；

（2）对input_tensor进行多头attention操作，再做：线性投影——dropout——layer norm——intermediate线性投影——线性投影——dropout——attention_output的residual——layer norm

intermediate也就是论文中的feedforward层，intermediate线性投影的hidden_size可以自行指定，其他层的线性投影hidden_size需要统一，目的是为了对齐。

（3）如此循环计算若干次，且保存每一次的输出，最后返回所有层的输出或者最后一层的输出。

这里的transformer只存在encoder，而不存在decoder操作，所有层的multi-head attention操作都是self encoder。
在这里插入图片描述

7、模型入口BertModel

“Transformer的双向编码表示“
在这里插入图片描述

(1）设置各种参数，如果input_mask为None的话，就指定所有input_mask值为1，即不进行过滤；如果token_type_ids是None的话，就指定所有token_type_ids值为0；

(2）对输入的input_ids进行embedding操作，再embedding_postprocessor操作，前面我们说了。主要是加入位置和token_type信息到词向量里面；

(3）转换attention_mask 后，通过调用transformer_model进行encoder操作；

(4）获取最后一层的输出sequence_output和pooled_output，pooled_output是取sequence_output的第一个切片然后线性投影获得（可以用于分类问题）

class BertModel(object):
  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=True,
               scope=None):
    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      with tf.variable_scope("pooler"):
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

总结：

Bert的主要流程是先embedding（包括segment embeddings 和 position embeddibngs）,然后调用transformer的编码器得到输出结果，其中其中embedding、embedding_table、所有transformer层输出、最后transformer层输出以及pooled_output都可以获得，用于迁移学习的fine-tune和预测任务；
bert对于transformer的使用仅限于encoder，没有decoder的过程。这是因为模型存粹是为了预训练服务，而预训练是通过语言模型，不同于NLP其他特定任务。在做迁移学习时可以自行添加；
正因为没有decoder的操作，所以在attention函数里面也相应地减少了很多不必要的功能。

vinojie

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Bert源码解析笔记（二）：模型主体

Bert模型是基于Transformer架构的（论文：Attention is all you need），它在处理Seq2Seq问题的时候，直接利用注意力机制代替了传统RNN，LSTM，RNN等的固有模式，这些之前固有的模式有个问题就是计算输出的时候不能并行计算，所以Transformner的比之前固有的模式优势在于靠attention机制，不使用RNN，CNN等，并行度高，通过attention，抓长距离依赖关系比RNN强。Bert大火却不懂Transformer？这篇文章关于Tansformer说的比
复制链接

扫一扫

专栏目录