bert源码中input_mask参数的解释

最新推荐文章于 2024-06-19 14:56:22 发布

will-wil

最新推荐文章于 2024-06-19 14:56:22 发布

阅读量2.5k

点赞数 1

分类专栏： nlp学习笔记文章标签：自然语言处理神经网络

本文链接：https://blog.csdn.net/yangyanbao8389/article/details/116795360

版权

nlp学习笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

本文通过解读bert的tensorflow源码来解析input_mask参数的应用方法，文中展示的代码均为bert源码中涉及到input_mask的模块。

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  """Create 3D attention mask from a 2D tensor mask.

  Args:
    from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
    to_mask: int32 Tensor of shape [batch_size, to_seq_length].

  Returns:
    float Tensor of shape [batch_size, from_seq_length, to_seq_length].
  """
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  # We don't assume that `from_tensor` is a mask (although it could be). We
  # don't actually care if we attend *from* padding tokens (only *to* padding)
  # tokens so we create a tensor of all ones.
  #
  # `broadcast_ones` = [batch_size, from_seq_length, 1]
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  # Here we broadcast along two dimensions to create the mask.
  mask = broadcast_ones * to_mask

  return mask

to_mask（intput_mask）形状为[batch, seq_length]，其中attended部分为1, no attended部分为0,经过下列函数转换得到[batch, seq_length, seq_length]，实际上在维度为1的位置上复制了seq_length份

# `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

然后得到的attention_mask用于为后续attention计算做掩码，上述为tensorflow attention部分的源码，query_layer的形状为[batch, num_head, seq_length, T],其中num_head*T=bert_embedding_size，得到的attention_scores形状为[batch, num_head, seq_length, seq_length]，第三维表示的含义为当前行的字符在子attention头中与其他字符的相关系数程度分数。

if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

adder变量即为掩码处理变量，将no attened变量的0转换成-10000,加到attention_scores中，去除填充部分字符的影响，对于每个子attention头都用相同的adder扩展做处理，因为attention_scores后面维度表示的含义即为每个批次文本的长度，与attention_mask扩展后的含义想对应，最终做softmax，可以消除填充字符对attention的影响。

will-wil

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
bert源码中input_mask参数的解释

to_mask（intput_mask）形状为[batch, seq_length]，其中attended部分为1, no attended部分为0,经过下列函数转换得到[batch, seq_length, seq_length]，实际上在维度为1的位置上复制了seq_length份def create_attention_mask_from_input_mask(from_tensor, to_mask): """Create 3D attention mask from a 2D tensor.
复制链接

扫一扫

专栏目录