bert-源码学习（part I）

最新推荐文章于 2022-03-04 17:15:42 发布

lbertj

最新推荐文章于 2022-03-04 17:15:42 发布

阅读量176

点赞数

分类专栏： NLP 文章标签： bert 自然语言处理神经网络

本文链接：https://blog.csdn.net/weixin_42419825/article/details/120071180

版权

本文详细解析了Google的BERT模型源码，重点介绍了配置类、词向量查找、注意力层和Transformer模块。从embedding lookup到attention mask的构造，再到multi-head attention的实现，全面阐述了BERT的核心运作机制。

摘要由CSDN通过智能技术生成

导读：

今天对Google公司开源的大型文本预处理bert模型源码进行学习，首先对bert最主要的模型实现部分bertModel，代码位于modeling.py模块。该模块主要分为配置类（bertconfig）、获取词向量（embedding lookup）、词向量的后续处理（embedding postprocessor）、构造attention mask、注意力层attention layer、核心模块-transformer、bertmodel类的构造函数共7大部分构成。下面逐一对各模块解读。

bert config
该部分主要定义模型的超参数和处理json_object文件的方法。

class BertConfig(object):
  """BERT模型的配置类."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

参数的含义如下：

vocab_size：词表大小
hidden_size：隐藏层神经元数
num_hidden_layers：Transformer encoder中的隐藏层数
num_attention_heads：multi-head attention 的head数
intermediate_size：encoder的“中间”隐层神经元数（例如feed-forward layer）
hidden_act：隐藏层激活函数
hidden_dropout_prob：隐层dropout率
attention_probs_dropout_prob：注意力部分的dropout
max_position_embeddings：最大位置编码
type_vocab_size：token_type_ids的词典大小
initializer_range：truncated_normal_initializer初始化方法的stdev

embedding lookup
输入的word，输出的词向量。

def embedding_lookup(input_ids,                        # word_id：【batch_size, seq_length】
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):

  # 该函数默认输入的形状为【batch_size, seq_length, input_num】
  # 如果输入为2D的【batch_size, seq_length】，则扩展到【batch_size, seq_length, 1】
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])    #【batch_size*seq_length*input_num】
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:    # 按索引取值
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  # output：[batch_size, seq_length, num_inputs]
  # 转成:[batch_size, seq_length, num_inputs*embedding_size]
  output = tf.reshape(output,
                        input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

参数含义：

input_ids：word id 【batch_size, seq_length】
vocab_size：embedding词表
embedding_size：embedding维度
initializer_range：embedding初始化范围
word_embedding_name：embeddding table命名
use_one_hot_embeddings：是否使用one-hotembedding
Return：【batch_size, seq_length, embedding_size】

embedding postprocessor
bert模型输入分为token embedding、segment embedding和position embedding。在第2节中使用的是token embedding，对其完善信息，正则化，dropout之后输出最终embedding。

def embedding_postprocessor(input_tensor,                # [batch_size, seq_length, embedding_size]
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,

最低0.47元/天解锁文章

lbertj

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
bert-源码学习（part I）

导读：今天对Google公司开源的大型文本预处理bert模型源码进行学习，首先对bert最主要的模型实现部分bertModel，代码位于modeling.py模块。该模块主要分为配置类（bertconfig）、获取词向量（embedding lookup）、词向量的后续处理（embedding postprocessor）、构造attention mask、注意力层attention layer、核心模块-transformer、bertmodel类的构造函数共7大部分构成。下面逐一对各模块解读。be
复制链接

扫一扫

专栏目录