Al-bert利用自己训练数据集预训练以及测试LCQMC语义相似度测试（二）

最新推荐文章于 2024-07-22 08:47:08 发布

chenmingwei000

最新推荐文章于 2024-07-22 08:47:08 发布

阅读量2.4k

点赞数 1

分类专栏： bert 深度学习 nlp课程

本文链接：https://blog.csdn.net/chenmingwei000/article/details/103567243

版权

深度学习同时被 3 个专栏收录

14 篇文章 1 订阅

订阅专栏

bert

5 篇文章 1 订阅

订阅专栏

nlp课程

5 篇文章 0 订阅

订阅专栏

`Al-bert利用自己训练数据集预训练以及测试LCQMC语义相似度测试## 标题（二）
上一张讲解了怎么构造预训练的数据，这一章讲解训练过程，一起探讨与bert的区别

1.2 run_pretraining.py 的讲解

   我们仍然采用debug模式进行

   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)  # 从json文件中获得配置信息

解析里边的参数：
“directionality”: “bidi”, # 双向transformer
“embedding_size”: 128, #词语编码维度
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.0, #隐藏层的dropout概率
“hidden_size”: 768, #经过transformer编码之后的维度
“initializer_range”: 0.02, #参数初始化范围
“max_position_embeddings”: 512, #最大编码的位置
“num_attention_heads”: 12, #tranformer的多头attention的多头个数
“num_hidden_layers”: 12, #transformer的层数
“pooler_fc_size”: 768,
“pooler_num_attention_heads”: 12,
“pooler_num_fc_layers”: 3,
“pooler_size_per_head”: 128,
“pooler_type”: “first_token_transform”,
“type_vocab_size”: 2,
“vocab_size”: 21128, #词汇大小
“ln_type”:“postln”
加载模型配置此点之后，就是创建训练后模型文件夹以及对应训练数据集文件路径

  	input_files = []  # 输入可以是多个文件，以“逗号隔开”；可以是一个匹配形式的，如“input_x*”
    for input_pattern in FLAGS.input_file.split(","):
        input_files.extend(tf.gfile.Glob(input_pattern))

还有就是利用下面的代码对模型基本进行设置，应该不用解释：

run_config = tf.contrib.tpu.RunConfig(
        keep_checkpoint_max=20,  # 10
        cluster=tpu_cluster_resolver,
        master=FLAGS.master,
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=FLAGS.save_checkpoints_steps,
        tpu_config=tf.contrib.tpu.TPUConfig(
            iterations_per_loop=FLAGS.iterations_per_loop,
            num_shards=FLAGS.num_tpu_cores,
            per_host_input_for_training=is_per_host))

接下来就是比较重要的构建模型流程图：

 model_fn = model_fn_builder(
        bert_config=bert_config,
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=FLAGS.num_train_steps,
        num_warmup_steps=FLAGS.num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

其中bert_config与原始bert不太一样就是
“ln_type”:“postln” 所以真个模型除了数据中文多了全词模式，这个地方不同，进一步进行解析，那么model_fn_builder函数就是构建模型过程，

def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):

利用了Returns model_fn closure for TPUEstimator，内置函数，所以这个地方运行只会返回模型构建的内置函数，而内置函数才是模型构建的流程。下面将会有debug这个内置函数进行解析模型。接下来主要解析利用TPUEstimator进行函数模型训练的原理，可以大概查一下相关知识；接下来就是构建TPUEstimator实例，

 estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=FLAGS.use_tpu,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=FLAGS.train_batch_size,
        eval_batch_size=FLAGS.eval_batch_size)

这个过程就是构建了模型需要训练的可以称之为训练实例estimator，接下来就是数据输入的内嵌函数格式，如下：

train_input_fn = input_fn_builder(
            input_files=input_files,
            max_seq_length=FLAGS.max_seq_length,
            max_predictions_per_seq=FLAGS.max_predictions_per_seq,
            is_training=True)

可以看出上面的代码是对input_files中的文件进行处理的逻辑代码。其中内嵌函数是 input_fn(params)函数，所以最终数据处理函数是这个，
有以上可知，基于TPUEstimator训练的模型有三部分组成，
一部分是TPUEstimator自己默认的参数设置，例如模型保存路径，FLAGS.output_dir、FLAGS.save_checkpoints_steps、FLAGS.train_batch_size以及tpu的设置，如果没有tpu可以忽略，最大保存模型文件个数，等。
第二部分是数据处理模块，input_fn_builder函数处理；
第三部分是模型构建模块，model_fn_builder构建整个model，
首先说数据处理模块，debug到

 def input_fn(params):
     batch_size = params["batch_size"]

由于他是input_fn_builder函数的内置函数，所以在实际运行过程中从这里开始数据准备，首先获取参数的batch_size.然后根据数据处理阶段的tfrecord格式进行获取数据，格式如下：

 name_to_features = {
        "input_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "input_mask":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "segment_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "masked_lm_positions":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_ids":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_weights":
            tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
        "next_sentence_labels":
            tf.FixedLenFeature([1], tf.int64),
    }#这里的字段就不解释了，不明白的请看数据处理阶段代码，以及
   
```go
在这里插入代码片

然后利用tfrecord固有格式获取数据进行分割形成batch_size大小的数据

if is_training:
    d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
    d = d.repeat()
    d = d.shuffle(buffer_size=len(input_files))

    # `cycle_length` is the number of parallel files that get read.
    cycle_length = min(num_cpu_threads, len(input_files))

    # `sloppy` mode means that the interleaving is not exact. This adds
    # even more randomness to the training pipeline.
    d = d.apply(
        tf.contrib.data.parallel_interleave(
            tf.data.TFRecordDataset,
            sloppy=is_training,
            cycle_length=cycle_length))
    d = d.shuffle(buffer_size=100)

以上代码都是tfrecord固定格式，

 d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                num_parallel_batches=num_cpu_threads,
                drop_remainder=True))

以上代码是对tfrecord进行解析的函数，其中_decode_record(record, name_to_features)是解析的过程，解析函数如下：

def _decode_record(record, name_to_features):
    """Decodes a record to a TensorFlow example."""
    example = tf.parse_single_example(record, name_to_features)

    # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
    # So cast all int64 to int32.
    for name in list(example.keys()):
        t = example[name]
        if t.dtype == tf.int64:
            t = tf.to_int32(t)
        example[name] = t

    return example

也就是针对每一个样本数据的，把对应的是表示id的int类型进行转化为数据类型，当然也可以针对float转化，return example是返回一个样本，所以只要保持tfrecord的example的字典形式赝本数据，基本就可以很好的处理，有关形成batch_size,以及数据打散全部封装。
接下来重点讲解模型构造与结构。debug到 def model_fn(featu在这里插入代码片res, labels, mode, params)所以TPUEstimator应该核心思想就是数据核心以及模型核心固定模块，其他框架处理的思想。

		input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        masked_lm_positions = features["masked_lm_positions"]
        masked_lm_ids = features["masked_lm_ids"]
        masked_lm_weights = features["masked_lm_weights"]
        next_sentence_labels = features["next_sentence_labels"]

首先模型的输入为样本词id input_ids；以及实际样本长度input_mask；segment_ids区分AheB句子的特征；masked_lm_positions label位置信息；masked_lm_ids对应实际label在句子中的位置信息；next_sentence_labels句子顺序label；masked_lm_weights初始化的有预测id为1.0，其他扩充位置为0.0。
然后就是构建模型结构代码：

 model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

进入BertModel模型结构看到里边的具体函数为：

  		config = copy.deepcopy(config)
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.attention_probs_dropout_prob = 0.0

        input_shape = get_shape_list(input_ids, expected_rank=2)
        batch_size = input_shape[0]
        seq_length = input_shape[1]

        if input_mask is None:
            input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

        if token_type_ids is None:
            token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

上面的代码就是简单的初始化判断工作。获取到batch_size,seq_len.其中token_type_ids就是segment_ids，就是区分是句子A还是后边的句子B。
下面代码过程获取词汇表的embedding（ self.embedding_table）以及Albert的特殊处理得到的hident_size到embedding_size的映射embedding_table_2（self.embedding_table_2）以及input_ids对应的embedding输出。

 with tf.variable_scope(scope, default_name="bert"):
            with tf.variable_scope("embeddings"):
                # Perform embedding lookup on the word ids, but use stype of factorized embedding parameterization from albert. add by brightmart, 2019-09-28
                (self.embedding_output, self.embedding_table, self.embedding_table_2) = embedding_lookup_factorized(
                    input_ids=input_ids,
                    vocab_size=config.vocab_size,
                    hidden_size=config.hidden_size,
                    embedding_size=config.embedding_size,
                    initializer_range=config.initializer_range,
                    word_embedding_name="word_embeddings",
                    use_one_hot_embeddings=use_one_hot_embeddings)

进入embedding_lookup_factorized函数看具体操作得到相应的输出。在这里对应的词向量embedding_size=128,hidden_size=312
1、对input_ids维度扩充得到[ batch_size, seq_length, 1]：

    print("embedding_lookup_factorized. factorized embedding parameterization is used.")
    if input_ids.shape.ndims == 2:
        input_ids = tf.expand_dims(input_ids, axis=[-1])  # shape of input_ids is:[ batch_size, seq_length, 1]

2、对词汇列表的词向量进行初始化维度为[vocab_size,embedding_size]：

	 embedding_table = tf.get_variable(  
			     name=word_embedding_name,
			     shape=[vocab_size, embedding_size],
			     initializer=create_initializer(initializer_range))

3、对输入打平得到flat_input:

  flat_input_ids = tf.reshape(input_ids, [-1])  # one rank. shape as (batch_size * sequence_length,)

4、获取input_ids对应的embedding词向量：
维度为：[batch_size * sequence_length,embedding_size]

 output_middle = tf.gather(embedding_table,
                                  flat_input_ids)

tf.gather：用一个一维的索引数组，将张量中对应索引的向量提取出来，其实可以使用tf.embedding_lookup函数替代
5、跨层级词向量embedding_size与经过transformer之间的映射变量初始化，输出维度为： # [embedding_size, hidden_size]

  # 2. project vector(output_middle) to the hidden space 
  #对应的是在transformer输出的hidden_size 转变为embedding_size的词典
    project_variable = tf.get_variable( 
        name=word_embedding_name + "_2",
        shape=[embedding_size, hidden_size],
        initializer=create_initializer(initializer_range))

接下来就是把embedding词汇映射到hidden_size代码。

 	output = tf.matmul(output_middle,     #
                       project_variable)  # ([batch_size * sequence_length, embedding_size] * [embedding_size, hidden_size])--->
                                         # [batch_size * sequence_length, hidden_size]
    # reshape back to 3 rank
    input_shape = get_shape_list(input_ids)  # input_shape=[ batch_size, seq_length, 1]
    batch_size, sequene_length, _ = input_shape
    output = tf.reshape(output,
                        (batch_size, sequene_length, hidden_size))  # output=[batch_size, sequence_length, hidden_size]

这一点可以与原始bert进行对比，原始bert代码并没有上边的词embedding与hidden_size之间的映射，可以对比进行学习其中的结构区别，然后增加position信息以及区别A和B句子的信息。

# Add positional embeddings and token type embeddings, then layer
                # normalize and perform dropout.
                self.embedding_output = embedding_postprocessor(
                    input_tensor=self.embedding_output,
                    use_token_type=True,
                    token_type_ids=token_type_ids,
                    token_type_vocab_size=config.type_vocab_size,
                    token_type_embedding_name="token_type_embeddings",
                    use_position_embeddings=True,
                    position_embedding_name="position_embeddings",
                    initializer_range=config.initializer_range,
                    max_position_embeddings=config.max_position_embeddings,
                    dropout_prob=config.hidden_dropout_prob)

self.embedding_output就是初始化的input_ids对应的hidden_size的维度[batch_size, sequence_length, hidden_size] ；token_type_ids对应的A和B区别，因此config.type_vocab_size=2，进入加入位置信息的详细代码如下：
1、获取输入的维度信息：

   input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    width = input_shape[2]
    output = input_tensor

2、初始化token_type_ids对应的embedding变量：
token_type_table对应的维度信息[token_type_vocab_size,hidden_size]与词向量输出维度保持一致self.embedding_output；

 if use_token_type:
        if token_type_ids is None:
            raise ValueError("`token_type_ids` must be specified if"
                             "`use_token_type` is True.")
        token_type_table = tf.get_variable(
            name=token_type_embedding_name,
            shape=[token_type_vocab_size, width],
            initializer=create_initializer(initializer_range))

3、接下来代码虽然用了什么one_hot 但是其实质就是embedding_lookup所以就不再解释：

 		flat_token_type_ids = tf.reshape(token_type_ids, [-1])
        one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
        token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
        token_type_embeddings = tf.reshape(token_type_embeddings,
                                           [batch_size, seq_length, width])

4、然后把position信息与词汇信息相加，这里的position信息是A和B的区别信息特征：

output += token_type_embeddings

接下来的代码就是加入对应的整个句子每一个词语对应的位置信息position。
1、初始化position信息，利用最大长度max_position_embeddings，以及增加对应的position_embedding 信息代码都没有变化。

if use_position_embeddings:
        assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) #
        #seq_length不能大于max_position_embeddings
        with tf.control_dependencies([assert_op]):
            full_position_embeddings = tf.get_variable(
                name=position_embedding_name,
                shape=[max_position_embeddings, width],
                initializer=create_initializer(initializer_range))
               position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                           [seq_length, -1])
            num_dims = len(output.shape.as_list())

            # Only the last two dimensions are relevant (`seq_length` and `width`), so
            # we broadcast among the first dimensions, which is typically just
            # the batch size.
            position_broadcast_shape = []
            for _ in range(num_dims - 2):
                position_broadcast_shape.append(1)
            position_broadcast_shape.extend([seq_length, width])
            position_embeddings = tf.reshape(position_embeddings,
                                             position_broadcast_shape)
            output += position_embeddings

    output = layer_norm_and_dropout(output, dropout_prob)
    return output

接下来代码就是transformer的核心代码，encoder部分模型：
1、利用input_ids、input_mask，形成 [batch_size, seq_length, seq_length]：

这个函数作用利用input_mask区别实际含有样本信息以及是扩充的位置信息也就是说【seq_length, seq_length】这样的矩阵有1.0/0.0，其中1.0代表有文本，0.0没有，用来去除扩充长度对输出的影响。

  with tf.variable_scope("encoder"):
       # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
       # mask of shape [batch_size, seq_length, seq_length] which is used
       # for the attention scores.
       attention_mask = create_attention_mask_from_input_mask(input_ids,
        input_mask)

接着就进入了transformer的内部函数代码如下：


```python
 	  print("ln_type:", ln_type)
      if ln_type == 'postln' or ln_type is None:  # currently, base or large of albert used post-LN structure
          print("old structure of transformer.use: transformer_model,which use post-LN")
          self.all_encoder_layers = transformer_model(
              input_tensor=self.embedding_output,
              attention_mask=attention_mask,
              hidden_size=config.hidden_size,
              num_hidden_layers=config.num_hidden_layers,
              num_attention_heads=config.num_attention_heads,
              intermediate_size=config.intermediate_size,
              intermediate_act_fn=get_activation(config.hidden_act),
              hidden_dropout_prob=config.hidden_dropout_prob,             attention_probs_dropout_prob=config.attention_probs_dropout_prob,
              initializer_range=config.initializer_range,
              do_return_all_layers=True)

然后获取所有层的all_encoder_layers输出，如果对transformer_model感兴趣的可以去里边看。最后就是模型训练目标函数的增加，首先得到预测词典损失函数如下：

	(masked_lm_loss,
       masked_lm_example_loss, masked_lm_log_probs) = 
						        get_masked_lm_output(
						            bert_config,
						             model.get_sequence_output(), 
						             model.get_embedding_table(),
						             model.get_embedding_table_2(),
						            masked_lm_positions,
						             masked_lm_ids, 
						             masked_lm_weights)

get_sequence_output（）得到的就是通过模型最终的输出hidden_size表示，model.get_embedding_table(),对应的是词汇表，model.get_embedding_table_2()是词汇与hidden表示之间的映射关系，masked_lm_positions是对应预测句子的哪个位置，masked_lm_ids，实际labels，masked_lm_weights,由于target_seqlen是一个样本最大预测长度，所以masked_lm_weights是记录样本实际预测长度。然后进入代码仔细研究，并且根据需要预测的positions位置，获取对应的 input_tensor，作为预测该位置的输入；

def get_masked_lm_output(bert_config, input_tensor, output_weights, project_weights, positions,
                         label_ids, label_weights):
    """Get loss and log probs for the masked LM."""
    input_tensor = gather_indexes(input_tensor, positions)

然后对input_tensor 进行再次映射以及normal正则化以下，代码如下：

with tf.variable_scope("cls/predictions"):
        # We apply one more non-linear transformation before the output layer.
        # This matrix is not used after pre-training.
        with tf.variable_scope("transform"):
            input_tensor = tf.layers.dense(
                input_tensor,
                units=bert_config.hidden_size,
                activation=modeling.get_activation(bert_config.hidden_act),
                kernel_initializer=modeling.create_initializer(
                    bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)

增加一个词汇大小的偏执 21128大小在这里。

output_bias = tf.get_variable(
            "output_bias",
            shape=[bert_config.vocab_size],
            initializer=tf.zeros_initializer())

然后再把input_tensor转换为词汇对应的embedding_size维度，与词汇对应的get_embedding_table相乘得到最后的词汇得分。

 		input_project = tf.matmul(input_tensor, project_weights, transpose_b=True)
        logits = tf.matmul(input_project, output_weights, transpose_b=True)
        #  # input_project=[-1, embedding_size], output_weights=[vocab_size, embedding_size],
        # output_weights_transpose=[embedding_size, vocab_size] ---> [-1, vocab_size]

        logits = tf.nn.bias_add(logits, output_bias)
        log_probs = tf.nn.log_softmax(logits, axis=-1）

log_probs对应的维度为（1632,21128），其他的把对应的label转为同样的维度；

		label_ids = tf.reshape(label_ids, [-1])
        label_weights = tf.reshape(label_weights, [-1])

        one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

        # The `positions` tensor might be zero-padded (if the sequence is too
        # short to have the maximum number of predictions). The `label_weights`
        # tensor has a value of 1.0 for every real prediction and 0.0 for the
        # padding predictions.
        per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
        numerator = tf.reduce_sum(label_weights * per_example_loss)
        denominator = tf.reduce_sum(label_weights) + 1e-5
        loss = numerator / denominator

最后得到对应的ml损失；获取next-sentence对应的损失，把两个损失求和；

 (next_sentence_loss, next_sentence_example_loss,
         next_sentence_log_probs) = get_next_sentence_output(
            bert_config, model.get_pooled_output(), next_sentence_labels)

        total_loss = masked_lm_loss + next_sentence_loss

剩余的代码就是利用预训练的model对新构建的模型已有参数进行初始化，

		tvars = tf.trainable_variables() #获取新模型的所有可训练模型参数

        initialized_variable_names = {}
        print("init_checkpoint:", init_checkpoint)
        scaffold_fn = None
        if init_checkpoint:
            (assignment_map, initialized_variable_names
             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)

利用预训练的model，init_checkpoint文件对tvars变量进行初始化，看如下函数
tvars = tf.trainable_variables()表示模型可训练的总共模型参数；

def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
    """Compute the union of the current variables and checkpoint variables.
       tvars：本模型代表的所有变量 
    """
    assignment_map = {}
    initialized_variable_names = {} #已经初始化的变量名

这个代码表示了把对应模型变量名与模型变量层value形成字典，name_to_variable代表当前层的模型字典；

 for var in tvars:
        name = var.name
        m = re.match("^(.*):\\d+$", name)
        if m is not None:
            name = m.group(1)
        name_to_variable[name] = var

获取初始化变量数值列表，这个列表是初始化所有层的信息：

 `init_vars = tf.train.list_variables(init_checkpoint)`

然后在获取能够分配初始化变量的字典，解析就是看模型中哪些变量是在初始化中存在的，并且利用initialized_variable_names保存初始化的层名如下代码，返回对应的分配层以及初始化变量名：

 assignment_map = collections.OrderedDict()
 for x in init_vars:
        (name, var) = (x[0], x[1])
        if name not in name_to_variable:
            continue
        assignment_map[name] = name
        initialized_variable_names[name] = 1
        initialized_variable_names[name + ":0"] = 1
 return (assignment_map, initialized_variable_names)

然后从初始化文件中获取初始化结果：

tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

下面代码打印哪些变量进行了初始化：

tf.logging.info("**** Trainable Variables ****")
for var in tvars:
     init_string = ""
     if var.name in initialized_variable_names:
         init_string = ", *INIT_FROM_CKPT*"
     tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                     init_string)