Bert源码注解（三）

最新推荐文章于 2023-12-25 17:10:00 发布

舒语---依依

最新推荐文章于 2023-12-25 17:10:00 发布

阅读量442

点赞数 2

分类专栏：源码解析

本文链接：https://blog.csdn.net/matlabjenny/article/details/116143701

版权

源码解析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这一部分主要说一下run_squad.py，这一部分主要是做阅读理解任务的，数据集可以是SQuAD1.0或者2.0，数据集格式如下：
在这里插入图片描述
参考：https://www.cnblogs.com/xuehuiping/p/12262700.html

在SQuAD2.0版本中，添加了对应问题是否有答案的is_impossible参数，如果为False,则是可以在context中找到的答案，若为True，则会给出plausible_answers，格式同answers。

好了，进入正题。
main函数代码就不放在这里占地方了，先说read_squad_examples，这一部分是读取数据，每一个qas是一个example,每一个example中：
question_text：字符串，问题文本，没有进行处理，来自原数据；
doc_tokens：列表，context中去掉空格后文本，形如：[ ‘beyonce’, ‘giselle’, ‘knowles-carter’，…]；
orig_answer_text：字符串，答案文本，来自原数据；
start_position：若is_impossible为True，为-1，否则：根据doc_tokens得到的answer start；
end_position：若is_impossible为True，为-1，否则：根据答案长度和start得到
（char_to_word_offset：列表，长度=context，每一个index对应context中索引位置的值，每一个值对应doc_tokens中第几个词，
形如：[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2，。。。]，其中，空格跟着上一个词）

convert_examples_to_features：
构造的input格式：[CLS]+query+[SEP]+paragraph_context+[SEP]，问题放在前面问题放在前面，可以利用BERTNext Sentence Predict的训练方式获得更丰富的语义信息；
其中，query根据max_query_length进行截断；
因为doc_tokens只是根据空格进行分词的，若要定位tokenize后的起始位置，还要进行处理。
doc_tokens:[ ‘beyonce’, ‘giselle’, ‘knowles-carter’]
all_doc_tokens: [‘beyonce’, ‘gi’, ‘##selle’, ‘knowles’, ‘-’, ‘carter’] tokenize之后的；
orig_to_tok_index：一共有len(doc_tokens)个元素，每一个位置对应all_doc_tokens的index ，eg:[0,1,3]
tok_to_orig_index:长度等于all_doc_tokens,all_doc_tokens对应到原doc_tokens的， eg：[0,1,1,2,2,2]
传入的example.start_position是doc_tokens得到的，根据orig_to_tok_index得到tokenize之后的；

下面是_improve_answer_span功能:标注人员会把"Japan"当做"Japanese"的sub_span，但是WordPiece tokenizer不会将"Japanese"切分开，所以将"Japanese"作为标注数据，这种情况很少见但是会发生。

max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 ，-3: for [CLS], [SEP] and [SEP]

为了处理超过maximum sequence length，采用sliding window的方式，获取含有answer答案的最长上下文的doc-span。

 #  Doc: the man went to the store and bought a gallon of milk
  #  Span A: the man went to the
  #  Span B: to the store and bought
  #  Span C: and bought a gallon of

对于bought，有B有4个上文，0个下文，C有一个上文，三个下文，设计score = min(num_left_context, num_right_context) + 0.01 * doc_span.length，取score大的，这个score设计的巧妙。

create_model：

final_hidden = model.get_sequence_output()

  final_hidden_shape = modeling.get_shape_list(final_hidden, expected_rank=3)
  batch_size = final_hidden_shape[0]
  seq_length = final_hidden_shape[1]
  hidden_size = final_hidden_shape[2]

  output_weights = tf.get_variable(
      "cls/squad/output_weights", [2, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "cls/squad/output_bias", [2], initializer=tf.zeros_initializer())

  final_hidden_matrix = tf.reshape(final_hidden,
                                   [batch_size * seq_length, hidden_size])
  logits = tf.matmul(final_hidden_matrix, output_weights, transpose_b=True)
  logits = tf.nn.bias_add(logits, output_bias)

  logits = tf.reshape(logits, [batch_size, seq_length, 2])
  logits = tf.transpose(logits, [2, 0, 1])

  unstacked_logits = tf.unstack(logits, axis=0)

  (start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])

model也很简单，获得bert输出，然后全连接层相乘+bias,再进行维度转化和句子分解
① 相乘+bias,结果shape=[batch_size * seq_length，2]
② Reshape 为[batch_size，seq_length，2]
③ tf.transpose为[2，batch_size，seq_length]
④ tf.unstack:unstacked_logits = tf.unstack(logits, axis=0)
获得最终(start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])

到这里基本上剩下的就很简单了，都是套路。