BERT解读(论文 + TensorFlow源码)

N个月前BERT就频繁出现在笔者的视野中,只知道是Google出品的神奇好用的pretrain模型,但一听到要用TPU去训练,就有种“拜拜了您嘞”了感觉。不过看到近期大家在谈论的,在研究的,无一不是围绕BERT等一系列的pretrain模型,不禁感叹大势所趋!近期终于有时间静下心来好好研读这篇论文和它的源码,并记录如下,但毕竟BERT已经出了很久了,各路大神都把它研究得很透彻了,所以如有不周详之处,还望指正~

一. 前言

BERT,全称Bidirectional Encoder Representations from Transformers,是Google在18年11月份推出的一个NLP预训练的模型,它一出现,就横扫各大NLP任务的SoTA,并且Google还良心的放出了源码和预训练模型,可以说是像ImageNet那种里程碑式的突破。

PS:不得不说,这篇BERT的论文,是笔者看过所有关于pretrain的论文里面,思路最清晰,章节安排最符合笔者思维的论文,简直是一口气读完,十分流畅!

那么,这个被神化已久的模型,到底是个啥?下面我们就来一起探索~

二. BERT原理

1. 模型结构

看BERT的全称,其实已经说的很清楚了,是双向Transformer的Encoder,果然还是用了自家提出的、也确实好用的Transformer作为base model。

这里比较新奇的其实是双向模型,笔者之前看的关于pretrain的论文里面,基本都是用单向的模型训练LM,即便是ELMo宣称也是双向,实际上还是独立训练了两个正向和反向的model。那么对于一般LM训练的套路来说,就应该是单向的才对,双向的话,比如将ELMo的双向输出进行拼接后再输入下一层,那么模型在这一层就看到了后面的内容,在训练下一层的时候,后面的不早就已经泄露了,还怎么训练?BERT是如何避免这个问题的?

实际上BERT的双向模型还要得益于它同时定义的MLM任务,是另外一种训练LM的思路,可以双向训练,这个后面会详细提到。

这里Google定义了两种size的BERT模型:

  • B E R T B A S E BERT_{BASE} BERTBASE L = 12 , H = 768 , A = 12 L=12, H=768, A=12 L=12,H=768,A=12,总参数量110M,对标GPT模型
  • B E R T L A R G E BERT_{LARGE} BERTLARGE L = 24 , H = 1024 , A = 16 L=24, H=1024, A=16 L=24,H=1024,A=16,总参数量340M

2. 模型输入

BERT的输入,可以是单句,也可以是两个句子pack在一起,实际上是为了应对不同的任务输入需求,比如有单句的情感分类,也有两句的关系判断等。但在预训练的时候,实际上都是用两个句子pack在一起送进去训练。

具体地,BERT使用BPE作为输入,词表大小30000,最长输入长度是512(为何要定义这个参数?后面在源码部分就可以看到,实际上输入的不是物理上的各个句子,而是长度尽可能达到这个长度的相邻句子组作为一整个句子,为了减少padding的计算浪费!)。

输入的构造方式可用下图来说明:

这里有几个有意思的点:

  1. 第一位放入的是[CLS]标识,用于吸收并作为这整个句子表示的一个特殊timestep,可后续用于分类任务,很机智啊有没有!
  2. 输入都是这种句子pair的形式,那么怎么区分两个句子呢?有两种方式:1)在两个句子的中间加入[SEP]分隔符;2)加上segment embeeding这种东西,明确标识两个句子的id,如果输入一个句子的话,就只使用同一个id即可。

3. 预训练任务

这里其实就是重头戏了,也是BERT整个预训练的精髓所在,它定义了两个任务用来进行预训练的学习,与以往的LM训练十分不同,一个是MLM,另一个是NSP。下面分别介绍:

  1. MLM任务(Maksed LM)

这是能让前面的Transformer模型能进行双向训练的一个重要支撑任务。可以看成是LM的另一种训练方式。

它的思路是:随机mask掉一句话中的某个词,然后让模型同时结合左边和右边的上下文,来预测这个词,其实有点儿像是完形填空。

论文提到每次mask掉一句话中大约15%的bpe token,但考虑到模型可能从来没见过这些mask掉的词,这与真实的finetune任务的输入分布可能会不太一致,所以作者提出了一个“骚”操作,在每次选定一个要mask的词之后:

  • 80%的概率:用[MASK]替换
  • 10%的概率:用随机token进行替换,这里感觉像是加噪声,让模型学习得更鲁棒,而且15%*10%=1.5%的噪声也不太会损伤LM模型的建模能力
  • 10%的概率:维持原样,为了能靠近finetune时的真实分布

论文还提到,这样每次只预测15%的词,会不会模型收敛的很慢?作者霸气的回应:确实会比训练单向的LM要慢,但相比起它的效果增长,慢根本不算什么,是值得的!不愧是Google啊,真·财大气粗!

  1. NSP任务(Next Sentence Prediction)

这是为了下游需要输入句子对,并分辨句子对之间联系的任务而设计的,比如QA、NLI等。作者认为LM里面学不到这样的知识,所以又单独设计了NSP预训练任务。

这个任务的insight也非常直接,就是预测两个句子中,后面一句是不是前面一句的下一句,是一个二分类问题。在构造输入的时候,以50%的概率选择真实是下一句的句子,以50%的概率选择不是真实下一句的句子。比如:

论文提到最终在这个任务上的acc大约能到97%~98%。

4. pretrain流程

用BooksCorpus(800M词)+ 英文Wikipedia(2500M词)作为语料,作者强调使用document-level的语料比随机shuffle句子的语料(比如Billion Word Benchmark)要好,因为这样才能抽取出连续的长句子,才好用于BERT的数据构造。

这里笔者特别注意了一下训练时间,BASE版本的BERT,用16块TPU,LARGE版本的BERT,用64块TPU,都是训练了4天!这算力,也是没谁了。。

5. finetune流程

基本都是针对任务要对模型顶层做一定的修改,但都遵循修改很少的原则,后面在实验阶段,会详细说明不同任务的改进是什么样子的。

三. 实验

1. 评估数据集

论文在11个任务上面评估了他们的方法,其中包含8个GLUE中的任务,还有SQuAD1.1、NER和SWAG,下面先来介绍一下:(前8个都是GLUE中的任务,GLUE是General Language Understanding Evaluation,一个包含了多种任务的标准集)

  1. MNLI:Multi-Genre Natural Language Inference,给定句子对,判断两个句子是蕴含、矛盾还是中立,三分类
  2. QQP:Quora Question Pairs,给定句子对,判断语义是否相似,二分类
  3. QNLI:Question Natural Language Inference,给定句子对,判断后者是否是前者的回答,二分类
  4. SST-2:Standford Sentiment Treebank,给定单句,判断情感,二分类
  5. CoLA:Corpus of Linguistic Acceptability,给定单句,判断是否是一句话?二分类
  6. STS-B:Semantic Textual Similarity,给定句子对,判断相似度程度,五分类
  7. MRPC:Microsoft Research Paraphrase Corpus,给定句子对,判断语义是否一致,二分类
  8. RTE:Recognizing Texual Entailment,给定句子对,判断蕴含性,二分类
  9. SQuAD1.1:Stanford Question Answering Dataset,经典阅读理解任务,给定(question,paragraph)对,预测answer在paragraph中的起止位置(span prediction)
  10. CoNLL-2003 NER:命名实体识别任务,输入单句,为每个词进行标注
  11. SWAG:Situations With Adversarial Generations,原任务是给定一句话,和4个选项,从选项里面找出它的下一句,这里转化为输入pair的分类问题,四分类

2. 模型及任务适配

下图是针对上面任务,使用BERT的一些图例:

上面的(a)和(b)是分类任务,输入可以是单句,可以是句子对,输出用[CLS]的表征即可,下面的(c)和(d)是针对token的标注,基本都只是在最后加入了一层分类层,可以说是最小的改动!

3. 实验结果

在GLUE的8个任务上的表现如下:

在SQuAD上的表现如下:

在CoNLL-2003 NER上的表现如下:

在SWAG上的表现如下:

4. 一些分析

作者验证了使用不同预训练任务组合的效果:

  • NO NSP:模型只用MLM进行预训练,没有NSP任务
  • LTR & No NSP:模型用LTR(left-to-right)训练LM,而不用MLM和NSP
  • + BiLSTM:在顶层又加入了一个随机初始化的BiLSTM用于finetune

结论就是,作者认为,MLM+双向训练是BERT里面唯一重要的提升点。其实笔者认为NSP也是比较重要的,虽然去掉NSP效果没有差的那么明显。

作者又分析了模型大小对效果的影响:

基本都是越大越好,不管是对于大小数据集,也不管是与预训练任务是否类似。

作者还分析了预训练步数对效果的影响:

结论就是MLM更有潜力,虽然收敛的慢,但最终效果好!

作者最后对比了用BERT提取那种offline特征的表现:

这里是在CoNLL-2003任务上验证,将这些提取出来的特征,接上两层BiLSTM,发现用最后4层的concat起来,效果是最好的。当然这从侧面证明了BERT不管是finetune还是提特征都很好的能力,但实际效果上还是得自己实践!

四. TensorFlow实现

笔者这里阅读的是原汁原味的TensorFlow源码,虽然笔者比较擅长的是PyTorch,也确实有Google官方推荐的PyTorch版本。但这里笔者还是想剖析Google官方的TensorFlow源码。

相比较于GPT和GPT-2,Google放出的BERT还是一点儿都不吝啬的,包括pretrain的模型结构、训练代码、预训练好的完整模型参数,以及在一些不同任务上的finetune代码,也是让笔者大饱眼福。同时,还有一些Tips和常见的errors等。重要的是,BERT也提供了像ELMo一样的接口,用于预先提取embedding到文件中。建议感兴趣的读者还是去这里看细节。

下面,笔者将分为pretrain和finetune两部分进行详细剖析,同时因笔者自身的TensorFlow水平有限,如若有解读不到位的地方,烦请告知~

1. 无监督pretrain

在预训练阶段,其实输入的构造和模型的训练两者都很重要。

  1. 输入数据的构造:

以bert源码中给的sample_text.txt为例,其内容如下:

This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত
Text should be one-sentence-per-line, with empty lines between documents.
This sample text is public domain and was randomly selected from Project Guttenberg.

The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.
Possibly this may have been the reason why early risers in that locality, during the rainy season, adopted a thoughtful habit of body, and seldom lifted their eyes to the rifted or india-ink washed skies above them.
"Cass" Beard had risen early that morning, but not with a view to discovery.
A leak in his cabin roof,--quite consistent with his careless, improvident habits,--had roused him at 4 A. M., with a flooded "bunk" and wet blankets.
The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor's to supply the deficiency.
This was nearly opposite.
Mr. Cassius crossed the highway, and stopped suddenly.
Something glittered in the nearest red pool before him.
Gold, surely!
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a plain gold ring.
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
Like most of his fellow gold-seekers, Cass was superstitious.

The fountain of classic wisdom, Hypatia herself.
As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred doors of her lecture-room, imbibe celestial knowledge.
From my youth I felt in me a soul above the matter-entangled herd.
She revealed to me the glorious fact, that I am a spark of Divinity itself.
A fallen star, I am, sir!' continued he, pensively, stroking his lean stomach--'a fallen star!--fallen, if the dignity of philosophy will allow of the simile, among the hogs of the lower world--indeed, even into the hog-bucket itself. Well, after all, I will show you the way to the Archbishop's.
There is a philosophic pleasure in opening one's treasures to the modest young.
Perhaps you will assist me by carrying this basket of fruit?' And the little man jumped up, put his basket on Philammon's head, and trotted off up a neighbouring street.
Philammon followed, half contemptuous, half wondering at what this philosophy might be, which could feed the self-conceit of anything so abject as his ragged little apish guide;
but the novel roar and whirl of the street, the perpetual stream of busy faces, the line of curricles, palanquins, laden asses, camels, elephants, which met and passed him, and squeezed him up steps and into doorways, as they threaded their way through the great Moon-gate into the ample street beyond, drove everything from his mind but wondering curiosity, and a vague, helpless dread of that great living wilderness, more terrible than any dead wilderness of sand which he had left behind.
Already he longed for the repose, the silence of the Laura--for faces which knew him and smiled upon him; but it was too late to turn back now.
His guide held on for more than a mile up the great main street, crossed in the centre of the city, at right angles, by one equally magnificent, at each end of which, miles away, appeared, dim and distant over the heads of the living stream of passengers, the yellow sand-hills of the desert;
while at the end of the vista in front of them gleamed the blue harbour, through a network of countless masts.
At last they reached the quay at the opposite end of the street;
and there burst on Philammon's astonished eyes a vast semicircle of blue sea, ringed with palaces and towers.
He stopped involuntarily; and his little guide stopped also, and looked askance at the young monk, to watch the effect which that grand panorama should produce on him.

源码中对于这个输入文件的要求为:1)一行一句话,最好是真实文本中的一句话,而不是一段话或者一句话的部分;2)文档之间用空行分开,防止NSP任务跨文档。

对于这样的输入作为input_files时,其处理代码如下:

rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
  input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
  FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
  rng)
print(instances[0])

其输出的一个示例如下:

tokens: [CLS] for more than a [MASK] up [MASK] great main street , [MASK] [MASK] [MASK] centre of the city , at right angles , [MASK] one equally magnificent , at each end ##ミ which , miles away , appeared , dim and distant over the heads of the living stream of passengers , the yellow [MASK] - hills of [MASK] [MASK] ; while at the end of the vista in front [MASK] them gleamed the blue harbour , through a network [SEP] possibly this may have been the reason why early rise ##rs in [MASK] locality , during the rainy season , adopted [MASK] [MASK] [MASK] of body , and seldom lifted [MASK] eyes [MASK] the rift [MASK] or india - ink washed skies above them . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 5 7 12 13 14 24 32 55 59 60 71 94 103 104 105 109 112 114 117
masked_lm_labels: mile the crossed in the by of sand the desert of that a thoughtful habit and their to ##ed

可见,其输出有:

  • tokens:两句话pack在一起,并加上[MASK][CLS][SEP]之后的token序列
  • segment_ids:句子标识符,0表示第一句话,1表示第二句话
  • is_random_next:是否为下一句,NSP任务的label
  • masked_lm_positions:用于MLM任务,标识哪些位置被mask了
  • masked_lm_labels:用于MLM任务,表示被mask的位置上的label

那么,create_training_instances函数内部究竟是怎么操作的呢?看下面的代码

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng):
  """Create `TrainingInstance`s from raw text."""
  all_documents = [[]]
  for input_file in input_files:
    with tf.gfile.GFile(input_file, "r") as reader:
      while True:
        line = tokenization.convert_to_unicode(reader.readline())
        if not line:
          break
        line = line.strip()

        # Empty lines are used as document delimiters
        if not line:
          all_documents.append([])
        tokens = tokenizer.tokenize(line)
        if tokens:
          all_documents[-1].append(tokens)

  # Remove empty documents
  all_documents = [x for x in all_documents if x]
  rng.shuffle(all_documents)

  vocab_words = list(tokenizer.vocab.keys())
  instances = []
  for _ in range(dupe_factor):
    for document_index in range(len(all_documents)):
      instances.extend(
          create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

  rng.shuffle(instances)
  return instances

前面都是读文件、分词、打乱等操作,得到all_documents,这是一个三维列表,分别存储的是文档、句子、词。而对于每个文档的处理,重点在create_instances_from_document这个函数,dupe_factor这个参数的意思是遍历所有的文档多少遍?其意义在于每次mask的内容都不一样,复制几份数据进行处理的话,就能得到不同的mask数据。

下面来看create_instances_from_document这个函数:

def create_instances_from_document(
    all_documents, document_index, max_seq_length, short_seq_prob,
    masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
  """Creates `TrainingInstance`s for a single document."""
  document = all_documents[document_index]

  # Account for [CLS], [SEP], [SEP]
  max_num_tokens = max_seq_length - 3
  
  target_seq_length = max_num_tokens
  if rng.random() < short_seq_prob:
    target_seq_length = rng.randint(2, max_num_tokens)

  instances = []
  current_chunk = []
  current_length = 0
  i = 0
  while i < len(document):
    segment = document[i]
    current_chunk.append(segment)
    current_length += len(segment)
    if i == len(document) - 1 or current_length >= target_seq_length:
      if current_chunk:
        a_end = 1
        if len(current_chunk) >= 2:
          a_end = rng.randint(1, len(current_chunk) - 1)

        tokens_a = []
        for j in range(a_end):
          tokens_a.extend(current_chunk[j])

        tokens_b = []
        # Random next
        is_random_next = False
        if len(current_chunk) == 1 or rng.random() < 0.5:
          is_random_next = True
          target_b_length = target_seq_length - len(tokens_a)

          for _ in range(10):
            random_document_index = rng.randint(0, len(all_documents) - 1)
            if random_document_index != document_index:
              break

          random_document = all_documents[random_document_index]
          random_start = rng.randint(0, len(random_document) - 1)
          for j in range(random_start, len(random_document)):
            tokens_b.extend(random_document[j])
            if len(tokens_b) >= target_b_length:
              break
              
          num_unused_segments = len(current_chunk) - a_end
          i -= num_unused_segments
        # Actual next
        else:
          is_random_next = False
          for j in range(a_end, len(current_chunk)):
            tokens_b.extend(current_chunk[j])
        truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

        assert len(tokens_a) >= 1
        assert len(tokens_b) >= 1

        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
          tokens.append(token)
          segment_ids.append(0)

        tokens.append("[SEP]")
        segment_ids.append(0)

        for token in tokens_b:
          tokens.append(token)
          segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)

        (tokens, masked_lm_positions,
         masked_lm_labels) = create_masked_lm_predictions(
             tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
        instance = TrainingInstance(
            tokens=tokens,
            segment_ids=segment_ids,
            is_random_next=is_random_next,
            masked_lm_positions=masked_lm_positions,
            masked_lm_labels=masked_lm_labels)
        instances.append(instance)
      current_chunk = []
      current_length = 0
    i += 1

  return instances

很长,,对其整个思路做一个解释就是:

  • 根据max_num_tokens这个数值(对于这个值的计算在最后说),对输出的instancetokens序列长度做一个限制,而后从文档里面选N个句子,直到长度>=这个长度。
  • 在这N个句子里面,随机找一个分隔点,前面作为NSP任务的句子A,后面的进行下一步的处理。
  • 以0.5的概率将上一步分隔点之后的句子作为NSP任务中的句子B,并且设置is_random_nextFalse;以另外0.5的概率,先随机抽取一个文档,而后随机选一个句子的起点,选够满足剩余长度的句子作为句子B,并设置is_random_nextTrue
  • 对长度进行裁剪,将其长度规约到max_num_tokens之内,其实就是从句子A和句子B中选择较长的那个句子,而后随机从前后扔掉一些token,直到满足要求。
  • 将上一步的句子,进行create_masked_lm_predictions这个函数的计算,生成MLM需要的mask。

可见,在构造NSP的时候,不是用的真实物理上的两个相邻的句子,而是两段相邻的句子组,因为这样可以减少padding计算的浪费。至于max_num_tokens的计算,这里有一个short_seq_prob参数,其意义在于,如果用前面的句子组的构造方式进行构造的话,难免与finetune任务中的输入有些不相符,因此这里在一定概率下取短句子进行处理。

而对于MLM的构建,用到的是create_masked_lm_predictions这个函数,它的输入是前面构建好的NSP的token序列,其代码如下:

PS:在构造MLM的时候,并没有单句输入的情况,即BERT在预训练的时候,输入只有一种,就是两个句子组的pack

def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
  """Creates the predictions for the masked LM objective."""

  cand_indexes = []
  for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
      continue
    cand_indexes.append(i)

  rng.shuffle(cand_indexes)

  output_tokens = list(tokens)

  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []
  covered_indexes = set()
  for index in cand_indexes:
    if len(masked_lms) >= num_to_predict:
      break
    if index in covered_indexes:
      continue
    covered_indexes.add(index)

    masked_token = None
    # 80% of the time, replace with [MASK]
    if rng.random() < 0.8:
      masked_token = "[MASK]"
    else:
      # 10% of the time, keep original
      if rng.random() < 0.5:
        masked_token = tokens[index]
      # 10% of the time, replace with random word
      else:
        masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

    output_tokens[index] = masked_token

    masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

  masked_lms = sorted(masked_lms, key=lambda x: x.index)

  masked_lm_positions = []
  masked_lm_labels = []
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)

  return (output_tokens, masked_lm_positions, masked_lm_labels)

这里的核心部分就是论文中提到的,先按照某个概率抽取不超过num_to_predict的词用于mask,而后再按照8:1:1的概率进行替换。

  1. 模型的预训练

这里主要看模型的结构及loss的计算:

model = modeling.BertModel(
    config=bert_config,
    is_training=is_training,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=use_one_hot_embeddings)

(masked_lm_loss,
 masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
     bert_config, model.get_sequence_output(), model.get_embedding_table(),
     masked_lm_positions, masked_lm_ids, masked_lm_weights)

(next_sentence_loss, next_sentence_example_loss,
 next_sentence_log_probs) = get_next_sentence_output(
     bert_config, model.get_pooled_output(), next_sentence_labels)

total_loss = masked_lm_loss + next_sentence_loss

其中BertModel是模型构建函数,get_masked_lm_output是计算MLM Loss的函数,get_next_sentence_output是计算NSP Loss的函数。一个一个来看:

首先是BertModel

with tf.variable_scope(scope, default_name="bert"):
  with tf.variable_scope("embeddings"):
    # Perform embedding lookup on the word ids.
    (self.embedding_output, self.embedding_table) = embedding_lookup(
        input_ids=input_ids,
        vocab_size=config.vocab_size,
        embedding_size=config.hidden_size,
        initializer_range=config.initializer_range,
        word_embedding_name="word_embeddings",
        use_one_hot_embeddings=use_one_hot_embeddings)

    # Add positional embeddings and token type embeddings, then layer
    # normalize and perform dropout.
    self.embedding_output = embedding_postprocessor(
        input_tensor=self.embedding_output,
        use_token_type=True,
        token_type_ids=token_type_ids,
        token_type_vocab_size=config.type_vocab_size,
        token_type_embedding_name="token_type_embeddings",
        use_position_embeddings=True,
        position_embedding_name="position_embeddings",
        initializer_range=config.initializer_range,
        max_position_embeddings=config.max_position_embeddings,
        dropout_prob=config.hidden_dropout_prob)

  with tf.variable_scope("encoder"):
    attention_mask = create_attention_mask_from_input_mask(
        input_ids, input_mask)

    # Run the stacked transformer.
    # `sequence_output` shape = [batch_size, seq_length, hidden_size].
    self.all_encoder_layers = transformer_model(
        input_tensor=self.embedding_output,
        attention_mask=attention_mask,
        hidden_size=config.hidden_size,
        num_hidden_layers=config.num_hidden_layers,
        num_attention_heads=config.num_attention_heads,
        intermediate_size=config.intermediate_size,
        intermediate_act_fn=get_activation(config.hidden_act),
        hidden_dropout_prob=config.hidden_dropout_prob,
        attention_probs_dropout_prob=config.attention_probs_dropout_prob,
        initializer_range=config.initializer_range,
        do_return_all_layers=True)

  self.sequence_output = self.all_encoder_layers[-1]
  with tf.variable_scope("pooler"):
    first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
    self.pooled_output = tf.layers.dense(
        first_token_tensor,
        config.hidden_size,
        activation=tf.tanh,
        kernel_initializer=create_initializer(config.initializer_range))

代码还是很清晰的,首先是embedding层,将id转为embedding向量;然后是embedding_postprocessor层,用于加上position_embeddingsegment_embedding;然后是encoder层,这个就是构建标准的transformer的encoder的过程;最后是pooler层,其实是想用NSP任务的,将第一个[CLS]的输出再经过一个linear层,用于后续的分类。

接着来看get_masked_lm_output函数:

  with tf.variable_scope("cls/predictions"):
    # We apply one more non-linear transformation before the output layer.
    # This matrix is not used after pre-training.
    with tf.variable_scope("transform"):
      input_tensor = tf.layers.dense(
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      input_tensor = modeling.layer_norm(input_tensor)

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    label_ids = tf.reshape(label_ids, [-1])
    label_weights = tf.reshape(label_weights, [-1])

    one_hot_labels = tf.one_hot(
        label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator

这里就是首先经过一个额外的linear层,而后再利用前面的embedding层参数作为分类器参数,将输出映射到词表上,而后根据mask的那些内容计算loss。

最后来看get_next_sentence_output函数:

  with tf.variable_scope("cls/seq_relationship"):
    output_weights = tf.get_variable(
        "output_weights",
        shape=[2, bert_config.hidden_size],
        initializer=modeling.create_initializer(bert_config.initializer_range))
    output_bias = tf.get_variable(
        "output_bias", shape=[2], initializer=tf.zeros_initializer())

    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    labels = tf.reshape(labels, [-1])
    one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)

这里就是一个标准的二分类器。

PS:这里有一个小细节就是,在利用BERT的输出去计算MLM的Loss的时候,会在softmax层之前再额外加入一层,在预训练结束后就会扔掉这个层。而在利用BERT的输出去计算NSP的Loss的时候,也会在softmax层之前再额外加入一层,并且会一直带着,不会扔掉。笔者猜测这里的原因可能是,在后面利用BERT的输出作为表示时,不希望它是直接拿来预测MLM的那个表示,所以要用它前面的表示,而在后续做句子分类任务的时候,却希望能有一个适合用于分类的向量,这就是NSP的那个linear层还继续保留的原因。

2. 有监督finetune

源码中给出了两个任务的finetune例子,一个是句子对分类任务MRPC,另一个是问答任务squad。下面分别进行剖析:

  1. MRPC任务

主要还是看模型和loss的构建:

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,
      input_mask=input_mask,
      token_type_ids=segment_ids,
      use_one_hot_embeddings=use_one_hot_embeddings)

  output_layer = model.get_pooled_output()

  hidden_size = output_layer.shape[-1].value

  output_weights = tf.get_variable(
      "output_weights", [num_labels, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

  with tf.variable_scope("loss"):
    if is_training:
      # I.e., 0.1 dropout
      output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    probabilities = tf.nn.softmax(logits, axis=-1)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)

    return (loss, per_example_loss, logits, probabilities)

这里其实就是用了之前的BERT模型,然后取出其第一个token的表示,而后建立一个合适类别的分类器。

  1. SQuAD任务

同样看模型和loss的构建:

(start_logits, end_logits) = create_model(
        bert_config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)
        
def compute_loss(logits, positions):
    one_hot_positions = tf.one_hot(
        positions, depth=seq_length, dtype=tf.float32)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    loss = -tf.reduce_mean(
        tf.reduce_sum(one_hot_positions * log_probs, axis=-1))
    return loss

start_positions = features["start_positions"]
end_positions = features["end_positions"]

start_loss = compute_loss(start_logits, start_positions)
end_loss = compute_loss(end_logits, end_positions)

total_loss = (start_loss + end_loss) / 2.0

这里首先是create_model,得到在各个位置上的属于start以及end的logits。而后通过compute_loss计算loss,通过对所有位置进行softmax,根据目标位置计算交叉熵。最后的loss是两者的平均。其中create_model的代码如下:

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,
      input_mask=input_mask,
      token_type_ids=segment_ids,
      use_one_hot_embeddings=use_one_hot_embeddings)

  final_hidden = model.get_sequence_output()

  final_hidden_shape = modeling.get_shape_list(final_hidden, expected_rank=3)
  batch_size = final_hidden_shape[0]
  seq_length = final_hidden_shape[1]
  hidden_size = final_hidden_shape[2]

  output_weights = tf.get_variable(
      "cls/squad/output_weights", [2, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "cls/squad/output_bias", [2], initializer=tf.zeros_initializer())

  final_hidden_matrix = tf.reshape(final_hidden,
                                   [batch_size * seq_length, hidden_size])
  logits = tf.matmul(final_hidden_matrix, output_weights, transpose_b=True)
  logits = tf.nn.bias_add(logits, output_bias)

  logits = tf.reshape(logits, [batch_size, seq_length, 2])
  logits = tf.transpose(logits, [2, 0, 1])

  unstacked_logits = tf.unstack(logits, axis=0)

  (start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])

  return (start_logits, end_logits)

可见是用一个linear层,分别计算是start和end的分数。

五. 总结:

优势

  1. 双向的思路很reasonable,而且也很赞
  2. MLM和NSP两个任务的设计构想很绝妙
  3. 效果确实好,我等凡人也可以follow
  4. 源码和文档实在是太充分了,不愧是Google!

不足

  1. 整体的设计mannual的部分还挺多的,比如两个任务的选取,MLM中概率的选择等。其实都是考虑到了下游任务的情况,做了一个整合。那么面对更多样的任务呢?有没有一个统一的预训练任务?比如GPT-2贯彻全文的一个思路就是,LM是一个无监督的多任务学习者,LM就够了!
  2. 没有给出multilingual的训练细节?比如输入是怎么构造的?多语的训练方式与标准的有没有区别?这里笔者猜测的是:输入按照语料大小的概率进行平滑取样,而后统计出多语的bpe作为整个大词表;而后直接将多语言的文档全部摞在一起,按照原来的逻辑进行处理。当然这是笔者自己的臆想,若有大神知道或看到了这些内容,还烦请告知~

传送门

论文:https://arxiv.org/abs/1810.04805
源码:https://github.com/google-research/bert (TensorFlow,官方已经出了多语版的,包括中文)
https://github.com/huggingface/pytorch-pretrained-BERT (PyTorch)
https://github.com/soskek/bert-chainer (Chainer)

  • 7
    点赞
  • 56
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值