Bert详细学习及代码实现详解

银晗

已于 2023-08-28 09:45:30 修改

阅读量2.7k

点赞数 1

分类专栏：深度学习基础文章标签： bert 学习人工智能

于 2023-08-07 09:50:45 首次发布

本文链接：https://blog.csdn.net/RandyHan/article/details/132140211

版权

深度学习基础专栏收录该内容

11 篇文章 1 订阅

订阅专栏

BERT概述

BERT的全称是Bidirectional Encoder Representation from Transformers，即双向Transformer的Encoder，因为decoder是不能获要预测的信息的。在大型语料库（Wikipedia + BookCorpus）上训练一个大型模型（12 层到 24 层 Transformer）很长时间（1M 更新步骤），这就是 BERT。

模型的主要创新点都在pre-train方法上，即用了Masked LM和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation。
- Masked LM --> word
- Next Sentence Prediction --> sentence

在这里插入图片描述

Mask掩码

在原始预处理代码中，我们随机选择 WordPiece 标记进行掩码。
例如：
Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head
Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head

全字掩码改进：
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

改进思想：
训练是相同的——我们仍然独立预测每个屏蔽的 WordPiece 标记。改进来自于这样的事实：对于已拆分为多个 WordPieces 的单词，原始预测任务过于“简单”。

一次预测一个mask太简单了，把原来的mask周围的词全部都mask掉，提高难度。

Enmbedding

三种Embedding求和构成的：
在这里插入图片描述

Token Embeddings是词向量，第一个单词是CLS标志，可以用于之后的分类任务
Segment Embeddings用来区别两种句子，因为预训练不光做LM还要做以两个句子为输入的分类任务
Position Embeddings和之前文章中的Transformer不一样，不是三角函数而是学习出来的

Pre-training Task 1: Masked Language Model

为什么要bidirection？

意思就是如果使用预训练模型处理其他任务，那人们想要的肯定不止某个词左边的信息，而是左右两边的信息。

在训练过程中作者随机mask 15%的token，而不是把像cbow一样把每个词都预测一遍。最终的损失函数只计算被mask掉那个token。

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon

mask的技巧：

Mask如何做也是有技巧的，如果一直用标记[MASK]代替（在实际预测时是碰不到这个标记的）会影响模型，所以随机mask的时候10%的单词会被替代成其他单词，10%的单词不替换，剩下80%才被替换为[MASK]。

要注意的是Masked LM预训练阶段模型是不知道真正被mask的是哪个词，所以模型每个词都要关注。

sequence_length:

因为序列长度太大（512）会影响训练速度，所以90%的steps都用seq_len=128训练，余下的10%步数训练512长度的输入。

Pre-training Task 2: Next Sentence Prediction

因为涉及到QA和NLI之类的任务，增加了第二个预训练任务

目的是让模型理解两个句子之间的联系。训练的输入是句子A和B，B有一半的几率是A的下一句，输入这两个句子，模型预测B是不是A的下一句。预训练的时候可以达到97-98%的准确度。

注意：作者特意说了语料的选取很关键，要选用document-level的而不是sentence-level的，这样可以具备抽象连续长序列特征的能力。

Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence

Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence

fine-tuning

code：run_classifier.py / run_squad.py（tpu）

Sentence (and sentence-pair) classification tasks

在运行此示例之前，您必须通过运行此脚本下载 GLUE 数据并将其解压到某个目录 $GLUE_DIR 。接下来，下载 BERT-Base 检查点并将其解压缩到某个目录 $BERT_BASE_DIR 。

此示例代码在 Microsoft Research Paraphrase Corpus (MRPC) 语料库上微调 BERT-Base ，该语料库仅包含 3,600 个示例，并且可以在大多数 GPU 上在几分钟内进行微调。

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/

***** Eval results *****
  eval_accuracy = 0.845588
  eval_loss = 0.505248
  global_step = 343
  loss = 0.505248

训练完分类器后，您可以使用 --do_predict=true 命令在推理模式下使用它。输入文件夹中需要有一个名为 test.tsv 的文件。输出将在输出文件夹中名为 test_results.tsv 的文件中创建。每行将包含每个样本的输出，列是类概率。

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue
export TRAINED_CLASSIFIER=/path/to/fine/tuned/classifier

python run_classifier.py \
  --task_name=MRPC \
  --do_predict=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$TRAINED_CLASSIFIER \
  --max_seq_length=128 \
  --output_dir=/tmp/mrpc_output/

影响内存使用的因素有：

max_seq_length ：发布的模型使用高达 512 的序列长度进行训练，但您可以使用更短的最大序列长度进行微调以节省大量内存。这是由示例代码中的 max_seq_length 标志控制的。
train_batch_size ：内存使用量也与批量大小成正比。
模型类型， BERT-Base 与 BERT-Large ： BERT-Large 模型比 BERT-Base 需要更多的内存。
优化器：BERT的默认优化器是Adam，它需要大量额外的内存来存储 m 和 v 向量。切换到内存效率更高的优化器可以减少内存使用量，但也会影响结果。我们还没有尝试过其他优化器进行微调。

代码详解

class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.

    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label


class DataProcessor(object):
  """Base class for data converters for sequence classification data sets."""

	  def get_train_examples(self, data_dir):
	    """Gets a collection of `InputExample`s for the train set."""
	    raise NotImplementedError()
	
	  def get_dev_examples(self, data_dir):
	    """Gets a collection of `InputExample`s for the dev set."""
	    raise NotImplementedError()
	
	  def get_test_examples(self, data_dir):
	    """Gets a collection of `InputExample`s for prediction."""
	    raise NotImplementedError()
	
	  def get_labels(self):
	    """Gets the list of labels for this data set."""
	    raise NotImplementedError()
	
	  @classmethod
	  def _read_tsv(cls, input_file, quotechar=None):
	    """Reads a tab separated value file."""
	    with tf.gfile.Open(input_file, "r") as f:
	      reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
	      lines = []
	      for line in reader:
	        lines.append(line)
	      return lines

XNLI（Cross-lingual Natural Language Inference）数据集是一个用于跨语言自然语言推理任务的数据集。它是在自然语言推理（NLI）任务的基础上进行扩展，旨在促进多语言之间的推理能力研究和跨语言模型的发展。
XNLI数据集是在原始的英语NLI数据集SNLI（Stanford Natural Language Inference）的基础上构建的。它包含来自15种不同语言的句子对，涵盖了多种语言家族，如印欧语系、南亚语系、尼日尔-刚果语系等。每个语言都有约5,000个训练样本和2,500个开发样本。
XNLI数据集的目标是通过将SNLI数据集翻译成其他语言，从而为多语言推理任务提供一个统一的基准。对于给定的句子对，模型需要判断它们之间的关系是蕴含（entailment）、矛盾（contradiction）还是中性（neutral）。通过在多语言上进行推理，可以评估模型在不同语言之间的泛化能力和跨语言理解能力。XNLI数据集的发布促进了跨语言自然语言处理的研究和发展，为构建能够处理多语言文本的模型提供了基准和评估标准。

取数据，注意是官方数据集的输入格式


class XnliProcessor(DataProcessor):
	  """Processor for the XNLI data set."""
	
	  def __init__(self):
	    self.language = "zh"
	
	  def get_train_examples(self, data_dir):
	    """See base class."""
	    lines = self._read_tsv(
	        os.path.join(data_dir, "multinli",
	                     "multinli.train.%s.tsv" % self.language))
	    examples = []
	    for (i, line) in enumerate(lines):
	      if i == 0:
	        continue
	      guid = "train-%d" % (i)
	      text_a = tokenization.convert_to_unicode(line[0])
	      text_b = tokenization.convert_to_unicode(line[1])
	      label = tokenization.convert_to_unicode(line[2])
	      if label == tokenization.convert_to_unicode("contradictory"):
	        label = tokenization.convert_to_unicode("contradiction")
	      examples.append(
	          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
	    return examples
	
	  def get_dev_examples(self, data_dir):
	    """See base class."""
	    lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
	    examples = []
	    for (i, line) in enumerate(lines):
	      if i == 0:
	        continue
	      guid = "dev-%d" % (i)
	      language = tokenization.convert_to_unicode(line[0])
	      if language != tokenization.convert_to_unicode(self.language):
	        continue
	      text_a = tokenization.convert_to_unicode(line[6])
	      text_b = tokenization.convert_to_unicode(line[7])
	      label = tokenization.convert_to_unicode(line[1])
	      examples.append(
	          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
	    return examples
	
	  def get_labels(self):
	    """See base class."""
	    return ["contradiction", "entailment", "neutral"]

单条数据的输入格式

class InputFeatures(object):
	  """A single set of features of data."""
	
	  def __init__(self,
	               input_ids,
	               input_mask,
	               segment_ids,
	               label_id,
	               is_real_example=True):
	    self.input_ids = input_ids
	    self.input_mask = input_mask
	    self.segment_ids = segment_ids
	    self.label_id = label_id
	    self.is_real_example = is_real_example

转化成token的处理函数

def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):
	  """Converts a single `InputExample` into a single `InputFeatures`."""
	 # 如果是TPU上的话，要把每一个batch填满，需要padding的sentence
	  if isinstance(example, PaddingInputExample):
	    return InputFeatures(
	        input_ids=[0] * max_seq_length,
	        input_mask=[0] * max_seq_length,
	        segment_ids=[0] * max_seq_length,
	        label_id=0,
	        is_real_example=False)
	# id和label的一个映射关系
	  label_map = {}
	  for (i, label) in enumerate(label_list):
	    label_map[label] = i
	# 句子变成token序列
	  tokens_a = tokenizer.tokenize(example.text_a)
	  tokens_b = None
	  if example.text_b:
	    tokens_b = tokenizer.tokenize(example.text_b)
	
	  if tokens_b:
	    # Modifies `tokens_a` and `tokens_b` in place so that the total
	    # length is less than the specified length.
	    # Account for [CLS], [SEP], [SEP] with "- 3"
	    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
	  else:
	    # Account for [CLS] and [SEP] with "- 2"
	    if len(tokens_a) > max_seq_length - 2:
	      tokens_a = tokens_a[0:(max_seq_length - 2)]
	
	  # The convention in BERT is:
	  # (a) For sequence pairs:
	  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
	  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
	  # (b) For single sequences:
	  #  tokens:   [CLS] the dog is hairy . [SEP]
	  #  type_ids: 0     0   0   0  0     0 0
	  #
	  # Where "type_ids" are used to indicate whether this is the first
	  # sequence or the second sequence. The embedding vectors for `type=0` and
	  # `type=1` were learned during pre-training and are added to the wordpiece
	  # embedding vector (and position vector). This is not *strictly* necessary
	  # since the [SEP] token unambiguously separates the sequences, but it makes
	  # it easier for the model to learn the concept of sequences.
	  #
	  # For classification tasks, the first vector (corresponding to [CLS]) is
	  # used as the "sentence vector". Note that this only makes sense because
	  # the entire model is fine-tuned.

	# 加入cls和sep到token中，构成input
	  tokens = []
	  segment_ids = []
	  tokens.append("[CLS]")
	  segment_ids.append(0)
	  for token in tokens_a:
	    tokens.append(token)
	    segment_ids.append(0)
	  tokens.append("[SEP]")
	  segment_ids.append(0)
	
	  if tokens_b:
	    for token in tokens_b:
	      tokens.append(token)
	      segment_ids.append(1)
	    tokens.append("[SEP]")
	    segment_ids.append(1)
	
	  input_ids = tokenizer.convert_tokens_to_ids(tokens)
	
	  # The mask has 1 for real tokens and 0 for padding tokens. Only real
	  # tokens are attended to.
	  input_mask = [1] * len(input_ids)
	
	  # Zero-pad up to the sequence length.
	  while len(input_ids) < max_seq_length:
	    input_ids.append(0)
	    input_mask.append(0)
	    segment_ids.append(0)
	
	  assert len(input_ids) == max_seq_length
	  assert len(input_mask) == max_seq_length
	  assert len(segment_ids) == max_seq_length
	
	  label_id = label_map[example.label]
	  if ex_index < 5:
	    tf.logging.info("*** Example ***")
	    tf.logging.info("guid: %s" % (example.guid))
	    tf.logging.info("tokens: %s" % " ".join(
	        [tokenization.printable_text(x) for x in tokens]))
	    tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
	    tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
	    tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
	    tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
	
	  feature = InputFeatures(
	      input_ids=input_ids,
	      input_mask=input_mask,
	      segment_ids=segment_ids,
	      label_id=label_id,
	      is_real_example=True)
	  return feature

句子对的任务


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
	  """Truncates a sequence pair in place to the maximum length.
	这段代码是一个用于截断序列对的函数。它的作用是将序列对(tokens_a和tokens_b)截断到最大长度(max_length)。
	函数使用了一个简单的启发式方法来截断序列。它会逐个删除一个token，直到序列对的总长度小于等于最大长度。如果tokens_a的长度大于tokens_b的长度，则删除tokens_a的最后一个token；否则，删除tokens_b的最后一个token。
	这个截断函数的目的是确保序列对的总长度不超过最大长度，以便在处理序列对时能够满足模型的输入要求。通过逐个删除token，可以保留较长序列中更多的信息，从而更好地处理不同长度的序列对。
	"""
	
	  # This is a simple heuristic which will always truncate the longer sequence
	  # one token at a time. This makes more sense than truncating an equal percent
	  # of tokens from each, since if one sequence is very short then each token
	  # that's truncated likely contains more information than a longer sequence.
	  while True:
	    total_length = len(tokens_a) + len(tokens_b)
	    if total_length <= max_length:
	      break
	    if len(tokens_a) > len(tokens_b):
	      tokens_a.pop()
	    else:
	      tokens_b.pop()

构建模型


def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
  """Returns `model_fn` closure for TPUEstimator."""

  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    tf.logging.info("*** Features ***")
    for name in sorted(features.keys()):
      tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    label_ids = features["label_ids"]
    is_real_example = None
    if "is_real_example" in features:
      is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
    else:
      is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)

    (total_loss, per_example_loss, logits, probabilities) = create_model(
        bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
        num_labels, use_one_hot_embeddings)

    tvars = tf.trainable_variables()
    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      (assignment_map, initialized_variable_names
      ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
      if use_tpu:

        def tpu_scaffold():
          tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
          return tf.train.Scaffold()

        scaffold_fn = tpu_scaffold
      else:
        tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    tf.logging.info("**** Trainable Variables ****")
    for var in tvars:
      init_string = ""
      if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
      tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                      init_string)

    output_spec = None
    if mode == tf.estimator.ModeKeys.TRAIN:

      train_op = optimization.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          train_op=train_op,
          scaffold_fn=scaffold_fn)
    elif mode == tf.estimator.ModeKeys.EVAL:

      def metric_fn(per_example_loss, label_ids, logits, is_real_example):
        predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
        accuracy = tf.metrics.accuracy(
            labels=label_ids, predictions=predictions, weights=is_real_example)
        loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
        return {
            "eval_accuracy": accuracy,
            "eval_loss": loss,
        }

      eval_metrics = (metric_fn,
                      [per_example_loss, label_ids, logits, is_real_example])
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          eval_metrics=eval_metrics,
          scaffold_fn=scaffold_fn)
    else:
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          predictions={"probabilities": probabilities},
          scaffold_fn=scaffold_fn)
    return output_spec

  return model_fn

Using BERT to extract fixed feature vectors

在某些情况下，与其对整个预训练模型进行端到端的微调，不如获得预训练的上下文嵌入，这些嵌入是从预训练的隐藏层生成的每个输入标记的固定上下文表示。 -训练有素的模型。这也应该可以缓解大部分内存不足问题。

# Sentence A and Sentence B are separated by the ||| delimiter for sentence
# pair tasks like question answering and entailment.
# For single sentence inputs, put one sentence per line and DON'T use the
# delimiter.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

If you need to maintain alignment between the original and tokenized words (for projecting training labels), see the Tokenization section below.

注意：您可能会看到类似 Could not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict. 的消息此消息是预期的，它仅意味着我们正在使用 init_from_checkpoint() API 而不是保存的模型 API。如果您不指定检查点或指定无效的检查点，该脚本将会抱怨。

将你的输入文件，使用bert转化为文本特征

model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      layer_indexes=layer_indexes,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_one_hot_embeddings)

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      predict_batch_size=FLAGS.batch_size)

  input_fn = input_fn_builder(
      features=features, seq_length=FLAGS.max_seq_length)

  with codecs.getwriter("utf-8")(tf.gfile.Open(FLAGS.output_file,
                                               "w")) as writer:
    for result in estimator.predict(input_fn, yield_single_examples=True):
      unique_id = int(result["unique_id"])
      feature = unique_id_to_feature[unique_id]
      output_json = collections.OrderedDict()
      output_json["linex_index"] = unique_id
      all_features = []
      for (i, token) in enumerate(feature.tokens):
        all_layers = []
        for (j, layer_index) in enumerate(layer_indexes):
          layer_output = result["layer_output_%d" % j]
          layers = collections.OrderedDict()
          layers["index"] = layer_index
          layers["values"] = [
              round(float(x), 6) for x in layer_output[i:(i + 1)].flat
          ]
          all_layers.append(layers)
        features = collections.OrderedDict()
        features["token"] = token
        features["layers"] = all_layers
        all_features.append(features)
      output_json["features"] = all_features
      writer.write(json.dumps(output_json) + "\n")

tokenalization

实例化 tokenizer = tokenization.FullTokenizer 的实例
使用 tokens = tokenizer.tokenize(raw_text) 对原始文本进行标记。
截断至最大序列长度。（您最多可以使用 512 个，但出于内存和速度原因，您可能希望使用更短的长度。）
在正确的位置添加 [CLS] 和 [SEP] 标记。

在我们描述处理单词级任务的一般方法之前，了解我们的分词器到底在做什么非常重要。它有三个主要步骤：

(1) 文本规范化：将所有空白字符转换为空格，并（对于 Uncased 模型）将输入小写并去掉重音标记。例如， John Johanson’s, → john johanson’s, 。
(2) 标点符号分割：分割两侧的所有标点符号（即在所有标点符号周围添加空格）。标点符号定义为 (a) 任何具有 P* Unicode 类的字符，(b) 任何非字母/数字/空格 ASCII 字符（例如，像 $ 这样的字符在技术上不是标点）。例如， john johanson’s, → john johanson ’ s ,
(3) WordPiece 标记化：将空格标记化应用于上述过程的输出，并对每个标记单独应用 WordPiece 标记化。（我们的实现直接基于 tensor2tensor 中的实现，该实现是链接的）。例如， john johanson ’ s , → john johan ##son ’ s ,

### Input
orig_tokens = ["John", "Johanson", "'s",  "house"]
labels      = ["NNP",  "NNP",      "POS", "NN"]

### Output
bert_tokens = []

# Token map will be an int -> int mapping between the `orig_tokens` index and
# the `bert_tokens` index.
orig_to_tok_map = []

tokenizer = tokenization.FullTokenizer(
    vocab_file=vocab_file, do_lower_case=True)

bert_tokens.append("[CLS]")
for orig_token in orig_tokens:
  orig_to_tok_map.append(len(bert_tokens))
  bert_tokens.extend(tokenizer.tokenize(orig_token))
bert_tokens.append("[SEP]")

# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
# orig_to_tok_map == [1, 2, 4, 6]

分类任务

在这里插入图片描述

预训练模型

每个 .zip 文件包含三项：

包含预训练权重（实际上是 3 个文件）的 TensorFlow 检查点 ( bert_model.ckpt )。
用于将 WordPiece 映射到单词 id 的词汇文件 ( vocab.txt )。
指定模型超参数的配置文件 ( bert_config.json )。

代码详解

https://github.com/google-research/bert/blob/master/run_classifier.py

输入组成：

guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.

BertConfig


class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

BertModel

class BertModel(object):
  """BERT model ("Bidirectional Encoder Representations from Transformers").

  Example usage:

  # Already been converted into WordPiece token ids
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) # 词转换成token
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])	# 哪些是被mask的，0是被mask的，1是正常的
  token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) # （sentence1，sentence2）标识句子的，第一个句子用0，第二个句子用1，【CLS】【SEP】用2 。。。

# 设置Bert模型的参数
  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
    num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
# 加载Bert模型，放进去config和输入
  model = modeling.BertModel(config=config, is_training=True,
    input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
# laebl的embedding
  label_embeddings = tf.get_variable(...)
# Bert的池化层输出 
  pooled_output = model.get_pooled_output()
# 池化层的输出 ✖️ label = 输出的概率
  logits = tf.matmul(pooled_output, label_embeddings)
# 算loss
  ...

  """
	
	  def __init__(self,
	               config,
	               is_training,
	               input_ids,
	               input_mask=None,
	               token_type_ids=None,
	               use_one_hot_embeddings=False,
	               scope=None):
	    """Constructor for BertModel.
	
	    Args:
	      config: `BertConfig` instance.
	      is_training: bool. true for training model, false for eval model. Controls
	        whether dropout will be applied.
	      input_ids: int32 Tensor of shape [batch_size, seq_length].
	      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
	      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
	      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
	        embeddings or tf.embedding_lookup() for the word embeddings.
	      scope: (optional) variable scope. Defaults to "bert".
	
	    Raises:
	      ValueError: The config is invalid or one of the input tensor shapes
	        is invalid.
	    """
	    config = copy.deepcopy(config)
	    if not is_training:
	      config.hidden_dropout_prob = 0.0
	      config.attention_probs_dropout_prob = 0.0
	
	    input_shape = get_shape_list(input_ids, expected_rank=2)
	    batch_size = input_shape[0]
	    seq_length = input_shape[1]
	
	    if input_mask is None:
	      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
	
	    if token_type_ids is None:
	      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
	
	    with tf.variable_scope(scope, default_name="bert"):
	      with tf.variable_scope("embeddings"):
	        # Perform embedding lookup on the word ids.
	        (self.embedding_output, self.embedding_table) = embedding_lookup(
	            input_ids=input_ids,
	            vocab_size=config.vocab_size,
	            embedding_size=config.hidden_size,
	            initializer_range=config.initializer_range,
	            word_embedding_name="word_embeddings",
	            use_one_hot_embeddings=use_one_hot_embeddings)
	
	        # Add positional embeddings and token type embeddings, then layer
	        # normalize and perform dropout.
	        self.embedding_output = embedding_postprocessor(
	            input_tensor=self.embedding_output,
	            use_token_type=True,
	            token_type_ids=token_type_ids,
	            token_type_vocab_size=config.type_vocab_size,
	            token_type_embedding_name="token_type_embeddings",
	            use_position_embeddings=True,
	            position_embedding_name="position_embeddings",
	            initializer_range=config.initializer_range,
	            max_position_embeddings=config.max_position_embeddings,
	            dropout_prob=config.hidden_dropout_prob)
	
	      with tf.variable_scope("encoder"):
	        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
	        # mask of shape [batch_size, seq_length, seq_length] which is used
	        # for the attention scores.
	        attention_mask = create_attention_mask_from_input_mask(
	            input_ids, input_mask)
	
	        # Run the stacked transformer.
	        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
	        self.all_encoder_layers = transformer_model(
	            input_tensor=self.embedding_output,
	            attention_mask=attention_mask,
	            hidden_size=config.hidden_size,
	            num_hidden_layers=config.num_hidden_layers,
	            num_attention_heads=config.num_attention_heads,
	            intermediate_size=config.intermediate_size,
	            intermediate_act_fn=get_activation(config.hidden_act),
	            hidden_dropout_prob=config.hidden_dropout_prob,
	            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
	            initializer_range=config.initializer_range,
	            do_return_all_layers=True)
	
	      self.sequence_output = self.all_encoder_layers[-1]
	      # The "pooler" converts the encoded sequence tensor of shape
	      # [batch_size, seq_length, hidden_size] to a tensor of shape
	      # [batch_size, hidden_size]. This is necessary for segment-level
	      # (or segment-pair-level) classification tasks where we need a fixed
	      # dimensional representation of the segment.
	      with tf.variable_scope("pooler"):
	        # We "pool" the model by simply taking the hidden state corresponding
	        # to the first token. We assume that this has been pre-trained
	        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
	        self.pooled_output = tf.layers.dense(
	            first_token_tensor,
	            config.hidden_size,
	            activation=tf.tanh,
	            kernel_initializer=create_initializer(config.initializer_range))
	
	  def get_pooled_output(self):
	    return self.pooled_output
	
	  def get_sequence_output(self):
	    """Gets final hidden layer of encoder.
	
	    Returns:
	      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
	      to the final hidden of the transformer encoder.
	    """
	    return self.sequence_output
	
	  def get_all_encoder_layers(self):
	    return self.all_encoder_layers
	
	  def get_embedding_output(self):
	    """Gets output of the embedding lookup (i.e., input to the transformer).
	
	    Returns:
	      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
	      to the output of the embedding layer, after summing the word
	      embeddings with the positional embeddings and the token type embeddings,
	      then performing layer normalization. This is the input to the transformer.
	    """
	    return self.embedding_output
	
	  def get_embedding_table(self):
	    return self.embedding_table

word embedding

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

"""
tf.get_variable()函数用于创建或获取一个共享变量(可以初始化)。共享变量是在模型的不同部分之间共享和重用的变量
embedding_table 其中每一行是一个词的嵌入向量
output说嵌入后的张量
"""
  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))
# input_ids: [batch_size, seq_length], reshape展平后 ，[batch_size * seq_length]
# 再onehot一下，根据词汇表的长度，得到[batch_size * seq_length, vacab_size]
# 词汇表[vocab_size, embedding_size]，两矩阵相乘得到，output = [batch_size * seq_length, embedding_size]
  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)
"""
具体来说，假设output的形状为(batch_size, seq_length, embedding_size)，input_shape为[batch_size, seq_length, embedding_size]，embedding_size为一个整数。那么，input_shape[0:-1]将得到[batch_size, seq_length]，[input_shape[-1] * embedding_size]将得到[embedding_size * embedding_size]。最终，tf.reshape()函数将output重塑为形状为(batch_size, seq_length, embedding_size * embedding_size)的张量。
"""
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

加上位置编码

位置编码的embedding类似上面的，有个table，然后初始化大小
注意理解【seq_length, width】: 位置编码是每个词在文本中的位置，所以每一个词对应一个位置，width就是位置表示

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

transformer


def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.

  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the same, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector.

  This function first projects `from_tensor` into a "query" tensor and
  `to_tensor` into "key" and "value" tensors. These are (effectively) a list
  of tensors of length `num_attention_heads`, where each tensor is of shape
  [batch_size, seq_length, size_per_head].

  Then, the query and key tensors are dot-producted and scaled. These are
  softmaxed to obtain attention probabilities. The value tensors are then
  interpolated by these probabilities, then concatenated back to a single
  tensor and returned.

  In practice, the multi-headed attention are done with transposes and
  reshapes rather than actual separate tensors.

  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
    float Tensor of shape [batch_size, from_seq_length,
      num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
      true, this will be of shape [batch_size * from_seq_length,
      num_attention_heads * size_per_head]).

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """


  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
	"""
	output_tensor是一个形状为(batch_size, num_heads, seq_length, head_size)的张量，其中：
	
	batch_size表示批次大小，即输入样本的数量。
	num_heads表示注意力头的数量，Transformer中通常会使用多头注意力机制，将输入特征分为多个子空间进行注意力计算。
	seq_length表示序列长度，即输入特征的长度。
	head_size表示每个注意力头的特征维度。
	通过tf.transpose(output_tensor, [0, 2, 1, 3])操作，将output_tensor的维度进行交换。具体来说，交换后的维度顺序为(batch_size, seq_length, num_heads, head_size)。
	
	这种维度交换的目的是为了将注意力头的维度(num_heads)与序列长度的维度(seq_length)进行交换，以便在后续的矩阵乘法操作中，将注意力头的数量作为矩阵乘法的维度，从而实现多头注意力的计算。
	"""
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

银晗

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Bert详细学习及代码实现详解

BERT的全称是Bidirectional Encoder Representation from Transformers，即双向Transformer的Encoder，因为decoder是不能获要预测的信息的。模型的主要创新点都在方法上，即用了Masked LM和两种方法分别捕捉词语和句子级别的representation。
复制链接

扫一扫

专栏目录