201020学习笔记（BERT）

最新推荐文章于 2024-07-15 16:26:06 发布

Rbdash

最新推荐文章于 2024-07-15 16:26:06 发布

阅读量1k

点赞数 2

分类专栏： nlp

原文链接：no.com

版权

nlp 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

前置：word2vec，RNN网络模型，了解词向量如何建模
重点：Transformer网络架构，BERT训练方法，实际应用
基本组成依旧是机器翻译模型中常见的Seq2Seq网络

传统RNN的问题：
下一层需要上一层的输出，不能并行。
Transformer：
self-attention机制来进行并行计算，在输入和输出都相同。输出结果是被同时计算出来的，基本已经取代RNN了。
考虑词将上下文语境融入到词向量中。
在这里插入图片描述
两个词x1和x2：
第一步向量初始化，转化为编码（四维向量，四个特征）
第二步Q矩阵，K矩阵，V矩阵，借助三个辅助矩阵求出

在这里插入图片描述
求得当前词和每一个词的分值。
softmax求得：当前词对待编码位置的影响大小。
向量维度越大值越大但影响不一定越重要，所以要去掉向量维度影响。
每个词汇跟整个序列中每一个K计算得分，然后基于得分再分配特征，得到注意力值。
总过程：
在这里插入图片描述
多头注意力机制
一组qkv可以提取一组当前词的特征表达，多组qkv提取多组特征。

经过自注意力后要对层做归一化。

decoder：输入输出都是一个序列，
在这里插入图片描述

在这里插入图片描述

模型：BERT_BASE_DIR
数据：glue_data
选MRPC——两个字符串描述的是否同一个意思

文件：
BERT_BASE_DIR/uncased…/
bert_config.json 一些配置参数
ckpt：谷歌保存的预训练模型
vocab.txt：语料库，所有的词

run_classifier.py
设置run configuration
Arguments：
–task_name=MRPC \
–do_train=true \
–do_eval=true \
–data_dir=…/GLUE/glue_data/MRPC
–vocab_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_config.json \
–init_checkpoint=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=128 \
–train_batch_size=1 \
–learning_rate=2e-5 \
–num_train_epochs=3.0 \
–output_dir=…/GLUE/output

任务名，do_train做不做训练，do_eval做不做验证，windows别用绝对路径，别用中文

run_classifier.py：
如177-192：读取数据需要自己完成
842：train_examples = processor.get_train_examples(FLAGS.data_dir)
读取到一个数据（跳转到299）

num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)

↑844行train_examples有3668个，batch_size=100，要做3700/100=37次迭代，乘上epoches=3，一共迭代111次

num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

↑845行num_warmup_steps：刚开始训练的时候让学习率偏小，经过warmup阶段后再还原（设置的是0.1，就是经过111*0.1=11次迭代后学习率还原）

869行：数据读取

file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

在file_based_convert_examples_to_features中

writer = tf.python_io.TFRecordWriter(output_file)

↑483行，转化为TFRecordWriter格式

    if ex_index % 10000 == 0:
      tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

↑485行：每10000次打印一次结果

feature = convert_single_example(ex_index, example, label_list,max_seq_length, tokenizer)

↑489行：核心函数，跳转377行

  label_map = {}
  for (i, label) in enumerate(label_list): #构建标签
    label_map[label] = i

389行：构建标签，0和1两个类别

tokens_a = tokenizer.tokenize(example.text_a) #第一句话分词

393行：分词，跳转tokenizer.py 170行

    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

↑wordpiece
实例：
在这里插入图片描述
<class ‘list’>: [‘am’, ‘##ro’, ‘##zi’, ‘accused’, ‘his’, ‘brother’, ‘,’, ‘whom’, ‘he’, ‘called’, ‘"’, ‘the’, ‘witness’, ‘"’, ‘,’, ‘of’, ‘deliberately’, ‘di’, ‘##stor’, ‘##ting’, ‘his’, ‘evidence’, ‘.’]
中文基本都是切分成一个一个字来做，总体思路都是切分成更细的来处理
分词完成返回run_classifier，如果存在第二句话，则也进行分词
398行判断

  if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3" #保留3个特殊字符
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) #如果这俩太长了就截断操作
  else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
      tokens_a = tokens_a[0:(max_seq_length - 2)]

①过长截断
②有b保留三个特殊字符，没有b保留两个特殊字符
408开始进行编码，代码自带的说明：

# The convention in BERT is:
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1 #表示来自哪句话
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  #
  # Where "type_ids" are used to indicate whether this is the first
  # sequence or the second sequence. The embedding vectors for `type=0` and
  # `type=1` were learned during pre-training and are added to the wordpiece
  # embedding vector (and position vector). This is not *strictly* necessary
  # since the [SEP] token unambiguously separates the sequences, but it makes
  # it easier for the model to learn the concept of sequences.
  #
  # For classification tasks, the first vector (corresponding to [CLS]) is
  # used as the "sentence vector". Note that this only makes sense because
  # the entire model is fine-tuned.

type_id=0表示前一句话，1表示后一句话

  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
  tokens.append("[SEP]")
  segment_ids.append(0)

426行开始编码，第一个词是CLS固定的，编码为0，然后遍历word_piece中每一个词，type_id都是0。a里的词都添加完后，加入连接符sep，再添加一个0

  if tokens_b:
    for token in tokens_b:
      tokens.append(token)
      segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

436行开始添加b（如果存在），type_id编码为1，通过vocab.txt的索引来找词
443行转化为ID（vocab.txt）
<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]

max_seq_length=128,不够的补0

  while len(input_ids) < max_seq_length: #PAD的长度取决于设置的最大长度
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

↑450行，self-attention中考虑补0时要加入额外mask，做自注意力区分补的0，指定实际的词的input_mask=1，实际参与到自注意力计算中，补0的input_mask=0，不参与。
input_ids：<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_masks:<class ‘list’>: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
459行输出：
在这里插入图片描述

470行：inputfeatures->161行，自我初始化赋值
485行：for循环遍历每一个样本
496开始：处理样本

496-502：数据类型转化成tfRecorder

504，505

    tf_example = tf.train.Example(features=tf.train.Features(feature=features))
    writer.write(tf_example.SerializeToString())

把一条tf数据序列化的写进writer

embedding层：
574层开始创建bert模型

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,#（8,128）
      input_mask=input_mask,#（8,128）
      token_type_ids=segment_ids,#（8,128）
      use_one_hot_embeddings=use_one_hot_embeddings)

config：配置文件
is_training：是否训练
input_ids：batchsize都是8,128是每句话长度
input_mask：0还是1，是补的内容还是本来有内容
segment_ids：第几句话

modeling.py：
165行

    if input_mask is None: #如果没设置mask 自然就都是1的
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

mask：没有mask就自动添加，都是1（对self-attention不好）
type_id：没说是几句话就默认是1句话，都设置成0

先构建embedding层，171行开始
把1-128都转化成一个向量，三个编码要维度相同

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

↑171，embedding过程，input_ids格式8x128，vocab_size三万多个（预训练模型），embedding_size：映射的多少维（官方是768），initializer_range：初始化取值范围（0.02），one_hot（默认false），预训练模型别改参数

额外编码特征
输入两个维度，(batchsize x max_length)=8x128
输出：batchsize x max_length x 768维的向量
modeling.py
171-180 ：完成word_embedding
409行开始

  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])
  embedding_table = tf.get_variable( #词映射矩阵，30522, 768
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))
  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids) #CPU,GPU运算1024, 768 一个batch里所有的映射结果

先给输入多加一个维度，8 x 128 x 1
拍平：flat_input_ids= input_ids(8x128x1)=1024
output=1024 x 768

  input_shape = get_shape_list(input_ids)
  output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) #(8, 128, 768)
  return (output, embedding_table)

↑421行，output三个维度：8x128x768 batch_sizex每句话中的词x每个词的向量
词变成了向量

184-194：完成位置编码position_embedding
只是把信息融入，不会改变shape
跳转472-

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(#(2, 768)
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])#(1024)
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width]) #8, 128, 768
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

这里因为预设最多两句，所以embedding是(2,768)，第一维度只有0和1两种
对每个词确定其是0还是1
这里的one_hot为了加速
进行乘法， token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
(1024x2) x(2 x 768),1024个词每个词有2个可能性，表格有两种可能性，每个可能性是768维的向量，最后也是1024x768
然后reshape成8x128x768
full_position_embeddings:512x768
505-507

      position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1]) #位置编码给的挺大，为了加速只需要取出有用部分就可以 128, 768
      num_dims = len(output.shape.as_list())

进行一个切片截取，512x768，返回的position_embeddings只对128x768进行处理

512-518行

      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width]) # [1, 128, 768] 表示位置编码跟输入啥数据无关，因为原始的embedding是有batchsize当做第一个维度，这里为了计算也得加入
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

现在得到的位置是128x768，对于每个batch都加一个相同的。位置编码与位置传进的词无关，加一个维度得到[1,128,768]
1.考虑词（2种可能性），2.位置（128个位置）

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

加入dropout层，output相加输出（三个层的和）

mask机制：
modeling.py 200行

        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

基于每个词，去计算每个词需要跟多少个词做attention（跟1的做，0的忽略）
输入8x128，输出8x128x128，最后一个128是每一个单词能看见多少个单词

Transformer
modeling.py 205行

self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
            num_attention_heads=config.num_attention_heads,#多头注意力有多少个头
            intermediate_size=config.intermediate_size,#全连接层神经元个数
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)#是否返回每一层的输出

input_tensor：前面embedding的结果
attention_mask：映射到0或者1，表示不要或者要这个词
fine-tuning接着训练，很多参数不能改
802行

  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

hidden_size=768
num_attention_heads=12
768/12，每个头64个特征，把每个头的向量拼在一起。如果不能整除后续计算会麻烦。
807行↓

  attention_head_size = int(hidden_size / num_attention_heads) #一共要输出768个特征，给每个头分一下
  input_shape = get_shape_list(input_tensor, expected_rank=3) # [8, 128, 768]
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

815行↓

 if input_width != hidden_size: #注意残差连接的方式，需要它俩维度一样才能相加
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

在这里插入图片描述
这里不是拼接，是加法
残差连接，输入是768维输出必须也是768维才能进行相加，因此进行判断

reshape：对8x128转化为1024（可能是为了加速？）
819行

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor) #reshape的目的可能是为了加速

输入转化为1024x768
825行

 all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

每一层的输出是下一层的输入，layer_input和prev_output都是128x768
830开始是attention

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

from_tensor和to_tensor都是layer_input：对自己做attention
attention_mask：加1和0
返回2D的tensor，sql_length都是128

558行：构建attention_layer

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `

637行：

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])#[1024, 768]
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])#[1024, 768]

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences) 8
  #   F = `from_tensor` sequence length 128
  #   T = `to_tensor` sequence length 128
  #   N = `num_attention_heads` 12
  #   H = `size_per_head` 64

构建QKV矩阵：
Query：666行

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

Query矩阵是由from_tensor构建的，num_attention_heads=12个头,size_per_head=64，query_layer=1024x768（BF,NH）
8个batch每个有128个词，1024个词都要跟其他词计算內积，对每个词都要产生矩阵，12个头，每个头有64特征，12x64=768

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

key矩阵显然要跟Query矩阵维度一样，除了传进去的是to_tensor，其他参数与query矩阵一样

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

实际得到特征（查看前面的QKV图），所以v跟k维度一样

內积计算：

  # `query_layer` = [B, N, F, H] #为了加速计算内积 
  query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)
  # `key_layer` = [B, N, T, H] #为了加速计算内积 
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)
  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) #结果为(8, 12, 128, 128)
  attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head))) #消除维度对结果的影响

在这里插入图片描述
根号dk：8，消除维度的影响

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 #mask为1的时候结果为0 mask为0的时候结果为非常大的负数

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder #把这个加入到原始的得分里相当于mask为1的就不变，mask为0的就会变成非常大的负数

有注意力mask的时候mask为1
mask为1时，adder=（1-1）-10000=0
mask为0时，adder=(1-0)-10000=-10000
softmax处理时，在这里插入图片描述
x=-10000时，softmax近乎为0，基本没有分配概率，权重不会被映射到，

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores) #再做softmax此时负数做softmax相当于结果为0了就相当于不考虑了

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

进行矩阵计算

# `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])#(8, 128, 12, 64)

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) #(8, 12, 128, 64)

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)#计算最终结果特征 (8, 12, 128, 64)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])#转换回[8, 128, 12, 64]

857残差连接

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"): #1024, 768 残差连接
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

全连接后做判断，884行

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

返回所有层或最后层。

创建模型
run_classifier577行

 """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,#（8,128）
      input_mask=input_mask,#（8,128）
      token_type_ids=segment_ids,#（8,128）
      use_one_hot_embeddings=use_one_hot_embeddings)

590行定义输出

  # If you want to use the token-level output, use model.get_sequence_output()
  # instead.
  output_layer = model.get_pooled_output()
  hidden_size = output_layer.shape[-1].value  #768

  output_weights = tf.get_variable(  #再连的全连接层
      "output_weights", [num_labels, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(  #偏置参数，0和1进行微调
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

get_pooled_output：第一位是CLS，会覆盖到所有的句子
hidden_size：768
output_weights：（2,768）
num_labels=2：二分类

modeling.py 205行最终结果

# Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,#全连接层神经元个数
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)#是否返回每一层的输出

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

first_token_tensor：第一个tensor就是CLS

经过bert_model得到一个模型，需要什么结果就连接怎样的全连接层
run_classifier 601行

 with tf.variable_scope("loss"):
    if is_training:
      # I.e., 0.1 dropout
      output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    probabilities = tf.nn.softmax(logits, axis=-1)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)

    return (loss, per_example_loss, logits, probabilities)

logits=outputlayer * 权重 + 偏置，再加上softmax层，然后交叉熵计算损失

基本只要改数据读和预处理，在run_classifier177行左右

class DataProcessor(object):
  """Base class for data converters for sequence classification data sets."""

  def get_train_examples(self, data_dir):
    """Gets a collection of `InputExample`s for the train set."""
    raise NotImplementedError()

  def get_dev_examples(self, data_dir):
    """Gets a collection of `InputExample`s for the dev set."""
    raise NotImplementedError()

  def get_test_examples(self, data_dir):
    """Gets a collection of `InputExample`s for prediction."""
    raise NotImplementedError()

  def get_labels(self):
    """Gets the list of labels for this data set."""
    raise NotImplementedError()

在这里插入图片描述

读取自己的数据集 run_classifier 199行

为了不改源码，text_b弄成乱码。
inputexample：指定的一个啥都没干的函数，把数据一个个传入examples中

class MyDataProcessor(DataProcessor):#自己写的方法，继承了DataProcessor
      """Base class for data converters for sequence classification data sets."""

      def get_train_examples(self, data_dir):
          """Gets a collection of `InputExample`s for the train set."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\train_sentiment.txt')
          f=open(file_path,'r',encoding='utf-8')
          train_data=[]
          index=0
          for line in f.readlines():
              guid = "train-%d" % (index)#指定一个id值
              line=line.replace('\n','').split('\t') #换行符替换掉
              text_a = tokenization.convert_to_unicode(str(line[1]))
              label = str(line[2])
              train_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None,label=label))
              index +=1
          return train_data

      def get_dev_examples(self, data_dir):
          """Gets a collection of `InputExample`s for the dev set."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test_sentiment.txt')
          f = open(file_path, 'r', encoding='utf-8')
          dev_data = []
          index = 0
          for line in f.readlines():
              guid = "dev-%d" % (index)  # 指定一个id值
              line = line.replace('\n', '').split('\t')  # 换行符替换掉
              text_a = tokenization.convert_to_unicode(str(line[1]))
              label = str(line[2])
              dev_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
              index += 1
          return dev_data

      def get_test_examples(self, data_dir):
          """Gets a collection of `InputExample`s for prediction."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test.csv')
          test_df = open(file_path, 'r', encoding='utf-8')
          test_data = []
          for index ,test in enumerate(test_df.values):
              guid = "test-%d" % (index)

              text_a = tokenization.convert_to_unicode(str(test[0]))
              label = str(test[1])
              test_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
              index += 1
          return test_data

      def get_labels(self):
          """Gets the list of labels for this data set."""
          return ['0','1','2']

将自己写的预处理设置为参数
processors，加一行"mydata":MyDataProcessor
运行参数：
–data_dir=data
–task_name=mydata
–vocab_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json \
–output_dir=…/mydata_model
–do_train=true
–do_eval=true
–init_checkpoint=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=70 \
–train_batch_size=32 \
–learning_rate=5e-5 \
–num_train_epochs=3.0 \