201020学习笔记(BERT)

前置:word2vec,RNN网络模型,了解词向量如何建模
重点:Transformer网络架构,BERT训练方法,实际应用
基本组成依旧是机器翻译模型中常见的Seq2Seq网络

传统RNN的问题:
下一层需要上一层的输出,不能并行。
Transformer:
self-attention机制来进行并行计算,在输入和输出都相同。输出结果是被同时计算出来的,基本已经取代RNN了。
考虑词将上下文语境融入到词向量中。
在这里插入图片描述
两个词x1和x2:
第一步向量初始化,转化为编码(四维向量,四个特征)
第二步Q矩阵,K矩阵,V矩阵,借助三个辅助矩阵求出
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
求得当前词和每一个词的分值。
softmax求得:当前词对待编码位置的影响大小。
向量维度越大值越大但影响不一定越重要,所以要去掉向量维度影响。
每个词汇跟整个序列中每一个K计算得分,然后基于得分再分配特征,得到注意力值。
总过程:
在这里插入图片描述
多头注意力机制
一组qkv可以提取一组当前词的特征表达,多组qkv提取多组特征。
在这里插入图片描述
在这里插入图片描述在这里插入图片描述
在这里插入图片描述
经过自注意力后要对层做归一化。

decoder:输入输出都是一个序列,
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
模型:BERT_BASE_DIR
数据:glue_data
选MRPC——两个字符串描述的是否同一个意思

文件:
BERT_BASE_DIR/uncased…/
bert_config.json 一些配置参数
ckpt:谷歌保存的预训练模型
vocab.txt:语料库,所有的词

run_classifier.py
设置run configuration
Arguments:
–task_name=MRPC \
–do_train=true \
–do_eval=true \
–data_dir=…/GLUE/glue_data/MRPC
–vocab_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_config.json \
–init_checkpoint=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=128 \
–train_batch_size=1 \
–learning_rate=2e-5 \
–num_train_epochs=3.0 \
–output_dir=…/GLUE/output

任务名,do_train做不做训练,do_eval做不做验证,windows别用绝对路径,别用中文

run_classifier.py:
如177-192:读取数据需要自己完成
842:train_examples = processor.get_train_examples(FLAGS.data_dir)
读取到一个数据(跳转到299)

num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)

↑844行train_examples有3668个,batch_size=100,要做3700/100=37次迭代,乘上epoches=3,一共迭代111次

num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

↑845行num_warmup_steps:刚开始训练的时候让学习率偏小,经过warmup阶段后再还原(设置的是0.1,就是经过111*0.1=11次迭代后学习率还原)

869行:数据读取

file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

在file_based_convert_examples_to_features中

writer = tf.python_io.TFRecordWriter(output_file)

↑483行,转化为TFRecordWriter格式

    if ex_index % 10000 == 0:
      tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

↑485行:每10000次打印一次结果

feature = convert_single_example(ex_index, example, label_list,max_seq_length, tokenizer)

↑489行:核心函数,跳转377行

  label_map = {}
  for (i, label) in enumerate(label_list): #构建标签
    label_map[label] = i

389行:构建标签,0和1两个类别

tokens_a = tokenizer.tokenize(example.text_a) #第一句话分词

393行:分词,跳转tokenizer.py 170行

    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

↑wordpiece
实例:
在这里插入图片描述
<class ‘list’>: [‘am’, ‘##ro’, ‘##zi’, ‘accused’, ‘his’, ‘brother’, ‘,’, ‘whom’, ‘he’, ‘called’, ‘"’, ‘the’, ‘witness’, ‘"’, ‘,’, ‘of’, ‘deliberately’, ‘di’, ‘##stor’, ‘##ting’, ‘his’, ‘evidence’, ‘.’]
中文基本都是切分成一个一个字来做,总体思路都是切分成更细的来处理
分词完成返回run_classifier,如果存在第二句话,则也进行分词
398行判断

  if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3" #保留3个特殊字符
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) #如果这俩太长了就截断操作
  else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
      tokens_a = tokens_a[0:(max_seq_length - 2)]

①过长截断
②有b保留三个特殊字符,没有b保留两个特殊字符
408开始进行编码,代码自带的说明:

# The convention in BERT is:
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1 #表示来自哪句话
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  #
  # Where "type_ids" are used to indicate whether this is the first
  # sequence or the second sequence. The embedding vectors for `type=0` and
  # `type=1` were learned during pre-training and are added to the wordpiece
  # embedding vector (and position vector). This is not *strictly* necessary
  # since the [SEP] token unambiguously separates the sequences, but it makes
  # it easier for the model to learn the concept of sequences.
  #
  # For classification tasks, the first vector (corresponding to [CLS]) is
  # used as the "sentence vector". Note that this only makes sense because
  # the entire model is fine-tuned.

type_id=0表示前一句话,1表示后一句话

  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
  tokens.append("[SEP]")
  segment_ids.append(0)

426行开始编码,第一个词是CLS固定的,编码为0,然后遍历word_piece中每一个词,type_id都是0。a里的词都添加完后,加入连接符sep,再添加一个0

  if tokens_b:
    for token in tokens_b:
      tokens.append(token)
      segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

436行开始添加b(如果存在),type_id编码为1,通过vocab.txt的索引来找词
443行转化为ID(vocab.txt)
<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]

max_seq_length=128,不够的补0

  while len(input_ids) < max_seq_length: #PAD的长度取决于设置的最大长度
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

↑450行,self-attention中考虑补0时要加入额外mask,做自注意力区分补的0,指定实际的词的input_mask=1,实际参与到自注意力计算中,补0的input_mask=0,不参与。
input_ids:<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_masks:<class ‘list’>: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
459行输出:
在这里插入图片描述

470行:inputfeatures->161行,自我初始化赋值
485行:for循环遍历每一个样本
496开始:处理样本

496-502:数据类型转化成tfRecorder

504,505

    tf_example = tf.train.Example(features=tf.train.Features(feature=features))
    writer.write(tf_example.SerializeToString())

把一条tf数据序列化的写进writer

embedding层:
574层开始创建bert模型

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,#(8,128)
      input_mask=input_mask,#(8,128)
      token_type_ids=segment_ids,#(8,128)
      use_one_hot_embeddings=use_one_hot_embeddings)

config:配置文件
is_training:是否训练
input_ids:batchsize都是8,128是每句话长度
input_mask:0还是1,是补的内容还是本来有内容
segment_ids:第几句话

modeling.py:
165行

    if input_mask is None: #如果没设置mask 自然就都是1的
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

mask:没有mask就自动添加,都是1(对self-attention不好)
type_id:没说是几句话就默认是1句话,都设置成0

先构建embedding层,171行开始
把1-128都转化成一个向量,三个编码要维度相同

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

↑171,embedding过程,input_ids格式8x128,vocab_size三万多个(预训练模型),embedding_size:映射的多少维(官方是768),initializer_range:初始化取值范围(0.02),one_hot(默认false),预训练模型别改参数

额外编码特征
输入两个维度,(batchsize x max_length)=8x128
输出:batchsize x max_length x 768维的向量
modeling.py
171-180 :完成word_embedding
409行开始

  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])
  embedding_table = tf.get_variable( #词映射矩阵,30522, 768
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))
  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids) #CPU,GPU运算1024, 768 一个batch里所有的映射结果

先给输入多加一个维度,8 x 128 x 1
拍平:flat_input_ids= input_ids(8x128x1)=1024
output=1024 x 768

  input_shape = get_shape_list(input_ids)
  output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) #(8, 128, 768)
  return (output, embedding_table)

↑421行,output三个维度:8x128x768 batch_sizex每句话中的词x每个词的向量
词变成了向量

184-194:完成位置编码position_embedding
只是把信息融入,不会改变shape
跳转472-

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(#(2, 768)
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])#(1024)
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width]) #8, 128, 768
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

这里因为预设最多两句,所以embedding是(2,768),第一维度只有0和1两种
对每个词确定其是0还是1
这里的one_hot为了加速
进行乘法, token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
(1024x2) x(2 x 768),1024个词每个词有2个可能性,表格有两种可能性,每个可能性是768维的向量,最后也是1024x768
然后reshape成8x128x768
full_position_embeddings:512x768
505-507

      position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1]) #位置编码给的挺大,为了加速只需要取出有用部分就可以 128, 768
      num_dims = len(output.shape.as_list())

进行一个切片截取,512x768,返回的position_embeddings只对128x768进行处理

512-518行

      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width]) # [1, 128, 768] 表示位置编码跟输入啥数据无关,因为原始的embedding是有batchsize当做第一个维度,这里为了计算也得加入
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

现在得到的位置是128x768,对于每个batch都加一个相同的。位置编码与位置传进的词无关,加一个维度得到[1,128,768]
1.考虑词(2种可能性),2.位置(128个位置)

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

加入dropout层,output相加输出(三个层的和)

mask机制
modeling.py 200行

        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

基于每个词,去计算每个词需要跟多少个词做attention(跟1的做,0的忽略)
输入8x128,输出8x128x128,最后一个128是每一个单词能看见多少个单词

Transformer
modeling.py 205行

self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
            num_attention_heads=config.num_attention_heads,#多头注意力有多少个头
            intermediate_size=config.intermediate_size,#全连接层神经元个数
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)#是否返回每一层的输出

input_tensor:前面embedding的结果
attention_mask:映射到0或者1,表示不要或者要这个词
fine-tuning接着训练,很多参数不能改
802行

  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

hidden_size=768
num_attention_heads=12
768/12,每个头64个特征,把每个头的向量拼在一起。如果不能整除后续计算会麻烦。
807行↓

  attention_head_size = int(hidden_size / num_attention_heads) #一共要输出768个特征,给每个头分一下
  input_shape = get_shape_list(input_tensor, expected_rank=3) # [8, 128, 768]
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

815行↓

 if input_width != hidden_size: #注意残差连接的方式,需要它俩维度一样才能相加
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

在这里插入图片描述
这里不是拼接,是加法
残差连接,输入是768维输出必须也是768维才能进行相加,因此进行判断

reshape:对8x128转化为1024(可能是为了加速?)
819行

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor) #reshape的目的可能是为了加速

输入转化为1024x768
825行

 all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

每一层的输出是下一层的输入,layer_input和prev_output都是128x768
830开始是attention

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

from_tensor和to_tensor都是layer_input:对自己做attention
attention_mask:加1和0
返回2D的tensor,sql_length都是128

558行:构建attention_layer

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `

637行:

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])#[1024, 768]
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])#[1024, 768]
  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences) 8
  #   F = `from_tensor` sequence length 128
  #   T = `to_tensor` sequence length 128
  #   N = `num_attention_heads` 12
  #   H = `size_per_head` 64

构建QKV矩阵:
Query:666行

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

Query矩阵是由from_tensor构建的,num_attention_heads=12个头,size_per_head=64,query_layer=1024x768(BF,NH)
8个batch每个有128个词,1024个词都要跟其他词计算內积,对每个词都要产生矩阵,12个头,每个头有64特征,12x64=768

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

key矩阵显然要跟Query矩阵维度一样,除了传进去的是to_tensor,其他参数与query矩阵一样

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

实际得到特征(查看前面的QKV图),所以v跟k维度一样

內积计算:

  # `query_layer` = [B, N, F, H] #为了加速计算内积 
  query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)
  # `key_layer` = [B, N, T, H] #为了加速计算内积 
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)
  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) #结果为(8, 12, 128, 128)
  attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head))) #消除维度对结果的影响

在这里插入图片描述
根号dk:8,消除维度的影响

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 #mask为1的时候结果为0 mask为0的时候结果为非常大的负数

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder #把这个加入到原始的得分里相当于mask为1的就不变,mask为0的就会变成非常大的负数

有注意力mask的时候mask为1
mask为1时,adder=(1-1)-10000=0
mask为0时,adder=(1-0)
-10000=-10000
softmax处理时,在这里插入图片描述
x=-10000时,softmax近乎为0,基本没有分配概率,权重不会被映射到,

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores) #再做softmax此时负数做softmax相当于结果为0了就相当于不考虑了

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

进行矩阵计算

# `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])#(8, 128, 12, 64)

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) #(8, 12, 128, 64)

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)#计算最终结果特征 (8, 12, 128, 64)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])#转换回[8, 128, 12, 64]

857残差连接

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"): #1024, 768 残差连接
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

全连接后做判断,884行

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

返回所有层或最后层。

创建模型
run_classifier577行

 """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,#(8,128)
      input_mask=input_mask,#(8,128)
      token_type_ids=segment_ids,#(8,128)
      use_one_hot_embeddings=use_one_hot_embeddings)

590行定义输出

  # If you want to use the token-level output, use model.get_sequence_output()
  # instead.
  output_layer = model.get_pooled_output()
  hidden_size = output_layer.shape[-1].value  #768

  output_weights = tf.get_variable(  #再连的全连接层
      "output_weights", [num_labels, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(  #偏置参数,0和1进行微调
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

get_pooled_output:第一位是CLS,会覆盖到所有的句子
hidden_size:768
output_weights:(2,768)
num_labels=2:二分类

modeling.py 205行最终结果

# Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,#全连接层神经元个数
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)#是否返回每一层的输出

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

first_token_tensor:第一个tensor就是CLS

经过bert_model得到一个模型,需要什么结果就连接怎样的全连接层
run_classifier 601行

 with tf.variable_scope("loss"):
    if is_training:
      # I.e., 0.1 dropout
      output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    probabilities = tf.nn.softmax(logits, axis=-1)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)

    return (loss, per_example_loss, logits, probabilities)

logits=outputlayer * 权重 + 偏置,再加上softmax层,然后交叉熵计算损失

基本只要改数据读和预处理,在run_classifier177行左右

class DataProcessor(object):
  """Base class for data converters for sequence classification data sets."""

  def get_train_examples(self, data_dir):
    """Gets a collection of `InputExample`s for the train set."""
    raise NotImplementedError()

  def get_dev_examples(self, data_dir):
    """Gets a collection of `InputExample`s for the dev set."""
    raise NotImplementedError()

  def get_test_examples(self, data_dir):
    """Gets a collection of `InputExample`s for prediction."""
    raise NotImplementedError()

  def get_labels(self):
    """Gets the list of labels for this data set."""
    raise NotImplementedError()

在这里插入图片描述

读取自己的数据集 run_classifier 199行


为了不改源码,text_b弄成乱码。
inputexample:指定的一个啥都没干的函数,把数据一个个传入examples中

class MyDataProcessor(DataProcessor):#自己写的方法,继承了DataProcessor
      """Base class for data converters for sequence classification data sets."""

      def get_train_examples(self, data_dir):
          """Gets a collection of `InputExample`s for the train set."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\train_sentiment.txt')
          f=open(file_path,'r',encoding='utf-8')
          train_data=[]
          index=0
          for line in f.readlines():
              guid = "train-%d" % (index)#指定一个id值
              line=line.replace('\n','').split('\t') #换行符替换掉
              text_a = tokenization.convert_to_unicode(str(line[1]))
              label = str(line[2])
              train_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None,label=label))
              index +=1
          return train_data

      def get_dev_examples(self, data_dir):
          """Gets a collection of `InputExample`s for the dev set."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test_sentiment.txt')
          f = open(file_path, 'r', encoding='utf-8')
          dev_data = []
          index = 0
          for line in f.readlines():
              guid = "dev-%d" % (index)  # 指定一个id值
              line = line.replace('\n', '').split('\t')  # 换行符替换掉
              text_a = tokenization.convert_to_unicode(str(line[1]))
              label = str(line[2])
              dev_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
              index += 1
          return dev_data

      def get_test_examples(self, data_dir):
          """Gets a collection of `InputExample`s for prediction."""
          file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test.csv')
          test_df = open(file_path, 'r', encoding='utf-8')
          test_data = []
          for index ,test in enumerate(test_df.values):
              guid = "test-%d" % (index)

              text_a = tokenization.convert_to_unicode(str(test[0]))
              label = str(test[1])
              test_data.append(
                  InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
              index += 1
          return test_data

      def get_labels(self):
          """Gets the list of labels for this data set."""
          return ['0','1','2']

将自己写的预处理设置为参数
processors,加一行"mydata":MyDataProcessor
运行参数:
–data_dir=data
–task_name=mydata
–vocab_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json \
–output_dir=…/mydata_model
–do_train=true
–do_eval=true
–init_checkpoint=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=70 \
–train_batch_size=32 \
–learning_rate=5e-5 \
–num_train_epochs=3.0 \

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值