文章目录
简述
- BERT, Bidirectional Encoder Representations from Transformers.
- pre-train: 在超大语料上做无监督学习,可以得到 token 的通用表达.
- fine-tune: 在预训练模型最后加一层 task-specific 的 layer, 然后 fine-tune 所有的 parameter, 即可得到令人满意的结果.
一. GLUE 基准测评
GLUE, General Language Understanding Evaluation. 发音为 [ɡluː],类似于英文单词 “glue”(胶水) 的发音.
见参考[1] [2].
它是流行的 NLP 任务测评基准. 有以下几种主要的评估任务:
1.1 单句分类任务
CoLA
The Corpus of Linguistic Acceptability. 判断一个单词序列是否为一个合乎语法的英语句子.
SST-2
Stanford Sentiment Treebank. 对影评之类的句子判断情感倾向, 有 positive / negative 两个类别.
1.2 语义一致判断
MRPC
The Microsoft Research Paraphrase Corpus. 判断句子 pair 是否语义一致.
QQP
Quora Question Pairs . 句子 pair 的二分类. 给出 Quora 问答网站上的两个提问, 判断它们语义是否一致.
1.3 推理任务 Inference
MNLI
Multi-Genre Natural Language Inference. 推理任务, 给定句子 pair, 问第二句与第一句是以下三类关系中的哪一种:
- entailment. 可推论的蕴涵关系
- contradiction, 矛盾的
- neutral, 中立的
句子1:A man is playing a large drum.
句子2:The man is playing an instrument.
label: 可推论的
句子1:A man is playing a large drum.
句子2:The man is dancing.
label: 矛盾的
QNLI
Question Natural Language Inference. 句子pair的二分类. The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.
WNLI
阅读理解任务, 系统读取含有代词 (pronoun) 的句子, 然后从选项中选出所指代的具体对象.
二. BERT 模型
多个transformer的encoder堆叠而成, 没有decoder.
token_emb
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=tf.truncated_normal_initializer(stddev=initializer_range))
# initializer_range = 0.02
position encoding
transformer中的 position encoding 采用的是 sin及cos函数. 而在bert中, 每个位置有自己的可学习参数, 参与训练.
bert.modeling.embedding_postprocessor()
中,
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
pre-train task 与 loss
预训练包括以下两个任务.
masked LM task
输入时以 0.15 的概率对sequence中的token作[mask]替换, 然后预测这些 原本的 token.
# 与整个 vocab_emb_mat 相乘, 可以理解为 argmax 点乘
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
# 以下两行等价于 softmax_cross_entropy, 多分类, 见参考[1]
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
next sentence predict task
GLUE中很多任务涉及到语言推理, 即判断两个句子之间的关系, 所以预训练阶段也要有涉及句子关系的任务,原文说 pre-training towards this task is very beneficial to both QA and NLI.
bert使用[CLS]的hidden vector , 又给跟了一个FC,来做二分类.
先 dense(hidden_size, tanh), 再 dense(2, activation=None)
代码见下.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size, # 768
activation=tf.tanh, # in [-1,1]
kernel_initializer=create_initializer(config.initializer_range))
def get_next_sentence_output(bert_config, input_tensor, labels):
"""Get loss and log probs for the next sentence prediction."""
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("yichu_cls/seq_relationship"):
output_weights = tf.get_variable(
"output_weights",
shape=[2, config.hidden_size], # 变更 hidden_size 为 128
initializer=modeling.create_initializer(bert_config.initializer_range))
output_bias = tf.get_variable(
"output_bias", shape=[2], initializer=tf.zeros_initializer())
# input_tensor 即上文 pooled_output
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)
代码中采用了 log_softmax() 函数, 其实就是 softmax_crossentropy, 对比讨论见参考[7]
additional pre-train
官方在github发布了 BERT-Base, Chinese
预训练模型, Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters.
我们可以基于这个release的ckpt, 用自有的corpus, 做 pre-train 续跑.
If your task has a large domain-specific corpus available (e.g., “movie reviews” or “scientific papers”), it will likely be beneficial to run additional steps of pre-training on your corpus, starting from the BERT checkpoint.
改动模型
因为官方release的 BERT-Base, Chinese , 其 hidden size=768, 而我们若想要低维(如d=128)的hidden vector, 便可在网络最后加了一个 128d的dense layer.
第一次续跑
- init_checkpoint
此处为官方ckpt, 如bert_base_dir/chinese_L-12_H-768_A-12/bert_model.ckpt
.
使用tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
函数恢复ckpt中的所有parameter. - output_dir
该续跑模型的目录. 对应tf.estimator.RunConfig(model_dir=FLAGS.output_dir,...)
续跑的续跑
此时不再需要 init_checkpoint 参数. 所有的参数(以及自有模型新加的参数)都已经在output_dir
中了.
fine-tune
图: 不同下游任务上的相应 fine-tune 结构.
注意[CLS]在句子对及单句子任务中不同的用法.
feature extract
In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model.
对应extract_features.py
脚本.
- init_checkpoint
如果自己续跑过, 就用自有的ckpt, 否则用官方release的.
三. RoBERTa 等后继者
RoBERTa, A Robustly Optimized BERT Pretraining Approach, 一个稳健的 BERT 方法(不喜欢’鲁棒’的译法).
- 数据集. 增加 CC-NEWS, 原16G->160G.
- 计算图/网络结构. 更长的序列, 移除NSP任务.
- 训练过程. 加大 batch size, 修改 adam 的参数.
性能比较, 见下图. 比如在 MNLI 任务上的指标从 86->90 .