GLUE 评测与 BERT

yichudu

已于 2024-11-21 14:28:09 修改

阅读量1.9k

点赞数

分类专栏： NLP 文章标签： bert 人工智能深度学习

于 2019-08-25 21:22:06 首次发布

天天开心

本文链接：https://blog.csdn.net/chuchus/article/details/100067369

版权

NLP 专栏收录该内容

34 篇文章

订阅专栏

简述

BERT, Bidirectional Encoder Representations from Transformers.
pre-train: 在超大语料上做无监督学习,可以得到 token 的通用表达.
fine-tune: 在预训练模型最后加一层 task-specific 的 layer, 然后 fine-tune 所有的 parameter, 即可得到令人满意的结果.

一. GLUE 基准测评

GLUE, General Language Understanding Evaluation. 发音为 [ɡluː]，类似于英文单词 “glue”(胶水) 的发音.
见参考[1] [2].
它是流行的 NLP 任务测评基准. 有以下几种主要的评估任务:

1.1 单句分类任务

CoLA

The Corpus of Linguistic Acceptability. 判断一个单词序列是否为一个合乎语法的英语句子.

SST-2

Stanford Sentiment Treebank. 对影评之类的句子判断情感倾向, 有 positive / negative 两个类别.

1.2 语义一致判断

MRPC

The Microsoft Research Paraphrase Corpus. 判断句子 pair 是否语义一致.

QQP

Quora Question Pairs . 句子 pair 的二分类. 给出 Quora 问答网站上的两个提问, 判断它们语义是否一致.

1.3 推理任务 Inference

MNLI

Multi-Genre Natural Language Inference. 推理任务, 给定句子 pair, 问第二句与第一句是以下三类关系中的哪一种:

entailment. 可推论的蕴涵关系
contradiction, 矛盾的
neutral, 中立的

句子1：A man is playing a large drum.
句子2：The man is playing an instrument.
label: 可推论的

句子1：A man is playing a large drum.
句子2：The man is dancing.
label: 矛盾的

QNLI

Question Natural Language Inference. 句子pair的二分类. The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

WNLI

阅读理解任务, 系统读取含有代词 (pronoun) 的句子, 然后从选项中选出所指代的具体对象.

二. BERT 模型

在这里插入图片描述

多个transformer的encoder堆叠而成, 没有decoder.

token_emb

embedding_table = tf.get_variable(
     name=word_embedding_name,
     shape=[vocab_size, embedding_size],
     initializer=tf.truncated_normal_initializer(stddev=initializer_range))
# initializer_range = 0.02

position encoding

transformer中的 position encoding 采用的是 sin及cos函数. 而在bert中, 每个位置有自己的可学习参数, 参与训练.
bert.modeling.embedding_postprocessor() 中,

full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

pre-train task 与 loss

预训练包括以下两个任务.

masked LM task

输入时以 0.15 的概率对sequence中的token作[mask]替换, 然后预测这些原本的 token.

# 与整个 vocab_emb_mat 相乘, 可以理解为 argmax 点乘
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
# 以下两行等价于 softmax_cross_entropy, 多分类, 见参考[1]
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])

next sentence predict task

GLUE中很多任务涉及到语言推理, 即判断两个句子之间的关系, 所以预训练阶段也要有涉及句子关系的任务,原文说 pre-training towards this task is very beneficial to both QA and NLI.
bert使用[CLS]的hidden vector , 又给跟了一个FC,来做二分类.
先 dense(hidden_size, tanh), 再 dense(2, activation=None)
代码见下.

with tf.variable_scope("pooler"):
    # We "pool" the model by simply taking the hidden state corresponding
    # to the first token. We assume that this has been pre-trained
    first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
    pooled_output = tf.layers.dense(
        first_token_tensor,
        config.hidden_size, # 768
        activation=tf.tanh, # in [-1,1]
        kernel_initializer=create_initializer(config.initializer_range))

def get_next_sentence_output(bert_config, input_tensor, labels):
    """Get loss and log probs for the next sentence prediction."""

    # Simple binary classification. Note that 0 is "next sentence" and 1 is
    # "random sentence". This weight matrix is not used after pre-training.
    with tf.variable_scope("yichu_cls/seq_relationship"):
        output_weights = tf.get_variable(
            "output_weights",
            shape=[2, config.hidden_size],  # 变更 hidden_size 为 128
            initializer=modeling.create_initializer(bert_config.initializer_range))
        output_bias = tf.get_variable(
            "output_bias", shape=[2], initializer=tf.zeros_initializer())
        # input_tensor 即上文 pooled_output
        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        log_probs = tf.nn.log_softmax(logits, axis=-1)
	    labels = tf.reshape(labels, [-1])
	    one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
	    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
	    loss = tf.reduce_mean(per_example_loss)
	    return (loss, per_example_loss, log_probs)

代码中采用了 log_softmax() 函数, 其实就是 softmax_crossentropy, 对比讨论见参考[7]

additional pre-train

官方在github发布了 BERT-Base, Chinese 预训练模型, Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters.
我们可以基于这个release的ckpt, 用自有的corpus, 做 pre-train 续跑.
If your task has a large domain-specific corpus available (e.g., “movie reviews” or “scientific papers”), it will likely be beneficial to run additional steps of pre-training on your corpus, starting from the BERT checkpoint.

改动模型

因为官方release的 BERT-Base, Chinese , 其 hidden size=768, 而我们若想要低维(如d=128)的hidden vector, 便可在网络最后加了一个 128d的dense layer.

第一次续跑

init_checkpoint
此处为官方ckpt, 如 bert_base_dir/chinese_L-12_H-768_A-12/bert_model.ckpt .
使用tf.train.init_from_checkpoint(init_checkpoint, assignment_map)函数恢复ckpt中的所有parameter.
output_dir
该续跑模型的目录. 对应 tf.estimator.RunConfig(model_dir=FLAGS.output_dir,...)

续跑的续跑

此时不再需要 init_checkpoint 参数. 所有的参数(以及自有模型新加的参数)都已经在output_dir 中了.

fine-tune

图: 不同下游任务上的相应 fine-tune 结构.
注意[CLS]在句子对及单句子任务中不同的用法.

feature extract

In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model.
对应extract_features.py脚本.