Paper：《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding用于语言理解的深度双向Tr

一个处女座的程序猿

已于 2024-05-25 18:04:10 修改

阅读量9.3k

点赞数 6

分类专栏： NLP/LLMs DL/R 文章标签：自然语言处理人工智能 BERT

于 2019-03-02 22:57:15 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/88082930

版权

NLP/LLMs 同时被 2 个专栏收录

557 篇文章 426 订阅

订阅专栏

DL/R

397 篇文章 66 订阅

订阅专栏

Paper：《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding用于语言理解的深度双向Transformers预训练模型》翻译与解读

导读：这篇论文提出了BERT模型，以解决当前语言表示学习方法的限制。BERT（Bidirectional Encoder Representation from Transformer）是 Google AI于NAACL2019 提出的一个预训练语言模型。BERT 的创新点是提出了有效的无监督预训练任务，从而使得模型能够从无标注语料中获得通用的语言建模能力，更擅长文本理解。

背景：现有语言表示学习方法存在这样一些限制：

>> 大多数方法训练的是单向语言模型，而非深度双向模型。这限制了在预训练阶段能使用的模型结构。

>> 特征工程较重，需要为每个任务设计不同的任务专用神经结构。

>> 预训练任务单一，难以学习句子之间关系等更广泛信息。

解决方案：本文提出的BERT模型采用遮蔽语言模型预训练任务，弥补现有方法的单向限制，使得能学习深度双向表示。在预训练时BERT使用了以下两个任务：

>> 被遮蔽语言模型任务：随机MASK某些词，预测MASK词出现的原词。

>> 下一句预测任务：判断给定句子对是否是真实相邻句子对。

核心思路

>> BERT采用Transformer网络，能实现深度双向学习表示。它在所有层都同时使用左右上下文信息。

>> BERT提出了统一模型，即可用于特征提取也可用于调优。它提升预训练质量，在很多NLP任务上效果均优于专用经过复杂设计的模型。

核心技术点

>> 深度双向Transformer编码器：采用深度双向Transformer编码器框架，可以同时学习词符的左右上下文信息。这与传统单向语言模型不同，可以更充分地学习词符依赖关系。

>> 联合MLM和NSP任务：在预训练过程中，利用掩蔽语言模型目标函数。通过随机掩蔽部分词符，强制模型学习词符全序列依赖关系。这比单纯预测下一个词符更符合实际应用场景。联合利用下一个句子预测目标函数，可以进一步优化文本对表示学习效果。

>> 下游任务只需简单修改：细端训练中，仅增添一个简单的输出层，最小化新参数，全面调优所有参数。这简化了各种下游任务的网络结构修改。预训练模型直接应用于不同下游任务，训练成本低，效率高。

>> 11个NLP评估任务效果显著：可以很好应用于问答、自然语言推断等各种NLP任务，且无需重大任务定制，显示出强大的泛化能力。在11个NLP评估任务上都获得状态优先结果，特别是GLUE得分和SQuADv1.1/v2.0上的显著提升。

优势

>> BERT消除了传统语言模型的单向限制。它不仅学习了词之间关系，也学习了句子之间关系。

>> BERT提出的统一框架简化了下游任务的训练。它在11个NLP任务上都取得首位，效果显著优于同类方法。

总之，BERT利用被遮蔽语言模型和下一句预测等预训练任务，实现了深度双向学习，统一了特征学习与调优学习过程，取得广泛NLP任务的 state-of-the-art 成绩。

Paper：《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding用于语言理解的深度双向Transformers预训练模型》翻译与解读

NLP之BERT：BERT的简介(背景、改进点、创新点、简介、意义、原理、优缺点、总结与评价)、模型结构、训练过程(MLM、NSP任务的概述)之详细攻略

《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》翻译与解读

Abstract

1、Introduction

语言模型预训练的有效性

两种将预训练语言表示应用于下游任务的策略：基于特征的方法(比如ELMo)、微调(比如Transformer)

问题与局限性：微调方法中的单向性

BERT模型的提出：采用MLM任务，而实现了深度双向Transformer的预训练

本论文的贡献：BERT证明双向预训练对语言表示的重要性、提高了11个NLP任务的技术水平

单词表示：包括非神经网络方法和神经网络方法，比如预训练词嵌入，更粗粒度的层次，如句子嵌入或段落嵌入

ELMo(基于特征的)→LSTM

2.2、无监督微调方法Unsupervised Fine-tuning Approaches

特征为基础的方法→句子或文档编码器的发展(比如GPT)→→

图1:BERT的总体前训练和微调过程。

2.3 从监督数据迁移学习Transfer Learning from Supervised Data

从具有大型数据集的监督任务中有效转移的实战案例：计算机视觉中的迁移学习(如基于ImageNet预训练的模型进行微调)，自然语言推理、机器翻译

3 BERT

BERT的两大步骤(预训练、微调)→BERT在不同任务中使用统一的架构

BERT的模型架构：基于原始Transformer编码器的多层双向结构

BERT的输入/输出表示：采用WordPiece嵌入，30,000个标记词汇表

3.1 Pre-training BERT

BERT使用两个无监督任务进行预训练

任务 #1：Masked LM（MLM）

任务 #2：下一个句子预测（NSP）

图2:BERT输入表示。输入嵌入是token嵌入、分段嵌入和位置嵌入的总和。

预训练数据

3.2 Fine-tuning BERT

BERT能够直接应用于许多下游任务，实现了两个句子之间的双向交叉注意力

BERT任务细节；与预训练相比，微调相对廉价

4 Experiments：展示了BERT在各项任务上的卓越性能

5 Ablation Studies消融研究

5.1 Effect of Pre-training Tasks 预训练任务的影响，去除NSP任务性能有显著影响

5.2 Effect of Model Size模型规模的影响，将模型扩展到极大规模还会在非常小规模任务上产生显著的改进

5.3 Feature-based Approach with BERT基于特征的方法，结果表明BERT在微调和基于特征的方法上都表现出色

6 Conclusion

A Additional Details for BERT

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

References

Paper：《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding用于语言理解的深度双向Transformers预训练模型》翻译与解读

Paper：《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding用于语言理解的深度双向Tr_bert paper-CSDN博客

NLP之BERT：BERT的简介(背景、改进点、创新点、简介、意义、原理、优缺点、总结与评价)、模型结构、训练过程(MLM、NSP任务的概述)之详细攻略

NLP之BERT：BERT的简介(背景、改进点、创新点、简介、意义、原理、优缺点、总结与评价)、模型结构、训练过程(MLM、NSP任务的概述)之详细攻略_nlp nsp-CSDN博客

《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》翻译与解读

地址	论文地址：https://arxiv.org/abs/1810.04805
时间	2018年10月11日最新版本，2019年5月24日
作者	Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language
总结	该论文提出了BERT模型，主要解决了以下问题： *背景痛点： >> 现有语言表示学习模型大多使用单向模型，无法同时利用左右上下文信息，限制了转移学习的应用。标准语言模型只能单向训练，不利于学习深层双向语言表示。任务设计上的局限性，如OpenAI GPT仅考虑前文信息。 >> 特征式方法(如ELMo)学习双向语言表示，但仅为浅层表示。 >> 目前语言模型预训练主要采用无监督自然语言处理任务。fine-tuning方法难以支持需求任务专用网络结构的需求任务。具体解决方案： >> 提出了BERT模型，采用双向Transformer编码器架构，可以同时利用左右上下文信息。预训练深层双向Transformer编码器模型BERT。 >> 在预训练过程中，利用masked language model目标函数，通过随机mask部分词符来推测遮盖词，强制模型学习全词顺序依赖关系。利用“遮蔽语言模型”(MLM)预训练目标，将部分词替换为mask，预测遮蔽词，使模型学习双向上下文信息。 >>提出NEXT句预测目标函数，联合预训练文本对表示。加入“下一句预测”任务，联合训练句子对表示。 >> 直接微调预训练模型进行下游任务训练。fine-tuning过程中，仅增加一个简单的输出层，最小化新参数，所有参数都进行端到端调优。核心特点： >> BERT采用深度双向架构，可以同时学习左右上下文。 >> Masked LM目标函数强制模型学习全词顺序依赖关系。 >> NEXT句预测任务有助于文本对任务。 >> Fine-tuning方法可以兼容各种任务而无需特定网络修改。主要优势*： >> 超越OpenAI GPT，显著提升GLUE质量评分。在11个自然语言处理任务上获得状态优先结果，GLUE得分提升7.7个百分点。在SQuAD、SWAG等问答理解任务上也取得首位水平。 >> 可很好应用于各种NLP任务，如问答、自然语言推断等，而无需重大任务定制。 >> 参数量大，表达能力强，对小样本数据也有很好表现。 >> 预训练模型可以直接用于不同任务，训练成本低，效率高。 >> 是首个可以在许多任务上达到目前状态优先水平的纯粹预训练模型。 >> 稳定效果比较好，损失少，训练高效。 >> 代码和预训练模型公开提供，方便应用。本文通过深层双向Transformer结构及MLM、NSP预训练目标，实现了深层双向语言表示学习，直接微调预训练模型达到多任务SOA水平，提升了NLP基准任务的 state of the art，解决了传统语言模型的不足。

Abstract

模型介绍：介绍了一种新的语言表示模型，命名为BERT（Bidirectional Encoder Representations from Transformers）。
设计特点：与之前的语言表示模型不同，BERT的设计目的是从未标记的文本中预训练深度双向表示，通过在所有层中联合对左右上下文进行条件建模。
微调过程：BERT的预训练模型可以通过添加一个额外的输出层进行微调，以创建适用于各种任务（如问答和语言推理）的最先进模型，而无需进行重大的任务特定架构修改。
概念简单而强大：BERT在概念上简单，但在实证效果上表现强大。在十一个自然语言处理任务上取得了新的最先进结果

We introduce a new language representa- tion model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language repre- sentation models (Peters et al., 2018a; Rad- ford et al., 2018), BERT is designed to pre- train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a re- sult, the pre-trained BERT model can be fine- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task- specific architecture modifications.

我们引入了一种新的语言表示模型BERT，它代表双向编码器表示。与最近的语言表示模型不同(Peters等人，2018a;Rad- ford等人，2018)，BERT旨在通过联合条件反射所有层中的左右上下文，预训练来自无标签文本的深度双向表示。因此，预先训练的BERT模型可以通过一个额外的输出层进行微调，为广泛的任务创建最先进的模型，如问题回答和语言推断，而无需对任务特定的体系结构进行大量修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art re- sults on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement),MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answer- ing Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1(5.1 point absolute improvement).

BERT在概念上很简单，但在经验上很有力。它在11个自然语言处理任务上获得了最新的结果，包括GLUE分数提高到80.5%(7.7%的绝对改进)，多项测试的准确性提高到86.7%(4.6%的绝对改进)，SQuAD v1.1问答测试F1达到93.2(1.5分的绝对改进)，SQuAD v2.0测试F1达到83.1(5.1分的绝对改进)。

1、Introduction

语言模型预训练的有效性

论文指出语言模型的预训练对提升多个自然语言处理任务有效，引用了相关研究的例证，包括自然语言推理、释义、命名实体识别和问答等任务。

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the re- lationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

语言模型预训练已被证明对改善许多自然语言处理任务是有效的(Dai和Le, 2015;彼得斯等人，2018a;Radford等人，2018;霍华德和鲁德，2018)。这包括句子级别的任务，如自然语言推理(Bowman等人，2015;Williams等人，2018年)和释义(Dolan和Brockett, 2005年)，目的是通过对句子进行整体分析来预测句子之间的关系，以及token级任务，如命名实体识别和问题回答，其中模型需要在token级产生细粒度的输出(Tjong Kim Sang和De Meulder, 2003年;Rajpurkar et al.， 2016)。

两种将预训练语言表示应用于下游任务的策略：基于特征的方法(比如ELMo)、微调(比如Transformer)

文章提到两种将预训练语言表示应用于下游任务的策略，分别为基于特征的方法和微调。其中，基于特征的方法如ELMo使用任务特定的架构，将预训练表示作为额外特征，而微调方法（如OpenAI GPT）则引入极少的任务特定参数，在下游任务上通过微调所有预训练参数来训练。

There are two existing strategies for apply- ing pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as addi- tional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre- trained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

将预先训练好的语言表示应用到下游任务中有两种现有的策略:基于特征的和微调。基于特征的方法，如ELMo (Peters等人，2018a)，使用特定于任务的体系结构，包括预先训练的表示作为附加特征。

微调方法，如生成式预训练Transformer(OpenAI GPT) (Radford等人，2018)，引入了最小的任务特定参数，并通过简单地微调所有预训练参数在下游任务上进行训练。这两种方法在预训练过程中共享相同的目标函数，它们使用单向语言模型学习一般的语言表示。

问题与局限性：微调方法中的单向性

论文认为目前的技术限制了预训练表示的威力，尤其是对于微调方法。主要限制在于标准语言模型是单向的，这在预训练过程中限制了可用的架构选择，对于涉及双向信息的任务，如问答，这种限制可能是有害的。

We argue that current techniques restrict the power of the pre-trained representations, espe- cially for the fine-tuning approaches. The ma- jor limitation is that standard language models are unidirectional, and this limits the choice of archi- tectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to- right architecture, where every token can only at- tend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such re- strictions are sub-optimal for sentence-level tasks, and could be very harmful when applying fine- tuning based approaches to token-level tasks such as question answering, where it is crucial to incor- porate context from both directions.

我们认为，目前的技术限制了预先训练的表示的力量，特别是在微调方法。主要的限制是标准语言模型是单向的，这限制了可以在预训练期间使用的架构的选择。例如，在OpenAI GPT中，作者使用了从左到右的体系结构，其中每个token只能朝向Transformer的自我注意层中的前一个token(Vaswani等人，2017年)。这样的限制对于句子级别的任务来说是次优的，并且在将基于微调的方法应用于token级别的任务时可能是非常有害的，例如问题回答，在token级别的任务中，从两个方向引入上下文是至关重要的。

BERT模型的提出：采用MLM任务，而实现了深度双向Transformer的预训练

通过提出BERT（Bidirectional Encoder Representations from Transformers），作者改进了基于微调的方法。BERT通过使用“masked language model”（MLM）预训练目标，消除了之前提到的单向性约束。MLM在输入中随机掩盖一些标记，目标是仅基于上下文预测被掩盖词的原始词汇ID。与左到右语言模型预训练不同，MLM目标使表示能够融合左右上下文，从而实现了深度双向Transformer的预训练。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidi- rectionality constraint by using a “masked lan- guage model” (MLM) pre-training objective, in- spired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked

word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- train a deep bidirectional Transformer. In addi- tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations.

在本文中，我们改进了基于微调的方法，提出了BERT:来自Transformer的双向编码器表示。BERT通过使用由完形填空任务(Taylor, 1953)为中心的“掩码语言模型”(MLM)预训练目标，缓解了前面提到的统一方向约束。掩码语言模型从输入中随机掩码一些tokens，其目的是预测掩码的原始词汇表id

只基于上下文的单词。与从左到右的语言模型预训练不同，MLM对象使表示能够融合左右上下文，这使我们可以预训练一个深度双向Transformer。除了遮罩语言模型外，我们还使用“下句预测”任务联合预训练文本对表示。

本论文的贡献：BERT证明双向预训练对语言表示的重要性、提高了11个NLP任务的技术水平

论文总结了其贡献，包括证明双向预训练对语言表示的重要性、展示预训练表示减少对许多复杂的任务特定架构的需求、以及在11个自然语言处理任务上推动了技术水平的提高。提供了代码和预训练模型的链接。

The contributions of our paper are as follows:

(1)、We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.

(2)、We show that pre-trained representations reduce the need for many heavily-engineered task- specific architectures. BERT is the first fine- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-specific architectures.

(3)、BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert.

本文的贡献如下:

(1)，我们证明了语言表示的双向预训练的重要性。与Radford等人(2018)使用单向语言模型进行预训练不同，BERT使用掩码语言模型来实现预训练的深度双向表示。这也与Peters等人(2018a)形成了对比，他们使用了独立训练的从左到右和从右到左lm的浅连接。

(2)，我们表明，预先训练的表示减少了许多重工程的任务特定架构的需要。BERT是第一个基于精细调优的表示模型，它在大量的句子级和token级任务套件上实现了最先进的性能，并且形成了许多特定于任务的体系结构。

(3) BERT提高了11个NLP任务的技术水平。代码和预先训练的模组可在https://github.com/ google-research/bert获得。

2、Related Work

There is a long history of pre-training general lan- guage representations, and we briefly review the most widely-used approaches in this section.

预训练通用语言表示有很长的历史，我们简要回顾了本节中最广泛使用的方法。

2.1、基于无监督特征的方法Unsupervised Feature-based Approaches

单词表示：包括非神经网络方法和神经网络方法，比如预训练词嵌入，更粗粒度的层次，如句子嵌入或段落嵌入

学习广泛适用的单词表示是几十年来研究的一个活跃领域，包括非神经网络方法和神经网络方法。预训练的词嵌入是现代自然语言处理系统的重要组成部分，相较于从头开始学习的嵌入，预训练词嵌入带来了显著的改进。

预训练词向量： 为了预训练词嵌入向量，采用了左到右的语言建模目标以及在左右上下文中区分正确和不正确单词的目标。这些方法已经推广到更粗粒度的层次，如句子嵌入或段落嵌入。

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013).

These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016).

几十年来，学习广泛适用的单词表示一直是一个活跃的研究领域，包括非神经网络(Brown et al.， 1992;安藤和张，2005;Blitzer等人，2006)和neural (Mikolov等人，2013;Pennington et al.， 2014)方法。预训练词嵌入是现代NLP系统的一个组成部分，与从头开始学习的嵌入相比，它提供了显著的改进(Turian et al.， 2010)。为了预训练词嵌入向量，使用了从左到右的语言建模目标(Mnih和Hinton, 2009)，以及在左右上下文中区分正确和不正确单词的目标(Mikolov et al.， 2013)。这些方法已经被推广到更粗的粒度，例如句子嵌入(Kiros等人，2015;Logeswaran and Lee, 2018)或段落嵌入(Le and Mikolov, 2014)。为了训练句子表征，之前的工作使用目标对候选的下一个句子进行排序(Jernite et al.， 2017;Logeswaran和Lee, 2018)，根据前一个句子的表示从左到右生成下一个句子单词(Kiros等人，2015)，或者去噪自动编码器派生的目标(Hill等人，2016)。

ELMo(基于特征的)→LSTM

ELMo及其前身在不同维度上推广了传统的词嵌入研究。它们从左到右和从右到左的语言模型中提取上下文敏感特征，每个标记的上下文表示是左到右和右到左表示的连接。将上下文词嵌入与现有任务特定架构集成时，ELMo在几个重要的自然语言处理基准测试中推动了技术水平的提高。

ELMo特点： ELMo提取上下文敏感特征，其模型是基于特征的，而不是深度双向的。它在多个任务上取得了良好的性能，包括问答、情感分析和命名实体识别。
其他相关工作： Melamud等人提出通过通过LSTMs从左右上下文预测单个词来学习上下文表示。Fedus等人展示了填空任务可以用于提高文本生成模型的鲁棒性。

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els.

ELMo及其前身(Peters et al.， 2017,2018a)将传统的词嵌入研究在不同维度上进行了推广。他们从从左到右和从右到左的语言模型中提取上下文敏感的特征。每个标记的上下文表示是从左到右和从右到左表示的串联。当将上下文词嵌入与现有的任务特定架构集成时，ELMo在几个主要的NLP基准(Peters等人，2018a)中推进了最先进的技术，包括问答(Rajpurkar等人，2016)、情感分析(Socher等人，2013)和命名实体识别(Tjong Kim Sang和De Meulder, 2003)。Melamud等人(2016)提出通过使用lstm从左右上下文中预测单个单词的任务来学习上下文表示。与ELMo类似，他们的模型是基于特征的，而不是深度双向的。Fedus等人(2018)表明，完形填空任务可以用来提高文本生成模型的鲁棒性。

2.2、无监督微调方法Unsupervised Fine-tuning Approaches

特征为基础的方法→句子或文档编码器的发展(比如GPT)→→

特征为基础的方法： 早期探索这个方向的方法主要是从未标记的文本中预训练单词嵌入参数的基于特征的方法，例如Collobert和Weston（2008）的工作。
句子或文档编码器的发展： 近年来，从未标记的文本中预训练并在监督下游任务中进行微调的句子或文档编码器逐渐兴起，这些编码器生成上下文令牌表示。这种方法的优势在于几乎没有需要从头学习的参数。其中，OpenAI GPT（Radford et al., 2018）在GLUE基准测试的许多句子级任务上取得了之前最先进的结果。
预训练目标： 针对这些模型的预训练，使用了左到右语言建模和自编码器目标，例如Howard和Ruder（2018）、Radford等（2018）、Dai和Le（2015）的工作。这些目标有助于学习上下文表示。

As with the feature-based approaches, the first works in this direction only pre-trained word em- bedding parameters from unlabeled text (Col- lobert and Weston, 2008).

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre-viously state-of-the-art results on many sentence-level tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language modeling and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

与基于特征的方法一样，第一种方法在这个方向上只从未标记文本中预先训练的单词em-垫层参数(Col- lobert和Weston, 2008)。

最近，生成上下文token表示的句子或文档编码器已经从未标记的文本中预先训练，并为受监督的下游任务进行了调整(Dai和Le, 2015;霍华德和罗德，2018年;Radford等人，2018)。这些方法的优点是需要从头学习的参数很少。至少部分由于这一优势，OpenAI GPT (Radford等人，2018)在许多来自GLUE基准测试的句子级任务上取得了之前最先进的结果(Wang等人，2018a)。从左到右的语言建模和自动编码器目标已用于预训练此类模型(Howard和Ruder, 2018;Radford等人，2018;戴和乐，2015)。

图1:BERT的总体前训练和微调过程。

Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-tions/answers).

图1:BERT的总体前训练和微调过程。除了输出层，在预训练和微调中都使用了相同的体系结构。使用相同的预训练模型参数来初始化不同下游任务的模型。在微调过程中，对所有参数进行微调。[CLS]是添加在每个输入示例前面的特殊符号，[SEP]是一个特殊的分隔符号(例如，分隔问题/答案)。

2.3 从监督数据迁移学习Transfer Learning from Supervised Data

从具有大型数据集的监督任务中有效转移的实战案例：计算机视觉中的迁移学习(如基于ImageNet预训练的模型进行微调)，自然语言推理、机器翻译

有效的迁移学习示例： 有研究表明，从具有大型数据集的监督任务中进行迁移学习是有效的，例如自然语言推理（Conneau et al., 2017）和机器翻译（McCann et al., 2017）。
计算机视觉中的迁移学习： 计算机视觉研究同样展示了从大型预训练模型进行迁移学习的重要性。一个有效的方法是对使用ImageNet进行预训练的模型进行微调，这在计算机视觉领域取得了成功（Deng et al., 2009; Yosinski et al., 2014）。

There has also been work showing effective trans-fer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demon-strated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with Ima-geNet (Deng et al., 2009; Yosinski et al., 2014).

也有研究表明，从具有大型数据集的监督任务中有效转移，如自然语言推理(Conneau等人，2017)和机器翻译(McCann等人，2017)。计算机视觉研究也证明了从大型预训练模型中转移学习的重要性，其中一个有效的方法是微调用Ima-geNet预训练的模型(Deng等人，2009;Yosinski et al.， 2014)。

3 BERT

BERT的两大步骤(预训练、微调)→BERT在不同任务中使用统一的架构

两大步骤：
在预训练阶段，模型在未标记的数据上进行训练，执行不同的预训练任务。
微调阶段中，BERT模型使用预训练参数进行初始化，然后通过来自下游任务的标记数据对所有参数进行微调。不同的下游任务有不同的微调模型，尽管它们使用相同的预训练参数。BERT在本文中的运行示例是图1中的问答示例。
BERT的统一架构： BERT的一个显著特点是其在不同任务中使用统一的架构。预训练架构与最终的下游任务架构之间几乎没有差异。

We introduce BERT and its detailed implementa-tion in this section. There are two steps in our framework: pre-training and fine-tuning. Dur-ing pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the param-eters are fine-tuned using labeled data from the downstream tasks. Each downstream task has sep-arate fine-tuned models, even though they are ini-tialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section.

A distinctive feature of BERT is its unified ar-chitecture across different tasks. There is minimal difference between the pre-trained architec-ture and the final downstream architecture.

在本节中，我们将介绍BERT及其详细实现。在我们的框架中有两个步骤:预训练和微调。在预训练期间，模型在不同的预训练任务上对未标记数据进行训练。对于微调，首先使用预训练的参数初始化BERT模型，然后使用来自下游任务的标记数据对所有参数进行微调。每个下游任务都有单独的微调模型，即使它们是用相同的预训练参数初始化的。图1中的问答示例将作为本节的运行示例。BERT的一个显著特征是其跨不同任务的统一架构。预训练的体系结构和最终的下游体系结构之间的差别很小。

BERT的模型架构：基于原始Transformer编码器的多层双向结构

模型架构： BERT的模型架构是基于原始Transformer编码器的多层双向结构，与Vaswani等人（2017）的实现相似。主要报告两个模型尺寸的结果：
>> BERTBASE（L=12，H=768，A=12，总参数=110M）和
>> BERTLARGE（L=24，H=1024，A=16，总参数=340M）。BERTBASE选用与OpenAI GPT相同的模型尺寸进行比较，
>> 但需要注意，BERT Transformer使用双向自注意力，而GPT Transformer使用受限制的自注意力。

Model Architecture BERT’s model architec-ture is a multi-layer bidirectional Transformer en-coder based on the original implementation de-scribed in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our im-plementation is almost identical to the original, we will omit an exhaustive background descrip-tion of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”2

In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A.3 We primarily report results on two model sizes: BERTBASE (L=12, H=768, A=12, Total Param-eters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Trans-former uses constrained self-attention where every token can only attend to context to its left.4

BERT的模型架构是一个多层双向Transformer编码器，基于Vaswani等人(2017)描述的原始实现，并在tensor2tensor库中发布由于变形金刚的使用已经变得很普遍，我们的实现几乎与原始的完全相同，我们将省略模型架构的详尽背景描述，并将读者推荐给Vaswani等人(2017)以及优秀的指南，如“注释的变形金刚”。2在这项工作中，我们表示层数(即变压器块)为L，隐藏大小为H，自我注意头的数量为A。3我们主要报告了两种模型尺寸的结果:BERTBASE (L=12, H=768, A=12, Total parameter -eters=110M)和BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M)。为了进行比较，选择BERTBASE与OpenAI GPT具有相同的模型大小。然而，关键的是，BERT Transformer使用双向自注意，而GPT Transformer使用约束自注意，其中每个令牌只能关注其左侧的上下文

BERT的输入/输出表示：采用WordPiece嵌入，30,000个标记词汇表

输入/输出表示： 为了让BERT处理各种下游任务，输入表示能够明确地表示单个句子或一对句子。使用WordPiece嵌入，其中包括一个30,000个标记的词汇表。每个序列的第一个标记始终是特殊的分类标记（[CLS]），对应于此标记的最终隐藏状态用作分类任务的聚合序列表示。句子对通过一个特殊标记（[SEP]）分开，并通过一个学习的嵌入指示每个标记属于句子A还是句子B。
输入表示构建： 对于给定的标记，其输入表示是通过将相应的标记、段和位置嵌入相加构建的。输入表示构建的可视化可在图2中看到。

Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g.,  Question, Answer ) in one token sequence. Throughout this work, a “sentence” can be an arbi-trary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the in-put token sequence to BERT, which may be a sin-gle sentence or two sentences packed together.

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special clas-sification token ([CLS]). The final hidden state corresponding to this token is used as the ag-gregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embed-ding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C ∈ RH , and the final hidden vector for the ith input token as Ti ∈ RH .

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualiza-tion of this construction can be seen in Figure 2.

输入/输出表示为了使BERT处理各种下游任务，我们的输入表示能够在一个令牌序列中明确地表示单个句子和一对句子(例如，Question, Answer)。在这项工作中，一个“句子”可以是一个连续文本的任意跨度，而不是一个实际的语言句子。“序列”指的是BERT的输入令牌序列，它可以是一个句子或两个句子组合在一起。

我们使用WordPiece嵌入(Wu et al.， 2016)和30,000个标记词汇表。每个序列的第一个标记总是一个特殊的分类标记([CLS])。与此令牌对应的最终隐藏状态用作分类任务的聚合序列表示。句子对被打包成一个单一的序列。我们用两种方法区分句子。首先，我们用一个特殊的令牌([SEP])将它们分开。其次，我们为每个标记添加一个学习的嵌入，表明它是属于句子a还是句子b。如图1所示，我们将输入嵌入表示为E，特殊[CLS]标记的最终隐藏向量表示为C∈RH，第i个输入标记的最终隐藏向量表示为Ti∈RH。

对于给定标记，其输入表示是通过将相应的标记、段和位置嵌入相加来构建的。图2显示了这种结构的可视化。

3.1 Pre-training BERT

BERT使用两个无监督任务进行预训练

与Peters等人（2018a）和Radford等人（2018）不同，BERT的预训练不使用传统的从左到右或从右到左的语言模型。相反，BERT使用两个无监督任务进行预训练。

Unlike Peters et al. (2018a) and Radford et al.(2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsuper-vised tasks, described in this section. This step is presented in the left part of Figure 1.

与Peters等人(2018a)和Radford等人(2018)不同，我们不使用传统的从左到右或从右到左的语言模型来预训练BERT。相反，我们使用两个非监督任务预先训练BERT，在本节中进行描述。这个步骤显示在图1的左侧。

任务 #1：Masked LM（MLM）

通过屏蔽一定比例的输入标记并预测这些屏蔽的标记，实现深度双向模型的预训练。这个任务称为“Masked LM”（MLM），也被文献中常称为Cloze任务。在实验中，以每个序列中的所有WordPiece标记的15%随机屏蔽。

Task #1: Masked LM Intuitively, it is reason-able to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirec-tional conditioning would allow each word to in-directly “see itself”, and the model could trivially predict the target word in a multi-layered context.

In order to train a deep bidirectional representa-tion, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece to-kens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than recon-structing the entire input.

Although this allows us to obtain a bidirec-tional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not ap-pear during fine-tuning. To mitigate this, we do not always replace “masked” words with the ac-tual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

直观地说，我们有理由相信深度双向模型严格地比从左到右的模型或从左到右和从右到左的模型的浅连接更强大。不幸的是，标准条件语言模型只能被从左到右或从右到左训练，因为双向条件作用将允许每个单词间接地“看到自己”，模型可以在多层上下文中简单地预测目标单词。

为了训练深度双向表示，我们只需随机屏蔽一部分输入token，然后预测这些被屏蔽的token。我们把这个过程称为“掩模LM”(MLM)，尽管在文献中它经常被称为完形填空任务(Taylor, 1953)。在这种情况下，与掩码token相对应的最终隐藏向量通过词汇表被输入到输出softmax中，就像在标准LM中一样。在我们所有的实验中，我们在每个序列中随机屏蔽15%的WordPiece到kens。与去噪自动编码器(Vincent et al.， 2008)相比，我们只预测被掩盖的单词，而不是重建整个输入。

尽管这允许我们获得一个双向的预训练模型，但缺点是我们在预训练和微调之间造成了不匹配，因为[MASK]token在微调期间不会出现。为了缓解这个问题，我们并不总是用实际的[MASK]token替换“掩码”单词。训练数据生成器随机选择15%的token位置进行预测。如果第i个token被选中，我们用(1)80%的时间[MASK]token替换第i个token(2)10%的时间随机token(3)10%的时间不变的第i个token。然后用Ti来预测具有交叉熵损失的原始token。我们在附录C.2中比较了这个过程的变体。

任务 #2：下一个句子预测（NSP）

为了训练理解句子关系的模型，BERT进行了二元化的下一个句子预测任务。对于每个预训练示例，选择句子A和B，50%的时间B是实际跟随A的下一个句子（标记为IsNext），另外50%的时间是来自语料库的随机句子（标记为NotNext）。

Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Ques-tion Answering (QA) and Natural Language Infer-ence (NLI) are based on understanding the rela-tionship between two sentences, which is not di-rectly captured by language modeling. In order to train a model that understands sentence rela-tionships, we pre-train for a binarized next sen-tence prediction task that can be trivially gener-ated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, C is used for next sentence predic-tion (NSP).5 Despite its simplicity, we demon-strate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI. 6

The NSP task is closely related to representation-learning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all pa-rameters to initialize end-task model parameters.

任务#2:下句预测(NSP)许多重要的下游任务，如问答(QA)和自然语言推理(NLI)，都是基于理解两个句子之间的关系，这是语言建模无法直接捕获的。为了训练一个理解句子关系的模型，我们对一个可以从任何单语语料库中轻易生成的二元化下一句预测任务进行了预训练。具体来说，在为每个预训练示例选择句子A和B时，50%的时间B是A之后的实际下一个句子(token为IsNext)， 50%的时间是语料库中的随机句子(token为NotNext)。如图1所示，C用于下句预测(NSP)尽管它很简单，我们在第5.1节中演示了针对该任务的预培训对QA和NLI都是非常有益的。6

NSP任务与Jernite等人(2017)和Logeswaran和Lee(2018)使用的表征学习目标密切相关。然而，在之前的工作中，只有句子嵌入被转移到下游任务，其中BERT转移所有参数参数来初始化最终任务模型参数。

图2:BERT输入表示。输入嵌入是token嵌入、分段嵌入和位置嵌入的总和。

BERT使用WordPiece嵌入，并通过特殊标记（[CLS]和[SEP]）和学习的嵌入表示两个句子的输入。通过对应的标记、段和位置嵌入相加来构建每个标记的输入表示。

Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

图2:BERT输入表示。输入嵌入是token嵌入、分段嵌入和位置嵌入的总和。

预训练数据

使用BooksCorpus（800M words）和English Wikipedia（2,500M words）进行预训练。选择使用文档级语料库而不是洗牌的句子级语料库，以提取长的连续序列。

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is criti-cal to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

预训练过程在很大程度上遵循了现有的语言模型预训练文献。对于预训练语料库，我们使用BooksCorpus(800万字)(Zhu et al.， 2015)和英语维基百科(2500万字)。对于维基百科，我们只提取文本段落，而忽略列表、表格和标题。为了提取长连续序列，使用文档级语料库而不是像十亿词基准(Chelba et al.， 2013)这样的打乱句子级语料库是至关重要的。

3.2 Fine-tuning BERT

BERT能够直接应用于许多下游任务，实现了两个句子之间的双向交叉注意力

微调过程简介： 由于Transformer中的自注意机制，BERT能够直接应用于许多下游任务，无论是涉及单一文本还是文本对。微调是直截了当的，因为可以通过交换适当的输入和输出来模型化许多下游任务。
文本对应用的自注意机制： 对于涉及文本对的应用，通常的模式是在应用双向交叉注意力之前独立编码文本对。BERT使用自注意机制将这两个阶段统一起来，通过使用自注意力对连接的文本对进行编码，有效地实现了两个句子之间的双向交叉注意力。

Fine-tuning is straightforward since the self-attention mechanism in the Transformer al-lows BERT to model many downstream tasks—whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs be-fore applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidi-rectional cross attention between two sentences.

通过交换适当的输入和输出，微调是直接的，因为Transformer中的自关注机制允许BERT对许多下游任务建模——无论它们涉及单个文本还是文本对。对于涉及文本对的应用程序，常见的模式是在应用双向交叉注意之前对文本对进行独立编码，如Parikh等人(2016);Seo等人(2017)。BERT使用自注意机制来统一这两个阶段，因为用自注意编码一个连接的文本对有效地包括两个句子之间的双向交叉注意。

BERT任务细节；与预训练相比，微调相对廉价

任务细节： 对于每个任务，将任务特定的输入和输出插入BERT，并对所有参数进行端到端的微调。输入方面，来自预训练的句子A和句子B类似于（1）在释义中的句子对，（2）在蕴涵中的假设-前提对，（3）在问答中的问题-段落对，和（4）在文本分类或序列标记中的退化文本-∅对。输出方面，对于标记级任务（如序列标记或问答），标记表示被送入输出层，而对于分类任务（如蕴涵或情感分析），[CLS]表示被送入输出层。
微调相对廉价： 与预训练相比，微调相对廉价。论文中的所有结果可以在至多1小时的单个Cloud TPU上复制，或在GPU上几个小时内从完全相同的预训练模型开始。微调的任务特定细节在第4节的相应子节中有详细描述，附录A.5中还提供了更多细节。

For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end. At the in-put, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphras-ing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and(4) a degenerate text-∅ pair in text classification or sequence tagging. At the output, the token rep-resentations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as en-tailment or sentiment analysis.

Compared to pre-training, fine-tuning is rela-tively inexpensive. All of the results in the pa-per can be replicated in at most 1 hour on a sin-gle Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.7 We de-scribe the task-specific details in the correspond-ing subsections of Section 4. More details can be found in Appendix A.5.

对于每个任务，我们只需将特定于任务的输入和输出插入到BERT中，并对所有参数进行端到端的微调。在输入端，预训练的句子A和句子B类似于:(1)释义中的句子对，(2)蕴涵中的假设-前提对，(3)问答中的问题-段落对，(4)文本分类或序列标注中的退化文本-∅对。在输出端，令牌表示被输入输出层，用于令牌级任务，如序列标记或问题回答，而[CLS]表示被输入输出层，用于分类，如介入或情感分析。

与预训练相比，微调相对便宜。从完全相同的预训练模型开始，论文中的所有结果可以在单个Cloud TPU上最多1小时内复制，或者在GPU上复制几个小时我们将在第4节的相应小节中描述特定于任务的细节。详情见附录A.5。

4 Experiments：展示了BERT在各项任务上的卓越性能

In this section, we present BERT fine-tuning re-sults on 11 NLP tasks.

在本节中，我们将展示11个NLP任务上的BERT微调结果。

GLUE任务结果：
- BERT在GLUE基准测试中取得了显著的成绩，使用单一模型和单一任务。
- GLUE测试结果包括各任务的准确度，F1分数和Spearman相关性，其中使用了BERTBASE和BERTLARGE两个模型。
- BERTBASE和BERTLARGE在所有任务上均明显优于之前的最先进系统，平均准确度提高了4.5%和7.0%。
- 在GLUE任务中，BERTLARGE相对于BERTBASE在所有任务中表现更好，特别是在训练数据较少的任务中。
SQuAD v1.1结果：
- BERT在Stanford Question Answering Dataset (SQuAD v1.1)上进行了微调，优化了回答问题的性能。
- BERT模型在SQuAD v1.1上的表现超过了其他系统，包括领先的集成系统，F1得分较其它系统提高了1.5。
- 使用了数据增强技术（在TriviaQA上进行微调）进一步提高了性能。
SQuAD v2.0结果：
- SQuAD 2.0任务对SQuAD 1.1进行了扩展，允许问题在给定段落中没有短的答案。
- BERT模型在SQuAD 2.0上相对于之前最好的系统提高了+5.1 F1。
SWAG结果：
- BERT在Situations With Adversarial Generations（SWAG）数据集上进行了微调，该数据集评估了基于常识的推理。
- BERTLARGE在SWAG任务中表现出色，相对于作者的基线系统提高了27.1%，相对于OpenAI GPT提高了8.3%。

总体来说，该论文通过在不同自然语言处理任务上进行BERT微调，展示了BERT在各项任务上的卓越性能，并突显了BERTLARGE相对于BERTBASE的优越性，特别是在数据有限的任务中。

4.1 GLUE

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a col-lection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1.

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hid-den vector C ∈ RH corresponding to the first input token ([CLS]) as the aggregate representa-tion. The only new parameters introduced during fine-tuning are classification layer weights W ∈ RK×H , where K is the number of labels. We com-pute a standard classification loss with C and W , i.e., log(softmax(CWT )).

通用语言理解评估(GLUE)基准(Wang et al.， 2018a)是各种自然语言理解任务的集合。GLUE数据集的详细描述见附录B.1。

为了在GLUE上进行微调，我们像第3节中描述的那样表示输入序列(对于单句或句子对)，并使用与第一个输入token([CLS])对应的最终隐藏向量C∈RH作为聚合表示。微调过程中引入的唯一新参数是分类层权重W∈RK×H，其中K为标签数。我们用C和W计算标准分类损失，即log(softmax(CWT))。

Table 1: GLUE Test results, scored by the evaluation server (https://gluebenchmark.com/leaderboard). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are single-model, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

表1:GLUE测试结果，由评估服务器(https://gluebenchmark.com/leaderboard)评分。每个任务下面的数字表示训练示例的数量。“平均”列与官方GLUE分数略有不同，因为我们排除了有问题的WNLI集BERT和OpenAI GPT是单模型、单任务。QQP和MRPC报告F1分数，STS-B报告Spearman相关性，其他任务报告准确性分数。我们排除了使用BERT作为组件之一的条目。

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that fine-tuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but per-form different fine-tuning data shuffling and clas-sifier layer initialization.9

Results are presented in Table 1. Both BERTBASE and BERTLARGE outperform all sys-tems on all tasks by a substantial margin, obtaining 4.5% and 7.0% respective average accuracy im-provement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the at-tention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a 4.6%absolute accuracy improvement. On the official GLUE leaderboard10, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

We find that BERTLARGE significantly outper-forms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

我们使用32个批处理大小，并对所有GLUE任务的数据进行3个周期的微调。对于每个任务，我们在Dev集上选择最佳的微调学习率(在5e-5、4e-5、3e-5和2e-5之间)。此外，对于BERTLARGE，我们发现微调在小型数据集上有时不稳定，因此我们运行了几次随机重启，并在Dev集上选择了最佳模型。对于随机重启，我们使用相同的预先训练的检查点，但每个表单不同的微调数据洗牌和分类器层初始化

结果如表1所示。BERTBASE和BERTLARGE在所有任务上的表现都大大优于所有系统，分别获得4.5%和7.0%的平均精度改进。注意，BERTBASE和OpenAI GPT在模型体系结构方面几乎是相同的，除了注意掩码之外。对于最大和最广泛报道的GLUE任务MNLI, BERT获得了4.6%的绝对精度提高。在GLUE的官方排行榜上，BERTLARGE获得了80.5分，而OpenAI GPT在撰写本文时获得了72.8分。

我们发现BERTLARGE在所有任务中都明显优于BERTBASE，特别是在训练数据很少的任务中。模型大小的影响在第5.2节中有更深入的探讨。

4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowd-sourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

As shown in Figure 1, in the question answer-ing task, we represent the input question and pas-sage as a single packed sequence, with the ques-tion using the A embedding and the passage using the B embedding. We only introduce a start vec-tor S ∈ RH and an end vector E ∈ RH during fine-tuning. The probability of word i being the start of the answer span is computed as a dot prod-uct between Ti and S followed by a softmax over all of the words in the paragraph: Pi = eS·Ti . The analogous formula is used for the end of the answer span. The score of a candidate span from position i to position j is defined as S·Ti + E·Tj , and the maximum scoring span where j ≥ i is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

斯坦福问题回答数据集(SQuAD v1.1)是一个收集了10万个来自人群的问题/答案对的集合(Rajpurkar等人，2016年)。给定一个问题和一篇来自维基百科的包含答案的文章，任务是预测文章中的答案文本跨度。

如图1所示，在问题回答任务中，我们将输入问题和段落表示为单个打包序列，其中问题使用a嵌入，而段落使用B嵌入。我们在微调过程中只引入一个起始向量S∈RH和一个结束向量E∈RH。单词i作为答案范围开始的概率计算为Ti和S之间的点积，然后是段落中所有单词的软最大值:Pi =eS·Ti。用类似的公式计算答案跨度的结束。从位置i到位置j的候选跨度的得分定义为S·Ti + E·Tj，并以j≥i的最大得分跨度作为预测值。训练目标是正确的起始位置和结束位置的对数可能性之和。我们微调了3个阶段，学习率为5e-5，批次大小为32。

Table 2 shows top leaderboard entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018). The top results from the SQuAD leaderboard do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD.

Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and+1.3 F1 as a single system. In fact, our single BERT model outperforms the top ensemble sys-tem in terms of F1 score. Without TriviaQA fine-tuning data, we only lose 0.1-0.4 F1, still outper-forming all existing systems by a wide margin.12

表2显示了排行榜上的条目以及来自顶级发布系统的结果(Seo等人，2017;克拉克和加德纳，2018年;彼得斯等人，2018a;Hu等人，2018)。来自SQuAD排行榜的排名靠前的结果没有最新的公共系统描述，11并且在训练他们的系统时被允许使用任何公共数据。因此，我们在系统中使用适度的数据增强，首先在TriviaQA上进行微调(Joshi等人，2017年)，然后在SQuAD上进行微调。

我们性能最好的系统在整合方面比排名第一的系统高出1.5 F1，在单个系统上高出1.3 F1。事实上，我们的单一BERT模型在F1分数方面优于顶级集成系统。在没有TriviaQA微调数据的情况下，我们只损失了0.1-0.4个F1，但仍然远远超过所有现有系统

Table 2: SQuAD 1.1 results. The BERT ensemble is 7x systems which use different pre-training checkpoints and fine-tuning seeds.
Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.
Table 4: SWAG Dev and Test accuracies. yHuman performance is measured with 100 samples, as reported in the SWAG paper.

表2:SQuAD 1.1结果。BERT集合是使用不同的训练前检查点和微调种子的7x系统。

表3:SQuAD 2.0结果。我们排除了使用BERT作为组件之一的条目。

表4:SWAG开发和测试准确性。正如swg论文中所报道的那样，我们用100个样本来测量人类的表现。

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided para-graph, making the problem more realistic.

We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat ques-tions that do not have an answer as having an an-swer span with start and end at the [CLS] to-ken. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: snull = S·C + E·C to the score of the best non-null span

sˆi,j = maxj≥iS·Ti + E·Tj . We predict a non-null answer when sˆi,j > snull + τ, where the thresh-old τ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

SQuAD 2.0任务扩展了SQuAD 1.1问题定义，它允许所提供的段落中不存在简短答案的可能性，从而使问题更加现实。

我们使用一种简单的方法来为这项任务扩展SQuAD v1.1 BERT模型。我们将没有答案的问题视为有一个以[CLS]为起始和结束的答案跨度。开始和结束答案跨度位置的概率空间被扩展到包括[CLS]token的位置。对于预测，我们将无答案跨度:snull = S·C + E·C的得分与最佳非零跨度的得分进行比较

s @ @ i,j = maxj≥iS·Ti + E·Tj。当s口令i,j口令b0 snull + τ时，我们预测一个非空答案，其中阈值τ在开发集上被选择以使F1最大化。我们没有在这个模型中使用TriviaQA数据。我们微调了2个阶段，学习率为5e-5，批次大小为48。

The results compared to prior leaderboard en-tries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, exclud-ing systems that use BERT as one of their com-ponents. We observe a +5.1 F1 improvement over the previous best system.

结果与之前的排行榜输入和最高发表的工作(Sun等人，2018;Wang等人，2018b)如表3所示，不包括使用BERT作为其组件之一的系统。我们观察到，与之前的最佳系统相比，改进了+5.1 F1。

4.4 SWAG

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair com-pletion examples that evaluate grounded common-sense inference (Zellers et al., 2018). Given a sen-tence, the task is to choose the most plausible con-tinuation among four choices.

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vec-tor whose dot product with the [CLS] token rep-resentation C denotes a score for each choice which is normalized with a softmax layer.

对抗代的情况(SWAG)数据集包含113k句对完成示例，用于评估基础常识推理(Zellers等人，2018年)。给定一个句子，任务是在四个选项中选择最合理的延续。

在对SWAG数据集进行微调时，我们构造了四个输入序列，每个序列都包含给定句子(句子A)和可能的延续(句子B)的连接。引入的唯一任务特定参数是一个向量，它与[CLS]token表示的点积C表示每个选择的分数，该选择是用softmax层规范化的。

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Re-sults are presented in Table 4. BERTLARGE out-performs the authors’ baseline ESIM+ELMo sys-tem by +27.1% and OpenAI GPT by 8.3%.

我们对模型进行了3个阶段的微调，学习率为2e-5，批次大小为16。结果如表4所示。BERTLARGE的性能比作者的基线ESIM+ELMo系统高出27.1%，比OpenAI GPT高出8.3%。

5 Ablation Studies消融研究

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional ablation studies can be found in Appendix C.

在本节中，我们在BERT的多个方面进行烧蚀实验，以便更好地理解它们的相对重要性。附加的消融研究可在附录C中找到。

Table 5: Ablation over the pre-training tasks using the
BERTBASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence
prediction, like OpenAI GPT. “+ BiLSTM” adds a randomly
initialized BiLSTM on top of the “LTR + No
NSP” model during fine-tuning.

表5:使用BERTBASE架构对训练前任务进行消融。“No NSP”训练不需要下一句预测任务。“LTR & No NSP”被训练为从左到右的LM，没有下一句预测，就像OpenAI GPT。“+ BiLSTM”在微调时在“LTR + No NSP”模型之上添加一个随机初始化的BiLSTM。

5.1 Effect of Pre-training Tasks 预训练任务的影响，去除NSP任务性能有显著影响

通过对BERT的消融实验，研究了BERT不同方面的重要性。
使用BERTBASE架构进行的预训练任务消融，包括：
- No NSP（无下一个句子预测任务）：使用"masked LM"（MLM）进行训练，但没有"next sentence prediction"（NSP）任务。
- LTR & No NSP（仅左侧上下文模型）：使用标准的从左到右（LTR）语言模型进行训练，不包括NSP任务。这与OpenAI GPT直接可比，但使用了更大的训练数据集、输入表示和微调方案。
结果表明，去除NSP任务对QNLI、MNLI和SQuAD 1.1的性能有显著影响。同时，LTR模型在所有任务上表现较差，特别是在MRPC和SQuAD上

We demonstrate the importance of the deep bidi-rectionality of BERT by evaluating two pre-training objectives using exactly the same pre-training data, fine-tuning scheme, and hyperpa-rameters as BERTBASE:	通过使用与BERTBASE完全相同的训练前数据、微调方案和超参数来评估两个训练前目标，我们展示了BERT的深度双向性的重要性:
No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task. LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input repre-sentation, and our fine-tuning scheme.	无NSP:使用“掩模LM”(MLM)训练的双向模型，但不使用“下句预测”(NSP)任务。 LTR & No NSP:一个仅左上下文的模型，它使用标准的从左到右(LTR) LM(而不是MLM)进行训练。在微调时也应用了仅左约束，因为删除它会导致前列/微调不匹配，从而降低下游性能。此外，该模型在没有NSP任务的情况下进行了预训练。这与OpenAI GPT直接类似，但使用了更大的训练数据集、输入表示和微调方案。
We first examine the impact brought by the NSP task. In Table 5, we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by com-paring “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD. For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no right-side context. In order to make a good faith at-tempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pre-trained bidirectional models. The BiLSTM hurts performance on the GLUE tasks.	我们首先考察了NSP任务带来的影响。在表5中，我们表明在QNLI、MNLI和SQuAD 1.1上删除NSP会显著影响性能。接下来，我们通过比较“No NSP”和“LTR & No NSP”来评估训练双向表示的影响。LTR模型在所有任务上的表现都比MLM模型差，在MRPC和SQuAD上有很大的下降。对于SQuAD来说，很明显LTR模型在token预测方面表现不佳，因为token级隐藏状态没有右侧上下文。为了在加强LTR系统方面做一个善意的尝试，我们在上面添加了一个随机初始化的BiLSTM。这确实显著改善了SQuAD上的结果，但结果仍然比预先训练的双向模型差得多。BiLSTM会影响GLUE任务的性能。
We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two mod-els, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.	我们认识到，也可以训练单独的LTR和RTL模型，并将每个token表示为两个模型的连接，就像ELMo所做的那样。然而:(a)这是单一双向模型的两倍;(b)对于QA这样的任务来说，这是不直观的，因为RTL模型不能以问题作为回答的条件;(c)严格地说，它不如深度双向模型强大，因为它可以在每一层使用左右上下文。

5.2 Effect of Model Size模型规模的影响，将模型扩展到极大规模还会在非常小规模任务上产生显著的改进

探讨了模型规模对微调任务准确性的影响。
通过在相同超参数和训练程序的情况下使用不同层数、隐藏单元和注意头的BERT模型进行实验。
结果显示，较大的模型在所有四个数据集上都导致了准确性的提升。即使对于MRPC这样的数据集，它只有3600个标记的训练样本，并且与预训练任务相差较大，也取得了显著的改进。
该研究首次明确地证明，将模型扩展到极大规模还会在非常小规模任务上产生显著的改进，前提是模型经过充分的预训练。

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyperparameters and training pro-cedure as described previously.

Results on selected GLUE tasks are shown in Table 6. In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict ac-curacy improvement across all four datasets, even for MRPC which only has 3,600 labeled train-ing examples, and is substantially different from the pre-training tasks. It is also perhaps surpris-ing that we are able to achieve such significant improvements on top of models which are al-ready quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is (L=6, H=1024, A=16) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is (L=64, H=512, A=2) with 235M parameters (Al-Rfou et al., 2018). By contrast, BERTBASE contains 110M parameters and BERTLARGE con-tains 340M parameters.

在本节中，我们将探讨模型大小对微调任务准确性的影响。我们使用不同数量的层、隐藏单元和注意头来训练许多BERT模型，同时使用前面描述的相同的超参数和训练过程。

所选GLUE任务的结果如表6所示。在这个表中，我们报告了5次随机重新启动微调的平均Dev Set精度。我们可以看到，更大的模型在所有四个数据集上都带来了严格的准确性提高，即使是只有3600个token训练示例的MRPC，也与训练前任务有很大的不同。同样令人惊讶的是，我们能够在模型的基础上实现如此显著的改进，这与现有的文献相比已经相当大了。例如，Vaswani等人(2017)研究的最大Transformer为编码器的100M参数(L=6, H=1024, A=16)，我们在文献中发现的最大Transformer为(L=64, H=512, A=2)， 235M参数(al - rfou等人，2018)。相比之下，BERTBASE包含110M参数，BERTLARGE包含340M参数。

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convinc-ingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been suffi-ciently pre-trained. Peters et al. (2018b) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al. (2016) men-tioned in passing that increasing hidden dimen-sion size from 200 to 600 helped, but increasing further to 1,000 did not bring further improve-ments. Both of these prior works used a feature-based approach — we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of ran-domly initialized additional parameters, the task-specific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small.

我们早就知道，增加模型的规模将导致机器翻译和语言建模等大规模任务的不断改进，这一点可以从表6所示的hold -out训练数据的LM困惑中得到证明。然而，我们相信这是第一个令人信服地证明扩展到极端模型大小也会导致在非常小的规模的任务上有很大的改进的工作，前提是模型已经被充分地预先训练。Peters等人(2018b)就将预训练的双lm大小从两层增加到四层对下游任务的影响提出了混合结果，Melamud等人(2016)顺便提到，将隐藏维度大小从200增加到600有所帮助，但进一步增加到1000并没有带来进一步的改进。这两项先前的工作都使用了基于特征的方法——我们假设，当模型直接对下游任务进行微调，并且只使用非常少量的随机初始化的附加参数时，即使在下游任务数据非常小的情况下，特定于任务的模型也可以从更大的、更有表现力的预训练表示中受益。

5.3 Feature-based Approach with BERT基于特征的方法，结果表明BERT在微调和基于特征的方法上都表现出色

讨论了BERT的两种方法：微调方法和基于特征的方法。
微调方法是在预训练模型上添加一个简单的分类层，并在下游任务上联合微调所有参数。
基于特征的方法是从预训练模型中提取固定特征，然后在这些特征上运行更便宜的模型进行实验。
在CoNLL-2003命名实体识别任务上，使用了BERT的两种方法，结果表明BERT在微调和基于特征的方法上都表现出色。

All of the BERT results presented so far have used the fine-tuning approach, where a simple classifi-cation layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a down-stream task. However, the feature-based approach, where fixed features are extracted from the pre-trained model, has certain advantages. First, not all tasks can be easily represented by a Trans-former encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation. In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we for-mulate this as a tagging task but do not use a CRF layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

到目前为止，所有的BERT结果都使用了微调方法，在预先训练的模型中添加一个简单的分类层，并在一个下游任务上联合微调所有参数。然而，基于特征的方法，从预训练的模型中提取固定特征，有一定的优势。首先，不是所有的任务都可以很容易地用Trans-former编码器体系结构表示，因此需要添加特定于任务的模型体系结构。其次，预先计算一次训练数据的昂贵表示，然后在此表示之上用便宜的模型进行许多实验，这在计算上有很大的好处。在本节中，我们通过将BERT应用于CoNLL-2003命名实体识别(NER)任务(Tjong Kim Sang和De Meulder, 2003)来比较这两种方法。在BERT的输入中，我们使用保留大小写的WordPiece模型，并包含数据提供的最大文档上下文。根据标准实践，我们将其作为标记任务进行处理，但在输出中不使用CRF层。我们使用第一个子token的表示形式作为NER标签集上token级分类器的输入。

To ablate the fine-tuning approach, we apply the feature-based approach by extracting the activa-tions from one or more layers without fine-tuning any parameters of BERT. These contextual em-beddings are used as input to a randomly initial-ized two-layer 768-dimensional BiLSTM before the classification layer.

Results are presented in Table 7. BERTLARGE performs competitively with state-of-the-art meth-ods. The best performing method concatenates the token representations from the top four hidden lay-ers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model. This demonstrates that BERT is effective for both fine-tuning and feature-based approaches.

为了消除微调方法，我们应用基于特征的方法，从一个或多个层提取激活，而不微调BERT的任何参数。在分类层之前，将这些上下文嵌入作为一个随机初始化的两层768维BiLSTM的输入。

结果见表7。BERTLARGE用最先进的方法进行竞争。最好的执行方法是将来自预训练Transformer的前四个隐藏层的token表示连接起来，这比微调整个模型只差0.3 F1。这表明BERT对于微调和基于特征的方法都是有效的。

Table 6: Ablation over BERT model size. #L = the number of layers; #H = hidden size; #A = number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data.
Table 7: CoNLL-2003 Named Entity Recognition results. Hyperparameters were selected using the Dev set. The reported Dev and Test scores are averaged over
5 random restarts using those hyperparameters.

表6:BERT模型尺寸上的消融。#L =层数;#H =隐藏的大小;#A =注意力头的数量。“LM (ppl)”是隐含训练数据的LM困惑。

表7:CoNLL-2003命名实体识别结果。使用Dev集选择超参数。报告的开发和测试分数是平均的
使用超参数进行5次随机重启。

6 Conclusion

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from deep unidirectional architec-tures. Our major contribution is further general-izing these findings to deep bidirectional architec-tures, allowing the same pre-trained model to suc-cessfully tackle a broad set of NLP tasks.

最近由于语言模型迁移学习的经验改进表明，丰富的无监督预训练是许多语言理解系统的组成部分。特别是，这些结果使低资源任务也能从深度单向架构中受益。我们的主要贡献是将这些发现进一步推广到深度双向架构，允许相同的预训练模型成功地处理广泛的NLP任务。

A Additional Details for BERT

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

Here we studies the differences in recent popular representation learning models including ELMo, OpenAI GPT and BERT. The comparisons be-tween the model architectures are shown visually in Figure 3. Note that in addition to the architec-ture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

在这里我们研究了包括ELMo、OpenAI GPT和BERT在内的近期流行表示学习模型之间的差异。模型架构之间的比较在图3中直观展示。请注意，除了架构差异之外，BERT和OpenAI GPT是微调方法，而ELMo是基于特征的方法。

The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text cor-pus. In fact, many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared. The core argument of this work is that the bi-directionality and the two pre-training tasks presented in Section 3.1 account for the majority of the empirical improvements, but we do note that there are several other differences between how BERT and GPT were trained:

BERT最可比的现有预训练方法是OpenAI GPT，该方法在大型文本语料库上训练从左到右的Transformer语言模型。实际上，BERT中的许多设计决策都是有意为之，使其尽可能接近GPT，以便这两种方法可以最小化比较。本工作的核心论点是，双向性和第3.1节中提出的两个预训练任务构成了大多数经验改进，但我们确实注意到BERT和GPT在训练方式上存在几个其他差异：

>> GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCor-pus (800M words) and Wikipedia (2,500M words).

>> GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only in-troduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embed-dings during pre-training.

>> GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.

>> GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.

>> To isolate the effect of these differences, we per-form ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectionality they enable.

>> GPT在BooksCorpus（800M 个词）上训练；BERT在BooksCorpus（800M个词）和Wikipedia（2,500M个词）上训练。

>> GPT仅在微调时引入句子分隔符([SEP])和分类器标记([CLS])；BERT在预训练期间学习[SEP]、[CLS]和句子A/B嵌入。

>> GPT以32,000个词的批量大小训练了1M步；BERT以128,000个词的批量大小训练了1M步。

>> GPT对所有微调实验使用相同的学习率5e-5；BERT选择在开发集上表现最佳的任务特定微调学习率。

>> 为了隔离这些差异的影响，我们在第5.1节进行了消融实验，这些实验证明大多数改进实际上来自两个预训练任务以及它们实现的双向性。

References

Alan Akbik, Duncan Blythe, and Roland Vollgraf.

2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level lan- guage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444.

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.

Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC. NIST.

John Blitzer, Ryan McDonald, and Fernando Pereira.

2006. Domain adaptation with structural correspon- dence learning. In Proceedings of the 2006 confer- ence on empirical methods in natural language pro- cessing, pages 120–128. Association for Computa- tional Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. In EMNLP. Association for Computational Linguis- tics.

Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancou- ver, Canada. Association for Computational Lin- guistics.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.

Quora question pairs.

Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehen- sion. In ACL.

Kevin Clark, Minh-Thang Luong, Christopher D Man- ning, and Quoc Le. 2018. Semi-supervised se- quence modeling with cross-view training. In Pro- ceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 1914– 1925.

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Pro- ceedings of the 25th international conference on Machine learning, pages 160–167. ACM.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 670–680, Copen- hagen, Denmark. Association for Computational Linguistics.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural informa- tion processing systems, pages 3079–3087.

Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

William B Dolan and Chris Brockett. 2005. Automati- cally constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).

William Fedus, Ian Goodfellow, and Andrew M Dai.

2018. Maskgan: Better text generation via filling in the . arXiv preprint arXiv:1801.07736.

Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaus- sian error linear units. CoRR, abs/1606.08415.

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.

Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computa- tional Linguistics.

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics.

Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehen- sion. In IJCAI.

Yacine Jernite, Samuel R. Bowman, and David Son- tag. 2017. Discourse-based objectives for fast un- supervised sentence representation learning. CoRR, abs/1705.00557.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. In ACL.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.

Quoc Le and Tomas Mikolov. 2014. Distributed rep- resentations of sentences and documents. In Inter- national Conference on Machine Learning, pages 1188–1196.

Hector J Levesque, Ernest Davis, and Leora Morgen- stern. 2011. The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47.

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence represen- tations. In International Conference on Learning Representations.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Con- textualized word vectors. In NIPS.

Oren Melamud, Jacob Goldberger, and Ido Dagan.

2016. context2vec: Learning generic context em- bedding with bidirectional LSTM. In CoNLL.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.

Andriy Mnih and Geoffrey E Hinton. 2009. A scal- able hierarchical distributed language model. In

D. Koller, D. Schuurmans, Y. Bengio, and L. Bot- tou, editors, Advances in Neural Information Pro- cessing Systems 21, pages 1081–1088. Curran As- sociates, Inc.

Ankur P Parikh, Oscar Ta¨ckstro¨m, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In EMNLP.

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– 1543.

Matthew Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- quence tagging with bidirectional language models. In ACL.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word rep- resentations. In NAACL.

Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 1499–1509.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing with unsupervised learning. Technical re- port, OpenAI.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Nat- ural Language Processing, pages 2383–2392.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.

Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu.

2018. U-net: Machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1810.06638.

Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.

Erik F Tjong Kim Sang and Fien De Meulder.

2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.

Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, ACL ’10, pages 384–394.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoen- coders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM.

Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform

for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: An- alyzing and Interpreting Neural Networks for NLP, pages 353–355.

Wei Wang, Ming Yan, and Chen Wu. 2018b. Multi- granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lin- guistics.

Alex Warstadt, Amanpreet Singh, and Samuel R Bow- man. 2018. Neural network acceptability judg- ments. arXiv preprint arXiv:1805.12471.

Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehen- sion. In ICLR.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.