《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》论文翻译--中英对照

吕秀才

已于 2022-05-23 22:39:54 修改

阅读量6k

点赞数 1

分类专栏：深度学习文章标签： nlp 深度学习 BERT

于 2020-06-02 17:37:04 首次发布

原文链接：https://arxiv.org/pdf/1810.04805v1.pdf

版权

深度学习专栏收录该内容

9 篇文章 31 订阅

订阅专栏

文章目录

1 Introduction(简介)
2 Related Work（相关工作）
3 BERT
4 Experiments（实验）
5 Ablation Studies（消融实验）
6 Conclusion（结束语）
- - References

Abstract

摘要

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
我们介绍了一个叫做BERT的模型，BERT是“Bidirectional Encoder Representations from Transformers”首字母的缩写。与最近的语言表征模型不同，BERT通过在每一层中连接左边(上文)和右边（下文）的内容而被设计成一个双向表征的深度预训练模型。这样做的结果是，预训练的BERT表征可以只添加一个输出层来进行微调，从而使很多任务达到SOTA的结果，例如QA，语言推断，无需针对特定任务进行模型结构的修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE bench-mark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5 absolute improvement), outperforming human performance by 2.0.
BERT在概念上理解起来很简单，在实验中的表现也很强大。它在11种自然语言处理任务上获得了新的SOTA结果，包括将GLUE的bench-mark提高到80.4％（提高了7.6％），MultiNLI（多语型自然语言推理）的准确性提高到86.7％（提高了5.6％）。 SQuAD v1.1问题回答测试F1达到93.2（提高了1.5），比人工的表现高2个点。

1 Introduction(简介)

Language model pre-training has shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al., 2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuAD question answering (Rajpurkar et al., 2016), where models are required to produce fine-grained output at the token-level.
预训练语言模型对改善许多自然语言处理任务非常有效（Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al., 2018; Howard and Ruder, 2018）。这些任务包括句子级任务，例如自然语言推理（Bowman et al., 2015; Williams et al., 2018）和释义（原文：paraphrasing）（Dolan and Brockett, 2005），旨在通过整体分析来预测句子之间的关系，以及token-level的任务，例如命名实体识别（Tjong Kim Sang and De Meulder, 2003）和SQuAD问题解答（Rajpurkar et al., 2016），在这些任务中模型预测粒度比较细，如token级别的输出。

There are two existing strategies for applying pre-trained language representations to down-stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018), uses tasks-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning the pre-trained parameters. In previous work, both approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.
现有两种将预训练语言表征应用于下游任务的策略：基于特征的方法和基于微调的方法。基于特征的方法，例如ELMo（Peters等人，2018），使用特定于某个任务的架构，把预训练的向量表征作为特征添加进去。基于微调（fine-tuning）方法，例如Generative Pre-trained Transformer（OpenAI GPT）（Radford et al., 2018）引入了最小的与任务相关的特定参数，并通在下游任务上对预训练的参数进行微调。在先前的工作中，这两种方法在预训练期间都具有相同的目标，它们均使用单向语言模型来学习通用语言的表征。

We argue that current techniques severely restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attended to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be devastating when applying fine-tuning based approaches to token-level tasks such as SQuAD question answering (Ra-jpurkar et al., 2016), where it is crucial to incorporate context from both directions.
我们认为，当前的技术严重限制了预训练的表征能力，特别是对于微调方法。主要的限制是标准语言模型是单向的，这限制了可以在预训练期间使用的体系结构的选择。例如，在OpenAI GPT中，作者使用从左到右的体系结构，其中每个token只能在Transformer的self-attention中关注到当前步之前的token（Vaswani等人，2017）。这样的限制对于句子级任务不是最理想的，在对token-level任务采用基于微调的方法（例如SQuAD问题回答）时可能会毁灭性的（Rajpurkar等人，2016）。在这些任务中，迫切的需要从两个方向整合上下文。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT addresses the previously mentioned unidirectional constraints by proposing a new pre-training objective: the “masked language model” (MLM), inspired by the Cloze task (Tay-lor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Un-like left-to-right language model pre-training, the MLM objective allows the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, we also introduce a “next sentence prediction” task that jointly pre-trains text-pair representations.
在本文中，我们通过提出BERT–来自Transformer的双向编码器表征法，来改进了基于微调的方法。 BERT通过提出受完形填空任务启发的“屏蔽语言模型”（MLM）（Taylor, 1953）这个新的预训练目标来解决前面提到的单向约束问题。 MLM从输入中随机屏蔽了某些token，目的是仅根据其上下文来预测屏蔽单词的原始词汇ID。与从左到右的语言模型的预训练不同，MLM允许模型的表征融合左侧和右侧的上下文，这使得我们可以预训练深层的双向Transformer。除了屏蔽语言模型之外，我们还引入了“下一个句子预测”任务，该任务参与了预训练中文本对(text-pair)表征的训练。

The contributions of our paper are as follows:
本文的贡献如下：

We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pre-trained deep bidirectional representations. This is also in contrast to Peters et al. (2018), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.
我们证明了双向预训练对于语言表征的重要性。与Radford等人不同。（2018），Radford等人使用单向语言模型进行预训练，BERT使用MLM来实现预训练的深度双向表征。这也与Peters等人相反。（2018），Peters等人使用的是经过独立训练的从左到右和从右到左的LMs的浅层连接。
We show that pre-trained representations eliminate the needs of many heavily-engineered task-specific architectures. BERT is the first fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many systems with task-specific architectures.
我们指出，经过预训练的表征形式消除了许多需要对特定任务精心设计体系结构的需求。 BERT是第一个基于fine-tuning 的表征模型，可在一系列sentence-level和token-level任务上得到SOTA的结果，其性能优于具有特定任务架构的许多系统。
BERT advances the state-of-the-artfor eleven NLP tasks. We also report extensive ablations of BERT, demonstrating that the bidirectional nature of our model is the single most important new contribution. The code and pre-trained model will be available at goo.gl/language/bert.1
BERT提高了11项NLP任务的最新SOTA。我们还报告了BERT的大量消融研究（后文中的第五部分Ablation Studies），表明我们模型的双向性质是唯一最重要的新贡献。代码和预训练模型将在goo.gl/language/bert.1上提供

2 Related Work（相关工作）

There is a long history of pre-training general language representations, and we briefly review the most popular approaches in this section.
通用的语言表征预训练工作已有很长的历史，本节中我们将简要回顾最流行的一些方法。

2.1 Feature-based Approaches（基于特征的方法）

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992;Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are considered to be an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010).
几十年来，学习通用的单词表征法一直是学术界的活跃领域，包括非神经（Brown et al., 1992;Ando and Zhang, 2005; Blitzer et al., 2006）和神经（Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014）方法。经过预训练的词嵌入被认为是现代NLP系统不可或缺的一部分，与从头开始学习的嵌入相比有显著改进（Turian et al., 2010）。

These approaches have been generalized to coarser granularities, such as sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). As with traditional word embeddings, these learned representations are also typically used as features in a downstream model.
这些方法已推广到较粗粒度（相对于词嵌入），例如句子嵌入（Kiros等人，2015； Logeswaran和Lee，2018）或段落嵌入（Le和Mikolov，2014）。与传统的单词嵌入一样，这些学习的表征形式通常也用作下游模型中的特征。

ELMo (Peters et al., 2017) generalizes traditional word embedding research along a different dimension. They propose to extract context-sensitive features from a language model. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state-of-the-art for several major NLP bench-marks (Peters et al., 2018) including question answering (Rajpurkar et al., 2016) on SQuAD, sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003).
ELMo从一个不同的角度阐述了传统的词嵌入研究。他们建议从语言模型中提取上下文相关的功能。当将上下文词嵌入与现有的特定于任务的体系结构集成时，ELMo改进了几个主要的NLP任务（Peters等，2018）的最新技术，其中包括问题解答（Rajpurkar等，2016）。在SQuAD上，进行情感分析（Socher等人，2013），和命名实体识别（Tjong Kim Sang和De Meulder，2003）。

2.2 Fine-tuning Approaches （微调方法）

A recent trend in transfer learning from language models (LMs) is to pre-train some model architecture on a LM objective before fine-tuning that same model for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentence-level tasks from the GLUE benchmark (Wang et al., 2018).
从语言模型（LMs）进行迁移学习的最新趋势是，在针对监督的下游任务微调相同模型之前，先在LM目标上预先训练一些模型架构（Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018）。这些方法的优点是几乎不需要从头学习参数。至少部分是由于这一优势，OpenAI GPT（Radford et al., 2018）在GLUE基准测试中在许多句子级任务上获得了之前的SOTA结果（Wang et al., 2018）。

2.3 Transfer Learning from Supervised Data（在有监督数据上的迁移学习）

While the advantage of unsupervised pre-training is that there is a nearly unlimited amount of data available, there has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Outside of NLP, computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained on ImageNet (Deng et al., 2009; Yosinski et al., 2014).
尽管无监督预训练的优势在于几乎可以无限量地使用数据，但也有工作表明可以从具有大型数据集的监督任务中进行有效转移，例如自然语言推理（Conneau et al。，2017）和机器翻译（McCann et al。，2017）。在NLP之外，计算机视觉研究也证明了从大型预训练模型进行迁移学习的重要性，一个有效的方法是fine-tune在ImageNet上预训练的模型（Deng et al., 2009; Yosinski et al., 2014）。

3 BERT

We introduce BERT and its detailed implementation in this section. We first cover the model architecture and the input representation for BERT. We then introduce the pre-training tasks, the core innovation in this paper, in Section 3.3. The pre-training procedures, and fine-tuning procedures are detailed in Section 3.4 and 3.5, respectively. Finally, the differences between BERT and OpenAI GPT are discussed in Section 3.6.
我们将在本节中介绍BERT及其详细实现。我们首先介绍BERT的模型架构和输入信息。然后，在第3.3节中介绍预训练任务，即本文的核心创新点。预训练程序和微调程序分别在3.4和3.5节中详细介绍。最后，在第3.6节中讨论了BERT与OpenAI GPT之间的区别。

3.1 Model Architecture（模型结构）

BERT’s model architecture is a multilayer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.$^2$ Because the use of Transformers has become ubiquitous recently and our implementation is effectively identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”$^3$
BERT的模型架构是多层双向Transformer编码器，其基于Vaswani等人所描述的原始实现。并在tensor2tensor库中发布（2017）。 $^2$ 由于最近Transformers的使用可以说是无处不在，并且我们的实现与原始实现完全相同，因此我们将省略对模型结构的详细背景说明，并向读者介绍Vaswani等人。以及一些出色的指导性文章（2017），例如“The Annotated Transformer” $^3$

在这里插入图片描述

In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. In all cases we set the feed-forward/filter size to be 4H, i.e., 3072 for the H = 768 and 4096 for the H = 1024. We primarily report results on two model sizes:
在这项工作中，我们将层（即Transformer模块）的数量表示为L，将隐藏层的大小表示为H，并将self-attention heads的数量表示为A。我们统一将前馈网络的大小设置为 4H，即当H = 768为FF的大小为3072，H = 1024时为4096。我们主要报告两种模型尺寸的结果：

BERT $_{BASE}$ : L=12, H=768, A=12, Total Parameters=110M
BERT $_{LARGE}$ : L=24, H=1024, A=16, Total Parameters=340M

BERT$_{BASE}$ was chosen to have an identical model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left. We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation. The comparisons between BERT, OpenAI GPT and ELMo are shown visually in Figure 1.
为了方便比较，我们把BERT $_{BASE}$ 设置为和GPT相同的大小。但是，BERT的Transformer是双向的self-attention，而GPT中的Transformer使用的self-attention有更多的限制，当前的token只能从它的前文获取信息。在文献中双向Transformer一般被称为“Transformer encoder”,而只能获取前文信息的‘left-context-only’版本的Transformer称为“Transformer decoder”,它可以被用来生成文本。BERT、OpenAI GPT和ELMo的对比在Figure1中给出。

3.2 Input Representation（输入表示）

Our input representation is able to unambiguously represent both a single text sentence or a pair of text sentences (e.g., [Question, Answer]) in one token sequence.$^4$ For a given token, its input representation is constructed by summing the corresponding token, segment and position embeddings. A visual representation of our input representation is given in Figure 2.
我们的输入表征能够清楚的用一个token序列表达一个文本或成对的句子（例如[question , answer]），对于给定的一个token,它的表征由相应的token相加得到，包括分句信息和位置信息。我们在Figure2中给出了图像示例。

在这里插入图片描述

The specifics are:

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. We denote split word pieces with ##.
我们使用WordPiece的词嵌入来做为词的表征，共包含30000个词。我们使用“##”来对词分片。
We use learned positional embeddings with supported sequence lengths up to 512 tokens.
我们使用了预先学习的位置编码，可以支持512长度的序列。
The first token of every sequence is always the special classification embedding ([CLS]). The final hidden state (i.e., output of Transformer) corresponding to this token is used as the aggregate sequence representation for classification tasks. For nonclassification tasks, this vector is ignored.
每个序列的第一个token都是特殊的分类嵌入编码（[CLS]）。最后一层的隐藏状态（例如Transformer的输出）会与这个token有关，在分类任务中会使用这个token，此时它融合整个序列的信息。对于非分类的任务，这个向量就没用了，可以被忽略。
Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence.
句子对被封装进一个序列中。我们通过以下两种方法区分句子。一，我们使用一个特殊的token([SEP])来分隔两个句子。二，对应于第一个句子中的每个token，我们向其添加预先学习到的句子向量，这个句子向量的每个token都是A的embedding。同样的，对应于第二个句子中的每个token，我们向其添加预先学习到的句子向量，这个句子向量的每个token都是B的embeding。
For single-sentence inputs we only use the sentence A embeddings.
对于单个句子的输入，我们只使用包含A的embedding的句子向量。

3.3 Pre-training Tasks（预训练任务）

Unlike Peters et al. (2018) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two novel unsupervised prediction tasks, described in this section.
与Peters、Radford等人不同，我们没有使用传统的从左到右或者从右到左的语言模型来预训练BERT。相反，我们使用了两个新颖的任务来进行预训练，这两个任务会在本节中描述。

3.3.1 Task 1: Masked LM（任务1：基于遮盖的语言模型）

Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context.
从直觉上讲，有理由相信，一个深度双向模型比从左到右的模型或者双向浅层相连的模型要更加强大。不幸的是，标准的语言模型只能从左到右或从右到左训练，因为双向语言模型将允许每个单词在多层上下文中间接“看到自己”。

In order to train a deep bidirectional representation, we take a straightforward approach of masking some percentage of the input tokens at random, and then predicting only those masked tokens. We efer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.
为了训练一个深度的双向表征，我们采用了一个简单的方法，随机的屏蔽掉一定百分比的token，然后只预测这部分token。我们称此方法为“掩盖式语言模型（MLM）”，尽管在文献中通常将其称为“完形填空”任务（Taylor，1953）。在这种情况下，就像在标准语言模型中一样，最后的隐层向量中与被屏蔽的token相关的部分会通过softmax反馈到输出中。在我们所有的实验中，我们随机屏蔽了每个序列中的部分token，数量是所有WordPiece token的15％。与去噪自编码器不同(Vincent et al., 2008)，我们只预测屏蔽掉的单词而不是预测所有的输入。

Although this does allow us to obtain a bidirectional pre-trained model, there are two downsides to such an approach. The first is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token is never seen during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my dog is hairy it chooses hairy. It then performs the following procedure:
尽管这使我们可以获得一个双向的预训练模型，但这个方法有两个缺点。缺点一，由于[MASK]这个token在fine-tuning过程中是不存在的，会使预训练任务和微调任务不相同。为了缓解这个问题，我们并不会一直使用[MASK]替换被屏蔽的token。相反的，训练数据的生成过程会随机的选择15%的token，例如：在“in the sentence my dog is hairy”这句话中我们随机选择了“hairy”，然后会执行下面的操作：

Rather than always replacing the chosen words with [MASK], the data generator will do the following:
数据生成器不会始终使用[MASK]来代替选择的单词，它会执行以下操作：
80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
80%的情况下：使用[MASK]替换随机选中的词，例如：my dog is hairy → my dog is [MASK]
10% of the time: Replace the word with a randomword,e.g.,my dog is hairy → my dog is apple
10%的情况下：随机选取个单词来替换选中的词，例如：my dog is hairy → my dog is apple
10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed word.
10%的情况下：保持选中的词不变，例如：my dog is hairy → my dog is hairy，这样做的目的是使表征偏向实际观察到的单词。

The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability.
Transformer解码器不知道哪个单词会被预测也不知道哪个单词是被随机替换的，所以它强制保留了每个输入token的上下文分布表征。此外，因为随机替换的情况只有所有token的1.5%，这看起来不会损害模型的语言理解能力。

The second downside of using an MLM is that only 15% of tokens are predicted in each batch, which suggests that more pre-training steps may be required for the model to converge. In Section 5.3 we demonstrate that MLM does converge marginally slower than a left-to-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost.
使用MLM的第二个缺点是，每批次只预测了15％的token，这表明模型收敛可能需要更多的预训练步骤。在第5.3节中，我们证实了MLM的收敛速度比从左到右的模型慢，但是MLM模型的实验提升远远超过了增加的训练成本。

3.3.2 Task 2: Next Sentence Prediction （任务2：下一句预测）

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two text sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A, and 50% of the time it is a random sentence from the corpus.
许多重要的下游任务，例如问答（QA）和自然语言推论（NLI），都是基于对两个文本句子之间关系的理解，而语言建模并不能直接捕获这些关系。为了训练一个能够理解句子关系的模型，我们预训练了一个预测下一个句子的二值化预测任务，这个任务可以使用任何单语语料库。具体而言，为每个预训练样本选择句子A和B时，50％的情况B是A的下一个句子，而50％情况B是来自语料库的随机句子。

For example:
Input $=$ [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label $=$ IsNext
Input $=$ [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label $=$ NotNext

举例：
Input $=$ [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label $=$ IsNext
Input $=$ [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label $=$ NotNext

We choose the NotNext sentences completely at random, and the final pre-trained model achieves 97%-98% accuracy at this task. Despite its simplicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI.
我们完全随机的选择NotNext句子，最终的预训练模型在此任务上达到97％-98％的准确率。尽管这个任务很简单，但我们在5.1节中证明了完成此任务的预训练对QA和NLI都非常有益。

3.4 Pre-training Procedure（预训练方法）

The pre-training procedure largely follows the existing literature on language model pre-training.For the pre-training corpus we use the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.
预训练过程很大程度上遵循了现有文献中有关语言模型预训练的方法。对于预训练语料库，我们把BooksCorpus（800M字）（Zhu等人，2015）与English Wikipedia（2,500M字）串联起来使用。对于Wikipedia，我们仅提取文本段落，而忽略列表、表格和标题。为了提取长的连续序列，使用文档级语料库至关重要，而不是诸如Billion Word Benchmark（Chelba等人，2013）之类的经过打乱顺序的句子级语料库。

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.
为了生成训练所需的每个输入序列，我们从语料库中抽取了两个跨度的文本，我们将它们称为“句子”，即使它们通常比单个句子长得多（但也可能更短）。第一个句子接收A嵌入，第二个句子接收B嵌入。50％的情况B确实是A的下一个句子，而50％的情况B是随机的句子，这是为“next sentence prediction”任务准备的。对它们进行采样，以使总长度≤512个token。LM掩蔽操作在WordPiece标记化之后应用，其统一掩蔽率为15％，并且没有对任何部的分单词特别考虑。

We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.
我们以256个序列（256个序列* 512个令牌= 128,000个令牌/批）的批处理大小进行1,000,000步的训练，这大约是33亿个单词语料库的40个纪元。
我们使用包含256个序列的batch(256 sequences * 512 tokens = 128,000 tokens/batch)训练了1,000,000步，这大约是在33亿个单词语料上训练40次迭代。我们使用Adam优化算法，学习速率为1e-4，β1= 0.9，β2= 0.999，L2权重衰减为0.01，我们对学习速率使用了预热的处理方法，在前10,000步迭代内进行预热，之后进行线性衰减。我们在所有层上使用的dropout probability为0.1。像OpenAI GPT一样，我们使用gelu激活函数（Hendrycks和Gimpel，2016）而不是标准relu激活函数。训练损失是 MLM似然损失的平均值与预测下个句子的似然损失的平均值的和。

Training of BERT$_{BASE}$ was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).$^5$ Training of BERT$_{LARGE}$ was performed on 16 Cloud TPUs (64 TPU chips total). Each pre-training took 4 days to complete.
BERT $_{BASE}$ 模型的训练是在云端TPU上进行的，每个Pod上配置4块TPU（总共使用16块TPU），BERT $_{LARGE}$ 的训练也是在云端TPU上进行的，但每个Pod上配置了16块TPU（总共使用64块）。

3.5 Fine-tuning Procedure （微调方法）

For sequence-level classification tasks, BERT fine-tuning is straightforward. In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding. We denote this vector as C ∈ RH. The only new parameters added during fine-tuning are for a classification layer W ∈ RK ×H , where K is the number of classifier labels. The label probabilities P ∈ RK are computed with a standard softmax, P = softmax(C W T ). All of the parameters of BERT and W are fine-tuned jointly to maximize the log-probability of the correct label. For span-level and token-level prediction tasks, the above procedure must be modified slightly in a task-specific manner. Details are given in the corresponding subsection of Section 4.

对于序列级别的分类任务，BERT的微调很简单。为了获得一个能把输入序列中的信息集中起来且是固定维度的表征，我们把最后一层的隐藏状态（即Transformer的输出）的第一个token构造成相应的特殊字嵌入[CLS]。并把它做为分类的输入。我们将该向量表示为 $C\in \mathbb{R}^{H}$ 。在微调期间添加的唯一新参数是分类层 $W\in\mathbb{R}^{K\times H}$ ，其中K是分类器标签的数量。标签概率 $P\in \mathbb{R}^{K}$ 用标准softmax计算， $P=softmax\left ( CW^{T} \right )$ 。 BERT的所有参数和W都经过最大化正确标签的对数概率来微调。对于跨度级和token级预测任务，必须根据任务的特定方式稍微修改上述过程。详情见第4节的相应小节。

For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks:
对微调来说，模型的超参数除了batch size、 learning rate和epochs外，均与预训练中相同。dropout 的丢弃均设为0.1。最佳超参数值是特定于任务的，但我们发现以下一系列可能的值在所有任务中都适用：

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 3, 4

We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.
我们注意到，相对于小的数据集，大的数据集对超参的选择更不敏感。微调通常会很快，因此我们可以在开发集上对上面的所有的参数做一个全局搜索，选择表现最好的模型。

3.6 Comparison of BERT and OpenAI GPT （BERT与GPT的比较）

The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text corpus. In fact, many of the design decisions in BERT were intentionally chosen to be as close to GPT as possible so that the two methods could be minimally compared. The core argument of this work is that the two novel pre-training tasks presented in Section 3.3 account for the majority of the empirical improvements, but we do note that there are several other differences between how BERT and GPT were trained:
现有的模型中最适合与BERT进行对比的是OpenAI的GPT模型，GPT模型在一个很大的文本语料上训练了一个基于Transformer的从左到右的语言模型。实际上，我们有意将BERT中的许多设计、决策设计的和GPT中的尽量相近，以便可以对两种方法进行最小化差异的比较。这项工作的核心论点是，在第3.3节中介绍的两个新提出的预训练任务对过往任务的提升起了主要做用。但是我们确实注意到，BERT和GPT的训练方法之间还有其他一些区别：

GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
GPT使用的训练语料是BooksCorpus(800万字)；BERT的训练语料是BooksCorpus (800万字)和维基百科(2500万字)。
GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.
GPT使用句子分隔符（[SEP]）和分类token（[CLS]），它们仅在微调时引入；BERT在预训练期间学习[SEP]，[CLS]和句子A / B的嵌入。
GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.
GPT训练了1M步，一个batch中有32000字; BERT也训练了1M步，但一个batch包含128,000字。
GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.
GPT对所有微调实验都使用值为5e-5的学习率; BERT根据不同的任务选择不同的微调学习速率，该速率在开发集上表现最佳。

To isolate the effect of these differences, we perform ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the new pre-training tasks.
为了排除这些差异带来的影响，我们在5.1节中进行了消融实验，该实验表明，大多数改进实际上来自于新的预训练任务。

4 Experiments（实验）

In this section, we present BERT fine-tuning results on 11 NLP tasks.
在本节中，我们将介绍BERT模型在11个NLP任务上的微调结果。
在这里插入图片描述
Figure 3: Our task specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level tasks while (c) and (d) are token-level tasks. In the figure, E represents the input embedding, Ti represents the contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences.
Figure 3：我们的特定任务模型是通过将BERT与一个附加的输出层合并而形成的，因此只需要从头开始学习最少的参数。在任务中，（a）和（b）是序列级任务，而（c）和（d）是token级任务。在图中， $E$ 表示输入嵌入， $T_i$ 表示token $_i$ 的上下文表示，[CLS]是用于分类输出的特殊符号，[SEP]是用于分隔非连续token序列的特殊符号。

4.1 GLUE Datasets（GLUE数据集）

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of diverse natural language understanding tasks. Most of the GLUE datasets have already existed for a number of years, but the purpose of GLUE is to (1) distribute these datasets with canonical Train, Dev, and Test splits, and (2) set up an evaluation server to mitigate issues with evaluation inconsistencies and Test set overfitting. GLUE does not distribute labels for the Test set and users must upload their predictions to the GLUE server for evaluation, with limits on the number of submissions.
通用语言理解评估（GLUE）基准（Wang et al., 2018）是一个各种自然语言理解任务的集合。大多数GLUE数据集已经存在了很多年，但是GLUE的目的是（1）使用规范的方法来把数据集划分成Train，Dev和Test ，以及（2）设置评估服务器以缓解评估不一致的问题，以及测试集过拟合的问题。GLUE不会分发测试集的标签，用户必须将其预测结果上传到GLUE服务器以进行评估，但要限制提交的数量。

The GLUE benchmark includes the following datasets, the descriptions of which were originally summarized in Wang et al. (2018):
GLUE基准测试包括以下数据集，Wang等人对其进行了概述。（2018）：

MNLI Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classification task (Williams et al., 2018). Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.
MNLI （Multi-Genre Natural Language Inference）是一个大规模的、来源众多的、用来做蕴含分类任务的数据集, 目的就是推断两个句子是意思蕴含, 矛盾, 还是无关的。

QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent (Chen et al., 2018).
QQP Quora Question Pairs是一个二元分类任务，目标是确定在Quora上询问的两个问题在语义上是否等效。

QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task (Wang et al., 2018). The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.
QNLI Question Natural Language Inference是Stanford Question Answering数据集（Rajpurkar et al., 2016）的一个版本，该数据集已转换为二分类任务（Wang等，2018）。正例是（question, sentence）句对，正例句对包含正确的答案，负例是来自同一段落的（question, sentence）句对，不包含答案。

SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).
SST-2斯坦福情感树库是一个二分类的单句分类任务，由从电影评论中提取的句子，以及人工标注的情感标签组成。（Socher等人，2013）。

CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018).
CoLA语言可接受语料库是一个二元单句分类任务，其目标是预测英语句子在语言上是否“可接受”(Warstadt et al., 2018)）。

STS-B The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources (Cer et al., 2017). They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.
STS-B 语义文本相似性基准语料库是从新闻标题和其他来源提取的句子对的集合（Cer et al。，2017）。他们用1到5分打分，表示这两个句子在语义上有多相似。

MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent (Dolan and Brockett, 2005).
MRPC 微软研究院释义语料库由自动从在线新闻资源中提取的句子对组成，并带有说明句子对中的句子在语义上是否等效的人工标注信息（Dolan and Brockett, 2005）。

RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009). $^6$
RTE 识别文本蕴含是一项类似于MNLI的二分类蕴含任务，但是训练数据少得多（Bentivogli et al., 2009）。

WNLI Winograd NLI is a small natural language inference dataset deriving from (Levesque et al., 2011). The GLUE webpage notes that there are issues with the construction of this dataset, $^7$ and every trained system that’s been submitted to GLUE has has performed worse than the 65.1 baseline accuracy of predicting the majority class. We therefore exclude this set out of fairness to OpenAI GPT. For our GLUE submission, we always predicted the majority class.
WNLI Winograd NLI是一个小型自然语言推理数据集，源自（Levesque et al., 2011）。 GLUE网页指出，此数据集的构建存在问题 $^7$ ，已提交给GLUE的每个受过训练的系统的效果都比预测多数类的65.1基线准确性差。因此，出于对OpenAI GPT不公平的考虑，我们将其排除在外。对于我们的GLUE提交，我们始终会预测多数类。

4.1.1 GLUE Results（GLUE数据集上的结果）

在这里插入图片描述
Table 1: GLUE Test results, scored by the GLUE evaluation server. The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score,sine we exclude the problematic WNLI set. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768, A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT are single-model, single task. All results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.com/language-unsupervised/.
Table 1:由GLUE评估服务器评分的GLUE测试结果。每个任务下方的数字表示训练样本的数量。 “Average”列与官方GLUE得分略有不同，因为我们排除了有问题的WNLI集。 OpenAI GPT =（L = 12，H = 768，A = 12）; BERT $_{BASE}$ =（L = 12，H = 768，A = 12）; BERT $_{LARGE}$ =（L = 24，H = 1024，A = 16）。 BERT和OpenAI GPT是单模型，单任务。所有的结果可以从https://gluebenchmark.com/leaderboard和https://blog.openai.com/language-unsupervised/ 获得。

To fine-tune on GLUE, we represent the input sequence or sequence pair as described in Section 3, and use the final hidden vector C ∈ RH corresponding to the first input token ([CLS]) as the aggregate representation. This is demonstrated visually in Figure 3 (a) and (b). The only new parameters introduced during fine-tuning is a classification layer W ∈ RK ×H , where K is the number of labels. We compute a standard classification loss with C and W , i.e., log(softmax(C W T )).
为了在GLUE上进行微调，我们按照第3节中描述的方法来表示输入序列或序列对，并使用与第一个输入标记（[CLS]）对应的最终隐藏向量 $C\in\mathbb{R}^{H}$ 对应作为聚合表征。这在图3（a）和（b）中得到了普遍的证明。在微调过程中引入的唯一新参数是分类层 $W\in\mathbb{R}^{K\times H}$ ，其中K是标签的数目。我们用C和W计算标准分类损失，即 $log(softmax(CW^T))$ 。

We use a batch size of 32 and 3 epochs over the data for all GLUE tasks. For each task, we ran fine-tunings with learning rates of 5e-5, 4e-5, 3e-5, and 2e-5 and selected the one that performed best on the Dev set. Additionally, for BERT$_{LARGE}$ we found that fine-tuning was sometimes unstable on small data sets (i.e., some runs would produce degenerate results), so we ran several random restarts and selected the model that performed best on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization. We note that the GLUE data set distribution does not include the Test labels, and we only made a single GLUE evaluation server submission for each BERT$_{BASE}$ and BERT$_{LARGE}$.
我们使用32和3个阶段的批处理大小覆盖所有粘合任务的数据。对于每一个任务，我们都进行了精细的调整，学习率分别为5e-5、4e-5、3e-5和2e-5，并选择了在开发集上表现最好的一个。另外，对于BERT $_{LARGE}$ ，我们发现微调在小数据集上有时是不稳定的（即，某些运行会产生退化的结果），因此我们运行了几个随机重新启动，并选择了在开发集上表现最好的模型。对于随机重启，我们使用相同的预训练检查点，但执行不同的微调数据洗牌和分类器层初始化。我们注意到GLUE数据集分布不包括测试标签，并且我们只为每个BERT ${BASE}$ 和BERT ${LARGE}$ 提交了一个GLUE评估服务器。
我们在GLUE数据集上的所有任务使用的batch size为32，epoch为3。我们对每一个任务都进行了微调，使用的学习速率分别为5e-5, 4e-5, 3e-5和 2e-5，并选择了在开发集上表现最好的模型。此外，对BERT $_{LARGE}$ 模型来说，我们发现微调在小数据集上有时候是不稳定的（即，微调有时会产生更差的结果），所以我们进行了几次的随机重启操作，并选择在开发集上表现最好的模型。对于随机重启，我们使用相同的预训练检查点，但是我们对微调数据执行不同的数据打乱和分类层的初始化。我们注意到GLUE发布的数据集不包括测试集的标签，所以我们只能把BERT $_{BASE}$ 和BERT $_{LARGE}$ 的结果提交到GLUE的评估服务器。

Results are presented in Table 1. Both BERT$_{BASE}$ and BERT$_{LARGE}$ outperform all existing systems on all tasks by a substantial margin, obtaining 4.4% and 6.7% respective average accuracy improvement over the state-of-the-art. Note that BERTBASE and OpenAI GPT are nearly iden-ical in terms of model architecture outside of the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a 4.7% absolute accuracy improvement over the state-of-the-art. On the official GLUE leader-board,8 BERTLARGE obtains a score of 80.4, compared to the top leaderboard system, OpenAI GPT, which obtains 72.8 as of the date of writing.
实验结果显示在table1中，BERT $_{BASE}$ 和BERT $_{LARGE}$ 在所有任务上的表现均优于所有现有系统，与SOTA结果相比，平均精度分别提高了4.4％和6.7％。请注意，就注意力掩蔽之外的模型架构而言，BERT $_{BASE}$ 和OpenAI GPT几乎是相同的。对于最大且报告最广的GLUE任务MNLI，BERT比最先进的技术提高了4.7％的绝对精度。在官方的GLUE排行榜上, $^8$ BERT $_{LARGE}$ 的得分为80.4，而OpenAI GPT的得分为72.8。

We find that BERT$_{LARGE}$ significantly outperforms BERT$_{BASE}$ across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.
我们发现，在所有任务中，尤其是训练数据很少的任务，BERT $_{LARGE}$ 的性能明显优于BERT $_{BASE}$ 。模型大小的影响将在5.2节中更全面地探讨。

4.2 SQuAD v1.1 （SQuAD数据集）

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowd-sourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage. For example:
斯坦福大学问答数据集（SQuAD v1.1）收集了10万个众包问题/答案对（Rajpurkar et al., 2016）。给定一个问题以及Wikipedia中包含答案的段落，任务是预测段落中包含答案的文本范围。举例：

Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
Input Paragraph:
… Precipitation forms as smaller droplets
coalesce via collision with other rain drops
or ice crystals within a cloud. …
Output Answer:
within a cloud

This type of span prediction task is quite different from the sequence classification tasks of GLUE, but we are able to adapt BERT to run on SQuAD in a straight forward manner. Just as with GLUE, we represent the input question and paragraph as a single packed sequence, with the question using the A embedding and the paragraph using the B embedding. The only new parameters learned during fine-tuning are a start vector S∈RH and an end vector E∈RH. Let the final hidden vector from BERT for the ith input token be denoted as Ti ∈ RH. See Figure 3 (c) for a visualization. Then, the probability of word i being the start of the answer span is computed as a dot product between Ti and S followed by a softmax over all of the words in the paragraph:
这种预测span的任务与GLUE的序列分类任务有很大的不同，但是我们能够使BERT以一个简单的方式在SQuAD数据上运行。像使用GLUE数据集一样，我们将输入的问题和段落封装进一个序列来表征，问题使用A嵌入，而段落使用B嵌入。在fine-tun阶段唯一需要学习的是一个起始向量 $S\in\mathbb{R}^H$ 和一个结束向量 $E\in\mathbb{R}^H$ ,我们把与第 $i$ 个输入字符相对应的BERT的最后一个隐层的向量定义为 $T_i\in\mathbb{R}^H$ 。相关的可视化信息，请看Figure 3的c图。然后， $T_i$ 和 $S$ 的点积，通过 $S o f t m a x$ （softmax的计算范围为当前段落内所有的字）后的值为第 $i$ 个词是answer span开始位置的概率。

$P_i = \frac{e^{S·T_i}}{\Sigma_{j}{e^{S·T_j}}}$

The same formula is used for the end of the answer span, and the maximum scoring span is used as the prediction. The training objective is the log-likelihood of the correct start and end positions.
answer span的结束位置的概率也使用相同的公式。得分最高的span做为预测的结果。训练目标是正确开始和结束位置的对数似然函数。

We train for 3 epochs with a learning rate of 5e-5 and a batch size of 32. At inference time, since the end prediction is not conditioned on the start, we add the constraint that the end must come after the start, but no other heuristics are used. The tokenized labeled span is aligned back to the original untokenized input for evaluation.
我们使用5e-5的学习率，32的batch size训练了5轮。在推理时，由于结果位置的预测不是以开始位置为条件的，所以我们添加了结束必须在开始之后的约束，但没有使用其它的启发式的方法。处理过的已标记span会对齐到原始的为处理的输出上，以用来评估。

Results are presented in Table 2. SQuAD uses a highly rigorous testing procedure where the submitter must manually contact the SQuAD organizers to run their system on a hidden test set, so we only submitted our best system for testing. The result shown in the table is our first and only Test submission to SQuAD. We note that the top results from the SQuAD leaderboard do not have up-to-date public system descriptions available, and are allowed to use any public data when training their systems. We therefore use very modest data augmentation in our submitted system by jointly trainng on SQuAD and TriviaQA (Joshi et al., 2017).
结果在Table 2中展示。SQuAD使用非常严格的测试流程，提交者必须手动联系SQuAD组织者以在隐藏的测试集上运行他们的系统，因此我们仅提交了最佳的测试系统。表格中显示的结果是我们向SQuAD提交的第一个也是唯一的测试。我们注意到，来自SQuAD排行榜排名最高的结果没有可用的最新开源系统的描述，但可以在训练他们的系统时使用任何公共数据。因此，通过联合在SQuAD和TriviaQA上进行训练，我们在提交的系统中使用了非常适度的数据扩增（Joshi等人，2017）。

Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system. In fact, our single BERT model outperforms the top ensemble sysem in terms of F1 score. If we fine-tune on only SQuAD (without TriviaQA) we lose 0.1-0.4 F1 and still outperform all existing systems by a wide margin.
我们表现最好的系统在整体上比排名靠前的系统F1值高出1.5个百分点，BERT做为单一系统时，它的性能比排名靠前的系统的F1高出1.3个点。事实上，我们的单BERT模型在F1成绩方面优于顶级的集成系统。如果我们只在SQuAD语料上进行微调（不使用TriviaQA），我们的F1将损失0.1-0.4个点，但相比于现有系统，仍有很大的优势。

4.3 Named Entity Recognition （命名实体识别）

To evaluate performance on a token tagging task, we fine-tune BERT on the CoNLL 2003 Named Entity Recognition (NER) dataset. This dataset consists of 200k training words which have been annotated as Person, Organization, Location, Miscellaneous, or Other (non-named entity).
为了评估token标注任务的表现，我们在CoNLL 2003 Named Entity Recognition (NER) 数据集上微调了BERT模型。这个数据集的训练集包含200000个词，这些词被标注为人名(Person)、机构（Organization）、地点(Location)、混合类型(missaellaneous)或Other(非命名实体)。

For fine-tuning, we feed the final hidden representation Ti ∈ RH for to each token i into a classification layer over the NER label set. The predictions are not conditioned on the surrounding predictions (i.e., non-autoregressive and no CRF). To make this compatible with WordPiece tokenization, we feed each CoNLL-tokenized input word into our WordPiece tokenizer and use the hidden state corresponding to the first sub-token as input to the classifier. For example:
为了进行微调，我们把最后一层隐层向量的每个token( $T_i \in\mathbb{R}^H $) 输入到一个可以映射到NER标签集合的分类layer上。预测结果之间条件独立（即，非自回归，没有CRF）。为了与WordPiece相兼容，我们把CoNLL语料中已经tokenize过的数据再次输入到WordPiece中做tokenize，然后使用第一个子标记相对应的隐藏状态作为分类器的输入，例如：

Jim Hen ##son was a puppet ##eer I-PER I-PER X O O O X

Where no prediction is made for X. Since the WordPiece tokenization boundaries are a known part of the input, this is done for both training and test. A visual representation is also given in Figure 3 (d). A cased WordPiece model is used for NER, whereas an uncased model is used for all other tasks.
$X$ 不会被预测。在训练和测试过程中，WordPiece处理后的边界对于输入来说是已知部分。在图3（d）中也给出了可视化的展示。区分大小写的WordPiece模型用于NER，而没有区分大小写的模型用于其他任务。
在这里插入图片描述

Results are presented in Table 3. BERTLARGE outperforms the existing SOTA, Cross-View Training with multi-task learning (Clark et al., 2018), by +0.2 on CoNLL-2003 NER Test.
结果如表3所示。在CoNLL-2003 NER测试集上，BERTLARGE在多任务的交叉可视训练上，表现优于现有的SOTA结果0.2个点。（Clark et al.，2018）。

在这里插入图片描述

4.4 SWAG（常识推理）

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded common-sense inference (Zellers et al., 2018).
SWAG 数据集包含11.3万句对，这些句对评估了基于常识的推理（Zellers等人，2018）。

Given a sentence from a video captioning dataset, the task is to decide among four choices the most plausible continuation. For example:
给定视频字幕数据集中的句子，任务是在四个备选答案中选择最合理的下文。例如：

A girl is going across a set of monkey bars. She
(i) jumps up across the monkey bars.
(ii) struggles onto the bars to grab her head.
(iii) gets to the end stands on a wooden plank.
(iv) jumps up and does a back flip.

Adapting BERT to the SWAG dataset is similar to the adaptation for GLUE. For each example, we construct four input sequences, which each contain the concatenation of the the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters we introduce is a vector V ∈ RH , whose dot product with the final aggregate representation Ci ∈ RH denotes a score for each choice i. The probability distribution is the softmax over the four choices:
BERT在SWAG数据集上的修改与在GLUE数据集上时类似，举例，我们构造了4个输入序列，每个序列包含了给定的上文A，以及可能的下文B。只有一个向量 $V\in \mathbb{R^H}$ 是任务特有的，它的点积 $C_i \in\mathbb{R^H}$ 是最终的聚合表征，代表了每个选项 $i$ 的得分。概率分布是基于这4个选项得分的softmax, 如下：
$P_i = \frac{e^{V·C_i}}{\Sigma_{j=1}^{4}{e^{V·C_j}}}$

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM+ELMo system by +27.1%.
我们使用lr = 2e-5，batch size = 16， epochs = 3 的参数对模型进行了微调。结果在表4中展示。 $BERT_{LARGE}$ 的表现比 $E S I M + E L M o$ 的baseline高27.1个百分点。
在这里插入图片描述

5 Ablation Studies（消融实验）

Although we have demonstrated extremely strong empirical results, the results presented so far have not isolated the specific contributions from each aspect of the BERT framework. In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance.
尽管我们展示了非常有实验基础的结果，但到目前为止，还没有独立的表明 $B E R T$ 框架的各个部分到底都有哪些特殊的贡献。在本章节中，我们通过 $B E R T$ 模型各部分的消融实验，来更好的了解它们的相对重要性。

5.1 Effect of Pre-training Tasks（预训练任务的影响）

One of our core claims is that the deep bidirectionality of BERT, which is enabled by masked LM pre-training, is the single most important improvement of BERT compared to previous work. To give evidence for this claim, we evaluate two new models which use the exact same pre-training data, fine-tuning scheme and Transformer hyper-parameters as BERTBASE:
我们的核心观点之一是，与以前的工作相比，BERT中基于掩盖方式的、深度双向的语言模型的预训练实现，是BERT的最重要的改进。为了证明这个观点，我们评估了两个新模型，它们使用与 $BERT_{BASE}$ 完全相同的预训练数据，微调方案和Transformer超参数：

No NSP(不使用NSP任务):

A model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.
模型只使用“Masked LM”进行训练，没用使用NSP任务。
LTR & No NSP(LTR 及不使用NSP任务):

A model which is trained using a Left-to-Right (LTR) LM, rather than an MLM. In this case, we predict every input word and do not apply any masking. The left-only constraint was also applied at fine-tuning, because we found it is always worse to pre-train with left-only-context and fine-tune with bidirectional context. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme.
使用从左到右（LTR）LM而不是MLM训练的模型。在这种情况下，我们会预测每个输入的单词，并且不会应用任何屏蔽。微调时也使用左侧约束，因为我们发现预训练时使用左侧约束，而微调时使用下上文两侧约束的结果总是很差。此外，该模型的训练没有使用NSP任务。这可以直接与OpenAI GPT相提并论，但要使用我们的更大的训练数据集，输入表示形式和微调方案。

Results are presented in Table 5. We first examine the impact brought by the NSP task. We can see that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD. These results demonstrate that our pre-training method is critical in obtaining the strong empirical results presented previously.
结果显示在表5中。我们首先检查NSP任务带来的影响。我们可以看到，删除NSP任务会严重降低QNLI，MNLI和SQuAD任务上的表现。这些结果表明，我们的NSP预训练方法对于获得前文提到的有力的实验结果至关重要。
在这里插入图片描述

Next, we evaluate the impact of training bidirectional representations by comparing “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with extremely large drops on MRPC and SQuAD. For SQuAD it is intuitively clear that an LTR model will perform very poorly at span and token prediction, since the token-level hidden states have no right-side context. For MRPC is unclear whether the poor performance is due to the small data size or the nature of the task, but we found this poor performance to be consistent across a full hyperparameter sweep with many random restarts. In order make a good faith attempt at strengthening the LTR system, we tried adding a randomly initialized BiLSTM on top of it for fine-tuning. This does significantly improve results on SQuAD, but the results are still far worse than the pre-trained bidirectional models. It also hurts performance on all four GLUE tasks.
接下来，我们通过“不使用NSP任务”和“从左到右（LTR）训练及不使用NSP任务”来评估双向训练的影响。在所有的任务上，LTR的表现都比MLM差。尤其是在MRPC和SQuAD两个任务上，表现下降的很大。对于SQuAD，直观上很清楚，由于令token级别的隐藏状态没有右侧的下文，因此LTR模型在跨度和token预测上的性能将非常差。对于MRPC来说，不清楚性能差是由于数据量小还是任务性质引起的，但是我们发现这种表现差在多次随机重启的完整超参数扫描中是一致的。为了得到真实的结果，我们尝试加强LTR系统，我们尝试在其上添加随机初始化的BiLSTM以进行微调。这确实可以显著改善SQuAD上的结果，但结果仍然比预训练的双向模型差得多。这也会降低所有四个GLUE任务的性能表现。

We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since a deep bidi-rectional model could choose to use either left or right context.
我们认识到，像ELMo一样，也有可能训练单独的LTR和RTL模型并将每个token表示为两个模型的串联。但是：（a）这个模型的代价是单个双向模型的两倍；（b）对于QA这样的任务，这是不直观的，因为RTL模型将无法确定问题的答案；（c）由于深度双向模型可以选择使用左上文或右下文，因此它从比较严谨的定义上讲不如深度双向模型强大。

5.2 Effect of Model Size（模型大小的影响）

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyperparameters and training procedure as described previously.
在本节中，我们探索模型大小对微调任务准确性的影响。我们训练了许多具有不同层数，隐藏单元和attention heads的BERT模型，而其他方面则使用了与前面所述相同的超参数和训练过程。

Results on selected GLUE tasks are shown in Table 6. In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks. It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is (L=6, H=1024, A=16) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is (L=64, H=512, A=2) with 235M parameters (Al-Rfou et al., 2018). By contrast, BERTBASE contains 110M parameters and BERTLARGE contains 340M parameters.
表6中显示了选定的GLUE任务的结果。在此表中，我们展示了在开发数据集上，5次随机初始化微调的平均准确性。我们可以看到，较大的模型在4个数据集上都有明确的提升，即使在与上游的预训练任务有较大区别的MRPC任务且样本量只有3600的训练集上也是如此。也许令人惊讶的是，我们能够在相对于现有文献而言已经足够大的模型的基础上实现如此重大的改进。例如，在Vaswani等人中探索的最大的Transformer模型。（2017）是（L = 6，H = 1024，A = 16），编码器的参数为100M，而我们在文献中发现的最大的Transformer结构是（L = 64，H = 512，A = 2）并有235M参数（Al-Rfou et al。，2018）。相比之下， $BERT_{BASE}$ 包含110M参数，而 $BERT_{LARGE}$ 包含340M参数。
在这里插入图片描述

It has been known for many years that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate that scaling to extreme model sizes also leads to large improve-ments on very small scale tasks, provided that the model has been sufficiently pre-trained.
多年以来，众所周知，增加模型大小可以持续的提高机器翻译和语言建模等大型任务的效果，表6中列出的在训练集上的语言模型困惑度也给出了相应证明。但是，我们认为这是第一个证明了对模型大小的极端缩放也可以在非常小规模的任务上带来很大的改进的工作，前提是模型已经过充分的预训练。

5.3 Effect of Number of Training Steps（训练步数的影响）

Figure 4 presents MNLI Dev accuracy after fine-tuning from a checkpoint that has been pre-trained for k steps. This allows us to answer the following questions:
图4展示了使用一个预训练了K次的模型进行微调后在MNLI任务上的开发准确度。这能为我们回答以下问题：

Question: Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? Answer: Yes, BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps.
问题： $B E R T$ 真的需要如此大的预训练(128,000 words/batch * 1,000,000 steps) 来达到高微调准确度吗？
回答：是的， $BERT_{LARGE}$ 使用1M迭代比500k次迭代的准确度提高了1%。
Question: Does MLM pre-training converge slower than LTR pre-training, since only 15% of words are predicted in each batch rather than every word? Answer: The MLM model does converge slightly slower than the LTR model. However, in terms of absolute accuracy the MLM model begins to outperform the LTR model almost immediately.
问题：相比于LTR的每个batch预测所有字，MLM每个batch只有15%的数据会被预测，所以MLM 预训练的收敛速度是否比 LTR 预训练收敛速度慢？
回答：MLM模型的收敛速度确实比LTR模型稍慢。然而，就绝对精度而言，MLM模型几乎在训练一开始就优于LTR模型。

5.4 Feature-based Approach with BERT（BERT中基于特征的方法）

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a down-stream task. However, the feature-based approach, where fixed features are extracted from the pre-trained model, has certain advantages. First, not all NLP tasks can be easily be represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to being able to pre-compute an expensive representation of the training data once and then run many experiments with less expensive models on top of this representation.
到目前为止，我们展示出的 $B E R T$ 的结果都是基于微调这个方法的。微调时会向模型添加一个用于分类的layer，所有的参数都会参与下游的微调任务。但是，从预训练模型中提取固定特征的方法是具有某些优势的。首先，并非所有的NLP任务都可以通过Transformer编码器体系结构轻松表征，因此需要添加特定于任务的模型体系结构。其次，使用高昂的代价在训练数据上预训练，得到训练集的表征，然后在此表征之上使用较低的成本运行许多模型实验，这具有重大的计算优势。

In this section we evaluate how well BERT performs in the feature-based approach by generating ELMo-like pre-trained contextual representations on the CoNLL-2003 NER task. To do this, we use the same input representation as in Section 4.3, but use the activations from one or more layers without fine-tuning any parameters of BERT. These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.
在本节中，我们通过在 CoNLL-2003 NER 任务上生成类似 ELMo 的预训练上下文表征来评估BERT在基于特征的方法中的表现。为此，我们使用与第 4.3 节中相同的输入表示，使用一个或多个layer激活，而无需微调BERT的任何参数。这些上下文embedding用作分类层之前BiLSTM的输入，BiLSTM共有二层，均随机初始化。

Results are shown in Table 7. The best performing method is to concatenate the token representations from the top four hidden layers of the pre-trained Transformer, which is only 0.3 F1 behind fine-tuning the entire model. This demonstrates that BERT is effective for both the fine-tuning and feature-based approaches.
结果如表 7 所示。性能最好的方法是连接来自预训练Transformer的前四个隐藏层的token表征，基F1只比微调整个模型的的F1低0.3。这表明 BERT 对于微调和基于特征的方法都是有效的。

在这里插入图片描述

6 Conclusion（结束语）

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from very deep unidirectional architectures. Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.
最近，通过使用语言模型进行迁移学习，相关实验都得到了提升。证明，丰富的、无监督的预训练是许多语言理解系统不可或缺的一部分。特别是，这些结果会使一些低资源的任务也能从深度单向架构中受益。我们的主要贡献是将这些发现进一步推广到深度双向体系结构中，使相同的预训练模型能够成功处理一系列广泛的NLP任务。

While the empirical results are strong, in some cases surpassing human performance, important future work is to investigate the linguistic phenomena that may or may not be captured by BERT.
虽然实验结果很强，在某些情况下超过了人类的表现，但未来重要的工作是探索能被BERT捕获或不被捕获的语言现象。

References

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level lan- guage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444.
Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.
Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC. NIST.
John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspon- dence learning. In Proceedings of the 2006 confer- ence on empirical methods in natural language pro- cessing, pages 120–128. Association for Computa- tional Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. In EMNLP. Association for Computational Linguis- tics.
Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.
Quora question pairs.
Kevin Clark, Minh-Thang Luong, Christopher D Man- ning, and Quoc V Le. 2018. Semi-supervised se- quence modeling with cross-view training. arXiv preprint arXiv:1809.08370.
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Pro- ceedings of the 25th International Conference on Machine Learning, ICML ’08.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo ̈ıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 670–680, Copen- hagen, Denmark. Association for Computational Linguistics.
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural informa- tion processing systems, pages 3079–3087.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
William B Dolan and Chris Brockett. 2005. Automati- cally constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaus- sian error linear units. CoRR, abs/1606.08415.
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. In ACL.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
Quoc Le and Tomas Mikolov. 2014. Distributed rep- resentations of sentences and documents. In Inter- national Conference on Machine Learning, pages 1188–1196.
Hector J Levesque, Ernest Davis, and Leora Morgen- stern. 2011. The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47.
Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence represen- tations. In International Conference on Learning Representations.
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Con- textualized word vectors. In NIPS.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– 1543.
Matthew Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- quence tagging with bidirectional language models. In ACL.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In NAACL.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing with unsupervised learning. Technical re- port, OpenAI.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Wilson L Taylor. 1953. cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Lin- guistics.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, ACL ’10, pages 384–394.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoen- coders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM.
Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
A. Warstadt, A. Singh, and S. R. Bowman. 2018. Cor- pus of linguistic acceptability.
Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between
human and machine translation. arXiv:1609.08144.
arXiv preprint
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neuralnetworks? InAdvancesinneuralinformation processing systems, pages 3320–3328.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.