谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(公号回复“谷歌BERT论文”下载彩标PDF论文)
原创: 秦陇纪 数据简化DataSimp 今天
数据简化DataSimp导读:谷歌AI语言组论文《BERT:语言理解的深度双向变换器预训练》,介绍一种新的语言表征模型BERT——来自变换器的双向编码器表征量。异于最新语言表征模型,BERT基于所有层的左、右语境来预训练深度双向表征量。BERT是首个大批句子层面和词块层面任务中取得当前最优性能的表征模型,性能超越许多使用任务特定架构的系统,刷新11项NLP任务当前最优性能记录,堪称最强NLP预训练模型!未来可能成为新行业基础。本文翻译BERT论文(原文中英文对照),BERT简版源码10月30日已发布,我们后期抽空分析,祝大家学习愉快~要推进人类文明,不可止步于敲门呐喊;设计空想太多,无法实现就虚度一生;工程能力至关重要,秦陇纪与君共勉之。
谷歌AI论文BERT双向编码器表征模型:机器阅读理解NLP基准11种最优(62264字)
目录
A谷歌AI论文BERT双向编码器表征模型(58914字)
一、介绍Introduction
二、相关工作RelatedWork
三、BERT变换器双向编码器表征
四、实验Experiments
五、消模实验AblationStudies
六、结论Conclusion
参考文献References
B机器阅读理解11种NLP任务BERT超人类(2978字)
一、BERT模型主要贡献
二、BERT模型与其它两个的不同
参考文献(1214字)Appx(845字).数据简化DataSimp社区简介
A谷歌AI论文BERT双向编码器表征模型(58914字)
BERT:语言理解的深度双向变换器预训练
文|谷歌AI语言组BERT作者,译|秦陇纪,数据简化DataSimp20181013Sat-1103Sat
名称:BERT:语言理解的深度双向变换器预训练
BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
论文地址:https://arxiv.org/pdf/1810.04805.pdf
作者:Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
单位:Google AILanguage {jacobdevlin,mingweichang,kentonl,kristout}@google.com
摘要:本文介绍一种称之为BERT的新语言表征模型,意为来自变换器的双向编码器表征量(BidirectionalEncoder Representations from Transformers)。不同于最近的语言表征模型(Peters等,2018; Radford等,2018),BERT旨在基于所有层的左、右语境来预训练深度双向表征。因此,预训练的BERT表征可以仅用一个额外的输出层进行微调,进而为很多任务(如问答和语言推理)创建当前最优模型,无需对任务特定架构做出大量修改。
BERT的概念很简单,但实验效果很强大。它刷新了11个NLP任务的当前最优结果,包括将GLUE基准提升至80.4%(7.6%的绝对改进)、将MultiNLI的准确率提高到86.7%(5.6%的绝对改进),以及将SQuADv1.1问答测试F1的得分提高至93.2分(1.5分绝对提高)——比人类性能还高出2.0分。
Abstract:We introduce anew language representation model called BERT, which stands for BidirectionalEncoder Representations from Transformers. Unlike recent languagerepresentation models (Peters et al., 2018; Radford et al., 2018), BERT isdesigned to pre-train deep bidirectional representations by jointlyconditioning on both left and right context in all layers. As a result, thepre-trained BERT representations can be fine-tuned with just one additionaloutput layer to create state-of-the-art models for a wide range of tasks, suchas question answering and language inference, without substantial task-specificarchitecture modifications.
BERT isconceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven natural language processing tasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6%absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5absolute improvement), outperforming human performance by 2.0.
除了上述摘要Abstact,论文有6节:介绍Introduction、相关工作Related Work、BERT、实验Experiments、消模实验Ablation Studies、结论Conclusion,末尾42篇参考资料References。
一、介绍Introduction
语言模型预训练已被证明可有效改进许多自然语言处理任务(Dai and Le, 2015;Peters等,2017, 2018; Radford等,2018; Howard and Ruder, 2018)。这些任务包括句子级任务,如自然语言推理inference(Bowman等,2015; Williams等,2018)和释义paraphrasing(Dolan and Brockett, 2005),旨在通过整体分析来预测句子之间的关系;以及词块级任务,如命名实体识别(Tjong Kim Sang andDe Meulder, 2003)和SQuAD问题回答(Rajpurkar等,2016),其中模型需要在词块级别生成细粒度输出。
Language modelpre-training has shown to be effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al.,2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such asnatural language inference (Bowman et al., 2015; Williams et al., 2018) andparaphrasing (Dolan and Brockett, 2005), which aim to predict the relationshipsbetween sentences by analyzing them holistically, as well as token-level taskssuch as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuADquestion answering (Rajpurkar et al., 2016), where models are required toproduce fine-grained output at the token-level. (译注1:token义为象征、标志、纪念品、代币、代价券,和sign意思相同但比sign庄重文雅,常用于严肃场合。token有语言学词义:[语言学]语言符号、计算机词义:[计算机]词块、词块。秦陇纪认为“符标”更合意,但常见NLP文献里token译为“词块”,随大流吧。)
将预训练语言表征应用于下游任务有两种现有策略:基于特征feature-based和微调fine-tuning。基于特征的方法,例如ELMo(Peters等,2018),使用特定于任务的架构,其包括将预训练表征作为附加特征。微调方法,例如GenerativePre-trained Transformer(OpenAIGPT生成型预训练变换器)(Radford等,2018),引入了最小的任务特定参数,并通过简单地微调预训练参数在下游任务中进行训练。在以前的工作中,两种方法在预训练期间共享相同的目标函数,它们使用单向语言模型来学习通用语言表征。
There are twoexisting strategies for applying pre-trained language representations todownstream tasks: feature-based and fine-tuning. The feature-based approach,such as ELMo (Peters et al., 2018), uses tasks-specific architectures thatinclude the pre-trained representations as additional features. The fine-tuningapproach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radfordet al., 2018), introduces minimal task-specific parameters, and is trained onthe downstream tasks by simply fine-tuning the pretrained parameters. Inprevious work, both approaches share the same objective function duringpre-training, where they use unidirectional language models to learn generallanguage representations.
我们认为,当前技术严重制约了预训练表征的能力,特别是对于微调方法。其主要局限在于标准语言模型是单向的,这限制了可以在预训练期间使用的架构类型。例如,在OpenAI GPT,作者们用一个从左到右的架构,其中每个词块只能注意变换器自注意层中的前验词块(Vaswani等,2017)。这种局限对于句子层面任务而言是次优选择,对于词块级任务的方法,则可能是毁灭性的。在这种任务中应用基于词块级微调法,如SQuAD问答(Rajpurkar等,2016),结合两个方向语境至关重要。
We argue thatcurrent techniques severely restrict the power of the pre-trainedrepresentations, especially for the fine-tuning approaches. The majorlimitation is that standard language models are unidirectional, and this limitsthe choice of architectures that can be used during pre-training. For example,in OpenAI GPT, the authors use a left-to-right architecture, where every tokencan only attended to previous tokens in the self-attention layers of theTransformer (Vaswani et al., 2017). Such restrictions are sub-optimal forsentencelevel tasks, and could be devastating when applying fine-tuning basedapproaches to token-level tasks such as SQuAD question answering (Rajpurkar etal., 2016), where it is crucial to incorporate context from both directions.
在本论文,我们通过提出BERT模型:来自变换器的双向编码器表征量(Bidirectional Encoder Representations fromTransformers),改进了基于微调的方法。BERT通过提出一个新的预训练目标:“遮蔽语言模型”(maskedlanguage model,MLM),来自Cloze任务(Taylor,1953)的启发,来解决前面提到的单向局限。该遮蔽语言模型随机地从输入中遮蔽一些词块,并且,目标是仅基于该遮蔽词语境语境来预测其原始词汇id。不像从左到右的语言模型预训练,该MLM目标允许表征融合左右两侧语境语境,这允许我们预训练一个深度双向变换器。除了该遮蔽语言模型,我们还引入了一个“下一句预测”(nextsentence prediction)任务,该任务联合预训练文本对表征量。
In this paper,we improve the fine-tuning based approaches by proposing BERT: BidirectionalEncoder Representations from Transformers. BERT addresses the previouslymentioned unidirectional constraints by proposing a new pre-training objective:the “masked language model” (MLM), inspired by the Cloze task (Taylor, 1953).The masked language model randomly masks some of the tokens from the input, andthe objective is to predict the original vocabulary id of the masked word basedonly on its context. Unlike left-to-right language model pre-training, the MLMobjective allows the representation to fuse the left and the right context,which allows us to pre-train a deep bidirectional Transformer. In addition tothe masked language model, we also introduce a “next sentence prediction” taskthat jointly pre-trains text-pair representations.
我们的论文贡献如下:
•我们证明了双向预训练对语言表征量的重要性。与Radford等人(2018)不同,其使用单向语言模型进行预训练,BERT使用遮蔽语言模型来实现预训练的深度双向表征量。这也与Peters等人(2018)形成对比,其使用由独立训练的从左到右和从右到左LMs(语言模型)的浅层串联。
•我们展示了预训练表征量能消除许多重型工程任务特定架构的需求。BERT是第一个基于微调的表征模型,它在大量的句子级和词块级任务上实现了最先进的性能,优于许多具有任务特定架构的系统。
•BERT推进了11项NLP任务的最高水平。因此,我们报告了广泛的BERT消融,证明我们模型的双向性质是最重要的新贡献。代码和预训练模型将在goo.gl/language/bert上提供1。(注1 将于2018年10月底前公布。)
The contributions of our paper are as follows:
•Wedemonstrate the importance of bidirectional pre-training for languagerepresentations. Unlike Radford et al. (2018), which uses unidirectionallanguage models for pretraining, BERT uses masked language models to enablepre-trained deep bidirectional representations. This is also in contrast toPeters et al. (2018), which uses a shallow concatenation of independentlytrained leftto-right and right-to-left LMs.
•We show thatpre-trained representations eliminate the needs of many heavilyengineeredtask-specific architectures. BERT is the first fine-tuning based representationmodel that achieves state-of-the-art performance on a large suite ofsentence-level and token-level tasks, outperforming many systems withtask-specific architectures.
•BERT advancesthe state-of-the-art for eleven NLP tasks. We also report extensive ablationsof BERT, demonstrating that the bidirectional nature of our model is the singlemost important new contribution. The code and pre-trained model will beavailable at goo.gl/language/bert.1
1 Will be released before the end ofOctober 2018.
二、相关工作Related Work
预训练通用语言表征有很长历史,本节我们简要回顾中这些最常用的方法。
There is a long history ofpre-training general language representations, and we briefly review the mostpopular approaches in this section.
2.1 基于特征的方法Feature-based Approaches
广泛采用的单词表征学习,已经是数十年的活跃研究领域,包括非神经(Brown等,1992; Ando and Zhang, 2005; Blitzer等,2006)和神经(Collobert andWeston, 2008; Mikolov等,2013; Pennington等,2014)方法。预训练的单词嵌入被认为是现代NLP系统的组成部分,与从头学习的嵌入相比提供了显着的改进(Turian等,2010)。
这些方法已经被推广到更粗的粒度,如句子嵌入(Kiros等,2015; Logeswaran and Lee, 2018)或段落嵌入(Le and Mikolov, 2014)。与传统词嵌入一样,这些学习到的表征通常用作下游模型中的特征。
ELMo(Peters等,2017)将传统的词嵌入研究概括为不同维度。他们建议从语言模型中提取语境敏感型特征。把语境字词嵌入与现有任务特定架构集成时,ELMo针对一些主要的NLP基准(Peters et al., 2018)提出了最先进的技术,包括关于SQUAD问答(Rajpurkar等,2016),情绪分析(Socher等,2013),以及命名实体识别(Tjong Kim Sang和De Meulder,2003)。
Learning widely applicablerepresentations of words has been an active area of research for decades,including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al.,2006) and neural (Collobert and Weston, 2008; Mikolov et al., 2013; Penningtonet al., 2014) methods. Pretrained word embeddings are considered to be anintegral part of modern NLP systems, offering significant improvements overembeddings learned from scratch (Turian et al., 2010).
These approaches have beengeneralized to coarser granularities, such as sentence embeddings (Kiros etal., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov,2014). As with tra