LLMs之GPT：《Improving Language Understanding by Generative Pre-Training》翻译与解读

一个处女座的程序猿

已于 2024-01-11 00:44:22 修改

阅读量1.5w

点赞数 16

分类专栏： AI/AGI NLP/LLMs 文章标签：自然语言处理 GPT

于 2018-12-26 19:49:04 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/85267707

版权

NLP/LLMs 同时被 2 个专栏收录

483 篇文章 375 订阅

订阅专栏

AI/AGI

317 篇文章 215 订阅

订阅专栏

LLMs之GPT：《Improving Language Understanding by Generative Pre-Training》翻译与解读

导读：这篇文章介绍了通过预训练来提升自然语言理解能力的研究工作。

背景：传统监督学习方法依赖大量标注数据，但标注数据成本高且不足。仅利用词级信息预训练的方法难以获得高质量的语义表示。

方法：利用大量非标注文本数据，采用语言模型目标函数进行Transformer网络的无监督预训练。将预训练模型微调到各下游任务，实现任务知识的转换。

成果：在包括判断关系、问答、分类等12个任务中的9个任务上刷新现有最优结果。模型通用性强，对不同规模数据集效果都好。

设计原理：利用Transformer结构有助于捕获长程依赖关系。输入转换方式减少了微调阶段对模型结构的改动。

优点：

>> 需求计算资源较大，但预训练后微调效率高。

>> 模型推广能力强，对新任务有一定的零唆能力。

限制：

>> 理解能力依赖文本数据，难免存在偏差。

>> 通用模型仍容易在新的情况下表现不稳定。

总之，这项工作验证了预训练方法可以获得通用语义知识，通过微调可实现自然语言多任务学习，取得了很好的效果。

GPT的demo体验

T1、OpenAI API

T2、Inferkit工具(文本生成)

NLP之Inferkit：Inferkit工具(文本生成)的简介、使用方法、案例应用之详细攻略

T3、国内类似GPT的demo：悟道测试

博客文章《Improving language understanding with unsupervised learning》翻译与解读

引言

两个阶段：基于transformer无监督预训练+语言建模→有监督微调

论文《Improving Language Understanding by Generative Pre-Training》翻译与解读

Abstract

NLP领域标注数据的稀缺性→采用无监督预训练+特定任务微调(任务感知的输入变换+有效迁移)→证明了方法有效性

1、Introduction

从原始文本中学习对减轻对监督学习的依赖至关重要→利用未标记数据的语言信息(无监督学习)，比如词嵌入的成功→但更需要利用更高层次的信息

半监督学习方法实现迁移学习：采用无监督预训练和有监督微调相结合的半监督方法→两阶段训练过程(使用语言建模目标在未标记数据上学习神经网络模型的初始参数+使用相应的监督目标将这些参数适应到目标任务)

模型架构选择Transformer(更适合处理文本中的长依赖关系)→特定任务的输入适应

评估结果：12个任务中有9个显著改进

无监督预训练：适应任务，早期【图像分类/回归任务】→当前【训练深度神经网络】，网络基座，LSTM→Transformer

Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer.图1:(左)此工作中使用的Transformer架构和训练目标。(右)对不同任务进行微调的输入转换。我们将所有结构化输入转换为标记序列，由我们的预训练模型处理，然后是线性+softmax层。

Table 1: A list of the different tasks and datasets used in our experiments.表1:我们实验中使用的不同任务和数据集的列表。

无监督预训练数据集

模型架构：基于Transformer的解码器，包括Masked self-attention、position-wise feed-forward networks、Adam、余弦调度学习率

调整细节：重用了无监督预训练的超参数，3个epoch、线性学习率衰减计划

4.2、Supervised fine-tuning

自然语言推理、问题回答、语义相似性和文本分类

Table 3: Results on question answering and commonsense reasoning, comparing our model with current state-of-the-art methods.. 9x means an ensemble of 9 models. 表3:问题回答和常识推理的结果，将我们的模型与当前最先进的方法进行比较。9x表示9个模型的集合。

Table 4: Semantic similarity and classification results, comparing our model with current state-of-the-art methods. All task evaluations in this table were done using the GLUE benchmark. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)表4:语义相似性和分类结果，将我们的模型与当前最先进的方法进行比较。本表中的所有任务评估都是使用GLUE基准测试完成的。(mc= Mathews相关性，acc=准确度，pc=Pearson相关性)

5、Analysis

Figure 2: (left) Effect of transferring increasing number of layers from the pre-trained language model on RACE and MultiNLI. (right) Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates. Performance per task is normalized between a random guess baseline and the current state-of-the-art with a single model.图 2：（左）从 RACE 和 MultiNLI 上的预训练语言模型转移更多层的效果。（右）绘图显示了不同任务的零样本性能随 LM 预训练更新的变化。每个任务的性能在随机猜测基线和当前最先进的单个模型之间进行标准化。

Table 5: Analysis of various model ablations on different tasks. Avg. score is a unweighted average of all the results. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)表5:不同任务下的各种模型消融分析。平均分是所有成绩的未加权平均。(mc= Mathews相关性，acc=准确度，pc=Pearson相关性)

层次数的转移影响：每一层都包含对解决目标任务有用

Zero-shot行为：Transformer体系结构的归纳偏差有助于迁移。

消融实验：采用辅助语言建模、选择大型数据集、选择Transformer、要有预训练

GPT的demo体验

T1、OpenAI API

地址：https://beta.openai.com/

T2、Inferkit工具(文本生成)

NLP之Inferkit：Inferkit工具(文本生成)的简介、使用方法、案例应用之详细攻略

https://yunyaniu.blog.csdn.net/article/details/135027610

T3、国内类似GPT的demo：悟道测试

地址：https://models.aminer.cn/democenter

博客文章《Improving language understanding with unsupervised learning》翻译与解读

地址	博客文章：Improving language understanding with unsupervised learning
时间	2018年6月11日
作者	OpenAI
总结	导读：这篇文章介绍了通过预训练来提升自然语言理解能力的研究工作。背景：传统监督学习方法依赖大量标注数据，但标注数据成本高且不足。仅利用词级信息预训练的方法难以获得高质量的语义表示。方法：利用大量非标注文本数据，采用语言模型目标函数进行Transformer网络的无监督预训练。将预训练模型微调到各下游任务，实现任务知识的转换。成果：在包括判断关系、问答、分类等12个任务中的9个任务上刷新现有最优结果。模型通用性强，对不同规模数据集效果都好。 *设计原理*：利用Transformer结构有助于捕获长程依赖关系。输入转换方式减少了微调阶段对模型结构的改动。优点： >> 需求计算资源较大，但预训练后微调效率高。 >> 模型推广能力强，对新任务有一定的零唆能力。限制： >> 理解能力依赖文本数据，难免存在偏差。 >> 通用模型仍容易在新的情况下表现不稳定。总之，这项工作验证了预训练方法可以获得通用语义知识，通过微调可实现自然语言多任务学习，取得了很好的效果。

引言

我们使用一种可扩展的、与任务无关的系统，在一系列多样化的语言任务中取得了最先进的结果，并发布了这个系统。我们的方法是两种现有思想的结合：transformers 和无监督预训练。这些结果提供了一个令人信服的例子，即将监督学习方法与无监督预训练结合起来效果非常好；这是许多人过去已经探索过的一个想法，我们希望我们的结果能激发进一步的研究，将这个想法应用到更大、更多样化的数据集上。

两个阶段：基于transformer无监督预训练+语言建模→有监督微调

我们的系统分为两个阶段；首先，我们以无监督的方式在大量数据上训练一个transformer模型——使用语言建模作为训练信号——然后我们在较小的监督数据集上对这个模型进行微调，以帮助它解决特定任务。我们在情感神经元工作的基础上发展了这种方法，在那项工作中我们注意到，当在足够的数据上进行无监督学习时，可以产生出令人惊讶的区分特征。在这里，我们想进一步探索这个想法：我们是否能够开发一个模型，在大量数据上以无监督的方式训练它，然后微调模型以在许多不同任务上取得良好的性能？我们的结果表明，这种方法效果非常好；相同的核心模型可以在最小的调整下微调，用于非常不同的任务。

这项工作建立在《半监督序列学习》中介绍的方法基础上，该方法展示了如何通过使用LSTM的无监督预训练，然后进行监督微调，来提高文档分类性能。它还延伸了ULMFiT的研究，该研究表明，可以通过微调单一的与数据集无关的LSTM语言模型，在各种文档分类数据集上获得最先进的性能；我们的工作展示了在这种方法中如何使用基于Transformer的模型，以在文档分类以外的更广泛任务上成功，例如常识推理、语义相似性和阅读理解。它与ELMo类似，但更加与任务无关，ELMo结合了预训练但使用任务定制的架构，在广泛的任务套件上获得最先进的结果。

我们在实现这些结果时只需要很少的调优。所有数据集使用单一的前向语言模型，没有任何集成，大多数报告的结果使用完全相同的超参数设置。

我们特别激动的一个结果是我们的方法在测试常识推理和阅读理解的三个数据集（COPA、RACE和ROCStories）上的表现。我们的模型在这些数据集上大幅领先，取得了新的最先进结果。这些数据集被认为需要多句子推理和丰富的世界知识来解决，这表明我们的模型主要通过无监督学习提高这些技能。这表明通过无监督技术有望发展复杂的语言理解能力。

为什么选择无监督学习？

监督学习是近期机器学习成功的核心。然而，为了取得良好的效果，它可能需要大量、经过精心清理的、昂贵的数据集。无监督学习具有吸引力，因为它有潜力解决这些缺点。由于无监督学习消除了明确人工标注的瓶颈，它也能够很好地适应当前计算能力增强和原始数据可用性提高的趋势。无监督学习是一个非常活跃的研究领域，但其实际应用通常仍然受到限制。

最近有一波尝试通过使用无监督学习来增强大量未标记数据的系统，以进一步提升语言能力。通过无监督技术训练的词汇表示可以使用包含几TB信息的大型数据集，并在与监督学习结合时提高各种NLP任务的性能。直到最近，用于NLP的这些无监督技术（例如GLoVe和word2vec）使用简单的模型（单词向量）和训练信号（单词的局部共现关系）。Skip-Thought Vectors是一个引人注目的早期演示，展示了更复杂方法可能实现的潜在改进。但是现在正在使用一些进一步提高性能的新技术。这些技术包括使用预训练的句子表示模型、上下文化的单词向量（尤其是ELMo和CoVE），以及使用定制架构将无监督预训练与监督微调融合的方法，就像我们自己的方法。

我们还注意到，我们可以使用底层语言模型开始执行任务，而无需在其上进行训练。例如，在多项选择问题中选择正确答案的性能，随着底层语言模型的改进而稳步提高。尽管这些方法的绝对性能通常与监督学习的最新技术相比，仍然相当低（在问题回答方面，它仍然被一个简单的滑动窗口基线超越），但令人鼓舞的是，这种行为在广泛的任务集合中都表现出鲁棒性。随机初始化的网络不包含任何关于任务和世界的信息，使用这些启发式算法并不比随机网络更好。这为为什么生成预训练可以提高下游任务的性能提供了一些见解。

我们还可以使用模型中已有的语言功能执行情感分析。对于斯坦福情感树库数据集，该数据集包含来自积极和消极电影评论的句子，我们可以使用语言模型通过在句子后添加词语“very”，并观察模型更可能预测“positive”还是“negative”来猜测评论是积极还是消极。这种方法在没有对模型进行任何任务适应的情况下，与经典基线相当，精度约为80%。

我们的工作还验证了Transformer架构的稳健性和实用性，表明在不需要复杂的任务特定定制或超参数调整的情况下，它足以在各种任务上取得最先进的结果。

缺点

这个项目有一些值得注意的未解决问题：

>> 计算要求：许多先前的NLP任务方法在单个GPU上训练相对较小的模型。我们的方法需要昂贵的预训练步骤——在8个GPU上进行1个月的训练。幸运的是，这只需要做一次，我们正在发布我们的模型，以便其他人可以避免这个步骤。它还是一个大型模型（与先前工作相比），因此使用了更多的计算和内存——我们使用了一个37层（12块）的Transformer架构，并在长达512个令牌的序列上进行训练。大多数实验是在4和8个GPU系统上进行的。该模型对新任务进行微调非常快，有助于缓解额外的资源需求。

>> 通过文本学习世界的限制和偏见：互联网上现成的书籍和文本不包含关于世界的完整甚至准确的信息。最近的工作表明，通过文本学习某些类型的信息是困难的，其他研究表明模型会学习和利用数据分布中的偏见。

>> 仍然存在脆弱的泛化性：尽管我们的方法改善了广泛任务的性能，但当前的深度学习NLP模型仍然表现出令人惊讶和反直觉的行为——特别是在系统性、对抗性或超出分布范围的评估中。尽管我们的方法不免于这些问题，但我们观察到一些进展的迹象。我们的方法在文本蕴涵方面比以前的纯神经方法表现出更好的词汇鲁棒性。在Glockner等人(2018)引入的数据集上，我们的模型达到了83.75%，表现与KIM相似，后者通过WordNet整合了外部知识。

未来

>> 扩展该方法：我们观察到，语言模型性能的提高与下游任务的提高是相关的。我们目前使用的是通用硬件（一个8 GPU机器）和仅有几千本书（约5GB文本）的训练数据集。这表明在使用更多计算和数据的经过验证的方法的情况下，还有很大的改进空间。

>> 改进微调：我们的方法目前非常简单。使用更复杂的适应和迁移技术，比如ULMFiT中探索的技术，可能会取得重大进展。

>> 更好地理解为什么生成预训练有帮助：尽管我们在这里讨论了一些我们偏爱的观点，但更有针对性的实验和研究将有助于区分相互竞争的解释。例如，我们观察到的好处有多少是由于处理更广泛背景的能力的提高而不是世界知识的提高?

计算

我们越来越关注我们在模型训练中消耗的计算与最终输出之间的关系。训练该模型所使用的总计算量为0.96 petaflop days（pfs-days）。

8 P600 GPU * 30 天 * 12 TFLOPS/GPU * 0.33 利用率 =

= 0.96 pfs-days

论文《Improving Language Understanding by Generative Pre-Training》翻译与解读

地址	论文地址01：https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf 论文地址02：https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf 博客文章：Improving language understanding with unsupervised learning
时间	2018年6月11日
作者	OpenAI Alec Radford，OpenAI，alec@openai.com Karthik Narasimhan，OpenAI，karthikn@openai.com Tim Salimans，OpenAI，tim@openai.com Ilya Sutskever，OpenAI，ilyasu@openai.com
总结	介绍了一种通过无监督预训练和监督微调的方法来提升自然语言理解任务的效果。背景痛点：自然语言理解任务依赖于大量标记数据训练，但标注数据成本高，数量不足。仅利用词向量预训练无法学习到较高水平的语义表示。解决方案： >> 利用大量未标注文本数据，采用语言模型目标函数进行变换器的无监督预训练，使模型学习到通用语义表示。 >> 将预训练模型微调至各自然语言理解任务，采用任务特定的输入转换和监督训练，实施知识迁移。核心特点： >> 采用Transformer网络结构，能更好捕捉长距离依存关系。 >> 输入转换方式使结构化输入以序列形式输入模型，减少微调阶段对模型的改动。 >> 预训练模型具有一定的零样本能力，可以在未见数据上完成部分任务。优势： >> 在12个语言理解任务中的9个任务上超越了目前最先进的方法。 >> 对不同规模数据集效果都比较好，可扩展性强。 >> 只需少量参数微调即可实现知识迁移，模型设计简单。总之，这篇论文通过语言模型预训练获取通用语义知识，再通过微调实现不同下游任务，取得了很好的效果，解决了标注数据匮乏的问题，实现了自然语言理解中的一个重要突破。

Abstract

NLP领域标注数据的稀缺性→采用无监督预训练+特定任务微调(任务感知的输入变换+有效迁移)→证明了方法有效性

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

自然语言理解包括范围广泛的不同任务，如文本蕴涵、问题回答、语义相似性评估和文档分类。虽然大量的未标记文本语料库很丰富，但用于学习这些特定任务的标记数据很少，这使得经过判别训练的模型难以充分发挥作用。我们证明，通过在不同的无标签文本语料库上对语言模型进行生成式预训练，然后对每个特定任务进行判别性微调，可以在这些任务的巨大收益。与以前的方法相比，我们在微调期间使用任务感知的输入转换来实现有效的传输，同时需要对模型架构进行最少的更改。我们证明了我们的方法在广泛的自然语言理解基准上的有效性。我们的通用任务不可知模型优于使用专门为每个任务设计的架构的鉴别训练模型，在研究的12个任务中有9个显著提高。例如，我们在常识推理(Stories Cloze故事完形填空测试)上取得了8.9%的绝对进步，在问题回答(RACE)上取得了5.7%的绝对进步，在文本蕴含(MultiNLI多项测试)上取得了1.5%的绝对进步。

1、Introduction

从原始文本中学习对减轻对监督学习的依赖至关重要→利用未标记数据的语言信息(无监督学习)，比如词嵌入的成功→但更需要利用更高层次的信息

The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pre-trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22], with each method outperforming the others on different tasks.1 Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture [43, 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.

从原始文本中有效学习的能力对于减轻NLP中对监督学习的依赖至关重要。大多数深度学习方法都需要大量的手工标记数据，这限制了它们在许多缺乏注释资源的领域的适用性[61]。在这些情况下，可以利用来自未标记数据的语言信息的模型为收集更多注释提供了一种有价值的替代方法，这可能既耗时又昂贵。此外，即使在有大量监督的情况下，以无监督的方式学习良好的表示也可以显著提升性能。迄今为止，最令人信服的证据是广泛使用预训练词嵌入[10,39,42]来提高一系列NLP任务的性能[8,11,26,45]。然而，从未标记的文本中利用单词级以上的信息具有挑战性，主要有两个原因。

首先，目前尚不清楚哪种类型的优化目标在学习对迁移有用的文本表示时最有效。最近的研究着眼于不同的目标，如语言建模[44]、机器翻译[38]和语篇连贯[22]，每种方法在不同的任务上都优于其他方法。

其次，对于将这些学习到的表征转移到目标任务的最有效方法，目前还没有达成共识。现有的技术包括对模型架构进行特定于任务的更改[43,44]，使用复杂的学习方案[21]和添加辅助学习目标[50]。这些不确定性使得开发有效的语言处理半监督学习方法变得困难。

半监督学习方法实现迁移学习：采用无监督预训练和有监督微调相结合的半监督方法→两阶段训练过程(使用语言建模目标在未标记数据上学习神经网络模型的初始参数+使用相应的监督目标将这些参数适应到目标任务)

In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.

在本文中，我们探索了一种结合无监督预训练和监督微调的半监督方法来完成语言理解任务。我们的目标是学习一种通用的表示方式，它可以不需要任何适应就能转移到广泛的任务中。我们假设可以访问一个大型的无标记文本语料库和几个数据集，其中包含手动注释的训练示例(目标任务)。我们的设置不要求这些目标任务与未标记的语料库处于同一域中。我们采用两阶段训练程序。首先，我们在未标记的数据上使用语言建模目标来学习神经网络模型的初始参数。随后，我们使用相应的监督目标将这些参数调整适应到目标任务。

模型架构选择Transformer(更适合处理文本中的长依赖关系)→特定任务的输入适应

模型架构选择： 使用Transformer模型，相对于循环网络等替代方案，它在处理文本中的长期依赖关系时提供更为结构化的内存，从而在各种任务之间实现强大的迁移性能。
任务特定的输入适应： 利用从遍历式方法派生的任务特定的输入适应，在微调时以最小的架构更改有效地进行。

For our model architecture, we use the Transformer [62], which has been shown to perform strongly on various tasks such as machine translation [62], document generation [34], and syntactic parsing [29]. This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.

对于我们的模型架构，我们使用Transformer[62]，它已被证明在各种任务上表现出色，如机器翻译[62]、文档生成[34]和语法解析[29]。与循环网络等替代方案相比，这种模型选择为我们提供了更结构化的记忆来处理文本中的长期依赖关系，从而在不同任务之间实现了稳健的传输性能。在传输过程中，我们利用来自遍历样式方法[52]的特定于任务的输入自适应，该方法将结构化文本输入处理为单个连续的tokens序列。正如我们在实验中所演示的，这些调整使我们能够在对预训练模型的架构，进行最小更改的情况下有效地进行微调。

评估结果：12个任务中有9个显著改进

We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40], 5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.

我们在四种类型的语言理解任务上评估我们的方法——自然语言推理、问题回答、语义相似性和文本分类。我们的通用的任务无关模型优于采用专门为每项任务设计的架构的判别式训练模型，在研究的12个任务中有9个显著改进。例如，我们在常识推理(故事完形测试Stories Cloze Test)[40]上实现了8.9%的绝对改进，在问答(RACE)[30]上实现了5.7%的绝对改进，在文本包含(多项测试MultiNLI)上实现了1.5%的绝对改进[66]，在最近引入的GLUE多任务基准测试[64]上实现了5.5%的绝对改进。我们还分析了预训练模型在四种不同设置下的零样本行为，并证明它为下游任务获得了有用的语言知识。

2 、Related Work

本研究主要属于自然语言处理领域的半监督学习：词级别信息—早期【词频统计】→近期【词嵌入】，高级语义信息—短语级和句子级嵌入

半监督学习在NLP中有着广泛的应用，包括序列标注、文本分类等任务。

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics.

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks [28, 32, 1, 36, 22, 12, 56, 31].

我们的工作大致属于自然语言的半监督学习范畴。这种范式已经引起了极大的兴趣，并应用于序列标记[24,33,57]或文本分类[41,70]等任务。

最早的方法是使用未标记的数据来计算词级或短语级统计，然后将其用作监督模型中的特征[33]。

在过去的几年里，研究人员已经证明了使用词嵌入的好处[11,39,42]，它是在未标记的语料库上训练的，可以提高各种任务的性能[8,11,26,45]。然而，这些方法主要是传递词级信息，而我们的目标是捕获更高级别的语义。

最近的方法研究了从未标记数据中学习和利用超过单词级别的语义。短语级或句子级嵌入可以使用未标记的语料库进行训练，已用于将文本编码为适合各种目标任务的向量表示[28,32,1,36,22,12,56,31]。

无监督预训练：适应任务，早期【图像分类/回归任务】→当前【训练深度神经网络】，网络基座，LSTM→Transformer

无监督预训练的特殊情况： 无监督预训练是半监督学习的一种特殊情况，其目标是找到一个良好的初始化点而不是修改监督学习目标。
前期研究： 先前的研究探讨了在图像分类和回归任务中使用无监督预训练的技术。最近的工作表明，预训练充当了正则化方案，能够使深度神经网络更好地泛化。
相似方法的对比： 与其他方法的对比表明，该研究选择使用Transformer网络，使其能够捕捉更长范围的语言结构，与使用LSTM模型的其他方法相比，提高了性能。

Unsupervised pre-training

Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification [20, 49, 63] and regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48].

The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches [43, 44, 38] use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.

无监督预训练

无监督预训练是半监督学习的一种特殊情况，其目标是找到一个好的初始化点，而不是修改监督学习目标。早期工作探索了该技术在图像分类[20,49,63]和回归任务[3]中的使用。随后的研究[15]表明，预训练作为一种正则化方案，能够在深度神经网络中实现更好的泛化。在最近的工作中，该方法已用于帮助训练深度神经网络完成各种任务，如图像分类[69]、语音识别[68]、实体消歧[17]和机器翻译[48]。

与我们最接近的工作是使用语言建模目标预训练神经网络，然后在监督下对目标任务进行微调。Dai et al.[13]和Howard and Ruder[21]采用该方法改进文本分类。然而，尽管预训练阶段有助于捕获一些语言信息，但LSTM模型的使用将其预测能力限制在短范围内。相比之下，我们对Transformer 网络的选择允许我们捕捉更广泛的语言结构，正如我们的实验所证明的那样。此外，我们还证明了我们的模型在更广泛的任务上的有效性，包括自然语言推理，释义检测和故事完成。其他方法[43,44,38]使用来自预训练语言或机器翻译模型的隐藏表示作为辅助特征，同时在目标任务上训练监督模型。这涉及到每个独立目标任务的大量新参数，而在传输过程中我们只需要对模型架构进行最小的更改。

辅助训练目标：半监督学习的替代形式

辅助训练目标是半监督学习的替代形式： 辅助训练目标是半监督学习的一种替代形式，旨在通过添加额外的无监督训练目标来提高模型性能。
Collobert和Weston的早期工作： Collobert和Weston的早期工作使用了各种辅助NLP任务，如词性标注、分块、命名实体识别和语言建模，以提高语义角色标注的性能。
Rei的近期研究： Rei的近期研究在目标任务目标之外添加了辅助语言建模目标，展示了在序列标注任务上的性能提升。
本研究的辅助目标使用： 本研究同样采用了辅助目标，但如我们所示，无监督预训练已经学到了与目标任务相关的多个语言学方面。

Auxiliary training objectives

Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.

辅助训练目标

添加辅助的无监督训练目标是半监督学习的另一种形式。Collobert和Weston的早期工作[10]使用了各种辅助NLP任务，如POS标记、分块、命名实体识别和语言建模来改进语义角色标记。

最近，Rei[50]在他们的目标任务目标中添加了一个辅助语言建模目标，并证明了序列标记任务的性能提高。我们的实验也使用了一个辅助目标，但如我们所示，无监督的预训练已经学习了与目标任务相关的几个语言方面。

3、Framework：两个阶段

Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

我们的训练程序包括两个阶段。第一阶段是在大型文本语料库上学习高容量语言模型。接下来是微调阶段，在此阶段，我们将模型调整为具有标记数据的判别任务。

3.1、无监督的预训练阶段UPT

Unsupervised pre-training

Given an unsupervised corpus of tokens U = {u1, . . . , un}, we use a standard language modeling objective to maximize the following likelihood:

where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent [51].

In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multiheaded self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

where U = (u−k, . . . , u−1) is the context vector of tokens, n is the number of layers, We is the token embedding matrix, and Wp is the position embedding matrix.

给定一个无监督的语料库(无标签的tokens语料库)U = {u1，…， un}，我们使用LM标准语言模型目标函数(即根据前 k 个词预测下一个词)，语言模型的优化目标是最大化下面的似然函数:

其中，k为token的上下文滑动窗口的大小，P条件概率是使用参数为Θ的神经网络建模的。这些参数是用SGD随机梯度下降[51]训练优化得到的。

在我们的实验中，我们对语言模型使用了一个多层Transformer解码器[34]，它是Transformer的一个变体[62]。该模型在输入上下文tokens上应用了一个多头自注意力操作，然后是按位置的前馈层，以在目标tokens上产生输出分布:

在GPT-1中，使用了12个transformer块的结构作为解码器，每个transformer块是一个多头的自注意力机制，然后通过全连接得到输出的概率分布。

该模型的训练过程，其实就是将输入文本中每个词的Embedding作为输入，输出预测的下一个词。

其中U = (U−k，…， u−1)为标记的上下文向量，n为层数，We为标记嵌入矩阵，Wp为位置嵌入矩阵。

第1步

输入为前k个词和位置的embedding，

其中U={u-k,,,u-1}是文本中每个词的词向量(当前时间片的上下文token)，n为层数，We表示词嵌入矩阵，Wp表示位置嵌入矩阵。

第2步

经过n层transformer-decoder层(GPT-1中为12层)

第3步

乘上一个token embedding矩阵，通过softmax得到概率

3.2、有监督的微调阶段SFT

第二阶段，Supervised fine-tuning，即在特定任务上使用少量带标签的数据对模型参数进行有监督的微调；对比预训练阶段，可以看出，只是多增加了“Task Classifier”模块。

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens, x1, . . . , xm, along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation hlm, which is then fed into an added linear output layer with parameters Wy to predict y:

This gives us the following objective to maximize:

当得到无监督的预训练模型之后，我们将它的值直接应用到有监督任务中。

在用公式1中的目标训练模型后，我们将参数适应监督目标任务。

我们假设，有一个带标签的数据集C，其中每个实例样本都包含有m个输入(tokens序列)：{x1,x2,,xm}，和对应的标签y组成，

~~~

输入通过我们预先训练的模型，得到最终transformer 块的激活hi，

~~~

然后，将其输入到一个附加的线性输出层，带参数Wy来预测y:

这为我们提供了以下最大化目标:

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):

Overall, the only extra parameters we require during fine-tuning are Wy, and embeddings for delimiter tokens (described below in Section 3.3).

我们还发现，将语言建模作为微调的辅助目标有助于

(a)、提高监督模型的泛化，

(b)、加速收敛。

这与之前的工作一致[50,43]，他们也观察到使用这种辅助目标可以提高性能。具体来说，我们优化了以下目标(权值λ):

总的来说，我们在微调过程中需要的唯一额外参数是Wy和分隔符tokens 的嵌入(将在3.3节中描述)。

图 1：（左）本工作中使用的 Transformer 架构和训练目标。（右）用于微调不同任务的输入转换。我们将所有结构化输入转换为标记序列，以供我们的预训练模型处理，然后是线性+softmax 层。

第1步

首先，将这些token输入到训练好的预训练模型中，得到最终的特征向量hj。

然后，再通过一个全连接层得到预测结果y，

具体任务相应的目标函数为：

其中，Wy为全连接层的参数，有监督的目标则是最大化上边公式的值。

第2步

作者并没有直接使用L2，而是向其中加入了L1 ，并使用λ进行两个任务权值的调整， λ的值一般为0.5，

其中，x1,,,,xm为特定下游任务的输入，y为标签，hj为预训练阶段最后一个词的输出，即最后一个token xm对应的最后一层transformer-decoder的输出。

所以，需要额外调整的参数只有Wy，

原文中还提到了用辅助训练目标的方法，来帮助模型在微调时拥有更好的泛化能力并加速收敛。

具体做法是：在使用最后一个词的预测结果进行监督学习的同时，前面的词继续上一步的无监督训练。

最终，微调阶段的，要优化的目标函数如下：

L3(C)=L2(C)+λ∗L1(C)

当进行有监督微调的时候，我们只训练输出层的Wy和分隔符的嵌入值。

Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer.图1:(左)此工作中使用的Transformer架构和训练目标。(右)对不同任务进行微调的输入转换。我们将所有结构化输入转换为标记序列，由我们的预训练模型处理，然后是线性+softmax层。

不同的任务模型的输入token序列不同

针对不同的任务，模型的输入token序列是有区别的。简单总结如下：

任务	具体任务具体修改
文本分类	判断输入文本是指定的哪个类别。不需要做修改；输入格式与预训练时一样，[start;text;extract]； ------------------------------------------------------------ 将起始和终止token加入到原始序列两端，输入transformer中得到特征向量，最后经过一个全连接，得到预测的概率分布。
文本推理	判断两个句子之间关系：包含关系(entailment)、矛盾关系(contradiction)、中立关系(neutral)。可在前提(premise)和假设(hypothesis)中间插入一个分隔符(delimiter)作为分隔； ------------------------------------------------------------ [start; premise; delimiter; hypothesis; extract]；将前提(premise)和假设(hypothesis) 通过分隔符(Delimiter)隔开，两端加上起始和终止token。再依次通过transformer和全连接，得到预测结果。
文本相识度度量	判断两个句子是否语义相关。可将两个句子顺序颠倒，然后将两次输入相加来做推测； ------------------------------------------------------------ 由于两个文本间没有相对顺序，所以把两种情况 ([start;text1;delimiter;text2;extract]， [start;text2;delimiter;text1;extract]) 分别处理后得到两个hlm后，再按位相加，经过全连接层。理解：输入的两个句子，正向和反向各拼接一次，然后分别，输入transformer中得到特征向量，拼接后，再送给全连接，得到预测结果。
问答与常识推理QA	类似于多选题，输入一篇文章、一个问题以及若干个候选答案，输出为每个答案的预测概率。可将上下文和问题放在一起，用分隔符隔开与答案； ------------------------------------------------------------ 将上下文文档(context document)和问题(question)与不同的答案(answer)分别拼接起来， ([start;context;question;delimiter;answer1]， [start;context;question;delimiter;answer2]… [start;context;question;delimiter;answerN])，经过模型后，再经过softmax层；理解：将n个选项的问题抽象化为n个二分类问题，即每个选项分别和内容进行拼接，然后各送入transformer和全连接中，最后选择置信度最高的作为预测结果。

3.3、特定于任务的输入转换

提出了一种任务输入转换的方法

提出了一种任务输入转换的方法，可以将结构化任务输入(如有序句对或问题-文档-答案三元组)转换成连续文本序列，以适应预训练模型的特征，从而在不同任务间实现架构几近不变的迁移学习。

文本分类与结构化输入任务的差异： 对于某些任务（如文本分类），可以直接按照上述描述微调模型。然而，对于一些结构化输入任务（如问答或文本蕴涵），需要进行一些修改以适应有序句子对或文档-问题-答案三元组等结构化输入。
避免任务特定架构的引入： 先前的工作在传递的表示之上提出了学习任务特定架构的方法，但这种方法重新引入了大量任务特定的定制，并未在这些额外的架构组件上使用迁移学习。
采用遍历式方法： 本研究采用了一种遍历式方法，将结构化输入转换为有序序列，使得预训练模型能够处理。这些输入变换允许在任务之间避免对架构进行广泛的更改。

Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens <s>，<e>.

对于某些任务，如文本分类，我们可以直接微调我们的模型如上所述。某些其他任务，如问题回答或文本包含，具有结构化的输入，如有序的句子对，或文档、问题和答案的三元组。由于我们预先训练的模型是在连续的文本序列上训练的，所以我们需要进行一些修改才能将其应用于这些任务。

以前的工作提出了在转移表示[44]之上学习特定任务的架构。这种方法重新引入了大量特定于任务的定制，并且没有对这些额外的体系结构组件使用迁移学习。

相反，我们使用遍历风格的方法[52]，将结构化输入转换为预先训练的模型可以处理的有序序列。这些输入转换允许我们避免跨任务对体系结构进行广泛的更改。我们在下面简要描述了这些输入转换，图1提供了一个可视化的说明。所有转换包括添加随机初始化的开始和结束令牌<s>，<e>。

输入转换示例

文本蕴涵任务： 将前提p和假设h的标记序列连接起来，中间用分隔符($)分隔。
相似性任务： 对于相似性任务，两个比较的句子没有固有的顺序。为了反映这一点，修改输入序列以包含两种可能的句子顺序，并分别处理每个顺序以产生两个序列表示，然后在馈入线性输出层之前进行逐元素相加。
问答和常识推理： 对于这些任务，给定上下文文档z、问题q和一组可能的答案{ak}。将文档上下文和问题与每个可能的答案连接起来，中间加入分隔符，以获取[z; q; $; ak]。每个这样的序列都独立地与我们的模型一起处理，然后通过softmax层进行归一化，生成可能答案的输出分布。

Textual entailment

For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between.

文本蕴涵

对于包含任务，我们将前提p和假设h令牌序列连接起来，中间有一个分隔符令牌($)。

Similarity

For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations hlm which are added element-wise before being fed into the linear output layer.

相似

在相似性测试中，被比较的两个句子没有固有的顺序。为了反映这一点，我们修改输入序列以包含两种可能的句子顺序(中间有分隔符)，并单独处理每一种顺序以生成两个序列表示hlm，在将它们输入线性输出层之前逐个添加元素。

Question Answering and Commonsense Reasoning

For these tasks, we are given a context document z, a question q, and a set of possible answers {ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get [z; q; $; ak]. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers.

问题回答和常识推理

对于这些任务，我们得到了一个上下文文档z、一个问题q和一组可能的答案{ak}。我们将文档上下文和问题与每个可能的答案连接起来，在两者之间添加分隔符标记以得到[z;问;美元;正义与发展党]。我们的模型对每个序列进行独立处理，然后通过softmax层进行归一化，以生成可能答案的输出分布。

4、Experiments

4.1、Setup

Table 1: A list of the different tasks and datasets used in our experiments.表1:我们实验中使用的不同任务和数据集的列表。

无监督预训练数据集

使用BooksCorpus数据集进行语言模型的训练，包含超过7,000本未发表的书籍，涵盖冒险、奇幻和言情等多种流派，具有长篇连续文本，有助于模型学习对长距离信息的条件依赖。

Unsupervised pre-training

We use the BooksCorpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a very low token level perplexity of 18.4 on this corpus.

无监督预训练

我们使用BooksCorpus数据集[71]来训练语言模型。它包含了7000多本独特的未出版的书籍，包括冒险、幻想和浪漫。最重要的是，它包含长段连续的文本，这允许生成模型学习以远程信息为条件。另一个可供选择的数据集，1B Word Benchmark，由类似的方法使用，ELMo[44]，大小大致相同，但在句子级别上被打乱-破坏了长期结构。我们的语言模型在这个语料库上实现了非常低的令牌级困惑度18.4。

模型架构：基于Transformer的解码器，包括Masked self-attention、position-wise feed-forward networks、Adam、余弦调度学习率

模型主要基于原始的Transformer工作，采用了一个包含12个层次的解码器，具有遮蔽的自注意力头（768维状态和12个注意力头）。对于位置逐层前馈网络，采用了3072维的内部状态。采用Adam优化方案，最大学习率为2.5e-4，通过线性增加学习率在前2000次更新内逐渐增加，然后使用余弦调度将其退火至0。在64个随机抽取的连续512个标记的小批次上进行100轮的训练。

Model specifications

Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.3

模型规范

我们的模型在很大程度上遵循了原来的变压器工作[62]。我们训练了一个带有蒙面自注意头(768个维度状态和12个注意头)的12层译码器。对于位置前馈网络，我们使用3072维内态。我们使用Adam优化方案[27]，最大学习率为2.5e-4。在前2000次更新中，学习率从零线性增加，并使用余弦计划退火到0。我们在64个随机采样的、512个令牌的连续序列的小批上训练100个epoch。由于layernorm[2]在整个模型中被广泛使用，因此N(0,0.02)的简单权重初始化就足够了。我们使用了一个字节对编码(BPE)词汇表，包含40000个合并[53]和残差、嵌入和注意力下降，其正规化率为0.1。我们还采用了[37]中提出的L2正则化的修改版本，在所有无偏差或增益权重上w = 0.01。对于激活函数，我们使用高斯误差线性单元(GELU)[18]。我们使用了习得的位置嵌入，而不是原始工作中提出的正弦版本。我们使用ftfy library2来清理BooksCorpus中的原始文本，标准化一些标点符号和空格，并使用spaCy标记器

调整细节：重用了无监督预训练的超参数，3个epoch、线性学习率衰减计划

调整细节：除非另有说明，重用了无监督预训练的超参数设置。在分类器中添加了0.1的dropout率。对于大多数任务，使用学习率为6.25e-5和批量大小为32。模型的微调速度较快，大多数情况下3个epoch的训练已足够。采用线性学习率衰减计划，并在训练的0.2%进行热身。λ被设置为0.5。

Fine-tuning details

Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5.

调整细节

除非指定，否则我们将重用来自无监督预训练的超参数设置。我们以0.1的速率将dropout添加到分类器中。对于大多数任务，我们使用6.25e-5的学习率和32的批处理大小。我们的模型快速微调，3次训练对于大多数情况是足够的。我们使用线性学习速率衰减计划，热身超过0.2%的训练。λ设为0.5。

4.2、Supervised fine-tuning

自然语言推理、问题回答、语义相似性和文本分类

We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. Some of these tasks are available as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1 provides an overview of all the tasks and datasets.

我们对各种监督任务进行实验，包括自然语言推理、问题回答、语义相似性和文本分类。其中一些任务是最近发布的GLUE多任务基准测试[64]的一部分，我们正在使用它。图1提供了所有任务和数据集的概述。

Natural Language Inference

The task of natural language inference (NLI), also known as recog-nizing textual entailment, involves reading a pair of sentences and judging the relationship between them from one of entailment, contradiction or neutral. Although there has been a lot of recent interest [58, 35, 44], the task remains challenging due to the presence of a wide variety of phenomena like lexical entailment, coreference, and lexical and syntactic ambiguity. We evaluate on five datasets with diverse sources, including image captions (SNLI), transcribed speech, popular fiction, and government reports (MNLI), Wikipedia articles (QNLI), science exams (SciTail) or news articles (RTE).

Table 2 details various results on the different NLI tasks for our model and previous state-of-the-art approaches. Our method significantly outperforms the baselines on four of the five datasets, achieving absolute improvements of upto 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI and 0.6% on SNLI over the previous best results. This demonstrates our model’s ability to better reason over multiple sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we evaluate on (2490 examples), we achieve an accuracy of 56%, which is below the 61.7% reported by a multi-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is likely our model will benefit from multi-task training as well but we have not explored this currently.

Home - ftfy: fixes text for you

https://spacy.io/

自然语言推理

自然语言推理(NLI)的任务，也称为文本蕴涵识别，包括阅读一对句子，并从蕴涵、矛盾或中性中判断它们之间的关系。虽然最近有很多兴趣[58,35,44]，但由于存在各种各样的现象，如词汇蕴涵，共指，词汇和句法歧义，任务仍然具有挑战性。我们评估了五个具有不同来源的数据集，包括图像说明(SNLI)、转录演讲、流行小说和政府报告(MNLI)、维基百科文章(QNLI)、科学考试(SciTail)或新闻文章(RTE)。

表2详细说明了我们的模型和以前最先进的方法在不同NLI任务上的各种结果。我们的方法在五个数据集中的四个数据集上显著优于基线，在MNLI上实现了高达1.5%的绝对改进，在SciTail上实现了5%的绝对改进，在QNLI上实现了5.8%的绝对改进，在SNLI上实现了0.6%的绝对改进。这证明了我们的模型能够更好地对多个句子进行推理，并处理语言歧义的各个方面。在RTE(我们评估的一个较小的数据集)上(2490个示例)，我们实现了56%的准确性，低于多任务biLSTM模型报告的61.7%。鉴于我们的方法在更大的NLI数据集上的强大性能，我们的模型很可能也会受益于多任务训练，但我们目前还没有探索这一点。

https://ftfy.readthedocs.io/en/latest/

https://spacy.io/

Question answering and commonsense reasoning

Another task that requires aspects of single and multi-sentence reasoning is question answering. We use the recently released RACE dataset [30], consisting of English passages with associated questions from middle and high school exams. This corpus has been shown to contain more reasoning type questions that other datasets like CNN [19] or SQuaD [47], providing the perfect evaluation for our model which is trained to handle long-range contexts. In addition, we evaluate on the Story Cloze Test [40], which involves selecting the correct ending to multi-sentence stories from two options. On these tasks, our model again outperforms the previous best results by significant margins - up to 8.9% on Story Cloze, and 5.7% overall on RACE. This demonstrates the ability of our model to handle long-range contexts effectively.

问题回答和常识推理

另一个需要单句和多句推理能力的任务是回答问题。我们使用最近发布的RACE数据集[30]，由初中和高中考试中的英语文章和相关问题组成。与CNN[19]或SQuaD[47]等其他数据集相比，该语料库已被证明包含更多的推理类型问题，为我们训练处理长期上下文的模型提供了完美的评估。此外，我们还对[40]故事完形填空测试进行了评估，该测试涉及从两个选项中选择多句故事的正确结局。在这些任务中，我们的模型再次以显著的幅度超过了之前的最佳结果——在故事完形上高达8.9%，在RACE上总体达到5.7%。这证明了我们的模型有效地处理长期上下文的能力。

Semantic Similarity

Semantic similarity (or paraphrase detection) tasks involve predicting whether two sentences are semantically equivalent or not. The challenges lie in recognizing rephrasing of concepts, understanding negation, and handling syntactic ambiguity. We use three datasets for this task 鈥?the Microsoft Paraphrase corpus (MRPC) [14] (collected from news sources), the Quora Question Pairs (QQP) dataset [9], and the Semantic Textual Similarity benchmark (STS-B) [6]. We obtain state-of-the-art results on two of the three semantic similarity tasks (Table 4) with a 1 point absolute gain on STS-B. The performance delta on QQP is significant, with a 4.2% absolute improvement over Single-task BiLSTM + ELMo + Attn.

语义相似度

语义相似性(或释义检测)任务包括预测两个句子是否在语义上等价。挑战在于认识概念的改写，理解否定和处理句法歧义。我们使用三个数据集来完成这个任务鈥?微软释义语料库(MRPC)[14](从新闻来源收集)，Quora问题对(QQP)数据集[9]，以及语义文本相似性基准(STS-B)[6]。我们在三个语义相似任务中的两个(表4)上获得了最先进的结果，在STS-B上获得了1个点的绝对增益。QQP上的性能增量非常显著，与单任务BiLSTM + ELMo + Attn相比绝对提高了4.2%。

Classification

Finally, we also evaluate on two different text classification tasks. The Corpus of Linguistic Acceptability (CoLA) [65] contains expert judgements on whether a sentence is grammatical or not, and tests the innate linguistic bias of trained models. The Stanford Sentiment Treebank (SST-2) [54], on the other hand, is a standard binary classification task. Our model obtains an score of 45.4 on CoLA, which is an especially big jump over the previous best result of 35.0, showcasing the innate linguistic bias learned by our model. The model also achieves 91.3% accuracy on SST-2, which is competitive with the state-of-the-art results. We also achieve an overall score of 72.8 on the GLUE benchmark, which is significantly better than the previous best of 68.9.

分类

最后，我们还评估了两种不同的文本分类任务。语言可接受性语料库(CoLA)[65]包含对句子是否符合语法的专家判断，并测试训练模型的先天语言偏见。另一方面，斯坦福情感树库(SST-2)[54]是一个标准的二进制分类任务。我们的模型在CoLA上获得了45.4分，这比之前的最佳结果35.0分有了特别大的飞跃，展示了我们的模型学习到的先天语言偏见。该模型在SST-2上的精度也达到了91.3%，与最先进的结果竞争。我们在GLUE基准测试中也取得了72.8的总成绩，这明显好于之前的最好成绩68.9。

Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate on, outperforming ensembles in many cases. Our results also indicate that our approach works well across datasets of different sizes, from smaller datasets such as STS-B (≈5.7k training examples) –to the largest one – SNLI (≈550k training examples).

总的来说，我们的方法在我们评估的12个数据集中的9个数据集中获得了最新的结果，在许多情况下表现优于集成。我们的结果还表明，我们的方法在不同大小的数据集上都能很好地工作，从较小的数据集，如STS-B(≈5.7k训练示例)到最大的数据集——SNLI(≈550k训练示例)。

Table 2: Experimental results on natural language inference tasks, comparing our model with current state-of-the-art methods. 5x indicates an ensemble of 5 models. All datasets use accuracy as the evaluation metric.
表 2：自然语言推理任务的实验结果，将我们的模型与当前最先进的方法进行比较。 5x 表示 5 个模型的集合。所有数据集都使用准确性作为评估指标。

Table 3: Results on question answering and commonsense reasoning, comparing our model with current state-of-the-art methods.. 9x means an ensemble of 9 models. 表3:问题回答和常识推理的结果，将我们的模型与当前最先进的方法进行比较。9x表示9个模型的集合。

Table 4: Semantic similarity and classification results, comparing our model with current state-of-the-art methods. All task evaluations in this table were done using the GLUE benchmark. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)表4:语义相似性和分类结果，将我们的模型与当前最先进的方法进行比较。本表中的所有任务评估都是使用GLUE基准测试完成的。(mc= Mathews相关性，acc=准确度，pc=Pearson相关性)

5、Analysis

Figure 2: (left) Effect of transferring increasing number of layers from the pre-trained language model on RACE and MultiNLI. (right) Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates. Performance per task is normalized between a random guess baseline and the current state-of-the-art with a single model.
图 2：（左）从 RACE 和 MultiNLI 上的预训练语言模型转移更多层的效果。（右）绘图显示了不同任务的零样本性能随 LM 预训练更新的变化。每个任务的性能在随机猜测基线和当前最先进的单个模型之间进行标准化。

Table 5: Analysis of various model ablations on different tasks. Avg. score is a unweighted average of all the results. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)
表5:不同任务下的各种模型消融分析。平均分是所有成绩的未加权平均。(mc= Mathews相关性，acc=准确度，pc=Pearson相关性)

层次数的转移影响：每一层都包含对解决目标任务有用

层次数的转移影响： 研究了从无监督预训练到监督目标任务中转移不同层数的影响。结果显示，每个转移的Transformer层都提供了进一步的性能改善，最多可提高MultiNLI任务的9%。这表明预训练模型的每一层都包含对解决目标任务有用的功能。

Impact of number of layers transferred

We observed the impact of transferring a variable number of layers from unsupervised pre-training to the supervised target task. Figure 2(left) illustrates the performance of our approach on MultiNLI and RACE as a function of the number of layers transferred. We observe the standard result that transferring embeddings improves performance and that each transformer layer provides further benefits up to 9% for full transfer on MultiNLI. This indicates that each layer in the pre-trained model contains useful functionality for solving target tasks.

转移层数的影响

我们观察了将可变数量的层数从无监督的预训练转移到有监督的目标任务的影响。图2(左)说明了我们的方法在多项li和RACE上的性能作为传输层数的函数。我们观察到的标准结果是，传输嵌入提高了性能，并且对于在多项li上的完全传输，每个转换器层提供了高达9%的进一步效益。这表明预训练模型中的每一层都包含解决目标任务的有用功能。

Zero-shot行为：Transformer体系结构的归纳偏差有助于迁移。

Zero-shot行为：通过探究语言模型预训练在transformer中的有效性，观察了在没有监督微调的情况下，基于生成模型的启发式解决方案在训练过程中的有效性。结果显示这些启发式解决方案的性能稳定，并随着训练逐渐提高，暗示了生成预训练支持学习广泛任务相关功能的能力。Transformer体系结构的归纳偏差有助于迁移。

Zero-shot Behaviors

We’d like to better understand why language model pre-training of transform-ers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs. We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning. We visualize the effectiveness of these heuristic solutions over the course of generative pre-training in Fig 2(right). We observe the performance of these heuristics is stable and steadily increases over training suggesting that generative pretraining supports the learning of a wide variety of task relevant functionality. We also observe the LSTM exhibits higher variance in its zero-shot performance suggesting that the inductive bias of the Transformer architecture assists in transfer.

For CoLA (linguistic acceptability), examples are scored as the average token log-probability the generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis), we append the token very to each example and restrict the language model’s output distribution to only the words positive and negative and guess the token it assigns higher probability to as the prediction. For RACE (question answering), we pick the answer the generative model assigns the highest average token log-probability when conditioned on the document and question. For DPRD [46] (winograd schemas), we replace the definite pronoun with the two possible referrents and predict the resolution that the generative model assigns higher average token log-probability to the rest of the sequence after the substitution.

Zero-shot行为

我们想更好地理解为什么语言模型预训练转换器是有效的。一个假设是，底层生成模型学习执行我们评估的许多任务，以提高其语言建模能力，并且与lstm相比，转换器的更结构化的注意记忆有助于转移。我们设计了一系列启发式解决方案，使用底层生成模型来执行任务，无需监督微调。我们在图2(右)中可视化了这些启发式解决方案在生成式预训练过程中的有效性。我们观察到这些启发式的性能是稳定的，并且在训练中稳步增长，这表明生成式预训练支持学习各种各样的任务相关功能。我们还观察到LSTM在其零镜头性能中表现出更高的方差，这表明Transformer架构的归纳偏差有助于传输。

对于CoLA(语言可接受性)，示例被评分为生成模型分配的平均令牌对数概率，并通过阈值进行预测。对于SST-2(情感分析)，我们将令牌very附加到每个例子中，并将语言模型的输出分布限制为单词positive和negative，并猜测它赋予更高概率的令牌作为预测。对于RACE(问题回答)，当以文档和问题为条件时，我们选择生成模型分配的最高平均令牌对数概率的答案。对于DPRD [46] (winograd模式)，我们将确定代词替换为两个可能的指称，并预测生成模型在替换后将更高的平均令牌对数概率分配给序列其余部分的分辨率。

消融实验：采用辅助语言建模、选择大型数据集、选择Transformer、要有预训练

消融研究： 进行了三个不同的消融研究。首先，研究了在微调过程中没有辅助语言建模目标的方法的性能。结果显示，在NLI任务和QQP任务中，辅助目标对性能有所帮助，总体趋势表明大型数据集受益于辅助目标，而小型数据集则没有。其次，通过与具有相同框架的单层2048单元LSTM进行比较，分析了Transformer的效果。结果显示，使用LSTM而不是Transformer会导致平均分数下降5.6。最后，将与直接在监督目标任务上进行训练的Transformer架构进行比较，没有预训练。结果显示，缺乏预训练会损害所有任务的性能，相比于完整模型降低了14.8%。

Ablation studies

We perform three different ablation studies (Table 5). First, we examine the performance of our method without the auxiliary LM objective during fine-tuning. We observe that the auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets benefit from the auxiliary objective but smaller datasets do not. Second, we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework. We observe a 5.6 average score drop when using the LSTM instead of the Transformer. The LSTM only outperforms the Transformer on one dataset – MRPC. Finally, we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training. We observe that the lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease compared to our full model.

消融实验

我们进行了三个不同的消融研究(表5)。首先，我们在微调期间检查了我们的方法在没有辅助LM目标的情况下的性能。我们观察到辅助目标对NLI任务和QQP都有帮助。总的来说，这一趋势表明，较大的数据集受益于辅助目标，而较小的数据集则不然。其次，通过与使用相同框架的单层2048单元LSTM进行比较，分析了Transformer的效果。当使用LSTM而不是Transformer时，我们观察到平均分数下降了5.6分。LSTM只在一个数据集上胜过Transformer——MRPC。最后，我们还将与直接在监督目标任务上训练的transformer 体系结构进行比较，而不进行预训练。我们观察到，缺乏预训练会影响所有任务的表现，导致与完整模型相比下降14.8%。

6、Conclusion

引入框架——生成式预训练和判别式微调

引入了一个框架，通过生成式预训练和判别式微调，在一个通用任务的模型中实现强大的自然语言理解。通过在包含大量连续文本的多样性语料库上进行预训练，模型获得了显著的世界知识和处理长距离依赖性的能力，成功将这些能力迁移到解决判别性任务，如问答、语义相似性评估、蕴涵判断和文本分类，在研究的12个数据集中有9个达到了最先进水平。

无监督训练提升性能

使用无监督（预）训练提升判别性任务性能一直是机器学习研究的重要目标。研究表明，实现显著的性能提升是可能的，并提供了关于哪些模型（Transformers）和数据集（具有长距离依赖性的文本）最适合这种方法的线索。希望这将促进对无监督学习的新研究，不仅适用于自然语言理解，还适用于其他领域，进一步提高我们对无监督学习如何以及何时起作用的理解。

We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets we study.

Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Our work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach. We hope that this will help enable new research into unsupervised learning, for both natural language understanding and other domains, further improving our understanding of how and when unsupervised learning works.

我们引入了一个框架，通过生成式预训练和判别式微调，使用与任务无关的单一模型实现强大的自然语言理解。通过对具有长段连续文本的多样化语料库进行预训练，我们的模型获得了重要的世界知识和处理远程依赖关系的能力，然后成功地转移到解决判别任务，如问答、语义相似性评估、蕴含确定和文本分类，改进了我们研究的 12个数据集中的 9 个的最新技术水平。

使用无监督(预)训练来提高辨别任务的性能一直是机器学习研究的一个重要目标。我们的工作表明，实现显著的性能提升确实是可能的，并提供了关于哪种模型(Transformers)和数据集(具有长期依赖关系的文本)最适合这种方法的提示。我们希望这将有助于对自然语言理解和其他领域的无监督学习进行新的研究，进一步提高我们对无监督学习如何以及何时起作用的理解。

References

[1] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.

[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. InAdvances in neural information processing systems, pages 153–160, 2007.

[4] L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.

[5] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language inference. EMNLP, 2015.

[6] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.

[7] S. Chaturvedi, H. Peng, and D. Roth. Story comprehension for predicting what happens next. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1603–1614, 2017.

[8] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 740–750, 2014.

[9] Z. Chen, H. Zhang, X. Zhang, and L. Zhao. Quora question pairs. https://data.quora.com/First-Quora- Dataset-Release-Question-Pairs, 2018.

[10] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.

[11] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[12] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence representations from natural language inference data. EMNLP, 2017.

[13] A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079–3087, 2015.

[14] W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.

[15] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

[16] S. Gray, A. Radford, and K. P. Diederik. Gpu kernels for block-sparse weights. 2017.

[17] Z. He, S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang. Learning entity representation for entity disam- biguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 30–34, 2013.

[18] D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415, 2016.

[19] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701, 2015.

[20] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[21] J. Howard and S. Ruder. Universal language model fine-tuning for text classification. Association for Computational Linguistics (ACL), 2018.

[22] Y. Jernite, S. R. Bowman, and D. Sontag. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557, 2017.

[23] Y. Ji and J. Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 891–896, 2013.

[24] F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 209–216. Association for Computational Linguistics, 2006.

[25] T. Khot, A. Sabharwal, and P. Clark. Scitail: A textual entailment dataset from science question answering.In Proceedings of AAAI, 2018.

[26] Y. Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.

[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[28] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.

[29] N. Kitaev and D. Klein. Constituency parsing with a self-attentive encoder. ACL, 2018.

[30] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations. EMNLP, 2017.

[31] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. ICLR, 2018.

[32] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014.

[33] P. Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005.

[34] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. Generating wikipedia by summarizing long sequences. ICLR, 2018.

[35] X. Liu, K. Duh, and J. Gao. Stochastic answer networks for natural language inference. arXiv preprint arXiv:1804.07888, 2018.

[36] L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. ICLR, 2018. [37] I. Loshchilov and F. Hutter. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101,‌2017.

[38] B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308, 2017.

[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[40] N. Mostafazadeh, M. Roth, A. Louis, N. Chambers, and J. Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.

[41] K. Nigam, A. McCallum, and T. Mitchell. Semi-supervised text classification using em. Semi-Supervised Learning, pages 33–56, 2006.

[42] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[43] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power. Semi-supervised sequence tagging with bidirec- tional language models. ACL, 2017.

[44] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextual- ized word representations. NAACL, 2018.

[45] Y. Qi, D. S. Sachan, M. Felix, S. J. Padmanabhan, and G. Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.

[46] A. Rahman and V. Ng. Resolving complex cases of definite pronouns: the winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777–789. Association for Computational Linguistics, 2012.

[47] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. EMNLP, 2016.

[48] P. Ramachandran, P. J. Liu, and Q. V. Le. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.

[49] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pages 1137–1144, 2007.

[50] M. Rei. Semi-supervised multitask learning for sequence labeling. ACL, 2017.

[51] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[52] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kocˇisky`, and P. Blunsom. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.

[53] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.

[54] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.

[55] S. Srinivasan, R. Arora, and M. Riedl. A simple and effective approach to the story cloze test. arXiv preprint arXiv:1803.05547, 2018.

[56] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018.

[57] J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. Proceedings of ACL-08: HLT, pages 665–673, 2008.

[58] Y. Tay, L. A. Tuan, and S. C. Hui. A compare-propagate architecture with alignment factorization for natural language inference. arXiv preprint arXiv:1801.00102, 2017.

[59] Y. Tay, L. A. Tuan, and S. C. Hui. Multi-range reasoning for machine comprehension. arXiv preprint arXiv:1803.09074, 2018.

[60] J. Tian, Z. Zhou, M. Lan, and Y. Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 191–197, 2017.

[61] Y. Tsvetkov. Opportunities and challenges in working with low-resource languages. CMU, 2017.

[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.

[63] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.

[64] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. .io/cola, 2018.

[66] A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. NAACL, 2018.

[67] Y. Xu, J. Liu, J. Gao, Y. Shen, and X. Liu. Towards human-level machine reading comprehension: Reasoning and inference with multiple strategies. arXiv preprint arXiv:1711.04964, 2017.

[68] D. Yu, L. Deng, and G. Dahl. Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.

[69] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, volume 1, page 6, 2017.

[70] X. Zhu. Semi-supervised learning literature survey. 2005.

[71] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.

一个处女座的程序猿

关注

16
点赞
踩
43

收藏

觉得还不错? 一键收藏
打赏
2
评论
LLMs之GPT：《Improving Language Understanding by Generative Pre-Training》翻译与解读

Paper：GPT之《Improving Language Understanding by Generative Pre-Training》翻译与解读目录GPT之《Improving Language Understanding by Generative Pre-Training》翻译与解读GPT的demo体验Abstract1、Introduction2 Related Work3 Framework4 Experiments5 Analysis6
复制链接

扫一扫