NLP系列经典论文(2) -- BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding

本文链接：https://blog.csdn.net/Levi_Ackerman__/article/details/121944309

该博客介绍了BERT模型，一种用于自然语言处理的深度双向预训练技术。BERT通过预训练和微调两阶段，分别在大规模无标注文本上学习通用语言表示，然后针对特定NLP任务进行优化，显著提升了下游任务的性能。文章讨论了无监督特征方法、微调方法以及监督数据迁移学习，并详细阐述了BERT的预训练和微调过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先放原文链接

https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf

Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

与最近的语言模型不同的是，BERT在所有的层中，都从未标注的文本中以上下文为语义条件进行深度的双向预训练。因此，预训练模型BERT模型仅通过一个额外的输出层就可以被很好的微调，从而避免大量的针对特定任务的模型架构修改，创造出针对许多任务的目前最先进的模型，比如问答系统和语言推理。

BERT概念上很简单，实践中很强大。他在11个自然语言处理的任务中取得了新高。（下面就是针对几个经典的数据集，饰演的效果，通过绝对精度+相对精度来表现模型的强大。）

1 Introduction

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations.

The contributions of our paper are as follows:

• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al