RoBERTa: A Robustly Optimized BERT Pretraining Approach

最新推荐文章于 2023-12-09 20:42:11 发布

Hotiso

最新推荐文章于 2023-12-09 20:42:11 发布

阅读量262

点赞数

分类专栏：模型解读文章标签：自然语言处理

本文链接：https://blog.csdn.net/robotooo/article/details/115382701

版权

模型解读专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1. Setup

BERT takes as input a concatenation of two segments (sequences of tokens), x1, . . . , xN and y1, . . . , yM.

Segments usually consist of more than one natural sentence.

The two segments are presented as a single input sequence to BERT with special tokens delimiting them: [CLS], x1, . . . , xN , [SEP], y1, . . . , yM, [EOS].

M and N are constrained such that M + N < T, where T is a parameter that controls the maximum sequence length during training.

2. Architecture

BERT uses the now ubiquitous transformer architecture (Vaswani et al., 2017), which we will not review in detail. We use a transformer architecture with L layers. Each block uses A self-attention heads and hidden dimension H.

3. Training Objectives

(1) Masked Language Model (MLM)

A random sample of the tokens in the input sequence is selected and replaced with the special token [MASK]. The MLM objective is a cross-entropy loss on predicting the masked tokens. BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged, and 10% are replaced by a randomly selected vocabulary token.

In the original implementation, random masking and replacement is performed once in the beginning and saved for the duration of training, although in practice, data is duplicated so the mask is not always the same for every training sentence (see Section 4.1).

(2) Next Sentence Prediction (NSP)

NSP is a binary classification loss for predicting whether two segments follow each other in the original text. Positive examples are created by taking consecutive sentences from the text corpus. Negative examples are created by pairing segments from different documents. Positive and negative examples are sampled with equal probability.

The NSP objective was designed to improve performance on downstream tasks, such as Natural Language Inference (Bowman et al., 2015), which require reasoning about the relationships between pairs of sentences.

4. Optimization

BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed. BERT trains with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.

5. Data

BERT is trained on a combination of BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA, which totals 16GB of uncompressed text.3

PS:只是先介绍BERT，后续再介绍变种。