【课程总结】day23:大模型训练策略(BERT模型与GLM模型)

前言

在前两章的学习中,我们了解到大模型的训练过程,其中Base model(底座大模型)的训练采用了挖空填空的策略。本章我们将结合 BERTGLM 两个模型的论文,深入了解在预训练时两者的具体训练策略。

资料

  • BERT论文:https://arxiv.org/pdf/1810.04805
  • GLM论文:https://arxiv.org/pdf/2103.10360

BERT 模型训练策略

背景介绍

BERTBidirectional Encoder Representations from Transformers)是由Google在2018年提出的一种预训练语言模型。

贡献
BERT 开创了预训练和微调的范式,使得在NLP任务中,研究者可以利用大规模无标注数据进行预训练,从而减少对标注数据的依赖。

论文阅读理解

摘要(Abstract)

论文原文(NO.1)

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

论文翻译(NO.1)

我们介绍一种新的语言表示模型,称为 BERT(Bidirectional Encoder Representations from Transformers),即:基于Transformers 的双向编码表征模型。与最近的一些语言表征模型有所不同,BERT 通过在所有的层中共同考虑上下文的方式,并使用不需要标注的文本中进行双向的、深度的语言表征模型预训练。
结果,预训练的 BERT 模型只需添加一个额外的输出层即可进行微调,从而为广泛的任务(如问答和语言推理)创建最先进的模型,而无需对特定任务的架构进行重大修改。

论文理解

  • 双向的上下文方式:是指一段词向量中,其中的某个词可以与左右的上下文同时进行注意力计算,而不是自回归式的只能进行单向计算。
# 例如:一段文字
    A B C D E F G
# 1.假设在词D处,如果是单向自回归计算,那么D只能与A、B、C进行注意力计算;
# 2.但是如果是双向的上下文,那么D可以同时与A、B、C、D、E、F、G进行注意力计算。
  • 微调:上述关于微调的表述,阐明了当前大模型的训练思路:不针对任何任务,先训练一个通用模型,然后再简单的添加一个layer去进行微调。
引言(Introduction)

论文原文(NO.2)

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

论文翻译(NO.2)

现在有两种应用于下游任务语言表征模型的预训练策略:基于特征的基于微调的。基于特征的方法,如ELMo(Peters et al., 2018a),使用特定于任务的架构,将预训练的表示作为附加特征。基于微调的方法,如生成预训练的Transformer(OpenAI GPT)(Radford et al., 2018),引入了最小的任务特定参数,并通过简单地微调所有预训练参数进行训练。这两个策略在预训练期间共享相同的目标函数,并且使用单向语言模型来学习通用的语言表征。

论文理解

  • 有两种训练策略:基于特征的 和 基于微调的。
  • 基于特征的训练策略,是针对特定任务的,需要修改模型结构。
  • 基于微调的训练策略,可以做到改数据不改模型结构。

论文原文(NO.3)

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

论文翻译(NO.3)

我们认为当前的技术限制了预训练表征的能力,尤其是对于微调方法。主要的限制在于标准语言模型是单向的,这限制了在预训练期间可以使用的架构选择。例如,在 OpenAI GPT 中,作者使用了 单向从左到右 的架构,这使得在Transformer 的自注意力层中(Vaswani 等,2017),每个 token 只能看到到与之前的 token 进行注意力计算。这样的限制对于句子级任务来说是次优的,而在将基于微调的方法应用于token级任务(如问答)时,这可能非常有害,因为在这种情况下,从两个方向整合上下文至关重要。

论文原文(NO.4)

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers.
BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked arXiv:1810.04805v2 [cs.CL] 24 May 2019 word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations. The contributions of our paper are as follows:

  • We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This
    is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.
  • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

论文翻译(NO.4)

在这篇论文中,我们通过提出 BERT 来改进基于微调的方法,以解决之前提到的单向限制。通过使用“掩码语言模型”(MLM)作为预训练目标,这种思想源于文学中的完形填空 (Taylor, 1953)。掩码语言模型随机屏蔽输入中的某些 token ,并使目标仅依赖于其上下文。与单向语言模型预训练相比,MLM 目标使表征能够融合左右上下文,这允许我们预训练一个深度双向 Transformer。除了 MLM 外,我们还使用“下一个句子预测”任务,联合预训练文本表征。我们论文的贡献如下:

  • 我们展示了双向预训练对语言表示的重要性。与 Radford 等(2018)使用单向语言模型进行预训练不同,BERT 使用掩蔽语言模型来实现预训练的深度双向表征。这也与 Peters 等(2018a)形成对比,后者使用独立训练的从左到右和从右到左的语言模型的浅层拼接。
  • 我们展示了,预训练表征减少了对许多重工程化的特定任务架构的需求。BERT第一个基于微调的表征模型,在大量句子级和 token 级任务上实现了最先进的性能,超越了许多特定任务的架构。
相关工作(Related Work)

论文原文(NO.5)

Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

论文翻译(NO.5)

图1:BERT的整体预训练和微调过程。除了输出层外,预训练和微调中使用相同的架构。相同的预训练模型参数用于初始化不同下游任务的模型。在微调过程中,所有参数都进行微调。[CLS] 是在每个输入示例前添加的特殊符号,[SEP] 是一个特殊的分隔符标记(例如,用于分隔问题/答案)。

BERT部分(BERT)

论文原文(NO.6)

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section. A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

论文翻译(NO.6)

在本节中,我们介绍BERT及其详细实现。我们的框架包含两个步骤:预训练微调。在预训练阶段,模型在不同的无标注数据和预训练任务上进行训练。对于微调,BERT模型首先使用预训练参数进行初始化,然后使用来自下游任务的标注数据对所有参数进行微调。每个下游任务都有单独的微调模型,尽管它们都是使用相同的预训练参数初始化的。图1中的问答示例将作为本节的运行示例。BERT 的一个显著特点是其在不同任务之间的统一架构。预训练架构与最终下游架构之间的差异最小。

模型架构(Model Architecture)

论文原文(NO.7)

BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”
In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. We primarily report results on two model sizes: BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024,A=16, Total Parameters=340M).

论文翻译(NO.7)

BERT的模型架构是基于Vaswani等人(2017)描述的原始实现,并在tensor2tensor库中发布的多层双向变换器编码器。由于 Transformers 的使用已变得普遍,并且我们的实现与原始版本几乎相同,因此我们将省略对模型架构的详细背景描述,并建议读者参考Vaswani等人(2017)以及优秀的指南,如《The Annotated Transformer》。

在本工作中,我们将层数(即变换器块)记为L,隐藏层大小记为H,自注意力头的数量记为A。我们主要报告两种模型大小的结果:BERTBASE(L=12,H=768,A=12,总参数=1.1亿)和BERTLARGE(L=24,H=1024,A=16,总参数=3.4亿)。

输入/输出表征(Input/Output Representations)

论文原文(NO.8)

To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answeri) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first
token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

论文翻译(NO.8)

为了使BERT能够处理各种下游任务,我们的输入表征能够明确地表示单个句子一对句子(例如,<问题, 答案>)在一个 token 序列中。在我们的通篇文章中,一个“sentence(句子)”可以是任意连续的文本,而不必是实际的语言句子。“sequence(序列)”指的是输入到BERT的 token 序列,这可以是一个单独的句子或两个句子组合在一起。

我们使用WordPiece嵌入(Wu等,2016),词汇量为30,000个标记。每个序列的第一个 token 始终是一个特殊的分类标记([CLS])。与该 token 对应的最终隐藏状态用于分类任务的聚合序列表示。如果是句子对,则被打包成一个单一序列。我们通过两种方式来区分句子。首先,我们用一个特殊token([SEP])将它们分开。其次,我们为每个 token 添加一个已经学过的embedding词向量,指示它属于句子A还是句子B。

任务1: Masked LM

论文原文(NO.9)

Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context. In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input. Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss.

论文翻译(NO.9)

直观上,深度双向模型的能力显然强于单向从左到右的模型或从左到右从右到左模型的浅层拼接。不幸的是,标准的条件语言模型只能从左到右或从右到左进行训练,因为双向条件会使每个词间接地“看到自己”,从而使模型能够在多层上下文中轻松预测目标词。

为了训练深度双向表征,我们随机掩盖输入 token 的一定百分比,然后预测这些被掩盖的 token 。我们将这一过程称为“掩码语言模型”(Masked LM,MLM),尽管在文献中通常称为完型填空任务(Taylor,1953)。在这种情况下,与掩码标记对应的最终隐藏向量被输入到一个输出softmax中,类似于标准的语言模型。在我们的所有实验中,我们随机掩盖每个序列中15%的WordPiece token 。与去噪自编码器(Vincent等,2008)不同,我们只预测被掩盖的词,而不是重建整个输入

尽管这使我们能够获得一个双向预训练模型,但一个缺点是我们在预训练和微调之间产生了不匹配,因为[MASK]标记在微调过程中并不存在。为了缓解这一问题,我们并不总是用实际的[MASK]标记替换“被掩盖”的词。训练数据生成器随机选择15%的 token 位置进行预测。如果选择了第i个 token ,我们会在以下情况下替换第i个 token :(1) 80%的时间用[MASK]token,(2) 10%的时间用随机token,(3) 10%的时间保持第i个 token 不变。然后,Ti将用于通过交叉熵损失预测原始标记。

任务2:下一个句子预测(NSP)

论文原文(NO.10)

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext),
and 50% of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, C is used for next sentence prediction (NSP). Despite its simplicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI.

论文翻译(NO.10)

许多重要的下游任务,如问答(QA)和自然语言推理(NLI),都基于理解两个句子之间的关系,而这一点并没有被语言建模直接捕捉。为了训练一个理解句子关系的模型,我们预训练了一个二元下一个句子预测任务,该任务可以从任何单语语料库中简单生成。

具体而言,在为每个预训练示例选择句子A和B时,50%的时间 B 是实际跟随 A 的下一个句子(标记为IsNext),而 50% 的时间则是来自语料库的随机句子(标记为NotNext)。正如图1所示, C 用于下一个句子预测(NSP)。尽管这个任务很简单,但我们在第5.1节中证明了,针对这一任务的预训练对问答和自然语言推理都有很大帮助。

预训练数据(Pre-training data)

论文原文(NO.11)

The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

论文翻译(NO.11)

预训练过程基本上遵循现有的语言模型预训练文献。我们使用的预训练语料库包括BooksCorpus(8亿个词)(Zhu等,2015)和英语维基百科(25亿个词)。对于维基百科,我们仅提取文本段落,忽略列表、表格和标题。使用文档级语料库而非像亿词基准(Billion Word Benchmark,Chelba等,2013)这样的打乱句子级语料库是至关重要的,以便提取长的连续序列。

论文内容总结

预训练阶段

在预训练阶段,BERT 使用大量无标注的文本数据进行训练,主要包括以下两个任务:

第一个任务:掩蔽语言模型(Masked Language Model, MLM)

  • 过程:随机选择输入句子中的一些词,并将它们替换为一个特殊的[MASK]标记。模型的目标是根据上下文预测被掩蔽的词,即:挖空填空。
# 理解示例:

# 输入句子:“我喜欢吃苹果。”
# 随机替换:“我喜欢吃[MASK]。”
# 模型预测:模型需要预测被掩蔽的词,例如“苹果”。

第二个任务:下一个句子预测(Next Sentence Prediction, NSP)

  • 过程:模型接收一对句子,判断句子A和句子B是否有相关性,其中[NSP]是句子是否有相关性的分类标签。
# 理解示例:
# 输入句子对:
#   句子A:“我喜欢看电影。”
#   句子B1:“这部电影非常好。”(是下一个句子)
#   句子B2:“天气很好。”(不是下一个句子)
# 模型预测:模型需要预测句子A和句子B是否有相关性,即:判断句子B1是否是句子A的下一个句子。
微调阶段

在微调阶段,BERT 在特定的下游任务上进行训练,通常只需添加一个输出层。微调的过程如下:

  • 过程:将预训练的 BERT 模型加载到特定任务中(如问答、情感分析等),并使用标注数据上进行训练。
# 理解示例:
# 输入问答对:(对于问答对任务)
#   问题:“BERT模型是什么?”
#   上下文:“BERT模型是一个强大的语言模型。”
# 模型输出:答案的开始和结束位置。

GLM 模型训练策略

背景介绍

GLM(Generalized Language Model)是国内智普提出的一种通用的语言模型,旨在通过更强的上下文理解和灵活的任务适应能力来提升自然语言处理的效果。

贡献
GLM 创新的使用了一种 基于自回归空白填充的通用语言模型

论文阅读理解

摘要(Abstract)

论文原文(NO.1)

There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERTLarge, demonstrating its generalizability to different downstream tasks.

论文翻译(NO.1)

预训练架构有多种类型,包括自编码模型(如BERT)、自回归模型(如GPT)和编码-解码模型(如T5)。然而,没有任何预训练框架在自然语言理解(NLU)、无条件生成和有条件生成这三大类任务中表现最佳。为了解决这个挑战,我们提出了一种基于自回归空白填充的通用语言模型(GLM)。

GLM 通过添加二维位置编码并允许以任意顺序预测跨度来改进空白填充预训练,这使得其在NLU任务上相较于 BERTT5 取得了性能提升。同时,GLM 可以通过改变空白的数量和长度来针对不同类型的任务进行预训练。在自然语言理解、有条件和无条件生成等广泛任务中,GLM 在相同模型大小和数据的情况下超越了BERT、T5和GPT,并且在参数为BERTLarge的1.25倍的单一预训练模型中达到了最佳性能,展示了其对不同下游任务的泛化能力。

引言(Introduction)

论文原文(NO.2)

In this paper, we propose a pretraining framework named GLM (General Language Model), based on autoregressive blank infilling. We randomly blank out continuous spans of tokens from the input text, following the idea of autoencoding, and train the model to sequentially reconstruct the spans, following the idea of autoregressive pretraining (see Figure 1). While blanking filling has been used in T5 (Raffel et al., 2020) for text-to-text pretraining, we propose two improvements, namely span shuffling and 2D positional encoding. Empirically, we show that with the same amount of parameters and computational cost, GLM significantly outperforms BERT on the SuperGLUE benchmark by a large margin of 4.6% – 5.0% and outperforms RoBERTa and BART when pretrained on a corpus of similar size (158GB). GLM also significantly outperforms T5 on NLU and generation tasks with fewer parameters and data.

论文翻译(NO.2)

在本文中,我们提出了一种名为GLM(通用语言模型)的预训练框架,基于自回归空白填充。我们随机将输入文本中的连续跨度token 进行遮盖,遵循自编码的思想,并训练模型按顺序重建这些跨度,遵循自回归预训练的理念(见图1)。虽然空白填充已在 T5(Raffel等,2020)中用于文本到文本的预训练,但我们提出了两个改进,即一定长度的完形填空二维位置编码。通过实证研究,我们表明,在相同参数和计算成本的情况下,GLM 在SuperGLUE基准测试中显著超越 BERT ,提升幅度达到4.6%至5.0%,并在与RoBERTa和BART的预训练中表现更佳,使用的语料库规模相似(158GB)。GLM 在自然语言理解和生成任务中也显著超越 T5,且所需参数和数据更少。

论文原文(NO.3)

Inspired by Pattern-Exploiting Training (PET) (Schick and Schütze, 2020a), we reformulate NLU tasks as manually-crafted cloze questions that mimic human language. Different from the BERTbased models used by PET, GLM can naturally handle multi-token answers to the cloze question via autoregressive blank filling.

论文翻译(NO.3)

受到模式利用训练(PET)(Schick和Schütze,2020a)的启发,我们将自然语言理解(NLU)任务重新表述为手工制作的填空问题,以模拟人类语言。与PET使用的基于BERT的模型不同,GLM可以通过自回归空白填充自然地处理填空问题的多token答案。

论文原文(NO.4)

Figure 2: GLM pretraining. (a) The original text is [x1, x2, x3, x4, x5, x6]. Two spans [x3] and [x5, x6] are sampled. (b) Replace the sampled spans with [M] in Part A, and shuffle the spans in Part B. © GLM autoregressively generates Part B. Each span is prepended with [S] as input and appended with [E] as output. 2D positional encoding represents inter- and intra-span positions. (d) Self-attention mask. Grey areas are masked out. Part A tokens can attend to themselves (blue frame) but not B. Part B tokens can attend to A and their antecedents in B (yellow and green frames correspond to the two spans). [M] := [MASK], [S] := [START], and [E] := [END].

论文翻译(NO.3)

图2:GLM预训练。

(a) 原始文本为[x1, x2, x3, x4, x5, x6]。从中抽取两个跨度[x3]和[x5, x6]。
(b) 在A部分用[M]替换抽取的跨度,并在B部分对跨度进行洗牌。
© GLM自回归生成B部分。每个跨度以[S]作为输入,输出以[E]结束。二维位置编码表示跨度内外的位置。
(d) 自注意力掩码。灰色区域被屏蔽。A部分的标记可以相互关注(蓝框),但不能关注B部分。B部分的标记可以关注A部分及其在B中的前驱(黄色和绿色框对应于两个跨度)。[M] := [MASK],[S] := [START],和[E] := [END]。

论文内容总结

GLM训练过程:
第一步:准备数据,从句子中随机选择一些token进行遮蔽。

# 示例理解:

# 原始文本:[我, 喜欢, 学习, 人工智能, 和, 机器学习]
#         [x1, x2,   x3,   x4,    x5,     x6]

第二步:随机遮蔽,分为两个部分

# 随机遮蔽词为:即“学习”和“和机器学习”
# PartA: [我, 喜欢, [M], 人工智能, [M] ]
# PartB: [[和, 机器学习], [学习]]

第三步:位置编码

  • 过程:GLM 使用二维位置编码来表示每个标记在序列中的位置。这种编码方式能够帮助模型理解标记之间的相对位置关系。
#       我, 喜欢, [M], 人工智能, [M]  [S]  和  机器学习 [S]  学习
# P1    0   1     2     3      4    4    4    4      2    2    
# P2    0   0     0     0      0    1    2    3      1    2

第四步:自回归空白填充

#   初始输入: 我, 喜欢

#   第一轮自回归
#   输入: 我, 喜欢 ,[S]
#   返回: 我, 喜欢 ,[S], 和

#   第二轮自回归
#   输入: 我, 喜欢 ,[S], 和
#   返回: 我, 喜欢 ,[S], 和, 机器学习

#   第三轮自回归
#   输入: 我, 喜欢 ,[S], 和, 机器学习
#   返回: 我, 喜欢 ,[S], 和, 机器学习, [E]

#   第四轮自回归
#   输入: 我, 喜欢 ,[S], 和, 机器学习, [E]
#   返回: 我, 喜欢 ,[S], 和, 机器学习, [E], [S]

#   第五轮自回归
#   输入: 我, 喜欢 ,[S], 和, 机器学习, [E], [S]
#   返回: 我, 喜欢 ,[S], 和, 机器学习, [E], [S], 学习

#   最后输出: 我, 喜欢 ,[S], 和, 机器学习, [E], [S], 学习, [E]

总结

  • BERTGLM 模型的共同点:预训练时都是使用海量的无标注数据进行训练。
  • BERTGLM 模型的不同点:
    • BERT模型在预训练时挖空填空时,是随机挖单个token,而GLM模型是随机挖连续的token
    • BERT模型结构是Encoder-Only架构GLM模型结构是Decoder-Only架构
  • GLM 模型它通过mask掩码将encoder与decoder连接起来,实现自回归空白填充。
  • 预训练(PT)是一个漫长的、枯燥的训练过程,但也是打基础、修炼内功的过程。
  • 微调训练(SFT)是建立在预训练基础上的进阶训练,这个过程需要根据实战情况调整训练内容。

读后感

大模型的训练过程是如此地具有普世的意义。让我不禁想起《七龙珠》中武天老师训练悟空和克林的过程:

  • 基础的训练是漫长的、重复的、枯燥的,但也是打基础的过程。

  • 只有基础打牢靠了,才能在战斗中"针对性的修炼",积累经验,最终超越自我。

参考资料

知乎:BERT系列论文笔记

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值