lectra有效学习编码器,该编码器可对代币替换进行准确分类

The recent advances in NLP suggest training Language Models mainly either using a causal language modeling objective or denoising autoencoding objectives (for e.g. Masked Language Modeling objective). The framework proceeds with self-supervised pre-training of the model on one of the aforementioned objectives followed by fine-tuning the model on specific downstream objectives. Models like BERT, RoBERTa, XLNet, ALBERT, T5, etc. are trained on such objectives and have achieved the state of the art on the respective benchmarks.

NLP的最新进展建议主要使用因果语言建模目标或去噪自动编码目标(例如,蒙版语言建模目标)来训练语言模型。 该框架将对上述目标之一进行模型的自我监督预训练,然后针对特定的下游目标对模型进行微调。 像BERTRoBERTaXLNetALBERTT5等模型都针对此类目标进行了培训,并已在各自的基准上达到了最新水平。

In this paper we are going to discuss a rather unique approach proposed by Google AI for pre-training of language models in the paper, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. We’ll first glance through the existing pre-training approaches and then discuss ELECTRA’s approach in-depth.

在本文中,我们将讨论Google AI提出的一种非常独特的方法,用于在论文ELECTRA中进行语言模型的预训练:ELECTRA:将文本编码器作为鉴别器而不是生成器进行鉴别 。 我们将首先浏览现有的预培训方法,然后深入讨论ELECTRA的方法。

Note that these models implement the Transformer (Vaswani et. al.) architecture. For more on Transformers, consider giving this article a read.

注意 ,这些模型实现了Transformer( Vaswani et al。 )体系结构。 有关“变形金刚”的更多信息,请考虑阅读本文

现有的预培训方法 (Existing Pre-training Approaches)

Image for post
Google AI Blog Google AI博客进行的现有预培训方法

These approaches can be broadly classified as:

这些方法可以大致分为:

  1. Language Modeling or Causal Language Modeling Objective (left): In this objective, the language model is trained in an autoregressive setting i.e. for a sequence of tokens, we maximize the probability of a token x_t to occur at the ‘t’th position, given all the tokens preceding that token in the sequence. Formally, maximize: P(x_t | x_(i < t)).

    语言建模或因果语言建模目标 (左):在此目标中,语言模型是在自回归设置下训练的,即对于一系列标记,我们给出标记x_t在第t个位置出现的概率最大,给定序列中该令牌之前的所有令牌。 正式地,最大化: P(x_t | x_(i <t))

  2. Denoising Autoencoding or vaguely, Masked Language Modeling Objective (right): In this objective, the language model is trained in an autoencoding setting i.e. for a sequence of tokens, we maximize the probability of a token x_t to occur at the ‘t’th position, given all the tokens in the sequence x_hat. Formally, maximize: P(x_t | x_hat).

    对自动编码进行降噪或模糊地掩盖语言建模目标 (右):在此目标中,语言模型是在自动编码设置下进行训练的,即对于一系列令牌,我们最大化了令牌x_t在第t个位置出现的概率给定序列x_hat中的所有令牌。 正式地,最大化: P(x_t | x_hat)

OpenAI GPT, GPT-2, XLNet are trained on the Language Modeling Objective and BERT, RoBERTa are trained on the Denoising Autoencoding Objective.

OpenAI GPT GPT-2 XLNet 接受了语言建模目标 方面的培训 BERT RoBERTa 接受了降噪自动编码目标方面的培训。

电子 (ELECTRA)

Image for post
Google AI Blog Google AI博客替换了令牌检测

The ELECTRA paper has proposed a Replaced Token Detection objective wherein instead of masking the inputs and then training the model to predict the correct tokens, we replace the original tokens from the sequence with tokens that are not correct, but they make some sense with the given sequence (for e.g. in the animation above, ‘the’ and ‘cooked’ are replaced by ‘a’ and ‘ate’ which although are incorrect, still make sense), and then train the model to classify if a given token is ‘original’ or ‘replaced’.

ELECTRA论文提出了替换令牌检测目标,其中我们不是掩盖输入然后训练模型以预测正确的令牌, 而是用不正确的令牌替换了序列中的原始令牌 ,但对于给定的令牌来说,它们是有意义的序列(例如,在以上“ ”和“ ”由“a”和“ ”,这虽然是不正确的,还有意义代替动画),然后训练模型,如果一个给定的令牌是“ 原始 分类 '或' 替换 '。

The model consists of a Generator and a Discriminator Network as in Generative Adversarial Networks (GANs). On the contrary, the training procedure is NOT adversarial as in GANs. We will glance through a few reasons why ELECTRA is not a GAN later in this article.

该模型由生成对抗网络 (GAN)中的生成 鉴别器网络组成。 相反,与GAN相比,训练过程不是 对抗性的 。 我们将在本文后面介绍几种为什么ELECTRA不是GAN的原因。

发电机 (The Generator)

This is essentially a small masked language model. So, the inputs are initially corrupted with [MASK] tokens and then, the generator is trained to predict the correct tokens from the original sequence (or plausible tokens that make sense as discussed earlier).

这本质上是一个小的掩盖语言模型 。 因此,最初使用[MASK]令牌破坏了输入,然后训练了生成器,以根据原始序列(或前面讨论的有意义的合理令牌)预测正确的令牌。

Moreover, the logic behind having a small generator is related to the computational complexity of the model.

此外,具有较小生成器的逻辑背后与模型的计算复杂性有关。

If the generator and discriminator are the same size, training ELECTRA would take around twice as much compute per step as training only with masked language modeling

如果生成器和鉴别器的大小相同,则训练ELECTRA每步所需的计算量大约是仅使用掩盖语言建模进行训练的两倍

ELECTRA Paper

ELECTRA纸

Formally,

正式地

Image for post
Paper 纸的掩盖语言建模(MLM)目标

鉴别者 (The Discriminator)

This is essentially a transformer encoder and is trained to maximize the likelihood of classifying a replaced token as ‘replaced’ and a original token as ‘original’.

这本质上是一个变压器编码器,经过培训可以最大程度地将替换标记分类为“ 替换 ”,将原始标记分类为“ 原始 ”。

Formally,

正式地

Image for post
Paper 纸张替换令牌检测目标

组合模型 (Combined Model)

Image for post
Paper ELECTRA

The combined model is pre-trained with both; the generator and the discriminator on the Replaced Token Detection objective.

两者都经过组合模型的预训练; 替换令牌检测目标上的生成器和鉴别器。

Image for post
Paper

The generator is discarded after the pre-training and the discriminator is used for fine-tuning on downstream tasks.

在预训练之后,将丢弃生成器,并使用鉴别器对下游任务进行微调。

Hence, the Discriminator is ELECTRA.

因此, 鉴别器为ELECTRA。

ELECTRA不是GAN吗? (How is ELECTRA not a GAN?)

Although the model architecture and/or training objective may reminisce GANs, ELECTRA ain’t one! The following are some peculiarities of ELECTRA that contrast to that of GANs’:

尽管模型架构和/或培训目标可能会使GAN令人回味,但ELECTRA却不是一个! 以下是ELECTRA的一些特性与GAN's的对比:

  1. If the generator happens to generate the correct token, that token is considered “real” instead of “fake”.

    如果生成器恰巧生成正确的令牌,则该令牌被视为“真实”而不是“伪造”。
  2. The generator is trained with maximum likelihood rather than being trained adversarially to trick the discriminator.

    训练生成器的可能性最大,而不是通过对抗性的方法欺骗区分器。
  3. Since we sample tokens as the generator output and feed it to the discriminator, it is impossible to backpropagate the loss through this sampling step, which is a necessity for adversarial training.

    由于我们将令牌作为生成器输出进行采样并将其提供给鉴别器,因此不可能通过此采样步骤反向传播损失,这是对抗训练的必要条件。

结论 (Conclusion)

In this article, we covered a new approach for language model pre-training. We also discussed the resemblance of ELECTRA with GANs and how they both differ from each other.

在本文中,我们介绍了一种用于语言模型预训练的新方法。 我们还讨论了ELECTRA与GAN的相似之处以及它们之间的区别。

Here is a link to the GitHub repository for ELECTRA

这是指向ELECTRA的GitHub存储库链接

Here is an API reference to implement ELECTRA using huggingface transformers

这是API的参考资料,用于使用拥抱转换器实现ELECTRA

翻译自: https://medium.com/@rojagtap/electra-efficiently-learning-an-encoder-that-classifies-token-replacements-accurately-59253abd5f25

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值