Recent Advances in Pre-trained Language Models

zzz_qing

于 2023-04-23 23:08:40 发布

阅读量168

点赞数

文章标签：语言模型深度学习人工智能

本文链接：https://blog.csdn.net/zzz_qing/article/details/130332770

版权

（一）Background knowledge

（二）The Problems of PLMs

1. Data scarcity in downstream tasks

2. The PLM is too big, and they are still getting bigger

（三）The Solutions of Those Problems

1. Labeled Data Scarcity → Data-Efficient Fine-tuning

2. PLMs Are Gigantic → Reducing the Number of Parameters

（一）Background knowledge

Pre-trained Language Models

Neural Language Models: A neural network that defines the probability over sequences of words.

How are these language models trained? Given an incomplete sentence, predict the rest of the sentence.

Training a langauge model is self-supervised learning。

Pre-trained Language Models有两种Model：

① Autoregressive Language Models (ALMs): Complete the sentence given its prefix.

Transformer-based ALMs结构如下图，它由许多堆叠的transformer layer组成：

② Masked Language Models (MLMs): Use the unmasked words to predict the masked word.

Pre-trained Language Models中的Pre-trained是指使用大型语料库去train一个neural language model。预训练模型有如下两种：

Autoregressive pre-trained: GPT 系列 (GPT, GPT-2, GPT-3)
MLM-based pre-trained: BERT 系列 (BERT, RoBERTa, ALBERT)

关于预训练的相关好处、fine-tuning以及GPT和BERT模型均在Self-supervised Learning部分的笔记中有记录，这里不再赘述。

（二）The Problems of PLMs

1. Data scarcity in downstream tasks

A large amount of labeled data is not easy to obtain for each downstream task

2. The PLM is too big, and they are still getting bigger

Need a copy for each downstream task：

Inference takes too long and Consume too much space:

（三）The Solutions of Those Problems

1. Labeled Data Scarcity → Data-Efficient Fine-tuning

Prompt Tuning——By converting the data points in the dataset into natural language prompts, the model may be easier to know what it should do.

核心概念：设置一些东西让model知道我们在做什么。Format the downstream task as a language modelling task with pre-defined templates into natural language prompt.

In prompt tuning, we need:

A prompt template: convert data points into a natural language prompt.

A PLM: perform language modeling.
A verbalizer: A mapping between the label and the vocabulary.

Prompt tuning v.s. Standard fine-tuning

下面介绍数据在不同程度的稀缺下，prompts是如何帮助训练的。

Few-shot learning: We have some labeled training data.

Semi-Supervised learning: We have some labeled training data and a large amount of unlabeled data

Pattern-Exploiting Training (PET):

Step 1: Use different prompts and verbalizer to prompt-tune different PLMs on the labeled dataset.

Step 2: Predict the unlabeled dataset and combine the predictions from different models.

Step 3: Use a PLM with classifier head to train on the soft-labeled data set.

Zero-shot inference: inference on the downstream task without any training data.

如果没有training data，则需要一个可以对downstream tasks进行zero-shot inference的模型。

GPT-3证明在模型足够大的条件下，zero-shot (with task description)是可行的。GPT-3仅根据任务的自然语言描述来预测答案。不执行梯度更新。

2. PLMs Are Gigantic → Reducing the Number of Parameters

Pre-train a large model, but use a smaller model for the downstream tasks

Share the parameters among the transformer layers

Parameter-Efficient Fine-tuning: Use a small amount of parameters for each downstream task

Fine-tuning = modifying the hidden representation based on a PLM

① Adapter: Use special submodules to modify hidden representations

Adapters: small trainable submodules inserted in transformers.

All downstream tasks share the PLM; the adapters in each layer and the classifier heads are the task-specific modules.

During fine-tuning, only update the adpaters and theclassifier head.

② LoRA: Use special submodules to modify hidden representations!

③ Prefix(前缀) Tuning: Use special submodules to modify hidden representations!

④ Soft Prompting: Prepend the prefix embedding at the input layer

Soft Prompts: vectors (can be initialized from some word embeddings)

Hard Prompts: words (that are originally in the vocabulary

Parameter-Efficient Fine-tuning的benifit有如下三点：

① 极大地减少了用于特定任务的参数

② 训练数据不容易过拟合，更好的better out-of-domain performance

③ fine-tune更少的parameters，小数据集训练时的一个很好的候选

zzz_qing

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Recent Advances in Pre-trained Language Models

Pre-trained Language Models中的Pre-trained是指使用大型语料库去train一个neural language model。GPT-3证明在模型足够大的条件下，zero-shot (with task description)是可行的。关于预训练的相关好处、fine-tuning以及GPT和BERT模型均在Self-supervised Learning部分的笔记中有记录，这里不再赘述。③ fine-tune更少的parameters，小数据集训练时的一个很好的候选。
复制链接

扫一扫