1. 引言

近年来,以 BERT 和 GPT 系列为代表的大规模预训练语言模型(Pre-trained Language Model, PLM)在 NLP 的各个领域取得了巨大成功。本文整理了自 BERT 和 GPT 诞生以来与 PLM 相关的论文,根据引用数筛选出163篇具有代表性的工作,并按照综述基准数据集PLM的设计PLM的分析高效的PLMPLM的使用六大类型进行了初步划分。

本文整理的论文列表已经同步更新到 GitHub,也会进行持续的更新,欢迎大家关注和 Star。


本文尽可能地在每篇论文的后面附上了 PDF 链接、代码实现和项目主页,以方便读者进一步了解相关工作。

2. 综述

  1. "Pre-trained models for natural language processing: A survey". Science China Technological Sciences(2020)

  2. "Which *BERT? A Survey Organizing Contextualized Encoders". EMNLP(2020)

  3. "A Primer in BERTology: What We Know About How BERT Works". TACL(2020) 

  4. "From static to dynamic word representations: a survey". International Journal of Machine Learning and Cybernetics(2020) 

  5. "Overview of the Transformer-based Models for NLP Tasks". 2020 15th Conference on Computer Science and Information Systems (FedCSIS) 

  6. "A Survey on Contextual Embeddings". arXiv(2020) 

  7. "The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures". IEEE Access(2021) 

  8. "Pre-Trained Models: Past, Present and Future". arXiv(2021) 

  9. "A Survey of Transformers". arXiv(2021) 

3. 基准数据集

  1. XNLI: "XNLI: Evaluating Cross-lingual Sentence Representations". EMNLP(2018) 

  2. GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". ICLR(2019)

  3. SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". NeurIPS(2019) 

  4. CLUE: "CLUE: A Chinese Language Understanding Evaluation Benchmark". COLING(2020) 

  5. XTREME: "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization". ICML(2020) 

  6. XGLUE: "XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation". EMNLP(2020) 

  7. DialoGLUE: "DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue". arXiv(2020) 

 4. PLM的设计

 4.1 通用设计

  1. GPT: "Improving Language Understanding by Generative Pre-Training". OpenAI(2018) 

  2. GPT-2: "Language Models are Unsupervised Multitask Learners". OpenAI(2019) 

  3. BERT: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL(2019) 

  4. XLNet: "XLNet: Generalized Autoregressive Pretraining for Language Understanding". NeurIPS(2019) 

  5. SBERT: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". ACL(2019) 

  6. UniLM: "Unified Language Model Pre-training for Natural Language Understanding and Generation". NeurIPS(2019) 

  7. MASS: "MASS: Masked Sequence to Sequence Pre-training for Language Generation". ICML(2019) 

  8. Chinese-BERT-wwm: "Pre-Training with Whole Word Masking for Chinese BERT". arXiv(2019) 

  9. "Cloze-driven Pretraining of Self-attention Networks". EMNLP(2019) 

  10. "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model". Workshop on Methods for Optimizing and Evaluating Neural Language Generation(2019) 

  11. GPT-3: "Language Models are Few-Shot Learners". arXiv(2020) 

  12. T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR(2020) 

  13. BART: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension". ACL(2020) 

  14. Poly-encoders: "Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring". ICLR(2020) 

  15. SpanBERT: "SpanBERT: Improving Pre-training by Representing and Predicting Spans". TACL(2020) 

  16. ERNIE 2.0: "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding". AAAI(2020) 

  17. SemBERT: "Semantics-Aware BERT for Language Understanding". AAAI(2020) 

  18. "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks". TACL(2020) 

  19. ProphetNet: "ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training". EMNLP(2020) 

  20. UniLMv2: "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training". ICML(2020) 

  21. MacBERT: "Revisiting Pre-Trained Models for Chinese Natural Language Processing". EMNLP(2020) 

  22. MPNet: "MPNet: Masked and Permuted Pre-training for Language Understanding". arXiv(2020) 

  23. DEBERTA: "DeBERTa: Decoding-enhanced BERT with Disentangled Attention". ICLR(2021) 

  24. PALM: "PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation". EMNLP(2020) 

4.2 知识增强

  1. ERNIE(Baidu): "ERNIE: Enhanced Representation through Knowledge Integration". arXiv(2019) 

  2. KnowBert: "Knowledge Enhanced Contextual Word Representations". EMNLP(2019) 

  3. ERNIE(Tsinghua): "ERNIE: Enhanced Language Representation with Informative Entities". ACL(2019) 

  4. COMET: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction". ACL(2019) 

  5. K-BERT: "K-BERT: Enabling Language Representation with Knowledge Graph". AAAI(2020) 

  6. WKLM: "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model". ICLR(2020) 

  7. LUKE: "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention". EMNLP(2020) 

  8. K-Adapter: "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters". ICLR(2021) 

  9. KEPLER: "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation". TACL(2021) 

4.3 多语言

  1. XLM: "Cross-lingual Language Model Pretraining". arXiv(2019) 

  2. "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond". TACL(2019) 

  3. UDify: "75 Languages, 1 Model: Parsing Universal Dependencies Universally". EMNLP(2019) 

  4. Unicoder: "Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks". EMNLP(2019) 

  5. XLM-R: "Unsupervised Cross-lingual Representation Learning at Scale". ACL(2020) 

  6. "Multilingual Alignment of Contextual Word Representations". ICLR(2020) 

  7. mBART: "Multilingual Denoising Pre-training for Neural Machine Translation". TACL(2020) 

  8. mT5: "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer". NAACL(2021) 

  9. InfoXLM: "InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training". NAACL(2021) 

4.4 多模态

  1. ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks". NeuralIPS(2019) 

  2. LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". EMNLP(2019) 

  3. VideoBERT: "VideoBERT: A Joint Model for Video and Language Representation Learning" ICCV(2019) 

  4. MulT: "Multimodal Transformer for Unaligned Multimodal Language Sequences". ACL(2019) 

  5. VisualBERT: "VisualBERT: A Simple and Performant Baseline for Vision and Language". arXiv(2019) 

  6. B2T2: "Fusion of Detected Objects in Text for Visual Question Answering". EMNLP(2019) 

  7. VL-BERT: "VL-BERT: Pre-training of Generic Visual-Linguistic Representations". ICLR(2020) 

  8. Unicoder-VL: "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training". AAAI(2020) 

  9. VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA". AAAI(2020) 

  10. UNITER: "UNITER: UNiversal Image-TExt Representation Learning". ECCV(2020) 

  11. Oscar: "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks". ECCV(2020) 

  12. "12-in-1: Multi-Task Vision and Language Representation Learning". CVPR(2020) 

  13. ActBERT: "ActBERT: Learning Global-Local Video-Text Representations". CVPR(2020) 

  14. VLN: "Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks". CVPR(2020) 

  15. VILLA: "Large-Scale Adversarial Training for Vision-and-Language Representation Learning". arXiv(2020) 

  16. ImageBERT: "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data". arXiv(2020) 

  17. ALIGN: "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". ICML(2021) 

  18. ClipBERT: "Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling". CVPR(2021) 

  19. DALL·E: "Zero-Shot Text-to-Image Generation". arXiv(2021) 

  20. CLIP: "Learning Transferable Visual Models From Natural Language Supervision". arXiv(2021) 

4.5 信息检索

  1. ORQA: "Latent Retrieval for Weakly Supervised Open Domain Question Answering". ACL(2019) 

  2. REALM: "REALM: Retrieval-Augmented Language Model Pre-Training". arXiv(2020) 

  3. RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS(2020) 

  4. DPR: "Dense Passage Retrieval for Open-Domain Question Answering". EMNLP(2020) 

 5. PLM的分析

 5.1 知识

  1. "What Does BERT Look at? An Analysis of BERT’s Attention". BlackBoxNLP(2019) 

  2. "BERT Rediscovers the Classical NLP Pipeline". ACL(2019) 

  3. "How Multilingual is Multilingual BERT?". ACL(2019) 

  4. "A Structural Probe for Finding Syntax in Word Representations". NAACL(2019) 

  5. "Language Models as Knowledge Bases?". EMNLP(2019) 

  6. "What Does BERT Learn about the Structure of Language?". ACL(2019) 

  7. "Linguistic Knowledge and Transferability of Contextual Representations". NAACL(2019) 

  8. "Assessing BERT's Syntactic Abilities". arXiv(2019) 

  9. "Probing Neural Network Comprehension of Natural Language Arguments" ACL(2019) 

  10. "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings". EMNLP(2019) 

  11. "Visualizing and Measuring the Geometry of BERT". NeurIPS(2019) 

  12. "Designing and Interpreting Probes with Control Tasks". EMNLP(2019) 

  13. "Open Sesame: Getting inside BERT’s Linguistic Knowledge". BlackboxNLP(2019) 

  14. "What do you learn from context? Probing for sentence structure in contextualized word representations". ICLR(2019) 

  15. "Commonsense Knowledge Mining from Pretrained Models". EMNLP(2019) 

  16. "Do NLP Models Know Numbers? Probing Numeracy in Embeddings". EMNLP(2019) 

  17. "On the Cross-lingual Transferability of Monolingual Representations". ACL(2020) 

  18. "Cross-Lingual Ability of Multilingual BERT: An Empirical Study". ICLR(2020) 

  19. "What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models". TACL(2020) 

  20. "How Much Knowledge Can You Pack Into the Parameters of a Language Model?". EMNLP(2020) 

  21. "How Can We Know What Language Models Know?". TACL(2020) 

  22. "oLMpics-On What Language Model Pre-training Captures". TACL(2020)

  23. "Information-Theoretic Probing with Minimum Description Length". EMNLP(2020) 

  24. "Inducing Relational Knowledge from BERT". AAAI(2020) 

  25. AutoPrompt: "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts". EMNLP(2020) 

  26. "Emergent linguistic structure in artificial neural networks trained by self-supervision". PNAS(2020) 

  27. "Evaluating Commonsense in Pre-Trained Language Models". AAAI(2020) 

 5.2 鲁棒性

  1. "Universal Adversarial Triggers for Attacking and Analyzing NLP". EMNLP(2019) 

  2. "Pretrained Transformers Improve Out-of-Distribution Robustness". ACL(2020) 

  3. BERT-ATTACK: "BERT-ATTACK: Adversarial Attack Against BERT Using BERT". EMNLP(2020) 

  4. "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment". AAAI(2020) 

5.3 稀疏性

  1. "Are Sixteen Heads Really Better than One?". NeurIPS(2019) 

  2. "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned". ACL(2019) 

  3. "Revealing the Dark Secrets of BERT". EMNLP(2019) 

  4. "The Lottery Ticket Hypothesis for Pre-trained BERT Networks". NeurIPS(2020) 

  5. "When BERT Plays the Lottery, All Tickets Are Winning". EMNLP(2020) 

5.4 其他

  1. "Scaling Laws for Neural Language Models". arXiv(2020)

  2. "Extracting Training Data from Large Language Models". arXiv(2020) 

6. 高效的PLM

6.1 模型训练

  1. RoBERTa: "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv(2019) 

  2. "Efficient Training of BERT by Progressively Stacking". ICML(2019) 

  3. Megatron-LM: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism". arXiv(2019) 

  4. ELECTRA: "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR(2020) 

  5. "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes". ICLR(2020) 

  6. GShard: "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv(2020) 

  7. Admin: "Understanding the Difficulty of Training Transformers". EMNLP(2020) 

  8. ZeRO: "ZeRO: Memory optimizations Toward Training Trillion Parameter Models". SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 

  9. Switch Transformers: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv(2021) 

6.2 模型压缩

  1. DistilBERT: "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". arXiv(2019) 

  2. PKD: "Patient Knowledge Distillation for BERT Model Compression". EMNLP(2019) 

  3. "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks". arXiv(2019) 

  4. Q8BERT: "Q8BERT: Quantized 8Bit BERT". 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 

  5. ALBERT: "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR(2020) 

  6. TinyBERT: "TinyBERT: Distilling BERT for Natural Language Understanding". EMNLP(2020) 

  7. Layerdrop: "Reducing Transformer Depth on Demand with Structured Dropout". ICLR(2020) 

  8. Q-BERT: "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT". AAAI(2020) 

  9. MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices". ACL(2020) 

  10. "Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning". 5th Workshop on Representation Learning for NLP(2020) 

  11. MiniLM: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers". arXiv(2020) 

  12. FastBERT: "FastBERT: a Self-distilling BERT with Adaptive Inference Time". ACL(2020) 

  13. DeeBERT: "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference". ACL(2020) 

7. PLM的使用

7.1 两阶段

  1. "Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks". arXiv(2018) 

  2. "How to Fine-Tune BERT for Text Classification?". CCL(2019) 

  3. "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks". ACL(2020) 

  4. "Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?". ACL(2020) 

7.2 多任务

  1. MT-DNN: "Multi-Task Deep Neural Networks for Natural Language Understanding". ACL(2019) 

  2. "BAM! Born-Again Multi-Task Networks for Natural Language Understanding". ACL(2019) 

  3. "Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding". arXiv(2019) 

7.3 Adapter

  1. "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning". ICML(2019) 

  2. Adapter: "Parameter-Efficient Transfer Learning for NLP". ICML(2019) 

7.4 Prompt

  1. PET: "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference". EACL(2021) 

  2. "It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners". NAACL(2021) 

  3. "Prefix-Tuning: Optimizing Continuous Prompts for Generation". arXiv(2021)

  4. LM-BFF: "Making Pre-trained Language Models Better Few-shot Learners". ACL(2021) 

  5. "What Makes Good In-Context Examples for GPT-3?". arXiv(2021) 

  6. "The Power of Scale for Parameter-Efficient Prompt Tuning". arXiv(2021) 

7.5 其他

  1. "To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks". RepL4NLP(2019) 

  2. "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models". NAACL(2019) 

  3. "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping". arXiv(2020) 

  4. SMART: "SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization". EMNLP(2020) 

  5. "Revisiting Few-sample BERT Fine-tuning". ICLR(2021) 


预训练语言模型(Pretrained Language Model,PLM)是指在大规模语料库上训练的通用语言模型,可以用于各种自然语言处理任务。PLM 的出现极大地促进了自然语言处理领域的发展,成为了近年来的热点研究方向。 PLM 的历史可以追溯到 2013 年的 Word2vec 模型和 2015 年的 GloVe 模型,它们以词向量为基础,但都无法处理词序关系。2018 年,Google 团队提出了 BERT 模型(Bidirectional Encoder Representations from Transformers),它采用 Transformer 模型,可以双向学习句子中的上下文信息,从而在多项自然语言处理任务上取得了优异成绩。BERT 模型开创了 PLM 的新时代。 随后,BERT 模型的改进和扩展不断涌现。例如,XLNet 模型使用了无序自回归技术,进一步提升了模型的性能;RoBERTa 模型在训练过程中采用了更多的数据和更长的序列,进一步提高了模型的泛化性能;ELECTRA 模型则使用了对抗训练技术,让模型更加鲁棒。 除了上述模型外,还有一些针对特定任务的 PLM 模型,例如 GPT(Generative Pre-trained Transformer)模型和 T5(Text-to-Text Transfer Transformer)模型等。这些模型在各自的任务上表现出色,为实际应用带来了很大的便利。 然而,PLM 模型的训练需要消耗大量的计算资源和时间,对于普通用户来说很难实现。因此,各大厂商都提供了预训练模型参数,供用户直接使用。这些预训练模型参数可以快速地应用于各种自然语言处理任务,大大减少了模型训练的时间和资源消耗。 总之,PLM 模型是自然语言处理领域的重要进展,为各种自然语言处理任务提供了强有力的支持。随着技术的不断发展,PLM 模型的性能和应用场景还将不断拓展。


