ERNIE 3.0: LARGE-SCALE KNOWLEDGE ENHANCED PRE-TRAINING FOR LANGUAGE UNDERSTANDING AND GENERATION
Sun Y, Wang S, Feng S, et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2107.02137, 2021.
关键词:百亿参数大模型 \Transformer-XL\Knowledge graph
预训练中加入知识图谱三元组,模型基本单元从2.0的transformer换成transformer-XL
百度文心可以体验模型效果:https://wenxin.baidu.com/wenxin/ernie
1、ERNIE 3.0基本特点
(1)参数规模:10 billion
(2)引入知识图谱
- large-scale knowledge enhanced models :4TB corpus consisting of plain texts and a large-scale knowledge graph
(3) fuses auto-regressive network and auto-encoding network
- handle both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning.
(4)模型性能
- outperforms the state-of-the-art models on 54 Chinese NLP tasks
- English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%)
2、 ERNIE 3.0 framework
Continual Multi-Paradigms Unified Pre-training Framework
(1)Universal representation Module:通用语义表示层一旦预训练完成,就不再更新(即便在fine-tune时也不再更新)
(2)Task-specific Representation Modules:特定任务语义表示层则会在fine-tune下游任务时候更新,这样保证了fine-tune的高效,其中NLG和NLU参数非共享。
- NLU-specific representation module (a bidirectional modeling network,自编码)
- NLG-specific representation module (a unidirectional modeling network,自回归)
3、pre-training tasks
采用多个不同语义层级的预训练任务。
(1) Word-aware: Capture Lexical information
(2) Structure-aware: Capture the syntactic information
(3) Knowledge-aware: Improve knowledge memorization and reasoning
3.1 Word-aware
(1)Knowledge Masked Language Modeling:同ERNIE1.0
(2)Document Language Modeling:参考ERNIE-Doc,进行了更长文本的单向语言模型预训练,主要用于提升文本生成能力
3.2 Structure-aware
(1)Sentence Reordering:同ERNIE2.0
(2)Sentence Distance:同ERNIE2.0
3.3 Knowledge-aware
Universal Knowledge-Text Prediction(文本与知识平行预训练)
(1)训练方法:海量无监督文本(unstructured texts)与大规模知识图谱(structured texts)的平行预训练。
(2)训练语料:5千万知识图谱三元组与4TB大规模语料中相关的文本组成pair
(3)成果:通过将大规模知识图谱的实体关系与大规模文本数据同时输入到预训练模型中进行联合掩码训练,促进了结构化知识和无结构文本之间的信息共享,大幅提升了模型对于知识的记忆和推理能力。
(4)文本与知识平行预训练具体实施:
Step1:利用识图谱挖掘算法,对一句话进行三元组挖掘
Eg:The Nightingale is written by Danish author Hans Christian Andersen.
Step2:得到知识图谱的三元组
Eg:<Andersen, Write, Nightingale>
Step3:此时将三元组和元语句拼接在一起作为模型输入
Eg:Andersen Write Nightingale [SEP] The Nightingale is written by Danish author Hans Christian Andersen [SEP]
为了让模型能学习到这个知识图谱的关系,采用以下两种方法:
(1)将三元组中的某个实体或者关系去掉,然后通过B段去预测A段的masked部分。
(2)将B段的某个实体去掉,通过A段去预测B段被masked的部分。
三元组(用A段表示)可以代表了一对实体以及其关系,这个关系具有一定的语义信息,比如逻辑关系,这个我们一般认为是知识(Knowledge)
元语句(用B段表示)则代表着原始的文本信息(Plain Text)
4、 ERNIE 3.0训练数据
(1)数据规模: 4TB storage size in 11 different categories
(2)数据来源
(2.1)通用数据:
- baike, Wikipedia, feed
- Baidu search(including Baijiahao, Zhidao, Tieba, Experience)
- Webtext, QA-long, QA-short, Poetry &Couplet
(2.2)特定领域数据: medical, law and financial area and Baidu knowledge graph with more than 50 million facts.
5、 ERNIE 3.0训练细节
(1)预训练算法
渐进式训练法提升训练稳定性
- Transformer :learning rate warm-up strategy:
- ERNIE 3.0 : progressive learning strategy,increasing the training factors including the input sequence length, the batch size, the learning rate and the dropout rate
(2)数据预处理
-
去重
Character去重:对于同一字符连续出现多次的情况,使用一个字符来代替
Paragraph去重:对于同一个段落连续出现多次的情况,使用一个段落来代替
Document去重:通过对比文档中最长的3个句子MD5的和,来过滤掉重复的文档。MD5(Message Digest Algorithm5) -
过滤:过滤掉单词数小于10的句子
-
分句,分词
(3)参数设置
- Total Parameter: 10 billion
- Activate function: GeLU
- The maximum sequence length of context:512
- the memory length of language generation:128
- Batch size: 6144
- 优化函数:Adam。 learning rate of 1e-4
- Token: 375 billion tokens
- 硬件要求: 384 NVDIA v100 GPU card
- 深度学习框架:PaddlePaddle
- parameter sharding
5、 实验
54 NLP tasks
5.1、Experiments on Fine-tuning Tasks
(1)Natural Language Understanding Tasks
14种类型45个任务,平均5%左右的提升,其中指代消解任务从69.7%提升到95.4%
- Sentiment analysis 情感分析
- Opinion extraction 观点提取
- Natural Language Inference 自然语言推断
- Winograd Schema Challenge (anaphora resolution 指代消解)
- Relation Extraction 关系抽取
- Event Extraction 事件抽取
- Semantic Similarity 语义相似
- Chinese News Classification 中文新闻分类
- Closed-Book Question Answering 问答
- Named Entity Recognition 命名实体识别
- Machine Reading Comprehension 机器阅读理解
- Legal Documents Analysis 法律文件分析
- Cant Understanding 歧义理解?
- Document Retrieval 文档检索
(2)Natural Language Generation Tasks
7种类型9个任务,平均提升7.4%
- Text Summarization 文本摘要
- Question Generation 问题生成
- Closed-Book Question Answering 问答
- Math 数学
- Advertisement generation 广告生成
- Translation 翻译
- Dialogue Generation 对话生成
(3)LUGE benchmark
- 5.36%的提升
5.2、Experiments on Zero-shot Learning
- NLG: Acc平均提升5.3%
5.3、Experiments on SuperGLUE
- 英文模型
- surpassing the human performance by +0.8% (90.6% vs. 89.8%)
https://super.gluebenchmark.com/leaderboard
6 关键技术
- The Effectiveness of the Task-specific Representation Modules
- Universal Knowledge-Text Prediction
- Progressive Learning to Speed up Convergence