what is bert?
BERT: Bidirectional Encoder Representation from Transformers,即双向Transformer的Encoder。
与其他词向量关系
Word2vec等词向量是词维度,训练好就确定了。而BERT是句子维度的向量表示,依赖上下文构建结果。
why bert?
- bert的出现彻底改变了预训练产生词向量和下游具体NLP任务的关系;
- 使用Masked LM和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation,增加词向量模型泛化能力;
- 证明了双向模型对文本特征表示的重要性;
- 证明了预训练模型能够消除很多繁重的任务相关的网络结构;
- 预训练架构与最终下游架构之间的差异很小,可进行迁移学习;
- 在11个NLP任务上,提升了state of the art水平
bert general goal
- designed for pre-training bidirectional representations from unlabelled data
- Conditions on left and right context in all layers
- pre-trained model can be fine-tuned with one additional output layer for many tasks(e.g., NLI, QA, …)
- for many tasks, no modifications to the bert architecture are required
the architecture of bert
bert:每层均使用双向Transformer,以
P
(
w
i
∣
w
1
,
.
.
.
,
w
i
−
1
,
w
i
+
1
,
.
.
.
,
w
n
)
P(w_i|w_1, ... , w_{i-1},w_{i+1}, ... , w_{n})
P(wi∣w1,...,wi−1,wi+1,...,wn)作为目标函数进行训练。
OpenAI GPT:从左至右的Transformer;
ELMO:从左至右&从右至左的两个Lstm,分别以
P
(
w
i
∣
w
1
,
.
.
.
,
w
i
−
1
)
P(w_i|w_1, ... , w_{i-1})
P(wi∣w1,...,wi−1) 和
P
(
w
i
∣
w
i
+
1
,
.
.
.
,
w
n
)
P(w_i|w_{i+1}, ... , w_{n})
P(wi∣wi+1,...,wn) 作为目标函数,独立训练出两个representation然后拼接。
bert核心思想
input & output
input representation:
input embeddings = token embeddings + segmentation embeddings + position embeddings
Token Embeddings:词向量,第一个单词是CLS标志,可用于后续分类任务;
Segment Embeddings:区别两种句子,因为预训练不光做LM还要做以两个句子为输入的分类任务
Position Embeddings:和之前文章中的Transformer不一样,不是三角函数而是学习出来的
句子拆分表示
[CLS]:起始标记
[SEP]:句对分割标记
单句或句对组合
WordPiece embeddings(英文)
单字拆分(中文):相比于词组可减小词表大小
why subword?
- Classic word representation cannot handle unseen word or rare word well;
- Character embeddings is one of the solution to overcome out-of-vocabulary(OOV)
- It may too fine-gained any missing some important information
- Subword is in between word and character. It is not too fine-gained while able to handle unseen or rare word;
- e.g., subword = sub + word
- Three commonly used algorithms:
a. Byte Pair Encoding(BPE)
b. WordPiece
c. Unigram Language Model
pre-train(Task 1):Masked LM
masked LM:随机MASK 15% 的word piece
(1)[MASK]token替换(80%)
(2)随机token替换(10%)
(3)不变(10%)
采用bidirectional:使用模型进行任务处理时,需左右两边信息,而非只需要左边的信息。(Important departure from previous embedding models: don’t train the model to predict the next word, but train it to predict the whole context.)
problem: how can we prevent trivial copying via the self-attention mechanism?
solution: mask 15% of the tokens in the input sequence; train the model to predict these.
problem: masking creates mismatch between pre-training and fine-tuning: [MASK] token is not seen during fine-tuning.
solution:
1. do not always replace masked words with [MASK], instead choose 15% of token position at random for prediction
2. if i-th token is chosen, we replace the i-th token with:
a. the [MASK] token 80% of the time
b. a random token 10% of the time
c. the unchanged i-th token 10% of the time
3. now use Ti to predict original token with cross entropy loss
why masking ?
- if we always use [MASK] token, the model would not have to learn good representation for other words.
- if we always use [MASK] token or random word, model would learn that observed word is never correct.
- if we always use [MASK] token or observed word, model would be bias to trivially copy.
pre-train(Task 2):Next Sentence Prediction (NSP)
目的:让模型理解问答,推理等句子对之间的关系
训练数据构建:
50%:B是跟随A的实际下一个句子(标记为IsNext)
50%:来自语料库的随机句子(标记为NotNext)
对比:
BERT:传输所有参数初始化最终任务模型参数
其他:句子嵌入被转移到下游任务
The final model achieves 97%-98% accuracy on NSP.
注意:作者特意说了语料的选取很关键,要选用document-level的而不是sentence-level的,这样可以具备抽象连续长序列特征的能力
Fine-tuning
Transformer中的自动注意机制允许BERT通过交换适当的输入和输出来模拟许多下游任务
将任务特定的输入和输出插入到BERT中,并对端到端的所有参数进行微调。
参数:
Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4
参数
参数:
BERT_BASE(L = 12,H = 768,A = 12,总参数= 110M)
BERT_LARGE(L = 24,H = 1024,A = 16,总参数= 340M)
In all cases we set the feed-forward/filter size to be 4H, i.e., 3072 for the H = 768 and 4096 for the H = 1024.
L:the number of layers (transformer blocks)
H: the dimensionality of hidden layer
A: the number of self-attention heads
code
others
bert每一层都学到了什么
ACL 2019:What does BERT learn about the structure of language?
https://hal.inria.fr/hal-02131630/document
低层网络捕捉了短语级别的结构信息
表层信息特征在底层网络(3,4),句法信息特征在中间层网络(69),语义信息特征在高层网络。(912)
主谓一致表现在中间层网络(8,9)
bert变体
ROBERTA
静态mask->动态mask
去除句对NSP任务,输入连续多个句子
更多数据 更大batch size 更长时间
ALBERT
减少参数(瘦身):
词表 V 到 隐层 H 的中间,插入一个小维度 E
共享所有层的参数:Attention FFN
SOP 替换 NSP:负样本换成了同一篇文章中的两个逆序的句子
BERT对MASK 15% 的词来预测。ALBERT 预测的是 n-gram 片段,包含更完整的语义信息
训练数据长度:90%取512 BERT90% 128
对应BERT large:H:1024 ->4096 L:24->12 窄而深->宽而浅
百度 ERNIE
bert VS GPT
Encoder VS. Decoder
更多的数据:BooksCorpus and Wikipedia VS. BooksCorpus
在预训练中采用 SEP CLS(GPT :Fine-tuning)
每个batch size词用的更多
(5e-5, 4e-5, 3e-5, and 2e-5) VS. 5e-5
bert & 文本分类
bert可以解决的问题
序列标注
分类任务
句子关系
(不能完成生成任务)
bert文本分类方法
特征抽取 + 其他模型
step1
step2
step3