BERT
- Published by Google in 2018
- Bidirectional Encoder Representation from Transformers
- Two Phrases: Pre-training, fine-turning
- Use Transformer proposed in Attention Is All You Need by Google 2017 to replace RNN
- BERT takes advantages of multiple models. (1)predict word from given contect -Word2Vec CBOW,(2) 2-layer bidirectional model - ElMo, (3) Transformer instead of RNN-GPT(Generative Pre-trainning)
BERT-Mustked Languaged Model
BERT uses Language Model to train model. inspired by the Cloze Task.
Mask 15% words of doc(for dropout to prevent weights over focus on some )
- 80% use "[MASK]"
- 10% use original word
- 10% use a random word
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are fligh ## less birds [SEP]
Label = NotNext