文章目录
Model Architecture
a multi-layer bidirectional Transformer encoder.
Input/Output Representations
input representation = Token Embeddings + Segment Embeddings + Position Embeddings
Token Embeddings use WordPiece embeddings which convert a token into a fixed length vector.
Segment Embeddings indicates whether a token belongs to sentence A or sentence B. Because sentence pairs are packed together into a single sequence.
Position Embeddings make BERT learn the position message.
Output Representation: the final hidden vector (dimension size is H)
Pre-training BERT
Task #1: Masked Language Model (MLM)
Defination: We simply mask some percentage of the input tokens at random, and then predict those masked tokens.
In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. Finally, it will be used to predict the original token with entropy loss.
Task # 2: Next Sentence Prediction (NSP)
We pre-train for a binarized next sentence prediction task (binary classification).
When choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).
Fine-tunning BERT
BERT encodes a concatenated text
pair with self-attention effectively includes bidirectional cross attention between two sentences.