模型细节
1.残差
2.Layer Normalization
3.masked multi-head attention
4.输出
BN:取不同样本的同一个通道做归一化
LN:同一个样本的不同通道做归一化
掩码语言模型(Masked LM, MLM)
•每句话中有15%的单词被[mask]替换
•编码器( Transformer encoder )不知道它将被要求预测哪些单词,或者哪些单词已被随机单词替换
能够保持对每个输入 token 的分布式上下文表示的学习
句子关系判断(Next Sentence Prediction, NSP)
•补充Masked LM对单词的学习,帮助理解两个句子之间的关系
•支撑例如智能问答、自然语言推理等任务的完成
Step 1:构造单词查表,返回单词表达和映射关系
word_embeddings, embedding_tables = self.bert_embedding_lookup(input_ids)
self.bert_embedding_lookup = EmbeddingLookup(
vocab_size=config.vocab_size,
embedding_size=self.embedding_size,
embedding_shape=output_embedding_shape,
use_one_hot_embeddings=use_one_hot_embeddings,
initializer_range=config.initializer_range)
Step 2:在词向量中融入位置编码和词类信息
embedding_output = self.bert_embedding_postprocessor(token_type_ids, word_embeddings)
self.bert_embedding_postprocessor = EmbeddingPostprocessor(
embedding_size=self.embedding_size,
embedding_shape=output_embedding_shape,
use_token_type=True,
token_type_vocab_size=config.type_vocab_size,
use_one_hot_embeddings=use_one_hot_embeddings,
initializer_range=0.02,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)
Step 3:构造注意力掩码
attention_mask = self._create_attention_mask_from_input_mask(input_mask)
self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config)
def construct(self, input_mask):
if not self.input_mask_from_dataset:
input_mask = self.input_mask
input_mask = self.cast(self.reshape(input_mask, self.shape), mstype.float32)
attention_mask = self.batch_matmul(self.broadcast_ones, input_mask)
return attention_mask
Step 4:构造编码器
# bert encoder
encoder_output = self.bert_encoder(self.cast_compute_type(embedding_output),attention_mask)
sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
self.bert_encoder = BertTransformer(
batch_size=self.batch_size,
hidden_size=self.hidden_size,
seq_length=self.seq_length,
num_attention_heads=config.num_attention_heads,
num_hidden_layers=self.num_hidden_layers,
intermediate_size=config.intermediate_size,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
use_one_hot_embeddings=use_one_hot_embeddings,
initializer_range=config.initializer_range,
hidden_dropout_prob=config.hidden_dropout_prob,
use_relative_positions=config.use_relative_positions,
hidden_act=config.hidden_act,
compute_type=config.compute_type,
return_all_encoders=True)
Step 5:构造池化层
# pooler
sequence_slice = self.slice(sequence_output,(0, 0, 0),(self.batch_size, 1, self.hidden_size),
(1, 1, 1))
first_token = self.squeeze_1(sequence_slice)
pooled_output = self.dense(first_token)
self.slice = P.StridedSlice()
self.squeeze_1 = P.Squeeze(axis=1)
self.dense = nn.Dense(self.hidden_size, self.hidden_size,
activation="tanh",
weight_init=TruncatedNormal(config.initializer_range))
总览:如何用MindSpore构造BERT
def construct(self, input_ids, token_type_ids, input_mask):
# embedding
if not self.token_type_ids_from_dataset:
token_type_ids = self.token_type_ids
word_embeddings, embedding_tables = self.bert_embedding_lookup(input_ids)
embedding_output = self.bert_embedding_postprocessor(token_type_ids,
word_embeddings)
# attention mask [batch_size, seq_length, seq_length]
attention_mask = self._create_attention_mask_from_input_mask(input_mask)
# bert encoder
encoder_output = self.bert_encoder(self.cast_compute_type(embedding_output),
attention_mask) sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
# pooler
sequence_slice = self.slice(sequence_output,
(0, 0, 0),
(self.batch_size, 1, self.hidden_size),
(1, 1, 1))
first_token = self.squeeze_1(sequence_slice)
pooled_output = self.dense(first_token)
return sequence_output, pooled_output, embedding_tables