文章目录
概要
BERT学习笔记:
Pytorch链接:https://github.com/daiwk/BERT-pytorch/tree/master
文章链接:https://arxiv.org/pdf/1810.04805
构造数据
NSP任务:
- 目的:
Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. - 一行中包含t1和t2分别两个句子。随机生成数小于0.5则直接使用t2并设置标签1,否则从数据中随机抽取句子和t1组合并设置标签为0。
t1, (t2, is_next_label) = self.datas[item][0], self.random_sent(item)
# 将一行中含有两个句子。提取第一个句子,然后50%概率随机选取另外的句子并设置为标签0或1
def random_sent(self, index):
# output_text, label(isNotNext:0, isNext:1)
if random.random() > 0.5:
return self.datas[index][1], 1
else:
return self.datas[random.randrange(len(self.datas))][1], 0
MLM 任务:
- 目的(根据上下文理解词的含义):
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to in
directly “see itself”, and the model could trivially predict the target word in a multi-layered context.
随机将句子中token替换成mask。token15%的机率被替换。被替换的token中80%的机率替换成掩码,10%的几率替换成别的token,剩下10的几率从vocab中替将标签换成对应位置的索引。其他token则保持不变,并且标签为0.
t1_random, t1_label = self.random_word(t1)
t2_random, t2_label = self.random_word(t2)
def random_word(self, sentence):
tokens = sentence.split()
output_label = []
for i, token in enumerate(tokens):
prob = random.random()
if prob < 0.15:
# 80% randomly change token to make token
if prob < prob * 0.8:
tokens[i] = self.vocab.mask_index
# 10% randomly change token to random token
elif prob * 0.8 <= prob < prob * 0.9:
tokens[i] = random.randrange(len(self.vocab))
# 10% randomly change token to current token
elif prob >= prob * 0.9:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))
else:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(0)
return tokens, output_label
给掩码后的句子加上特殊标识符:
(1)[CLS] t1_random [SEP]
(2)t2_random [SEP]
给t1和t2的标签补上padding。
# [CLS] tag = SOS tag, [SEP] tag = EOS tag
t1 = [self.vocab.sos_index] + t1_random