Bert学习笔记

概要

BERT学习笔记:
Pytorch链接:https://github.com/daiwk/BERT-pytorch/tree/master
文章链接:https://arxiv.org/pdf/1810.04805

构造数据

在这里插入图片描述

NSP任务:

  • 目的:
    Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
  • 一行中包含t1和t2分别两个句子。随机生成数小于0.5则直接使用t2并设置标签1,否则从数据中随机抽取句子和t1组合并设置标签为0。
 t1, (t2, is_next_label) = self.datas[item][0], self.random_sent(item)
# 将一行中含有两个句子。提取第一个句子,然后50%概率随机选取另外的句子并设置为标签0或1
    def random_sent(self, index):
        # output_text, label(isNotNext:0, isNext:1)
        if random.random() > 0.5:
            return self.datas[index][1], 1
        else:
            return self.datas[random.randrange(len(self.datas))][1], 0

MLM 任务:

  • 目的(根据上下文理解词的含义):
    Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to in
    directly “see itself”, and the model could trivially predict the target word in a multi-layered context.

随机将句子中token替换成mask。token15%的机率被替换。被替换的token中80%的机率替换成掩码,10%的几率替换成别的token,剩下10的几率从vocab中替将标签换成对应位置的索引。其他token则保持不变,并且标签为0.

t1_random, t1_label = self.random_word(t1)
t2_random, t2_label = self.random_word(t2)
    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

给掩码后的句子加上特殊标识符:
(1)[CLS] t1_random [SEP]
(2)t2_random [SEP]

给t1和t2的标签补上padding。

        # [CLS] tag = SOS tag, [SEP] tag = EOS tag
        t1 = [self.vocab.sos_index] + t1_random 
  • 30
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

今天不熬夜!

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值