Bert学习笔记

最新推荐文章于 2024-08-07 21:56:36 发布

今天不熬夜！

最新推荐文章于 2024-08-07 21:56:36 发布

阅读量916

点赞数 30

分类专栏： Pytorch学习记录文章标签： bert 学习笔记

本文链接：https://blog.csdn.net/qq_44700741/article/details/140815366

版权

文章目录

基础知识
- softmax函数
- nn.NLLLoss

概要

BERT学习笔记：
Pytorch链接：https://github.com/daiwk/BERT-pytorch/tree/master
文章链接：https://arxiv.org/pdf/1810.04805

构造数据

在这里插入图片描述

NSP任务：

目的：
Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
一行中包含t1和t2分别两个句子。随机生成数小于0.5则直接使用t2并设置标签1，否则从数据中随机抽取句子和t1组合并设置标签为0。

 t1, (t2, is_next_label) = self.datas[item][0], self.random_sent(item)

# 将一行中含有两个句子。提取第一个句子，然后50%概率随机选取另外的句子并设置为标签0或1
    def random_sent(self, index):
        # output_text, label(isNotNext:0, isNext:1)
        if random.random() > 0.5:
            return self.datas[index][1], 1
        else:
            return self.datas[random.randrange(len(self.datas))][1], 0

MLM 任务：

目的(根据上下文理解词的含义):
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to in
directly “see itself”, and the model could trivially predict the target word in a multi-layered context.

随机将句子中token替换成mask。token15%的机率被替换。被替换的token中80%的机率替换成掩码，10%的几率替换成别的token，剩下10的几率从vocab中替将标签换成对应位置的索引。其他token则保持不变，并且标签为0.

t1_random, t1_label = self.random_word(t1)
t2_random, t2_label = self.random_word(t2)

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

给掩码后的句子加上特殊标识符：
（1）[CLS] t1_random [SEP]
（2）t2_random [SEP]

给t1和t2的标签补上padding。

        # [CLS] tag = SOS tag, [SEP] tag = EOS tag
        t1 = [self.vocab.sos_index] + t1_random

最低0.47元/天解锁文章

今天不熬夜！

关注

30
点赞
踩
22

收藏

觉得还不错? 一键收藏
打赏
0
评论
Bert学习笔记

被替换的token中80%的机率替换成掩码，10%的几率替换成别的token，剩下10的几率从vocab中替将标签换成对应位置的索引。句子位置编码：[0, 0, 0, 0, 0, 0, 1, 1, 1, 1] （0表示第一句话，1表示第二句话）注意力掩码：[1, 1, 1, 1, 1, 1, 1, 1, 1, 1] （1表示非填充，0表示填充。用于给定序列中每个位置的词添加位置信息，以便模型能够区分序列中不同位置的词。位置编码：[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
复制链接

扫一扫