NLP教程笔记：BERT 双向语言模型

最新推荐文章于 2024-08-31 00:03:19 发布

_APTX4869

最新推荐文章于 2024-08-31 00:03:19 发布

阅读量2.7k

点赞数 3

分类专栏： NLP

本文链接：https://blog.csdn.net/nanke_4869/article/details/113745886

版权

BERT作为双向语言模型，与GPT和ELMo相比提供了更丰富的语义理解。BERT的训练策略包括遮蔽语言模型和下一句预测，通过Transformer的Encoder捕捉上下文信息。训练时使用特定技巧如随机替换，以提高模型学习效率。虽然训练过程复杂且收敛速度较慢，但其双向注意力机制使其在NLP任务中表现出色。

摘要由CSDN通过智能技术生成

NLP教程

TF_IDF
词向量
 句向量
 Seq2Seq 语言生成模型
 CNN的语言模型
 语言模型的注意力
 Transformer 将注意力发挥到极致
 ELMo 一词多义
 GPT 单向语言模型
 BERT 双向语言模型
 NLP模型的多种应用

怎么了

BERT 和 GPT 还有 ELMo 是一个性质的东西。它存在的意义是要变成一种预训练模型，提供 NLP 中对句子的理解。ELMo 用了双向 LSTM 作为句子信息的提取器，同时还能表达词语在句子中的不同含义；GPT 呢，它是一种单向的语言模型，同样也可以用 attention 的方式提取到更加丰富的语言意思信息。而BERT，它就和GPT是同一个家族，都是从Transformer 演变而来的。那么 BERT 和 GPT 有有什么不同之处呢？

其实最大的不同之处是，BERT 认为如果看一个句子只从单向观看，是不是还会缺少另一个方向的信息？所以 BERT 像 ELMo 一样，算是一种双向的语言模型。而这种双向性，其实正是原封不动的 Transformer 的 Encoder 部分。
在这里插入图片描述

怎么训练

BERT就是一个Transformer的Encoder，只是在训练步骤上有些不同。在这个教程中就不会详细说明Encoder的结构了。

为了让BERT理解语义内容，它的训练会比GPT tricky得多。 GPT之所以训练方案上比较简单，是因为我们把它当成一个RNN一样训练，比如用前文预测后文（用mask挡住了后文的信息）。前后没有信息的穿越，这也是单向语言模型好训练的一个原因。但是如果又要利用前后文的信息（不mask掉后文信息），又要好训练，这就比较头疼了。因为我在预测词X的时候，实际上是看着X来预测X，这样并没有什么意义。

好在BERT的研发人员想到了一个还可以的办法，就是我在句子里面遮住X，不让模型看到X，然后来用前后文的信息预测X。这就是BERT训练时最核心的概念了。
在这里插入图片描述
但是这样做又会导致一个问题。我们人类理解完形填空的意思，知道那个空（mask）是无或者没有的意思。但是模型不知道呀，它的空（mask）会被当成一个词去理解。因为我们给的是一个叫mask的词向量输入到模型里的。模型还以为你要用mask这个词向量来预测个啥。为了避免这种情况发生，研究人员又做了一个取巧的方案：除了用mask来表示要预测的词，我还有些时候，把mask随机替换成其他词，或者原封不动。具体下来就是下面三种方式：

随机选取15%的词做如下改变

80% 的时间，将它替换成 [MASK]
10% 的时间，将它替换成其他任意词
10% 的时间，不变

举个例子:

Input： The man went to [MASK] store with [MASK] dog
Target:                  the               his

预测 [MASK] 是BERT的一项最主要的任务。在非监督学习中，我们还能怎么玩？让模型有更多的可以被训练的任务？其实呀，我们还能借助上下文信息做件事，就是让模型判断，相邻这这两句话是不是上下文关系。
在这里插入图片描述
举个例子，我在一个两句话的段落中将这两句话拆开，然后将两句话同时输入模型，让模型输出True/False判断是否是上下文。同时我还可以随机拼凑不是上下文的句子，让它学习这两句不是上下文。

Input : the man went to the store [SEP] he bought a gallon of milk [SEP]
Is next : True

Input = the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Is next : False

有了这两项任务，一个[MASK],一个上下文预测，我们应该就能创造出非常多的训练数据来给模型训练进行监督训练啦。其实也就是把非监督的数据做成了两个监督学习的任务，模型还是被监督学习的。

请注意：我写的BERT代码和原文有一处不同，我认为不用传递给模型一个[CLS]信息让模型知道当前在做的是什么任务，因为我想要得到的是一个语言理解器，至于对于不同的任务，可以 Finetune 出不同的头来适应，因为谁知道你下游是不是一个你训练过的任务（Task）呢？所以我觉得没必要专门为了Task去搞一个Task input。我更关注的是训练出一个语言模型，而不是一个语言任务模型。

代码

我们这里选择的数据还是和做ELMo，GPT 时相同的数据(MRPC)，可以进行横向对比。

def train(model, data, step=10000):
    for t in range(step):
        seqs, segs, seqs_, loss_mask, xlen, nsp_labels = random_mask_or_replace(data, ...)
        loss, pred = model.step(seqs, segs, seqs_, loss_mask, nsp_labels)

d = utils.MRPCData("./MRPC", 2000)
m = BERT()
train(m, d, step=10000)

我们注意到 random_mask_or_replace() 在每次循环中，将数据进行了一次MASK和replace操作。目的就是为了让BERT有个可以被预测的词位。

通过上面的训练过程，如果我们打印出训练结果，可以发现，BERT在收敛，但是收敛的速度比GPT慢很多，我们上次训练的GPT只用了5000步就收敛到一个比较好的地方，但是这次的BERT训练了10000步，还是没能收敛到特别好。这也是BERT在训练上的一个硬伤。

step:  0 | time: 0.64 | loss: 9.655
| tgt:  <GO> <quote> we can 't change the past , sour we can schering-plough a lot about the future , <quote> sheehan said gamecocks a news conference wednesday afternoon . prevents <quote> we heads 't change the past leg but goldman can do a lot about analogous future , <quote> sheehan said hours after arriving in phoenix
| prd:  tennis subject bar condition adviser down higher ko larned sleep charing arrest shipments alone corp. forging lord rucker humans requiring peaks assignment communion parking locked jeb novels aboard civilians sciences moroccan offer juvenile non-discriminatory reactors <NUM>-to-49-year-old slashed touch-screen underperformed aches trenton north partway odds tito websites company-sponsored orthopedic behind mother-of-two breaking campaigning cooperate down denver marched
| tgt word:  ['but', 'do', 'at', '<SEP>', 'can', ',', 'we', 'the']
| prd word:  ['sleep', 'shipments', 'communion', 'sciences', 'juvenile', 'touch-screen', 'aches', 'websites']


step:  100 | time: 14.04 | loss: 8.924
| tgt:  <GO> this year , local health departments hired part-time water samplers and purchased testing equipment with a $ <NUM> grant from the environmental protection agency . <SEP> this year , peninsula health officials got the money to hire part-time water samplers and purchase testing equipment thanks to a $ <NUM> grant from the environmental protection agency
| prd:  <GO> harrison operated <GO> <GO> , <GO> the the <SEP> <SEP> <GO> the the <SEP> manila stuck <SEP> the <GO> medics <SEP> <GO> <SEP> the sherry offend daschle cronan , washington-area , membership , the <NUM> , , the the the <NUM> the the the the , stricter <NUM> the , , the <NUM> the the
| tgt word:  ['testing', 'with', '<NUM>', 'protection', ',', 'health', 'water', 'protection']
| prd word:  ['the', 'manila', 'the', '<SEP>', ',', ',', 'the', 'the']

...

step:  9800 | time: 14.16 | loss: 2.888
| tgt:  <GO> in <NUM> , the building 's owners , the port authority of new york and new jersey , issued guidelines to upgrade the fireproofing to a thickness of <NUM> {
    inches . <SEP> the nist discovered that in <NUM> the port authority issued guidelines to upgrade the fireproofing to a thickness of <NUM> 1 / <NUM> inches
| prd:  <GO> in <NUM> , the new 's the , the , , of new new and new <NUM> , , to to to the fireproofing to a of of <NUM> fireproofing <NUM> the <SEP> the nist the that in <NUM> the to to , <NUM> to <NUM> the , to a , of <NUM> to , <NUM> to
| tgt word:  ['authority', 'inches', 'that', 'in', 'authority', 'a', '1', '/']
| prd word:  [',',