NLP教程笔记：GPT 单向语言模型-CSDN博客

本文链接：https://blog.csdn.net/nanke_4869/article/details/113741998

本文介绍了GPT模型，一种基于Transformer注意力机制的预训练语言模型。GPT通过非监督学习预测上下文，利用Future Mask避免信息穿越问题。文中还探讨了GPT与Transformer的区别，并提供了训练案例及核心代码，展示了模型的注意力结果。

摘要由CSDN通过智能技术生成

NLP教程

TF_IDF
词向量
 句向量
 Seq2Seq 语言生成模型
 CNN的语言模型
 语言模型的注意力
 Transformer 将注意力发挥到极致
 ELMo 一词多义
 GPT 单向语言模型
 BERT 双向语言模型
 NLP模型的多种应用

今天要学习的是一个在自然语言中比ELMo更厉害的模型。这个模型玩的不是RNN那一套循环机制，而是Transformer的注意力机制。它成功地将Transformer的注意力运用在语言模型中，并且让模型能够非常精准的预测出答案，在很多方面让人类打开眼界。

这个模型就是 Generative Pre-Training (GPT) 模型。目前这个模型已经迭代了3个版本了，最强的一个GPT3，媒体已经将其夸上了天。不过抛开噱头部分， GPT3的确算是在NLP里面的优秀模型。

GPT是啥

GPT主要的目标还是当好一个预训练模型该有的样子。用非监督的人类语言数据，训练一个预训练模型，然后拿着这个模型进行finetune，基本上就可以让你在其他任务上也表现出色。因为下游要finetune的任务千奇百怪，在这个教学中，我会更专注GPT模型本身。告诉你GPT模型到底长什么样，又会有什么样的特性。至于后续的finetune部分，其实比起模型本身，要容易不少。

具体到GPT的模型，其实它和Transformer有着目不可分的联系。有人说它是Transformer的Decoder，但是我觉得这可能并不准确。它更像是一种Transformer Decoder与Encoder的结合。用着Decoder的 Future Mask (Look Ahead Mask)，但结构上又更像Encoder。

说说为什么这样设计吧。说到底，这么设计就是为了让GPT方便训练。用前文的信息预测后文的信息，所以用上了Future Mask。

如果不用Future Mask, 又要做大量语料的非监督学习，很可能会让模型在预测A时看到A的信息，从而有一种信息穿越的问题。具体解释一下，因为Transformer这种MultiHead Attention模式，每一个Head都会看到所有的文字内容，如果用前文的信息预测后文内容，又不用Future Mask时，模型是可以看到要预测的信息的，这种训练是无效的。 Future Mask的应用，就是不让模型看到被穿越的信息，用一双无形的手，蒙蔽了它的透视眼。

另外一个与Transformer Decoder的不同之处是，它没有借用到Encoder提供的 self-attention 信息。所以GPT的Decoder要比Transformer少一些层。那么最终的模型乍一看的确和Transformer的某一部分很像，不过就是有两点不同。

Decoder 少了一些连接 Encoder 的层；
只使用Future Mask (Look ahead mask)做注意力。

学习案例

和训练ELMo一样，我们拥有网络上大量的无标签数据，语言模型进行无监督学习，训练出一个pretrained模型就有了优势。这次的案例我们还是使用在ELMo训练的 Microsoft Research Paraphrase Corpus (MRPC) 数据。可以做一个横向对比。这个数据集的内容大概是用这种形式组织的。
在这里插入图片描述
每行有两句话 #1 String 和 #2 String，如果他们是语义相同的话，Quality 为1，反之为0。这份数据集可以做两件事：

两句合起来训练文本匹配；
两句拆开单独对待，理解人类语言，学一个语言模型。

这个教学中，我们在训练语言模型的时候，用的是无监督的方法训练第2种模式。但同时也涉及到了第1种任务。除了无监督能训练，我们同样还能引入有监督的学习，这次GPT就带你体验一把同时训练无监督和有监督的做法。

其实我们可以将无监督看成是一个task，预测是否是下一句看成另一个task，当然task还能有很多。就看你的数据支持的是什么样的task了。 多种task一起来训练一个模型，能让这个模型在更多task上的泛化能力更强。

先看看最终的训练结果会是怎样吧~

代码

首先还是我们的训练步骤，因为训练的循环是最能看出来训练时的差异化的。

def train(model, data, step=10000):
    for t in range(step):
        seqs, segs, xlen, nsp_labels = data.sample(16)
        loss, pred = model.step(seqs[:, :-1], segs[:, :-1], seqs[:, 1:], nsp_labels)
d = utils.MRPCData("./MRPC", 2000)
m = GPT(...)
train(m, d, step=5000)

这里的utils.MRPCData(), 已经将他们封装好, model.step()中，

seqs[:, :-1]是X input中的句子信息，
segs[:, :-1]是X input的前后句信息，判断是否是前句还是后句。因为我们会同时将前句和后句放在seqs中一起给模型，所以模型需要搞清楚他到底是前句还是后句。
seqs[:, 1:]是非监督学习的Y信息，用前句预测后句。
nsp_labels是判断输入的两句话是否是前后文关系。

总体来说，就是将前后句的文本信息和片段信息传入模型，让模型训练两个任务，

非监督的后文预测，
是否是下一句。

我们可以看到，如果展示出整个训练的结果，它是这样的：

step:  0 | time: 0.63 | loss: 9.663 
| tgt:  they also are reshaping the retail business relationship elsewhere , as companies take away ideas and practices that change how they do business in their own firms and with others . <SEP> they also are reshaping the retail-business relationship , as companies take away concepts and practices that change how they do business internally and with others . 
| prd:  kinsley van-vliet franco atheist bottom kent performance toured trapeze reporting alta miz crush <NUM>-month crush kennedy dominick clarence ``will thames mr scanning abuses losses sleeping since detection punching scrutiny fare-beating shiites sue gagne canfor built schafer chronicle assignment cat deadline action slipping enhances crush tearing cat mobile widen treaty retire towards an-najaf virtually alta widen files gillian jamaica


step:  100 | time: 14.39 | loss: 8.227 
| tgt:  <quote> we are declaring war on sexual harassment and sexual assault . <SEP> <quote> we have declared war on sexual assault and sexual harassment , <quote> rosa said . 
| prd:  the the the the the the the the the , the the <SEP> the the , , , , , , , , , <NUM> <SEP> <SEP> <NUM> the

...

step:  4800 | time: 14.08 | loss: 0.612 
| tgt:  the rest said they belonged to another party or had no affiliation . <SEP> the rest said they had no affiliation or belonged to another party . 
| prd:  the company said they remain to another party or had no affiliation . <SEP> the rest said they had no affiliation or belonged to another party or


step:  4900 | time: 14.05 | loss: 0.677 
| tgt:  <quote> craxi begged me to intervene because he believed the operation damaged the state , <quote> mr berlusconi said . <SEP> <quote> i had no direct interest and craxi begged me to intervene because he believed that the deal was damaging to the state , <quote> berlusconi testified . 
| prd:  the the begged me to intervene because he believed the operation damaged the state , <quote> mr berlusconi said . <SEP> <quote> i had no direct interest and craxi begged me to intervene because in believed that the deal was damaging to the state , <quote> berlusconi testified .

经历了5000步的训练，从最开始频繁预测 the 变成最能够比较好预测句子的后半段内容。因为future mask的原因，GPT是没办法很好的预测句子的前半段的，因为前半段的信息太少了。所以我们才说GPT是单向语言模型。

而模型的架构我们会使用到在Transformer中的Encoder代码，因为他们是通用的。只是我们需要将Encoder中的Mask规则给替换掉。而且在模型中为seg和word多加上几个embedding参数。

class GPT(keras.Model):
    def __init__(self, ...):
        self.word_emb = keras.layers.Embedding(...)     # [n_vocab, dim]
        self.segment_emb = keras.layers.Embedding(...)   # [max_seg, dim]
        self.position_emb = self.add_weight(.