Bert and its family——GPT

在更新完ELMO、Bert之后,还有一个家族成员——GPT需要记录。其实也一直想写啦,只不过最近都在玩。那什么是GPT呢?GPT就是Generative Pre-Training 的简称,实际上就是transformer的decoder。那GPT在做一个怎样的工作呢?就是输入一个句子中的上一个词,我们希望GPT模型可以得到句子中的下一个词,就仅此而已。当然,由于GPT-2的模型非常巨大,它在很多任务上都达到了惊人的结果,甚至可以做到zero-shot learning(简单来说就是模型的迁移能力非常好),如阅读理解任务,不需要任何阅读理解的训练集,就可以得到很好的结果。

GPT是怎么运作的呢?上面有提到说GPT要做的事情就是给它一个词汇,能够预测下一个词汇。举例来说,你给它beginning of sentence(BOS)这个token,再输入“潮水”这个词汇,希望GTPmodel可以output“退了”这个词汇。那要怎样output“退了”这个词汇呢?同样,GPT也会做self-attention:如下图,把“潮水”input进去,产生query,key跟value,也就是图中的q^{2},k^{2},v^{2},然后做self-attention,那具体怎么做,在之前的transformer系列都有提到过,这里就不再叙述了。

在预测完“退了”之后呢就把“退了”拿下来, 然后再问GPT说“退了”后面该接哪个词汇。那“退了”这个词汇也会产生query,key跟value,也就是图中的q^{3},k^{3},v^{3},然后将q^{3}呢跟已经产生的词汇包括自己做self-attention,得到的embedding通过很多层,就预测到了“”这个词汇。那这个process呢就这样反复进行下去,就是GTPmodel要做的事情。

GPTGPT-2做的事情一样,就是训练一个language model ,只不过这个language model非常非常的巨大。那GPT与Bert有什么差一点呢?

1.语言模型:Bert和GPT-2虽然都采用transformer,但是Bert使用的是transformer的encoder,即:Self Attention,是双向的语言模型;而GPT-2用的是transformer中去掉中间Encoder-Decoder Attention层的decoder,即:Masked Self Attention,是单向语言模型。
2.结构:Bert是pre-training + fine-tuning的结构;而GPT-2只有pre-training。
3.输入向量:GPT-2是token embedding + position embedding;Bert是 token embedding + position embedding + segment embedding。
4.参数量:Bert是3亿参数量;而GPT-2是15亿参数量。
5.Bert引入Masked LM和Next Sentence Prediction;而GPT-2只是单纯的用单向语言模型进行训练,没引入这两个。
6.Bert不能做生成式任务,而GPT-2可以。参考NLP——Bert与GPT-2的纠葛 - 知乎 (zhihu.com)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are both advanced natural language processing (NLP) models developed by OpenAI and Google respectively. Although they share some similarities, there are key differences between the two models. 1. Pre-training Objective: GPT is pre-trained using a language modeling objective, where the model is trained to predict the next word in a sequence of words. BERT, on the other hand, is trained using a masked language modeling objective. In this approach, some words in the input sequence are masked, and the model is trained to predict these masked words based on the surrounding context. 2. Transformer Architecture: Both GPT and BERT use the transformer architecture, which is a neural network architecture that is specifically designed for processing sequential data like text. However, GPT uses a unidirectional transformer, which means that it processes the input sequence in a forward direction only. BERT, on the other hand, uses a bidirectional transformer, which allows it to process the input sequence in both forward and backward directions. 3. Fine-tuning: Both models can be fine-tuned on specific NLP tasks, such as text classification, question answering, and text generation. However, GPT is better suited for text generation tasks, while BERT is better suited for tasks that require a deep understanding of the context, such as question answering. 4. Training Data: GPT is trained on a massive corpus of text data, such as web pages, books, and news articles. BERT is trained on a similar corpus of text data, but it also includes labeled data from specific NLP tasks, such as the Stanford Question Answering Dataset (SQuAD). In summary, GPT and BERT are both powerful NLP models, but they have different strengths and weaknesses depending on the task at hand. GPT is better suited for generating coherent and fluent text, while BERT is better suited for tasks that require a deep understanding of the context.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值