nlp bert模型_NLP语言模型BERT，GPT2 / 3，T-NLG：更改游戏规则

最新推荐文章于 2023-09-04 23:00:00 发布

weixin_26726011

最新推荐文章于 2023-09-04 23:00:00 发布

阅读量609

点赞数

文章标签： python 机器学习人工智能 java go

原文链接：https://medium.com/analytics-vidhya/nlp-language-models-bert-gpt2-t-nlg-changing-the-rules-of-the-game-3334b23020a9

版权

nlp bert模型

Summary: key concepts of popular language model capabilities

简介：流行语言模型功能的关键概念

We all are aware about the current revolution in field of Artificial Intelligence(AI) and Natural Language Processing(NLP) is one of the major contributor.

众所周知，人工智能 (AI)和自然语言处理 (NLP)领域是当前的主要贡献者之一。

For the NLP related tasks, where we build technique related to human and computer interaction, we first develop a language specific understanding in our machine so it can extract some context out of the training data. This is also the first basic things in our parenting, our babies first understand the language then we start giving complex task gradually.

对于与NLP相关的任务，我们在其中建立与人和计算机交互相关的技术，我们首先在机器中发展对语言的特定理解，以便它可以从训练数据中提取某些上下文。这也是我们养育孩子的首要基本知识，我们的婴儿首先了解语言，然后逐步开始执行复杂的任务。

In the conventional world, we need to nurture each baby individually but on other hand if you take example of of any subject like Physics, a lot of people contributed so far and we have predefined ecosystem like books, universities to pass the earned knowledge to next person. Our conventional NLP language model was similar like this only, everyone needs to develop their own language understanding using some technique but no one can leverage others work. The computer vision division of AI, already achieved this using their ImageNet object data set. This concept is called Transfer Learning. As per Wikipedia

在传统世界中，我们需要分别培养每个婴儿，但另一方面，如果您以诸如物理学之类的任何学科为例，那么到目前为止，已有很多人做出了贡献，并且我们已经预定义了生态系统，例如书籍，大学，将所学知识传递给下一个人。我们常规的NLP语言模型仅类似于此，每个人都需要使用某种技术来发展自己的语言理解，但是没有人可以利用其他工作。 AI的计算机视觉部门已经使用其ImageNet对象数据集实现了这一目标。这个概念称为转移学习。根据维基百科

Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

转移学习(TL)是机器学习(ML)中的一个研究问题，着重于存储在解决一个问题时获得的知识并将其应用于另一个但相关的问题。例如，在尝试识别卡车时可以应用在学习识别汽车时获得的知识。

To reduce a lot of repetitive time intensive, cost unfriendly and compute intensive task, A lot of major companies was working on such language model, where someone can leverage their langua ge understanding but BERT from Google is major defining moment, which almost changed this industry. before that the popular one was ELMo and GPT.

为了减少大量重复性的时间密集型，成本不友好型和计算密集型任务，许多大型公司都在开发这种语言模型，在这种语言模型中，人们可以利用自己的语言理解，但是Google的BERT是主要的定义时刻，这几乎改变了这个行业。在此之前，流行的是ELMo和GPT 。

Before going to these language model, we must understand few key concepts

在使用这些语言模型之前，我们必须了解一些关键概念

Embedding we know that most of our algorithms can’t understand native languages and we need to provide some numerical representation and embedding is doing the same, making different numerical representations of the same text. It can be simple one like count based embedding like TF-IDF, prediction based or context based. Here we are only focused on context based.

嵌入我们知道大多数算法都无法理解本机语言，因此我们需要提供一些数字表示形式，并且嵌入也执行相同的操作，从而使同一文本具有不同的数字表示形式。它可以很简单，例如基于计数的嵌入(如TF-IDF)，基于预测或基于上下文。这里我们只关注基于上下文。

No of Parameters for Neural Network All our language model use this term as a performance metric and more number of parameters is generally assumed more accurate one. It is typically the weights of the connections or parameters are learned during the training stage.

神经网络参数的数量我们所有的语言模型都将此术语用作性能指标，并且通常假定更多的参数更为准确。通常是在训练阶段学习连接的权重或参数。

变形金刚 (Transformers)

This is where the story changed, this is not CNN and RNN, this is something totally different. Let’s go into more detail

故事发生了变化，这不是CNN和RNN ，而是完全不同的东西。让我们更详细

suppose your model go through this tweet

假设您的模型通过此推文

Now, your model will be confused if Narendra Modi is talking to update these social media companies like ‘I quit’ or update his followers about his decision. Even any person with basic English knowledge, can be confused if they don’t put attention, i repeat ‘attention’. This is key concept from where Transformer architecture evolved, it create a ‘self attention’ layer while reading all corpus.

现在，如果纳伦德拉·莫迪(Narendra Modi)谈论更新这些社交媒体公司(例如“我辞职”)或更新他的追随者关于他的决定的信息，您的模型将会感到困惑。即使是任何具有基本英语知识的人，如果他们不注意也可能会感到困惑，我再说一遍“ 注意 ”。这是Transformer体系结构演变的关键概念，它在读取所有语料库的同时创建了“ 自我关注 ”层。

If you ever have any connection with Electronics or software encryption, you know the word encoder-decoder. The first one changes original input to some cryptic one and second one do the reverse i.e. cryptic to original.

如果您与电子或软件加密有任何联系，您就会知道编码器/解码器一词。第一个将原始输入更改为某种隐秘的输入，第二个则相反，即将原始输入神秘化。

Image for post — https://arxiv.org/pdf/1706.03762v5.pdf https://arxiv.org/pdf/1706.03762v5.pdf

This is the diagram from its white paper and here the key steps

这是其白皮书中的图表，此处是关键步骤

The first one is encoder which has Multi-Head attention layer followed by feed forward neural network
第一个是编码器，它具有Multi-Head注意层，然后是前馈神经网络
Second one is decoder which has one additional layer ‘masked multi head attention’
第二个是解码器，它具有一层额外的“掩盖多头注意”
Nx denote number of layers for both encoder and decoder
Nx表示编码器和解码器的层数
first we have stack of encoder layer where output of one layer will work as input of second
首先，我们有一层编码器层，其中一层的输出将作为第二层的输入
The attention layer of encoder check about context using query vector, key vector and value vector
编码器的关注层使用查询向量，键向量和值向量检查上下文
Then encoder pass their understanding to next layer and so on
然后编码器将他们的理解传递给下一层，依此类推
The final encoder output will pass to all decoders as key vector and query vector
最终的编码器输出将作为键向量和查询向量传递给所有解码器
It will first predict the first word as final output and take that first word as input of all decoders and predict next words
它将首先预测第一个字作为最终输出，并将该第一个字作为所有解码器的输入并预测下一个字
This process will be repeated till it will predict the last word
重复此过程，直到预测出最后一个单词
Terminate the loop :) remember your early programming days
终止循环:)记住您早期的编程时代

Now we can discuss about popular language models

现在我们可以讨论流行语言模型

伯特 (BERT)

Bidirectional Encoder Representations from Transformers, Google

Google变形金刚的双向编码器表示形式

This is actual breakthrough in field of NLP pre-trained model and it can understand context like difference between ‘this painting is pretty ugly’ and ‘this watch is pretty’. Both sentence has word ‘pretty’ but BERT can understand difference context between two. As per official documentation

这是NLP预训练模型领域的实际突破，可以理解“这幅画很丑”与“这只表很漂亮”之间的区别。两个句子都带有单词“ pretty”，但是BERT可以理解两者之间的差异上下文。根据官方文件

BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.

BERT是第一个深度双向 ， 无监督的语言表示形式，仅使用纯文本语料库进行了预训练。

BERT mainly has two keyword a) bidirectional b) transformers. Transformers are already explained and bidirectional is implemented via masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. This is not new concept but BERT is the one who successfully implemented before anyone.

BERT主要有两个关键字a)双向b)变压器。已经说明了变压器，并通过屏蔽输入中的某些单词，然后双向调节每个单词以预测被屏蔽的单词，来实现双向。这不是一个新概念，但是BERT是在任何人之前成功实现的。

before BERT, ELMo was the technique used for context based learning but here the key advantage of BERT

在出现BERT之前，ELMo是用于基于上下文的学习的技术，但这是BERT的主要优势

What to do with BERT

BERT怎么办

Pre-Training : This is very compute intensive and only needed if you want to train by own for any language, Google already trained and provide two models a) BERT-Base b) BERT-Large
预训练：这是非常耗费计算资源的，只有在您想自己进行任何语言的训练时才需要，Google已经训练过并提供两种模型a)BERT-Base b)BERT-Large
Fine tuning : This is task specific work where you can fine tune model. Tensorflow if the by default supported framework and PyTorch and Chainer non-official support also available
微调：这是特定于任务的工作，您可以在其中微调模型。 Tensorflow如果默认支持的框架和PyTorch和Chainer非官方支持，也可

you can find implementation code from here and also can execute this notebook directly

您可以从此处找到实现代码，也可以直接执行此笔记本

BERT is basically designed to fill the blank kind of activity and it support 340 millions of parameters

BERT基本上是为填补空白的活动而设计的，它支持3.4亿个参数

BERT主要采用 (BERT major adoptions)

ROBERTA FairSeq team, Facebook

ROBERTA FairSeq团队，Facebook

This is something released in pyTorch, and as per their official documentation

这是pyTorch和官方文档中发布的内容

RoBERTa builds on BERT’s language masking strategy and modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates.

RoBERTa建立在BERT的语言屏蔽策略的基础上，并修改了BERT中的关键超参数，包括删除BERT的下一句预训练目标，以及使用更大的迷你批次和学习率进行训练。

AzureML-BERT Microsoft

天蓝色ML-BERT 微软

Its cloud based adoption of BERT where Azure cloud can perform to end to end process, as per their website it has better metric than Google native

它采用基于BERT的云，Azure云可以执行端到端流程，因为根据其网站，它的指标比Google本机更好

ALBERT: A Lite BERT Google

阿尔伯特：Lite BERT Google

DistilBERT a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT BERT的简化版本：更小，更快，更便宜，更轻

GPT-2 / 3 (GPT-2/3)

Generative Pretrained Transformer, OpenAI

生成式预训练变压器，OpenAI

An Elon Musk initiative, OpenAI, which also received 1 Billion investment from Microsoft. It has word generative in its name, as it was trained to predict the new token based on sequence of token, using unsupervised techniques.

Elon Musk的一项举措，即OpenAI，也获得了Microsoft的10亿美元投资。它的名字具有生成词的含义，因为它经过训练，可以使用无监督技术根据令牌序列来预测新令牌。

Considering its content generation capabilities, The time it was released, the management told that they are not releasing its full version as open source, they are feared that it will be dangerous if will be used for fake news creation.

考虑到其内容生成功能，在发布之时，管理层告诉他们不会以开源形式发布其完整版本，他们担心如果将其用于假新闻创建，将是危险的。

This pretrained model are mainly used content over internet, Wikipedia, Reddit and its basically developed to do content writing or generating new text. Its unidirectional language model. Unlike of BERT, its is mainly using decoder skills and generate new skills word by word.

这种预先训练的模型主要用于Internet，Wikipedia，Reddit上的内容，其基本开发目的是进行内容编写或生成新文本。它是单向语言模型。与BERT不同，它主要使用解码器技能并逐字生成新技能。

GPT also released it’s a music generation module, used same GPT-2 to make all the music understanding

GPT还发布了它的音乐生成模块，使用相同的GPT-2使所有音乐理解

BERT is basically designed essay writing kind of activity and it support 1.5 billions of parameters. GPT 3 also announced which has more advance capabilities and its really a big discussion topic around the world.

BERT基本上是一种设计为论文写作的活动，它支持15亿个参数。 GPT 3还宣布了它具有更先进的功能，并且在全球范围内确实是一个很大的讨论主题。

T-NLG (T-NLG)

Turing Natural Language Generation, Microsoft

图灵自然语言生成，微软

Considering the recent development in field of language model, this is Microsoft bid to solve NLP tasks like conversation , language understanding, question answer, summarization etc. As per their claim, it is 17 billion parameter language model which needs a different Microsoft developed optimizer called ZeRO and a different Deep Speed optimization library called DeepSpeed. Using both, the model can be trained on multiple CPU.

考虑到语言模型领域的最新发展，这是Microsoft试图解决NLP任务，例如对话，语言理解，问题答案，摘要等。根据他们的说法，这是170亿个参数的语言模型，需要另一种Microsoft开发的优化程序，称为ZeRO和另一个称为DeepSpeed的 Deep Speed优化库。两者都可以在多个CPU上训练模型。

This model is naturally solving the question and direct answer problem, very useful for AI enabled assistance. It can also answer without using context message, at this time, the model relies on knowledge gained during pre-training to generate an answer.

该模型自然解决了问题和直接回答问题，对于启用AI的帮助非常有用。它也可以在不使用上下文消息的情况下进行回答，这时，该模型依赖于在预训练过程中获得的知识来生成答案。

It support abstractive summarization like a human, not extractive where the summarization only reduce no of sentences. It can summarize multiple kind of document like email, excel etc.

它支持像人一样的抽象摘要，而不是抽象的摘要，因为摘要仅减少了句子的数量。它可以总结多种文档，例如电子邮件，excel等。

Microsoft does not made most of things public here, so for this, I have taken most of content from source website.

微软没有在这里公开大多数事情，因此，我从源网站上获取了大部分内容。

This is USP from their website

这是他们网站上的USP