gpt-2 bert_bert和gpt 2的abba解释器

最新推荐文章于 2024-06-03 16:53:52 发布

weixin_26704853

最新推荐文章于 2024-06-03 16:53:52 发布

阅读量432

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/the-abba-explainer-to-bert-and-gpt-2-ada8ff4d1eca

版权

gpt-2 bert

For the life of me, I couldn’t understand how BERT or GPT-2 worked.

对于我的一生，我无法理解BERT或GPT-2是如何工作的。

I read articles; followed diagrams; squinted at equations; watched recorded classes; read code documentation; and still struggled to make sense of it all.

我读文章；遵循图；斜视方程式；观看录制的课程；阅读代码文档；仍在努力理解这一切。

It wasn’t the math that made it hard.

不是数学使事情变得困难。

More like, that the big part you’d expect to precede the nitty-gritty was somehow missing.

更像是，您所希望的先于坚韧不拔的大部分丢失了。

This article bridges the gap, explaining in simple terms how these models are built. It’s the article I wish I could have read first; many of the details would have then slotted right into place. With the generous help of ABBA, we’ll introduce three different ideas. I guarantee there won’t be a single mathematical equation in any of them: 1. Thinking about words and meaning — Attention 2. Pancake stacking small components — Deep Learning 3. Thinking about words and signals — Embeddings In the fourth and final section, we’ll see how these ideas tie neatly into a bow.

本文弥补了这一差距，以简单的方式解释了这些模型的构建方式。我希望我能先读这篇文章。然后，许多细节将正确定位。在ABBA的慷慨帮助下，我们将介绍三种不同的想法。我保证其中没有一个数学方程式：1.考虑单词和含义- 注意 2.煎饼堆叠小组件- 深度学习 3.考虑单词和信号- 嵌入在第四部分也是最后一部分，我们将看到这些想法如何巧妙地结合在一起。

1.关于单词和含义的思考 - 注意 (1. Thinking about words and meaning — Attention)

Let’s look at the following sentence:

让我们看下面的句子：

Suppose we asked three friends to read this sentence. While they’d probably agree the sentence’s topic is the rain, they might diverge on which words are most important to ‘rain’ to convey this sentence’s true meaning.

假设我们请三个朋友朗读这句话。尽管他们可能会同意该句子的主题是下雨，但他们可能对哪种单词最重要，这对于“雨水”传达此句子的真实含义可能会有所分歧。

Benny might say ‘tapped’ and ‘window’ were the most important, because the sentence is about the noise the rain makes.

本尼可能会说“轻按”和“窗口”是最重要的，因为这句话是关于下雨的噪音。

Frida might say ‘softly’ and ‘summer’ are the most important, because this is a sentence about what summer rain is like.

Frida可能会说“轻柔”和“夏季”是最重要的，因为这是关于夏季雨水的句子。

Bjorn might take a different approach altogether, and focus on ‘ed’ and ‘that’ to suggest this sentence is about the memory of past rain.

比约恩可能会采取完全不同的方法，而将重点放在“ ed”和“ that”上，以表明这句话是关于过去雨的记忆。

While all three readers are taking in the same sentence, each is paying attention to different parts. Each is attributing different weighing to some words in relation to ‘rain’, while discounting the importance of others.

当三位读者都使用相同的句子时，每位读者都在关注不同的部分 。每个人对与“雨”相关的某些单词赋予不同的权重，而忽略了其他单词的重要性。

Their different ways of understanding this sentence also set their expectations of what comes next. When asked to continue the following sentence starting with: “It …”, they might say:

他们对这句话的不同理解方式也设定了他们对接下来的期望。当被要求继续以下以“ It…”开头的句子时，他们可能会说：

Each of these three options makes sense.

这三个选项中的每一个都是有意义的。

This concept of attention, and how words in a text are related to one another in lots of different ways is a central idea. We’d like to take this idea, of paying attention to semantic relationships between words, and teach a machine to do it.

这种关注的概念以及文本中的单词如何以许多不同的方式相互关联是一个中心思想。我们想借此思路，注重单词之间的语义关系 ，并教一台机器做。

The task models like BERT and GPT-2 are trained on is guessing correctly the next word in a piece of text. Once trained, such a model can, therefore, be used to generate new text, word by word.

训练像BERT和GPT-2这样的任务模型可以正确猜测一段文本中的下一个单词。因此，一旦经过训练，便可以使用这种模型逐字生成新文本。

The main difference between the two models is that, during training, in the BERT model you’re allowed to also look at the text that comes after the missing word, whereas with GPT-2 you’re only looking backwards. Generating text word-by-word sounds iffy; wouldn’t it make sense to generate several words that together make sense?

两种模型之间的主要区别是，在训练过程中，在BERT模型中，您还可以查看丢失单词后的文本，而对于GPT-2，则只向后看。逐字生成文本听起来有些困难；生成几个在一起有意义的单词不是很有意义吗？

Indeed, to have a longer horizon beyond a single next word, you can think of a “beam” into the future: instead of going straight for your top next-word candidate, you keep the top 5 candidates, and for each, generate their top 5 next-words, and so on.

的确，要想拥有更长的视野，只剩下一个下一个单词，就可以想到未来的“光束”：与其直接寻找下一个单词的最佳候选人，不如保留前五个单词的候选人，并为每个候选人生成他们的前5个下一个词，依此类推。

After several rounds of this, you’ve quite a few potential extensions to the sentence and you can choose the best ones. If we then ask GPT-2 to complete a sentence, it might offer sensible suggestions like so:

经过几轮之后，您已经对该句子进行了许多潜在的扩展，可以选择最佳的扩展。然后，如果我们要求GPT-2完成一个句子，它可能会提供如下明智的建议：

2.煎饼堆叠组件- 深度学习 (2. Pancake stacking components — Deep Learning)

The premise — and promise — of this article is to explain how large language models like GPT-2 work.

本文的前提(也是承诺)是解释大型语言模型(如GPT-2)如何工作。

At the very least, then, I need to tell you what they are. The answer is: They’re a pancake stack.

那么，至少我需要告诉你它们是什么。答案是：它们是薄煎饼堆。

GPT-2 is a stack of identical components called decoders, and BERT is a stack of slightly different identical components called encoders.

GPT-2是称为解码器的相同组件的堆栈，而BERT是称为编码器的稍有不同的相同组件的堆栈。

How big are these stacks? Well, it varies.

这些堆栈有多大？好吧，它有所不同。

The more layers there are, the more your input gets tumbled, mixed and crunched. That’s where the ‘Deep’ in ‘Deep Learning’ comes from. One of the things researchers are finding (and this was a specific research question when building GPT-3) is the more you stack — the better the results on some natural language processing tasks.

层数越多，您输入的内容就越容易翻滚，混合和压缩。这就是“深度学习”中“深度”的来源。研究人员发现的一件事(这是构建GPT-3时的一个具体研究问题)是您叠得越多-在某些自然语言处理任务上的结果越好。

You’ll notice it’s not just the stack getting taller as the models get bigger. The number at the bottom, Model Dimensionality, is also getting bigger. We’ll get to this number in the next section; for now, please ignore it.

您会注意到，不仅仅是模型变得越来越大，堆栈也会越来越高。底部的数字“模型维数”也越来越大。我们将在下一节中获得该数字。现在，请忽略它。

To understand how a large language model works, all you need to know is what happens to the input as it passes through one of these components. From then on, it’s just rinse-and-repeat.

要了解大型语言模型的工作原理，您需要知道的是输入通过这些组件之一时会发生什么情况。从那时起，它就是反复冲洗。

Everything boiling down to a pass-through step inside a small box is a core idea in deep learning. Maintaining a firm grip on the bigger picture, I won’t be covering here an exact step-through.

深入到小盒子中的所有步骤都成为深度学习的核心思想。在全局上保持牢牢的把握，在这里我不会做一个确切的逐步介绍。

Instead, let’s first get clear on inputs.

相反，让我们首先弄清楚输入。

These models “eat up” chunks of text. (For the extra-large GPT-2 it’s 1024 tokens, which makes for about 800–900 consecutive words. For BERT it’s half this amount.) This chunk of text is called context.

这些模型“吃掉”了大块的文本。 (对于超大的GPT-2，它是1024个令牌，可生成约800-900个连续的单词。对于BERT，则是此数量的一半。)这部分文本称为context 。

Every word is processed individually, and the model calculates an impression of how each word relates to the words that came before it.

每个单词都是单独处理的，模型会计算出每个单词与它之前的单词之间的关系。

Also, even without specifics, you can be sure that a pass-through step will be made out of a combination of five types of actions:

另外，即使没有具体说明，您也可以确保通过以下五种操作的组合来进行传递步骤：

Why those? For two very specific reasons:

为什么呢？有两个非常具体的原因：

Theoretically, if you repeatedly combine functions of these types over and over again, you can approximate any function you want.
从理论上讲，如果您一次又一次地重复组合这些类型的函数，则可以近似所需的任何函数。

Pretty much all of deep learning is founded on the above theoretical premise.

几乎所有的深度学习都基于上述理论前提。

When it comes to natural language processing, though, I feel not enough pause is given to the odd idea that language tasks are functions too:

但是，在自然语言处理方面，我觉得语言任务也是函数之类的奇怪想法没有得到足够的停顿：

When we translate from English to French, we’re performing a function, that takes in some of words in English and outputs a specific bunch of words in French.

当我们将英语翻译成法语时，我们正在执行一项功能，该功能接受英语中的某些单词，并输出法语中的特定单词。

For text generation, we’re approximating a function that takes in a bunch of written text and spits out a sensible choice for the next word.

对于文本生成，我们近似于一个函数，该函数吸收一堆书面文本并为下一个单词吐出一个明智的选择。

No question, these are intricate functions: English has tens of thousands of words, and an untold number of relationships between them.

毫无疑问，这些功能错综复杂：英语有成千上万个单词，并且它们之间的联系数不胜数。

But to get a tingly feel for how something sophisticated can be approximated with a whole lot of simple functions, think of an ever-more intricate 3D mesh:

但是，要想一眼如何通过大量简单函数来近似复杂的事物，可以考虑一个越来越复杂的3D网格：

2. In practice, approximating something like text generation or translation means repeating such actions MANY MANY MANY TIMES. For the theoretical idea of attention to be computationally feasible, the calculations performed need to have specific mathematical properties.

2. 在实践中 ，近似诸如文本生成或翻译之类的东西意味着要重复很多多次这样的动作。为了使理论上的注意在计算上可行，执行的计算必须具有特定的数学属性。

Looking at the list, of the five possible actions, one is redundant. For some reason, it also appears in big, shouty letters. That reason is explained in the next part:

从列表中看，在五种可能的动作中，一种是多余的。由于某种原因，它也以大声喊叫的形式出现。下一部分将说明该原因：

3.关于单词和信号的思考- 嵌入 (3. Thinking about words and signals — Embeddings)

Let’s think about adding things together. On the one hand, we can add NUMBERS: if we take 3 and add 4, we get 7. But, just looking at 7, there’s no trace left of either 3 or 4. It’s just 7.

让我们考虑将事物加在一起。一方面，我们可以加NUMBERS：如果取3并加4，则得到7。但是，仅看7，就没有留下3或4的痕迹。只有7。

On the other hand, you can add MUSIC together: if you take a song and overlay it with another song, you get a new track, but you can still make out each of the two songs. You’ve created something new, but still retained quite a lot of information about each of the original tracks. As of now, I’d like you to start thinking of the words you read as if each and every one were a different 3-minute musical track.

另一方面，您可以将MUSIC一起添加：如果您将一首歌曲叠加到另一首歌曲上，则会获得新的音轨，但仍然可以辨认出两首歌曲。您已经创建了一些新内容，但是仍然保留了很多有关每个原始曲目的信息。 到目前为止，我希望您开始思考您所读的单词，好像每个单词都是不同的3分钟音乐曲目。

The reason for this musical detour is because humans have utterly rubbish intuition when it comes to higher-dimension spaces. We extrapolate based on our experience with 2D and 3D. But in many important ways, such as adding stuff together, things behave very differently in higher dimensions.

之所以会绕道而行，是因为当涉及到高维空间时，人们完全会产生垃圾直觉 。我们根据在2D和3D方面的经验进行推断。但是，在许多重要的方式上，例如将东西加在一起，它们在更高维度上的行为截然不同。

A word’s track is called its embedding. Without telling you how it’s done, I’ll state that words with similar meanings end up having similar embeddings.

单词的轨迹称为嵌入。在不告诉您如何完成的情况下，我将说明具有相似含义的单词最终具有相似的嵌入。

Before words go into models like BERT and GPT-2, they’re first turned into long signals — just not musical ones. But remember the core idea: the sum retains lots of information about each of the original parts.

在将单词输入诸如BERT和GPT-2之类的模型之前，它们首先被转换为长信号，而不仅仅是音乐信号。但是请记住核心思想：总和保留有关每个原始部分的大量信息。

Going back to the list of functions each pancake would be made of:

返回每个煎饼的功能列表：

The reason for the big shouty letters starts becoming clear. Think back to our ABBA attention committee:

大声喊叫的原因开始变得清楚。回想一下我们的ABBA关注委员会：

We’re showing Benny, Frida and Bjorn the rain sentence, and asking them to judge how the word ‘It’ in the following sentence relates to words that came before.

我们正在向Benny，Frida和Bjorn显示下雨语句，并要求他们判断以下语句中的“它”与之前的单词有何关系。

Then comes Agnetha, and plays the role of the committee chair:

然后是Agnetha，并担任委员会主席的角色：

The numbers I threw (20% or 60%) are not fixed. As the model learns, what changes is both how Benny, Bjorn and Frida decide which past words are important, and also the weighting Agnetha gives to their opinions.

我扔的数字(20％或60％)不是固定的。随着模型的学习，Benny，Bjorn和Frida如何决定过去的哪些单词很重要，以及Agnetha对他们的意见的权重如何变化，都将产生变化。

This all happens within one pancake. Now we stack them: each consecutive box essentially receives as input the sum of the weighted judgement from the previous box, and the previous box’s input. This makes sure the opinions of lower layers don’t get completely lost along the way.

所有这些都发生在一个薄煎饼内。现在我们将它们堆叠起来：每个连续的框实际上都接收来自前一个框的加权判断和前一个框的输入的总和作为输入。这样可以确保较低层的意见在此过程中不会完全消失。

Remember those big numbers I told you to ignore? As the model grows, it’s not just the number of pancakes in the stack, but how long’s the “musical signal”.

还记得我告诉过你的那些大数字吗？随着模型的发展，不仅是堆栈中的煎饼数量，而且“音乐信号”持续了多长时间。

Imagine moving from ABBA’s Waterloo, which is just under 3 minutes long, to Queen’s Bohemian Rhapsody (which is nearly 6 minutes long)

想象一下，从不到3分钟长的ABBA 滑铁卢搬到女王的波希米亚狂想曲 (长6分钟)

Intuitively, you can sense that when adding together longer musical tracks, information about added components is preserved even better.

直观地，您可以感觉到，将更长的音乐曲目加在一起时，可以更好地保留有关所添加成分的信息。

4.把三个想法绑在一起 (4. Tying the three ideas in a neat bow)

How we grasp meaning is closely tied to how we pay attention to different words. We weigh the influence of words, near and far, in the context of whatever we’re currently reading.

我们如何把握含义与我们关注不同单词密切相关。在我们目前正在阅读的任何内容的背景下，我们都会权衡词语在远近影响下的影响。

It allows us, as human readers, to perform tasks like completing gaps in a sentence, or continue the writing in a way that makes sense. So, philosophically speaking, the true meaning of a word in the current context is the collective influence of all the words before it. We can thus imagine a theoretical language model, which would ‘understand’ text by keeping track of relationships — however subtle — between words.

作为人类的读者，它使我们能够执行任务，例如完成句子中的空白或以有意义的方式继续写作。因此，从哲学上来讲，一个词在当前上下文中的真正含义是它之前所有词的集体影响 。因此，我们可以想象一个理论语言模型，该模型将通过跟踪单词之间的关系(无论多么微妙)来“理解”文本。

It would learn this by repeatedly trying to guess the next word in a text. Every time, it’ll check how well it’s done and if it guessed wrongly, it’ll re-balance the different weights it gives relationships between words.

它会通过反复尝试猜测文本中的下一个单词来学习这一点。每次，它都会检查其执行情况，以及是否猜错了，它将重新平衡赋予单词之间关系的不同权重。

When making a real-life language model, we use embeddings instead of written words, and these get crunched inside the likes of BERT and GPT-2.

在制作现实生活中的语言模型时，我们使用嵌入而不是书面词，这些词在BERT和GPT-2之类的语言中会受到限制。

To get a better feel for what happens to these representations along the way, I suggested we’d better think of it like music, because adding two things together looks more like overlaying music than adding regular numbers.

为了更好地了解这些表示在整个过程中会发生什么，我建议我们最好将其视为音乐，因为将两件事加在一起看起来比叠加数字更像叠加音乐。

Then we looked at the belly of the language machine. It didn’t look very fancy, just a pancake-stack of identical boxes, each one feeding from the one below.

然后，我们看着语言机器的腹部。看起来并不花哨，只是一叠煎饼，上面放着相同的盒子，每个盒子都从下面的盒子里取食。

And, although we didn’t say exactly what goes on inside these boxes, we understand conceptually that they’re a way to implement paying attention to relationships between words, near and far.

而且，尽管我们没有确切说明这些框内的内容，但从概念上讲，它们是一种实现注意单词近距离和远距离关系的方法。

摘要 (Summary)

Have I glazed over some of the details? Hell yeah! The aim was to never lose sight of the bigger picture. While the ABBA committee is not EXACTLY what happens in each decoder, it’s close enough to the truth to make the details comprehensible. If I now told you there aren’t three attention heads in each layer, but twelve of them, it would hardly make much difference, right?

我是否已经了解了一些细节？真是的！目的是永不遗忘大局。尽管ABBA委员会并不完全了解每个解码器中发生的情况，但它与事实相近，足以使细节易于理解。如果我现在告诉您，每一层中没有三个关注头，但是有十二个 ，几乎没有什么不同，对吗？

Generally speaking, whenever this article mentions a specific number (like 12 or 1024), assume researchers have already made it bigger.

一般来说，只要本文提到一个特定的数字(例如12或1024)，就应假设研究人员已经将其扩大了。

That’s the story of GPT-3 in a nutshell:

简而言之，这就是GPT-3的故事：

Instead of 12 layers — 96.Instead of 1024 tokens in the context — 2048.Instead of 12 attention heads per layer — 96.Instead of embedding size of 1600 — 12,288 (!!!)

而不是12层-96。而不是上下文中的1024个令牌-2048。而不是每层12个关注头-96。而不是嵌入大小1600 — 12,288(!!!)

(Given everything we’ve seen, that last line gives room for pause: Do we actually need to go from an ABBA song to a 27-minute musical number for each and every word? Probably not.)

(鉴于我们所看到的一切，最后一行为您留出了停顿的余地：我们实际上是否需要将每个单词的ABBA歌曲转换为27分钟的音乐编号？也许不是。)

But the basic premise stays the same. Once in the stratified experimental realm, language models are now fast becoming part of everyday life. They’ve already changed Google Search. They’re already changing translation services. They’re already used to generate text.

但是基本前提保持不变。一旦进入了分层的实验领域，语言模型现在正Swift成为日常生活的一部分。他们已经更改了Google搜索。他们已经在改变翻译服务。它们已经用于生成文本。

When technology moves away from an academic toy to something with tangible influence, I don’t believe researchers are entitled to stick with obfuscating mathematical language.

当技术从学术玩具转移到具有明显影响力的东西时，我认为研究人员无权坚持使用模糊的数学语言。

There’s a moral duty to explain these concepts so that anyone can understand. At least, that’s what I told ABBA so they’d agree to come and help.

解释这些概念是道德义务，任何人都可以理解。至少，我就是这么告诉ABBA的，所以他们同意来帮助您。

资料来源： (Sources:)

1. Seth, Y.2019. BERT Explained — A list of Frequently Asked Questions. https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/ 2. Alammar, J. The Illustrated GPT-2. https://jalammar.github.io/illustrated-gpt2/

1.塞思，Y.2019。 BERT解释-常见问题列表。 https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/ 2. 2. Alammar，J 。插图GPT-2。 https://jalammar.github.io/illustrated-gpt2/

3. Arora, A. 2020. The Annotated GPT-2. https://amaarora.github.io/2020/02/18/annotatedGPT2.html

3. Arora，答：2020年。带注释的GPT-2。 https://amaarora.github.io/2020/02/18/annotatedGPT2.html

4. Smith, N.A.2019. Contextual Word Representations: A Contextual Introduction. https://arxiv.org/abs/1902.06006 5. Li, C. 2020. OpenAI’s GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/ 6. Calvo M.R. 2018. Dissecting BERT. https://medium.com/dissecting-bert

4.史密斯，NA2019。 上下文词表示形式：上下文介绍。 https://arxiv.org/abs/1902.06006 5. Li，2020年。OpenAI的GPT-3语言模型：技术概述。 https://lambdalabs.com/blog/demystifying-gpt-3/ 6. Calvo MR2018。 剖析BERT。 https://medium.com/dissecting-bert

7. HuggingFace. Write with Transformers. https://transformer.huggingface.co/doc/gpt2-large

7. HuggingFace。 用变形金刚写。 https://transformer.huggingface.co/doc/gpt2-large