Text generation before transformers(转换器出现之前的文本生成)
It's important to note that generative algorithms are not new. Previous generations of language models made use of an architecture called recurrent neural networks or RNNs. RNNs while powerful for their time, were limited by the amount of compute and memory needed to perform well at generative tasks.
Let's look at an example of an RNN carrying out a simple next-word prediction generative task.
With just one previous words seen by the model, the prediction can't be very good.
As you scale the RNN implementation to be able to see more of the preceding words in the text,
you have to significantly scale the resources that the model uses. As for the prediction, well, the model failed here.
Even though you scale the model, it still hasn't seen enough of the input to make a good prediction. To successfully predict the next word, models need to see more than just the previous few words. Models needs to have an understanding of the whole sentence or even the whole document. The problem here is that language is complex.
In many languages, one word can have multiple meanings. These are homonyms. In this case, it's only with the context of the sentence that we can see what kind of bank is meant.
Words within a sentence structures can be ambiguous or have what we might call syntactic ambiguity. Take for example this sentence, "The teacher taught the students with the book." Did the teacher teach using the book or did the student have the book, or was it both? How can an algorithm make sense of human language if sometimes we can't?
Well in 2017, after the publication of this paper, Attention is All You Need, from Google and the University of Toronto, everything changed. The transformer architecture had arrived. This novel approach unlocked the progress in generative AI that we see today.
It can be scaled efficiently to use multi-core GPUs, it can parallel process input data, making use of much larger training datasets, and crucially, it's able to learn to pay attention to the meaning of the words it's processing. And attention is all you need. It's in the title.
要理解生成算法的重要性,需要注意到它们并不是新概念。前几代语言模型使用了被称为循环神经网络(RNN)的架构。虽然在当时RNNs功能强大,但它们在执行生成任务时需要大量的计算和内存资源,这限制了它们的性能。
让我们来看一个简单的例子,一个RNN执行下一个词预测的生成任务。
如果模型只看到一个前面的单词,预测效果不会太好。当你扩大RNN实现的规模,使其能够看到文本中更多的前序单词时,你必须显著增加模型使用的资源。至于预测,嗯,模型在这里失败了。即使你扩大了模型的规模,它仍然没有看到足够的输入来做出好的预测。为了成功预测下一个词,模型需要的不仅仅是前几个词,而是需要对整个句子甚至整个文档有所理解。问题在于语言是复杂的。在许多语言中,一个词可以有多个含义。这些就是同音异义词。
在这种情况下,只有通过句子的上下文我们才能看出所说的是哪种银行。
句子结构中的单词可能是模棱两可的,或者我们可能称之为句法歧义。以这个句子为例,“老师用书教学生。” 老师是用书来教,还是学生有书,或者两者都有?如果有时我们自己都做不到,那么算法如何理解人类语言呢?
然而在2017年,随着这篇论文《注意力就是你所需要的一切》(Attention is All You Need)由谷歌和多伦多大学发布后,一切都改变了。转换器架构(transformer architecture)出现了。这种新颖的方法开启了我们今天看到的生成人工智能的进步。
它可以有效地扩展以使用多核GPU,可以并行处理输入数据,利用更大的训练数据集,并且至关重要的是,它能够学会关注其处理的单词的含义。而注意力就是你所需要的。它就在标题中。
Transformers architecture(转换器架构)
Building large language models using the transformer architecture dramatically improved the performance of natural language tasks over the earlier generation of RNNs, and led to an explosion in regenerative capability.
The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence. Not just as you see here, to each word next to its neighbor, but to every other word in a sentence.
To apply attention weights to those relationships so that the model learns the relevance of each word to each other words no matter where they are in the input. This gives the algorithm the ability to learn who has the book, who could have the book, and if it's even relevant to the wider context of the document.
These attention weights are learned during LLM training and you'll learn more about this later this week. This diagram is called an attention map and can be useful to illustrate the attention weights between each word and every other word. Here in this stylized example, you can see that the word book is strongly connected with or paying attention to the word teacher and the word student. This is called self-attention and the ability to learn a tension in this way across the whole input significantly approves the model's ability to encode language.
Now that you've seen one of the key attributes of the transformer architecture, self-attention, let's cover at a high level how the model works. Here's a simplified diagram of the transformer architecture so that you can focus at a high level on where these processes are taking place. The transformer architecture is split into two distinct parts, the encoder and the decoder. These components work in conjunction with each other and they share a number of similarities. Also, note here, the diagram you see is derived from the original attention is all you need paper. Notice how the inputs to the model are at the bottom and the outputs are at the top, where possible we'll try to remain faithful to this throughout the course.
Now, machine-learning models are just big statistical calculators and they work with numbers, not words. So before passing texts into the model to process, you must first tokenize the words. Simply put, this converts the words into numbers, with each number representing a position in a dictionary of all the possible words that the model can work with. You can choose from multiple tokenization methods. For example, token IDs matching two complete words, or using token IDs to represent parts of words. As you can see here. What's important is that once you've selected a tokenizer to train the model, you must use the same tokenizer when you generate text.
Now that your input is represented as numbers, you can pass it to the embedding layer. This layer is a trainable vector embedding space, a high-dimensional space where each token is represented as a vector and occupies a unique location within that space. Each token ID in the vocabulary is matched to a multi-dimensional vector, and the intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence. Embedding vector spaces have been used in natural language processing for some time, previous generation language algorithms like Word2vec use this concept. Don't worry if you're not familiar with this. You'll see examples of this throughout the course, and there are some links to additional resources in the reading exercises at the end of this week.
Looking back at the sample sequence, you can see that in this simple case, each word has been matched to a token ID, and each token is mapped into a vector.
In the original transformer paper, the vector size was actually 512, so much bigger than we can fit onto this image. For simplicity, if you imagine a vector size of just three, you could plot the words into a three-dimensional space and see the relationships between those words. You can see now how you can relate words that are located close to each other in the embedding space, and how you can calculate the distance between the words as an angle, which gives the model the ability to mathematically understand language.
As you add the token vectors into the base of the encoder or the decoder, you also add positional encoding. The model processes each of the input tokens in parallel. So by adding the positional encoding, you preserve the information about the word order and don't lose the relevance of the position of the word in the sentence.
Once you've summed the input tokens and the positional encodings, you pass the resulting vectors to the self-attention layer. Here, the model analyzes the relationships between the tokens in your input sequence. As you saw earlier, this allows the model to attend to different parts of the input sequence to better capture the contextual dependencies between the words. The self-attention weights that are learned during training and stored in these layers reflect the importance of each word in that input sequence to all other words in the sequence.
But this does not happen just once, the transformer architecture actually has multi-headed self-attention. This means that multiple sets of self-attention weights or heads are learned in parallel independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common. The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. Whilst another head may focus on the activity of the sentence. Whilst yet another head may focus on some other properties such as if the words rhyme. It's important to note that you don't dictate ahead of time what aspects of language the attention heads will learn. The weights of each head are randomly initialized and given sufficient training data and time, each will learn different aspects of language. While some attention maps are easy to interpret, like the examples discussed here, others may not be.
Now that all of the attention weights have been applied to your input data, the output is processed through a fully-connected feed-forward network. The output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary.
You can then pass these logits to a final softmax layer, where they are normalized into a probability score for each word. This output includes a probability for every single word in the vocabulary, so there's likely to be thousands of scores here. One single token will have a score higher than the rest. This is the most likely predicted token. But as you'll see later in the course, there are a number of methods that you can use to vary the final selection from this vector of probabilities.
相较于早期的RNNs,使用转换器架构构建大型语言模型显著提高了自然语言任务的性能,带来了生成能力的爆炸性增长。转换器架构的强大之处在于其能够学习句子中所有单词的相关性与上下文。不仅仅是像你在这里看到的,将注意力放在每个单词和它的邻近单词之间,而是放在句子中的其他所有单词上。对那些关系应用注意力权重,以便模型学会每个单词与其他任何单词的相关性,无论它们在输入中的位置如何。这使得算法有能力学习谁有书、谁可能有书,以及它是否与文档的更广泛上下文相关。这些注意力权重是在LLM训练过程中学习的,本周晚些时候你会了解更多关于这方面的内容。
这个图表被称为注意力图,可用于说明每个单词与其他所有单词之间的注意力权重。在这个风格化的示例中,你可以看到“book”这个词与“teacher”和“student”这两个词有很强的联系或关注。这称为自注意力,以这种方式在整个输入中学习张力,显著提升了模型编码语言的能力。
现在你已经了解了转换器架构的一个关键属性——自注意力,让我们从高层次上了解模型是如何工作的。这里有一个简化的转换器架构图,以便你可以从高层次上关注这些过程发生的位置。转换器架构分为两个不同的部分:编码器和解码器。这些组件相互协作,并且它们有许多相似之处。此外,请注意,你在这里看到的图表源自最初的《Attention is All You Need》论文。注意模型的输入在底部,输出在顶部,我们将尽可能保持这种表示方式贯穿整个课程。现在,机器学习模型只是大型统计计算器,它们用数字而不是单词工作。
所以在将文本传递给模型处理之前,你必须首先将单词token化。简单来说,这将单词转换为数字,每个数字代表模型可以处理的所有可能单词字典中的一个位置。你可以选择多种token化方法。例如,使用token ID匹配两个完整的单词,或使用token ID来表示单词的部分。正如你在这里看到的。重要的是,一旦你选择了一个用于训练模型的tokenizer,当你生成文本时,你必须使用相同的tokenizer。
现在你的输入被表示为数字,你可以将其传递给嵌入层。这一层是一个可训练的向量嵌入空间,一个高维空间,其中每个token都表示为一个向量,并在该空间内占据一个独特的位置。词汇表中的每个token ID都与一个多维向量相匹配,其直观意义是这些向量学会编码输入序列中单个token的含义和上下文。嵌入向量空间已经在自然语言处理中使用了一段时间,前一代语言算法如Word2vec使用了这个概念。如果你不熟悉这个没关系。你将在课程中看到这方面的例子,并且在本周末尾的阅读练习中有一些额外资源的链接。
回顾这个简单案例的示例序列,你可以看到在这个简单的情况下,每个单词都与一个token ID匹配,每个token都被映射到一个向量。在最初的转换器论文中,向量大小实际上是512,比我们能放在这张图片上的要大得多。为了简单起见,如果你想象一个只有三个维度的向量大小,你可以将这些单词绘制到三维空间中,并看到这些单词之间的关系。你现在可以看到你如何关联在嵌入空间中彼此靠近的单词,并且如何计算单词之间的角度距离,这给了模型数学理解语言的能力。当你将token向量添加到编码器或解码器的基部时,你还添加了位置编码。模型并行处理每个输入token。所以通过添加位置编码,你保留了有关单词顺序的信息,不会丢失单词在句子中的位置的相关性。一旦你汇总了输入token和位置编码,你就将结果向量传递给自注意力层。在这里,模型分析输入序列中token之间的关系。正如你之前看到的,这允许模型关注输入序列的不同部分,以更好地捕捉单词之间的上下文依赖关系。在训练过程中学习并获得的自注意力权重存储在这些层中,反映了该输入序列中每个单词对所有其他单词的重要性。但这不仅仅发生一次,转换器架构实际上具有多头自注意力。
这意味着多组自注意力权重或头部独立于彼此并行学习。包含在注意力层中的注意力头的数量因模型而异,但常见的范围是12-100。这里的直觉是,每个自注意力头将学习语言的不同方面。例如,一个头可能会看到我们句子中人物实体之间的关系。而另一个头可能专注于句子的活动。还有一个头可能专注于一些其他属性,比如如果单词押韵。需要注意的是,你不会事先指定注意力头将学习语言的哪些方面。每个头的权重都是随机初始化的,并且给定足够的训练数据和时间,每个头将学习语言的不同方面。虽然一些注意力图容易解释,就像这里讨论的例子,但其他的可能不容易解释。现在所有的注意权重都已经应用到你的输入数据上,输出通过一个全连接的前馈网络进行处理。这一层的输出是一个与tokenizer字典中的每一个token的概率得分成比例的logits向量。然后你可以将这些logits传递到一个最终的softmax层,在那里它们被归一化为每个单词的概率得分。这个输出包括词汇表中每一个单词的概率,所以这里可能有成千上万的得分。一个单一的token将有一个高于其他所有token的得分。这是最有可能的预测token。但是正如你将在课程后面看到的,有多种方法可以改变从这个概率向量中选择的方法。