大语言模型生成式AI学习笔记——1.1.5 大语言模型及生成式AI项目生命周期简介——使用转换器生成文本，转换器—— 所需仅为“注意力”

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/137968587

Generating text with transformers（使用转换器生成文本）

At this point, you've seen a high-level overview of some of the major components inside the transformer architecture. But you still haven't seen how the overall prediction process works from end to end. Let's walk through a simple example. In this example, you'll look at a translation task or a sequence-to-sequence task, which incidentally was the original objective of the transformer architecture designers. You'll use a transformer model to translate the French phrase “J'aime l'apprentissage automatique” into English.

First, you'll tokenize the input words using this same tokenizer that was used to train the network. These tokens are then added into the input on the encoder side of the network, passed through the embedding layer, and then fed into the multi-headed attention layers. The outputs of the multi-headed attention layers are fed through a feed-forward network to the output of the encoder. At this point, the data that leaves the encoder is a deep representation of the structure and meaning of the input sequence. This representation is inserted into the middle of the decoder to influence the decoder's self-attention mechanisms.

Next, a start of sequence token is added to the input of the decoder. This triggers the decoder to predict the next token, which it does based on the contextual understanding that it's being provided from the encoder. The output of the decoder's self-attention layers gets passed through the decoder feed-forward network and through a final softmax output layer. At this point, we have our first token.

You'll continue this loop, passing the output token back to the input to trigger the generation of the next token, until the model predicts an end-of-sequence token. At this point, the final sequence of tokens can be detokenized into words, and you have your output. In this case, “I love machine learning”. There are multiple ways in which you can use the output from the softmax layer to predict the next token. These can influence how creative you are generated text is. You will look at these in more detail later this week.

Let's summarize what you've seen so far. The complete transformer architecture consists of an encoder and decoder components. The encoder encodes input sequences into a deep representation of the structure and meaning of the input. The decoder, working from input token triggers, uses the encoder's contextual understanding to generate new tokens. It does this in a loop until some stop condition has been reached.

While the translation example you explored here used both the encoder and decoder parts of the transformer, you can split these components apart for variations of the architecture. Encoder-only models also work as sequence-to-sequence models, but without further modification, the input sequence and the output sequence or the same length. Their use is less common these days, but by adding additional layers to the architecture, you can train encoder-only models to perform classification tasks such as sentiment analysis, BERT is an example of an encoder-only model. Encoder-decoder models, as you've seen, perform well on sequence-to-sequence tasks such as translation, where the input sequence and the output sequence can be different lengths. You can also scale and train this type of model to perform general text generation tasks. Examples of encoder-decoder models include BART as opposed to BERT and T5, the model that you'll use in the labs in this course. Finally, decoder-only models are some of the most commonly used today. Again, as they have scaled, their capabilities have grown. These models can now generalize to most tasks. Popular decoder-only models include the GPT family of models, BLOOM, Jurassic, LLaMA, and many more. You'll learn more about these different varieties of transformers and how they are trained later this week.

That was quite a lot. The main goal of this overview of transformer models is to give you enough background to understand the differences between the various models being used out in the world and to be able to read model documentation. I want to emphasize that you don't need to worry about remembering all the details you've seen here, as you can come back to this explanation as often as you need. Remember that you'll be interacting with transformer models through natural language, creating prompts using written words, not code. You don't need to understand all of the details of the underlying architecture to do this. This is called prompt engineering, and that's what you'll explore in the next part of this course. Let's move on to the next video to learn more.

到目前为止，你已经从高层次上了解了转换器架构内部一些主要组件的概览。但你还没有看到整个预测过程是如何从头到尾工作的。让我们通过一个简单的例子来走一遍。在这个例子中，你将看一个翻译任务或序列到序列任务，这也是转换器架构设计者的最初目标。你将使用一个转换器模型将法语短语翻译成英语。

首先，你会使用与训练网络时相同的分词器对输入单词进行分词。这些分词随后被加入网络编码器端的输入中，通过嵌入层传递，然后送入多头注意力层。多头注意力层的输出通过前馈网络传送到编码器的输出。此时，离开编码器的数据是对输入序列结构和意义的深层表示。这种表示被插入到解码器的中间，以影响解码器的自注意力机制。

接下来，在解码器的输入中添加一个序列开始标记。这触发了解码器根据它从编码器获得的背景理解来预测下一个分词。解码器的自注意力层的输出通过解码器前馈网络和最终的softmax输出层传递。此时，我们得到了第一个分词。你会继续这个循环，将输出分词反馈到输入以触发下一个分词的生成，直到模型预测出一个序列结束标记。此时，最终的分词序列可以被反分词化成单词，你就得到了输出，即翻译结果——“我爱机器学习“。你可以使用softmax层的输出来预测下一个分词的多种方式。这些可以影响你生成文本的创造性。本周晚些时候你将更详细地研究这些。

让我们总结一下你到目前为止所看到的。完整的转换器架构由编码器和解码器组件组成。编码器将输入序列编码为输入的结构和意义的深层表示。解码器从输入分词触发开始，利用编码器的背景理解生成新的分词。它在一个循环中这样做，直到达到某种停止条件。虽然你在这里探索的翻译示例使用了转换器的编码器和解码器部分，但你可以将这些组件分开用于架构的变体。只有编码器的模型也作为序列到序列模型工作，但如果没有进一步修改，输入序列和输出序列的长度是相同的。如今它们的使用不太常见，但通过向架构添加额外层次，你可以训练只有编码器的模型来执行分类任务，如情感分析，BERT就是只有编码器的模型的一个例子。正如你所见，编码器-解码器模型在序列到序列任务上表现良好，例如翻译，其中输入序列和输出序列可以是不同的长度。你也可以扩展和训练这种类型的模型来执行一般的文本生成任务。编码器-解码器模型的例子包括BART与BERT和T5相反，后者是你在本课程实验室中使用的模型。

最后，只有解码器的模型是当今最常用的一些。再次强调，随着它们的规模扩大，它们的能力也在增长。这些模型现在可以泛化到大多数任务。流行的只有解码器的模型包括GPT系列模型、BLOOM、Jurassic、LLaMA等许多其他模型。你将在本周晚些时候了解更多关于这些不同种类的转换器及其训练方式的不同之处。这是相当多的。这个转换器模型概述的主要目的是给你足够的背景知识，以理解世界上正在使用的各种模型之间的差异，并能够阅读模型文档。我想强调你不需要担心记住你在这里看到的所有细节，因为你可以根据需要经常回到这个解释。记住，你将通过自然语言与转换器模型进行交互，使用书面文字创建提示，而不是代码。你不需要了解所有底层架构的细节就可以做到这一点。这称为提示工程，这就是你在本课程的下一部分将要探索的内容。让我们继续下一个视频，学习更多。

Transformers: Attention is all you need（转换器—— 所需仅为“注意力”）

"Attention is All You Need" is a research paper published in 2017 by Google researchers, which introduced the Transformer model, a novel architecture that revolutionized the field of natural language processing (NLP) and became the basis for the LLMs we now know - such as GPT, PaLM and others. The paper proposes a neural network architecture that replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with an entirely attention-based mechanism.

The Transformer model uses self-attention to compute representations of input sequences, which allows it to capture long-term dependencies and parallelize computation effectively. The authors demonstrate that their model achieves state-of-the-art performance on several machine translation tasks and outperforms previous models that rely on RNNs or CNNs.

The Transformer architecture consists of an encoder and a decoder, each of which is composed of several layers. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to attend to different parts of the input sequence, while the feed-forward network applies a point-wise fully connected layer to each position separately and identically.

The Transformer model also uses residual connections and layer normalization to facilitate training and prevent overfitting. In addition, the authors introduce a positional encoding scheme that encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.

"注意力是你所需要的一切"是2017年由谷歌研究人员发布的一篇研究论文，它引入了Transformer模型，这是一种革命性的架构，彻底改变了自然语言处理（NLP）领域，并成为我们现在所知的LLMs——如GPT、PaLM等的基础。该论文提出了一种神经网络架构，用完全基于注意力机制的方式取代传统的循环神经网络（RNNs）和卷积神经网络（CNNs）。

Transformer模型使用自注意力来计算输入序列的表示形式，这允许它捕捉长期依赖性并有效地并行化计算。作者证明他们的模型在几个机器翻译任务上实现了最先进的性能，并且超过了之前依赖于RNN或CNN的模型。

Transformer架构由一个编码器和一个解码器组成，每个都由几层组成。每一层包括两个子层：多头自注意力机制和前馈神经网络。多头自注意力机制允许模型关注输入序列的不同部分，而前馈网络则对每个位置分别且相同地应用点对点全连接层。

Transformer模型还使用残差连接和层归一化来促进训练并防止过拟合。此外，作者引入了一种位置编码方案，对输入序列中每个标记的位置进行编码，使模型能够捕捉序列的顺序，无需循环或卷积操作。