大语言模型生成式AI学习笔记——1.1.6 大语言模型及生成式AI项目生命周期简介——提示及提示工程，生成配置

预见未来to50

已于 2024-04-30 11:32:40 修改

阅读量619

点赞数 8

分类专栏：机器学习、深度学习（ML/DL) 文章标签：人工智能语言模型学习

于 2024-04-19 16:07:28 首次发布

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/137968673

版权

机器学习、深度学习（ML/DL) 专栏收录该内容

148 篇文章 13 订阅

订阅专栏

Prompting and prompt engineering（提示及提示工程）

Okay, Just to remind you of some of the terminology. The text that you feed into the model is called the prompt, the act of generating text is known as inference, and the output text is known as the completion. The full amount of text or the memory that is available to use for the prompt is called the context window. Although the example here shows the model performing well, you'll frequently encounter situations where the model doesn't produce the outcome that you want on the first try. You may have to revise the language in your prompt or the way that it's written several times to get the model to behave in the way that you want. This work to develop and improve the prompt is known as prompt engineering. This is a big topic. But one powerful strategy to get the model to produce better outcomes is to include examples of the task that you want the model to carry out inside the prompt. Providing examples inside the context window is called in-context learning.

Let's take a look at what this term means. With in-context learning, you can help LLMs learn more about the task being asked by including examples or additional data in the prompt. Here is a concrete example. Within the prompt shown here, you ask the model to classify the sentiment of a review. So whether the review of this movie is positive or negative, the prompt consists of the instruction, "Classify this review," followed by some context, which in this case is the review text itself, and an instruction to produce the sentiment at the end. This method, including your input data within the prompt, is called zero-shot inference. The largest of the LLMs are surprisingly good at this, grasping the task to be completed and returning a good answer. In this example, the model correctly identifies the sentiment as positive.

Smaller models, on the other hand, can struggle with this. Here's an example of a completion generated by GPT-2, an earlier smaller version of the model that powers ChatGPT. As you can see, the model doesn't follow the instruction. While it does generate text with some relation to the prompt, the model can't figure out the details of the task and does not identify the sentiment. This is where providing an example within the prompt can improve performance.

Here you can see that the prompt text is longer and now starts with a completed example that demonstrates the tasks to be carried out to the model. After specifying that the model should classify the review, the prompt text includes a sample review. I loved this movie, followed by a completed sentiment analysis. In this case, the review is positive. Next, the prompt states the instruction again and includes the actual input review that we want the model to analyze. You pass this new longer prompt to the smaller model, which now has a better chance of understanding the task you're specifying and the format of the response that you want. The inclusion of a single example is known as one-shot inference, in contrast to the zero-shot prompt you supplied earlier.

Sometimes a single example won't be enough for the model to learn what you want it to do. So you can extend the idea of giving a single example to include multiple examples. This is known as few-shot inference. Here, you're working with an even smaller model that failed to carry out good sentiment analysis with one-shot inference. Instead, you're going to try few-shot inference by including a second example. This time, a negative review, including a mix of examples with different output classes can help the model to understand what it needs to do. You pass the new prompts to the model. And this time it understands the instruction and generates a completion that correctly identifies the sentiment of the review as negative.

So to recap, you can engineer your prompts to encourage the model to learn by examples. While the largest models are good at zero-shot inference with no examples, smaller models can benefit from one-shot or few-shot inference that include examples of the desired behavior. But remember the context window because you have a limit on the amount of in-context learning that you can pass into the model. Generally, if you find that your model isn't performing well when, say, including five or six examples, you should try fine-tuning your model instead. Fine-tuning performs additional training on the model using new data to make it more capable of the task you want it to perform. You'll explore fine-tuning in detail in week 2 of this course.

As larger and larger models have been trained, it's become clear that the ability of models to perform multiple tasks and how well they perform those tasks depends strongly on the scale of the model. As you heard earlier in the lesson, models with more parameters are able to capture more understanding of language. The largest models are surprisingly good at zero-shot inference and are able to infer and successfully complete many tasks that they were not specifically trained to perform. In contrast, smaller models are generally only good at a small number of tasks. Typically, those that are similar to the task that they were trained on. You may have to try out a few models to find the right one for your use case. Once you've found the model that is working for you, there are a few settings that you can experiment with to influence the structure and style of the completions that the model generates. Let's take a look at some of these configuration settings in the next video.

好的，这里只是提醒你一些术语。你输入模型的文本被称为提示(prompt)，生成文本的行为被称为推理(inference)，输出的文本被称为完成(completion)。可用于提示的全部文本量或内存被称为上下文窗口(context window)。虽然这里的示例显示了模型表现良好，但你经常会碰到模型在第一次尝试时没有产生你想要的结果的情况。你可能不得不多次修改提示中的语言或书写方式，以便使模型按照你想要的方式运行。这种开发和改进提示的工作被称为提示工程(prompt engineering)。这是一个很大的话题。但是，让模型产生更好结果的一个有效策略是在提示中包含你想要模型执行的任务示例。在上下文窗口中提供示例被称为上下文内学习(in-context learning)。

让我们看看这个术语的意思。通过上下文内学习，你可以通过在提示中包含示例或额外的数据来帮助LLM了解更多关于所要求的任务。这里有一个具体的例子。在此处显示的提示中，你要求模型对评论的情感进行分类。所以无论这部电影的评论是积极的还是消极的，提示都包括指令“对这个评论进行分类”，然后是一些上下文，在这种情况下就是评论文本本身，以及在最后产生情感的指令。这种方法，将你的输入数据包含在提示中，被称为零样本推理(zero-shot inference)。最大的LLM在这方面出奇地好，能够把握要完成的任务并返回一个好答案。在这个例子中，模型正确地识别了情感为积极的。

另一方面，较小的模型可能会对此感到困难。这里有一个由GPT-2生成的完成示例，这是驱动ChatGPT的模型的早期较小版本。如你所见，模型没有遵循指令。虽然它确实生成了与提示有些相关的文本，但模型无法弄清楚任务的细节并且没有识别出情感。这就是在提示中提供示例可以改善性能的地方。

在这里你可以看到，提示文本更长了，现在以一个完整的示例开始，向模型展示了要执行的任务。在指定模型应该对评论进行分类之后，提示文本包括了一个样本评论。我喜欢这部电影，然后是一个完成的情感分析。在这种情况下，评论是积极的。接下来，提示再次陈述指令并包含我们想要模型分析的实际输入评论。你将这个新的更长的提示传递给较小的模型，现在它有更好的机会理解你指定的任务和你希望的响应格式。包含单个示例被称为单样本推理(one-shot inference)，与你之前提供的零样本提示形成对比。

有时单个示例可能不足以让模型学会你想要它做什么。因此，你可以将提供单个示例的想法扩展为包括多个示例。这被称为少样本推理(few-shot inference)。在这里，你正在使用一个更小的模型，该模型未能通过单样本推理进行良好的情感分析。相反，你将尝试通过包括第二个示例来进行少样本推理。这次，一个负面的评论，包括不同输出类别的混合示例可以帮助模型理解它需要做什么。你将新的提示传递给模型。这一次，它理解了指令并生成了一个正确识别评论情感为消极的完成。

所以总结一下，你可以通过示例来设计你的提示以鼓励模型学习。虽然最大的模型在没有示例的情况下擅长零样本推理，但较小的模型可以从包含所需行为的示例的单样本或少样本推理中受益。但是请记住上下文窗口，因为你传递到模型中的上下文内学习的数量是有限的。通常，如果你发现你的模型在包含五个或六个示例时表现不佳，你应该尝试调整你的模型。微调使用新数据对模型进行额外训练，使其更有能力执行你想要它执行的任务。你将在本周的第2周详细探索微调。

随着越来越多的大型模型被训练出来，模型执行多项任务的能力以及它们执行这些任务的好坏强烈依赖于模型的规模。正如你在课程前面听到的，拥有更多参数的模型能够捕捉到更多的语言理解。最大的模型在零样本推理方面出奇地好，并且能够推断并成功完成许多它们并未专门训练过的任务。相比之下，较小的模型通常只擅长少量的任务。通常是那些与它们训练过的任务相似的任务。你可能得尝试几个模型才能找到适合你的用例的正确模型。一旦你找到了适合你的模型，有几个设置你可以实验以影响模型生成的完成的结构和平风格。让我们在下个视频中看看这些配置设置。

Generative configuration（生成配置）

In this video, you'll examine some of the methods and associated configuration parameters that you can use to influence the way that the model makes the final decision about next-word generation. If you've used LLMs in playgrounds such as on the Hugging Face website or an AWS, you might have been presented with controls like these to adjust how the LLM behaves. Each model exposes a set of configuration parameters that can influence the model's output during inference. Note that these are different than the training parameters which are learned during training time. Instead, these configuration parameters are invoked at inference time and give you control over things like the maximum number of tokens in the completion, and how creative the output is.

Max new tokens is probably the simplest of these parameters, and you can use it to limit the number of tokens that the model will generate. You can think of this as putting a cap on the number of times the model will go through the selection process. Here you can see examples of max new tokens being set to 100, 150, or 200. But note how the length of the completion in the example for 200 is shorter. This is because another stop condition was reached, such as the model predicting and end of sequence token. Remember it's max new tokens, not a hard number of new tokens generated.

The output from the transformer's softmax layer is a probability distribution across the entire dictionary of words that the model uses. Here you can see a selection of words and their probability score next to them. Although we are only showing four words here, imagine that this is a list that carries on to the complete dictionary. Most large language models by default will operate with so-called greedy decoding. This is the simplest form of next-word prediction, where the model will always choose the word with the highest probability. This method can work very well for short generation but is susceptible to repeated words or repeated sequences of words. If you want to generate text that's more natural, more creative and avoids repeating words, you need to use some other controls.

Random sampling is the easiest way to introduce some variability. Instead of selecting the most probable word every time with random sampling, the model chooses an output word at random using the probability distribution to weight the selection. For example, in the illustration, the word banana has a probability score of 0.02. With random sampling, this equates to a 2% chance that this word will be selected. By using this sampling technique, we reduce the likelihood that words will be repeated. However, depending on the setting, there is a possibility that the output may be too creative, producing words that cause the generation to wander off into topics or words that just don't make sense. Note that in some implementations, you may need to disable greedy and enable random sampling explicitly. For example, the Hugging Face transformers implementation that we use in the lab requires that we set do sample to equal true.

Let's explore top k and top p sampling techniques to help limit the random sampling and increase the chance that the output will be sensible. Two Settings, top p and top k are sampling techniques that we can use to help limit the random sampling and increase the chance that the output will be sensible. To limit the options while still allowing some variability, you can specify a top k value which instructs the model to choose from only the k tokens with the highest probability. In this example here, k is set to three, so you're restricting the model to choose from these three options. The model then selects from these options using the probability weighting and in this case, it chooses donut as the next word. This method can help the model have some randomness while preventing the selection of highly improbable completion words. This in turn makes your text generation more likely to sound reasonable and to make sense.

Alternatively, you can use the top p setting to limit the random sampling to the predictions whose combined probabilities do not exceed p. For example, if you set p to equal 0.3, the options are cake and donut since their probabilities of 0.2 and 0.1 add up to 0.3. The model then uses the random probability weighting method to choose from these tokens. With top k, you specify the number of tokens to randomly choose from, and with top p, you specify the total probability that you want the model to choose from.

One more parameter that you can use to control the randomness of the model output is known as temperature. This parameter influences the shape of the probability distribution that the model calculates for the next token. Broadly speaking, the higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness. The temperature value is a scaling factor that's applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token. In contrast to the top k and top p parameters, changing the temperature actually alters the predictions that the model will make. If you choose a low value of temperature, say less than one, the resulting probability distribution from the softmax layer is more strongly peaked with the probability being concentrated in a smaller number of words. You can see this here in the blue bars beside the table, which show a probability bar chart turned on its side. Most of the probability here is concentrated on the word cake. The model will select from this distribution using random sampling and the resulting text will be less random and will more closely follow the most likely word sequences that the model learned during training. If instead you set the temperature to a higher value, say, greater than one, then the model will calculate a broader flatter probability distribution for the next token. Notice that in contrast to the blue bars, the probability is more evenly spread across the tokens. This leads the model to generate text with a higher degree of randomness and more variability in the output compared to a cool temperature setting. This can help you generate text that sounds more creative. If you leave the temperature value equal to one, this will leave the softmax function as default and the unaltered probability distribution will be used.

You've covered a lot of ground so far. You've examined the types of tasks that LLMs are capable of performing and learned about transformers, the model architecture that powers these amazing tools. You've also explored how to get the best possible performance out of these models using prompt engineering and by experimenting with different inference configuration parameters. In the next video, you'll start building on this foundational knowledge by thinking through the steps required to develop and launch an LLM -powered application.

在本视频中，你将了解一些方法和相关的配置参数，这些方法可以用来影响模型在生成下一个词时的最终决策。如果你在像Hugging Face网站或AWS这样的平台上使用过LLMs（大语言模型），你可能已经看到过类似的控件来调整LLM的行为。每个模型都暴露了一组配置参数，可以在推理过程中影响模型的输出。注意，这些与训练参数不同，后者是在训练期间学习的。相反，这些配置参数是在推理时调用的，它们可以让你控制诸如完成的最大token数量，以及输出的创造性等事项。

“最大新token数”可能是这些参数中最简单的一个，你可以用它来限制模型将生成的token数量。你可以把它想象成对模型进行选择过程的次数设置了一个上限。这里你可以看到“最大新token数”被设置为100、150或200的例子。但请注意，在200的例子中，完成的长度更短。这是因为达到了另一个停止条件，比如模型预测到了一个结束序列的token。记住，这是“最大新token数”，而不是实际生成的新token数。

来自transformer的softmax层的输出是整个字典中单词的概率分布。在这里你可以看到一些单词和它们旁边的概率得分。虽然我们只展示了四个单词，但想象一下这是一个持续到完整字典的列表。大多数大型语言模型默认会使用所谓的贪婪解码。这是最简单形式的下一个词预测，模型总是选择概率最高的词。这种方法对于短文本生成非常有效，但容易出现重复的单词或重复的词序列。如果你想生成更自然、更有创意且避免重复单词的文本，你需要引入其他控制方法。

随机抽样是引入一些变化性的最简单方法。与每次选择概率最高的词不同，通过随机抽样，模型使用概率分布加权随机选择一个输出词。例如，在图示中，“香蕉”这个词的概率分数为0.02。使用随机抽样，这意味着有2%的机会选中这个词。通过使用这种抽样技术，我们减少了单词被重复的可能性。然而，根据设置的不同，输出可能过于有创意，产生导致生成偏离主题或出现不相关单词的情况。注意，在一些实现中，你可能需要明确禁用贪婪并启用随机抽样。例如，我们在实验室中使用的Hugging Face transformers实现要求我们将do sample设置为true。

让我们探索top k和top p抽样技术，以帮助限制随机抽样并增加输出合理的机会。两个设置，top p和top k是我们可以用来帮助限制随机抽样并增加输出合理的抽样技术。为了在仍然允许一些变化性的同时限制选项，你可以指定一个top k值，指示模型仅从k个最高概率的token中选择。在这个例子中，k被设置为3，所以你限制模型从这三个选项中选择。然后模型使用概率权重从这些选项中选择，在这种情况下，它选择了“甜甜圈”作为下一个词。这种方法可以帮助模型保持一定的随机性，同时防止选择高度不可能的完成词。这反过来使你的文本生成更有可能是合理的并且有意义。

或者，你可以使用top p设置来限制随机抽样，使其仅限于其组合概率不超过p的预测。例如，如果你将p设置为等于0.3，选项就是蛋糕和甜甜圈，因为它们的概率分别是0.2和0.1，加起来是0.3。然后模型使用随机概率加权方法从这些token中选择。使用top k，你指定要从中随机选择的token数量；而使用top p，你指定你希望模型从中选择的总概率。

你可以使用的另一个参数来控制模型输出的随机性被称为温度。这个参数影响模型为下一个token计算的概率分布的形状。广义上讲，温度越高，随机性越高；温度越低，随机性越低。温度值是在模型的最终softmax层内应用的缩放因子，它影响下一个token的概率分布形状。与top k和top p参数不同，改变温度实际上改变了模型将做出的预测。如果你选择一个低的温度值，比如小于一，那么softmax层产生的概率分布会更加尖锐，概率集中在较少的单词上。你可以在旁边的蓝色条形图中看到这一点，它显示了一个侧翻的概率条形图。这里的大部分概率集中在“蛋糕”这个词上。模型将使用随机抽样从这个分布中选择，产生的文本将会更少随机性，并且更加遵循模型在训练期间学到的最可能的词序列。如果你将温度设置为更高的值，比如大于一，那么模型将为下一个token计算出一个更宽更平的概率分布。注意与蓝色条形图相比，概率在token之间更加均匀地分布。这使模型生成的文本具有更高的随机性和更多的变化性，与冷温度设置相比。如果你将温度值保留为一，这将保留softmax函数的默认设置，并将使用未更改的概率分布。

到目前为止你已经学了很多内容。你检查了LLMs能够执行的任务类型，并了解了Transformers，这是驱动这些惊人工具的模型架构。你还探索了如何通过提示工程和尝试不同的推理配置参数来获得这些模型的最佳性能。在下一个视频中，你将从这个基础开始，思考开发和启动一个由LLM驱动的应用程序所需的步骤。