大语言模型生成式AI学习笔记——2. 1.1LLM指令微调——第2周课程简介

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/138344356

Fine-tuning LLMs with instruction（LLM指令微调）

Introduction - Week 2（第2周课程简介）

Welcome back I'm here with my instructors for this week, Mike and Shelby. Last week you learned about transformer networks, which is really a key foundation for large language models, as well as the Generative AI project Life Cycle. And this week there's lots more to dive into, starting with instruction tuning of large language models. And then later how to carry out fine-tuning in an efficient way.

>> Yes, so we take a look at instruction fine-tuning, so when you have your base model, the thing that's initially pretrained, it's encoded a lot of really good information, usually about the world. So it knows about things, but it doesn't necessarily know how to be able to respond to our prompts, our questions. So when we instruct it to do a certain task, it doesn't necessarily know how to respond. And so instruction fine-tuning helps it to be able to change its behavior to be more helpful for us.

>> I thought instruction fine-tuning was one of those major breakthroughs in the really history of large language models. Because by learning of general text of the Internet and other sources, you learn to predict the next word. By predicting what's the next word on the Internet is not the same as following instructions. I thought it's amazing you can take a large language model, train it on hundreds of billions of words of the Internet. And then fine-tune it with a much smaller data set on following instructions and just learn to do that.

>> That's right and one of the things you have to watch out for, of course, is catastrophic forgetting and this is something that we talk about in the course. So that's where you train the model on some extra data in this insane instruct fine-tuning. And then it forgets all of that stuff that it had before, or a big chunk of that data that it had before. And so there are some techniques that we'll talk about in the course to help combat that. Such as doing instruct fine-tuning across a really broad range of different instruction types. So it's not just a case of just tuning it on just the thing you want it to do. You might have to be a little bit broader than that as well, but we talk about it in the course.

>> And so it turns out that there are two types of fine-tuning that are very worth doing. One is that instruction fine-tuning we just talked about, Mike. And then when a specific developer is trying to fine-tune it for their own application, for a specialized application. One of the problems with fine-tuning is you take a giant model and you fine-tune every single parameter in that model. You have this big thing to store around and deploy, and it's actually very compute and memory expensive. So fortunately, there are better techniques than that.

>> Right, and we talk about parameter efficient fine-tuning or PEFT for short, as a set of methods that can allow you to mitigate some of those concerns, right? So we have a lot of customers that do want to be able to tune for very specific tasks, very specific domains. And parameter efficient fine-tuning is a great way to still achieve similar performance results on a lot of tasks that you can with full fine-tuning. But then actually take advantage of techniques that allow you to freeze those original model weights. Or add adaptive layers on top of that with a much smaller memory footprint, right? So that you can train for multiple tasks.

>> In fact, one of the techniques that I know you've used a lot is LoRA. I remember when I read the LoRA paper, I thought, this just makes sense, this is going to work.

>> Right, we see a lot of excitement demand around LoRA because of the performance results of using those low rank matrices as opposed to full fine-tuning, right? So you're able to get really good performance results with minimal compute and memory requirements.

>> So what I'm seeing among different developers is many developers will often start off with prompting, and sometimes that gives you good enough performance and that's great. And sometimes prompting hits a ceiling in performance and then this type of fine-tuning with LoRA or other PEFT technique is really critical for unlocking that extra level performance. And then the other thing I'm seeing among a lot of LM developers has a discussion debate about the cost of using a giant model, which is a lot of benefits versus for your application fine-tuning a smaller model.

>> Exactly, full fine tuning can be cost prohibitive, right? To say the least so the ability to actually be able to use techniques like PEFT to put fine-tuning generative AI models kind of in the hands of everyday users. That do have those cost constraints and they're cost conscious, which is pretty much everyone in the real world, right?

>> That's right and of course, if you're concerned about where your data is going as well. So if it needs to be running in your control, then having a model which is of an appropriate size is really important.

>> And so, once again, tons of exciting stuff to dive into this week. Let's go on to the next video where Mike will kick things off with instruction fine-tuning.

欢迎回来。这周我和我的导师Mike和Shelby在一起。上周你们了解了转换器网络，这实际上是大型语言模型的一个关键基础，以及生成式AI项目生命周期。这周我们还有更多内容要深入探讨，首先从大型语言模型的指令调整开始。然后是关于如何以高效的方式进行微调。

是的，所以我们将看看指令微调。当你有一个基础模型，最初预训练时，它编码了很多非常好的信息，通常是关于世界的信息。所以它知道很多事情，但它不一定知道如何响应我们的提示，我们的问题。所以当我们指示它做某项任务时，它不一定知道如何响应。因此，指令微调帮助它改变行为，使其对我们更加有用。

我认为指令微调是大型语言模型历史上的重大突破之一。因为通过学习互联网和其他来源的一般文本，你学会了预测下一个词。在互联网上预测下一个词与遵循指令并不相同。我认为你可以拿一个大型语言模型，用互联网上的数千亿个单词来训练它，然后用一个更小的数据集对它进行指令方面的微调，并学会这样做，这是很了不起的。

没错，当然，你必须注意的一件事是灾难性遗忘，这是我们在课程中讨论的内容。所以在那里你对一些额外的数据进行了疯狂的指令微调训练。然后它忘记了之前所有的那些东西，或者它之前拥有的大部分数据。所以我们将在课程中讨论一些技术来帮助对抗这个问题。比如在非常广泛的不同指令类型上进行指令微调。所以这不仅仅是在你希望它做的那件事上进行调整。你可能必须比那更广泛一些，但我们会在这门课程中讨论它。

事实证明，有两种类型的微调是非常值得做的。一个是我们刚刚谈论过的指令微调，Mike。然后当一个特定的开发者试图为他们自己的应用，为一个专门的应用进行调整时。微调的一个问题是，你拿一个巨大的模型，你调整那个模型中的每个参数。你有这个大东西需要存储和部署，实际上这是非常计算和内存密集的。所以幸运的是，有比那更好的技术。

对，我们讨论了参数高效的微调或简称PEFT，作为一组方法，可以帮助你减轻这些担忧，对吧？所以我们有很多客户确实希望能够为非常具体的任务、非常具体的领域进行调整。参数高效的微调是仍然可以实现与全面微调相似的性能结果的好方法。但实际上利用了一些技术，允许你冻结那些原始模型权重。或者在顶部添加适应性层，具有更小的内存占用，对吧？这样你就可以为多个任务进行训练了。

事实上，我知道你们经常使用的一种技术是LoRA。我记得当我读到LoRA论文时，我想，这太有道理了，这会奏效的。

对，我们看到围绕LoRA的需求激增，因为使用那些低秩矩阵而不是全面微调的性能结果，对吧？所以你能够在最小的计算和内存需求下获得非常好的性能结果。

所以我在不同的开发者中看到的是，许多开发者通常会从提示开始，有时这会给你足够好的性能，这很好。有时提示在性能上达到了天花板，那么这种类型的微调，无论是使用LoRA还是其他PEFT技术对于解锁额外级别的性能都至关重要。然后我在其他很多LM开发者中看到的是关于使用巨大模型的成本的讨论辩论，这与为你的应用微调一个更小的模型相比有很多好处。

确切地说，全面微调可能是成本过高的，至少可以这么说，所以能够实际使用像PEFT这样的技术将微调生成式AI模型的能力放在日常用户手中。那些确实有成本约束并且他们在意成本的用户，这几乎是现实世界中的每个人，对吧？

没错，当然，如果你也关心你的数据去向的话。所以如果它需要在你的控制下运行，那么拥有一个合适大小的模型是非常重要的。

再次强调，本周有大量激动人心的内容等着我们去探索。让我们进入下一个视频，Mike将用指令微调开始。