大语言模型生成式AI学习笔记——1. 1.2 大语言模型及生成式AI项目生命周期简介——第1周课程简介

本文链接：https://blog.csdn.net/hpdlzu80100/article/details/137968020

Week 1（第1周）

Generative AI use cases, project lifecycle, and model pre-training

Learning Objectives

Discuss model pre-training and the value of continued pre-training vs fine-tuning
Define the terms Generative AI, large language models, prompt, and describe the transformer architecture that powers LLMs
Describe the steps in a typical LLM-based, generative AI model lifecycle and discuss the constraining factors that drive decisions at each step of model lifecycle
Discuss computational challenges during model pre-training and determine how to efficiently reduce memory footprint
Define the term scaling law and describe the laws that have been discovered for LLMs related to training dataset size, compute budget, inference requirements, and other factors.

生成式人工智能应用场景、项目生命周期和模型预训练

学习目标：

讨论模型预训练以及持续预训练与微调的价值
定义生成式人工智能、大型语言模型、提示等术语，并描述支持LLMs的Transformer架构
描述典型的基于LLM的生成式AI模型生命周期中的步骤，并讨论在模型生命周期的每个阶段驱动决策的约束因素
讨论模型预训练过程中的计算挑战，并确定如何有效减少内存占用

定义缩放法则这个术语，并描述已经发现的与训练数据集大小、计算预算、推理需求和其他因素相关的LLMs的法则。

Introduction - Week 1

>> Welcome back. There's a lot of exciting material to go over this week, and one of the first topics that Mike will share with you in a little bit is a deep dive into how transformer networks actually work.

>> Yeah, so look, it's a complicated topic, right? In 2017, the paper came out, Attention is all You Need, and it laid out all of these fairly complex data processes which are going to happen inside the transformer architecture. So we take a little bit of a high level view, but we do go down into some depths. We talk about things like self-attention and the multi-headed self-attention mechanism.

So we can see why it is that these models actually work, how it is that they actually gain an understanding of language.

>> And it's amazing how long the transformer architecture has been around and it's still state of the art for many models.

>> I remember after I saw the transformer paper when it first came out, I thought, yep, I get this equation. I acknowledge this is a math equation. But what's it actually doing? And it's always seemed a little bit magical. It took me a long time playing with it to finally go, okay, this is why it works. And so I think in this first week, you learn the intuitions behind some of these terms you may have heard before, like multi-headed attention. What is that and why does it make sense?

And why did the transformer architecture really take off? I think attention had been around for a long time, but actually thought it was, one of the things that really made to take off was it allowed attention to work in a massively parallel way. So it made it work on modern GPUs and could scale it up. I think these nuances around transformers are not well-understood by many, so looking forward to when you deep dive into that.

>> Absolutely, I mean, the scale is part of it and how it's able to take in all that data. I just want to say as well, though, that we're not going to go into this at such a level which is going to make people's heads explode. If they want to do that, then they can go ahead and read that paper too.

What we're going to do is we're going to look at the really important parts of that transformer architecture that gives you the intuition you need so that you can actually make practical use out of these models.

>> One thing I've been surprised and delighted by is how transformers, even though this course focuses on text, it's been really interesting to see how that basic transformer architecture is creating a foundation for vision transformers as well.

So even though in this course you learn mostly about large language models, models about text, I think understanding transformers is also helping people understand this really exciting vision transformer and other modalities as well. It's going to be a really critical building block for

a lot of machine learning.

>> Absolutely.

>> And then beyond transformers, there's a second major topic that looking forward to having this first week cover, which is the Generative AI project Lifecycle. I know a lot of people are thinking, boy, does all this LM stuff, what I do of it? And the Generative AI project Lifecycle, which will talk about in a little bit, helps you plan out how to think about building your own Generative AI project.

>> That's right, and the Generative AI project Lifecycle walks you through the individual stages and

decisions you have to make when you're developing Generative AI applications. So one of the first things you have to decide is whether you're taking a foundation model off the shelf or you're actually pre-training your own model and then as a follow up, whether you want to fine tune and customize that model maybe for your specific data.

>> Yeah, in fact, there's so many large language model options out there, some open source, some not open source, that I see many developers wondering, which of these models do I want to use?

And so having a way to evaluate it and then also choose the right model sizing. I know in your other work, you've talked about when do you need a giant model, 100 billion or even much bigger versus when can a 1 to 30 billion parameter model or even sub 1 billion parameter model be just fantastic for a specific application?

>> Exactly, so there might be use cases where you really need the model to be very comprehensive and able to generalize to a lot of different tasks.

And there might be use cases where you're just optimizing for a single-use case, right?

And you can potentially work with a smaller model and achieving similar or even very good results.

>> Yeah, I think that might be one of the really surprising things for some people to learn is that you can actually use quite small models and still get quite a lot of capability out of them.

>> Yeah, I think when you want your large language model to have a lot of general knowledge about the world, when you wanted to know stuff about history and philosophy and the sizes and how to write Python code and so on and so on. It helps to have a giant model with hundreds of billions of parameters. But for a single task like summarizing dialogue or acting as a customer service agent for one company, for applications like that, sometimes you can use hundreds of billions of parameters models. But that's not always necessary. So lots of really exciting material to get into this week.

With that, let's go on to the next video when Mike will kick things off with a deep dive into many different use cases of large language models.

大语言模型及生成式AI项目生命周期简介——第一周课程简介

“欢迎回来。这周我们有一大堆激动人心的素材要讲解，首先Mike会在一会儿跟你深入探讨变换器网络是如何运作的。”

“是的，这是个复杂的主题，对吧？2017年，《注意力是你所需要的一切》这篇论文发表了，它阐述了所有相当复杂的数据流程，这些都将发生在变换器架构内部。所以我们会从高层次的视角来看待这个问题，但同时我们也会深入到一些细节。我们会讨论像自注意力和多头自注意力机制这样的概念。这样我们就能看到为什么这些模型真的有效，它们是如何真正理解语言的。”

“而且令人惊奇的是，变换器架构存在了很长时间，对许多模型来说它仍然是最前沿的技术。”

“我记得当我第一次看到变换器的论文时，我想，是的，我懂这个等式。我承认这是一个数学等式。但它实际上在做什么？它总是显得有点神奇。我花了很长时间摆弄它，最后终于明白了，这就是它有效的原因。所以我认为在第一周，你会学到一些你可能之前听说过的术语背后的直觉，比如多头注意力。那是什么，为什么它讲得通？为什么变换器架构真的会流行起来？我认为注意力已经存在很长时间了，但实际上让它流行的一个原因是，它允许注意力以大规模并行的方式工作。所以它可以在现代GPU上运行并且可以扩展。我认为关于变换器的这些细微差别很多人都不太了解，所以很期待你深入研究它。“

“绝对，我是说规模是其中的一部分，以及它如何能够吸收所有这些数据。我也想说，不过，我们不会深入到那种让人头炸的水平。如果他们想那样做，他们也可以继续阅读那篇论文。我们要做的事情是，我们将关注那些对你来说非常重要的变换器架构部分，这会给你提供所需的直觉，使你能够实际使用这些模型。”

“有一件事让我感到惊讶和高兴的是，尽管这门课程侧重于文本，看到基本的变换器架构也为视觉变换器等其他模式创造了基础真的很有趣。所以即使这门课程你主要学习的是关于大型语言模型，关于文本的模型，我认为理解变换器也在帮助人们理解这个非常令人兴奋的视觉变换器和其他模式。这将是很多机器学习的一个非常关键的构建块。”

“当然。”

“然后除了变换器之外，还有第二个主要的主题，我们期待在第一周覆盖，那就是生成式AI项目生命周期。我知道很多人在想，所有这些LM的东西我能做什么呢？生成式AI项目生命周期，稍后会谈到，它会帮助你规划如何思考建立你自己的生成式AI项目。”

“没错，生成式AI项目生命周期会引导你通过开发生成式AI应用程序时必须做出的各个阶段和决策。所以你要做的第一件事是决定你是选择一个现成的基础模型还是实际上预训练你自己的模型，然后接下来是你是否想要微调并定制该模型，也许针对你的特定数据。”

“是的，事实上，那里有很多大型语言模型可供选择，有些是开源的，有些不是，我看到很多开发者在想，我到底想用哪个模型？因此，有一种方法来评估它，然后也选择正确的模型大小。我知道你在其他工作中谈到过，什么时候你需要一个巨型模型，1000亿甚至更大的参数，与什么时候一个1到300亿参数的模型，甚至不到10亿参数的模型对特定应用来说就已经足够好了？”

“确实如此，所以可能有一些用例你真的需要模型非常全面并能够泛化到很多不同的任务。”

“也可能有一些用例你只是针对单一用例进行优化，对吧？”

“你可以使用一个较小的模型并且获得相似甚至非常好的结果。”

“是的，我认为这可能是一些人学习时会感到惊讶的事情之一，那就是你实际上可以使用相当小的模型并且仍然从中获得很多能力。”

“是的，我认为当你希望你的大型语言模型对世界有很多普遍知识时，当你想要知道有关历史和哲学的东西，以及大小和如何编写Python代码等等。拥有一个带有数千亿参数的巨型模型会有帮助。但对于像对话总结或作为某个公司的客户服务代理这样的单一任务，对于这样的应用，有时你可以使用数千亿参数的模型。但这并不总是必要的。所以这周有很多真正令人兴奋的材料要讲。”

“说到这里，让我们进入下一个视频，Mike将开始深入探讨大型语言模型的许多不同用例。”