Accelerate FM development with Amazon SageMaker JumpStart

All right, I think we're good, welcome everyone. Monday evening at re:Invent. Thank you so much for joining us. I know it's just before dinner time, but we're gonna make this super duper exciting uh by way of introduction, my name is Carl Albertson. I lead our SageMaker models and partnership teams and I'm gonna be joined on stage in a moment with uh by Jeff Boutier who leads our Hugging Face uh product and growth team, as well as Mark Carp, who's a senior machine learning uh architect of the SageMaker team.

So as I mentioned, we've got a really exciting hour plan for you. Um we're gonna be talking about some of the really exciting things and also some of the challenges that you likely are facing in beginning to adopt gen AI. Um we're gonna talk about how, what we have done at AWS to help simplify uh some of these uh processes and especially on SageMaker Jumpstart. Then Jeff's gonna talk us through a lot of the really interesting innovations that are going on in Hugging Face to bring open source models to you and make those really easy to implement right at your fingertips. And then finally, Mark's gonna come up and walk you through in detail, how you can really start taking these models, customize them and optimize them for deployment at scale.

So without further ado um I actually want to do a little bit of a poll of the audience to kind of gauge where folks are in their gen AI journey. So I can ask three questions and I love a show of hands um to get an idea of where your experiences are.

So first question, can I see hands of who have started to play around with uh large language models or gender AI models, generative AI models in general. All right. So the majority of this room great who's gone a step further and started to compare models, customize maybe fine tune models or uh even begin to optimize some of the parameters very nice. And then lastly, uh who in this room have actually taken models kind of as far as they can go and put them in production end points for a specific business use case. Ok.

So I think we all saw there's a huge variety of experience and where folks are on the journeys here. Um and I think this is really easy, so uh are really compelling. Um there's a lot of excitement in the room. Um but I think also many of you that have gone through the various stages uh can share your experiences of war stories and there's a lot of challenges through that.

So let me touch on a couple of uh what those can be. So, first off, uh the space of large language models, general AI is expanding and evolving very rapidly over the last six months or so. I think we've seen uh tremendous introduction of new models, uh large models, medium size, small models, pretty much every animal that I can think of that flies or has, uh you know, I think it's not fur, but something can be turned into, into sweaters has been created. Um and also we're seeing models in terms of like quantized variants, uh models that can be optimized for certain uh applications and then domain specific models uh and language specific models. So this is an example to show that the, the landscape of models to choose from is vast and we see that can be growing and continue to grow very rapidly.

Um second is security and compliance. So it's one thing to play with models in a laboratory setting. but as you start thinking about them, taking them to production settings, uh applying them to business use cases, this is where regulatory compliance becomes critical. Uh we have acronyms such as gd gdpr, pc, ih ac or h aa compliancy, fed ramp compliancy and so forth. But more importantly, it's retaining control over your data, your prompts and inference and where that goes, that's a critical component that many customers are facing.

Uh the third is, as I mentioned, there's many models out there. But how do we actually compare these models? And how do we find that optimization between the accuracy of model can bring the performance and the cost? And as it changes over time, how do we compare that with new models and then improve that through fine tuning and so forth?

And finally, as we think about going from that proof of concept to production at scale, uh a lot of these questions of cost begin to come into play. So can we get that same accuracy or close enough accuracy that we're looking for at, at a cost that is scalable uh and meets the needs of your business.

So those are a lot of the challenges that many of you in this room are facing today or will face. And that's exactly what we're thinking about here at AWS of how can we simplify that? How can we move that from you?

So there's a couple of services uh in two, in particular, you're gonna hear a ton about this week. One is Amazon Bedrock, another is the Amazon SageMaker. Um so I won't go into too much detail on either of those services, but I want to leave you with a few things to help. Uh when these questions come up of, when should I think about Bedrock? When should I think about SageMaker?

So I'll start with Bedrock is designed to be the easiest place to get started with gen AI. If your focus is on building applications on top of uh generative AI models. Absolutely Bedrock is a service design for you. If you're thinking about getting into the models themselves and you think your focus is gonna be around fine tuning, your focus is gonna be around pushing the limits of open source models or proprietary models or even training from scratch. That's where SageMaker really lives to help you out within SageMaker.

Um I have 22 kind of icons on the screen to give you a differentiation. Uh recognizing that many customers still want that performance, that SageMaker offers for the model builder, but an easy place to get start. That's where we've designed SageMaker Jumpstart to really work well with Hugging Face and, and bringing the generative AI models to your needs. But you can always go as deep as you need to all the way from training from scratch. And many large model providers do train their models from scratch on SageMaker. So all that power, all that flexibility is available to you at your fingertips if you need it.

So let me give you a quick kind of a screenshot of what Jumpstart is today, but explain uh the paradigm shift we're seeing in ML that gives us the thesis of why we keep investing here and why we think this is really where the the future of gen AI is going.

So I would say even kind of back a year ago, if I was on this stage, I would still be talking to you about. Uh most customers start with their data, they build, they train, they deploy, you're fundamentally starting with your data and then you work towards a model. Uh what's interesting about large language models is that paradigm is now shifted uh that you can begin to your journey with a model.

So this promise of transfer learning between someone else trained to model maybe generic or someone else trained to model for a specific domain that you can then use in your application is now possible. Now that might get you 80 or 90% of where you need to go and you still need to do the work to get 90 to 100% through fine tuning and other things. But that idea of being able to go 0 to 80% with a model kind of out of the box that's brand new and that's super exciting.

And so we saw a lot of customers still wanting to use SageMaker for these purposes, but start with a model that really is uh in many cases good enough or close to good enough for the needs. And that's where Jumpstart came into play.

So what do we offer on Jumpstart today? Um many of these models, this is uh not designed for you guys to go ahead and, and read through it all. It's just designed to show that um many of the, the most relevant models in the space today are all available on, on Jumpstart.

So for example, the LAMA two family models, both uh 7 1340 billion parameters, both the base and the chat model, both uh for inference as well as fine tuning. Uh scripts are available for you use today. Uh the models from Falcon Mistral, as well as the key proprietary models providers are seeing in the space such as a, a 21 co here are all available.

So we're trying to bring what we think is the most relevant models uh to you and make those available directly within SageMaker. Um so they can be used with all of the SageMaker uh enterprise ready features and tools that, you know, uh and you'll see plenty of which are being um rolled out throughout the week today.

And that brings us to what Jeff is gonna talk about is the the number of open source models far extends beyond what's on this page. And so that's where we're really excited to deepen our partnership with Hugging Face, to make sure that whether you find the models on Hugging Face and deploy them into the SageMaker or if you're already a SageMaker customer, you want to start in SageMaker, you can find the best of Hugging Face. Um that's all available very simply and at your fingertips,

I want to bring this up um re regarding uh data protection and security because that's always a really critical question we get for customers that are going deep in the SageMaker. And I think there's really two components uh to the slide, I want you guys to walk away with.

So one is we are focused very adamantly that uh it, when it's your data, it's under your control. And so in the world of an open source model, you're actually gonna be deploying that into an endpoint into your account. You have full visibility of the model weights, uh full visibility, uh obviously visibility but full control over the prompt inference and so forth.

The same is true if you're using uh models from a proprietary model provider such as an a, a 21 cohere or stability, uh a is proprietary models, those models are actually gonna be uh within your account. So none of the prompts uh that you provide or the inference would leave your account and go to the model provider themselves. And that's key for many customers that are operating in enterprise setting.

And we take that a step further with SageMaker, uh excuse me, SageMaker and Jumpstart is you can not only deploy these models into your VPC, but you can also do so when you have network isolation, we bring them into the AWS infrastructure and package them up. So they're easily to be deployed within one click.

Ok. So um I'm gonna walk through at a high level just to where you can find Jumpstart, um how you can get going with it and then we'll go into more of the details that uh Jeff and Mark are gonna lead.

So SageMaker Jumpstart today uh lives inside SageMaker Studio. Um so this is where we have our graphical user interface where you can explore. Um so you can easily just go ahead and, and browse the models uh by modality. Um you can search if there's a specific model you're looking for and there's a lot of cool features that are gonna be coming out, I think Tuesday and Wednesday this week. Uh we're actually gonna go much deeper into how they can be filtered um by various attributes. Um key improvements to the UX and so forth. So we're very excited to speak this again, easier and easier and easier as we go.

So once you find that model that you're looking for, um we go in, there's a model detail page has everything you need to understand what the model does. So description the size, what it can be used for

"And key here is uh the licenses and license type, everything from a very permissive apache 2.0 to potentially some more restrictive licenses. So you're aware and can make the right choice for your business uh at this page.

Um this is where you can directly take action. Um so if you want to run the model for inference for proof of concepts or even take it to production, you can do so in one click, if you want to fine tune the model, we can then take you and point or we can have you point to your uh fine tuning on s3 and away you go.

And if you want to use these programmatically, we have the api code snippets which you can copy and paste directly into your, your workflow.

So what does that look like? Deployment? We actually have a default deployment, but we're giving you the opportunities to be able to select for uh certain uh endpoint configurations. So for example, if you're looking for really a cost optimized implementation or a latency optimized implementation or throughput optimized implementation, uh where we're gonna be providing those defaults for you. And then obviously you have full control over the hyper parameters and can go from there.

And one of the key elements of, of jumpstart is we try to provide as much of an open box uh or, or clear box service as possible. So why we make it easy if you get it going, you have full visibility into the code that we're using uh in the case of open source models, the weights themselves. So you can take this, you can customize it uh and, and completely um make it your own from there.

Now, when it comes to fine tuning. Many of the the key models we provide, we have the fine tuning scripts available. And so you can access that directly from sage maker studio with that train button. And it's as simple as pointing it to the s3 bucket with your data. Uh if you want to again, use that programmatically, we have the api code snippets and example notebooks which show you how you can uh take those trainings, uh training scripts and your training data and, and away you go.

And then lastly, um which mar is gonna explain to you in a little bit later when and when, excuse me, when you're ready to scale or when you're ready to test models in a programmatic fashion, all that's available at your fingertips through the sdk.

So i'm gonna just take a moment before i, i switch over to jeff and um provide a little bit of a, a story of, of i found uh how customers i'm seeing them in the real world uh begin to start with different open source models or closed models and experiment.

So um this is probably a few months ago when i was working with a, a large enterprise customer, they're very excited to bring generative a i and ll ms into their customer experience use case. So there's a couple of elements about this, which is, which is interesting. So one i would call this as uh an internal use case where they weren't exposing a chatbot or a large language model directly to their customers. So in this case, the, um i say the trust and safety layer was less critical. Uh and then also, you know, when they were, when they were, what they're trying to do here is, is actually look at how can they do analysis over many of the calls that were coming in.

So if you can imagine when you call a customer service agent, uh you might have a conversation that goes on for a couple of minutes. The transcript is maybe somewhere between 500 maybe 1000 words or so and many customer service uh um facilities when they try to summarize, they often ask many of the same basic questions. So can you tell me within two sentences? Kind of what happened on the call? What was the root cause? What was the action the agent took and so forth?

So we kind of call this a guided q and a use case with the same questions are coming up and over, over and over. And ultimately, this customer i was working with wanted to take that and run some analysis.

So the reason i bring that up is they actually started with a, a very powerful, uh this is a closed model um to run this uh the summarization. And as you can imagine, it worked flawlessly. They love the results. They said this was amazing. But then they started doing the math. They said ok, this works for this proof of concept. But actually, my use case is i wanna be summarizing about on the order of 100,000 conversations every day or potentially every week. And at that point, that's when they really struggled to make the business case close of how do we bring this to production, you know, is there opportunities in front of us to, to actually run this cheaper, more efficiently and so forth?

So we started working with them and this is um at the time we started suggestions like, hey, have you looked at a smaller model, have you looked at some of the open source models? You know, is there a model that is potentially good enough? Um but it can mean the difference of you can actually deploy this example or not.

So at the time we did this, the model that they actually picked was a very simple uh flan or flan t five model, which um you know, now that 36 months have gone by, you could argue seems quaint. Um but for their use case, it was plenty sufficient. And what was really compelling is that the cost of the workload at scale dropped about 95 to 97% and their eyes wide and they got super excited because this is the difference of this as being a toy project. So this is something that i can actually take to the real world and put in production.

So i'm really excited. They actually have subsequently taken that workload, put that into a fraction with uh a smaller, with an open source model. And that was really the key. And i think there's a couple of themes in there that are compelling is that the open source uh model and, and ecosystem is, is not only live and well but critical, i think as customs moving into production and um this need for balancing, what is the appropriate model if i'm trying to optimize for accuracy, cost and throughput is gonna be a key theme, i think for many customers and potentially yourselves included all throughout 2024.

So with that, i'm gonna turn the stage over to jeff and um he can talk about all the great innovations going on a hugging face and what we're doing with the stage maker. Thank you so much, carl. There you are, sir. Thank you.

Hi, everyone. I'm so excited to uh tell you about how you can use, how you face to build your own a i using open models and open source libraries. And also to tell you about our collaboration, our partnership with aws so that you can do all this work in a secure way and uh in a scalable way in a cost effective way using sage maker and drums start. But uh i guess the first question is like, hold on. What is hi face. And so uh how many of you please raise your hand if you already know about hiding face. All right, thank you very much. Well, you know, hugging face is used every day by very different constituency. Maybe you're a researcher that publishes uh their papers onto hugging face. Maybe you're a data scientist that come to find those pre trained models and evaluate them on the hugging face hub. Maybe you're a developer uh that want to use an easy api and go to hugging face to find uh the best models to do. so. Maybe you're a machine learning engineer and you're trying to scale end points, maybe you're a student and you're learning about data science and all these people's experiences are quite a bit different. It's hard to put all these perspectives into a set of slides. And so instead of that, i have a 30 se 32nd video that puts it all together into a fast paced little story, i'm gonna play it for you and it's gonna go real fast. So pay attention and then i'll test you what actually is hugging face.

It's a leading open platform for a i builders democratizing good machine learning. But what does that actually mean? There are three pillars models, data sets and spaces models are the thing that actually do. The a i, we host hundreds of thousands of models, including ones from major companies like google microsoft and meta, as well as research institutions like stanford and open source communities like auth or a i, then there's data sets openly accept data used to train the model and finally spaces which let you easily demo and showcase models. A recent example being delusion diffusion, all of these pillars together are combined with open source libraries like transformers, diffusers and accelerate making it easy to build, share and use the latest a i models. All of this is free. But if you need more compute or help deploying your own models, we have that too. So if you want to be involved, go to hf.co easy, right?

Did you get that? Let me uh double click on the most uh important points. And maybe the first one is our mission. Our mission is to democratize good machine learning. And in democratization, there are multiple layers. We want to make it uh accessible, we want to make a i accessible to as many people as possible. We want to make a i as easy to understand, to add as many people as possible. We want to make a i easy to use by as many people as possible and we want to make a i cost effective and all of these layers apply to the work and collaboration that we've been doing uh with the teams at aws and sage maker. And we want to democratize good machine learning. And what does that mean for us? Good machine learning. Well, it means machine learning that is built from open source. It means machine learning that is community driven and it means machine learning that is built from ethics first principle. And so that's our mission and where, where are we today?

So today we are hosting, hosting over a million a i repositories. It's a million models or datasets or applications. And that's what we call spaces and its models to do any kind of machine learning task. Maybe it's doing a natural language processing tasks using a transformer model like translation like summarization, text generation classification. Maybe it's running a speech or audio model to transcribe what i'm saying into text, to recognize speakers in a conversation. Maybe it's a computer vision model including generating images with the diffusers libraries. So today, it really encompasses all modalities of machine learning and not just hugging face open source libraries but models from over 30 different open source libraries. But how did we get there? And carl mentioned transfer learning"

If you ask me what has been the biggest game changing innovation over the past five years, over the past five years of exponential growth of machine learning, to me, it is transfer learning.

And what transfer learning is is the ability to leverage huge models that have been created at huge expense using massive amounts of compute, gigantic datasets over a large period of time to create these pre-trained models you can use off the shelf and then adapt very efficiently with relatively little data.

And that's how we go from hundreds of architectures - architectures being the code implementation of a model within a library like the Transformer library - into hundreds of thousands of model checkpoints, model weights.

And today on the Hugging Face Hub, you can find all of those, uh, model weights and repositories that have been contributed by the community in order to find, uh, uh, the right model, whichever task you're targeting, whichever language you're going to be working in, whichever domain your data is.

And that's what created this need for a central place for the community to share and access all of these off-the-shelf, pre-trained, fine-tuned, compiled, quantized models. And that's how we got to a million repositories and all these models translate into an incredible amount of usage.

And today, there are over 10 million model downloads happening every single day on the Hugging Face Hub. And these statistics are actually open source too. You can find them on a space, uh, on a Hugging Face.

And what's interesting is that if you look at the top 10 architectures, they actually cover many different modalities and libraries. You have Wave2Vec2 for speech transcription, you have ModelForClass for text classifications, for text generation like LAMDA, and you have models for computer vision.

So today, Hugging Face is the leading open platform for AI builders and also the home for large language models and generative AI, which is a subset of machine learning.

So if we zoom into LLMs, which is, uh, the, the, the, the theme of this session. I think one takeaway, uh, that I want you to have is the Open LLM Leaderboard, which is a free and accessible resource on the Hugging Face Hub to be able to comb through the jungle of, uh, open LLMs that are hosted on Hugging Face.

And today there are thousands, close to 5,000, open large language models that are being continuously evaluated on, uh, the Open LLM Leaderboard. So you can easily find the model that would work best for you filtering by the model size.

Are you looking for a pre-training or fine-tune or a chat model? Are you looking for a specific architecture or license? You can filter that very easily using this tool.

Alright. And so I talk about how companies have been able to build these amazing foundation models, make them available to the community through the Hugging Face Hub, but we don't stop there. We also create our own or contribute to our own foundation models.

And we do this using our managed cluster on SageMaker using some very cool features that you're going to hear about this. And some of the great models that we've been able to create, uh, using our managed, uh, cluster on SageMaker are, for instance, Stargo.

Stargo comes from the BigScience initiative. Um, and it's the state of the art up and up and model for code generation that upon release beats every proprietary model that was available. And what made it really special is that it was built from a completely open data set that w, that was built from consent for, uh, transparent, uh, uh, data.

Another great example of that is Fifi. Fifi is the largest, uh, open multimodal model, meaning that it takes as input, not just text but also images to generate text like Flamingo and GPT-4 after it.

And the latest example of that is Zephyr. Zephyr is a 7 billion model which is a fine-tune of Megatron 7 billion with a new, uh, policy for fine-tuning that makes it the most helpful assistant model in its weight category.

And so we're running in to towards three years of our collaboration with AWS and all the great things that I've been talking about are things that you can now use today using the experiences that we build in collaboration with AWS and SageMaker.

And so you're benefiting from three years of collaboration of work and work to have these, uh, ready to use, uh, experiences. Our goal for this collaboration is to make it easy, to make it secure and to make it cost effective for you to leverage the best open models.

We make it easy by providing you with the tools. We make it secure because you're benefiting from all the security that has been built in the SageMaker platform, everything that Carl talked about before. And we make it cost effective because you can use all of the cost optimization techniques built into SageMaker like spot instances, like hardware accelerators, like Trainium and Inferentia.

And so let's start with easy. How can you leverage all these great open models directly within your environment? And the most important artifact for you to do that is our Deep Learning Containers which we build and maintain in collaboration with the SageMaker team.

So that when Meta releases LAMDA 2 on a Hugging Face within days, you can use it on SageMaker. So that when Anthropic releases, uh, the Anthropic model, you can use it within days on SageMaker. And these Deep Learning Containers, they're created to give you an out-of-the-box experience, whichever type of model, whichever type of hardware you are targeting.

And today we offer Deep Learning Containers for training, for inference of models. And specifically for large language model inference, we offer Deep Learning Containers targeting CPU, targeting GPU and also now targeting Trainium and Inferentia, so the Neuron architecture.

And of course, we offer Deep Learning Containers that are optimized for, uh, your machine learning framework of choice. We make it very cost effective for you to deploy those models at scale.

And one of the key tools to do that, and I want to zoom into the large language model, uh, use case, is text generation inference. Text generation inference is an open source, open source accessible library, uh, which is built to deliver the highest possible throughput, meaning like the, the, the, the fastest amount of text that you can generate, uh, on, uh, on, uh, hardware accelerators.

And today, I'm super excited to announce or tell you that, uh, we now support TGI, which is the short name for Text Generation Inference, with support for the Neuron architecture.

And earlier today, my colleague Philip Schmidt made a great demo on how you can deploy LAMDA 2 using Inferentia 2 and benefit from all those optimizations.

And another thing that is really exciting is that, uh, Text Generation Inference I mentioned a source accessible library. Um, through SageMaker you can use without any restrictions, uh, the library under its new edge, uh, foil license.

And finally, to put it all together, how can we make it very easy for you to go through the whole workflow that Carl, uh, described to you? How can you go from model discovery to deployment at scale of the model?

And the best example of that is this, um, ready to use code snippet that we make available directly on the Hugging Face Hub. So if you go to the model page for any of the Hugging Face models, uh, that was listed, uh, in the previous slide and you click on Deploy, you will find a Deploy on SageMaker option.

If you click on that, you will find that with JumpStart, we give you a ready to use code snippet that will leverage the SageMaker SDK. So you can deploy the model super, super easily.

And deploying the model is only one of the things that you can accomplish using using SageMaker JumpStart. And to walk you through all the great things you can do with JumpStart, I'm happy to hand it over to Mark.

Thank you. Cool, thanks, Jeff. Um, so now we've seen where SageMaker JumpStart lies within the GenAI offering on AWS. Um, we've also seen how to discover and deploy some of these, these models, um, via the UI. And we've seen how Hugging Face is a key factor in, um, accelerating access to these models.

But what does it take, you know, to analyze, evaluate, test, and maybe even retrain these models? So let's take a look at what that looks like with SageMaker.

So as many of you may know, um, there's a number of models out there and the question that, uh, customers always bring to us is, well, what model do I pick? And so we, the recommendation we give is first of all, work backwards from your use case, right?

But follow the following heuristic somewhat. So first is prompt engineering. So take a look at the different models or pre-training models that are out there today, right? Um, if it's for text generation, take a look at specific text generation models. If it's co-generation models, take a look at the pre-trained co-generation models, um, and see how far you can get with just manipulating the prompt, right?

If you're looking at, let's say, a text generation use case and you fixate on the LAMDA family of models, start out with a smaller variation, see how far you can get and see what you can accomplish.

So for example, if you have a summary use case where you need to summarize, let's say, transcripts from a call center, um, and you wanna see if a model can actually, uh, solve your use case, you may notice that zero shot, um, examples into the model may not work, right?

The model may not understand how to summarize your specific, uh, data in the correct format or structure you expect. So you could look at maybe providing the model, the example, the model, some examples on how to summarize and we call that few shot learning.

So give it a few examples and see if it can accomplish, um, your use case. The, the next thing is retrieval augmented generation. So one thing to note is the knowledge that these models have is locked in time, right? It's the knowledge that the models have is only what they were trained on.

So what if you have a use case where you wanna plug in some external knowledge base? Right? Let's say you want to be able to ask questions and get answers on your internal policies and procedures, right? The pre-trained models as they are, were not trained on your internal knowledge base obviously.

So you can implement something called RA whereby at a high level, the concept is take your internal knowledge base or your external knowledge, right? Query that data, get the relevant information that pertains to the let's say it's a query, right? Pass that knowledge, that's those sources and provide it to the pre-training model to discern the final answer.

So RA is great for, uh, for use cases such as chatbots, Q&A, um, and so on. But what if you can't solve your use case with prompt engineering and RAG alone? Well, that's when you're gonna have to look at customizing the model somewhat.

And customizing a model could also be beneficial in, in, in one respect, the ability to customize a model may open the opportunity for you to use a smaller model that's fine-tuned on a specific task. And we've seen that hosting smaller models is more cost effective in production, right?

A 7 billion, hosting a 7 billion parameter model requires less GPU memory than hosting a 70 billion parameter model. So let's say you've now decided on a few models that you wanna look at. You wanna test them out, you wanna follow this heuristic we've been discussing, how does that look within the SageMaker SDK experience?

Well, deploying a pre-trained model right via the SDK, just as how it is, simple deploying for the UI only requires a few lines of code, specifically. All we need to create in the SageMaker SDK is the JumpStart model. And here we pass the specific model ID.

In our case, we want to test out the pre-trained Falcon 7B model. And all we do once we've created the JumpStart model object is go dot deploy. And then we can actually make predictions, passing the payload or prompt that we want to send to it.

And the output of this code is what is known as a SageMaker endpoint that will allow you to interact with the model. And we'll take a look at what exactly a SageMaker endpoint is in, um, the following slides.

So now we've deployed a pre-trained model. We also want to maybe implement some sort of RA system, right? We want to maybe create a chatbot, Q&A that we're discussing. Um, so how do we do that?

So SageMaker endpoints actually do integrate with LangChain. And LangChain is essentially a framework that allows you to deploy genAI applications. And here we need to create a SageMaker endpoint object within the LangChain SDK.

And we passed the specific endpoint name and this is the endpoint name we would have subsequently, um, deployed with the previous code snippet. And then fine-tuning. So maybe we actually need to customize this model.

Maybe we want to try a smaller model, um, and see if we can fine-tune it on a specific task. Again, fine-tuning with the SageMaker SDK is a few lines of code. Again, here we pass the model ID.

"It's the Vulcan Seven B model. We pass it some hyper parameters that JumpStart has exposed for us to play with - the specific instance type. And then we kick off the job with a simple .fit() function, passing our train and validation data which lies within S3.

Cool. So we've deployed our model, maybe we've implemented it into some sort of RAG solution. Um, and maybe we've even fine tuned some models. How do we analyze and evaluate these models? So we can look at this in two ways - quantitatively and qualitatively.

Quantitatively, we can look at specific metrics of a model such as toxicity, accuracy, semantic robustness and so on. And there's a number of open source benchmarks out there where we can read, you know, specific metrics for a specific model when it relates to toxicity, for example. Or we can run some open source algorithms to calculate this. And we at SageMaker are actually working to make this easier.

Um, and you'll see this in the coming days, but today, let's say you want to compare a pre-trained model and a fine tuned model. By default, when you launch a fine tuning job, SageMaker JumpStart models during training will output some default metrics into CloudWatch such as loss and perplexity.

And perplexity here that's shown on the screen is essentially just a measure of how well a model is able to predict the next word in a sequence, and a smaller value is better than a big value. So here we can see on a pre-trained model, we calculated our PPL to be 8.147. And after fine tuning, we can see we got it down to 1.437. So that's quantitatively, right? Qualitatively is we can actually play with the models and see how they're doing.

So just as an example, here we are asking a pre-trained model, what is item seven of the 10-K SEC filing about? And the context of this question is we expect the model to answer in terms of the Amazon 10-K SEC filing. And essentially the 10-K SEC filing is just a report containing the business operations and conditions.

So if we look at the pre-trained model's response, it tried to answer it didn't do pretty well. But after fine tuning, we can see that it accurately provided the answer. And this is because the knowledge was embedded into the model itself.

And so this idea of looking at it quantitatively and qualitatively, you can look at comparing fine tuned against pre-trained and even different pre-trained models.

So let's say we've now come up with a specific model that we're happy to move to production with, right? What does it take to actually, you know, make inference against this model, integrate it with our application, make it available to our users and be able to scale based on our request and also some security considerations which are important in production applications.

So the first thing when it comes to real time inference on SageMaker is the concept of the SageMaker endpoint. And now a SageMaker endpoint is a managed production grade endpoint that allows you to host your model and interact with it in real time. So we can get real time responses from it. It's accessible to any external application as it's exposed via HTTPS endpoint. And it also integrates with Application Auto Scaling so it can scale up and down to your needs.

And if we open up what that SageMaker endpoint looks like internally, this is all managed for you, but it contains, for example, a managed load balancer that will load balance the requests to your instances. It will also run a specific container that you have. In many cases, customers may be using their own container, they may be using one of the prebuilt deep learning DLCs that Jeff mentioned.

And then the actual model weights itself. And customers today have the flexibility to do whatever they like. SageMaker is a framework agnostic platform. But what is JumpStart? So JumpStart essentially encapsulates all of those things related to hosting the model - the container, for example, the hosting script, the actual model server that's going to host the model.

And also even some of the instance types, you know, we have customers that state that we don't really know what instance type a specific model should be hosted on, right? There's different quantization, there's different GPU sizes and so on. SageMaker JumpStart encapsulates all of that. So you don't have to worry about that.

So if we double click on that little purple box, which is the JumpStart model for SageMaker open models, it's made up of that ECR image that we're speaking about - that's the image that contains the model server. In the case of Hugging Face models that most likely will be the text generation inference container. And then the S3 model artifacts - so this is the maybe the inference script that contains the logic to host the model and also the model weights itself.

So this is for open models such as GPT-Falcon, LLaMA, and so on. For proprietary models on SageMaker, they are also made up of these core building blocks - the ECR image and the model weights obviously. But you as an end user don't have access to see the actual underlying weights or the image. But the model providers would have provided these production ready out of the box.

So now we've seen how, you know, we can actually deploy a model to SageMaker real time endpoint and get inference from it. How do we actually integrate that with our application?

SageMaker endpoints allow for integration in a number of ways. First is API Gateway - let's say you want to control the authorization and authentication for some reason, you can integrate your SageMaker endpoint with API Gateway, either with a Lambda function or directly with API Gateway. The Lambda function here could maybe do some light preprocessing for example. And then ultimately make inference to the endpoint.

Or if you don't need that, API Gateway offers direct integration as well. Another way is maybe you don't necessarily need to control any authorization and authentication. You can interact with the SageMaker endpoints via Boto3 or even the high level SageMaker SDKs. And then finally, if you don't want to use any SDK specifically, you can directly make POST requests.

So for example, if you want to just make a POST request with your inference payload, you can make that directly to the endpoint. The only consideration is that request needs to be signed with the Sig v4 header.

So now when it comes to scaling up and down based on your traffic needs, this is also a very important consideration when it comes to moving to production.

SageMaker endpoints, as I said before, integrates with Application Auto Scaling. So the auto scaling will automatically distribute these instances across AZs on your behalf, fully managed. In order to improve and enforce high availability, you can also dynamically adjust the number of instances as you'd need.

There is no traffic interruption during scaling events. And the scaling in and scale out options are suitable for your custom traffic pattern. We support both predefined custom metrics like the number of invocations hitting the endpoint at any point in time, or even maybe your own custom metrics - maybe you have a queue within your application and you want to scale the endpoint based on the number of requests sitting in a queue. You can do all of this with Auto Scaling on SageMaker endpoints and setting up auto scaling is also very simple.

You can use the SDK but there's also presence within the console. You can simply just set your minimum and maximum instance count and the specific target metric. In this case, I am using SageMaker invocations for instance. I set the specific target value and off we go - auto scaling is set up.

We also have some optional parameters like the scale in cool down and scale out cool down if we wanted to set that.

So finally, security. Carl mentioned a few of the built in functionality when it comes to security and SageMaker. By default out of the box, SageMaker has preventative controls both on an infrastructure and a network isolation level.

So you can deploy your SageMaker endpoint into a VPC where you can control access to and from the model container. SageMaker endpoints integrate with AWS PrivateLink. So if you want to make inference to the endpoint without even having your request leave your own VPC, you can do that as well.

We also have the ability to provide a VPC config to the endpoint itself. And that's where we'll set specific subnets and security groups. Where if we wanted the container to make any network calls within itself, all those calls will be made within your VPC.

And then finally is network isolation. So for open models, this is a toggle switch where we can set network isolation - that basically means to lock down the container. So the container itself cannot make any network calls.

So you can think - if you're sending a prompt to your model, right? You don't want that prompt to necessarily be sent to a model provider or anything like that. You can lock down that container so it's completely isolated.

So for open models, it's a toggle by default. SageMaker does not share any prompt information, but you can customize, for example, the inference script as you like. For proprietary models on the other hand, those are network isolated. It is not a toggle option. Those containers that are running the proprietary models like AI21 Cohere are completely network isolated - no inference, no network requests can be made from those containers.

And so yeah, if you wanna get started with SageMaker JumpStart, I highly recommend taking a look at the JumpStart documentation. We have a great example repository for you to get started with all these popular models. And take a look at the SageMaker JumpStart product detail page for the latest JumpStart offerings again.

So thanks for myself, Jeff and Carl. We hope you had a great first day at re:Invent. Please do remember to complete the session survey in the mobile app. Cheers."

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值