Scaling FM inference to hundreds of models with Amazon SageMaker

Right. Good day, good people. Wow, it's amazing to be here in person, such a great venue with you all. And I have to say thank you very much for joining us in today's session on scaling foundation model inference with SageMaker.

And now what if I tell you that this whole presentation and the demo that you're going to see in this session is completely used, used and created by JDAI? Shocking. Well, the reality is we hand crafted this whole presentation and demo, but I wish I could do that.

My name is Daul Patel. I, I lead the solution architecture team of machine learning specialist team at AWS. We help customers like Salesforce joining us today scale machine learning workloads on AWS. And I have been joined by my colleague Alan Tan, who is from the SageMaker product team for inference. He is going to share new features that we are launching in SageMaker inference.

Today in this session, we are very honored to have the Vice President of Salesforce who is responsible for building a machine learning platform. And he is going to share his story on how his team uses SageMaker in scaling foundation model inference at scale cost efficiently. His name is Pavitra Doshi. We are very honored to have you here in this session.

So a quick question and feel free to raise your hand if I ask a question. So who all are building generative AI applications currently? Alright, almost. Alright. Cool.

Who are using open source foundation models for that open source foundation model? Quite a lot. Alright.

Um how many of you are using SageMaker for deploying the models? Alright, cool.

Alright. How many of you need to use more than one foundation model for inference for building applications? More than one? Cool.

Alright, we got good audience. You're gonna love it. Alright, let's get started.

So we are, we are seeing a paradigm shift today as the organizations are strategically embedding generative AI applications. They are very fabric of daily operations and the result is unlocking new possibilities, unleashing the new unprecedented level of productivity increase.

And the core foundation to building the generative AI applications is the foundation model. A foundation model is a large pre-trained transformer based model that is best in either genetic tasks or fine tuned towards like doing this specific task.

And generally, there is more than one foundation model involved in building an end-to-end generative AI application for the production use. Let me give an example for that a generative AI chatbot application needs a variety of different foundation models including toxicity detection, PII detection, summarization Q&A models, workflow generation search models and maring models. The list goes on and on and on and on.

On top of that, customers are also wanting to improve the accuracy of these foundation models that they are building based on their customer data sets or domain data sets. And they want to fine tune these foundation models further and deploy them in the production.

On top of that, you want to use multi modality of the data like image or text or code or video and the list goes on and on. And you quickly end up having multiple foundation models maybe from tens to f to even hundreds of foundation models that you end up deploying at scale.

And you need a scalable way cost efficiently to host these foundation models at scale. So additionally, these models are large, a large number of hundreds of billions of parameters, hundreds of gigabytes of memory that is needed to load these foundation models and they are based on transformers architecture and transformers are slow and it could get surprisingly slow even on the expensive hardware.

Now, think about the lot of iterative experimentation that you have to go through in performance, tuning these foundation models in order to get the best response time and throughput at low cost.

And don't forget, you have to also put the guardrails on making sure that you isolate these models, take care of the area of impact, you control the area of impact, put guardrails so that one model does not end up having a problem with the other models and you prevent the noisy neighbor problems.

And by the way, you might end up having hundreds of inference end points to host this foundation models. It's going to increase the operational overhead for you.

So how about I take your journey of deploying a single model, a single foundation model into production. And let's see how you can scale to hundreds of foundation models at scale cost efficiently using SageMaker. Let's get started.

So the first thing that you do in order to host a foundation model is calculated size. Now the size of the model here, in this case, we take an example of Lama 2 13 billion model. The size of the model is primarily dependent on the number of parameters and the size for each parameter.

Lama I I 13 billion is a 13 billion parameter. So you have 13 billion parameters and each parameter taking four bytes which is fp 32 we get about 52 gigabytes of memory just to load the parameters of the model.

Now, in a large language model foundation model, there is an auto regressive decoding process that takes the input tokens and processes these tokens iteratively to output the next token. And during this whole process, it ends up creating tension, key value tensors that it has to keep into the memory. And that also occupies an additional 4 to 10 gs of memory.

In addition to that, you need to use frameworks like PyTorch or Nvidia ka. Additionally, you are going to occupy about 3 to 4 gigabytes of memory. So overall, we are talking about 65 gigabytes of memory just to load up one instance of Lama 213 billion.

Now, you can always try to optimize this by compressing the size of the model by using different techniques like quantization mechanism. But that comes with the trade off with the accuracy loss. So for now, we are going to keep on using four bytes per parameter. So we still have 65 gigabytes of model to handle just for one instance.

Alright. So we have 65 gs of memory to load. And we choose an instance here. In this case, we choose an ml instance p4d which has eight Nvidia a100 gpus, each gpu comes with 40 gs of high bandwidth memory. So obviously, we cannot fit 65 gigabytes of model into any of the GPU's memory.

So we split that model by using model partitioning logic or model parallel logic. And we split the whole model into two shards and we occupied two GPUs out of eight.

Alright. Well, what about the rest of the six GPUs? They are just sitting idle out there. So we still have 75% compute memory free and idle. So can we make it more efficient? Let us see how,

So how about we load additional copies of the same model, load them up into the rest of the three gpus so that we saturate and we maximize the utilization of the memory. So we load these models up in the additional gpus, we get ultimately four model copies and then possibly we can drive up the utilization of the gpus to its 100%.

That depends on if the model receives enough production traffic to drive up the of the compute 200%. If the model is not being invoked as much as we expect, then it's going to sit idle and you're not going to have the best price performance. Alright. So let's keep that in mind.

So let's put this model into production and scale according to the traffic of the model. So for that SageMaker inference offers the broadest and deepest set of inference options for any use case that you can use. And you can use a fully managed SageMaker real time inference endpoint and invoke this endpoint using a restful api and get the streaming response.

A SageMaker also supports offline er options for large data sets. And if your payload is large in size like image or video, you can also use SageMaker asynchronous inference end point and scale down to zero depending on the traffic you can, you can also host one model or you can also host ensemble of models.

You can also orchestrate these models into a zero insurance pipeline, put them together, orchestrate the workflow and the SageMaker, endpoint can also host multiple models into the same endpoint as well.

On the hardware side, on the infrastructure side, you can use CPUs, AWS influent media, GPUs are going even completely serverless. Alright.

So let's use SageMakers real time inference and point to host Lama II 13 billion model for our chatbot use case. Alright.

So you have Lama two and then you have a container that you have to choose to host this model. In this case, we are going to use a large model entrance container. Now, do you use this container? Because we just launched a slew of new features of a large model influence container in SageMaker, which will give you 20% lower latency.

On average with these new features, we introduce new optimizations like optimize all reduce algorithm. We also have integration with the tensor RTN back end as well. That will give you optimized performance in terms of response time and throughput.

So do use a large model infant containers for hosting foundation models on infants. Alright.

So we have a large model infants container and we pick a machine learning instance here. In this case, we use p4d as I presented before, which is aps. Alright.

Now we attach an auto scaling policy to scale horizontally based on the user traffic. So you attach that on the SageMaker inference end point and scale horizontally to any number of instances depending on the throughput or depending on the hardware utilization or depending on the response time targets.

And then user can invoke that using a restful api and get the streaming response. Alright, sounds good.

So you have a Lama two model up and running in production scaling out for your chatbot. Cool. Alright.

Now you want to moderate the content of your chatbot application. So you now have toxicity detection endpoint. Oh by the way, you also want to protect the PII data of your customers. So you also have PII endpoint cool.

By the way, you just forgot that your chatbot does not hallucinate. So you decide to add search model and embedding models and then you also want to support additional use cases for developer wherein developer can also use your chatbot for the code generation.

So you end up having a code generation endpoint and then by the way, you need multilingual t you need speed generation. Oh by the way, I want to now fine tune these models based on my customer data.

And then you just keep on increasing the number of end points.

Look at the number of end points that you end up hosting and managing this. This is high operational overhead. You can't continue like that. You can't scale like this. You need a better way to cost efficiently serve these foundation models and build generative AI applications on SageMaker.

All right. So can we pack more models on the same SageMaker inference endpoint?

All right. Let's see. So how about we take these foundation models and host in the single SageMaker end point? Now, that's a good option. But if these models do not fit on the on the machine learning accelerators on that instance will not be able to fit in and you will not be able to host all of the models at the same time. What do we do here?

So? Well, then you decide to use SageMakers multi model entrance endpoint, a SageMaker multi model entrance endpoint uses a fully managed option to co host multiple foundation models and spread across multiple ML instances in the fleet. It loads these models dynamically based on which model is being invoked by the user and dynamically loads up during the invocation. It also has smart routing logic to route the traffic of insurance for that specific model to only those instances that have already loaded them into the memory to the cold start. And it also uses replication based on the popularity of the models. If a model is receiving high traffic, it's going to spin up more replicas dynamically. How cool is that great?

All right. But however, the initial invocation request coming to these models are going to experience cold start latency because these models are large, they are going to take probably seconds to minutes to load them up onto the high bandwidth memory. And this might not be a good solution for those applications that are ultra low latency sensitive applications.

So what do we do here? Well, overall what you need is for ultra low latency applications for hosting multiple foundation models at scale four entrance. What you need is to be able to pack multiple foundation models into a single end point to avoid high operational overhead and save costs. And then you also want to make sure that you don't have to deal with a cold start latency. You can load up this memory early before the inference request comes and pin them into the memory. And then you also want to make sure that you have a fine tuned granular allocation of these machine learning accelerators based on the models needs. And then you want to also put guardrails to make sure that you provide a fine, fine grained or scaling policy based on the model, not just based on the machine learning instance, you should be able to scale out based on the model so that the model who is receiving high traffic can scale versus the model who is receiving almost no traffic you don't need to scale and you can scale down to zero to save costs.

So and then on top of that you need model specific matrix for observable like model specific response times, model specific throughput, model specific hardware utilization in order to satisfy these needs, i am happy to announce the launch of new SageMaker inference and point inference components to satisfy these requirements and another slew of other performance enhancing features in SageMaker in the session. And with that, I will hand it over to Alan 10 to dive deep into these new features. Thank you very much. It's really good. Oh, sorry. Ok. Who is excited to learn about these new features? Ok. All we got a lot of people so great.

Uh thank you no for introducing the background and the challenges that our customers are seeing. First, we added the flexibility for customers to deploy one to hundreds of models efficiently by packing them onto a single end points. And this flexibility means you can actually grow the endpoint as your use cases change by deploying multiple models on a single end point. We've seen customers save and reduce their inference costs by 50%. On average, each of the models that are deployed on the endpoint can have its own auto scaling policy so you can scale them independently. And we've published a new set of CloudWatch metrics and CloudWatch logs to provide you the data to set up very effective auto scaling policies as well.

And not only that you can scale each of the models that the plot on the end point all the way down to zero. So you can free up hardware for other models to be loaded. You can also dedicate specific hardware resources to different models. So you can say model A gets two GPUs, model B gets one GPU on the instance. And StageMaker will use this information to efficiently pack your models on the instances for high utilization and availability.

We've also launched a new smart routing algorithm that routes requests to instances and copies of the model that are more readily available to serve that traffic. This leads to a 20% lower latency. On average, you can use these features with any Sage compatible container. This could be a container that AWS publishes or one that you've built yourself. You just need to make sure it responds to two SageMaker API s one for health checks and one for responding to the inference request.

Of course, let's take a look at how the StageMaker entities have evolved with this new release. So first we have the endpoint. This is just an abstraction of the infrastructure and the model that you have set up. You can make HTTP request to this end point to get inference back. And behind endpoint are a set of ML instances that SageMaker manages on your behalf to do things like patching health checks and instance replacement. In case there are any issues to enable these features.

We added a new AGE me object called inference components. You can think of an inference component as an abstraction for a model that's deployed and is ready to serve traffic. There are five main components of uh inference components. The first is the location of your container. This container has your model server has your inference logic and it knows what to do with your model weights. And then we need the location of your model weights in an S3 bucket that you own. Of course. And you can also specify how many CPU cores GPUs or neuron devices. If you're using our inferential or trainum instances and how many how much CPU memory this model needs to run. You can also specify how many copies of this model you want to start with to do things like the model pinning that da mentioned earlier.

An entrance component is also a unit of scaling. So you can set up auto scaling policies for each inference component. They can be very different from between each other and you can have one or more inference components on an end point. And Sage me will handle placement of those inference components on the instances.

And i can, i'll show a deep dive of what that looks like in the next couple of slides to deploy inference component is very easy. Three simple steps. First, you pick the instance you want to use by creating an in config second, you just create the endpoint which gives you access to that instance, third, you create the inference components which effectively deploys that model onto the endpoint and you can serve traffic for that model.

Now, we've also released a full set of life cycle management API s for you to manage what models are deployed so you can and new models so they're deployed, you can remove existing models. You can also update them if you have a new version. For example, let's take a look at a deeper dive into what the auto scaling and placement logic looks like and how we maximize utilization to help you save costs.

Let's take, for example, we have um let's say a comic book ML application, it generates the images for your comic book and it also helps you write the story and the dialogue for that comic book for this application. We'll use two models, a stable diffusion model. This one uses one a 10 GPU. This is the GPU on a G5 instance and we're gonna configure it to scale from one copy to four copies depending on the number of requests or traffic going to that model. We're gonna deploy this on a single end point using a instance of four GPUs for example, a G 512 extra large instance. And because we've said the minimum copy is one sage maker has put one copy on one of the GPUs there.

Next, we also have a stable diffusion. Sorry, I have a LAMA 2 13 billion model. And this one is quantized so I can fit on a smaller GPU here, but it still needs two a 10 GPUs but is otherwise configured exactly the same. And in this, in this case, you can see StageMaker has placed the LAMA 213 billion model on the same instance as the stable diffusion model.

I have also popped up an auto scaling monitor so we can see what happens when the traffic changes and the scaling behavior changes. Let's say a lot of my customers using the app, start writing their story and start joining the narrative first. So we see an increase in traffic for the LAMA two model. So in this case, StageMaker sees that we only have one GPU free on this instance. So we'll actually spin up a second instance to get access to more GPUs. And then S StageMaker would place enough copies to meet the traffic on that second instance.

And then let's say we're done writing and more and more people start generating the panels and graphics and art for their comic book. Now we're seeing more traffic for stable diffusion. So S me will spin up another copy of stable diffusion and it will do that using the remaining free slot to maximize utilization.

So SageMaker will scale up instances when it needs to meet the inference components, scaling policy needs. And it will also try to fill up free slots whenever possible.

Sage worker will do the exact same thing even when you have a model that scales down and you have new hardware slots that are freed up. So for example, let's say people stop writing their stories and LAMA two traffic comes down. So StageMaker will scale down the LAMA two model. This frees up two GPU slots on an existing instance, right? And then let's say more and more people continue generating more graphics for the comic book. And then StageMaker scales up to take those remaining slots as well. So we'll reuse slots before creating new instances to help you optimize for cost.

Now there's actually another optimization we could do and this is a sneak peek for something we're working on and is coming soon. Let's take for example that maybe it's midnight, maybe it's 3 a.m. No one is generating any graphics for their comic book. So because there's no traffic, we can actually scale down the stable diffusion model all the way to zero. And now you can see we have two instances that are basically 50% utilized StageMaker will detect this and combine them into one to optimize for cost. So instead of paying for 2 50% utilized instances, you're just paying for one by default StageMaker routes requests by random.

This has worked really well for classical MO models where the model latency is short and you get relatively consistent latency between requests. However, with foundation models, the inference latency can vary from seconds to minutes depending on the input that's going into that request. And even between requests, we see a very large variation.

So what happens with random routing is that even by chance, you can get a short running request that gets stuck behind two longer running requests. This leads to a longer end to end latency for your clients.

So what we did was we introduced a new routing algorithm called least outstanding request. This one, we will route requests to instances and copies of the model that are ready to serve traffic. This leads to a much more even distribution of the request

as you can see, the short running requests are processed much faster. now, this gives you a much better end to end latency.

earlier this year, we also launched a new feature for streaming back responses. so you can create real time applications that you know, send responses back such as tokens or chatbots. and then if you're building, let's say video generation application, you can stream back the video frames in real time as well.

this feature is also compatible with the new launches that i just talked about. so you can stream back responses from each model that's in the end point.

and we've launched new granular cloudwatch metrics and cloudwatch logs to help you debug and monitor more effectively.

so for cloudwatch metrics, we've added new hardware utilization metrics that you can monitor at the per model or inference component level. we've added reservation efficiency metrics. so you can see how many of the reserved hardware is being used. and also we've broken down the invocation metrics. so you can see how many requests are going to each of your model and how many of those are failing, how many of those are succeeding.

and similarly, we've broken out the cloud watch logs as well. so you can get logs specific for your inference components and models. so in this case, you can see the lama two inference component, you know, couldn't start because they couldn't find a library. so you can easily go and debug that and then just update the inference component afterwards.

all of the features i talked about is supported in our sage maker python sdk, our sage maker studio u i, which i will do a demo of shortly and is also available through our standard aws sdk and aws cli tooling as well.

and of course, to support your infrastructures code, all of this capability is available in our aws cloud formation as well.

so let's take a look at a demo of this, these new features through our interactive u i.

so what you're seeing here is a screen of the stage maker studio u i. this is the new u i. so it may look a little bit different if you haven't seen it.

let's take, for example, we have a new m application that generates code using a cogen model and also a chatbot that teaches you how to use that product and troubleshoot. we want to combine both of these models onto a single endpoint to help us save costs.

so because i've already deployed these end points and my application is working, we can take a look at the models that are powering this application and to do that, we will go to the model section here. it will list all the deployed models.

so, oh, there we go just a little late. um so you can see if we're going to the model section and we can see at the top there, there are two tabs registered models and deployable models, deployable models are models that have previously deployed. so if i click on it, i get a list of models that's been deployed. and because this existing application, i can just search for the models that i've used before.

so in this case, i created this app a few days ago, so i can search for that date. and here you can see a list of models now. uh for so for this demo, we'll use the first two models because they're just newer versions. so we'll multi select the two models, the cogen model and the lama two model.

and i can create new models here if i want. but in this case, i'll just deploy these. so i'll click deploy and now i was pre generated an endpoint name, but i'll edit it. so it's more specific and i can more easily find it later.

so i just type in a demo endpoint name here. and i also select a p four d instance to for high performance and there's also the instance i'm already using today. so and i will also leave the instance count as one and have stage record scale dot up as needed.

both of these models fit on a single gp u. so i will leave the number of gp us to be one. i will also leave the number of copies to be one and have stage record scale dot up as needed.

both of these models also gp u bound. so i'll just leave the cpu memory alone as well if i want to add a new model i can, but in this case, i'll just stick with these two uh click deploy.

now, we can see some successful notifications of the end point being created and the models being deployed. was it skip forward in time because this will take a few minutes.

so now we'll come back and we can see that the end point is now in service and also the two models that we deployed are also in service.

we can also test these models out directly from this u i to make sure that it is at least functioning as expected.

so here we'll click on test inference and we'll scroll down a little bit just to see more of this section. there's a model dropped down here that lets us switch between the two models. we'll try the llama two model first.

we'll paste in a sample input here and hit send request at the bottom, which was a little bit cut off. unfortunately. well, we can see the request came back. it looks good.

i'll try the same thing with the cogen model. so the cogen model generates code. so i'm going to paste in a hello world function for it to generate and complete. we will hit send request again and we can see a hello world function came back.

so both of these look good. so let's go ahead and set up an auto scaling policy for both of these models. yeah, we'll hit refresh just to get the models loaded.

so we'll go ahead and set up the lama two model first. so we can click on edit autos skilling on the right here. it's asking me for two properties at the top scale in and scale out. cool down. this is just to say how long we want to wait before triggering another event. so i just leave us this, this is the default looks reasonable.

the instance count range is how many copies of this model i want to scale between. so just leave this as well for this demo. and the target metric is what metric i'm gonna scale on. so this is number of requests going to each copy of this model. we'll also put a value of one just so we can easily trigger that using this demo.

and we just need to select now a roll on that has permissions to auto scale. we'll do a very similar thing for the co n 25 model. so we hit edit auto scaling again and just leave all the default values.

we will change the implications to two just so we can differentiate it a bit. so you can see that they are actually different scaling policies.

so you can see now we have two auto scaling policies behind the scenes. this is actually implemented using application auto scaling, which also creates an alarming cloudwatch metric that you can use to monitor when the auto scaling happens and also monitor the number of requests going to those models.

so in the background here, i'm also gonna send some fake requests and we can actually monitor the cloudwatch metrics using the aws console.

so we'll switch the aws console very quickly. and here we'll click cloudwatch. now on the left here, we'll click all alarms just to see all the alarms we have. you can see we've actually created two alarms per inference component. one for scaling up and one for scaling down.

let's fast forward to a time when these alarms are triggered and it's actually gonna scale up. we'll take a look at the lama two model in this case just to see what that looks like.

so you can see now these models are in alarm. we're gonna look at the lamma two model because it's an alarm. it actually means that auto scaling policy has kicked in and we're scaling up.

so you can see this is because in the last five minutes or so, we've had more requests than one going in. and this cloud is saying, hey, it's an alarm.

so we can actually switch back to the co uh stage maker studio, go to the model tab and confirm that it's scaled up. so you can see for the lama two model, it's now at five copies. so it went from 1 to 5 based on the traffic we were sending it.

let's take a look at what the scaling down looks like as well. so we can go to the second alarm here, which is for scaling down. and similarly, we can click on it to get more details, but it's gonna scale down when it's less than one because that's what we said.

you can see that it's all the traffic is already peter out. so let's hit refresh and fast forward to a time a few hours later when it's definitely should have scaled down.

so we can see now this alarm is also in alarm, which means the model should have scaled down. so we can confirm that again by going into stage maker studio, we can see now the number of copies is now back to one.

so we saw it go from one copy to five copy and back down to one. so with this, i i conclude my demo and i'll hand it over to bhavesh to talk about how salesforce is using sage maker to effectively host foundation models with high performance and low cost. thank you alan so much for the introduction.

hi, i'm bs. um i'd like to share about how sales for einstein one platform and the innovations we've made with the aws team in generative a i specifically for scaling up inferences for the foundation models. but first, i'd like to thank you all for attending this session.

i'd also like to thank the aws h maker team for partnering with us in this journey of providing a trusted a i platform for our customers. and i'd like to thank my teams in sales force in the bay area in seattle in israel and india.

it is their hard work that i am privileged to showcase over here

So with that, let's get into it.

I'm assuming all of you are familiar with Salesforce or our Salesforce customers. Yeah, great.

Um so Salesforce's mission since 1999 has been to help our customers connect with their customers in a whole new way. And that's the premise of the customer relationship management software suites. And as you all know, Salesforce is the number one CRM by market share by far.

So for us, it all starts with Customer 360. Uh Salesforce helps deliver a single source of truth with a 360 degree view of all the customers interactions. This is the big power that Salesforce brings into the play.

So as companies turn to AI to improve their productivity or develop or deliver some awesome customer experiences, they face a key challenge. Most companies can't leverage their data for such experiences because the data is spread across disconnected islands. And as we all know that when data isn't connected correctly, no amount of AI or automation or personalization will work.

So that's where we created the Einstein One platform. It is one platform that connects all of your customer data across all of your customer experiences and enables building AI applications and experiences.

Let's dive deep a little bit more into the platform.

So we built this Einstein One platform to deliver an integrated, intelligent, automated, easy to customize and an open platform for trusted AI. All of our core CRM applications like Sales, Service, Marketing Commerce, as well as Data Cloud are integrated and native to this platform and with a new metadata framework instead of having islands of disconnected data, data is accessible from one platform, even data outside of our CRM.

Now, all of your applications will speak the same language and it gives you a comprehensive single source of truth for each of your customers. It brings native AI with Einstein built not just for predictive but also for generative use cases. It's deeply integrated with Data Cloud. It gives you a comprehensive data lake with real time data access. Uh very useful for the generative AI use cases.

You can access it from your favorite applications like Slack, Tableau and Heroku or from any other popular applications or productivity suits. It's easy to customize with the low code and the no code options so that you get the benefit of the full platform and you can build your own applications that are similarly integrated, automated, intelligent thanks to our open ecosystem or bring in any integration you want with MuleSoft.

So that's the Einstein One platform with the platform. We are also bringing in generative AI use cases to every cloud in the flow of work. This addresses some key operational pain points and opportunities.

For example, Sales GPT automates prospecting of emails so that sellers can spend more time with their customers. Service GPT, automate service responses, summarizations of cases and knowledge articles. Uh marketing focuses on segment generation campaign creation, commerce. GPT. It's focuses on the dynamic product descriptions or you know, a concierge service for personalized product discovery.

Uh similarly, Developer GPT uses natural language to create code, chat based assistance and auto completion for coding. All of these Einstein GPT applications um use generative AI in every cloud in their workflow, right? So we are bringing AI where it drives the most impact to your business and focused on unlocking value.

So let's look at one example of this. So this is one of the case services and like on the right side of the screen, you see the Copilot experience where you can in natural language, ask something in this case, summarize the case and it gives you a complete summary of the case that is helpful for you to figure out what the next action should be. Ok?

So now let's look at the trust layer and the architecture of the Einstein One platform.

So it's delivered on our Hyperforce public cloud infrastructure which enables us to securely deploy and scale all of the data and AI services while meeting compliance requirements, we have an open model ecosystem um provide secure access to foundational models uh hosted outside of Salesforce via APIs or you can bring in your own foundation models so that you can reuse your generative AI investments and run those models in your uh infrastructure.

We also provide the ability to host foundation models for you. If you want a private model, you can also fine tune some of these foundation models and run on a per tenant basis in Hyperforce instead of in your own VPC. This is where we have partnered a lot with the SageMaker team on making sure we can scale up inferences for hundreds of foundation models.

And because trust is the number one priority for Salesforce, we've created this Einstein Trust layer which provides you with security guard rails that enable you to ground these AI uh models using data in your own CRM. But do so with controls like zero data retention that protect data security and privacy, we also use data masking so that these models will never see private PII data and we ensure better outputs through toxicity detection and through audit trails.

So AI at Salesforce is customizable through this low code builders such as our new Einstein Copilot Studio, which is the configuration layer for the new Einstein Copilot that I just demoed. And um it also enables you to ask questions, not get answers but to take actions, saving time time.

So what have you been working with the SageMaker team?

So our relationship with SageMaker goes beyond the generative AI use cases. We've been partnering with them for many predictive and generative use cases as and also for multi-tenant infer and training.

Um we've worked with Salesforce uh with SageMaker on making the data available in Data Cloud, uh the data in Data Cloud available in SageMaker with zero data copy.

Um we've been working on optimizing model influences in the large model influence container with innovations such as faster transformer dynamic batching and very much interesting things.

So these are some of the foundation models that we have developed in Salesforce or we are using wire SageMaker Jumpstart right now. And for this example, I'm just going to focus on two of them.

The CodeGen 16 billion model which requires about 75 gb of GPU memory and currently has a high traffic incoming and the Text 1013 billion models. It's a relatively newer model requires 55 gb of GPU memory. And currently we're seeing low traffic but we expect that to scale up.

So what is our deployment strategy today looking like?

So we use SageMaker single model endpoint. Both of these require two GPUs. So I have deployed two copies of my CodeGen model since it's having more traffic and one copy of the TextGen model. All right, both of these are applied on p4d since it's a single model endpoint I have to have two p4 instances. Is this kind of a deployment familiar to you guys? Have you been using this?

Yeah. So as you see here, we have to use two p40 instances, super expensive ones which are significantly underutilized right now. Think about scaling this up, not only for more traffic but more models, it's an operational nightmare, it is very hard to do it and you are still paying a lot because you are under utilizing the resources for these models, right?

We looked at multimodal endpoints what da talked about. Uh but those are good for smaller models which are homogeneous and uh but for this larger foundation model, since you have to load them up in the memory, the SageMaker inference component endpoints was the right solution for us.

So how did we do that? Uh on the same instance with SageMaker inference component endpoint, we were able to deploy both the models. So in pink, you see copy one and copy two of model a CodeGen and and and blue, you see it for the TextGen models.

Now this completely utilizes or better utilizes the p40 instances that we have and and does this without any performance penalty. So we did side by side comparisons of how our traffic was going in for single model endpoints. And now with the inference component engine and with the smart routing functionality, we didn't see any performance penalty.

So as our models scale up, how does this look like? Right? So initially with the single model endpoints as we would scale up, uh we would need to now add more instances onto it. So if my traffics tripled up, I would need to spin up another instance for the CodeGen model, another instance for the TextGen model. So now i need four of them.

But now if I look at it from the inference component endpoint as my traffic scales up, I'm be able to better utilize the instances and i would need only three instances of p4d to get the same throughput, same request going through again, like you see better utilization of the resources, less number of instances, hence less cost to all of us. Ok.

So what are some of our learnings that we saw?

Scaling using inference component endpoints requires single endpoint to manage super useful, especially if you go and scale up to hundreds of models that you have to practice.

It is much more optimal utilization of the hardware resources.

The key thing was you can independently scale each model.

Um and then the auto scaling functionality is was very useful for managing these resources um scale to zero. What Alan mentioned was another useful functionality especially in the non production environments in our lower test and performance environments because what we could do is for a period of time, scale it up for running our test and then bring it back down as soon as this is done.

Uh this is not the case in the single model endpoints or any of the other solutions uh and cost savings as I showed you as we scale up.

Uh and performance was also at parity.

So what are some of our key takeaways?

Um the first step you would want to do is you want to identify the foundation models which use the same hardware and host them on a single endpoint. Uh use inference components. It takes away a lot of your heavy lifting.

Um it reduces your compute footprint without any performance impact.

Um the auto scaling functionality for us worked really well. We were able to scale up and down the the model's usage based on our, our, our incoming traffic.

Uh the smart routing functionality that Alan mentioned was very useful for getting performance parity.

Um and then for us for the CodeGen models, especially the first byte latency was important. And so it's true for most of the use cases. So use response streaming and and the last one, the large modeler container that was very useful as compared to some other containers that are there for doing inferences for foundation models.

Those are some of our key takeaways.

So with that, I'll call back double. Thank you very much, Bavis for the awesome story.

Thank you very much. So I'm going to just deep dive into some of the things which we introduced recently two days ago, we launched new features into the large model in container that i want to dive deep quickly.

So what we have here in this in terms of hosting foundation models on SageMaker, we have a large model instance container that already supported deep speed. It also supports the other model parallel approaches like hugging face and accelerate. And it also has support for the new support for tensor RTM.

We also introduced additional optimization techniques for all review which enables the optimal communication between the nvidia gpus internally for the model parallel distributed inference. So you use all these features and functionalities and now you can get 20 lower latency on average compared to the previous versions of the large model infant container.

So do use large model container and SageMaker. And you know for any type of the foundation, albeit a text or large language model or stable diffusion type of model or any kind of text to image or image to text type of models.

We also have support for the stable diffusion back and we also have other optimized engines like faster transformer that you can use to host FlanT5 or T5 family, the foundation models as well. And you can use the single inference container to host this variety of foundation models at scale.

You don't have to keep on using different different containers for different types of models or encoder architecture or decoder architecture. This is a unified single large model infants container that you can use standardize on and use and leverage this for cost efficiency and high performance.

With that here are some references that you can use in order to boost the inference performance for foundation model and some of the new features that we launched two days ago in large model infant container and then the new SageMaker inference components feature that we launched. You can find more information here.

You can definitely scan code that and it will be redacted to the resource directly.

And with that, i will be extremely happy to get your feedback for the session. So do go to your app select aim 327 and please provide us feedback with that.

Thank you very much. Thank you Allen and thank you all for attending this session. Thank you very much. We'll hang out here. If you have any questions, we will be happy to answer your questions. Thank you very much.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值