Deploy gen AI apps efficiently at scale with serverless containers

Merla Grande: Hello, everyone. Welcome to K304. My name is Merla Grande. I'm a senior leader with Worldwide Specialist Organization and Amazon Web Services and joining with me is Vav.

Vav: Hey, everyone. Glad to meet you and thanks for joining us today. My name is Vav. I'm a senior product manager with Amazon CS. I've been with CS for over three years now and have spent my time looking at all aspects of the deployment and application life cycle in S. Super excited to share uh what, what we've seen over the past year with customers starting to build and deploy generative AI applications on ECS.

Merla Grande: Thank you. Our talk today is gonna mainly focus on important themes developing and deploying generative AI applications at scale using AWS sous containers and hopefully a takeaway from this session. Um you will be moving forward and pacing through your JA I journey more confidently and have enough resources to do so.

But generative AI represents a paradigm shift in the artificial intelligence, empowering computers to go beyond computation and actively contribute to the process of creation at its core. Generative AI is about machines that can imagine innovate and produce entirely new content by incorporating sophisticated algorithms and deep learning. The generative AI systems can learn from vast data sets to understand structures and identify patterns.

Generative AI unlocks new possibilities and creates significant business value in four different regions or areas. When it comes to creating new customer experiences. Generative AI analyzes customer data and creates personalization for their products, services and marketing messages. This helps in enhancing the overall experience and improves customer satisfaction.

Generative AI can enable builders by bypassing time consuming coding tasks. For example, by using Amazon Code Whisper, builders can complete their coding tasks 57% faster, thereby thereby boosting employee productivity.

Next is creating content. We have recently launched Amazon Party Rock to create applications and you don't even need to know how to write a single line of code to do so.

Generative AI has the ability to analyze extensive data sets and create insights by which you can make predictions faster and take informed decisions when it comes to generative AI transforming various industries. Let's look at some of them and what their use cases are to start with.

Financial services industry. Algorithmic trading is one of the commonly used strategies in the financial services industry to support, to speed up and make it more efficient, efficient. Now, it's typically a complex process involving a deep understanding and a combination of domain expertise on market vectors as well as um financial segments as well as technical knowledge in terms of how to build the automation and incorporating the logic with gen AI, you can analyze market data to automate the trading strategies and predict market sentiments quickly.

Now its ability to generate alerts for potential fraudulent activities is also one of the predominant use cases that we've seen customers in the financial services industry handled risk management proactively.

Now, within healthcare, generative AI has been transformative in terms of medical imaging diagnosis, as well as drug discovery.

Now, in the automotive and the manufacturing industry, generative AI models have enabled to optimize supply chain logistics including inventory management and warehouse operations as well as detect defects to ensure product quality.

Now, in the educational sector, with generative AI, our customers have been able to create digital learning platforms with intelligent tutoring systems that can provide adaptive assessment on the fly based on each individual student's pace and progress.

Now, with all of this one would wonder what does AI AI text stack stack look like? And what does it entail? A tech stack for a gen AI application typically has three macro layers and these layers work with each other, talk to each other and enable the pipeline of development and deployment of the applications to start with.

We have the data layer in this layer. Data data is collected from multiple sources and stored in the data warehouses, preprocess cleaned to ensure its quality and the noise is removed through normalization techniques, then you annotate the data and label the data thereby making it ready for training purposes.

The second layer is the modeling layer in this layer, you choose the specific mission learning or deep learning algorithm depending upon the nature of the problem that is being solved. And then the selected model is then used against the preprocess data and trained during the training, the model learns to make predictions and then the model is then validated against new data and it's just iterated, it, it's just iterated from there.

In the deployment and application layer, the model is deployed onto production environment and integrated into the application and application to perform specific tasks such as make predictions or you know, generate content and so on.

Now, coming to the roles in the generative AI ecosystem, there are three main roles that create a pipeline that goes from development and tuning to the deployment and visualization and utilization of the generative models through APIs. Let's take a look at each of the roles and their skill sets to start with.

We have the model provider. The model provider is typically responsible for creating or offering the generative models used in the system. This role involves designing training and potentially fine tuning the generative models. Model providers can be individuals or organizations that start building the foundation models from scratch.

Now, this role involves a diverse skill set, including mission learning, deep learning programming and domain specific knowledge on data science for preprocessing and augmenting the data.

Second one is a tuner. The tuner role involves optimizing and fine tuning the generative models to enhance the performance of the model. And then tuning may include adjusting the hyper parameters, refining the training strategies or adapting models to specific use cases.

Tuners work closely with the initial models provided by the model provider and iterate on top of it to improve its capabilities.

Now, the typical skill set that is required for tuners is having a deep understanding on machine learning algorithms and having also an understanding on hyper parameter tuning and trade and tuning strategies as well as prompt engineering.

The last role is the consumers. This is where the consumers actually take the models and integrate into the applications either via the APIs or directly consuming um through the APIs. And the skill set that is required for consumers is to be able to work with the APIs and SDKs and frameworks such as tensor flow porch.

Now keep in mind in an open source community, individuals can be performing multiple roles. For example, researchers can be providers as well as tumors and similarly, developers can be consumers as well as tuners.

A building foundation model involves several considerations to start with training foundation models often requires significant computational resources. Access to these resources can be a barrier if it's not prop planned properly.

Second building foundation models can be a time consuming process and the training time depends on various factors such as model size, its complexity, data size and its diversity.

Secondly, hiring the right talent is always a challenge, especially when you need a blend of domain expertise of AI ML deep learning and tech and technical skills on algorithms and building automation. using various frameworks.

Lastly optimizing the use of resources, both in terms of computational power and memory is crucial and developing models that are efficient in terms of energy consumption and computational costs would require an ongoing effort and it takes time.

Oh AWS, customers have asked us how they can quickly take advantage of what is out there today and be able to use foundation models and generative AI and create their generative AI applications within their business and organizations to drive new levels of productivity and transform their offerings.

We identified four key considerations for you to be able to quickly build and deploy generative AI applications at scale. We will be double clicking on each of these key aspects for the rest of the talk today.

The first is we have a thorough understanding on foundation models. Foundation models have been playing a significant role in advancing the state of the art in natural language processing and mission learning. A foundation model refers to a large pre trained artificial intelligence model that serves as the starting point for a wide range of downstream tasks.

Few commonly known foundation models are BGPT and T5. These models are typically trained on extensive and diverse unstructured data to learn general patterns and representations across various domains.

Foundation models are often massive in terms of parameters in the order of hundreds of billions and typically require large scale computational infrastructure to run the training on these models.

When it comes to the types of foundation models, there are three first is the text to text large learning model. Many organizations have had a chance to experiment with the generative large learning models that can do things like summarizing text, respond to questions, extract information and create content. These models take the user's input text as prompts and extend it with new generated text.

Second, there is a text to embeddings LLM that can compare pieces of text like what are your user types and um search bar and then it compares with the index data and connect the dots between the two. For example, we have Amazon's OpenSearch that uses this type of large learning model where it compares a user's ask against catalog of data and it presents the user with more accurate and relevant results.

Also emerging are the multimodal FFLLMs where the foundation models can understand text as well as images. An example is the Stable Diffusion foundation model that can generate images based on a user's natural language prompts.

Now let's take a look at how a foundation model is built normally you would start with your mission learning process because you have a business problem and you identify that machine learning is the way to go. Then a data scientist collects all the data from multiple sources and integrates data from the various sources, cleans the data and analyzes the data.

Next. The data scientist starts the process of feature engineering training and making sure the models are tuned and evaluated the performance of those models based on the results. The data scientists might go back to collect more data or perform additional data cleaning steps.

But assuming that the models are fundamentally performing well, and then they would go ahead and deploy the model so it can be used to generate predictions.

The next step is basically an important one is to monitor the model which is in production. Now keep in mind, a mission learning model is out of date as soon as you've trained it and the world is constantly changing and evolving. So the older your model gets the worse it gets at making predictions.

So by monitoring the quality of your model and making sure that it's being retrained with the new collected data to adapt to its changing requirements or changing patterns is important.

The final but very crucial one is to be able to build your model responsibly and making sure that it's incorporated across every step of the workflow is going to be super crucial for your businesses.

Let's take a look at how AWS has been building our generative AI offerings responsibly with Amazon Code Whisper. You can build applications faster and more securely. We have something called an AI coding companion. It is it has inbuilt security scanning for finding and suggesting remediations for hard to detect vulnerabilities.

Similarly, we have the Amazon Titan foundation models. It consists of two new LLMs for text summarization, text generation classification and information extraction.

They are also built to detect and remove harmful content in the data that customers provide for customization, reject inappropriate content in the user input and filter. The model's outputs containing inappropriate content.

The second key strategy or consideration to accelerate your JA I journey at scale is to use the pre-trained models wherever applicable. A pre-trained model is a machine learning model that has been trained on a large data set for a specific task and is saved or distributed for reuse on new and specific or related tasks.

Now, some pre-trained models have gained widespread popularity due to their effectiveness across various tasks. The models that you are looking at the screen right now are a broad selection of open source models that offer many capabilities including natural language processing, image generation and text classification such as summarization, sentiment analysis and information extraction.

With SageMaker Jumpstart, ML builders can choose from a broad selection of open source foundation models and Jumpstart offers the ML practitioners who want to have the deep model customization and evaluation capabilities to access the foundation models through environments that they already use like SageMaker Studio or SageMaker SDK and the Amazon SageMaker console. And we have new models added on a weekly basis.

Next in the model development, there's also an approach called fine tuning, which takes the pre-existing models for specific tasks and tunes it further for domain specific tasks. This involves taking a pre-trained model and adapting them to a specific downstream task or domain.

Fine tuning involves further training the pre-trained model on a specific data set which is typically smaller than the original data set and focused on a specific domain or a task. Conventional fine tuning or full time, full fine tuning involves updating all the model weights based on the new data set. While full fine tuning of all layers provides optimal results. It is resource intensive and can be prone to forgetting original information.

So to that, we have the performance efficient fine tuning which is theft and a flavor of that is Lora is low range adaptable model which takes two or three matrices and basically tunes with the equal amount of performance and efficiency.

Now one would wonder when would I choose to pre-training versus fine tune? And while these two approaches are two stages in the life cycle of tuning or using a pre-trained model for a specific task.

Pre-training involves training on a large and diverse data set for a general task. While fine tuning involves further tuning on a smaller task specific data set.

Pre-training can be computationally expensive and time consuming during the large data set. Versus the fine tuning typically requires less time and resources compared to the pre-training. Since the model starts with pre-trained knowledge in real time scenarios.

The combination of pre-training and fine tuning is a powerful approach, especially in the deep learning where pre-trained models are often used as a starting point for tasks with smaller data sets. This strategy leverages the advantages of the benefits of capturing general knowledge from large data sets while adapting the model's knowledge to a specific task.

The other evaluation criteria that you would want want to consider is when would you want to choose the foundation model versus the pre-trained model? Choosing a pre-trained model and building a foundation model depends on various factors including the specific requirements of your task, the available resources and the nature of your data.

If your task has specific requirements that differ significantly from the tasks, pre-trained models, then building a foundation model from scratch may be more suitable. Additionally, if you have a large task relevant data set training, a model from scratch allows you to learn the features directly and related to your specific task, potentially leading to better performance.

Building a foundation model provides full control over the model architecture. And this is beneficial if you need to design a model with specific architectural considerations that are tailored to your business statement.

Now, to sum it up, here's a few best practices to build generative AI response. These best practices will continue to evolve. But here are the some key points that we'd like to share that we've learned so far.

First focus on your use case, how you're defining your specific use case is gonna be directly relational to developing your training algorithms that enforce that definition of the problem statement.

Second, prioritize education and diversity in your workforce. For example, in terms of human annotated training, data promoting diversity in the actual annotations is key.

And the third is we wanna make sure that we assess the risk because each use case has its own unique set of risks for this test test and test. Each use case would have its own performance evaluation and often times it just takes as much time to come up with a testing system as much as it would take to create the model

Next contin continuously iterate across the AI life cycle and then make sure that you are introducing governance policies with accountability and measurement through the each step of the workflow.

The third is the utilization of cloud services is gonna be highly beneficial for generative AI applications due to these specific demands and requirements of a gene AI development training and deployment, whether it's the computational power of training complex generative models or scalability to make sure that your competition resources are scaling on demand or storing huge data sets with efficient retrievals and many other benefits like security monitoring, cost efficiency.

It just gives you a good playground for experimentation and leveraging cloud services for gen gen AI processes or applications provides access to this powerful ecosystem of resources.

Here is how our generative AI offerings come together. We have the first layer which is the Amazon bedrock. The easiest way to build and scale generative AI applications with foundation models like Anthropic Stability AI and even Amazon Titan models, Amazon EC2 inferential and triennium gives you the best price performance infrastructure for training and inference in the cloud and AWS trainum and AWS inferential combined with two ultra clusters and high speed networking will give you the bandwidth and the computational power that is required to build your build and train your generative models.

We have services with inbuilt generative AI such as Amazon CodeWhisperer - AI coding companion that helps you build applications faster and more securely and it's free for individual use

Next with the deepest and broadest offerings. AWS provides a platform that combines speed and power to give you the agility to move quickly when it comes to rapidly pacing through to accelerate your time to market when it comes to building generative AI applications

Speed in the form of sous computing that allows you to execute code without having to manage the underlying infrastructure. And amazon's power of generative AI offerings like SageMaker Jumpstart ML practitioners can choose from the broad selection of open source foundation models where new models are added on a weekly basis.

Now let's double click a little bit on the sous operating model. The solu operating model enhances the development and deployment of generative AI applications. It provides a flexible scalable and cost efficient environment. It aligns well with the event driven modular and scalable nature of the generative AI tasks allowing developers to just focus on creating generative AI applications without the burden of managing infrastructure

Service applications are event driven, responding to events such as data uploads, user requests or schedule task. This aligns well with the generative AI applications where events like data availability or user interactions can trigger more training on inference tasks.

The abstraction that sous provides is beneficial for builders building generative AI applications allowing the developers to just focus on model development and application logic.

We will spend a bunch of time talking about how you would actually go about building first developing and building an AI powered application and taking it into production.

So, so so a good place to start is by actually asking yourself, do you really need to use generative AI for this application? Please don't get me wrong. Generative AI is a game changing technology and we see customers start to use generative AI in all sorts of creative ways.

But given the level of hype around generative AI, at this point of time, it is easy to fall into the trap of trying to over ft your problem to use generative AI. Even if using AI wouldn't materially help customers for that particular use case.

And a good way to define, evaluate whether your application aligns with, with use cases supported by generative AI is to see to define the use case that you're looking to solve and see if it really needs any of these capabilities.

We we so depending on whether your application needs text to text capabilities or text to imaging or text to audio or a mix of any of those makes sense. Go ahead and use generative AI. But if not, it's probably a good idea to look elsewhere for building your application.

Now, once you know that you actually need to build uh you, you, you want to build uh gen AI powered application and you have validated the use case. The next thing you want probably do is figure out what's the mo what model you want to use for your application?

Now, even before you can choose that application to choose the model that you want to pick a good question to answer is do you want, what, how do i actually measure what is success? What is a successful model versus what is not a successful model because there's so many foundational models available right now from so many different vendors.

So this this is a good heuristic way to think to reason about this. A good first question to ask is, does label data even exist for the use case that i'm trying to wait for? If it does do does the use case actually end up in providing discrete outputs and outcomes? If so great. You, you, you can have a very high precision accuracy metric that you can measure to uh to, to evaluate different models.

But unfortunately not a lot of uh uh use cases for generative AI actually align with have uh high precision metrics that you can use an alternative would be similarity metrics. So you compare the response that the generative AI model is giving versus what uh what, what would be an ideal response and that, and, and that you, you could use that to, to see if the model is performing well or not.

Now, if label data simply does not exist, then the question to ask is do you want to even try to automate the process or just evaluate it manually? And uh if you, if you realize there's no way to automate this. In that case, you're probably best off just trying to eyeball, look comparing different models and introduce and having a human evaluating all of these options.

Uh a possible way a possible thing you could look at here is SageMaker Ground Truth. Uh and if on the other hand, if you do think that if you do want to automate the validation process, an alternative could be to compare the, the model output to, to an LLM or a model that you trust.

I this, this might be a little overwhelming at first sight. But let's go off through an example, which might make things a little more tangible for, for you.

So we have so define the set of prompts that you want to test your gen model against. So we've done that here and we know what type of output we might, we want to expect. So for some of these, we know this is the output i want to expect and i predefined it. But for a lot of the others, like for instance, the coding task, you might not have the output already defined.

Uh so the next step, you, you pick a few popular models that you want to evaluate things that you could look at could be titan cloud uh llama any of those, depending on what your application, what your use cases. And then you evaluate all of these against uh what uh what, what the uh what, what the accuracy of the model is.

I i will say this if this still feels overwhelming and you don't have someone in your organization or, or, or in your team who can help i'd recommend using playgrounds and bedrock to test out some sample prompts across these models. And just eyeball it, see what makes sense to you, what the results from which model align the best with your expectations and go with that.

Now, once you define all the, once you have a clear all uh understanding of measuring things, you, you want to also evaluate across a few different things running uh deploying foundation models is expensive.

Uh so, so you want to look at what's the cost of running these invocations can take time. So what's the speed? How, how latency sensitive is your, is your use case? How latency sensitive are your customers? So depending upon the use case, that's something that, that, that can change quite a bit and also the precision which we just went through what is the most important for your use case.

So, and, and rate the models that you're evaluating across all of these three ax accesses. So even if there's a model that is really, really high precision, but if your use case does not need that level of precision, you might be better off choosing a different model which is slightly less precise but is much lower cost.

Uh so in this case, for instance, you might, you'll probably end up choosing model two. Now, once you've decided what model you want to use, you, you, you, you may want to fine tune the selected model as midler already spoke through it. There's two core ways you can go about it basically. And, and both involve training the model train the model to some extent uh over a small data set.

Uh you can fine tune it or you can use parameter efficient, fine tuning which uh which, which is much less complicated it takes, uses much less data is much cheaper. I again, if you're someone, if you're someone who's still relatively new to this space and you don't have someone on your team who, who, who can walk through all of this with you.

My recommendation would be to actually look at uh uh prebuilt uh pre prebuilt capabilities in bedrock and sage maker jumpstart, which allow you to fine tune your model without actually knowing how fine tuning actually works.

In case the model that you selected isn't, isn't supported there, i'd actually suggest skipping the fine tune step because we'll be talking about other strategies you could use instead to make to make sure that your model is giving the right domain specific out

All right. Now, now, now that we've spent a bunch of time actually talking about uh what foundation models are, how you can use the, how you can use them, how you can select them. Let's come to the fun stuff. How, how you actually use a model to create customer value, how you create customer delight, create business value for your business.

A good mental model for for thinking about your generative AI powered application is to think of the model as al as, as somewhat decoupled from your application itself. Think of it as a dependency for your application, something like say a database or a queue or a stream or something like that, which, which, which your application uses, but doesn't necessarily have to be tied together with the model.

What, what uh what, what this, what this decoupling leads to is uh is, is a number of benefits similar to what you'd see for microservice architectures. A you decouple the technology choices between uh what you're running your model on versus what you're building your application on you. Your scaling changes dramatically, you can scale your application and your model at, at different rates depending on what is uh what, what use case you're looking to serve. And finally, because you can almost have two different teams or different sets of people focused on both of these things. You, you, you're more agile, you have smaller teams who can innovate independently and deliver, deliver and create value for so and, and a side effect or, or a side benefit uh i might say of, of this decoupling is, is that you can really lean in on uh uh on, on, on how you're already running applications.

So for instance, if you're already running your applications on services, it becomes all that much easier to break down to deploy your gen AI applications with service compute as well. Once you have, once you've decoupled the model from the application itself, both physically and in your mind.

So to make things more tangible, uh le let's look at the sample architecture, I've kept this fairly simple and straightforward to keep things simple. So we have uh load. So we have a load balancer which is uh which is used for traffic where your c your end user requests are coming in, it is connected to a load balance tcs service where your application is running. Um and, and that service basically speaks to a model end point. That model end point could be anywhere that model end point could be on uh on, on save maker, it could be on bedrock or you could self deploy it on ecs.

Now, let's let's dive into this a little bit. And one particular area i really want to talk about is how your application actually integrates with the, the AI model in my mind. This is one of the most interesting aspects for, especially for customers who are coming from uh a service this world to uh uh to start using generative AI models because there are some nuances to how uh how you invoke uh a gen AI model.

So le let's go into each of these one by one. So the first question is how do you actually invoke the model? You could naively just call the api end point where your model is running but is is that the best way best or the most efficient way to do it? Probably not. There's number of frameworks out there. One in particular that's super popular is lang chain which you can rely on to, to communicate for your application to communicate with the model that you, that you, that you already selected and is deployed somewhere.

A side benefit of doing. this is a obviously, you don't have to create integration code that is uh non differentiated. But you can also switch around across models as your model, uh as as your team that's deploying models, change switches to a different model which which might deliver better outcomes. Uh and, and also uh another benefit is that the frameworks usually come with a number of getting started templates which which which further help accelerate how you you can deploy uh your gen AI application again without having to write a ton of undifferentiated integration code.

The the next mechanical question to ask is do you call your model synchronously or as synchronously? So, so the general rule of thumb to follow her is invoke a model synchronously for the most latency sci of workloads. So think about uh uh a chatbot, for instance, your, your, your customer, your end user is out there is waiting for a response. You probably don't want to keep them, keep them waiting a long time. So how there's flip sides to it when you're synchronously invoking your model from your application, your model needs to scale fairly elastically to make sure that it's supporting the right this new new load as it's getting and that can get expensive because deploying gen i models is generally expensive.

So stay caution and where you can use uh use an asynchronous invoke pattern, introduce a queue in the middle for instance, which can handle back pressure. So think of a use case like say generating avatars or generating images, which isn't necessarily super sensitive. You still want to give deliver the best possible customer experience for, for your customers and you want to keep latency to a minimum. But there is more tolerance for for workloads where there is more tolerance for it. It's, it's a good idea to you to invoke your model asynchronously.

Uh this allows you to set your model scaling on, on a much more gradual uh pattern you, you, this allow which allows you to reduce cost quite significantly depending on what the scaling characteristics of your uh application look like.

Now that we've we've gone through the mechanics of invoking your model. So we know we want to see how we know how we want to invoke the model. We know whether we want to invoke it synchronously or not. Is it, is it fair to do? Do, do we expect to get great responses for customer query queries from the ja i model out of the box at this point of time? Unfortunately, the answer is probably not. And there's two key reasons to why uh this happens.

One, your model uh as, as mila alluded to earlier, your model is kind of out of date the day it it comes out. Uh so the, the there are many models out there which were trained on data uh two years ago or they might not have access to information that your customers expect. In those cases. Your model is basically out of the water because it doesn't know and it can't answer the question that your customers are likely to ask the other question.

Uh the other uh reason why this, this might not work out great is that the model might also need some feedback and tuning and guidance on style so that you, you, you can make sure that the model is speaking in the voice that you want to deliver to your customers. It, it it's easier to walk through this with an example.

So the, the, the whole discipline of prompt engineering uh prompt is, is basically uh speak for as another way of calling requests for a gen a i model. The whole discipline of prompt engineering is was all about how you annotate your customers query to with, with the right level of context so that your, your customers are getting the, you're delivering the right experience for customers and you're getting the right answers in the voice you want to have with your customers.

So an example uh and, and this is best explained with an example. So le let's ask uh an a i mo so i asked an a i model label the sentence by sentiment, think before you act and the answer that comes out is positive arguably. Ok. But may might not be what i'm really looking for. On a second try. I tried a different prom. I said, label the sentence and give me two descriptive adjectives instead of leaving it super open ended. And, and this time, the response that came was cautious and thoughtful, which i think is it makes a lot more sense for what or which aligns a lot more with what the, the intent for answering this question asking this question was so uh the, the, the broader uh the moral of the story here, so to say is that don't just pass the query that you're getting from uh from customers to your model directly use templates to, to, to generate prompts which uh which you have tested out, which you know, deliver better outcomes.

Uh i'm going to plug frameworks again here. Uh if you use a framework, the prebuilt template templating options there, which, which can reduce the amount of overhead uh for, for the amount of code that you might have to write if you're using a framework next up.

The other problem that we still haven't tackled is models do get out of date with, with information and, and that can be super problematic if i want to use a model that's super capable but doesn't have current data or it doesn't have my specific data set and, and a technique that's becoming increasingly popular to, to address this use case is retrieval augmented generation at a very, very high level. The the the the easiest way i can explain this is that at run time, you you you fetch the right information that your model would need to answer the question because you know, your model might not answer the question.

Uh so let, let me run through the mechanics first and then an example would help her. So the the mechanics. So to, to, to use retrieval augmented generation. The first thing you need to do is is to actually pass your uh the the data set that you want to provide. It could be your documentation guide or it could be some new uh recent events. Any of that uh could pass that through a text to embeddings model like titan for instance, and convert that into embeddings and store that into a vector db vector database is is a recent innovation which allows directional data capture to be done much more efficiently.

Aws offers managed services in this called uh opensearch for uh uh or uh my vector search for amazon's open uh opensearch as well as pg vector for rds with postgres. Once you have this set up, when your, when your customer request comes in, pass the request first to the specter database, get the co uh get the relevant embeddings and include those embeddings as part of the prompt that you send to your ja i model with, with, by adding that information, your model has a lot more knowledge that it can then lean on for answering the question that your customers may have asked.

So for instance, uh ii i want my, i want to ask my model which fms are supported by amazon bedrock. Now, probably because this model might have been trained before bedrock existed. It doesn't know this. So i'm going to pass the, the documentation guide for bedrock as uh as the input here. And when i asked the question again, the, the model is able to answer the question because it has access to all the information, it needs to answer the question.

Keep in mind the there there are some nuances here. So neither, so neither prompt engineering nor retrieval augmented generation actually change your model. So it's really only in the context of the particular prompt that you can lean on the context that you provided, unlike fine tuning where you're actually changing weights of the model. So that every time you invoke the model after fine tuning it, it has that additional domain data. But with, with the rag or prompt engineering, this, this is an iterative process that needs to happen every time the other big thing to keep in mind is that the additional data or context you add is going to be limited by the context length that your model can accept. There's a finite limit to how much input your model is going to accept as part of any invocation. So, so you have to be fairly mindful of how much data can actually get past as part of the invocation.

Finally, you uh the rag is also limited by the accuracy of the embedding algorithm itself that you're using the, the foundation model we use to create uh text embeddings from text and the performance of the database to retrieve uh at on time. Uh this is where i think something like uh vector search for open uh vector search for opensearch helps because that is proven to perform uh much more uh uh very efficiently.

Now that we have an idea of how we might want to build uh build the application and call, integrate the application with uh with, with aj a i model. Let's actually talk about taking the model to production and and i'll break this across four different categories. First, hosting your application itself, uh second, hosting the model and then monitoring and security which are p zero for any use case.

Let's start by talking about more options for hosting your gen AI applications on service. As mila already spoke, service helps accelerate deployment. And that's, that's, that's why uh customers are increasingly choosing service containers to build and deploy all their applications to prominent options that come up here are amazon or aws. Lambda. A key, a key thing to keep in mind here is that the lambda and, and i'm sure you already know there's a 15 minute limit on the uh on the execution for a lambda function

Whereas with S3 this is basically unlimited storage is also a key thing to keep in mind. Even though even if you're just running the model, you're just running your application and not the model, there might be storage constraints that might pop up with Lambda because Lambda only offers up to 10 gigabytes of storage. So keep that in mind as you choose the application.

As a rule of thumb, if you have infrequent invocations, it's usually a good idea to go with Lambda because it helps you get started super quickly. But if you have, if you expect to see fairly steady traffic, uh I, I'd lean on using Amazon ECS just because it's, it's uh it, it, it serves better with continuous traffic.

As your application starts to see more scale, you probably want to set up auto scaling for your application. If you, if you choose ECS to deploy your service, your application, you can choose to auto scale your application based on the right metric for your application by you can choose to auto scale based on default metrics like CPU, memory utilization or you can choose based on the request that your application is seeing or any other custom metric that you want to choose.

Now that we've decided what, how, where we and how you actually want to deploy your application. Uh a good question to ask also is how you're going to deploy your model server at, at a very high level. There's, there's kind of three ways you can do it. You can, you, you can go the full service route with Amazon Bedrock API as a service super fast to get started. You don't need to, you, you don't need to do much to do to, to actually have uh AI model running that, that your application can call with SageMaker. You, there's a full service, fully managed platform that you can use to deploy. AI applications, AI models, but there's still some level of effort that's going to be involved in getting up and started or you can go all the way and, and self deploy your application either on EC2 directly or using containers with ECS or EKS and depending on you and, and this, this kind of boils down to a number of things.

One is the biggest thing I would really leave on here is what, what, what level of expertise your organization has. If your organization has some level of expertise in running a, in deploying a model, it's probably a good idea to lean on self hosted because that might turn out to be cheaper compared to something like Bedrock or SageMaker. But for most customers who don't have the right level of skill set, it's probably a good idea to start out with something like Bedrock, which allows you to get started immediately, pay per use, spin up, spin down things, test things out easily uh or SageMaker, which gives you more, somewhat more configurability than Bedrock. You're less constrained on what models you can actually deploy. Uh so it's really up to you what the right uh deployment layer for you is.

So for, for ECS customers who have already chosen to, who are running a bunch of their applications in ECS and have some level of expertise in deploying uh AI and in running AI applications or have the right talent. It's probably a good idea to deploy your AI model on ECS. The it, it helps you have consistent tooling across your stack. You can deploy all your applications including your, your AI model on ECS. You have full scaling and concurrency control and you get all of the, the, the rich feature set of Amazon ECS. So basically free, including security scaling and even serverless with Fargate.

The next thing that you want, once you've made the choice to deploy on ECS, an immediate question that comes up would be how, what, what compute does your application need? I know for most people, uh AI has become synonymous with GPUs. And for good reason, there's, there's a lot of reason why GPUs are good for, for building and serving uh AI models and I won't go into those details, but it isn't necessary that you need, you, you don't necessarily need to use GPUs to deploy your uh AI model.

As a general rule of thumb, I'd say if your model is say less than uh 30 billion parameters, it, it, it might be worth investigating using uh CPU uh first for serving a model. Again, there's more that you need to think about here. What's acceptable latency for your customers is that is, is CPU based inference meeting that benchmark. If it isn't, obviously, that's a moot point and you probably want to use an accelerator, the accelerator could be a GPU or it could be even a Neuron device depending on what the use case is.

Another key consideration if you do choose to deploy your uh if when you're deploying your AI model with ECS is how do you actually load the model into your, into your application? AI models can vary immensely in size. Some of the larger ones can be tens or hundreds of gigabytes, even the smaller would be several gigabytes. So you probably want to get them loaded onto the host as fast as you can so that your model, your application is up and running your model is up and running and responding to requests as soon as possible.

There's kind of three key ways you could do that one, you could use EFS, which is a fully managed scalable file system. The you can host your model on EFS and make that and serve that to uh to, to your application deployed on ECS. Uh you could just host your model on S3 and use S3 API in your model code or in your in your container to pull the model at run time. Or you could even bundle the model into the container image itself and make a massive 5 1020 gigabyte deployable of these.

My recommendation would be to use Amazon S3. Uh I I would suggest don't do a simple get object call. There's, there's ways to make uh uh S3 more performant and uh there, there's, there's a lot of literature about how you can uh you can pull objects from S3 more efficiently. And I'd recommend uh going through those if you're uh if, if you're deploying AI a I model on ECS, uh I already spoke about the uh scaling out your mo uh your AI application. But you also probably want to auto scale your model server deployed on ECS. Same set of capabilities. You can scale on any metric of your choice. It can be a custom metric, predefined metric. ECS metric doesn't matter.

Uh one additional dimension. If you do choose to deploy uh with accelerated computer on EC2 is that you, you also need to configure auto scaling for your uh your mo for your uh underlying infrastructure. If you choose to do that, uh I I I'd recommend looking into using warm pools which can really help accelerate start up time so that your, your scale out can be more efficient, especially when you're starting to scale and, and you and trying to get the most juice out of your GPU accelerated machines.

Finally, in terms of monitoring ECS offers of monitoring capabilities, uh you get container insights out of the box which which gives you a ton of metrics about your ECS workloads. You get FireLens, which you can configure in your task definition and use that to send logs to to CloudWatch Logs to any partner destination or basically anywhere that you want to one key call out I would have here is, is uh if it would be to deploy an X Ray sidecar for your AI application, it's it's usually a good idea to it. It's it's a good idea to trace user request end to end because it can help you debug things, help identify bottlenecks in your application end to end. This isn't a native integration. So you would need to deploy the X Ray demon as a side car. But there are good patterns to how you can deploy the X Ray demon on your application.

Finally, in terms of security, security is one of the places where I think ECS, particularly ECS with Fargate really shines your applications by default on ECS Fargate are isolated. Each, each task you deploy on Fargate runs on its own VM. So there's no chance of a breakout in your container causing any security issues for other applications that you have deployed a new capability. We just recently launched as an integration with GuardDuty which allows you to monitor, which allows you to have run time threat detection for your containers running across uh EC2 ECS as well as Fargate. That's a, that's a one click integration, quick plug. If, if you are uh security sensitive customer do give that a spin uh might might be useful.

And with that, I'll try, I'll, I'll hand over to Mudit. Thank you.

All right, quickly wrapping up. Let's take a look at how some of the customers are using Amazon ECS or Fargate uh to run their AI applications.

First one is Scenario, it's a game game development company and wanted to reduce their time to market for game studios by using generative artificial intelligence to create style consistent assets. And they used Amazon ECS a fully managed container orchestration service to build its generative AI offering. And an interesting fact is that they wrote the first line of code on October 2022 and built the beta in two months with just three engineers. So Scenario was able to generate over 1 million images in the first two weeks and are continuing to rapidly use um Amazon ECS and scale and grow on AWS.

Second start up company as well. RA AI is a SaaS company that aims to increase the quality of healthcare by streamlining radiology workflows using natural language processing. When they came to AWS, they were deploying ML applications on data centers on on prem and were looking for higher performance and faster inference speeds in terms of serving more customers. In addition to migrating their ML models to Amazon P4d instances powered by Nvidia A100, RA AI was able to streamline their experimentation by deploying the ML models on Amazon ECS and have set up their CD. So they were able to experiment develop train and deploy their models much quickly using PyTorch frameworks and are able to strengthen customer satisfaction by incorporating the feedback quickly and increasing the inference speeds by 50%.

Actuate is a cloud based computer vision platform that ingests vision feed from remote accessible cameras and detects security threats in real time. Their application decodes video streams into jpeg images and sends those images to the AI models for instant analysis. In order to be efficient, Actuate, was looking for dynamic capacity scaling to meet the peak resource performance requirements and to do all of this in a cost effective way. They chose ECS Fargate as the operating model to manage their ECS. So that management of EC2 instances is completely abstracted away from the customer and Fargate containerized subsets of cameras that incorporated internal scheduling logic and launch tasks at specific times on a daily basis. And this enabled them to achieve improved CI/CD and streamline monitoring and troubleshooting for continuous improvements.

Quick takeaways for all the talk that we did:

  1. Generative AI foundation models can be used for a variety of applications. So keep in mind that when you start with a foundation model, it can be used later for a downstream of tasks.

  2. Generative AI applications aren't inherently unique. The key nuance is that how the application integrates with the model. So engineer the prompt engineering and using DALL-E for best results as Weber pointed out would be the way to go and use Amazon Bedrock for model solving and use Amazon ECS with AWS Fargate as your application server to get started quickly.

  3. And Amazon ECS offers both accelerated which is in the EC2 form. And so using the Fargate launch type compute options as a rich feature set for self hosting.

I'll do a quick plug here uh with lots of the, the biggest takeaway for for me uh here would be personally if, if, if the stock helps you, helps demystify uh AI applications for you somewhat and encourages you for, for those of you who haven't tried already to try out building AI apps. on AWS, it's not rocket science for the most part. A lot of the hard part I think is sour and really what the world is your oyster, so to say, and it's a good time to start testing out with new use cases if you haven't already. Thank you.

In addition to the takeaways, here are some references that you may want to uh infer to in terms of blog posts or collateral web turnkey solutions in terms of how you want to get started with that.

I hope that this talk was useful for you and you're moving out from the room confidently to make your journey more successful and pave us through your uh AI uh path forward in a more accelerated way. Thank you so much.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值