SaaS meets AI/ML & generative AI: Multi-tenant patterns & strategies

Thanks for turning out for the session. Hope Reinvent's been good for you so far. Appreciate you making the trek over here to Mandalay to hear us talk about SaaS and generative AI.

It's kind of interesting every year when we sit down and think about reinventing what we're gonna do. And we are always asking ourselves, what new emerging technology or what new patterns are out there that we can talk about?

Obviously generative AI very much came up to the top of that list for us, right? We see lots of customers and lots of partners and other folks who are very interested in understanding, not just how generative AI can fit into their application, but specifically, what does it mean to bring generative AI into a multi-tenant environment?

What are the nuances and things in our architecture that we would need to do to support the workflows and architectural principles of SaaS specifically? How do we offer tenants specific customizations and refinements that can sit on top of the whole gen AI experience but still deliver unique experiences to our customers?

That was the motive behind this talk - we wanted to spell out the landscape of possibilities looking at all this through the lens of what multi-tenancy does to the whole generative AI story. How does it affect data partitioning, isolation and pricing?

Then how do we take all these cool tools like RAG and fine-tuning and marry them to the gen AI experience? I'll say this is early days for a lot of this thinking. We're all scrambling to figure out the gen AI story across so many experiences. SaaS providers aren't alone.

Our hope is that you leave with some sense of the possibilities. I suspect if you hear this talk a year from now, there will be a whole new range of possibilities. But for now, we can equip you with a mental model for connecting multi-tenancy to this generative AI story.

In terms of who I am, I'm Todd Golding. I've been working in the SaaS space for 8 years with different customers and partners. I'm joined by James Jory, an old friend and coworker here at AWS. James, want to introduce yourself?

Thanks Todd. I'm a specialist solutions architect in the ML group at AWS. I'm also a retail specialist and I've worked with partners on building integrations of AI services into their platform. Happy to be here with you today.

James knows the gen AI story much deeper than I do. He's my security blanket! He's helped me pull together this content and present it to you.

Let's get into the tech. This is a 307 level session, but we won't be popping up the IDE or writing code. We'll get into architecture details though.

We have to set the table on the fit between generative AI and SaaS. Where are common points of intersection that are areas to think about applying SaaS to generative AI?

SaaS has momentum as a standard delivery model. Applications are being built in multi-tenant models faster than we can keep up. Gen AI will connect to SaaS. It will enrich SaaS applications as a way to deliver them and introduce new ways to enrich the experience.

We'll concentrate on delivering targeted tenant experiences in gen AI environments. We have LLMs and mechanisms to ask questions of foundational models. But what does it mean to offer a targeted experience to a tenant in a specific domain?

In one SaaS environment with many tenants in contexts, how do we offer unique experiences? That's where we'll dive deep - customizing for tenants with fine-tuning and other techniques. I think that's huge value in gen AI and multi-tenancy.

Pricing and packaging is also key. Implementing multi-tenant gen AI will influence packaging and pricing options, not just business decisions. How you measure and attribute consumption and translate that to price looks different in gen AI.

SaaS emphasizes efficiency - analytics and insights into resource usage and load. It seems natural gen AI will also enrich SaaS operations and analytics. We won't cover it today, but think about it.

People use AI in tenant onboarding and management, optimizing their journey. Good examples apply gen AI here too.

Finally, the magic - what gen AI-enabled feature can I add to differentiate my SaaS? New features, extended reach? The sweet spots are still evolving.

Let's outline the big moving parts of this experience. Generative AI has massive capabilities in AWS. But for SaaS, focus on:

Foundation models - the heart and brain. SageMaker and Bedrock - stateless services and APIs on top of models. Your front door.

Optional layers like fine-tuning customize models for tenant context. RAG and state/memory outside LLMs refine requests.

Our multi-tenant SaaS app sits on top. We'll see how to mix and match patterns like fine-tuning and RAG for tenant needs.

Let's start simple - an app using Bedrock. Customers pose requests, prompting Bedrock, which returns a response. With the same request, customers get the same response.

In multi-tenant SaaS, we want distinct experiences - targeted responses based on tenant domain. Tenant context flows in, and we augment the Bedrock request with that context before sending it.

Now the tenant-augmented prompt yields a tailored response. We'll see how to layer this in.

And now when the response comes back, I get a tenant specific response. So for me, this is the highest level view of like what it, what could multi tenancy look like in a in a generative AI experience.

The reality is the how it's augmented is wildly different from one solution to the next, right? So I wanna just make this point a little bit clearer by mapping it to a specific example.

Let's say I have an ecommerce solution, right? You people make stores and shops on my, on my ecommerce platform. It's one solution. Everybody consumes it. They all consume it equally. But the customers who build stores on my ecommerce platform, they aren't building stores from all the same domains.

Some are coming from golf, the golf domain where they're selling drivers and wedges and, and they have their customers have very specific expectations about what it means to buy golf clubs. Others are selling tools and they want like hammers and saws and they want other sort of things. They want a unique experience relative to that.

And then of course, they've got clothing right. So now if I have these domains, but they're all still running against my one platform when they come in and they say I want to go find a product or I wanna go find or search for something.

But they're saying, oh, I'm a left-handed golfer and I uh I have a slice that's really bad. And I uh uh well, there's probably lots of things bad about my golf game that would show up here, but uh which club might be a good fit for me for this particular set of conditions.

Um, I, I wanna have that domain context injected somehow into the request that I'm gonna make to the back end of this experience so that I get back a specific response that is contextual to the golf domain.

And now when the tools domain comes in and makes a request, I want the same sort of experience. So to me, this is both valuable in terms of the shoppers getting a great experience. But it's also a great way to highlight the fact that like, not everybody has the same expectation of what they're gonna get back from gen AI.

And it's our job, especially in a SaaS environment to figure out how to tailor that experience. So at this point, I'm going to hand off to James because James is going to go ahead and take this concept and, and push further here into like what are the specifics of this look like?

Thanks Hud. So if we take this concept and drill a little bit deeper into um using RAG or retrieval augmented, generated, generated generative um patterns with AI in a, in a SAS or multi-tenant application. What does that, what does that look like?

Um RAG purely speaking, the RAG is a pattern or a technique by which you improve the quality of the and the predictability of the output from LLMs by grounding the LLM or providing additional data that's retrieved from an external source from the LLM.

And you provide that in your prompt to the LLM to keep it focused on the task at hand. So the typical data source types for that retrieval layer is you're mostly it is going to be some sort of vector database like OpenSearch with the KNN plug in or Pinecone is a really popular SaaS based vector database enterprise Redis added support for vector bases lookups ManguDB did as well.

So this is a rapidly changing space in the in the database community is adding support for vector based similarity searching. And the reason why vector databases work so well with, with RAG use cases is they're able to take a natural language query or a semantic query and be able to find similar or related documents in a semantic way or, or objects in a semantic way to that natural query.

However, RAG is not, it's, it's not a requirement that you use a vector database. And, and we're going to show an example here of how you can use a either DynamoDB or relational database to pull that or retrieve that, that those facts or that data that can then be used in a prompt.

So we have a scenario here of um a tenant entering an application, a shared SaaS application um tenant number five in this case. And we're, we're looking to build that augmented um uh mo uh pa pattern that Todd was describing.

So we go off and and collect the products for tenant number five, we're using a dedicated table here for this tenant. Maybe our use case is we want to take the promoted, currently promoted products for this ecommerce store. And we want to ask the model to maybe wrap those products in a some marketing copy that promotes or makes those products look more compelling to the to the user.

Um, so we retrieve the product information and then we um augment the prompt, we build a prompt based on that product information and then we instruct the prompt, the LLM through the prompt, what we want it to produce.

Now, another tenant comes in to the same solution. So tenant number four comes in and we do the same thing. So we make a request to tenant number four's dedicated table in this case to retrieve its products and then go through the same process of generating a prompt with tenant four's products and then return the response.

Finally, we have a third tenant tenant number two that comes in. And in this case, tenant number two's data is in a a shared table or a pooled configuration. So you can see here we're getting the partition key. So this is a DynamoDB table. The partition key is the is tenant number two's tenant ID. And that's how we're able to just pull that tenant's products from that that shared shared table.

And we set this up on purpose to illustrate that a lot of the same SaaS patterns that you use for building SaaS applications apply here in a gen AI application use case. So you know a lot of those same learnings can be applied to these new, this class of new use cases and experiences you can build.

So when it comes to selecting the data source that you're going to use with these um RAG experiences, uh there's lots of different choices. So if you're, if you're gonna do a vector based um a type of look up, so a similarity semantic based look up um OpenSearch. I mentioned that um you can also do it now with uh Postgres using the PG vector library.

Kendra is another option. It's a fully managed service that can index documents that you have in S3 and other applications that you use. And then sort of a different one to consider as we we're sort of teeing up this ecommerce experience is maybe I want instead of promoted products, I want to pull recommended products for this user based on past purchase behavior and then be able to build a prompt around those recommended products.

So the idea of showing just a few of these sources is that uh you want to match up the data store with the use case that you're looking to build and you're not limited with vector databases um alone. Some of the decision making that you need to go through is similar to any like transactional data source choices you make.

In a multi-tenant SaaS application, you're looking at the ability to partition your data, your tenant data tenant isolation. You're also looking at operational characteristics, so noisy neighbor type situations. These are all important considerations that apply in this gen AI space as well.

So looking a little bit more closer at the vector databases, since it's so common with, with RAG and genAI, we're going to look a bit a little bit closer at. Well, how would I populate these vector databases in, in a SaaS based application?

So it starts with the tenant data. We're going to, we're going to index that data in a vector data source and a vector data source. If you're not familiar with it, it's it's a type of database where you index and store unstructured objects that are represented as vector embeddings.

And then you, the idea for these is be able as I mentioned is to be able to do similarity based searching semantic bases searching. And when you query, one of these indexes, you take a natural language query or a semantic query and you generate an embedding from that query as well.

And then you search the index for for objects in that index that are semantically similar using vector based algorithms such as cosine similarity. There's other approaches, you're basically finding vectors of other objects that are close or nearest neighbors to that query vector.

So what we have our our tenant data here stage maybe in a data lake. And before we can generate embeddings on our data, we likely need to do some preparation here. And so this typically with text based data it's called chunking where you're going to take maybe a long piece of text and you're going to break it into chunks.

So you're going to generate embeddings for each chunk of data. You can use AWS Glue or ER a couple of AWS services that would be a good fit here. And then with your data prepared, you're ready to generate those embeddings.

We're showing a couple of options here in SageMaker is you can you can host a model say from Hugging Face that will generate embeddings based on the input that you provide. Bedrock also provides fully managed models that can generate embeddings as well. So these are, you know, two options available to you.

Once your embeddings are generated, it's time to index those embeddings into say a vector data source. We we're picking OpenSearch here and you can see we we have two different OpenSearch domains for two different tenants. This shows you how you can, you can do a silo based approach with OpenSearch in this case.

And so what's in our index is both the embeddings and the text, so we can do those vector based searches.

Amazon Bedrock, if you probably heard a lot about it this week, it's a fully managed service that's designed to provide API access to foundation models. And one of the sort of higher level capabilities that Bedrock has is something called Knowledge Base and it went GA this week and it helps automate some of those steps or it takes on some of those steps that we saw on the prior slide of sort of doing it yourself.

So I want to walk through how you would do that similar process we just saw. But this time with a Knowledge Base in Bedrock.

So we start with our tenant specific data and we need to create our OpenSearch index beforehand because we're going to tell Bedrock to index our data into, into that into that index. OpenSearch serverless supports vector based database.

So you don't have to manage any instances or nodes you can let the service handle all that for you. So the OpenSearch serverless is a great match here. So we're creating one for tenant one here.

And then we ask Bedrock to populate that index and manage that index. It takes the data prepares the data and it calls the Titan embeddings foundation model to generate the embedding. So it does this on our behalf, we don't have to make that call now.

And it will also index the embeddings as well as the metadata directly in that OpenSearch index that we provided when we configured the Knowledge Base. What's also nice about Knowledge Base is it provides a inference endpoint or a query endpoint where you can provide a natural language query and Knowledge Base will generate an embedding from that query for you automatically.

And it will submit that query to the OpenSearch index in this case and return the results. So it helps you both on building out and populating the index as well as querying the index.

I'd like to pivot a little bit here. Talk about another capability of Bedrock that's common with some gen AI related use cases and and how we can use this in a in a SaaS application.

So agents for bedrock that also went GA this week allows you to tap into the reasoning capabilities of large language models where you're able to, a large language model is able to take a complex task and break it down into steps and execute those steps to perform or execute that task. Some examples might be um creating an agent that can process insurance claims, uh an e commerce experience where you want to guide a, a user or consumer through uh a purchasing experience and actually place that order or helping consumers with travel reservations. There's really the possibilities are endless and these agents have the ability to prompt the user for the needed information to complete that task. So it breaks down the steps and it says i need more information from the consumer. It will prompt the user get that information and continue the flow. It can also call lambda functions in your AWS environment to perform each of those steps. And finally, it can look up if it needs data to complete the task, it can look up data say in a knowledge base to be able to pull that data in and it will compose a prompt to the llm to continue through that through that task.

So let's walk through an example here. Um it starts with bringing a task in. Um and we're showing here from our application, we're gonna have some tenant context that identify, identifies that tenant. So we're still, you know, kind of keeping this in the, in the sas um sas world here. And uh that task is submitted to an agent. When we create an agent, we provide instructions to the agent. And so in this case, we'll use the example of helping todd with his slice situation in golf. We're going to, you are a retail assistant responsible for helping customers select and purchase golf clubs. So that's the, that's the role or the instructions for this particular agent with the task uh provided to the agent, it's then going to compose a prompt to the llm that includes all of the steps that it can perform. Uh the context of the role, what it's uh what its job is as a, as a retail assistant for a golf ecommerce storefront, um as well as the actions it can perform. And so that prompt is submitted to an lm to be able to then perform what's called chain of thought, which is a, a reasoning action flow. It's a prompt engineering flow with ms that allows it to go through these steps and it may go back and ask the customer for more information. So, you know, are you left-handed right-handed those kind of questions? And it goes through the sequence of steps by making api calls that you configure with, with your agent. So these lambda functions can do things like look up inventory for a set of golf clubs, um place the order, get the the shipping tracking number. Um really anything can be done behind these lambda functions and they could be uh these functions could include your business logic or they could just be shims that call microservices that you already have in your sass architecture. Um you can also look up data from data sources. Um there's really, you know, no end to what you can do. And as you see the tenant context is passed through, it's weaved through this process. And so these lambda functions then can know what the tenant is. They can have session identification who the user is that can all get passed through as session attributes through this agent process. The agent can also consult a knowledge base to be able to get information. So maybe we're looking up, we have a knowledge base of how do you match golf clubs to players based on their skill level, things like that, we can tap into a knowledge base and then use that to um to build a prompt to guide this this entire process. We get down to the end of the steps and uh we will return the results back to the agent. The agent has the results, it will take the task and the results and craft a prompt to create a final result which is returned to the user. So something like ok, we've ordered your clubs. Um it's, it's these ping irons and they're on their way, they should be there by thursday. You know, something like that would be the end of that task or the results of that task.

So we've talked about rag, we've talked about different different techniques there and using prompts and prompt augmentation. A question invariably comes up is what if i reach the limits of what i can do with prompt engineering? Or what if my use case is so specialized or so complicated that one of the existing foundation models just is not adequate. You could certainly build a your own foundation model from the ground up, but that's typically cost prohibitive and also beyond the the skills of of most organizations, it's a very big endeavor, very big process. What if you could take an existing foundation model, provide some additional training data that's specific to your use case or your domain and then use that specialized model to integrate into your applications. And that's exactly what the process of fine tuning is. And bedrock fortunately, bedrock provides this as a as a fully managed experience for you. In fact, you don't even have to write any code.

So the way you think about fine tuning is there's multiple dimensions here you can that you can think about it. You could, at the tenant level, we could actually fine tune models for each tenant if we wanted to go to that granular level, and that involves creating tenants specific training data and validation data that we provide to bedrock when we create a custom model along with what's called hyper parameters, which are parameters that control the training process um uh parameters such as the number of epics, the batch size, the training rate, our learning rate rather, these are all parameters that you can control in that training process. The output of creating a custom model is a fine tuned model. Um and we're showing two different models here for tenant one and tenant two. These models are managed by bedrock and uh they are only accessible from your account. So these are still private to your environment. Um and we'll learn more about how you can control that through im um later in the talk.

Another dimension to think about is what if i don't need tenants specific, fine tuning? What if i want to do fine tuning at a industry or domain level? So maybe i have customers in the health care and financial services areas and i have data that i can use to fine tune those, those f ms to create a better experience there. And so you can do the same process at an industry level and create fine tune models that you share across all the tenants for each of those domains.

So looking at this uh more from a, from a procedural or life cycle um process just real quickly, you, you uh start with bedrock of course. Um and you invoke a uh uh custom training job. It's um it's model model customization job uh for each two tenants here, tenant one and tenant two, we bring in the training data from each of those two tenants and the output are two fine tune models and bedrock will associate a model id um with those um each of those models and we'll see how we can weave those into both the onboarding process and uh the request flow.

So moving forward into thinking through the onboarding process uh with both rag and rag or fine tuning or both, um let's take a closer look at that. So we start with um a, a tenant that's, that's going through some sort of onboarding process. Um this could be an onboarding process that's managed internally in your application or maybe the tenant is sort of driving this onboarding process. And uh we'll, we'll sort of use the, the classic s a um you know, approaches here and, and design patterns of having a control plane here with a uh a service, an onboarding service that's the orchestrator for, for this onboarding process. Um it works with a tennis pro tenant provisioning service to um uh perform all the provisioning tasks. And maybe that service reaches out and provisions some of the other, you know, non jai pieces running parts that we have of this, of this architecture where the gen a comes in is then we can initiate that that same fine tuning um flow. If fine tuning is part of our uh at the tenant level as part of our solution, we can kick off that process as well as populating that um vector database. Uh so if we're going to use like an opensearch vector database, we can begin that process as well. Both of those two processes are asynchronous in nature. They're typically something that you're going to kick off and it's going to end at some point later. Um and eventually will be ready for your for your tenant to start consuming those, those specialized resources, we need to store that tenant configuration somewhere. So we'll interface with a tenant management service that stores the model id from our fine tune model as well as maybe where our training data is located in s3, maybe where our r a opensearch index is stored. We will associate that with the tenant in the tenant management plan.

Now it comes to routing requests that come into the application. So again, we'll start with tenant number one, the request comes in with tenant context that we'll use to associate this user with a tenant. We have an application plane, a service that's processing these requests. And the first thing it's going to do is look up or resolve the information it needs for this particular tenant from the same tenant management service that we saw on the prior slide with the rag information that we have here, we can reach out to that rag data source, whatever that is, if it's a um it's a database call like we saw with those dynamo db tables or um maybe we want to do a, a query to an open search index, um beam back related documents or items. That's what we do there. In order to invoke that model, we need the model id for the fine tune model. So we'll also get that from the tenant management service. Now, we're ready to build our prompt and uh pass that prompt to the model, invoke the model uh with that prompt and receive the response back that we passed back to the user.

So when you think about fine tuning and rag, there's a lot of decisions here. And you know, it's, these are, there's no absolute rules here. It really depends on working backwards from the user experience and use cases. You're looking to deliver with n a. You may be able to use a single fine tune model across all tenants. Maybe you fine tune one fm and you use another fm sort of off the shelf from bedrock for a different use case. Maybe you're going to use some of the the knowledge base or the agents for across all tenants in bedrock or the other way, maybe you want to do that, that tenants specific, fine tuning, you want to do tenant specific rag. There's really no, there's no really exact guidance here, but you can mix and match these in different ways. And so uh that's really the the hard part of thinking about gen a i incorporating these different technologies is combining the, the data that you have available both at the domain level, the tenant level and and at the at the tenant context level, some of these tools and mechanisms that we've talked about fine tuning r a bedrock knowledge base and agents. And it's, it's working backwards from the use case and asking what data do i have? Do i have the data i need for this use case with a using generative a i if not, how can i get that data? What do i need to do to prepare that data? And how do i i stand up these tools in my architecture?

So i'm gonna hand it off to todd to uh create the story forward.

Thanks james and i uh i will just reemphasize james's point here, which is to me when we put all this together, it's like these tools are all awesome and great. but still somebody in your organization is gonna have to go, you like, you've got to go figure out where's all this coming from where's my fine tuning coming from? where's, where's what rag formula? which tools that i wish we could tell you just go do x or y and that's the packaged way to do it. it's not, you have to sort of go figure out what's gonna work best for you.

Now, i do in addition to sort of like, great, we can do these tenant refinements, we can do all these new constructs. so when we introduce new constructs into any sass environment, we're always going to have to ask ourselves like, what are we doing to make sure one tenant can't access another resource of the resources of another tenant. like that's just foundational to sass. and now we have all these new gen cortes constructs introduced into our universe. and if we look at like the pieces james talked about here, right? we have these fine tuning in points. now, these models that are available to us and each one of those models are associated with individual tenants. we have whatever these rag sources are these vector databases or dynamo db tables or whatever it is we chose to use down in those environments

And even if they're, you know, deployed for individual tenants, they're not isolated, right? There's nothing on the surface that will prevent a microservice from crossing a boundary from Tenant One's to Tenant Two's resources. We have to introduce something there, right?

So we need something that will make sure Tenant One can't see Tenant Two's raw data, for example. And we also need something that makes sure that across these endpoints here, these models, one tenant can't invoke another tenant's model, right? So just fundamental sort of tenant isolation, but an easy thing to overlook when you start watering a generative AI here.

Uh, so what does that look like? Uh, well, the good news here is like if you've seen a lot of our talks on tenant isolation or you've seen things about, we talk about a token vending machine and how to assume roles with IAM, we, we can still use IAM mostly here to secure a lot of these, these mechanisms and make sure that one tenant can't see another tenant's resources.

And in this case, fine tuning it really straightforward, which is our tenants come in with their JWT token that has the tenant context in it somewhere within our SA solution, some library hopefully or something of that nature is going to go out and say, hey, there's a policy out there that is a policy that constrains access to, to different generative AI models here.

Um, I'm going to assume a role for that, uh, for that particular tenant, get back the credentials from that role and then use that to go access and invoke the model that I'm interested in invoking. And if I try to invoke a model that doesn't belong to me, I'm going to get denied so super straightforward but something not to overlook.

Um, now when you look at it with RA, it's a little more interesting because RA, we're not just have a model and a and a model ID and we just have to lock down that model ID. Now, we have any number of different technologies and resources that we have to, to think about.

So we could have, you know, like James talked about, we could have OpenSearch here, we could have Pinecone here, we could have RDS in this world. And really this comes now back to the fundamental ways we think about applying isolation.

Generally, one of your approaches is if your resources are siloed here for your raw data, you can use just our like resource level isolation where you would just use IAM to say, hey, you can access that database. Generally there's some IAM or some other mechanism there that at a resource level that you can constrain where it gets accessed.

It gets a little more tricky and like James's example, he had before is when you get to item level isolation where if the resource, if the resource has tenants, data or tenant items commingle together inside the same resource. Now you have to figure out how to apply isolation at that level and that's where your mileage might vary a little bit.

So in some cases, IAM is your friend there and you can do exactly what you want. In other cases, you will have to be creative about how you do isolation there.

And if we look at isolation just for raw data here using IAM, same old formula, this is the diagram James had earlier showing all the different DynamoDB tables. But now with isolation layered onto it, we come in our solution assumes role based on the tenant context and now it's going to go out and essentially hit one of these tables.

In this case, it's hitting the pool table, which is where we had tenant data commingle together. Well, in that model, I have to have some notion of item level isolation, which is why it's handy that I pick DynamoDB here because the item it supports item level isolation.

So I can say you can only see Tenant Two's items in this table works great. Um, if this were RDS MySQL, you would have, we'd be having another discussion, Postgres it be row level security. There's other options you have to think about.

And then we had the two siloed tables that James talked about. The ones with our golf and our tools bits in it. Now these can use resource level isolation. So nothing exotic. Nothing wildly crazy here. But I, we really can't talk about multi-tenant throw all these resources in this environment and not ask ourselves, how are we going to protect them?

To me, the more interesting angle here and the one that comes up over and over of customers is like, what's, what is tiering? What does it really mean to tier a solution in a, in a, in a generative AI experience where I've used generative AI underneath the hood of my solution somewhere.

Is it the same old thing I do for tiering already? And generative AI is just one more tool in the tool bag that I have to think about or does generative AI somehow change the way we think about how tenants consume our experience and how we control how they consume our experience. And we'll see how we measure how they consume our experience.

So if we look at this at the simplest level, and we just said, hey, there's an operational view into this and there's sort of a product view into this at an operational level. If I have different tenants consuming my somewhere in the back end of my, of my SA solution, there's consuming Bedrock.

I'm, I'm just operationally be concerned about is one tenant imposing excess load on my, on my system. That's just a core sort of SaaS consideration. I'm also going to be worried about it because even if I can support how much they consume is what they're consuming, affecting the experience of other tenants.

So I just like just to keep the lights on, I'm going to want to have some way that I control that. But then at the sort of product level of this discussion. I'm also going to say I may not want to offer every single customer the same entry point and price point for my system.

I may want to segment my offering into different experiences. So I may want to have a basic and an advanced and a premium tier tenant here and basically say, hey, my basic tier tenants, your SLAs are gonna look like x or y because and I'm going to constrain the degree to which you can consume Bedrock here.

And this is partly again for health of the system and partly for noisy neighbor, but it's also a business decision which is I don't want the basic tier tenant who's paying me 50 bucks a month unlikely but whatever they're paying me a month to be consuming 80% of my Bedrock bill because they're just going crazy on the system and consuming it.

Meanwhile, my premium tier tenants both being impacted by that tenant but also maybe is consuming it even less. So I get all this sort of imbalance in the different um sort of tiers in my system and the level of load they're putting on my environment.

So you need this tool to do this. But the question is how do you do this? And is it just the same old thing with Bedrock? Is there something more you need to think about? And it turns out there is something more you need to think about so if we take basic advanced and premium tier tenants and they're coming in, we need to put something between those tenants and their consumption of Bedrock.

Uh, because we need to be able to sort of intercept those requests and validate whether or not we're going to throttle them or not going to throttle them or what are we going to do with them? So, in this case, I've chosen Amazon API Gateway and I will admit now that we are still figuring out what's the like, there could be any number of different tools in that slot.

But Gateway is one I'm familiar with and it's an easy one to sort of describe this problem for and here with Gateway, I'm going to process each one of these requests and just like I would with classic throttling, I'll use the Lambda authorizer and I'll look at this and in reality, I can look at the frequency of your calls here as well.

So I might use the classic sort of throttling, which is just you're calling too many times. That's gonna get into my throttling story here. But also in my throttling story is the new twist, the new wrinkle, which is what's the complexity of the request you're about to send to Bedrock.

What are the, now we because if you look at these, uh these gen AI solutions, the complexity of the tokens coming in and out the size of them, the complexity of them, the and the, and the complexity coming out. Those are the things that correlate to load for the service that you're consuming.

So it's not just the number of calls you're making to it, it's the complexity of the calls you're making to it. So now we have to put complexity into the front door of this story. Evaluate the complexity of the call that's coming in and then create or authorizer policy and then decide, are we going to let you flow through to the back end or not?

So now we have a whole new set of things to think about. I could just use, I could have a frequency set of policies and a and a separate set of complexity policies or I could have some combination of those two things.

How do we then turn find the boundaries here now that we're measuring complexity, where would the boundary of a policy be like, how do you decide how big or how complex something is to say you're a basic tier complexity, this is advanced tier complexity or just how frequently are you submitting requests of a certain complexity size?

Does that get you to an SLA? There's all kinds of bits you have to think about here. But my big thing is here, no matter what you do here, you have to, you have to do something. You're not just going to leave Bedrock wide open in your app and say to somebody consume however much you want all good. That's just not going to be a good idea.

Now, a whole another dimension of tiering here, that is one that is an entirely different perspective on it, which is to say I've got basic and premium tier tenants here on the left hand side. And I then on the right hand side, I've got all these different LLM's that are out there, right, that are available to me that all have their own sort of strengths and they're all specialties and things they're good at or things they might not be good at.

And could I as part of my strategy here choose to say the, the LLM that you get to use as part of my experience is part of your tier strategy. So basic tier tenants get LLM one and premium tier tenants get LLM too. And the assumption here is there's something about these LLM's that would make them more appealing at a higher tier than the lower tier.

But that's where the hard part of this comes in because what's appealing about it and why you might want to offer it to somebody at a higher tier is that it depends kind of answer like your domain and the nature of what you want to do with that LLM is gonna have a lot to do with how you decide what's basic and what's premium here.

But I still think it's a valid model. It just sort of, it depends on whether it will apply to you based on what it is you're actually doing with your solution.

Now, in terms of pricing here, like I, I pricing, people are coming at me left and right and asking me, like, what should the pricing model be? What's your guidance for pricing for SaaS providers?

Um, and I still think again, an, an emerging space for which I think there's still lots to be learned. I would think the generative AI services themselves will evolve a ton in terms of how they approach price. But right now, if I were to say, what are the options?

Um, I think one way to look at pricing here is based on experience alone, right? I could say this whole model James described with raw and fine tuning. I could say my basic tier tenants, you all share one model, you all get the same experience, that's it.

But my premium tier tenants will offer you tenant customization or I could do something purely SLA so throughput based here, if I want and say I'm going to throttle you basic tier tenants at a certain level. But premium tier tenants you get to consume more, by the way, nobody gets wide open.

There's going to be a there's going to be throttling here no matter what. But premium tier gets to consume a lot more maybe than basic tier does here. And then the sort of last model which is perfectly reasonable. Imagine an environment where the nature of what's being asked of, of the gen service under the hood is wide open and changing all the time.

And I have no idea what the complexity of these, the input or the output of these requests is going to be. Well, that means the cost could be changing wildly all over the place. I might do cost per inference here and just say, I'm going to correlate this very much to whatever the cost was on the back end for me to perform this this task.

I'm not a fan of like passing the the cost of the infrastructure straight through in my pricing model. So I'd like to try to abstract that away a little bit, but you may still have to take this approach. And the real answer is it's probably some combination of these things, right?

You, there's no one size fits all for everybody here. You're not just going to be in one of these buckets, you might need a mix of these things. And again, just, you know, caveats are important here. I still think next year there'll be a lot more to say about this. But this is I think a reasonable starting point.

The last bit here on, just on pricing models is, um you know, if we take classic sort of silo in pool models, we've always talked about like you could have a full pooled environment where all tenants are sharing all things and you could have siloed environments where tenants got dedicated environments, but they were all owned and operated sort of through one collective experience.

What would that look like in a generative AI environment? Well, I could here have all my service to share, have this fully shared environment. And in this model, say you, you have to, everybody's sharing the same model. And in this model, I might do on demand pricing, right?

We have different pricing strategies that I can use.

It'll all have to do obviously with how frequently this pool environment is invoking it because I could choose something other than on demand here. Um but, but I also like in here, there's no training cost for the model because we're not doing anything specific to these tenants. And the general idea is like on this side of the world, I'm trying to keep the, the experience pretty lightweight and maybe limit my costs if I can on the generative AI side.

And then on the silo side is where I'm going to say no now is where you get all this rich model customization, right? So in fact, when you go to model, if you want to do model customization, you're going to fall into another bucket of price points and so on just from the generative AI service. So no matter which silo pool or whatever you choose. If you're going to do model customizations, you are going to incur additional costs just and expectations from, from the gen AI service.

So here got an app plane got services coming in and I'm about to make a request to, to the gen a service, I'm going to intercept it here and I'm going to evaluate the complexity of the request. I'm going to remember that and then I send the prompt off to bedrock. And now when I come back, I also have to evaluate it on the way out. So when I evaluate it on the way out, I have the complexity of the request coming in the complex to go out.

Now I'm going to record that and capture that. So if you've seen in our examples, we've talked about, you need a metrics and analytics service running in your control plane of your SaaS environment. You want to aggregate all that data here and choose how you're going to use it. Is it just for internal analysis? Is it for billing? What's it used for? That's all up to you.

And you could argue, you could evaluate and record it on the way in and then evaluate on the way out separately as you want. You could have those to be entirely two separate events. But the key thing is you need this data and you need to be evaluating the complexity and this probably overlaps with the throttling evaluation of complexity. There may be some bits of code you could share there.

Ok, James, I'm gonna hand over to you real quick. Talk about tooling.

All right. Thanks Todd. Um so the reason why we put this section in here for tooling is there's a lot of productivity gains that uh you can realize by using a lot of the work that's being done to develop um uh open source frameworks, um various tools, productivity, productivity gains uh with these different tools that uh so you're not building things from the ground up.

Um there's a lot of uh activity going on in this space. These are just a few um there's, there's really tons of different projects out there that are actively being worked on. But one of the earliest and most popular frameworks out there is LangChain. It's an open source um project that um is in Python and JavaScript and it uh creates abstractions and implementations and wrappers around the most common um gen AI development patterns that you've seen.

So we've talked about um the uh using prompt engineering uh with RAG. Um there's the chain of thought that we talked about with agents. LinkedIn has its own agents uh capability built into it that you can use. Um chains is a way to chain multiple calls together uh to LLMs to uh have some, some some desired outcome um helps with prompt engineering, prompt templates.

So it's really a a productivity gain tool that allows you to develop gen AI applications more quickly. LaneChain also has uh implementations around several AWS services including Bedrock. So you can uh use LangChain to construct a gen AI application around Bedrock. Um other AWS services, Kendra is in there as well as a RAG source.

Amazon Personalize is in there as a, as a RAG source uh as well as other sort of foundational or infrastructure services uh such as like Elastic for retrieving memory for a chatbot experience. So uh LangChain is a, is a great um library. There's a lot of activity going on there. I expect it to continue to evolve.

Um Hugging Face is a, is a community and um uh platform to help developers build train and deploy machine learning models. And it provides also um a lot of libraries. It has thousands of um open source libraries and, and foundation models uh or or thousands of models um that you can bring into SageMaker through SageMaker JumpStart.

And so that's a way to quickly spin up some of these open source uh models uh in SageMaker, of course, that comes with more responsibility, you have to then configure and maintain and operate that the pricing model is a little bit different there. Um versus the more consumption token based we talked about with Bedrock. But nevertheless, if you really want more, more control over your JAI experience and you have machine learning expertise using Hugging Face with with SageMaker. JumpStart is a great option to consider.

And then finally, we've talked a lot about Bedrock today and some of the tooling on top of Bedrock, not only just being able to access these foundation models but also being able to build agents and knowledge bases. Guardrails is an exciting announcement we had this week for doing content moderation around prompts, input and output prompts. So there's a lot going on there with Bedrock.

A quick really quick look at uh what LangChain looks like in action. This is sort of a contrived example. Here we have some, some code where we're bringing in, this is in Python, we're bringing in Bedrock through uh through LangChain. Um we are uh creating an in in inference parameters that we're gonna pass. These are the arguments we're gonna pass to the foundation model. Uh we're gonna create a, a client uh with the, we're using an Anthropic Clawed instant here. Um and then we can invoke uh with a prompt template, we can build a prompt and then invoke the model to get back our response. So this is how simple it is in LangChain to be able to interface with a Bedrock foundation model, an embeddings based example.

So Bedrock Embeddings is another class in LangChain that you can use uh using the Titan embeddings model. In this example, you can generate the embeddings and then use them to index say in the OpenSearch index that we saw earlier back to you, Todd.

Cool. Thanks James.

Ok. Just quickly here, we'll wrap up just a few takeaways that way, hopefully be some fine points we would like to make for you before we wrap up.

Um I, I hope you can see here like this whole notion of fine tuning and RAG, they're like super powerful constructs that could really have a lot of multi-tenant implications for you. They're good in general, but they're especially valuable in this area where we're trying to figure out how to offer tenants different experiences.

Um and then I think back to this point, I think I've already drilled it down too much here probably. But figuring out how and where to customize your system, where c what are the right points to introduce RAG and introduce fine tuning and where's that data gonna come from is a big part of this problem. And you definitely want to start there and work backwards.

And then these foundational SaaS concepts, silo pool isolation, like all these things that we always talk about, you should see that they overlay some of this, that or intersect, at least with some of the concepts here. So don't forget that, that you still have to be thinking about those fundamental principles as part of biting some of this off noisy neighbor.

Like that whole notion of throttling should, should be a big area for you to look at as part of this. Um and I, I started throttling and pricing basically all of this is influenced by complexity. So this whole notion of complexity and of what's the sort of complexity of my prompt, what's the complexity of my output? That has this cascading impact across how you throttle, how you build, pricing, it have all kinds of influences that you want to think about here.

Um and then you have to sort of figure out which of the deployment strategies like silo and pool and do you want to be provision or are you going to be on demand? Like you have to sit down and figure out which combination of those things are going to match your workload? Are we light on the way we make requests so we could be an on demand or we need provision. Do we want to do all this tenant customization? Well, if we do that, we have to find out which flavor of Bedrock is going to be best for us and so on.

And then at that little snippet of LangChain, even though it wasn't super multi-tenant SaaS is a reminder that the tools here are your friend, their tools can make a whole lot of this easier. So I highly encourage you to sort of lean into the tools and figure out what works best there.

Um breakout sessions that have already occurred for the most part, I think, I don't know that there's any left, but if you just wanted to go find them on video after the after the conference, uh here's a list of them and then I'm just gonna snip through these because you don't need those and that's it.

So thanks for, for turning out, really appreciate you being here. Thanks James for your help. So it was, it was great uh presenting this content for the first time. Uh we'll do questions over here if you have questions.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值