Building cost-optimized multi-tenant SaaS architectures

最新推荐文章于 2024-09-24 13:46:44 发布

李白的朋友王维

最新推荐文章于 2024-09-24 13:46:44 发布

阅读量151

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134794639

版权

Well, good morning, everybody. Uh welcome to re invent. When they saw, when I saw 9 a.m. on my schedule on Monday morning. I was sure that this was going to be a lightly populated room just because I don't know. Everybody wants to get up at 9 a.m. on the first day, but welcome. Hopefully, hopefully we can get your re invent off to a to a good start here with a good session.

Uh as the slide says here, you'll see, I'm Todd Golding and I'm a solutions architect and I've been working at AWS in the SaaS domain for about eight years right now. And I've been working with various customers and domains and use cases and the people migrating people greenfield. So I've really had a chance to see an awful lot of uh sort of solutions and organizations that uh that are in various stages of moving to SaaS.

And obviously almost every discussion that we have about SaaS, no matter which one of those sort of profiles you come from has some sort of some sort of theme that's centered around this idea of cost. Um in general, in fact, no matter where you're at no matter what the economic situation is. Cost is often one of the big motivators for people to move to a SaaS delivery model. Um people really want the economies of scale SaaS. They are often are coming from environments where cost has become a huge part of the problem for them as an organization and they're looking at SaaS as a way to solve those problems.

Now, my team has built over the years, a number of different solutions to try to help kind of targeted solutions to help people go after some of these cost problems. We've built cost per tenant. Sort of, I've done talks over the years on cost per tenant strategies and how do you calculate cost per tenant and figure out what your tenants are, what parts of your infrastructure your tenants are consuming. We've talked about tier and pricing and other sort of strategies all about sort of driving this better cost experience. However, those have sort of been a little targeted pieces here and there. And for this year over re invent, I said, let's like take a step back a little bit from this cost story. And let's look at the big picture of cost like what are all the moving parts like if I were in your chair today? Like what are the big sort of architectural patterns and themes and strategies that I ought to be thinking about for cost optimization? And ii i suspect as we go through this approach, you'll see that my, my scope of what i see is cost optimization might be bigger than what you expect.

Uh and I would say that we're, our goal here isn't um just to sort of give you a specific like go do x or go do y it's to give you a mental model for how to think about cost optimization. Now, that doesn't mean we're not going to get into architecture. Absolutely. Going to get into architecture detail. We're going to look at the areas of architecture you ought to be focusing on and look at how you design the architectural experience resolution to drive a better cost experience. But I'm going to connect that to more concepts and more ideas that ought to might broaden your view of how you think about cost optimization.

Um now this is a 300 level session. So we're not going to crack open the IDE in here and I'm not going to write a whole bunch of code or do a bunch of demos for you. This is intentionally meant to be a 300 level session. We're gonna look at architecture diagrams, we're going to look at like sort of patterns and strategies. Absolutely technical, but it is 300 levels. So just to sort of level set, what we're doing, hopefully that matches the abstract, hopefully that matches why you're here. And if it doesn't end, like i want to go do some other session because something better is going on absolutely fine with me. If you sort of opt into something else, I definitely want you to be in a session that's a good fit for you.

Now, when we talk about cost optimization, there's this very sort of classic view that almost everybody has of cost optimization, right? There's like almost every team I go to says, hey, we have some environment or we're about to build some environment and we have customers in that environment that basically we have that are running their own sort of stack of services and resources that are part of their environment. They might be a managed service sort of environment where they might be running the same solution for some but some customers might have their own solution, some may not and they see the pain that comes with having this environment. These whenever you have anything one off for any customer, it isn't a great cost story, right? Because we're going to end up with idle resources, we're going to end up with over provisioned bits of our environment supporting these one off environments is not a good cost story for any organization. And it's behind a lot of why people have adopted SaaS and PaaS as a strategy just to overcome the challenges that come with this.

So then with this sort of mindset, people tend to think well then cost optimization for SaaS is all about just taking these customers pouring them into some multi tenant environment where we share everything, we share, compute, we share storage, we do everything we can to take care, take advantage of the scale and the elasticity of the cloud and we just horizontally scale out based on the needs of our tenants. And that is absolutely a great cost story uh and is absolutely 100% a good reason to go to SaaS, right?

However, I I think the SaaS sort of cost optimization story isn't just about sharing infrastructure. If it was we could you basically stop on the last side say thanks everybody. Great show uh all good for today. No, there's a lot more moving parts to the story. Even when we're sharing infrastructure, there's a lot more moving parts to this story.

Um so what I really want to do is to have you think as an architect, not as a business person, this will sound like business and I guarantee you it's meant to be technical, which is I want you as an architect to think beyond your infrastructure bill. I want you thinking about costs not just like, hey, we got our AWS bill and it's 10% less than it was last month because we did something we're done with cost optimization. That's not the end of the story. Right, if you think about cost in a multi-tenant environment, one of the areas I really want to talk for you to focus on is how you support growth, right? The whole idea of going to a SaaS delivery model is that we want to reach new markets, we want to reach new segments, we want to grow our business faster and we want to grow it in an efficient way. So part of what we're trying to do as part of our cost story is have the business be able to grow rapidly without absorbing all this extra cost that we traditionally sort of absorbed when we had per tenant resources. And so we'll absolutely focus on how do we do that, how do we achieve that, that efficient growth? That's part of our story.

The other piece that might surprise you is um you know, a huge part of this that gets left off the spectrum entirely often is operational efficiency. If you look at really big SaaS companies, especially b to c SaaS companies, and you look at the like what they've built and the number of tenants they have running in their environments, they take great pride in how few people are on their operations teams. They built great tools, they've built, built great constructs to be able to scale their operations teams as their business grows. In fact, this sort of connects to that first item, which is efficient growth, which is I want my operations team to be able to, to sort of match that efficiency model as well. And I feel like this is like somehow gets put on the other side of the accounting bracket, which is to say, yeah, there's an ops team and somebody, they're in the budget and somebody pays for them. But they're not part of my cost optimization story. No, they're a huge part of it. In fact, usually in SaaS environment, the between ops and the development teams and everybody else is blurred significantly here. So when you think of your cost story, think about how that, how big that ops team is, how much effort they have to spend to support your team, how much effort takes, does it take for them to find problems and troubleshoot issues? That's all part of this cost story.

And then the other one here that I really think gets lost in the shuffle also is just this idea of understanding the workloads in your environment, right? You are not supporting every tenant equally, right? A lot of people will say, well, we'll build a system, we'll build, uh we'll build based on consumption of some, some parameter, some dimension of your system. And we're done like now it's cost efficient. If they consume more, they pay more, if they don't consume more, they're done. But that usually doesn't really account for the realities of what's actually happening under the hood. Because what we'll see is we go into some of the examples here, I can have tenants who are not paying very much for my system but are putting huge workloads on my environment and making my infrastructure bell go way up and other tenants who are paying me a premium who are barely consuming the system and barely contributing to my infrastructure bill. There has to be good alignment between the workloads that are in your environment, the personas you're supporting in your environment and how they impose costs on your environment. You have to deliberately go after that. You have to think about how well if this tenant is not, is really over consuming my system and they're not paying that much. What what should i be doing in my architecture? What strategy should i have in place to be sure they're not over consuming my system.

Um so that's a huge part of it. So for me, obviously, this is all what I would consider an efficiency game. You're probably going to be tired of me using the word efficiency by the end of this talk because cost optimization isn't just how big is my infrastructure bill? Cost optimization is how do costs grow efficiently as my business grows? Because costs are still going to go up. But are they going up proportionate to the growth of the business in a way that's good for the business? And I do a little simple graph here just to drive this point home. Nothing magical about this at all. But to me, if you were to look at revenue this top line and we would say, hey, we invest in all this automation, got all this great multi tenancy. We're now getting able to go after new markets and new segments and the hockey stick of growth has taken off for the business.

Um the real measure of whether I've done the right things with my cost strategy is this bottom line, right? Did i build the right operational mechanisms? Did i build the right um cost efficiencies into the way that my architecture scales? Did i build all the mechanisms into my environment um to be able to have its scale efficiently as the business grows. And essentially you want to get to the mode where each new tenant we add like if i said tomorrow you add 100 tenants, i don't want the cost to go up proportionately 100 100 whatever increments, whatever that amount would be. That is our goal. And that is the theme of efficiency.

So don't, don't think of the efficiency as sorry, cost optimization as a bottom line number, think of it as a pattern or a or a theme here, right? The other bit of this is if we look at cost optimization, it's sort of an interesting story here is it's both uh you know, we have this whole notion of variation, this notion of workloads that change for tenants all the time. If you think about a multi tenant environment. You have new tenants showing up all the time. You have um tenants that are consuming your system in different patterns, right? Different workloads and patterns out there. Some, some are consuming some part of your system more than others. Some are pushing, compute more than others are pushing storage. Some are hammering the api but not doing much else. And they're all over the spectrum in terms of how they're consuming your system. And that's just how they're consuming it today. So you'll see, i have profile today up there

Now, if I come back tomorrow and I look at this, that distribution might be entirely different. So for me, a huge part of the cost optimization challenge here is how do we allow for this variation because it's natural to the business, the business wants to support this variation in fact, but how do we do that and still build good scaling models? How do we build good resilience models? How do we build all the underlying architecture that will support this dynamic model without eroding the cost and the environment? Because you could easily say if we don't care about cost. Yeah, I could support those that can support and I'll just provision for the worst case scenario. We're all good, everybody can run well. That's not obviously a great cost optimization story.

So for me, like the number one thing I think people struggle with in this whole story and it will be a struggle is how to deal with all this, how these variations and workloads come in. So the first area we're going to go after is the most obvious one to go after, which is just how do we somehow align the activity of tenants with consumption? Like that is the number one goal when you start this path, but it isn't the only goal. Hopefully, you see that we'll get into operations, we'll get into uh tier in these other strategies. But certainly job one here is to be sure that uh if we look at how tenants are consuming our environment, that their consumption pretty much aligns to activity.

And this is a graph. I've shown a bunch of times in other talks but is my fantasy version of the architecture. If I were to build the ultimate sass architecture out there, you'll see the blue line here represents tenant consumption and obviously in a multi tenant environment with those varying workloads, I just mentioned that could be all over the map, that graph could be changing all day every day. And then in my fantasy world, if I build a really good architecture here, a cost optimized architecture, the red line which is like, what's it actually costing me? How much infrastructure is being provisioned to run it? How much, how many resources are needed to support this experience? That would track that blue line very closely the gap between the blue and the red would be very small.

Um and now that means we're barely over provisioning resources. We're basically bringing just enough resource to give tenants what they need. And if it goes down to zero consumption, my, my bill goes down to almost zero as part of that experience. That is the dream of every sas architect. It isn't the reality of every sas architect. Uh even though I show this diagram and everybody's like, great, we'll do this and there are areas like i do this in a serverless talk where you can actually with serverless compute, get pretty close to this. But it's not the reality because you have to run all kinds of resources that don't always dynamically scale so great, not every resource can just scale exactly to what you want.

And so you're going to have bits of your system that are over provision, you're going to have areas where that gap is going to be a little bigger than you expect. And that's ok. That's part of the reality of this. The question is, are you trying to make that gap as small as you can and where it's bigger than it needs to be? It's because it has to be because you've kind of gone through the math and figured out that's just what we have to do for our environment. And then you're continually looking for another way. In fact, i go into lots of organizations who have some version of this graph and they're asking me, can you fix it? Right? And now we're finding out what can we move you to? What technology can we move you to? Which strategy can we change you to, to get closer to this experience?

Now, to me, i break sort of this infrastructure optimization and that's the bucket we're in right now into a few categories. And this is where cost optimization gets a little bit interesting. Um we could have multiple models of how resources are deployed in our environment. So one model we could have as shown here is what i call pooled. If you've seen a lot of our content pooled just means that the infrastructure is shared by many tenants. And and if i have a pooled environment, there's a whole cost optimization strategy for if the how do i there's a whole cost optimization strategy for figuring out how to pool those shared resources. It's more of classic sort of dynamic scaling classic elasticity, the classic model most people know, but we could also have a scenario where some of our tenants in the same solution may be running in silo, meaning dedicated environments, right?

And this model particularly we say these are what we call full stack silo, meaning a tenant gets their own whole stack of resources. Well, why would you do that in sass environment? By the way people do it all the time, i would say lots of solutions i run into have a mix of silo and pool because some tenant comes along and says we are right willing to write a big enough check if you will let us run our own system. Well, now we're going to allow that it's still going to be a a instance of our pooled environment with one tenant in it. But now it has different cost optimization profile optimizing for that workload versus an environment where multiple tenants are running look very different the strategies you implement the approach you take to scale here. Because now you have this silent model, we have like a more predictable consumption pattern. We have to think more about idle resources. We have to worry less about spiky resources, kind of almost the inverse of the things we have to worry about in the pool model.

Then the reality is though you're not usually just silo or pooled, even within a siloed or a pooled environment, you'll have some mix of silo and pool in reality, right? So in this example, just to sort of drive that point home got this environment, my web app tier is pooled. Uh my microservices are siloed and then down at my storage level here, I've got this mix of some things are siloed, some things are pooled. So i point this out to just make it clear to you that, that like your your approach to thinking about cost optimization just for infrastructure. It means that you're gonna have to look at policies and strategies that span all of these experiences. And you, you, you the landscape of what you're going to have to do and the complexity of what you have to do will go up. But this is part of being a sass architect and saying yes to the business. It's about saying yes, we can support siloed models. Yes, we can have just some resources or just some microservices siloed and some pooled and we'll figure out how to build a good cost optimization story.

My point is i don't care which one of these you're using or which combinations you're using. You still should be thinking about cost efficiency across all of it. Now, if we just want to focus on compute and uh this is an area probably you all know the most about but we'll go. So we'll go through it pretty quickly. Like how am i just gonna get good cost optimization out of compute?

Well, the simplest one here is just classic horizontal scaling. So one i'm gonna spend the least time on. But let's say i have some order microservice, it's running on ec2 and i'm just going to scale up two instances based on however much load is being put on that order microservice. And i'm going to try to do that efficiently enough that i don't end up with a bunch of over provision infrastructure just classic out of the box, very, very little multi-tenant about that at all.

I will say the unit of scale here though is the entire order microservice. So it's every operation in that microservice independent of what the load is on that micro service. If it starts to get too much, we're going to have to spin up two more in two instances. Also, I'll say with ec two, spinning them up and responding to spiky kind of loads is something that can be a little slower here in terms of spinning up new instances. So you have to think about, you know, more over provisioning here, right? You probably have to be a little more prepared for a spike to come uh than you might have to be in other technologies, but a perfectly good way to look at this.

Um but in contrast, if we look at lambda, we take that same order microservice. Um this is where the serverless sort of world puts like all lines perfectly to the s a sort of universe, right? I still have my same order microservice. I still have this get order and update order functions, but the unit of scale in my environment is the individual functions. So i don't really care what the load is on it right now. If the load tomorrow is like all get order and the order and nothing on up order, update order and then the next day it's inverted from that i don't care because i'm still only going to pay for what i actually consume. This is the beauty of cus.

Uh and, and the reason we always say serve, serve services for sass is just because you don't chase any policies here. You just rely on the managed service to scale for you. You aren't even worrying about what kind of compute instance you're picking none of that is in your world. So for me, this model takes so much off of my plate from both an operational perspective, but also from a design perspective, i don't have to worry about how my microservice or even decomposed. And whether i've got the right microservices completely because i'm really scaling my unit of scale is are the individual functions within my microservice. So a much better fit. No worry about idle compute here now is lambda for every workload though. No, it's not. You can't put every single workload into lambda.

So that doesn't mean like one hammer just go solve everything lambda, you're done. No, but it's a really good fit. Great. We have a great reference architecture on lambda that's out there that shows off a lot of these capabilities. And then of course, if we're going to talk about compute, we can't not talk about containers. Certainly, eks is wildly popular among sas providers. And it also represents a really good compromise kind of to me between lambda and two because containers spin up really fast. I can get spiky loads and changing workloads in a multi tenant environment and still get good cost efficiency.

Out of this obviously here where it's more about how we're scaling the underlying nodes in the cluster that are part of this. But in general, i get really good mechanisms here. So i can still get rapid horizontal scale. Don't usually end up with a ton of over provisioning in this space. And i get access to all these cool tools that are part of the container environment that let me bring additional cost optimizations to my environment. And i get tools like cube cost and other tools for example, in here that let me have visibility into cost as part of this, right.

So the other thing here is to get multiple deployment strategies like i can use name pretend and i can do all these different sort of mechanisms to implement my cost savings model. One piece you'll notice here, you'll see fargate mentioned here. I can actually bring that serverless sort of goodness with me over into the container world. If i want and say i don't care what nodes are underneath my eks cluster, i'm going to run on fargate and now i don't have to worry about how to scale the underlying cluster. I don't have to go write policies. I'm going to, i'm going to basically let the computer my environment be be all reliant on a servius model

So kind of the best of both worlds a little bit there and that i get containers if that's where you want to be but i still get some of the goodness of the lambda experience. I still don't think it's one for one because the function level of granularity is a little more powerful for some of the scaling things, but it's still a really good fit.

So that stuff probably you already came in knowing but we had to sort of hit on that just to be here. And i do want to touch on one more dimension of eks cause i was talking to a member of my team and he happened to be talking about uh cost optimization strategies. And one of the things we started talking about was well, can i actually optimize in eks for example, all the way down to the, the type of instance that i'm running inside my cluster.

So if i for example, have an eks cluster and i have these pods running on these nodes and those nodes are running on m five instances because that's the right type of instance for that type of workload. Could i say i want another node here? And i want that node to be running with a different instance type i want to optimize here. Basically to say, hey, this is a really compute intensive or this particular microservice would be great to run with a gp u instead of m five. The whole point here is i can sort of at another level of granularity at run time, match workloads with the appropriate instance type and now i get better cost optimization out of that as well.

I think this is still a debatable point to some degree. You really have to get good alignment between the instance type and the workload to prove that you saved money. I'm not just going to go through gp us it problems obviously because that's going to be expensive. But if the gp u the right thing and i'm scaling up a bunch of c fives to try to do the equivalent of that gp u. There's probably a point at which i get good cost savings out of that. So a little bit of a stretch there. But one something i think you ought to be thinking about even in container environments all the way down to the, the type of instance you're running on.

The other thing you should be looking at here is um how you can selectively silo bits of your system sometimes when i go in to talk to organizations. And they're saying, hey, um we've got some customers who want silo so full stack silo for them. And they're saying, so you're either in pool or you're in full stack silo and that's all they offer because as soon as they hear silo, they're saying, well, they need isolation and they need, they need to have some compliance need. They've got to be in a silo. That's the only way to do that. And of course, the minute we put the full stack into a silo. We start to erode some of the cost optimization story of our, of our system.

Instead, what i'd rather say is what parts of the system need to be siloed. Because what you find is like in the landscape of everything your system does, they're mostly concerned about some subset of what it does and the compliance and the regulatory or the isolation dimension of that. So can i just break out parts of my system offer it silo and have a mix of some resources being pooled and some being siloed? Now, only the thing, only certain services are siloed. And to me that gives me a better cost optimization story because i'm siloed where i need to be siloed instead of just holistically siloing all of the micro services in the system.

I think this is one of the most under used cap sort of design patterns that i think people should be using. I think people should be more and more thinking about how do i really decompose my system of the services and where can i support the sort of varying needs of the business with a more granular approach to how i decompose the workloads?

Now, another approach to this which takes an entirely different angle to this is this notion of pods. Some people with uh i don't know if cellular architecture is out there, but i don't know that i know this is exactly cellular architecture. Uh maybe it is but, but we've been referring to this as pod based optimization. And here instead of thinking about how can i get the compute to scale exactly the way i want it to scale or how do i get the storage to scale as cost cost efficiently as i can. Here instead, i'm gonna take groups of tenants and i might put them into a pod.

So my on day one, i get like here's these 10 tenants, here's a profile for them. I'm gonna put them into this pod collectively. And my unit of scale is the pod. So everything the tenant is running is in the pod. And now i have to scale that pod for those seven tenants as effectively as i can. But my unit of scale isn't now down to, is this every little knob and dial inside the pod dialed in as long as the pod itself scales effectively? By the way, this is also reduces the blast radius to some degree because these seven tenants can only affect one another.

Um now when i decide, hey, i think i've reached the limits of that pod. I don't want to expand it anymore. I've got it dialed in. Pretty nice. Let's spin up another pod and we will now put the next round of tenants into that pod and pods become the unit of scale here. Right now, we scale horizontally based on pods. But my effort and my energy to figure out what the right cost optimized way to scale. Here is more at the pod level instead of the entire architecture for all tenants. At the same time.

In fact, in reality, i could have separate scaling strategies for pod one and pod two, i can also move tenants between pods. It could say, oh, there's a tenant in pod two that's suddenly saturating everything and they suddenly have a different profile. I'll migrate them into another pod and still keep pod as my unit of scale. There's pros and cons to this. It sounds awesome because it seems like it simplifies everything, but it really makes operations still a little harder here because we have to figure out like now we've got this distributed sort of pods, how do we operate them successfully deployment is a little more tricky here. These are, there are definitely things to think about, but it's worth considering.

I'd also say if you're like thinking about multi region, are you having a bigger global footprint? This can be a natural sort of jumping off point for somebody who wants to go multi region because now i can put use pods and deploy them into individual regions. So if i already have a pod as a concept going to regions is is less of an effort. Now, the hardest one and the one i don't like to talk about the most is right, sizing storage because everybody's um sort of dealing with if they've got they're using r ds. So they're using some other mechanism.

Uh so they've got tenants, these tenants have multiple workflows. Um they're, let's say in this case, have some pooled sort of microservice and they've got some r ds instance out there and when they put the r ds instance out there, you have no choice. I've got to pick a compute size for that. What's the right size to choose? Well, i think the consumption is going to be x or y, but no matter what, you're going to have to pick some over provision bit, but you're gonna have to pick some size for this and this is just for one tenant. By the way, you'll see in this example, it's a silo. So i probably have some chance to approximate consumption here.

And then i have another tenant uh and another silo, i have to pick a size for them. And then i have these tenants that are running in my pool model and now i have no idea what to do for them. I don't know. I mean on day one, you know, i'm just gonna oversize that, that, that uh instance, no matter what i do. And, and obviously for organizations that are trying to cost optimize if r ds specifically and i'm just picking on r ds. There are lots of services that have you pick a compute instance for them for any service like that. Whenever i'm binding a specific size of compute instance to it, i have to figure out what to do as the profile changes specifically around cost optimization.

So for example, if the workload goes up for 10 at one, i'm going to have to add potentially, i have to sort of migrate it to a larger instance size. And if for tenant two, it's gone down. Am i going to migrate there? Well, now i'm building dev ops tools. I had an organization i worked with, had a whole team that did nothing but migrate r ds instances between instance sizes just to try to dial in the cost for their system. And if you multiply this by, you know, tons of tenants, you can imagine how this is going to be a problem.

Um and then imagine like the cycle of this over time. Well, now this workload for tenant ones down, it's been down for a few months. Let's take them down a size. Uh uh and then what do i do over time? My pool environment changes for over a while. Now, the reason i said i don't really like this solution. So there's supposed to be a big sort of. So now go do x and you've solved this whole problem. No, this is just a real problem. You have to think about, there are strategies we'll talk about them, but i don't want to ignore the reality that you're going to have to figure this out.

Now, if you're doing free trials or something of that nature. There's some things you can do to constrain that because it's a free trial and just like limit the amount of load people can put on a free trial environment. But this is a hard area to chase costs. And if this, uh if, if you are going to go down this class, i i this is where i do sort of point people at serverless as, as the best way to sort of go here.

So if you look at the different storage services that are out there, you'll see that uh you know, aurora is out there. So you could potentially go to aurora and get that. Now, you have to look at all the fine print to see. Does aurora my sequel do exactly what you are doing with r ds

My sequel, we have to do the mapping to be sure it's all there. But the good news is like these services that are doing storage are more and more embracing the notion of cerberus because they see the value and the importance of this, which is I really don't want to think about what instant size to choose for my storage for my s a solution. I just want to pay for what I consume and I don't want to have my operations team off continually chasing like how, what size instance should we be giving to this r ds instance today? Right?

That the other good news is this sort of model also extends beyond storage. So if you look at like messaging, I just grabbed messaging as an example here, there's lots of good cist bits in that experience as well. So for me as a when I sit down with customers, we are struggling with this problem. The first question I ask is, is there a chance to get you into a serverless model? Right?

And if you're doing something self managed, this problem is even harder because now you're not just dealing with the instance type, you're dealing with all the other operational overhead of managing that yourself. So the last slide in this section on on scale here is one that I try to drive home. Like what are the major themes here that I want you when you're just thinking about aligning consumption and cost, what, how do you get there? How do you get the graph there? Well, obviously it's a mix of things. It's not just one strategy. Certainly we want efficient resource consumption. So to me that's getting just enough resources just in time where you can achieve that. Practically it doesn't work for everything. I also think you have to look at how you're breaking your system down and deploying it. What are the deployment patterns? What's silent? What's pooled? Where can i decompose the system differently to arrive at a more cost, optimized experience. Oops missed one. And then the last one is like how much effort and energy is going into writing scaling policies across this experience. Are we investing a ton of time in figuring out like, what's the scale for today? The worst case scenario is when i've got some operations team that's constantly adjusting, like on a daily basis. Well, today we're struggling with x. So now we're going to go rewrite the scaling policy for this particular part of the system. Like you can't scale as an organization very effectively if that feels like you're just constantly chasing that. So pick a strategy that gives you something stable that's gonna, that's gonna grow with you.

Now, um i mentioned operations at the beginning as an area to focus on and this is the part where everybody's like, yeah, yeah, that's good. But we don't care about it. I've been in so many organizations where people, i'll tell them about the importance of operational efficiency and they'll say, yes, they'll say we need metrics, we need analytics, we need insights, we need all these things to make our teams more efficient. I leave and i come back like six months later and they've done nothing to build any of that. Uh and i know why i mean, features and functions win basic scale, the environment wins. But to me, if you're, if you're having the, the cost discussion with your business, you ought to be talking about this. Uh it might be your job as the architect and maybe teamed with the operations team to make them realize just how important it is for you to invest here. Right.

In fact, if we take our graph that we had earlier and we look at that drawing i had so i had revenue in gold and now i've broken out costs right. Here's infrastructure costs at the bottom. Well, if i make infrastructure costs great, the bill is down, the bill is growing exactly the, you know, the rate it should be. We're getting great cost efficiency out of the infrastructure, but we didn't spend any energy on operations hardly. And i look at the cost and the expense of operations and i find that my operations overhead is almost growing at the same rate that revenue is growing. Like they're, they're, they're pretty close together. Well, now i haven't really achieved efficiency, cost efficiency for the business because each new tenant is adding new overhead, right? Maybe the complexity of our deployments, maybe the complexity of our automation, maybe multi region as part of your world or maybe you just don't have great tools to even know what's happening when something goes wrong in a multi tenant environment, whatever it is, you have to bring attention to the fact that we, we have to do something here to get this under control.

And so what do i mean when i say scaling operations because it can be this really vague thing, right? Because there's no one go do x or go do y to get great operations but to me, there are questions i go and ask people like how hard is it for you to do certain things in, in your operations teams, how much visibility they have and guess what, guess where their visibility comes from? Their visibility comes from the stuff that you bake into the architecture like that operations team only has so much visibility in the scale and so much visibility into isolation and so much visibility into deployment automation. Uh and it's it's your job to sort of bake in the information they need to create a really rich operations experience.

So, uh do i have alerts and alarms that proactively tell me when things are going on? Right? Um how hard is it for me to troubleshoot an issue if something's wrong with the system and a tenant calls in and says this part of the system isn't working for me right now. How much effort is it for the operation seem to be able to sift through the logs and sort of go through all the mechanisms to figure out. Oh yes, at three o'clock this and if i spend, you know, three days looking at the logs, i eventually figure out what went wrong. No, i want to see the logs of tenant context. I want to be able to see how certain services are being consumed with a tenant or a tier context. Um in general, i just have all these sort of questions i wanna ask that tell me i have a really good operations experience and that the ops is running efficiently and it is a subjective thing, but it can still be a goal.

So to me, part of this is i said you have to go do this work. This means instrumenting your environment, right? And this is where all this runs into a wall, right? Somebody will say we've got x number of features, we're competing for x amount of new business and, and somebody in a product owner or somebody in the uh dev team or in operations is saying we don't have good visibility and how tenants are consuming an environment. I can't even tell you who's consuming this new feature you want to release or how they're consuming it or if they're consuming it at all. The only way that gets there is somebody puts in your backlog, like you've got to go invest in this stuff. We've got to have these metrics and we've got to make these metrics and analytics and visibility as important as features and functions. Long term that pays off, you just have to get to sort of understand what the metrics are in the ways of doing it. Obviously, the other one that sort of here is our logs have to be laced with tenant context. We have to have all kinds of tenant context in the log so that when i go to look at something and say what happened at one o'clock for tenant number two, i can see just what happened for tenant two and then i can say, well, oh, they were having trouble with this microservice. Go show me how microservice x was scaling at two o'clock. Oh, here, look what's happening they're doing. They're imposing this specific workload. I never expected on that. If i can do that quickly as a team, suddenly i'm proactively solving things for the tenant, but i'm also way more efficient. I need less people. Uh and, and i, because i have better tools to support this.

And so for me, this is all about having dashboards and it's an awful drawing but having dashboards that are multi tenant specific operations, dashboards, yes, app dynamics, new relic data dog, like all these great tools are out there, continue to use them. But even if you use them, you still have to go say what view do we need our system that are our multi tenant views into how resources are being consumed, how loads being imposed on our environment. Where are my multi tenant alerts and alarms? I don't just want to see generally computes being consumed this way, memories being consumed this way. I want to see what tenants are doing. So either in those tools or via grafana or quick side or whatever it is you want to do, go build your own dashboards, whatever combination of those tools give you those views you need to go after because you want your ops teams to have this at their fingertips and imagine doing that job without these tools much harder job. And that erodes the cost efficiency of your business.

So how might you measure this? Just because i don't want to leave this nebulous. I wanted to give some examples certainly as part of operations. One of the things i want to know is how fast or how quickly tenants are able to on board my system, that's a huge measure that nobody thinks about. In fact, usually what teams do is they scale the whole application infrastructure, they get all that stuff working really great. But then i ask them all, what if 100 tenants showed up tomorrow and went through your on boarding process? Would the on boarding process scale effectively would, would the ops team have view into how it's scaling? And does the business have view into how, how easily they're on boarding? Uh if i have good automation and all the right tools around that, that's a really efficient and cost efficient experience.

I also want to look at like the deployment deployment is really tricky of silo pool of all these different deployment models. What does it mean to roll out a new version of our solution? Well, and, and how effectively did those rollouts go? What's the meantime between failures here? How fast are defects escaping all things i want to know about? And then last tenant life psychometrics just generally how do we have visibility of tenants are nearing renewal. This ties into the customer success part of sas where operations and touches customer success sometimes. Um how do we, how do we have tenants moving between tiers of our system upgrading from basic to premium? What kind of tooling, by the way, one of the hardest parts of sas moving between tiers is a really tough problem. Do we have good tools and mechanisms for that or does that take a team of four people from the ops team to migrate a bunch of people over to this and a week of effort that does not very cost efficient. These aren't necessarily the metrics. You got to go after i mostly put these here because i think if you attach metrics to the operation story, you bring more visibility to the importance and the cost implications of having great tools and great mechanisms here.

Now, the other thing here is um we, we have to really, i talked about having looking at tenant workloads and tenant profiles and how they're consuming your system. And this is another area that i think is under invested, right? Um i think a lot of people say like we know who our tenants are on day one, they just assume that they'll consume a system a certain way and that will be our c story and that will be it. And i i find over and over again people who didn't invest in really asking themselves hard questions about tenant personas end up with challenges here.

As an example here, i've got this ecommerce uh solution that is one i worked on before i was actually at aws. And that it was a very successful ecommerce solution and ecommerce platform and it had tenants uh a huge pool of tenants and a lot of activity seemed like a very successful system. But the the cost of the system was sort of eroding the bottom line of the business. And so we dug in, we didn't have those metrics that we needed to really tell us how tenants were consuming our environment. And we profiled consumption in three different ways. We looked at how tenants in the different tiers of our system were imposing infrastructure costs, how they were, how much revenue they were generating for the business. And then as an e commerce, like we thought there might be some correlation between how many products they were selling the size of their catalog and the way that they were imposing costs on our system.

And in basic tier tenants, i don't know, maybe they were paying us 50 bucks and some advanced tier tenant might have been paying us 5000, right? Um and they paid us basically based on the the number of transactions, the number of sales that they had on their site. So our revenue was tied to their revenue to some degree. Now, when we got this data and we looked at the basic tier tenant, what we saw was um these, these tenants had massive catalogs sometimes selling millions of products. Uh and they were imposing huge loads on the infrastructure of our system and specifically on the cost of our infrastructure. Um because they were, you know, just uploading a million products every day because they didn't know how to write upload and use the api s of our environment. And we just let them do whatever they needed to do, get their to keep their store running

Um but they weren't actually selling hardly any products. In fact, the the the size of their catalog usually meant they were just like trying to blast a whole bunch of products out there and hope people would buy them. And so the revenue for us was, was low, but their infrastructure cost was very high.

Meanwhile, on the advanced tier side, we had tenants who had really small focus catalogs, they sold some little product that they were really good at selling, but they drove it like mad and they generated tons of revenue from that product, but they didn't generate a lot of infrastructure overhead for us. Ironically and so for me, when i look at this and i say i'm going to do cost optimization if i don't do something about this, uh i have a situation where the, the sort of cost load that somebody's putting on my business is inverse with the revenue, they're generating for my business.

So you have to sort of look at like what are we going to do to, to create a tiered structure here that creates a better balance between the load they put in our environment and the experience you're having and you'll never get a perfect balance here. But obviously, one of the ways to go after this is with, with um tier based throttling, right?

So what i, what we recommend to people here is to say, put real hard boundaries here on how these tenants can consume your environment. So here i've got basic standard and platinum tier tenants, they're all coming into my environment. They're all coming in and hitting the api gateway with some uh some number of requests and what i'm gonna do here and this is straight out of our lambda um reference architecture, i'm going to associate usage plans with an api key for each one of these tiers. And the usage plan will basically say at basic tier, you can consume at this level, at platinum tier, you can consume at this level and so on. And now those those tiers will have experiences here.

So that person who wants to go upload a million products to my catalog every day is probably going to get throttled in the basic tier because they're imposing excess load. And if they're unhappy with getting throttled in that scenario and they call into customer support or whoever they call into and say, hey, i can't upload my main catalog parts. I'm going to say, well, you need to be over in the platinum tier if you want to be able to do that, right? Uh it's in an overt way to manage the way they're consuming your environment and you shouldn't feel guilty about that being a range of experience. That's the intent. That is the whole point of tearing is to create this balance of cost here, right?

Another approach to this just to give you another model and apologies that this is a little lambda heavy is using reserved concurrency. And so instead of using the api gateway here, uh we can focus more on the con con currency of the consumption of our lambda functions.

So here, um i can basically say my basic tier gets 100 concurrent requests that it can be handled across the lambda functions that are deployed there, advanced is 300 premium gets everything else. It's still all about throttling, but a different way to apply throttling. And i only include this because i want you to realize like every stack, every sort of tool chain that you're using has different ways to do this. But almost all of them had some way to control the experience i also included here. Just an eks example, i can do tenant name spaces and associate different resource quotas with them. And those different resource quotas will essentially define how much like tenant one might be basic tier tenant two might be advanced tier or something of that nature, they get different experiences.

So one of the things i tell people to do to sort of figure this out is to go out and model their tenant personas. I will ask tech and business to sit down and say, ok, what are your baseline sort of view of consumption? How many of how many of basic advanced and platinum tier tenants do you expect to have into the system in the first year? And what do you expect the consumption activity of those tenants to be generally? How do you expect them to push the system? What would the load be and what would the anticipated sort of margins and costs be associated with that model?

Again, you can't, there is no notion of tc in a sass environment. People will say come and tell me tc total cost of ownership for my sass environment before i build it. I have no idea because it's dynamic. It's all based on however many tenants are in the system, the cost one day is different than the cost of the next. So the best we can do is model it and then see if the model matches what you want, right?

So then we start to project that out to another year and say, well, what if it grows by x? How much will basic tier grow? How will, how much will advanced tier grow? And how much will platinum grow in year one? And how much will that change the consumption profiles? And how will that basically impact margins? Do we like it? Do we not? And you might find out that our tiers are not the right tiers based on the way they're consuming the system. We've drawn the boundaries in the wrong places. Uh so then i'll tell people to do, take this idea and then model it out and create, create multiple pat multi year paths.

Ok? If the consumption pattern was x, it would look like this and it would affect our costs. This way. If the consumption pattern follows this trend, then it will look like this. And this is a way to go back to the business and project cost in an environment where there are no absolutes about cost. But it also is a way to say, do we have the right personas and profiles? Do we know the workloads we expect? Because the worst case is you get into this, you plan it all out, you get the tears and then you find out tenants consume this in ways you never expected, which will happen anyway. But at least you've thought it through as best you can and you've tried to anticipate what those workloads might be and where the boundaries ought to be having no boundaries is the harder part of this.

Now, the last bit to me, if you're, if you're going to do uh any of these things, you still have to measure what you're doing. Uh and this is where we really want. I talked about measuring cost efficiency of the operation side through some metrics. But generally for the, for a business, we really want to know how we can measure tenant level consumption. And this is like every business i almost run into these days is saying, how do we get to some cost per tenant metric? How can we get the data? We need to know what cost per tenant is.

Um and again, the answer is it's hard, especially if you're running resources because if a resource is shared by tenants, it gets tricky. So if we just look at cost through the lens of classic infrastructure, and you say, how much am i paying for cs or how much am i paying for r ds? And how much am i paying for the storage that's in r ds? Yeah, i can tell you like how much those individual resource are costing me. But now when i look at a multi-tenant environment and i have multiple tenants consuming that e ec2, i'm sorry, cs environment, i have multiple tenants consuming this r ds environment and they're sharing data inside the storage. Well, the the cost units straight from your aws bill aren't going to tell you what cost per tenant is.

Instead you have to get some sort of approximation of how tenants are consuming your environment. So you have to, you have to instrument your environment with all the tooling and the mechanisms to be able to capture and publish different metrics. To say, hey, a tenant is consuming ecs right now. They're running this operation, publish that as a metric aggregate, all that and then come up with some allocation of this and it doesn't have to be super precise and it doesn't have to cover your whole system. But if you don't have this, like those are draw lines i drew on the chart for you, the purple line of infrastructure consumption, which is, this is more focused on how do you know what that line is or the graph i just showed you with like the ecommerce solution. How do you know what those numbers are? You don't want to just get your bill every month from aws and go seems to be going ok? Because you could still find out like some tenants are over consuming the system in ways that are imposing more costs than they can, they should be. And some tenants are under consuming it in ways we didn't expect. I don't know if i don't have the data.

So again, another area where i'm probably on my soapbox a little bit, which is you got to get the business to sort of see the importance of this. It has, it should have value in all the way into how they choose to price the solution. Obviously, the second part of this is yeah, getting the sort of consumption here of this is great. But then i still have to go correlate that to the bill and figure out how does that correlate to the bill? And there's a lot of moving parts to this done whole talks just on this. But i thought it would be wrong to talk about cost optimization without highlighting the point that cost efficiency means measuring as well, right, you have to measure that infrastructure bits.

I encourage you to go look at some of the other solutions that are out there and there are partner solutions that are starting to show up here. I mentioned cube costs earlier but also cloud zero has some interesting ways of tracking costs. And there's probably a bunch of tools that have come along since i last spoke about this that are out there that can help you along the way. But it still is an investment of sort of getting the data that's at the core of this.

Ok, a few takeaways here and then we'll wrap things up. Um ii i hope that you see the difference between cost optimization and cost efficiency at this point, right? And to me, if i'm building an actual sas business, by the way, as an architect, i feel like it's my job to make the sas business a success, not just build good architecture, then i have to figure out how to make that architecture scale efficiently support the different workloads support the different deployment models to handle all those variations in a way that gives the business all the tools it needs to more effectively reach more segments, more markets without um sort of compromising on cost. In fact, uh hopefully we'll get, still get the economies of scale. Every we go, we go. It should also be clear, there's not like i can't give you a go to just this one thing and it will solve all your scale problems. Compute stack varies, deployment model varies your your environments vary. What's your compliance model? What's your regulatory model? Where are you migrating? Are you greenfield? All those things will affect, affect how you get to cost optimization.

Um and then hopefully you can see that infrastructure is just a piece of the puzzle, right? Infrastructure costs are absolutely where you're going to spend a lot of time, but you should be focused on the operations dimension of this. You should be focusing on how the overall cost profile of the business reacts and responds based on the architecture that you've built. Uh and uh i think i've already highlighted that point but obviously don't underestimate the importance of that as well, which is it could be a bigger component of your, of your cost than you might think.

Uh and then this the whole notion of tearing and using tearing as part of your cost optimization strategy, tearing isn't just a way to charge somebody a different amount for the product tearing is a strategy that is used to drive cost optimization and align the behavior and the activity of tenants with the amount of resource they're consuming anytime. Those things are misaligned. My whole cost optimization story could be compromised as part of that.

And then the last bit, like i said, measure, measure measure if you do any of this and you don't measure, you have no idea whether you're succeeding or failing, you just like hope that looks good. Everybody's happy, that's enough. But i would like to have the hard data. I would also say the hard data can help you make the point to the rest of your team that, that it will show you trends, for example, that will help you articulate some of the challenges. Maybe the need for better investment in operations are the reprioritization of what it is. You're working on all those bits are key to that.

Ok? One last bit here, today's like the first part of re invent. Uh there's quite a few s a sessions that are going on. I'm doing some others. If you're interested by all means, here's a list of the breakout section session. There's a good a il there's a cost optimized. Uh i'm sorry, not that one. There's, by the way, there's also a cost optimization choc talk if you want to get into more details there, workshops, a builder session and i'll put these back up at the end if somebody wants to look at them and that's it. I hope that was a valuable session. I hope that it match what you're here for and i hope you have a great rest of your re invent. So, thanks so much.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫