Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate

最新推荐文章于 2024-08-16 13:49:24 发布

李白的朋友王维

最新推荐文章于 2024-08-16 13:49:24 发布

阅读量179

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134790886

版权

Nathan: So uh we search for this talk around the idea that building a SaaS equals building customer trust. And this was an idea that I learned when I got to work with two of the greatest SaaS builders I've ever met. Uh they went on to build uh several $100 million businesses and what they really um in cul uh you know, it taught me as I worked with them was that building a SaaS equals building customer trust. This was a game changer for me in my early twenties, you know, I started out thinking about tech as just being about the tech and then this idea that building the business implied building trust. It really shifted my mindset about how I thought about infrastructure and the architecture of that infrastructure.

So the first um you'll see four categories on there of customer trust that we'll talk about. And uh just gonna talk about the first one which is I trust your software to be available when I need it.

So uh the first part is I trust your software to be available when I need it. We know that trust is part of all four of these particular categories when it comes to building software. One of the reasons is is because you need your customers to trust you, but more importantly, you need your customers to trust that the application is actually available.

Now, this might seem like a silly question. But what exactly does availability mean? Availability as it turns out is not binary. It's a spectrum or a range right. Availability can be everything is completely down and nothing is accessible. The application isn't working back, end services aren't working. Infrastructure is down, disaster has struck, lightning has struck nothing's working or it can be all the way on the other end of the spectrum. On the other end of the range where everything is functioning perfectly.

Now, we don't live in a world that is binary. It's not zeros and ones it's not on off. It's not all the way on one side or all the way on the other side are there times where things would fall into that category. Absolutely. But a lot of times where we see availability issues or impact where availability has degraded performance, it tends to fall a little bit more somewhere on that spectrum. Maybe it's so slow, it's barely usable or maybe only one feature is broken or maybe it's broken only for one group of customers, but for your other tenants or customers, there's aspects of your application and your infrastructure that are still up that are still available.

So with this in mind, as we talk about trusting you as the SaaS provider to make sure that your application is available, I want you to consider and keep this spectrum in mind. Let's also talk about the problem here where we talk about how do we mitigate this? How do we reduce the impact from that availability being all the way on the left side of the spectrum and being all the way down and how do we push that and make it to where our chances of ending up on that right side where everything is green customers, trust us. Customers are happy.

Now, this is that tendency that i mentioned earlier where maybe we have a group of tenants, we have tenants, one tenants, two tenants, three, they might have medium workloads or they might be free tiers or they might have high workloads. And for that, they're in a more premium tier and they're spending more money, multiple tiers, multiple personas also affect how we address that availability, how we address that impact? Are they sharing resources in that consideration? They have greater sensitivity to the scope of that impact?

Nathan and I have talked about it and we'll continue to talk about it later on in the session. But even the noisy neighbor problem is a big thing, unpredictable activity patterns, challenging scaling policies. All of these are problems that we want to consider when we're looking at how we can minimize downtime to minimize the impact and ensure availability is ideally in that green side of the spectrum.

And one of the ways i like to explain it is it's a little bit car intuitive at first. But if you think about it, if your free trial users experience uh downtime or impact, that's not nearly as impactful to your business as if your premium tier customer who's paying you, you know, maybe 10% of your, your revenue for your small SaaS business goes down. That's a very high impact uh downtime.

Jessica: Absolutely. Yeah. So then impact also kind of becomes a spectrum, right? Like when we're talking about that trust level, if it's not available for your free tier, that's impactful and maybe you've broken some trust there, but it's not as impactful from a revenue perspective as that premium tier customer.

Now, one of the ways that we can reduce impact is through isolation, right? And this is one of the reasons how come getting away from monolithic architecture and moving to distributed systems and abstracted APIs and decouple your applications became so popular because traditionally speaking, when you have a monolith and everything in the monolith goes down, you've just dropped trust for everyone. It doesn't matter if they're free, it doesn't matter if they're premium.

However, when you start to abstract out and decouple your applications, your infrastructure, you start to look at isolation. Now if only one part of it goes down, maybe that trust lowers for that one group of customers or for the one tenant or uh silo that's experiencing that. But for everyone else else, trust is still up. This one, i think is big because it also shows one of the reasons it come containers and the concept of it became so popular, it empowers us to really focus on ensuring that that trust level is much higher.

Now, another way that you can look at isolating is not just through abstracted and decoupled systems, but it's also through your code. And this happens with the way that you think of how you're handling your deployments when you're using, whether that's Amazon ECS or really anything where you're rolling out code to your applications.

One of the benefits when you're using Amazon ECS is the clusters are free and you can have up to 10,000 per region per count, you can have as many clusters as you want. So this means that you can handle your isolation for your new code for your new tasks very easily. You can have different environments depending on if you're doing blue green. If you're doing canary, you can have your staging environments, your production environments where you can test these tasks in an isolated manner without affecting your customers or affecting, breaking that particular trust.

Another strategy is you can look at doing your rolling deployments with zero downtime. That's one of the things that's built right into Amazon ECS. So when you call the update service API and you've defined as part of your infrastructure that you want to have multiple versions of your particular task. As you start to say, i have this group of three blue tasks, but i know that i have new features, new changes that are going to be part of the green tasks. They still need to be able to communicate back and forth. I can start to plan how i want that to roll out and it's only going to roll out. It's only going to check as long as those green tasks are healthy.

Now, what happens in the event that one of those green tasks isn't healthy? What happens if there is a problem or is a bug in the code? And if that were to actually hit production, you would start dropping traffic. That would be a problem. Right?

Luckily, Amazon ECS has something built in where you can take a look at using circuit breaker to catch those bad deployments where if one of those tasks starts failing, starts failing your health checks, it will actually automatically cancel that deployment, roll back and fail over to the blue version that you know is known and working. This is really powerful because not only are you isolating your code, you're preventing it from ever getting into production where you're again making sure that that customer trust is up high and you're reducing the risk that you're gonna be negatively impacting your customers.

I know you've used circuit breaker a lot and you've seen it be super effective.

Jessica: Yeah. Yeah. So circuit breaker is amazing because it, it'll catch, it'll catch your bad deployment and roll back in some cases before you even realize. So we've had some customers reach out and be like iii i made an update service uh call to ECS and I'm not sure what happened. It doesn't look like the update went through. I was like, well, actually a circuit breaker caught that you're, you tried to roll out a really bad container that was crashing and it rolled you back to the previous version. So that's the reason why it looks like nothing happened when you did your update service call.

One of the nice things about whether it's circuit breaker, whether it's how you handle your deployments. But one of the nice things about thinking on how you're going to again handle that isolation and maintaining that customer trust is putting systems in place that can do some of the hard work for you. It reduces also the amount of work you have to do. So you can focus on the next iteration, focus on your customers, focus on continuing earning that trust rather than oops that did make it out to production.

Now, I'm the one that has to go and roll it back and there's latency and I've broken some trust but putting systems in place that catch those things for you and do some of the hard work for you, you leverage what's available.

Another thing to consider is when you're moving from a shared pool to a single tenant silo and this depends kind of again on strategy, this goes back to where we were talking about tenants and you have your free tier with maybe low workloads, you have your more premium tier with higher workloads or you have a customer kind of in between depending on how you have your silos, whether you're sharing all these resources and now those shared resource resources go down and you have a affected all tenants versus maybe your premium tier is using only maybe one or two shared resources, but a majority of their services are more siloed out again, reducing the chances of that impact.

And I know Nathan, you have a lot of thoughts on this too. Yeah. So um the is, it's not either or in many cases if you're running a significantly sized sass business, you're gonna have some pooled resources because that is the most cost effective way to build your business, especially for those free tier users, they're not paying you any money. You don't want to give them dedicated hardware for their free tier usage. So you would probably have a pool for those, for those users. But then your premium tier uh customer who's, who's paying top dollar for your service, they're definitely gonna want their own, their own hardware in some cases, their own hardware installed on their own adibi account as well. So you're, you're gonna see a range of different deployment strategies and we're gonna talk a little now about the pool strategy and how you can optimize and reducing impact even when you're running that shared pool as well.

And one of the ways you can do that is using something called cellular architecture. And this is really nice because you have your customers in these different silos and these different pools. For example, on this particular slide, you have your production, one, production, two, production three, right? Total of three silos. If one of those goes down or one particular cell, only 33% of those tenants are actually impacted. So again, you've reduced, you've, you've lowered that uh chance for that particular impact for again, for how many customers you could potentially break trust for.

And again, I know that you've been around AWS a lot cellular architecture, I think really became very popular with AWS. Uh fun fact just by show of hands, who knows how cellular architecture, like what it actually stems from? No, it shows from like uh cargo ships from like when you think of like, think of the Titanic and the bottom, like, you know, on how we're, where, where when the Titanic hit the iceberg and all the water went through and it flooded the entire, obviously flooded the entire Titanic. If there were actually cells that were broken up. Now, when one of those cells goes down, it doesn't affect the rest of the ship. And that was the same concept for cellular architecture.

Now, of course, Titanic might not be the the greatest example because it did go down. But I think Titanic kept more of those cells. But I was saying like I think they improve the cellular architecture after the Titanic. But yes, it would not be a good a fun story. You know, if you ever seen an avis a regional outage and in some cases, you'll see in the in the message it says like uh that, you know, we've seen some impact to api calls and and the experience can be very different. Some people will be like a s was working fine for me during this, during this outage was other people will be like, i wasn't able to make any requests at all. Every request was failing. It's because of cellular architecture. In some cases, a particular cell may be failing while other people who are in their own isolated cell um are not experiencing any impact uh to this particular outage that occurred.

One of the next problems that you wanna consider and again, how you can reduce the impact and continue to maintain and ultimately build and increase customer trust is how you address that noisy neighbor problem. And this can happen where like let's say in one of the cells in production three, there's maybe a high request volume or one particular customer that has a lot of requests that can ultimately end up affecting the other customers that are running in that particular pool. Uh that can, again, that's a spiky tenant where they would, when they have that large request queue can impact performance and degrade. Versus if they had just a normal request volume, there might only be an occasional spike.

Now, one of the ways that you can address the noisy neighbor problem is through throttling. And again, I know that Nathan has a lot of thoughts on this. Yeah, so there's multiple different types of throttling here. Uh throttle by activity throttling by compute consumption and throttling at the underlying resources. And we're gonna talk about different levels of that throttling in particular for ecs and fargate. Um the, the two that i think are most applicable are filing at the compute and filing by activity.

So when you throttle from the beginning at, at front door, you're handling. Now starting through the api gateway starting through your network level. So if all your customers are going through that same particular api gateway, maybe you have your basic tier, your standard tier and your platinum customers, they're all authenticating through the api gateway. But you can leverage amazon ecs with serverless with uh ed a something like lambda where you have a custom authorizer and then you create a policy that through that depending on what that customer is, they're only given a certain rate limit, a certain burst, a certain quota of how many requests they can they can make and throttling po policies are applied accordingly.

Once they get through that front door, they go over maybe to the product or the order side, but you're throttling right at the ingress level and you'll see this on every single a service. Uh they'll have like an api rate limit quota. And that the reason why it exists is because we're using cellular architecture behind the scenes. And we don't want any particular tenant who is inside a cell to be able to just dd os that cell out of service and, and impact the uh performance for other tenants who happen to be served out of that same uh production cell.

Another way that you can do throttling and i know nathan said again, there's two different ways predominantly. But another one is through container throttling where you can actually set it based on the containers themselves. This is through your resource uh def definition. So maybe for your resource quota for tenant one, they're only given a limit of 1/4 cpu and one gigabyte. And this is done through your task definition. And this again might be for maybe you pre uh you free or your advanced customers, they're not given as many resources. So you're throttling what they even have access to for their applications, their api s their binaries running versus maybe in your more platinum or premium customers. You've done a different task definition where you actually give them access to more cpu more memory. Now they can have access to one vcpu two gigabytes. So not only are you when you leverage these together, not only are you throttling at the network level, then also once they get through from those api requests from that quota, now you're also preventing any kind of spikes or especially if there's pooled or shared resources, you're preventing any one particular task or service from consuming too much of those shared resources. You're, you're making sure that you're setting that quota and it shouldn't exceed or go beyond that quota.

Now, another thing to consider is AWS Fargate launches each task in its own isolated micro VM. And this is really cool too because again, when you consider that entire or that cellular architecture, the entire cell experiencing performance impact on part on one particular cluster doesn't affect the other particular cell. Whereas in that other cell, if one particular cell or single service is impacted because they're isolated because they're in those micro VV ms on the right side of the diagram, that one particular task doesn't affect the other tasks running in that cell. And that's due to the way that AWS Fargate launches and handles those particular tasks.

Now, Nathan, what would be another way that they can maybe consider addressing this if they were using something like EC2?

So EC2 also has um the ability to specify uh container resource limits and it's, it's fairly effective, but there are some things that aren't actually throttle, for example, network. So you could have like 11 particular container consuming too much of the network of the underlying host. So i think in general, uh you'll find that uh Avis Fargate isolation is a little bit more uh efficient than the EC2 uh uh isolation at making sure that there's no cross impact between one tenant or, or one cell of your particular service compared to hosting multiple tenants or multiple cells as adjacent containers on the same EC2 host.

Awesome. All right. Moving on to continue on this theme of trust in ways that we can continue to earn trust from our customers. Nathan's gonna go ahead and lead this next session where we trust your software to keep our data secure. So this is, this is a huge one. Security is, is absolutely critical for a software service business. Um if you have a security breach that can actually completely kill your business. Um it's particularly if sensitive customer data gets stolen.

So, one of the key ways to stop that from happening is actually microservice architecture. And I know that right now there is a trend sort of toward moving back toward monolith. Um and I think that works for some very small start ups and people who are just starting out building a brand new sa uh sas business, but in general microservices are gonna be much better for security isolation um as you grow and scale and you reach a point where you actually need to have a high security system. And that's because of a couple of different reasons. And we'll talk about the service integrations for ecs that enable this for your microservices.

So, one of the key ideas behind a microservice is that you need a firewall to protect it from bad internet traffic. Um this bad internet traffic could originate from outside, from the public internet. It could originate from inside your network as well as part of your system was compromised. So, ecs and a bs fargate integrate seamlessly with a s vpc um to give each of your tasks its own network interface inside your vpc. And this allows you to do some really interesting things um such as building security groups that have a granular uh minimal ingress rules between each of the different components in your system.

So we talked about how the micro v ms are so isolated from each other while the network as well is also isolated. Think about traditional ec2 host. If you're running multiple um application containers on that host, it becomes a bit of a challenge because now you have to open up a bunch of ports on that, that ec2 host potentially a gigantic um port range if you're using dynamic ports and allow inbound traffic across an entire um section of potentially thousands of ports on that ec2 host.

So s fargate and a s vpc networking mode simplifies that tremendously. Now, each of your task has its own network interface has its own private ip address inside your vpc. And you can granularly allow or deny traffic between any of the entities inside your vpc.

So here you see, the Avis Fargate is running uh two different tasks. They each have their own um elastic network interface. There's an application of load balancer that's accepting inbound traffic from the public internet and it is sending that inbound traffic to one of the tasks, uh elastic network interfaces, but it's actually denied the ability to talk to the other task. And likewise, the one task is being allowed to talk to the other task, you know, initiating a connection to the other task, but the reverse is not allowed. So there's directionality, you can say that this task is allowed to open a connection to that task, but the task is not allowed to open a connection to uh in the reverse direction.

The thing that i really like about the way that security group rules work when it comes to ecs and integrating with vpc is you have to be very intentional that you want those rules to work because by default, nothing can talk to anything, right? So you are intentionally defining and shaping how you want your traffic to ultimately end up to your tasks. And that way there's not anything, there's no communication happening behind the scenes that you weren't aware of because you've intentionally created those security groups, created those target groups and explicitly allowed that communication to happen. Whereas the default is deny and that's very powerful.

So, so you set up your, you set up your networking, you set up your networking roles. Um everything is granular traffic is allowed to flow to your application. Now it's time for your application to actually do something.

And in order to do that, it probably needs an IAM role. So the IAM role is the concept that we have inside AWS that allows a piece of code to make use of the AWS API and talk to other AWS resources. And for microservices, in particular, it's important to have granular IAM roles.

Um so you'll see here that maybe a web task is allowed to communicate to the S3 bucket. Um and the API task is allowed to talk to the Amazon DynamoDB bucket, but you don't want them talking to each other's resources. And so traditionally, when you were running on just a bare EC2 instance with no containers, no ECS, uh you had no option but to give the instance role, uh wide ranging permissions to talk to all the resources that every component running on the EC2 host needed. And so the result is if, if one component on that host was able to be compromised through a remote code execution vulnerability or something along those lines, um the attacker actually has access to everything that, that EC2 host is allowed to, to uh communicate with.

With microservices. Um and with AWS Fargate, the picture is a lot different because you're running more granular smaller components, each of them has their own IAM role and their own level of access. If someone hacks into one of those microservice containers due to some software flaw that you have, um they're only able to access the resources that that particular microservice is um authorized to communicate to.

But we can actually get even more granular than that. So that was the at the service level. But let's talk about the tenant model, there's different ways to separate out your tenant data um inside your data model as well. So think about how you would store that data. For example, in the DynamoDB table, one way might be to have one gigantic table such as an orders table it's a shared pool table that all of your tenants data is gonna be stuffed into. And this works great, especially for those free tier users.

Um another way would be to have multiple tables, cellular tables like order cell one, order cell two. And so the each of these is a pool, but it's a pool for a smaller subset of your customers. So that limits impact uh limits um um how much data is accessible with within that cell and keeps those cells isolated from each other.

And then finally, for your most premium tier customers, they want their own custom deployment. You might have a specific table for that particular tenant um with just their data.

Now, it's important to think about with those cellular architecture and with those pool tables, how you will keep data secure because yeah, i, you know, i'm a developer, uh just a developer. We're, we're all fallible. We make mistakes. I'm laughing because we are the development team for one of our products. And we both have been like, sorry, i just made a mistake it, but that goes back to also what we talked about when we are developers and we do make mistakes because we're not, we're not perfect, we're not, we're, i mean, we're not like computers that can do the same thing every single time i mentioned earlier, the importance of putting systems in place that can do some of the hard work for you. Essentially putting systems in place that ensure you fall into success. And one of the ways you can do that is how you also structure your database in your schema.

So, so check out this table here. This is the order of table. It's a pool table. We have a user id, we have orders for the these particular users. And obviously the the primary key there is looking up uh data based on the user id. Now, this code here has a terrible vulnerability in it. Someone completely forgot to check to make sure that the person making the call um actually is the person that they're fetching the data for.

So imagine if evil user Eve um can just make a request for orders slash Alice and and start to fetch all the orders for the user name. Alice. Now, the reason why this is a big problem is because uh that user Alice could be part of an entirely different tenant organization inside your company. Um so for that pool of data, now you've you've leaked data across tenants. And so one evil user and one tenant is able to fetch information from one of your other tenants and big customer trust has been lost.

So how can we prevent this? This is something that, that AWS does internally in all of our services as well. You'll notice that we have IAM roles and service linked IAM rolls um inside of pretty much all of our services and we have, there's a reason for that. It's for multi-tenant isolation.

So the end to end flow looks something like this. I know this looks incredibly complicated. The reality is it's a little bit, a little bit simpler than, than, than it looks. Um we have an authorization flow uh which essentially establishes an identity for the user who's making the request. And that's done using Amazon Cognito User Pool and a JWT token.

Once we have an identity for that user, that identity can be passed through to the rest of the system through Amazon API Gateway. And the heart of how this functions is that we can um based on the identity of the user actually down scope the IAM role even further.

So we talked about how that microservice has a role, right? So the microservice role as a whole may have a particular level of access to a resource such as orders table. You see on top the Amazon ECS task role for the orders microservice has access to query, the entire orders table, but we can also have a tenant scope role.

So this particular user because they're part of the Acme org, um their IAM role only allows them to fetch from the order's table if the leading key in the DynamoDB query matches Acme. So this is a very powerful way to create uh an intersection between the microservice task role and the tenant role to down scope the level of access on a per API request basis, even though your compute is being uh is operating out of a pool.

And so here's how that works. We've adjusted the table a little bit. So there's a new primary key, which is the tenant id and we still have a, a user id there. In order look up, you'll notice that the uh the same, um api endpoint is actually still vulnerable. There's no check that's going on anywhere in this code to verify that uh if, if Eve makes a request, she's not making a request for Alice.

However, the IAM role is going to catch and deny any attempts to fetch data outside of your particular tenant tenant id. And the way it does that is it uses a leading key. And because the tenant id is part of the query. If Eve attempts to fetch um orders for a user out of of any company or Example Corp, she's actually gonna be denied by that IAM policy and your service is probably end up returning an error back to the uh to, to, to Eve.

So as a result, Eve is only able to violate the privacy of other people inside her own tenant organization and she's not able to fetch information across cross tenant boundaries.

Now, let's say the worst case scenario, you know, you have a problem in the code. Um you didn't have proper protection, you didn't have proper IAM. How do you know what the actual impact of that was. Let's say you've made, made a bunch of requests and fetched a bunch of data for a bunch of different um other tenants inside of your, your business. You wanna be able to have the confidence to tell your other tenants. You know, I'm really sorry, you were impacted or we're very confident that you are not impacted at all. And the way to do that is you have to have audit logs.

Um so Amazon ECS FireLens is a tool built into Amazon ECS to give you this level of detail. And the way it works is it's a sidecar that lives alongside your task and it enhances all of the logs that, that come out of your application with extra metadata. And that metadata can be ECS related metadata. It can also be uh custom metadata that you attach as well.

So for example, the tenant id, uh the the cluster, the task aren't even the version of code uh via the ECS task definition that was running for that particular request. So now I'm able to make um granular lookups and see everything that Eve did. You know which particular tenants you might have touched, you know which particular tasks were impacted. You say there was a remote code execution user or something along those lines, you're gonna be able to query that data using CloudWatch Log Insights.

And the nice thing about this is if you're already using CloudWatch with your application. We all know how challenging it is to go to CloudWatch and try to search through and figure out what's actually going around in the logs, right? Because it's unstructured FireLens takes that and actually turns into structure data where now I don't have to do again. A lot of those manual things i can actually run queries and get responses back and be able to organize that data in a way that allows me to take action.

This again goes back to putting systems in place that allows you to fall into success and make some of those more manual tasks a little easier. It again protects you just moving forward and this is a really powerful way and easy way to be able to handle your logs and your data uh in an actionable manner.

So those are all my tips that i have for the section on, on, on increasing customer trust in the security level. I'm gonna turn it over to just to talk about i trust your software to get new features that i need.

Yeah. So again, you're obviously the theme is trust and the next step is I want new features. I want, I'm gonna trust you to be able to deliver new features.

Now, one of the things that i think developers overall or just anyone involved in tech is a lot of times they get focus, i think on, i know that my customers want features. I want to be agile. I wanna push out new features, but i also need to spend some time learning. I need to make sure i carve that out. And unfortunately i'm spending all my time doing maintenance and now i don't have as much time as i would have liked to implement features.

And actually, when Nathan and i were designing this particular talk, we spent a lot of time talking about this particular slide just because when it comes down to from a developer's workflow is i don't want to spend as much time doing maintenance, right? I want to have those systems in place. I want to implement features. But depending on the way that i approach problems can change the way that i necessarily approach how my maintenance looks like versus how my features look like versus how much time i spend learning. And am i actually carving out time to learn as well? There's a lot of pitfalls that can happen along the way with this particular the the particular aspect.

So some software and service businesses, they end up pushing their developers so hard to continue building features that the developers have to dial back on the learning side of things, then they're unhappy. They're not growing, they're not able to grow as uh uh as, as you know, in their role, they're not able to keep up with current tech. That's how you end up with an outdated system that becomes legacy very quickly or they skimp on the maintenance

And next thing, you know, you have a system that's destabilizing over time. Um or they, they, they stop implementing new features. You see the uh business grows to a certain level and then they say, you know, we're unable to keep up with the market and they end up being replaced by a different SaaS that is able to build features in more uh agile fashion.

Absolutely. And one of the things to also consider when it comes to maintenance is how you can, when you're rolling out these changes, how you're rolling out when and what I mean by that is your deployments. What systems are you using? If you're using AWS to handle your deployments? If you're using something like GitHub Actions, if you're using Jenkins, I'm talking specifically about your DevOps and your CI/CD. If you don't have workflows or pipelines in place, that's simplifying the amount of maintenance that you need to actually get those changes and get those new features out. It actually is more effective to spend time on the little maintenance part to actually implement new features in your release process in your CI/CD in your workflow. And that will free up the amount of time to end up implementing features in your application.

So it's not just focusing on features that your customers want, but focusing on how you can enhance and add features that will ultimately empower you and enable you to deliver those that new function to your customers as well.

Uh and one of the slides that we'll have at the end, we'll talk about how you can learn patterns for some of these things, how you can see samples and codes. And that's a website called containers on aws.com and Nathan and I built that and when we were building it, we ran into a lot of these issues. And the problem is is we want to give all of you everyone in this room tools and patterns, new features, new functionality to be able to learn how to go do the thing. But to do that, we had to pause, take the time to maintain how we're gonna release and implement those new features, how we're going to actually plan our deployments and now we can go back and free time where we focus just on patterns, just on code, just on tips, just on examples, just on information and features for you.

So it's not when we talk about features, it's not just the application, it's also the release process and the the method on getting there. Now new features come with maintenance weight. And now what do I mean by that? It means that when you are implementing some of these new features, you're making code changes, right? You're making things that now you're going to have to continue to maintain and address and some of those can be very heavy some of those can introduce new problems, new issues, new burdens that you're gonna have to take a look at how you're going to handle that. And once you're past the learning curve of that, now there's maybe less infrastructure work in the long run. But the initial part of when you've introduced those new features, there's going to be a learning curve. So there's going to be some, some weight maintenance, weight that you're dealing with.

A lot of it comes down to the tech choices you end up making. If you make the wrong tech choice, you can end up more like the diagram on the left where every time you add a new feature, your maintenance burden is increasing exponentially. If you make the right tech choice, then you might maybe have that initial learning curve. But each new feature is gonna actually come with a little bit less maintenance burden. It will start to level out over time.

Another way is you can reduce your operational overhead using AWS Fargate. So for example, in this particular slide, if you're running maybe your ECS instances on your EC2 instance, now you're having to manage that EC2 host, you're having to manage that node and that may be maintenance weight that you're ok with that, you've weighed the pros and cons and EC2 is what you need to be able to deliver value to your customers and earn that trust. But with that comes additional overhead where you're having to do patches, you're having to pay attention to Docker, you're having to pay attention maybe to stricter networking rules. You're having to manage those EC2 instances a lot more than leveraging something like AWS Fargate where all of that is behind the scenes for you and all you have to worry about is the patches to your container to your application itself.

This next slide is the patterns that I mentioned earlier. One of the ways that you can deploy prebuilt patterns is through using something like AWS Copilot, which is an easy CLI tool that will help you generate a manifest file. It'll help you define whether you want a low balance service, an event driven service, a back end API it'll actually walk you through and help you make strategic decisions for your application depending on your microservice architecture. And then it'll generally manifest files that will ultimately give you CloudFormation templates to deploy your prebuilt infrastructure deployment or your strategy for how you're handling your micro services.

So again, that can be even cron jobs, it'll also build a CI/CD pipeline for you within AWS.

Yeah. One of the goals when we built and designed a Copilot was specifically around stuff like the cron jobs. Um there's made times where you might have a business requirement to send out a reporting email once a week or something like that. And a lot of people were spending a lot of time trying to figure out how to do high availability cron and, and scheduling to, to solve some of these problems at scale. And we said, ok, we can take some of these prebuilt patterns, we can package them up into a tool that allows you to just think about the future you wanna build and then deploying architecture that spent 80 design and vetted.

So it's really the same way we design some of our own internal services. All these components here is, is basically the components we build our own services out of and operate at scale.

Now. AWS Copilot, I feel my background by the way. So i i would identify as a developer now, but my background maybe came a little bit more from operations. So Copilot feels a little bit more native to me because it's a lot more interactive. I'm able to kind of just do my CLI commands live in my terminal and be able to get the query queries and data that I would expect.

However, if you're more developer minded working with a CLI and living kind of in your terminal to define your infrastructure might feel a little foreign and might feel a little complicated. So for that and to make infrastructure easier for developers, there's also something called Cloud Development Tool Kit which or AWS CDK which enables you to effectively define and declare your infrastructure as you would your application. So it's a lot more native to programmers in the way that you access it and it's applicable for any type of developer.

So whether you're a Node developer, whether you're Python, whether you're Go, there's SDKs that you can use depending on what language or framework you're already comfortable with. And I know you use CDK a lot.

Yeah. Yeah. So actually if you go to the next slide, this is a passion project that i built for uh CDK. Um if you could advance one slide, sorry about that. Yeah, yeah, there we go. So, so this is a, a tool kit that i built in using CDK is designed to be an extendable interface for interacting with CDK to deploy different features for ECS service. You'll see the way it works is you can import different extensions such as a FireLens extension, an XRay extension, you know an Application Load Balancer extension and just add those to your ECS service with an API that's as simple as saying, you know the service description dot a and then add the new extension and it's also customizable. You can build your own extensions. If you have a particular extension that you wanna apply to multiple micro services or you wanna distribute across your organization, you can package that up as code and uh deploy that and all this is gonna add together is to reduce the burden of setting up infrastructure and, and maintain that infrastructure for people who just wanna build features for your SaaS.

So it's gonna translate to you building faster and better um using an abstraction layer that takes away some of that infrastructure, heavy lifting. And one of the biggest infrastructure heavy load, i think coming to ECS is dealing with networking, right? Defining EC uh defining your VPCs, defining your security groups, your target groups, all of that CDK makes it really easy to be able to attach your load balancers and kind of create all of that for you. And then CDK by default is gonna go and have that ultimately translated over into CloudFormation and deploy that as a stack. But from a infrastructure development perspective, it's definitely a lot easier where i think it reduces that maintenance and that overhead. So you can just focus more on your application and delivering application features.

Another thing and this is what i mentioned earlier is deploying prebuilt patterns that were designed by AWS. These are patterns that members of the TFC members of our team, members of AWS have built and shared that you can also use and leverage as well. And the nice thing is, is on the left hand side. When you go to the website, you'll see a little filter box and you can actually search by what type of infrastructure tool you want. So if you want to do Copilot or you want to do CloudFormation or you want to do CDK or even SAM Serverless Application Model. We have samples that will help you get started with all of those different tools or you can just use the search query and search for what it is you're trying to accomplish. And these are pre built best patterns.

Even if you don't use infrastructure tools that are AWS, maybe you're more comfortable with Terraform, maybe you're using multi cloud, that's fine. We have Terraform samples as well.

Now, the last section of trust that we wanna talk about is trusting your business to be here for the long term. And i'm gonna go ahead and let Nathan take point on this one.

So, so this is a good one. The first uh SaaS i ever worked at. Unfortunately, we ran out of money and we went out of business. I i was very excited about this start up early on. I was one of the one of the first members of it, but we just never made the business model work. And, and i i it's very hard to optimize a SaaS, you know, the dream of, of building a SaaS is that um as you have tenant consumption, your cost, your skill and your performance are gonna stay in sync with that tenant consumption usage such that you always have a, you know, a nice healthy profit margin there and, and your business is sustainable.

Now, the reality though, especially if you have free tier users is you end up something more like this you know, if you have a basic, basic tier free tier user, they end up having a lot of infrastructure costs. Um there's a lot of them and there's very low revenue obviously, then you have your standard and advanced tier users, maybe there's less of them, but there's also less infrastructure costs and there's more revenue from those customers.

So the lesson to learn from this is that you have to look at margin for your SaaS at not just the top level but with a more granular level of of of of information about where your infrastructure costs are going and what the utilization actually is.

So we want to look at the margin not just at the top level but per service and per tenant. You have that microservice de uh deployment, you have tenant cellular uh architecture, you have premium tiers uh users who are on their own silo. We wanna be able to look on a fine green basis and see exactly what's going on with each of these, what their utilization is and how we can optimize that.

So how do we do that? Well, one of the native features that's built inside of Amazon ECS is Container Insights. And the idea behind this is it is it gives you this granular telemetry data which comes out of a Fargate through Amazon ECS and you have different options about where you want to pipe all this data. For example, application logs, obviously gonna put those in CloudWatch Logs. Uh Container Insights is taking the telemetry of the actual resource utilization of each container and putting that into Amazon CloudWatch. You also have Amazon EventBridge events that Amazon ECS produces that you can also put into CloudWatch. And the goal of all of this is when you have all of your data in one place, you can start to query it. And CloudWatch container in uh CloudWatch Log Insights gives you an interface in a query language that allows you to pull out actionable information from all of this telemetry. They're getting out of your, out of your uh your system.

So we actually have a pattern for that once again on containers on aws.com, uh you'll find the Fargate right sizing dashboard. So what it does is it takes some of those telemetries such as this, you know, this large json blob that you see over here on the left and it crunches the numbers and and pulls out the statistics to find opportunities for optimizing your infrastructure on a per container basis. So for example, this container is a little bit over provisioned. It's not using its resources efficiently versus this other container is running hot, you know, it's very close to its its maximum resource utilization and maybe you need to actually upsize that

So this is a great pattern to look at. This was built by one of our principal engineers inside of the container organization. And we've used this internally as well as with many customers who are saying like, you know, where's my utilization going? I see that I see the end of the bill at the end of the month, I'm able to look at my top level margin. But how do I track that back to actual tenant usage and per service basis?

It's also important to understand the EC2 pricing model compared to AWS Fargate because it is a little bit different. Earlier, I saw roughly 50% of you in the room were using EC2, 50% of you were using AWS Fargate. Many of you are probably using both perhaps. So you may already have a bit of understanding of this. But the key thing to understand is that with EC2, you're paying a flat price for the instance, whether or not there are any containers running. Whereas with AWS Fargate, you're paying for the container based on its size and you stop paying whenever a container stops.

So this means that you can do really granular micro optimization on how you scale up and down. And it'll be more efficient than when you're running on EC2. If you're running EC2, maybe if you scale down a number of containers, it doesn't actually reduce your cost any because you're still running an EC2 instance, which still is charging the same amount as you would have been charged if it was full of containers. So ironically, some of the micros that you might do is not yielding as efficient results as you would think. Whereas with AWS Fargate, because each of the containers are isolated and are charging independently, when you shut down a container, it stops charging entirely.

So the key to learn about this is that it's important to redesign your system such as event driven and you take advantage of asynchronous work as much as possible. So what do I mean by that? If you have a container that's running at all times on an EC2 instance, that kind of makes sense because you're gonna be paying for the EC2 instance whether or not it's running a container. But if I'm running on AWS Fargate, I may be able to actually start up and shut down that container and save time in between so there's no cost.

So doing this with an event driven architecture that it does work in the background allows you to take advantage of the AWS Fargate pricing model to save you money and make that multi tenant model a little bit more efficient. And we actually have a real live demo example of this over in the expo. If you go over to the Modern Applications and Open Source booth and search for the CUS Espresso and CUS Flick, this is actually leveraging both event driven architecture, Lambda, EventBridge, Step Functions, but it's leveraging with AWS Fargate.

And one of the new applications we demoed for this event in particular is C Flick or serverless video, which is on demand streaming entirely in AWS. But here's the way it works - if I have a video that's less than 20 minutes, I'm going to have a script that's going to process that video running in Lambda. If it's over 15 minutes, it will fail out anyway because of Lambda's execution time limits. It would also be very expensive to continue going. I can leverage Fargate to launch that video processing on demand, run it when I need it for the service video for streaming and then when the streaming is done, when the processing is done, that task stops. So now I'm no longer paying for it.

There's different ways I can look at cost optimization to leverage the tools that are available to me depending on what my goals are for my application and my needs. But if you want to talk a little bit more about how to leverage event driven architecture with ECS and Fargate, I definitely recommend going over to the Modern Apps and Open Source booth and talking to a lot of our colleagues there as well.

So the main takeaway that I want you to remember is don't think about your AWS Fargate container as necessarily having to be a web server that's up and running 100% of the time and sitting there, even when it doesn't have any work to do. There are ways to design your architecture so that these containers come and go and you're being charged on a more micro level than a flat constant charge.

But you probably do have some things that are, for example, a web server or an API where you do have no option but to keep that thing up and running at all times, so it can answer your customers. And there's still ways to optimize the price and the margin on that when you're using AWS Fargate.

The two ways are Graviton and Spot capacity. I actually did a lot of calculation here to make sure these bars are accurate relative ways to show the relative price of the different options. So you have, if you have your base x86 or Intel based vCPU cost for an AWS Fargate price, that's that top bar that you see. And then if you decide to move that onto a Graviton or ARM based processor, you're gonna see a 22% savings based on the vCPU price.

But there's actually a little bit more to that, which is the fact that these Graviton processors are actually faster and more powerful than the Intel processors that you would be getting if you're running an Intel based Fargate task. So effectively, the price performance can add up to being about 40% cheaper because you're able to squeeze more juice out of your task, more requests per second, more concurrent requests than you would if you're running on Intel.

So we see that that other bar goes down a little bit more. By far the absolute cheapest way to do it if you want to save the most money though is Spot tasks. So if you're familiar with Spot, this is sort of selling you spare capacity that's currently unused and we can take that capacity back with a two minute notice if it's needed for other purposes. But you'll see that that's up to 68% cheaper.

So this can be ideal with some of your event driven architectures, if that is asynchronous work that happens in the background, maybe on the side of a queue or something along those lines where when there is Spot capacity available because there is not that high utilization maybe at night or something like that, you can do that work at a much cheaper rate, 68% up to 68% cheaper than you would if you were using an on demand price.

So essentially you are moving some of that workload later in the day potentially or overnight rather than trying to do it during the most busy business hours of the day. So this can make some of your workload a little bit more delayed, but extraordinarily cheap compared to doing it on demand.

And you can also, even if you do have something that it needs to be up and running during the day, you can't necessarily rely on delaying that to the evening or something like that in the off hours, you can choose to use multiple capacity providers to optimize your cost as well.

The interesting thing about this is that if you look at the comparison price performance, you'll find that there's almost like a 1 to 3 ratio of how many tasks you can launch on demand for the same price in Spot. And so a strategy that many people use is they launch their baseline capacity based on the traffic level that they expect onto a Fargate on-demand task and then to increase the quality of their service, they'll launch up to three additional Spot tasks for the same price that you would have paid an on demand task for. And they're able to increase the decrease the latency, increase the number of concurrent requests potentially deal with spikes a little bit more gracefully using that Spot capacity as an overflow or burst capacity.

So this is the way that you can dramatically increase the quality of your service without dramatically increasing the price. So all of this is gonna add up to you being able to build a business which doesn't destroy itself with infrastructure costs and go out of business because you're unable to achieve a profit margin.

And interestingly, I didn't put a slide in here for this, but some of what Jess was talking about with being able to build new features is also applicable to building your business for sustainable long term. If you think about it, if you have to hire additional developers to maintain your infrastructure, that's additional cost, that's additional paychecks. And so using those fully managed services using serverless options can actually decrease your overall burden on the business and burn rate as well.

You're leveraging again, I keep going back to it, but you're leveraging the resources, the tools and the products and the systems, you're leveraging those that are available to you to help you and your customers fall into success rather than going and trying to create success and force it and hire out and trying to spend more. You can reduce your overhead while still increasing your success rate, your throughput rate and again, reducing the overall financial burden and maintenance burden.

So that's it for today. We have about 33 minutes left just to summarize everything. This will be a good slide to take a picture of if you want a quick memory of everything we talked about today - the four dimensions are availability and security first.

Obviously, we started out right in the very beginning, right - I have to trust you, I have to trust that your application is going to be available. And we know that availability is not an absolute, it's not binary, there is a range for it. And so there's different ways that you can help make sure that availability falls within the more satisfactory side of the range and ideally as green as possible.

And we know that the answer to that comes through isolation, whether that's how we're handling our deployments, how we're isolating our code, how we're isolating from a security perspective when it comes to multi-tenancy - all of these pieces kind of end up tying in and going hand in hand together.

And I feel like I have to say it because it's security and it's AWS, but we know that security is a day zero job, this is zero trust, right? We want to make sure that we focus on security starts right from the very beginning and making sure that we have that security helps also ensure availability and that isolation.

And you'll see on these slides, the list of features built into ECS and AWS Fargate that allow you to have that isolation such as the microVM boundaries and the fact that we isolate deploys during rolling deploys to make sure that you don't drop traffic. For security, the way we give each task its own network interface, the way we give each task, each container, its own IAM credentials and enable you to build that tenant scoping to protect tenants from each other.

And then the last two dimensions were agility and pricing. So obviously, AWS Fargate is gonna allow you to build things with a little bit more agility. You're gonna be spending a little bit less time on operational overhead. You're not gonna have to worry about version upgrades. You're not gonna have to worry about upgrades and patches to your infrastructure. And that adds up to mean that you can build more features, your devs could spend more time learning, they're gonna be happier, your customers will be happier when they get more features as well.

And then pricing of course is a big one as well. And I know we just a few slides ago talked about pricing. But just to again recap on that is AWS Fargate helps you to avoid any kind of wasted resources. Again rather than having an EC2 cluster just sitting there maybe with no tasks running on it, you leverage Fargate to do on demand tasks, scaling or some of those tasks that don't need to be running continuously as well.

Though as Nathan talked about, there are also ways that you can have your long running web service applications running on Fargate and still minimize the cost.

Finally, thank you very much for hanging out with us and spending your afternoon with us again. My name is Jessica, this is Nathan, and if you wanna talk to us more, we seriously like we love this, we talk about it all the time, we love containers, we love open source, we love Amazon. Feel free to reach out and connect on Twitter and we'd love to talk to you. Thank you!

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate

Nathan: So uh we search for this talk around the idea that building a SaaS equals building customer trust. And this was an idea that I learned when I got to work with two of the greatest SaaS builders I've ever met. Uh they went on to build uh several $100
复制链接

扫一扫