A consistent approach to resilience analysis for critical workloads

John Fermenta: Um we're here for arc 313. My name is John Fermenta. I'm a Principal Product Manager on the Resilience Infrastructure and Solutions team. Uh we build Application Recovery Controller as one of our services. So if you've heard of that, awesome, if you haven't um check it out, it's a recovery tool um to help you with resilience.

I'm fortunate enough to have Mike Haiken and, and Mike Glonick to core with me. They'll introduce themselves as, as they come up.

Um a couple of housekeeping things. If you haven't seen this, we recently released the Resilience uh Life Cycle Framework. If you grab the QR code, you'll get uh access to the white paper. Um the genesis of this is over the last year or so. A lot of customers had asked us where should we start in our resilience journey? And this white paper kind of outlined some steps and a framework uh for uh you to use. And then our session today will really focus on the Design and Implement and Operate uh stages of that life cycle.

Um and then to quickly call out AWS has a bunch of uh resilience oriented offerings. Um I'm not gonna go through each of them. So maybe just grab a picture and you can, you know google and, and check out the ones you're interested in, but I will do it again. A shameless, uh shout out here to Application Recovery Controller. Um definitely check that out.

So to kick off our presentation, um, has anyone ever heard of the missing bullet hole story or, or knows what this is? Alright. So a few of you, alright. So this was a new story to me, my friend Mike told me about it just only a couple of months ago. So I'm, I'm very late to the party but um appreciate the folks that raised their hand to humor me here.

Um so this is a AP 38 Lightning long range fighter plane and in World War 2, you know, it was advantageous for the US government, right to keep these in the air and, and be able to um return home safely and do and do their duty in combat. And one of the things they needed to do was how do we make these planes more resilient? And you're probably wondering why am i talking about this during a, a talk about like distributed systems and resilience? And there, there are definitely some correlation here that we'll touch on. I said, just give me a second.

Um but yeah, the goal was, hey, we need to keep these in the plane, right? And, and there's an inherent trade off, um, between armoring an aircraft like this, uh, and its ability to be agile the fuel consumption and, and so on and so forth. And so the government had a team, the Statistical Research Group and a guy, um, by name of Abraham Wald. And this, this guy was like the typical smartest guy in the room and, and the folks came to him with this problem. We need to keep these planes in the air. Um we need to add armor to them, but we need to understand the trade off of where to place the armor, right? Because all those things again, the, the uh adding weight, uh maintaining agility, maintaining proper fuel consumption.

Um you need the way that with making sure that they're operable and, and getting our pilots home safely. And so, um the, the group came to Abraham uh with the data that kind of showed this distribution of bullets, ah, you know, on these aircrafts when they were returning back to base. And the initial thought was, hey, let's put the some armor where the bullet holes are. And Abraham tested this assumption. He's like, this just doesn't seem right. And he being a, a mathematician and a statistician basically said, let's look at this as an equation. And if we set the probability to zero that if a, a bullet hole were to hit at the engine as an example, um, we assume that it's not returning home, you know.

So if we set that, where would the bullet holes, where would that distribution end up? And then it ended up looking something like this. So what he determined was where the bullet holes were in the engine, in, in, you know, the pilot compartment and things like that they weren't returning home, they were shot down. And so that really focused um the attention on the critical parts of the system.

So again, we really had to had to test the assumption. Um and like a leadership principle, amazon dive deep to understand where we need to be thinking critically about the resilience of our system and make the right tradeoffs. And so they ended up adding more armor to these places. And so with our systems, we need to do the same thing. We need to peek around corners and understand where the vulnerabilities are in our system and make the right tradeoffs um on a variety of factors. Um and I'll show some tradeoffs in a second, but to make sure that we're getting the most value out of the resilience uh that we're building into our systems.

And so that's just to the resilience analysis framework. Um this was a aaa joint effort by a few folks uh over the last year of culminating uh essentially customer and internal conversations around how we can make our systems more resilient. And we defined a model that looks at common failure modes or categories of failure that we've seen. And so I'm going to talk about the like the theory of the framework and what you need to do to analyze your systems. And then Mike Hain will talk about some real world customer examples of, you know, scenarios that we were involved in and how we use the framework to um implement, you know, whether it's a corrective or preventative mitigation to increase the resilience of that. And then Michael Obn will talk about how our team Application Recovery Controller has operationalized a framework like this.

So why should you care? Um first off, I'll say there's a lot in the white paper that the QR code will take you to that, that we won't cover. Um but the, the, the reason why you should care about this is it provides a repeatable process that you can use now to assess and basically increase the resilience of your workloads. Um and that way you can use it across your organization or various workloads and you're using the same framework and the same process and it comes repeatable and something that you could operationalize and use it, you know, maybe weekly, quarterly, uh monthly or whatever cadence makes sense for you guys. And Mike will share a little bit more about how we've made it work.

It also guides towards that recovery oriented architectures where you think about a particular failure category and then how a mitigation can help um a slew of failure mode across that category and then uh help guide towards, you know, the right mitigations and the right observable strategies to align with your resilience goals.

Excuse me. So to start off with this, we modeled or thought through like what are the desired resilience properties we would want for our workload. And the first one is fault isolation. Um we want to intentionally build fault isolation boundaries that allow us to scope or limit the blast radius of a potential failure. We want sufficient capacity. We want essentially the workload to be able to uh handle the demand that is, you know, uh being thrown at it if you will and this could be cpu cycles memory, um you name it, we want timely output. And this essentially means that when a client makes a request uh that they're getting the results or the outcome of that in their expected time frame. And that could essentially mean the difference of um you know, a a client time out being a seeded and the application being perceived as unavailable. Um versus, you know, getting it within the the appropriate time out, correct output.

Essentially, we want the expected outcome right of the system, um sometimes incorrect or, you know, um or incomplete results if you will are worse than returning nothing at all. So we need to make sure that our systems are providing the output as expected and then redundancy. Um you know, two is sometimes better than one, right. Um so if a component fails, we want to make sure that we have redundant or spares um to take over in the event that there is a failure.

And so taking those uh categories or desired uh resilience properties, we created the SEAMS model and it just stands for Shared Fate, Excessive Load, Excessive Latency, Misconfiguration Bugs and Single Point of Failure. And each one of those failure categories violates one of the resilience properties.

So, shared fate that goes beyond or spreads across your fault isolation boundaries. Um you know, is not necessarily a good thing, right? If you have a fault container that you're expecting to be able to um use as a blas ph like mitigation and a failure spans multiple fault containers or fault isolation boundaries. That's not what we want. Uh excessive load um can go and violate a sufficient capacity where all of a sudden maybe we're getting a spike in traffic that we didn't anticipate. How can we handle those types of failures? Excessive latency could be a number of different ways that can manifest whether it's a network device failure um or, or slow hardware or you know, whatever it may be. But how can that um make sure that we are having timely output and not exceeding client timeouts and this configuration of bugs. I'll just call this out. This is probably the most popular failure category we've come across. Um but how can you know, deployment failure, um someone going in and and changing a security group on accident or you know, patching an ec2 host um cause essentially the output to not be as expected. And the last one is single points of failure, right? We just want to minimize any single points of failure in our systems. Um that could cause you know our system to be unavailable if that component were to fail.

And so now that you understand the failure categories and the desired resilience properties. The first step in actually using the framework is understanding the workload. And this is just an example of the in app purchase validation, a solution that uh someone built on AWS. And we typically start with an architecture diagram like this with numbers that kind of show the data flow or the life of a packet or a transaction if you will. And then we also really want to break them down into like a atomic user stories, um different functions that the system provides. And this is a very simple, simple system. If you, all right, it's a couple a few services and two flows. But you can imagine in an e commerce website or a retail banking website, the user stories are are, you know, much more vast than this. But here we have a user story for in app purchasing and then a user story for refunds. And what this allows you to do is prioritize which user stories are most valuable to your business and focus your, your resilience analysis work on those user stories. Um so you're not trying to boil the ocean, you can make meaningful progress, um you know, with your analysis.

So that's kind of the first step is just understand the workload and figuring out where we're going to focus and within those user stories and diagrams, you really want to make sure that these four components are called out. And the first is your code and config and you can think of that as your application code or any kind of configuration that is used for your system. The second is your infrastructure. So any of the a w services or on prem components, um really anything that is kind of driving your system, then your data stores, um you know, databases, ebs volumes anywhere it's data stored and any any any external dependencies. And these could be, you know, a team within your organization that you're consuming an api it could be a third party like isv api as an example, maybe like you need weather data um for your system like that would be categorized as an external dependency. And the main point to have this is you, we don't really necessarily care about making sure like how do we do, we have the database components called it. But the point of this is more as you're thinking through and diagramming and understanding your system, make sure you understand that the place of these components in there and then their interactions with one another. And you'll use that as part of the analysis.

So now you understand the workload, you have it modeled right and diagrammed and, and you're kind of with the group you're working with, have a common understanding, then you need to think about how we're going to use that framework that SEAMS model on a workload to mitigate some potential failures. And then the first part of it a after you've deci or, you know, start asking questions around this failure modes we talked about is figuring out the tradeoffs you need to make.

Um and this is the same for Abraham with the aircraft, right? Um for our systems, the first is usually there's a cost and effort component here and this could be, we talked about redundancy, adding replicas or spares, increases cost. Um there's an uh you know, an effort of implementing some advanced resilience patterns. Um and, and Mike will talk about some in a few minutes. There's also complexity involved um where you're gonna have, you know, um you know, maybe advanced CI/CD systems to deploy in certain, you know, ways to help um reduce misconfiguration in bugs or you need more enhanced observability to understand different failure modes and be able to detect things, you know, more of a leading indicator than a lagging indicator. It's gonna be increased operational burden where you need to have you know, more sophisticated run books on call process. Um you know, making sure that teams know what they need to do in an event. Um this all, you know, is a trade off that you're gonna need to make on your critical systems. And then the last is consistency and latency. A lot of times these systems tend to start thinking about a multiregional approach or having ADR strategy. And there's usually a typical trade-off of consistency over availability. Once you start to, you know, think about CAP theorem and asynchronous replication, but a trade off, you're going to need to reason about.

So now you, you, you saw SEAMS, the failure categories, you looked at the workload, you start to understand how those failure modes can manifest.

Um, we looked at some of the tradeoffs of, of maybe implementing a mitigation. And then now you need to think about if there is a particular failure mode that you're, you, you've kind of reasoned about, you need to think about what is the likelihood that it could happen. All right, like should we invest our time in, in this particular failure mode? And likelihood is all about the possibility that this could occur.

Um, and if you think about, for example, if you have an EC2 instance over the lifetime of the instance, you're probably gonna patch it, you're gonna, you know, have maintenance on it, the decent likelihood that at some point, you know, something's gonna go wrong and that instance is going to be unavailable. Now, on the flip side, if you're using RDS and you have, you know, multiple replicas, there's probably not a, a high likelihood that that's gonna be unavailable because if something goes wrong, you have the replica, you have to take over and, you know, you're, you're good to go.

Um, the other piece is impact where you want to understand the harm if you will or the reputational or commercial damage of, you know, that failure of manifesting. Um, and so thinking about how likely it is to happen versus the impact that are happening, you can make a judgment on which failure modes to focus on and continue on with um developing mitigations for.

And just to kind of show here, there's typically like a point of diminishing returns, right, that you can kind of eke out so much more resilience from time and effort in taking those tradeoffs. And the kind of as you get to the top of that curve, it's usually reserved for more of our critical systems where maybe if you have an internal, you know, in your organization, like an internal internal directory to look up folks, you probably don't want to spend a ton of time and effort uh making them very resilient. But if you have a system that's critical to your business and uh driving a lot of revenue, you probably want to focus more on the resilience and specifically the critical user stories um that are driving your business.

So the next part is now that you've figured out the failure mode, you, you know, where you want to focus, you need to think about how you would observe these things and you really want to have leading indicators and lagging indicators. If you can leading indicator being, you're, you're being aware that something is changing in the system and you're approaching a threshold that could cause an uh you know, a unavailability um scenario um but hasn't happened yet.

So for example, this could be, you know, monitoring connection kind of a database and understanding it's getting to a certain threshold where you could then maybe implement, you know, throttling or terminate some, you know, least used connections, um you know, to stop a potential issue from happening where a lagging indicator would be an alarm on, you know, um connections uh failing to the database as an example. Uh but again, we want to have both metrics available to you if you can do that.

And so now that you understand how you would observe them, we need to think about how we would mitigate them. And there's kind of two common approaches, uh preventative or corrective um preventative meaning I'm going to put a mitigation in place that would um stop uh you know, uh an outage or a failure from occurring. And you could think this of like maybe you have an excessive load on your infrastructure, maybe you have, you know, throttling or um load shedding to protect the system. Um so you never get to the point where the system is completely unavailable or corrective mitigation.

As an example may be where you're starting to see aaa load in traffic. And you understand, um that you're gonna get excessive load over the next 24 hours and then maybe you correctly auto scale to start adding capacity as you can. Um would be an example of that.

Um, and so yeah, with that, I'm going to pass it over to Mike who's going to share some customer examples. Thank you.

Thanks Sean. Yeah, yeah, vegetable thing. One of those couple of things, how did he actually use the framework? Uh by it's quite clear that you've been doing it multiple times. Good job. Yeah, i i'm, it's a shadow on the no i and then the system kind of metadata database might work. Give me one second when it's another m you know, handheld. I thought he was telling me, you know, is it milo who i think? Maybe why? No, this is mine. This one, i was like this guy's telling me, i think you can hear me. Maybe we got confused with two mics here because i'm mic too. Here we go. All right, but just all right, you guys get to see me wave hands with a mic and a clicker.

All right, hopefully this is better. So we have this charted database and we have a metadata database that maintains a map of the shards, right? And this is how the nodes in this fleet actually find which database they need to talk to the containers themselves. The software would lazy load the mapping data from the metadata database. So when they've got their first request, they would reach out to this metadata database, get the mapping and then connect to the database they needed to.

So we wanted to ask the question about excessive load in a normal scenario, right? The user would say, hey, i want to find smith and we go to that worker fleet and the fleet would go to the metadata database, they'd return that database mapping and we find that hey smith for the letter s is on db four, right? And we connect to the database. What about when this happens? We have a whole lot of people show up at once. They all try to do a work up all of the worker nodes try to connect to that single metadata database and crashes it, right? They had a really plausible excessive load scenario that would take down this entire system and make it unrecoverable.

So what we suggested is looking at a pattern that we call constant work. So instead of looking up that metadata, when they receive the request, we could instead say every 60 seconds have all these worker fleet nodes pull that mapping from something like an s3 bucket. And so it looks like this, they just continually do the same thing. It's going to remove any unanticipated load because they're always doing the same thing over and over. So regardless of how many customer requests there are, we don't change the amount of load on the metadata service.

So constant work has a couple of benefits. It gives us consistent, predictable performance and it gives us this property that we call antifragility. And it means that when something goes wrong, when something breaks, we actually end up doing less work, right? So when things go wrong in the system, we remove the amount of work being done instead of increasing it. So constant work could help them solve this excessive load problem with that metadata database.

But like john mentioned, we're going to have to make tradeoffs when we want to try to improve resilience. And so one of the tradeoffs with constant work is we're going to have potentially wasted work, right? We're going to do a lot of work even when things may not have changed. So we have to accommodate for that and understand what is that going to increase in potential costs, right? Because all of those requests have some costs associated with them, the data transfer has some costs associated with it. So we need to consider the tradeoffs. There is the resilience improvement of implementing constant work to solve this problem worth the tradeoffs we make.

So as our first example, our next one is a customer that wanted to set up a multiregional synchronous replication system. So they had a desire to meet an rp zero. So have a zero recovery point objective which means no data loss across regions. They were going to use multiregional synchronous replication across three different regions. And john mentioned cap earlier, has anybody heard of the cap theorem? Couple of people.

So what the cap theorem says is that there are three properties, consistency availability and partition tolerance. And you can only have two of the three. And so in most distributed systems, we all have to accept partition tolerance because we have multiple servers and multiple components, which means we have to make a trade-off and choice between consistency and availability.

Well, this customer wanted to try to beat cap theorem, right? They said during a failure when one of these things wasn't available, we're going to update the configuration to remove the impaired region to return availability, right? Because in a synchronous replication scheme, if any one of the nodes fail, the entire system fails. And so they wanted to make a runtime update to this configuration during a failure. Once that region recovered, they wanted to sync data back to the impaired region, meaning that they would have some period of inconsistency anyways, right?

And so they wanted to try to achieve all these at the same time and what we came up with when we looked at it, right, that this was a really complex system that was going to be fragile. And ultimately, we wanted to test the cascading failure as part of this. Do we have shared fate? Right. So we were looking at how this replication affected fault isolation.

And so here we can see we've got our three regions, region a b and c and we're using sync replication. And so if region b is impaired, is having a bad day for whatever reason, maybe that database node failed and a user goes to do a write in region a, they can't, this operation fails. And so what this means is that we have cascading failure from region b to region a. And so then what they would do, right, they would take the impaired node, remove it from the replication configuration.

And so this database then would be inconsistent until it was able to catch up after it recovered. So instead, we said, if you really do want fault isolation with, we don't want cascading failure, we could use async replication instead. And so now when region b is having a bad day, we can still perform rights in the other regions, we can still take a right in region a and asynchrony replicate it to region c.

And then when region b recovers, we'll have some catch-up replication to do, but it will eventually get back to a state where it's caught up, you know, with some minimal lag in that asynchronous replication so to achieve fault tolerance across those regions, prevent cascading failure. We were planning to use async replication and so that one helps reduce latency because we didn't have to wait for all of the rights to be replicated to both regions.

So asynchronous replication is going to give us faster performance and it gives us improved availability, right? We could still take rights and reads in the remaining regions even when one of them was impaired. But our trade-off here was consistency, right? We made a trade-off on consistency saying that we weren't going to be able to achieve an rp zero in this type of scenario, right? We'll have to deal with some inconsistency.

And we decided because even in the original design, it already accounted for an inconsistent state, right? When one of those nodes failed, it would still have to catch up. And so we made a decision right to accept that potential amount of data inconsistency. It's not really data loss, right? That data still persisted on a database in the primary region where it's written. But if we had to fail over it, we might have to deal with a period of time, right, where we didn't have all of the data available.

And so that was one of the tradeoffs that we had to make. All right, let's look at the example three a distributed storage system. So this service has data started across multiple storage nodes. The data is replicated across those nodes, reid requests issued by the users of the system could land on any one of the replicas and the requests are generally routed to a random storage node from front end request routers.

So what does this look like? So user performs a read and they end up landing on one of the storage nodes, right? They may issue a second read and they'll go to a different node, right? Potentially that node. But what if one of these nodes is slow, right? We wanted to ask what happens when we have excessive latency in the system, right?

So it's possible this storage node on the bottom, maybe it's experiencing a hard drive error, right? Maybe it has a faulty cpu, faulty memory for whatever reason, it's responding a lot more slowly than the other storage nodes. And so the result here is that some percentage of our reads are going to be measurably slower than others. So our tail end latency, maybe it's p 90 p 99 p 99.9. But that tail end latency is going to be significantly higher, right? Compared to the other notes.

So we suggested using a pattern called hedging to reduce tail and latency. And so what hedging is it's a way that we send requests in parallel and race them against each other. So now we've got that slow storage note again, but we're going to send two requests, read requests in parallel. And so you can see the one on the top ends up being a lot faster than the one on the bottom. And so in hedging, we just take the fastest response.

And so now when that slow storage node, right responds at that higher tail and latency, we actually don't care about it because we got a faster response from our other request. And so hedging gives us the ability to reduce tail end latency. It also masks transient failures because now that storage node being slow or being com or being completely unavailable are exactly the same to the client, right? They don't know if that thing was slow or if it just failed to respond because they were only looking for the first response, right?

And so they take the fastest one, the challenge with hedging right in our trade off is that we're going to be doing additional work. Now, instead of every one reed request, we're going to make two requests or three requests depending on how much you want to hedge, right. And so that's going to be additional work could potentially be additional cost.

And the other thing that you have to be really careful with, with hedging is that those requests have to be item potent. And what item potent means is that whenever you issue the request, no matter how many times you issue it, the same result always occurs. So reid requests by their nature are item potent. So you can deal with this for reed's right. Requests or mutating changes, some can be item potent, some may not be. So you have to just be careful with how you would implement hedging in a, in a system.

All right, let's look at our fourth example, multi a z deployments. So we have a system that's architected to use three availability zones for the customer. Going into this design deployment speed was a priority for them. And the deployments target instances in each a z. At the same time, they had built the system relying on the success of the pre-production environment to validate the deployment. But we actually found is that in many customers, their pre-production environments don't exactly mirror production

Does anybody here have a real pre-production environment that exactly mirrors production? I see, I think three hands, four hands. So you see most, most customers, right? There's some disparity between those environments. And so what happens was when they'd make a deployment and it would go bad, right? Maybe it was a bad change, maybe there was something in the environment, maybe something that they didn't catch and they push that change and it goes out to get deployed, hits all three availability zones at the same time. And we have this large scale failure, right?

And we wanted to ask, are the changes we're making? Are those deployment units at least as small as your intended fault isolation boundaries? And here an availability zone provides you a fault isolation boundary. We don't expect failures in one availability zone to cascade to other availability zones, right? Instances fail in one AZ don't affect instances in another AZ. So instead we looked at can we make our deployment units aligned with availability zones? So now when we push a change and it goes to deploy, it goes to one availability zone. And if that is bad, right, we have a failure there. We have a number of things that we can do to correct it.

If we've built an availability zone independent architecture, which means that we keep all of our traffic inside an AZ, we could use something like a zonal shift to move traffic away from that AZ and really quickly recover before we've even been able to roll back the deployment. If we don't have an architecture that's built that way, we could certainly initiate an automated rollback based on some metric or alarm. But the key here is that our remaining two availability zones haven't been changed and aren't failed, right? And this doesn't have to be a complete failure, right? This deployment doesn't necessarily mean that we have touched every single instance at once, but it could be that we've touched an instance in each AZ at the same time which expands the blast radius of a failure.

So now when we deploy and we see a success, we can go one AZ at a time and make that a dependency, right. Make that a gating check before we move on to deploy to the next availability zone. And so using fault isolated deployments gives us a predictable deployment failure, blast radius. We know that our blast radius will only ever be a single availability zone in size and it helps us preserve the customer experience right by being able to quickly roll back or shift traffic away from a failed deployment unit means that we can keep serving the customer experience that we were trying to deliver without a large scale failure.

But our tradeoffs, our deployments are potentially going to be a little bit slower, right? We have to deploy to an entire AZ at a time. Watch it, inspect it, make sure everything's healthy before moving on to the next one. So that's going to be a little bit more complex to orchestrate, right? You have to be able to set up your CI/CD system to be aware of what AZ it's deploying to. Now, you can do this with something like CodeDeploy and tags. So if you tagged all your instances with an availability zone ID, you could set up individual deployments for each one of those, but it could be a challenge, right? It could be a complex thing to orchestrate within a CI/CD system.

So the last example, an e-commerce site with multiple dependencies. So the site uses multiple back end services to deliver content on the web page. Not every microservice is critical to the customer experience. The web service currently though requires that each component has to load in order to render the whole page.

So here's our web page. It's very fancy. So we have product pictures, right? Calling one microservice, we have our add functionality. That's another microservice. We have suggestions on things that are frequently brought together. We have a section for items that you might be interested in. And finally, we have new promotions. But now if the new promotions, microservice fails, the whole website fails to load, right? That's not a great customer experience. It's not what we want.

So instead, if all these microservices successfully load, we can still serve the critical user story here, right? The most important thing is that the customer can see what they want to buy, add it to a cart and check out. So if new promotions isn't available, we just don't render it instead of failing to load the web page. This is the idea of graceful degradation, right? We can gracefully degrade, degrade when components of the system are slow or unavailable. And so graceful degradation helps us prevent cascading failures. Right? Again, like we saw earlier, we don't want a failure in that new promotions service to impact the checkout pictures and so on. And that gives us the ability to provide a partial customer experience.

But again, we have trade offs and being able to implement graceful degradation means that we really have to understand our dependency interactions, understanding the dependency map across an entire service and orchestrating it to be able to gracefully degrade can be challenging, right? So it takes a takes a really deep inspection of how all of your components work together.

All right. So those are our examples that we've seen in real life with customers where we've used the analysis framework to find potential resilience gaps and then suggest mitigations to be able to solve them. So I'm going to hand it over to Mike and Mike gonna talk about doing resilience analysis at AWS.

Thank you, Mike. And uh let's see if my microphone works. It does great. So my name is Mike Gallo Nick. I'm a senior software engineer at AWS and um I work at the resilience infrastructure and solutions organization. I will tell you about how my team and I uh implemented RAF in my organization. I will share some of the lessons learned. And most importantly, I'm going to help you answer the question if you should implement RAF in your organization.

Um one service where we implemented RAF is the application recovery controller. Routing control. Customers use routing control to recover applications from uh regional level uh infrastructure or application failures. They use routing control to initiate a failover and uh update DNS records quickly and reliably. This diagram. It just shows one of the custom architectures. Routing control offers 100% SLA. It is designed to operate even in the most extreme scenarios like two regions are impaired or one region is partitioned away from another region.

Another service where we implemented RAF is the uh is zonal shift which is uh also like routing control part of the application recovery controller. It is designed to help customers like you to recover your applications from uh impairments within a single availability zone. Zonal shift is free and it is integrated with uh with uh AWS ELB which makes it very easy for you folks to shift traffic away from an an impaired availability zone without taking any control for independence.

So on this diagram, um a typical multi-AZ application is seeing elevated uh error rates in AZ three and the customer goes and uses zonal shift API to shift traffic away from the EC2 instances in the AZ three to the other availability zones. Uh in behind the scenes. Zonal shift flips DNS records in Route 53 without taking any control point independence and in a constant workload manner. So which means that it doesn't matter how many customer requests there are or how large the applications are, it will still work.

So, zonal shift runs at a large scale, millions of load balancers, millions of resources and it creates value for you at the time when there is uh bad weather in the availability zone, all while running in the region where the availability zone might be impaired. So it is obviously challenging to operate and continue to build features in such a service. And it is not a surprise that we chose to um apply RAF to think about resiliency of this service uh proactively.

So as John mentioned, resiliency comes in all sizes and shapes uh from airplanes to applications. In the case of zonal shift, it must work at the time when there is an availability zone impairment. That is what it is about you. Our customers must be able to rely on it at the time when uh planes of other of other AWS services might be impaired.

So for many applications, the biggest resiliency concern is availability which is often measured in um uh failure ratio. Now, not so much for zonal shift. While we really care about how many five hundreds, you see, we prioritize your ability to start a zonal shift. So perhaps if the very first request to start a zonal shift fails, maybe it is ok. As long as the second request, the first retry succeeds. Um it is important to under to understand what is truly important to your application. And it is never simple, of course, uh durability, consistency, availability, upstream availability downstream. You should clearly define your customer workloads first and then work backwards from what is important to these workloads resiliency wise.

Um so the process that we institutionalized in my organization is that we meet every other week, we have a defined agenda, but we also have a facilitator that is responsible for uh adjusting this agenda to the particular meeting, to the particular conversation. We have product folks in the meeting, but the process is driven and owned by the engineering.

So as we go through the process, we end up with a number of documents. Uh at Amazon, we have this culture of writing that we believe is important and uh by taking the time to write the failure mode documents and reviewing these failure mode documents with the team, we gain deeper understanding of the problem than if we had just talked through it.

So you are listening to me and you're I can hear you thinking, well, this RAF thing, it's heavy uh process documents and you are right, RAF um requires substantial investment of engineering effort. But you operate a critical application running reliably at all times, your application is complex. It has miscellaneous dependencies and it deals with constantly changing customer traffic.

So let me ask you one question in your software development life cycle. Do you dedicate time to thinking about resiliency proactively after the application launch? I have no doubt that ah all of us do retrospectives after events. One of the reasons that we institutionalized RAF in my teams is that we wanted a more proactive way of uh thinking about resiliency for our critical services.

Um we remind ourselves that resilience is not a, is not a not a destination, it is a journey. We cannot just design resiliency once.

Now, what did we learn from uh the RAF um from this implementation? Um it is no surprise that just like as any other initiative, RAF requires uh executive level support. It is also not a surprise that um RAF needs engineers that are willing and able to dive deep into hypothetical scenarios and reason about them. Um that of course, gets down to having the right people on the team and um uh that have passion towards resiliency.

So we learn that RAF is um but process for engineers just as much as it is for stakeholders, um it helps engineers learn about the behavior of their systems and develop a resilience mindset. But RAF becomes fully uh fruitful when it is integrated with your uh ops and future development processes. But you have to commit resources to not only executing the program but to also improving your systems in the areas that your team identifies as the most important.

So let me share an example from my team. Um one day our on call uh engineer mitigated the deployment issue in one of our application component environments. While mitigating that issue, they realized that the root cause was in the intermittent failures of a noncritical dependency. Ok. The engineer called this out during our regular RAF meetings, we reviewed the failure mode documents with the team had a conversation and we realized that the probability of this noncritical dependency of slowing down our ability to scale up the application quickly, that probability is higher than we originally thought, which means that some of the specific failure modes have higher risks than we have originally uh thought.

So um as the result, we reprioritize the work to uh remove this uh uh critical noncritical dependency from the app from the application start critical path. Now, if not for RAF, then this risk would have stayed with us longer.

So let me now summarize you are operating a mission critical application. You greatly care about running this application reliably at all times, your architecture follows best practices. You have well designed SOPs, standard operating procedures, you have stellar operators, but your application is complex. It deals with constantly shifting traffic patterns because customers do different things. It has miscellaneous dependencies.

So if you see value in thinking about resiliency proactively, and if you subscribe to the idea of resiliency is a journey, not a destination, then you should consider implementing RAF. Now, um once you do, please uh trust that your team will identify opportunities to improve your application and your team will expect that you take the advantage of those opportunities.

Uh thank you for your time and um the three of us will uh thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值