Detecting and mitigating gray failures-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134807785

Mike Hagen: Detecting and mitigating gray failures. My name is Mike Hagen. I'm a senior principal solutions architect at AWS and I work with lots of customers on a lot of different resilience topics. Uh but one of my favorite to talk about is gray failures. Just a quick show of hands. Has anybody here heard of gray failures before? Ok. A couple. Anybody experienced a gray failure? Ok. One or two. Awesome.

So today we're gonna get into talking about what gray failures are, how we can uh find them and detect them and what we can do about them. So to start off, I've got a representation here of an availability uh dashboard from one of our services and you can see here that service availability dropped a little bit below 98%. Can anybody guess how many servers were impacted that caused this event? I'll give you the answer - one single host cause a 2% drop in service availability. And we experience these in lots of different ways. A single host with faulty memory, single host with a faulty cpu. We even experience these types of things in software. I have a single host with a Java class path conflict, java loads, classes non deterministically. And so we can find random ways that things fail on single hosts. And so these are all grave failures.

So let's understand a little bit more. What I mean when I say gray failure, great failures are defined by this idea of differential observability. What different servability means is that health is observed differently from different perspectives. The underlying system that you use may not see the impact at all or may not cross a threshold, but you as the user of that system may be disproportionately impacted. And so what this means is that we can't rely on that underlying system to detect and mitigate the failure. So we're going to have to take action ourselves.

Let's look at an abstract model to try to make this a little bit more clear on the street. We have a system and it has some core business logic, right? The thing that it does for its customers. It also has an observer which is going to be our failure detector. It may report have metrics reported to it, it may pull metrics from the system and then we have a reactor and the reactor is the thing that's going to fix the problem when the observer detects it. And so the system is going to make observations about its health, but it also has users. And in this case, app one, app two and app three and they're going to make their own observations about the health of the system. And in this case, we're going to look at average latency.

And so app one is seeing an average latency of 50 milliseconds, app 253 and app 356. And so our system sees an average overall latency of 53 milliseconds. Now, let's say that the system has a threshold that if average latency for the entire system goes above 60 milliseconds, that reactor is going to kick in and fix the problem. Now, let's say that app one is seeing an average latency of 70 milliseconds. This brings the system's average latency up to 59.66. So we don't cross that threshold. So the reactor doesn't kick in this system is not going to do anything but app one is seeing a 40% increase in latency and that could be really significant, right. That could exceed its time out. It could have cascading effects on other dependency calls. A 40% increase can be a big deal. And so this is a great failure.

And what we're really talking about is this top right quadrant. The system's perspective is that it's healthy, nothing's wrong. But from the app's perspective, the customer perspective, it's unhealthy and failures can move across this quadrant. It might start as a grave failure and become a detected failure and then go to all good or it could start as a detected failure. And move to be a grave failure. The idea here is that even though systems may eventually notice that something's wrong, they may not react fast enough. And so if you want to be resilient to these types of failures, you're going to want to be able to detect them quicker and react faster than the underlying systems that you depend on.

Great failures are typically gonna happen along some kind of fault isolation boundary. And we have lots of these in AWS. We have regions, right? Regions contain failures so that failures don't uh cascade to a different region. We have availability zones, availability zones contain failures, instances, containers, right? We wouldn't expect a fault on a single ec two instance or a full es volume to affect other instances. And finally you have software modules or system components, we wouldn't expect a bug in your list products function to affect your checkout function, right? These are all things that contain failures.

Today, I'm gonna focus on availability zones and instances as the fault containers where we may see gray failures and this is where we typically see gray failures occur the most.

So let's talk about single host gray failures. So here I have a fleet of ec two instances sitting behind a low balancer. And let's say a fault happens inside this instance and it can no longer respond to customer requests and it starts returning 500 errors to customers, but it still responds to health checks. And so the load balancer thinks the instance is healthy because it can reply to its health check. But when it gets customer requests, it produces errors.

Here's another example, we have a fleet of ec2 instances and they're using dynamo db as a distributed walk table. And so they reach out to the table to try to establish a lock on a certain piece of data, right? And so they write their annotation in there with a heartbeat, the time that they wanted and the piece of data that they're trying to lock on. Now, let's say this instant fails and it can't actually process the work that it was trying to do, right? So it can make no progress on the data that it's locked, but it's still at least healthy enough that it can continue to heartbeat to this table. So it continues to update the time stamp on the table, which means that nobody else can get access to the data that it's locked on. But the instance actually can't complete the work that it was trying to do two examples of single host gray failures.

So if these types of things happen, we noticed that that shallow health check from a load balancer that shallow health check the heartbeat table. Um the is doing the heartbeat table, right? Wasn't helping us find these types of errors. So what do we do deeper health checks? Right. If we test more deeply, maybe we can find these.

So let's talk about the difference between shallow and deep health checks. Shallow health checks, check the server, the application, we may mock dependencies and it's a truer check if the server itself is the source of the problem, we're only checking things that we would call on box, right? So we're not reaching out to anything external to this instance. On the other hand, a deep health check, it's gonna check everything that the shallow health check did, but it's also going to check dependencies off box dependencies. And so we might make a query to a database, we might reach out and do a dead object in s3, right? Whatever your dependencies are, we're going to test those to make sure that the full network path, right? The full user story is being tested.

So let's take a look at a couple of examples of how we can increase from shallow health checks to deep health checks. And in this case, I'm going to use an auto scaling group with instances talking to a dependency. They are all sitting behind a load balancer with auto scaling. You have two options for determining the health of your instances in that auto scaling group. First option is using the ec2 status checks, right? And those are gonna be things like the ping uh requests, responding to arp, really just a liveness check of the ec2 instance. But you can also add b health checks to your auto scaling group and so it will also use the health checks that the load balancer performs to determine if instances are healthy.

So in our example, here, we're going to perform shallow health checks. And so we're going to reach out to the slash health web path for these servers. And when we have a failure inside one of these instances, let's say a critical process isn't running, we're going to route around that problem because the load balancer is going to see the instance is unhealthy. So it won't send any more traffic and auto scale is going to replace it. So that's good.

But now let's say we have a connectivity gray failure. Let's say this instance can't connect to the database or maybe 10% of its queries are timing out or getting dropped or something like that. Well, with shallow health checks, we won't route around this and customers are going to get errors, right. That load balancer is going to continue to send traffic to that instance even though it's having problems, right? And that's the similar thing that we saw with the previous examples.

So we want deeper health checks. Ok. Let's use a deep health check to try to route around gray failures. In this case, we're only going to use the uh sta ec2 status checks that are auto scaling and we'll see why on the next slide. So when we have a con connectivity grade failure, hey, we're going to route around it right, because our deep health check is checking the path connecting to this dependency. And because this instance can't, the health check fails and the low balancer doesn't send any traffic.

And let's what happens. We have a failure inside this instance. Well, it's gonna get routed around. Right. The wood bouncer is going to see it as unhealthy, but it's not going to get replaced. Right. And that's because our auto skilling group is only using the ec2 status checks. It's not using the elb health checks. Ok. So this, this is a problem we don't want.

So let's let's add in the elb health checks. All right. So now we have a failure inside the instance. Great. This is what we want, we want to route around it and we want auto scaling to replace this instance. But here's the problem. Let's say we have a transient error in this dependency.

It's not the network path, it's not the instance of the problem. It's the dependency and it affects all of these instances. Well, now all these instances are going to appear unhealthy, right? They're going to fail all of their ELB health checks and somehow my entire fleet appears to be unhealthy and Auto Scaling is going to start replacing all these unhealthy instances, right.

So in batches of up to 10% we're going to start losing instances and they're gonna start getting replaced. The 10% drop in your capacity could be massive in terms of impact for your customers. The last time I launched an EC2 instance with bootstrapping, it's probably a 5 to 15 minute process depending on what you're installing, the type of instance that you're running. And so you're going to have a huge amount of downtime while these instances are being replaced.

So we don't want this and unfortunately, it can get worse. We could have the case that we have this transient database error. But actually all of our instances aren't impacted. Let's say we have one remaining instance that actually is just totally fine. Well, the ELB is not going to fail open, right? If it has at least one healthy instance, it's only going to send traffic to that one healthy instance. And so what happens here is we end up overwhelming that single host because it gets all of the traffic while all these other things are being replaced, which is gonna end up overwhelming it and bringing it down.

And now I have no healthy instances again and even in an overwhelmed state, right, that instance could continue to pass health checks, right? So it may not be able to process customer traffic, but it still replies to a health check request. And so we continue to send all of that traffic to an unhealthy instance. So this is probably our worst case scenario.

So it looks like our health checks got a little too deep, right? So what now shower of checks and so we're not allowed to use memes that we find on the internet, but you can probably figure out which one that I've tried to imitate here. Uh and so we're left in this situation of which do I choose shallow health checks or deep health checks? What's, what's the right answer? Right. We just saw that there were problems with all of the things that we just proposed.

So let's look at our tradeoffs. If you want to use deep health checks, don't integrate them with your Auto Scaling group because we want to prevent unnecessary instance, termination, deep health checks allow us to route around gray failures without terminating instances. But they're gonna require an additional mechanism to give us a holistic picture of that local instance. health.

One of the patterns that we used to do this is called the heartbeat table pattern and we'll talk about that next, right, because we want to be able to determine for Auto Scaling, right, whether an instance locally is unhealthy and should be replaced. And the deep health check doesn't necessarily tell us that if you use shallow health checks, you should always integrate these with the Auto Scaling group. No downsides to doing that shallow health checks help ensure that unhealthy instances are routed around and are also replaced by Auto Scaling. It can help preventing it can help prevent terminating instances due to transient dependency problems, right? Because we're not looking off box, but they'll require us to build an additional mechanism to deal with grave failures.

And so neither deep or shallow health checks are right or wrong, right. And I'm not going to tell you which one you should or shouldn't pick. But you need to understand the tradeoffs. There are right and wrong ways to implement them. And regardless of which one you pick, you will likely need another mechanism to either deal with local instant health or with finding and mitigating grave failures.

So if you want to use del checks, you can use something like this to determine local instant self. So every time one of these instances in this Auto Scaling group receives a health check from a load balancer, they write into a DynamoDB table and they write their instance id, the time at which that heartbeat is being made, whether they consider themselves healthy, right? Are all my critical processes running is my CloudWatch agent running is my CodeDeploy agent running is HTTV running, right? And maybe there a Z ID, maybe some other information in there.

And so you can see that the instance here on the bottom i-543210, right? Last reported it at 11:55. And our current time is 12:01. And so every minute this Lambda function is going to run and it's going to query the table and it's going to look for stale entries. And so when it finds this entry, that's more than five minutes old. It's going to call the Auto Scaling API to set instant health to terminate that instance. And so this is a way that we can use deep health checks and also still have a good picture of how the instance is performing locally without its dependencies.

The challenge that we run into though is that health checks don't always tell the whole story. So here, I've got some instances behind a load balancer. And let's say I have two dependencies. These dependencies could be a third party system. It could be S3 could be DynamoDB, right? And so these dependencies have lots of potential API that you can interface with. And let's say that my system has three paths, right? Three functions. I have a home of products and a checkout home relies on dependency, one products, relies on dependency, one and checkout requires dependency, one and two.

And we're going to, we're going to reach out to the health path for our ELB health checks. But when that health check is triggered, we may only test a subset of our dependencies. If you use 50 APIs and S3 in your normal function, it would be kind of unreasonable to want to test all 50 of them in a health check, right? And so maybe you can't actually test every single API in your dependency chain. And so if there's a failure in dependency too, it might get missed in the health check, right? If you, you mainly call, say S3 get object as one of your dependencies. And in a rare case, you have put object and you don't actually test that in your deep health check. You might miss an impairment.

Here's the other challenge. Let's say this particular instance is having a problem communicating with dependency one and it's producing a 4% error rate. This is going to result in a 98% service availability, right. I've got two instances. One's got a 4% error rate. So I dropped down to 98% in order for the health checks to fail from the load balancer. Let's say I have it configured that I need three failed health checks in a row to consider this instance is unhealthy. If only 4% of my requests are failing, it means that I have a 4% chance of each one of the health checks to fail and to get three in a row means I have a 0.0064% chance of actually catching this problem with a health check. So it's really unlikely that my ELB health checks would catch this type of problem.

So that might lead us to really want to use shallow health checks and build detection and mitigation tools for gray failures. Because our health checks from an ELB where our EC2 instant status checks just are really unlikely to catch this type of problem.

Ok. So how can we detect single instance gray failures? The first thing we need is instrumentation. We need observability of our system and observability and instrumentation requires writing explicit code. You need to write code around your business logic. You need more than what comes out of the box, right? AWS provides you a lot of great metrics natively for EC2 and all of our other services. You'll get things like CPU memory um network in out, but you really need to know what's going on inside your hosts, right? What are they doing? Who are they talking to? How long did it take? How much data were they processing? And we want to get both client side and server side perspectives. We want to enrich our metrics with context about what's happening.

And I mentioned that we have different fault isolation boundaries in AWS like regions, AZs hosts. We wanna put dimensions on our metrics that are aligned to our fault isolation boundaries. So in every metric, I want to look at latency with an Availability Zone ID, right? I want to look at availability of my system with an Availability Zone ID, right? And that way I can distinguish if there's a problem in one Availability Zone or another. But without enriching my metrics with an Availability Zone ID or with a host ID, I have no idea what the actual source of the problem is.

One of the capabilities that we provide in CloudWatch is called the Embedded Metric Format. I really like it personally because it gives you a single pane of glass to put your metrics and logs together. And so as developers engineers, you don't have to write two sets of code one to write all to your log files and then another one to write metrics, custom metrics to the CloudWatch API, you can put all of your metrics directly into a log file and we automatically extract them out and publish them as a CloudWatch metric. And since I mentioned the format, this is what an Embedded Metric Format log looks like. It's just a structure. JSON, I've got the metrics that I'm looking at here. I'm looking at two XX, three XX and so on. Responses my success latency, you can see that I have dimensions, the name of the service that's being called my region, my Availability Zone ID, my instance ID and I have several sets of dimensions that I can include in here. And then I have the actual data, right? So this instance is an AZ one. It responded with a two XX and it had a 20 milliseconds success latency, really powerful tool, right? So I can now look at this log file and understand and query all the metric data as well as look at it in dashboards.

But to find a grave failure, we need to be able to compare data among EC2 instances, right? It's likely that at some point, all of your instances are going to have some level of error rates, right. It's never gonna generally, I wouldn't say never. It's generally not going to be zero across the board. But what we really want to know is, is one host having a much worse day than everybody else. And so the way that we can do this and compare among instances is using a service called Contributor Insights.

And so this is a feature of CloudWatch that allows us to write rules against log files. And in this case, we're looking at HTTP status codes between 499 and 600. So we're looking for 500 responses and we're keying on an instant ID. And so this is an entire Contributor Insights rule and it will give you a dashboard, it will give you a graph that potentially looks something like this. And so I've got a bunch of instances here that all have probably relatively about the same error rates. And in this case, if I was looking at this Contributor Insights dashboard, I would say, ok, probably nothing wrong. But if I saw this one, I'd probably be inclined to say, hey, there's something wrong with that host. It has way more errors than the others. It's potentially experiencing a gray failure, but humans can be slow, right?

I don't want a human having to look at this dashboard all day to try to find out if we're experiencing a grave failure, so what we can do is create an alarm based on our Contributor Insights rules. So in CloudWatch, we natively support with the InsightRule metric, creating alarms from Contributor Insights data. And so here, I'm looking at the max contributor value divided by the sum of everything being greater than 0.75. And so what this means is, I'm looking for one host that's responsible for more than 75% of the errors in the system. And if I have one host that has 75% of the errors, now, I can do something about it. I can create automation based on this alarm. I can fire a Lambda function, send a message to SNS or we can just have a Lambda function that periodically looks at the Contributor Insights data, right? And finds it, say once a minute on a scheduled event, I just go and look at the Contributor Insights data and say, hey, is somebody responsible for 75% of my errors.

And so what we're really doing here is outlier detection. We're trying to find an outlier in our system based on their error rates. It could also be based on latency, right? It could be whatever statistic, whatever KPI is important to your system to identify if something is going wrong. It's like I said, we have a periodic or triggered Lambda consuming CloudWatch metric data or contributed insights data. Directly, we want to determine if the skew right isn't probable to occur. And there are lots of different outlier detection algorithms. You can use chi squared k means clustering z score, which looks at standard deviation or you could just use a uh static number like 75% of the errors, right? Lots of different options to be able to determine if something is out of balance.

So just two considerations, when we think about outlier detection to detect gray failures:

First, it's possible that this could hide real failures inside of your system, right? So when bad hosts are frequently terminated, it may actually be a sign that maybe I've got a bug in my code, maybe this actually isn't a great failure. Maybe this really is a failure. But because I'm taking some automated action to detect and respond to it, I'm potentially hiding that bug, right.

So we want to be careful about using automation, which means that we want to really periodically review the actions that our automation takes, right? If we see this happening all of the time, it's unlikely that we have a great failure that happens, you know, every hour, every day in that kind of frequency.

So if you're seeing problems like this occur regularly, it may be a signal that you need to check something that's underlying in the system.

Other risk is false positives, right? These outlier detection algorithms aren't perfect. And so it may be possible that they detect something that really wasn't a gray failure.

Now, in a single host, um, consideration probably not terrible, right. If we have a false positive and terminate a host, but it's something to be aware of.

So how do we mitigate single instant gray failures? I'm going to give you the shortest, most simple answer to this problem. It's going to take me like half a slide.

So we have a great failure. Replace the instance. That's it simplest way to mitigate this problem. Launch a new instance. It's unlikely that that new instance is gonna have any of the problems, right? That the one that we terminated had, whether it was faulty memory, faulty cpu network problem, right? A race condition, a memory leak replacing the instance is most likely going to fix it.

The thing to consider here is if you're building this with automation, ensure that you apply some kind of velocity control internally at aws, we call these bullet counters. What we want to make sure is that we don't terminate two instances too frequent, too many instances, too frequently, right?

If we see, hey, I'm gonna try to attempt my 10th termination in 10 minutes. We stop and engage a human so they can go and check what's going on, right? You don't want to have runaway automation that deserts terminating everything accidentally.

Ok. So we've talked about hosts and how we can detect and mitigate grave failures in single ec2 hosts. We can apply the same logic to finding single a z grade failures as well.

So when we have a grave failure that impacts an entire availability zone, there are likely three different ways that this failure can manifest.

The first way is that we have multiple a zs that show impact, but one is much more impacted than the others, right? So this means that i have errors that are happening across all the availability zones that i'm in. But i have one a z that has way more errors than the others, right? This is our outlier detection. This is what we just talked about on being able to find that type of situation.

The second possibility is that i have a really clear signal that one a z is impacted and the others aren't. And again, we could find this using outlier detection, we could also find this using cloud watch composite alarms. And i'll go through the example of how you can build composite alarms to find this type of situation.

And then finally, the third way that a single a z grade failure can manifest is that all of the fault containers. In this case, our availability zones are impacted because of a single shared resource that's experiencing a grave failure, something like a database, right? If a database is in an availability zone that has a gray failure, all of the instances and all the azs right could show impacts because of that. And again, we will be, we will be able to find that kind of problem with composite alarms as well.

So let's talk about creating availability, alarms for your system in this service that i have here. I have, i have two functions. I have my home function and my list products function and these are alarm definitions. And so i've got an alarm for a z one. I have a three out of three and a three out of five. And i'm looking for my availability to be lower than 99.9%. And so i have four separate alarms here, but i don't want to have four separate alarms that an operator has to deal with and look at. And so i can create composite composite alarms. And so here i have an a z one home availability, composite alarm that's looking for my consecutive three out of three or three out of five, my m out of n and i have one for my list products. And i do this for all of the availability zones that i'm in. So if i'm in three azs, i'll have alarms for each service for each a z and then i can bind this into a final composite alarm that tells me about my a z one availability. If either my home availability or my list products availability goes into alarm. I know that i'm having an availability situation problem in a z one. Right. And i'll have that for a z one a z two a z three. And you can also do this for latency or any other metric that you think is important for determining the health of your system.

So how do we use that to find isolated a z impact? Well, i've got my availability alarm for a z one and i've got my latency alarm for a z one. And this is probably the most complex part. What this composite alarm is saying is that if i have an alarm in a z one for availability or latency, but not a z two or a z three, right? So this composite alarm says a z one is seeing latency or availability impact, but a z two and a z three aren't.

The other thing we need to know though is that we have more than one instance experiencing a problem, right? It's entirely possible we can have a single bad host in a single a z that makes that entire a z look bad. And so we can go back to our contributor insights rule if i'm using contributor insights. And i've got one to find, looking at five xx errors. I want to see that i have unique contributors being two or more, right? And this tells me at least two, if not greater depending on the size of your fleet are actually experiencing problems. Right. And that's how we rule out that this is a single host, producing errors in that a z and that gives us the final alarm that we can actually attach to an snns topic to notify an operator or start automation.

And so we want to say that i'm seeing impact in just a z one and it's not being produced by a single instance.

And while i know you're probably looking at this and like, man, like that's really complicated, that would be a lot to set up all of these alarms that you see here, once you set them up the first time they're static. And so there's an upfront investment to creating these. But every time you add a product or you add a feature, right? You're only editing those very top level ones, right? And adding them to that first composite alarm. So after you build this, once this set of composite alarms remain static.

So let's talk about the second way that those errors could manifest, right? When we have that single shared resource like a database um right. Aurora rds, something like that.

I'm going to be a little less specific about what metrics you should use. Let's say i'm measuring failed transactions being greater than 1% as an example. And i'm also going to measure my p 99 transaction latency say these are alarms if it exceeds 75 milliseconds, you see, i've got an alarm in each a z. So a z one a z two and a z three and i can create composite arms with these as well.

And so what this is telling me is if i have a problem with my primary database in two or more a zs. And so this is just a combination of all of the possibilities from my a zs if a z one and a z two or a z one and a z three or a z two and a z three, right? So that's the combination of all of my a zs to tell me that i'm seeing impact from my interactions with the database in at least two a zs, i can do the same thing for latency and i create a final composite alarm that again, we can attach to an sns topic and notify an operator.

So now that we've got a couple of different ways that we can detect gray failures in a single a z, right, we can use outlier detection, we can build composite alarms. So now that we know that something's going wrong, which is the hardest part of gray failures, we can take some action to mitigate them.

So when a single a z failure happens, we really have three options. The first option is you can wait it out, right. Wait, wait for somebody else to notice, wait for it to go away. And i'll be honest, a lot of customers choose option one and if you do that's ok, but just realize, right, we're going to see some impact to your customer experience. While that great failure is happening.

Our second option is if the impairment affects just a single availability zone, we can evacuate that a z right. We can leave that a stop doing work there to make the failure go away. And the third option is if you have a multi region disaster recovery plan, you could fail over to another region and we have a number of customers that also choose option three.

The challenge with option three is that it's going to produce extended recovery times and the possibility for data loss or data inconsistency. If you're using any native aws replication service like s3 cross region replication, dynamo, global tables, aurora global database, right? And so on. They're all going to use asynchronous replication. And what that means is that the data in your secondary region has the potential to be behind the data in your primary region. So if you fail over to that secondary region, you could potentially be operating on inconsistent data, right? So you're gonna have some nonzero recovery point objective or rpo.

The other challenge is that failing over to another region can take a lot of time. So by the time that you detect that there's a problem, you engage someone to say, hey, we should fail over and then you actually execute the fail over to another region. And maybe you're using a pilot light strategy. Now you have to bring up resources in that secondary region. It can take a lot of time, even if you're using an active, active or an active, passive, like a hot standby, your dn ss record, right is going to have att l on it, right? So for some period of time, your customers will still be accessing your primary region until they refresh that dns record after the tt l expires.

So lots of considerations to think of in multi region, but we're going to have some non zero rp and we're going to have a potentially longer recovery time objective as compared if we stayed in the same region and just evacuated in a, within a region, we can have strong consistency, right. So we have strong consistency with s3. We have strongly consistent reads in dynamo, you have strong consistency in multi a zr ds, you have crash consistent uh snapshots of ebs volumes. So we can achieve an rp zero inside of a single region.

And because you're probably already deployed across multiple availability zones, it doesn't take you any time to spin up new resources. We talk a little bit about this pattern that we call static stability and static stability means that you don't have to make any changes or updates in order to recover from a failure. In the case of like an e two fleet being statically stable to an availability zone failure means that i would be prep provisioned in my other two a zs to handle the loss of a single availability zone and if you're configured that way, if you have your architecture set up, that way, you can recover in basically close to an rt, you know, a near zero rt o. I don't think you will never be able to achieve a zero rt o because you still have to detect the failure happens, but you'll recover a whole lot faster, right? If you don't have to make any changes to your environment, except for just removing traffic from a single a z.

And so that's what we're gonna talk about, right? How do we evacuate a single availability zone so that we can recover faster and with no data loss.

So we have two goals for evacuating a z. The first goal is to stop processing work in that availability zone. We want to prevent http or gr pc requests, right? Whatever kind of work your system does, we want to prevent them from going to the a that a z. If it's a batch processing type of workload, we want to stop batch processing in that a z, right? Any type of work we don't want it to happen where the impairment is.

A secondary goal is that we really might want to stop deploying new capacity into that a z if this is an extended duration type event, right? So for example, if we're using auto scaling, we may not want auto scaling to deploy new instances into the a z where we've stopped processing work right, that capacity that we needed that auto scaling launched because we needed more isn't going to actually take any of the load.

This is going to be the same if it's ecs eks right? Pods containers, you may not want to start to keep deploying them into the a z that we've just left. And so as a secondary goal, right, we may want to stop deploying new capacity there.

Two major considerations that we need to talk about for a evacuation. The first one is a architectural concept that we call availability zone independence. To effectively completely shift traffic away from an a z. The architecture has to be a z independent. What that means is that we have to keep the traffic that's delivered to that availability zone inside that availability zone. And so when you think about load balancers, right with lb, you have the option for cross zone load balancing to be enabled or disabled on both network load balancers and application load balancers.

To have an a z independent architecture, you have to disable cross zone load balancing. And that way the load balancer node that receives the traffic in an a z will only send it to the instances or things behind it in the same availability zone.

And this is true for all the rest of your interactions right across your architecture. For example, VPC endpoints - if you're using VPC endpoints, we provide a top level DNS name, right, that matches the service name, but we also provide availability on specific DNS names.

And the whole reason we want to do this is because we don't want a failure in one AZ to cascade to other availability zones, right? So if AZ one is having a bad day, say the VPC endpoint there is dropping traffic, we don't want an instance in AZ three using the VPC endpoint in AZ one, we want it using the endpoint in its own availability zone. And that way we can fail that fault container, right, that availability zone, we can fail it completely, right, without having a cascade and impact other availability zones.

The thing that we have to be most conscious about when building availability zone independent architectures are database read replicas. So with RDS, in a primary instance, you have the ability to manually fail over to a different read replica, right? So if AZ one is having a bad day and my RDS instance is in AZ one, I can fail that over to another replica, right, and it say AZ two or AZ three. And so you have control over that.

But for read replicas, you don't have control, there's no failover read replica from AZ one to AZ two. And so if you want to keep that traffic within the same AZ, you need to set up a read replica per AZ, if you're using them. If you're not using them, then you don't have to do anything. But if you are using read replicas, you need them to be in each availability zone and you need to have some kind of service discovery mechanism.

So your instances know what the endpoint is that they should connect to for the read replica in their own AZ, right? You could put that in an SSM Parameter Store. You could use DNS, you could use Cloud Map and do service discovery. A lot of different options on how you can actually store and access that data. But it's just important that we want to keep everything isolated within an AZ to be able to effectively evacuate.

The second consideration we have are control planes and data planes. So control planes in AWS services are the bits of software that are responsible for creating, updating, deleting and modifying resources, right? Most of your API actions - data planes on the other hand provide the day to day business of a resource.

And so for example, in S3, creating a bucket is a control plane action, getting an object or putting an object is a data plane action. When you launch an EC2 instance, right, EC2 RunInstances, that's a control plane action. And a lot happens in the background - when you launch an EC2 instance, we have to find a physical host with capacity, we've got to plump up an EBS volume, we have to set up your IAM credentials, we have to update the security groups, right, and assign the security groups. So a lot happens in the background.

And so control planes tend to be more complex systems with more dependencies. And just because of this, as compared to data planes, they're statistically more likely to fail. And so we think that a dependency on a data plane is more reliable than a dependency on a control plane at a higher level, right? We think that things that don't need to be changed are just going to be more reliable, right, than things that do have to be changed in response to a failure.

So ultimately, in our recovery path, we want to try to rely on data planes as much as possible - it won't always be possible. But if we can use a data plane action in our recovery path, we think that's going to be more reliable than a control plane.

So let's look at how we can use data planes to evacuate an availability zone. So last year, we launched a feature of Application Recovery Controller called Zonal Shift. And what Zonal Shift allows you to do is for any ALB or NLB with cross zone load balancing disabled, you can specify an Availability Zone ID and our resource ID. And we will stop sending traffic to that availability zone under the hood - we're just manipulating DNS, right?

So we're going to stop returning the IP address for that load balancer node in the AZ that you specified. So I execute my Zonal Shift and I take that load balancer node, right, this is a single NLB with an endpoint in each AZ, and I stopped responding to DNS requests for www.example.com.

Zonal Shift will remove that IP address. And because I have an AZ independent architecture here, no more traffic goes to AZ three and I only process customer work in AZ one and AZ two. So that's the simplest, lowest cost option for evacuating an availability zone.

But you might want to have a centralized view of health. I've talked to a number of customers and they want a platform team to be able to tell all of their independent application teams, "Hey, we think that AZ one is having a bad day and we want to tell everybody, right, that you should probably evacuate that availability zone."

And so instead of trying to execute a Zonal Shift for maybe tens, hundreds or thousands of accounts in your business, you could create a centralized API and you can do this really simply with something like API Gateway and DynamoDB.

And so in this case, I have an integration between API Gateway and Dynamo. And when I call the API, it's going to retrieve an item for my Dynamo table and in my Dynamo table, I've just got a key value of an AZ ID and whether or not I consider it to be healthy.

Then I point Route 53 health checks - and so Route 53 health checks are part of Route 53. Route 53's data plane doesn't rely on the control plane, right? So I'm pointing health checks at my API Gateway endpoint and in each health check for each of these load balancer nodes using their zonal DNS name.

So each NLB and ALB has a DNS name like use1a.myloadbalancer.elb, etc. And so I can specify a DNS record for each one of those zonal records and then point a health check to my API Gateway endpoints.

And so when I want to evacuate an availability zone, because I think it's having a gray failure, I set the "healthy" value in my Dynamo table to false which causes the health check to fail in Route 53 and I stopped returning that DNS record.

So effectively, you're achieving the same thing that Zonal Shift is doing for you. But now you can have lots of teams rely on this and access it and it doesn't have to be just used with Route 53 health checks, right? You could have some other type of system that's checking this value, right, and responding to it. Doesn't necessarily have to be a health check.

So those are two different patterns using only data plane actions, right? Invoking an API Gateway endpoint, getting an item from DynamoDB - all parts of their data plane. And so both are really reliable ways that we can invoke an AZ evacuation.

But in some instances, we may need to use the control plane. And so when may this be required? Let's say you have an architecture that looks like this - I don't have a global balancer at all. Right. I've got an Auto Scaling group that's scaling based on an SQS queue and they're pulling an SQS queue and processing work based on the messages in the queue.

So how could I evacuate an AZ in that type of architecture? So this isn't gonna be a very prescriptive pattern. It's gonna be more of a high level thing that you can write and invoke yourself.

And so in this case, I've got an operator that's gonna execute a local script. And the reason that I'm executing a local script is to remove as many dependencies as possible in my recovery path - the fewer things that I have to depend on for recovery gives me a higher probability that that recovery is going to work, right?

If I was relying on some other service or a third party system, right, and that system was having a bad day that could prevent you from invoking your recovery. So I've got an operator here that's going to run the script and what the script is generally going to do is I'm going to list all the resources of the type that I've specified - in this case, an Auto Scaling group.

I'm going to determine which subnets need to be removed from its network configuration by comparing the subnets availability zone ID to the AZ I provided in the script. I'm gonna record the availability zone ID and then the resource name or its ARN and which subnets I removed in a Dynamo table - could be a Dynamo table, could be something else, right? Could be a local CSV file, whatever.

And the reason that I'm doing that is so that I can undo this later, right? It's easy to know which subnets to remove but once they're removed, we don't really have a record of how to add them back. And so we're going to record that so we know how to undo it.

And then finally, I'm going to update the network configuration of the resource, right? I'm actually gonna call the API to remove those subnets from its network configuration. And so for Auto Scaling, once I remove, say a subnet from its network configuration, it'll launch new instances in the remaining AZs and I'll terminate the ones in the AZ that we left.

And so that's a control plane action, updating the configuration of my Auto Scaling group. But it allows us to remove that capacity and to stop processing work in the impacted availability zone.

When I want to undo this and I want to recover, the outage say is gone, I can call the same script. But this time instead of "evacuate", I want to "recover". We'll get all the subnets that we removed, right, we'll look at all the work that we just did, we'll describe each of those resources that we'd recorded in our table.

We combine their current network configuration with the subnets that were removed, we'll update that network configuration and actually change the configuration of that resource. And then finally, we will remove the record after the update.

This is an idempotent action, which means that as many times as you run this, you'll always end up with the same result. So even if your script fails halfway through and you forget to delete the records from the table, if you run it again, it'll only ever add back the maximum subnets that were configured before, right?

And if the subnet already exists in its network configuration, nothing to change.

Ok. So summary:

We started talking about gray failures and what they are - they're defined by the idea of differential observability, which means that health is observed differently from different perspectives. You as a customer may see something that's impacting you or unhealthy while the underlying system doesn't.

What we want to do is turn gray failures into detected failures, right? And this may be true for your system and your customer experience as well as the dependencies you use - we don't want them to be gray, we want them to be detected.

In order to do that, you have to take action yourself, right? We can't rely on the underlying service. If you want to be resilient to gray failures, you have to build the observability and mitigation tools, right, to detect and fix them.

To do that, it's especially important that your metrics are enriched with dimensions aligned to your fault isolation boundaries. It's gonna be really tough for you to know that AZ one is impacted if you don't have metrics specific to AZ one, AZ two, AZ three, right? If all you have is a collection in aggregate at the region level, right, you'll never be able to find the source that's causing a drop in availability or increased latency.

So that means we need more than just the load balancer and Amazon EC2 health checks, right? We have to build additional kind of more advanced systems in order to do this for our detection.

We want to use a combination of outlier detection and composite alarms. And then finally to actually perform the evacuation, we want to prefer data planes over control planes whenever possible.

So I've got a couple of resources up here:

There's a blog about outlier detection using the right types of health checks that goes into more detail about the health checks.
There's a Well-Architected lab on health checks.
There's a whitepaper called "Advanced Multi-AZ Resilience Patterns" that goes into a lot more detail on everything we just talked about.
There's also a workshop - so if you want to get hands on with creating and experiencing and detecting and mitigating gray failures, there's a workshop and if it's not full, there's one actually being presented today at 5pm where you can get hands on with this.

So a lot of different resources, right, to help you understand detection and mitigation techniques.

Finally, I want to mention that we launched the Resilience Lifecycle framework. And so this is a standard way to help customers understand resilience. When we think about detecting and mitigating gray failures within this framework, we have five different areas.

And so in the Design and Implement and Operate phases is where we would think about creating the detection and mitigation techniques for single host and single availability zone gray failures.

We also have a lot of purpose built offerings around resilience that can help with this - Resilience Hub, you can use Fault Injection Simulator to simulate single host or single AZ failures, you can use Elastic Disaster Recovery, we have Backup, and then one of the main features or services that we used to mitigate and evacuate an availability zone, Application Recovery Controller, Zonal Shift.

So a lot of different capabilities to help you along this journey.

Alright, with that, thank you all so much for coming!