Improve application resilience with AWS Fault Injection Simulator

Good afternoon, everyone. How are you again? Hopefully uh you'd be good also after the session. And um yeah, first of all, thank you very much to making it uh here. I know MGM is quite far from the rest. So uh or 10 points for every one of you to make it, make it here. At least we are not alone, which is great.

Um my name is Adrian Hornsby. I'm principal engineer with the AWS reliability team. We own the fault injection uh service. Uh it used to be called simulator but 15th of November, we renamed it to service uh to make it a little bit easier to, to understand. Uh I'll be john uh joined by Iris who is a senior product manager in our team and she'll uh talk a little bit about today's announcements. If you haven't seen today's announcements, uh we've launched a bunch of uh scenarios to uh test multi a z applications and multiregional applications and she'll be diving into these ones.

So, before diving into this, I'd love to talk a little bit about the why of resilience and, and why are we launching those scenarios uh today? And I'll start by asking a simple question. What is the, a little bit, the price of down time? Does anybody have any idea? A little bit? How much did this cost roughly? Right. It's expensive. What? It is? Very expensive? Millions of dollars. Actually.

Uh the recent uh recent surveys say that about 91% of enterprise for 91% of the enterprise, the cost of one hour of downtime is roughly 300 k and 42 for 44% of those enterprise, it goes up to a million and then for uh 18% of this one, it exceeds 5 million. If you've never thought about the cost of downtime, a good average is typically $10,000 per minute of downtime. It's a lot more expensive than you'd think. But when you are down, actually, there's a lot of implication and we'll talk a little bit about that.

Another very interesting number is that if you experience downtown regularly, uh, your costs are likely to be about 16 times higher than a company that is not having as many downtown. All right. So it's pretty scary and it's really easy to understand why downtown is something that people are worrying about, um, in, in the company. And there's a lot of implications of downtown. Uh, of course, uh, your customers don't like it, but it's loss of revenue. It's typically can have an impact on compliance as well, which is something, uh, can be quite terrible to have, uh, it can have an effect on the price of stock and then also on the hiring process and you just, uh, loss, uh, miss, uh, miss some businesses as well.

But there's something a little bit more tricky here is downtown also affect the brand, right? And it's very famous quote that it takes a lifetime to get a good reputation, but you lose it very, very fast and that's very, very, very common, especially nowadays where there's a lot of different application and, and and competing businesses to get attention of customers is if you have a lot of down time, uh typically your customers are gonna go and try to search something somewhere else and then it's very, very difficult to get them back to your platform. So downtown is very, very important and it's challenging to avoid downtime. Actually, i would even say it's pretty much impossible to avoid downtown because we're dealing with complex systems.

And people tend to think that complex systems is really related to very wide and big distributed system. But it's not really like that. In fact, I'd argue that even a very simple client server uh application like this one is actually a complex system if you think about it. Uh there's a lot of possible permutations of failure. Like for example, my client has to create a message, sends it over the network, the network needs to deliver the message to the server. The server typically needs to validate the message. Maybe update a state in a data store, creates an answer. Puts it on the network. The network delivers that to the client. The client uh validates and maybe also updates its local state.

So there's tons of things that can go wrong here. Your network can become slow. You can have some latency, you can have some packet loss. Your validation mechanism might fail through an error. Your data store might be congested. I might be there's too many open connection and you come, right? So there's a failure and so on and so on. So the permutations of failure and even this very little uh system is actually quite staggering.

So if you think about this, when you build actually a quite wide uh distributed system in the cloud or hybrid or whatever you're doing, it really is complex and there's a lot of possibility for failures. So the question is uh how do you avoid failure? Well, you can't, i uh if there's a summary for this presentation is you can't avoid failure. Failures are just gonna happen and, and everything will fail over time. Like berner often says our cto.

So you might ask, how do we do this, how do we handle failure or how do we, you know, mediate maybe the impact of failures. And there is a very interesting book that's been written uh and called uh resonance engineering practice. And the, the authors break down resilience in four pillars. And the first is anticipation, uh monitoring, responding and then learning.

So anticipation is, you know, trying to come up with all the failure scenarios and try to think about the designs to uh uh to have a resilient system. Ah then you have monitoring. Well, that's understanding what happens during a failure scenario. I'm pretty sure many of you have experienced an outage where uh you didn't have the right alarms or right monitoring. And then, you know, in nights site, you say, ok, maybe uh maybe we could have done something.

Uh so monitoring is very important and then you have responding. So responding is trying to understand what you have to do to recover from the failure. And this is not trivial. And sometimes you have very complex systems where you have to have a sequence of event, you know, bring back your tier zero application, tier 12 and three in sequence. Sometimes, you know, uh you realize that one system depends on the other and it's now. So, you know, you end up with circular dependency, this is quite problematic. Uh so these are things of r system as well.

And then you have learning, learning is sitting down after the outage and try to understand what happened and making sense of it and derive be best practices and hopefully share them with the rest of the organization. So everybody can learn out of it. So this is a very good way to think about resilient system because a lot of time, we tend to think about resilience system only as the technology itself. But this is the entire spectrum of the culture, the mechanisms and the tools involved here.

And so, uh we took these four cornerstones of resilience and uh we thought about how we've done resilience also at amazon and aws and, and how our customers are doing it. And we've come up with something called the resilience life cycle framework. And basically the life cycle framework is a continuous resilience process by which you go through these steps and then try to understand, you know, how to improve your system resilience.

And it starts with setting objective. It sounds pretty ah pretty trivial. But a lot of times we don't really understand why we want resilience systems or what are the objective, the business objectives? Is it, is it resilience? The four lines of availability? Is it uh for uh live events? Is it a new feature? So why do we need resilience? And then we go and design and implement, right, and then evaluate and test observing, respond and learn and then it's a life cycle.

What's interesting is if you look in a in a company, teams might be a different, you know, different level of this life cycle, some might be very deep, some might be just starting and you might start somewhere. It's not always the case that you start with objectives. Sometimes you test because you've had a problem and you need to start thinking about it so that it doesn't happen again. And then the discussion and you know, trying to fix it typically gets you into the conversation around setting objective. What i'm trying to say is you can enter this life cycle at any time. You can access this life cycle. By the way, we've published this into, into our documentation.

So today, this presentation will focus a bit more into the design and implement and evaluate and test. So arguably the most important part of designing uh resume system is to think about false isolation boundaries. Remember what i said, failures are going to happen. So you have two strategy to recover from failures. First is being able to recover very fast. So practice recovery. But the second is minimize the impact that you have on your workload, try to limit the impact on customers. And this is where photo isolation boundaries kick in photo isolation boundary is a little bit similar to bulkheads in a boat.

So your boat, you know, typically they have bulk gates so that if a part of the boat gets uh the hole, it doesn't fill up the entire boat, it fills one part of the boat. So the idea here is similar is you're trying to think about your system in terms of domains of blas rs. And this has very interesting property because you can have isolation of your workload, containment of failures. But more importantly because you have a very finite size of your boundaries, you can do vary, you can understand it. You can understand that particular boundary. Test it, understand its limit and then you can use that and replicate it horizontally if possible.

So test here is very interesting. The most common ones obviously and i'm sure it's nothing new here. Uh the most common isolation boundaries that we use is a region and an availability zones. Uh there are 32 regions globally. Uh and 100 and two availability zones. A region is typically uh two, this is three or more availability zones and availability zone is the collection of data centers uh that are, you know, have uh isolated power communication and networking.

Uh typically a cs within the region are separated by roughly 100 kilometers but not far enough so that we can still have single milli uh single millisecond latency between ac and we do that because we want synchronous replication, you know, in some of our regional services. So a zs are also very isolated from one another. They have different uh power uh systems, different networking, different uh different cooling systems. Um and they can withstand issues uh like a tornado earthquake, uh a power outage or things like this. So these are very common uh most commonly used for isolation boundaries that we have if you want to use this or the most trivial way of using those boundaries.

I would say probably the, the scenario one is building a multi ac application that's really, really where everyone should be. uh starting when you start your journey into building resilience application. It typically uh start with having instances, distributed or compute, distribute, distributed across multiple aces, a database that also has replicas or fell over into another a z. And if an easy way to experience, experience some issues.

So if you have a deployment that for example, impact an a, if there's a problem in some of the service, uh some great failures, and you really want to isolate the traffic from this a, typically you would uh move the traffic from an a z to another, we call that a z uh if it is on isolation, right? And, and typically here in this scenario, the primary instance would you know, be failing over to a secondary region, then the uh that particular standby would become the primary, another very interesting fault isolation boundary that we use a lot here at amazon and aws is what we call a sale.

A cell typically is, is another way how we contain the blast radius of failures. And what is very interesting with sales is uh for example, you take a multi a application and this is the same slide that i've shown before and that's our cell, right. So this is really an instantiation of a service

"And we put that in the cell including the data layer and then we typically take the cell and then we replicate those cells horizontally behind a thin routing layer. Typically, that's gonna be Route 53 with a uh a domain name that you know, can direct traffic to different cells. And typically we assign uh customers to one or more cells depending on, on the level of resilience we want to achieve.

Um but here, you know, if something were to happen in one cell, for example, if cell one example.com was to have a deployment failure, which you know, typically deployments, uh it is very common to have a issue and we don't want that issue to to happen to the rest of the cells. We do the deployment in the cell one, we look how the cell behaves. If something bad happened, then we can take the cell out of of the routing layer and then, you know, move customers to other cells or depending on how we route the traffic.

Uh typically customers will be assigned multiple cells. But this is a very interesting design. Uh and it's also very interesting to use this sort of design uh in case engineering because you can create a cell where you only have synthetic user, synthetic traffic. It's exactly the same size of the other cells that take production traffic and real users. So we can conduct fault. Uh you know cases engineering in one cell, understand the failure modes, improve its resilience and then duplicate that to the rest of the cell. So that's a very good way, we actually can do fault injection in a production environment and is prime video, does something very similar to that.

Another very uh important photo nation boundary is something we call control plane and data plane separation. So the control plane is typically uh uh the operation by which you set up uh the resources in an account. And then the data plane is basically using those resources. Uh so if you think about e two instances, for example, the control plane is start uh deploy uh uh uh start an instance that's a control plane. The instance then uh is available for you to be used and that's data plane operation. And we do this because control plane typically are a little bit more complicated. They have workflows, they are more database, they have more business logic. And in an outage what we want is avoid using control plane, right?

So we make data plane as simple as possible so that we can use resources that are deployed without having too much complexity because typically in an outage control plane will statistically have more issues.

Um so this is very uh a very common design that we, we also have internally that leads me to uh to the concept of static stability. How many of you have used or heard the concept of static stability? Well, a few one that's great. So it's something we use very, very often here.

Um and typically uh we define a static, static, a statically stable system when, if there is an outage, we don't do any control plane operation. So as i said, it means that our system is stable, i'll give you a very simple example. So that what i'm saying, try uh hopefully will make sense if you have a system like a plane with four engines. If the plane loses two of this engine and can continue flying without impacting its flight, the system is statically stable, right?

So that if you think about this when designing system is, is very, it gives you very interesting properties. So for example, if you start designing multiple a z applications, so say you have three a zs and one a z experiences some issues or there is some deployment and your application in one a z starts to fail. If the other a zs cannot handle the surge of traffic from this a z that has some problem, then your system is not statically stable if the other is a without doing any control brain operation.

So without starting any new instances, if they can handle the traffic of the a that has issues, then your system is statically stable. And we think about this a lot when we design our systems because as i said during an outage, it is very common that control plane operation won't work.

Uh the last thing you want to do in the outage is tries to start new instances or try to deploy a cluster or try to, you know, typically you need to do nothing and the system needs to handle the d the failure by itself in that sense, then your system is statically stable. And this is a very, very important concept. When we talk about resilience, we wrote a very good paper around how lambda use multiple a z application to actually uh uh build its resilience and they build their static stability using ali. So i highly suggest you to, to, to read this article, not because i wrote it with marcia, but because it explains the concept with multiple a, which is very important concepts which leads me to.

But the second part which is evaluating and testing because yeah, for isolation boundary is great. But how do you evaluate and test them? And traditionally, actually, we've talked a lot about photo isolation boundaries, but it's been quite difficult for customers to test those photo isolation boundaries. You were left trying to do it on your own and things like this, which is not easy.

So few years, a few years ago, yeah, two years ago, i think, yeah, we launched, we launched a product to go and do resilience testing which will talk a little bit about this. And resilience testing is really uh injecting stress in your system, observing a response and then doing some improvement. And we do this to improve the reliability. The performance, the resilience of your system. But this is very interesting properties here. When you start injecting faults in the system, you really start to uncover hidden issues, right.

Uh so, for example, missing alarms, uh missing logs, uh missing observ ability. Uh and it's better to observe this in a, in a test environment rather than when it's in production. And what's also very interesting with resume testing, it allows us to practice our operational muscle. I'm not sure if you often have outages, but systems are becoming more and more resilient. And i know that uh to become very fast at recovering fast, you need to have a really good muscle memory of operational practices. You know your command line, you know your ra book, they are up to date. Everything needs to uh to, to go fast. The last thing you want to do during an outage is realize that your operational procedures are outdated and they don't work, right.

So injecting fault in the system is really good way to practice your operational muscle. And typically when you inject fault in the system, you use something similar to the scientific method that we we were taught at school where you start with an hypothesis. It's like what happened if i remove the database or what happened if i remove the authentication layer of my system and then you go make a uh an experiment to try to validate your hypothesis. Ok. If your hypothesis is, if i remove my database layer, my system will fall into a read only mode and use cash and still continue working. Then you go and test it until you test it is assumption. And i can guarantee you assumptions until you've turned the assumptions into a truth, your assumptions are wrong, right? Or in assumptions you have at, you know, let's say on monday, on friday, after 10 deployments can go wrong, right. Configuration drift, behavior drift.

So it's important to regular exercise these ones. So to go and test this kind of behavior, we launched uh two years ago, we launched this fault injection simulator, which we rebranded fault injection services.

Um and typically, that allows you to inject control experiment in your system and to talk a little bit more in depth about uh this service. I want to invite iris on stage and especially dive into this uh new announcement that we've done today. Thank you very much."

In this scenario, FIS enables customers to test the implementation of their multi AZ architectures. It combines multiple FIS actions to replicate the symptoms of a power interruption in one AZ.

The basic idea here is that when an AZ experiences a power interruption event, your application should be able to behave and perform as expected in the remaining availability zones. We include faults in this scenario that you would expect in a power interruption event, no matter how unlikely that is to occur.

For example, EC2 instances and container pods and tests are expected to fail while load balancers and auto scaling groups should be able to operate as expected. Region based services such as Lambda should also be able to operate as expected as well.

This is the current list of faults that we support at launch:

  • It will stop EC2 instances and pods and tests on these instances.

  • We also have a new action to inject insufficient capacity errors into our auto scaling groups and IAM roles to prevent any instant launch requests into the target availability zone.

  • We will also cause ElastiCache and RDS databases to failover.

  • Finally, once the AZ comes back online, any persistent EBS volumes may be unresponsive.

An end to end power interruption scenario of availability zone A would look like this:

You would start your experiment targeting resources in availability zone A. This will cause compute to fail. And if you have any primary Elasticache nodes or writer RDS instances, those will failover.

Now if you have the appropriate alarms and monitoring in place, and this scenario is a good way to test that, you should trigger your alarms and health checks. At this point, you should shift traffic out of AZ A either with your standard operating procedures or with Route 53 Application Recovery Controllers.

We particularly like the zonal shift capability of ALBs and NLBs because it’s a good response to this type of scenario. And once you shift traffic, you can observe key metrics such as availability and latency to see how your application performs in the other availability zones.

And once the experiment ends, you can fail back and see how your application recovers and verify that you can shift traffic back.

Now, I'm going to show a pre-recorded demo targeting us-east-1a on an application similar to this one but with replica hosts behind an NLB in three availability zones. I've also already created a synthetic monitoring on each of the zonal endpoints as well as a regional canary that allows you to check the customer's experience during such a scenario.

So to start, we'll go to the FIS Management Console which is under Resilience Hub. You can find the Scenario Library on the left hand side. Next, you can see that we have a range of scenarios in the library. I'm going to select the AZ Power Interruption scenario.

Here you can see information about the scenario including the exact actions and targets. You can also see the actual JSON of the experiment template and copy and modify it to your own needs. And finally, you can check out the details and descriptions and additional documentation on the scenario.

So I'm going to create a template from this scenario. If you have resources in multiple accounts, you can select that option. But I've set up the app stack to be in just one account, so I'll confirm now.

As you can see, the actions and targets for this experiment are already defined. To set up this template, I just need to input a few key parameters.

I'll click on Edit Shared Parameters. Now this is where you would specify the affected AZ. We want to target us-east-1a resources, so I'll select that parameter.

Next, most resource types in this scenario we target with resource tags, but this does not support targeting IAM roles with tags. So we'll need to select an IAM role that we're injecting the insufficient capacity error into to prevent instance launch requests. I've created a demo role that I'll select.

As I mentioned, we support targeting with tags, so this is where you would specify your tag key and value for each resource type. I've already tagged my resources, but you could use your own tagging convention. For instance, if you have a consistent tag for an application, you could specify that.

And finally, you also need to specify the duration of the experiment. The default is a 30 minute outage with a 30 minute recovery period, but I shortened it a bit for this demo.

So now I don't need to do anything else when it comes to setting up the actions and targets. If I wanted, I could click on a specific target and action just to see how the parameters have populated, which I've done here just to show you it's targeting us-east-1a as expected.

Next, we have this experiment option that allows you to skip actions if you don't have a specific resource type in your stack - for instance if you don't have any ElastiCache clusters, the experiment will simply skip that action. The default is to skip actions, but if I wanted the template to validate I have all the resources specified, I could select "Fail".

Finally, this is where you specify the IAM role that allows FIS to take actions. I've already created and set up the proper permissions, so I just select that here.

I'm not going to specify a stop condition for this demo because I want the experiment to run all the way through, but this is where you would define one if you had it.

And finally, I think it's best practice to enable logs, so I'm going to specify a destination S3 bucket to send my experiment logs to.

Now I can create a template from this scenario and run it at any time. For this demo I've gone ahead and started the experiment. Once I click Start, it goes into an initializing state.

While it's initializing, I'll show the rest of the stack I've set up. As I mentioned, I have canaries in each zonal endpoint as well as a regional customer canary. I have 6 healthy hosts behind my load balancer, 2 in each AZ. My primary RDS node is in us-east-1a, and my writer instance is there as well.

Going back to the experiment, we can now see the actions are in a running or pending state. The RDS failover action has already completed since that happens instantly.

So let's check the impact of this experiment on the resources. In my Auto Scaling Group I can see instances are now being terminated in us-east-1a. Checking the activity log I can see the ASG attempted to launch instances into us-east-1a but received an insufficient capacity error injected by FIS, which prevents provisioning instances into that AZ.

I can also see it's deployed additional instances into the remaining AZs. Looking at my load balancer, I see a similar picture - I only have 4 healthy hosts now, with 2 initializing.

Checking my RDS database, I see it's currently modifying. I'll also look at my ElastiCache cluster and see it's modifying as well, with the primary node already failed over to us-east-1c.

Now I can check the status of my canaries. My alarm shows my canary is failing in us-east-1a, but I also see the regional canary representing customer experience is being impacted, with increased response times.

This shows I probably want to improve my application's ability to do a zonal shift and shift traffic away from the AZ to improve resilience and recovery time.

Going back to FIS, I see my actions have completed. I didn't have all resource types so those actions were skipped.

Hopefully that gives you a sense of how this power interruption scenario works! I'll dive into details on the next scenario, which is cross-region connectivity...

[Transcript continues]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值