Building confidence through chaos engineering on AWS

最新推荐文章于 2024-09-11 13:51:21 发布

taibaili2023

最新推荐文章于 2024-09-11 13:51:21 发布

阅读量832

点赞数 23

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134590150

版权

Welcome everyone. We're here today to talk about building confidence to chaos engineering. And in this session, you will learn:

What chaos engineering is and what it isn't
What the value is of chaos engineering
How you can get started with chaos engineering within your own firms.

But more importantly, I will show you how you can combine the power of chaos engineering and continuous resilience and build a process that you can scale chaos engineering across your organization in a controlled and secure way to help your developers and engineers with secure, reliable and robust workloads that ultimately leads to a great customer experience.

My name is Laurent Domb and so let's get started.

First, I will introduce you to chaos engineering and we're gonna see what it is and what it isn't. I will also go through the various aspects when we're thinking about prerequisites for chaos engineering and what you need to get started.

We will then dive into continuous resilience and why continuous resilience is so important. When we're thinking about resilient application, applications on AWS and combined with chaos engineering and continuous resilience, I will walk you through our chaos engineering continuous resilience program that we use to help our customers to build chaos engineering practices and programs that they can scale across their organizations.

And last, I will show you some great workshops that we have in AWS where you can get started with chaos engineering on your own.

So when you're thinking about chaos engineering, chaos engineering is not new chaos engineering has been around for over 10 years and there are many companies that have already adopted chaos engineering and have taken the mechanisms and trying to find the known unknowns. These are things that we are aware of, but don't fully understand in our systems and chase the unknown unknowns, which are things that we are neither aware of nor fully understand.

And through chaos engineering, these various companies were able to find deficiencies within their environments and prevent large scale events and therefore ultimately have a better experience for their customers.

And yet when you're thinking about chaos engineering in many ways, it's not how we see chaos engineering. There is still a perception that chaos engineering is that thing which blows up production or where we randomly just shut down things within an environment. And that is exactly not what chaos engineering is about.

When we're thinking about chaos engineering, we should look at it from a much different perspective. Many of you have probably seen the shared responsibility model for resilience. When you're thinking about the shared responsibility model for resilience, there are two sections the blue and the orange in the resilience of the cloud.

We as AWS are responsible for the resilience of the facilities, the network, the storage, the services that you consume, but you as a customer, you're responsible on how and what services you use, where you place, for example, your workloads. Think about zonal services like two where you place your data and how you fail over if something happens within your environment.

But think about the challenges that come when you're looking at the shared responsibility model, how can you make sure that if a service fails that you are consuming that is in the orange, that your workload is resilient. How do you know if something fails that your workload can fail over? And this is where chaos engineering comes into play.

When you're thinking about the workloads that you are running in the blue, what you can influence is the primary dependency that you're consuming in AWS. If you're using two, if you're using lambda, if you're using sqs, if you're using elastic cash, these are the services that you can impact with chaos engineering in a safe and controlled way.

And you can figure out mechanisms on how your components within your application can casely fail over to another service. So when you're thinking about chaos engineering, what it provides you is improved operational readiness because your teams will get trained on what to do. If a certain service fails, you will have mechanisms in place to be able to fail over automatically.

You will have great observable in place because you will realize what is missing within your observable that you haven't seen when you are running these experiments in a controlled way. And ultimately, you will learn to build more resilient workloads on AWS.

And when you are thinking about all this together, what does it lead to? Of course, happy customers. And that's what chaos engineering is about. It's all about us building great workloads that ultimately lead to a great customer experience.

And so when you think about chaos engineering, it's all about building controlled experiments. If we know that an experiment will fail, we're not going to run the experiment. If we know that we are going to inject the fault and at fault, we take a bug that brings down our system, we're not going to run the experiment. We already know what happens. And what we want to make sure is if we have an experiment that by definition, that experiment should be tolerated by the system and should be fail safe, because what we want to understand is our system resilient to component failures.

Many of you might have a similar architecture than you see here on this slide. But when you're thinking about it, let's say you're using redis on two or elastic cash. What's your confidence level? If redis fails, do you have mechanisms in place to make sure that your database does not get fully overrun with requests if your cash suddenly fails?

Or what if you think about latency that suddenly gets injected between two microservices and you create a retry storm. Do you have mechanisms to mitigate that? With exponential back off and jitter? And what if to that you have cascading failures and an entire a gets out of commission. Are you confident that you can fail over from one availability zone to another?

And think about the impacts that you might have on a regional service. What is your confidence level that if an entire region or a service in a region has an impact that you rely on? And because of your sla's have to fail over into a secondary region. What's your confidence level that you run books and fail over playbooks that they're all up to date? And you can say yes, i can run to them.

And so when you're thinking about chaos engineering and we're thinking about the services that we build on a daily basis, they're all based on trade offs that we have every single day.

Now, when you're thinking about everyone here in the room, we all want to build awesome workloads, resilient, workloads, robust workloads. But the reality is we're all under pressure. There is a certain budget that i can use. There is a certain time that i need to deliver and certain features. But in a distributed system, there is no way that every single person understands the thousands or hundreds of microservices that communicate with each other.

And ultimately, what happens is if i think that i'm depending on a soft dependency where someone suddenly changes code that becomes a hard dependency on what happens. We suddenly have an event. And when you're thinking about these events, usually they happen, you're somewhere in a restaurant or on vacation, you get called at two in the morning and everybody runs and tries to fix and bring the system back up.

And the challenge with this is once the system is back up, you just go back to business as usual until the same challenge happens again. And it's not because we don't want to fix it, but it's because good intentions, they don't work. And this is where mechanisms come into play like chaos engineering and continuous resilience.

Now i mentioned in the beginning that there are many companies that already have adopted chaos engineering and these are just some of the verticals of all companies that have adopted chaos engineering and some of them already started 5 to 6 years ago.

But i want to give you a few examples of a very regulated industry, the financial services industry where you have very large companies that have adopted chaos engineering. One of these companies also spoke here at re invent this year, which is capital one, capital, one wrote many great blog posts that you can find under the case engineering stories, the link that you have here and have explained how they were thinking about building their chaos engineering story and processes.

What they were looking at in regards to resilience and readiness of their applications. But they also over five years have built a cloud doctor that uses various services, helping their developers to data to the lineage of their services, fault injections and reports. When they execute the chaos experiments to help them build better workloads on aws.

There are others like the national australia bank that looked at observ and defined observable as being key to chaos engineering, looking at aspects like errors, traffic, various tracing metrics and blogging as well as saturation that has to be part of chaos engineering. Because if they don't see that they define it as chaos and then you have others like into it that share the gate story about how they were thinking migrating from on premises to the cloud and how they were thinking about resilience and how the resilience was different from doing a fema analysis after the fact going to chaos engineering and trying to understand if one obsolete the other, but realized that they still need both and also build the process to help the developers to automate chaos experiments from start to the end.

And there are many more stories like that that i could talk about today. Unfortunately, we don't have enough time. But if you're flying back today or tomorrow, look at the url. And there are many of these stories that can help you get started with chaos engineering.

So there are many more customers that will adopt chaos engineering. Next year, there is a great study by gardner that was done for the infrastructure and operations leaders guide that said that 40% of companies will adopt chaos engineering next year. And they're doing that because they think that they can increase customer experience by 20%. Think about how many more happy customers you're gonna have with such a number.

So let's get to the prerequisites and how you can get started with chaos engineering.

So first you need basic monitoring and if you have observ ability, that's great, then we need to have organizational awareness. We need to think about real world events that we're injecting or faults. And then of course, if we find a deficiency within our environment, we need to commit and go and fix it if it's resilient or security focused.

So let's dive a little bit more into this. So when you're thinking about metrics, many of us have really great metrics in ks engineering, we called metrics. No, nones. These are things that we are aware of and we fully understand. And you're thinking when you're thinking about metrics, right? It's cpu percentage, it's memory, it's this io and it's all great.

But in a distributed system, you're going to look at many, many different dashboards and metrics to figure out what's going on within your environment. And so when we're starting with chaos engineering. Many times when we're running the first experiment, even if we're trying to make sure that we're seeing everything we realize we can't see it.

And this is what leads us to observ ability, observable, helps us find the needle in the haystack. We can start looking at the highest level at our baseline, look at the graph. And even if we have absolutely no idea what's going on, we're going to understand where we are, we can drill down all the way to tracing like aws x ray and understand it.

But there are also other stacks from an open source perspective. And if you use them, that's perfectly fine. So when you're thinking about observable, and this is the key observable is based on three pillars, you have metrics, you have logging and you have tracing.

Now. Why is that important? Because you want to make sure that you embed, for example, metrics within your logs. So that if you're looking at the high level steady state that you might have and you want to drill in that as soon as you get to the stage from a casing to a log that you see what was going on and can correlate. And so at any point in time, you understand where your application is, let me give you an example when you're looking at this graph, every single one of you, even if you have absolutely no idea what that workload is, sees that there are a few issues.

You look at the spikes and you're going to say, hm, something happened there. And if we were drilled down, we would have seen that we have a process which ran out of control and suddenly cpu spiked. Every one of you is able to look at that graft down here and say, wait a minute, why did this drop? And if we would trail into it, we would realize that i had an issue with my certis cluster and the ports suddenly start restarting.

And every one of you sees that we suddenly had a huge impact somewhere which caused 500 errors which was caused by node failures within my cluster. That's what observable is about. We can look at the graph and we see either way where we have to go and drill into.

Now this is a very observable and view. And of course, we want developers to have a similar view and this is where the tracing aspects come into play

You want to provide the developers with the aspects of understanding the interactions with the microservices. And especially when you're thinking about chaos engineering and experiments, you want them to understand what is the impact of the experiment. And what we shouldn't forget is the user experience and what the user sees when we're running these experiments. Because if you're thinking about that baseline and we're running an experiment and the baseline doesn't move means that the customer is super happy. And that also means that we are resilient to such a failure.

So now that we understand the observable aspects, I'd like to move on to the organizational awareness. Now, what we have found is that when you're starting with a small team and you enable the small team on chaos engineering and they build common faults that can be injected across the organization and then enable the decentralized development teams on chaos engineering that works fairly well. Now, why is that if you're thinking about many of you that sit in the room, you have hundreds, if not thousands of development teams, there is no way that that central team will understand every single workload that is around you. There is also no way that that central team will get the power to basically inject failures everywhere. But those development teams already have im permission to access their environments and do things in their environments. And so it's much easier to help them run experiments than having a central team that runs it all.

And that also helps with building customized experiments for those various development teams that they eventually then can share with others and the learnings that came out of it. And key to all this, of course, is having an executive sponsor that helps you make resilience part of the journey of a software development life cycle and also shift the responsibility for resilience to those development teams.

And then we need to think about real world events and examples. Now what we see most when we're looking at all failures that our customers have is code and configuration errors. And so think about the faults that you can inject when you're thinking about deployments or think about the experiments that you can do and say, well, do we even realize that we have a faulty deployment? And do we see it within observ? And when you're thinking about infrastructure, what if you have an ec two instance that fails or suddenly an es cluster where a load bounder doesn't pass traffic? Are you able to mitigate such events? What about data and state? This is not just about the cash drift but what if suddenly your database runs out of disk space? Do you have mechanisms to one detect that but to mitigate this as well?

And then of course, my favorite, which is dependencies. Do you understand all the dependencies that you have within your system? And also third party dependencies? What if you're dependent on, let's say a third party i dp, you have mechanisms in place to fall back. And how do you prove that you can? And then last of course, natural disasters when we're thinking about human error, for example, or again, right theres like hurricane sandy and others and how you can fail over and how you can simulate that again in a controlled way through chaos engineering experiments.

And then the last prerequisite truly is about making sure that if we're finding a deficiency within our systems that is related to security or resilience, that we go and we remediate it because it's worth nothing if we build new features, but our service is not available. So we need to have that executive sponsorship and we need to be able to prioritize these.

And so this brings us to continuous resilience. And so when you're thinking about resilience, resilience is not a one time thing. Resilience should be part of our everyday life when we're thinking about building resilient workloads from the bottom all the way up to the application itself. And so continuous resilience is a life cycle that helps us think about our workload from a steady state point of view and work towards mitigating events like we just went through from code and configuration all the way to the very unlikely events and disaster recovery to save experimentation within our pipelines outside of our pipelines. Because errors happen all the time, not just when we provision new code and making sure that we learn from the faults that surfaced during the experiments that we learn to anticipate what to do, be able to mitigate these various faults and then also provide those learnings throughout the organization so that others can learn from these experiments.

And so when you take continuous resilience and chaos engineering and you put them together, that's what leads us to the chaos engineering and continuous resilience program. And that's a program that we have built over the last two years at aws and have helped many customers run through it, which enabled them to build a chaos engineering program within their own firm and scale it across various organizations and development teams so that they can build controlled experiments within their environment.

And so usually when we're starting with on this journey, it's a game day that we are preparing for another game day as you might think, where we're just running for two hours and we're checking if something was fine or not, especially when we're starting out with chaos engineering, it's important to truly plan what we want to execute on. And so setting expectations is a big part of it. So key to that because you're going to need quite a few people that you want to involve is project planning. And usually the first time when we do this, it might be between a week and three weeks where we're planning the game day, the various people that we want in the game day, like the chaos champion that will advocate the game day throughout the company, the development teams. If there are srs, we're going to bring them in observable and incident response.

And then once we have all the roles and responsibilities for the game day, we're going to think about what is it that we want to run chaos experiments on. And when you're thinking about chaos engineering, it's not just about resilience it can be about security as well. And so a contribution is a list of what's important to you that can be resilience, there can be availability, there can be security, there can be durability, that's something which you define. And then of course, we want to make sure that there is a clear outcome on what we want to achieve with the chaos experiment. In our case, when we are starting out what we want to prove to the organization and the sponsors is that we can run experiments in a safe and controlled way without impacting our customers and that we can take those learnings and share it. Either if we found something or not with our customers to be able to make sure that the business units understand how to mitigate future failures if we found something or have the confidence that we are resilient to the faults that we injected.

So then we define the workload. And so for this presentation, i chose a workload, this is uh payments, workload and uh it's on e ks and some um databases and then message uh buses with kafka. And so important. There too is when you're choosing a workload, make sure that when you're starting out, don't choose the most critical workload that you have and then impacted and everyone is unhappy, choose a workload that you know, even if degraded, if it has some customer impact, that it is still fine. And usually we have metrics that allow that when you're thinking about slo s for your service.

So once you've chosen a workload, we're going to make sure that our chaos experiments that we want to run are safe and we do that to a discovery phase for the workload. And so that discovery phase will involve quite a bit of architecture. We're going to dive into it all of, you know, the well architected review. And so when we're thinking about the well architected review, it's not just about clicking the buttons in the tool, but we're taking about the day to go to the various designs of the architecture. And we want to understand how the architecture and the workloads and the components within your workloads, speak to each other. What mechanisms do they have in place like retries? What mechanisms do they have in regards to circuit breakers and have you implement them? Do you have one books and playbooks in place in case we have to hold back and we want to make sure that you have the observable in place and for example, health checks as well when we execute something so that your system automatically can recover.

And if we have all that information and we see that there is a deficiency that might impact internal and external customers, that's where we start. If we have known issues, we're going to have to go and fix these first before we move on within the process. Now, if everything is fine, we're gonna say, ok, let's move on to the definition of the experiments. And that's a very exciting part.

So when you're thinking about our system that we just saw before, we can now think about what can go wrong within our environment. And if we already have or have not mechanisms in place. For example, if i have a third party provider id p to have a break class account in place where i can prove that i can log in if something happens. What about my eks cluster? And if i have a note that fails, do i know my code book time for the node itself? Do i know the code, book time for the pods itself? And how long is it going to take for me for these to be live again so that they can take traffic and my customers are not going to get disconnected or have a bad experience or think about someone misfiring an outer scaling group and health checks which suddenly marks most of the instances as unhealthy. Do you have mechanisms to detect that? And what does that mean again for your customers and the teams that operate the environment?

And then think about a scenario where someone pushed the configuration change and suddenly your cluster cannot pull from your container registry anymore. That means that you cannot launch any containers. Do you have mechanisms to mitigate that? And there are a few more scenarios that we can think about like events with kafka. Are you going to lose messages if the broker suddenly reboots or you lose a partition? Do you have mechanisms in place to mitigate that? Or the aurora database flipping over? Do your applications know that you need to go to the other end point?

And so these are all infrastructure faults and some of the developers which might sit here would say, yeah, i mean, you know, it's mostly easy to fix or think about latency, latency and jitter when you implement that and the cascading effects that can happen within your system. These are all things we want to make sure and understand and with fault injection and controlled experiments, we are able to do that.

And then lastly think about challenges that your clients might have to connect to your environment. So for our experiment, what we wanted to achieve is prove that we can execute and understand a brownout scenario. What the brownout scenario is is that our client that connects to us expects the response in a certain amount of milliseconds. And if we do not provide that the client is just going to go and back off. But the challenge is when you have a brownout that your server still is trying to compute whatever they need to compute to return to the client and that's wasted cycles. And so that inflection point is called a brownout.

Now, before we can think about an experiment to simulate the brown out within our environment, we need to understand the steady state and what the steady state is and what it isn't. So when you're thinking about defining a steady state for your workload as the high level top metric that you're thinking about your service. So for example, for a payment system, that's transactions per second. When you're thinking about retail, that's all this per second. Streaming, for example, stream starts per second and, and playback events started with media. And when you're looking at that line, you see very quickly if you have an order drop or a transaction drop, that something that you injected within the environment cost probably that drop.

Now, once we understand what that steady state is, we are going to think about the hypothesis. And so the hypothesis is key when you're thinking about the experiment, because the hypothesis will define at the end, did your experiment turn out as you expect? Or did you learn something new that you didn't expect? And so the importance here is, as you see, we are saying, we are expecting a transaction rate. So 300 transactions per second and we think that even if 40% of our nodes fail within our environment, still 99% of all requests to our a ps should be successful. So the 99th percentile and return a response within 100 milliseconds, what we also want to define is because we know our systems, we're going to say ok. We based on our experience, the node should come back within five minutes. Pods should get scheduled within eight and b available. And traffic will flow again to those ports and alerts will fire after three minutes.

And once we are all agreeing on that hypothesis, then we're going to go and fill out the experiment template. And so when you're thinking about the experiment itself and the template, we're gonna make sure that we're very clearly defining what we want to run. We're going to have the definition of the workload itself. What experiment in action we want to run? In our case, it's going to be terminating 40% of nodes. Because if you're thinking about the steady state load you have on the system and you remove nodes, that's the brownout scenario which you want to simulate the environment that we're going to run in. And in our case, we're always starting with the process in a lower environment and not in production, the duration that we want to run. And i think about this, well, in this case, we're gonna run 30 minutes, but you might run experiments where you say i'm gonna run 30 minutes with five minutes intervals to make sure that you can look at the graphs based on the experiment staggering experiments that you're running to understand the impact of the experiment.

And then of course, because we want to do this in a controlled way, we need to be very clear what the fault isolation boundary is of our experiment. And we're going to clearly define that as well and the alarms that are in place that would trigger the experiment to roll back if it gets out of bound. And that's key because we want to make sure that we are practicing safe care engineering experiments.

We also want to make sure that we understand where is the observable and what are we looking at when we are running the experiment? And then you would also add the hypothesis again to the template as well. We also see two empty lines there which are the findings and the correction of error. And when we're thinking about the experiment itself good or bad, we're always going to have an end report where we might celebrate that our system is resilient or we might celebrate that we find something that we didn't know and just helped our organization to mitigate a large scale event.

So once we have the experiment ready, we're going to think about priming the environment for our experiment. But before we go there, i want to walk you through an entire cycle on how we execute an experiment. So first we have to check if the system is still healthy. Because if you remember in the beginning, we said if we know the system will fail or the experiment will fail, we are not going to run the experiment. So once we see that the system is healthy we're going to check is the experiment that we wanted to run still valid because it might be that the developer already fixed the b that we thought might exist if we run an experiment.

And if we see it is then comes something very important. We're going to create a control in the experimental group and we're going to make sure that that's defined. And i'm going to go into that in a few seconds. And if we see that that controlled experimental group is there and defined and up and running, then we start generating load against the control and the experimental group in our environment.

And we're checking again is the steady state that we have in the tolerance that we think it should be or not. If it is tolerant, then now finally we can go and run the experiment against the target. And then again, we check is it intolerance based on what we think? And if it isn't, then the stop condition is gonna kick in and it's gonna roll back.

And if it is that experiment turns into a regs test, why? Because now we understand what that experiment does to our system, you know, we can mitigate it and it's predictable. So I mentioned the aspects of a control and experimental group when you're thinking about chaos engineering and running experiments. The goal always is one that it's controlled and two that you have minimal to no impact to your customers when you're running it.

So way and how you can do that is we call it not just having synthetic load that you generate, but also synthetic resources. For example, you spin up a new eks cluster, a synthetic one where you one that you have and inject the fault and the other one which is healthy in the same environment that you are in. And so you're not, you're not impacting existing resources that might have customer traffic, but new resources with exactly the same code base as the other ones where you understand what happens in a certain failure scenario.

So once we prime the experiment and we see that control and the experimental group are healthy and i see the steady state i can move on and think about running the experiment itself. Now running a chaos engineering experiment requires great tools that are safe to run the experiment. And so when you're thinking about tools that are various out there that you can use and consume in aws, we released fault injection simulator last year in march.

And when you're thinking about one of the first slide with the shared responsibility model for resilience fault injection simulator helps you quite a bit with that. Because the faults that you can inject the actions that you can run are running against the aws api directly and you can inject faults against your primary dependency to make sure that you can create mechanisms that you can survive a component failure within your system.

Now two fold sets and actions that I want to highlight that we just recently released the following. There is now integration with litmus chaos and chaos mesh. And the great thing about this is that now it provides you with a widened scope of faults that you can inject, for example into your certis cluster, two fault injection simulator via a single pane of glass.

And then we also released the network connectivity disruption that allows you to simulate, for example, availability zone issues that you might have and events to understand how your workloads react if you suddenly have a disruption within an availability zone. And we're also working on various other deep checks where you can flip a switch and have impact into an aws service during the experiment. And once the experiment is over, your service will just work just fine.

And so that will help you to again validate and verify the entire process that your systems are able to survive component failures. Now, if you want to run actions against, let's say two systems, you also have the capability to run these through ssm. Now think about it where these come into play. When we're thinking about running experiments, there are various ways on how you can create disruptions within the system.

Let's say you have various microservices that run and consume a database. Now you might say, well, how can i create a fault within the database without having impact to all those microservices. And the answer to that is you can inject faults within the microservices itself. For example, pocket loss that would result in exactly the same as the application not being able to write to the database because it is not getting there.

And so it's important there to widen the scope and think about the experiments that you can run and see with the actions that you have on how you can simulate those various experiments. So in our case, because we want to do that brown out that i showed before we would use the eks action that can terminate a certain amount of notes, a percentage of nodes within our cluster and we would run them.

Now i mentioned that we want to have a tool that we can trust that we want to make sure that something goes wrong that an alert automatically kicks in and helps us roll back the experiment and fault injection simulator has these mechanisms. So when you build an experiment, you can define what are my alarms that have to kick in based on the experiment. And if something goes wrong, the experiment stops and you can roll back the experiment.

And so in our case, everything was fine and we said, ok, well, now we have confidence based on the observable that we have for this experiment to move up to the next environment. Now here it's key as soon as you get into production you have to think about the guard rails that are important in your production environment when we're running health engineering experiment in production, especially when you're thinking about running them for the first time, please don't run them on peak hours. It's probably not the best idea.

And also make sure because in many ways when you're running those experiments in lower level environments, your permissions might be much more permissive than you have them in a production environment. You got to make sure that you have the observable in place that you have the permissions to execute the various experiments and that you have the observable in place to be able to see what's going on with that environment as well.

Also key is to understand that the fault boundary change because we are in production now. So make sure you understand that as well and be able to understand what is the risk if i'm running this experiment within the production environment. Because again, we want to make sure that we're not impacting our customers.

And the last one which we often see not being up to date is that brown books and playbooks are not up to date based on the latest changes that were made to the workload. So make sure you have all these and if we see that we do, we're finally at the stage where we can think about moving up to production.

So here again, we're going to think about priming the environment for production. And so you've seen this picture before in the lower level environment. But if we're in a production environment and we don't have a mirrored environment that some of our customers do where they split the traffic and they have a chaos engineering environment in production. In another environment. We can also use a cannery to say that we are going to take real use of traffic a percent and we are going to start bringing that real user traffic to the control and experimental group.

Now keep in mind at this point in time, nothing should go wrong. We have the control and experimental group there. We haven't injected the fault. We should be able to see from an observable perspective. That's all thumbs up. And once we see that that totally happened, that's where your heart starts pounding.

Most of the time when you're running it, the first time we're going to get to running the experiment actually in production. But see when you're thinking about running the experiment in production, that's very different from an event that happens out of nothing where there's no one in the room, you have to page everyone and people are running around. Like here, we already run through all these stages that we have this confidence that our workload should be perfectly fine with what we've seen.

And if there is anything that would happen, you have the entire team of experts in that room looking at that dashboard. And if they see that there is a spike which they didn't expect that experiment is done and they're going to fix that issue right away when you see it. So the aspect of running experiments in production, especially when you're thinking about running it in a game day style is very different when you're thinking about the real world event.

And in many ways helps you to find deficiencies within production environments as well. That might not have a cure in lower level environments. So now even if something happened or not always after a chaos engineering game day and experiments that you also run automatically like a capital one, we're going to go into a post mortem or correction of error.

Now, key here when we're thinking about chaos engineering is that we are very transparent and blameless in regards to what we have found within our system because only this way we're going to learn and be able to tell others what we've seen within the environment. But there are certain questions which we need to ask when we are looking at the experiment.

For example, how did we communicate with each other during the experiment? Was there someone that we had to bring in to make sense of what we saw on the screen? Because there is this guy which just knows almost everything but he wasn't here. And what are, for example, some of the mechanisms that we want to use to share the learnings good or bad that we found during the experiments.

Now think about it, you want to make sure that your business units and the various developers see those findings because you want to share these with them so that they don't make the same mistakes. It's advertisement for you that you can say, hey, i found these various issues and for that, i was able to mitigate xy and z there is another customer that is also here at re invent riot games.

They wrote a great story about how they were looking at and did a chaos engineering game there on redis and we were able to find various issues within that and also were able to help another team with chaos engineering and help them have a seamless launch. But previous to that found faults within their environments, right, in regards to configuration errors with low bouncers and circuit breakers that weren't implemented and they share this with the development team.

So they see, wow, i get really a lot of value out of this. And so it's important to promote that as well now, but not at the end yet. What we have done now is created a great level of confidence that we know we can run this experiment without impact to our environments. And this is where the continuous resilience aspect comes into play.

I said in the beginning, resilience is not an after fact, it should be with you all the time and this is where we're thinking about the automation of those experiments. Now chaos experiments, you can think about it as individual experiments. If you're thinking about the organizational awareness that we had in the beginning, you will have various teams that have specific experiments they want to run. And that's what you would run individually to prove a certain point.

We would also have the experiments that we run in pipeline. But as mentioned, we need to make sure that experiments also run outside of the pipeline because faults happen all the time. They don't just happen when i push my code, they happen day in day out morning at night, whenever and then use the game days to bring the teams together and make sure that you understand, not just the aspects of how i recover the apps, but also look at it from a continuity of operations perspective on how your processes work and are the people alerted in certain ways when you're running experiments that they see what they need to do or what not?

So to make it easier for our customers, we have built of course various templates and handbooks that when we're going to the experiments with them, we share like the chaos engineering handbook that shows the business value of chaos engineering and how it helps with resilience, the chaos engineering templates as well as the correction of error template and also the various aspects of the reports that we share with the customers when we are running into the program.

Now, next, i just want to show you some resources that you can start with when you're thinking about chaos engineering on your own time. And so we have great chaos engineering workshops that go from resilience to security. But when i run to these workshops, what i usually do is i start with the observ one. And the reason i do that is because that workshop creates an entire system that provides me with everything in the stack of observ ability.

And i have to do absolutely nothing to get it outside of pressing a button. And once i have that and i have the observable from top down to tracing and logging, i'm going to the chaos engineering workshop and i'm looking at the experiments that i have there and it starts with some database fault injection and containers and two and shows you how you can do that in the pipeline.

You take those experiments and you run it against the pet shop within the observable workshop. And that gives you a great view of what's going on within your system. If you inject those faults, you will see them right away within those dashboards with no effort and observ ability. And that's important because again, as the national australia bank said, you need to see what's happening within your system. Else. There is nothing which your best architecture is worth and another workshop and there is a white paper that was released as well about fault isolation boundaries in aws is the testing resilience using availability zone failures.

Think about running chaos experiments that would trigger the availability zone failures. And this workshop also shows you how to use various embedded metric formats within cloud watch logs to be able to add these to observable to the pet shop where you don't see how it fails over. So to make it easier for you to find all these, i have provided you with some qr codes and these slides will be available as well where you can find those various workshops and blogs and, and white papers that are interesting to read.

So in this session, you've learned about what chaos engineering is and what it isn't. But more importantly what the power of chaos engineering is and what you can achieve with it and how you can build your own chaos engineering program and scale it to build more resilient and robust workloads in aws that provide your customers with a better user experience.

taibaili2023

关注

23
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
Building confidence through chaos engineering on AWS

Welcome everyone. We're here today to talk about building confidence to chaos engineering. And in this session, you will learn:But more importantly, I will show you how you can combine the power of chaos engineering and continuous resilience and build a pr
复制链接

扫一扫