Deep dive into Amazon ECS resilience and availability

Good morning, everybody. Um welcome to day four free invent. I honestly don't know how many days we've been here. Anyway. Um thank you for honoring us with your presence and not going to the keynote somewhere else. We're really, really honored. Um my name is May Seidel Casing. I'm a developer advocate with the ECS service team. Um we're gonna be talking a little bit about today about Amazon ECS and I'd like to introduce my cos speaker.

Hi, everyone. My name is Malcolm Feeley. I'm one of the senior principal engineers from the container and service org. So this is CON 401 deep type into Amazon ECS resiliency and availability. The session is being recorded. The slides will be available after the event. So if you want to take pictures, you're more than welcome. If you don't want to, you can get the actual screenshots cause it's going to be a good decent amount of diagrams. It's going to be a deep dive session. We're assuming that you understand what AWS cloud is a, you understand what Docker is, you understand what a container is and hopefully you're not new to ECS. So let's go over our agenda for today.

So there are probably maybe hopefully not very many of you who don't know what Amazon CS. So we can give us a very, very small introduction, one slide of what Amazon EC S a few things about our service. Then afterwards I'm gonna hand it over to Malcolm about how we build our services for, excuse me, for a bet and resilience. What that actually means to us as an Amazon service team, some concepts about architectural availability and resilience and how we continuously improve this, how we make our service, what it is that allows you to deploy your applications in a safe, simple and reliable way.

So Amazon EC S is a native container services which runs on the AWS clown. If you pretty much know what, how to use AWS EC two, whatever else, you probably know how to use the same thing. Same with your provision in Amazon EC two EE two instance, you can also provision in a container on Amazon EC S. You have the options of running these containers in a number of locations in the AWS regions, in AWS supported hardware, which is located outside of the region, for example, on AWS Outposts wavelength or local zone, you have the option of running not only on EC two but also Fargate. Fargate is a managed container compute managed service container compute engine which allows you to deploy the container and not have to worry about what instances are running underneath. And the last option of course is with CS anywhere which allows you to provision the same workloads in the same way on premises in your data center. If you have these requirements on supported software, we don't really care about the hardware pretty much. But um, it doesn't make a difference if it's running on one vendor or another vendor hardware. As long as the support operating system is supported, you can run your container in the same way, connect it to the region and manage all your workloads as you do.

Amazon EC S is a service which is pretty, I would say integral to how we operate with inside Amazon and AWS itself. It's we launch over two point two and a quarter billion tasks every single week across all of our regions. That's a huge amount of traffic and API calls. We handle more than tens of thousands of API requests per second every single 2nd, 24 hours a day across all of the regions. We are located and deployed as a service. 32 different regions, six different continents. And the interesting thing is that 65% of all new customers to AWS will be using ECS because of its simplicity, the way that they are currently used to using and they don't have to worry about managing underlying infrastructure in any kind of way.

Our customers love EC from all different segments, from whatever kind of vertical you may be small companies, large companies, fortune some number, whatever it may be, everybody uses ECS in some kind of way, but it's not only external customers. Amazon ECS is also an integral part of how we build services internally in our company. A lot of the services that you might recognize on the screen. Sage maker, Lex Poly AWS Batch. All of these things are utilizing the the resources or the service of ECS underneath the hub and building their service which they provide to you to use any of these services here on screen and others as well using Amazon EC S for the same, same for the same reasons as I, as I said before, the simplicity, the reliability and the security of, of ECS.

So I'm going to hand it over to, to Malcolm and he's going to go into more about the architectural resiliency and availability.

Hey, everyone. I just wanted to add my thanks to you all coming out. I really appreciate it. Um so before we dive into talking um about specifically about how we build ecs for uh for availability to be highly available and highly resilient, i thought it might be useful for us to talk a little bit about some of the tenants, the design principles that we use in order to um make sure that we're engineering um our services um in, in, in, in, in, in order to receive those those outcomes. And at the same time, i wanted to introduce you to some of the foundational building blocks that aws delivers, that we build on top of um that you can also build on top of um that, that form sort of some of the foundations that we use in order to make sure that our services are highly available.

So to start off with um our, our uh CTO, doctor van der vogel has often been quoted as saying that um in order to build systems um that are resilient, we need to embrace uh failure as a natural occurrence. Um there's another uh person out there um uh who is uh equally infamous um um mr murphy. So murphy's law says that anything that can go wrong will go wrong. And, and certainly our experience has been that this generally tends to be true. Um and, and, and in fact, he adds that at the most inopportune time. And so really the key takeaways to that are that in order for us to be able to make sure that we're building services that are highly resilient and highly available, we have to embrace the notion of failure as part of the design of the architecture for the service itself. And then, in addition to that, we also want to make sure that we're normalizing the operation of the service um such that we're compensating, we, we're normalizing the notion of failure um as as part of the operation of the service.

So, so we want to design for design for the, the, the notion of failure. And we want to normalize the operation of the service such that when failure does occur. It's just, it's just part of our normal operating procedure. I'm gonna talk a lot about availability and a lot about resilience and i thought it might be useful for us to quickly um start by defining what we mean by those terms in the context of this conversation, availability is um the way we're talking about it is a probability, it's a probability metric. Um in effect, if you have a web service that is serving 100 requests, and it was able to s usefully serve 100 of those requests, then um it is 100% available. So availability is the ability for the service to be able to do useful work when needed. If in this particular case, the service only served 99 of those 100 requests usefully, then the service is 99% available. And our goal is obviously to reach that nirvana of 100% availability, we wanna be able to make sure that our service is always available to do useful work.

Resilience is a key attribute that feeds into availability. So the service's resilience is its ability to absorb failure and to recover quickly from that failure. Um and the, and you can see that a service that isn't resilient that can't absorb failure is clearly not gonna be able to meet its availability requirements. And equally, if the service is ab able to absorb the failure, but then takes a while to recover, it's going to drive down the availability. So we really wanna make sure that we're optimizing for high high resilience, quick recovery and ability to, to absorb that failure in order to deliver, in order to try and reach that nirvana of the 100% availability.

Let's jump a little bit into some of the foundational building blocks that aws delivers. Um in order for, for um our us to build our services to make them highly resilient and also in order for you to do the same. Um the first is the aws region. So a region is a local geographic presence of aws. Um we have 32 regions spread across the world and we're continually adding to that to that list. And the thing with the region is that it uses this, it uses the concept of a shared nothing architecture, effectively. Regions are entirely discrete, isolated installations of aws. They are unaware of other regions. And this is really useful because it means that they are discreet atomic and they, they, they fail completely independently of each other. The the other benefit that you get from, from a from an aws region is um it drives up your availability and resilience because it allows you to put your workload as close to where you need it.

So, you know, you can either put it close to your customers or close to where you're provisioning your workload, whatever works best for you. Um but it means that um you can get that workload as close to where it's needed in order to make sure that you're reducing things like network hops, et cetera, which reduces, reduces the probability of, of, of, of network connectivity failure.

Now, within the context of an aws region, um there are availability zones. Now an availability zone is a logical abstraction over a physical infrastructure separation. What i mean by that is that every aws region has three availability zones which logically partition the region into a set of physical physically, discrete data centers. Any one availability zone has comprises multiple data centers that are located um geographically further away from or away from the the data centers that comprise another availability zone, such that in the event that there were a natural disaster of some sort, only one of those availability zones would be impacted but close enough geographically that um we can we can get very high through um low between so that it feels it feels very local.

So the benefit there is that you have this logical abstraction that you can build on top of which under which is supported by, by um independent um infrastructure and that independent infrastructure has uh independent power, independent networking, independent cooling. And so um we can be assured that in fact, availability zones will fail independently of each other.

So now that we've gone through some of those basics, let's dive a little bit into the ecs architecture itself. But before i do that, i'm gonna take a drink of water. It's a bit dry out here in vegas.

Um so let's dive a little bit into the, into the, into the ecs architecture and talk about how we use some of these concepts in ecs in order to deliver the availability um that, that, that we need

First off, ECS is installed in every AWS region and it uses the shared nothing architecture. Effectively ECS in one region is completely unaware of ECS in any other region. It is operated independently, functions independently.

ECS is also present in at least three availability zones in every region that it's installed in. Now here, I'm talking about ECS in terms of the ECS control plan. Let me just describe a little bit what I mean by the ECS control plan.

The ECS control plan is that portion of the ECS managed service, which does the provisioning of your workloads. So it's the thing that drives the scheduler, it's doing placement, it's doing deployments for you. So it's, it's the management layer. It's the, it's the orchestrated management layer that we, that we deliver to you.

That ECS control plan is, as I said, delivered in at least three availability zones. One of the ways that we could provision that provision that control plan is we could work out how much capacity we needed in order to serve the, the, the tps rates that Maes was talking about earlier.

We could provision that capacity across these three availability zones. But what might happen, and let's say, for example, we're provisioning at, at sort of the p100 of availability of, of capacity that we need. What might happen though is that if one of those availability zones were to fail, we end up in a situation where because we're deployed over the three, in this particular case, we end up with only 66% of the capacity that we would need in order to serve the traffic, that, that we would be, that that is required.

As a result of that, we effectively end up in, in a situation where we need to scale up in response to this event. When you take that approach, you introduce a bimodality into your service, a bimodal behavior in a service is one where the services standard operating procedure deviates to something different in the, in the face of failure. So it's actually doing something different in, in the face of failure.

We wanna really try and avoid that. And so we try very hard to index on leveraging a principle that we call static stability. Static stability means that you want to be in a position where your service is able to handle a failure of a dependency and doesn't need mutation of that service in order to continue to operate.

You want to be able to effectively be running in a steady state and remove this need for a bimodal behavior to be able to take this corrective action. The way that we do that for ECS is that we prescale our ECS service to 100 and 50% of our p100 peak plus a percentage.

That effectively means that at any point in time, ECS is over provisioned in every region in order to make sure that in the event that an availability zone were to fail, ECS continues to operate at 100% of capacity and that delivers the static stability, which is, which is an incredibly important concept for us and we leverage it extensively in, in, in terms of how we do operations, we're gonna get into that more.

Really key to this is this notion of over provisioning preca in order to make sure that we are able to, to handle this event of of availability of of an availability zone failure.

You're actually able to take advantage of this in your implementations running on top of ECS. ECS has two, I'm gonna introduce two concepts to you, you probably are familiar with these.

The first is the notion of an ECS service. An ECS service describes an application workload where it has an associated desire state and that the the desired count rather, and that desired count describes the number of ecs tasks, the number of workers that you need in order to deliver that service.

In this particular example, we have an ecs service which has a single ecs task. So a desired count of one. ECS will actively make sure that it drives towards your desired count. It will continually provision capacity to make sure that you are in a position where you have the desired count that you need.

Now, obviously, in this particular implementation, we're sort of tempting fate. We're, we're picking a fight with Mr Murphy as it were because in the event that one of these tasks, that that this task were to fail, we would have a 100% outage. And so we want to avoid that.

In a perfect world, what we wanna do is prescale to make sure that we have presence in at least three availability zones. Because that way we have this, we have this ability to be able to, to um to have a single failure and then have our workloads continue to operate in the other two.

In this particular case, in this example, our ECS service is provisioned with three tasks. And in these, and, and in those three tasks are spread across our three availability zones.

Now, in this example, we're using ECS Fargate. Fargate is our serverless compute platform. And the thing with Fargate is that when you run an ECS service on top of Fargate, Fargate will automatically spread the placement of the tasks that you ask it to provision.

So it will, it will use the desired count and it will use the availability zones that you tell it to provision into and it will make sure that it continually spreads the load or the workloads across those availability zones.

That means that by default, ECS is giving you the ability to be able to provision capacity and to, to leverage these availability zones in the same way that we do in order to deliver the availability posture that we want.

Now, you can do the same thing with ECS on EC2. As Maes called out ECS runs on multiple capacity types, you can do the same thing with ECS on EC2. Although when you do that, you need to be a little bit more conscious of the fact that underlying the the capacity is an EC2 instance that, that, that you own and manage.

And so typically, the way you would use that is is with an auto scaling group more often than not with capacity providers and a managed auto scaling group. And what you wanna make sure of is that the auto scaling group that you're using has is provisioning EC2 instances into the availability zones that you're gonna be leveraging for your task placement.

So if you have at least three availability zones, provisioned for your auto scaling group, your EC2 instances will be provisioned also spread across those three availability zones. You then do the same configuration for your ECS service. And the end result is that you will get the same spread placement.

One of the things that I talked about earlier is the notion of prescaling. We, we prescale our service to 100 and 50% of, of peak. You might be asking yourself, well, how do I get to work out? What that number needs to be? What do I set my desired count? To in order to find that, to find that number.

The way that we do this is we start with our base desired count. And in this particular case, as I said, the metric that we use is is our peak. So our p100 of throughput plus some percentage above that.

And then we work out how much, how many times we'll be using. So in, in the example that we've been using so far, we've been using three availability zones and we want to use these two numbers in order to work out our target desired counts. In other words, our target capacity that we want to provision.

And so this formula will give, you will give you the, the the um that that outcome. So for example, in, in a case where we had a service that needed six base tasks active at all times in order to meet the throughput of the demand that it needed. And we were using three availability zones.

Well, you can see here what we would need is a desired count of nine tasks. That gives us a 50% over provisioning.

Now, if we were in a position where our service was growing, you know, let's say, for example, in this particular case, we we've been very successful and we've gone from six to needing 600. Well, there's benefits in actually incorporating more availability zones. Many of the AWS regions have more than three availability zones. And in those in those regions, leveraging more availability zones actually gets you the same availability profile, the same availability posture, but at a cheaper cost.

You can see here, in this particular case, we scaled up, we, we're 100 times more successful, which is fantastic. So we're doing very well. And as a result of that, we've also decided that we're gonna incorporate six availability zones into, into our provisioning.

And so here we actually end up with a 30% saving in, in terms of the utilization, but with the same availability posture, keeping in mind that availability zones fail independently of each other. And so in, at, at any point in time, we will always have five availability zones to serve the traffic that we need.

So with that, I'm gonna move on to some slightly different idea. We talked a lot about static stability and how we can use availability zones and static stability to be prescaled in order to make sure that we can handle failure.

Static stability is incredibly useful with availability zones because it gives us infrastructure isolation. But what about things like software deployments? How do we make sure and, and, and sort of scale scale considerations, how do we, how do we get software isolation?

The way that we do this in ECS is that in every AWS region ECS is actually delivered as partitions. And so a partition is a full copy of the ECS control plan running, but multiple copies of it. And what we do is we run a routing proxy, a very thin routing layer over the top of that partition of those partitions which allows us to route traffic into any one partition.

And the reason that the partitions are so powerful is because it gives us an opportunity to be able to isolate work such that in the event that there were a software failure or a scaling failure of some sort in a particular partition, the blast radius, the impact is mitigated by, by virtue of only impacting those workloads running in that particular partition.

This is also incredibly useful for us in terms of scaling scaling up and scaling out. You can imagine that in a service where we where we're provisioning 2.25 billion tasks a week. And we're, we're serving tens of tens of thousands of transactions a second, it can be very difficult to do scale testing in a non production environment. It becomes very expensive.

And so one of the ways that we do this is by partitioning the service and creating these full, full copies, full partitions of our service, we can scale test to breaking in a non production environment, a portion of the functionality of the control plane or in fact a full copy of the control plane in that non production environment.

And we know how, how much capacity then one copy of the control plane can take that gives us an opportunity to then pre to to be able to work out how much capacity we need in any given region.

So for example, if we know that we, that a given partition can handle 100 requests and we need 1000 requests in a region. Well, it means we know that we need at least 10 partitions in that region.

The big benefit that, that, that we get from this is in fact, you can leverage this, this underlying partitioning through the ECS cluster. So the way that we do sharding, effectively, the way that we do routing of, of the workloads onto these partitions is using a combination of your AWS account ID and your cluster ID.

That effectively means that when you create a workload and you, you, you create that inside a cluster, you're getting allocated your partition of of the ECS of the ECS control plane.

And then were you to create another, another cluster? You will be allocated a separate partition for that for that control plane, et cetera. And so the benefit you get here is that um you can work out how you want to isolate your, your workloads and use, use this partitioning mechanism, use the clu clusters to create your own partitioning within uh within your workloads.

Now, in ecs, a cluster, um the the ecs cluster defaults are, are very high, you can have thousands of clusters by default. Um there is no charge for ecs clusters. Um and as long as the clusters are, are, are provisioned uh are provisioned on onto the um underlying vpc, the underlying network. Um there is there is no reason why uh your, your, your, your uh applications running each cluster can see each other and talk to each other, etcetera. So the cluster is really, is really is a name space, but it does give you this opportunity to be able to isolate uh workloads into in uh across your clusters in order to, in order to benefit from uh from this workload isolation that we have built in uh to ecs under the hood.

So moving on from that, let's talk a little bit about um how we operate the service for resilience. So we've, we've covered two ideas to design patterns that, that we use in order to think about how to design for failure, how to architect for failure. Now, we're gonna cover a little bit about how we talk about how we think about operating for resilience.

My click is not clicking. Yes, it is sorry. Um all right. So one of the things that you might see if you, if you look back at the, at, at the, the um the diagrams that we looked at is that we're creating sort of a mesh, right? So we've got these, these verticals which are our infrastructure isolation for through static stability for our availability zones and that we've got these partitions, these cells as, as we refer to them internally. These full copies of, of, of ecs in a given region. And so each of those full cells uh is is has delivered against that static stability promise. They are all pres scale to 100 and 50%. And so as a result, you end up with this, with this mesh, that means that in fact, um we can deploy a change into one portion of that, of that mesh and isolate that change quite, quite usefully so that we can make sure that we're um that if in the event that the change were impactful, um it impacts a very small subset of, of um of, of workload.

And the way that we do this is we use a strategy that we call rolling deployments. So for a rolling deployment, um what we will do is we will take, we will, we will deploy a change set to an availability zone and a to a single availability zone and a single partition, the availability zone and partition or the cell that we have is comprised always comprises multiple workers, right? So ecs ecs comprises many, many services. Um each of those services is leveraging the same partitioning structure, the same, the same architecture and in any one of those cells, um there will be multiple workers that are serving traffic for that particular partition in that particular availability zone.

And so um in this example that we have here, we have 18 of those workers that are serving it and the way that we do rolling deployments is we will make sure first, obviously that we've run regression testing, integration testing and scale testing for the software. Um we wanna make sure that we're, we are very confident in the change, but we always view any change rolling out with some degree of some degree of trepidation and caution. So that so, you know, we, we, we encourage our engineers to um to, to view view change with caution and make sure that when we roll it out, we build confidence over time.

And so the way that we do this with rolling deployments is that we will we'll deploy the change to the to a very small subset of the set of servers or the set of workers in a cell in an availability zone. So in this particular case, we're, we're rolling out to three of those workers first, as we build confidence, we will continue to roll out in steps of three until we've completely rolled out to one cell in one partition, assuming that that goes well and we are comfortable, we would then move on to the next availability zone, the next cell and the next partition and so on until we've eventually deployed that particular change set to all of the availability zones for that particular cell.

Now, the benefit we have with this approach is going back to that static stability. Um that that we were talking about earlier is in the event that there were a failure. We're in a position where we can immediately weigh away, we can move traffic out of that availability zone because we assume that availability zones will fail. And so we've designed for that in um in as, as part of our architecture and as part of our, as part of our normalization of operation of failure. We've, we've designed that into the way that we do deployments as well.

The way that we do this is we have automated, monitors, continually monitoring, the way that the service has been um has uh the, the, the continually monitoring any one partition and any in any one availability zone as we do those deployments. And the reason that we want automation here is because we want to make sure that we respond to failure as fast as we possibly can, that goes to back goes back to our resilience goal. Remember that um part of the part of the idea of resilience is and abilities, the serv the ability of the service to res to um survive failure, but also its ability to recover quickly.

So in order to do this, what we're, what we're doing is we're introducing this automatic alarm. And then we have this, we introduce this idea of bake time. Now, big time is not actually a time as much as it is a measure of, of confidence. So what we will do is um we will have a set of metrics that are continually monitoring the health of the service um at the granularity that we're deploying. And then we will have another set of metrics and an associated alarm, which is monitoring the health of the service uh to make sure that we are, we are continue to continually meeting um our customers needs in terms of that availability.

And what we'll do is we will, we will decide a threshold after which we've built sufficient confidence that we feel like this, the, the deployment can move on. And we call this b time. Now, we're fortunate in that um because we're serving tens of thousands of tps a second, we could set a b time of, i need to see 100,000 successful operations and that will take in the order of 10 to 15 seconds to complete. And so we can get a very high confidence in, in the um in the value of the um in, in the, the, the change that we've shipped um in in a very short period of time, just by virtue of, of, of measuring, measuring the success of, of that core volume.

So in the event that the, that this alarm um identified that in fact, there was a failure, it will automatically trigger a trigger, a roll back. In fact, it will do two things. It will trigger a roll back. And then at the same time, we also have, we have um those alarms monitoring the availability zone health for that particular partition and it will automatically weigh traffic away. The benefit we get from. This is that it allows us to continuously deploy our service, which is incredibly useful before us.

We we're very privileged in that ecs is um is, is growing and popular and we have a lot of features that we, that we are um we are shipping um enhancements in terms of availability, um improvements and, and bug fixes, et cetera. And so we're deploying um ecs uh multiple times a day. Like i said earlier, ecs comprises multiple services and each of those services is continuously deployed. And so ecs is provisioning is, is will be, will be rolling out many, many times a day. And we want to be in a position where in the event of failure, we automatically roll uh way away from that way away from that failure without, without impacting our customers. Um and we can do that safely and with confidence.

Now, all of the conversation that we've had so far has talked about ecs delivery in the context of um a single region, but like we said earlier, ecs is currently present in, in 32 regions. And um and as we grow aws, it will be present in all of those. And so one of the things that we have to think about is how do we do this globally, how do we roll out globally?

So we use, we use a concept called pipelines where what we will do is we will deploy into um our pipelines always start assuming we always assume that the change set that we're about to roll out is is unhealthy. We always take a, take a pessimistic view. And so the pipeline will make sure that it rolls out, like i said previously to a single region in a single partition in a single availability zone.

And as we build confidence with that change, we would then roll it out to the next set of partitions in that region and potentially to the next set of regions. But importantly, we never change, we never apply a change into a region in more than one availability zone at a time. And that's to make sure that we go back to that static stability uh premise, we have to be in a position where at any point in time, we can assume that that that an availability zone will fail or is failing. And so that we can, we can weigh that traffic away.

Now as we build confidence, we will accelerate that deployment. But again, we never violate that principle, this idea of static stability, this idea that, that, that the the we never make change to to more than one availability zone in a region at a time for the same change set. And so we will build that confidence over time. And as you, as you can see, um we will start to roll out to more and more partitions and more and more uh more and more regions concurrently in order to accelerate the deployment. And this is this is ultimately what allows us to be able to deliver change relatively quickly um through those pipelines.

Now, importantly, we are continuously deploying and we are shipping change multiple times a day. And so at any point in time through those pipelines, um we may, we, we may well have multiple change sets, but again, just to double down, we never make, we never make more than one change in an availability zone at a time. And we never apply that change to more than one availability zone in the same region at a time.

So given that i'm going to hand back to me um to talk about um how we continuously improve our service. Thank you very much. Welcome. Thanks malcolm.

So we've been talking a lot about processes and tools and i actually forgot to do something at the beginning of session. I'll do it now. Um there's a lot of things besides the process and tools, it's also the people. So i want to tell you s very short story about how this actual session came about.

Um every week on our team with product management and the principal engineers and the developer advocate team as well. We have a weekly meeting to go over our business metrics, to go over open bugs. Um are we doing as a service and also getting feedback from customers. And we, how we have to have that feedback, how we can act on it.

And during one of these meetings, a couple of months ago, there was a feedback about the request that one of our customers asked for a specific deep dive into how a cs works. And i asked during that meeting is this is a request that we get on a regular basis. So the answer was yes. And we don't have up until now real information of how this actually works in the background under the covers.

So we took upon ourselves to actually create the content for this event. And there's also going to be a slightly more detailed blog post from what you've heard today coming out next week. So watch the the aws containers blog for, for, for that article.

Um and this is the content we are sharing here today because it's we in amazon itself, we use mechanisms, mechanisms are tools or processes that we use. We can scale the work that we do to not only one specific service team or one specific service or one organization, we use it to scale globally across the whole company. So now there's this story is done

Let's go into um some opera, some of the aspects of the operational and processes that we use inside.

On the screen is what we call a flywheel. A flywheel is a concept which I'm sure you might have seen before in some of our documentation, it's a circular motion which we continuously use and every team and every organization and every service and every anything within inside now has their own version of the flywheel.

This topic here is the uptime flywheel which we usually use for chaos engineering experiments and continuous improvement. So I'll just walk you through the fly wheel and how it starts.

So on the top right hand side, if you're looking at the screen, you have, the 1st, 1st stage is the prepared stage. When you run some kind of contain chaos engineering experiment or great game days. In order to test your service, the most important thing you have to do is plan and prepare properly before you start running anything, what you're trying to do, what you expect to happen, what you need in the room, what is going and how you're going to solve the problem because unleashing these kind of experiments and breaking things on purpose without having a solution which you're going to use and know how to solve the problem is the most dangerous thing you can ever do. So please never ever do that without running a prepare or planning stage based on the plan and preparation.

After the experiment starts running, you're going to the tech phase. Did you have all the metrics that you expected to see where the logs flowing correctly? Did i get the right alerts that something was wrong with my system based on the change in the experiment that i ran that i have enough people in the room which goes into the response phase.

Once you get, if you have enough information to understand what is going on, you have to have the right people in order to fix all the right tools or processes. So, were we able to respond in the amount of time which we thought we would do that we have the right people on the call that we have the right processes, the right scripts, the right automated actions or operations in order to resolve, for example, scaling up an instance or scaling up a cluster? All those things are the response phase which we hoped we had. And that's what we also have to test as part of our experiment.

The last phase is the learned phase. We take all the actions and everything that we went through and we continuously try to improve. Did everything we expect happened, actually did happen? Or were we surprised by something which we never ever thought about? Are the things we can understand from the experiment that we ran that we can improve the next time if we do another experiment or in a different server at a different scale, in a different region and improve continuously without having to make these things and run into problems. time after time after time, the whole concept of this flywheel is essentially to lead to shorter incidents and less incidents.

Because of course, as malcolm says, and verner rogel and murphy, something's going to break no matter how much you plan for it, it's something you never thought of or never encountered before, something will happen. But the more you practice these operations and these processes, the easier it will be for you to recover, the quicker it will be for you to recover. And the less downtime you will have for your customers.

I want to dive into that learn phase a tiny bit. And one of the things, one of the tools that we use internally in, in amazon, which i think will be very beneficial. Um it's a correction of errors. A co e a co e is a process which we use in order as a part of that learning phase of how to improve. It's a mechanism, as i said before, things which we use continuously to scale within our company. And it's a mechanism to identify problems which we encountered. Um add ownership to fix those problems if we need to be if any in in certain cases. And of course, the most important thing is to share that knowledge with other teams and organizations that it doesn't say apartment and tribal knowledge that people can learn from what we've encountered and experience and they might experience as well in their own services and products. And eventually, of course, the most important is to reduce the amount of incidents that we have.

So what the most other thing is, it's not a blame game. In other words, we never use this process as to blame a specific team to process organization. This is something we use as a learning positive process and it's not a punishment for anybody else because mistakes can happen. The question is how do we prevent them from happening again, worse down the road and what processes and tools do we put into place in order for those things to not occur again?

When do we use the coe when we have customer impacting events? For example, if a service goes down, if a customer can press the order button on amazon.com, or they can launch an ecs task because an api was down. So if there's a customer impacting event, we will run a coe on that event. If it was a procedural miss, for example, in our planning phase, we expected something to happen and something completely different went haywire, we didn't understand our system properly or we don't have enough knowledge. That means we have to run a coe to understand what we missed and how we're going to fix it. And of course, other incidents which can reveal some kind of opportunity for improvement amongst our teams, other teams, these are things which will use the coe document.

So what does the coe look like? Coe is divided into two parts with some sub buddhist points. So the first part is the supporting information, supporting information start with the impact. What was the impact of the actual outage that we had or the incident that we had? How many customers were affected? They could, what was the exact thing that they couldn't do? We make a very, very detailed timeline of what happened when the incident started when we detected that there was a problem, the shorter time for that to happen, the better it is for everybody. Um we have supporting metrics, logs, screenshots of the actual monitoring graphs because the picture is better than 1000 words. And you can actually see how your api response dropped from whatever it was usually to p 100 to something to p 20 because of an outage. And we go through incident questions. What actually happened? How did we discover this incident? How could we have shortened the time to discover this incident? What were the actions we took in order to repair the problem or diagnose the problem and fix it eventually so that our customers were finally back to where they were before we started.

The second part of the key is the corrections part. And here we have the five whys which we start the five ws is a mechanism which we use internally of trying to actually get down to the root cause of what the problem was. Usually it's a configuration change, but that's not the root cause because everything results usually with a configuration change unless it's the act of god and something goes wrong. But um diving deep into those questions and using the mechanism, the process of five, w what actually was the reasoning of that configuration changed and why it wasn't checked and why it wasn't validated and why it wasn't deployed on a small subset of the fleet. And think we continuously dive further into these kind of questions in order to actually find what was the root cause of the problem so that we can identify it and fix it.

The second thing is the lesson lessons learned, what we learned from this incident, we learned that we shouldn't push configuration changes in in the middle of um chaos experiments or all kinds of things. Like there are a number of different lessons that can be learned from this kind of these kind of events. And they are usually um divided into two parts. And that is the short term. And the long term, the short term is to make sure that we fix the most important and critical problems that they don't happen again immediately. And there are things which are the long term, which we have to improve our processes in order to make sure that this doesn't happen again further down the road, either by us or by another team. And these things usually go into the the service team's backlog and the road map in order to this could be by decoupling services along the way because we found out that there was a some kind of weak point or critical part of the infrastructure which wasn't able to handle capacity. So we have to split the service into two different parts. And this will be a roadmap item which will go down. And these are things which are currently constantly managed with actions, items with an owner from the service teams to make sure that they are done and tracked on a regular basis.

On every single one of these documents you'll see on the top of that, i have an open line. And that is a summary because that's usually what you write last based on all the information at the bottom small paragraph, executive summary of what was the problem, what we did, how we solved it and what we learned in a very short form so that the people can also understand and get a good view of what's going to happen when they go through the details of the document.

Let's recap of what we did today. So we talked a little bit about how we build services for availability and resilience. We talked about the differences between availability, how we measure it, how we look at it with inside our organizations saving resilience, how we take steps. And we are, as we say, skeptical, cautiousness. What was the word you malcolm for skeptical, cautious skepticism? Thank you very much cautious skepticism, how we really, really make sure that we deploy in a very, very measured and careful way to make sure that we don't have any impact on our customers and the service that they're using and some of the the continuous improvement processes which we use inside our organization within amazon as well.

Most of what you've heard today has already been published as a white paper. We have something called the amazon builders library, which is a set of very detailed white papers from our senior principal engineers and principal engineers and malcolm of course, has also written very uh a number of articles there as well on how we design systems and look at systems at scale within amazon. Yeah, it doesn't give you all the details of how we build every single thing, but it gives you the concepts and how the design principles which we use like we did here in this talk and a lot of the concepts which we used are available in these six different articles here, which you can read white papers. If you bored in weekend, reading, it will be a good thing to read on the plane.

If you would like, this is not the only ecs or far session we have here at the event, some of them might have already occurred. So i apologize that not, maybe you won't be able to catch all of them. But um we have a number of different ecs and fargate sessions here available at the event if you want to catch the recordings which are available afterwards. After the event, please do um talk, talks, workshop sessions, build the sessions. These are things which i would advise you if you had the time and um passion for these, for this technology to definitely go and have a look at one of these sessions as well.

The last thing i would like to point you to is the modern applications and open source zone which is located on the expo floor. Um there's a map kind of at the end when you're coming from the export entrance at the bottom is where the registration was behind the registration. You go walk all the way to the end to the right. You'll see a big one application and open source zone where we have engineers, product managers, people like me to develop adequate kits which are there to speak to you and answer your questions and listen to your to, to, to, to your feedback because that's the most important to us is the fact of the feedback that you give us based on that we develop our services, improve our products and make them uh as they are before you get up, please.

Um i would request um if you haven't already installed the application, i hope you have for the event. Please take it out and don't forget to fill in the survey. We would really, really, really appreciate your feedback on the session today. Any other thoughts and comments you would like to let us know. Please share our emails here are up on the screen. Also, twitter accounts. If you want to send us either of us a message, please. You are more than welcome.

And with that, if there are any questions, there are microphones, i think on either side, if i can see properly. Yes, there are. Please feel free to come up to the mic and ask a question. There are stickers here that we have from ecs if you would like as well. Um and thank you very much for, for, for attending today.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值