Practice like you play: How Amazon scales resilience to new heights

最新推荐文章于 2024-09-12 20:11:47 发布

李白的朋友王维

最新推荐文章于 2024-09-12 20:11:47 发布

阅读量75

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134834890

版权

$300,000 per hour. That's the average cost of downtime across industries and almost half 46% of all companies that experience downtime aren't able to serve their customers as a result. Wouldn't it be nice if you could apply the principles, strategies and tactics of the most successful sports teams like the ones in the NFL in the form of a playbook tailored to you, a resilience playbook that you can use to train your teams to become resilient to the unpredictable. What do you think Olga?

I think it's a brilliant idea, Lauren and I think this is exactly what, why, why we're here today. Prime Video has been streaming exclusively Thursday Night Football, two years in a row and a few years before that, we're also streaming it and cos sharing with other partners this year, our audience is 26% up, which basically means more than 12 million customers are joining and watching football each Thursday. This is amazing. We've done this for a while. We also have learned and recognize that preparing engineering teams for peak workload and for peak streamings is very similar to how teams prepare for getting to Super Bowl.

So what we've done, we took all of the lessons, practical insights, stories, and we created a resilience playbook that we'll be sharing with you all of you today from the session so you can take it home to your home teams.

My name is Olga Hall. I'm a director of liability and resilience engineering and prime video sports. And together with me today is Lauren Don, who is chief technologist focusing on financial and federal sector and also has responsibility for chaos engineering worldwide and amazon web services.

Thank you for joining us. Thank you, Olga and Kate uh to be here.

So today we're gonna go into building that winning mindset that you need as a team to be successful when you're thinking about resilience and you start doing that when you're thinking about a goal, a common goal. Now, once you have that goal, we're gonna walk you through how you need to think about training your teams the same way as they do it in prime video, you're gonna look at the predictable will then go into more of the experiments and thinking about what can go wrong when you're looking at resilience, the unexpectable. Then you need to think about how do i analyze the various aspects that you learned? We're gonna show you that and go into the aspects of how should you memorize the various learnings? So that next time when something happens, you're gonna build muscle memory and you're gonna be ready for it.

Ok, let's start with a little bit of reminder, most of, you know, prime video because you have your favorite shows. Maybe it's jack ryan, maybe it's forever favorite for me. Marvelous, miss mason. Maybe it's citadel. There is millions of titles in our rich catalog. Prime video. Also stream sports, tennis, baseball, volleyball, new zealand cricket. You name it. We are super happy to be an entertainment hub and delight our customers with sports. We have about 50,000 events just this year that was streamed worldwide when we prepare for this season.

Early in the year we start thinking about, ok, what is our main mental model? How do we succeed preparing for global events with distributed teams in inevitably year over year, we go to the same mental model, think global act local as we go through the presentation today, i will point out examples where this is actually happening and why we did it the way it worked. But in general, just to start us off. Why there, right? Why start there?

Um at global level, we agree on goals, we agree on programs, we agree to allocate certain resources, we agree on certain work streams and then we provide guidance to our teams that take this guidance and bring it home so they can execute independently for those globally streamed events.

So when you're thinking about these 200 countries and you've been streaming thursday night football for quite a while. And i see here, black friday football, i mean, when i think about black friday, this is excitement and you have people running in stores and you know, but how do you, how do you prepare for these unpredictable conditions?

Well, uh black friday has been absolutely super special for us. We hope we've established a new nfl tradition. But if you think about it, it's not that unpredictable because on the black friday, what do we have? We have family, friends enjoying time together and hopefully enjoying time together watching football. So we can make a certain assumptions about audience and what the audience will do about customer behavior.

The things are unpredictable. They're a little bit tougher. You cannot predict really whether and what will happen on the match.

Little novel fact that i'm gonna share with you the very first time prime video streamed tnf it rained out completely, totally rained out. The game was helped postpone and then we had to prolong it for a significant amount of time. And those of you thinking, well, so what's the big deal about this? Well, the peak workload that comes with, that had to run much longer than anticipated. And we have to think about those scenarios. What if that peak work workload time gets intended? Do we need to test for it? Do we need to prepare for it? So those are things that can impact you team make up. What happens with the player health? Um how that is real time and can impact also how the audience shows up. Absolutely unpredictable.

Um and the last thing is we're living in this cultural moment. All of us where a certain celebrity that a lot of us loves happened to be a fan of kansas chiefs. And with her, there is new energy and there is a tension that coming and enjoyed by nfl fans and nfl, how to predict and advance right in the beginning of the year.

Most of you probably thinking, well, how do i find those unpredictable things? Well, those unpredictable things very often show up from newspaper or maybe your news feed, however, you get your news very often. There are also thoughts on the back of my mind. Wh what if this happens? What i do? And my question to you is what is your playbook for the scenarios? And if you don't have one that's ok, take maybe some insights that you're gonna take today and bring that playbook home and create one with your teams.

And when you think about your playbook, you need to build that winning mindset. When you look at the winning mindset from successful sports teams, there is one thing that all of them do and that's the practice like they play and they play how they practice, they create muscle memory, they visualize what they are doing within their games and therefore understand how to perform at peak olga. Talk to us about how prime video is practicing and getting to the peak.

I like what you say, create muscle memory. So the way we create muscle memory, prime video has the program where we run load tests, automated load tests three times a week in each region where prime video runs, we calculate what that peak will be. Sometimes it's on a weekly canter, sometimes it's on a monthly. We find that peak and we basically run the low test. We call them game days to make sure that we have certainty. Uh for outer sc rules, testing our outer scaling testing response of our teams.

And for those of you that are thinking who this could be a little bit intense. Well, this weekly testing for peak, what it does, it trains that operational muscle memory for our teams keeps the code p fresh, helps us identify bugs that might be showing up at scale. Um and it also keeps the muscle of our teams super fresh.

When you are thinking about that winning mindset, you need to have a goal. Can you share with the audience how you were able to gather the tubes and create the goal that one was behind it?

Absolutely happy to share. So as i said, when we start at the beginning of the year and thinking about all right, new season is coming up, what do we do? We agree on the fact that availability is our feature number one, we take a specific goal on how we want to raise our availability bar for certain services or maybe across the product. Um and then we have a tenant that basically says availability is feature number one, this is not just a saying, imagine yourself sitting in the room with the product leaders and the engineering leaders where you have a ledger of new features that you want to build. And right on top of that ledger, there is availability work streams and availability projects that are funded, they have resources and they have support of leadership before i get any further.

What i want to do, let me um share with you the definition of availability as we use it in amazon prime video. We've tested over the years, we've practiced it over the years, kind of like it. So for us, availability is an emerging quality of distributed systems where each critical components cooper collaborate and protect from risks.

What is interesting in this definition? It's not just software right? There is leaders and there is teams including in this definition as well. And speaking of terms of leaders, i don't have to go too far. Prime video vice president responsible for uh and being a head of global head of sports, jay moran talks about availability at every single meeting with his engineering and product teams and he could be heard hearing very often something like this.

Customers are passionate about sports and we're committed to providing a seamless high quality viewing experience and he talks about this at all, all hands and he discusses it during our features review. When you have this leadership support, it's easy to create a structure for programs.

So let me show you guys how the programs are structured. So in the first walk stream, we have projects that are related to operational excellence we develop and we're gonna share it today with you uh an approach and mechanisms that we call operational readiness score. And we basically agree with the teams, what goal, what number they want to take for achieving an operational readiness score this year? And we'll share that algorithm with this. You just in a moment, the next set of projects are really focused on service resiliency.

Hm what does it actually mean? So again, at the beginning of the year or as the year progresses, teams are thinking, what contingencies do i need to have? What fail back? Do i need to have? What levels do i need to have? How do i dial up features up and down? How do we test that? So all of that gets aggregated in a very specific program and we call it service resiliency. And we're gonna dig a little bit deeper how this program looks like.

The first two things are not as successful without observable and reporting as they can be. So to make absolutely movement going forward year over year and improving the availability posture, you need to invest into observable and reporting. And i'm happy to share today some examples what we've done for observable for services and connections between end points. Um and i'm gonna highlight some reporting that we also do for availability.

So when you're thinking, sorry, when you're thinking about an operational reading score, amazon prime video did a lot of research when it comes to understanding what is critical to build an operational readiness score. You see four pillars here from deployment safety code coverage, operational readiness completion and review and correction of error actions.

Now let me show you how this looks like at prime video, they've built a resilience portal that provides transparency in regards to the health of every service that you can imagine that incorporates prime video. There is a catalog that automatically ingests new services that developers create. You can analyze those services. As olga mentioned to those game days, you get reports in regards to low testing and the health of these and then the operational readiness score that is very important for the developers but also for the leaders as well.

We're gonna also look into scenarios and how these are applied to prime video and then how the insights aspect works in regard to observ. Now, the operational readiness score can have 100 points and these 100 points are comprised of 45 points for deployment safety, 15 points for cold coverage and correction of errors and another 25 points for the operational readiness review now there is a reason why there is a heavy weight on the deployment safety because when we are looking across the entire cost base, most errors happen during deployments.

The other aspect is you want to have a safe environment for your developers to innovate and through the score and the various aspects and mechanisms that we provide were enabling developers to do so

When you are thinking about deployment safety, we wanted to share with you a few items that you can use as well in your own environments.

For example, implement automatic roll back that will allow you to get very quickly back to a stage where you can operate. Think about especially if you are operating in AWS, start with deployments into an availability zone, roll out to multiple others availability zone, then to the region and do the same in another region.

And then additionally, you want to make sure that always that you have 66% of healthy nodes when you're provisioning. And that again is to ensure that you always can serve your customers, then you can think about code coverage and code reviews.

And this is not just about understanding what the percentage is of tests that you have on your code prime video also is looking at the failure handling in regards to pipelines and faults that happen in there. The aspects in regards to reusability of design patterns that you know, work and are resilient and then also around the dependencies because they're key.

When you're thinking about a distributed system, we then go into the operational readiness reviews. And when you're thinking about operational readiness reviews, this is the same thing when you're thinking about a sports team that huddles in the room before a game and you create a plan where, you know, this is the goal, we have best practices across what we want to do in the place.

And here you're looking at the security, best practices, the deployment safety that we just talked about. How does the service predict and thus scale and load and then especially around the event management, incident response to be ready. If something goes wrong to be able to do that, of course, they need alarms and metrics to be in place and covent so that it can be executed on.

And then lastly for the unpredictable having availability reports and impacts and chaos engineering to ensure that they are ready in regards to faults and failures that may happen in the system.

Now, all these points, as you probably have seen, they're all additions. And so olga when we're thinking about operational readiness score, how do you ensure that actions out of, for example, faults are actually implemented?

Absolutely. So we have a process that we call correction of error. Uh many of, you know, by industry uh standard name, which is root cause analysis, those of you that are following aws blog, for example, during incidents and read about it. This is your traditional rc a. This is very structured, it has customer impact timeline, what happened less five wise lessons that we've learned and fundamentally actions.

And very often, um i advise the team when i work with the teams that the focus really on the actions, actions are the key for preventions. Um the world of sports, your world doesn't matter in which industry you are. Maybe it's news, maybe it's healthcare, maybe it's finance.

You know, there is plenty of work and there's all of the conversations about trade off and there is conversation about, do we hurry up and maybe finish up some new features that we want to release or do we need to focus and strengthen some of the learnings that we have from past coes?

So the mechanism uh the coe review and correction of error helps us to basically say, hey, if you have this lingering points, the lingering actions from coes that are not getting addressed, we're gonna take points away which basically what it does that operational readiness score, it's not once done, it's a living breathing thing, almost like a temperature check.

That at any given point of time, you can take a look at the operational red score for your services and basically uh make a reasonable assertions and assumptions what's going on there. Are we going up or what's trending down and what are the reasons are?

So it becomes a mechanism for leaders uh to keep a check on resilience and availability of their services.

So we talked about defining the goal, rallying the tubes and then how you can ensure that your teams are performing as expected through an operational readiness score. But you may think is the service that i have really worth the sweat and the way and how you can define that is when you're thinking about your services is the component you're working on customer impacting.

Does it serve many upstream or downstream dependencies? And if that's the case, you know that the investments that you can make in operational readiness scores is well worth it.

So to make it easier for you to think about the entire journey in resilience, we created a mechanism for you that is called team because resilience is a team sport, the resilience is a team sport. Um and the reason why we did it, lauren and i because look, there will be tons of conversations, tons of information coming your way, right.

How do you sort of like create a structure that easy to connect with? So in our discussions, as we were preparing for these sessions, we basically said the world is actually really simple. If you obstruct a little bit, right, you can have two slices, one slice, things that are predictable forecast, maybe customer behavior, right? What customers will do on a certain day? You can predict this.

Yeah, there will be a variance, maybe a degree of risk, but it's predictable. And then there is unpredictable events for predictable events and predictable inputs. As we call them at amazon, we train, we train our teams for predictable inputs for unpredictable inputs. We experiment with, it's kind of hard to train for unpredictability, but you can experiment with and you can learn it.

We collect the data, we aggregate the data for both cases, we analyze it, we review it, we take the learnings, we put it out in our operational memory and we repeat the cycle again. So let's start with the training. I want to start there because it's sort of like area number one. So what do you mean? Right. You can ask me.

So let's talk about what do I mean when I say forecast for those of you in streaming media, the key metrics uh that define what's going on for peak workload are actually pretty simple. These are concurrent streams. How many customers we're gonna have a peak, right? The second metric is stream starts, the saying last minute is actually a last minute is so true.

Five minutes before the game, we've noticed that the customer behavior is pretty much it doesn't really matter. eepl english premiere, you get tnf, you always have, we always see this spike in subscriptions that happening, right? So you can and the stream starts, helps us to basically make assertions how fast customers will be joining.

The other things that you can predict is hey, maybe it's a tie, maybe it's messy playing on uh english premier league, right? And there is some moment that everybody is gonna come in. What is the peak, uh tp as that may happen on the services for stream starts? And the third metric is gonna depend on your business.

This is your interest point where the customers are coming full force in one place. Right? For us, it's detailed pages where customers are lending uh and are ready to watch a certain show. I'm sure that for all of you, if you think and brainstorm a little bit, you can find and identify those key metrics that define um your business, right?

What we also learned over the years that all services and end points that we have that scale was one of these metrics, which basically means that we can develop algorithm um based on ml and i that helps us to predict these three main business drivers and apply similar algorithm to the service level metrics.

So we have the forecasting at all individual end points for the services and the customer journey that are involved in supporting um sports events and supporting our product last, not least because it's streaming media, it's not just services. There is obviously throughput that we need to think through and this goes to c dns and i sps.

Ok. So the question is now that we have all of this input, but what do we do with them? And this is where fun begins. If we know a tps, right, for your end point on a weekly basis. And if we know what we forecasted at peak for the event, you can see that you can easily create load testing for each individual end point.

What we do in prime video, we created a tool, we lovingly call it w a um and this is a game day tool, what it allows us to do, assemble multiple configuration so that we have this one game day that comes pretty close to what we see with production traffic and peak workload footprint.

This is where we can model a load for subscriptions for playback services. Um maybe for advertisement, whatever scenarios that we can think of that will be having different ramp ups during the game. All of them are running in parallel with this uh approach.

And the reason i want to make a stop on this and then this is super important because this is where global versus local come into place. Our teams, our two pizza teams, they own individual configurations and low generation for their end point. A central team will come up and create this load profile that matches the profile of the band or a sport that we will be playing and running uh and preparing for.

So I'm a little bit confused. So in the beginning, you had this slide where we showed that your teams are running three times a game day a week when we in aws, we're on a game day. We have a room full of people. We sketch out everything. H how are you able to rectify that amount of work with your teams?

Gone are the days when we have engineers in the room. And yes, i was a technical program manager sitting with engineers in the room in the middle of the night running those tests. We automated it right now with this configuration tool. Um it, it's just a matter of a few clicks and one person that schedules that game day and fundamentally watches over the game day run.

Um it is true for those of you who are following ks engineering. Yes, prime video experiments and run this test and production, which basically means that we have technology to stop a game day or stop experiment at any given point where we see a sign of uh trouble and whatnot and a lot of it is automated, but there is also a game leader who can do the same thing when it's needed.

So it's click of a button and it's one person. So how w when you're thinking about it and you're running this automatically? How are you confident that nothing is gonna go wrong, being able to run this automatically?

Ok. So this is where we said the word was kind of simple, right? We can predict what the tps per each individual endpoint is and we can watch on the monitors and alarms and have high degree of confidence that things are going well. And if they are not, we'll stop pause, we learn and we do this experiment again.

So in essence, let's take a look at the playbook for the game day. Uh specifically uh in prime video when we say load testing, this is our game day. We thought it was a forecast. Next, we develop different scenarios. All sports are different. Thursday night football, i enjoyed on living room the most.

However, if you take japan boxing, especially with the timing window that was happening this window, um it's mobile because customers were traveling at that time or if you take new zealand cricket, for example, again, it's mobile first. So you basically create those scenarios that are specific for the launch that you are preparing for uh teams update their load generators, they maintaining it at a local level.

Uh we make sure that we instrumented observable and we're keeping eyes on the health of the services and the last piece, you know, um it doesn't get enough attention and i want to stress it uh a little bit higher when the game day is run, there is reports and there is findings, there's tickets that are issued for the team and that's why we're on those game days because you can take those fresh fighting, you can put this or teams that will put this in the technical de tracker and they go into your sprint right away.

So as the tier does sprint readiness, remember when i said availability feature one, they look at all of the issues that is in their sprint and the availability items will be always on top. So for them, it becomes a little bit easier to prioritize and focus on fixing the findings from the game days.

So we looked at starting the team off with a goal, 10 operational readiness scores and the training, we know now what the predictable is. Let us look into what prime video is doing in regards to the unpredictable and experiments.

When you think about putting the resilience to the test, there are various ways on how you can do that. You can run ad hoc ad hoc experiments as you usually see them in game days, but you also should run them in pipeline because if you have them in the pipeline, you can have a blocking pipeline if you're introducing a fault so that you can quickly roll back instead of provisioning into production and impact your customers.

You can also run them in a scheduled way the same way as olga mentioned with prime video. Now, when you think about the unpredictable and putting resilience to the test, it's not just about the tech, it's about the people, it's about the processes

Think about when was the last time when your core database admin, for example, went on vacation and the system went down and nobody could log in. Probably every one of us saw that before.

Now, when you think about chaos engineering experiments, they're always associated with risk. And you need to understand which ones are low risk, which ones are high risk, which ones are medium risk, especially if you're starting out.

When it comes to chaos engineering and chaos engineering experiments, you can start with, for example, CPU and memory hog. You can run some POS IO experiments to understand how is your database, for example, reacting to those POS IOs. Or you can do simple things like restarting EC2 instances that are behind an auto scaling group because you know, they should come back up. The same thing with Kubernetes or with ECS - when you kill a pod or a task, it should come back up. You can then move on to medium risk experiments.

And they are very interesting because many times we say, yeah, infrastructure failure, that's easy. But if you start injecting latency, you're going to learn a lot about your dependencies and how they react to latency. And so inject downstream, for example, latency into services, increase error rates and see what happens.

Now, one thing that we thought is interesting to share is how Amazon Prime Video is looking at the medium risk items because they are not just running it at talk but also in pipeline.

Absolutely. And I must admit this is one of my favorite slides because this is super actionable. So if you're taking a picture of a slide, yeah, that's probably the one that you might want to snap. Why? Because the low risk items, engineering teams at a local level, they can get started and do this at any given point of time. You don't have to have a central program or central planning. They can get into the routine of experimenting and testing independently.

What we also do in Prime Video, some of the game days that we were on was low testing. We will run those low risk in production as well, for the sake of learning.

Okay. The middle pipe medium risk, I came to appreciate it with time and I absolutely love it. Highly recommend it to work with your teams to put it in your pipelines. Why? Because very often when you walk the hallways or if you're listening to the Slack chatter, how many times do you see a conversation? "Hey, I just committed this function and it has 50 milliseconds in increase. Uh do we need to renegotiate SLA?" And sometimes you see that back and forth. It doesn't really matter which company be it storage, maybe, maybe use baby media, you know, we live in the distributed world, right? So our services are integrated and dependent on each other. If you have those tests in the pipeline, the negotiations are just done. You run the tests starting with 50, 75, 500 whatever that is. But you have confidence that if something changed in the contract of your dependencies upstream and downstream, you have confidence that it didn't break the chain. That's why I love that middle part.

And then high risk. This is an example where the global work comes together. Yes, this is when we will be in the same room with technical program managers and engineers. And we basically run experiments of how quickly we can shift workload between default regions, for example. And that is a high risk experiment that we will plan at a global and central team level.

When you're thinking about enabling our customers in AWS, so that you can understand how your workloads react to faults that happen in the resilience of the cloud. You have Fault Injection Service. And for those high risk experiments that you just saw, we just today released new faults for AZ availability, power outage and also regional isolation faults that you can use if you have multi-region services and you want to see disaster recovery and isolating a region and how these workloads function. Fantastic news, can't wait to start using it!

So additionally, when you're looking at Fault Injection Service over the last 2.5 years and this year, we released quite a few new faults. If you're using EKS, for example, you can now inject faults into the pods directly like latency and CPU and memory. When you're thinking about ECS, you have very similar faults that you have here, we enable you to inject faults into EBS thinking about IO. But then added a lot of network faults that you can go and run.

Keep in mind, these are not just simulations, these are real world faults and you need to understand how your workloads or applications react to those faults. So you have the power to go and execute them.

Let's look into how Prime Video is running chaos engineering experiments. You've seen the Resilience Portal in the very beginning. And when a team wants to run chaos engineering experiments, they first go in the tool Olga talked about the forecasts. When you think about an experiment, you want to have a steady state. Therefore, we start increasing the load based on the forecast that the team has. We then define the experiments. You see two specific tools there with Fault Injection Service and SSM Chaos Runner. SSM Chaos Runner predates the Fault Injection Service and also enables Prime Video to get kicked off by a pipeline and automatically with a library.

Now, once that experiment is executed, we of course want to do that in a safe way. There are already metrics and alarms defined in CloudWatch. So that if the experiment would go out of bounds, it automatically would stop and all logs and traces that you have for observable flow into the Resilience Portal in regards to the insights that they have to be able to act and understand how to improve the system.

So let me show you how Prime Video does that in regards to the pipeline. When you're thinking about chaos engineering, we see that with many customers that they build an interface across the chaos engineering tooling so that they expand the amount of faults that they can create. In this case, you see that we have a library for SSM Chaos Runner that can run in pipeline that calls Fault Injection Service. You see the action that we're gonna execute here with IO stress. You have the tags that you can define to ensure what is your scope of impact as well as the duration and time and the stop condition to have safe experiments.

And that brings me to the dress rehearsals of Prime Video that are the game days, the chaos engineering aspects that we're executing. This is another place where we have a lot of fun. So as you can imagine, there are preseason games that NFL runs, right, sort of like warming up the players warming up for preseason and everything else. Prime Video gets access to this preseason streams and this is where we test our features. Okay. Usually it is a beta room and we call it the beta program. It's done in prod and yes teams could be virtual, they could be sitting in the same room but highly distributed. So it's a combination of both.

What happens during this beta or dress rehearsals rehearsals, as we call them, we don't tell the teams what are we gonna run. They don't know, they just gonna know that there are some experiments that we're gonna do and I'm gonna share with you my favorite one.

So not all of it can be technical, some of it, you know, you can imagine, you know, teams are testing the features and whatnot and somebody does an error injection. You're gonna have to figure out if it's real or not. But one of my favorites this season was BB QR code. We declared that we had the bad QR code that needed to get down. We started the work to figure out where it is to find it and then pull it off. We measured our meantime to detection, we measured meantime to resolution. Our engineering teams were pulling levers to kind of take this QR out of the system right there on the fly. Our press release team, marketing team absolutely loved it because they were not aware that we're gonna do this. So they had to figure out, you know, what they're gonna do, what their playbook will be the same as the production team who was a little bit caught by surprise and they had to also figure out on the fly. All right, what is the, what the process that they need to follow for that? We run it as an incident and then we take all of the findings and do retrospective on this. And fundamentally in many, in many other teams and businesses, this is called a tabletop exercise. A simulation super powerful. Every single time when we run it, we've got this really great feedback both from our production teams and our engineering teams.

So let me dig a little bit deeper, what else we do? Just to show you guys an example. So the thing that we've decided on Prime Video is that every feature for sports that we have is gonna have a feature flag, which basically means we can pull it off super quickly. Why would we want that? Well, we have such a feature as a Rapid Recap, for example, right? One of the favorite features that our customers love. If we're not necessarily happy with the highlights of the games that's been selected for Rapid Recap, we can take it off, adjust and pull it back again. And that's one of the examples. So you can do it basically in real time.

So during dress rehearsals, as we're preparing it, we've been testing all the new features that were created for this season. How long it takes us from the moment we declared something is off to fixing it and putting back up. I also love the fact that when you do this unannounced experiments, our alarms get refreshed and our engineering teams, I see somebody smiling in their own, you've probably experienced something like that. But absolutely, that's true.

Our engineering teams are starting to look at their alarms and basically saying, "Hey, we need to make sure that we're tuned. Uh and we have precision and we can uh sort of like separate the noise from the real thing."

Um retrospectives, we run retrospectives for this experiment and why this is also good because debates very often are, "Hey, do we need to automate this? Do we need to go with what we have? What else can we do?" So this, this is real time in sports. There is no room, everything is real time. There is, there is no room to lose any specific moment of the game. So that's the reason we practice these things.

Um I'll talk about one more thing. Sometimes weather comes into those discussions. It's super powerful but not obvious. We have this test where we also test Chime going down, Slack going down. What if there is a power outage? We have a hybrid team sitting in different locations but well, it is not predictable. What if people in London cannot get to the tube? What happens if there is power outage in Seattle? So you kind of work through that scenarios as we do and have redundancy built in and we practice it as well.

So when you think about from the start, we looked at the definition of goals, how you can track them with operational readiness goal. We looked at the predictable in regards to training and the unpredictable. So you take learnings from these and you analyze them. Let's go into how you're thinking about the analysis of the various learnings. So what's the process for that?

So the mechanism uh that we have, like I said, it's correction of errors, it's super structured. Um one thing that I hear sometimes being asked is it just engineers who get it. Do we need to have managers sit down with us as well? It's a huddle, it is engineering teams that are involved in the um incident or near incident. Sometimes we will do a COE, we didn't impact the customers. So it came very, very close and there is something significant to learn from this, right? And the manager, senior leaders are also in the room reading this document and discussing why because fundamentally as probably many of you experience when it comes to the launch, right? This is the game time for decisions. What do you prioritize? Do you hold certain features and work more on operational excellence? And we have done it over years where we said actually, here's what we've learned. We need to strengthen XY and Z and we're gonna release certain features mid season and whatnot.

Um I'm gonna show you as promised what we've done for observable because this is super powerful as well. We've created a service graph that shows us connections, all of the different services and the cool thing about it, the link is gonna highlighted, be highlighted and read if we determine there is an anomaly in the log files that connecting this to services or if we have S2 alarms going off and what not. So it's, it's real time. It's super powerful.

Uh I know that many customers and many of you probably doing this already. The thing that I want to highlight that. I thought it was super clever how our engineering teams did it because the service graph, right? It can be super unruly and can become this big glob. So we created views by customer journeys so we can have a view on subscriptions. We can have a view on playbook. We can have view specific to detail pages and being able to navigate through the views real time and having uh visibility real time when things are in trouble and they lit up super powerful.

And another thing that we've done, uh we debated, should we do it or not a while back? And we uh found over years, it was a great investment. We have real time availability dashboard. Uh we selected a few key services within this customer journey and it's basically near real time shows to everybody what the availability for those end points is. There's no questions, there is no guess what? It's just exposed and everybody has their eyes on that very often i hear a question um and when i speak to cost and when i speak with many of you even, you know, with our own engineers, what is an ROI? Prove it to me that chaos engineering work.

So I'm sharing with you this graph on incident severity. It's a real graph from last year, we've learned a lot last year. Um and basically what this graph shows is how we studied season and we had some incidents that we needed to work on. We applied the processes that I'm describing. It's a game times, it's drills, it's making sure we do those case insertions and we've done much better by the end of the season. So when this happens, you can see that meantime, between failures improved and it becomes much larger and meantime to detection, meantime, to restoration shrinks and that's exactly what you want.

So we have conviction that availability. It is feature number one and we're investing it on an ongoing basis because it does get us the results. Exactly. And when you're thinking about the availability and the resilience, it's not possible without observ.

Next, we're going to go into the memorized phase. And when you're thinking about the memorized phase, it's a lot going back to the roots that we had in the beginning, you need to have that objective for your team. What is it that you want to improve? Once you know what you want to improve, you can start creating a plan, you can visualize it on where is it that you wanna go? You can then start defining a strategy around the predictable patterns that you usually see in your customers traffic. And you can get more comfortable in running these various game days as prime video is doing it in an automated way prewarming your systems, understanding how they work on the load, injecting predictable faults where you know, they should be able to recover based on outer scaling groups, for example, and then move forward two experiments where you begin to challenge the teams a little bit, define a clear hypothesis in regards to how the system should behave and verify that your hypothesis actually becomes true.

Game days or dress rehearsals as we have it at prime video. And then very importantly, when you're thinking about memorization, it's understanding your graphs. When you think about observable and understanding of what happens is you start always at your steady state. That's the transactions per second, that's the stream starts per second that you need to see. And if you see impact something is going down and you can drill in and understand and analyze what you can do better. That's the fine tuning piece. And if you do that over and over again, you begin to have that operational muscle memory that you can execute that peak.

Additionally, not to forget is the round books and playbooks. You want to ensure that when you run and memorize these various steps that your team members know what is their role and how to execute. If there is an incident, you don't want them to go and search where in the wiki is that page where i need to click, they all need to know in their sleep on how to execute it. And here you see that you have various buttons for the call leader, for example, for the subject matter experts and other roles so that they know right away on what they need to do. And with that all go get that touchdown, get us to the, ok.

So I've practiced this a lot. Let me try it and this is how you get to the super bowl. All right.

Um we use the word memorization a few times. Why the reason we keep insisting on this because in prime video, especially in prime video sports, we're in the business where each second matters, you cannot lose a single second, right? So fundamentally, this means that you don't have time to think, debate, you have to pre cache decisions. All of the logic of what your operational response is needs to be in your muscle memory as lauren keeps saying, uh being a former tennis player.

Um so it is part of the muscle memory and in summary, how do you get to that muscle memory for those of you who might be in finance or health care or other industries where real time response is super important. Here's the, here's the playbook, think about what's predictable, think about what's not predictable train for the inputs that are predictable experiment and prove experiment and prove things that are not develop your scenarios.

Uh for predictable things, create your own mechanism we shared with you a operational right in your skull. It doesn't have to be 100 it needs to work for your workload, it needs to work for your organization. Um you can use the logic here to develop that mechanism as the way you see it fit. Uh we've been happy with that. Develop the game days. Be it workload? testing, be it case testing, maybe a combination of two. Remember that slide with three buckets super powerful. So develop those systems. The dress rehearsals doesn't have to be just stack. It's all things on the back of your mind that are nagging. You sometimes what if this happened? We'll play it out. What are you gonna do? right? Green, bring act insights, update your own books, playbooks and all of that is gonna get you to proactive reliability.

So with that sir, go build your playbook. We showed you in this session how to assemble your teams thinking about that winning mindset and how you should look at it by defining a goal for your teams so that you can rally the troops. We now have seen how you can adapt in regards to game days, for example, running forecasts and low tests against your system multiple times a week to understand how they work. And then how you should think about experiments and run them continuously within your environments, go and define how you should think about blameless learning and the analysis that you're gonna do. So your teams can learn in a comfortable way and grow and then memorize, train them, help them to understand how the systems work and give them environments where they have the freedom to do so.

And so all of these steps can be summarized again in one sentence. Absolutely. All you have to remember is just one thing, practice like you play and play, how you practice. What a wonderful thing that we've got from sports highly applicable to our engineering teams.

Thank you for joining us today and we will be happy to take your questions.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Practice like you play: How Amazon scales resilience to new heights

What i do?right?
复制链接

扫一扫