Good afternoon. Thanks everybody for coming out. I want to start off with a quick question:
How many of you that are in the room are in some way responsible for developing, managing, maintaining an application for your organization where if that app goes down, it's going to impact your organization, it's going to cause problems?
(Pause for show of hands)
Alright, cool. Now, all of you who put your hands up, how many of you feel that you really got, like all the actions you need for your resilience on lockdown? You know, where they fit in your life cycle and what you need to work on.
(Pause for show of hands)
Ok, I see very few hands. That's good. That means you're in the right talk.
So I'm Clark Richey. I'm a Principal Technologist here at AWS and I'm going to be joined today by Stacy Brown and Yoni from Vanguard. And we're going to talk about the AWS Resilience Life Cycle, which hopefully is going to help you with that problem that we just talked about.
Resilience equals revenue. That's a really bold statement that Gartner made. But it's true, you might have the coolest application ever deployed. But if it goes down, it doesn't matter, that's going to cost the company money.
IDC estimates that the Fortune 1000 companies lost between 1.5 and 2.5 billion with a B billion dollars in revenue every year to unplanned downtime. That's a chunk of change.
And beyond that, there are other costs that are harder to measure. They're intangible costs - impacts to your reputation of the business. Once that application has gone down and you've broken trust with your customers that can be very, very difficult to repair and the effects can last a long time.
So if you've ever talked to AWS before, you've probably seen a picture that looks a lot like this, we're big into shared responsibility at AWS. And so of course, we have a shared responsibility model for resilience and what that essentially boils down to are two fundamental things:
We at AWS are responsible for the resilience of the cloud. What does that mean? Well, it means that we're responsible for making sure that all of the services and infrastructure that you need to build your applications in the cloud are resilient. So those are things like our AWS regions and availability zones, even Outposts and Edge locations, all running on our global infrastructure, things like networking, servers, storage, even the AWS services that we provide that you'll use to build your applications. That's the resiliency of those things is our responsibility.
Now, you as the customer are responsible for resilience in the cloud. What does that mean? Well, it's up to you to make key choices about how you design your architectures, how you manage your networking, your service quotas. How do you even do things like deploy your code and manage failures and other events that can happen in production? How do you back up your data in a secure manner? These are all your responsibilities to make sure that you build a resilient manner in the cloud.
And of course, today we're going to go through and we're going to talk about all kinds of different activities that you can do based on best practices that we've built up over many, many years with AWS to ensure that you can make the right decisions for your business to build that resilient architecture in the cloud and even better. Stacey and Yoni are going to come up and tell you how they're doing it at Vanguard, which as I know is what you really want to hear.
So moving your applications to the cloud certainly comes with a lot of advantages, right. That's why we do it. You're able to remove a lot of the undifferentiated heavy lifting by leveraging existing AWS services, allowing you to be more agile and develop and innovate more quickly. You're also able to leverage the economies of scale present in the cloud so that you can lower your costs compared to running your own infrastructure. And those are all great things.
But like with everything else in life, there's downsides and challenges, distributed systems are complicated and challenging in this day and age. Of course, we have standards for everything, but we all know the reality of the situation is that there's always variations of, there's always variations on the implementations of those standards.
When you've got computers all over the cloud, many of them, maybe thousands of them ensuring that you have observability, meaning that you're looking at the proper state and can see what's happening at all the critical aspects of your application can be very, very difficult.
And of course, you're now responsible for multiple components sitting on potentially many, many different machines and you have to manage both the downstream and the upstream impacts of any event that occurs in the system. That's a lot.
This level of complexity makes failure prediction on a distributed system far more difficult. This is an incredibly oversimplified example of a distributed system. It's oversimplified, it's almost not worth talking about. And yet as we look at this, we can still see there are many different ways and areas in which it could fail.
This is just a one client, sending one message to one server and getting a response. And yet if the client's not able to run the code properly or the wrong code is downloaded by the client that could cause an impairment, the network could fail at any point along the way that we see here, the server could suffer a catastrophic failure or maybe we just made a configuration change that was incorrect. Or we have to put a new version of our code to that server and it's not operating correctly and it's not going to process that message and return a response the way we want.
There's so many ways, there are so many whales here. There are no whales here actually. But there are so many ways in which this could fail even with just using one message. Imagine now 10,000 messages at once in a much more complicated system that looks like something you'd have in actual production. This is really hard.
There's an excellent paper on this that we've written as well that QR code that is present here on the screen. We'll get you a direct link to that if you're interested in reading up in much more depth about some of the challenges in distributed systems and why they're so complex.
(Pause)
OK. So this is a question that we at AWS receive from our customers all over the time: How do we improve our resilience? And it's hard, we understand that we're operating at a massive scale. We know this is hard and that's why we spent time and worked on developing the AWS Resilience Lifecycle white paper that we're talking about today.
The important thing to understand is resilience isn't a destination. It's not a one time thing you just do and you say great. Now I'm resilient. It's a journey, not a destination as we talk about this journey.
And before I get into the phases of the life cycle, there's two main concepts I want to talk about briefly. Foundational resilience and continual resilience.
Foundational resilience refers to the key services and guidelines upon which you're going to build your resilient architectures. Those AWS services are things like Aurora Multi-AZ clusters that you can use to build a highly resilient database for your application or perhaps even AWS Resilience Hub which we'll talk about in detail a little bit later on.
There's also fundamental guidelines there most notably in the format of the AWS Well-Architected Framework, which has a Reliability pillar, which takes a whole lot of guidelines that we've developed over the years of AWS on your best practices for building resilient systems and how to use our services in conjunction with those. And it makes those available to you.
Now, continuous resilience on the other hand, is all about your organization's people, processes and systems that you're going to use to ensure that you have those resilient systems all about your resilience in the cloud. It's how do you monitor and manage and observe and react to your systems as you're operating them to keep them up and running in a resilient way.
There are five phases to the life cycle, as you can see here: Setting Objectives, Designing and Implementing, Respond and Learn, Evaluate and Test, and Operate.
We're going to go through each one of these talk about them in some detail, what the activities are there. And then we're going to look at what AWS foundational services are available to support you in each phase.
I'm going to touch on a few of the key points in each and all of them. I highly encourage you to look at the white paper to get more information on this and we'll have the QR code for you at the end of the presentation.
Now, hopefully this picture looks kind of familiar to you. Maybe not the words in the box. You've probably seen a picture like this. You might even have something like this operating in your organization. It should look like a fairly typical software development life cycle picture and that's very intentional.
We want it to be familiar to you. We need the resilience life cycle to be something that is understandable, familiar and easy for you to integrate into your existing development practices.
So we always want to start this life cycle in the Set Objectives phase. It seems very intuitive and easy. If you're starting with a brand new application, greenfield development, of course, you would start and set objectives. But even if you have existing applications that you're running today and you decide you want to apply these lifecycle concepts. I'm going to strongly encourage you to start in Set Objectives and we'll talk about why in great detail in just a moment.
But as we go through this life cycle, as we iterate through it continually, we're going to take those findings from each output of each phase. I'm going to feed them back into our Set Objectives so we can continue to refine and evaluate our objectives and make sure we're doing the right things for the business at the moment.
As you can see, there's a whole bunch of activities in each of these life cycles. As I promised, we will get into each one in detail, but all of them are going to provide key outputs during the phase. And those are going to help you identify priorities in terms of resilience for your organization for your individual applications.
Hopefully you get better operational practices out of it and you learn about how to you learn about both the state of your existing mitigation capabilities and your ability to learn from mistakes and hopefully how to get better and improve those things because it's always in the journey of improvement.
Couple of key points, it is an iterative process. We're going to keep going around that life cycle again and again and again, depending on the size of your teams, you might have different members of the team that are actually in different stages at the same time. Some people might be designing and implementing for the next release while others are operating the existing release and that's completely normal.
It's also normal that you're not going to have or necessarily even need the same maturity in each phase. That's something you're going to have to evaluate based on your own organizational needs and then grow that over time as necessary.
In fact, you probably won't even implement, you probably won't implement all of the practices that we talk about in the white paper for every application, it's probably just not something you're going to need. So it's about understanding which ones you're going to need, how they fit in with your engineering practices and most importantly, how they help you meet your business goals for resilience.
Alright, I promise to practice at each step. So let's get into that Setting Objectives. To me, this is really the most important phase. This is all about understanding what the business needs in terms of the resilience of your application. If you can't figure that out, there's really no place else to go.
And I get this question from customers all the time. I have conversations that go kind of like this. They say, hey Clark, we need help becoming more resilient. I say great, we can do that. What are your objectives? They say more. I'm like, well, great, ok, more like how, how much more like what do you want to? What do you, what do you need to get to better?
Ok. I mean, fair enough. That's good. It's good. We all have to start somewhere, but that's kind of like asking me to come on a road trip with you and you're like, alright Clark, I'm gonna drive. I wanna go somewhere, give me directions. I said great. Where are we going? Not here?
Well, I, I could give you directions but I have none of us have any way of knowing if those are the right decisions or the wrong decisions. We have to know where we're going. It's the same thing with resilience. We have to have objectives and we have to have concrete objectives.
The metrics that we typically talk about our recovery point, objective, RPO and recovery time, objective, RTO.
Recovery point objective refers to how much data could you potentially lose if you have some sort of impairment, could you lose five minutes worth of data? 10 an hour something you've got to determine.
Recovery time objective is essentially how long could you be down if something catastrophic happens, how long can you be down before you start to suffer significant implications to your business? Whether it's financial impacts, brand impact, whatever that might be.
"Those are the two key metrics. And you're going to hear me talk about those over and over again today.
Now, what a lot of our customers would do, particularly in large customers that have hundreds or thousands of applications, is it becomes too unwieldy to try to set RPO and RTO for each and every application. So they'll create tiers. Maybe you call it gold, silver and platinum or 1, 2, 3. It doesn't really matter. But the idea there often is setting several different tiers with well-defined metrics - RPS, RTO, maybe even service level objectives or SLA's (service level agreements) for your customers if that's appropriate - and then bucketing your applications into one of those three or four tiers. And that just makes it easier to manage your organization's resilience instead of having to have potentially different metrics for every single application.
So if you have a lot of applications consider that kind of a prioritization and tiring approach internally.
Now you're going to see this slide or basically a variation on it a couple of times and this is where I'm going to kind of point out those foundational services that are applicable and available to you for each stage.
Conspicuously, there are none in Set Objectives. And that's because Set Objectives again is a business operation. That's about really understanding and going through the process of analyzing what is the impact to a business if this application fails in a certain way. What about in the data side, how much in terms of, you know, how much time it can be down? What does that impact? That's an internal business exercise. There's no magic service to help you do that. It's all about your internal workings now to design and implement lots of activities here.
Right. So we've got objectives and now we're going to design and implement our system in order to try to meet those resilience objectives. Well, of course, being all the other objectives that the application is supposed to fulfill lots of key decisions to make here in terms of your architectures, fault isolation boundaries. Do you need to deploy your application in using AWS zonal services like EC2, for example? And if so, are you resilient enough in a single availability zone or do you need more high availability and you want to deploy across potentially multiple availability zones? Do you really, really need crazy high availability and maybe you need to be in more than one region at a time? Those are decisions that you have to make, they are tradeoffs to all of these things.
What foundational services and guidelines are you going to use in building this? Are you going to use that Aurora Multi-AZ cluster for example? Or are you going to use something different? What are the impacts going to be in terms of cost and resilience and engineering? How are you going to be deploying your code and running your integration? What choices do you want to make in logging? Are you using CloudWatch or using something else? Are you, how are you forming those logs that you can read them easily later on? In this case, when in case you have an event, all of these things have to be well thought out in terms of those objectives.
And of course, we're going to come back to them, we're going to learn, we're going to come back and we may make changes, lots of services now in design and implement, makes up for the empty column next to it. I just want to call out a couple of them here, AWS Resilience Hub. So nice. I put it up there twice. Um this is a great service that will allow you to take your infrastructure and you take a CloudFormation template TariffForm or you can even work with existing deployed resources using tagging, for example, and you can create a resilience policy in the tool we define. Of course R and RT and then you put your, you'll assign that policy to an application and then the Resilience Hub will run an assessment for you and it will look at your infrastructure and determine can you meet those resilience goals with your current configuration? And if you can't, it's going to give you tips, remediations to fix that. It'll tell you why you can't get there from here. It will also provide you helpful things like code snippets to build CloudWatch alarms, Fault Injection Simulator templates to build chaos engineering tests and so on. So a really, really nice tool that can even start helping you early on. Figure out is this a good infrastructure to meet my goals?
Of course, the AWS Well Architected Framework in that Resilience pillar, it is going to provide those foundational guidelines that we talked about earlier to help you implement best practices and we have things like AWS X-Rays. If you decide that you want to have tracing inside your application in order to monitor key events, key data movement. For example, in your application, you can implement AWS tracing during your design and implantation phase.
AWS Trusted Advisor. The last one i want to call out here is an excellent service available to anyone who has Enterprise Support. And what Trusted Advisor does is there are thousands of best practice checks that are in there and it's continually running and evaluating your systems and finding areas where perhaps you're not following guidelines in terms of best practices for security or resilience and so forth and then it can erase those and surface those to your attention as well as by providing remediation tips. So you can fix those or you can make the conscious decision to accept that risk and take the trade off and move on from there.
Evaluate and test, evaluate and test really happens at kind of two different stages in your application, pre deployment before you actually take that, you know, code and and get it out into production, but then it continues post deployment.
So pre deployment, hopefully everyone's doing unit testing, some integration testing. I feel like in 2023 we we can say that pretty safely usually. But there's other forms of testing too, right. Do you need, is the application sensitive to performance? Like are your customers expecting responses within certain times? If so, you might need to do performance benchmarking to ensure that the combination of the infrastructure you've chosen your architectural decisions and your actual code respond well within those time frames. And do they do it at load? We can always do the classic load testing which is put it out in production and see what happens when it breaks. And the customers like it, it's not recommended though it's not recommended. So you might want to do load testing up front fault injection in game days are chaos experiment concepts, chaos engineering is the idea that we're going to really test the resilience of our system. And i say system i'm talking not just about the technical pieces but the people and the processes, right, that continuous resilience the people and the processes that we use to respond to events and to manage our system.
So we're going to create a hypothesis about what would happen to our system. In the case of a particular type of impairment, maybe a network outage, maybe a server goes down data corruption, whatever it might be. So we're going to anticipate how the system would respond to that, how we would detect the the incident, how we would respond, what the system behavior would be like how we recover from that and then we run a test. Now, those tests can be tabletop exercises. Well, maybe you just walk through it. Those can be really useful pre-production while you know, the code is really still just on the white board and it hasn't even gotten to the keyboard or you can use things like Fault Injection Simulator, which will, which will allow you to actually simulate those outages in your environment and see what actually happens.
And then of course, we're going to take what happens from that and we're going to learn from it. What went well? What didn't go well? And then of course, tracing, you may have noticed tracing has now appeared in evaluating test and in design and implement, that's intentional. There's a number of activities you'll see some here today. You'll see more in the white paper that do appear in multiple phases. And there's good reason for that. When we were in design implement, we talked about making those choices. One of those was do we need tracing? Do I think I'm going to need that level of detail later on in my application once an event occurs, i i manage and monitor my system. If so i had to make that choice early on and implement it into my code. If I've done that now and evaluate and test, i can use that information to evaluate the existing and current health of my system and make choices and decisions based upon that is my tracing working well. Is it too much? Is it not enough number of services present here again, trusted advisor continues to be there. Resilient sub resilience is really nice too because not only will do all the things i told you a moment ago, but it will do that on an automated basis for you. You can schedule those things to run recurring, you can tie them into your code deployment cycle. So it happens whenever if you deploy code and it now enables drift detection.
So what that means is that if something occurs in your environment, whether it's a change to your infrastructure, that's a part of your normal infrastructure evolution process through like a CD process. For example, or someone just goes in and makes a change, it'll go ahead and when it sees that change, it'll rerun an assessment. And if that change has negatively impacted your application's ability to meet the RPO and RTO defined by the brazilians policy of that application, it'll start notifying people right away. So a really great feature there. And of course, we see like X-Ray appearing here again as well and we just talked about why, you know that is a moment ago, Fault Injection Simulator. That's that tool for running those C engineering tests. Going to simulate things like network impairments, loss of an EC2 server, things like that operate.
So now we're in production, we're doing stuff well, lots of things to, to certainly do while the system is up and operating synthetic traffic is a really, really great tool for allowing us to proactively start to detect when we might be headed into a bad place. The idea behind synthetic traffic is that we're essentially creating tests that run on a regular basis in our production environment that simulate user behavior in a safe way. Of course, if you're running a financial, you know, institution, we're not going to be taking money out of people's accounts, right? We are going to be hitting key APIs to make sure that from the end user's point of view, all of those key functionalities, all the way from the front end through to our servers, our data are all reacting and responding in the time and way that our customers expect. And this can very often provide us with the earliest warnings when something is starting to go wrong before the real warning is when our customers start calling us.
Speaking of alarming, we need to make sure that everyone understands their role. What happens when an alarm goes off? Did the alarm even go off before someone started calling us and telling us things were down? That's not always the case. Unfortunately. Do we have too many alarms? Sometimes we have so many alarms, we have almost no ability to act because there's a million of them and maybe some of them are set too sensitively and they're going off on what turns out to be normal fluctuations in system behavior. And so we get false alarms. So we need to constantly be evaluating what alarms we have and how do we respond to them. Again.
Low testing can be important here as well and certainly operational reviews to ensure you're continuing to do the right things during your operational practices. The AWS Health Dashboard is a nice tool that will allow you to see the status of all AWS services in every region and any events that are reported on them. If you log into your account first and then go to the dashboard, you actually get a nice focus view of only those services that you're actually leveraging. So you can really focus in on those. Additionally, you can use the Health API to programmatically tie into that if, if that is the way you want to go or you could even use an EventBridge to essentially get real near real time notifications when any relevant health events occur.
If you're using Amazon CloudWatch for your logs, you can of course just open them up and look for things when when things occur. That's a strategy, you can use other tools to read those logs or you can use CloudWatch Insights for a little more fine grain searching through those logs or even send those logs to Amazon OpenSearch Service for a much more robust search experience. If that's what you need, responding and learning, things are going to happen to our system, there's going to be events resilience is not about preventing events from happening. It's about ensuring that we can operate through them. Can we have a system that can be not significantly impaired, not impaired at all? Perhaps even when an event occurs? And how do we respond to those events? We're big fans of automation when it comes to that.
So if you can have set up auto remediation automation, such that when a particular type of impairment occurs, you can identify that and then essentially click a button and automate the steps to repair that. That's a great best practice. It reduces the potential time to actually perform the remediation and removes the potential for a lot of human error within that remediation. There are tradeoffs, it can be hard to implement, sometimes it might be a little expensive. So we do have to make those tradeoffs and evaluate that event management escalation. When those alarms go off, when things start to break, do we know who's responding? At what point does, do we have to recognize that this is maybe beyond tier one support and we need to escalate, do we, does everyone know who they're escalating to and what happens? Then how do you manage the events so that everyone can communicate clearly, but you don't have 9000 people in a chat room just kind of yelling at each other that or coming up with their own just ideas of what might be happening that can really slow down your responses. So making sure your event management communication is clear is key.
And I want to take a moment to talk about correction of errors. This is something that we're really, really big on here at AWS. We do this internally all of the time. And this is about when an event happens, getting together the stakeholders, engineers, lines of business, whomever is a part of this and looking at what happened in a blameless way, a blameless way. That's the key. And here's why something happened. We've gotten through it, we fixed it, we did a good job, but maybe we didn't. But we're kind of past that. Now. Of course, the key thing we want is to not have that happen again. So we need to have a very honest evaluation of what happened. Now, a lot of the time things go down because somebody made a mistake. We all remember, i think when like that big Facebook outage occurred a year or so ago now it was like two keystrokes. Somebody just made by mistake. It was down for days. It happens, it happens to everyone. We see it happen in AWS. The key thing here is we don't focus on. Well, why did you do that? That was the wrong thing. That's not helpful. No one's going even if everyone knows who did it, if that person's even should be participating in the correction of errors, that's not getting us anywhere we have to understand instead is where did our system fail? How did our process let them down? Why were they able to run that script that did that thing? We shouldn't have enabled that to happen. How can we fix our process? So that that error can't happen again and no one can make that mistake again. That's how we get healthy. So that's a really, really key process. I strongly urge people to start looking into and running correction of errors.
And actually, AWS has several other programs that really just go deep into like this kind of process. And then of course, we have services to help you here as well. AWS, Config AWS Systems Manager that will help you automate some of those remediations that we talked about"
And Route 53 Application Recovery Controller or ARC as we like to call it, is a really nice tool that I encourage you to investigate if you haven't already. And this is going to essentially provide you with very reliable data plane based operations for failing traffic out of fault isolation zone, such as an availability zone or a region when things start to go bad.
Alright, I've bored you enough. Now, I'm going bring Stacey and Yoni up to give you the real information on what Vanguard's done to take these concepts and improve their resilience. Stacey Yoni.
Thank you Clark. So good afternoon, Yoni and I are thrilled to be here today to talk about resiliency, something near and dear to Yoni and I's heart. Uh we currently lead the Technology Resiliency organization at Vanguard and have for the last few years. But before we start, let me tell you a little bit about Vanguard and provide a little context on why resiliency is so important.
Uh Vanguard is a global investment management company with over 50 million investors worldwide. We have over seven trillion dollars in assets under management that we are investors entrust us with, we have 20,000 crew members globally. That's how we refer to our internal crew and we have 18 global or locations within that.
We are, we have no physical branches and are a digital company at Vanguard. Our core mission is to take a stand for all investors, treat them fairly and give them the best chance for investment success. It's very important to have zero downtime for our clients and it's critically important for us to do that both for our clients who are entrusting them with our assets and for our engineers who need to focus on future development for those clients and not responding to the firefights that were of the past.
We have a big IT complicated system. As Stacy mentioned, we've been doing it for a few decades now. Um we are running in multiple cloud environments in each cloud environment. We are deployed in multiple regions for various reasons, some of them geo proximity. Some of them are resiliency. We upgrade multiple data sensors on prem still and we have global presence in multiple publications because we are a global company and have presence in every geography.
In 2015, we have adopted a public cloud strategy and have been on it ever since. A lot of our mission critical obligations, if not majority are running in public cloud to date. We have adopted a very strong modernization agenda and executing successfully on it even today. And I think we have a few years to go until we are fully modernized. This is this has created a very complicated environment for us. We have a legacy, legacy services, a very modern services and everything in between and that had resulted in a very complicated environment that caused instability and degraded experience for our customers.
So what were we gonna do? I started by saying downtime is not tolerable to both internal and external clients. We have always been focused on resiliency. So while resiliency, improvements were happening, they were happening in silos, they were happening at a product level, they were happening at a team level, they were happening across the organization. They were very reactive. We were focusing on what happened and correcting that individual problem within those silos.
What we needed to do was during this transformational time. During the change, we really need to double down, we needed to create an enduring enterprise wide organization by bringing all of the teams together throughout that life cycle to ensure that we had the feedback loop between the processes. We were focusing on the people, the processes and the technologies and systems needed to be able to enable an organization, our size to be able to do this and learn from our errors.
How we did this was we started by pulling together our architecture office, our engineering office, our production assurance and our operations teams. This enabled us to really look at the end to end life cycle from an organization view, coming together as a cross functional team to share learnings and to continue to iterate throughout the life cycle.
You'll see very closely what we say here um is aligning to what uh Clark mentioned in the AWS white paper within the resiliency organization. So we started a little bit before that. But as, as you see, kind of the life cycle align, it's very similar to the parts where you need to focus, focusing, focusing on defining and setting your standards and across.
So step one was to create the organization, but we also needed to focus on the operating model. How are we to operate? How are we gonna continue to ensure that the feedback loop worked throughout the respond and learn and throughout the life cycle, we needed to shift from reactive firefighting mode, which we all got really good at. And I'm sure everyone here has been in there. Um but we need to shift from that, we needed to move into a proactive state so that we could focus on getting better and we needed to eventually move into ingrained.
This is hard because you're doing this while you're transforming, right? So as you're trying to move to proactive, you're actually still responding to issues that are happening throughout the environment. So it takes time and focus, we focused on really making sure that we had the right resilient standards and patterns and goals set. We needed to change some things as we looked at that, we needed to make sure that we were setting the right direction and did our architecture designs and patterns match those standards. We didn't make some big decisions and changes along the way to ensure that this was working.
Once we did that, we needed to enable our engineers and our developers with the tools and the processes through the deployment, um the deployment pipeline to be able to elevate that while taking advantages of the patterns and the standards available. And then we needed to ensure that we were validating the operating and we were operating as we expected, we needed to ensure that we were ready to shift, go back to the standards, go back to the patterns if, if it needed and then respond and learn.
As Clark mentioned, you're not removing any failures from the system. What you're doing is you're preparing for those system failures and you're learning and you're ingraining them throughout the process to ensure you're ready to recover and you're reducing that MTTR along the way.
One of the more important things of understanding your resilience is then is to perform, to perform a performance test. We've been doing performance testing for a while in Vanger and we've been using a vendor tool um until we, we maxed out the capability of the tool as it relates to our demand, we simply needed more. Um and we couldn't get it from the vendor in a cost effective manner.
So we've developed a tool in house called PTAS Performance Testing. As a Service is based on locust essentially as a control plan that we've developed. But we are able, we are, we were able and still able to scale it virtually to the limit of how much infrastructure we can provision for AWS and, and we pushed some serious loads with this tool allowing us to understand how our systems respond to significant load.
In addition to that, we looked at a variety of tools available out there for chaos engineering and we, so some of the tools were not suitable for our environment. So again, we have developed an in-house tool that is specific for the context of our uh business processes and allow us to experience the failures that we know can occur in our systems.
There's about 20 experiments that exist in that um tools suite today and all of them are named after some disaster of nature phenomenon. As, as you would expect, we have enhanced our enhanced our observable stack. Previously, we relied on logging and alerting. When we introduced distributed tracing, it opened our eyes like it allowed us to really understand where failures occur in a system in real time, significantly reducing meantime to detection and therefore meantime, to repair,
Stacy had mentioned policy engine that is a tool that we've developed using open technology, open standards like regal and o a which allows us to ascertain business value as it moves through our pipeline and see its compliance to the resiliency standards that we've established allows us to fail things early on in the pipeline in the process and make sure that the things that are passing through it are really resilient the way we've defined it.
And finally, for our stakeholders, we wanted to allow them to have an ability in real time to see the health of our critical business processes. And with that, we created an enterprise health dashboard, having the tools however, is good but having culture where everybody is empowered and has the desire to use those tools is better.
So we've changed our change release management and focused on the culture of resiliency since stacy mentioned that we had a checklist before like and it's a good start, but it's not scalable. uh we wanted to move faster and deliver value faster.
So we've automated the release management entirely. Rules engines is what made it possible for us resiliency testing. We have combined the performance testing and the chaos testing that allows our builders to experience failure in a safe manner. We wanted to force the issue. Failures will happen when multiple factors collide you under load and something fails. What happens? We want to force this issue. We want to understand how quickly our applications recover from failure or if they can sometimes under load and we've experienced that too. The load overwhelms a service and is just not able to recover by itself until you intervene. That was very important for us as we matured in this journey and acquired more and more experience.
Uh our people became subject matter expert. It was important for us to bring them together and allow them to teach others and learn from each other. So we've created a champion network, a resiliency champion network. We wanted to train all of our abilities in sre practices and we came up with a three tier uh training curriculum that is available to every crew member in bangor um as well as a special track for our executives. So our executives are also versed in sa practices.
And finally, we've instituted a very prestigious award which we issued so far to a few teams. This award identifies the teams that go above and beyond. Um in this ownership mentality, identifying how can they make their services better than they ever were and truly resilient and more it co it comes with a monetary award, but more importantly, bragging rights.
So where are we now? So we've shifted. So we talked a little bit about moving from that proactive to the, from the reactive to the proactive to the ingrained. It doesn't happen overnight. It's not something that probably ever ends as you continue to mature, you need to continue to look at your resiliency journey. Yes, we went from some checklist to an in grade policy engine where we can then start automatically checking our policies along the way and improving the experience. But that's not over. We're going to continue to enhance on that.
Let's talk a little bit about where we are today and some of our wins and we're gonna get into some of our where we're missing next. So we have seen in the last few years great improvements along resiliency. We have enhanced the reliability and the continuity. But more importantly, we have increased our clients and our crew's confidence in our systems. We are focusing more on future development and less on responding to errors.
We've actually increased our deliveries by five acts while reducing major incidents by 30%. So gone are the days where you implement a change freeze in order to increase stability. Now, we're in increasing the amount of deployments we do while seeing a decrease in incidents and that's all due to the integrated controls that are in place to allow our engineers and developers to develop and deploy faster without manual checklists and speed bumps.
Along the way, we focused on the observable and really doubling down on that. What's important is to be able to be prepared for a failure, know where the failure is occurring in your systems and be prepared to respond. We're excited to report. We reduced our MTTR by 60% from a focus in that space.
We've increased availability, reliability and recover ability along the way. And as I mentioned, increase ex client experience and satisfaction, but I'd say another great win was our developers actually like to be involved in the process now and we're no longer production stops and controls along the way. And they're excited to take part in the testing and the tools that we have to actually enable them to move faster.
But what it's not done, we need to continue to focus here. What we're focusing on next year is to continue the work on the ease of adoption. We want this to be something that's ingrained throughout the system, not something that we have to stop and think about. We wanna make it easy to adopt for our engineers and our developers. We're doing that by integrating a lot of the consumption of libraries, creating more programmatic consumption of our tools, having less need to train on the tools and more ability to use them and focus on delivery observ ability continues to be a big focus.
We are using some of the best state of the art tools, but they're in various places and various platforms. So our teams need to look across those platforms. We're looking at continuing to integrate that into a single platform to for ease, to see your end to end journey, also help to identify breaks along the system sooner.
We're enhancing our policy engine. We're continuing to look to codify standards and give more information early in the development life cycle to our engineers to be able to respond and adjust along the way. And then we're looking our practices today. We're focused a lot on our applications. We want to continue to look at the end to end client journey throughout the life cycle of resiliency available for everyone.
If you scan the QR code, here is the AWS white paper. So we'll leave that up. That will take a lot through the life cycle that Clark mentioned and a lot of the, the parts that we use in Vanguard within our journey.
So I just want to thank everyone for their time this afternoon. We appreciated sharing our story within resiliency, which is not done yet. We are available for a few questions and we, you feel free to reach out to any of us individually. Thank you.