Best practices for creating multi-Region architectures on AWS

最新推荐文章于 2024-10-09 12:52:01 发布

李白的朋友王维

最新推荐文章于 2024-10-09 12:52:01 发布

阅读量144

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134791874

版权

All right, welcome. We're gonna go ahead and get started.

So you are here because you either have or you are considering either building or extending your workload across multiple AWS regions. Now, whether that is to improve performance for a globally distributed user base or increase availability for some of your most critical workloads or to come into compliance with data residency laws and regulations, you've likely realized that this can get quite complex.

So whether you're just getting started or if you're deep in the weeds, trying to figure out the best path forward, there are a lot of decisions to make throughout the session. We're gonna look through some real world scenarios and extract best practices that you can take home today to help you in your multiregional journey.

So, welcome to this 300 level session on best practices for creating multi region architectures on AWS. My name is Joe Chapman. I'm a principal solutions architect and I'm joined by my colleague, Neeraj Kumar, who is a principal technologist.

Over the past few years, Neeraj and I have been in a lot of customer conversations, helping customers to evolve and build their workloads across multiple AWS regions. So we built this session based on these real world learnings that we extracted from the field helping customers just like you.

Now, as we briefly look at the agenda here and overall, our goal is to provide you with a more clear picture on how you can build and evolve your workload across multiple AWS regions by utilizing known AWS best practices.

We're going to dive into two fictitious customer scenarios which are built based on real world use cases. And then as we walk through these customer journeys, we're going to extract the best practices from each of those.

We're also going to look into the tradeoffs, the considerations, the pros and cons of the path that they took because there really is no one size fits all solution when it comes to a multi region architecture. So you can understand the differences in the trade offs and adapt them to your specific workload.

Now, if you're still in the decision phases of whether or not you need to deploy a multi region application or multi regional workload, it's important to remember how AWS regions are built and AWS regions are built to be resilient by design.

Customers benefit from three or more availability zones in every single public AWS region. And these availability zones consist of one or more discrete data centers within the availability zone.

Each separate availability zone is then separated by a meaningful distance from the other availability zones and they're separated by the meaningful distance, taken into account, things such as a separate power substations, separate flood plains.

So if there was a disaster where there is a flood in one data center, it wouldn't affect other data centers within a different AZ within that region.

Each of these AZs are also connected through high throughput, low latency links which allows you to build resilient applications with very low latency between the different availability zones. You can even build synchronous applications that are designed within that, that regional boundary.

The core thing I wanted to emphasize here is that AWS regions are built to be resilient. Some of you that may have been building on top of AWS for some time, this this might be common knowledge for you.

But as you're doing design reviews, as you're doing well, architect well, architected framework reviews with others within your organization, take that special look and that special lens on the workload and really determine if the workload is taking advantage of the resilient design and resilient nature of regions.

Because if you're not well architected in a single region, going to a multi region might not help. It could actually make things worse, for example, by extending problems across multiple regions, extending operations and extending dependencies across multiple regions.

Many of you do have strong reasons to go to a multi region architecture and as you get started on this journey. The first thing I encourage you to do is really kind of take a step back and really understand the current architecture, what is made up of, of the current architecture, what are the requirements of it and what are these new requirements that the architecture and the workload needs to be able to accommodate?

For example, if you need to improve performance for a growing global audience, how is performance measured? And what are the current performance metrics? And what are the the future performance metrics that need to be understood and accomplished?

Or if your requirements say that you need to increase availability for the workload. What are the current availability requirements and what is the current workload achieving? How is that being measured? And how is that gonna be measured? And what does it need to achieve in the future?

Or if it you need to have, if you have requirements for data residency laws and regulations that you need to come into compliance with, what exactly are those requirements that are set within those laws and regulations?

For example, where can data be stored or where can data be transferred? Who can have access to the data? And most importantly, what are the requirements that need to be put in place such as guardrails to ensure that data doesn't transfer data doesn't move where it's not supposed to.

So as you understand these requirements and get in, get in line with both business and technology stakeholders and understand the requirements and then work backwards from those requirements to figure out what the best solution is for you.

Ask, can a single region meet those requirements or does these requirements dictate that we need to extend beyond a multi region into a multi regional workload?

And having both business and technology stakeholders in the room, in agreement here, is going to help to level set across the organization on the cost, the complexity and the tradeoffs. And it's really going to set the foundation for your journey into a multiregional workload.

So understanding the requirements is the first fundamental in creating a multi region workload.

Next, I'm going to hand it over to Neeraj and we're gonna walk through our first customer scenario.

Thank you Joe and hello everyone, welcome to Re:Invent and welcome to this session. Thank you for coming this late in the day.

So just to build on what Joe mentioned earlier, we have a couple of stories, a couple of scenarios to share with you. And the characterization of these stories of the customers is fictitious, but really the lessons that we are bringing to you and the best practices that we are bringing to you through this session are very much real, not just from you, but working across a multitude of customers in different industry verticals.

So we have really tried to distill those lessons that we've learned from the field into these couple of stories.

So the first story we wanna talk about is about a fintech retail bank. It's a fast growing company going through a hyper growth. And currently, they are operating out of a single region. They have done going back to Joe's earlier point, they have taken steps to improve resilience posture within a single region.

And now they're at a point that this hyper growth is leading them to leading them to think about the disaster recovery, their business continuity strategy really seriously. Because at this point in growth resilience is becoming a boardroom level conversation. The cost of downtime is quite significant for them to, you know, not think about disaster recovery and business continuity.

So the goal at this, at this stage in their business life cycle, what they're really trying to do is identify the right DR and operational continuity strategy and also invest in testing those strategies, whatever detection, control, recovery, controls and mitigation controls, they build, they want a regular testing of those controls, so they work exactly how they intended to.

Now, one of the first steps they took in the on path to that DR journey is define what the business goals look like. Now, some of, you may be familiar with the two key metrics that you really need to define before you can you know, identify the right optimized strategy is the RTO and RPO metrics.

Now, for those of you who have not heard this before, just to kind of give you a very quick summary on that recovery point objective is it means it stands for recovery point objective. So think of this like this is the data currency your business needs to operate or in other words, this is the lag in the data or the data that you can afford to lose in case of any large scale impairment with your data sources.

As an example, the RTO is all about the speed of recovery. This is a recovery time objective, how quickly we want to recover back to the operations and you know, back to the business operating and working again typically defined, you know, in terms of minutes or hours.

Now for their purpose, this was described to be as 15 minutes of RPO and one hour RTO. Of course, this can vary based on industry and based on use case based on the criticality of that mission critical system that we are talking about.

So based on those the next step was to really look at what are the different strategies that they can adopt.

Now, this chart that you're seeing on your screen is from our Well Architected Framework which broadly classifies approaching DR across four different broad level strategies starting with a simple backup and restore. You know, you can argue that any critical system should have this as a default strategy because it's not just the technology fails, it could be many different kinds of reasons why you may have, you know, say data corruption or you may want to go back to the previously good known stage.

So this is this is kind of the I would say entry level basic level strategy. Every, every critical system should think about the next strategy is called pilot light strategy. In pilot light strategy. You take a replica of your of your workload of your application and you try to create that in a different side or different region in this context.

In pilot light, you're focused on a couple of things. One is your data replication strategy because you want to continuously sync data. So in case you have to recover in event of a failure, you, you have all the data or the current data that you need to effectively operate your business systems.

And the second thing with is whilst you can have all your templates and your infrastructure as a code templates ready to go, but you're not really running any compute or any capacity just yet. The idea here is that you need all the data available in the second region and your scripts that you need to bring up your and scale up your infrastructure.

The next the third strategy here is warm standby. Now, warm standby is very similar to pilot light. The distinction here is that rather than not running any compute at all, you have some base level, some baseline compute available, some warmth of resources.

So the idea here is when you compare pilot light and warm standby strategy, you're effectively getting a quicker RTO because your time to recover with having all those resources or scaled down version of those resources are already there. And from there, when you fail over, you can ramp up and scale up your resources.

And other end of the spectrum is active, active. By definition, active, active implies that you're using more than one region simultaneously, either for both read and write transactions or having some kind of regional sharding or again, there's no one way of doing active, active. The idea here is that you're splitting your workload or splitting your transactions across the regions.

Of course, as we go from left to right on this chart, the the complexity, the cost and you know, and operational overhead also changes.

So for, for this particular customer they were comfortable with, with choosing warm standby. It's something they thought they can, you know, meet the goals they are trying to achieve from their operational continuity needs.

So next step in this in this journey was to start, start, start developing, start executing warm, warm standby strategy.

So what does it look like? As we said earlier in warm standby, you're taking your replica from one region and you're making sure these resources are available, including the data, including the code and configuration that you need readily available in your secondary region to keep foundational capabilities.

You need to think about here. One is your data replication strategy. Again, it will depend on what kind of RPO goals you are trying to achieve. In this instance, they are using Amazon Aurora. Now, Amazon Aurora has a feature called Global Databases. So you, you can actually deploy the database in Aurora as a cross-region cluster where you can replicate your data from primary region to your second region.

So you have notes running, uh, you know, in the secondary region, you're not actually using that region. Uh but it's there as a passive, uh standby region. Uh similarly, you need to think about how you're going to deploy your code because what you want is your core and configuration to be also in sync between the two regions.

Now couple of call outs here when you're thinking about extending your code pipeline to deploy the, you know, code and configuration in your primary region. Uh you should do this deployment, you know, one region at a time. And the reason, reason for this is that from experience and statistically speaking, one of the biggest reason why systems fail is when some new code and configuration shows up in the production like a bad code and or or or configuration and you have to think about rolling back from there.

And so it can be a very involved process. Now, if let's say bad configuration causes any kind of impairment in the region. One, guess what if you deployed at the same time in another region, it will have the same problem in the region two. So you really want that will hamper your ability to recover from any event in region one in time. So you have to really avoid doing that.

Another key call out here is whatever your recovery procedures are, your recovery mechanisms, uh you know, your DR run books, uh you need that run book to be available in both regions. And ideally what you want to do is invoke those run books from uh the region where you're recovering into your secondary region.

Again, the reason is very similar, if there's any kind of uh cascaded impact or let's say there's some common dependency that these run books are dependent on. So whatever caused the impairment in region one services could potentially also hamper your ability to effectively run those run books. So it's always a good idea to invoke these from the secondary region.

So whatever fail over activities are, you know, you'll have to uh promote your readers to writers in the secondary region cut over at some point, you know, from a DNS, all those procedures and steps you should think about executing from um from, from your region where you're recovering into.

So at this point, the best practices to call out that this customer in their journey have adopted is one is that DR and operation continuity strategy should be informed by your business impact assessment. And BIA is a really good mechanism to pinpoint where the critical parts of your business are. What does that critical path looks like and what are the critical systems or the systems that are supporting that critical path? And that's how you effectively you can derive some of these metrics like RTO and RPO. Because if you think about it, it's not just a technology decision.

Joe was mentioning this earlier, this is a business decision and we have to weigh this against the business case of um you know, the cost and operational overhead that comes with these architectures uh and potentially with the cost of downtime as an example, code and configuration.

We've talked about it. Think about this, the term we use within AWS is a staggered deployment, you know, do it one region at a time or even I would say if you have a smaller fault isolation units, even in region one, you do it like one fault isolation unit at a time. Let's say if you have a a independent workload, even deploy this, say a at a time just to have build that confidence that this new release is working as effectively as you wish it for and choose the right data application strategy.

Of course, in this example, we've shown how you can do this with Amazon Aurora, whatever your system of record is or whatever your choice of databases, you need to think about whether you can meet your RPO goals, whether it's like a simple backup and restore strategy all the way to near real time application, just think those replication strategies, you know, that meets those RPO goals.

And again, uh ensure recovery position run books can be invoked from the secondary region just to avoid any kind of uh impact that, you know, uh that may impact your recovery procedures as well in your primary region.

So, carrying on with this journey, the next step, uh after implementing this warm standby, this customer wanted to build more confidence in their ability to effectively fail over. And one of the things we often talk about, uh you know, when we are with customers talking about the DR strategies or resilience in general is uh you practice your uh ability to recover uh before any event happens, you know, you build that muscle constantly and regularly, not necessarily when this event happens.

So choose what the right frequency is uh to test these mechanisms and effectively. What you're really trying to test is your uh your detection control. Can you detect a problem? Uh your recovery controller based on once you get the detection or the signals, are your recovery controls, recovery procedures working as effectively. Uh and any kind of mitigation controls you want to build as part of that process.

So they adopted something, we also prescribe a lot is running game days. A company wide game days and game days are pretty good mechanisms to test out, not just the technology works, but if you think about it, these processes involve really people process and technology coming to working as a well oiled machine.

So this is really trying to test our is our incident response working as intended, our uh all of the run books working as intended. So game days are a really good mechanism, choose the frequency that works best for your business best for based on your release cycles, etc.

Um and uh in this case, uh they also used uh AWS Fault Injection Service. Let me spend maybe a quick one minute on how Fault Injection Service can actually help you with, with automating or simulating some of these scenarios.

So Fault Injection Service is a fully managed service. That basically helps you inject failures, deliberate failures to see how your system can withstand these, these failures effectively. You know, another term you may have heard of we use is chaos experimentation. You are deliberately trying to introduce chaos in the system to see again, testing your recovery controls, your detection controls is your observable working as intended. Are you getting these signals in time and then are your recovery procedures acting on whatever that failure mode is?

So FI enables you to run these experiments as a managed service. It gives you um these, you know, these events or the actions as we call it in the service. So things like, hey, let's try and terminate some EC2 instances or let's just try and introduce some uh you know, uh network interruptions or EBS volume interruptions, whatever that failure mode is that we are concerned with, you can actually uh effectively help simulate that.

And because it's a managed, you want to run these experiments with the, with safety rails around, you know, with the, with all the uh you know, in a fail safe mode because you don't want to interrupt your uh any, you know, active production services. So this service also provides you some of the guard rails around which to run these experiments.

So again, what was the best practice here? Uh regularly test your detection and recovery control. We cannot emphasize enough that you know, these controls, you, you must uh regularly test. Um just so that in case that any real event happens, you, you are well prepared and you have a very high degree of confidence that these procedures, these run books will work as intended.

Ok. So carrying on in their uh in this journey and this kind of incremental and step improvement they are making to the overall uh resilience posture. The next thing they wanted to uh uh apply in in here was that they wanted to validate the failure of readiness.

Now, what it means is now you may have defined all your recovery and detection control. But what if when you fail over your target region, your se your your uh secondary region has some mismatches in the base level features that you need or base level uh you know, configuration that you need for effective fail over.

For example, do you have the same limits that you have in the region? Have you checked all all the service level limits? Because if you apply for limit increases during the event, guess what? You're compromising your recovery time, objective, things like provision capacity. If you have certain uh certain services support provision capacity, do you have any mismatch between those uh you know, between the regions?

So making all those checks, those readiness checks is again super critical beyond the recovery and detection controls we've been talking about. So in order to do that, uh they basically um adopted a, a Route 53 Application Recovery Controller. Uh if you're not familiar with this, I'll give you a very quick uh intro to this.

So, Application Recovery Controller or ARC is a part of the Route 53 service and it helps you on at a, at a macro level. There are three key benefits of adopting Application Recovery Controller. One is uh the readiness checks as we've been talking about. So uh it will constantly at a one minute interval check for audit your what of parameters and configuration do you have provided? Uh on the secondary region? Uh things like I was examples like I was giving earlier, uh what's the limits or, or any kind of mismatch and cap you know, provision capacity, etc. And there's a whole bunch of other checks uh that it comes uh you know, out of the box.

Uh the second thing is the routing control itself. Now, when you fail over, especially in these kind of architectures, the scenario that we are talking about with warm standby, which is really an active, passive uh architecture. Uh whilst we recommend you should automate all the steps like all your run books and everything should be automated. But the decision to actually pull the this proverbial lever to shift the traffic from one region to another uh should be done manual.

And really the reason for that is it's a very involved decision. You need to be absolutely sure that that final when you make that final decision to shift the traffic that you're absolutely ready to take that. So readiness checks kind of play into that because you want to be absolutely sure that when you open the traffic gates to the second region, you are, you have all the capacity you need, you have all the data you need and all the readiness checks are showing you green signal before you invoke any kind of, you know, switch over at the level.

And the third thing is uh because it's such an involved um and important process, you want to have some safety rules, you, you need to have some guardrails around uh you know, invoking uh the the this labor uh unintentionally, you know, so you can actually put some checks in place. So that's another kind of the feature of ARC, that it gives you safety rules that only when these rules are met that after the pulling the lever, those this will be actually actioned, the routing control will be actually action based on those rules.

So, uh at this point, the best practices uh to call out is uh build, fail over checks as part of your recovery strategy. Beyond your recovery, mitigation and detection control, you want to be absolutely sure. You're ready to take the traffic on your secondary region. Uh use ARC uh where you can uh because it will simplify and automate a lot of things which otherwise you will have to, you know, self build build yourself.

And by the way, this is not just for the failure. Uh because when you fail back, when you bring, when you shift the traffic back from the secondary, back to your primary region, you need to go through the similar level of thinking uh you know, that, that you do at a fail over. So ARC can actually help you in both, both scenarios.

And uh again, uh I would like to emphasize automate as much as possible uh because that more automation will get you faster to your, to your goals. But the final decision in these architectures specifically should be, should be manual, right?

So let's carry on. Uh again, we are making increment improvements.

This customer is on that path to adopt continuous improvement on their resilience journey. The next thing they observed in the observable stack was the situation that, in order to trigger the failure mechanisms or fail back mechanism, you really need to rely on the signals and those signals are provided by your observable stack. So observability - that's where your detection control plays in - is a very foundational capability you need in place. And you need to think about the same, if not more, level of resilience of the observable stack itself as the rest of the workload because if you're not getting those signals and you're not getting those signals in time, it will hamper your ability to recover in time.

So in this case, the realization was that, hey, you know your observable control, your detection control itself is impaired. This is very similar to the first concept we were talking about in the recovery controls. Now, this is the situation with your detection controls. If there's some common dependency or some common cause of underlying impairment, whatever instrumentation you have done with your observable can also be impaired.

So in this case, what you really want to do is what they did here - extend the observable stack across the region. Now, even though in this situation, they're not actively using the region, but you need observable, you need to observe both regions, at least at a shallow level health checks, even some kind of heartbeat, metric that you can have. So you can observe a region from outside or what's going on in the region, from outside of the region and correlate both what you can see within the region, but also what you actually observe from outside of the region. That's really important as part of thinking about the observable because in case there's something impaired within your observable by having this outside perspective, outside observable for your region, you can make a more deterministic decision on whether or not there's something wrong and then make your failure decision based on that based on those signals.

So again, the best practice here, we want to call out is observe regions health not just from within but also from outside of that region to make that deterministic choice or decision for failure.

So after having all these incremental improvements, this is what this architecture currently looks like - its current view or what we like to say, it's evolving view because you're never done with resilience. Resilience is a continuous improvement journey. And one thing I want to call out here is by doing, taking all these incremental steps, they were also tradeoffs to make as part of this, of course, it made absolute sense in the context of their business and the criticality of the systems they were running. But the tradeoff here was operational overhead - think about all the cross region observability, looking at your data application, looking at how your deployment is going on. So there are additional operational overhead and which comes with some additional complexity to think about, you know, you have more steps to think about as you think about failing over and failing back in these architectures. And of course, this also leads to some additional cost to think about.

So again, going back to the previous point I mentioned in the beginning that these architecture decisions or going down this path is not purely a technology decision, it has to be led by business, it has to be led by a business case.

So our next customer scenario is going to be an online authentication provider, providing authentication as a service. This is also a fast growing company and as they, this fast growing company was evolving, they started to run into two primary problems.

The first problem was as they were taking on larger and larger customers, which is a good problem to have, they noticed that these larger and larger customers wanted to have contractually set SLAs within their contracts and they didn't have any that were set and they needed, they knew that they needed to increase their uptime SLAs that were gonna be set within these contracts.

Also as they had more customers worldwide and more users logging in from more locations, they noticed a few metrics from support cases and online threads as well as their own internal metrics that users in different geographies were getting vastly different performance from the same exact workload.

So they sat down and they, they understood these requirements. They wanted to improve the uptime SLA to four 9s of availability. And they wanted to increase the overall application performance by reducing the latency by 30% at a global level. And they wanted to do this specifically for the most important API which was their authentication API.

You can see the brief architecture diagram here. This is currently a single region application. Users come in, they first hit an API Gateway endpoint and then depending on the request, the user will be directed to either a Lambda function or an Application Load Balancer that has both ECS instances, ECS containers as well as EC2 instances. Then there's a persistent storage layer at the bottom there that consists of both DynamoDB as well as S3.

Now, as this customer began on this journey, they first wanted to start to build up this environment in their lower environments and they quickly came across their first problem. They realized that only about 20%, they realized that about 20% of the workload was either not defined within their infrastructure as code templates or there was some significant drift between what was in the infrastructure's code templates and what was actively running in production.

So they needed to really take a step back and do a full audit of the workload from start to finish and ensure that what was in the infrastructure's code templates matched what was running in production because they knew that doing manual releases and manual changes, especially as they go across multiple regions was going to be troublesome. There was going to be error prone and it was also going to take more time.

So to essentially ensure that they have consistent deployments across all of the regions they're deploying into, they adopted CloudFormation and within CloudFormation, they broke down the application into separate stacks. These stacks were for example, one for the front end, one for the back end, one for shared services, another one for the data tier. And within these stacks within these templates, they then created a CloudFormation stack set and within this CloudFormation stack set, what that allowed them to do is deploy that same stack to multiple regions.

However, they knew again that they didn't want to deploy it to the same region to all the regions at the same time, they wanted to allow for sufficient bake time or sufficient time after the deployment to ensure that the deployment was healthy before extending that deployment to other regions. So to accomplish that they used a mixture of parameters within CloudFormation and also wait conditions within CloudFormation to ensure that we're only deploying to one region at a time and then we're waiting a sufficient amount of time before deploying to the next region.

The next problem they needed to overcome was understanding how they can have a globally distributed intelligent and reliable routing layer. So they wanted to ensure that users when they came into the application were automatically routed to the location that was most relevant to them, which in this case had the lowest latency for them.

So to accomplish this, they utilize Route 53 Latency Based Record Sets. And what these latency based record sets essentially do is a user comes in, they make the request to the Route 53 record. Route 53 determines the application that would have the lowest latency for this particular request and it returns the IP for that application that would have the lowest latency. So the user is then directed to the relevant region with the lowest latency to them.

They also wanted an easy mechanism to be able to shift traffic away from a region to evacuate it, for example, for maintenance, but also if there's some unforeseen problem or unforeseen issues within one of the regional deployments. So for that, they also utilize Route 53 Application Recovery Controller. And what this does is it essentially attaches health checks to that latency based record set. And this utilizes the Route 53 highly available data plane to flip the switch on those health records to dictate what regions can be routed to, essentially what regions will be returned to from the Route 53 records.

Being an online authentication provider, you know, security was always top of mind for them. So they also wanted to have additional protection for things like SQL injection and cross site scripting attacks and other layer 7 attacks. So for that, they deployed an AWS Web Application Firewall in front of the API Gateway endpoint.

Let's take a quick second to catch our breath and look at some of the best practices from these first slides for this scenario.

First, as you go to a multi-regional architecture, deployment best practices, CI/CD best practices are going to become ever more important. Utilizing infrastructure as code will help to ensure that you have the same consistent deployments across multiple different regions and then utilizing things like regional parameters and wait times and wait states as another best practice is to only deploy to one region at a time allowing for sufficient time after that deployment is complete to maybe uncover any unforeseen problems.

Also, this customer spent quite a bit of time figuring out what was the best solution to create that reliable and intelligent routing layer. Their particular use case, they utilized Route 53 Latency Based Records as well as Route 53 Application Recovery Controller to both route users as well as to direct them for maintenance and for possible impairments.

Now, as you can probably imagine this is a very read heavy workload, right? Users make a lot more requests to authenticate in the authentication API than they do to make changes to user permissions. However, when a user permission needs to be changed, for example revoking those permissions, customers want that to happen as quickly and as reliably as possible.

So let's take a look at how they accomplish this first. They looked at both synchronous replication and asynchronous replication. However, with a two region deployment here, they went against or they chose against synchronous replication for two primary reasons.

The first reason was because a single write going into the application would need to be fully consistent and acknowledged from both regional deployments. This meant that a problem with one of the workload deployments would take down the entire workload. Thus, because the dependency was spread across regions, this could reduce their availability.

The second reason was because again, any writes going into one region would then need to be acknowledged from the secondary region. This is going to increase the latency and increase the time for these writes to happen.

So those two requirements that they had at the beginning were to improve availability and increase performance and that went against those. So for those reasons, they chose asynchronous replication and they did that with the DynamoDB Global Tables.

So DynamoDB Global Tables offer them a multi-region, multi-writer database solution with typical replication lags between the regional deployments of one second or less. There's also built in features such as conflict resolution with DynamoDB Global Tables, such as if the same request comes to both regions then the last writer is going to win in that case.

Now for the use case where a user comes in and they need to change the user's permissions...

For example, we say we need to revoke a user's permission because they no longer need that access. So for that, they utilize a concept called hedging as well as having potent transactions.

Let's take a look at what I mean by that. First, let's see a scenario here where a user comes in and they have a request. This request is going to be to revoke User A's permissions. That request is then sent to an intelligent SDK. The intelligence within this SDK is primarily that the SDK is aware of all the healthy regional deployments of the application.

Now, once it is in the SDK, the SDK realizes that this needs to be sent everywhere because it's a highly resilient or highly important request. So it sends that request to both regional endpoints at the exact same time.

Now, due to things such as the speed of light and latency over long distances, it's not going to reach both of those regional endpoints at the same time, but it is going to reach both of those faster than if it was only going to go to one regional endpoint and replicate it to the other. Thus, we've essentially reduced our p99 latency in this case. We've also protected against any potential network transmission issues.

For example, maybe a latent network problem where a request doesn't make it to one of the regional endpoints. That request is still going to make it to the other regional endpoint and then asynchronously replicated via DynamoDB to the other region.

One very important note here is that all of these transactions, all of these requests were idempotent. And what I mean by that is that if the same request is made multiple times, the end result is not going to change.

How they did this in this scenario was the user's permission set that is changing within the request. The entire user set is being sent within that request such that the new user's permission set is not determined based on the previous set.

Now, when it comes to multiregional architectures and data replication, many times the biggest thing to overcome is just the speed of light and how fast we can transfer data.

The speed of light is very, very fast, right? However, when we put it into fiber, which most data transferred over long distances is done, and we add on top of that ring protocols and transmission retransmission, transferring data over long distances is much slower than the speed of light.

But really what this boils down to is the longer the distance the data needs to travel the more latency that is going to be put into the workload.

And to help out with this and to understand what options you have, we have a core theorem that core theorem is called the CAP theorem. The CAP theorem was created specifically for distributed systems and with it states for a distributed system, which a multiregional architecture is, you have to choose between three options: availability, consistency and partition tolerance.

Now, because we have a multi region application spread across multiple locations, we usually need to always have partition tolerance here. This leaves us with really two options.

One is consistency and partition tolerance. This is useful if you have an application that has really strong consistency requirements. However, the trade off here is that there could be decreased performance and decreased availability because now a write needs to be fully persistent across both locations.

If you do have an application that has strong consistency requirements in a scenario such as this, you might want to think about adding a third region and then using like a quorum based model so two out of three writes are required to be fully acknowledged before that write is fully consistent.

The second option though, and the one that I would say probably the majority of customers utilize, is availability and partition tolerance. So with this, customers gain the advantage of increased availability for the workload. However, the trade off is that the workload needs to be okay with eventual consistency.

And if there's a problem with one of the regional workloads, then any in-flight transactions could potentially be lost. However, the increased availability is a big winning point for that.

Now, if you do choose the second option there, availability and partition tolerance, some core things you're going to want to make sure you're monitoring are simply data replication.

So utilizing in this scenario, both DynamoDB global tables and S3 cross region replication, make sure you're monitoring the replication lag between the two locations and have an appropriate alert set up such that if the replication lag increases by so much, then we're going to alert our on-call engineers and take appropriate action.

It was also very important for this customer since they're trying to increase availability across the world to understand availability metrics and measure that appropriately. So they did a few things here:

They took this standpoint from three different angles. The first angle was simply looking at the server logs and from the server logs, what are we seeing for error rates? What are they seeing for processing time for a request?
The second was they ran synthetic checks, synthetic checks not from the region that was being monitored, but from a different region, either region two or even a third region. And what these synthetic checks were doing were going through common user workflows and monitoring the active duration for that workflow and then alerting if the duration exceeded what was expected.
The third viewpoint then was from the user's endpoint. So they utilized CloudWatch Real User Monitoring to get a really good understanding of duration and error rates from the user's endpoint. And then they fed that into the monitoring tools.

We call this differential observability where they are taking three different viewpoints here to try to gain a more holistic and more true viewpoint, more true data point of how healthy the application is.

This allows them to not only understand how healthy the application is holistically, but also allows them to understand and dig into problems that might not have been seen by only taking a single viewpoint of the workload.

So again, let's take a break to look at some of the best practices from these:

Understand the consistency characteristics of your workload and then the replication options that are available to meet those consistency requirements. Many customers use asynchronous replication simply for improving availability of the workload. But we did talk about using a quorum-based model if your workload does have strong consistency requirements.
If you do have a multi-region multi-writer workload going on, then you're going to need to make sure you have appropriate checks in place to ensure for consistency - making sure your requests are idempotent, making sure you have some kind of conflict resolution, and then detecting any inconsistencies that might happen because you're writing to multiple locations at the same time.
And then we talked at the end there about differential observability - one viewpoint of your application is good, but the more viewpoints from different perspectives by utilizing differential observability will give you a more holistic true health of your application.

The next thing that this customer wanted to ensure was that they needed to ensure that each region could work and function 100% independent of the other regions. They didn't want any problem in Region 1 to affect Region 2.

So this is what we call regional independence. And they did that by enforcing very strict regional fault isolation boundaries at the regional level. So what this essentially boils down to is there's no cross region calls, there's no cross region dependencies for the application in Region 1 and Region 2.

And then they tested to find these unknowns. So during a planned event, they actually shifted all traffic away from Region 1, so all traffic was going to Region 2. And when they did that, they noticed that many resources within Region 1 were still being actively, proactively processing data. This allowed them to uncover those unknowns.

And it turned out to be a legacy system that was making some backing calls that were just not expected. Testing helped them to uncover that and then put into place these more strict fault isolation boundaries again at the regional level.

Now, as this customer went on this multi-region journey, they did a pretty good job of estimating cost as it pertained to their increased IT spend, their increased infrastructure cost. However, they admitted they did a pretty poor job of estimating the operational cost for the workload.

So they needed to take a step back and really understand from an operational standpoint where the operations costs were going and what was the expectations going forward.

They needed to look at policies and procedures as it relates to not now a single region application but a multi-region application. For example, runbooks needed to be updated. As simple as for simple things such as when an alert comes in, first thing we need to do is determine what region it’s coming from, which regional deployment is coming from.

They also needed to create completely new runbooks for things like evacuating a region or shifting traffic if they needed to for maintenance. These runbooks also needed to take into account different running states.

For example, if the application was fully healthy and they needed to evacuate a region, then they're going to probably take different steps than if the workload was partially unhealthy where not everything was fully available in the partially available standpoint.

They wanted to ensure that all actions were going from the healthy region.

The last thing here was that many service quotas are set at the regional level. So as the workload grows, you're likely going to make service limit increases and just know that these service limit increases, the vast majority of them are not at the account level.

So make sure that you're tracking those service limit quotas. And then when you go to multiregional architecture, as you increase the limit in one region, make sure you're also increasing the limit in the other region.

And you can use the AWS Service Quotas service for this. And this can also help you to check and understand the different service quotas as defined at the regional level. And you can also make service limit increases from that service as well.

So more best practices that we uncovered here:

Understanding your dependencies and architecting for regional independence is going to improve the overall resilience of your application, allowing for regional independence.

So each regional workload can function 100% independent of the other regional workloads.

Also, as you're planning and budgeting for your expansion here, don't forget about the increased operational overhead. So uh increased operations and and what that's going to mean specific for your application service quotas are mainly set at the regional level. So if you're increasing the service quota in one region, make sure you have run books and checks and those things in place to ensure you're doing that for all relevant locations.

And then as you're updating your run books, make sure you also take it into account, healthy state, unhealthy state and knowing what different procedures to run within with different states for your m multiregional workload.

All right, if you remember at the beginning of this this customer scenario, they needed to really kind of shift that mindset to have everything done with infrastructures code, uh which is great and the, the company and the engineers were doing that. However, it's important to also put in additional checks, uh put its additional checks in place to ensure, especially in this scenario, that region a is the same or region one is the same as region two because they didn't want for a user to go into different regions and have a different user experience.

So that was very important. So for that and to help with the detection, they again utilize rough 53 arc to have some readiness checks. And these readiness checks are going to continually pull the resources in both of the regional locations and looking for things like service limits and quotas and uh throt lane and any any other version differences that may happen and then it will alert when those inconsistencies are sound are found so that somebody can go and remediate them.

Next. This customer was successful in deploying the application across multiple regions. But one of their core requirements again was to increase their overall performance for their end users. And they knew that as they increased performance at a global scale, they were gonna be able to open up uh new to new markets and new companies for their product.

So as they were looking at which region they wanted to go into next, they needed to create, to create a framework for that. In the beginning of the framework is right here. First, they looked at where their, where their users were coming from, specifically where they were coming from, that had the highest amount of latency and then doing that analysis by the volume with the highest latency.

Next, as they, they kind of narrowed down the region they wanted us to go into, you know, we have today, we have 32 regions around the world and they narrowed it down, you know, based on where they were seeing the highest amount of latency.

Next, they looked at what the cost was. So different regions can have different pricing for um different a services. And this is because it costs us different, it costs us different amounts to provide services because of things like taxes and labor and uh power and uh materials. These can all be vastly different based on different geographies in the world. So for those reasons. it, it costs different in different regions. The the best place to start here is by looking at the aws public pricing pages as well as the aws pricing calculator. And those will help you to determine the cost differences between the regions.

The the third one here was then the service availability. So not every aws service is available in every aws region. Uh we do have initiatives to try to make sure that it is everything is available everywhere. However, we admittedly have a pretty long ways to go there. So if a service is not available that you need in a region that you need for your workload, go ahead and let your aws account team know 90 to 95% of a road map is directly influenced based on that feedback from you.

So what this customer did was take, they did a full inventory list of their workload for all of the aws services and features that were being used and then match that against the a ds uh public pages for the the regional availability of each service.

All right. So be final best practices here learned from this customer. The first is that especially to have some kind of tool in in place to detect inconsistencies between different regional workloads, especially in an active active setup like this, you don't want uh users to have a different user experience depending on the location that they go to.

Also as you're selecting the region to go into. There's a lot to be. There's a lot to consider and make sure you're figuring out what works best for your workload for your business and for your users, looking at things like user location latency to them costs and service availability are some very common ones.

Here's our kind of our current evolving architecture. It's always evolving here. Uh we didn't talk about everything due to the limited amount of time here, we only hit the top 10 items, but you notice that we missed a few things such as you know, these ecs containers and replication, how the lambda functions are going to be deployed, uh the ecs or the ec2 instances as well.

So it's important to kind of take your time first understand your requirements and then take it step by step to in piece by piece to be able to evolve your architecture into a multi region one.

Now i'm gonna hand it back over to neeraj to wrap us up. Thank you, joe.

All right. So uh based on these two stories, these two scenarios, let's just um have some concluding thoughts are the key takeaways. We want you to take away from this presentation.

So the first point is that, you know, eight of these regions and joe covered this in the beginning, they are built with a highly uh they are highly resilient and all the, you know, infrastructure level redundancy through availability zones et cetera that you get. So most applications do not require to be multi region. So have that reasoning why go multi region driven by your business needs driven by, you know what you're trying to driven by your business case to be very specific.

And some of the nuances that you saw in specifically in the scenario that joe walked through is that data and dependencies, it can be complex. So we have to be very thoughtful before we embark on the multi region journey, what kind of dependencies we are taking, what kind of data access patterns and data consistency needs we have because depending on that, it will steer you in a specific direction, especially in the case of active activities, which can be quite complex. You have to think about item potency and all those other things that comes with those architectures, uh observ ability again, cannot emphasize enough.

This is the foundational rock for you to build either single region uh resilient applications or in this case, multi region uh resilient applications. So uh observable and how you're applying the code, how you're testing uh your, you know, again, going back to the recovery detection and mitigation controls. It is fairly important for you to think through uh ahead of time before you embark on this journey.

And of course, you know, there are additional costs you need to also think about when you are when you're deploying these. And again, this is uh we're really scratching the surface. Uh in this presentation, there's no single way of doing multi region even, let's say active, active example, there are several different ways how you can, you know, implement active, active or active, passive uh multiregional architecture.

So uh driven by your business cases and of course, uh we are here to help you work with the account team, work with uh your aws representatives uh to, to help you um you know, uh steering in that um kind of uh direction.

So a couple of resources, i want a real quick call out. As i said earlier, we are, you know, this is not super deep. But if you want to understand more about this topic, we did publish a white paper on multi region fundamentals. One of our colleagues, you know, wrote this a few months ago, it goes into a lot more depth that we have covered. But based on the similar best practices that we've been discussing in this session.

The uh another uh white paper we recently uh published is uh the resonance life cycle framework. Um and if you just even google search, i mean, there's a uh you can take a picture if you want of the um the code there uh resilience uh life cycle framework. It really shares the, the strategies, the mechanisms that you should be thinking through uh these different stages. Uh this is like, you know, really based on the stlc model.

Uh so as an example, setting objectives, understanding your requirements or when you are at design and implement stage, what kind of resilience of thinking you need to bring or you know, so it's not just about multi region, this is kind of a very broad resilience, best practices. Uh white people aligned to the life cycle notion that we have in this framework.

And of course, besides this thought leadership, uh we are also uh you know, based on your feedback, the feedback we get from our customers, we are also investing in, you know, different brazilian uh offerings. Uh a couple of them, we did touch on in this presentation, you know, fault injection service, uh which can help you, you know, uh manage your cash experimentation.

Uh we also talked about route 53 application recovery controller for having this kind of, you know, routing controls and readiness checks. But there are other services. Again, we have not touched on um that you may want to take a look at one is like application aws, resilience hub, aws backup specifically, if you're thinking about entry level, you know, this kind of base level of strategy around backup and restore.

We talked about uh gain elastic disaster recoveries and other service uh which you can think about, you know, when you're thinking about those um multi region uh dr strategies. Uh besides that, we also publish a bunch of a w solutions that uh we have a landing page if you just search aws solutions where you can find some of these multi region and many other resilience related solutions that we have published over a period of time.

So i think with that said, i want to thank you all for coming in. Uh and again, this is late in the day. So we really appreciate your time. Uh we would uh request you to leave the feedback. It's gonna really help us to continuously improve these sessions. So, thank you very much and enjoy the reinvents.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫