Building global event-driven applications

hello. hello and welcome to this session building global event driven applications. my name is marcia villalba. i'm part of the developer relations teams for the s team in a les and i've been doing s os since it's early beginning in production. so early 2016, i'm the host of uh youtube channel. if you are into several less, you might already know my face. i'm very good at sco uh the who are several less where i have 300 tutorials or more on how to get started with several less. and a lot of what we are going to talk today is going to be in there in the upcoming weeks.

so let's talk about moderation, but i want to start this conversation by asking you the question. do you really need to be moderation? because nowadays with aws is so easy to start working with services that are multi region, but it's not only about the services. there is way other things that you need to think that sometimes you're not aware.

so let's address the elephant in the room first and then let's talk about multi re and what it means multi region for several less how you architect multivision applications, several less multi region applications and event driven applications. and then we talk from the perspective of developers what it means to operate this type of applications.

so you want to go multi region. but first let's talk why you don't want to go moorish? i hope nobody leaves the room after i say this because i really like that the room is full. so please don't leave. but maybe this help you to consider if, if, if you are going in the right path, moderation has a lot of challenges, but i will address two.

the first one is the replication lab. this is the physic laws. we have regions all around the world. the world is big and moving data from virginia to ireland. it needs to go through the physical space and it adds some latency to application. maybe there is 900 kg, 9000 kilometers between one region or another. and that might add you 5080 milliseconds on replication lag. and you might think nah that's not, that's not important. believe me, we will talk in the architecting part how that can really affect what you are building and what challenges that brings into your application. so replication lag is a big challenge when you are going to go multi region. and if you don't address an architect, your application to that, then you will be starting to basically shoot yourself in the feet, then testing testing we always talk about testing. i know a lot of customers tell us that testing several less applications is challenging. testing, distributed systems is challenging. and at that one more layer of complexity testing, multi region distributed several less global endpoints, blah blah blah, that's very challenging. so you really need to commit testing. if you're not commit to testing, then don't even try it because you are going to add complexity and cost. and when i talk cost, i talk human resources and i also talk economic cost because basically you're using more resources into your application without seeing any benefit. if you don't really commit to yourself to testing, then don't even try.

and when we talk to developers, a lot of people tell us that they want to go multi region because they want the application to be up, they want to be resilient to everything. and when we look at what are the most common reasons why our customers have uh failures and why things are broken. we see that colo and bad configuration is the reason why most of the time things are broken and that's not addressed with multi region that's not addressed with multi region that's addressed with good deployment strategy using multiple deployments having good testing, but moderation will only add more mess into your broken state.

so maybe that's not the most reasonable way to go. another reason why customers say that they want to go multi region is because if something fail in our core infrastructure, but you're here in a several less talk. and when you are using several components, we promise you that all our service have high availability built in. meaning that we aws take care of the availability of that. that means that in the case of land, for example, you are using original service and we will move the traffic according to the health of our availability zones. so you don't need to do that. so if you're in a sur as world, this is not a big deal for you. if you are doing instances, then instead of having a multi region architecture, what you need is a multi avail zone architecture where you are thinking about the availability within your uh region. and a reason people say is like, well, but what if our data gets corrupted? what if we lose data? well, that's not address with multi region that's address with a good backup strategy, snapshot strategy. and if you use s3 in your own region, the one that you're using already to keep the backups, keep the snapshots, you already volunteer with a super amount of nines of availability. so you don't need to worry about that of going mo and then we end up to this really un pro scenario, the zombie apocalypse, the natural disasters. well, then yes, you need moorish. but hm but please don't leave, please don't leave. no, no, no, no, you don't leave. don't sit down because there is still many reason why customers want to go moderation, there is free, mostly free.

the first one is the zombie apocalypse situation for some organizations. zombie apocalypse is a real consideration. they cannot have any downtime. these are these scenarios where uh you have a service that is so crucial that if something fails, you need to fail over your traffic to another region, no matter if it's a bad deployment, a bad part of your infra, it doesn't matter. we just need to fail over and that's ok. that's valid. we understand that customers have reasons to have this in mind. but you as developers, as architects need to think if this is a real problem in your organization, do you have some of those services?

the second reason why customers want to go multi region is because you have a global application and you have customers that are distributed all around the world. so you want to serve your customers the best way possible. and for that, you use the region that is closest to them because again, the laws of physics latency as it affects your servers or application latency to connect one backend to another, backend a different region. it also affects your uh users, your end users. when they are expecting the response from a website or from a service, you have the closer that services to them, the faster it will respond. and we know that end users don't have a lot of patience and they get very sensitive, very fast and they go to another company that respond faster. so we want to provide the best service.

another reason is you need to meet legal and data regulatory compliance. and this is the case for a lot of verticals. for example, in germany, a lot of verticals required the data of citizens of german. germany is stored in germany and they need to use the frankfurt region and that happens in many, many countries. so you need to have a multi region strategy to address the situation.

but the question you need to ask when you're looking at your architecture is does it need to be multi racial? and when i talk about it, i refer to your application, your service, your component, whatever you're trying to make multi region. because let's look at this example, you are running an ecommerce global ecommerce company. i don't know which one is. but imagine you run an ecommerce, you have your end to end order processing that it's your user, your end user will do the purchases will see some recommendations of products that they can recommend and then they can do the purchase the fulfillment happen, the payment happens and all these a activities happen asynchronously, but they are really critical for the ordering process.

now, let's put yourself in the case of disaster recovery. you are an ecommerce that are, is really worried about the zombie apocalypse. so you create this uh multiregional strategy where you are working on the fail over. and what you need to think is which component of your application are critical for the success of that order. you don't want your customers, you want always two customers to be able to order a product. so maybe your order system, your order flow is multi re your backend systems that are supporting that ordering process need to be multi region. because if something fails, then the fulfillment of the order will not happen. but your recommendation service uh we can live without it. so thinking about if which services need to be multi region is critical.

and when we think about the same ecommerce, but now they want to serve the users with the best latency possible. the architectural decisions are very different because now you will maybe make your front end application, your ordering system very fast. that means maybe the recommendation service as well because that's visible for the user. but then all this a synchronous process you might have in one region and you don't need to do multi region for that because that doesn't affect the latency. the user will still receive the the the the order in a couple of days. so it doesn't affect. so thinking about which components you are taking to a multi region strategy is critical because not everything needs to be multi region.

and then you need to ask yourself how you handle your storage how you're handling your events. because here we are in a serve as event driven architectures, how you manage your infrastructure and configuration. and those are the questions that we are going to look today during the talk.

so remember deciding multi region is not black and white heard one story from um some uh solution architect that told me that sometimes a customer came and approached them and say, hey, we have gone multi region but this is painful. we have, i don't know what we have done. please can, can, can you see what we have done and and help us? so this essay went in and he looked at the architecture and they have make everything multi region and we don't mean everything. even the menu of the cafeteria was multi region. does it need to be multi region? no, it doesn't. why, why, why it doesn't matter. so think about the components that your application needs to take to a multi region strategy to go aligned to the needs of your multi regional strategy.

so let's address this thing that uh we have seen. if you have been attending multiregional talks in aws, maybe you have heard about the slide of business continuity. this is a classical in our disaster recovery business. continue this slide. and this is something that you need to think when you are in that zombie apocalypse moment. these are the two important questions you need to ask that will drive the strategies that you have in order to basically go moderation, how much data you can afford to lose and how quickly you must recover.

sure, you might be thinking, i don't want to lose any data and i want to recover like i, i don't want to have any downtime. if you go that way, it will be very expensive and it's very hard and it adds a lot of complexity and you need to test. so finding that right balance of mitigating your risk and having an achievable architecture is what you want.

and when we look at the uh uh slide of business continuity, usually we present it with this other slide that is the thermometer of strategies for business continuity. and here you can see that there is the backup and restore, that is our cold strategy, meaning that there is no infrastructure provided in multi region, just you back up everything into another region. and now we have pilot light one stand by and active, active. what it means is you have the infrastructure there, but you just add more and more capacity until you have the same capacity in both regions in the active, active. but we are in the cus world, we have this magic automatic scalability. so traffic will drive the amount of capacity that is getting provision. so in the cus world, pilot light warm stand by an active active are the same monster and we are going to address during this talk, backup and restore an active, active and then it's up to your traffic and how you handle that uh that the amount of capacity that's getting provision in the active active scenario.

so these are summary slides, we are going into the depth of all the services during the talk. but backup and restore. the idea is just to basically call backup your data into uh another region. and here you can use different services like s free dynamo. most of our databases allows you to do that. a eventbridge allows you to do that. so there are many, many services that you basically enable a peak and you say in which region you want to have the replication and it will do it automatically for you. when the active, active is a little bit more complex, you just have the two full stack applications running in both uh regions. and then when things happen, you just divert the traffic.

so let's get into the beef of the session. that is the architecting part where i will show you how to do these two strategies more in depth. when i showed aws, i heard this theorem uh like a million times. this is one of the aws favorite, the network partition consistency availability theorem or cup for a friend. and here what it states that you can only have two of these three things. you cannot have it all because life is hard and challenging and crude. but so we are in a multiregional strategy. so that means that we have network partition already, we have two or more regions. so we can only have consistency or availability. and you will see that challenge spread around our architectural patterns that we are going to cover through the set. we will see it. when we talk about databases, we will see it. when we talk about events, you will see it when we talk about data and secret. and if you don't see the pattern, everything is about data. data is the challenge on multi region, how you replicate this data and how you query that data because the infrastructure is several less, you just deploy it and you wait for somebody to call it. but the data, if it's not there, then the other reason will be useless.

so let's talk about databases first. so i will show you here free patterns for architecting your data. and these are software engineer patterns that are not attached to any service. and then we look at some of the services that can help you.

the first pattern is the red local white global pattern. here you have a database that has a primary database and read replicas all around the world. so you always write to the primary and you can read from anywhere this data is replicated. so everything is good. and then we have users in san francisco and we have the reading from the closest data uh region that is uh oregon. and then we have users in taipei that they read from uh southeast asia, everything is good. then the users in san francisco write in oregon and the users in taipei will write to oregon. and this is simple. this is a great pattern. the thing is now if the uh users in taipei read too fast, they might not get the latest version that it was written. so here we have a consistency issue, manuel. but the good thing here is this is applies for patterns that are not that many rights. so, authentication, authentication is a good example. you might create an account once and you just write on it when you want to change the password or edit some uh some things like that. so you have maybe few percentage of rights uh and a lot of reads. then this is a great pattern for that content management. you have a blog, a video platform, one person upload a new video and then millions of them will read it. that's a great pattern. you don't have that many collisions.

another pattern is read, local write partition. and here we will use a partition database, meaning that the users from, we will decide how they are going to get partition. but the users, for example, in this case, from asia will write to southeast asia parti uh partition and that gets replicated all around the world and the users in us will be partitioned to some database in us, but i will not show you that. so basically now the user taipei reads and writes local and that's good. best of the word

The problem is when the user Taipei flies to LA and now the user will read from the R replica. But it was oh well, how I go back but it will write to the PA partition that it got assigned and this is not a big deal. Users don't travel that much.

We have a customer that uh implemented this pattern when they were building like a fantasy league game. So when he uh when they created the the game, they were assigning, ok, if this user creates a cricket league, then we are partitioning to some asia region because most probably those users are from india, they're from australia, they're from pakistan, it will be closer to them. And then when user came and create a f american football league, then they were assigned to some us region. And in that way, everybody was partitioned and they were getting the best of the words. So that's an interesting pattern as well.

The third pattern I want to show you is read local, write, local for that. You need a special type of database that has a multi primary meaning that you can write everywhere. So here we have the users in san francisco and the users in taipei and they both read from the closest one. And then they write to the closest one.

So now imagine that you have an rare card collective site and you decide to apply this pattern. So now the user in san francisco see this rare pokemon card and boom buys it. At the same time, the user type base is is rare pokemon card. I only have one and buys it. Who should win the card? I don't know, you need to decide because the replication now is collision. You have a race condition and that's something your application needs to resolve. So you need to be careful when you pick this pattern. This is really cool. But if you have these collisions, your users might not be happy.

So these are free patterns that will help you when you're addressing this replication lag and the consistency of your data. So let's look at some of the services that might help you. I mentioned the amazon free cross regional application. This is basically a check box that you deciding where you want to create uh replicate your data. You can replicate objects, buckets, prefixes, you can even change the uh storage tiers. So you can go for a very cold uh backup in glacier, it will be super cheap. And if you want to learn more about s free uh global supporters, many, many different features like global access points. So you definitely uh check that in the documentation and there might be some sessions as well on that.

Then amazon dynamo divi global tables. This is a different type of tables in dynamo that supports that multi primary uh database basically that you can write in whatever the replicas and then everything syncs to each other. So if we will use this for the rare card of pokemon example, because of the rule that uh the last writer wins, the last person that bought the car will win. I don't know if that's fair. That's up to you to decide. But this is how dynamo does it and how you work with that.

I will show you the examples with cd k today. So i don't like the console that much. So uh this is how you do it with cd k, you can do it with confirmation, you can do it from the console. But basically, it's as simple as defining a table. And the only thing you need to do is add the revisions where you want to replicate here, you can put 12 or as many as you need. And this is how it looks in the console. So you have a table and you will find in the tab, the global tables and you will see all the list of uh regions that this got replicated.

So let's go one more step. Now you put an item and you put an item and when i write the item, this is uh for a demo, uh i will write from which region it was written from. So in this case, we are in ireland, i write from ireland. So we have the item that says, hey, i'm in ireland and then i write in another replica of this uh table and i write from virginia. And now i can see the uh sorry, i go to the virginia replica and i can see the ireland um item. So this is an interesting thing. We will talk more about dynamo in a second.

But now let's look about events and the challenge of events. So we are in an event driven architecture. And if you are uh already building event driven architectures, maybe you're using eventbridge. Eventbridge is our several less offering for events. And there, it's a great solution to connect microservices together and to create an event driven architecture.

So if you are using even uh eventbridge, you need to check a b bridge. Global end point. There is a feature that we launched in the spring that will create a global endpoint in the dns level, meaning that is a global c uh endpoint that doesn't depend to any region and it will help you to do the fail over if something fails. So if you need to fail over from region a to region b, you can do that with global endpoint. It also allows you to do the automated replication between the rep uh the different regions.

So that's totally handled by uh global endpoints. And when we look at global end points. There is two modes of it. First, the active archive and this is the mode of the backup and restore that i was showing you at the beginning when i was talking about the business continued strategies, meaning that now we put an event into the primary and in the primary region, we have our full stack deployed and it's doing whatever it needs to do. But in our secondary region, we just collect all the replicated events in the past and nothing gets triggered afterwards, we are just collecting them. So if something goes wrong, we always have all the uh events back up in a different region.

The second mode is the active, active mode. And here we have again the same primary and secondary regions. And in the primary region, we have a full stack. And in the second region, we have another full stack. So now the event gets replicated and we can decide if we want to do something in the secondary region or not.

So how we defined uh a global endpoints using uh cd k. And this is very similar to cloud formation because here we are using a level one constructor. You need to have the two event buses already configured and you need the amazon resource name for each a rn. And the thing here is that both buses need to have exactly the same name. So if not, this will not work, then you need to configure your routing and you need to configure a health check. This is in the level of dns that will decide if your primary region is kind of working or not. And if something is broken, then it will trigger the secondary region that is uh defined there. And finally, you need to enable replication.

So when you have all this set up, then you can see it appearing in your console and you will see the two event buses defined there. You will see the health check, you will see the replication and you will get back a very important thing. That is the endpoint id, that is what you need in order to send messages to the end point from your application.

So let's build something. So you can see this in action, we are going to build this application very simple. A lambda function will put an event into a global endpoint and then this global endpoint will send it to the primary region. If everything is good and will trigger uh put it in a custom event bus that will put it in a uh cloud watch lock, the event will get replicated and that will then be put in another cloud watch loss because that's how the rules are defined.

So you want to see how the function is that will put the events into the endpoint. It looks very, very similar to how a function will look if you are putting events in um in the custom bus, we just define the same phase. We put the, the event bus name there. That's why it's very important that the bus names are the same. And then we have the endpoint id and then we just use the normal aws sdk command to put events into the bus.

Then the next step is to have in the global endpoints configured the health check. So that's what we are going to do next. And for that, we are going to use route 53 health checks that allow us to check if the region or whatever resources are healthy. There is many ways to define if a resource is healthy. The way we are going to use is cloudwatch alarms, meaning if there is an alarm that got triggered, then the resource is not healthy and we need to fail over for that.

The first thing we need to do is to define a metric in order to build the alarm. And for that, we are going to use the metric that uh global endpoint provides and that's the ingestion to invocation start latency. Meaning if the latency is too high, then the alarm will trigger and boom, the global endpoint will fail over.

So then after having the metric, we create the alarm and then we can create the health check that is using that alarm one thing to notice because i will show you in the demo that the health check will fail the first minute when it doesn't have enough data to, to work. So we will see it failing.

So let's put everything together into an architectural diagram. And then i show you how this looks.

First. The lambda function puts an event in the global endpoint as we saw with the aws sdk. Nothing is failing good. The event bus puts the event in the cloudwatch log group. The replication goes, it goes to the secondary region and it puts again in the other cloudwatch log group.

Now the lamb got triggered, something exploded. We are failing over. So now the lambda function will put it again to the global endpoint. But because we are failing over, we are going to secondary vision and we are going to replicate the even and put it in both cloudwatch logs.

So let's see this in action. So i have, i will show you the two cloudwatch logs and you will see that uh we are going to check what is going on. And what i did is that we are seeing from which region they have been created. So it looks something like this you have in one side, the primary region that is virginia, in the other side, ireland. And now we are in that failing state, in that initializing state. So you can see that the events are being written from the uh secondary region because we are in the uh in the failing state. So that's how it works. Both logs are receiving the same events, things are getting replicated.

So we can see there that the health check is failing. Now, we are going to see what happens when the health check is healthy because now we have enough data and our health check gets uh active. So everything is good. We can see that our alarm is not a trigger anymore. So things are looking good.

So if we go back to our cloud watch log groups, we can start seeing that the bets are coming from the primary region. Now, that's good happens automatically. I don't need to touch anything just looking at how the health check is behaving. And we, if we check both uh log groups, they are both getting the same exact events in the different times because replication la so that's good.

Now what happens if something fails? So to uh simulate a failure and when you're working with health checks, the only thing you need to do is go to edit that and invert the health check. So because i don't, that's the easiest way to test, i'm going to do that and that will make my health check. Now, unhealthy. And if i start running the logs and the query for the logs, i'm starting to see things coming from the secondary region. That's great.

What if now all my events are kind of replicated and if i have a lambda function, things are getting triggered twice. So imagine i have an ecommerce and i have this multi region architecture. And now basically this means that my order is being processed twice in two different regions. Will my customers be happy if they have to pay two times for the same item? I don't know.

So let's see this in action. So i have this uh order back end uh uh ready here. And it's receiving an order from the global endpoint. And we are putting that uh create order uh event into the global endpoint. And then i'm showing you what is going on in this event store table. So all of this is available in my youtube channel on how to do it in the coming weeks. So you can go and check it. But basically, now you can see all the events happening twice. So basically the order is handled in ireland and it is handled in virginia. We don't want that, the customer doesn't want that.

So how we solve this very simple, we add into our rule, a check that we only process the events that are coming from the region where the event is created. So in that way, now the events are only going to trigger the the right region. So if we are in the primary region, the event comes from the primary region, boom, we trigger the architecture uh the application in that region. And if it's coming from the fail over, we go through that way.

So let's see it in action exactly the same application. But now i added that rule, we put the event, we triggered the ordering process. And now we can see that the events are only happening twice, once, sorry. And now you will say marcia, marcia, marcia, marcia. But there is one event that is two times. I see it, i see it. And that's true. That's our replicated event. So that's the event that gets replicated and it's ok that it's twice because we want that replicated event just in case. So it's good.

I know you might be thinking uh well, this looks good. So my event goes here and then it gets replicated and it triggers my lambda function. But what happens when something fail, will this rule allows me to keep my application working? Let's see it. So now we have our um health check, we will invert it. We can uh invert it and see what happens. Let's see if the fail over fails. So same application than before. I have not touched anything. Just break in the the application

Now the health check is unhealthy and we can put an event into our application, into our backend, same event into the global endpoint. And now you can see that the events are coming from the second region once except the one event, the order created that is a replicated event and that's it. Everything is good. We're happy, things exploded. It goes to the second generation, it gets replicated and it triggers a lambda function as I want.

So this is a really simple way of building these event driven applications with multi region.

Let's talk about secret data, secret data. I talk about application secrets and parameters and configurations and user data secrets. Very important if you are in this zombie apocalypse mode because if you don't take care of these things, maybe when you need to fail over your application is not ready to handle that.

So when you look at AWS, uh Secret Manager and KMS support multi-region, so all these links are available at the end. So don't worry. But this is super, super fundamental that you look if you are using these two services in your application and you are in this uh mode of, you want to make sure that you have everything working no matter what. Because if you don't replicate your secrets, if one reason is down, then uh the application will not work working.

And here's what testing becomes very important. Then user data, user data. When I started building this deck, I asked my community, hey, uh what is your challenges of going multi region? And everybody says Cognito doesn't support multi region iii I don't work for Cognito, but I i know but there is way walks around it.

The first thing is you can replicate everything in Cognito except the passwords. So that's fine. And in the case of a fail over when something explode, what you will need to do is your users will need to go in and create a new password. So it's not a big deal. It would be nice that it's moderation but it's not.

So how you do it, you create a new user, you use the triggers. When a la a new user is created, you can enable a trigger lamda function and that save the data to Dynamo that replicates automatically with global tables and then you can populate a new uh user pool. And in that way, your data is synchronized in another region. And if you ever need to fail over, you will have the data there except the user passwords that you will need to ask your users to create a new password.

So it's a problem but it's not that big of a deal. And if you want to use some third party services for authentication, you can do it, for example, Auth0 that multi region support. So you can check how they do it. There is uh material as well at the end.

And here is something very important. I want to say if you look at third party solutions and you are in this mode of you want the zero sl a everything perfect. You cannot take any risk. Look at what your dependencies are offering because the mo strategy is hard and a lot of dependencies might not provide you that.

So when you're looking for services that you will depend like third party services, make sure that they are tested and they are resilient for multi region and you might need to test them as well.

So let's look at routing because routing is something developers are sometimes not too familiar with. And Route 53 is a great service that will help you a lot. It will do a lot of the heavy lifting for you.

So Route 53 is our DNS offering. It has this crazy sl a of 100% up time and it's region independent. We have already used it for our health checks. But Route 53 provides a lot of nice ways to route users. Uh so I will show you free use cases that are aligned to the three main reasons why people tend to do moderation, how to do the disaster recovery, like we saw with global end points, how to do latency base and how to do she location one.

So if you have a user in US and you want to uh offer them the best latency, you just configure Route 53 to do it. I will show you a demo in a moment that will decide which is the closest region uh for the user, which is the shortest latency and Route 53 will do that automatically.

If you want to do the she all location one in order to meet your um like legal and, and data requirements, you can also configure Route 53 to say, hey, all the German customers need to go to the German region and it will happen by Route 53.

And then as we saw, we have the DNS fail over the example I show you with the global endpoints is something you can apply in other resources as well to use health checks to decide where and when traffic, the uh flow of your customers were to route them. So if something goes wrong, Route 53 will then based on a health check, route them to another region.

So let's leave an example with latency because I already show you the, the, the health check example. Now we have this application that has two regions, there is no primary region or secondary region because both of the regions are active, active and they are as important because we are using a latency based mode.

So now our user, depending on where it is, it will either write to the Dynamo table in uh Virginia or it will write to the Dynamo table in Ireland. And we are using the global tables to replicate the data.

So how I do that, I have this domain that has two records to a records in this is in DNS in Route 53. And I just create the value one record to be pointing to API G in one region. And I decided like ra latency based routing.

And then in the second, uh the second uh uh a record I do exactly the same. So now when a user from Germany uh sends a request to that domain and it writes into Dynamo and we are going to see that the, the region from where the data is written in the table, it will write from ger uh from Ireland because it's the closest region for Germany.

And if we do exactly the same, but now from the US, we will see that the user is being written in the uh replica of US. So in that way, I have a simple application, all the code is available for you. At the end, everything is built this infrastructure code so you can check it out.

Everything is is built by Route 53. So this is super simple way to create this latency based application.

So how I manage as a developer, an active, active environment and this is something very important because if you're a developer, there is a lot of responsibility that falls on your shoulders. When we talk about the uh the management of your application, we have the operations team. But in the cus world, we tend to rely a lot on the the s uh mentality where developers are empowered to take responsibility of a lot of their applications.

So the first thing that developers need to look at is deployments, deployments are fundamental. So if you are a developer working in a multi region strategy, you need to address your deployment and you need to do infrastructure as code. That's why all the demos I will show you today and all the demos that you will find are infrastructure as code because that's the only way to do multi region having one region is already complex, having two regions without infrastructure as code is like hell.

So please use it and when you build those infrastructure as code templates make them agnostic. And what I mean by agnostic don't hard code account numbers don't hard code uh stages, don't hard code regions don't hard code anything, use configuration for that.

So then you can grab exactly the same template, replicate it in your dev account in your stage account in production, then take it to another region and replicate it there. And it will make your life way more easy and automate everything I know. Sometimes it's very, very tempting. When there is an error in one region, we can go so easy manually and tick that box and everything will get solved.

But then that error will replicate in your other regions and then you will forget and then things will explode in your face. So when you see a error go to your infrastructure as code template, do the changes, push it to your repo let the c i cd take it over and deploy it in a control matter in an automated matter and then propagate it in a controlled way.

Another thing about deployments and this is super critical for those that are in the zombie apocalypse mode. Pick a reason for doing your deployment because if maybe you're really worried about not being able ever to fail your deployments might let you fail. You want to make sure that you have a strategy on how you can do deployments even though something explodes.

So this is something very, very important. Pick what is your primary reason? Think about how you're doing um multi vision for your deployments. Do you need that? If that's something important for you, maybe it's not but have that conversation in your teams and think about it and then please deploy one region at a time. Don't deploy everything at the same time, we already talk at the beginning that deployments are one of the main reasons, things fail.

So do it one at a time test and then move to the next region. And this is very important the deployment window for the case of I want to provide the best experience for my users. So if you want to provide the best experience for your users, understand when the users are using your system and deploy in the right time.

So if you have an e commerce that is very popular in the day, maybe in US, you want to deploy in the night and then in Asia, you want to deploy during the night and that something you need to live with to provide the best experience. But again, be aligned to your multi region strategy. What is your strategy is this experience? Well, think about when you're doing your deployments because you might affect the experience of the users.

And then this is something maybe it's not important for multi region, but I think it's important for any kind of deployments that is in multiple deployments. What I mean by multiple deployments is when you do a deployment, don't override your existing environment, just create a new one point your traffic there. And if after you test and you make sure everything is good, you can destroy the old one. But if something goes back, you can always roll back super fast to your previous environment and then deploy again. And in that way, you make your deployment failure uh way smaller in my demos.

I use AWS CDK and CDK is great when you're working in a multi-account multi region, multi environment. But if you're in the AWS uh transformation world, you might want to check a Stack Sets. So Stack Sets provides the features for multi-account and multi region. So you can deploy uh the same Stack Sets across different regions and accounts easily. So that's for the CloudFormation fans out there.

We talk about configuration deployments and configuration were the biggest reason why things fail. And when we talk about a multivision strategy, we increment the complexity, we increment, the reasons why everything fail. So keeping track of configuration is fundamental, you can use something like Config Rules that will help you do that.

So in a multi-account multivision uh strategy, you can use uh Config Rules to define what are the configurations that you want to have. And if something gets out of that uh setup configuration that you provide, you will receive a notification or you can even configure that it rolls back to whatever you want.

So those are the two main reasons when things fail. So have strategies to prevent that.

Another thing developers need to be responsible of is observable. So make sure that you have some tools to observe your applications, instrument, your applications and then build some kind of tools to figure out what is going on until this rein event.

We didn't have a way to centralize our CloudWatch logs in one region. Now we have that feature. I built the slides uh earlier where I show you a mechanism where how you can do it. But now uh it's building in CloudWatch. But if you don't use CloudWatch, if you use some third party or you use whatever, make sure that you have a good way to observe all your reasons and understand what is going on if you have an active, active scenario.

So this is very, very important. So either if you use CloudWatch new features or you use some third parties have that in mind, observe your applications.

And then I cannot emphasize anything more than testing. I said testing is one of the reasons it failed. So please test regularly, test, automatically, build all those basic test into your kind of deployment strategies into your everyday work.

So then you can create interesting test and test creatively. You need to test your health check, you need to test your networking, you need to test your dependencies. You need to test your data corruption because and more if you are in this zombie apocalypse mode, you need to make sure that if something breaks you can fail over.

So you need to test and test and test and test and you need to test in the most creative ways as possible because sometimes things go bad and we cannot prevent that. We cannot be prepared for it. So it's good to test and if you want to be prepared for when things go bad, one thing you can do is to try to do game days.

Game days are a great way to test uh kind of difficult situation or an error example in your application, you simulate it and then you can see how your employees, how your teammates react. You can also test your processes, you can test your ra m books, you can test if you have the right credentials. Sometimes it happens that there is a big error and the developers don't have access to the systems and how they can troubleshoot.

So game days a great way to make sure um that you can solve problems when things are in uh fire. You know, there, there uh firefighters have drills now and then where they test their skills. Well, game days are ways to be firefighters in software development.

And finally fold injection simulator. If you are using availability zones or instances and things like that, you can inject failures into your network, you can inject failures into your instances and you can simulate what things uh what those errors do into your application and take actions before those things happen for sure.

Wrapping up moderation is complex. So do it carefully pick the right services and think about why you want to do moderation, doing everything is impossible. So have a very clear goal. Are you for the zombie apocalypse? Are you for the latency? Are you for the data regulatory? And with that in mind, define your architecture and tackle the problems for that. Sometimes you just need a less multi region and just basically backup is good enough. So go through that area and try to figure out your deployment programs, your infrastructure problems, uh your date like other programs in improving the application itself.

So think about the ways that you use your databases and what patterns you apply is not the same pattern for all the problems. So look at the problems that you have and try to think of the patterns that will help you out and what kind of tools you can use, you can use S3 for objects and that's very useful. There is the S3 cross region replication. There is many things like global end points for S3 data access points, sorry. So you can get a lot with S3 done for multivision a for data basis. There is not only Dynamo but Aurora has a lot of different tools and mechanisms if you are using Aurora to support the multi region.

And there is a I think a whole talk on Aurora global distributed uh applications that you can check during reba. Then for events, global endpoints will help you a lot. It does, it automatically, it's in the DNS level, it will do the fail over for you. And you can just uh design your applications for uh for those events have in mind that sometimes those events will be coming replicated. So have a strategy for that use Route 53 as much as possible because it will do a lot of the heavy lifting for you. It will take care of uh automating the routine and it will take care of doing the health check for you, but please test your health checks and not only invert them, just make sure that those health checks can happen and how you can address those uh situations because sometimes it's so easy to be health checks. But uh when the errors come, those health checks don't get triggered because you just use the wrong metrics.

So make sure that you pick the right metrics that you create the right alarms and uh everything is automated. Think about your deployments, deployments is fundamental when we talk about multivision and then use the tools like CDK, CloudFormation, Observable components, AWS Config rules to help you out in building this and test. If you don't test nothing works.

As I promised, all the demos links and everything is in this blog post. Uh this blog post is a live post. So a lot of things will come in the next weeks, more deep dives and how I build all the demos for this presentation. The code is there everything all the links I mentioned are there. And if something is missing, you just reach out to me in the comments of the blog post or in Twitter and I will uh make it happen if you don't know.

Soles Land is our website where you can find all about Serverless. We have a category on event driven architectures. All the launches are there for Serverless and we have two sections that are really cool. One is Patterns where is Workflows where you can basically get code that we build. Uh and you can also pro write your own patterns and workflows that were successful for you that basically you can just deployed from infrastructure as code into your own uh account.

So you can see how to do global endpoints, you can see how to do API Gateway fail overs. A lot of the things I talk, they are available as patterns in this site.

And with that said, I have to thank you all for being a wonderful audience today. I hope you enjoyed this session if you like it, give me a shout out in Twitter or in LinkedIn. And if you took some nice pictures, please share it with me. It's always nice to see me on the other side. So yes, thank you very much.

  • 9
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值