Goldman Sachs: The journey to zero downtime

最新推荐文章于 2024-01-19 15:45:11 发布

李白的朋友王维

最新推荐文章于 2024-01-19 15:45:11 发布

阅读量76

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134847148

版权

Hello, everyone. Welcome, welcome to FSI 310. We look forward to sharing with you Goldman Sachs' journey to zero downtime. We have a lot of content to cover in this session. So we kindly request you to address, engage with us with your questions after the session.

All good. Perfect.

Before we begin, let's do a quick show of hands. How many of you are finding it hard to implement architectures for zero downtime? Ok, excellent. That's a lot of you. It is indeed complex to build architectures for zero downtime, especially in highly distributed systems and event driven applications. By the end of the session, you will walk away with practical insights and techniques, how Goldman Sachs navigated through this challenge and so you can apply in your own specific context.

My name is Manjula Nni. I'm a Senior Solutions Architect at AWS. As a Solutions Architect, I provide strategic recommendations for financial customers to build highly available and resilient applications using AWS cloud services. I am part of the AWS team that is dedicated to Goldman Sachs. This is my second time speaking at Reinvents this year, however, is really special as I get to present you Goldman Sachs Transaction Banking team, Rob Carson, Vice President and Technology Fellow and Akash Pau Vice President. I've been working with this team for the last 2.5 years and in their zero downtime journey for the last 18 months.

Without further ado, let's take a look at the agenda. We have a great agenda today. We're gonna first start establishing the problem statement. What is zero downtime? Why zero downtime is crucial for our customers? And what are some of the common factors we are seeing in our customers? Fueling to zero downtime.

Next, we will segue into what transaction banking is and their availability requirements. Following that, we will dive deep into architectures and strategies that the team implemented using AWS cloud services to mitigate the down time effectively. Next, we will share some key wins and of course, we're going to share some of the insights we learned throughout this journey.

Finally, we'll wrap up with future enhancements that the team is getting to implement. I want you all to take a moment and think about what does zero downtime mean to your organization. For the purpose of this session, zero downtime means your mission critical applications continues to function without any disruptions. Even during maintenance, new relief launches feature upgrades or any changes to the application.

There are four common factors we see in our customers, especially enterprise customers fueling to zero downtime. First and foremost, is resiliency. System failures are very common in fact, our CTO Werner Vogel says everything fails all the time. So you need to prepare for the system failures. You need to design and build solutions to address the system failures as well as unexpected events such as natural disasters or even region wide outages.

Next client trust. Clients expect services to be available 24 by seven, especially in banking industry. Any downtime can impact the business significantly. So the impact could be dissatisfied clients or loss of trust from the clients or even loss of revenue and potential business. To provide uninterrupted service to the clients, your applications and the underlying infrastructure should be highly available, remain resilient and operational at all times.

Lastly, customers are ultimately accountable to meet their SLAs, service level agreements. So to meet their SLAs, they need to be the most of the organizations are investing heavily in building highly available architectures, building resilient systems even across regions and even fail over mechanisms with that.

I will now hand over to Rob to take through their journey and how they address these requirements.

Thank you, Manolo, I'm Rob Carson and this is also my second time presenting at Reinvents and it's special to me because last time was 2020 it was virtual. And so I'm really happy to be here with all of you.

I work for Goldman Sachs in Transaction Banking or TXB in case you're not familiar with us. Goldman Sachs is an investment bank and Transaction Banking is a service that we started offering in 2020 for clients and it's banking services for corporate clients. So just like you would go into a bank and have an account and you could deposit money and, and make payments. We do the same thing and our customers are large corporations. So we do the same thing that another bank would do, but on a larger scale and we offer analytics and advanced payments and other services that corporations need.

Before we go into the technology, let's talk about our, our plan a little bit. So we started building Transaction Banking in 2018 and we decided to build it 100% on AWS. There's a few exceptions here and there, for instance, for components that interact with on prem components that already exist that already existed. So in some cases, we have to interact, but for the most part, we're 100% on Amazon.

We have hundreds of microservices and hundreds of engineers, our recovery point objective is zero because we can't really lose a payment and our RTO or recovery time objective if you lose a database is, is less than two minutes. And very importantly, we have thousands of deployments annually and I call that out because we find that in addition to infrastructure related issues, one of the frequent causes for downtime is when you make a code change. And so as the platform team, we made a conscious effort to ensure that the deployment process is really smooth and that it can adhere to our availability requirements.

And just some, some metrics, we have billions of dollars in transactions processed per day. And I want to introduce you to a component called the payment rail. A payment rail is the way that we send and receive payments. And I wanna call out one of them called RTP or real time payments. And we have an obligation to be available 99.5% of the time. So that's 3.6 hours of downtime max per month. Additionally, we have a deployment window from, from that payment rail of 2 to 6 a.m. on Sundays.

Now, I don't know about any of you guys, but I really strongly prefer not to be online at 2 a.m. eastern time on Sunday. So that even further is our desire to, to follow these processes to have high availability.

So this is a really high level view of our components that we used to build the bank. And I wanna, I want you to realize that we built them in layers. So the top layer we have is what we call interfaces or some of some of us call it channels. And the idea is that in clients can interact with us either via an API that you can find out about on developers.gs.com. There's a website that clients can log into and interact with us or they can send us files of payments.

Once we get the instructions from them in any one of these channels, we go into what's called the critical payment flow, which is what we've, we've highlighted in orange, right? So we get the payment. Ok. Who's the payment going to? Right. What sort of validation do we have to do? Do we have to do a sanction scan to make sure that we're not paying someone who sanctions a fraud check where we check behavior of the accounts and make sure they're not out of the ordinary.

We also want to make sure that the client either has enough money in their account to make a payment or that they have, they have a credit line from us. And then finally we go to the payment rail, right? And so the interfaces and and the payment orchestration flow are really our critical processes, right? If these are down, we're not able to meet our objectives, right? And we've been very intentional in layering the system in this way and that is to juxtapose it with something like the ledger or reporting and statements, right?

In some of these cases, like for instance, the ledger, it's a vendor product and we don't have as much control. Um and so we've been very careful. I mentioned the, the credit check and the and the balance check that can actually run in the critical payment flow if the ledger is down or if the ledger is behaving slowly right. And so we were very, very intentional in laying our application and making sure that we, we took the components that were critical for our flow and made them highly available and the components like the ledger or reporting, they're important, they have to be up. But if they're down, it doesn't impact our business as much.

Ok. So our approach to high availability is twofold. First, I'm gonna take you through our architecture and then Aroshit is going to take you through the deployment strategies.

So high availability architecture. So I'm here representing a team called the Platform team as I mentioned before. And there's a lot more developers than there are of us, all right. And so in order to build an architecture that is highly resilient, we came up with a bunch of rules that we follow, mainly we want each, each account to be separated and each team to have their services separated in their own account. And also we wanted everything to look the same.

So our goal is while applications might be running different services, right? And different business cases, we want as as much as possible for the infrastructure to look the same so that we can give them the high availability that they need. And so the way we summarize that is uniformity equals efficiency.

So the first, the first rule is that we really, we very strongly prefer Amazon ECS over Fargate. We like that because it's immutable, it's not like an EC2 node where I could bring up an EC2 and somebody could log in and make a change, right? And then, and then we decide that we have to recycle it and it comes up and it's different, right? With the containers, the containers come up and we have a really high level of confidence that if we take a container down and bring up a new one, if it's the same image, it's gonna be the same container, it's gonna behave exactly the same way.

Common IAC modules. So IAC is infrastructure as code and our infrastructure code is Terraform. And as the platform team, one of our goals being to, to let the application teams not really have to think about availability too much, but to give it to give them those, those capabilities was to build a couple of custom modules, right?

Um a Terraform module is, is a predefined set of, of Terraform resources that sets up infrastructure for you. And so we have a module called an app module, right? And the goal is our application teams give us a couple of key pieces of information, the location of their image, the ports they want us to open up um potentially if they want to be able to connect to other applications, and we then turn around and we will create an ECS service for them that's distributed across availability zones, we'll handle certificates for them, we'll handle logging, we handle all of the things that need to be handled, we handle the best practices in terms of security, right? And at the same time, we have a good feeling that they're highly available because they're distributed across availability zones. Likewise, we have modules for Amazon Aurora to create a database. We have modules for um Amazon DynamoDB to create a global table.

We have an S3 module and so on and so forth, right? And so the goal is our application teams use these modules because first off, it's the easiest thing for them to do and second off, it gives them the best practices in all of in our availability and security.

Next, micro accounts in VPC endpoint services. So I mentioned earlier that each team has their own account. What we, one of the concerns that we had when building out Transaction Banking was cascading failures, right? So what happens if a payments application in the in the critical payments flow and a ledger are in the same account? And a ledger does a deployment and they make a mistake where they, they somehow break the payments app, right? That would clearly not be desirable.

So what we elected to do was to use the Amazon account barrier as, as a really strong revetment to enable our account, our applications to be separated. So a micro account is a small account that just has one set of functionally aligned microservices. It has a VPC, has just us being a bank, there's not gonna be any internet gateway, right? And you build your, your services in there and you can be very confident that nobody's gonna destroy your secrets or break you because they're in their own micro account.

And of course, we don't live in a bubble, so we have to communicate with other micro accounts and we do that using VPC endpoint services. If you're not familiar with PrivateLink or VPC endpoint services, they are a point to point unidirectional connection that you can configure on your accounts. And we like that because we actually maintain a directory of all the connectivity. So in addition to having isolation of the services, we also get a really good feeling for what services are connected to which other services.

And also in terms of availability, these endpoints are in multiple availability zones and they, they um they're available just like the ECS tasks, independent zero downtime deployments.

There is we have hundreds of microservices. It is not scalable for us all to be able to say, ok, let's stop the world and we're gonna migrate and then we're gonna do system one and then system two and so on. So our goal is to enable everybody to deploy independently right now, we all understand that there's gonna always be a scenario where some service is deploying a new feature that i need. And so of course, i'm gonna have to work with them to do that. But in general, if i'm not, if there's a service that has nothing to do with me or if there's a service that's, that's being upgraded, that doesn't have a new feature that i rely on. I can migrate completely independently.

Finally, we follow a dev ops cultural philosophy. And what i take that to mean is we automate everything right. We have a platform team, we automate technical processes as much as possible. Our goal is as soon as you run a bit, as soon as you commit code, the build runs, right? And we can, we can move that all the way into our lower lanes.

Um and we also automate functional processes. So for our operations team, our goal is to do as much as possible with as few people as possible.

So now let me take you quickly through a, a single micro account. So you start out with, with an empty account, right? Just like any other account, there's nothing in it. And now the platform team is gonna set you up with a bunch of things, right?

First thing we're gonna set you up with an application VPC notice there's no internet gateway and we're gonna give you a Fargate cluster. That's not to say we won't support people who can't run on Fargate, but we're very opinionated. So if you want to use Fargate, if you want to use something other than Fargate, you have to let us know. Right.

And then we have another a bunch of other capabilities that are required, for example, CloudWatch, right? We make sure that CloudWatch is set up for you. We set up appropriate encryption queues, right? We actually set up an Observable pipeline so that your logs will go to a central Observable account that we can then collect logs and do analytics on them.

Also, a lot of our systems are event driven, right? And our, and our messages are typically Kafka and we also have some MQ traffic. And then because we have this micro account infrastructure, we put MSK and Amazon MQ in their own account. And then if you need access to that, you let us know when we set up your account and we will create an endpoint using PrivateLink into the messaging account.

Ok. Also lambdas, i think this is actually a really important point. So we have a number of lambdas that are in the account that are used by application teams, whether they realize it or or they don't. So for example, we have a lambda that we create that will rotate database passwords and it makes sure that your password is rotated at a, at an uh the correct basis at the um at the correct rate. And that it, that it follows our best practices.

We have lambda that, that enable certificate signing. So if you see there's a lambda that connects to a WSS private CA, all of our connections are mutual TLS. And so the goal is an application team is able to say i need a certificate and they get one right at this point you have, oh, there's also Blue Green members that Akar is gonna take you through.

So now we have a basic account and we might have to connect to other accounts. So we also have ways to come in and out of, of Goldman. So for example, we have a, an egress proxy that is controlled by our core team that enables us to safely communicate with external SaaSis, right? And then we also have ingress paths, right? And that's a separate account, it's not part of your micro account. And again, we use endpoint services to, to, to make those connections and those are, those are tightly governed again because, because we have um information that we want to protect, right?

So at this point, you're an application team and we know that every single application team has the same thing and now they can go ahead and deploy their app, right? So the first thing is we've put some tasks in the in the Fargate cluster, right? Again, the application team gave us the, the details of what they wanted. They deployed their tasks, we put an NLB in front of it, we just, we, we distributed them across availability zones. We know that, that they're gonna be highly available.

Same thing with Aurora, right? You, you, we have Aurora, you, we will generate your, your, your replicas. We'll make sure that that your, your encryption is correct and then it goes on for the other, the other stores, uh the other services as well. Amazon S3, DynamoDB.

Each application in their own Terraform project is run using these modules, they're using them how they want to and it's interacting with our lambdas and with our micro account in the way that we expect it to, right?

So application teams now have with all these, these resources, they now have a working application.

Ok. The other, the only thing that you have to do now is you might have to connect to other other accounts. And so we have a module that, that supports connectivity. So you, because i i think we mentioned this before for VPC endpoints, we have a a directory. And so if you want to connect to another account, you use the directory and we use a module to make the connection and you also make a declaration of who you want to be able to connect to you, right?

And now at scale, we have, we have our plant.

Yeah. Last thing I'll talk about is our cross region blueprint. And this is the place where it becomes pretty hard for the platform team to make decisions that everybody has to use, right? Because a lot of this stuff is very application specific.

So at a high level, right, we have our API we can use Global Accelerator and Shield and we can make a decision about whether traffic that comes in should go to us east or us west. Right.

Um the problem is you inherently have data associated with your application. You have, you have state and a lot of the, you know, there's all of the different resources that you might use to store your state have different global replication strategies. So, DynamoDB and S3 and Aurora that's not on here are eventually consistent, they're not immediate. So if I have traffic that goes to east and then back to west and then back to east, I'm probably gonna, I might lose data. Right.

And so we allow the application teams to make those decisions on how they want to handle that. A lot of our applications right now when they fail over to west, they go into read only, right. So they might be, they might be uh you know, a couple of seconds behind, but we know that they're not gonna, they're not gonna corrupt the data.

We've also, we, we have some applications now that are looking to say, well, maybe, maybe, maybe one ECS task is up in east and others are, are up in west. And so they want to use VPC peering to enable connectivity between east and west, right? But it's, it's really an application decision. And as a platform team, we work with them to, to, to figure out what's best for their application.

And now I'm going to hand it over to Akers to talk about our deployment strategy.

Thanks Rob. For the next 20 minutes. We'll talk about deployment strategies surrounding zero downtime deployments. My name is Akar Shit Bachu and this is my first time at re:Invent. I work on resiliency engineering for Transaction Banking at Goldman Sachs.

Before we dive deep into the deployment strategies. Let's try to understand how zero downtime deployments is relevant here and how it's useful to us. I'm sure everyone in this room at some point of time could relate to waking up to 2 a.m. release incidents or having to perform releases at unfriendly hours. Even at Goldman Sachs and Transaction Banking. We are not immune to that.

In Transaction Banking. We have 300 plus engineers spread across different geographical locations and working through different time zones and it's a hassle to coordinate between the teams during release incidents. The answer to that is zero downtime deployments. With the help of the solution that we built, we were able to minimize the requirement for the presence of developers during the release windows and are able to make more confident and continuous deployments by being able to validate the updated functional changes even before we route the traffic onto it.

When we started implementing the zero downtime deployment solutions in Transaction Banking. Three, these three categories broadly defined our strategy. Let's double click on the stateless services first. When it comes to the stateless services, most of our workload is placed on AWS ECS target and AWS ECS target controller by default follows the rolling deployment strategy. And as you know, in the rolling deployment strategy, a percentage of workload is replaced with the updated version in the rolling fashion. And once the deployment is successful is when you can validate your updated functional changes.

Whereas in the blue green deployment strategy, you can validate your updated functional changes and then route the traffic onto it. We vetted multiple deployment strategies before we landed with blue green.

So how did we implement the blue green deployment strategy with AWS so far?

We employ blue green deployment strategy by using AWS CodeDeploy. And as the next step in the life cycle, we validate using AWS CloudWatch Synthetics. And depending on the results of the CloudWatch Synthetics, we perform a safe flip to route the traffic from the current version to the updated version.

As you might agree, deployments is a technical process whereas releases is a business process. And with the help of blue green deployment strategy, we were able to decouple both.

Let's try to understand how we implemented the blue green deployment strategy with a single easiest service by zooming into a single easiest service on the screen.

You can see an easiest service which is serving the live production traffic on the 8443 port of the Network Load Balancer which we can call as the production version or blue version as well. And using AWS CodeDeploy, we create a new ECS task set within the same service which we can call as the green version or the release candidate.

AWS CodeDeploy allows you maximum up to 48 hours to make a decision on whether you want to move forward with your deployment or else if you want to roll back. So this 48 hour window is crucial to us where we run canaries using AWS CloudWatch Synthetics on our green version to perform this smoke testing so that we are more confident of our updated functional changes.

And assuming in this scenario, assuming that our CloudWatch Synthetics look good when we want to flip the traffic, we use the AWS SAM CLI to trigger the AWS CodeDeployment flip. And here's where the magic happens.

If you follow through the animation, the blue version which we call as the production version is detached from the port 8443. And the green version which is the release candidate is detached from the port 9443. And the green version is attached to the port 8443, thereby routing the entire production traffic onto our green version.

The release candidate from here on, we will start calling it as the new blue version or new production version and the old production version, the old blue version will get decommissioned after the configured successful graceful shutdown period.

And this is the story of how we implemented blue green deployment strategy with a single ECS service. But when it comes to products, it's a combination of hundreds and thousands of microservices communicating with each other.

In transaction banking, we have hundreds of microservices communicating with each other to make sure that the payments go through, that our deposits are recorded and the fraud checks are validated. It's important.

So what if the deployment of a single ECS service creates a cascading breakdown effect on the dependent upstream and downstream services?

To address that problem, we came with an intelligent homegrown solution called as the Blue Green Library Manager. Let's delve into the internals of that.

On the screen, you can see when you trigger an AWS CodeDeployment flip, a Lambda which is hooked to the AWS CodeDeployment gets invoked, which is responsible for adding the relevant metadata on the ECS task sets. So that the process can identify whether the task set is the green version or the blue version.

For the sake of brevity, we categorize our applications as RESTful services and event driven services. So when you bootstrap an application, the Blue Green Library Manager in the application pulls the tag of the ECS task set and tries to understand whether the running task set is the blue version or green version.

And the developers along with the application logic can pass in the configuration defining whether the green version should be talking to the upstream green version or the upstream blue version. Likewise, if the blue version of the task should be talking to the upstream blue version or the upstream green version.

During the bootstrap of your service, you also pass in a configuration file which encapsulates the upstream blue and the green service details. So that the Blue Green Library Manager takes care of routing the REST call to the relevant upstream service.

Similarly, when it comes to the event driven services, we use MSK as our communication channel and similar to the RESTful services, we pass in a configuration file which encapsulates the blue and the green Kafka topic details. Thereby the Blue Green Library Manager, makes sure that the messages produced are passed to the right Kafka topic.

But say if your service consumes messages from Kafka as well, to address that we built in a Blue Green Kafka Consumer, a component which takes care of that problem when it comes to applications.

Each and every application has its own unique use cases. So all of this functionality that we spoke about is just a simple configuration for developers. Using Spring Boot interceptors, they need not write any additional code for this.

But for the unique application use cases, we built in a generic custom component called as the Custom Blue Green Component, which the developers can import and extend by writing their own logic to address their own unique use cases.

So we talked about baking the cake. Let's talk about the icing - that's the best practices that we follow as part of our journey to achieve this high availability.

Talking about the best practices, one of those things is health checks. When it comes to the ECS Target health checks, everyone configures the load balancer health checks as well as the container health checks. But we try to take it a step up by coming with the concept called as Deep Health Checks.

Deep Health Checks is something that we implemented using Spring Boot Actuator and using Deep Health Checks, we can configure the service to access the application's functionality as well as you can configure it to access AWS services as well, thereby ensuring that you have a better understanding if your application is ready to serve the traffic or not.

On the screen, you can see we configured using Deep Health Checks to access endpoints of services. And also we are validating if a particular DynamoDB table exists or not before we mark the service as healthy.

In practice, our Deep Health Checks is much broader and encompasses more use cases and scenarios. Deep Health Checks is something that we strongly recommend within TXB irrespective of the deployment strategies that we follow.

So let's move on to the deployment strategies for stateful resources. These are our top 3 services that serve the bulk of our implementation.

Talking about Amazon RDS, we built a homegrown solution using Amazon Aurora Fast Cloning service to achieve high availability with reads and writes. And we will cover the details in depth about that thing in the next couple of slides.

Moving on to Amazon MSK, we have something called Schema History to ensure the backward and forward compatibility.

And talking about the DynamoDB, though DynamoDB is schema-less, we are still careful to not delete any of the attributes that the other applications use. And in case if we do, we coordinate on that externally.

Let's look into a scenario where we deploy Amazon Aurora RDS.

On the screen, you can see the production applications accessing the live production database on the port 5432 of the Network Load Balancer for a moment.

Let's say it's 2am on Saturday and we temporarily disable the writes on our applications. And using Amazon Aurora Fast Cloning service, we create a cloned version of our production database which we will call it as the green version.

Let's say it's available for us to use by 2:05am. That's when we route our load balancer to point to the green version of the database. Thereby the applications can still continue to make the reads on the database.

And this is the moment where we can start the deployment of a blue version of the database. And assuming that the deployment takes roughly around 20 minutes, let's say at 2:25am we have an updated version of our database, the Version 2.

This is when we route the load balancer back again onto the blue version of our database. We decommission the cloned version of our database, the green version of the database and we enable the writes, we enable the writes back again on our application.

And so if you look at the entire process, we were able to continuously make the reads to the databases. But we had downtime with respect to the writes for the time period of deployment of the database roughly around for 25 minutes.

So we came with a first version of the solution where we were able to achieve high availability with respect to the writes as well. Let's talk about that.

On the screen, you can see the applications pointing to the Elasticache instead of the database. So we employed a strategy called Write Behind the Cache where the applications read and write the data through the cache. But there is a process which hydrates the data from the cache into the databases.

Everyone knows that cache is volatile in nature and given the concerns over the durability of the data we brought in MSK into the architecture where when your application writes the data, it writes the data both into the cache as well as MSK and only when it gets a ack from both the services, it considers that the write is successful.

So for a moment, let's say it's 3pm on Tuesday and we temporarily disable a job which hydrates the data from the cache into the database. And using Amazon Fast Cloning service, we create a cloned version of this database, the green version which is available to us by 3:05pm.

This is when again, we point the load balancer to the green version of the database. And let's say that we start the deployment of our database now and assuming that it takes roughly around 20 minutes by 3:25pm you have an updated version of your blue database.

That's when you point the load balancer back again to the blue version of the database, decommission the green cloned version of the database and enable the writes, enable the job which hydrates the data from cache to the database again.

Again, if you follow through the entire process, we were able to continuously make the reads and writes, irrespective of what's happening at the database layer, irrespective of the deployments that are happening there.

So this is how we achieved high availability with both reads and writes when it comes to Amazon Aurora RDS. Let me take a moment here to shout out for Sangita Karna, the Vice President, Transaction Banking who contributed to this design and implemented the solution on our data platforms.

Please do bear in mind that this solution comes with additional cost and complexity and you want to use it only when it's absolutely warranted.

Moving on to the release procedures, in Transaction Banking, all the functional changes that we make are committed and goes through the maker checker process where it gets thoroughly reviewed, approved and tagged.

The updated functional changes that are tagged are deployed using Infrastructure as Code as Rob had mentioned in the previous slide. We make all our deployments using Infrastructure as Code.

But given that we are deploying the green version of our changes, the updated version during the staging window, which could be outside of the release windows, what if a developer accidentally updates the version of the database or updates the name of the database, deletes a secret or an S3 bucket? It creates an interruption to the existing business.

To that end, we predefined few preventative rules using Security as Code which acts as the guardrails and warns and blocks the developers from accidentally making any changes to the existing business and make sure that it's only the green version of the code that gets deployed as part of our deployment.

And once the green version of your changes are staged as we discussed in the previous slides, we have maximum up to 48 hours to make a decision on whether we want to flip the traffic or else if we want to roll back.

And we are taking this 48 hours to our advantage by using AWS CloudWatch Synthetics and canaries to run the smoke testing on our green version to be more confident about our deployments.

And once the canaries are done, we monitor the results of the AWS CloudWatch Synthetics. And we selectively make a decision on whether we want to flip the traffic to the updated version or not.

When we want to flip the traffic, we use the AWS SAM CLI to trigger the AWS CodeDeployment flip where the traffic is routed from the blue version, the production version to the green version release candidate and the old blue version, the old production version is decommissioned after the graceful shutdown period.

So this is we have covered about deployment strategies. We talked about the best practices that we follow in transaction banking. But it's important that every now and then that we validate the resiliency and high availability of our product as a whole.

In transaction banking, we came with a ritual called as game days, which we conduct twice every year where we have all hands on deck performing multiregional deployments and failing over our business from one region to another region, thereby taking an opportunity to benchmark the R and RP aspects of our product. And also we take the feedback from this exercise to update our run books and automate the process at the best to improve the efficiency by the next exercise.

On the bottom of the slide, you can find the link to the Goldman Sachs blog post with John Gori, Transaction Banking, John Gore Vice President, Transaction Banking, and I ordered on multiregional strategy for Amazon Aurora RDS.

With this, I'll hand it back to Rob to learn about the key wins and lessons learned. Thanks Sucker Sh.

So we had a lot of key wins from from this journey:

Prod release validation time - When we think about a release that that goes bad in prod, typically it's a configuration issue, right? Because you've normally tested your application in non prod. And so the difference is typically configuration based. And when you have an issue with a, with a prod release and you're doing a rolling deployment. In the best case scenario, your app doesn't come up and you're just running in reduced capacity. So one or two tasks in your service are, are cycling in the worst case scenario, your error is such that your apps come up. But the health check, the health check either takes too long to fail or the application, the health check never fails and the application just runs incorrectly. Our ability to have that pro release validation time of 48 hours has been very beneficial to us and it's really improved our confidence in our releases.

Dev team release hours - As I said earlier, nobody wants to be online on Saturday doing releases. And so the ability to do them either midweek or to, to stage them on Thursday or Friday and release on Saturday has been a huge win for our developers.

Aurora downtime per deployment - In the past, if you were doing an Aurora deployment and you had downtime, let's imagine in back to Ecker shit scenario after 20 minutes, you realize that there's something wrong with your version too, right? If you were taken down time, you have to wait to roll that back as well, right? And so there had been issues in the past where we had this kind of scenario and and not having that validation time was really, was really costly. So the the ability to have zero down time, Aurora deployments is is powerful in addition to the fact that when you have this, especially when you have leads and rights, you, you avoid the cascading effect, right? So because you took down your database and your services down other services that rely on, you were down and so on and so forth, right? So basically you have everything up and running.

Finally release related incidents at Goldman, especially in Transaction Banking and throughout all of Goldman, we're very, very meticulous about our release process. We have release approvals, we have, we have to list out all the features that are going in just like you probably would in most places. Um so we never had a huge number of release related incidents, but we have found that teams that take advantage of blue green have significantly, have significantly reduced their incidents. So for example, a team that i worked on until a few weeks ago, we have had a, we've, we've embraced this, we've had one release incident since we did and it was actually because we forgot to do the flip. So the only incident was that we didn't do the release.

So let's just take a few minutes to talk about lessons learned.

I'm a huge fan of deep health checks, right? So often when you see people writing their, their, their, their configuration for their ecs services, their nlb health check will be a ping on, on a given port and their, their ecs health check will be a ps. So what we found is that often a ping isn't good enough, right? And uh and an ecs health check definitely isn't good enough because often we use Spring Boot, often it takes 30 seconds for the process to start. And so you have, you have Amazon mistakenly thinking that things are up. Um but also if you have a really good deep health check, that health check can, can take the place of canaries, right? So I know i just talked about CloudWatch Synthetic Canaries. I we actually have found that if you have a significantly complicated health check where your application will only come up if it's healthy, then that will take the place of it and you have to be careful, right? Because if you have an NLB in, in a number of availability zones and ECS, you're gonna find that each of your tasks is gonna be pegged from each availability zone every 30 seconds or whatever your health check. Uh, time period is. So you're gonna have to add in some caching. So what we did is we have deep health checks that call our services, but they have a one minute, two minute pause where if you, if you were successful, last time you called and you call again in the next two minutes, we're just gonna say you're successful and if you're not successful, we'll do it each time, right? So that gives us the ability to actually fully test our code.

Another huge thing that we found that people had confusion about and, and took us a while to understand is the way that a flip works with a network load balancer versus an application load balancer. An application load balancer is is request by request. So if i flip an application load balancer, then i can be sure that the next request that comes in is gonna go to my new my release candidate that just became my prod version, right? And we actually use network load balancers in transaction banking because our entire plant is mutual TLS and we terminate TLS on the server. Now, as we all may have heard in the last few days, application load balancers mutual TLS was announced. So, you know, take this with a grain of salt but with, with a network load balancer, the routing of of traffic is done by connections, not by requests. And so what we found is happening is most modern web clients use connection pooling. And so we would see them fail over, we would, we would do the flip and we would find that clients were still connecting to the version one service. And so what we, what we, what we decided was there's, there's two things that you have to recognize about this.

The first thing is you have to understand how long it takes for your service to shut down. And the second thing is that you have to make sure that you set up your server correctly so that you don't disconnect clients.

So for the first thing, when you have a service that's running a Spring Boot in our case or any service, but in Spring Boot, i'll explain more how we do it. We actually use graceful shutdown, which is a, which is a setting in Spring Boot that tells the server to gracefully shut down all of the http connections on the way to shut down. So now what you have to do is you say, ok, well, how long is the longest service invocation that i have? Let's imagine it's three minutes, right? And so what you want to do is you want to make sure that you set your deregistration delay, which is a setting that you put on your, on your, on your service, such that it controls after your network load balancer, no longer is routing traffic to the service. How long until it shuts down and then your stop time out, which is, how long do you have for that shutdown hook that shuts down connections to run. And if you do that, your clients won't have any disconnects.

Um finally, we have a lot of applications that that actually spoke about that are, that are, that are, that are event driven, right? So there's no, there's no network load bouncer, there's no web service. So how do we use those with code, deploy code delay actually requires a load balancer. And so what we did is we use the stub NLB and we just took another load balancer in the same account and we created a dummy service, a hello world service and we run it on port 10 443 and, and the green is on 11 443. And we rely very heavily on the blue green manager to ensure that when it's blue or green, it interacts with the correct upstream and downstream services.

Finally, one other really brief point, if you're, if you're looking to build something like this and you're looking at your ECS tags, just be really careful. The ECS API often will tell you that your service is down when it's when it's not completely down yet. So we found that we were getting a bunch of errors trying to query services that were, that were not there, but we were querying ECS from within the service. So it was clearly still running.

Now. I'd like to hand it back over to Manjula.

Ok. All right. Thank you, Rob.

Looking forward, Transaction Banking is implementing some new features to continue to enhance the TXP robustness.

The first feature is blue green deployment for Aurora Postgres Sequel. Transaction banking heavily leverages PostScript SQL for relational database use cases. We recently launched a new feature in Amazon ideas that is blue green deployments to support simpler and faster updates to Postgrade SQL. The way it works is blue green deployment, create the stage environment, you can then deploy and verify your changes in the staging environment. This is a completely managed staging environment. Once you verify your changes, you can promote the staging environment to be the new database production system. All of this can be done as fast as within a minute. There is no data loss and there is no absolutely application code changes needed to switch the database end point. This is going to be a very powerful feature that's going to enable a lot of our customers to do major version database engine upgrades, schema changes and also any maintenance updates.

The next feature is chaos testing. It is crucial to conduct and practice chaos testing. In order to ensure your applications, resiliency and reliability. Transaction banking team is in the process of adding and simulating fault injection failures in their critical payment flows and or assessing the TXB applications response using AWS Fault Injection Simulator with FAS. You can create experiments, you can induce failures and those failures could be network delays or lambda execution failures so on and so forth. Once you configure different parameters in these experiments, you execute those experiments and you analyze the results. The goal is to iterate on these experiments and of course continue to enhance the THP robustness.

The last feature is zero touch release. Some of the steps that Akash walked us through. There are still still some manual intervention steps in the current release pipeline transaction. Banking team is continuing to enhance the release cycles in order to have faster and reliable release cycles.

With that, we come to a conclusion for this session. I would really like to appreciate take this opportunity to appreciate Goldman Sachs team for sharing their journey to zero downtime. Our primary goal was to equip you with techniques, practical insights and strategies. So you can take forward these insights and apply in your own specific context and embark on a journey to create resilient applications using AWS cloud services.

Please do take a moment to fill out the session survey. Your feedback is highly valuable for us. We would like to provide you innovative content that caters to your needs. Happy exploring rest of the re invent.

Thank you so much for your time.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫