Improve resilience of SAP workloads with AWS Support

最新推荐文章于 2024-08-09 17:09:54 发布

李白的朋友王维

最新推荐文章于 2024-08-09 17:09:54 发布

阅读量67

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135120048

版权

Are you running critical workloads on AWS? Are you looking to improve resiliency and the operations of your critical workloads, like SAP? If the answer is yes, you are at the right place.

Welcome in this session. We'll learn about how 3M work with AWS Support to improve resiliency of their SAP workload.

With me, we have Kim O from 3M IT Cloud and FOPS. She's going to talk about the journey her team took working with A+ Support making their system more resilient.

We have Vijay Sitaram who is the Technical Account Manager for 3M and myself, Manik Chopra. I'm a Principal Technical Account Manager working with 3M in this session.

We'll talk about the various AWS Support engagements like tabletop exercise, false testing and runway reviews which helps you walk the DR exercises for your workloads. You will hear firsthand from 3M how they modernize their residency management AWS using AWS Resiliency Hub fault injection simulation for testing and CloudWatch Application Insights for improving the SAP availability monitoring.

You might have seen this slide multiple times before and we use this internally all the time. Everything fails all the time. This is Werner Vogels our VP CTO. And with this principle, we try to improve the resiliency of our systems and work backwards from it.

Technology is the backbone of all businesses today, big or small and anything in between and any complex system is made up of multiple subsystems which has a likelihood of failure over its lifespan and some failures are small and avoidable while others are rare complex and catastrophic.

And this came out with Gardener earlier this year. Resiliency equals revenue. You know, residency means revenue does any degradation in resiliency means revenue loss, brand impact customer trust loss or all of the above failures of missional distribution system can cause costly damages. And some of those brand damages are hard to gun back, especially customer loyalty.

When we talk about resiliency, what is resiliency? So think about this mental model as you see on the screen, from left to right left is high availability. You know, you design your systems to be available and resistance to common failures. Like when you are patching your system, when you're doing CI/CD pipelines, when you're deploying maybe a hardware failure in your tier or in other infrastructure where it's a highly critical system, but it's highly available, it can easily handle those kind of failures going right? Is your continuity of operations. When those disaster, the rare occurrence disaster does happen, you know how quickly, how gracefully you can recover and maybe the parts of it can still continue to gracefully run. It could be the DR scenarios like backup and restores or more severe, like totally going back to a different region.

And at the bottom is your continuous resilience. This is where you are continuously working, testing your resiliency exercises, testing your DR procedures, making improvements to your observeability and things like that, improving your CI/CD code, your deployment pipeline. So that's the mental model when we work towards resiliency in short, resiliency is how well your system responds and recovers from failure. And this is the mental model we work with.

So what is application resilience in industry when we talk about resilience is we use these two metrics to measure resiliency of any system and how we map it back to our business requirements. Recovery point, objective and recovery time, objective, recovery time, objective is, is from down time, all the time to recovery. How much that downtime your business can handle? And it's ok and that's you work backwards from them and then you design your recovery exercise accordingly. Recovery point objective is how much data loss your systems can handle for the business continue to continue. Otherwise it's severe impact to your businesses. So those are the two metrics, we design our resiliency exercises against.

So what are the different categories of failure when we do our scenario mapping, our threat modeling that what are the resiliency exercises? What are the scenarios we should design against these are some of the scenarios we talk about.

So code deployment and configuration, these are some of which more often, especially in an agile develops environment. We are we are applying frequent code changes, making frequent configuration changes. These are the scenarios which can happen more often, right? You should have a quick way to recover from these either by going back to a previous version of the code or have a blue green deployment where you have a good stack always running and available.

The next set is core infrastructure. This is where you have your hardware failing in your data centers, you know your host failing or you're losing a backup and things like that. How will you recover from those?

The next one is data and state corruption. This is like, hey, your backup was corrupted or your data is getting corrupted because you've deployed a bad code which are making changes to that data. How will you recover from those kind of scenarios? And what DR scenarios will you design around it?

Then, then the dependencies where you know in complex systems systems talks to many other external systems, whether it's a third party or your own systems within your company, you know, what are your SLAs and SLOs working with those other systems and the other APIs and if they fail, how your application responds, does it gracefully continue to function with a good message for your end user or it just completely fails with an error messages. You know, things like that and the highly unlikely ones where the whole region goes down, you lose the whole availability zone of AWS, you know, you had a climate event and one AZ is completely unavailable. What will you do? How will you recover? Those are rare occurrences. But we do want to plan for that. We don't want to test that and we want to make you guys ready for that, especially for your critical system.

These are some of the resiliency services 3M used as they will walk you through their journey.

AWS Health, this is where they tell you about the various events happening towards the health of the cloud, looking at any point of time, Trusted Advisor. It is aggregated checks around various pillars and resiliency and fault tolerance is one of them.

AWS Resilience Hub. This is where you put in your application and it going through the infrastructure layer of your application. It tells you how does you stand as compared to your RPO and RTO of your application as you plan to do resiliency.

Resiliency CloudWatch, sorry, Amazon CloudWatch. This is for your observe availability. So you can see the insights of application, you can see the insights of your monitor the infrastructure, get some alerts when things are starting to fail.

Fault Injection Service. This service helps you test these scenarios. You know when you are trying to simulate a hardware failure. How will you do that? You're not going to go to a data center and pull the plug, we can help you simulate that. So that's, that's the service and it comes with multiple scenarios. You know, you're losing an EBS volume, you're zone is going down, etc., etc. or your network running slows. You know, we're always adding new scenarios to that service.

And CloudWatch. I already touched upon that.

So these are some of the support engagements we have and we use with 3M working through making their SAP more resilient.

So first one is the Well Architected SAP for with the lens. So if you're familiar with the Well-Architected Framework from AWS, this goes to the best practice around operations and architecture and it has specific lenses for certain scenarios and kind of workloads. And SAP is one of them which goes through the application specific operations and resilience. So we, we looked into that to see how running through the reliability pillar and optional excellence and make it better.

Then Access to Response and Incident. This is where we run the tabletop exercises, working with the 3M team and see how different teams going to interact. What's the communication will look like, how will AWS fit in there, you know, things like that, how, what's the communication structure look like? You may need people from network team, you may need people from security from infrastructure team and all that and how will that look like on the AWS support side.

Business Build and Review Runbooks. So this is the critical as our customers are moving from on prem to AWS, you know, they are bringing their operations on the cloud. Are you run books ready for the cloud? They were written for the on prem world. How do we help you revise for the cloud? You know, you still have some flavor of on prem there. You know, some teams will get involved, but we need, we help you revise for the cloud and how things function in the cloud.

Manage and Operate Resilience. This is where you have access infrastructure, resilience and resiliency recommendations. This is where we go through your run books, your infrastructure, your architecture to provide you new recommendations. You can take whether deploying multi AZ or changing the IOPS on your EBS volumes and or you can take fast backups or snapshot restores and things like that sort.

Then you have Test and Validated Recovery. This is where Fault Injection Service comes handy. It comes with multiple scenarios. You can test, you can start with one, you know, get confident, get the confidence built with your operations team and then you can iterate through it and go through different different scenarios of failure. And you can, depending on how frequent you want to run those exercises every quarter, every six months, you can pick up different scenarios to simulate.

And then Monitor and Observe Availability. This is where you have, you look into what kind of metrics you have available for monitoring your infrastructure and your applications.

So we go through all these exercises and that's the journey we took with 3M and we're gonna talk deeper more about this soon.

And this is the flywheel. I just talked about it, you know, well in the middle and center is the Well Architected Framework. And, and we took this journey, we started with a single scenario and then we iterate through multiple scenarios working with 3M on this.

And with that, I would like to invite Vijay to talk about SAP.

Thanks Manik.

Good evening everyone. So Manik talked about the design principles, the mental model in ways by which you want to protect your mission critical workloads. And what I'm gonna do is walk you through the journey that we took with 3M in activating some of the support best practices, ways to assess the application and test the application, validate the configuration and tie it all together with observable.

So I'm going to go through the details of what that entails and in the SAP architecture, I want to call it a couple of things.

One of them is that 3M has two regions. The first region is dedicated for production workloads and the second region dedicated for the non production workloads. For the purpose of the session, and with where we improve the resiliency, we will refer to the production workload which runs on three available zones.

The first and the second AZs here are paired together with the Pacemaker cluster for both the application, which is the ASCS cluster as well as the database. As you see here, the database also has got a auxiliary standby in the third availability zone which is a time delay replication. A log replication as well as the database has got a synchronous log replication established between the primary and the principal standby.

In the third availability zone 3M has spun up additional application servers for capacity purposes. The emphasis here is that when we look at this architectural pattern, focus on the single points of failure for SAP, the single points of failures are typically the ASCS which hosts the message server services as well as your enqueue services database is a single point of failure. AZ itself is a failure. What if the AZ goes down? And, and not to mention the EFS where we have SAP files, shares user homes and mount and last but not the least...

What Manic mentioned is a regional failure. It's an allocation, but when it happens, uh, we have to be ready for it. So how do we go about addressing those risks? Um, improving the design, validating the design, testing the design, and also be prepared to recover from those events.

So I'll start with some best practices and these are some low hanging fruits. Uh, you can get out of the box fully automated with Trusted Advisor, Priority, right. Trusted Advisor out of the box provides over 250 recommendations across five categories, uh, which is curated across thousands of customers. And this is fully automated that 3M took advantage of.

And I want to focus on the fault tolerance, which is one of the categories, which, um, goes deep into assessing the risk associated with resilience, such as single AZ missing snapshots, uh, backups, and so forth. So it talks about both the compute failures as well as the data recovery, uh failures.

As a team, we work with 3M, um, we review the recommendations from Trusted Advisor and based on 3M's workload, prioritized what's important and we addressed those recommendations.

Now, what you see here is a combination of both SAP and non-SAP workloads. You all know that in SAP, there are also other boundary systems that interact with SAP. And when we looked at the Cluster Advisor recommendations, we looked at the SAP landscape as a whole. So it's a combination of systems, how the systems interact with each other, and the overall resiliency posture.

Having looked at the best practices and how easy it is to implement because it's fully automated in the backend, let's take a look at how to simplify the assessment of RPO and RTO targets. And as you're aware, SAP systems have, um, aggressive RPO and RTO targets.

In this case, working with 3M, uh, using Resilience Hub, uh, the assessment was done based on the core SAP components which we talked about and those are the components that we look for because they are the single points of failure, right? So we looked at the, uh, ASCS, ERS cluster, uh, the database cluster, the primary and the standby database nodes, the associated storage volumes, and also the, the application servers, um, and the, and the global mounts, the EFS mounts.

Based on that, uh, 3M's, uh, DR policies, uh, with Resilience Hub, uh, we created the RPO and RTO targets, um, and associated that with the, um, onboarding process where resource groups are created to pull all of the application related components.

So we create a resource group with Resilience Hub. It pulls in all of your SAP app servers, um, the EBS volumes, the EFS file shares, and it onboards it as one application. So when we look at assessment of an application at Resilience Hub, it is a unit, not the individual resources associated with it because we want to protect it as a whole.

And SAP runs on 2 AZs for compute, um, in EBS and EFS, and, um, what it's not, you know, shown here is that, uh, there's also options where SAP offers where you can use FSx ONTAP as well for storage.

No part of the assessment, what we learned is that 3M had already built a very robust data recovery strategy, um, multi-layered data recovery strategy. So 3M had already established AWS Backup for backing up the database servers, the application servers, and the EFS SAP file shares.

3M had already established a Backup Vault for protection and compliance and any accidental deletion. And 3M worked with us, uh, to influence some of the product advancements with support for, um, native EBS snapshots as well as, as the AMI backed, um, AMI snapshots backed by EBS volumes which are used for both the application servers and the database servers.

Resilience Hub is now integrated with Trusted Advisor. Um, in Trusted Advisor, there are two, uh, key, uh updates that are integrated. One is associated with the age of the last assessment. So with the Resilience Hub, uh, 3M has enabled, um, regular assessments which are run on a daily basis. And if the assessments are not run, what you would see is that, as it's flagged here, you know, one of the applications shows a last assessment run of 172 days.

So from Trusted Advisor, one can observe and see how frequently assessments are being done on the application.

Number two is the score itself. So Resilience Hub scans the application and looks for the data protection and the compute protection. And based on the score is calculated and published in the Trusted Advisor as well. And this is a good guidance to determine, uh, what's missing, uh, based on the recommendation and take appropriate action to improve the resilience score for the, uh, SAP application.

Now, having seen the best practice with Trusted Advisor Priority and also seeing how easy it is to set the RPO RTO targets and assess a given application, in this case it's an SAP workload, now let's take a look at how do we now test and validate this workload? Does it fail over as designed? Right, can we recover as expected in what types of experiments?

3M tested extensively and how repeatable are they for multiple SAP systems? So 3M, um, working initially, uh, was excited to start, you know, test out the pause, um, volume, um, action with FIS related to the EBS as well as, uh, testing out the cluster, which is the Pacemaker cluster associated with the ASCS, ERS.

Um, also tested the database cluster by failing over the nodes as well as the database log and the, um, data volumes. And we talked about AZ level failures can happen. So we also introduced a scenario where we disrupt the network connectivity between the two AZs.

Given SAP file shares are critical, SA mount users, a trans, uh, there was also a test, uh, um, created to, um, suspend traffic to the NFS mounts.

And last but not the least, um, 3M also worked with us to simulate a control plane outage. So we're gonna walk through each one of these scenarios and share the results, you know, from, uh, 3M's, uh, testing.

So in this first test, um, you know, we targeted, um, the ASCS and ERS nodes, um, to test the cluster by using an FIS action. The Fault Injection Service action was to stop the ASCS node. Uh, in this test, what we observed was that the cluster failed over successfully as expected as designed, there was no data loss, and there were alerts as part of the cluster in the way the cluster was configured.

Uh, we noticed internally there were changes to the IP addresses based on the overlay IP agent as well as there were proper fence actions taken, uh, which enabled the failover process. And overall it was a successful test. So this was the first test we delivered, um, together with 3M to demonstrate that the SAP ASCS ERS cluster was working as designed and it was a successful test using the FIS action.

The second test involved testing the database, um, and we had to test the database using four different ways and I'm gonna walk through each one of them and the results with 3M.

The first set of tests included stopping the database node. So the test was conducted to stop the primary node. The second test was conducted to stop the secondary R node. And after that, the third and the fourth test involved testing the data volumes on the primary node. And the fourth test was with pausing the traffic to log volumes on the primary node.

So as you see, I'm going to compare the test one and two. The first test will trigger a cluster failover because when we stop the primary node, the cluster will, uh, will move over the, uh, activity to the secondary database node. And we use the EBS pause IO action for the data and log volumes.

The results differed because based on the duration of the test. Now, if you run the test for five minutes, the system will hang and it, it may not trigger a failover. However, running the same test for 15 minutes to 20 minutes, um, might crash the database and trigger a failover and move the traffic to the secondary AZ to the secondary database node.

The third test we conducted with 3M, uh, had to do with the SAP file shares with the use of SA trans and SA mode. And for this, uh, we use the FIS action using an SSM document to block traffic to port 2049 for a duration of 300 seconds, which is five minutes.

And this is what we, uh, noticed was that the 3M system worked as designed again, right? The cluster worked as designed, the failover, um, was expected. Uh, there was monitoring in place, there was no data loss, and overall it was a successful test in this scenario of network disruption.

3M had a, uh, a unique design with the environments where an indigenous subnet contains multiple SIDs and multiple SAP systems. So when there is disruption to network flow of traffic between AZs, uh, we had to scope the experiment to just one SAP system. And for the simplicity, we scoped it to just the ASCS node.

And we were able to achieve this by using the prefix list and having only the IP address of the ASCS node available in the prefix list to target the experiment just on the one node. So what this does is it stops all the traffic in and out of the ASCS node. Thereby if I were to repeat, it triggers a fence action, the cluster kicks in, the fence agent, um, takes the action.

And after that, the overlay IP agent updates the route table entries and the traffic now moves over to the, the secondary node and the secondary AZ as designed. So with 3M, you know, we were able to once again successfully test this scenario as well, assuming there was an AZ level outage.

Now you can take the same scenario and you can expand the prefix list IP addresses up to 10 or do experiments at the subnet level for multiple SIDs.

Let's talk about API throttle. Uh, this is a very interesting case where, uh, working with 3M, um, we had to design this in a way that we could simulate cluster communication with AWS control plane APIs.

So the cluster, um, communication, uh, has four key API calls, uh, with the control plane. And the experiment was designed such that we throttle those API calls 100%. This means that when this experiment is being, um, run, the cluster will no longer be able to communicate with the control plane.

And we targeted the ASCS node and thereby it will break the cluster. In this scenario, the nodes will not be able to fail over because the overlay IP agent will not update the route table entries because now the overlay agent has all of the APIs throttled at 100% as well.

And the fence agent is unable to take the fencing action because the API call cells are also throttled 100%. So what does this mean in an event like this when we're simulating a regional event, like Mani described, it is important to know that after the recovery of a regional event, there were steps, manual steps are required to check the status of the cluster, clean the resources, and making sure that the IP addresses are updated on all the route tables.

There are nine route tables at 3M. So we had to validate those as well and confirm that the resources are running in the respective nodes and the cluster is healthy. And this is something that 3M worked with us to develop manual steps and procedures to have it ready just in case if there is a regional event and the cluster is unable to fail over as designed.

To tie it all, um, when we worked with 3M, we said, ok, we have done the testing. Now what's next? Well, can we see if the cluster is healthy? How do we monitor this? And that's where AWS CloudWatch Application Insights for SAP comes into play.

With CloudWatch, there are several metrics. Uh, we're not gonna go through all of the metrics in this session. I do want to focus on two key areas. One is the HA cluster metrics and the second one being the NetWeaver availability metrics. There are five cluster metrics, four availability metrics.

Um, and I want to focus on the cluster metrics first.

Uh in the these cluster metrics are related to the fence agent if the fence agent is, is is enabled if there is a pacemaker node statuses. Um and if there is a ring error and are there any pacemaker fail counts on the net? We were availability metrics cold watch also provides um status of the um h a configuration status of the h a cluster as well as the start service process and the alerts and the availability of the sap processes.

This is something uh three m is uh implemented for a couple of the systems and there is planned to roll out for additional sids.

And with that, I would like to invite our customer kim otto from platform engineering team to walk us through the resilience journey.

Thanks bj.

Hi, everyone. Thanks for joining us today. As these guys have mentioned, I'm here from three m and we worked with aws enterprise support on this journey. So you may recognize the name three m for some of our consumer products, maybe post it notes, scotch tape, command hooks or n95 masks. But actually, we're a global manufacturing company working on delivering innovations on over 31 technology platforms. And along with global manufacturing comes a global single instance of sap and that's why this resilience is so important to us.

So I'm gonna take you through just a few of the highlights on this journey that we had with aws enterprise support. Many of the technical pieces have already been covered by manic and vj, but just to give you a little bit from our perspective.

So when they came to us earlier this year, uh to start on this journey, we asked ourselves a few questions. Are we really as resilient as we thought um you know, what is our architecture? Is it resilient? What opportunities do we have? We also ask, is there a way to reduce the complexity and the cost that we have every time we run a test? And finally, how could we make our processes more robust?

So we decided to enter the journey and we took a a guiding north star with us really, that was to focus on improving our resilience of our mission critical workload, which in this case is sap to look at how could we lower our m tt r and how do we modernize the operations that go along with that?

So we set a few objectives to review best practices and architecture against something like the trusted advisor, build consistent and repeatable mechanisms such as what we found with the testing. And finally, again, looking at those operational processes that go along with it our outcomes.

They were pretty simple. They came to answer those questions that we asked. We wanted to come out with modern lightweight dr testing scenario with reduced complexity and cost. We also want to look at how can we enable the culture of testing across not only infrastructure but application teams and finally to drive conversation amongst the teams and the people who would be involved in terms of recovering from an incident and looking at continuity and architecture.

So here's a quote that i shared with bj and manic. Um the underlying thought here is you know, the resiliency hub and the services are new. One of the underlying reasons that three m moved from on premise hosting to the cloud is to take advantage of those new services. We want to be nimble, we want to be agile and we want to bring our processes up to date right along with it. So of course, we're going to join in and take on this new adventure because that's what using the cloud is all about.

So here's our journey as we jumped in, it started with a set of collaborative sessions, uh bringing together resources from our platform engineering team, sap application team and operations to look at our current abilities to detect notify and manage incidents.

Then they talked through, you know, how ready are we when we need to respond to a failure and how do we recover? So they did some deep dives looking at our tools and our processes. They talked about communication plans, response plans, many different pieces that come into this.

Then they engage in some tabletop exercises. Those what if scenarios, what if we have a failure of an entire availability zone? What if we lose a a cluster, a host amount point? You know, how do we prepare for that and how do we respond today?

So the overall focus of these discussions and deep dives is really around that single need to find the way to reduce the mtr the outcome of this section of the program was really, it came to just a few simple points.

We wanted to enhance our use of the aws health dashboard, which we had already started to do. But we wanted to bring that into a single centralized view and also to start using aws push notifications to help us be more aware when there is a health event, then we decided it was time to jump in and really look at the continuity plans specific for s a. Since that was our focus here, we already had a robust plan and the team had practiced many times many, many details around recovery and response.

But it was important that the teams together collaboratively took an infrastructure and an application view as they looked at these different run books. This phase actually showed that rm had a lot of strengths on both the application and the infrastructure side.

The shining star was really the ramp up and ramp down procedures that our teams have practiced so many times during monthly maintenance or patching activities. And you know, those are still solid gold and they still stand there. But the really exciting thing that came out of this is that the team started to think about, wow, we should look beyond just the sap corps, there's many other surrounding systems, critical services and functions that we need. If we're going to have end to end resiliency and that discussion, then it really just turned into. All right, let's look at that full availability zone failure, what is going to remain healthy and what am i going to lose if i lose the availability zone, what is not going to be available and what manual steps do i need to take?

So our takeaways from this section, other than that great collaboration is really that we do have some points of cross-team communication that we could improve. When we have some of these complex scenarios. There's also areas for increased automation, continuing to version our operational documentation so that we are ready for a recovery and a response. And finally to really encourage that we keep doing that expanded end to end thinking for not only sap but our other critical workloads as well.

The next portion of the journey takes us directly into resilience hub when we on boarded our resources into the tool. So we selected a few several sids from our non production environment and working with enterprise support. The team began to load these through the console.

Well, quickly, uh the team realized that our scale um and sheer size of sap deployment um was too much um for the young service that aws had introduced. Um but there was no worries. Um the product team uh came to the table very quickly. They provided us um with scripting and ability to um quickly add large quantities of resources in order to build that application in the resilience hub.

So then we ran our assessment and we are ready for our score. Unfortunately, the results were surprising and maybe not in the best way. Um our score was a little bit lower than we expected. And as, as we took a look at the results, the team quickly realized that the resilience hub was not giving us credit for our data protection and backup strategy, which is quite, quite robust as vj had mentioned.

But again, the aws product team for the resilience hub stepped up, they were eager to work with us, understand our complexities and they turned around and built that recognition for the ebs snapshots and the ebs-backed ami snapshots right into the tool. Our score went up and it was much more realistic to our actual resiliency.

So after these, then it was time to discuss what are the recommendations that are coming out of the tool? Yes, there were definitely things um we could improve on, but we were feeling confident um with what we had architected and what they had given us.

So looking forward, the team continues to want to look for more resilience at a lower cost to think about what are those possible, you know, hidden areas where we need to improve our resilience. And we want to definitely find those before they create an impact in production. And finally thinking about how can we frequently test and how can we build this into automation? How can we build this into what we are doing with our c ad to make sure that our resilience testing and our assessment is up to date and that we are doing this on a frequent basis.

The next part of the journey is then to perform the testing, the simulation, the experiments using fis this was the fun part. Um we used it to test many points of our high availability architecture. Um those false, those complex um fault scenarios. Um as mentioned by vj, we had done many of these tests before on our own manually or through some automation. But now running it through the tool, we were easily creating templates and we were building something that could be put into the hands of the application teams. We were ready to expose any blind spots that we might find through that simulation.

So the real brilliance of us working with fis is that those templated tests and those true fault simulations are again something that can be handed off to our application team and reducing the dependence on our infrastructure teams. So this will save us days and hours. Our our typical testing would take days. We can now execute the same set of tests directly by our application team in just hours.

And the final piece of our journey here is around monitoring and observation. So three m was already leveraging the cloudwatch agent for infrastructure performance and availability metrics, but we were now introduced to um application-specific metrics for sap. So cloudwatch metrics for sap um now provide additional insights directly about the application.

And by using another new feature called the cloudwatch application insight for sap dashboard, we can see cloudwatch metrics for infrastructure performance and availability side by side with s a specific application metrics. All of this on a single pane of glass is going to be so helpful. We can put this right again in the hands of the application team, they can use this to more quickly identify and troubleshoot an issue. And again, it reduces reliance on other teams to support them.

So to wrap it up and sum it up, um it's been a meaningful, meaningful journey. Um as we've worked with resilience hub and with the enterprise support team, we found new and modern ways uh to test and validate uh our resiliency, we focused and found uh faster ways for for recovery. We've made new insights, we've thought more deeply and broadly about our end to end resilience. And we've driven a lot of cross-team communication and we've created consistent templates which can now be used by our application team such as sap. But we see this going much further. We've already started to upload many other critical applications into the tool. And following the same path overall, we hope this drives a culture of testing and accountability of resilience at our company.

So to wrap it up, i will hand it back. Thank you, kim.

Yes. So as to bring this all together uh part of the journey involved, assessing teams readiness to events, right? So like we open the session, things break all the time, right? So how are we ready? How teams communicate, how teams can remediate and restore operations? Um and that doesn't happen quickly with our run book. So we did look at the run books as well and part of that was enterprise support, activating some of the engagements and also bringing in the a w services.

We talked about r and sub fault injection service, cloudwatch application insights um to have a holistic engagement and to an engagement. Um i would encourage that you reach out to your account team and ask about the end to end program that we built for three m and definitely, you know, reach out to them so that you can learn more about the f the the final outcomes of such a program and how you can improve resilience for your sap workloads.

Here are some quick references. Uh i wanna highlight uh on the, the first one which is we have published a recent uh update in our sap on aws documentation related to configuring real clusters. And and there is also an update on how to automate incident response management and integrated with cloudwatch.

Yeah. And using cloudwatch applications insights to monitor the health of your sap system side by side with the uh with the infrastructure metrics. And last but not the least is a well architected framework and the sap lens 2.0.

Just wanna say thank you to everyone uh for coming over here today.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Improve resilience of SAP workloads with AWS Support

Thanks bj.
复制链接

扫一扫