Optimizing with AWS Trusted Advisor and AWS Well-Architected Framework

Steven Salem: Quick introduction. My name is Steven Salem. I'm a Senior Solutions Architect for the AWS Well Architected team. Today I have with me Arun Rajan who is the Principal Product Manager for Trusted Advisor. And with us, we also have Carlos Wiley. Carlos is an Enterprise Architect from Georgia Pacific. If you haven't heard about Georgia Pacific before, chances are you're actually using their product on a daily basis. They are a large manufacturer for pop tissues and packaging. And today, Carlos is gonna share with us a little bit about their cloud optimization journey in one of their workload.

So this is gonna be level 300 sessions. So we're gonna be focusing predominantly on the three technical aspects about cloud optimization, specifically how Trusted Advisor and Well Architected can help us with that.

Here is a quick agenda. So I'll first off level set with everyone of what is cloud optimization, at least in the context of our presentation today. And then we'll cover some of the ways that you could use in AWS to conduct optimization discovery, looking for opportunities for optimization and then how you can prioritize them. We'll close off today with a story from Georgia Pacific, specifically on their optimization journey and one of their workload, leveraging Trusted Advisor and Well Architected to improve their resiliency and governance as well.

Alright. So I guess we can get started. So I thought I'd start off with this topic first to begin with about level setting, the context of cloud optimization because depending on which media that you read or who you ask this term for some reason have different meanings. Some people think it has to do just with cost reduction. Some is performance improvement, which is it exactly right?

Well, the term cloud optimization actually scopes beyond just these two items is actually anything that you do in your architecture, whether it's to build or to run it, to make sure that it has the right set of capabilities to deliver the business outcome and its maximum potential. And it's the most optimal of potential and the domain can actually go just beyond just cost and performance. It can actually go through security operations, many others.

And what I'm going to do next is I'm going to give you a better picture of that using an example. And in this example, I'm going to take the context around reliability of a workload. So let's say that you are hosting an application, you've got this very popular SaaS or ecommerce website and it has a very high availability requirement along with that. And for you to achieve this high availability requirement, what you've done is that you've deployed this application in three independent replica in each availability zone. In front of it, you would have a network load balancer, fronting it and then you would have the Route 53 to handle DNS resolutions.

Right now, if you're not familiar with this architectural pattern, this is what we commonly call as the AZ independent architectural pattern. It's actually common if you have a very high availability requirement in your workload, that you can't really cater with the conventional multi AZ deployment model on your architecture. In any case, that's just for context.

The idea here is that if one replica goes down, for whatever reason, your customer base running on the other two replica can still go on. So it's a sort of containing the blast radius, managing the blast radius.

In any case, this application is working fine. So you're serving your customer happily, but just like what Vernon Vools once said that a failure happens all the time and sometimes that failure can happen on human too. One night you slept too late, the next day you deploy the code. And the code causes some issues, right? So it's a bad code and what happened then is that your application starts seeing some performance degradations, customers start escalating, things are going slow.

And because this is a very high visibility and very high impacting application, your first priority is to recover this customer experience. And therefore, the quickest way to do that by design is to cut off your relationship with that replica where you've deployed the broken code essentially allowing all of your traffic going into the healthy replicas.

And it's a very simple objective as you can, as you can see the logistic, however, of achieving this is not quite now, there, there's obviously multiple different ways on how people approach this, but typically, what they've done is that they've deployed what they call the zonal endpoint record of your load balancer in your Route 53 service.

If you're not familiar with what zonal endpoint is. Zonal endpoint is essentially a DNS record that points directly into the load balancer that is serving from a specific availability zone. And you can actually see which availability zone that load balancer is serving from from a prefix of that DNS.

Now, the idea here is that if you want to recover, you want to stop traffic from going to the, the availability zone of that, that has a broken replica. You would call out Route 53 API to delete that record effectively stopping traffic from going over there. So that's the recovery plan.

So what's wrong with this? Right? It looks like it's working well. If you think about it further, by you implementing this architecture with three different independent record, what you're essentially doing is introducing additional complexity into your architecture. And then with that, if you, if you can imagine, if you as your workload expands into thousands of workloads that you have to manage complexity compounds and then therefore it's operational overhead.

But the most important thing in this approach is that when you're calling Route 53 to delete that record, when during recovery, you're essentially relying on Route 53 control plane. And this, that is the contrary to AWS best practices as it introduces you into additional risk.

So in a way, even though that this way of working is working for you, right? It can be achieved this way, but it's not a very optimized way to architect.

So how can we do this better? Right. Well, in AWS Route 53 there is a service called Application Recovery Controller. And since May this year, there is a capability that has just gone GA called Zonal Shift. With Zonal Shift what you can do is essentially get Route 53 to stop sending traffic to a certain availability zone without relying on that control plane. So all of these activities are all orchestrated directly at the data plane. Therefore, you're complying with the best practices and not exposed to the risk.

So in a way, if you think about this, on top of that, by you doing this, essentially, what that leaves you is that you, you can still have those two instances intact for your further investigation later if you need to.

Now if you think about this, implementing your architecture in this model is much more optimized as it introduce you with less complexity and less operational overhead. And but most importantly also is that this capability is offered to you at no additional cost. So it doesn't affect your TCO it's better bang for your buck, right.

So this is what essentially we mean by cloud optimization. Anything that you basically do to build and run your workload to bring the most value possible for your business objective? Because that's what your workload essentially is for. This is just obviously just for setting the context. And to give you an example of one of the many optimization that can happen in your workload in the context of reliability. In reality, this can happen around in the area of optimization, security costs and performance as well.

Now that I've level set of what is cloud optimization. Let's dive in deeper into the nature of this optimized state of the workload, right? And for that, can I get a quick hands up if you heard about this capabilities on EC2 called instance store have you heard about this? Awesome. We've got some people, knows about this.

Well, if you have, if you know about instance store, chances are you've actually been using AWS for quite some time. Because, back when EC2 was first released in 2006 instance store was the only available block level storage that is available for EC2.

I give you a context of what instance store is. It is basically a temporary storage that is attached to the EC2 instance. And it's supported by the physical hard drive that that is attached to the to the actual machine where EC2 instance is running. And because of that, there is a special nature to it because this instance store has got limited size depending on the instance type that you choose to run. And then it's ephemeral by nature, which means if you stop that instance and you start it again, your data will be gone.

So as you can imagine if you run an application with that, you need to make this application fully stateless. For some application, it may not be a problem. But for some, if you don't have access to the way it's coded, it can be challenging as you know.

Now, despite of what I've just mentioned for quite some time, this is the only way that you can access a block level storage natively from EC2 instance. And in a way it is the only optimized way available for you, right. That is not until, of course, in 2008, we've released EBS volume, which is as you know, EBS volume is a service for a persistent block storage and it behaves more like your normal disk.

So you can essentially store your data, attach the disk and then attach detach and attach it to EC2 instance. And if you don't have to worry about the EC2 instance losing the data, if you EC2 goes down, right? Like I mean, if you, if you stop your EC2 instance, you still have to take snapshot, but it's a persistent data.

Now, since then, as you know, EBS has been the de facto for your every EC2 instance that you launch. So in a way if you have a workload that can't run fully stateless and you have, you need to have a local storage to, to handle that workload, then EBS volume is a more optimized way for you to run the workload.

Now, this is just one of the many example, of course, of optimization opportunities. But before I jump into that, let me, let me ask, why am I telling you this? Right? In case you haven't noticed this state of optimized, this optimized state, this so called optimized state on your workload is actually moving along with new capabilities that is available to you. And this is just one example of how EC2 instance store was the only way. And then you had the EBS volume that becomes the new norm of you running a block level storage.

But we've actually got other capabilities on AWS as well, including the Application Recovery Controller. Essentially, as time goes on, there will be new capabilities that is, has been made available to you. And it, it opens up new possibilities for you to run your workload in a more optimized way. So it's continuously moving over time.

And this is really driven by three things, as you can imagine. The first one is industry trends, right? This is a continuously moving nature of it. And with that comes in the growing business requirements and along with that as well, AWS will always continuously release new capabilities as you know, because we're in re:Invent at the moment to make sure that you're able to build your workload and run your workload in a more optimized way.

So the next question is that typically comes in after I explain this to, to a lot of people, is that like, well, how do I know what this good looks like because there's so many moving parts here. How do I know that whatever workload that I'm shaping is going to be functional to meet my business objective because you can build a bridge. But if you build it with a wooden plank versus with a construct of a concrete construct. It's the same, it's technically bridge, but you know that the one or the other can't deliver the same quality of business outcomes. So the quality really matters.

And the other aspect to this is that what is, is that you can do to give you a guidance to what areas that you can optimize. Well, this is what AWS Well Architected framework is essentially for if you're not familiar with AWS Well Architected framework in a nutshell, this is collections of high level best practices and guidance to help you build and run your workload in the most optimal way in AWS.

Now there are essentially different domains that the framework covers. Currently, there are six different areas that the framework deems to be important. And this is what we commonly call as the Well Architected pillar. And it covers areas such as operations, security, reliability, costs, performance, and also sustainability.

Now using the Well Architected framework pillar structure, along with the actual framework structure itself, you can use it as a way as a cornerstone to guide you towards directions on areas that you can optimize.

To give you an example...

Let's bring back the example of the first workload that we mentioned in the beginning, the ecommerce website that you have. And if you have a requirement of high availability of that high availability, you'll be asking yourself, what can I do to prepare for failure on this workload?

And for that, you could use the Reliability Pillar on Well Architected for the focus area of failure management. Underneath that, you will find a collection of best practices that is categorized to help your workload be designed to withstand failure. And then within that, you will find an itemized best practices that guides you specifically in this example towards how you uh you're supposed to leverage data plane or elements of services rather than control plane. So this emphasizes on the scenario that I've just mentioned about Route 53 before.

And within this best practice, the Well Architected Framework also provides you with a deeper prescriptive guidance towards areas of services that you could uh that you could use. In this case, it's the Application Recovery Controller as well. So this is really just to paint you a picture of one of the many ways that you could use. Well, architected the framework to guide you towards areas of improvement in your workload.

And of course, this is just one example of the best practices in practice. AWS Well Architected actually has about 300 best practices and it's continuously growing as we realize that things are moving. This is part of the the thing that we we do in Well Architecture.

All right. So that's a level set on how you could use Well Architected Framework to guide you towards optimization. The next question we ask is that how should we approach this optimization again? There are so many moving parts and we know that optimization requires an investment in our organization. Uh every, every time you uh you want to look for opportunities to optimize you need to spend time and resources. And because of that, the approach that you need to take in cloud optimization needs to be done in a continuously iterative journey. You essentially need to go through um continuous include this as part of your continuous improvement process.

And the most important thing is that every cycle of of this, the everything that you do on this optimization, you want to be able to justify that to a clear business outcome. Because ultimately an investment needs to be justified, you need to find out what is the return of investment here, right? So there are three typical motions that you would uh that you would do uh in achieving this.

The first one is to learn about this best practices themselves. The second one is to measure and compare your workload against these best practices. How far aligned or far um not aligned your workload is and the last one here is to continuously improve, gradually make improvements uh in a gradual manner.

Now to when you're doing this uh measurement, what what what this essentially means is that you need to conduct an exploration against the workload. And there are typical areas that you could that you need to look at. The first one is really the system configuration, which is really the technology element to the workload. So you would be looking at what services that you would use, what configurations, how does that align with the best practices uh and towards your use case.

But looking at technology aspect itself is typically not enough because for a workload to be fully optimized, it actually requires people and process around that as well. This is how you would look at a workload holistically. And um for this one, during those discovery, you would be looking around at what, how you're structuring your organization, what process that you have in place in your workload to make sure that it is ready to be operated and, and running um in the most optimal way.

And there are two typical means or methods that you would conduct this discovery. The first one is an automated discovery which is essentially leveraging uh AWS tools to understand your system configurations, runs the scan and find out what's the delta is. And the other one is through a sit down exercise by um meeting with your organization or stakeholders typically to understand what is the people and process um around that workload that, that, that you need to to make an improvement in. Uh and this is what we commonly call the Well Architected Framework review. If you have heard about that essentially, right.

So essentially an exercise that you do uh and sit down together with your uh with your stakeholders of that workload to understand where, where is it aligned? How is it aligned with the AWS best practices? Now, it doesn't really matter which order that you do. Uh the most important thing is that uh you are actually generating those improvements. So one key message that I would like to, to make emphasize here is that the exploration, the uh the um I guess the the the activity of doing discovery is only going to be as good as what improvement that actually generates. You want to make sure that those investment that you make will create a clear business outcome.

Now, if we uh to, to dive in deeper on looking at the composition of discovery, let's have a look at what uh tools that you could use uh to help you with this right now for this. Uh there we you could use the AWS Well Architected tool, which is basically uh a console based tool that you can access uh directly into your AWS console, you can access them for free. And uh it's available across every region in AWS. And using the idea of this tool is essentially to give you a better guidance towards your conversations with your stakeholders because every best practices or every questions will have different target persona in your organization. And the idea here is that leveraging the curated uh questions and best practices, you can have a better targeted conversations to navigate. Uh and along with that, the tools also provide you with capabilities to help you prioritize um this question base uh according to your best practices and these features has just been released this year called the profile feature.

Uh and you can essentially create a prioritized list uh of your questions that you need to ask your stakeholder depending on what business context that you put in. The other features that was also released this year is the review template, which is essentially a way that you could uh prefill some of these questions. And it's the idea here is that if you want to scale uh a practice on running discovery in your organization, some of the things may already be implemented, some of the things may be repetitive and using review templates. Uh you can address the uh the scale factor of that repetitiveness essentially pre by prefill some of these question sets. So this is the Well Architected tool.

Um again, this is really a tool that give you guidance uh towards that conversational um discovery model. Now, what about the automated discovery? What can we use in AWS to run a scan to uh to understand the delta between our services configuration with AWS best practices? So this is where AWS Trusted Advisor comes in, which is essentially a fully managed service uh that is available to you on AWS if you are a part of Premium or Business Support above.

Uh and what this does is that it runs a continuous scan in your uh app in your environment, in your AWS account against uh 400 plus best practices checks. We have now uh in, in 447 AWS services. Uh and what beyond uh what essentially this Trusted Advisor gives you is an insight, uh a tangible insights towards your system configurations that you can pursue further towards your optimization. But beyond that Trusted Advisor uh also provides you with an extended programmatically extendable capabilities to essentially uh implement best practices in an automated fashion. And for that, what I will do is I'll hand it over to Arun to talk more. A little bit about Trusted Advisor check check.

Thank you, Steven. Hello everyone. It's a pleasure to be here. Thank you for having us and for feeding your curiosity and learning around this topic which is near and dear to some of us, the topic of cloud optimization. So in the next few minutes, I'm going to cover a little bit more depth on how Trusted Advisor working in concert with. Well Architected can help you in your cloud optimization journey, but just a quick baseline. So everyone in the room is on the same page about Trusted Advisor. I just want to take a few, few like maybe 1 to 2 minutes to baseline everyone and then we'll start digging deeper, Trusted Advisor like Steven pointed out is essentially a service that scans your resource configurations, your resource usage and workload architecture to compare that against specific best practices that are foundational to deploying on AWS and provides you with deviations that are detected against that best practice. It then informs you about actions that you can take to improve your workload configuration and your cloud environment to optimize further.

Now today, as Stephen said, we have about 400 plus. Uh the last count was 467 Trusted Advisor, best practice checks spanning 47 services and that continues to grow as we discover additional best practices that need deviation detection. Additionally, all these capabilities are available in your account if you have AWS Business Support or higher tiers of support.

Now connecting back to what Steven said, the cloud optimization cycle has three main phases which is learn measure and improve. Now, by virtue of how Trusted Advisor works, it already covers the learn and measure phases. It provides you with the context of a best practice as it applies to a service. It also provides the measure which is the detection of the deviation from best practice and provides you with recommendations on how to improve. But in addition to that, you can customize and leverage Trusted Advisor for the third phase and that's not so obvious which is the improved phase. And to take advantage of that, you can leverage uh Trusted Advisors integrations with other services. We'll get to that in order to get to that deeper.

Let's talk about ways in which you can access Trusted Advisor and some of the data that Trusted Advisor presents to you. So the first is visual inspection through the AWS management console. The second is programmatic access to Trusted Advisor, check results through the API and the third. And this is where uh we'll dig deeper and it gets more interesting is the integration with Amazon EventBridge, which further allows you to extend actions through other AWS services to take uh cloud optimization improvement actions and we'll dig into that. But let's also talk about the data sources that Trust Advisor presents.

There are three key pieces of information I want to point point out here. One is the resource that's being scanned in your account. The second is the status of that resource compared to a specific best practice. And the third is the recommendation. How do you investigate what actions could you take and what order and some helpful resources on taking those actions.

Now that we've baselined on this, let's dig into how you can instrument your improve face using Trusted Advisor. So with the Amazon EventBridge integration, I'm going to use a specific check example, but this can be extended and replicated across other checks as well. The check I'm focusing on is the Exposed Access Key check that falls under security category.

Now, this check provides an alert if it detects your IAM key for tied to an account that's exposed in any public repositories that we can find. So when you, as you can imagine, uh exposing your IAM access key in public repositories is almost an, almost, is probably i should remove that and say it is never a good thing. Uh but failures happen, it can happen by humans or by machines. And so what do you do? How do you respond to such an incident uh in an automated fashion? And that's what we'll look at.

So using the Amazon EventBridge integration, you can invoke AWS Step Functions whenever EventBridge gets an alert or an event around this particular check and Step Functions just in a nutshell. What it does is it allows you to orchestrate activities that form together to help you with a particular outcome. And in this example, I'll talk about three activities, you can orchestrate for this incident response.

The first one is you can in work Step Functions to trigger a Lambda to hit the IAM or Identity and Access Management service to delete the key pair that's, that's been exposed. And this will immediately prevent malicious actors that haven't already copied it from using it. Uh oh even even those that have copied it because that's removed from your account. And the second is you can create a look back period, whether that's one year or for the lifetime of that exposed access key, you can hit AWS CloudTrail, pull up logs of various actions that have been performed in your account using that access key and compile a digest of it.

And then lastly, you can put that digest along with the the information that Trust Advisor presented to you. When did you discover the incident? When did you take action and what actions have been done using that exposed access key in the past for your look back period and send that as a notification to your security team so they can decide what further investigation they need to do and that can be done through Amazon SNS or Simple Notification System.

Now, just to help you get started with solutions like this, we've added this particular solution to the Trusted Advisor Tools, open source repository. Uh you can get it from that QR code or you can just search for Trusted Advisor Tools and you should be able to easily find it.

Now, digging deeper into best practices and connecting back to what Steven talked about with. Well, Architected, Trusted Advisor checks fall under specific categories that are purposefully built to help you optimize in specific areas. And the ones that have been there for a while include Cost, Performance, Security, Fault Tolerance and Service Limits. And these are very well aligned with AWS Well-Architected pillars even if they aren't used, even if the same terms aren't used, you can find a mapping from AWS Trust Advisor checks to AWS.

Well architected pillars to the specific best practice in AWS. Well Architected now last month to further entrench this alignment, we launched a new category in AWS Trust Advisor for Operational Excellence and this category provides best practice checks. Today, we have I believe 28 different best practice checks and, and increasing uh to cover your operational readiness for areas such as observable.

In addition to launching a new category, we've also launched a new data source a little bit primer here on Trusted Advisor, Trusted Advisor has best practice checks from individual services like EC2 S3 Lambda. But it also has best practices based on signals that we find in other specialized hubs in AWS. And that list today includes services like Resiliency Hub, Security Hub and Computer Optimizer.

In addition, last month, we launched an integration with AWS Config. And what this enables us to do is to use Config Manage rules that are installed in your account to provide signals on resource configurations that could impact your best practice posture and thereby it allows us to layer in optimization recommendations from that data source to dig deeper into how this works.

Here's an example. uh the example I've chosen is Amazon API Gateway, not not logging execution logs. Now, interestingly, this example falls in both the new Operational Excellence category as well as is integrated through AWS. Config uh more about that specific check it detects deviations in the sense that it, it looks at your API Gateway configurations and alerts you if you're not storing those logs on CloudWatch. And this is a configuration setting that the API Gateway provides to you.

Now, why is this important? Because from an operational readiness standpoint, having these logs accessible after actions and executions have happened provides you with a wealth of information for things like triaging issues, monitoring the health or just tracking the performance of your APIs and all the API calls that are being made, whether there's been errors or it has gone through all of that information is now available in CloudWatch if you just turn on those logs.

Now, the way it works is if you have the corresponding AWS Config manage rule installed in your account, Trusted Advisor, you don't need to do anything else. Trusted Advisor automatically pulls that as a signal and then layers in the best practice context around it. So Trust Advisor provides you with the context, the status as well as the recommendation on how you can enable it including helpful resources.

So this is some more on the AWS Config integration. So we talked about how EventBridge automation can help you with the improved phase covering the three phases of the continuous cycle. And we talked about some of the new data sources and the new categories that Trust Advisor has launched.

Now let's shift gears into the second topic that Steven talked about, which is prioritization with 400 plus best practice checks from Trusted Advisor that map to 300 best practices at the Well-Architected Framework level. You can imagine that there might be a lot of opportunities that are presented to you. How do you decide where to start uh connecting this back to the optimization cycle at every point in the cycle? When you invest effort and energy, you need to justify it based on the return on investment. And so there are many ways to do this. But I wanted to give you a few ways in which we think you can get started. And then you can customize that on top of that two signals that you can get from AWS, Trusted Advisor and Well Architected are urgency and business impact.

Urgency talks about how critical it is that you act immediately versus it can wait. Business impact refers to if you take that action, what how do you help your business? How does the underlying business improve if you're more optimized? Connecting that dot no, picking those dimensions. One by one.

AWS Trusted Advisor provides you with the urgency dimension through the status of the Trusted Advisor check, which is a signal for severity. There are three types of status today in Trusted Advisor, the green or OK, which means no deviation, detected and no action needed. The yellow or warning which implies investigation is needed. And then you get to decide what action if any that you want to take. And then the third one is red or error which says immediate action or investigation is required.

So that gives you a sense of the urgency and for the business impact, you need to know, look no further than the Well-Architected Framework. The Well-Architected Framework has details on the level of risk that each best practice carries if you haven't optimized to that best practice and these risk levels are in three tiers, low, medium and high. And this can be found in the framework in the dock that comprises the framework as well as in the Well Architected tool.

So now that you have these two dimensions and you know how to access it. How do you now layer a prioritization strategy on top of that. Again, you can customize it to what suits the needs of your business. But one example of how to do it is a variation of the Eisenhower matrix, which quite simply put creates four quadrants. And you can choose how to prioritize each quadrant. In this example, I've shown you a simulation of how you can map your opportunities across the four quadrants. And you can decide to start with the high business impact, high urgency and then layer in the other quadrants over time.

Uh but you can make it a little more customized in the sense that if there are specific best practices or specific checks that are more important to you, you could have a nonlinear approach towards prioritization. By the way, if your account has AWS Enterprise Support, then you already have a dedicated pool of technical account managers that collaborate with you.

Last year, we launched Trusted Advisor Priority. And if you don't want to take a DIY or do it yourself approach to prioritization, you can rely on your dedicated Technical Account Manager team to help you with that. Now, your, your account teams are entrenched in your business, they're in your meetings, they understand your priorities, they know how to bubble up the most impactful recommendations. And that's exactly what they do with Trusted Advisor Priority. And this allows you if you're at a business unit leader level, if you own multiple applications or if you're a C-suite team that looks at cross cutting across multiple departments in your org, then you can look at it these results at an organizational level across accounts in your organization.

But if you're at the doer level where you're say a DevOps engineer, a cloud builder um or an application architect, you can also see it at a member account or an individual account level and you can then collaborate. So this allows you to collaborate across persons in your organization in a tracking dashboard to burn down the most impactful opportunities and take that action and get more optimized.

But on the other hand, if you want a custom approach, a DIY or do it yourself approach, then connecting back to the two dimensions we talked about. There's another project we launched in the Trust Advisor tools, open source project called Optimization Starter that gets you a running start on capturing that data for those two dimensions in a single place.

The way it works is this solution uses Amazon uh AWS Systems Manager Automation capability to then execute calls to Trusted Advisor and Well Architected as well as parsing the Well-Architected framework to then put all the dimensions and the data in one place to illustrate what I mean by that.

Here's a sample row from that HTML report that the Optimization Starter creates. Now it's hard to read this, but I have uh ways to point out what each one is referring to. Uh so you have here a specific best practice check. Each row represents a best practice check tied to a specific resource and it captures the Trusted Advisor information side by side with the AWS Well Architected best practice information. And in addition, it also presents the two dimensions that we talked about.

So you have all this information in a single place and then you can choose how you want to layer your prioritization strategy on top of this or your approach to what to act on next.

Now, speaking of uh cloud optimization journeys, while this helps you, i it's my pleasure to have Carlos Wiley here from Georgia Pacific who will talk a little bit more about their own Georgia Pacific's own cloud optimization journey over to you, Carlos.

Thank you. Good afternoon, everyone. My name is Carlos Wiley. I'm one of the enterprise architects here at Georgia Pacific with my main focus being in cloud modernization as well as disaster recovery. Um for those who do not know Georgia Pacific is one of the largest manufacturing companies when it comes to paper, wood and pulp products. And what I'm gonna do on the next couple of slides is really take you through that cloud optimization journey that we did with one of our applications.

Um how we utilize the whole architecture framework as well as some of those tools to help with the governance. And when you get started, I really want to show you the importance of the application first um and then work on the solution that way.

So a lot of our applications here runs on ERP systems where this handles a lot of our core business functionality. This includes things like purchasing shipment as well as quoting um and the way that we implement it, excuse me, the way that we implemented this one was with SAP HANA.

So you can see here on the slide that we have a database server as well as a handful of application servers to help with the solution here. This is a tier one application and it's very important to have this up at any given time.

So from a business standpoint, there's certain recovery time objectives that we need to adhere to. And that basically says if anything goes down, we can bring our applications back up within a set amount of time. But then also too, as you're going to see as we go further along the SAP HANA database handles a lot of that business data. So we also have to make sure that we don't lose too much data and we have to hit our recovery point objectives for those business use cases as well too.

Um, a visual of this just to show the importance of it is as we have different products going through the production line, this can be like Angel Soft or like our Dixie plates. Um, it gets to the end of it and then we put it in storage while it's waiting for storage. Our ERP system is taking orders and then getting ready for shipment.

So from there once it's there, um we put it on a truck and then we send it out to our customers. Now, if we're not able to hit our RTO and RPO during that set amount of time, right? This breaks down the whole process. We're no longer accurately storing information or rather storing products.

Um we can't take fulfillment for orders and then we also cannot get orders to our customers. So for us, we wanted to make sure that this was one very resilient to failure, but then also two, make sure that we had a very strong recovery strategy when we were building this.

So the first thing that we wanted to do was make sure that it was highly available, right? So, um you can do this a couple of different ways. Um you can put it in multiple locations, um either multi AZ or multi, multi region. So from a cost perspective, we decided with our risk profile that multi AZ was good enough to hit for that requirement.

And what you can see here is that we have our primary database as well as our second database running into Availability Zones. And then we split our application servers across those Availability Zones as well too. So this reduces the putting your eggs in one basket, right?

So whenever a AZ goes down, we're still able to run in an ant state while we're initiating our DR plan to get all of our applications back up and running. Now, this was just the first part. The second thing that we wanted to do was make sure that we had a strong recovery strategy and this is where we used replication tools.

The first one being is CloudEndure. So with all of our application servers, we have CloudEndure application for um our OS as well as our block storage. This allows us to automatically fail over if one of the applications goes down. But like I mentioned earlier, the main brain of the operation is that HANA database.

So for that one to make sure that we, when we come back up, we can still fulfill those orders that were in our system. We use Pacemaker replication with that to keep both databases highly uh synchronous between each other, right? So if one goes down, we can bring the other one back up and fulfill the orders there.

Now, looking at both of these right, having it highly available, but also to having a really strong DR strategy with the replication. Um there's been multiple times where one or two of our application servers would go down. And what happens is instead of the business knocking on the door saying that hey, we cannot fulfill orders, right? Cloud and and Pacemaker were able to both automatically fill over. So from a business standpoint, they saw a hiccup but there was no impact to their bottom line.

The next thing too is because this is spread across multiple availability zones. In those cases where we had um failures in one availability zone, we were still able to run and maintain while we were initiating our DR plan.

Now looking at that, the next thing we wanted to do was make sure that we had a really strong monitoring solution. So observability is really good because it allows us to be more proactive than reactive, right? And at the end of the day, we really don't want the business calling us telling us it's an issue. So the way we solution that here was with CloudWatch metrics as well as other tools like LogicMonitor to look at those baselines of what is that good state. Once it goes out of that good state, its issues an alarm and sends it through our ticketing queue. This allows our support team and application team to resolve those issues before they become larger issues and bring down our production servers.

Um bringing back to attention what Steven brought up with that continuous discovery, right? Um we just didn't want to set up monitoring and leave it, right. So there's cases where we're continuing to add more monitoring as we find other issues. A good example of this was with our EBS volumes, they were getting throttled. Um and so because of that, our application service went down. Now it took us a while to figure out what was the root cause. But once we found out, we added more monitoring in place and the root cause there was AWS quota limits. We were hitting those too frequently, which brought down our application server. So we went ahead and improved our systems, improved those monitoring and now we have better visibility to any time we were about to hit our quotas and then we can put a request over to AWS to increase that for us.

So the next one, once again, um just bring back attention to this, right? When this goes down for longer than let's say like an hour, there's definitely financial impact here. So we wanted that reassurance that at any given time we can bring this back up, even if it's a widespread issue and other companies are trying to come back up too. So we decided to start using Capacity Reservations for this implementation. And this is really giving you that reassurance that at any given time you can bring back this system um with those idle resources.

Now, typically these are idle. So we went from paying for dev QA prod to now dev QA prod in out of resources. So that multiplied it one more time, but it does give us that reassurance.

Now, cost efficiency is a really big thing for us. So it's how can we be more cost efficient but not lower our DR tier for it. So working with our SAP solution architects, they came up with a great idea of repurposing those Capacity Reservations for our QA systems. So now our QA systems are running on top of those. So we go back to having just dev QA and prod, but that reassurance that we can always bring up production.

Now, in the case of a DR environment, we would just terminate our QA systems, spin up our um our production servers and then once we're ready to go back to um to our ideal state, we would just re provision our QA systems. So this allowed us to put cost optimization to the mix of our resiliency, but also to make sure that we have that additional um insurance in the case of a widespread issue and not just a isolated one to this application.

So that's just looking at one application, right? Um and a lot of times we have multiple or rather we have multiple applications in one of our accounts. So how do you actually govern that at scale?

So we started to leverage AWS Resilience Hub to help with that. Now, with this, we were able to put all of our RTOs and RPOs into all of our accounts. This has the benefit of reducing the need of our developers to know what those RTOs and RPOs are. They can just select their tiers and they get validated against it. We also have automated assessments for it. And this is really important because we're trying to get to a point where our application teams and our developers focus on the application itself and remove a lot of those manual from it.

So we went from on average, I wanna say 1 to 2 hours doing manual assessments um between multiple people to now, we can get that down to five minutes through their automated assessments. And then finally, with those assessments, we're also able to get the recommended um remediation paths.

So one thing we noticed um when it came to SQS um dead letter queue is what you need to have turned on to have a strong DR environment for it. That's not something that our developers knew intuit. So using Resilience Hub, we were able to see a trend in that they resolved it. And now we have that as a standard of whenever you're using um SQS and it's a tier one internal data letter queue there.

So that's how we look at it for multiple applications within one account. Now take that step one step further when you have multiple accounts, right? And you're trying to get that governance that scale. How does that work?

So we started to leverage Trusted Advisor reporting. This gives us the benefit of being able to see all of our Resilience Hub assessments into one view. But then we also get the benefit of any application that's not part of Resilience Hub. We're still able to see their resilience score for it. So we get the benefit of getting those assessments all in one place. But then also too anything that wasn't part of it, we can now validate that from a government standpoint that they still have a strong DR process.

Now what Arun mentioned earlier, right? They give you different um recommendations, you can have green, which is good, you can have yellow, which is a warning you might want to look at it and then you also have the red which you need to take action.

Now, we do have that conversational discovery with our business as well as our application team to really figure out do we want to take action on everything we see. There are some business use cases. It could either be for cost or from other optimization opportunities to where we're ok with having that risk for certain yellows that we have in our environments, but we just want to continue to have that conversation with the business so they can understand that and make those informed decisions.

So with those two, that's how we get governance at scale, we do it from an account level with Resilience Hub as well as multiple accounts with Trusted Advisor.

And then what I want to end off with is really how did we, what did we learn from when we went through this um optimization journey?

Now, once again, this was one particular application, but this is what we do for all of our tier one applications to make sure that we can reduce business impact with failure. So we have the Architectural Review Board which is very similar to the review process that the Well Architected Framework has.

Um but one thing we noticed is that our, our questions were very generic. Hey, do you have a DR plan? What tier it is? Um we started to take some of the the questions and the standards from the Well Architected Framework and added it to it. So we start getting more targeted there to figure out what is that business outcome you're trying to do? What is your replication strategy? What happens if it goes down for longer than x amount of hours? And this has really helped us find areas where someone was overly adding a tier to it, right? Where something could be a tier two. But they did it as a tier one, right? And we were able to save money there by going back to a tier two.

The next one is utilizing our automate assessments where it makes sense. So a lot of times in that people process technologies, there's a lot of things that could be missed when it comes to mainly doing these assessments. And the further we go along with our DR plan or our DR cycle, excuse me, we get to a tabletop and a lot of times we get to that table top, we have to reschedule because we miss things and we need to go back to the drawing board, right? When you get everyone in a room, that's a lot of time that you have to reallocate there. So leveraging these automated assessments have been a great checkpoint for us to know you're good to go sign up for that tabletop and then let's move forward with your DR plan. So this has definitely helped with the consistency when it comes to our DR as well as our DR planning.

And then finally, we started to use more governance at skill. Now, um one thing that we lacked before was a really good reporting strategy of saying this is our posture when it comes to resiliency, these are all the applications that are tier ones and this is how they scale or how they score against each other. Um by using Resilience Hub and Trusted Advisor, we've been able to get better governance in that way. Um we've also been able to see trends that then can help us inform and make better business decisions as we look at other applications.

So all in all, that's how we went through our optimization journey, right? Um making sure that we can still hit those tier one requirements, bring some cost efficiency to it and then have a really strong governance model for that.

So with that, I will hand it back over to Steven. Thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值