Improve operational efficiency and resilience with AWS Support

Shanan Kesha Raju, Principal Product Manager, AWS, Trusted Advisor: If you are here to learn about how to win by operating efficiently in the cloud, you are absolutely in the right room. Welcome to today's session, SUP 310, Improving Operational Efficiency and Resilience with AWS Support. I'm Shanan Kesha Raju, Principal Product Manager, AWS, Trusted Advisor. And today I am joined with Neil Sender, Principal Technical Account Manager, AWS Support and Kara Mostly Cloud Platform Manager, Coke Industries. Let's talk about today's agenda.

We will begin by explaining some of the operational efficiency and resilience challenges faced by you in your own organizations and the impact these inefficiencies can have on your business later. We will talk about the different AWS Support offerings that can help you overcome these challenges through proactive self service tools like AWS Trust Advisor and human guided programs that our account teams conduct with our enterprise support customers as workshops or tabletop exercises and Neil is going to cover uh some of these.

Finally, uh Tara is gonna come over and talk about Koch Industry's journey and how they are integrating these tools and programs into their day to day operations to optimize for resilience, cost and security.

Everything fails all the time. This is a famous quote from Dr Werner Vogels CTO Amazon.com. Let that sink in for a minute. We all know when we operate software applications issues do happen and these issues can happen because of manual errors where somebody has made a code change. And there was a unexpected induction of traffic into your production workloads. Or there could, there could be some hardware failures like rack failures in a data center or a adverse scenarios like natural disasters like flooding which can cause an entire region outage. And all of these things can have a varying impact on your business depending upon the criticality of the workloads that you are running in the AWS cloud.

So what one means by this is when we design applications, we have to design them, keeping failure in mind and we do that by introducing redundancy and resiliency into our application architectures. So when issues do occur, we are able to recover quickly and operate within our recovery point, objective and recovery time objectives.

Now, let's talk about some of the teams that you might be part of as part of your a w uh of your organizations and some of the priorities that you might be having. You might be part of a finance team and your responsibility is to ensure all your teams are operating efficiently as with respect to cost and within the budgets that you have set up. But for that, you need visibility into the inefficiencies across your teams and also track like how these teams are progressing towards resolving these inefficiencies.

You might be part of a security team and your challenges are also similar like where you want to understand how all your teams are performing with respect to security, best practices, are they making sure that all your access keys are protected? And you want to know whenever there an inefficiency do occur, you have, you need that visibility or you might be part of a site reliability team and wanting to ensure all your teams are implementing resilient architectures. Or you could be part of a devops team. And your responsibility could include like making sure all teams are implementing your observable best practices. So that when issues do occur, you're able to quickly identify why it has occurred and then take action on top of that.

A lot of our customers, especially from enterprise uh scale have told us that implementing these best practices across your teams is a hard challenge, especially if you have decentralized teams and then you have a centralized team which is responsible for making sure everyone's following these best practices. Like a cloud center of excellence. They have told us that they lack visibility, they lack prioritized recommendations as well as tracking tools that can allow them to understand where these inefficiencies are and and the absence of these tools. And when you have these inefficiencies creep into your systems or the sub optimal architectures make their way to production. Then when there is an outage, it can be really costly to your business.

How costly can this be? Right? A a report from IDC white paper from June 2022 has reported that enterprises say that on an average they are experiencing 29 unplanned outages a year and whenever this outage does occur, there are the meantime, the resolution is around 4.4 hours costing companies and an average of $13.5 million a year. On average $13.5 million is a lot of money.

So how can AWS Support help you to stay ahead of the curve? So we'll begin by talking about Trust Advisor AWS. Trusted Advisor is the home of AWS best practices in AWS. Trusted Advisor. We have a construct called check and this check is a codified best practice that runs across all regions within your account and identifies resources which deviates from this best practice and reports them back for your consumption.

Today, we have around more than 300 checks uh in Trust Advisor spanning 45 different AWS services including Amazon two AWS Lamda. And we also have integrations with domain specialized recommendation services like Security Hub, Resilience Hub, Computer Optimizer and bunch of other services as well.

The checks are categorized into six groups, Cost Optimization, Security, Fault Tolerance, Performance, Operational Excellence and Service Quotas. Every Trusted Advisor recommendation can be in one of the three states. Green meaning there is no action needed by you. Yellow, meaning AWS recommends an investigation is required and if action is needed, then you, you should act on it or read where we recommend action is required so that you can stay optimal in the cloud.

And there are three ways to consume these recommendations. Uh depending upon how you have set up, right. There is straightforward where you go to a w console, you bring up Trust Advisor and you are presented with a dashboard where all these findings are presented to you or you can integrate a wru advisor recommendations with your own internal systems using our API s or you might want to subscribe to certain type of thx which are very essential for your business applications. And for that, we provide EventBridge integration where you can subscribe to those particular thx. And as soon as we detect any deviation from that particular thc we report to you through that push notification.

So let's bring back that slide again where we discuss the various teams and show some examples of Trust Advisor checks which are available to each of your teams. So going back, if you are a finance team, AWS Trust Advisor can tell you about instances which are idle, like for example, in ds instances which are idle or underutilized ebs volumes and also provides you with recommendations on how to resolve those uh uh inefficiencies so that you can save money with respect to security there are checks like exposed access keys or ebs public snapshots leaving you at risk of, you know, a potential security threat for site reliability teams.

We have checks like rds multi az which tells you about rds instances which have not been set up for multi a leaving you at the risk of availability issues when you know outages occur or for devops teams. There are checks like Amazon b pc without flow logs in an absence of a flow log, it would be very hard for you folks to audit what are the changes that are happening to your network? So Trust Advisor helps uh various teams with different types of checks which are pur purpose built for the roles and responsibilities that they serve within those teams.

So who has access to Trust Advisor, Trusted Advisor is available to all AWS customers. And depending upon your subscription plan, it support subscription plan, you either get a subset of these checks or you get the entire 300 plus checks. So if you're on a basic or a developer plan, you get all the service quota checks and a subset of security checks. If you are on a business enterprise on ramp or or enterprise, you get the entire 300 plus checks.

While all of this is great. Our enterprise support customers wanted more going back to that central team wanting to know the entire cloud posture of their organization, the lack of visibility and the tracking tools. Our enterprise customers asked us like every time somebody subs signs up for an enterprise subscription, they get a technical account manager assigned to them like Neil, right? And depending upon the scale of the customer, there might be multiple tamps working with them. Customers share a lot of context with these uh tas so that they can help them strategize on how to uh achieve their business outcomes.

So enterprise customers wanted us to provide prioritized recommendations knowing that we we share their customer context so to so that they can focus on the recommendations which really matter to them. So for that reason, in 2022 we launched AWS Trust Advisor Priority.

So why Trusted Advisor Priority? So as explained when you aggregate recommendations across your AWS organization, imagine like the number of recommendations that you are working with depending upon the size of your operation, it could be in hundreds of thousands. So AWS Trusted Advisor, what happens in AWS? Trusted avior is like account teams are actually hand curating these recommendations which are relevant to you and sharing with them through your management accounts uh of your AWS organization AWS stress advisor priority recommendations also come with closed loop tracking.

So what do we, what do i mean by that? So every action that you are taking on on these recommendations, whether acknowledgment or dismissals are fed back to our systems to help improve the prioritization logic that we are implementing to uh share these recommendations with you. We also provide a historical view of all actions being taken on these recommendations. So this is addressing that tracking problem that we discussed, right? Like as a management account or a cloud center of excellence team, you can see the recommendation and the actions that our individual members are taking on their resources with respect to that recommendation and intervene when required to accelerate the resolution.

Trust advisor priority also has a second type of recommendation. It's just not the trusted advisor, automated checks that we discussed. But also as i was explaining, account teams run tabletop exercises and workshops with our enterprise support customers. During those workshops, there could be some recommendations which have been generated which are not automated yet. So trust advisor uh priority provides an account team's ability to enter manual recommendations. we relevant to your business outcomes.

An example could be you might be launching an online christmas sale on your website and you want to ensure that your application is architected in a manner to take that load that you are expecting during that event. So knowing that information, an account team can go in and put in a recommendation to avail an infrastructure event management workshop which is available to you as part of your enterprise support subscription. So that's another advantage of uh you know how account teams share these recommendations using trust advisor priority.

So how does it work? I mean, shaan, you said it's a very hard problem to pick a recommendation from hundreds and thousands for customers and then how can account teams do that? Right, because you're shifting the problem from customer to AWS account teams. So let's see how it works under the hood.

So, Trusted Advisor Priority has like few main steps. The first step is we aggregate recommendations across all accounts within an AWS organization that is step one, the step two, we pass these aggregated recommendations through an ML model that has been trained on customer context that you have shared with our account teams. This includes your production, workloads your critical applications or your outcomes. We also look at other data like what is the cost associated with the accounts or usage and the accounts to determine which, which accounts might be critical to your business.

And once that aggregated recommendations go through that prioritization, the account teams are presented with that like for example, the top 10 and the top 20 recommendations that we believe that um the customer should be focused on and then the account teams with their expertise and pick the recommendations which they believe are important for the customer and share them with uh them in t a priority.

So in this picture, if you look at it, like on the top, there is the start, which is the management account. And under that, you see uh different member accounts represented with different colors. And below that, you see different shapes, each shape represents a trust advisor check here. For example, in this context, we can assume the circles represent underutilized ebs volumes and squares represent r ds multi a z checks.

So the aggregation step is to put all these shapes together into one unit and then pass that uh packages through the uh a i model. So in outcomes, a prioritized list, and if you see in the bottom right, the squares are put together, the circles are put together and they are also stack ranked. The account team here has picked the ebs underutilized volumes and the multi a check as the two checks from the entire list. And the reason they might have done it because they understand that in 2024 that customer is probably focused on building resilient architecture at the same time wanting to optimize for costs.

So that is the context that an account team might use to hand pick these recommendations. So what happens when that recommendation gets shared with the customer? There are two things that are going to happen. So when an account team shares that uh those two recommendations, they actually get shared in the same format with the management account of the AWS organization and the initial state of this recommendation is pending response at the same time. uh we we also share the recommendation with every member account whose resources were identified as part of that uh overall recommendation

But the member account receives that same recommendation with only their resources in it. So and both of them can act on these recommendations. For example, a management account can acknowledge the recommendation and it they're taking an action on behalf of every member account. So it directly puts that recommendation into an in progress state or they can dismiss it if they feel that it is not relevant, the same thing can be done by the member account on their part of the recommendation as well and it's reflected accordingly in the overall management account.

So in this example, every member account has acted on their EBS underutilized volumes recommendation and that's represented with a green check on the circle. And the same status is reflected back in the management accounts overall recommendation and the status of that overall recommendation is also having a green check indicating to the management team that this recommendation is fully resolved.

Whereas for the RDS multi A, if you observe, there are two members who are part of that recommendation, the blue member and the yellow member, the blue member hasn't even acted on it and the yellow member has acknowledged it. So it is in progress. So the overall recommendation in the management account still has a no response date because there is at least one member who has not acted on it. So this indicates to the management account that they can intervene when necessary to understand the inaction and facilitate action.

So the number one product feature request that we received last year was uh the need for Trusted Advisor priority API s. And I'm happy to announce that a couple of weeks ago, we have launched an entire set of new Trusted Advisor API s which includes the Trusted Advisor priority recommendations. And here is a re reference architecture on how some of uh on how you can integrate these API s to power up your uh internal use cases to start off with.

Maybe you can have a scheduled job that runs on a fixed interval and you can make two API calls. One is the list recommendation API and the other one is the list recommendation resources API s. And once you make these two API calls, you get the recommendations alongside uh with the resources. And then you can funnel all these data into an S3 bucket using a Kinesis Firehose. And once it's in the S3 uh bucket, then you can power multiple use cases.

Let's say if you have a quick side dashboard where you use it for your own internal reporting, then you can use your Athena queries on top of S3 to fire up your quick side dashboard. That's one way you could do it or you might be interested in some particular type of a Trust Advisor check, which is very critical for your business. And as soon as you see that you want to open an incident management ticket within your internal systems, you could then use the EventBridge integration with S3 and then use two Lambda listeners and one of the Lambda listeners could integrate with Jira to open a ticket. And another Lambda could notify the particular member whose resources were identified as part of that particular recommendation from AWS account team that there is a priority recommendation in your account waiting for your action, right.

So that is one way but we also highly recommend that if you are using these a ps also use the update recommendation life cycle API to tell us what actions you are taking on that particular recommendation so that we can learn from your actions and get better at our prioritization logic.

With that, I will uh hand it out to Neil Senders who is going to go over some of the accounting workshops and tabletop exercises they, they conduct with our enterprise support customers.

Do you? Thank you, Shan. Good afternoon, everyone. My name is Neil Senders. I'm a Principal Technical Account Manager at AWS.

So you have Chanan, how um Trusted Advisor can inspect your environment and where applicable, provide recommendations to reduce cost, improve system availability and performance or close security gaps within your environment. Now, besides Trusted Advisor, there are several other programs that um uh can help you improve your operational efficiencies. And I'll in the next couple of slides, I'll show you how you could use these programs along with Trusted Advisor and Trusted Advisor. Priority to improve your uh uh operational efficiency in the cloud.

So let's start with resiliency. So as customers migrate and modernize their applications in the cloud, one of the questions we get from the customer is how do I make sure that my applications in the cloud are resilient? So um these applications are built out of multiple AWS services and these, these services themselves have their own resiliency, durability and redundancy targets. So how do I make sure that my overall application using all these AWS services can meet or exceed my resiliency requirements.

Uh resiliency um is your application's ability to recover from an outage without any major impact to your revenue. And it's measured by two industry centered best practices. Uh RTO recovery time, objective, and RPO recovery point objective. Often customers also need help in defining a resiliency strategy. And even if they have a resiliency strategy, sometimes they might need help in executing those strategies or testing that strategy to make sure that the resiliency strategy works.

So that's where a AWS support program can help. It's called Driving Resiliency Planning Execution and Testing or DR PET for short, DR PET is a AWS subject matter expert or SME led engagement that reviews your uh environment security posture. Um and then provides you recommendations to improve your resiliency readiness.

There are three stages in this uh engagement, discovery, dive deep and live fire testing. So in the discovery phase, uh the technical account managers and the resiliency SME s will meet your business continuity teams and catalog those mission vital and mission critical applications in your environment and then correlate the data that we receive from Trust trade Advisor to document risks within those applications.

We then run deep dive, architecture reviews through emersion days and build, run books for you. And then finally we run simulations and tabletop exercises to make sure that these run books will work and can meet your RTO RPO targets.

Now, DR PET is a very comprehensive review of your mission vital mission critical applications. It takes over a period of 3 to 4 months. You'll work with your technical account managers and resiliency SME s and the business continuity teams to build that run book and build that resiliency strategy for you. But it needs some commitment from your team to sit and work with us so you can reserve DR PET for the most mission vital applications and mission critical applications, some, some of those custom applications that have logic that needs human intervention to understand and work on.

Now, if you want to implement this best practice at scale for hundreds of applications in your organization, there is a AWS service called Resilience Hub that automates some of those actions that DR PET also does. Resilience Hub is a service that uh implements some of the best practices on uh identifying resiliency gaps within your environment. It identifies single points of failure within your environment.

Now, the way it does is is that you provide something called a resiliency policy. And the resiliency policy is a set of metadata uh that, that define the resiliency targets and resiliency configurations for instance, your RTO RPO your uh backup strategies. And what Resiliency Hub does is it runs some tests against those metadata to make sure that you can meet those uh resiliency objectives.

It will provide you resiliency recommendations if it finds that you will not be able to meet those recommendations or resiliency objectives. And it also the uh run some cost estimates to give you an estimate on how much additional cost you have to incur for this additional resiliency that you have added to your uh environment.

It will also provide operational recommendations around your infrastructure, whether you have the right alarms in place, it will create an um standard operating procedures which is nothing but uh run books that you know, you have to instead of treating them from scratch, Resilience Hub will already provide them for you and then you can run fault injection simulators, which is our chaos testing tool to simulate and make sure that these run books work.

So that's all about resiliency and some of the services and programs that can help you improve your uh overall resiliency posture.

Let's look at another area, security. So uh enterprises have thousands of AWS accounts. Now, the security, maturity of these accounts varies. So you cannot apply a single consistent security policy across all of those accounts, for instance, some of these accounts might host a monolith, you know, uh application that um that has code that has been there for ages, right? It could be uh that some of the accounts have modernized applications. Those born in the cloud, several applications that are using the, the best uh cloud security advice and uh practices when, when you build those from scratch. Or it could also have a uh accounts that are applications on accounts that can host third party applications, those third party applications that you are self hosting and self managing them.

So you cannot implement a continuous or consistent policy across all of them. Customers need prescriptive guidance for all of these different accounts and their security postures. And not only that, they also need a continuous a process to continuously review those processes and measure monitor and improve the baseline over time.

That's where a AWS support program can help. It's called Security Improvement Program. So Security Improvement Program reviews the maturity of your security uh posture and compares it against 250 plus industry best practices. Uh it will provide you a security score, a weighted security score for all the five security pillars of the Well Architected framework. And I'll cover that in the next slide and finally, it will provide you recommendations.

It will provide you these action plans in the form of documents, artifacts and scripts that you can run and all of these artifacts and documents can be tracked through Trusted Advisor priority. And uh you can make sure that you have a feedback loop to make sure everybody is. There's an owner for all these recommendations and actions are taken around it.

So those are the five pillars i mentioned in the previous slide, the five security pillars of the Trusted Advisor um or Well Architecture framework, um Identity and Access Management Detection, Auditing and Logging, Infrastructure Protection, Data Protection and Incident Response.

In the Identity and Access Management focus area. We look at your access controls, we will look at your IAM policies, your multi-account strategies for de uh detection, auditing logging. We look at your monitoring strategies for infrastructure protection. We make sure your infrastructure is protected from um thread detections or intrusion detections for data protection. We make sure your data is protected in motion and at rest, do you have encryption in place? Is your EBS encrypted? Are your RDS snapshots encrypted? And finally, for incident response, we look at your observable practices.

Do you have a, a response plan? Do you have a um uh escalation plan for uh security event? Who are your escalation points of contact? So those are just a few of the checks we look at, like i mentioned, there are 250 plus industry based practices that we re uh review and make sure you, you um you follow.

Um now these recommendations are service agnostic. Um it's not a deep dive on any particular service per se. So if you have a third party service that is used to monitor these uh best practices, these recommendations will work for those third party applications as well.

All right. So now when you implement these security based practices across thousands of accounts and get these recommendations and findings, you need a, a um, a process to manage this at scale. And uh that's where a AWS a service called Security Hub helps it aggregates and highlights all these findings in a single place.

So, um for instance, uh you may have these findings coming from services like Amazon, Macy, Amazon GuardDuty, AWS um AWS w AWS uh Shield. All of these is aggregated in Security Hub

Um not only security hub aggregates, it, it also normalizes the, the format of the output so that you can have a single pane of glass. Your Seesaws can have a single pane of glass to look at all these uh findings along with aggregation. It also provides you a mechanism to take actions on those findings. So on the right side, as you can see, there are integrations with CloudWatch and you could use ABI Lambda to take actions and on these alarms that are triggered, you can also have integrations with incident response tools like PDTwo and Slack and Security Hub has native integration with Trusted Advisor for you to uh uh in integrate and send some of those checks to Trusted Advisor as well.

So, all right. So that's on security. Let's look at another important area - incident response. So Shanan shared some stats from IDC that talks about how customers and organizations spend over 13.5 million per year on um inefficiencies or operational inefficiencies in the cloud. Now, these operational inefficiencies can lead to costly disruptions that can have a cascading effect in your across your organization. And these effects are not only can lead to revenue impact, but they can also lead to reputational impacts. Like for instance, you could have a reputational impact for with your partners. It could lead to uh workforce productivity issues or lead to um stress in your environment.

So the ability to react to incidents like this are outages and recover fast from these events is of utmost importance and it's not only important. In fact, in some regulated industries, it's a compliance requirement to meet a certain recovery objectives. So that's where a new AWS service can help. It's called Incident Detection and Response, Incident Detection Response or IDR is a support uh service that monitors critical environments for uh your critical environments 24 by seven and is done by the incident management team, the same team that monitors our own infrastructure within AWS.

Now, some of you are enterprise support customers here. So you're familiar with the business critical case and the workflow around it and how to engage your support engineers and the technical account managers. When a critical incident occurs. In the next slide, I'll show you how the support response differs from um a customer who's on IDR and a customer who's not on IDR. I didn't see how the, how it improves your support response.

So let's start with a customer or um a set of accounts which are not an IDR. So when an incident occurs, your alarms go off, you, um your site, reliability engineers detect the problem. They start in investigating the problem and then they isolate the problem to whether or not it is an AWS problem or not. Let's say it's an AWS problem. Then they engage our support engineers. They, they open a case and uh our support engineers get involved, your technical account managers get involved and we bring in the right uh uh the service teams and we investigate the problem and eventually resolve the problem.

Um as you know, there are about 15 minutes of s uh first responses earlier where the um support engineer will get on and acknowledge the case. Now note that it could be several hours before you isolate the problem and open a case it uh and then a significant amount of time is spent in isolating the problem and detecting who, which team to um call. And then we get involved, our technical account managers get involved and we start investigating, we get in the service teams, the relevant service teams on the call. And then finally resolve the problem.

Now, let's look at a uh a set of accounts which are on IDR and in uh incident goes off. And as I mentioned, it's monitored 24 by seven by this, the incident management team on AWS. So we get involved immediately when the incident is, uh incidents are detected. And then we, the uh incident management teams open a bridge, open a case and engage your site, reliability engineers, we start investigating the problem and we bring in the service teams and we resolve the problem.

So now uh uh on the right, as you can see, there's 15 minutes of uh time uh to for us to engage when an incident is uh detected. But we are reducing this to five minutes and there'll be an announcement this year at ReInvent around it. But as you can see this, this significantly reduces the mean time to engagement. And for some mission vital applications, this could mean a lot of competitive advantage compared to um it can give you a uh competitive advantage advantage to uh maybe start your resiliency uh run book and you can feel over before uh it impacts your customers further.

So that's uh um IDR and the differences between a customer on IDR and uh one on uh not on IDR, but let's look at how it works behind the scenes. So, on the right, you have your database environment, the workloads are monitored by um your um you have alarms set for those workloads, alarms go off and your observable metrics are triggered. You have CloudWatch or you could have uh third party APM tools like Neli or DynaTrace. All of these alerts are sent to our IDR team through an even bridge. And the IDR team triggers their run book, which was co created with your incident response teams and, and your site reliability teams.

And then we correlate the data with some of the AWS health data that we have access to. On our side. We investigate the root cause we bring in the service teams and finally, we resolve the problem and then do a post incident reporting where we, we try to improve the run book. So to make sure that this problem doesn't happen in the future. Again, these are some of the um the benefits that our customers have seen on IDR improved observable since these um alarms are set up by your the incident management team working closely with your site reliability engineers. We have better observable in place. We, we set up those right leading indicators for your environment, both for your infrastructure as well as your application layers.

Then for the reasons i mentioned earlier, let's you have a faster resolution because we get engaged faster. There's a meantime to engagement, increase, meantime to engagement. And then uh we also work on identifying the right service teams involved and that helps you resolve the problem, which reduces the mean time to uh resolution as well. And that's where uh Shanan Sly talked about the 4.4 hours of meantime, to resolution, it could bring that down as well, incident management for AWS events.

So whenever there is an AWS event, we provide proactive notifications to you and um you can take actions on failing over your infrastructure before even it becomes a major event. So, so that gives you a added advantage as well, finally reducing potentials for failures. So that goes back to your continuous improvement as we work closely with you to build those run books, uh review your run books over after every event, it helps you improve the the stability and of your environment over time and make sure that it doesn't um you don't have these problems in the future.

So that's incident response for you. Let's look at another area cost management. So as customers move further along in their cloud adoption journey, they consume more and more AWS services and the cloud operating cost would relatively increase. So there's a need to build a robust cloud management and cloud cost management strategy to continuously review those unhealthy practices around over provisioning your resources or under utilization of resources. And doing that at scale for thousands of accounts could be quite overwhelming.

That's where a AWS um support program can help. It's called Cloud Optimization uh Cost Optimization Workshop. So Cost Optimization Workshop is an TAM engagement. TAMs review your uh environments, cost optimization opportunities and provide your recommendation in areas such as compute database storage network transfers, reserved instances, savings plan and many others.

So if we review your cost and usage reports and use some of the, the tools that we have access to, to provide these recommendations for you uh over a period of 60 to 90 days. Uh depending on the level of engagement, we work with your FinOps teams to provide these recommendations. And what we have seen is um on an average, these engagements lead to about 15% reduction in your uh cloud operating cost and it is at no additional cost to you. It's part of your time engagement and times can technical account managers can do, do that for you.

Now, one question we get from customers is as we go through these different stages of the engagement, discovery and analysis, recommendation, educate, monitor reporting. What is the level of engagement from our end? What is the level of engagement from my finance uh fs teams. So here's a, a timeline of um how it uh the entire engagement works.

The first three weeks is all about discovery it. Um we could, we'll work with the FinOps teams to identify parts of your organizations where you want us to focus on. It could be a particular business unit which is um which has issues with operating costs or it could be a set of services that you are having um issues with. We work with your team, identify those areas and then for the next two weeks, we uh work and run our reports to do that deep dive analysis to provide the recommendations.

We share those recommendations with your FinOps teams through our workshop and then the remaining 6 to 7 weeks, which is optional is to work with your um FinOps teams to run a cadence where we observe as you, as you implement those best practices, we review these to make sure that uh we don't need any course corrections. We might, we might have to adjust a few things. Uh we provide um guidance along the way and then finally, we provide a strategic business review at the end of it. And this process can be repeated every quarterly or once a year or half a year, twice a year, depending on your or your availability and your need.

So those are some of these AWS support programs and um AWS services that can help you improve your operational efficiency. Now, I'd like to hand it over to Kara to talk about how Coke Industries has implemented some of these programs to improve their cloud operations.

Awesome. Thank you very much Neil. So my name is Kara Mosley and I am really excited to be talking to you today about Koch Industry's cloud journey. I was lucky enough to join Koch back in 2015. So I have been with our company throughout our entire cloud journey, but I started as an accountant and was not involved in the discovery or the initial phases. But when I saw the value it was creating for our business, I wanted to be a part of it. And so I bolstered my AWS knowledge and ultimately made my move into IT where I currently have the opportunity to lead our cloud platform team, which is similar to an ACOE we're responsible for provisioning accounts the security of our platform and really enabling each of our Koch businesses to get the most value out of AWS and focus on what their business needs.

And so if you are not familiar with Koch Industries, we are one of the largest privately owned businesses in the United States headquartered in Wichita, Kansas.

I like to think about us as an enterprise of large enterprises and it makes it a little bit fun and a little bit complex because we are in a variety of different industries. We have several businesses like Georgia Pacific who manufactures paper, Guardian manufactures glass Mule, who manufactures electronic components as well as businesses who support logistics, commodity trading and even market investments. And so our AWS environment as you can imagine is extremely complex to manage, but we have had a very exciting cloud journey.

We started back in 2016 with experimentation and figuring out what did we need to build and how fast could we move? And by 2017, we had several businesses moving their production workloads in into AWS. Uh through a variety of strategies, we had some lift and shift some modernization. But by 2020 we had the majority of our workloads migrated to AWS and really started upping some of our refactoring efforts. By 2021 we had a lot of challenges with some latency and meeting the needs of our global operating teams. And so we focused on what regions did we need to enable for our businesses to expand and deliver the value that they needed to. And through this last year, we really focused on reducing the technical debt in our platform. But as you know, no cloud migration goes smooth. It's always a little bit more difficult than we imagine.

And we ran into three key challenges during our journey:

  1. The first one was speed. Each of our businesses moved at a different speed and had different migration strategies which led to different needs of services that needed to be enabled the way that we implemented certain controls and varying cloud maturity across our enterprise.

  2. We also had challenges with visibility. We struggled to aggregate views of opportunities for our stakeholders so that they can make good decisions for all of their accounts.

  3. And we found we had technical debt. When we set up our platform, we really focused on production development q a accounts and our organizations. And we found it actually made visibility even more challenging. And because we had a very heavy lift and shift strategy to move our manufacturing applications into the cloud, we are unraveling some of the technical debt associated with those applications.

And so we knew we had to overcome our challenges and focused on four key areas, resiliency, security, monitoring, and governance, and cost management. But we knew that we couldn't do it alone. So we worked closely with our enterprise support team, particularly Neil to figure out what ways could we go and solve these challenges. And we were connected to three key programs and tools, Trusted Advisor, the Security Improvement Programs and the Cost Optimization Workshops.

And so our first focus was resiliency. As we were starting to modernize some of our applications, we went back to the basics and focused on the Well-Architected Framework and did reviews of key applications that we were uncomfortable with or ones that we wanted to modernize to identify what changes did we need to make to our designs and our configurations before we put a lot of effort in. And then we relied on the Trusted Advisor, resiliency checks so that we could see if we were making progress and identifying new opportunities that needed to be addressed.

And as we looked at our global footprint, we realized that some of our applications truly needed to be multi region, but they also needed to be close to the end users. And so we went and did point of view developments on which regions were going to be our preferred regions in every area of the world. So we could have a primary and a secondary region in North America Europe and Asia to make those decisions, easy for application teams to know where they could deploy their resources, especially when they needed to be connected to our network.

And then the last two areas I want to talk about together because they were so closely related in our journey. One of our subsidiaries, Georgia Pacific really focused on resiliency and went through the DR Pet program. And early on in that program, they had to align on what the business RPOs and RTOs were and connect with some of their key stakeholders to understand those expectations so that they could build them into AWS Resilience Hub and their application teams could start to configure and receive those automated alerts to know when their applications weren't going to meet the needs of the business.

And through the live fire testing, we were able to identify single points of failure that needed to be addressed to move forward. And through each of these areas, we identified over 330 resiliency opportunities that we took action on. And our TAs actually tracked through Trusted Advisor Priority to make sure that we were making progress on those initiatives.

But even though we were more resilient, we wanted to be more secure. And so over the past two years, each of our individual COE businesses participated in Security Improvement Programs and we opted to do this business by business so that we knew we would have the right resources and the right stakeholder support to go and take action once these engagements were concluded.

And this resulted in a lot of changes across our business. We learned that there were some configuration changes that we needed to make in CloudFront for some of our networking opportunities. And we learned that our developers were really good at encrypting EBS snapshots. But as we moved and started adopting new services like RDS, there was an opportunity to go and encrypt those snapshots as well so that we any backups we made were being covered the way that they should.

And then through each of those individual business Security Improvement Programs, we took a step back and said, what, what's a nonnegotiable, what has to be implemented into the platform for everyone's security? And this resulted in our team's developing the tools and the processes to ensure that CloudTrail and VPC logging was enforced in every single account that we had. And upon creation of new AWS accounts GuardDuty was deployed so that we could enhance our visibility.

But through that, we still felt like we needed to get better at our operational effectiveness. And our application teams were asking for alerts in real time when their applications could be impacted. And so we developed a process with Amazon Health Aware to deliver real time alerts directly to application teams.

And what we did is we use the Amazon Health API and pulled those results every single minute through a Lambda and stored them in our DynamoDB database. And then we allowed each of our COE businesses to filter those results through an Event Bridge and decide how, how they wanted to deliver them to their teams based on what their individual business processes were. Some teams wanted them through SNS topics, others wanted them to end up in emails and others wanted to integrate those results directly into Microsoft Teams channels or into ServiceNow tickets so that they could review them and act on them.

And with the new release of the Trusted Advisor Priority APIs, we believe that this design that we have is reusable. We can bring in more results and information and deliver them to the end users in a way that they're already receiving meaningful alerts today.

And the last area that we really wanted to focus on for operational efficiency was cost management. As we went through our resiliency and our security journeys, we were spending a lot of time reviewing the Trusted Advisor checks and we're getting really uncomfortable because there were a lot of cost recommendations. And as a former accountant, I was getting really uncomfortable with those.

And so we went to our AWS team and said, can you help us with this? Because we know that there's more we need to do but don't know where to start to get those quick wins. And our teams were able to go and pull information that we didn't even have access to in our third party FinOps tooling. And they showed us, hey, remember when you started your journey back in 2016? Well, you created a lot of snapshots since then and you've never cleaned them up. Same with the ones in 2018 and 2020. And we found our average ages of our backups were over a year and those weren't creating business value.

And so working with our TAs and key stakeholders, we were able to identify, we need to implement some governance around how long backups should be stored for. And by defining a new retention policy, we were able to automate the clean up of those snapshots and capture over a million dollars of annual savings just in deleting data that we didn't need, that didn't have any risk on application performance.

And so those were a lot of changes in the last couple of years, but we're not done yet. We still think that there's opportunity to better act when our applications could be impacted to prevent downtime and really keep the applications that are key at our manufacturing sites running effectively.

And so over the next year, we're really going to be focusing on how we can implement IDR and we're thinking about it in two different ways. The first one is our high value tier one applications if it is going to have a material impact for our business IDR is one of those preferred options that we want to explore so that our orders and our forklifts and our trucks can keep moving the way that they need to. But we're also thinking about it where we have poor consumer experiences. Today, there are some services that we use from AWS, that we don't have deep internal knowledge of how it works. And so when there is an issue, it takes us a very long time to get the application teams connected with the right resources to go and resolve it. And using IDR for some of these specific services can help us act on those faster and remove a lot of the frustration from those development teams.

And we also plan on utilizing the knowledge of our TAs significantly better. I mentioned at the beginning, we had a lot of technical debt with the way that we structured AWS Organizations. And over the last year, we've reorganized every single one of our accounts so that we can group them by business unit. And this gives our TA team the information that they need to know which accounts are owned by each one of our businesses. It's very clear and now that now they can curate recommendations. So if Georgia Pacific is focusing on resiliency, we can prioritize those recommendations for those accounts. And if Mole is focusing on cost, we can prioritize a different set of recommendations for Mole accounts as well.

And through this, because we can aggregate the results at our payer account level, we can delegate admin access to our FinOps teams and our Security teams to see how across the organization we are acting on these recommendations and following up with the right stakeholders to close the gaps that are needed.

And so I am going to pass it over to Shan then to close this out.

I'm back. So thank you, Carol for sharing that amazing journey. So to wrap it up, we discuss the pains that you have with respect to uh you know, visibility tracking inefficiencies in your organizations. We made claims that AWS support can help you operate efficiently and we have shown gains that our customers like Coke Industries are gaining by implementing these programs within their day to day operations.

Not everything that happens in Vegas needs to stay in Vegas. All this good learning that you have from here, please take it back to your teams and share the best practices with them so that your teams can operate efficiently and you know, optimize for resiliency, security and costs.

Here are some good resources which shares information about how you can start with Trust Advisor as well as IDR and these are some related sessions. I think the first session is actually already done, but you can uh you know, look at these other sessions which also talk about uh cost optimizations as well as uh how to achieve resiliency.

So finally, this is the time where you have to take your phones and please do fill out that survey and give your feedback because that is very important for Neil and me to get better as we do more of these sessions.

Thank you very much for spending your hour with us. And may the luck be with you.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值