McDonald’s path to secure operational excellence on AWS-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134790718

Good afternoon, everybody. Welcome. Are you looking for insights on how to lay the foundation for operational success on AWS or are you looking to strengthen your existing operational posture to help you grow and scale your business on AWS? If yes, this is the session for you because today we're going to share operational best practices and approaches you can take to set you up for long term success.

My name is Paul and a little later, you'll hear from Betty and Swati and this is our session on securing operational excellence on AWS.

Let's take a look at what we're going to cover today. I'm going to kick us off with some operational best practices and lessons learned, followed by approaches that you can take to build or strengthen your operational posture and they're gonna hand you over to Betty and Swati to talk about how Mcdonald's have approached cloud operations in a way that's enabled them to scale globally. They'll also look at some specific use cases that Mcdonald's have been able to solve for by partnering with AWS as a solutions architect that specializes in cloud operations.

Something that my customers often ask me is whether there are any common pitfalls or preventable challenges when it comes to operating on AWS on reflection. There is one reoccurring issue that I see come up time and time again and that is that customers often think about how they're going to operate their workloads far too late in their migration or their modernization journey.

Now, look, I get it whilst I love talking about cloud operations all day long. I do get that for some people. Speaking about the latest developments in generative AI or serverless computing can be far more exciting than diving deep on cloud operating models or process maps. However, the customers that I have seen be the most successful in AWS and those that have derived the highest return on their cloud investments are those customers that think about how they're going to operate their workloads both well before and well after they go live on AWS to set the scene for today's talk, I'm going to share some lessons learned on how creating a solid operational foundation before you deploy your workloads on AWS can set you up for success.

Now, I'd imagine there are plenty of people in the room here that already have workloads running on AWS and that's great. It is never too late to adopt some of the best practices that we share here today because operational excellence really is more of a journey than a final destination and you should be continually evaluating and refining your operational processes and procedures over time to help keep it simple. I'm gonna break down these lessons learned into the three domains of people, process and tools.

When it comes to people, you need to make sure that you've got the right people with the right skills and sufficient bandwidth to be able to operate your workloads on AWS. A common store factor that I observe as customers try to accelerate their cloud adoption is that if you're not expanding your cloud operations team, as you're moving more workloads to AWS, then your cloud project team can inadvertently become your cloud operations team. What I mean by that is if you've got resources with specialized skills like cloud architecture or application deployment, as you move more workloads to AWS operational issues inevitably start to arise and these specialized skills can get distracted and start working as incident managers or resolver to mitigate this. It's really important that you think about your holistic operating model and who is going to do what? Not only after you go live in AWS, but also before during and after the whole go live process.

When it comes to process, there is significant scope to automate your operational processes on AWS. And you should capitalize on this from day one. By focusing on automation. From the beginning, you can reduce manual effort and human error. Of course, AWS have a number of services that can help here, for example AWS Backup, which allows you to automate your backup management process, you can use AWS Backup to automate the to automate creating your backups. You can also use automation to copy your backups to multiple different AWS regions which can help with your continuity management. And if you're using life cycle policies, you can move your backups from warm storage to more cost effective cold storage over time. These are just a couple of highlights of the automation available in one single AWS service. But as we're talking about process, one lesson that I'd really like to share with you is that it's really important to have adequate security and operational processes in place. From the moment that you deploy your workloads on AWS. Many people don't think about this until a go live, but I've seen firsthand that it's really important to have this in place from the moment you deploy your app on AWS. The reason being is that whilst well intended people are moving really quickly with your cloud program, people do sometimes make mistakes. So something like accidentally leaving an Amazon S3 bucket open to the public or creating an overly permissive security group room can occur by having adequate security and operational processes in place. During your build phase. You can quickly capture these type of mis configurations and remediate them before they become a bigger issue.

When it comes to tools, it may be possible to use some of your legacy on prem tools on AWS. However, you really need to evaluate if your legacy tools will sufficiently integrate into the AWS ecosystem and give you the best operational outcomes when compared to cloud native tools. Let me give you an example, Amazon DevOps Guru is an AWS service that's powered by machine learning and it integrates with and consumes streams of operational data from other AWS services such as Amazon CloudWatch, AWS, Config and AWS X Ray. This enables DevOps Guru to produce machine learning, powered insights into things like the duration, severity and impact of an incident, as well as recommendations on how you can resolve it. But it doesn't end there. DevOps Guru further integrates with other AWS services such as AWS Systems Manager where you can automatically create an upside to help you remediate that incident. So you see it's really the deep integration between these AWS services that's super valuable as it can help you improve the availability, performance of your applications as well as decreasing, expensive downtime.

So we've looked at some operational processes, our best practices across the domains of people, process and tools. Now let's take a look at some of the approaches that you can take whether you're building or strengthening your operational foundation.

First of all, you can choose to do it yourself, but that doesn't mean you need to do it alone. AWS has a variety of resources, people and services that can help you on this journey. If you decide that you want to handle cloud operations in house. Then a great starting place is the AWS Well-Architected Framework. This is a framework comprised of six different pillars with one of these pillars being dedicated to operational excellence. This is a really useful guide that can provide you with insights on how you can effectively run your workloads on AWS, as well as tips on how you should be continually evaluating and refining your operational processes over time. Additionally, this has a number of key design principles that I'd like to surface here as well as pointing out that we added two new design principles just two months ago. I'll be sure to share a link to this at the end of our talk today.

If you're an AWS Enterprise Support customer, then there are a number of services that are included in your support plan that can help you with evaluating and reviewing the operational health of your AWS environment. These include a number of workshops that your Technical Account Manager can run with you such as the Operational Readiness Review workshop which can help you evaluate the operational readiness of your workloads prior to their go life. There is a Building a Monitoring Strategy workshop that can help you to identify the key metrics that matter the most to delivering successful business outcomes as well as the Incident Management workshop, which is a tabletop exercise in which your teams test their incident response procedures against a hypothetical incident. Just to note these are only a sample of the 26 different reviews, deep dives and workshops that your Technical Account Manager can run with you. And I'll share a link to the fullest at the end.

Now, while some of our customers are comfortable self operating their workloads, others prefer to outsource to a partner. If you believe that a partner is the right approach to help you operating your workloads, then the good news is that we've got a rich partner ecosystem with over 160 partners in the AWS Managed Services Provider program that can help you with operating your workloads. These partners undergo a rigorous evaluation process that validates both their business and technical expertise to make sure that they can deliver results for you at AWS.

Something that we heard from our customers was that they wanted AWS to help them with operating their workloads. So we listened and we created a service called AWS Managed Services or AMS for short. You can use AMS to help you operate your workloads by using AWS people process and tools. AMS can monitor your workloads and in the event that an incident is detected. AMS will proactively work to remediate that incident on your behalf by using our incident management processes which are backed by service level agreements and service credits. AMS can also help you raise the bar and security by deploying over 150 guardrails backed by rum books to help remediate incidents that may occur from a security perspective in your environments. AMS can also help you with the important but frankly undifferentiated operational activities such as backup management or patch management, which can free up your resources to focus on what matters the most to your business. And if cost optimization is something that matters to you, then AMS can provide you a cost saving recommendations but also take action to implement those changes so that you can realize that saving. These are just some of the capabilities that AMS brings to the table. But there is one more thing that I want to point out and that is regardless of whether you're brand new to AWS or you're a long term AWS customer. AMS can meet you where you are deploy its to link into your AWS accounts and help you operate your workloads.

Let's dive a little bit deeper into how AMS can help you at operating those workloads. So from a people perspective, AMS gives you the ability to reach into AWS and access operations engineers 24 7 to work on any incidents that arise within your accounts. AMS also provides you with two designated resources to help drive security and operational excellence in your AWS accounts. The first of these is an AMS Cloud Architect. The second is an AMS Service Delivery Manager in a couple of minutes. You're gonna hear from Swati. Swati is actually the AMS Service Delivery Manager for Mcdonald's. And she's gonna share with us some stories on how by partnering with Mcdonald's, she's been able to help them achieve their operational goals on AWS.

From a process perspective. When you on board your accounts, when you on board your accounts, AMS brings years of cloud operation expertise in the form of rooks which our operations engineers follow to rapidly remediate any incidents in your account. Additionally, if you were to have a security incident, then an AMS Security Engineer will follow a NIST aligned security incident response procedure and they will help you to detect, analyze, contain eradicate and recover from that security incident.

From a tuning perspective. AMS uses all native AWS services under the hood to deliver our operational outcomes. You retain full access to these AWS services as well as full access to your AWS environments. So you can still continue to innovate at a rapid pace whilst also benefiting from having AMS by your side to operate your workloads.

So now that we've had a look at some approaches you can take to build or to strengthen your operational foundation. I'm really excited to hand you over to Betty to talk about how Mcdonald's have taken some of these approaches on their journey to operational excellence.

Hi, everybody. My name is Betty Gomez and I've been with Mcdonald's for nine years. It's a lot of french fries. Alright, before we get started here. Who in the room has never been to Mcdonald's before?

That's all right. Let me tell you a fun story or a fun fact. If you're in the United States, you're never more than 100 and 15 miles away from a McDonald's. Don't worry, this session does not include us walking. But if you are interested, there is one right outside of the Venetian.

McDonald's is the world's largest restaurant company. We have 40,000 locations worldwide in 100 and 18 countries. When it comes to our people, we have 2.2 0.2 million people working for McDonald's and franchisees. And when it comes to our delicious food, we serve 70 million people a day. That is 1% of the world's population fed daily or 75 burgers made every single second mint. I'm sorry guys.

When it comes to our digital mobile application, we transact 69 million transactions every single day. All right. What I'm trying to do here is show you the breadth, the depth and the scale of operating McDonald's globally. As you can see, we don't tend to do operations at a small scale.

Now, let's double click on cloud operations specifically at McDonald's. Our AWS platform is the most developed one. Today. We have 900 applications hosted 2700 federated users, 6400 EC2 s and 100 AWS services leveraged. That's actually closer to 100 and two. I just checked last week.

What does all of this mean to you? When you walk into a McDonald's, you stand in front of the kiosk, you look to the menu items and you decide what you want. You transact v our point of sale system and you step to the side and look at the order ready board. All of that is part by cloud and technology today. Just like when you pull out your phone at home, you're again browsing through the menu in order delivery to your home. All of that is powered by cloud and technology.

Look operations is not really a new concept for McDonald's. Our restaurants have rigor in the restaurants. They've been experts at standardization, security safety and automation for the crew and for the customers for us incorporate. We wanted to ensure that we were applying those same lessons learned into our cloud operations for this. We turn to our partner a ms and quickly got deploying the solution in our ecosystem.

What i'm gonna do now is take you a little bit through our McDonald's cloud operations journey. There was a pivotal moment for us about 2 to 3 years ago where we decided to leverage a ms and deploy it in our ecosystem. For this. We start to what we call our organized phase in around 2021. We defined the scope, the tooling and the process we needed to leverage the new ways of working in our environment. We also started focusing on hiring cloud and tech focused talent. Paul said we wanted to avoid letting our project team become our operations team.

Moving on to 2021 to what we call our prepare phase. In this phase. We defined the security compliance and operational standards for our new ways of working in our new ecosystem. We also define the roles to build accountability and responsibility within our growing team. And lastly, there were gaps and we built a plan to remediate those gaps.

After our organized phase and our prepared phase, we move on to our operate phase which is current year 2023. Here is where we turn to a ms, ask them to deploy, to deploy the solution govern secure and operate 100% of our aws workloads, leveraging their expertise.

We decided to transform our processes and our tooling ops transparency was a big one for us. This year, we assembled a small but mighty team to provide cost optimization, cost transparency, budget management and everything else that we could offer to our organization tackling technical debt. 2023 was a year of techni our technical debt.

I'll give you a little bit of a story. Seven years or six years ago. We deployed our first date of us account into our ecosystem at McDonald's fast forward to now there's now 100 and five applications hosted in that single account. We decided to decouple that account, move all of those workloads into their designated sub accounts in a new landing zone wrapped around with aws control tower. We wanted to deploy aws control tower so that we had a secure monitoring and govern way to secure environment for this. We leverage a ms our aws pro serve team and the enterprise account team at McDonald's to quickly get going and decoupling this massive account.

Don't forget to upscale technology and cloud move so fast. 2023 we wanted to make sure that our team members and our growing team members were always up to date with the latest trends in technology, knowing how to apply it in the workforce, moving forward to our evolve phase, which is 2024 and beyond, we want McDonald's wants to continue to focus on restaurant operations powered by tech and powered by cloud. We also want to focus on business resiliency and continuity to get closer to those goals that we've set for ourselves, incident detection and response in partnership with a ms is a big one for us. Although the solution is already deployed in the ecosystem, there's a lot more coming in this space that we're focused on and lastly focusing on observable in a i os.

See that green arrow, that's where we are, that's of the McDonald's journey thus far. However, this is not your journey and i think you can take a lot of our lessons learned. But for now, i'd like to welcome swati back into this stage. Walk us through some use cases of success stories at McDonald's led by a ms.

Thank you. Thank you, betty. Like betty said, your journey may or may not look like this and that's ok. In the next section, we're going to go to three use cases to give you some tangible outcomes that you can take back and work within your organizations.

I'm swa go and i have been working with betty and mcdonald's over the past two years. So let's get started, i guess get started.

A first use case is cost optimization by a show of hands, how many of you are doing some sort of cost optimization in the cloud? Oh very impressive. I think about 80% of you. And so we showed right, while significantly reducing the operating cost of cloud is difficult and it's long, there's multiple low hanging fruit that we can go and get started to chip away at our cost optimization goals. We'll show you how we did that with mcdonald's.

The first step was to analyze phase, which really was really the understand phase. You need to understand your cost before you can optimize your cost. Now you can do that yourself by using aws cost explorer or in the case of mcdonald's, our tams who are our technical account managers run monthly cost optimization reviews when we go through these opportunities and partnership and analyze and prioritize them, which gets to my next face.

When you do look at cost explorer, you will see a lot of cost optimization opportunities and they can be quite overwhelming. So we want to make sure that we prioritize, we want to prioritize not just the cost optimization opportunities, but also look at items that will give you operational excellence, better security, better performance.

When it came to mcdonald's, we decided to go with low hanging fruit, simple small changes with low risk. If i'm honest, when we started, our first opportunity was bringing us a $600 saving per month, it almost seemed too small that we wanted to get started and we'll show you what we did with it shortly.

Once you have prioritized and you know what it is that you want to attack and what is your priority based on your business? The next step is plan which is very simply what needs to be done by when and who, which is really the execute. What we often see is our customers get stuck at that step. They go through 12 and three pretty seamlessly and come to the recommendations phase. But it's the execute that becomes really hard. It's either due to prioritization, resource constraints or other priorities.

A s is the hands on keyboard partner for mcdonald's that executes these actions so far on our journey, we started with deleting any unattached ebs volumes which are elastic block storage. We went with disabling any duplicate aws cloud trail logs we've analyzed under utilized ec2 instances and looked at streamlining those. So there's a lot that can be done based on your priorities.

So what can you take away from this slide, understanding your cost, prioritizing it? But the big one is really who is the hands on keyboard person, team role that can actually execute? Because until the value is realized, the first three, the first three steps are almost moot betty. What benefits did you see as a result of this exercise?

Our di direct, our direct outcome was recurring monthly savings of $60,000 which is a great achievement within itself. But it also taught us how to be more cost effective cost to buy c cost optimized, how to manage our dollars better to invest in more innovative products into the future.

Moving to use case two, driving business continuity. If a disaster strikes a business and your important data is lost. How easy would it be for you to recover your business? Let me make it simple. If you were to lose your phone here today at re invent, how easy would it be for you to recover your photos or your contacts, everybody can check their phone but i hope nobody loses their phone here today. But the question is important if our backup strategies help our business continuity plans

Business continuity is an organization's ability to continue operations during and after an incident. There's multiple aspects to business continuity. One of them is backup management and leveraging AWS Backup.

A first phase in this use case was the observed phase. A core part of my role as a service delivery manager is to understand the current state of patching operational metrics cost and backup. When we did our initial metrics for McDonald's, we observed that the customer had chosen EBS snapshots as back of strategy and that had been working well. And what that simply means is if there was an impacted instance, which needed to be brought up due to any incident, they would infer the configuration of that instance. What is it made of what the vpc submits and so on, they would create that new instance, they would restore their snapshots from backup and convert it to the volumes and then they would attach those volumes to that instance and then all together that will make your instance available again and that works and has been working well.

But we wanted to make sure we continue to raise the bar on oper operational excellence. So we started looking at alternative solutions, we looked at the possibility of doing two level backups, which really means you take a snapshot of your two instance, which takes a backup of your entire two merida. So if you've lost that instance, you restore that from backup. Now, while that is a simpler step, it does have more costs associated with it.

We presented both these options to McDonald's and asked them to choose as to what works for them from a risk, faster recovery and cost perspective McDonald's decided to move and transition to the two level backup despite of the cost for faster availability and business continuity for the critical applications, but chose to stay with their c ebs level snapshots for the non pro environment. And that mix has been working well for them so far,

Then comes the execute phase which is really making sure that the right backup plans are associated with the right resources. A s was quickly able to execute based on our agreements and bring it to light. We were also able to go through any duplicate backup plans and clean those up as part of operation hygiene.

So what can you take away from here other than don't lose your phone but make sure that you have a backup plan that is audited and not just and not just take the snapshot and take the backup but also restore from backup, often measure the time it takes to back up and continue to chip away to make that faster for your business.

Betty. can you tell us about our exercise here?

Thanks. Absolutely. Our ec2 backup compliance increased from 2% to 91% for all of our production environment helping us recover faster during high priority incidents. We also were able to deploy new standards, run books and standard operating procedures for a new way of doing backups.

Our last and third use case is security. Security is job zero for us all it is a top priority. There's multiple aspects to security. One of them is vulnerability management and vulnerability assessment. Vulnerability assessment is the process of identifying, classifying, analyzing and remediating your vulnerabilities in the environment. And the reason we want to do that is we want to proactively manage our vulnerabilities so we can reduce the probability of cyber incidents and vulnerability assessments and management are no longer a nice to have and are very much a foundation to security for operational excellence diving deep into this use case.

Our first step was the identification. We worked with the McDonald's security team, the a ms security team and and got a full list of all open vulnerabilities that were in the environment. We did that through using qis which is a vulnerability scanning tool leveraged by McDonald's and AWS Security Hub. Once we had a whole list that we can go analyze, we worked with AWS proserve to create quick site dashboards for us to be able to consume that data, we then move to the assess face.

Not all vulnerabilities have the same risk. Some vulnerabilities are higher risk and are exploitable while as some may have lower risk. So getting an understanding of what vulnerability poses, what risk and assessing that is critical to remediating those we work with the team and categorized four categories as an assessment phase.

We had the critical vulnerabilities that needed to get addressed in 15 days of identification. We then had our high need to address in 30 days, our mediums in 60 days and our lowest in 90 days. Now, the reality is we want to address our vulnerabilities as soon as possible and not wait for that time period. But the reason we have those is because they help us prioritize, they help us know and assess the risk of vulnerability poses in the environment.

Then is the address phase vulnerabilities can be addressed through effective patch management and by effective, i mean, patching across your entire suite and the scope of your environment and checking on pat failures and pat successes as well. A ms performs patch management for McDonald's. We worked through our vendor released patches and applied those patches on the application owner's chosen maintenance window. You define the maintenance window, we apply the patches and then lastly iterate.

Well, we have an established monthly process or monthly patching process. We have to continue to raise the bar. We have to continue to get ahead of the curve. So we've been working with the security team to define what more can we do. So we're focusing on security policies and procedures. A simple procedure we have implemented recently is making sure that all instances get patched and rebooted and there's no exceptions, you can move forward or you can go back, but you have to patch, we're moving to the control tower that you referred to earlier and then lastly a strong security response process that helps us in the event we do need it.

So what can you take away from here? Getting to a zero vulnerability is quite challenging because they catch up. The focus should be patch compliance. What you're patching? Is it successful or is it not patch coverage? How much are you patching? And then lastly patch latency, which is the point where the patch was released and when you applied it and you want that as low as possible.

Betty, i know there's a lot of work that's happening in security. Can you talk to us about some of the outcomes and results? You've seen?

Absolutely, we've seen 40% reduction in vulnerabilities enhancing o overall security posture. Our cloud security team now also has a way to report display and communicate security to the greater organization and the different levels of leadership and more.

So we went through three use cases today. There's a lot that has been done and there's a lot more that's coming these, these examples and use cases gave you a lens into our partnership so far as we take on more work or initiatives and partnership together.

And as you take this forward, we want you to focus on these three core domains, people process tools for a ms. We want to use as much as cloud native tooling to allow our customers to effectively operate and scale and innovate at the pace that aws is innovating.

An example here is the a ms resource scheduler. The a ms resource scheduler is based on top of the ec2 resource scheduler. What that does is helps you schedule the turning off or on your resources at a set schedule. So you can hibernate your resources as you like. So as a developer, you can hibernate from six pm to 6 a.m. or over the weekend or you can add mass turn off your development instances or testing environments at the period they are not being used.

An additional benefit of this is patching. If your instance is turned off resource scheduler will turn it on patch, turn it back off all automatically with no interruption or manual effort needed. There's other benefits such as cost benefits to using a ms resource scheduler as well.

Betty, can you talk to us about some of the benefits you've seen?

Absolutely. We were leveraging a third party tool to do our hibernation by moving to resource scheduler. We were able to save 100 and $19,000 in licensing costs.

Everything you've heard up until this point ties back to our McDonald's goals which are continuing to increase reliability and stability for our technology. We want to continue to establish global platforms or services for our crew and for our customers, we want to continue to improve cloud operational efficiency. Keep going up that journey. We will continue to innovate for the future and obviously grow our McDonald's technical community.

Now, I'd like to wrap it up by giving it back to pull.

Awesome. Thank you so much, Betty and Spotty for sharing McDonald's journey and just giving us a glimpse at the scale at which you operate. It's fascinating.

So before we wrap up today, I'd like to leave you with a couple of calls to action for you to ponder when you get home from re invent.

Firstly thinking back to swat's first use case, it isn't realistic to overhaul your entire operational landscape in just one day. Instead, you really need to think about what are your key operational pain points and then prioritize changes around these because these are the ones that are most likely to give you the greatest business value.

Secondly, I really recommend that you download the latest copy of the Well Architected Framework and take a look at that to get insights on how you can raise the bar from an operations perspective, maybe not reading for tonight, but perhaps reading on the plane hoe and finally partner with your AWS account team, whether it's your account manager, your solutions architect or your technical account manager. We're all here to help you and guide you on this journey and we can connect you with the right resources, people and partners to help you be successful if you'd like to see more cloud operations content.

I've got a couple of next steps for you. Firstly, the mandatory if you could please fill out the session survey Betty Swati and I would really appreciate your feedback. This data helps us understand if content like this is helpful for you and it shapes the type of content that we'll create for summits next year as well as re invent next year.

Also, we've got an ask us anything booth over in the expo hall in the Venetian where you can come along and speak to some cloud operations experts like myself, Swati or many other people, you can share your use cases, get some insights on how you could do things or turn it ways and raise the bar on operational excellence.

And of course, as promised, I've got a number of resources here that you can check out when you get home. So feel free to take a snap of these. If there's anything that you'd like to review afterwards, Betty Swati and I will be available in the room or just outside afterwards if you've got any questions, thoughts or want to learn more about McDonald's journey.

Other than that, I'd like to thank you all so much for joining us today and I hope you enjoy the rest of the week here at re invent. Thank you.