Sustainable compute: Reducing costs and carbon emissions with AWS

最新推荐文章于 2024-08-26 17:16:47 发布

李白的朋友王维

最新推荐文章于 2024-08-26 17:16:47 发布

阅读量93

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135090822

版权

My name is Jackson Wagstaff. I'm a compute specialist here at AWS. Today we're going to be talking about sustainable compute and how you're going to reduce your costs and carbon emissions with AWS.

I'm going to kick things off and then I'm going to hand it over to Street R. Ayala from Adobe, who's going to talk about their experience in adopting Graviton. Then lastly, Robert McConn, who's a solutions architect here at AWS, is going to talk about elasticity and resource optimization.

Before I dive into the agenda, I'd like to bring up the topic of rock climbing, which is probably not something you're expecting to hear about. But I think it makes a good analogy for sustainable compute. It's a sport that I did quite a bit when I was younger and it gets you out into some really beautiful settings, which I think is why we're all interested in sustainability.

You can see this climber, they're very high up on a cliff and it took a lot of energy to get there. When you're rock climbing, you're really trying to use as little energy as possible as you make your way up a rock climbing route so that you can sustain yourself all the way to the top. That's the same thing with sustainable compute. You're trying to use the fewest resources required to accomplish your goals and do it at the lowest cost and with the lowest carbon footprint possible.

Is anybody here afraid of heights? Any fear of heights in the audience? Ok. Yeah, I'm afraid of heights just because you rock climb doesn't mean it goes away. I think it's a healthy fear to have, you can numb it a little bit but it never really goes away.

If you've never rock climbed before, you may see this and think it's something that's hard to do or you'll never do it, but it's actually really easy to get started. They have climbing gyms. You can go down, rent a pair of shoes and try out rock climbing on what's called a bouldering wall where you don't have to use ropes. You go more sideways, there's padded floors. You can take selfies of yourself to show your friends what you did.

The same thing with sustainable compute is that just by using AWS, you're already starting in a very good place. We're part of, we're a co-founder of the Climate Pledge and we're on a path to powering our operations with 100% renewable energy by 2025 which would be five years ahead of schedule. And a survey by 451 Research showed that AWS was 3.6 times more energy efficient than the media and data centers that they surveyed. So just by using AWS means you're already starting in a very good place and being much more efficient than you likely could be in your own data center now.

Just because you're using AWS doesn't mean you're using AWS as efficiently as you could. And the same thing goes with rock climbing. If you go to a climbing gym doesn't mean you have all the skills and tools to be really efficient.

If you go to a gym and you see an experienced climber, you'll notice they move very fluidly across the climbing wall. And that's because one of the skills they've worked on is footwork. A lot of beginners don't know this, but you want to hold the majority of your weight on your feet when you're rock climbing because that's the most efficient way to do it. Your feet can hold your weight much more easily than your hands do. A beginner will go to a climbing wall and use their hands and not really think about their feet and wear themselves out very quickly.

That's the same with AWS - just because you're using AWS, you need to work on the right skills and develop the tools and experience to be able to use the things we're going to walk through today and become as efficient as possible.

With rock climbing, that footwork is one thing, there are many other skills you can develop, like how you hold certain holds and how you move across different parts of a rock climbing route. When you put it all together, what you can accomplish is actually pretty impressive.

Here's a climber. She's obviously a very experienced climber and notice how she's not bending her arms unless she absolutely needs to. That's because she's trying to conserve energy. It's much easier to hold your arm straight than it is to actually flex your muscle. She's also clipping the rope through these carabiners so that if she falls, it catches her, but she does it quickly and she makes it look easy, but it's actually fairly difficult to clip a rope through a carabiner with one hand if you've never tried it. She does that quickly so she can get her hand back onto the rock face and save her energy as she makes her way up that rock climbing route.

Here are some of the skills and tools we're going to talk about and services that you can use to become really efficient with your workloads on AWS:

I'm going to talk about Spot Instances and AWS Graviton processors.
Then I'm going to hand it over to Street R. Ayala from Adobe, who's going to talk about their experience adopting Graviton.
Lastly, Robert's going to finish up with resource optimization and continuous improvement because we're never done improving as new technologies come about.

EC2 Spot Instances are AWS spare capacity that we offer up to customers. They're just like regular on-demand EC2 instances or EC2 instances that are covered by Savings Plans. But they come with the caveat that since it's our spare capacity, AWS can reclaim those EC2 instances with a two-minute warning. Anything you use on Spot instances can go away with a two-minute warning.

You may think, why would I want to use EC2 instances that can be reclaimed? There are two primary reasons:

It's up to 90% off the on-demand price. These prices fluctuate based on the instance type and availability zone, but generally provide very compelling savings relative to on-demand. And you don't have to make any form of commitment as well.
They're also sustainable. The fact that when you use Spot instances, you're helping AWS get greater utilization out of our data centers and ultimately operate more efficiently and provide a more efficient offering to our customers at the end of the day.

So those are the two primary reasons to consider Spot instances.

Now we talk a lot about interruptions, but they happen a lot less than you might expect. Since 2022, 95% of Spot instances have been self-terminated by customers before AWS needed to reclaim them. So, you know, we have to talk a lot about them because any workload you run on Spot has to consider those interruptions. But it's not uncommon for a Spot instance to run for days at a time.

Also, we have this concept of Spot instance pools. While we have lots of spare capacity at any given moment, just because our customers are constantly growing and scaling, it doesn't mean we have spare capacity in every single instance type, size and availability zone.

Spot instance pools are defined as a specific instance type and size in a specific availability zone. That's how the prices are set. You can see here in this example, the prices for the exact same instance can vary by availability zone. If you are getting interrupted in a particular availability zone with a particular instance size in this Spot instance pool, it doesn't necessarily mean you would be interrupted in a different availability zone. It's an important concept to have when you're thinking about best practices and using Spot instances.

So flexibility is really a key component to running workloads on Spot. The first dimension of flexibility that we think about is instance flexibility. The more instance types and sizes you can use, the greater access you're going to give yourself to Spot capacity. It's important to think about price performance and value instead of just the perfect instance type for your workload. Because even if an instance doesn't fit your requirements perfectly, it may be discounted such that it's a really good value for what you get, even if you have extra CPU or memory that you're not using.

The second aspect is time flexibility. If you don't care whether a job gets done in two hours or 10 hours, that lets you perhaps have comfort that the job's going to get done, be a little bit more picky on the instance type you want to use, and perhaps set a certain price that you're willing to pay.

Lastly, we have location flexibility. The more availability zones and regions you can use, the greater access you give yourself to Spot capacity. So adding another region may more than double the Spot instance pools that you're able to leverage.

The common Spot workloads we see have a lot of these characteristics - they're flexible, fault tolerant, stateless or loosely coupled. We see containerized workloads are a common use case, those tend to have more modern applications that can handle those interruptions. We have web services that scale up and down and those often inherently have interruption tolerance. HPC batch processing, big data, anywhere you have a lot of compute that's crunching away - it's not only a good opportunity to lower your costs significantly and get more efficient, but they often have retry capabilities that make it very easy to use Spot.

We have customers also running CI/CD pipelines and ML workloads on Spot as well.

Spot's been around for a very long time and so it's tightly integrated with many AWS services. If you're using Auto Scaling groups or EC2 Fleet, our container services, EMR or Batch, it's very easy to take advantage of Spot because those integrations are there and easy to adopt. This extends to our third party partners. They know that their customers often would like to use Spot instances and so they bake in that native support into their applications.

So if you have an ISV you depend on, it's worth checking out whether they've done that. We have our Spot Ready partners who've been validated to help enable using Spot best practices. So if you have an ISV, it's worth checking out whether they support taking advantage of Spot instances.

We launched Spot in 2009 and since then have been continuously improving it and making it easier and easier to use. In 2017, we removed the bidding process to make it easier to use Spot instances and have continued to iterate. So if you looked at Spot instances a while ago, there's a good chance many things have changed and made it easier to use.

For example, Rebalanced Recommendations are a best effort early warning we provide saying that the Spot instance pool you're in is getting shallow, you might want to move from this instance because you're at a higher likelihood of getting interrupted. We've integrated that with Auto Scaling groups to make it easy to take advantage of those Rebalanced Recommendations.

That's continued to some of the more recent things we've done that Robert will talk a little bit about, like attribute-based instance type selection and our EC2 Flexibility dashboard.

Spot Placement Scores are one of my favorite features we've launched. This is a tool that allows you to take get real-time insight into the Spot capacity, given your capacity and instance requirements. You can see here this example from the AWS console - you put in your target capacity of how many instances you need and then the instances you can use and it will give you a score ranging from 1 to 9 around the likelihood of you getting that capacity right then and there at that point in time, by availability zone or by region.

So you can see here when I ran this, uh Ireland was a nine, Oregon was a four. And uh and it's a great way to see the benefit of implementing Spot best practices as well. So let's say if you have multi architecture ability to run workloads on both x86 and Arm instances and you added in these three same Graviton variants, the Spot placement scores when I ran this jump from a nine, sorry Ireland stated a nine and then Oregon jumped from four to an eight. And so you can see this, the more instances you add those higher, the higher the Spot placement scores get.

Now we had customers who said we want to see what this looks like over time. You know, it's nice to get a, a snapshot. But how does this look over days or weeks? And so earlier this year, we launched the EC2 Spot placement score tracker. It's uh it's available on GitHub. It's a CDK you can deploy that will log Spot placement scores for you. So you can see what your configuration scores are over the course of days, weeks or months. Um and, and get a sense of what that looks like.

So just to wrap up, uh you know, Spot is very easy to use, it's very similar, it just works the same as your regular EC2 instances. Um I kind of think of it like bouldering. You don't have to know how to use ropes you can use, you know, take advantage of it very easily if you already use e2. And since 2013, it's helped customers save over $10 billion. So it's a great place to start. Um and, and take a look at if you have workloads, that might be good candidates.

So, moving on to Graviton, uh AWS offers a broad choice of processors. So historically, we've offered Intel processors and now AMD processors as well. On the Arm 64 side, we offer Apple, M1 and M2 pro pro processors and then our homegrown um AWS Graviton processors.

Now AWS has been working on innovating around silicon for a long time and this started with our Nitro system. So uh the Nitro system initially uh was with the hypervisor focused on offloading that from software to purpose built hardware. And that allowed AWS to not only become more efficient but also provide a more efficient offering that was more performant um as well. And this is extended to things like EBS and VPC um and security. We also have our Trainium and Inferential machine learning accelerators and then our uh Graviton processors, which we're discussing today, which are the general purpose processor.

So Graviton will generally be up to 20% less than comparable x86 instances and it'll provide up to 40% better price performance because not only will it be less expensive, you'll also often get a performance boost along with that. And um so you can do things like shrink the size of your, your fleet and whatnot to, to lower your cost and your carbon footprint. It's also up to 60% more energy efficient per relative for the same performance than comparable instances.

So, um which makes sense when you think about the Arm architecture was originally developed for mobile devices where power consumption is a big concern. So it's a great opportunity to help lower your carbon footprint um as well.

So initially, we launched Graviton in 20 Graviton in 2018 and then since then, have been continuing to launch new generations. So we launched Graviton two in 2019, Graviton three in 2021 which had a 25% better compute performance than Graviton two. And then just this week, we launched Graviton four, which has up to 30% better compute performance than Graviton three. So it's exciting and by now, we have a mature offering with um Graviton instances for, for pretty much most use cases you can think of. So if you have a particular workload, that might be a good candidate, there's probably a Graviton instance that matches it.

Now in adopting Graviton, uh there are kind of two aspects that are important to think of. And the first is uh the impact it can have. So what, how big is that workload, how much, um you know, cost and carbon footprint, does that workload, make up? And then the other aspect is the ease of adoption.

And so when it comes to adopting Graviton, the easiest place to typically start is with managed services. So things like RDS, Aurora OpenSearch or Elastic Cache, uh often you can just move that right to Graviton and it'll run and it's, there should be no issues. You may have to pay attention to the version you're on. But other than that, it's very seamless to do.

Uh the same goes with EMR and ISVs that support Graviton, make it very easy to adopt Graviton as well. Uh Lambda is also quite easy and then often where we see customers uh having made the adopting Graviton being much more easy than they expected is with the modern application languages like Java or PHP. Um Arm's been around long enough now that those um modern modern application languages have built a native support. So you can take something like a Java application and move it to a Graviton instance and it'll run just fine. Um you know, and so you can, you have to pay attention to things like dependencies, but this is often where there's a lot of impact opportunity and where it's much easier than you might expect.

Now with compiled languages or languages that depend on compiled libraries or dependencies. Um those can be a little more involved because you have to recompile your application. And then lastly, uh .NET Core running on Linux can run on Graviton as well.

So just to wrap up there, uh Graviton is kind of like you're climbing with ropes. Now, you're learning how to be very efficient and, and uh move very quickly and, and without much um cost or carbon footprint, we have over 50,000 customers using Graviton today.

And so with that, I'm gonna hand it over to Street Art, who's gonna talk about Adobe's experience and being one of the 50,000 customers taking advantage. You go Street Art.

Thank you, Jackson, who's ready to go rock climbing now, he's, he's almost convinced me that I'm capable of doing it. So um really excited to be here today. Uh name is Sri Dayala, I'm the global director of our FinOps program here at Adobe and at Adobe. As many of, you know, we're really focused on products that delight our customers and enable creativity for all. But we're also focused really on making sure that we develop these products in a sustainable manner that's environmentally responsible, right?

So how do we do this? We align our goals with our values and so we've put forward some stringent goals for our developers and for our engineers to hit by 2025 and 2050. The big one aligning with many companies and the net zero goals of AWS to get to net zero carbon emissions by 2050. We also want to move to 100% renewable energy for all of our facilities offices, co locations, data centers across the board um by 2025 and also reduce uh FTE water consumption globally by 25% by 2025 as well.

We've made significant progress in this space uh scope one and two emissions. We've reduced by 35% over the past year. And we've seen about 80% of our buildings by square footage be LEED certified and we're continuing to progress by 2025 to make sure we get to 100.

So when we look at driving a positive impact at Adobe, it's very much intertwined with how we develop and how we approach product development. So you may be asking yourself, why is the FinOps guy up here talking to you about sustainability, right? Why isn't there an ESG folk or someone else up here to talk about our environmental strategy? And I wanna dive deep into our sustainability journey, especially around scope three emissions which are dealing with suppliers such as your public cloud providers like AWS.

And so to tackle that, we realized it's very much intertwined with consumption and efficiency around how we use the cloud. And so my organization is pretty much dedicated to enabling, enriching and educating our engineers and our product teams, how to maximize the value and use the cloud as efficiently as possible through tools, assets through dashboards and even engineering hours that we donate to make sure that we can run our applications and workloads as efficiently as possible.

We've even created an internal professional services team called the Cloud Efficiency Consulting Group which are engineers that go work on efficiency projects for free and are paid in efficiency and savings that we generate for the business and for our infrastructure community. Also, we've noticed as we've done more and more of these projects that we are the stewards of billing data. We're the stewards of utilization and consumption data and CO2 emission data. And we're able to create a more complete picture for us to go prioritize and standardize against workloads that we wanna optimize for cost reduction, carbon emission reduction and productivity increases.

And so with this unique position, we've piloted and kick started many programs across the board focused on efficiency and carbon emission reduction. And I'm gonna go into one of those POCs a little bit more detail today.

I think it's pretty simple. It's the same thing your parents told you when you were a kid, right? To save energy, you turn the lights off. So the more less you use, uh the less power draw from your resources, the more sustainable your workloads become. And it's really about reducing that activation energy that Jackson was talking about and just getting started.

So how did we get started? Right. Um what I wanna do is talk about a POC that we just recently kicked off that was highly successful with our Advertising Cloud division. They have a uh product known as the Real Time Bidding platform, which is essentially a bidding platform to place ads for our enterprise customers. It's a containerized environment underpinned by x86 architecture.

Um and we partnered with that product team to start collecting not only performance metrics but business metrics, productivity metrics and emission metrics while we were working through migrating over to Graviton. So when we started working through um the project in the POC, we essentially accelerated and helped in two major areas in terms of Graviton migration.

One is really focused on binary compatibility and making sure we work through those issues. And the second was accelerating what we saw as one of the rate limiting steps to adoption, which is getting through the testing, validation and deployment.

Um and so what we did was deploy engineers to go augment the product teams to help work through these issues in a timely fashion rather than putting it through the standard dev ops process. And we were very selective with the workloads. We picked, just like Jackson mentioned, we picked a Java application that was very easy to port over. There were very few library dependencies across the board and it was easy to rebuild everything in native co code using Arm, right.

And so we built the container and essentially put forward across the testing and validation piece, supplying resources to accelerate that. So it turned out to be very easy to do so. And a lot of folks asked me, what were some of the limitations or what were the issues that you ran into in your first migration? Right. And it turns out it's not the types of issues that you think you would run into.

We ran into very few architecture or compatibility issues at all. If anything didn't work, it tended to be dealing with logging or configuration issues that are easily solvable. And so my advice to you is if you do run into some hurdles and porting over very compatible workloads, chances are it's a pretty quick fix and often times when we and into these issues, we could work with our AWS partners to quickly solve for these challenges within an hour's time span.

And so by circumventing the process, going quick, going fast, going to the root cause and fix, fixing these types of issues quickly, we were able to instill confidence and get the program and project off the ground.

Um other areas that we helped with was validation of the savings, right. They weren't necessarily doing this just to move over to Arm for reliability and resiliency, but for 20% skew price reduction on average instance to instance, but we also help them validate and estimate the carbon impact for migrating over to Arm.

And it turned out to be highly successful. We were able to migrate in the matter of months and have almost the entire fleet over in the POC and then close out by moving the rest of it over after the POC.

So what are some of the results that we saw from this PC? And this project? Right, we migrated 73% of the entire fleet over to Arm in one shot. It's easy and surprisingly, very little issues or hitches when we moved over and we saw substantial improvements that we didn't even foresee or predict, right?

Uh 23% reduction in actual EC2 consumption. Not only were we moving over like for like and seeing a reduction in costs, we were seeing performance increases that allowed us to right size those instances and even eliminate some instances in general.

Um and then we also saw the 20% average instant saving, but the saving was much higher than that because we were losing less and running less hours overall.

So the savings were well over 30%. Uh by the time we migrated over, what about some business metrics that we noticed when we were collecting uh the price per performance that Jackson mentioned uh came through easily within the business metrics for a cloud. We were able to push through 17% more auctions per CPU than we were on the x86 analog architecture. And when we looked at the overall carbon impact reduction on estimate, it was almost 41% reduction in CO2 emissions across the board. So staggering numbers well beyond what our initial estimates were and the level of effort to be able to do this was substantially low comparative to other opportunities and techniques that we put out there.

And so this served as a bastion for our organization to go out, share the lessons learned, promote how easy it is to move to ARM provide and offer those support services and make sure that we could build momentum to start expanding this program across a Debbie. So I, I wanna talk a little bit about the lessons learned, right? And they're kind of straightforward to be honest, right?

Um have some business sense focus on ROI in terms of not only the return on investment of prices but carbon emissions and balance that with the risk and effort, you know, probably don't wanna start out with the C++ application as your first workload for a POC, right? So there's a ton of applications, a ton of great criteria and information out there to figure out which workloads to target. So set yourself up for success, right? Again, target the 80% most of the workloads that we see across a lot of our customers and our product teams happen to run in containers, they run Java and Python applications and these are the great use cases to move over. So start with them and create a heat map of where you think you're gonna have the greatest amount of success.

The other thing that I wanna focus on is uh most people worry about the technical issues that come up and, and I highlighted the fact that we didn't run into many technical issues. But one area that is very helpful and probably slows down most migrations is that these product teams have to focus on creating features that support our customers and generate revenue and continue to work on those products and services. So we're asking them to do something that is above and beyond on the roadmap. So make it easy on them and often times it's program management and project management functions that help smooth the wheels and keep it going. And that was one area of support that we realized we needed to buff up across the board and was very helpful to augmenting these teams, roadmaps and their dev ops processes to make sure we weren't intruding and getting in the way of the other work that they needed to do. And that was probably the most impactful thing that you can do to enable your organization as you think about starting POCs yourself and getting our migrations off the ground, right?

And the last thing I think that's really important to highlight is celebrate the success. Oftentimes I talk to engineers and I'll say I can save you 55% if you go do this and work on this workload. And I'll see blank stares at me and you know, they have other priorities, right? But when I talk about carbon emission reduction, it's crazy how many folks jump on board and come forward with ideas of their own. Engineers are passionate about this. We're all stewards of this planet and folks want to work on a highly impactful migrations that can really help reduce our carbon emissions. And so before I was doing the targeting of workloads and going out and trying to get volunteers. Now, I have a laundry list of product teams who are ready to go today, right? And want the support and want the lessons learned and want to move over to ARM to see all the benefits that we've talked about.

So celebrating that success. Um I, I feel like it's a triple winner, you basically save money, you improve performance and you reduce carbon emissions, like who wouldn't wanna do that? So definitely focus in and celebrate the success as you wanna gain momentum and scale out the program.

So what did we do once we ran this POC? Um we want to aim high. So we wanna scale this. And so our goal is to move 24% of our CPU hours onto ARM based chips and Graviton with AWS. We've developed a systematic approach to migrate over to ARM. We have accelerated the process. So what took a month, uh you know, a couple months ago now it takes a day for us, we just migrated 1% of our fleet to ARM yesterday in one day, right? So I I want you to be prepared for success, right? And understand that you're going to see a hockey stick in terms of demand and you wanna prepare yourself in terms of having the resourcing the capacity and the availability to move over to ARM and be prepared to see that growth. We went from in one quarter, we went 10x the next quarter in terms of our migration.

So I really want you to understand that this is a thing that will take off once you prove it out. Um and the last thing I really wanna focus on is ARM is just one element in your tool kit for carbon reduction, right? And the reason why I'm up here talking from a fin ops perspective on carbon reduction is there are far more tools in your efficiency tool kit, whether it's switching over to manage services, native services, right? Sizing downsizing instances um using lambda functions to get off of, you know, heavy compute instances, making sure that you make that information readily available to your engineers. So they can not only balance the costs when they're creating their roadmaps and plans, but can also have the carbon data available is something that we are working towards and want to make our engineers inform consumers so that they can start efficiently designing from a sustainability environmental perspective. As well as cost.

So I, I have to, I would love to talk with anyone who has questions about this afterwards. Um we're very passionate about this. We think ARM is a no brainer when it comes from a sustainability perspective, high impact high return across the board. Uh but with that, there are other efficiencies around the workloads that you can look at to help increase sustainability. And with that, I'm gonna hand it off to Robert, who's gonna talk about them. So, thank you.

Thanks Peter. It's awesome seeing Adobe's incredible commitment to sustainability, really love the 24% target by 2024. So I'm gonna wrap up our session, uh talking about elasticity and resource optimization and then a little bit about continuous improvement. Unlike Jackson, I, I am not a climber. Alright. But I did try climbing once in a climbing gym and there were a few surprises. My first surprise was just how rapidly I went from being vertical, climbing a wall to being horizontal, laid out on the mat. Completely exhausted, unable to move. I was not optimizing my climbing resources wasn't exactly as comfortable as that guy either. And once I was able to finally sort of like crawl myself off the mat and get up, I checked with the actual climber and they gave me a few tips and tricks on how to, to sort of climb a little more sustainably. Some of those tips and tricks were things mentioned by Jackson earlier, things like focusing on your legs for the climbing rather than your arms.

So my second surprise was just how effective a few tiny tips and tricks were at getting me to be able to claim and to climb in a sort of a sustainable way. So it's sort of with that spirit that I want to talk a little bit about a few different tips and tricks that you can use to optimize your workloads for sustainability and savings. We're gonna call these optimization levers. So I'm gonna talk about three different optimization levers. I'm gonna talk about scaling your workloads. I'm gonna talk about maintaining capacity and updating your infrastructure.

So let's start with scaling your workloads. And normally when we think of scaling, we tend to think about matching our customer demand, perhaps hitting certain metrics for our workload. And that's, that's what scaling is scaling out is indeed to match. That demand elasticity goes both ways. Scaling in is where we want to look to reduce our waste and our costs and it's often overlooked, but there's, there's some nuance there. So if you scale in too fast or too aggressively, you can impact your customers. But if you're scaling into slowly, you're gonna impact your pocketbook. Somebody from fin ops might come and want to have a little chat with you.

Now, Jackson talked a little early about uh optimizing instances. We talked about spot instances. He talked about leveraging Graviton, but we're taking a step back. We want to step back a little bit and take a look at your workload. And when we're talking about workloads, I want you to keep this maxim in the back of your head. It's simple. It's obvious, but we often don't think about it. It's that the most sustainable and cost optimized instance is the one that you don't run.

We're gonna talk about workloads. We're gonna talk a little bit about how we actually provision that capacity on EC2 run instances is sort of a very basic API it does most of the work for end of the day. Actually launching your instances. Usually you're gonna wanna use something a little bit more high level EC2 fleet. API is a great API if you are doing some building orchestrating something a little bit more complicated and specific for the most part, you're probably gonna wanna leverage something like a fully managed service such as auto scaling groups, auto schooling groups are a great choice for maintaining and managing your instances. You can use mixed instance types in the same autos skilling group including architecture. So you can use your x86 and your AWS Graviton instances in the same autos skilling group, you can mix purchase types, so you can use your on demand and your spot instances, auto killing groups will go ahead and replace your interrupted or unhealthy instances. And there's a lot of other features that are gonna be outside the scope of this talk. But right now we're focusing on the scaling part of auto scaling groups.

And if you're going to take advantage of the elasticity of the cloud, you've got to actually scale. If you're gonna do that, you're gonna need to scaling policies. So a dynamic scaling policy that you might start with is a simple or step scaling policy. This is something where you define a metric and your autos skiing group will add and remove instances based on crossing a threshold for that metric. If you want to focus on the scaling inside, when you're using something like a simple uh scaling policy, you're gonna wanna look to adjust the cool down period, similar step scaling is great. But we generally recommend if you're gonna use a dynamic scaling uh policy to go ahead and use target tracking.

Target tracking is sort of like a thermostat approach. You define some metrics such as aggregate vCPU uh across your workload. And the auto scaling group will automatically launch and remove instances as it needs to sort of keep you on that target. If you want to really fine tune the scaling inside of your target tracking, you're gonna want to adjust the frequency with which which your EC2 instances report their metrics to CloudWatch.

And the spirit of the most sustainable and cost-effective instance is the one that you don't run. If you have instances that can be shut down overnight maybe on the weekend, it could be something like developer workstations. You can look to use scheduled scaling to set that up. And if you do find yourself with a workload that needs to respond a little bit faster, perhaps proactively scale out for your workload. Predictive scaling is a great place to look.

Now, some of you probably worked a lot with scaling already. Uh if you have, you've probably run into a couple of workloads where perhaps the customer demand is just really, really high. The workload is really, really spiky and scaling is not quite fast enough to keep up with that. And your customers might be having a poor experience when you have a surge of demand come in. Or perhaps you work with instances that take a long time to reach that in service state. You say 5, 10 minutes or so, maybe they're just boot up long, maybe they have require a lot of configuration or initialization. But if you find yourself in those workloads, what's the usual solution? It's just kind of over provision, your scaling a little bit, right? This is a sort of a patch. It works, but it's costly.

It's a little bit wasteful. So if you find yourself in that kind of a situation, I'd recommend taking a look at warm pools. Worm pools are a set of pre initialized instances that sit alongside your auto skiing group. So they're already started, they're already booted up, they're already configured but you can put them in a stopped or a hibernated state. So you're not paying for the underlying ec2 instance, just perhaps an ebs volume in an en i. And then when your workload needs to scale out, it'll pull from those warm pools first. So your workload can scale up in seconds, potentially instead of minutes. So that gives you faster scaling and it saves you money and reduces that waste.

Second optimization lever. Let's talk about maintaining capacity if you've got a workload. So you've got an unhealthy ec2 instance or you've got a workload with spot instances. One's been terminated. Your your auto group is gonna go ahead and replace that instance. But what does it replace it with? How does it decide what instance to launch and how can it do that in a cost effective and sustainable way? The solution here is going to be to focus on allocation strategies.

So for your on demand workloads, lowest price is a great place to start. That's going to launch the lowest price instance that it has available to run. But we have a lot of instances that are price performant. They might not be the lowest price instance, but their performance is so much greater for your workload that you can use fewer of them in your workload to achieve the same goal in the end saving you money. Lowest price won't pick that up. If you have workloads that fit that criteria, you want to look to use a prioritized allocation strategy.

For spot instances, we have a variety of allocation strategies for very specific use cases. But in general, the price capacity optimized allocation strategy is going to be the recommended strategy here. Why not lowest price, lowest price does indeed launch you into the lowest price spot capacity pool. But it doesn't take into consideration the depth or shallowness of that specific pool. It could be extremely shallow and those tend to have very high interruption rates. High interruption rates can cause churn in your workload and can actually end up costing you more money.

The other end of the spectrum, we have capacity optimized, that'll launch you in the deepest capacity pools lowers your interruption rate gives you better experience. But it's not taking into consideration the price of the instance price capacity optimized sort of the best of both worlds. It attempts to balance both the lower interruption rates and a lower price.

Now, aws is constantly innovating, rolling out new instance generations, new instance types. I'm sure you've seen the announcements all week. So far, we've, we've had a handful of them that have come out. I believe we're over 750 now. And many of those are again, more price performant. So how can your workload take advantage of those newer instances when they roll out immediately rather than having to manually specify a set of instances. Keep watch, go back using those engineering resources. You can leverage attribute based instance type selection with attribute based instance type selection. What you do is you specify the attributes that your workload needs. For example, minimum vcpu count minimum amount of memory and then all the instances that fit those attributes become available for your workload is future proofs your workload. So that when we roll out those new instances, they happen to fit that criteria, you don't even have to know about them. They're gonna be available for your workload. This could be newer instances like the aws graviton ones were rolling out and somewhat recently the m seven i flex instances.

So the third optimization lever is gonna be around updated our infrastructure. So we all know our our workloads are not static. We're constantly rolling out updates could be security updates could be feature releases could be in the form of a new amazon machine image could be a new launch, template version, user data script or just to update to our code. And a common pattern is a blue green deployment. This is where we sort of launch a whole separate set of instances. We sort of double up our infrastructure, deploy our current or our new configuration to that new infrastructure, monitor it for a while. And if there's any problems, you can safely fall back to the previous configuration, this works. But it also requires significant resources. It's not very cost efficient. A rolling deployment is generally much faster and resource optimized. But they have their own challenges since you're replacing instances as you go along. How do you monitor that? And if you run into a problem, if you catch an issue, how do you safely fall back to your previous configuration? Auto scaling group's instance, refresh is a way to programmatically automate your rolling deployments and you can configure your instance, refresh to monitor that deployment and automatically roll back if problems are detected.

What if you're running containerized workloads or analytics, workloads. A lot of these optimization levers and best practices could be mapped onto those workloads as well. For instance, if you're running containers on amazon ela elastic container service or amazon elastic ker neti service with cluster auto scaler. You're using autos sc groups under the hood. So you can use instance refresh for updating your infrastructure. And if you're using amazon ecs, you can use attribute based instant type selection for future proofing. If you're using amazon eks with cluster auto scalar, you can still do something similar. You can leverage the two instance selector plug in for your node groups. If you use the amazon s with carpenter, you're using two fleet api under the hood and you can still do something similar by opening up as broadly as possible. Your provisioner what used to be called provisioner. Now it's being called node pool. Carpenter also has extra benefits for cost optimization such as consolidation or carpenter will actively continually monitor your cluster shift pods around to save you money.

If you're running long running clusters on amazon elastic mapreduce, you should look to sched uh to leverage the managed scaling so you can scale down your resources when they're not needed. So, resource optimization is also a continuous process and that comes with its own challenges. So i wanna talk about three different tools that can help you with that.

The first one is gonna be adobes compute optimizer. So we all know the benefits of right sizing. If you right size your instances for your workloads, you save money, you're not over provisioning and you're also not wasting resources. The problems here is it can require significant amount of engineering resources just to spend the time to actually right size your instances and your workloads aren't static workloads are changing. We're constantly changing our applications and aws is constantly rolling out new instances, some of which are gonna be more price performant for your particular workload. So yesterday's right sized instance might not be today's aws compute. Optimizer uses machine learning to make continuous right sizing recommendations on your behalf can provide recommendations for compute, for memory, for networking and for disk io for memory, you'll need to use the agent, but it's absolutely worth it. If your organization is serious about sustainability, one way to make sure that you're actually making progress and to see that progress towards your goals is you actually have to measure where you're at which again can be difficult. The aws customer carbon footprint tool lets you calculate your carbon emissions generated for your aws workloads. You can see historical footprint as well as forecast out to the future. And then finally, if you want to get insight into some of these best practices, how well you're, you're using them in your organization. We've got the ec two flexibility score dashboard. This is an open source tool that you can deploy in your account and it gives you a score based metrics of various attributes such as your instance, flexibility, how instance diverse your various workloads are and your scaling score, which is sort of how much you're using the elasticity of the cloud. So you can see where you're at, you can go nudge the teams that need a little extra incentive to adopt some of these best practices.

All right. So we've covered a lot. We've come a long way in our journey to climb that sustainable compute mountain dangerously. Flirting with belaboring that metaphor, but i want to put it all together. So jackson started by explaining how easy it is to get started simply by just shifting your workloads on to aws, you immediately gain our benefits as we progress towards our goal of powering our operations with 100% renewable energy by 2025. And he talked about leveraging ec2 spot instances for significant savings for your fault tolerant and flexible workloads. He talked about how you can migrate your workloads for sustainability and susa and sustainability benefits from x 86 to aws graviton and rieder demonstrated adobe's real world impact and savingss in performing that migration in the cloud. And i finished up by showing you talking a little bit about how you can leverage the elasticity of the cloud through three different optimization levers for your workloads and then a little bit about continuous improvement. All that put together is what's gonna help get you to the top of that sustainable compute mountain.

So i can tell i can feel the energy i can see you guys are dying to run out and get started. So here's your next steps. Check out the e two spot workshops.com get familiar with spot instances. Maybe you're a little timid to a spot. Maybe you're worried about putting some production workloads on there. Ec spot workshops will help you get a little bit more comfortable and help you decide what kind of workloads would be a good fit for running on spot. Then you can check out the aws graviton, getting started, guide and porting advisor. This will help you learn what's the best place to get started for migrating workloads, where's the biggest impact going to be? And finally, you can look to uh deploy that ec2 flexibility score dashboard into your to your account. See where you stand with flexible compute best practices and see where you might be able to make some progress.

So that's it. Thank you very much. Uh, hope you enjoyed this talk quick, quick reminder to take the survey in the app.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Sustainable compute: Reducing costs and carbon emissions with AWS

My name is Jackson Wagstaff. I'm a compute specialist here at AWS. Today we're going to be talking about sustainable compute and how you're going to reduce your costs and carbon emissions with AWS.I'm going to kick things off and then I'm going to hand it
复制链接

扫一扫