Smart savings: Amazon EC2 cost-optimization strategies

最新推荐文章于 2024-09-30 17:19:07 发布

李白的朋友王维

最新推荐文章于 2024-09-30 17:19:07 发布

阅读量142

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134791717

版权

Alright. I think that's my cue. Thank you all for joining me this morning. Uh I'm sure we'll have a few more trickle in uh for those of you who are for the people who got to the buses late to get over here from the keynote.

Um in case you're wondering if you're in the right location. Hopefully you are, you're here to learn about smart savings. Amazon EC2 cost optimization strategies. This is a 200 level session. So it is meant to be uh pretty accessible. Uh we're not gonna go deep into architectures or code samples or anything like that. Um but there is some assumed knowledge, right? So we assume, you know, thing, things like EC2 uh at a general concept and, and we're gonna go uh beyond that on how you can apply some best practices to achieve smart savings.

Uh oops, I meant to introduce myself before I go on. Um so my name's Boyd McGee. I'm the director of uh go to market for our networking and compute services. I've worked at AWS now for about 12 years in total. Um I've been working in the spot business specifically for about the last eight, eight or nine years now helping customers maximize savings for those of you who aren't aware.

Um, the spot, uh, spot instances which we will talk about. Uh, since 2013 spot instances have helped customers save over $10 billion over the cost of on demand. Right. So, very significant savings. Hopefully, some of you are already taking advantage of that for those of you who aren't. Hopefully, you know, the way to do it by the end of this presentation uh to get yourself started.

Um a few things I was told not to just talk about work to maybe introduce myself and, and tell you something about me. Um I was actually a, I performed in the olympics and don't look like that. I performed, i, i wasn't competitive. I was in the opening ceremony of the Sydney Olympics uh way back in 2000. Uh so that's a, a fun tidbit if you want to come and ask me about that, maybe a way to get to know me a little bit better.

Alright, so let's jump into it, the agenda for today, you know. So first we'll talk about what is a smart saving, right? Uh pretty important for me to help. I, i guess, make sure that we're thinking about optimizing in the right way that set yourself up for long-term success. So we'll talk about what that means. We'll talk about the AWS services and features. Uh we'll start with some of the more simple aspects of EC2 and, and cost optim, then we'll move into some of the more advanced aspects where we're seeing customers sort of get continually, uh reap additional savings benefits additional efficiency improvements uh throughout their compute platform and, and to some level to broad AWS use, uh overall, what are the decisions? How do you think about making sure that you are driving smart savings? We'll talk about that as well. Uh and then of course, we'll share a little bit about how customers are really maximizing savings in 2023 and where i expect or, or really believe with high levels of confidence, they're gonna continue to reap those benefits and continue to see improved efficiency through 2024 as well.

Alright, so let's jump into it. So what do i mean by smart savings uh for those of you who don't know and spec of strangers, the person that always told sherlock holmes that they were wrong. Uh i was told not everyone knows this reference. So uh the point is not all savings are good savings. So what are some examples of bad savings? You know, and i saw a lot of these uh from the beginning of the pandemic through to the worsening economic outlook. At the end of 2022 we did see customers make some decisions that i would call sub optimal for long-term success even though they do drive short-term savings, right?

So a bad saving is one that limits your ability to innovate in the future. On behalf of your customers, it restricts your business's ability to respond to your changing customers demand. And of course, it can impact your end customers negativity. So what do i mean by that? Well, first and foremost, limit your ability to innovate in the future. One of the worst examples we saw, saw really early on uh at the end of 2022 when customers were really quickly looking to see if they could optimize savings. Some customers went from completely non optimized environments and said, you know what, let's just delete a bunch of data in s3. Let's just delete a bunch of extra data in our data warehouse. Let's just get rid of it. And you know, that does drive immediate savings in the cloud, you stop paying. If you delete that storage and reduce your costs, then you know, they started thinking about, well, what am i gonna do with my generative a i strategy? Did some of that data actually was that data useful to my generative a i strategy or my mla i strategy in general, my data like strategy, right?

So maybe not the smartest way to say some better ways to maybe tier storage or take advantage of some new uh offerings that have come out more recently. Um you know, restrict the business's ability to respond to customer demand. The other thing that i saw, which, which really hurt because we, we spent a lot of time helping customers, you know, move from on premises into the cloud and then modernize. But the reality is, let's be honest in that process of modernization, you're often double paying for at least a little bit, right? You're building the new modern maybe service platform, microservices, containerized platform that's gonna enable you to move faster, your developers to move even quicker. But it does cost you money in, in the, in the immediate term because you still have to run your old system, right?

So we actually saw some customers choose to pause or even end modernization strategies, right? And then again, how do you respond to, you know, changing customer demands, changing environments and again, things like generative a i which you're probably already sick of from the, the 2.5 hours this morning. Um but obviously, it's something that a lot of businesses are thinking about. Uh and it's restricted their ability and then the final one, some customers stopped a migration into the cloud. And again, that's that idea that hey, while you're in that migration period, you've got to continue to operate your data centers, you haven't gotten rid of them yet and moving into the cloud, you're still building out your application. So you are having to pay some amount for your cloud services while you're in that process. So you can stop doing that. And we saw some customers stop doing that and stop buying hardware on premises as well. Right. And so what does that end up? Meaning? Well, in most scenarios, that's gonna mean that when your customers really need your services, when the demand comes in, you're going to provide a worse experience for them. So your customers maybe don't end up purchasing or if they do purchase, maybe it was such a bad experience, they're not coming back. So these are what i mean by bad savings, right? They do save you money, but they don't really set you up for success, long-term.

So we want to avoid these uh and the strategies we talk about today uh for the most part completely avoid these uh these pitfalls. So good savings, obviously, the inverse, right, enable that uh agility through flexibility, enable rapid innovation in a lean cost structure. This is something that i talk about with customers all the time when it comes to spot, right? If you can run your development environment on spot all of a sudden, not only have you reduced the cost of your existing development environment, you've reduced the cost to innovate and do more things on behalf of your customers to develop new features, functionality, whatever it may be. Uh and of course, uh providing customers with equal or better performance while you're optimizing.

Uh and i did ask a few people at the beginning. Uh sadly, only one person said graviton was one of the things that they were most excited about and the the keynote announcement, but that's a great example, right? We saw customers able to migrate to graviton improve their price performance by up to 40% and actually get lower latency, better customer experiences out of their platform, right? So that's what i mean by smart savings.

Alright, so now that we've gotten that through, i just want to quickly talk about uh the the sort of overall agenda. So these are the sort of three simple steps we'll talk about at the beginning before we get into the more advanced things gonna start with scaling, then we'll talk about right? Sizing downscaling and then we'll talk about obviously optimizing those purchase options. And what you're actually seeing here, this is uh anonymized, but this is a uh a real world customer experience as they went through the optimization strategy, i'm gonna lay out for you their cost to serve users, right? So that's why that function goes down. Uh their user base went up but their cost per user went down dramatically um as you see through these optimizations.

So uh i have to start with the first thing now, i imagine a lot of your aws customers today, but i assume there's a few of them in the audience, maybe a show of hands that still have some things operating in a data center on premises, that type of thing. Ok. Significant portion still. So i have to do the first thing first. My first way that i would recommend optimizing is by moving that into the cloud. Ok?

Um most of you are gonna be well aware of these concepts, but just to make sure we, we're all starting from the same point, right? When you're in the traditional environment, you have to purchase hardware, right? And you have to try to predict ahead of your customer demand how much hardware you're going to need and nobody is perfect at doing that. Ok.

So this is an example where you're purchasing maybe every three months, block purchases, maybe every six months expanding your data centers. And you're trying to forecast your actual usage and trying to keep that data center footprint as small as possible ahead of that actual usage. But again, when you're doing purchases every 36, 12 months, there's just no way your forecasts are gonna be as accurate as you might like them to be.

And so the first step, obviously fir first thing to point out is when you're over provisioned, you're wasting money, right? When you're under provisioned, your customers are getting that bad experience we talked about earlier, right? Maybe the website's not loading or it's running really slow. It's taking me 30 seconds to make a purchase which nobody puts up with anymore. Unless you're in australia. If you're australian, you might put up with it. Cos e internet still sucks. Um i don't know if i should say that might have to edit that out.

Um alright, so now obviously the whole thing you're gonna see when you move to aws, the one thing, the main thing, the elastic compute cloud, you're meant to match that capacity with exactly what your customers need at any point in time. Because when you're leveraging the AWS cloud EC2, our scale is so significant that for many customers, it appears effectively infinite, right? Because we have that elasticity available for you to scale up uh and do it just in time.

Ok. So first things first, we would really really suggest moving your infrastructure into the cloud. This is a study uh from back in february 2022 that we did with uh the hackett group. Uh and i just wanted to point out like it's 47% reduction in it costs. That's great. There's a lot of things that go into that. But the stat that really jumped out at me and the reason i just wanted to, to present this one here is 40% reduction in over provisioned infrastructure. Ok. Remember that when we start talking about scaling in just a moment.

And i also wanted to mention uh a customer who's gone through this process, who i've worked with directly. Um for many, many years now, i've been very, very privileged to work with skyscanner very closely. Um i have been a skyscanner customer since way back in at least i think 2010 when i first started booking uh international vacations from australia. It's very expensive to travel from australia. So, anything you can do to get cheaper tickets. You absolutely do. Uh, and skyscanner was my, my way to do that. And so i was thrilled years later when i started working for aws that i got to work with them to help them migrate out of their data centers and into aws. Right. And as you see from this quote, if you've been able to read it already, don't bother if you haven't. It used to take them 6 to 7 weeks to launch a service and at aws, it can take them 15 minutes. And so that means skyscanner was able to not only reduce their operating costs by up to 70% they were able to increase their agility and scalability. And now there's more features, right, more features that i get to take advantage of that, any of you who are customers of skyscanner get to take advantage of.

"Uh and again, even things like that performance, you get out of the platform with that scalability, the results that you get when you do a search on Skyscanner come through really, really quickly because they're using all of that elasticity. Uh and they're doing a lot of this on spot instances, they're very public about their work with spot. Uh and so they're really able to, to totally minimize their costs, which again, when I'm going to a platform to try to book cheap tickets it's pretty useful to know. They're also doing everything they can to reduce their costs so that they can pass those savings on to me uh, as much as possible.

Alright, now let's get into the stuff for the workloads that are already running on AWS today. Ok. So I already mentioned this, but it's called the Elastic Compute Cloud. We call it EC2, right? I'm sure not many of you say Elastic Compute Cloud day to day, but I feel like it's really easy to forget Elastic. It is one of the core premises of EC2. It's why it's in the name, right? If you are not using EC2 in an elastic way, you are not using it to the fullest um of its spirit of what it's designed for, of its way to deliver that efficiency and performance that you expect. Ok, similar type of graph as we talked about before, right? We want to avoid waste, we wanna meet peak business demand with great performance at the lowest possible cost, still working good. So we want to match those instances right up uh to that very middle now.

I'm sure a lot of you have looked at these concepts and and thought about them before uh and thought about how you can make those applications actually scale. So I wanted to highlight a few of the tools that customers use and there some of the newer ones that maybe you're less aware of for those customers that have been using AWS for for many, many years now when it comes to auto scaling groups. Because if you're not aware of it, we actually have multiple different scaling policy options for you. If you're a longtime EC2 customer, longtime AWS user, you are almost certainly, hopefully, almost certainly aware of simple or step scaling, right? As its name suggests, it is pretty simple. You set metrics, you might say, hey, my application starts performing really poorly at 60% utilization. So I'm gonna set my scaling metric to 45% when CPU gets to 45 add a couple extra nodes, cos I i never want to get to 60 give my customers that bad experience. So super popular, very commonly used tool uh for scaling. But it is the original if this is what you're using and you haven't explored any of these other three, something that you should check out. Um we'll talk about predictive scaling in particular.

But so target tracking uh thermostat like uh it was really for customers. Some customers found it difficult to set the right scale up policies and scaled down policies, right? Because you, you wanna obviously manage that the difference in metrics so that you don't end up scaling up then scaling down, then scaling up. And so target tracking was a really easy way for us to a way, an easier way for customers to sort of just set a thermostat like target inflation like target right. We want it to be at around about 50% utilization, right? Never above, never below or a little bit above a little bit below. That's fine, but we don't want it to massively deviate from that. And so that's what target tracking does. You can just set a metric, we'll automatically add and remove nodes and try to keep your metric at your compute so that your metric is as close to that as possible. And I'm talking about CPU here, you obviously don't need to use CPU. In fact, for most applications, we don't recommend using CPU. You can scale on any custom metric that you want to scale on uh for these services, right? You just need to feed that uh custom metric into CloudWatch Auto Scaling can pick it up and start scaling on that on your behalf. So you might scale based on, you know, user count, connection count, whatever it maybe CPU is just an example, right? Don't get too fixated on that. If you don't, if that's not really what matters to your application, how many of you are running workloads on ADBs today and haven't started scaling at all? Wow. Ok. Ok. Some honesty. Good, good. Thank you. I was gonna say I've never seen uh uh an audience say z none at all. Ok. Maybe just for the two of you but anyone else listen in as well?

So scheduled scaling, this is a really, really powerful tool to get people over the hump of what it means to scale. Right? Because again, as its name suggests, you can just set a schedule. And so I really encourage anyone who hasn't started scaling or is mainly mainly scaling, maybe a few applications, you know, maybe new ones, stuff that they've just freshly developed, but their core computer is still costing them a fortune sitting there not scaling at all scheduled scaling, get your development environment, your assessing environment, whatever, maybe get that thing. Turning off when your developers go home, get that thing, turning back on when your developers come to work, right? If you turn servers off at night and on the weekends, you're saving over 60 65% right? Cos that's just how numbers work, right? 9 to 55 days a week is, is not most of the week. So simple things like just scaling and this is an added benefit. We, we see a lot of people are reluctant to go into scaling and production because they haven't tested it at all, right? And so just enforcing this going into your development environment and making it turn off at night time, you will find issues, right? It's not gonna work perfectly the first day your developers are gonna come in, some of them are gonna be pissed off, right? I my environment's not working, I'm missing some stuff that I expected in my memory or wherever it may be, but that's gonna train them and in fact, years ago, uh Yelp presented at re:Invent about how they were able to use that as a reinforcement platform. And so by moving to things like scaling and things like spot in their development environment, it forces their developers to start thinking about what happens when things fail. Because again, this isn't just about cost optimizing to reduce your AWS costs, right? If you're being efficient, if you're doing smart savings, you also want to ensure if a server happens to fail during peak load, your customers aren't impacted, right? You want those customers to come back cost savings, of course, is in the name, it's about the cost, but I I you can't lose track of top line as well, right? If you save money in your costs and it causes your customers to not come back, you are not really saving anything, ok?

So scheduled scaling simple, great way to sort of push and, and challenge your team. Uh particularly for those of you who might be in sort of a central platform infrastructure cloud team and there's teams everywhere saying, oh no, we can't scale. I can't do that. I uh no, you know, you know how it is. Developers don't want to have to put in this work dev environments, nights and weekends. It's off, right? Very simple.

Alright, then we'll spend a little bit on predictive scaling, ok? Predictive scaling again, as its name suggests, these names are actually pretty, pretty clear for us. Um so with predictive scaling it sort of, I love it because it does what the spirit of auto scaling always was to me, right, you put your application inside an auto scaling group, we learn based on the traffic patterns based on the metrics inside that auto scaling group on those instances and then where it is predictable, we will predict it and scale up on your behalf ahead of time. Now, this has a couple of really awesome benefits, right? Number one, we talked earlier about step scaling have to set at 45 in order to make sure you don't trigger over 60% with predictive scaling. We can scale just in time, right? Based on your historical previous loads, we can scale it just in time so that you don't have hardware sitting there idle 1015 minutes waiting for that traffic to come in equally, you might trigger a 45%. It goes to 46% it goes back down to 40%. Well, you just added a server for no reason. You weren't gonna hit six. Predictive scaling is gonna use those predictions. So it scales intelligently, of course, this this uh can be paired with other scaling options because predictive scaling doesn't scale you down. So you do have to make your own scale down decision. So a lot of people just set up simple uh step scaling policies to scale down on that.

Also as the name sort of suggests if it's unpredictable. If you've got a big event that we've never seen before, you need to manage your own sort of scaling policies for that because we're not the present. We, we can't, we can't know things that we don't know. Uh so we can only use that historical data. Um and the final extra point that I really like to point out here, it does work within your boundaries. So, predictive scaling can be a scary concept. What if it just goes to infinity and my bill goes to infinity and my customers don't come, you can set boundaries just like every auto scaling group, you can set caps limits, maxes. Uh you don't need to let the thing scale to infinity. Uh you can set those reasonable caps and control that yourself. Ok?

One final thing and this is no, I don't want to bully the developers too much up here for too long. But again, we hear from some people, hey, it takes too long for an application to get started. You know, we see this a huge amount with traditional enterprise applications. Things running on, things like a Windows operating system often have significant very long bootstrapping processes that can take tens twenties. Even we've seen customers take over an hour to get an instance from when we've started it and given it to them into a point where it can actually take traffic uh and deliver some value uh on behalf of you, right? And so if you have that challenge today and you are not scaling because it takes you too long to instantiate and get an instance up and running or what we also saw some customers do if you're managing your own worm pools, right? Effectively scaling at 10% so that you can be ready at 60%. Right. So you're massively scaling over, over the top, go check out worm pools. What we do is we instantiate the instance and then we put it into a stop state on your behalf. An auto scaling group is aware of that, right? It's aware that it's got these instances that are in a stop state that are gonna be much faster when we get those instances out of the stop state and get them going because they're in a stop state. When an instance is in a stop state, you do not pay for the instance, right? You're just paying for that EBS storage. That's, that's got that uh instantiation sitting on it. So it is much more cost effective than running your own warm pools. And again, almost certainly much more cost effective than just uh leaving instances sitting there running 24 by seven. This gives you a, again a, a real world example from a customer here. Uh you can see that the time it took them to sort of get an instance up and running in order to handle that traffic before warm pools I don't know if you can actually squint in and see that. But the gap between, in the top one, the gap between the, uh, the orange line and the red line when the orange line peaks and then the red line peaks is 10 minutes. Once they've turned on warm pools, you see, it takes them less than a minute to get that server up and running into production for their application. Right. So, immediately you're saving 99 minutes of CPU right? When you pay by the second, that's nine minutes of savings. That's awesome. But not only that, imagine if this spike was unpredictable right now, all of a sudden one minute and you can catch up and deliver that performance that your customers expect instead of waiting 10 minutes and again, ending up with frustrated customers that aren't very happy.

So we talked about the sort of common use cases, right? Spiky workloads long uh instantiation times. Uh and it's a great alternative to having to manually scale which a lot of customers do when they have these really long uh slow instantiation times uh today.

Alright. As we move through this, ok. So we've gone through uh auto scaling. Now, let's talk a little bit about just sizing it, right? Um you could put either of these in either way. Um this is the order that I recommend exploring options though, right? Think about scaling and right sizing before you even look at purchase options for the most part. That's why that's coming third here. So, what do we mean by size it? Right. Well, I, I was just told and I updated these slides. We now have over 750 different instance types uh, available. So that's a lot. Uh right. It, it's, it's can be overwhelming. Uh, and again, uh, developers might say i don't want to learn all that crap. Right. I don't wanna spend all the time learning 750 plus different instance types. So I can work out. What's the best instance for me? Good news"

We actually have a system for that. Now, we have this thing called attribute based instant selection. It was launched a year or two ago now. Um and it sort of does what the name suggests. You can put in specs CPU memory network generation of instant, you know, different sort of ratios that might matter for your application. And then you can say now go get me the instance that's the cheapest available.

And what I really love about this, obviously, it it gives you that flexibility, right? Your developers are maybe already particularly those of you in the containers world. They're used to defining the container specs that they need. How many what CPU what memory does this container need? So going and applying that in this model with attribute based instant selection is pretty natural, right? They don't need to learn those 750 different instance types, you can just go select them.

But the thing I really like, we launched the M7 I flex earlier this year. I can't remember exactly when now, but we launched the M7 I flex earlier this year. And we quickly saw a very large user of EC2 become an enormous user of M7 I flex. We're like, wow, that's fantastic. You've already migrated, you've already moved M7 I flex provides 19% better price performance over M6 I, right. So almost 20% price performance improvement um by this migration and it turned out the customer didn't even know they'd made the migration.

Now. It might sound scary, but they had their system set up so that it wasn't scary for them. They, they could handle that. They were, they were ready to see if anything went wrong. Um but they just automatically noticed, hey, your specs actually match this new instance type that meets or exceeds your specifications and it's cheaper than the one that we were launching yesterday. So I'm just gonna start launching it, right?

So it's a really easy way to just automatically keep on top of the absolute best uh instances uh that AWS is offering for your application without having to keep track of 750 plus different instance types.

The other thing that I want to point out um using attribute based instant selection initially can be scary. Uh I would also say even though developers might know what they put in their containers, they might not really know what their CPU and memory needs are. Uh generally, when you ask a developer, what size server do they need? They'll say the biggest one.

And so on top of after you've first done that launch, after you've started making these moves, don't worry about having to make sure the initial specs is perfect because we offer the AWS Compute Optimizer for free. Since the launch of this thing, we've sent out over 20 billion recommendations. By the way, we don't send out recommendations if there's nothing to recommend. Ok. So that's 20 billion plus times that we have seen an opportunity for an AWS customer of EC2 to select an instance that is cheaper and delivers the same uh specs that the application actually, right? Because with the AWS Compute Optimizer, we're monitoring your real world workloads, right? We're looking at what's running on the instance, right? And by default, we're just looking at sort of some some low level metrics like CPU you can opt in and actually get higher level metrics like memory and that type of thing on top of this. So we can have really deep insights and make very intelligent recommendations on the instances for you to use.

And the reason I bring this up here in the right sizing section is again with EC2, you're paying by the second with the AWS Compute Optimizer, you're getting recommendations on a regular basis, assuming you're not perfectly optimized and none of you are cos there's not a single account inside of AWS that's perfectly optimized. Um, so with the AWS Compute Optimizer, we're gonna send you those recommendations and again, you can just respond to them whenever it makes sense. Do you change the specs and attribute based instant selection or do you just change the instance that's running whatever it is you can make that call, you can manage that yourself.

Um but I encourage you to come back to this at least on a monthly basis, right? Because if you're not coming back to this, at least on a monthly basis, you're almost certainly missing out on, on additional opportunities because this isn't just about misfiring the specs initially, right? Hopefully, hopefully you're still developing and deploying new applications, new features, new functionality, new performance optimizations, whatever it may be and the best instance for you are gonna change, right?

So a bs compute optimizer semi, semi regular basis, check it out, it's just gonna automate the process of sizing it, right? Ok. So it's an important step. But the reason i put it second is because you can sort of just do this naturally over time with these tools, right? You don't need to front load a lot of this work uh to think about it the right way.

Alright. The third sort of uh i guess basic uh common thing that everyone should be doing is, is what i'd describe these things as is the compute pur purchase options. Uh i imagine all of you have seen this slide. So i'm not gonna spend much time on it. We have on demand, we have savings plans, we have spot, right?

Uh on demand. Great way to get started pay by the second when you don't need it anymore. Give it back to us. However, with savings plans, uh i imagine a lot of you are aware of it. It's a super, super popular option these days. Uh savings plans, uh a few things to make sure that you're aware of. Cos i still see even very big savings plans. Customers sort of living in a reserved instance, mindset if you've been a longtime user.

So compute savings plans is the most popular type of savings plans and it's the most popular type because we really built it based on customer feedback from the reserved instance, uh offering customers wanted greater flexibility, they wanted greater automation in applying their savings uh globally. And so with the compute savings plans, we offer discounts of up to 66% over the on demand price. But the really cool thing that the flexibility that i'm talking about is we will automatically search for any instance in any region instance, family instant. I've gotta read the list. Actually, it's so long instance, family instance size tenancy operating system or compute service option.

So effectively with the compute savings plans you should think about it as committing to spend a certain amount on compute services anywhere in the world on any instance, type any instance, family for the next 1 to 3 years, whatever it, maybe that you're signing up for from a compute savings plan perspective. And on top of that, and, and this is the thing that i, i really wanna push for you if you're not ready to scale, if you're not ready to. Right size, go buy some compute savings plans immediately. Right. It is great for 24 by seven workloads. But if you're not scaling you currently have a 24 by seven workload. If you think right now your team comes and tells you in six months we'll be scaling. I guarantee you it'll be 18 months. I wait, i can't say guarantee, i take that back. I don't guarantee it'll be 18 months. But, you know, it's not actually gonna be six months.

So time and time again, i see customers make the decision to defer these, uh, commitment products where you can commit and save a fortune because they think we're just about to do this. We're just about to do that. Well, i if you're not ready, go buy them. But if you are just about to move to servius to fargate or lambda, this actually still applies over there as well. So don't let that be an excuse. Right. Compute savings plans. If you migrate your application from institute to fargate eit, to lambda, it's gonna automatically go with you, right. We're gonna, again, just search and try to find the best way to maximize your savings.

If you are very confident, you're gonna keep running a specific type of server for 1 to 3 years. However, you can save even more money with the instant savings plans. Right. Really good for things like databases, right? If you're, if you're self managing a database on ec2, the instant savings plan is gonna save you up to 72%. So you get additional savings, it's still flexible, but this is a little bit less flexibility, right? So now you're actually committing to a specific instance, family in a specific region uh for the period of, of the commitment, you know, 12 months to three years, depending again on, on which one you select there.

Um but it provides you the absolute deepest possible discounts, right? With instant savings plans. And uh uh i'll also just, i guess, briefly mention the sage maker savings plans as well. The compute savings plans do not apply over sage maker. Sage maker has their own uh savings plans offering which provides up to 64% discount. And again, it's pretty flexible across all of the sage maker portfolio in how you apply that.

Um but it uh you know, different usage types, instance family sizes, regions, that type of thing. Um so if you haven't check out savings plans, if you're still thinking about savings plans in a reserved instance. Mindset. Right. If you're thinking about, well, we need to really know the ins instance side, we need, really need to know the region. We might even need to know the size. You, you're wasting a lot of time, just go buy some compute savings plans. Work out what your sort of normal hourly spend is. And, and that's about what the right thing should be for you.

Uh and again, if you're gonna move an application from singapore to kuala lumpur or to sydney or to, you know, somewhere else in the world doesn't matter. We'll just chase it and apply the savings automatically with compute savings plans. Hopefully you're convinced you don't need to spend a lot of time on this next slide. Uh savings plans is is super duper popular, really, really flexible, um leverage them if you're still on reserved instances and you have a reason for that. I would love to meet you afterwards and hear what that reason would be. Uh it's fine to use them. Um just don't really know why people do anymore.

Alright, now let's talk about spot instances. As i mentioned at the beginning, i've been working in the spot business for about the last ooh sorry, i needed to say this savings plans. You know, i mentioned spot save customers 10 billion since 2013. I love spot. So it sort of hurts me to say this next one savings plans only came out in 2019. It's already saved customers $30 billion over the price of on demand. That's how popular it is. Right. It's, it's, it's really, really popular.

Uh and is saving customers a ton of money versus the on demand price. So let's talk about spot though cos that is near and dear to my heart and, and a great way to maximize savings for elastic workloads. So first spot is the same infrastructure fundamentally 100%. It is the same infrastructure before it is launched. It's like schrodinger's cat, it is on demand and spot.

It is just spare EC2 capacity. So it's got the same performance, same security while it's running.

Ok. The thing about Spot is we reserve the right to reclaim that capacity from you with a two minute warning. That's because Spot is truly spare capacity.

And so when it comes to running applications on Spot, there's a couple of things that, that I really just want you to know if you're gonna take away one thing and try to run on Spot. Uh it'll be the next slide uh but workloads fault tolerant uh and flexible.

So we'll talk about what flexible really means here when it comes to Spot, uh the, the applications that tend to be fault tolerant are those that are either loosely coupled or stateless um or have sort of a, a retry capability is, is something that we see a lot in HPC, big data, ML applications where you're gonna run a lot of that on Spot. Uh and if, if a server happens to go, you just restart it and you're saving so much money it's worthwhile.

So what do we mean by flexibility? Cos most people don't try Spot, if they're not fault tolerant. Most people really understand if a server, if you can't handle a server going away, then Spot is not for you. But what some people sort of misunderstand is the importance of flexibility.

Because when we talk about flexibility, we say there's sort of three or four vectors of flexibility, there's instant flexibility, there's availability zone flexibility, there's region flexibility and then there's time flexibility.

Now, I don't really consider time flexibility to be a true thing. Um generally, if you have an application that is so time flexible, you do not care when it runs the cheapest way for you to run that workload is just to not run it cos you don't care when you get the results back.

So I really don't meet many people out there that are truly time flexible unless you happen to be working at a university and you're willing to wait 346 months for your results. Because you've got a lot of stuff going on. I really never meet anyone in a business that is truly time flexible.

Similarly, region flexibility is not necessarily the easiest thing to do. Um and not necessarily for some applications that might need to actually be close to a database or be close to a storage environment. You know, high performance storage region flexibility just isn't necessarily possible.

So I'm going to ignore those two. I don't want to talk about them. I don't even really want to talk about availability zone flexibility because for those of you who are in here, hopefully you're all aware of why you are or are not multi AZ. Most people know multi AZ is gonna drive great availability for your applications. But for some workloads, it can cost you money because it does cost money to transfer across availability zones, right?

So you know that you know why you operate a certain way. And so I'm gonna ignore that one as well. If you wanna use Spot, you must be instant flexible and fault tolerant. Ok. Why do you have to be instant flexible?

So I I sort of tell this joke, it always lands flat, but I'm gonna try it again. Like imagine if you were sitting there and, and your partner was in the car next to you. And uh you know, I say to my wife Claire, hey Claire, can you call the hotel? We need to, we need to get a reservation if she called the hotel and she said, hey, is room 304 available? Oh, ok. Ok. No worries. And hung up, she turned and told me, hey, room through if it's not available, i like what what about 404 or 305? Like why do we care what room we're in? It's just the hotel and that's sort of the same thing with EC2. Right?

When it comes to Spot, we have so many different instance types, those 750 different instance types that i was talking about. Almost certainly there are multiple different instance types that work really, really well for your application. And while at any point in time in a specific availability zone in a specific region, somewhere in the world, we might not have your absolute number one option when it comes to Spot instances.

It's really, really unlikely. We won't have your second best option. Super unlikely we won't have your third best option, right? And so that's how customers are able to run consistently and run a lot of serious workloads, non time, flexible workloads, workloads, customers call production workloads on Spot instances is by saying, hey, when i'm interrupted, i'm just gonna quickly and automatically replace that Spot instance with another one, right?

And so just uh again, if you can see this, we've got three different instance families here, the m six g, the c six i, the c six g. Within each of those families, there's like 6 to 10 different sizes, every single one of those families and sizes is an independent capacity pool to Spot customers in the short-term. Ok.

So that means we might not have c six i 12 xl, but we probably have 18 xl or eight xl. Right? And if you can just run the right amount of serves to service your application, it's gonna work really well. Or maybe we have the c six ic seven i sorry, or the c five. Right. So there's lots of different ways to think about this flexibility option.

But I really just want to challenge you if you're going to use Spot, this isn't optional. It's no more optional than being fault tolerant. In my experience and opinion, you might get lucky for a month for two months, for three months running on a single instance, type on Spot. But then you have an experience where that one's not available for a week, right? And you're gonna think, hey, Spot's broken, but it's not. If you just ch check the hotel room next door, it probably would have been open, right?

And not only do we do i say that i don't even just want you to trust me. So i'm going through these slides now. Um attribute based instant selection is gonna automate this process for you. Right? I mentioned that earlier. It's a great way to just optimize general EC2 usage and just automatically select the instance type that best meets your specs.

Well, for Spot, it can do that. And on top of that, if your number one best instance isn't available, it can automatically fail over to the next one on your behalf. As well, right. So it can just handle that process of finding the Spot capacity um on your behalf.

And if you want to, you can actually check out this o so jimmy, you can check this out in a tool called Spot placement score where you can actually go and test yourself. Am i flexible enough to run this application on Spot? Yes or no?

We've seen some really fantastic workflows from customers that use the Spot placement score. This is a service that comes at no additional cost and they'll just automate the workflow and say, hey, if we're not getting a score above six or seven add additional instance types, challenge the developer or the the infrastructure manager to add additional instance types uh to my specs and these are the results that you'll see right on the left hand side, we have uh a customer who's requesting 1000 vcp us 32 vcpu minimum per box and 64 gigs of memory.

And we can see time and time again in every region that's displayed here. We actually get 99 out of 10 for a Spot placement score. That's pretty good, right? And it's pretty much as good as you're ever gonna get with Spot.

However, if you only put in a single instance type and again requested just 1000 vcp us on the left hand side, you see uh sorry, your right hand side, my my left um you see not a single not a single region has a score above three. Right? And so that's what Spot placement scores trying to communicate with you. Yeah, i know you're gonna be fault tolerant. If you're using Spot, are you flexible enough? And if you're getting threes, ones, twos and threes, you're not flexible enough to run on Spot. Right? Or you're gonna have to be time flexible. You know what i said about time flexibility doesn't really exist, right? You're gonna get yelled at, right?

So let's quickly just wrap on Spot instances here before we move on to some of the uh slightly more uh advanced or i guess uh the continuous optimization opportunities for customers. Uh with Spot instance, i've talked about this flexibility concept. What a lot of customers say is, hey, which pool has the most capacity? Hey, can you just tell us which one has the most capacity and is really cheap at the moment?

We can't tell you how much capacity we have. We're not gonna tell you that um doesn't matter how many times you ask, we're not going to answer that question, but we built an allocation strategy which we think does what you want us to do on your behalf, which is if you give us 10 different instance types, we will automatically search for those 10 instance types and find where are the capacity pools that are deep enough that you're not gonna have a terrible experience and then pick the cheapest of those 10, right?

That's why it's called the price capacity optimized strategy. If you're getting started with Spot or if you're already using Spot and you're not using this strategy, highly recommend jumping on it today. It's just the one that sort of maximizes savings while minimizing interruptions. Uh which is what people generally are trying to do when it comes to Spot.

Now. Uh just high level message here is think about how this works, how the balance across these options works within your workload, right? Things like uh well, video streaming is a really obvious one. Don't run video streaming on Spot. Terrible idea. I don't want my zoom call interrupted, right? Nobody does. So don't run things like video conferencing, even game servers that really require high levels of state and and sort of that instantaneous live human to human communication. Terrible idea have over over on the the analytics side, maybe you should run most of it on Spot and so have a think about the nature of your applications and those workloads.

And again, i think we we have up here. Uh uh do we have yeah, databases over on the the far left hand side for you. Well, obviously, databases should just be 100% savings plans. Don't even think about Spot or or on demand, right? As soon as you've gone through that, right? Sizing process purchase that.

Ok. So these are the the relatively simple things that if you haven't done i really encourage you to go apply immediately after that if you have applied all these and you've kindly stayed with us for the last 30 minutes as we went through them. Anyway, then i would encourage you to start looking at our more advanced optimization strategies.

And so we know customers when they're moving to the cloud are often at different stages of the journey. And so a single tool to sort of service, everything has been a very difficult thing to build out and that's why the team went and build the cloud intelligence dashboards again. These are service, these are, this is a uh a service that it it's not free because it is built on top of AWS service technologies, right?

But it runs inside your own account, it's pretty easy to use, it's secure, it's in depth and it's cost efficient again because it runs in surplus. You're mainly just paying for some quick, quick site to sort of view and and visualize the data um as you see over there on the the right hand side.

But what the cloud intelligence dashboard was really designed to do was was give a suite of insights that would go from the sort of exact level through to the developers, the hands on keyboards, doing work and be able to give insights to those executives and be able to track and manage things like kpis through down to the developers and really create a efficiency, aware organization right?

Because when it comes to cloud, it's no single person inside your organization's job to be cost efficient. Um and so that's why when we look at the cloud intelligence dashboards, we actually have, as i mentioned, not a single tool can rule them all. And so we have the trends dashboard, the cost intelligence dashboard, kudos dash so on and so forth.

But you see here, we've sort of built them, they all live off the same data, right? So you're not working off different data where you're gonna end up in a situation where, hey, whose data is right? Whose data is wrong? It's really presenting that data in different ways based on the audience for what's gonna be the most impactful and valuable for that audience member, right?

So of course, the the executives are likely going to see sort of the trends dashboards. Um the business owners and product owners are more likely gonna be spending time inside the kp i dashboards. Uh and then the fin ops, the, the practitioners, the people that are more likely to be closer to the hands on keyboards are gonna be spending more time in kudos.

Uh and then the, the compute optimizer uh dashboard and the trusted advisor, dashboard, because that's another thing i wanted to mention with these dashboards, we actually bring multiple aws data sources together so that you again can have that single pane of glass view uh across your leadership team uh through to your developers uh so that you can track and make progress on your uh efficiency and optimization goals.

Um if anyone looked at these dashboards two years ago when they sort of started, it was relatively cumbersome to get going. Um these days, we've now got this to a point where it takes about 30 minutes or less to get all of this set up. Ok? Um it is really pretty quick. Uh all of the data stays inside your accounts, right? So you don't have to share it with any third party.

Um it's pulling a w bs data already in your account, making it easy to view another really nice thing that customers like about this. If you're not, if you're in this pain point, you'll know the pain. These dashboards can operate across multiple payer accounts. Ok?

So if you happen to have an an organizational structure where you have multiple payers, uh uh in addition, obviously to multiple children under each of those payers, again, this can provide a organization wide, single pane glass view um into that.

And again, it's high, highly customisable because all of this is, is open source, obviously like the, the compute optimizer dashboard pulls in compute optimizer data, but all of the, the code to deploy, run and manage these is open source.

So if you wanna customize it, if you wanna make it specific for your organization, you can take that and run with it and customize it to your heart's content. Um and it is built on our services technologies. Uh we don't need to go into this uh in this session here. But in case you're wondering sort of the, the tools and components, how it all stays in your account, how it all comes together so quickly and easily.

Um this is the different, these are the different surplus components that come together to deliver the cloud intelligent intelligence dashboards uh framework on your behalf.

Alright, on to our last three here. So graviton, um maybe it's not that new and exciting anymore, but uh graviton four was just announced. Um and so i i bring this one up at this point because it's generally something we're seeing customers get to when they've already done those simple steps. Of course, there's no reason you can't move straight to graviton. In fact, there's many applications that you can just go run on graviton immediately today. But I bring it up in this section because it's one of those ones that's gonna continue to pay those benefits to you in years to come.

We just heard the announcement or hopefully most of you that attended. Uh just heard the announcement of graviton four, the next generation uh graviton four instance type which has 30% better cpu performance. I think it has 50% more cp us and 75% more memory. I'm gonna focus specifically on that 30% better cpu performance though, right. Because this is something that i think it's not obvious if you're not in the graviton world, but graviton is continuing to make significant performance leaps in each generation. Right. So we saw a double digit performance improvement from graviton two to graviton three. We just showed 30% performance improvement from graviton three to graviton four. Right? And so if you are going on to graviton, if you had it already migrated or if you're considering migrating, don't just look at what you'd expect, the immediate savings would be those will be compelling. Don't get me wrong. But if you move to graviton on graviton two, you would have gotten about 40% price performance improvement over comparable x 86 instance types uh a couple of years ago. And then we've seen double digit performance improvements twice since then when x 86 is doing single digit percentage improvement and moving at a slower clip, right?

So with graviton, you're getting significantly better price performance and your applications as we add new generations are getting those huge leap forward where you can add even less servers uh or maybe just service more customers with the same number of servers, whatever it maybe.

Um but that's why i sort of consider this 11 of those ones that are gonna continue to pay significant benefits uh for customers who are adopting. Uh and i mentioned this to somebody uh in, in the precht. But what i think is so powerful about this graviton platform is that we can actually go and, and look at what are those bottlenecks, right. Of course, we're not looking at any of your specific data, but we can, we see the workloads running in the cloud, we know what the performance bottlenecks are, we know what's driving costs into those compute environments. And that's why not only are we still sort of developing this overall uh environment so, so rapidly, but we're actually making sort of the, the right leaps forward, which is how we're able to get these double digit performance improvements because we have the data right to make these intelligent decisions. As we're designing the the next generation of chip sets, gonna skip this next slide because i wasn't allowed to put the graviton four slide in uh in case adam didn't announce it.

Um but the graviton three slide now seems a little bit redundant. Um but you know, yy, you know, the, the the story uh for graviton and, and i guess here in case you're not aware of it, right, people, them got graviton the original a one instance and graviton two graviton three now graviton four, but we actually have a lot of these components are actually embedded in our overall environment, right? The the intelligence we've gotten from designing and building our own silicon is not just in graviton, but it does play benefits to graviton, right? The nitro security chip that's also our own custom silicon, the nitro card, we've designed that from the ground up as well. And then finally, of course, the nitro hypervisor, which is, is what provides sort of the the base level of, of incredible performance you you generally will see on ec2, right, where you will get that fantastic uh performance out of these instances that you really won't see in hyper vised environments almost anywhere else, right?

Um and so these components, these are exclusive pur purpose built and they're modular building blocks. And that's why we're able to continue to make these pretty significant leaps forward uh in this overall graviton uh system that we, that we're operating today.

If you think about graviton for ec2, good on you, if you've already adopted for ec2, 10 points even better. Um however, if you haven't adopted graviton at all yet, and you're wondering where to get started. My answer is actually not ec2, it's fine if you want to get started with ec2, but we have a whole bunch of managed services. You click a button, you get the benefits. We just talked about cos we've already done all the qualification of the software, right? We own and operate that well, we we operate that ourselves on your behalf. So you don't have to, right? And so if you haven't adopted graviton, are you using r ds? Go click the button, elastic clash, go elastic, go, click the button or even things like emr, right? Where it actually is your own ec2 instances, you go click a button.

Uh they're just really quick ways that you can really start reaping huge benefits. And what we see is customers that do that, particularly if any of you are in the audience. And i do hear this from a lot of folks that tend to attend these conferences again, generally from like a a central infrastructure cloud or platform team, they might be struggling to convince their developers that developers don't know if it's gonna work and don't even want to explore it. We've seen some people just boldly go and, and move them to graviton. Mm mm it works. Sometimes it's a bold move but even better just go move your ma your aws managed services to graviton and they will see that it's not that it is real that they do get these price performance benefits that they do get better overall latency and performance for for many of their applications. And that is a great way to then go and convince them. Ok. Hey, we saw this, we proved it out with our managed services. We saw it for ourselves in the real world. Now it's worth spending that time to qualify and confirm that our applications running on ec2 can make this leap as well uh and start reaping the same benefits.

Alright, second or last here, folks. Thank you for staying with me. Um so servius, we, we're obviously mainly focused on computing this session um but i always like to mention servius because, yeah, as verner vogel says, there's no server that's easier to manage than no server. Um so if you do have applications right now where you've got servers running, and there's particularly if you have applications that have a lot of idle time, right? Moving to a cus architecture is a fantastic way to stop paying during that idle time, right? Because with these servius systems, you're generally paying uh for requests uh when your customers are actually using your platform, internal or external or whatever it maybe. Uh and so for us that servius is, is all about that. No infrastructure provisioning, the automated scaling you pay for what you use, uh all of that type of work. But again, just like with graviton, if you haven't started your service journey today, there might be paths for you to explore that actually sit outside the ec to compute portfolio that you haven't already checked out, right?

So we have service offerings across compute, of course, fargate lambda, two of the most popular. Um but things like our data store services. Uh so if you're using amazon aurora, there is a service option. Uh and of course, we have all of our integration services like the api gateway uh step functions, these types of things that are surplus offerings as well that can replace a lot of the, the sort of glue that you might be running on ec2 instances and that glue is very often idle. So it's a great way to go and reduce costs, uh maybe improve up time and performance as well for your applications.

Then finally, the final piece that for me, uh this might feel a little bit out of the blue, a little bit random for those of you who know carpenter, a quick show of hands. Who, who does know carpenter. Ok. Very, very unknown. Ok. Um so who knows certis? Ok. Good. Uh we're in the right world then. So carpenter is the aws built provisioning system uh that you can replace the cluster auto scaler with, right? And so carpenter was actually built by aws. It is open source, but it was designed by aws to really optimize cloud uh ec2 compute services in particular as much as possible.

Um and in addition to optimizing it, it does things like help uh improve upgrades and management uh and particularly help run uh platforms that have multiple different types of applications. So it's a high performance scheduler. The reason i bring it up though is if you are either in the process of migrating in kubernetes or if you're already running on kubernetes and you're thinking about the six or seven things that i've talked about today and thinking, jeez, that's a pretty long list. I don't know if i can be bothered getting started even with the first one on that list. I understand carpenter actually does almost the whole list for you, right?

With carpenter, i'll show you quickly how it works, right? This is how the old system used to work with the cluster auto scalar, which would then call an autocall group. And autos sc group could only be a single instance type, right? And so if you wanted to run multiple different applications on this cluster, maybe one of them needed high memory, one of them needed gp us, one of them needed high cpu w what are you gonna do? Are you, are you just gonna run super unt with a single instance type?

So we actually said, well, we can do better, we think we can replace those components with carpenter and ec2 fleet. And so what this thing is actually doing, let's talk about the steps. So for scaling number one carpenter automatically looks at the containers that are to be scheduled and then it's automatically gonna add the correct number of nodes on your behalf, right. So it's automatically scaling. In addition to that, it actually looks at the pods at the configurations in the pods and says, cos you don't have to provide instance types to carpenter. So it looks at the specs and says, hey, it looks like these next containers that we're about to go and deploy. They take about this shape. This is the optimal shape for the next set of containers that we're about to deploy. And so it'll actually automate the process of finding out of those 750 plus instance types, which are the ones that match the shape for your application. And then with ec2 fleet, automatically select the cheapest or with spot, automatically select the one that's price and capacity optimized, right?

So it does that out of the box on your behalf. So that's automated, scaling and automated, right? Sizing. And if you have sort of missed that point, the idea that it automatically selects the instance type means it also does the spot stuff we were talking about. Remember how i said you have to be in instance, flexible. Well, it's gonna automate that process because if the very first option, it tries to select the spot isn't available, it'll just move on to the next one, right? It's gonna automate that process on your behalf.

So it's really putting a lot of these pieces together uh in an automated way. Uh you do still have to purchase your own savings plans. I should have mentioned that uh if there's a base level in your carpenter environment, you do have to go purchase your own savings plans. It's not gonna automate that process for you. Um but the rest of it, it's doing a really good job of it is native to kubernetes, right? So if you are using things like taints, affinities, uh what else is here? Node selectors, topology spread, it's gonna listen to those. It's gonna honor them, right? It, it is a native kernes uh service that's designed to work with native kernes capabilities and functionalities.

Um and i already mentioned this right. It's using that sort of attribute based instant selection concept. It doesn't actually use attribute based instant selection, but it does the same thing on your behalf under the covers. Uh as i mentioned, it's going to automatically select the right, the cheapest on demand or the pricing capacity, optimized spot instant on your behalf.

And then the final thing that i really love about carpenter, which i think to me is sort of the spirit of what containers was always meant to be. Um and time and time again, i see customers that haven't just sort of, we haven't yet achieved this universally. Um until i meet customers that are running with carpenter, with carpenter, it's going to do automated node consolidation as well, right? So if you're running container environment, hey, we should go from 100 down to 50 nodes now because we've got all these nodes, uh the containers spread, you've sort of ended up fragged. Uh it's gonna automate the process of then defra your environment to minimize your compute costs on your behalf.

So if you takeaway only one thing from this presentation, this would be it if you're using kubernetes and you want to apply almost everything we talked about today and you want your uh infrastructure platform team to be excited about it, maybe even your developers to be excited about it because it really does give them a lot of freedom just jump straight to carpenter. This is a great way to deliver that ag agility for your business while getting all of the optimizations we talk about throughout today's session.

And with that, i'll thank you very much. Uh if you enjoyed the session, i will encourage you to fill in the uh survey. If you didn't feel free to just come up and yell at me at the end personally. Um, i, if you'd like, uh, but yeah, thank you very much for your time. Uh, feel free to run to your next session as quickly as you can.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫