Harness the power of Karpenter to scale, optimize & upgrade Kubernetes

最新推荐文章于 2024-10-15 13:50:27 发布

李白的朋友王维

最新推荐文章于 2024-10-15 13:50:27 发布

阅读量159

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837210

版权

Good to go. Cool. Well, thanks for coming to this talk. In this talk, we are going to talk about how you can harness the power of latest Certis Scaler Carpenter to scale, optimize and upgrade Certis.

Before we get started, just wanted to give a shout out because if you have been in Venetian or Caesar's Forum, I know it's a hike to come here. Both of us are coming from Facia and it took us like over 30 minutes.

My name is Raj Deepa. I'm a Principal Solutions Architect working at AWS. I have been in AWS for around five years and I help my customers on board into Certis as well as several list. I'm also a published author and YouTuber.

I'm gonna let my co presenter introduce himself.

Thanks. Uh I'm Alice Turn Principal Engineer at EKSI founded the Carpenter project three years ago. And uh uh yeah, excellent. So we have the founder of Carpenter with us.

Uh all right. Thank you, Ellis. Leave you to it.

Yep. Let's get started.

Alright. So how, what are we trying to do?

So, basically, the job of Cluster Art Scalar is your application is running in 1 EC2 instance. And as it scales, we want to go to multiple EC2 instances.

So zooming in a little bit as traffic increases for your application, your pod scales using horizontal part at a scalar and those ports will get scheduled into the running two instances. But as the number of pods grow, the existing EU instances will reach capacity at that point, the traffic keep increasing, some pods will go to pending state.

So in the previous version with Cluster Autocar, we interact with Autos Sc Group and Autos Scale Group, you have to configure different Node Groups and tie to the Auto Group. It will create Easy 2 instances.

Carpenter removes all of that. Carpenter directly interacts with scheduler for a minute. I thought my uh clicker was not working. Wh um carpenters keeps those cluster auto scalar auto scaling group. Pending parts, talk to scheduler, talk to carpenter interacts with EC2 directly and provisions the EC2 for you because it is interacting directly with two A PS. It is faster flexible and coordinate is native and you have a lot of controls to say what kind of two instances you want to provision amis, et cetera.

And you do that using these two yaml files not pool, formally known as provisional and easy to not class, formally known as aws node template.

Ok. Scaling is of course the bread and butter of carpenter, but the beauty of carpenter is it does more than scaling. Carpenter can help you optimize cost. It supports diverse workloads, including machine learning and gen a and it also helps you upgrade and patch.

And throughout this talk, we are going to show you how it does that. So in summary, it helps you do a total data plan implementation.

Carpenter is also open source and now ac nc a project just a couple of weeks back, carpenter was accepted into the sig auto scale.

So let's say at this point, you are like, all right, I want to adopt what are the steps. So generally, first, you will do some evaluation, see what the features carpenter provides and then you will implement it. And after you implement it, you have to think about day two operations.

So this will be the flow of our talk as well. We're gonna start with the evaluation first.

So how does a container scale? Like what does it need to scale cpu memory storage network? Sometimes gp u and they run in twos, we don't make it easy. We have over 100 different types of two instances. And depending on the instance type, you could have higher cpu, higher cpu to memory ratio, et cetera.

So while this, this is a great advantage, it also complicates the instant selection and orchestration and carpenter helps you with all that.

So remember i talked about north pool before. Think about this as a yamo file where you can define what kind of ec2 instances you want to provision any time a pod goes to a pending state and the scheduler talks to carpenter, it sees this yamo file.

So let's go through this in this a l file, you can say what kind of instance you want carpenter to provision. For example instance, family c five m five r five in instance, size, not in nano micro small. And this is just a couple of examples. You can also do greater than less than all different kinds of operator.

Also, you can totally skip this and this is one of the superpower of carpenter. It will automatically pick the best two instance type based on your pod definition.

You can also limit how many easy two instances this note pool can provision. So as you could see on the bottom, this note pool will keep on provisioning two till the number of cpu core reaches 100.

Similarly, you can also specify what availability on your should be provisioned or you can skip this as well and it will automatically pick the best availabilities on next.

You can put whether you want x 86 2 or graviton or purchase options such as spot on demand. This is a really good feature if you want to mix in spot and on demand, you can simply say spot comma on demand and carpenter will always try to prioritize spot, not only that it is intelligent enough to know if sport is more expensive than on demand, it will provisionally on demand instead.

And as always, you can skip all of this. We wanted to make sure out of the box it works. And if you skip this last spot on demand part, it will provision on the on demand if nothing is specified.

Remember on the first slide, i said carpenter is coordinate is native. So what does that mean? Sometimes your coordinative workloads may be required to run in certain availability zone, certain instance type spot on demand, et cetera. How do you do that? You use mechanisms like node selector, node, affinity, tense and toleration, topology, spread, et cetera.

Carpenter works with your pods, scheduling constraints. So let's try to understand that with an example because i bet you will use this part a lot on the left. We have a note pool and this note pool saying any easy two instance that this not pool provisions either has to be spot or on demand or x 86 or graviton.

And any nodes that this note pull provisions, these labels will be added. So if this note pull provisions a spot and amd 64 it will add the label carpenter dot slash capacity type spot and carpenter dot io slash architecture amd 64 to those nodes.

What that means is in your part definition file, you can use those to schedule your pods. So in your part definition file on the right, if you say, hey, this pod needs to be scheduled in a spot instance. Carpenter will uh provision a spot instance and schedule this pod.

And in addition to that, carpenter supports a bunch of well known labels that's added to the nodes automatically. So you can use all these parameters to schedule your nodes. This give you unparalleled flexibility to schedule your workload.

How about user defined tens annotation labels? Carpenter supports that as well. On the left, there is an example of a note pool where every node that's coming up. I'm attaching the label, the colon team a on the right, you can use that to use this note pool to schedule the pods provision by this.

So this is a good way to think about. If i have three teams, maybe team one can be team one, you put a level, team one, team two, team three and for the part definition you schedule respectively, sport interruption sports. Awesome.

Note pools can be configured for a mix of on demand and sport, but here is the kicker carpenter has built in sport interruption handler. You do not need to trap the sport interruption and use a bunch of other additional software.

Carpenter tracks the interruption notice via an amazon eventbridge event. And when you install carpenter, you can give the name of this eventbridge. Carpenter does the rest. It traps the two minute interruption warning and then it spins up the spot instance and it is not, you are not required to use no termination handlers also going back to what i said before.

If the spot becomes too expensive, then on demand, carpenter will intelligently spin up an on demand instance instead of a spot.

So as we understand now, right, we have a pending pod pending part, has some scheduling requirements and we have not pool, not pool has also some constraints and they interact each other. Then your easy to node gets scheduled or gets proficient.

There are different strategies for defining note poles. You can have a single note pool throughout your organization. You can put a mix of graviton x 86 bunch of other stuff and every application uses that single note pole.

Now, the middle one is pretty popular for multi-tenant environments where you can say team a gets this note pull, team b gets this note pull. And the reason could be maybe team a needs to use some gp us or there is some security isolation that's needed a different am isa tenant isolation due to a noisy neighbor.

Remember we talked about how carpenter you can specify limits of how much cpu and how much memory if some project needs to use a lot of hardware, but you want to limit them to a certain cpu memory, you can use this strategy.

And the final one is pretty cool. I like this one a lot. This is a weighted provisional strategy. When your nodes scheduling requirement overlaps with multiple node pools, you can assign priority to the node pools and the higher priority node pools will be considered.

So let's let's understand this with an example. So you have two note pools here on the left. Let's say you have signed up for compute savings plan or reserved instances and you have instance category c and r and instant cpu in 16 core or 32 core and hypervisor, nitro see the weight on the bottom.

So the weight 60 will always be prioritized over anything which is less than 60. So as soon as a port goes to a pending state and a new two needs to be scheduled. This on the note on the left will always be prioritized first.

It will keep on provisioning twos till it reaches either cpu 100,000 core or the memory 1000 gb bytes after you exhausted and used up all your computer savings plan, reserved instances. Then the other note pool will kick in right. Then the other institute instance types will come up.

So this is an awesome way to implement a spot on demand ratio, reserved instances, et cetera.

So now let's talk about cost optimization. This is one of my favorite feature over time. Your cluster may look like this. You have four instances running the first one is really awesome, nicely been packed tight. Last three, not so much, a lot of underutilized nodes with carpenter.

All you need to do is in the north pole. Put this consolidation policy. When underutilized and carpenter will automatically bin pack the pods and get rid of other instances. Not only that it is also intelligent enough.

So let's say in this case, the last 22 instances are m five dot extra large. Even if we consolidate these two parts into one m five dot extra large, there will still be west. So carpenter will get rid of those two m five dot extra large provision am five dot large and bein pack your nodes.

So better selection of worker notes and reduced cost.

So if i combine all the all the features that we talked about, carpenter combines the features of cluster autocall, node, groups, node termination handlers. And this scheduler no longer you need to maintain all these different stacks. It creates a single software cohesive stack. So it reduces your overhead.

So how does all this works under the hood? How does the magic happen and to talk about that? I'm going to invite ellis.

Thanks raj. So we covered what carpenter does and now we're going to talk a little bit about how it does it. We're going to go deep into a node launch, which is one of the most common things that carpenter does bringing new nodes to join the cluster.

There's four major elements of this workflow scheduling, batching bin packing. And finally the launch decision.

Carpenter works in tandem with the coop scheduler. So you can kind of think of carpenter as the back half where coop scheduler is the front half. Coop scheduler is responsible for allocating existing capacity and carpenter launches new capacity when that uh existing capacity is has no longer any space left.

So in scheduling, there's a lot of things going on. Carbines has a very rich scheduling language for how do you run your containers? How do you run your processes on the underlying compute?

Um the coop scheduler looks at pods and nodes and it looks at a set of configuration, the resource requests, the toleration, the node selectors and the topology spreads. These all have to be considered by both the coop scheduler and the carpenter algorithm.

Carpenter has to maintain all of these concepts in memory to make sure that these constraints are being met when it's making its launch decision. The coop scheduler is doing a very similar simulation

So the coop schedule, it looks at the requests. How many resources does this pod need? How much CPU, how much memory? It looks at toleration - where which, which nodes are available? Maybe I can tolerate running on GPU or I can tolerate running on spot. Um, which architectures do I need? Do I need to run on this team's compute or that team's compute? Can I do only ARM or only AMD?

When it identifies that no capacity can meet these constraints, it marks the pod as unscheduled, which we call pending. Now we have a list of pending pods. But when do we decide to trigger a launch? If we just take a single pod and we immediately launch, then we'll get one pod per node. This could be really inefficient.

We use what's called an expanding window algorithm to try to balance between launching capacity as quickly as possible and also making sure that we can bin pack effectively. We've chosen 22 values of one second and 12th to try to make this tradeoff.

So what we do is one second idle period with a maximum of 10 seconds for batching. And I'll walk through a couple of examples here so you can understand what's going on.

In this example, we have a pod that appears in the cluster. The coop scheduler marks it pending. And at t zero, we wait until t one - so a window of time when we don't see another pod - and we call that a batch. Then we flush that batch and we do the rest of the launch algorithm.

But what about in a more complex case? We get a pod and then we get another set of pods without an idle second. And here it's about half a second. When we do finally get that idle period, we take that batch and we flush it. And this is gonna be a much more efficient bin pack.

So depending on when the pods come into the cluster, we're able to make more or less efficient decisions. And this is just a fundamental tradeoff of what we call online and offline bin packing. This is a classical area of computer science you can look into and something that we think about a lot in terms of the heuristics and for how Carpenter works.

After that batch, another pod can come in and then we don't see anything else. So we flush another batch. And finally four new pods come in later and we flush another batch.

But what about a case where there's a continuous stream of pods in here? We don't have an idle one second. And so Carpenter, if you have a busy enough cluster, maybe you're running thousands of nodes, hundreds of thousands of pods, maybe we would just never act.

We have to decide at some point we're going to stop, we're gonna do our best and then we're going to keep looking at the remainder of the pods. In this case, we take an expanding window, we wait for 10 seconds and we grab that batch and then immediately after that batch, we take another with the remaining pods.

Now that we have our pods, how do we decide which instance that we want to launch? The first thing that we do is we discover instance types from EC2. We talk to the EC2 DescribeInstances API and we say what is the entire universe? How much CPU and memory, what is available that can even be launched on EC2?

Then we sort the instance types by cost because that's what we're trying to optimize for. We want to launch whatever instances we can while minimizing for cost.

Finally, we intersect the requirements. So we talked about all those scheduling constraints. We use a lot of set theory, set intersection. Um, we mo- so we model the pod constraints and we have to compare the pods against the other pods. So pod A and pod B may have different constraints. We have to factor in whether or not the, the set intersection of those are compatible with each other.

And we also have to compare that to your node pool definition, which defines its own constraints. So we do a big set intersection of all of the different dimensions - architecture, operating system, as well as the size. So the, the amount of CPU and memory that's available on the instance.

At the top here is scheduling and bin packing. We call these, uh, they're kind of two parts of the same thing. Scheduling is the constraints - uh, optimizing how, how tightly can we pack this within the constraints of the scheduling algorithm.

And finally, we follow a reuse, grow or create strategy. So we have these pods and we create a virtual node, a node that we think we're going to launch, given that set of instance types. We actually maintain the full list of available instance types for that node. And we say any of these instance types in this sorted order could be for this virtual node.

And we take, we take the first pod and we add it. As we add more pods, it changes the constraints, it changes the available instance types. Maybe those pods have instance type constraints. And so suddenly instance types are no longer possible to be used for this node. Maybe the CPU memory is just now large enough that we have to remove instance types from the list of the available instance types for that node.

So we go through all of the pending instances and we try to make as large instances as possible while also making sure that it's as low as cost as we go through. We can choose to reuse a virtual node by adding a pod to it, growing a virtual node, which is actually removing the smaller instance types of the available set, or we can create a new virtual node if there's a scheduling constraint that is invalid, say two pods have anti-affinity and just simply cannot run together.

There's a whole bunch of complicated other things that go into here. We have to consider things like volume topology. Maybe your pod requires an EBS volume and that EBS volume is only available in one zone. And suddenly as you add, as you do a reuse for that pod, you now need to tighten the constraints with those volume topology constraints.

We also need to think about the available volumes on the instances - how many volumes can even be mounted to this instance type? Host ports? How many ENIs are available? What are the daemon sets look like? How big are the daemon sets? It gets really complicated really fast.

The last thing to remember as we're doing this is we're not actually optimizing for utilization. And this is something that people think about or ask us about when we're, when they observe Carpenter in action, especially when using spot.

It's not about, I mean, obviously utilization is a very valuable thing. And reservation is, is a really good output metric to know that Carpenter is doing its job. But what we're really doing at the end of the day is we're looking at the raw cost. And sometimes that means being at lower utilization because what we care about at the end of the day is what is the cost of this compute?

Just to, to double click on that point of how hard bin packing is - it's NP complete. We just cannot possibly hope to get it right. We use heuristics. Uh, they're, they're getting more complicated all the time. If I were to describe to you exactly what we do, it would probably be different in a couple of weeks. We're making some major changes to this right now.

There's dozens of dimensions to consider so that just kind of grows the the space even more. I mentioned that we're doing a similar simulation as the coop scheduler. The coop scheduler is doing a search problem which is very similar. They're searching for which nodes of the known set can we use?

We're not actually doing search because our space is so large we have, we could, we could create new nodes out of nothing, we could reuse, we could grow. Um, so we're, we're kind of doing a generation problem rather than a search problem, but it's under the same constraints.

There's some open questions where we've had to make judgment calls like on preferences. Like if you have a preference to run in this availability zone, we, we try to respect it. We try to launch. But what happens if there's no capacity available in that zone? Maybe we could just stop provisioning and we could say, you know, ignore the preference.

Maybe we want to respect the preference and, and uh and, and try to launch uh with that preference. But if it's not possible, then we just launch without it. There's a bunch of different types of preferences and uh customers like yourselves have engaged in a very lively discussion on how we should treat preferences. So this is evolving all the time. And, and that kind of discussion is also goes into impacting the complexity of packing.

The last thing to note is Kubernetes is incredibly dynamic. If we were allowed to snapshot the cluster at a point in time and make a decision, by the time we've executed that decision and launched the instances, the cluster might have changed. Maybe these five pods at the top, one of them was deleted. Maybe there's a bunch more pods, maybe a node was interrupted by spot.

We can never hope to be perfect both for this, this change problem as well as the NP complete problem. So we try to make rapid heuristic based decisions and make them as quickly, make them as continuously as we can. So that over time, we continue to optimize, we continue to solve the problem so that things are not perfect but they're always getting better.

Finally, the launch decision - I've mentioned that we have a set of available instance types for that virtual node. We try to shoot for 20 instance size flexibility, especially when using spot. This can, this is a, a magic touch for how we avoid spot unavailability or a ICE - insufficient capacity exception - especially for exotic instance types like GPUs or smaller regions.

Sometimes spot capacity is just not available. And so Carpenter's ability to maintain that high space of high dimension of instance type allows us to talk to the EC2 Fleet API - say here's all of our options, here's an allocation strategy to pick one for on-demand. We just use lowest price and then Fleet chooses which capacity is available in EC2 that meets these constraints.

There's a little wrinkle here for spot allocation. There's a tradeoff for spot where if you just choose lowest price, you sometimes can get a very, very small capacity pool - capacity pool that's almost unavailable and it's a very high risk for interruption.

So we use the price capacity optimized strategy, which is a EC2 Fleet API feature that allows us to say, give us the cheapest instance type, but don't give us one if it's about to be interrupted. So there's a little bit of a balance in there. And, and we rely on the Fleet API to make that tradeoff.

Now, I'm gonna hand you back to Raj to cover day two operations and onboarding for Carpenter. Thank you, Alice!

When we release a new AMI version, AWS goes and updates the Parameter Store. So there is a new latest AMI which no longer matches your running AMI. So that's a drift. Carpenter will get that latest AMI and updates the worker nodes in a rolling deployment fashion.

So it will create new nodes with the new updated AMIs, drain and cordon the older nodes, and migrate pods over.

What about Kubernetes fashion upgrade? If you update the EKS control plane, let's say from 1.26 to 1.27, it's intelligent enough to know it is going to go check the Parameter Store. It will find out the latest AMI for version 1.27 and it will upgrade those worker plane nodes.

So as you can see, this is zero touch and secure. You do not need to go run some pipeline, no need to run recycling commands, etc. It's always using the latest EKS optimized AMI with all the latest security patches.

Now some of you may be thinking, oh this is great, but what about custom AMIs? So Carpenter supports custom AMIs as well. You can select an AMI using an AMI selector field in the NodePool class. You could select using tags, names, account IDs, or simply by the ID of the AMI.

If multiple AMIs satisfy your criteria, then the latest AMI will be used. If no AMI is selected, then no nodes will be provisioned. You can simply run a "kubectl get" command on your NodePool class and it will tell you what AMIs are discovered or satisfying the criteria you are defining. It's a good way to test it out before you implement that in production.

So let's take a look at the previous scenario but now with custom AMIs. So your NodePool class, we are using the AMI ID 123, right? So all the EC2s that this Carpenter NodePool and Node class will provision is AMI 123.

Now you have a new AMI and you add that new AMI ID, AMI 456. Any new EC2 instance that is provisioned will be provisioned with the new AMI, AMI 456. But old nodes will not drift. Remember we said drift is a mismatch between the EC2 IDs and machine ID. You still have AMI 123, so technically it has not drifted.

Now what if you want to migrate all those EC2 nodes into this new AMI 456? Well you simply remove AMI 123. At that point there is a drift and then Carpenter will automatically upgrade those older nodes with the new AMI.

Carpenter also respects registered AMIs.

We talked about how drift can make upgrading your node plane easy and zero touch. And we talked about consolidation before, how Carpenter packs and saves you money.

So one of our large enterprise customers, SentinelOne, has thousands of nodes in one cluster. They adopted Carpenter drift for automatic upgrade of the nodes and they enabled consolidation. Remember it's just a flag. And by doing that, their containers started being packed into underutilized instances as Carpenter provisions new right sized instances.

Because of this, because of that automatic drift feature, literally a flag, they were able to save around 50% in costs without any management overhead.

So now, how does all this work under the hood? Let's do a deep dive. I want to invite Alice back on stage.

Thanks. We talked about node launch earlier, which is probably about half of Carpenter's responsibilities. And the other half is what we call node disruption or node termination.

Disruption is any event in the cluster that would cause us to remove a node. This is obviously a very scary action. You wanted those nodes because you wanted your pods to be available. And when we take away those nodes, it can cause risk to the cluster.

So how do we protect against that? We rely on Pod Disruption Budgets. And this is essential - you have to have these configured when you're using Carpenter or any other cluster auto scaler. Otherwise your applications are not protected and may be disrupted faster than you can tolerate.

Pod Disruption Budgets enable you to specify how many pods can be disrupted at a time and are a gate on the speed of disruption. You can also for very sensitive pods, Carpenter supports an annotation called "do-not-disrupt". If you set that, we will not attempt to disrupt a node for any pod that has that.

You can also put that annotation on the node itself if you want to put a node to the side and make sure Carpenter doesn't touch it.

There are multiple types of disruption in Carpenter and they're growing as we get more use cases. But we have a standard algorithm for reasoning about them. These are all types of node disruption:

Drift
Consolidation
Spot interruption
EC2 health events
A feature we have called expiration where you can say that a node should only live for a certain amount of time

We classify these into two types of disruption - voluntary and involuntary.

For voluntary disruption, this is something that we'd like to take action on, but it's not burning, it's not immediate, it's something that we have a choice about.

For involuntary disruption, like a spot interruption or an EC2 health event, there's a slightly different algorithm - we try to move faster. We know that the node is going away, so we do whatever we can to make sure those pods are migrated off as quickly as possible. But those Pod Disruption Budgets you've defined may be violated in the involuntary case.

In general, we try to minimize disruption. There's all sorts of side effects of disruption - every time a new node comes online, every time new pods come online, you have to rehydrate any caches, you make a lot of DNS calls. So we just generally try to minimize disruption in the cluster. But we take action when we can based off of these voluntary and involuntary cases.

There's a standard disruption workflow:

First we identify candidates. We'll talk about some of those methods later.
We launch replacements. So we want to minimize disruption - we want to make sure replacement capacity is available before we terminate instances.
Once those replacement instances are launched, we drain the candidates that we identified earlier by moving the pods, respecting the Pod Disruption Budgets using the Evict API.
And then finally, when those candidates have been drained, we terminate them.

How do we identify which nodes are candidates for drift? There's two types:

One way is pretty easy - we can take a hash of the spec. So things like taints and labels, we do it on a per field basis and we figure out given your specification, what is a hash of what you asked for and what is the hash on that node? And we can simply do a string compare and say the specification has now changed from what the node launched with, and we can now nominate that node as drifted.

There's another kind of drift that can't be hashed - things that Raj mentioned earlier, like discovering an AMI. That's something that's defined outside of the Kubernetes configuration. And we have to discover it - discovering Kubernetes version, discovering AMI selectors. There's a bunch of different discovery mechanisms in Carpenter. So we periodically pull and compare those sets to do the discovered drift.

We also have expiration - that one's super easy to compute. You can see there's a creation timestamp on the node. We simply compare that with the duration you've specified - if the node is too old, it's a candidate.

Interruption - we subscribe to notifications, EC2 sends us notifications about spot interruption events as well as EC2 instance health events. And so we can trigger an involuntary disruption, nominating a candidate when those happen.

And finally consolidation - this one's kind of hard and we'll really go deep into it.

So consolidation is a voluntary action. We want to lower the cost, but we're trading off availability. How do we make sure that we don't disrupt pods, but we compress the cluster as much as possible? And this is again, another one of our NP hard problems. It's pretty complicated. I'll do my best to kind of walk you through some examples here so you can get a sense for what are all the decisions Carpenter is making.

Again we have the same problem we're optimizing, given the pod constraints and the node pool constraints. And in the code we actually reuse all of the code, the whole simulation algorithm we reuse for consolidation with a couple of tweaks.

Let's take this example here. This node is consolidated, right? It's empty. Each box is a node and then the pods are inside of it. This one's empty. This is a totally easy consolidation decision.

But is this the best consolidation or maybe a better way to think about it - is there a better consolidation decision? Is there something else that we could cheaply discover that could be better?

What about these two? These seem well packed, right? But in the context of the whole cluster, maybe they could be consolidated as well because there's capacity on the left, right?

What about this upper left? Turns out the upper left could also be consolidated. You could move one of those pods to the right and another one of those pods to the node below.

Obviously this is a gross oversimplification - there's a lot of pod constraints going on here. Again, volume topology, host ports - all those things are part of that simulation algorithm.

In this case, we were able to consolidate those three nodes on the right and move those pods over to the existing capacity on the left.

The decision is a fairly complicated heuristic. There's a bunch of concerns we're trying to address:

We want to minimize pod disruption - of course disrupting pods, even though you have your disruption budgets, it's really not great to be running at the minimum of those disruption budgets. So we try to minimize the number of pods that get disrupted with each action.
We also try to prefer older nodes - if we were to terminate very young nodes, then you can get what we call flapping, which is where a node is terminated, a new node is launched and very soon after another node is terminated, that same node is terminated and the effect there is that certain applications are then disrupted more than others. And you can have an application that gets unlucky and gets disrupted over and over and over as the cluster state evolves. So we use node age to try to fight back against that problem.

This is obviously extremely expensive to compute at scale. These are all heuristics and this is also evolving all the time.

Let's take another case, this is a much less allocated cluster. In this case, we were able to terminate a huge amount of capacity and move all the pods onto an existing node.

We call that an N-to-zero consolidation. So N-to-zero means we're taking N nodes and we don't actually have to launch any replacements. So back to that standard algorithm I mentioned earlier for the standard disruption algorithm, the candidates is the replacements is just an empty list. So we don't need to launch any replacements, we can just start evicting immediately.

Let's take this case, we have 5 nodes and 5 pods, but if we were to terminate any of the nodes, it wouldn't actually fit in the existing capacity. If you terminate that top node, you can schedule 4 of the pods and you'll have 1 pod as a remainder. An N-to-zero consolidation doesn't work here.

But we can do what we call an N-to-1 consolidation. So in this case, it was just 1-to-1 and we were able to identify by looking at the cost of the EC2 instance type that this instance type would be cheaper if we shrunk it down. And we were able to also simultaneously reuse the other available capacity in the cluster as part of the same move.

Here's another one. This one is an N-to-1 consolidation again. But we're taking 2 nodes and we identify that both of these nodes could be replaced again. You can imagine the complexity of this decision when you're reasoning about thousands of nodes and hundreds of thousands of pods.

As you can tell, we're pretty excited about Carpenter. We've done a lot of development. But one of the things I love about the project is just how quickly it keeps evolving - the level of community engagement, the community contributions, the issues that are opened on GitHub.

I love how Carpenter has an active online community and would invite you to join us whether it's on Slack, at our working group every other week, or just opening us a GitHub issue telling us how exactly we're not meeting your need or a fix to our documentation.

And that's our talk. Thank you so much. Reminder, please complete the survey, it really helps us out.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫