Inner workings of Amazon EKS

最新推荐文章于 2024-08-30 16:10:19 发布

李白的朋友王维

最新推荐文章于 2024-08-30 16:10:19 发布

阅读量125

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134834790

版权

Good morning, everyone. Uh so the cue for us was to get on stage when the music faded and we got on stage and the next song started. So uh we were a little early but good morning. It's great to have you all here. Um I hope you're having a great re invent, hope you're having a lot of useful learnings. Thank you for joining us today for this session on inner workings of amazon eks.

My name is Vipin Mohan. I'm a principal product manager on the Amazon EKS team and I'm joined by my colleague, Vipul Sabaya. He's a senior software development manager on Amazon EKS as well. So we work very closely with each other.

So communities is a great way to run containers at scale. It's, it's amazing how communities works, but there is a lot that goes on to managing cin and all of that time, energy and effort that you spend to manage communities is time you're taking away from delivering value and delighting your end customers.

So, in this session, we will give you a peek under the hood into how we operate with a managed service in amazon eks for you so we'll give you an overview of how we deliver high availability reliability, you know, give you a secure environment for running your business critical applications and give you resiliency and the confidence for you to run your business applications and offload the undifferentiated heavy lifting to us.

So in terms of our agenda for today, we will start off with a quick introduction into what is amazon eks and we'll talk about why, you know, you should be using amazon eks for your community applications. Uh I'd also like go over some of the use cases that our customers have told us for what they use the case for.

We also spend some time on community versions and upgrades. I can see some heads nodding over here. So upgrades in communities are complex. So we'll talk about like how eks helps you with upgrades and then you start peeling the onion and dive deeper into the eks architecture. So I'll hand it off to whip uh and we'll, we'll discuss how eks scales to respond to the needs of your business applications scaling up and scaling down as well as you know, how we deliver resiliency. How do we respond to data center incidents?

We'll give you some best practices and some pointers on where you could follow up uh you know, in future as well as you know, close out of the wrap up and summary.

So let's do a quick poll. I believe there's a simulcast room as well. But we'll do a show of hands here for this room and get a, get a sense. So I'd like to know at what stage are you in your community's journey? So we have three options. I'll, I'll go over them one by one and we'll do a show of hands.

So if you are somebody who's just getting started or your organization is interested in community, you're just evaluating communities. Let's do a show of hands. That's about i think 10 tw 20% of the room.

I would say if you are somebody who is getting knee deep into communities, you're starting to run test workloads, maybe even have a handful of production workloads. Let's do a show of hands. Ok? That's a slightly bigger number, about 4030 40%.

And lastly, if you have multiple production workloads on eks and you know, you are an expert in communities. Let's do that. Oh yeah, that's the majority. It's like 50% of the room. Great. Awesome.

So I know I'm, I'm hoping, you know, all of you have something to take away from the session today. So a quick introduction about amazon eks. So what is s and i'll start by a quick overview of communities.

Communities is container orchestration software. It's open source and it allows you or enables you to deploy your containerized applications and run your workloads at any scale.

Eks runs vanilla communities under the hood. What that means is we do not fork the upstream code base, eks is upstream conform and certified. And what that means is eks passes all of the upstream conformance tests and we apply and back port all of the security fixes that give you the, the peace of mind of running a secure communities environment with eks at any given point of time, eks supports up to six versions of coin. We actually increased that in the past few months. It used to be four versions in the past and now we support up to six versions. We'll talk more about that.

Uh eks gives you a managed experience for running a performant and reliable, secure communities environment. And our goal and mission with eks is to make community operations and management simple and boring so that you can focus on delivering value to your end customers.

So why amazon eks so community simplifies the deployment and management of containerized applications, but managing communities at scale is hard. So you need expertise and you need to staff teams to actually understand the overall community architecture. How would you use communities in your environments and also to deploy and manage communities applications on a on a scale that fits your needs. All of this oops all of this is time spent away from delivering value to your end customers.

Now, what are some of the challenges with running communities at scale? So firstly, the scalability aspect, the scaling of communities cluster efficiently while maintaining high availability in performance is a nontrivial task. So you need to ensure that your cluster can scale seamlessly to meet the varying needs of your business critical applications.

Then there's the the aspect of security risks. So community is open source software. And what that means is, you know, there could be security vulnerabilities in communities and the community community and we engage with the community as well to a patch, those security issues and security vulnerabilities. But you need to consume those fixes and apply those security fixes to your community components.

If you are self managing your community cluster, there's also the angle of the complexity and learning curve i touched upon this briefly. So when you want to manage your own community control plane, you need to set up, you know, teams with expertise in communities to not just manage the control plane but also to handle the operations. You need to set up monitoring, you need to set up alerts. Uh how do you respond to scaling events? And all of this demands a steep learning curve which can be time consuming for your for your business.

About the operational overhead, self managing communities can pose significant operational overhead for your teams. And you need to dedicate resources and time for the continuous operations and management of communities. And when you are managing your community clusters, you also need to take into account uh how, how do you deliver high availability and have disaster recovery systems? in place.

So ensuring that, you know, you need to have a disaster recovery plan. You, you need to think through what how high availability would be achieved in your environment. All of that is a pretty involved effort and again, takes time away from your business critical applications.

So our core tenets tie into some of the topics that we talked in the previous slide, security is job zero at aws and that applies to eks as well. So our primary goal is to ensure that you are running your applications in a secure environment.

Eks is built for production. So what that means is we are delivering a managed service that you can use to run production and mission critical workloads. We have customers across healthcare, manufacturing, financial services who are running mission critical workloads on eks, both speed scale and reliability.

So there's a few ways to think about this. One is, how does the s infrastructure itself respond to your needs? So if you have, you know, a sudden burst in demand for your application, how does you know the control plane scale to actually meet those needs? How do we deliver an experience for your end customers with the the minimum latency as possible?

Uh it also speaks to how quickly we respond to the community. So we are engaged with the upstream community. Uh for example, we are part of the security committee, we have leadership roles uh in different working groups as well. And anytime there's a security vulnerability, we get engaged. And a lot of the times you see those fixes even before their public knowledge, eks is upstream conformant.

And we are committed to delivering the latest versions of the software with eks and communities. And we also want to help you optimize compute. So at the end of the day, your applications are running on compute instances, but you need to figure out the right resources on your computer instance in terms of cpu and memory for running your applications, you could over provision. But that would mean you are paying more without actually taking advantage of the entire resources in that particular instance.

So we are coming up with ways by which you can optimize compute. So what are customers building on eks? So we have customers using eks across a variety of industries and verticals all the way from manufacturing, health care, financial services in three letter government agencies are using eks for various use cases.

Some of them are legacy app modernization. So these are customers who are moving applications from on prem uh over to the cloud. And these could be monolithic applications that they are modernizing and breaking down into microservices. And as they are doing that, they're choosing eks to run their containerized workloads.

A iml i'm guessing most folks in this room have heard something about a il this week. Uh a iml is a has been a major focus for ets and our customers in the past, you know, 12 to 15 months and some use cases are autonomous vehicles. Uh you know, customers use uh uh you know eks to run uh object tracking and you know, like computer vision and robotics, for example. And if you think about any a iml use case, there's going to be the aspect of modeling, training and inference.

So you first build a model, you train the model with some data and then you derive inferences from that model. So we have customers using ets across all of these uh applications and use cases.

Data processing is another common use case that we've seen customers run use eks four, think about like streaming use cases or heavy analytics use cases. It could be like financial services or health care where there's a lot of data that needs to be processed, a lot of back ends.

So most of you have smartphones and the likelihood is that at least one or more of your applications running on your smartphone are using eks as a backend. It also applies in uh iot internet of things, use cases and web applications have been a very common use case right from the start of eks, both for simple uh static websites as well as for complex dynamic websites.

So then we launched eks a little more than five years ago. Uh eks was the managed service in the cloud and that still continues to be the case. But many of our customers told us that they have used cases wherein they cannot use the cloud for running their applications.

Some customers told us that, hey, i have government regulations that determine where i can run my workload like with geographies. Uh and then there's like data sovereignty needs for which customers would want to run. Uh you know, eks closer to their geography.

So we announced eks for local zones a few years ago. The local zones is basically aws infrastructure available closer to your geography. A wavelength is a similar use case, but it's more focused on five g and telco use cases. Uh so these are use cases wherein the entire net, the application does not leave the network of the communication service provider.

Uh aws outposts for those of you who are not aware, outpost is aws managed hardware. So you can have a set of server racks in your office or in your data center. Uh and these are computer racks, you can run eks on those racks. The hardware is managed by aws and you can run eks on your outpost instances.

And lastly, you know, like there were customers uh who told us that they have long term data center leases that are not going to expire anytime soon, but they still want to modernize their applications uh where they are. So, eks anywhere is the open source solution that we have for running communities on prem and the common theme here is consistency.

So no matter where you are running eks, you get the same set of bits, community bits and the same tooling for running your community clusters. And when your time when you are ready, you are always free to move across different environments.

So aws has a global reach across different geographies and eks is available across 30 commercial regions. Uh we added support for five commercial regions this year in zurich, hyderabad, melbourne, tel, aviv and spain. We have more regions coming, they're constantly building new regions and we want to make s available in all of those regions.

Besides that, you can see in that list, you know, uh eeeks has uh you, you can run eks in geography across the americas, uh europe and the middle east, asia, pacific, africa and china. Uh and a lot of our customers, especially the government agencies, you know, run eks in gao cloud regions.

How many of you have had issues with upgrades in ek in communities in general? Let's do a show of fans. Yeah, that's, that's a common topic. We've heard a lot from our customers and what few common themes we heard from our customers were that you wanted to see consistency in versions that eks supports and we've delivered that this year.

I'm super excited to say that as of now, we are caught up on all upstream communities versions uh up to 1.28 communities 1.28 has been available on eks. Uh 1.29 is planned for launch next week, december 5th, i believe by the community. And we've been plugged in in some of those uh 1.29 efforts for weeks, if not months in advance with upgrades, you told us that, you know, there's a lot of complexity around upgrading your community clusters.

You need to dedicate resources for upgrades on an ongoing basis and you know, there's always conflicting priorities and you need to choose between upgrades and business and your business priorities. And we feel that's, that should be uh you know, a slam dunk choice. Business priorities should always win over for you and release cadence.

Like the upstream community releases uh a new version every four months and that's a pretty fast clip. So the way we've been responding to you, especially, you know, some of the things that we've announced this year is we've been consistent with version launches. We've launched uh quite a few versions this year in eks 1.28 has been the latest.

We've also announced something called extended support. I spent some time talking about that. Uh we're announcing a feature very soon on pre flight checks that will allow you to give you an idea about like what api s have been deprecated in a future community version even before you click the upgrade button.

So the idea is that you get the confidence of performing upgrade before you actually go through the entire process and figure out something is gonna break. And we have more on the road map. We'll talk about that later.

Uh and with the upstream launches, there has been a discussion around long term support lts. Uh we are engaged in those working groups. Uh we are co-chair, we have both product and engineering presence in that working group

So to quickly summarize, EKS supports versions up to from 1.23 to 1.28 today. 1.24 to 1.28 is in standard support and extended support is something they announced new. Uh and 1.23 is currently in extended support and every single version going forward will get extended support.

We have any support for community versions until 1.22. So excellent support for Amazon EKS is something we announced in October this year. The upstream community supports a community version for anywhere from 12 to 14 months. So it's 12 months of full support and then two months of patch support.

What we've done is to extend that to 26 months for all of our EKS customers. So you get the 14 months of standard support plus 12 months of security patching for the control plane components, for the add on default add ons and for all of the armies. And it's available in free preview right now for 1.23 and it will be available for every future version of communities.

And we're planning a GA of this feature in early 2024. There will be a pricing component associated with this for excellent support and those details will be available when they go GA.

So to set the stage for you know, the EKS architecture, EKS is a regional service. So there's a regional endpoint by which you would access your community cluster and the orange box that you see is the EKS interfaces, the EKS API.

So you would access the EKS API, the orange box using the EKS API and under the hood, we run vanilla Kubernetes. And once you know your cluster is up and running all of the kubectl commands are available to you, you can use the kubectl APIs in your cluster and we're running the same, you know, like API server, kube controller manager, kube scheduler that you're familiar with with upstream.

And once you have your Kubernetes cluster running, you can integrate with, you know, a ton of AWS services could be DynamoDB EMR or you could also take advantage of any open source software that's available. Like Prometheus and Grafana are something we've seen very commonly used.

So any open source software that's compatible with Kubernetes will run seamlessly with EKS because EKS is using vanilla Kubernetes under the hood.

So I'll hand it over to Vipul to dive deeper into the EKS architecture.

All right, thanks Vipin. Um so yeah, we're gonna go a little bit deeper now um and talk about how EKS works, how uh how we operate etcd for you and hopefully give you some insights into, you know, various aspects of, you know, the, the service that uh you may not be aware of.

So, um so let's start off with the control plane architecture. Uh so the etcd control plane is fully managed by AWS. Um it's zonally redundant, which means that we have API servers in a minimum of two zones and etcd in three zones.

Um each one of these instances is in a private VPC. So you actually get VPC level isolation. Um we have an NLB which is used for the public endpoint to the cluster. Um and the traffic within the NLB is actually isolated to the zone. So there is no cross zone uh impact or anything like that when traffic enters the zone.

Um so on top of this, we have a 99.95% SLA uh 24 7 support. And I think the most important thing is the, the fact that EKS manages that etcd for you. Um this is probably the most complicated thing to operate if you're doing this on your own.

Um the data plan, this is where your applications run uh these live in the customer account and the customer VPC. Uh so initially we started off with self manage nodes. Uh this is where you, you spin up a worker node um that worker node joins the cluster and then Kubernetes can schedule applications onto that node.

We provide EKS optimized AMIs. Uh when you spin up uh instance with this AMI, uh that instance will uh join the cluster. This is still supported today. Uh but what this means is it you you manage the complexity of, you know, doing the upgrade, um cordoning and draining the nodes, evicting pods, things like that.

We then uh launched EKS managed node groups. Uh this is still where capacity exists in your account in your VPC. Uh it belongs to an ASG but EKS takes care of some of the heavy lifting of upgrading uh gracefully terminating uh you know, the the pods on a node and then moving them to another node, things like that.

Um we also added a uh sorry, we also added a serverless option uh with Fargate. So Fargate is a uh AWS managed compute, it's fully managed by us. Um all you have to do is annotate a pod. And um you know, say that I want this pod to land in Fargate. And today the model but with Fargate is a single pod per node. So whenever you schedule a pod uh to Fargate, we actually spin up a node behind the scenes and schedule that pod on that node.

And finally we built Carpenter. So this is our version of a nextgen auto scaler. Um Carpenter is open source. We recently donated that to the CNCF. Uh it gives you the flexibility of, you know, deploying the, choosing the right instance types for your workload, whether it's, you know, an on-demand instance, a spot instance or something else.

Um it also helps you reduce costs. Uh what, what it's actually doing is rebalancing your containers. So if there's holes in your worker nodes, it'll try to bin pack um as much as it can and give you the ability to, you know, remove those nodes. So you're not, you don't have to go pay for, you know, just over provisioning.

Um Carpenter also gives you an option to automatically replace your nodes. So in the event, like if you have a requirement where, which most of us probably do uh where you need to do patching um of the control plan or of the worker node, Carpenter will automatically recycle these so that the OS patching is taken care of for you.

Ok. So now let's talk about how this architecture has led to better resiliency for us. Um so we spoke about zonal redundancy um that we, that, you know, we have API servers and etcd across multiple AZs.

Um what we have, we, we all know this, that AZs do go down and the, our goal is that a single zone going down should have zero impact on the cluster. And so what are we doing about this? How, how do we achieve that. I wanna walk us through that a little bit.

O over several years of operating EKS, we've kind of come to, to, to the conclusion that zonal failures are complicated. Um in practice, they're weird because they, they vary based on how they impact the capacity in a given region.

Uh so like an AZ may be fully down if it's on fire or something. Um or it might be partially down if there's a network disruption between, you know, some racks within the AZ. Um this could manifest as maybe increased latencies or reduced throughput things that don't necessarily mean that it's a complete failure.

Um so there's a lot of these different failure modes uh that may or may not cause impact to, to a cluster. So we have to reason about this. Um so we've actually, you know, we've gathered a ton of experience um over the, in the past decade. Uh every time there's a failure, we dive deep, we dissect it, we analyze what exactly happened, what we could do to minimize the failure. And we've started to use those learnings to effectively give you a more reliable Kubernetes experience.

So what we've built, what we've implemented is deterministic AZ resiliency. Um what does this mean? It's a, what we're trying to do is no matter what type of failure mode we have with an AZ, we want to give you a deterministic outcome. And by that, what we mean, is a seamless API experience where it's not just about the uptime, but even performance, things like latency and, and things that you would care about um should continue to behave the same way as if there was no failure.

Um we also want to provide no disruption or delays to any of the controller workflows in Kubernetes. So what do we do in, in practice? Um the first thing we do is we pause all updates. So our goal is to retain existing capacity. Um we move all the traffic to the API servers from the unhealthy AZ to the healthy AZ and this is not just, you know, client traffic coming from the outside, it's also long running connections.

Um watches exec proxy, all all these things that are supported by the Kubernetes API, these things are no longer going to the unhealthy AZ. Uh we move the leader uh for all the back end controllers to a healthy, AZ because we don't know if that leader, you know, is a leader in a bad AZ if it's going to be able to uh perform the operation that we need.

Uh we also moved the etcd leader uh from the bad AZ to a good AZ. And um yeah, we, we just wanna make sure that all writes are not impacted. Um if a leader was in the bad AZ.

Um so we're working on bringing some of these capabilities to your uh applications as well. So things that are work running on your worker nodes, uh we're going to be introducing some APIs where you can say I want to evacuate my applications from a zone that's gonna be available sometime next year.

So, ok. Um so, well, what I want to demonstrate here is kind of things that we've implemented and how that's shown us. Uh the fact that we, we, we, we're doing it correctly at least.

Um so our biggest priority during a failure is what we call static stability. Um so in the event that there's an outage, um our goal is to make sure existing clusters are not impacted. And by that, we mean they're statically stable.

Um this requires us to look at pretty much all of our dependencies um and architect around their failure modes. Um so we don't wait for an AZ outage to happen. We simulate them in production. Uh we inject different types of failure modes, uh network outages.

Um we, you know, we, we prevent calls to dependency services. Um and then we also uh sorry, let me scroll down here. Yeah, so health check failures, packet drops various modes of failures to ensure that we are actually redundant and able to fail away from a zone.

Um more recently, as I think most of us probably know um we had an outage in us-east-1 on 9/28. Uh this was a zonal event. It was the classic gray failure where not the entire zone was impacted, it was only some racks within the zone. And even that was a, you know, partial rack failure with within that zone.

Um during this event, uh we, we, we, we did the root cause we dissected the, the our performance in that event and we mitigated impact in our largest region. Um customers didn't see impact. Uh this is for hundreds of thousands of cluster in our largest region.

Ok. So let's move on to the scalability of the control plane. So EKS automatically scales um your cluster to provide you a right size control plane. Um we're look, we're consistently looking at signals such as memory cpu node count. Uh we're looking at etcd signals. Once we get an aggregate of all these, we determine what the maximum value should be and we scale your cluster to that max.

Um it takes approximately 10 minutes uh to scale the cluster and it depends on the exact type of operation that's happening. Um so if etcd needs to be scaled or if it's just the API server that needs to be scaled, um we, we do have in independently. So it's not like um you know, you have to actually think about which one is getting scaled.

Uh we scale it vertically and we're adding support for horizontal scaling as well um as part of the scaling operation. One of the things we also do is we tune various parameters of the control plane. So things like max request in flight. Um so you get more throughput uh as you need it.

Um as a as a cluster scales, we also tune the QPS. So if your, if your cluster needs to do more work, uh QPS for like the scheduler, the controller manager, things like that. Um if you need to do more work, it can as it scales up.

Um also due to the nature of long live connections within Kubernetes, uh we've implemented uh what we call um connection draining it's part of Kubernetes uh effectively. What happens over time is these long-running connections uh cause traffic imbalance across instances. And so what we do is we randomly connect, kill these connections.

Um and what ha what the clients do is they attempt to reconnect um and likely they will end up on a different API server.

Um EKS tests, you know, we make sure that we're scale testing all the time so that we're meeting or exceeding uh customer requirements. So we lo we leverage the upstream uh kube-test framework uh to run our scale tests. We also make improvements to the kube-test tests uh based on things that we learn from our customers.

Um for example, we recently added support for CRDs because, you know, there's actually uh a much larger penalty for listing CRDs uh versus listing things like pods.

Um so we also have upstream five k node tests um running on AWS and these today are release informing. Um so they sort of indicate to the release team, upstream release team whether there's any performance regressions or not.

Um we test with real nodes as in like real ec2 instances and real customer workloads. Um instead of relying just on synthetic testing, uh which, you know, may or may not hide some issues.

Uh finally, we're part of six scalability where we're working on improving a lot of the SLOs um that upstream has defined and uh you know, for things that you care about, things like pod launch latency uh network, uh uh connection latency, that kind of stuff.

Ok. Let's move on to etcd. Um so in a Kubernetes cluster, etcd is used to store all the config data for the cluster. Um this includes all the objects, things like pods, namespaces, um a and any other uh resource that you create within, within the cluster. It's a distributed key value store.

Um it's, you know, uh it's a version based system so that it's not just storing the API object, it's storing every update as a revision to that object um in into the, into the etcd cluster.

Um uses RAFT for uh consensus which requires three nodes uh to form a cluster and a minimum of two nodes to maintain the quorum. So, RAFT is kind of the secret behind etcd um in providing uh fault tolerance. Uh and strong consistency.

Uh so RAFT works by electing a leader and making sure that all writes go to the leader and none of the followers can do it, write.

Um so it's designed to be highly available and it's recommended for a small data set as opposed to other databases that can grow much larger.

So why is that city important to certis?

It's kind of the essential piece behind um certis for ensuring high availability and consistency of that cluster. Um it's crucial for maintaining the desired state and monitoring of changes, events that are happening and it can also, you know, be used to help the rec help recover from a failure.

Um it's probably the most important and most complicated piece uh within c ready to operate. So with an eks with an eks, uh we run that cd as a three node cluster.

Um ek ss uses static ip s and static volumes, um static ip s for the networking, static volumes for the data storage. Uh what we mean by that is the same ip and volume is used um irrespective of if an instance fails.

Uh this is, you know, useful for uh durability as well as you know, disaster recovery um kind of scenarios.

Um so ek si oh sorry cd is fully managed with an eks. Um not just the availability aspects but all the maintenance aspects as well. Things like um you know, you have to compact over time you have to defrag the cluster, all these things are done for you.

Um we, we do them behind the scenes. Um we also write size cd. So the s same scaling um logic we talked about earlier that applies to cd as well.

Um so we also take periodic backups and these are stored in s3 encrypted um in the event that, you know, you need to restore from a backup or uh something really bad happens, we're able to do that.

Ok. So in 2023 we moved that cd to static volumes. Um so there's a single volume per a z per uh uh per a z per cluster. And the lifetime of that volume is the lifetime of the cluster.

So for example, if an instance in a z one were to fail and it gets replaced by another instance, the volume has moved to the new instance and that uh instance joins as a follower uh to the cluster in the event that we lose two out of three instances.

Um or we lose quorum, uh we don't have to restore from a backup. We are able to effectively take the same volume and uh recreate, join the cluster and catch up.

Um so this architecture also helps with the time to recovery. So as part of when a node joins a cluster, and it has to do a snapshot transfer from a leader to the new node, this can take several minutes if it's like a, i don't know, multi gigabyte database.

Um so with static volumes, one of the things that it gives us is we don't have to catch up from the beginning from the leader. Um so this actually helps us in the event of failures, recover, recovery becomes much faster.

Ok. So cd, like i said earlier, it's designed for small data sets um by small, it's, you know, i mean, 1010 gigs, 12 gigs, that kind of thing. It's not super small. But uh so because of that, there is a database size limit implemented in x cd.

Um so one of the things we talked about earlier was that there's a separate revision for, for an object stored in cd. So each time there's an update to the object such as a status field, changing it at cd is actually creating a copy of that entire object and storing it in its database.

So it's not just uh you know, new things like a pod getting created. It's everything that's happening if, if there's a crash loop back off or something um that lasts for an hour, every status change is going to result in an additional revision of that object.

So that can also uh contribute to the database size itself. Um so when we hit the limit, um cd emits a no space alarm, it becomes read only. And at that point, all requests uh to mutate objects are rejected.

Uh so the the only thing at that point you can do is compact and defrag. And that's only gonna help if um it's the number of revisions that are causing the the disk space to go up, right?

Um if it's actual objects, defrag or compaction cannot help. So this year we added automatic recovery. So we have a new flow that we built um where we every so often when we get to that close to that uh maximum database size, we accelerate the compaction and defrag that we do in the event that we can automatically recover the cluster.

Um so if your cluster has a no space alarm, um it it can take up to 15 minutes. But uh if the recovery workflow can resolve it, it'll be out of alarm within 15 minutes.

Um if it can't resolve it, it probably means you have to go delete objects uh to free up the space.

Ok. Oops.

All right. So i wanna go a little bit behind the scenes of how we release software.

Um so things like platform versions um that you see changing underneath. Um so we'll start with talking about kind of how we minimize impact on releases.

So we're in every region today. Um each region is split into cells, these cells are nothing but aws accounts and we distribute clusters across these cells.

Um this isn't kind of a novel idea at all. Uh it's, you know, any, any large scale service probably follows a similar pattern.

Um the important thing is we limit the number of clusters in a cell to effectively that contain our blast radius. So when things bad happen to a cell, we know that only that clusters in that cell will be impacted.

Um so we have 1000 cluster per cell and we uh we do progressive deployments which we're gonna talk about in a minute uh to make sure that we're doing this all safely.

So the way we release platform versions is obviously looks like something like this pipeline workflow.

Um we do testing at various layers of the application. So things like individual components are tested, uh we have integration tests when they're all packaged together. And then finally, we do like an end to end cluster test.

Uh we also have, you know, a broad spectrum of tests that we run conformance version updates, uh platform version updates and scalability performance tests that we run.

Um and finally, there's, you know, uh testing that happens at every stage of the pipeline. Uh so at beta gamma, um it happens in every region and it happens prior to us promoting to the next stage.

Um the scale of our fleet is pretty massive. We're talking about several hundreds of thousands of clusters.

Um and these instances that are running these clusters, they need software such as os patching or new versions of containers, new patch releases of kubernetes.

Um so our challenge is balancing safety with velocity, given the number of regions, number of cells that we're operating.

So to hit the right balance, um we've kind of uh sort of used deployment strategies that aws has evolved over time.

So we have b times across each one of these uh waves. Uh that big time gradually decreases as we progress further in deployment.

Um and then we also exponentially grow the number of waves and at any point in time, we can do a roll back.

So this is kind of the visual representation of how your clusters are receiving new platform versions and you know, without impact to anything, any, any of your workloads.

Ok. So after operating eks for five years, um we've seen pretty much everything, the good, the bad and we've converted a lot of these things into uh best practices.

So i'm gonna go talk through some of these.

Um first, we'll talk about how things can go wrong.

Um so using kubernetes comes with kind of the shared responsibility to do certain things correct.

Um it's really easy to misfigure clusters and sometimes this is due to kind of the vast ecosystem of software that you can install.

Um sometimes it's due to not knowing the best practices.

So we've seen issues where, you know, customers have been negatively impacted by a p a priority and fairness where critical workloads don't get through and causes some production impact.

Um another classic problem is web books uh which are in the critical path of every request.

Um so, you know, we see customers that, you know, don't properly configure webs, whether they're failing closed, uh they don't have enough replicas, they're not zonally redundant.

Um so i'll, i'll give you some tips in the next slide. But um we also have, you know, bad clients that have impacted the cluster itself.

Um there's, you know, the coors a p a has different ways that um you can design good clients to leverage things like the cash.

Um and you know, not negatively uh cause load, unnecessary load on the, on the a b a server.

So we have a best practices guide that we've created.

Um so it kind of just helps you think about and um resolve for some of the things we talked about.

Um so today, you can monitor the api server with prometheus, but this isn't enough. You should actually be looking at your audit logs.

Um this is how we do debugging. We start with the audit logs. We, we, we have uh queries that tell us ok. What are the most expensive calls?

Uh what you know, we talk about, we figure out who the top talkers are by identifying the user agents or the clients that are making tons of calls. And this is what helps us narrow down whenever there's a problem and you can do all this yourself if you have logging enabled.

Um preventing 42 nines. Uh so as of 1.25 a lot of the behavior in api r in fairness has changed.

So we recommend uh figuring out kind of what your call patterns are and fixing the flow schemas so that, you know, it, it's well suited for your applications.

Um and then building well behaved clients uh so optimizing the api usage.

So using things like watch um using pageant requests, uh eliminating things like cluster wide list calls, uh querying by name space, querying by field selectors.

These are all ways that you're gonna reduce the load on the cluster, um understanding your limits.

Um so limits are everywhere uh in kernes and n aws for good reasons. Uh we recommend starting your applications to honor these limits, things like limiting the number of, you know, objects in a single name space.

Um the number of, you know, load balancers in, in a single cluster because you're gonna eventually hit some other limit that uh will impact you.

So, and finally, uh we recommend cleaning garbage collection.

So things like cleaning up cron jobs um by just simple things like using tt ls uh deleting cr ds.

So we, we've seen applications that use cr ds to maintain a lot of their events, event histories. And so, you know, there's no good reason to do that, uh go, go clean that up.

Um so you can find out this and, you know, some of the things i'm talking about as well as many more things in the best practices guide that we have here.

Ok. So finally, i want to talk a little bit about the community aspects.

Um so all the work we do in eks is in partnership with the community.

Um this year we migrated the upstream registry uh upstream registry to registry dot k dot io. This is running on ac r and a and aws.

Uh we're part of the lt ss working group as viv mentioned. Um we're working to figure out what extended support, long term support means.

Uh we have multiple members in the security committee.

Um we've also uh moved a lot of the release in forming test and for our jobs to aws. And this is uh go goes along with our commitment to supporting um uh the community with aws credits.

So there's a lot more we're also doing, you know, caps bug fixes um where our goal is to leverage kind of the operational muscle that we've built, managing these clusters to influence uh the upstream project.

So with that, i'll hand it back over to v. Thanks.

So to wrap up, uh i'd like to go over a few things that have been keeping us busy.

Um we are investing in solving the operational pain points of our customers. Uh i hope you appreciate all of the undifferentiated heavy lifting that we are taking away with ets and we are continuing to do that one of the areas we're looking at is to take on more uh add on operational software that we can manage.

Uh we also heard from you that you want better visibility into cluster telemetry and absorbability. Whipple mentioned that a couple of slides earlier.

So you've told us that you want to know what is going on inside the control plane, like better access to metrics and logs uh opened templates.

We have folks in different stages of their community journey in this room. Uh some of you have told us that you like to have the flexibility of tweaking the different knobs. But some of you have told us that you'd like us to be opinionated and give you some templates with which you can launch a cluster.

So we're trying to like figure out like what's the best way to deliver that to our customers. And we're constantly investing in improving the efficiency and fleet level operational uh work that we do for our, for our cluster so that we can deliver a great experience uh to you and you can in turn deliver a great experience to your end customers.

So we have a public road map. It's on github. Uh everything we do at aws including, you know, on our team at eeks is based on what you tell us. We work backwards from what you tell us.

So please, you know, like let us know through the github roadmap uh as to what you'd like to see in eks and how you'd like to see the future of eks shape up.

Thank you again. I hope you appreciate the w you appreciate the session on how eks works behind the scenes and all of the operational overhead that you would otherwise have to take on by self managing your community clusters.

We have our linkedin uh profiles over there for myself and bul feel free to reach out to us on linkedin. And also please remember to, you know, give us some feedback on, on this session.

Thank you again for joining us. Have a great rest of your conference. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Inner workings of Amazon EKS

in place.Ok. Oops.
复制链接

扫一扫