The future of Amazon EKS

All right, get some energy here. My name is Nathan Tabor. I lead the product team for Kubernetes at AWS. Really excited to be talking to you today about the future of Amazon EKS.

I have two amazing co-presenters with me, Alex and Nova, and they're going to be talking about some really exciting stuff we have. We have some really interesting agenda going on here.

Overall, this is about the future of EKS, but it's really hard to predict the future, right? I don't even know exactly what we're gonna be doing. So my hope through this session is to be giving you a sort of a look into our minds at AWS - how do we think about Kubernetes? How are we thinking about EKS? How are we gonna evolve everything that we're doing?

So, we're gonna talk about why Kubernetes matters. We're gonna talk about what's new and what's coming soon and we're gonna give you a vision for what's next with Kubernetes at AWS.

And along the way, we have our two special guests, Alex and Nova. Alex is from Slack. He's gonna be talking about how they use Kubernetes to scale applications. And Nova is from Anthropic. She's gonna be talking about how they use Kubernetes to scale AI safely at Anthropic.

So, we have a lot to get through. Let's get started.

Great. So, the thing that's kind of driving all of this, the thing that's underpinning the need for us to invest in Kubernetes, is the fact that Kubernetes is being so highly adopted by our customers and by organizations all over the world.

The Cloud Native Computing Foundation, the CNCF, the organization that governs Kubernetes, did a survey last year and they found that 64% of organizations are using Kubernetes in production and another 25% are piloting or evaluating Kubernetes. So that's a really big number and effectively Kubernetes is being adopted at massive scale and as a normal part of organizations' IT systems.

And so why, why are people even using Kubernetes? Why does this thing even matter? And there's, when we talk to our customers, we find that there's four things that they tend to want to do with Kubernetes. And you may say, yeah this is absolutely a reason at our organization. There may be some on here that you haven't thought of or maybe are less important to you.

But there's four things they want to do - they want to standardize operations in order to move faster. So Kubernetes is being used to accelerate delivery and accelerate change within the organization. They want to reduce fixed expenses, so moving from pre-provisioned static resources to dynamic resources that let you scale up and down quickly and allow you to orchestrate applications at the process level across a whole fleet of hosts.

They want to enable an entire organization, so development teams that are running multiple different kinds of software, and one of the great things about Kubernetes is how portable it is. So maybe these teams are all running on the AWS cloud, maybe they're running in different regions, maybe they're also running on-premises, maybe they're even running on other clouds. I know, I said that - it happens right? Like companies acquire other companies, they inherit different teams, those teams have spun up in a totally different place.

Kubernetes has a standard API layer that allows them to enforce governance, security and other best practices across the entire organization no matter where they're running.

Then they also want to plan for the future and reduce risk. So Kubernetes helps companies think about how do I have something that we think is gonna be very long term, how do I reduce the risk of my investments?

Now risk is not just something that CIOs think about or CEOs think about, all those finance people think about, maybe product managers too, right? But risk is something that actually engineers, we think about as well. As engineers we tend to call it tech debt, right? Like tech debt is risk. If you think about it - I have all this stuff I have to go and fix in the future.

So effectively people are looking at Kubernetes as a way to reduce tech debt.

And EKS - we celebrated the fifth year of EKS, of EKS being GA, in June. And what's crazy is that actually we announced that six years ago here at Re:Invent in 2017. And we think that EKS encompasses the best of AWS when it comes to performance, scale, reliability and availability.

And what we found is that more customers run Kubernetes workloads on AWS than anywhere else. The CNCF in their survey found that most customers were running Kubernetes on AWS more than any other place where they were running them.

And I think that's pretty incredible - when we look at our stats, we see that billions of EC2 instance hours have been run by customers every single week to support their Kubernetes workloads. And that's insane because six years ago when we were here at Re:Invent, and I remember being in the room when we announced EKS and people were very, very excited and it was like, hey this is gonna be an awesome big service. And then to go five-six years later and there's literally billions of computing hours every single week and we're underpinning all sorts of applications.

So applications from legacy apps that customers have moved to the cloud, people doing all sorts of amazing AI training, like Generative AI is the latest AI thing, but before that we've had a ton of autonomous vehicle training on EKS, which I think is really freaking cool - people are doing self-driving cars and creating whole simulated environments.

There's a lot of data processing with tools like Spark and Flink. And people are building mobile back-ends and web back-ends and web applications and business software - and the list goes on and on and on of the transformation that we're seeing customers achieve using Kubernetes and using EKS.

Okay, so those are the use cases, those are how people are using it - it's really abstracted though. So I wanna again, I told you I'd kinda give you a picture into my head and how we're thinking about Kubernetes and how we're thinking about how people build with Kubernetes.

And so we look at the use cases, obviously the use cases are very important, but also how they're using Kubernetes really matters - like where is Kubernetes sitting in the stack? And what can we do to take away the undifferentiated heavy lifting that customers are doing?

And so what we're seeing is that to run containerized applications in production doesn't just require Kubernetes and it doesn't just require the AWS cloud - you have this entire stack. I know this is what people tend to call the architecture, but it was a nice diagram I made.

And so we see Kubernetes sitting and abstracting the infrastructure on the AWS cloud, helping you get really consistent reliable access to all that infrastructure, effectively knitting together cloud services into a cohesive API.

And then on top of that, customers are running tools for deployment, observability, governance, traffic management, security, and a whole host of other things. And those things don't have to run in the cluster - a lot of them do, I went and checked, there's 599 open source projects that are part of the CNCF landscape that are being cataloged by the CNCF today, and most of those run inside or with Kubernetes.

There's another 100, I think 173 projects that are actually administered by the CNCF - so these are open source, open governance projects. And people are taking all those projects, right? You're taking that set of projects and you're running them in the cluster, and you can also connect to things that are outside of the cluster.

And so we'll talk about a few of those things today that customers are bringing from AWS and they're connecting to help them do this kind of platform layer.

And then on top of that, people are putting in IDPs - integrated developer portals and platforms, they're doing job workflows, ML training workflows, tooling that basically helps orchestrate the application on top of this entire platform stack that's sitting down there.

And then the applications code and data - those are being packaged as containers and then running through this whole stack.

So customers are doing this whole thing. And so the hint to get inside our head is how do we help customers do this more efficiently? How do we help them set this up more efficiently? How do we bring more of these tools together into a single place so that they don't have to go and run around and qualify and test and validate all the different tools - while continuing to give them full access to that 599 and growing number of projects and tools that they can use?

So three goals that we have for Kubernetes at AWS:

Simplify - like I said - to take away that undifferentiated heavy lifting, to make it easier to get the value out of your stack.

To give you complete access to supercharge your access to the cloud - to make it so you have the most performant options and the most cost effective options available with AWS.

And to standardize - to have compatibility with the Kubernetes innovations and be able to bring all that stuff from the open source community or build stuff yourself and bring it here. And the goal is that you should be able to focus on your expertise - so you shouldn't have to become an expert in something that you're not at your organization in order to get value out of Kubernetes.

We wanna help you get that really fast so that you can focus on what you do best - whether that's building a machine learning application, training a self-driving car, doing a hybrid deployment, running a telecommunications network - whatever you're doing should be the thing that you're able to focus on.

So we're supporting Kubernetes in six ways:

  • We're building distributions of Kubernetes
  • We're helping customers manage it
  • We're building integrations down to the cloud
  • We're developing upstream, helping Kubernetes develop upstream
  • We're participating in Kubernetes security
  • And we're helping with Kubernetes release and validation

And we're also sponsoring - we're sponsoring open source communities like the CNCF - helping them do things like the registry.k8s.io migration which allowed customers to speed up their image pulls safely and securely from CNCF projects.

So EKS is the primary way that we're doing Kubernetes, but it's not the only way that AWS is participating in Kubernetes. EKS is just that little management part - and there's all these other things. And of course EKS we've expanded in six years.

And so we now have EKS to run almost everywhere you can imagine - all the way down from EKS Distro, which is the distribution of Kubernetes you can take that software, you can run it anywhere, it's the Kubernetes that we ship with EKS up in the cloud. But you can also use the EKS Anywhere toolchain to run Kubernetes on-premises, you can run it on Snow, on Outposts, on Wavelength, on Local Zones.

And of course in the AWS cloud - I don't even think I have time to go through all the things that we've done in the last five years since going GA.

And as the little launch counter shows in the bottom there - in the last five years we've launched over 222 new features and services. And I'm really excited to kinda talk about the things that we've done in a minute and kinda the biggest highlights from 2022.

But before we get there, I want to introduce Alex from Slack and he's going to talk about how they're using EKS to scale in production.

Alex:

Thank you. Hi, thanks everybody for coming this early. My name is Alex Dmitri. I'm a Director of Engineering at Slack and I have the pleasure of leading teams that are responsible for this awesome stuff that I'm gonna walk you through.

Our engineers are obsessed with trying to make our customers' lives better and creating the most scalable and resilient compute infrastructure that we can possibly deliver. We do that with our AWS partners.

So let me walk you through and start our journey - for what were our challenges and our problems in the beginning when we started, we were mostly an EC2 VM shop and our customers were going through a lot of costly OS upgrades, generating a lot of toil.

Operating the system upgrades was very difficult and customers had to go through that instead of spending time doing what they needed to do - writing code, writing product.

These are the internal customers for us that we service as an infrastructure team. So configuration management was very tricky. The infrastructure had difficulty to scale. And on top of that, we introduced our Gov Slack, which we're proud of and we wanted um to enable our customers internally to develop on. But there's obviously clear compliance requirements and theory that we need to, to, to be ready for.

So how we asked ourselves, how do we increase developer productivity? How do we make sure to be resilient and scaling uh and do so in a way that we don't interrupt our customers.

So Slack was having, as I said, a lot of issues with the configuration management. It was in a monolithic fashion, it was hard to test and sometimes users would just essentially push changes forward in order to deliver and meet the deadlines without having that confidence that a good testing environment could give them.

Um so with that, we asked ourselves, what is the best platform we can use to move forward? And what is the tool that we can adopt? And so we built uh an obstruction that we call Bedrock and we built the substraction using Amazon EKS.

So what is the quick elevator pitch about Bedrock and the system that our engineers put together? It's a platform platform as a service that leverages EKS essentially handles your build your deployment and runtime environment. Our Austin team wrote this amazing Better API we call BPI and a CLI or on time environment and also a Go lang library, you can see the features that um this abstraction provides uh throughout the whole life cycle of an application.

And what our promise is and what we want the users to have is essentially a single YAML file, a user interface that is easy to use and easy to kick off and get you going. Essentially our goal is like how do we get you from your idea to production in the quickest possible time with the confidence and the scale that we can give you with Amazon.

So as you can see quickly from this user interface, we define images and docker files. Uh we also define the cluster where we run my teams run also the clusters, we run our EC2 nodes uh specifically because we use a software defined firewall at Slack called Nebula. And so we want to maintain our zero trust network for uh both with our internal and external customers to make sure that there's security at scale. And so we abstract all the network complexity, we abstract all of the shared VPC um necessities that you have in order to communicate with other applications. And we do so in the simple YAML file, we just opening some ports and then defining your console uh for service discoverability.

So again, uh the magic behind the scene here, you kick it off, you have this, this is a simple example, obviously, but a 17 line YAML file that is gonna build your docker images is gonna publish the registry call. Jenkins is gonna run a pipeline and then it's gonna deploy your container to a specified cluster um register itself, service discovery, FQDN, all the good stuff and then you're up and running and you're good to go. This is a promise essentially. That's what we wanted to do. We wanna give you the least amount of override internal if you're a pla uh platform engineer and you're writing features for Slack for the world to use. We want you to get there as soon as possible.

So what does it look like? Essentially we have a bunch of other uh teams and services like consoles for service discovery, Uber proxies. These are partner teams that we work with our nebula. So after we define firewall, we uh provide visibility with Grafana and Prometheus. And then inside our offering there specifically, I'm not gonna go through all of the different uh components. But what I wanna call out is the API that our team wrote and the CLI that it gets used to basically interact with Bedrock itself and EKS behind the scenes including builds, deploys, dashboards and also what we call cube beaches or basically SSH VMs that allows us to maintain again that zero trust security.

So if you have to operate, you're a power user, you wanna interact closely with Cerna, you have the ability to do from that. One thing I wanna call out is Carpenter that we're gonna talk about, which is being a tool that we have uh introduced recently. And we really love uh we on boarded using uh Carpenter just like uh probably six or seven months ago and we adopted it uh at full speed. It's been great for several reasons.

The stock that you see on screen there is essentially what we do when we run a, a new node, a new EC2 node that is gonna come up to speed for a cluster chef is our configuration management. We use Poptart. It's an internal tool uh to essentially provision the nodes, nodes get generated. They go to the right cluster. We love the fact that um Carpenter allows us to be more dynamic and flex, we currently operate in the notion that all of our nodes are immutable and then essentially that we don't leave them um around for more than 14 days. Uh we do this with the annotations on our nodes. And then we also apply some sort of a chaos engineering in the background when we test to make sure that everything is ok.

So what did we find out that was good about using Carpenter? And why does it matter for us? For instance, this dynamic instance selection is really important for us to be able not only to scale but also to be a right size and efficient right now, more than any other time, I'm sure all of you are concerned about uh fin ops and cost savings on your cloud. And so that allows us to be lean and allows us to be effective in what we wanna do and generate some cost savings.

At the same time, we also avoid some awkward incidents that could be generated by like manual modifications and auto scaling groups or like some Terraform snafus or something like that allowing Carpenter to allow us to scale and do so dynamically as being super super awesome.

So again, faster, no drain, also quicker recovery in case of incidents. If somebody nukes a cluster which can happen, uh it's a lot easier to come back online to almost no disruption and lower weight times, Carpenter plugging right there. On the EC2 API s and so uh interactions that are a lot faster. And then yeah, in, in general, as long as AWS gives us the capacity we just scale and then we're good to go.

So again, a quick note on the cluster upgrades, this is why like I said before, we wanna make sure that all of the scale and the things we do are completely transparent to our customers. We have this type project uh process, I should say that we have built to uh push our Kernes upgrades. Nathan was saying earlier, you've seen all the EKS uh features that get released constantly. All the Certis uh version that come out are very frequent. And so it's critical for us to be up to date as soon as a release comes out.

So we are on the, on the good stuff as soon as possible. So process there for upgrades on CL and cube CTL. And then basically we go through a soaking period in our clusters. So we make sure that there's no issues. We push 1025 50 then 100% uh following our environments of sandbox, canary dev and prade. And then this also gives us a lot of flexibility to roll back whether something would happen.

So this is what it looks like. Essentially this, this abstractions. They were so proud that we built. Uh we have our API and we have our uh CLI deployments and all the management that we do very dynamically with Carpenter of our EC2 infrastructure.

Um the, the good nod here is some of our engineers are in the room here. So I wanna give them a shout out to uh this was a uh a year and a half process where we did a migration and essentially now we're on 80% of Slacks application in the world run on Amazon EKS. You can see the numbers, we couldn't even count the memory that was too much. So a known on the wind.

So why did we love it? Why our customer love it. So specifically, our infrastructure is code is now in your repo. So in your product, in your code, you don't have this monolith approach where um it's difficult to test, you can create your development environment by quickly launching some containers. Uh it's easier for us to do a balancing, which is also at the core of some of our strategy around disaster recovery. With K burnes and EKS we can and with Carpenter, we can easily do that. And the redistribution for zone of resilience.

A note on Gov as I said earlier, it's great that we're offering this. But the overhead for a Cus for an internal customer was pretty large specifically when you think running on EC2 and you have to profile your application and your server using App Armor and we do that for you. Now, we manage your EC2 nodes and Cnas we profile them for you. You don't have to worry about the US upgrades. You don't have to worry about App Armor at all.

And then last note, as I said, Fin is being really important right now. We adopted with Cube Costs, which is an awesome tool. We adopted a chargeback model uh for our clusters. So if you live on one of our clusters in UK S, you know exactly how much you spend to do that. And Nathan, thank you, Alex, thank you.

So, Alex was talking about zonal resiliency and all the things that they've done to scale Slack on on EKS. And one of the biggest things that we've done on EKS in the last few years is expand our global availability. Um and so today, EKS is now available in every single AWS geographic region.

It's available in 100 and two availability zones, 33 local zones, 29 length length zones, and a good amount of those were launched last year. So we introduced four new regions last year and all the different um local zones and all those things like that. And there's five announced regions that are coming soon and EKS is committed to being in all of those regions um either at launch or very shortly after launch, we're gonna be right there. So as AWS expands, EK ss is expanding with AWS, we want to be everywhere that customers want to run.

So I'm gonna spend the next little bit here talking about some of the things that we've done really recently and also some uh one or two things that's coming soon.

So, um in addition to all the regional availability, we've also added a lot of support for Amazon ECS anywhere over the last year. So um ECAS anywhere is now supported by in ships with Snow devices. So you can actually order a Snow device and you can do edge computing with Kubernetes with EKS anywhere preloaded on that Snow device.

We also allowed customers to start provisioning bare metal nodes with Tinkerbell. So using bare metal racks instead of VMware or another type of hypervisor um solution.

Um and then we also announced self service subscriptions in the EKS console. So previously subscribing to EKS anywhere was a admittedly manual process and you had to talk to AWS. And now if you want to use ECAS anywhere in production, you can subscribe in the in the AWS console.

It works really similar to other AWS services and that's available right now. In addition to regions, versions are really important for Kubernetes.

So the upstream Kubernetes project ships a new version uh three times a year and admittedly if you look at this chart, EKS was pretty far behind. So as we scaled and as we built infrastructure to run more features and run in more regions, we weren't necessarily shipping new versions very quickly.

And we made a big investment in fixing that in 2023. So um we want reduce our upstream launch delay from 243 to 42 days. And we really like 42 days. I don't know if we're going to ever go much faster. And we built a bunch of automation that allows us to qualify new versions internally in days instead of weeks or months.

And in 2023 we launched five versions um in order to catch up with the upstream Kubernetes. So um this was a huge launch year for us and we launched all these new versions. We caught up with the project and that created another problem which was oh yeah, that's good.

By the way, um my, my uh team wanted me to make sure that a graph going down into the right is a good thing here. Uh not a bad thing we like that.

Um and as we launched all these versions, customers started talking to us and saying, oh my gosh, you're launching versions really fast. This is a problem how we, because we have to upgrade to the next version.

Um and so we're really excited to announce this year extended version support for Amazon EKS, starting with Kubernetes 1.23 you can run a Kubernetes version for up to 26 months. So that's an additional 12 months from the standard upstream support cycle.

And you create clusters in this extended version, you can upgrade them at any time you get AWS security patching full AWS support. This is available today in a preview for all 1.23 clusters. And we'll be making this GA for 1.23 and higher starting in January next year.

We also um announced an integration with AWS Managed Services Accelerate for EKS. So AMS provides a suite of hands on um SRE support. So if you're struggling with an upgrade, if your team's working really hard and this is what we saw, some customers said we just don't have the people of the expertise to do upgrades across all these clusters.

AWS now has a team of SRE experts who will come in and help you upgrade your clusters and help you. They have a whole monitoring and automation suite that they bring in. So it's not necessarily a product feature, but it's a really important part of the AWS offering and, and we're pretty excited about being able to give this to customers.

Ok. So that's all the upgrades and the versions. And yesterday, we just announced a really cool feature for Kubernetes security that we call AWS Pod Identity. And what this does is this is an a enhancement over the previous um IAM roles for service accounts where AWS will actually vend the endpoint that does token authentication inside of the cluster and allow you to more easily map and leverage IAM roles for pods inside your Kubernetes clusters.

Um it also supports session which allows you to reduce the number of roles and policies that you write and dynamically route authentication to different resources based on um based on uh tags in AWS. So really excited about Pod Identity. We just launched it yesterday. It's one of the first launches at ReInvent.

Um we have a great blog on this online, encourage folks to go and check this out coming soon. We have another IAM feature called Cluster Access Management. And this solves a huge pain point, which is the a config map inside of the cluster.

Today. When you want to configure IAM access to the cluster, you have to go and configure and set up that a config map and keep it up to date. The Cluster Access Management will be shipping soon here. And it allows you to actually configure all the IAM access into the cluster using AWS APIs.

So when you provision the cluster, you can set up everything in Terraform or CloudFormation, all the different roles that you want to have access to the cluster. And when the cluster comes up and is online, you don't have to take that second step of setting up the a config map.

The other added benefit is it prevents lockout. So no more accidentally fat fingering a role or something on the a config map or we've had customers who have deleted the a config map and locked themselves out of the cluster.

Um that's all easily editable through the AWS APIs at any time.

There we go. Um also, in 2023 we launched the VPC CNI network policy. It provides Kubernetes traffic isolation for applications. So previously customers had to run uh a different um network policy agent inside of the cluster, a Calico or another type of agent alongside the CNI.

And now the CNI, we introduced a new eBPF data plane um that supports the CNI and allows you to uh implement native Kubernetes network policies automatically. So you just write the network policies um as as standard network policy objects and the CNI will enforce those, those network policies that you define without having to run anything.

Additionally, we're really excited about the eBPF policy engine um and this data plane that we built into the CNI and you can expect to see us do a lot more with that eBPF data plane layer now that it's running in every single cluster.

Um in 2023 we also expanded the AWS Controllers for Kubernetes. So these controllers allow you to use Kubernetes to define AWS resources that your applications need directly within the cluster. Today, we have 21 services in GA and 11 more in preview. And we added EventBridge SQS and SNS really recently in 2023.

And then also we have Amazon ECS Add-ons and Amazon ECS Add-ons allows us to bring all the capabilities that you need into the cluster. And our vision for this product is that you have everything that you need from um open source, from AWS and from third party vendors that you'd want to run in the cluster and bring those all together with a simple API when you provision Kubernetes.

So in 2023 we made it easier to confirm the add-ons before they launch and also to subscribe to Marketplace add-ons directly from EKS. So those Marketplace add-ons were previously available, but they needed a really complicated and kind of out of out of loop provisioning flow where you had to go into Marketplace and you had to do different boxes and check buttons.

And so now when you provision those add-ons from Marketplace, you can subscribe directly as part of the EKS cluster creation. And we also expanded our catalog. We added the EFS CSI driver, we added the GuardDuty agent and we doubled the number of Marketplace add-ons that are supported.

Um so we have a lot of vendors who are are building and actually i saw this morning, we added yet another. Um we added the Weave GitOps Enterprise was added just this morning um to the add-ons catalog. So this catalog is starting to get bigger every single day and we're excited to see it um grow and grow.

Um we've also, in the last few weeks, we've launched two amazing observable solutions for Amazon EKS. So a few weeks ago, we launched Enhanced CloudWatch Observability which gives you automatic full cluster dashboards and integrated metrics and logging from Kubernetes clusters available in CloudWatch.

And this week at ReInvent, we're announcing Agentless Prometheus Metric Collection. So a managed service for Prometheus and EKS, we collaborated and you can now have agentless Prometheus metric collection that pushes directly to P and this runs inside of the cluster automatically. There's no agent to install, to manage or to scale. It will natively scrape Prometheus metrics inside of the cluster and ship them out to P without having to worry about how they're collected and keeping those processes up to date.

And then Alex was mentioning Carpenter. Um we're really excited about Carpenter. We've seen this project grow significantly in the last year and Carpenter provides automatic node provisioning and um cluster consolidation for Kubernetes.

So we look at the applications that we're running. We automatically provision nodes based on the needs of your applications. And in 2023 we added a number of stability improvements. Um we added drift reconciliation. We also brought the APIs to v1beta1 and we're moving Carpenter to the CNCF part of SIG Auto making it part of the Kubernetes project where it's going to be alongside um Cluster Autoscaler.

And this may seem really weird but we're super excited. At least i am super excited um that Azure actually launched a cloud provider for Carpenter this year as well. And so we're excited to see that community grow and to have more cloud providers for more places. Because that is one of those things that reduces risks and helps people use Carpenter in more places.

So, really excited to see this project during the CNCF and and excited to see more customers like Slack continue to adopt Carpenter at scale.

And then finally, um we talked a lot a little bit earlier about how EKS is using really heavily for machine learning. So in the last year, we've invested in machine learning as well on EKS, we've added the latest Nvidia and Neuron drivers. We've added support for the P5d instances and, and Elastic Fabric Adapter or EFA.

Um and also there's a new EC2 feature called Capacity Block Reservations that works with EKS. Um this week, they're announcing the S3 Mount Point CSI driver. Um so the new S3 Mount Points um that's part of uh S3 is a high speed performance file system interface. On top of S3, we've written a CSI driver for EKS that you can run inside of your Kubernetes clusters to get access to that performance S3 storage.

And we're also in the latest Accelerated AMIs we're launching NVMe mounted by default. So really excited to see more customers work with, with machine learning. We have a whole set of resources for customers that are doing data and machine learning on EKS called Data on EKS.

There's a QR code um here on the slide and to talk a little bit more about that, i have Nova Philanthropic.

Yeah, thanks, Nathan.

Um cool. So uh just a quick second about Anthropic. Um some of you have probably heard about us. Uh we do AI safety and uh research and we're building reliable, interpretable and steerable systems. We usually go with helpful, uh sorry, harmless and honest.

Um we develop things systems like Claude and uh try and uh bring those frontier models to the world um through responsible deployment and regularly sharing our safety insights.

Um and we use a lot of EKS. Um 100% of what we work on today is on, on EKS. Um we train our LLM scaling to tens of thousands of pods in individual jobs um and utilize things like Carpenter for flexible instance, types and cost optimization as well as like integrating with S3 to uh use the best of AWS where we can.

Um and so training LLMs is probably what you're here to uh hear about. So, um here's a, a quick slide that I got Claude to throw together for us.

Um uh we just released function calling. And so uh I put all of our Terraform in that for this slide.

Um the way that we use EKS and S3 together today is we have this flow where our raw data is then tokenized using uh spot EC2 instances that are provisioned on demand uh using Carpenter um that helps us cut costs as well as like make sure that the capacity that we need for our like petabyte scale uh tokenization jobs um are there when we need it?

Um those tokens flow back into S3 which come back into those accelerated EC2 instances. You can see these two are working pretty well hand in hand.

Um we use today p4ds and as well as p5's ranum. Um and then those model checkpoints flow back into um S3 again. So uh you can see that these two are sort of like working hand in hand to uh make that training happen.

Um and uh can I shoot back? Yeah.

So uh to go into a little bit more detail here, we actually don't use any CRDs today. Um because we want to be able to use Kubernetes wherever we uh wherever we can.

Um all of our training runs on one giant state, full set um across thousands and thousands of pods and EKS is able to scale with that. Um which is really exciting for us, especially as uh like the training runs scale. Uh the scaling never stops there.

Um in terms of flexible instance types, uh we don't want to think about um hardware uh unless it's GPUs. And so for data processing, uh we're able to use multiple in instance classes without having to think about that.

Um we specify our needs inside of Carpenter instead of instance types. Um our job is not to buy a particular EC2 class. Our job is to buy enough CPU and memory for that Spark job to run.

Um and we can use spot instances there to reduce costs, which I'm sure our finance folks are pretty happy about um as well as on demand and reserved instances um when we need that reliability to be there and Carpenter can do all of those.

Um it's uh moving forward here. Uh here's the big number. Uh we've got a 40% year over year cost reduction over using on-demand instances uh with no cluster autoscaler tweaking required.

I'm sure many of you have experienced the pain of having many cluster auto scaling groups in a single cluster. Um and having cluster autoscaler decide to get stuck on a particular one of those. Uh Carpenter doesn't have that problem, which is very exciting for us.

And so why would you use S3 and EKS together? Uh it's nice to have a single source of truth. You can get PVs and PVCs, you can have your volumes inside of um uh EKS and Carpenter.

Um but it's nice to have a single source of truth there. Um and we've managed to get like high speed durable object storage out of S3 um and have all of our workloads use that rather than PVs.

And, um, honestly, I think that the management is, uh, uh, is happy with that because, uh, the PVs can get stuck PVs can, uh, run out of space and S3 does neither of those things.

So, um, and, uh, here are a couple of tips and tricks about, uh, how we currently use this. So, stay simple.

Um, you've got a lot of power out of, uh, Kubernetes. There's, you mentioned, I think 299 was that the number 599 I'm sorry, 599 integrations that uh you can have um on top of Kubernetes.

Um but you can get pretty far with like what Kubernetes gives you out of the box. Um we've trained Claude um are Cloud zero, Cloud one, Cloud two, Cloud 21, all of these are on staple sets. So you can get pretty far with that.

Um you should optimize for preemption. Um one of the powerful things about running on a cloud platform in general uh is that you have the ability to uh scale up infrastructure on demand and but that's not gonna go anywhere. Uh if your workload isn't able to resume from a preemption.

Um also, as you scale, your instances will disappear. This is a fact of life. Um optimizing for preemption lets you take advantage of uh like larger amounts of infrastructure without problems.

Uh state management is key. You should be thinking about your state separately from your application. Um that's a pretty specific tip here. But like when you're using something like Kubernetes, um you want to have your application be able to scale separately from your state.

Scaling being using S3 and EKS together means that you can have your state and S3 have it be durable, not have to worry about scaling that while uh scaling the number of cores or GPUs um which have cores um separately from that and you should be thinking async.

So, um if you tie your application too closely together, um you're just gonna have problems there. Um but specifically with S3, um the most powerful thing that you can do with S3 is to scale horizontally um scale over multiple pre and things like that.

Um and if your application can take advantage of uh like an asynchronous queue or something like that, um then it doesn't matter what the latency is on S3 because you're in the throughput bound regime.

Um and that's something that we uh take a lot of advantage of. Um and uh yeah, so I'm gonna pass back to Nathan. Awesome. Thank you. Do.

Ok. We've heard a lot about how um Anthropic and how Slack are building their systems. I talked a little bit about what we've been doing over the last few years.

I wanna talk a little bit about what's coming and kind of how we're thinking, uh, about the future, um, the future with EKS.

So, uh, maybe like to predict what, what we're doing next. So, what are the next five years look like for EKS? And, and this is sort of, I wouldn't say it's a road map, but it's, it's the things that we're thinking about. It's the things that we're talking about on the team every single day.

So the first thing that we're spending a lot of time thinking about is how do we manage more things? What are the things that we see customers doing over and over again in the same exact way that we could go ahead and do.

So. Kubernetes is a really obvious one. When we built EKS, we had all these customers that were starting to run Kubernetes on EC2 and they were using a number of different tools to do it. And so we said, ok, let's just, let's help them do that more efficiently. Let's just build a control plane and, and, and run that for customers.

So we're looking at that same pattern, repeats over and over again. What are those things that customers are all running um in the same way? Are they policy things? Are they, um you know, CI/CD, how do we get configuration managed onto the cluster, for example?

Um and so what we see a lot of our customers doing is building this thing that, that i like to call the cluster vending machine. It's a, it's a big machine that you walk up to and you press the button or maybe you give an API call right and out pops a cluster on the other side.

And some teams um some organizations are building this thing when they're letting development teams pull the lever and vend their own clusters sometimes. Um that's fully abstracted from development teams and you have a systems team who's vending their own clusters internally for themselves.

So um that's one thing that I spend a lot of time thinking about what should, what should that cluster vending machine look like? And in some ways EKS is a cluster vending machine, but it doesn't necessarily have all the different um parts and customers still have to spend a lot of time doing things that are probably the same thing that every other customer has to do.

Um and one of those things is the upgrades, right? Like upgrade support and automation, how do we help customers upgrade more easily? Um how do we give them support to do those things? And so um we, we're really excited to launch the version um version extension.

And the goal there is to give customers um the ability to upgrade once per year. That's really the big thing that we're pushing for is how can i upgrade once per year?

So in the future, you can expect to see a bunch of stuff around that. How do we give you insights to let you know that it's safe to pull that button. Because if you're gonna upgrade once per year, you're skipping versions. How do you know that you're, you're gonna um things aren't gonna fail when you go two versions ahead.

How do you actually just pull that button once and then move everything um in the cluster together?

How do you start to true up a cluster? If you're like me and you, you, you kind of give it a little bit of time between going and looking at your clusters and all the things that are running in them.

I open up my cluster maybe once a week, but i'm also not running, i'm not running a production system on it, but i open it every week or so. And i, i'm like, oh my gosh, everything's out of date, right? Like these node groups have fallen out of date and these different add ons are out of date and this software needs a patch and there's all this stuff.

So how do we, how do we help, help you true that up more regularly? How do we help keep all those components up to date? Right.

Um how do we give you uh validation? How do we give you verification? How do we give you confidence to push the button? And then how do we make that all easier as you go?

Um and then obviously more managed compute more managed components um how do we bring Carpenter to more clusters? How do we give customers the ability to just um roll out applications and then have things um have things just run and just work, but do that in a Kubernetes conformant and upstream conformant way where you can take any one of those hundreds of open source projects or vendor projects and run them really easily when you wanna run them

So more of things managed. That's our 1st, 1st thing that we're looking at. And then the second thing that we're looking at is how do we, how do we bring Kubernetes um in a more managed way to all these places in, in AWS, the data center and at the edge.

So um building more deep bidirectional integrations with the AWS cloud different services. So we've had a lot of great integrations. I didn't even get a chance to talk about today, things like with GuardDuty, um where GuardDuty does runtime detection across thousands of customer Kubernetes clusters and aggregates those findings. And so if there's a threat attack on one customer's cluster in one part of the world, and we detect that same attack in the other customer's cluster in a different part of the world, we can actually say, hey, this is the same threat attack and we can warn those customers and help them take action.

So how do we build these bidirectional really deep integrations where if you want to um kind of enable higher level insights into your clusters. How do you leverage our our experience at scale in order to run safely more securely and at higher scale with AWS um enabling more seamless migrations, helping customers move workloads onto Kubernetes, both in the data center and in the cloud.

So two and and virtual machine workloads to Kubernetes but also things like Pivotal Cloud Foundry, things that customers are running on premises, legacy Java apps, how do we help you better move things to microservices? And some of these are simple tools, some of these will require machine learning. Um but these are the things that, that our teams are looking at.

Um and then also better hybrid node and cluster management today, ECS Anywhere is a CLI tool that you can run on premises and it helps you run those Kubernetes clusters on prem. But we think we can do better than that. We think we can offer um you know, more solutions and, and make it easier for customers to run um clusters in the data center and, and help them easily move workloads between data centers and clouds more easily.

And then obviously, um efficiency is a big deal and we've seen a lot of customers adopting things like Kubecost and Carpenter and other things to achieve efficiency at scale. So, um I think there's two things here. One is how do we take those insights? How do we look across all these customers and see the insights that we're we're getting for how to better tune applications and how do we help customers start to tune their Kubernetes clusters. Um again, that's just something we're thinking about. How do we do that more effectively and then also bringing together centralized observable and troubleshooting.

So I talked about how we have the AWS Managed Prometheus, the the agentless collector and then also how CloudWatch is also has a full stop solution for Kubernetes. How do we bring that together with EKS? How do we allow customers to choose either CloudWatch or Prometheus, depending on their preferences and then bring that together across all their clusters, all their applications allow them to dive through and troubleshoot whenever there's problems.

Um you know, doing things like Slack Ops. Um you know, in terms of troubleshooting where you get alerted uh proactively to an issue, right? And helping customers set that up, we see every customer needs observability, every customer is looking to um improve efficiency. And so those are the places where, where we're, we're looking to invest.

My team also put together a really nice slide um focusing on what we're doing for machine learning just because we have so many customers um that are, are working on machine learning. And we talked a little bit about the Mvme and the um S3. Um we're also gonna be rolling out um Amazon Linux 2023 support um in the coming year. That's an important investment for the team.

Um but we're also looking at how do we use Carpenter um for additional ML use cases. And uh it may not be apparently obvious because I think so many people are using Carpenter for so many different things. But we originally built Carpenter to work really well for batch processing and machine learning workloads because we saw customers really struggling with cluster auto scalar and then it took off with um it took off with stateless micro services really, really well.

Um but we do think there's a lot of improvements and things that we can do um for for batch workloads and for ML training and influence workloads with Carpenter. Um connecting it with EFA allowing instance type pray um doing better GPU time slicing and partitioning um natively supporting capacity, block reservations as part of Carpenter, helping customers orchestrate those reservations as well.

Um and scheduling um around savings plans and on demand capacity reservations. Um one of the biggest things that we've actually done for EKS is is it's auto scaling um and the control plane auto scaling. So continuing to tune the cluster for performance and resiliency is really important allowing us to support larger and more resilient workloads.

Um doing things like um tuning the control plane parameters is an ongoing project that the team is working on. So as we see different types of clusters and we see them scaling, we're continuously tuning the different parameters on the back end um to help customers uh get better performance out of the clusters.

Um but then also adding in cluster performance options like uh container image lazy loading, which allows you to preload and cache images onto the cluster so that you see containers start up a lot faster.

Um and then obviously we, we talked about, you know, how do we bring all those solutions together? And there's a whole library of, of a machine learning tooling um that we want to help customers bring together here all the way from uh Nvidia and neuron drivers. Um neuron is the, the framework for AWS training and inferential um bringing those into EKS when they're available and we just launched a, a an update to those really recently.

Um but also bringing in, you know, expanding the add ons catalog, um bringing in um you know, accelerated the build scripts, helping customers get, get the core components that they need for machine learning more quickly. So that's a lot of stuff.

Um how do you predict the future? What those are just some things that we're thinking about that? I, I happen to say, ok, these are some things i want to talk about. I put on the slide and one of the things that i think is most important, if you want to think about where we're going is to think about, what are you doing that? You think everybody else is doing.

Um those are the things that we're looking to help customers with. So, um our thing at AWS is to be customer obsessed. We remain customer obsessed. So we, our roadmap is driven off of the demands and by the needs of our customers. So we're constantly talking to people looking at what are they building, how are they building it and what should we be doing to help them build it more efficiently and faster?

Um what do they really need to run at scale? Um is it more important to have a simple console interface or is it more important to have a cluster that scales up and can support workloads of thousands of nodes without hiccuping? Um you know, those are the kind of trade offs that we're looking at. And so the best way to predict the future um is to think about what you need and, and what are you doing? That's just completely undifferentiated that you're like, you know, everybody is doing this, why am i doing this? Um AWS should be doing this. And if you're thinking of those things, we'd like to hear it.

We have a public roadmap. Um and i'm, i'm really excited about this public roadmap. We've had this out for a number of years now. I went and i pulled some stats for this because we launched this in 2018. Um we've shipped over 400 EKS items on the roadmap. We've actually checked those items off and closed them out since we launched the roadmap.

Um, so that's, i think that's incredible. Um, we actually have, uh, we've shipped over 800 almost 850 total items across all the AWS container services CS ECR that are also on the roadmap with us.

Um, and this is a great place. Uh, we often are myself and our team, we're on here really regularly. We're engaging with customers. If you open something you can expect, you know, hopefully to hear from us. I know there's a number of times where i've had conversations on GitHub on the roadmap about exactly how should we build something.

So if there's something that you're doing and you're like, hey, how do i get in touch with AWS? How do i tell them about this? You can talk to your account manager. Um you can also put it on the roadmap and we encourage customers to do that and engage here.

Um, and we, we really think like when we go to ship things, um, and we go to decide what we're gonna build in a, in a, in a cycle at a, on, on the EKS team, we're always pulling up the roadmap and saying, well, how many people are asking for that? And what are they saying about it?

Um so this is a great place to, to go ahead and ask us, um, give us feedback, propose ideas and, and also um we update this when we ship features. So if you wanna get notified when something comes out, you can subscribe to that issue um in GitHub and get a GitHub notification when we ship that feature.

So um there's also a number of resources that the team has put together. These are incredible resources to help you either um go deeper with Kubernetes, go deeper with EKS or just get started.

So um EKS best practices guide is an amazing deep dive resource for EKS. We have a hundreds, at least uh almost 100 different solution architects and other experts, engineers. We actually have engineers that have written whole sections of this this year. They go deep into exactly what are the best practices to run EKS at scale and they talk about some of the corner cases and some of the things that customers hit as they go um to really big scale and are running complicated architectures on EKS allow you to avoid those things and learn from hard lessons.

Um we also have the EKS workshop which is free and open training for using EKS. It goes all the way from the 200 to the 400 level, really excited. Recently. In the last few weeks, we just added a developer workshop section to the EKS workshop. So the the most of the EKS workshop focuses on the um on the system administrator side. And now we have a developer side. So if you have developers who are using EKS, they should find this really, really useful.

Um and then also we have EKS blueprints, which is a, a set of reference architectures and examples for how to deploy complete application clusters. And you can use this with Terraform and with the AWS CDK.

Um there's also a uh a number of new resources that the team is putting together for continuing EKS learning. Um so there's learning plans with Skill Builder and ramp up guides. Um and when you complete these courses, you can actually now earn digital badges from AWS uh for, for your EKS learning. So encourage everybody to go and check this out these new learning courses that the team is putting together.

And then if you want even more EKS, we have a number of sessions throughout the week. So encourage folks to check out the other EKS sessions. Um really cool talks. Uh there's a talk today on platform engineering, we have a uh a talk on another talk deeper on AI. There's actually even a workshop on how to build generative AI with data on Amazon EKS.

Um we have a talk on Wednesday talking about Carpenter and then also about how do you scale data processing um on Thursday with Pinterest.

So we're right at the end of time here. Thank you all so much and we'll be outside. Uh if we, if you have any questions or just wanna come chat.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值