Platform engineering with Amazon EKS

最新推荐文章于 2024-05-01 19:38:14 发布

litaibai-04

最新推荐文章于 2024-05-01 19:38:14 发布

阅读量180

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/littlechenlin/article/details/134784528

版权

"Uh now listen, I know this has been an awesome and long first day of re:Invent and we are likely the last session between you all and your first re:Invent happy hour. So we appreciate you sticking with us. We're really happy that you're here today and we'll do our best to make this one fun and more importantly, make it fast. So y'all can get out of here, enjoy that beverage and maybe have a nice dinner.

Um alrighty quick round of introductions just so you know who we are? My name is Kevin Coleman. I lead the worldwide go-to-market team for Kubernetes at AWS. I'm joined today by my colleague Roland Barcia, who is the director of specialist solution architecture for our containers and serverless services at AWS. And we're also super lucky to have a customer speaker with us today. Ahmed Bears from the New York Times who's gonna be talking about the awesome platform that he and his team are building on top of EKS.

Alright, the quick agenda for today to start off. I'm gonna share some of our perspective on how we view platform engineering uh in the EKS world and some of the benefits we see customers realize with their platform engineering investments. Uh next, we're gonna touch on a few specific customer examples and I'll share two different examples of customers who are kind of at opposite ends of the maturity spectrum with their, their internal platform that they're building. After that, I'm gonna hand the mic over to Roland. He's gonna come up here and talk about some platform implementation, uh platform implementation patterns that we see among EKS customers. Then he's gonna jump into best practices and for the folks in this room who are currently building internal platforms on platform engineering teams, this is where you're likely gonna want to pay attention in our roles. We get to work with a wide range of customers across the globe who are building uh platforms on top of EKS and there's been some key learnings that have emerged from all those engagements. So Roland's gonna share some of the salient learnings and hopefully those will help you with your platform engineering endeavors. Finally, Ahmed is gonna come up here and again, he's gonna talk about the awesome platform. He and his team at the New York Times are building on top of EKS.

Alright, that's a packed agenda. Let's get into it. So what is platform engineering? Um before we jump into this, I want to see a show of hands. How many folks in this room are platform engineers today? Alright, great. I expected a good amount of hands. How many folks are dev ops SRE types? Ok. Wonderful, good mix. And how many folks are leading platform engineering or organizations or otherwise responsible for enabling cloud adoption at their companies? Alright. Wonderful.

Uh so if you are leading platform or cloud adoption at your organization, you are responsible for enabling uh your organization to run applications in the cloud. This can include migrating, it can include modernizing, it can include building net new but you will have uh a wide range of internal customers who are software development teams. These internal customers will likely be responsible for building or maintaining a wide range of applications. Maybe one, maybe many, maybe a portfolio of applications and those applications uh will need to run on a wide range of cloud infrastructure services in the cloud. This of course will include services, compute services like Eeks or EC2, but it could also include things like storage or databases or networking and monitoring services. Additionally, not every application is going to be the same. They have different requirements, they are heterogeneous. Some might require more compute than others, some might require more memory, some might require GPUs. But regardless all of these applications will need to run in the cloud and you as the the leader of the platform organization or the cloud adoption organization are responsible for figuring out how to do this hopefully in a somewhat cost-effective manner.

Um you all have a choice to make who is going to be responsible for managing the infrastructure that is going to run those applications. And I like to think about this choice along what I like to call the infrastructure management spectrum. So on one end of the spectrum, you can choose to decentralize infrastructure management responsibilities in this model, the application teams that build the applications will also be responsible for deploying and maintaining the infrastructure that is needed for those applications. The implication here is that each individual application team will be responsible for deploying maintaining and perhaps most importantly, troubleshooting production grade infrastructure that is needed to run their applications. On the other end of the spectrum, you could choose to centralize infrastructure management in this model. Uh uh there are central teams that that compose infrastructure services into platform abstractions which are then offered as a service to application teams allowing them to offload their infrastructure management responsibilities.

Platform engineering, conceptually skews towards the left side of this diagram. It is all about building abstractions that allow application teams to offload infrastructure management responsibilities so that they can focus on building their applications and delighting their customers. However, in the EKS world, the abstractions that we see customers building and the customers that we work with actually very significantly and fall at different points along this spectrum. And the reason for this is that application teams and organizations have different requirements. Some application teams actually want to have some infrastructure control, other organizations might not allow them to have that control because they have uh various compliance requirements.

So to zoom in and be a bit more precise, we view platform engineering as the practice of identifying and building compute abstractions that meet the unique requirements of your internal customers and your organization to enable the efficient cost effective at scale adoption of the cloud and ultimately accelerate software delivery.

Alright. So let's talk about abstractions for a moment and why they're so valuable in the cloud context. And I think this is a really great visual to illustrate the point. On the right, you have excuse me. On the right, you have a bunch of parts for an engine which represent you can think of as representing cloud infrastructure and infrastructure primitives. And on the left, you have a complete engine which is sitting in a car and you can think of this as platform abstractions.

So of course, there are a number of people in the world who are capable of taking parts of an engine, assembling them, building a complete engine and installing it in the car, but it is labor intensive. It takes time. It's complex. And there are a much larger number of people in the world who just wanna engine install in their car, get in the car, turn on the key and drive away in the platform engineering context. Of course, there are plenty of people who can, can compose infrastructure into complete abstractions that are needed to run applications. Many of you all are sitting in this room and you have the skills that can do that. But there's a much larger number of application and software developers in the world who don't necessarily want to nor should they need to worry about composing infrastructure into the complete holes that are needed to run their application in the cloud.

So they just want to get back to again building their application and focusing on their end customer platform engineering is really the practice of building abstractions that meet your customers where they are your internal customers, where they are and allow them to offload infrastructure management responsibilities so that they can focus on their end customer.

Alright. So we've talked about building abstractions and platform engineering. Now let's talk about internal platforms as a whole. Uh because it's more than just infrastructure abstractions as I was preparing for this talk, uh I came across this definition of an internal platform and I really like this. So I wanted to share it with you all today. An internal platform is a foundation of self-service APIs tools, services, knowledge and support, which are arranged as a compelling internal product.

There are two parts to this definition that I really like. First, this definition doesn't just touch on the things that platform engineering teams build the APIs, the services, the tools or the abstractions. It also touches on how they drive adoption of the things they build through knowledge and support the most successful platform teams that we work with in the EKS world. Uh do a really good job focusing on things like documentation and providing support and education for their internal customers, which serves to drive adoption of the abstractions in the platform that they build.

The second part that I really like is the last four words, excuse me, the last three words, compelling internal product. When we build internal platforms for our customers, our internal customers, we really do build a product. It's the same thing we would do for building a product for external customers. So the teams that we see in the EKS world who have the most success with their internal platform, endeavors really take a product centric mindset to building their internal platform, work backwards from their customers and build abstractions that meet their specific needs.

Alright. So building a platform is an investment. Uh there is an upfront cost that will hopefully pay dividends over time. So let's talk about some of the benefits that we see customers achieve in the EKS world who have success building internal platforms.

So the first thing we see is velocity and this really comes down to speed um customers via their internal platforms are able to reduce the time it takes to get innovative new ideas. Uh and code literally code from a developer's laptop out into production and in front of customers. By removing the need for each individual team to provision their own infrastructure. Uh and by providing self-service deployment capabilities, companies can accelerate and do accelerate time to market for new ideas.

The second is governance and in this context, governance is sort of a catch off for things like security, reliability, scalability, et cetera. By building abstractions, companies are able to enforce the requirements for these types of concerns in an automated way for all applications that run on the platform. So every workload benefits from sensible defaults built into the platform and application teams don't have to figure out how to configure these on their own out of band.

And finally, is efficiency and this really comes down to cost savings. And there's a number of ways that platforms can save customers cost. Uh but I'll cite a few spec specific examples that we see quite frequently um in a multi-tenant in a specifically a containerized setting. By building a platform, you can run multiple workloads from different teams on the same underlying host instances. Therefore allowing you to be, oh, excuse me. Therefore allowing you to be more cost efficient with your workloads.

Uh and the second one is from a human capital cost perspective by removing the need for every individual application team uh from having infrastructure expertise in the team and centralizing that expertise you're able to save on human capital cost as well. One of the things we hear quite frequently from customers platforms, empower companies to achieve economies of scale in the cloud. As I mentioned before, there's an upfront investment to building a platform. But once it, once it is in place, the marginal cost of deploying or on boarding each subsequent application becomes quite small, thus enabling customers or companies to realize economies of scale.

Alright. So who's doing this? Um I wanted to share two specific customer examples of some EKS customers who've had success with their internal platform initiatives. And I wanted to share customers on opposite ends of the maturity spectrum. One is quite mature and has a platform that runs at significant scale today and the other is earlier on in their platform journey but have uh had some early success and frankly is a very cool use case that's built on top of EKS.

Alright. So the first customer I want to talk about today is Salesforce. Um Salesforce's internal platform, Hyperforce is built on top of EKS Hyperforce is a reimagining of Salesforce's platform architecture built for the public cloud. And in 2023 importantly, it's how Salesforce is delivering trusted AI to their customers.

Um the compute foundation of Hyperforce is called the Hyperforce Kubernetes Platform. And it's built on EKS with HKP Hyperforce's Kubernetes Platform for short HKP Salesforce engineering teams don't need to provision EKS clusters themselves rather they're empowered to consume Kubernetes as a service.

Um HKP, streamlines, developer experience and offers varying levels of abstraction including vending clusters as a service and vending namespaces as a service for workloads. That makes sense for a multi-tenant setting.

Um today Salesforce runs HKP across the Hyperforce fleet which comprises over the millions of pods and over 1000 EKS clusters. So again, a customer that is running an internal platform at some really awesome scale and having a lot of success in seeing results because of it.

The second customer I wanted to mention is NASA. Uh and this is the customers earlier on in their platform journey, NASA chose AWS to serve as a foundation for their new data platform.

Um their platform is built with tools like JupyterHub, Dask, Crossplane, Flux CD. And the platform allows scientists to provision complete batteries included JupyterHub environments in minutes. Uh the new platform aims to enable scientists from around the world to collaborate by sharing data and processes and generate reproducible results. The ultimate goal of the platform is to provide scientific models as a service or a way for researchers from around the world to execute common scientific models from a familiar interface"

I think this platform is super cool because it supports the White House uh Office of Science and Technology Year of Open Source initiative, which is a movement that aims to make scientific research more equitable, uh transparent, accessible and collaborative. Um and we're super thrilled that the NASA folks chose to build this platform on EKS.

All right, with that, I'm going to hand it over to Roland. Who's going to talk about platform implementation patterns?

Thank you, Kevin, how's everyone doing? All right, almost beer time. Who's going to have a beer right after this in the expo? All right.

All right. So Kevin asked how many platform engineers and dev ops teams, they were, how many developers and app developers are there that are building apps on a platform? See that's a problem, right? Where is your customer? Right? So one of the veins of the existence of our team is this notion of what developers and data scientists want. They want freedom, they want autonomy, they want to use Jupiter Hub, they wanna use Ray Coop Flow, whatever framework they want, they want to build code, they want to deploy microservices and they want the freedom to do that.

On the other side, you have platform builders and platform builders are trying to do the things that Kevin talked about, right? Which standards governance? If you go too far in the standardization world, you build a platform that nobody uses. Developers are like rivers, they're crooked. They take the path of least resistance to get the job done because they're trying to meet a deliverable a deadline. Sometimes you're even outsourcing development to different teams and they got to meet that deliverable, right?

You go on the other side, developers, you give them too much freedom. And all of a sudden you expose some URL out to the internet or they start using up all the GPUs that you allocated for those new large language models. Or they do all sorts of different things like ramp up the cost and you see your AWS bill and you're like, oh man, I didn't anticipate these billion events coming into the system to scale up this way. Right.

So how do, how do we solve this? So over time when you're starting to modernize, we usually start autonomous. You've heard Amazon talk about two pizza box teams, building applications. You have these independent squads. Now we have things like micro front ends, talking to micro services, talking to data meshes that are owned data products and people have this autonomy. You start building more and more of these applications.

We spent the beginning of this year really focus on cost optimization, right? When the economy hit, we saw a big uptake on these platform teams saying help us optimize cost, help us use things like spot in a kubernetes context with something like carpenter and then ja I hit the hype. The wave. People started creating data platforms and the same platforms needed to quickly spin up capacity storage GPU. And so he had both sides of it.

And so let's let's talk about some of the areas that we see our customers have challenges with. And so first ownership, I was at a customer meeting uh over in Europe. And in that customer meeting, we had a platform team, about 25 developers dedicated to the platform team supporting about 200 to 300 application developers, different projects. And they had a pretty good platform team. They built automations across all their different capabilities, but they still had some issues across different teams.

They had one team that was in the room with us. The developers were there saying just take my code and deploy it and run it. We had another team that's like I want my own cluster. I don't want you, I I want my own isolation. I want to manage it myself. They couldn't make everyone happy. So we have this notion of who owns it, the platform and sometimes the platform itself people don't realize is a product. If you're gonna do platform engineering, you have to treat it like a product and you have customers, those customers are the developers and the data scientists. So when we build products, we build them for them.

So are they included in there? The next area we see is level of abstraction. So we have everything from I want an AWS account. I want a cluster to just give me a dev ops pipeline that I can push my code to. And that's it. We have a lot of variability there and customers are struggling with where the line is for their organization. I wish I could tell you, there is a silver bullet to say this level as a service is the correct level and all your platform issues will be solved.

The other one, have you built it w right field of dreams, the adoption plo problem. I've had customers come and say, hey, I built this platform, I've invested in it. It's really good. You should see the code we wrote super automation with all sorts of gets, you know, you touch my G repo and nothing gets through unless my pull request gets accepted and developers are not using it. And we have the other end of the spectrum. They they get there. And I say, hey, I built this platform and with this platform, you don't have to know anything about Kubernetes. You react a developer just go and check in your code. Trust me, I'm gonna deploy your pod. I'm gonna run it. Oh my, your app is broken. It's your fault here. Just do kopy tail pod logs and go like wait. You just told me I didn't have to know Kubernetes. What are you talking about?

Observability is often an afterthought and not part of the platform design because we're wired to create, we want to create provision, automate we kind of center there and then we realize that you probably create something once you iterate debug all the time. So where is the center of gravity for your developers and your users? Is it in building apps? Or is it in maintaining your code versioning your code, troubleshooting your code debugging your code? Where is the center of gravity? It's the second part.

So you might get people to initially on board to your platform and then all of a sudden be looking for different places to go to because they can't troubleshoot. They have trouble accessing their logs, they don't know what's going on. It worked in their laptop but not in the cloud or in the cluster or in the pod, et cetera.

Everything from i moved this application and this was in the same customer meeting from a vm, an easy two instance into a pod. Go over to the next slide actually, right? And they were like, ok, i sized it the same way and all of a sudden my app didn't run and it's because other pods from other applications were scheduled on the same notes. So they're like, 00 this is broken. I need my own cluster.

So the reason the developers were like, I need my own cluster was they need isolation. What do you mean by isolation? There's 20 different definitions for what I mean by isolation? Everything from I need my own AWS account with my own permissions. We have regions, we have reasons that we want to do multi region for resiliency availability zones. If you work in regulated industries, like in public sector, you have things like network segmentation and you need isolation from that perspective, I want my own cluster. I want my own compute. There's all these different reasons. In this case, it was really a troubleshooting issue that led them to the conclusion that that team needed their own cluster. What they really needed were their own nodes, their own ways to do scheduling, et cetera. But these are the things you need to tee off.

And so there's a lot of different design choices that we go through with our customers. Everything from account as a service to template as a service cluster, as a service name, space as a service and platform as a service. We'll talk about that.

How many folks here are doing one of these on the screen? Ok. Platform as a service name space as a service cluster as a service template as a service. Ok? So you see, you don't all agree. So we don't have the magic platform. All of these have tradeoffs and downsides and they're optimized for it. And it really depends on the workloads that you're supporting.

All of these have a downside. For example, just had a meeting this morning with a customer in an EBC and they have account sprawl. So all of these have sprawl issues, right? I can have account sprawl. I can have tons of old templates, unused templates that are sitting in a G repo and never used. I can have cluster sprawl. I can have lots of unused name spaces, et cetera. And I can have lots of pipelines. So all of these have that issue. So you have to really think about what are you designing for.

So the punch line is no 11 size fits all right, it's very difficult. So what do we do about that? So one of the things that I think is important is this product mindset. If you're going to do platform engineering, it's one of your most important products in the organization. If it needs to support all of your organization's workloads ml pipelines data, and you need to work backwards from the business domain. At the end of the day, people are writing applications that serve a business function. You need to work backwards from that.

So how do you think about things architecturally? I like to think of big buckets. So architectures, modular want two way doors. You want to be able to have modules that you can build out reuse developer interfaces. Oftentimes we wish developers just use gooey's. But at the end of the day, there's a lot of automations that they want to support. What is your developer life cycle look like data and dependencies. This is where there's often a lot of variability, share databases, your own database, who owns the database, et cetera. And then you have libraries, reusable code, code, libraries, reusable images, software delivery and automation, we're gonna go deeper and then simplifying the developer experience.

And so as we go more to the right the developer and the data scientists and we break these down. The developer talks about containers, they have code, they have APIs they have events, event driven architecture. Uh is something that people are building inside of containers, maybe integrating with lambda functions. Then you have your code and your configuration configuration held is a real place. People have been there, they can get out they behave but configuration, then you have all sorts of dependencies. You want to think about this when you're building platforms working backwards, how do I handle databases and machine learning tools, libraries, brokers, you might be doing something with rabbit mq kafka.

How many folks are running kafka? How many people are running kafka on EKS? Ok. A few. All right, automated software delivery. These are what we call

I, I like the ILS, we debated this word Kevin and I the other day on the phone, he does not like that word, he liked another one, but I won't say it. But we think about how DevOps pipeline some organizations I had discussion this morning with an organization that's separate uh organization for Dev Ops pipelines, the C I CD pipelines for the developers and their infrastructure as code.

So that might be an additional customer. It might be part of the platform. That's the abstraction testing is often overlooked as not part of the platform security. Uh security is a, is a day one for us. And so that needs to be part of the platform and observable and then lastly simplifying the developer experience. And uh it's really how do I get the self-service? How do I give the developer autonomy templates or maybe a mechanism to allow developers to contribute templates that work? And everyone's favorite development activity is documentation. How many people love documenting their code and their scripts? All right, self documentation. That's good. Automating it. Yeah. Awesome.

And so big leap here. Uh but when you're building a platform, you have a lot of tools, a lot of choices. So we have clusters, shared services, all your and all the tools behind them and you can build platforms with a lot of tools since this is EKS a lot of the customers that we see using Kubernetes don't just think about Kubernetes as a compute but as actually an API an API that you use to provision, scale secure, et cetera. And in this case, this is an example that we've done with quite a few customers where we have a management cluster built with something like Argo CD Puff or your policies, Cross or something like AWS controllers that enable you to provision, manage services like a manage RDS database, an SQSQ an S3 bucket, we have all sorts of different control plane aspects and even the ability to use something like Crossplane to create provision and upgrade clusters.

So this is going all in on Kubernetes. Now do have a session, a chalk talk later this week. Con 408. It's called infrastructure is code versus GS versus C I CD. Because there's different choices. You can use tools like Terraform, you can use P Plume. We have partners, et cetera to do the cluster part and maybe get up to the right. I've had infrastructure teams come to me and say, I don't even know how to use Git. That's something developers use. You're not going to be successful at GS if you don't know how to use Git security with Git because it leverages all these things.

So there's all the tooling choices. But again, you really want to go with thinking about all these things and what supports these big buckets working backwards. So in summary, one of the biggest mistakes that we see with building platforms is they build something without any sponsors, any workloads. We always see the most successful platform teams have sponsored projects, customers that do things well, internally two or three workloads that you build the platform together with your customer, make the developers and the engineers part of that journey and build with them.

So that's probably the number one success factor. Don't boil the ocean but build with those use cases, pick compelling use cases and work with your developers. Maybe a state stateless workload, something with some capacity, something with some requests and build it out, provide escape hatches. So this is 80% of your organization maybe will use the easy button code as a service. But then think about, hey, if you want a little bit more control, we'll give you control with these different sign offs, read this documentation, documentation as a service and we'll give you escape hatches cluster as a service.

So don't pick one, pick two or three and have a priority based on, on use cases and learn to change where that wall is because what is default today might not be default tomorrow. Again, one size doesn't fit all workloads. Lastly, documentation and education are actually a key part of um our platform journey. Uh we run something inside of AWS called Technical Field Communities. TFCs actually was encouraging. I met with a customer this morning that's actually emulating that as part of their platform. So they've actually built enablement and education as part of that platform journey.

So you want people to use the platform come, we'll educate you. We'll make you Kubernetes developers. We'll help you get your certification, think about draw cards to draw them in and think about what they want to do. So that's where we're at and with that, 01 last thing, shameless plug. I forgot. We got this in here. We have a new EKS badge talking about education as a service and go get some learnings from some of the folks on our team from the field that work with customers in our service team.

So if you want to earn this badge. We just launched it this week. Here, here is the QR code. Give you a second to take a picture because we love getting educated. So we're putting our money where our mouth is, get educated with EKS earn a badge and with that Ahmed, you're up. Thank you.

Thank you, Roland. Hello, everyone. I know I'm the bottleneck between you and the reception. So I will try to be quick. Roland and Kevin shared great insights about how to build a platform on top of ETS. So let me start by how we built an internal developer platform at the New York Times.

So as you know, the New York Times, the most recognizable product is news and journalism, but we also have other products like games, Wirecutter, audio, The Athletic, cooking and others. So how we take all of these products and innovate and just deliver them. So how we take all of these products and innovate and just deliver them. So let's discuss our motivations on building an internal developer platform.

First, we think about like how to deliver a consistent experience to our engineers. So a new engineer is joining a team, they need to get like a consistent experience while like other engineers and different teams get the same experience. Every time they try to build a new service or an application, we don't want to be a bottleneck into the software development life cycle. We don't wanna like to have like a manual intervention every time an engineer try to deploy a new service or build a new service.

So we wanna offer this via self service approach that allow teams to be autonomous and build using the capabilities that they need. Also, we want to provide an extensible tooling. What I mean by that is we don't want to specifically focus on the tools we want to build an abstraction layer that provides teams with autonomy on building on top of it. So if we didn't deliver a specific tool for that use case, they can take that layer and build on top of it at the end, like delivery is important.

So we want to reduce the time from like ideation to production and allow teams to reduce all of the management overhead and infrastructure as well a across the organization and get their bootstrap applications quickly. So I would love to talk about this quote from Daniel Cassidy. He's uh the senior product manager for the platform. Well, I would read it here.

Engineers deserve a seamless experience to solve problems and deliver value. Imagine a centralized developer platform that helps engineers at every stage of the development life cycle while reducing the burden of sprawl of infrastructure. So what he means by that is basically simple, like when you go try to deploy an app, you don't have to focus on like all of the infrastructure necessary to deliver your app. Like you should go through a consistent experience. You don't have to ask anyone about how you would deploy it or all of the necessary steps for it, you should have this documented in a good way that allows you to build on top of it.

So what makes an internal developer platform indispensable for an organization? So to address this, uh we identified a few core builders for the platform first standardization, we need to make sure that we deliver like a standards across the organizations that teams can build on top of it then efficiency. What I mean here is that like we have a seamless experience that we can deliver and we use all of the time and allow teams to deliver faster when they need integration. Like the platform has to deliver a cohesive experience and integration with many tools. And here like we need to have all of these tools integrated and deliver a value overall to the development experience.

Scalability. I don't mean here, scalability by just run time experience, i mean scalability for the product itself like it's done in iterations. So like we built for day zero, day one day two and we continue going and add more features. And at the end, we have the visibility when we share like a centralized internal developer platform, we can have like a full cohesive look into the entire infrastructure that we have and the portfolio of services that we deliver on the platform.

And now we've been talking about the internal developer platform, but let's talk about the customer themselves. So who are we building this for? So we are building this for developers or engineers. So like developers have a journey like a customer journey. Think about when you try to buy something, you go search for it, check out and get it delivered and then put it somewhere or use it all that kind of stuff. So developer have a similar journey. When they get a requirement, they start to think about how to solve the problem, they design it, plan it and then they go through the process from coding to delivery.

But what i did here is just like i focused on few elements that we can take these elements and abstract in the platform and make it easier and seamless for engineers not have to think about it. It's not necessarily that we wanna abstract this away, but we wanna deliver a cohesive experience and transition to make it easier for them to develop their applications.

So before we transition into the actual work flows that we built, so we took the developer journey and we try to transition this to our workflows on how engineers will use that i wanna iterate on the goal. So our ultimate aim is to assist engineering seamlessly in developing and delivering their applications, ensuring a smooth and seamless transition.

So first with the creation process, like as an engineer, you have an app, you wanna deliver to your customer. So you start to create an app on board it, get all of the templating that you need from like get hub ribo or source control or any other, get control to ac i cd build and test and deploy and then you get the run time and the cloud resources that it's necessary here.

So like you get all of that but you just like requesting it. You specifically say like, hey, i need this and this and this and you click. Yes, that's what i need. And you get all of this out of the box. How we connect it at the end, we have our routing layer which is a shared in grass model that we built in the company that allows you to serve all of your traffic to customers. But there's one missing piece here which is observable. It's a critical piece.

So as you can see the observable layer here is across all the steps. So we need to ensure that every workflow is observed. And we have like a ton of telemetry data from traces logs, metrics and all of it's necessary to understand the entire workflow and life cycle.

So how we build that uh we experimented with few approaches on how we built the platform itself and we landed on like building on a multi-account architecture. So what does that mean? Instead of like layering all of the resources that we need into a single aws account? Like we start, we start to think about like how to deliver this experience through a multi-account architecture.

So why we would do that? Basically, we would get like group of workloads in specific accounts like security measurements to ensure like development and productions are not the same accounts also from a cost management perspective to ensure that we understand what type of workloads sitting on which account.

So here we have like the platform accounts which have all of the centralized workload that i talked about from like governess clusters, from like ingress model, from anything that it's shared across multiple services. Then we get to team accounts

Like each account, each team gets a set of accounts that they can create resources that available to the run time. And basically these resources like RDS instances, S3 buckets, SQS, and uh Elastic Cache instances and anything that they need to run their application.

So how we get all of this connected basically by adding a transit gateway layer between all of this. So to manage this, that might look complex, but like AWS provide like a set of services that we can use to make this experience better. So we are using AWS Control Tower with AWS Organization to manage the entire provisioning experience. And also we have CloudTrail enabled in all accounts to ensure that we understand what kind of API calls are done in these accounts. And everything here is done through a single sign on, on AWS. So that's our multi account architecture.

Let's go to another dilemma that we run through and many of you would run through. So let me ask a question if you are running an EKS clusters, who who here like runs multi-tenant clusters versus like single tenant clusters, who is running multi-tenant clusters? Oh wow, great amount. Who is running singleton in clusters? Ok.

So I'll get through it. So there's a dilemma when you think about like how you build your own platform, like either like I would go like multi tenant multis single tenant clusters or we would go to multi-tenant clusters and there's no one size fits, there's no one size fits all. Like it's it's a problem that it can be solved through many ways. Each approach has its own pros and cons when we talk about multi single 10 clusters, you know that there's hard tenancy that you don't have to think about like our back between tenants because that each tenant gets their own clusters. But there's more work, more overhead and management and upgrades and all of that kind of work. And there are also like less cost optimization. If a team needs just to deploy a couple of services, you have to build a cluster with them. While when you go in the multi-tenant clusters, you get single clusters or a few clusters that shared between tenants. So you get like highly cost optimized clusters, but you have to think carefully about how you ensure security and network isolation and all of that has to be planned before.

So in our use case, we designed and have some design requirement uh that we built for the platform and we ended up designing and choosing a multi tenant clusters for our use case. And with that, if we decided to move from a multi-tenant cluster to a single tenant cluster for a specific use case, we can do that over the course of developing the platform.

So how did we have, how we did we deliver a multi-tenant clusters built on? ETS basically we have. So the picture here or the diagram here shows like a single environment where me by environment is depth staging production or any other environment you could name. So we have like multi regions, at least two regions that we run our clusters in across multiple a availability zones. That's to ensure that we have like maximum availability for all of our workloads. And that's where our ECS clusters run and then tenants get access across all environments to all clusters, to specific name spaces that they can use for their workload.

And as you can see, we are using like different open source technologies and tools selling them for network policies and leveraging EPPP for routing carpenter for our scaling. It gave us more flexibility on how we deliver like spot instances versus on demand. And also things like if we decided to like go to a customer route and deliver note bows, we can do that with carpenter, we're using STO for our service mesh with combination with solium on cluster mesh. So we combine both to achieve native routing and uh availability between clusters. And at the end, we use uh Open Policy Agent to ensure that policies and mutations and validations are here across the board.

So now let me walk you through how on boarding experience looks like from an account perspective. So imagine we have a team and they don't have anything on the platform yet. They are brand new. They are just like they are our beta users, how this would work.

So we are using Backstage as our internal developer platform portal. So they would go to Backstage and request, hey, we need an account. So how can we create one? They requested? They set few like information in uh a form and then they go through the process and the process here will go in depth on to how this on boarding experience looks like. But basically they go through a Control Tower which that count as provisioned. Then they get to like being on boarded to the shared coper clusters. So let's dive into each step first.

Once the request is submitted, that gets to Control Tower which will just create the account and get it provisioned. Once it's provisioned, we're using infrastructures code over the board. So we will start to create the VBCs required AWS backup settings, config rules, everything necessary here and then you remember earlier, I talked about transit gateway. Then there's another process that will like have infrastructures code which will connect that specific account to the shared services. Like we have to ensure that there's no networking leak between all accounts unless it's necessary. So we have to just connect them to the shared services and make sure that we have enough network isolation, which is achieved by solium at this point.

Last step, we connect all of that by AWS Single Sign On. So teams get access to that counts to the console or the API or the CLI. So that looks familiar seems like the previous slide. But the addition here is that now we give teams, infrastructures code for their accounts and also we give them access to the management console. So that's the first step.

Now, accounts are being created where EKS gets apart here. So now we look at how after the account is being provisioned, we are using like event driven architecture to on board these accounts to the ETS clusters. Basically, we get a message say accounts created. It's different from provision. Provision means like account is just available, no resources in it but like created. It means like all the resources are being added from networking and other things that we need. And that goes through like a step function which trigger line that basically does some logic to ensure like we have all of the resources that we need for that specific account and keeps are trying making us into it. And that pushes to like a rebook with a specific with ACRD basically for a controller which we call the tenant operator that we created internally. And what that does basically is just like once we say this account needs to be on boarded to the clusters, it would just like create like tenant name space network policies are back and certificates. So all the necessary steps to ensure that tenant can just start deploy services and applications to the platform.

But the most important thing here we are using like another operator from IM which is going to create the tenant role. So basically how we grant access to the clusters, basically, we know already their account information. So we already take this create an IM role in the account map this to their configuration and ensure that they can get access to the cluster.

So now all of this together, we added another thing which is a platform CLI. So that's also a CLI that we built internally that helps teams to go through all of the steps necessary to get access to the cluster. So basically, it's one command that gets you access from like AWS account cluster sitting in your context, doing all of the necessary things that you can do. And from here, like you can see this is an experience done once. Basically when you just like go through like building a new account. But we have a similar experiences when you also build other apps and like we can use the platform CLI for other things for debugging, for access, for other things that you can go through while you are building on top of the platform.

So what we have today, this is what we have today. This is our uh internal developer platform offering and basically the same that as the workflows that i mentioned earlier goes through on boarding. So we we have a template engine that templates all the resources needed for a service you get through, get you get a github rebo your c i and then your cd goes through argo cd and then deploys to ets and get configured to your tenants account and then connect it to your ingres and then the backstage and the observable layer is cohesively looking around your entire the entire platform.

Ok. So what's next for us? So we are looking for a few things more managed add ons or like managed service by AWS. Hopefully one day we can see like a managed STU instead of us managing STU we would love to see like a managed STU manage carpenter. It's, it's been a thing that how we manage carpenter deployment and gets conflict. So i would love one day to see there's an issue on the container road map around like how carpenter could be uh deployed as a control plane. Currently, we only support like x 86 workloads. But we would love one day to add r.

We are in uh proof of concept for like doing uh this on the same clusters better. I am provisioning process like instead of like dealing with a config map and all of that process is needed just to like on board, new tenants to the clusters would love to have an API around this and the longer term in the future would be like to bring more workload. So now we are looking at stateless state for like API and applications and jobs. But in the future, we're looking into like add more data intensive applications and machine learning uh to the platform.

With that, i would love to share a couple of quotes from uh my beers. So bella, she's a senior software engineer from the games team and she's here talking about like launching the newest game connection uh on the platform. And she talks about like how quickly they would be able to transition from a prototype to like something in production quickly.

And then jorge, he's a staff software engineer from the publishing team. And he also talked about like how we develop and deliver many features out of the box, like deployment of bull request and other things that would help them to focus on building their application without thinking of like how to manage the infrastructure, but you can go and manage the infrastructure when it's needed.

So at the end as i tell you like about the journey today, what we learned through this process for the previous years, user documentation. It's important like if you deliver a platform without user documentation, usually you will get a lot of support questions around like how i would do this simple thing. So it's critical and biv withal to deliver a good user documentation and experience to your users.

Adoption and partnership like a platform without adoption is a waste of time. So if you build a platform as roland mentioned earlier and no one is using it, that's waste. So we have to partner with teams, we have to get early adapters to use and tell us like what exactly they need to build into the platform platform as a product. One of the critical elements that i talked to other engineers and he heard from them that platform engineering is not a project, it's not a one time thing that you go through, it's an iteration. So it's done at multiple levels and multiple iterations. So you have to do like one iteration and things might go wrong and you have to iterate and get through it and at the end customer feedback which is critical. So you have to always keep the door open and be open minded and looking for a change. You are not building a platform to deliver a cutting like a cutting edge technology. You are building a platform is to help your customers which is the engineering teams to deliver service services with that. I would like to hand it over to kevin back.

All right, excellent. I don't know if my mic is on now. Uh that's all we have for you today, folks. I wanted to thank uh first of all, thanks ahmed and roland for doing this. Uh thank you all for joining us. Uh we'll get you out of here a couple of minutes early so you can go to that reinvented. Happy hour. We're gonna stick around on the stage over here to answer any questions. If you have. Uh if you have some time, we'd love to hear about the platforms that you all are building and obviously we'd be happy to answer any questions. But with that, thank you all. Have a great re invent.