Data processing at massive scale on Amazon EKS

最新推荐文章于 2024-08-30 16:10:19 发布

李白的朋友王维

最新推荐文章于 2024-08-30 16:10:19 发布

阅读量130

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134835496

版权

Start with a quick show of hands. How many of you today are using Amazon EKS? Ok. Good deal. Good deal.

Another quick show of hands. How many of you are using Apache Spark today? Ok. Also a good deal. Seems like a little bit less.

Uh and then last question, how many of you are running Apache Spark on Amazon EKS today? Ok. Well, awesome. Thank you so much for that.

And with that, we are going to get started. My name is Alex Lyons. I'm a senior container specialist here at AWS working on EKS. I'm joined by my colleague, Vara as well as Som from Pinterest and I'll let them introduce themselves a bit more when they take over.

So today I'm going to talk about Kubernetes for data processing and then use that as a segue into what we've built at AWS that we call our Data on EKS project. And then I'm going to hand over to Vra who's going to talk briefly about open source data platforms on EKS. And then So is going to share with us the modernization journey that he's been a part of with Pinterest as it relates to their data processing and then finally Var is going to come back and talk about some best practice for running Apache Spark at scale.

So Kubernetes for data, what do financial simulations, autonomous vehicles, genomics research and machine learning recommendation engines have in common? Well, for one, they're all powered by data. Data is a key input to each of these processes being able to deliver the value it is that they deliver it.

So what makes them go? Another thing is that all of these are running on Kubernetes today and specifically on Amazon EKS, whether it be Bridgewater building the financial simulations that help them understand global financial markets or Mobili building the autonomous vehicle technology that's in an estimated 150 million vehicles worldwide or companies like Roche Pharma doing biomedical research or companies like Pinterest building machine learning recommendation engines. All of these run not only on Kubernetes but on Amazon EKS today and they do that at a massive scale.

But why Kubernetes for data? Well, we see four key reasons for this. And the first is scalability. Kubernetes allows you to independently scale the compute and memory for your application so that your infrastructure can meet your application's needs. It has capabilities like the Horizontal Pod Autoscaler and Cluster Autoscaler. And there's a number of open source tools like AWS open source Carpenter that can help you manage this auto scaling so that your infrastructure can meet the needs of your application.

Another key reason is orchestration. When it comes to data workloads, the Kubernetes orchestration can manage things like automating your job resubmission when a job fails, you can also specify the CPU and memory resources at the job level so that you can have a very fine grained control over the resources that you're allocating to your applications.

Another key reason is portability. So if you are running on Kubernetes, that means you're running containers and with a container, your application code is packaged alongside all of the dependencies that it needs to run like environment variables and libraries. Because of that, you'll see less things like job failures when promoting from one environment to the next because all of those environment variables are going to be packaged alongside that code. So you can have a faster time to market or time to production for these applications.

Another thing that this portability gives you is the ability to run multi-version clusters. So you can actually run multiple versions of Spark or other frameworks next to each other in the same cluster. Whereas if you are on a VM based architecture, you would have to either spin up a separate VM which is going to incur additional costs to run those other Spark versions or upgrade all of the versions, all the jobs to the next version of Spark, which is going to be a very cumbersome process. So that portability gives you some flexibility there as well.

The last key reason is standardization. Now at AWS, we see a number of customers adopting Kubernetes at scale and they typically do this by what we call standardizing on Kubernetes. Standardizing on Kubernetes means that they're making it their default compute option for orchestrating their containerized applications. What this looks like in practice is typically by building a Kubernetes platform that's managed by a central group and it provides developers with the things that they need to get to production CI/CD, observable governance, generally abstracting away those infrastructure details from the developers.

Now when a customer starts to see the benefits from this like a faster time to market, like decreased cost, they want to extend these benefits to other applications. They also can then leverage the investments that they've made in both people in terms of the central group, as well as the tools that they've built and they can leverage those investments further. So that's another reason why we see the growth in adoption for of Kubernetes for data workloads.

Now at AWS in working with customers to help them build these data workloads on EKS, we have noticed a common set of challenges that customers run into. These are very compute intensive workloads and many have very volatile scaling patterns. If you think about batch processing and sometimes they need to scale up to over 1000 nodes and back down and do so relatively quickly. And your clusters need to be configured to be able to support that.

Another thing is building a highly available application. These are stateful applications and your disaster recovery is going to have to account for that. It's going to need to look different. It's also setting up observability for what could be thousands of nodes and potentially millions of microservices.

Alongside all these, there were a number of decisions that we were helping customers make around the right network configurations, compute and storage options and also batch scheduling options so that they could build workloads that were ready to scale on EKS. And in helping them solve these problems, that is why we decided to build and launch our Data on EKS project.

So Data on ECS is a collection of what we call blueprints to help customers build modern data platforms on Amazon EKS. Now with the project, we again, we release these blueprints and a blueprint includes a reference architecture so you can visualize what it is, we're helping you deploy in your environment. It also has infrastructure as code or IaC templates that help you actually spin up the AWS resources and also install the open source tools and any necessary prerequisites to actually run these applications in your environment. And this is done following AWS best practices. So those best practices are built right into the IaC templates. And then finally, we give you sample code. To test in the instance of Spark, we give what are known as TPC-DS benchmarks. It's a very widely used benchmarking framework so that you can not only install these resources and tools in your cluster, but then also have sample code to test it against whatever solution you have today to make sure that it does meet your requirements.

So Data on EKS is a public open source project. We publish all of these blueprints to our website, which you can see a screenshot of here. And our goal is really to share this knowledge as broadly as possible with the community, recognizing that Kubernetes is open source and there's a vibrant open source community around not only Kubernetes but also data tools and how the two are evolving together. And we want to continue to support that.

We build blueprints for the use cases that we see as the most prevalent amongst our customers so that we can be as beneficial as possible and add as much value as possible. Now, primarily that has been around machine learning, it could be end to end from training to serving to machine learning operations as well as well as data processing primarily with Spark, which is what we're here to talk about today.

And with that, I'm going to hand over to Vra who's going to talk a little bit about open source data platforms on EKS.

Thank you, Alex.

Hi everyone. My name is Vara Buntu. I'm a principal solutions architect working mainly with open source technologies and I specialize in data analytics and communities.

So one of the key things that I wanted to show you today is how you can actually run open source data platforms on Kubernetes. As you can see in this light, this is the various stages of the data platform within the Kubernetes starting with the data ingestion and Kafka is most popular to run like in the streaming workloads for the data ingestion.

But I would like to bring your attention specific to the data processing related to this talk. So within the data processing, we have a number of tools like Spark Flink, Pandas, Dask and Beam and all these tools are used to do the data processing which is both batch processing as well as stream processing.

However, a lot of our customers, majority of the customers are favored towards Apache Spark for both batch processing and some first stream processing as well.

Now let's get to a little bit of introduction about Apache Spark. Apache Spark, and I guess most of you said that you are using Apache Spark. Apache Spark is a distributed processing engine which is mainly used to process terabytes and even petabytes of data across distributed with multiple instances.

It comes with a set of libraries like Spark SQL and Spark Streaming and GraphX, which is mainly used to process various types of data. You can then run Spark on your stand alone machine such as your Mac or Windows machine, whatever, but mainly to write your scripts and you know the process, small set of data.

However, if you want to process terabytes and terabytes of data, you need to process, you need to use the resource manager such as YARN or Mesos. But in 2018, Spark community added a support to Kubernetes as a resource manager, which means you can run your Spark jobs on Kubernetes in a distributed manner and Kubernetes acts as a resource manager.

Now, let me show you the end to end how Spark and Kubernetes works as you see on the slide and we have control plane and the data plane when user submits a Spark submit to a control plane. And the scheduler will schedule the pod and with a Spark driver and a headless service, which is driver gets created in the data plane.

So as a step, two driver requests a server for a number of executors to be created and the scheduler will create those executors in the data plane and these executor gets connected to driver pod using that headless service that you see. That's how the end to end communication for Spark on Kubernetes works.

But now I'm going to hand over to Somucha who's going to talk about Pinterest modernization journey?

Thank you.

Hi folks. Um first, thank you, Alex. Thank you, Vra for giving us an opportunity to talk about what we've been doing with EKS and our modernization journey thus far.

First a couple of words on what Pinterest is and what it does. It's a visual search and discovery engine with some social media flavorings. I think Wall Street sometimes considers us as social media, but I think we don't necessarily, you know, feel that we're more in the visual search and discovery side of things.

Um at the last quarterly report, um our monthly average users or MAUs were at 482 million and our motto is people use Pinterest to find inspiration for creating a life they love. We work in the metaphor of pins and boards. So you find assets, you find websites, pictures, you like, you pin them and you organize these pins via boards. So that's a kind of a top level summary of what Pinterest does.

In terms of how that actually translates into data processing. Here is a highly kind of opinionated overview of what the, what the data flow within. Um our infrastructure looks like.

So users are manipulating pins and boards and they're doing so via apps and via the web browser. And then there is also a source of third party data as well that can be for instance, merchant catalogs and, and, and, and things like that. So those flow into our front end API services and then downstream from the front end API services, we have large Kafka clusters for getting log information in a streaming real time way. And that can go into our streaming infrastructure or for instance, there are databases that say keep track of users board and pin memberships for example. And so there are data services that, that take care of that and then dumps from those databases and output of Apache Kafka also flows into our massive kind of S3 data lake. And then downstream of that, we have our big data platform where a lot of batch and streaming processing occurs. And then in conjunction with the machine learning platforms that are also running their, you know, we output signals, we output models recommendations and a huge set of data warehouses. So that's kind of a highly opinionated view of really kind of data flowing and being processed to produce kind of useful data products within Pinterest.

Zooming in further on our batch processing platform. So currently we are using Hadoop or at least we're using Hadoop in a fairly massive way, some quick numbers. Hadoop cluster nodes at this time around 14,002 instances plus in the cloud in our biggest cluster is around 2500 plus nodes. We run about 70,000 Hadoop jobs per day. We process hundreds of petabytes of data daily and we have more than half an exabyte of data in S3 under management. So with that in mind, you know, we have a very big Hadoop base.

Why would we consider moving to moving to Spark? There were a couple of reasons for us. Hadoop is an established service, but we felt it was getting old. We wanted to move to something a bit more cloud native, more agile and there was already a lot of momentum towards Spark internally at Pinterest.

We had been moving our Hive jobs to Spark Sequel. For example, we had been moving a bunch of our MapReduce jobs to Spark to take advantage of, you know, I, I think I'm preaching to the converted here. A lot of the benefits that Spark offers. We've been investing in kind of Spark upgrades and uh modifications to the central Spark code base.

So that that part was a no brainer. And then moving on to a cloud native solution like EKS that would allow us to really try things for to drive greater cost efficiency opportunities that was very tempting and then all taken together really reduced the the the dev iterative cycle, right and improves productivity.

So those really were our main reasons for moving our Spark on EKS uh platform, we call that MOCA.

So this is a complicated diagram. I don't have much time to spend on it, but bear with me. One of the reasons I wanted to throw this up is to point out that just getting Spark on EKS by itself is not sufficient. You have to build a bunch of supporting services around it if you really want to process data at scale.

So on here, um starting from the left hand side is kind of where how users interact with the system. So we have the notion of you, you see a system service called Spinner. That's our Airflow based um job, job dag writing or composing tool. So that really produces workflows. And then we have a piece of a piece of service, sorry, a service called Archer. And what Archer does is it's a job submission service that we wrote specifically for submitting jobs from, from taking these dag definitions from Spinner and submitting them into a set of EKS clusters.

And then within itself, you know, we have multiple EKS clusters that run, we had to outfit those with the Spark operator that you know, Vara and Alex have been talking about or we'll talk about more. And then for scheduling, we are using Unicorn, right? Because we found that Unicorn for us at least was the closest analog to YARN on uh on Hadoop.

And then beyond that, when jobs run, they will need images to pull from. And that's where ECR comes in. And even though we have gotten rid of Hive and our running Spark Sequel, we still need the Hive meta store because that's where the table definitions are.

Ok. And then you notice we have some of these other boxes that are in slightly different colors, these are services that we're working towards that will provide things like um Jupiter and uh and uh kind of Spark SQL UI submission and then also uh finer grain access control and more security.

On the top right hand side, there is also the remote shuffle service. So as you guys know, Spark itself has an external shuffle service, but we felt the need to move that to a separate service to open up possibilities like auto scaling within the ECS cluster. So that's an ongoing piece of, yeah, that's an ongoing project that we are continuing to invest in.

And then once jobs run, they produce metrics, they produce logging. So there are services to consume. those users will want to connect to the live Spark UI once the job is running. So we have an injection service that allows for that sorry, an in service rather. And then once jobs have finished running, they'll want to see logs. And that's where the Spark history server comes in.

A quick note on the migration timeline. So last year we spent evaluating a variety of Spark runtime solutions. Before settling on EKS. In one of this year, we kind of hunkered down and started products and selecting EKS and Spark on EKS settings. So we focused on how to operationalize putting in logging and metrics. And then really shifting our first set of jobs away from Hadoop into uh into, into MOCA. So that's around, you know, 10%.

And then this half, we are now starting to kind of come up on the edge looking at some of these scaling opportunities and looking to migrate around 20% more jobs in H1, we migrated Spark scholar jobs exclusively. This half, we are focusing on a mixture of scholar as well as Spark SQL and, and, and migrating those next year. Our focus will be doing more work with integrating into the rest of Pinterest. Specifically. For instance, the Pinterest service mesh is a big one. We want to look into P park migration as well finer grain access control and then start open sourcing some of these components that we have because we feel that um yeah, there would be a need in the community. I'm sure we are solving problems that others will also encounter. So if you guys are interested in learning more about some of these happy to share.

So in summary, um some of the wins and loans we have had from our projects. Thus far, a big win for us was being able to move to Graviton and a bunch of cost savings that comes from moving to that. Thus far, our Hadoop clusters had been running on Intel AMD based instances. But now our EKS clusters are Graviton and then getting access to um standard EKS API s at a deep level means that we can kind of mix and match various um various open source components, right? And, and get them running without fear of necessarily, you know, uh running into compatibility issues and and and things like that, um we don't have to worry about managing the control plane for the cobert cluster. That's, you know, that's, that's, that's taken care of.

And then uh open source components that actually have an AWS flavoring like EMR and Fluent Bit. That's been very useful for us as well. Simply because we've been able to work with AWS and really get an understanding of how best to deploy that within the Pinterest environment pricing is always great. uh meaning that uh the EKS overhead is, is is minimal. Most of the cost actually comes from the two instances, we are always looking for opportunities to save. So this was, you know, this was icing on the cake.

And then, as you know, we mentioned throughout the talk, um the data on EKS and EKS blueprints are things that we have leveraged to uh into our Terraform deployment infrastructure to make it easier to deploy clusters. And then of course, underlying, you know, the, the support and work with the various AWS folks here has been tremendous and we're continuing to, we're hoping to do more of that going forward.

In terms of the actual sizing. We have a couple of clusters totaling around I want to say around 800 nodes actually. So it's um and then we want to get to around about one K by end of year and next year, we have even more ambitious plans in terms of cluster sizing some of the loans. uh for us has been deploying ECS comes with its own set of challenges in terms of specifically networking and uh uh load on two APs.

So we have had to work carefully with the AWS team to kind of get around some of those things. Um control plane logs like to find a better way of exporting those into the, the Pinterest infrastructure. So that's one thing. And then because we have run our own Hadoop cluster, you know, forever and ever, we are kind of accustomed to mixing and matching various Hadoop patches and deploying those into the platform.

So obviously, because we don't own EKS, we don't have that luxury. We have to, you know, adhere to the EKS that have cobert is release schedule, uh the regular EKS upgrades. That's another thing, you know, uh the fact that you're essentially on the clock and you have some time, but you still have to migrate EKS versions. So that's also a consideration from on the Spark side.

Um you know, i could probably talk a fair amount on just that alone, but, but i'll keep it brief. One of our goals for migration was we wanted to keep all the details hidden or transparent to the user. So from our customers, our internal customer standpoint, they're simply writing Airflow dags and they don't care where it goes to whether it goes to Hadoop or whether it goes to MOCA that should be immaterial.

So what that meant is we had to be careful in terms of capturing job configurations and making sure those got translated into the Spark on EKS environment. And then note that, you know, because we're changing instance, types and architectures from AMD to Graviton native libraries, we have to be careful and make sure we can migrate those to Graviton as well.

A big one is because we're shifting away from Hadoop, Hadoop. Actually, in addition to, you know, running MapReduce and Spark provides a bunch of other things that we kind of take it for granted, right. So provides the YARN scheduler, it provides log aggregation, it exports metrics, it has its own file system, it has a UI, right? And so in moving to this, we had to find alternatives or write some custom stuff from scratch. So that's another consideration going forward.

And then because of the scale at which we operate some of these alternatives that we found, right? Um they kind of rough around the edges and means that we have to do that much work into them to make sure that they operate at this, uh you know, at the scales that we are accustomed to.

So overall, I think, and I don't use this term lightly. I would say that moving to EKS is something that is transformative for Pinterest data engineering and also I would say for infrastructure.

So in terms of going forward, um we have been heartened by our success with Spark on EKS and we're looking to do the same for Flink, which is what we use for our streaming platform as well as TDB, which we're using more on the storage and caching and the key value store side. And then eventually, you know, we'll have other workloads also move to EKS.

So thank you for listening and I'll hand you back to Vara. Thank you so much. All right, thank you.

So, um right. So many of you said you are running Spark on EKS um one thing I wanted to ask you how many of you hit IP exhaustion issue when you are running Spark at scale, like some I can't see with the light. But right, so we're going to discuss some of the best practices which helps you to scale your Spark workloads on EKS.

So you can avoid some of the common issues by looking by following these best practices. And then the workloads vary based on customer to customer. Have you run and we will help you at AWS to solve some of those problems to scale your workloads to the desired, say that you are looking for.

So in this slide that you see it's a VPC and with secondary cider, one thing we noticed that a lot of customers with the data teams and data science teams, they just go out and ask platform team that create an EKS cluster so that we can run Spark jobs and the platform team have no idea how many IPs that they need to allocate. They just normally create a smaller subnets and, and then the data team hits that issue saying, hey, we can't run the jobs anymore because we ran out of IPs.

So that's a common issue. We highly recommend customers like you to consider the networking configuration. And the configuration is something that I'm showing on the slide where you can use a private, like within the VPC, you can use two rable subnets. Those are mainly with a small C range. Those are mainly used for load balancers or any pub public interface or talking to your on prem data center and so on.

But when it comes to running your actual Spark workloads, even the nodes, you can leverage the secondary cider range with the non rout IPs as you see here, I have two subnets across two AZs and each private subnet on the secondary side range has got 65,000 IPs, which means it can give you 120 plus 120 k plus IPs are the parts that you can run on cities.

Now, the other thing that we tend to see from the customers when they are running at a scale is unknown host exception from coded DNS. This is a very common thing. There is one thing that you can avoid. This issue is using horizontal scaling for coded DNS with a cluster proportion or a scale.

So what normally happens when you deploy your code DNS by default, it comes with the two parts for your EKS cluster. But when it comes to running thousands of nodes, those two parts are not enough to you know, distribute or provide the service to the rest of the you know the parts.

So to avoid the situation with this class proportion or a scalar, you can actually horizontally scale code in parts based on the size of the worker parts, worker notes that you are spinning up. And another best practice is using no local DNS caching, which means you can build the caching with a simple cube CTL command to deploy the new local DNS caching on every single node that allows you to improve the performance rather than going back to the co DNS to get the DNS resolution.

Now, moving on to the storage storage is really crucial. When it comes to running Spark on communities, you might be running the workloads before on Hadoop or any other Hadoop based flavors that like EMR but running Spark on EKS storage is completely a different ball game like as you see in this slide, when you want a high performance and a low throughput, we highly recommend using NVMe SSD based instances with a Spark

As you see on the slide, you have two worker nodes and one worker is c5.dxlarge. Another one is c5.12xlarge and c5.dxlarge has one NVMe SSD attached which is 100 GB. For c5.12xlarge has two NVMe SSD attached. But when it comes to using these volumes as your shuffle data for Spark jobs, you can use /var/local directory to point to that specific storage. You can use hostPath. In this case, as you see on the slide, there is a volume that defines the hostPath. But in the case where you have multiple SSDs within the two instances, how does the data engineer know whether the instance has two disks or one disk to avoid the situation?

We recommend using RAID0. As you see on the worker node too, like when you spin up these nodes and you can configure RAID0, which will give you one mount path for all the disks which are available in each instance.

Now option two. Now you can also use Amazon EBS for shuffle storage with the EBS CSI controller which creates PersistentVolumeClaims. Which means if you, if you ask for 10 executors, every executor needs 40GB, it will create 10 EBS disks with 40GB each. As you see on the slide, every executor has a dedicated EBS volume. But this option gives you to choose from all the instances that AWS offers today.

Compared to the previous option, we looked at NVMe SSD, you only have a limited instances with the d type but when you wanted to grow, you know, the larger cluster and performance is not a concern. Then option two where you can use EBS CSI controller with the PVCs to run the workloads.

And some of the two key features that this offers is reusing PersistentVolumeClaims. As you see here on worker node one has two EBS volumes attached by two executors. If working node one gets interrupted by the spot interruption and there are no terminates, you can still reuse those two volumes for when the new node comes up. So with this feature, when the new node comes up, it attaches to the existing EBS volumes rather than creating a new EBS volumes, which means it can start the process where it is stopped rather than redoing computing from the scratch.

And further to this, you can also leverage dynamic resource allocation which what Spark offers, which helps you to scale dynamically based on the task load.

Now moving on to compute, I think this is one of the common asks that customers ask about when you want to run Spark on EKS. One thing to remember Spark driver is a single point of failure. So if the driver gets killed, the entire job gets deleted, so you have to restart the job and restart the computing process.

So to avoid that situation, we highly recommend you to run Spark driver on on-demand instances only if that is not cost effective for you. We recommend you to look at Reserved Instances which is 72% of up to 72% off from on-demand instances specific to the Spark driver.

For Spark executor nodes, you can use spot instances with the help of Carpenter or other scalers, you can scale up and down. And if the executor nodes even gets deleted or the nodes are interrupted by the spot and driver can still spin up those executors on the new node and continue the process.

And when it comes to compute so already explained the Pinterest use, they started the journey on Intel, but then eventually moved to the Graviton and they have seen a better performance and reduced costs. So we highly recommend you to look at Graviton3 which we have tested, it works really well on Spark running on EKS.

Now, how many of you have had the out of memory error when running Spark and EKS? Right? I think I see a lot of hands here and I tend to get this question quite a lot from a lot of the customers asking because most of you are moving your workload from Hadoop to EKS when you are moving from Hadoop to EKS, the one thing that you need to consider is the memory overhead factor, memory overhead factor.

So by default, Apache Spark has 0.1 which is a 10% of memory, which is used as a memory overhead factor, but that is ok for running Scala jobs or Java jobs. But when you want to run PySpark jobs and Qin one thing to consider, you need to give at least 40% of memory as a memory overhead factor to run PySpark job on Qin cluster.

Now moving on to the auto scaling bit. I know I see customers using mix of both options using static clusters like Pinterest do um where you can use reserved instances to run your Spark on EKS workloads. And at the same time, a lot of customers who wanted to optimize for the cause, they want to leverage cluster auto scale or a Carpenter.

And I highly recommend using Carpenter which is more powerful and it's an operator based solution does not require manage node groups or cluster scalar all that you need operator installed on EKS cluster. Once that is installed with the template that you see here, that's a node pool template we used to call as provisions in the previous version. And with version 0.3.2 it's called as node pool. Just define one node pool as a simple yaml manifest. And within the node pool, you can define the instances spot and on demand.

And then when you run your Spark jobs, you have a concept called part templates for both the drivers and executors. And within the part templates, you can define that you want to run driver on on demand instance executors on sport instance, using Carpenter manage labels. So with Carpenter will spin up those notes for you and once the job is finished, and the Carpenter will scale down those notes for you automatically.

And the speed of the Carpenter compared to cluster killer, a Carpenter at the moment, it can spin up the node under a minute compared to clusters killer can take up to 2 to 3 minutes.

Now. So there is a question that customer asked about hey, with the cluster scale, we use a lot of launch templates and we do a lot of configuration with the user data and so on. Can we do the same thing with the Carpenter? Yes, you can with the help of EC2 node class, there's another template. So you can use the EC2 node class and you can define your user data with the EKS optimized AMI and then and configure the way you want to, which is it is going to create a launch template for you behind the scenes and spin up a node for you. And you can even use custom AMI.

And as you see here, we also mentioned about configuring RAID0. So when you want to spin up, uh when you, when you want to leverage NVMe SSDs and with the Carpenter, you can choose a range of instances c5.xlarge to c5.24xlarge. So with this configuration that you see on the slide, which identifies whether there are any NVMe instances available in the EC2 instance, if it does, then it formats and mounts that disk to your EC2 instance, which makes it available for the data engineers just to use one path as a shuffle data storage.

Now, moving on to Kubernetes batch schedulers. So with the Qin is, it comes with the default batch scheduler, it's a single queue and when you place, run hundreds of jobs, it goes into the queue. It's a FIFO queue and you know, it takes one job at a time and place those workloads. But when it comes to the customers who come from Hadoop workloads and you know, the likes of using YARN and they wanted to see multiple queues so that multiple queues can be used when you want to run your Spark jobs on EKS as well.

So to fill that gap, there are batch schedulers that evolved one of them as Apache Unicorn that Pinterest mentioned that they are using. And there's another tool as a Volcano which is mainly used for one of the key features. What this offers is a multi-tenancy and resource isolation. Say, for example, you have multiple teams within your organization and you want to control how much CPU and memory they wanted to use with these batch schedulers, you can create multiple queues and you can set the quotas for what each queue can use it. So that way we can avoid the noisy neighbor situation and the multiple teams can access what they have been given in the first point.

And it also provides various workload queuing. You can use fair scheduling or FIFO queues. And most importantly, the gang scheduling, this is a key feature of what these batch schedulers offers. You know, like I talked about before, when you're in a Spark job, it's a two step process. The first thing, the driver gets created by the scheduler and the driver requests for the executors. And then scheduler will schedule those executors. As you see, that's a two step process to avoid that the gang scheduling. In this case, when you submit 10 jobs and gang scheduling will create a temporary pods for the driver and the executors and it deploys as an application rather than scheduling as pod by pod, which is really cool feature, that's something you should look into.

And in addition to that you have application sorting job priority and preemption features, the various other features that these batch schedulers offer and I highly recommend using these batch schedulers when you are running Spark and EKS.

Now moving on to metrics, right? So we are running Spark and EKS. Now, how do we actually monitor my Spark jobs on EKS? That's a big question, right? So we with the open source ecosystem that CNCF ecosystem that offers with the EKS, you have a number of tools that you can use, including the vendor solutions as well. But in this solution that you see, we are using Prometheus, which is used to, you know, store the metrics and then writing those Prometheus metrics using remote write method to Amazon Managed Service for Prometheus. And then you can use Grafana to visualize those metrics.

There are some open source dashboards out there which you can use Apache Spark for the Grafana dashboards to visualize the metrics, which is not only the node resource utilization, but also your JVM metrics for the Spark job to look at the granular level, how your shuffle data is performing and how each task is performing. You can go into that granular level.

But one thing to notice here, Apache Spark added this Prometheus class ah it's a class with Apache Spark which allows you it's a default configuration um within the Spark configuration. When you're running a job you can define saying I want to write this metrics to Prometheus. It's a built in feature that comes with the Party Spark which exposes this driver and executor metrics to Prometheus and Prometheus, extract those metrics. And then you can use it to visualize using Grafana.

In addition to that, you can also use CloudWatch Insights and you can also use the CloudWatch dashboards. Now, most importantly, Spark History Server, which is a very important tool for all the data engineers who wanted to tune their data processing jobs to see how, how it is, how bad, you know how good the code is written in some, in some situations where you can see some of the stages takes a longer time. The one way to identify is looking at the Spark History Server to visualize which stage is taking time so that you can go back and optimize your code to perform, to make it perform better.

And everyone wants to visualize all the data engineers while the job is running. So with tools like Spark Operator which provides in built feature where you can create an ingress object to visualize Spark live UI while the job is running. So with the options that we have within our blueprints, you can enable the event logs to write to an S3 bucket and you can create, you can deploy Spark History Server pointing to the S3 bucket and visualize the both historical metrics as well as live metrics using ingress and the load balancer.

Now Spark logging. Now we talked about the metrics. Now coming back to the logging, you know, with Hadoop, you does all the aggregation. How does logging works on EKS specific to this? You need daemonsets like Fluent Bit to extract this log logs from all the Spark pods driver and even the executors in this example that you see i have two drivers running and then i have a Fluent Bit pod that's running within the same node. The Fluent Bit is a daemonset. It extracts all the logs from every single pod both the drivers and executors and you can configure to write to an S3 bucket or to other destinations that Fluent Bit offers.

So it is more cost optimized when you write to an S3 bucket and you can use Athena or various other tools to query these logs. And you know, to see how it's what's going on with the job. And the most important best practice when it comes to scalability of running these workloads on EKS is a lot of these daemonsets sometimes makes too many calls to the API server.

One thing to consider if you bring any open source tool into EKS identify how that operator or a daemonset is performing in terms of making calls. Is it making too many calls in such cases? Like in Fluent Bit, we avoid, we recommend using extra filters where it basically says if you see if you notice in there, i said use kubelet is too true, which means if Fluent Bit requires any metadata, it can get from the kubelet within the node itself rather than making to making a number of calls to API server, which is highly recommended when you are deploying Fluent Bit, which reduces the number of calls that you make and increases the performance of your scalability of your jobs, right?

With that, I'm going to hand over to Alex. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Data processing at massive scale on Amazon EKS

Thank you.Right?
复制链接

扫一扫