Netflix Maestro: Orchestrating scaled data & ML workflows in the cloud

Good afternoon, everybody. My name is May and I'm a part of the A team that supports Netflix. Now, I've only been in the Netflix team for over a year, but I just found out that I have been a Netflix subscriber for 17 years.

Now, can you see a show of hands? Any Netflix subscribers out there anybody longer than 17 years? I see somebody. Alright. Awesome.

So today, um we will talk about the Netflix Maestro. Now, one of the reasons why I've been a subscriber of Netflix for over 17 years is because Netflix actually recommends a lot of shows and movies that I enjoy. And with that, um they have a powerful recommendation engine that is, that is powered by Netflix Maestro.

We have two Netflix engineers here. We have June Ha and Akash Rti who will be talking about Netflix Maestro. And with that, I will hand it off to June who will kick us off. Thank you.

My name is Jing He, I'm a staff engineer in the data platform at Netflix. Uh it's our pleasure today to give this transition about the uh Netflix uh workflow orchestrator Maestro. So let's get started by looking at the uh landscape of data and ml workload at Netflix.

And thanks, we just did a quick survey. Uh glad to see that we have some subscribers here. So thank you. And I hope you enjoyed the show as you have and also uh May might have experienced when you uh watching Netflix. The recommendation you saw um is pretty good uh which is a result of uh multiple workflow data pipelines and also the uh machine learning workflows.

And uh so also Netflix is a data driven company. Many business decisions is driven entirely by the data insights. So uh for example, like uh from the uh uh color in the landing page to the uh like the uh in investing in the content acquisition to the uh like training algorithms even to the like uh um uh like detecting sound like a security issues, right?

So data engineers, data scientists and uh like uh other software engineers and also even the content producers all around their workflow or data pipelines to get insights they need. And uh so we have a wide variety of use cases for data and the users would like to train their like etl pipelines uh run the etl pipelines to process data or train their ml models based on data or run a b test and so on.

So as our uh the company scales globally and uh it relies on data to power its business in all phases so the uh the scalability and the usability of the workflow trader has become very important.

So in the past uh multiple years, we have built this uh scalable and user-friendly workflow orchestrators for the network users to provide them a consistent way to manage and automate dell etl pipeline. This is maestro uh maestro is a fully managed service. It provides workload the service to thousands of internal netflix users.

Users used to develop their tail pipeline or do their daily jobs. And uh so it is reliable and uh scalable become a robust foundation. So users just build their busy logic. And then maestro is responsible to like manage the life cycle of that.

Uh metro consists of multiple components. There's like a workflow engine, the alerting service, error classification services. And uh um also the uh metro provides the u i and the abstraction layers like a template or the domain specific language for our users to use and define the workflow easily.

So this uh workflow system worked quite well for including uh for like uh engineers and even l engineers. And uh so instead of being a very simple workflow scheduler, we actually add lots of features into it to make it uh very user friendly and plus powerful such as like uh uh we provide those uh patterns like uh uh for each or sub workflow and the conditional branches.

And then user can use those patterns to easily define very complicated workflows in a very simple way. Uh also we provide the template support like for example, like uh users can run their spark jobs or tri jobs very easily by just provide a few parameters. And then the uh whole complexity of setting up the engine, everything is handled by the maestro.

So our users can just focus on their business logic without a worry about those kind of internal details. Um also triggering functions are important uh metro support like a time based trigger, uh such as crown or fy crown. Uh also we uh support the uh uh a single trigger as well.

So then user can run the etl pipelines or um workflows. Uh once the uh for example, upstream table position is ready, so that's very efficient security is another important feature to protect our users and also the data warehouse.

So you may wonder why we build our own work for trader instead of using some existing solutions. Um so we this uh we started metro project about three years ago. So um we were unable to find, find a viable solution for the netflix scale. And um so the the that kind of is like one of the main reason for us to make this decision to build our own one.

The challenges that we are facing in the netflix. Uh the uh when we talk about scalability, i think it's, it's not only just like a lambo work flow, the workflow system has to manage it. It's also about the size of a single workflow because that's very uh a common thing or request from our end users is that they would like to train their machinery models or they would like to run a backfield pipelines for like uh uh hundreds of 1000 times or something like that, even like a million times we see.

So that's very uh convenient for the users. But to put a lot of pressure for the workflow engine and we could not find a valuable approach for that kind of scale. And uh this uh example, just show uh kind of uh pipeline that our users can use to. This is used like for each pattern, they can easily use that to build this kind of very large scale training or backfield pipelines easily.

Uh that's convenient, but we have to deal with that scaling. Ok. Additionally, um as i mentioned, we have like 1000 of users use the this workflow engine, they build their daily or hourly etl pipelines or train their machinery models. Uh they work with that every day. And uh so the usability is super important because that large user base, right?

Users have different preference or they have different backgrounds. Some may like write a sql, some may like write python or sava or java whatever. So after like checking those existing solutions, we could find uh uh like a good solution or uh to solve the problems.

So we decide to make our own workflow trade to support all our use cases. So here i show example like uh uh that how our users use yao to define the workflow. Uh this is like a daily workflow around a small job. So you can see that the user just need to provide some parameters like my query, which has the sql query that the user want to learn.

They also can provide a link to the spa job jar or something. Then the maestro are going to take care of the whole life cycle of the execution. So the whole ex the the complexity is completely hidden from end users and then the users can just focus on developing the bus logic.

Ok. Uh this graph shows the scale of uh the uh um uh maestro you can see the growth rate is high and we have a, a big increase every year and the maestro runs like a tens of thousands workloads and a million jobs per day. And also the load is uneven and it is very spiky especially at the midnight or the start of an hour.

Ok? Because my will solve both the scalability issue and the usability then uh it's works quite well to support last uh like a wider variety of use cases. And netflix uh user use it to like uh schedule the etl pipelines uh which can be like a crown based or single based and they can manually run the backfield jobs for like uh uh like multiple years.

And also we have like uh uh many specialized pipelines built up. Our maestro. Uh for example, we have this open source uh uh machine learning framework called metaflow metaflow uh is built and integrated with maestro. All the production workflows run above the maestro.

And also other teams build their lines, use maestro to like uh a p testing, uh payment auditing or some simple workflows, just move data between the data stores. So you become a pay pass for all the users on netflix to access data warehouse and run or batch jobs.

Ok. Uh this diagram shows our high-level architecture and uh uh users use the uh u i or client or cr i to interact with the mice show over the api and the api gateway is the abstraction layer, provide a public interface for the end users. So they don't need to like uh uh load the details of the workflow engine.

And the workflow engine is a core uh which manages the state of the uh workflow execution and also uh provide a version. And uh uh it also have the triggering supports like a time based or the uh single based. All the executions is uh around over the uh darker containers.

And uh um then the the lots of downstream services, maestro kind of uh have integrated with them loosely coupled away by uh immediate some state change events to ac stream.

Ok? Uh this is a link to our blog. So it has all the technical details. Next, I'm going to pass the presentation to my colleague, Akash.

Yeah. Hello. Uh hello. Ok. Hello everyone. My name is Akash Dedi and I'm senior software engineer at Netflix. Thanks for attending our talk. Thanks, June.

Ok, let's see that. So I'm pretty sure many of you might have faced the challenges, creating a suitable abstraction for platforms which powers thousands of users and we face the same. So in the next couple of slides, I'm gonna cover some important critical considerations we made while building Mastro.

So the first is interfaces. Mastro offers domain specific language in yaml, python and java. Rest. We have ms for rest and graph ql. Can you guys hear me?

Ok. Uh next we have uh rest and graph qual interface for the programmatic access. Uh for an example, other platforms like metaflow, they create workflows at a scale using the programmatic uh interfaces. And next we have the mastro ui using which users can interact with their workflows. Look at like you know the previous runs and debug any issues.

Next is the integrations and mastro is built for the users. We welcome all the use cases and if you like working with tele scripts, notebooks, java docker sequel, mastro have got them covered.

Next important consideration is about the dynamic execution of the workflows. Static workflows are pretty much useless and pure dynamic ones are very hard to understand. So, ms approach of the dynamic executions is using the parameterized workflows where you create parameters in the workflow.

And the parameters looks very similar to like creating a variable in any programming language where you can uh use the parameters later reference them, pass their references or use them in the function parameters. Workflows comes with uh distinctive features such as conditional branching, uh uh uh conditional branching and uh uh sub workflows.

They are quite powerful and we will cover like, you know, some of the examples of the parameters workflows in the later slides. Mastro also offers dynamic code injection in java format directly in the workflow definition. They are mostly used as functions and mastro executes those in a sandbox environment ensuring they are run in isolation.

Uh and next is the extensibility. Mash flow is extensible. Users can create new step types by simply writing some notebooks and uh creating uh associating some parameters for those notebooks. Users can also bring their compute to power their time critical workloads.

And mastro offers a sync executions using signals api that don't cover and that's quite useful. Um uh uh especially like, you know, they look very similar to the um wait and notify api s of the programming language where uh the upstream workflow want to notify set of downstream workflows when the task is completed.

Yes. Next, let's explore uh the ds l offered by the uh maestro. Um so maestro offers ds l in yaml, python and java. And here you can see the example of a very simple workflow represented in all the three formats. This is a very simple workflow which runs daily at midnight in us specific time zone. So that is described by the trigger block here.

And then this workflow has some id has two jobs. The first job is of a spark type running a very simple query. And then the second job is of notebook job type and which points to a notebook like you know in some s3 location. Same workflow is represented using the python and java ds l.

So we crafted the mastro ds l to ensure the workflow definitions are readable where users can understand what's going on. They can later refer it are reproducible so that users can uh create new workflows by following the patterns used in the uh other workflows and are debug so that users can see what parameters are passed at the execution time and refer them.

Now let's uh let's go over a practical example of a back full workflow here. We will demonstrate the power of parameterization uh in the maestro uh execution. So this is a very simple workflow. Uh uh but like we will go over like, you know, some of the parameterization think about like, you know, user wants to run some backfill.

And the actual backfill logic is all captured in this notebook which is stored in s3. So we create a from date and to date these two parameters very similar to like how we create variables in programming languages. And we pass some values here and then we have first step and first step computes the backfield ranges between the from date and to date.

And we are using a cell function here which is a date in between. This is offered by mastro, but you can write your own cell function similar to like injecting a java function here. And we are passing the from date and to date references which and written at the execution time returns the array of dates between the from date and to date.

So there are going to be 365 dates here and which is later substituted to the dates parameter of uh of the first system. Next, we have the for each step for each step runs the backfill job here. There is only one job but there could be a dag of jobs within the forage. Here is the remaining transcript:

Uh these are all run in parallel. Uh and so the dates parameter of the step one is later referenced by this for each step to create the iterations of those 365 iterations. And then those iterations are later passed down to the actual backfill job.

Uh where the date parameter of the iteration is later referenced by the uh by this uh backfill job as arg one and arg one is passed down to the uh into the notebook where the user's business logic can refer the arg one parameter and and can do the backfill.

Next, let's explore the execution abstractions offered by the maestro. So mastro offers uh several step types uh like over 30 plus predefined step types with user set parameters. Uh for an example, spark is good for uh large scale data processing tri is idle for running queries uh on large data sets.

Uh the notebook step type is idle for model training or data cleaning or uh uh doing any kind of analysis. Uh so, uh and these step types cater to the wide range of uh workflow needs. But users can also bring their custom logic. If they want to run their custom logic, they can uh bring their custom notebooks or custom docker image and mastro will orchestrate them.

Uh u users can also create new step types just by defining the parameterized notebooks and defining the required and optional parameters uh uh for the new step type. And they uh once they are done, they register the template with mastro and mastro ensures checking their types and the values uh both at push and run time.

Um and the beauty of ms flow is it takes care of the scheduling and execution of the workflow. Uh this allows users to focus on the business logic uh and ms flow ensures the workflows are executed smoothly and efficiently.

Now let's uh explore the execution used uh uh then in the maestro, so mastro execution is designed to be efficient, reliable and user friendly. And at netflix, data engineers and data scientists mainly use notebooks. They create their work in notebook and later reference those notebooks in the workflows in mastro e job is run in isolation uh uh via custom docker image that we use.

So by default, we use the big data image which comes preloaded with set of common libraries, os packages and run times which saves users time to install those during the execution of the workflow. And mastro uses uh paper mill to execute the notebook.

So paper mill is a command line uh uh tool to execute the uh uh notebooks without uh uh notebook server. Mastro passes all the workflow and step parameters in the notebook as the first cell. So user business logic can refer like for an example here, if you see the script parameter is passed to the notebooks, first cell uh along with some important parameter that master decorate that can be later referenced by the user's business logic and notebooks outputs are later after the execution are exported to s3.

And we have the s3 where where users can uh see uh the notebook uh outputs. And this kind of execution model uh is quite useful for the de where users can see like, you know, if anything goes bad, they can go and interact with what parameters were passed and and can see what's going on?

So that's all from our end. Thank you all for attending our talk and feel free to reach out to us after this talk if you have any questions. Thank you.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值