Netflix’s journey to an Apache Iceberg–only data lake

最新推荐文章于 2024-08-08 21:03:48 发布

李白的朋友王维

最新推荐文章于 2024-08-08 21:03:48 发布

阅读量75

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135112883

版权

Thanks uh for coming to this talk. Uh and, and uh you know, hope uh everyone has had a great re invent so far. Uh enjoy the sessions. Uh we assure you, uh we've saved the best for the last.

Um so just to get started, my name is vy, i've been with uh aws coming up on five years. Um and the last two years, uh i have had the pleasure of leading the team uh that supports netflix data platforms. And uh today, it is my honor and pleasure to welcome uh our next two speakers uh for today, uh uh ashwin kaur and uh rakesh vira maani.

They're both uh software engineers uh within netflix's big data compute org. Uh and together they will describe the journey uh netflix took to modernize their data lake using apache iceberg, apache iceberg as uh i'm sure every one of you knows uh is uh one of the more popular uh open table formats uh that was invented uh by and at netflix uh to help folks efficiently manage data leaks uh with transactional compliance. Uh so your queries could travel through time uh while your schemas and your businesses evolve.

Uh and i'm really looking forward to listen to a few things. Uh obviously, i wanna listen to the problems that necessitated the move to apache iceberg. I want to listen to the scaling challenges netflix faced and some of the feature sets and the tools um uh that, that they invented along the way to deal with those challenges.

Um and last but not the least. Um i'm definitely look, looking to listen to the impact of a lot of this work both within and beyond netflix. Speaking of which uh n a lot of this work that netflix is doing uh is informing aws s own roadmap into space. And uh since all of this work is out in the open source domain, uh i'm sure many of you um uh will will be able to uh take some valuable recipes to go home and implement within your own company.

So um without further ado um ashwin is all yours.

Oh, thanks a lot. We uh i would first of all, a big shout out to aws accounts team for all the support and partnership they provided in this journey. Uh myself, ashwin kayo, i'm a senior software engineer at netflix. Uh i'm part of data platform team and uh at netflix, i spend my time architecting solutions to manage and process large amounts of data using big data frameworks like spark rino and apache isberg. I also actively contribute to the systems internally. Currently, i'm leading an effort to migrate uh from high base data lake to an ice uh iceberg only data lake. And we'll be talking about that.

Uh uh in this talk, i will let uh rakesh uh introduce himself.

Thanks ashwin. Uh my name is rakesh. Uh i've been at netflix for about five years now and i've had the privilege of like working across the stack of our big data uh compute uh platform from the metadata services layer to the storage layer uh to the to our computer engines. And uh most recently, i've also been working with ashwin on leading an effort to sort of migrate all of our hive data. Uh i work and we're excited to share that with you folks today.

Here's the agenda of this talk. We will first begin by describing netflix's overall business motivation. And then i'll provide a high level overview of our uh foundational data platform that powers all the customers and customer facing you uh features on our streaming platform. And then we will shine light on why we invented iceberg and decided to make it a paved path within netflix. And we will also describe all the ecosystem that we built around iceberg that helped us improve our underlying data infrastructure.

Till recently, we have been predominantly a high based data lake. Uh and uh in the next section, i will describe the infrastructure that we developed to transition into an ice uh iceberg only data lake. In conclusion, we'll provide key takeaways and we will also share our thoughts and key learnings uh with broader community.

As you all know, netflix stands as a premier entity in the realm of uh entertainment streaming. Uh at netflix, we dedicate significant amount of resources in comprehending our user preferences. This includes understanding our users uh favorite movies or shows as well as uh predicting the content that they might be interested in viewing in the future.

So all the features uh uh all the numerous features in our platform that you engage with and experience uh are all driven by data for an instance, the top 10 movies or shows that you see are all derived by crunching the uh region specific viewing statistics as well as the thumbnails that you see is very intimately uh personalized to your personalized to user uh interests.

In a similar way. Our machine learning algorithms process vast amounts of user data to tailor user experience to a particular individual user. So all of these features that you see are all very closely personalized to user interests.

All the teams within netflix that work on this uh intriguing data driven algorithms. They all depend on our foundational data platform data platform at netflix offers range of services for instance, uh online data stores, uh compute and storage services that rely on aws s s3 and ec2 primitives. Uh platform also provides other auxiliary services such as scheduling service that helps orchestrate large number of workloads about this foundational layer uh data platform uh implements and uh uh provides us various high level abstractions and systems uh that essentially helps us to orchestrate large number of jobs as well as to process data in multitude of ways.

Next rakesh will double click on this and provide us more details uh in depth details about our big data platform architecture.

Thanks ashwin. Uh so this slide gives you sort of a high level overview of all of the uh abstractions that we have uh in our uh big data platform. Uh over the years, we've sort of built out uh a cloud uh native uh extensible architecture sort of centered around principles like storage and compute uh separation and uh data platform component uh compos uh with these uh principles like we, we mainly had to lay these down because uh like ashwin was saying data is central to everything we do at netflix. And the uh big data platform actually powers netflix through all phases of the business. Uh all the way from uh internal analytics dashboards uh to a b testing to even product facing features like the top 10 titles that you see on the website.

Uh and as such, we do collect uh tons of data from our online data stores uh that you see on the left there. Uh cassandra is used quite widely for all of our member facing uh websites. Uh we have r ds and we have cockroach tv. And we have uh cdc streams set up from all of these online data stores that are processed by our uh stream processing platform that we call our data mesh.

Data mesh is composed again of uh open source uh services like uh fling and kafka and data mesh. Pretty much processes trillions of events per day from microservices consuming data from these uh online data stores and lands tens of petabytes of data each day into our warehouse in s3.

Uh casa is also another homegrown solution of ours. It basically is responsible for taking snapshots and backups of cassandra tables uh that we have running on our online websites and dumping them into iceberg tables in s3.

Uh and so looking at the uh bottommost layer in the abstractions we have there, uh the lowest layer, all of our data sits uh completely in s3. Uh we have uh hundreds of uh petabytes there that's growing at like a very fast clip. Uh and for our compute needs, we use uh titus our homegrown uh container platform, but that in turn also runs on uh ec2 instances.

In addition to that uh moving next, we have the uh the metadata services layer. So the metadata services layer sort of abstracts our all of the storage details from our users into that of a table format. So all of our tables are iceberg formatted at the moment, but we used to be complete. We used to be partially in hive and in iceberg. And as part of our effort, we have sort of moved towards one of like mostly iceberg.

Uh and we also have uh meta a here. Our uh metadata services uh layer uh meta a basically provides a unified api for accessing uh metadata about our data sets. And it does so by abstracting away all of the data stores that we support at netflix, and users just come using like a three part uh qualified name uh when querying the table and they're like agnostic to the underlying details of the table, like the table type and the storage type and things like that.

So for example, they use a table name like pro slash food slash bar to come and query everything they need to know about that table.

Um and in the metadata services layer, we also have uh services like the uh netflix data catalog and the policy engine. Uh these two services form like a key pillar of our governance uh platform. Uh they essentially index all of the data sets that we have within netflix, not just the big data data sets, but all of the data sets.

Uh and the policy engine uh enforces uh policies like hey, uh every data set uh should have uh an owner. Uh and if our data set has classified information, it has to, it has to be annotated as such.

Uh on top of the metadata services layer, we have our compute engine layer. Uh as you can see, we uh use like a myriad of engines. Uh spark is mostly used for etl bash processing. Uh we have uh thousands of jobs running on our platform uh every day. Um tri o is mostly used for uh interactive analytics. Um and druid is also used uh quite heavily for our real time uh aggregated uh dashboards.

Uh this is just again a subset of the engines, but the most used ones. Uh and on top of our computer layer, we now have our client facing uh tools here. Again, we provide a suite of tools uh to sort of fit all of the personas of users that use our platform.

Uh all of the users of our platform are internal to netflix. Uh but they come from different areas like you have analytics engineers, we have data engineers, data scientists and so on. Uh so in, in our tooling, we have uh a data platform where a user could just come in, start running some sql and uh running some queries uh against the engine of their choice.

Uh we also have a tableau which is used quite widely for our dashboarding

Uh and uh we also have uh Jupiter that's used quite a bit by our research engineers and data scientists. This is just again, just a subset of the client site ruling that we support the control plane and data plane abstractions that we have between the engines basically allows for both our users and ours as a platform to allow users to bring in their own tools that they see fit and connect seamlessly to the engines.

And it also abstracts them away from uh the specifics of the engines like the versions deployments and so on. Uh so, uh with that, uh just a little bit of uh history.

Um so we started by moving from an on prem architecture to the cloud uh in around uh at around 2010 or so. And we were uh following these principles like I was mentioning earlier and we were mostly a hive uh sort of warehouse, uh primarily high formatted uh warehouse. Uh so at our peak, we had about uh 600,000 hive tables and uh 250 million uh partitions or uh but we soon started uh running into limitations with hive.

As perhaps most of you are also aware, there is no acid semantics uh out of the box. So we had to build like our own patterns to sort of get around that uh additionally. Uh, so, like some of the tables in our warehouse started growing into petabytes of size. And every time say a data engineer had to go and change the partitioning scheme or uh scheme of a table, they had to pretty much drop and rewrite uh all of the data in that table.

Uh so it was getting to a point where uh basically running a platform completely on hive uh was uh not tenable. Uh that's when uh ryan blue uh dan weeks and team sort of uh design. Iceberg uh sort of uh a another open table format with uh the community in mind. Uh and uh something that's, that has well-defined interface and contracts and works across a variety of languages.

So, iceberg not only uh sort of overcame some of these drawbacks that we had with hive but also introduced uh add an additional set of features. Uh to begin with asset transactions were available out of the box. Uh and then uh one of the unique features of iceberg is that the uh rich metadata layer that's available.

So you can get the full footprint of a particular table just looking at the metadata layer. So this is quite powerful because you can not only uh speed up uh your queries, but it also eases any sort of data management and governance operations that you want to do uh on your table.

Uh and some of the features are like storage separation or configuration of tables that you get. This basically enabled us in the platform to allow users to configure tables and not jobs. For example, users can go and configure a table to have the standard compression with like a certain split size and not have to go around configuring all of their jobs that write to this particular table. So that was uh fairly useful similarly, we have uh time travel where you can query a table uh from a particular snapshot. This is used quite a bit by our data engineers.

Um and similarly, there is partitioning evolution that is also support it out of the box. And iceberg was built with a good set of lugg and extensible interfaces, for example, like the table operations. api sort of made it easier to start integrating iceberg with the engines that we had in the platform at that time.

So we integrated it with uh spark and reno. And while we still had our hive uh warehouse up and running in production, we started building out our iceberg uh ecosystem in parallel. Uh so we were primarily able to do this and shout out to our architect at that time. Uh charles smith, uh he, we put in like a good set of high level uh abstractions that basically enabled us to build out this parallel iceberg ecosystem to that of uh that is equivalent in feature parity with hive without any disruption to our users say, for example, having like a high level meta store layer where the users were abstracted away from the underlying table details.

We basically started having our users create iceberg tables by default, for example. Um so in terms of our journey, once we had uh iceberg integrated with our engines, a few teams started using iceberg quite a bit and they were completely on board. And then we started building out the rest of the service ecosystem that we need for iceberg to maturity.

Um and I'll go a bit into the details of how we actually built uh each of these uh services. And the why and the reason we, we're doing that is because our hope is like this will help you reason about any sort of abstractions you might want to build out uh in your abstract, in your organization. Uh these are the services are completely built on open source. Uh iceberg api s uh that are available today.

Uh starting with the metadata services, as i was saying, meta cat is our primary uh federated metadata store. It provides uh not only a unified api but also a unified uh type system when you're interacting with the data stores that we support. Uh so it's not just the hive meta store where we used to store our hive table metadata, but also third party stores like snowflake and having an abstraction like this basically say unlocks use cases where a user would want to write a transport job between hive and snowflake.

They then would not, would not need to be aware of the intricacies of the scheme of differences between hive and snowflake. For example, they would just deal with the the canonical types that medicat provides and they would use that to write the transport jobs that they need. And medicat under the covers, takes care of translating between the specific data source stores types.

Um so with this abstraction in place, we started off by actually storing the iceberg metadata that we need directly into the hive meta store. So we wrote some custom api s that did like a, a check and put uh operation to sort of keep track of like the most up to date state of an iceberg table.

Uh but we quickly started running into some scalability issues, especially on the hive side. Like i was saying, we had about 2 50 million partitions at that time in uh in mysql r ds instances that was backing our hms. Uh and with that, we, we had basically noisy neighbor issues where say a user would do a bulk update of a table with like 20 30 million partitions and that would lock up a bunch of tables.

And iceberg has like an optimistic concurrency commit protocol. So you would have iceberg tables that are constantly trying to commit to the table. Whereas hive tables have more of a pessimistic locking approach. So these sort of competing commit patterns ended up putting too much pressure on our r ds instances basically.

Um and this basically led to unavailability of iceberg tables in our platform. Uh we tried different things like uh scaling out or sharding or hive metasol instances, but we again started hitting uh resource limits on my sql. So that's when we decided to build polaris sort of a custom meta store designed specifically for iceberg tables.

Iceberg tables pretty much need a very simple data model. All you have to do is you need to keep track of the most recent root metadata json file for a particular table when you're committing to it. Uh and every time you commit, you keep incrementing that and you store just a pointer to it uh in your database.

Um and we, we decided to go with uh cockroach tv in this case, because we, our operations were mostly uh point updates and cockroach tv provides uh horizontal scalability uh out of the box.

Um and one more thing we added uh to uh we added to polaris was also support for uh the iceberg uh rest catalog uh spec so the iceberg sorry, the iceberg rest catalog uh was like recently introduced uh with an iceberg relatively uh recently introduced. It basically uh standardizes how iceberg table metadata operations should be managed.

Uh it provides uh consistency around how the iceberg commit protocol should work and uh say how iceberg tables uh should be created in general. And the power of such a standardized interface basically allows you to sort of implement something like this in house on top of your meta store. And that is what we did.

We added a layer of like rest catalog support within polaris itself and started supporting the rest catalog endpoint and with an end point like this any third party engine that can actually speak the the standard can now start connecting to your warehouse directly. Uh this is something that we did not have in the high world you basically have now like a singular end point with which you can connect to your warehouse and start running queries against it.

Uh so this is quite powerful in that it does, it even avoids vendor lock in to a much larger degree because you not only control all of your data but also the uh metadata uh and the meta store operations. Uh some other nice things that the uh iceberg uh rest catalog brings is uh when you have like a centralized catalog server running, you're basically in a position where you can start uh doing some optimizations to iceberg's commit protocol itself.

Uh so, for example, the uh iceberg rest catalog uh has uh uh uh helps with uh uh deconflicting commits. Uh so the protocol itself is built with a change based api uh so that way you do not have uh comments which do not conflict with each other, you can detect them instead of like say failing one come and then having users retry.

Uh this is problematic for example, when, when you have tables that are being written to by tens of writers in parallel. Uh and if these comments today, what happens is if these comments actually, even though they do not conflict with each other, they're not aware. Uh and they end up having like others keep retrying, uh which is like pretty inefficient when you're operating iceberg at scale.

Uh so this definitely helps with that where if you're doing a scheme update, you need not uh reject any rights that are happening to data files, for example. Uh so that is sort of wraps up like the uh uh the rest catalog sort of advantages. And it's something you should definitely look into. If you already have a meta store, you can uh simply add an implementation of it uh on top of your meta store.

Uh next, we have our table management services that we built out. Uh so we once you start landing data into your warehouse, it's time to start obviously start thinking about the life cycle of the data. Uh to that end, we have processes called the janitors that essentially clean up expired uh data.

Uh they, we, we run it in like uh three flavors. We have tt l janitors that clean up data that is past a certain date of expiration and this date is configured by our users. Although we have some guidelines in place. Uh we have snapshot janitors that clean up uh iceberg snapshots based on a specific uh tt l. And we also have orphaned file janitors that clean up dangling files that land up in your bucket, but they do not belong to any iceberg tables.

These are basically files from failed commits and so on. Uh all of these janitors are again, built on open source. Iceberg api s the snapshot janitor, for example, uses the expire snapshots api and the orphan file. Janitors uses the spark actions api that's available. So you can design api uh you can design services that fit your needs.

Um and this is a high level overview of the architecture of how we run the janitors.

Uh so on the far left, a user basically goes through our portal or one of our engines and configures the TTL for the janitors. And let's say we take these, the snapshot janitor here as an example, these janitors run as a pool of Spark processes that are periodically scheduled. You can think of it as a service that's maintaining a pool of these Spark processes.

The snapshot janitor then expires all of the irrelevant files that belong to a particular table, sends them out to an SQS queue. And then uh we have a deletion service that's running that uh make sure that we're not over deleting any of those files. Uh and the deletion service basically pulls all of the files set up for deletion from the SQS queues, sends them out to an audit log. Uh that's basically a streaming table of arts and also goes to S3 and soft deletes the data before soft, delete the data.

And this a log basically allows, helps us restore the data whenever we need it. And it also keeps track of any cases where a user might have like accidentally dropped the table and they want to restore and recover it. And the soft deletion period is usually 5 to 7 days in our case.

Uh similarly, uh as you as users write to their tables using a variety of engines, they might not be using the most uh uh optimal uh sort of uh format for uh writing out uh files. For example, some Spark jobs, write out too many small files to Iceberg tables. And this is especially a problem for remote storage, remote storage systems like S3 because you'll now have to incur the cost of network latency as you switch between uh IO handles.

Um so too many small files are a problem for uh compute engines. And in general, there is an optimization problem here where you can optimize the layout of files for a particular table uh in three in a way that is uh uh sort of uh efficient for querying through engines.

Uh so we, that's where we implemented auto tune uh auto tune runs in the background, completely abstracted away from our users and sort of optimizes the data files that belong to users of stables. Some of the techniques we use are like say merging to many small files and rewriting them into files of like a reasonable size, sorting uh low level delete file compactions.

And again, all of these use Iceberg OS SAP APIs. So it's something you can build on top of readily today as well, the architecture for auto tune is also a bit similar to that of the janitors. We leverage Spark here primarily as a distributed computation framework more so than data processing.

So if a user writes to Iceberg tables, the meta store is made aware of it and then it sends an event to SQS saying that hey, this table has committed these snapshots auto tune sort of listens to those events, let's say in this case, it is running in a compaction mode. It has a compaction config stored with all of the heuristics on to what degree files should be compacted and so on.

It then spins off a Spark job to go and rewrite the data files that belong to this table in accordance with the config and then commits that back to the Iceberg table.

Similarly, auto lift is another service that we have built specifically at Netflix. The motivation here was that most of our engines are only in the us east one region. However, we do have streaming engines that right using Flink to remote regions. So this basically leads to pretty costly cross bandwidth, cross region bandwidth costs.

So to avoid that we build auto lift that basically localizes files from these remote regions to us east one, it localizes about 2 to 3 petabytes of data each day. Uh and the way it also works is it basically uh uh scans a stream of incoming snapshots. Uh figures out all of the uh files that need to be localized and then makes use of the Iceberg uh replace uh commit API to basically remove the remote uh remote files and then replace them with local ones atomically.

Uh so again, this is an abstraction that we built on top of Iceberg OS SAPIs. Uh another uh another uh sort of new feature that we added on top of Iceberg is uh secure uh Iceberg tables.

Uh so the way in the, in the Hive world, our warehouse was pretty much uh open for all. Uh but teams who wanted a security would basically uh have a specific S3 IAM roles that they would set up and that specific folks on their team would have access to. But this sort of went against our ethos where we wanted users to just interact with high level abstractions in our platform and not have to deal with things at the S3 layer.

So that's when we decided to leverage Iceberg and the metadata layer available to build uh security across our warehouse and have like a secure data warehouse by default.

Um I'm going to give a quick overview of how we build security into our Iceberg tables, but the actual implementation is a bit involved. So the interest of time, like if you folks are interested, I'm happy to chat after the talk and give you more details.

But at a high level how this works is uh we've modeled each table as a resource that needs to be secured. Uh so we basically have a table level security and each table is mapped to a specific prefix in S3.

Uh so for example, table four, all of the data and metadata for table four lives in an S3 prefix s3 slash broad slash f right. And all of the ACLs, like all of the users that have access to this particular table that metadata is also stored in the Iceberg table.

Um so the end to end flow would basically look like a user Alice would come to query table four, the Spark driver would then go and talk to an external signer service. So this signer service is basically uh indeed abstraction that we built that acts as the policy enforcement point for our secure tables.

What the signer service does is whenever Spark comes to it to check that, hey, does user Alice have access to table four? It loads the table from our meta store checks if this user is there in the ACL list and hands out an STS token. So it basically acts as like an interceptor that's external to engines. And the nice thing about that is we were able to now extend the signer service to our other engines like Reno as well.

Um so once the Iceberg catalog has the STS token, it sort of passes that down to the uh Spark executors uh to read. So the overall access model that we have is pretty simple. Users have like either read, write or admin permissions to the table. It's not a, it's not a full blown RBAC framework, withdrawal, hierarchies and so on.

Uh but uh and we also do not support a row and column level security yet, although we are seeing requests from some of our users uh for these use cases and on the right, you can see like what the data model for ACLs uh looks like it's the user and their permissions. And all of this is stored in the Iceberg root metadata JSON file itself as a as a property.

Uh so to, to wrap up, uh as you can see, we started building out these uh secure uh Iceberg tables, started leveraging uh some of the performance and efficiency gains that we get from Iceberg. Uh thanks to these high level abstractions we had in place where our users didn't have to be aware of the fact that hey, i actually have a Hive table or I have an Iceberg table under the covers.

However, at the same time, we still had hundreds of petabytes of data that was still in the Hive format. And that's when we and our users simply did not have enough bandwidth to start migrating all of that data into Iceberg. So that's when we decided to start a managed effort where we would build migration tooling that did that heavy lifting on behalf of our users and sort of migrate all of these tables into Iceberg in a in a transparent manner with minimal downtime.

And Ashin will talk a bit about the migration tooling that we built to make that happen.

Thanks, Rakesh. The immense size of our data where where we have very high multiples of hundreds of petabytes of data coupled with large number of tables that we have where we have about 1.5 million uh tables. Uh the task of migrating from a predominantly Hive base data lake to an Iceberg only data lake is a colossal undertaking.

To deal with this challenge, we started designing our Hive to Iceberg migration tooling with certain key objectives in mind. Firstly, because of this huge uh immense size of the data there, we wanted to minimize the overall data movement during the migration. Uh this is to reduce the overall migration time as well as to bring the migration costs under control.

Uh second, key objective is to minimize user friction where we wanted to understand and anticipate beforehand. What are the different ways users will experience issues during or after the migration and want to minimize them?

In a similar way, we wanted to ensure business continuity uh where we wanted to again anticipate uh uh all the serious incidents that could happen uh before uh uh during migration or and after the migration. And uh we, we tried to mitigate them uh by developing certain features in the tooling to handle the to address first key challenge which is to minimize overall data movement.

We designed and implemented a Spark SQL procedure, which essentially when you run run this Spark SQL procedure on a Hive table. And it creates a Iceberg meta on top of the data files pointed to by a Hive table instead of copying the data from Hive to Iceberg.

For instance, this procedure begins by extracting the data files corresponding to a Hive table. And then it builds layers of metadata. For example, we create a set of manifest files, each manifest file points to a set of data files specific to a partition. And then on top of this, we create set of uh manifest list files where each manifest list uh stores uh are belonging to manifest files uh including all the partitions handled by a particular manifest file.

So each manifest list corresponds to a snapshot and this is pointed by metadata file. So when the user accesses a particular Iceberg table and it goes through Iceberg catalog which will have pointed to the latest metadata file.

So uh we have migrated about 95% of our non temporary Hive tables using this technique uh where it creates Iceberg meta instead of copying the data. But i want to mention that uh only very small minority of the tables uh we had to migrate using the copy operation.

In the upcoming slides. I will talk about what are those scenarios where we had to perform copy operation to perform migration?

To achieve the second objective, uh a 2nd and 3rd objective where we wanted to minimize user friction and ensure business continuity. We developed a Hive to Iceberg migration tooling.

So this tooling helps us uh in uh providing a managed migration for our users. So in this tooling, we essentially devised five different components. Each component bearing a unique responsibility, these components uh they interact with each other through a shared state. Yet they all work asynchronously by not blocked on each other.

So each component can be worked using a workflow that wakes up, performs a certain job and then goes to sleep. For example, the processor wakes up and then it looks at all the tables that are scheduled to be migrated. And then it extracts the table metadata such as the list of table owners, list of downstream users so on and so forth. And it prepares these tables for the migration.

Next. The communicator component looks at these tables. It sends the communication emails to this list of table owners and migrator and the downstream users to notify them about the upcoming scheduled migration as well as the current state of migration and what to expect next.

Whereas the migrator component itself migrates performs the actual migration of the tables, the scheduled tables from Hive to Iceberg.

Uh if the migration is successful, then we finalize the migration or if a user experiences an issue then uh reverter steps in and then it rewards certain tables back to the original hive table format to guarantee an instant instantaneous reward operation and to unblock user as soon as possible.

Uh we make sure that our original hive table is in sync with the newly created i book table for some period of time after the migration. Now this is ensured by the shadow or component which uh whose responsibility is to keep the original hive table in sync with the newly created table. And we perform this delta copying using the metadata layer provided by iceberg.

Also, I wanted to mention that uh to, to increase the velocity of our migration. We for each component, we spin up multiple instances of this component and we distribute the list of tables uh to these instances using consistent hashing. So that way we are able to do this processing in parallel.

Our migration tooling relies on several uh external auxiliary services. For example, uh it relies on this lineage logging service which is essentially a service that provides a centralized store that stores the flow between all the different data systems within Netflix. So prep pro component uses this service to extract certain line meta data such as who, who is a list of uh table owners, who are the list of downstream users so on and so forth.

And then we use another service that rake spoke about which is the meat. This is our federated meta store that stores the meta data about all the different data sources within Netflix. So, migrator uses this service to uh perform various operations on the metadata layers such as uh blocking and unblocking table during migration uh or the table renaming or copying certain important metadata from uh original hive table to newly criticized book table.

There is another similar service called us. Uh we call it, we call it dc network data catalog service. And this is a service that also stores other auxiliary metadata pieces within this service. Migrator uses this service to copy other pieces of metadata information from original heritable to newly created table such as table owners. uh data classification table says tt l et cetera.

Essential idea is to uh um uh hide the, hide the fact that you have migrated the table and make sure that table looks as it as it was before we design the migrator in the form of a state machine where essentially when you migrate a table, it goes through a set of states where we initially uh prepare the table for migration and then uh block the rights on it. Migrate the table to iceberg and make that iceberg table primary by swapping the names and then unplug the table uh primary table and then archive original hive table. If everything goes well, we uh finalize the migration. Uh if something goes wrong, then reverter sends the table to a different set of states where we eventually revert the table back to original hive table and then archive the newly created iceberg table.

So uh this way of designing uh a state uh state machine based migration tooling uh provides us with several benefits. For example, it lets our users have access to the current state of migration. So it can the tool will become completely transparent to the users where they are always updated on what are the current, what's the current state where the migration is at at this moment and what to expect next.

Secondly, uh when uh whenever there is an exception, whenever we hit an exception at a particular state, then we can either go back to previous good state or we can go move forward to next state once we uh debug and resolve the exception at that particular state. So this provides us with greater debug ability also provides operational flexibility because you could roll back to a previous good states.

Also, this kind of design also helps us to ensure that uh each state follows a particular uh property. For example, we make sure that each state is item put with respect to a particular table. And this ensures that ensures that tool is triable as well as uh uh uh it it it. So so that whenever we retry uh the, the the outcome doesn't change on a particular table, also, we are easily able to add new states to this uh state machine uh based on uh if we need to handle certain new uh coroner cases or we want to handle new use cases. So that gives us uh immense extensibility as I mentioned before, uh to guarantee an instantaneous reward operation.

Uh we keep the original hive table in sync with uh newly created iceberg table. Uh we, we, we achieve this by making use of uh the meta changed partition meta stored in iceberg meta layer. So for each uh changed partitions in the new table, we delete that corres. Uh we delete the data from the corresponding partition in the original hive table and insert the data from uh changed partition to the uh uh uh original hive table. So this way we are able to uh sync both the deletes and inserts.

Uh uh that happened after the migration to isberg table. And then we set the watermark to late a snapshot uh to in the original hive table so that we could keep repeating this operation for some time after the migration, we encountered certain user friction points during the migration and we handle them in a few different ways.

For example, iceberg doesn't support certain legacy features such as uh uh data uh which is in cs v or text format. Um so for this use for this case, so essentially what this meant is that we could not build iceberg metadata on top of the data that was in this legacy format. So for this case, we had to copy the data from high to newly created iceberg table. But uh but these were very small minority of the tables. So it still was a win for us to do this.

And uh so second issue is the legacy primitives. For example, if a hive table has a data column uh which has a legacy type such as time stamp is not the legacy type, but the in high world, the time stamp underlying uh underneath it gets translated to 96 which is considered a legacy time, a legacy data type in the iceberg world. So even in this case, we have to copy the data uh from hive to i beg and there are certain nuances.

Uh when a data column is uh data column has not null attached to it in the high world. Uh in the iceberg. When we translate this into iceberg, we get some kind of exception. So we have to make, we have to make some fixes in the iceberg library to, to deal with this.

Then we came across very interesting set of tables within the place uh uh where these tables uh coordinate uh right reads and rights between them. So what that meant is we could not migrate these tables independent of each other. So we had to uh we had to find all the jobs associated with these tables, stop all the jobs and then migrate all these tables at once and then uh unblock these jobs the third issue was that iceberg being a paved path within netflix.

Uh we had implemented the security features on top of iceberg. So we anticipated that once we migrate the table to iceberg, users are going to have friction with respect to access. So to solve this issue, uh we, we, we enabled our tooling to automatically set ac ls based on list of table owners and list of downstream users that we retrieved in the prester stage.

Also, there are certain in incompatibilities uh with uh uh with respect to hive and iceberg. So uh one of uh one of these incompatibility is that uh the underlying data file created by hive table uh uh associates the table co uh the data columns with the date column names. Whereas in the iceberg world, the columns are associated with the ordinal or id. So this creates certain uh uh incompatibility.

There are also certain nuances between required and optional fields to deal with this. We created a compatibility mode uh uh in our computing engines and also we set certain configurations uh automatically as soon as the table is migrated. This is to bring parity between hive and iceberg.

Uh another very interesting thing that we noticed is that uh some of our teams within netflix, they implemented their own custom library uh that relied on certain high specific features. For example, uh using a high meta story api that deals with partitions and this works only for high case uh in the iceberg case, iceberg meta layer itself stores the partition information. So ha meta story is not going to help.

So, um so to deal with this, we had to work closely with this uh uh teams and propose changes that worked with I iceberg and then migrate their tables. Also, each team had their own migration priorities because of several reasons. So uh to deal with this, uh we uh we created a future future with within our uh tooling where we could set a particular schedule uh per table. So what that, what that means is teams can set their own schedule priority uh within this migration tooling.

In conclusion, in this talk, we have shared our journey of creating iceberg and all the comprehensive ecosystem that we that we have built to enable its adoption within Netflix. And all all of our, all the strategies that we came around the iceberg have yielded a significant benefits.

For example, we are able to achieve 25% cost reduction by the virtue of z standard compression efficient data cleanup using janitors as well as data compaction provided by auto tne technology, performance wise. Iceberg natively outperforms high where uh we noticed that queries run order of orders of magnitude faster on high iceberg versus hi also the supporting systems that we built such as auto lift technology. It helps us also improve, improve the performance further by the virtue of data co location, the rich metal layer that iceberg provides, helps us to build fine grained security controls on top of iceberg as well as help us build very useful attractions such as uh iceberg rust catalog, these features and abstractions uh help us in uh make rapid improvements in our text where we are able to experiment rapidly with uh compute engines.

Uh and uh we are, we, we could plan uh we could, we could plan uh to replace the legacy comput engines uh by, by the new newer modern comput engines over the time, this will lead to uh reduction in tech debt, improved developer productivity. As well as modernizing our overall platform.

We are also uh developing certain other iceberg services such as iceberg stat service uh that creates a centralized store uh where it, where it is able to collect interesting statistics uh with respect to uh iceberg accessing iceberg tables and it stores this history of uh stats because without this service, the iceberg snapshot snapshot would would would get expired and then you lose that history. So with this service, we are able to keep the history of these stats and this will improve our other data systems within Netflix.

For example, we are able to implement incremental uh incremental data copying service that relies on this iceberg, which iceberg meta service. In conclusion, uh we foresee a future where future where uh iceberg and all of its ecosystems become industry standards. And we expect every, every compute engine vendor to support uh iceberg and all the ecosystems around around it. And this will ensure that we will have fast, secure and cost effective data management practices across the industry.

Finally, we are open sourcing our high price bar migration tooling. This is to provide reference to broader community as well as to showcase how uh how uh how such a migration tooling can be built and used.

Finally, we would like to thank ryan blue and daniel weeks for inventing iceberg and providing as a really strong foundation to work upon. And we would like to thank all current and former uh netflix, uh stunning colleagues who have uh contributed, contributed to this journey in some form and shape.

I would like to also thank every one of you who took time and attended our talk last but not uh last but not the least. I would like to thank aws uh for providing this terrific platform uh where we are able to share this whole journey with you. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Netflix’s journey to an Apache Iceberg–only data lake

Thanks uh for coming to this talk. Uh and, and uh you know, hope uh everyone has had a great re invent so far. Uh enjoy the sessions. Uh we assure you, uh we've saved the best for the last.Um so just to get started, my name is vy, i've been with uh aws com
复制链接

扫一扫