Safely migrate databases that serve millions of requests per second

最新推荐文章于 2024-10-31 13:46:29 发布

litaibai-04

最新推荐文章于 2024-10-31 13:46:29 发布

阅读量263

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/littlechenlin/article/details/134800943

版权

That was a really nice intro. Alright, who's here to talk about databases? Some folks here to talk about databases, anybody. Alright. Come on, wake up. Let's go. I know it's not in the morning. I know, I know, I appreciate you. I appreciate you for being here.

Alright. My name is Joey. I'm here with Ayushi and we're going to talk about migrating databases safely at millions of requests per second. Thanks to Joe's intro, I think we can skip this. I'm Joey, that's Ishi. And today we're going to be walking you through kind of a story of how of why we have to do data store migrations and then when we have to do them, what are our techniques? And like how can we really design our systems to be migrated in a, in a zero downtime? Completely up way. And then finally, we'll close with a couple of case studies that kind of show you the power of this general system and how we can apply the same techniques to different migrations and have them all be safe.

Alright. So let's start with the problem. The problem at Netflix is that we have a lot of state and it's all over the world. So every red country on that map is a country that uses Netflix. And then the the purple dots there are data are where we store copies of our data at Netflix. We use four Amazon regions with three availability zones each, which means that we have 12 copies of data spread across the world for almost all of our user data keeping all of this consistent is tough.

Why do we do it? Well, we do it because of the scale, we do it because we have massive volumes of traffic to these data stores, massive amounts of data to store in these data stores. And our developers expect single digit millisecond or in the case of some of our caching layers less than a millisecond at all times. So we just wanted to kind of show you the, the the reason why having safe data migrations being able to change these data store is so important to us because we just have a lot of them. And if we don't make it safe, we're going to make mistakes and nobody wants, nobody wants the uh the Netflix to network that would be a bummer, but it's more than just scale.

It's also that the solutions that you can use to solve problems overlap. So I like to talk about this like universe of database requirements. So at Netflix, we have lots of microservices, those microservices have some requirements and those requirements fit in this kind of picture. And you'll notice that the different solutions like a key value store or a relational store or a blob store, they overlap some, some use cases can be satisfied by multiple options. And at least at Netflix, we have use cases that span all over the place. Some of them can be handled by multiple stores, some of them have no solution. And we're very sad and that's where the data platform team that we work on comes in because sometimes we can build a solution. But sometimes there isn't, uh sometimes there isn't a good solution. And sometimes there are more than one solution to the same problem, but it gets worse because it turns out that use cases aren't static, they don't stay the same.

For example, here we have a use case that started off and we could have picked a key value store, a relational store or a blob store to satisfy that requirement. But then over time as that application evolved, it became only satisfiability with a blob store. We had a two third chance of picking wrong and we would have had to migrate them. But you might ask like, well what causes that shift? And at least at Netflix, it has to do with different features or products shifting their demands.

So for example, over time, maybe a service gets more popular, it gets more traffic or maybe they just are accumulating data over time and the storage goes up over time. So as these, these things scale, that's what causes those use cases to kind of move around the the universe of database requirements. Who here has had a case where you picked something like picked a data store. And then two years later, you were like, oh man, if only we picked the right data store, anybody, yeah, we've got a couple of folks. Ok? A lot of folks. That's good. This talk hopefully will help you.

Then it's not just the use cases changing though the offerings change too. Amazon is constantly releasing new capabilities. For example, I've picked three here, the Dynamo DB global tables, the Aurora global data store and S3 strong read after right consistency. These were three announcements from Amazon which really enlarged the span of problems that that offering could solve at Netflix, which meant that even if your use case stayed exactly the same, it's possible that with a new, a new change to an offering that you might have to migrate as well.

And then finally, there's cost because it turns out that pricing is another aspect of dynamic dynamics here, in particular, as you scale the cost tradeoffs between managed services and self managed services change on the x axis. Here we have like this kind of numerical measure of scale. There are a lot of factors that go into scale on the y axis. We have how much money that costs and you'll notice that like that fully managed service starts off at zero cost. That's fantastic. Right. Like you don't have to deploy any infrastructure, you don't have to pay any engineers to set stuff up. Contrast that with a self managed option where maybe you have to actually like write some controllers or, or build some expertise and in that world, uh there's a high initial cost as you scale though, it's often the case that that fully managed service scales at a higher rate of cost, which means that even if you pick the right thing at the initial level of of scale as you as as that use case scales out, the correct choice might change. And this is dependent on pricing.

The good news at least is that that question mark percentage mark. That's like saying what frequency, how many, how many use cases fit under this? At least at Netflix, that number is actually quite high. And if you, if you have the ability to move between these, then you can use the optimal engine for for price performance for everyone. You know, we like to joke that this was the year of efficiency. How many people had like an efficiency effort this year? Yeah, a lot of people we did too and a lot of it was understanding these curves. It was understanding where could we choose a better data store to serve a particular use case and save dollars?

Alright. So we've just done a lot of motivating of like why you might have to migrate. I do want to caution that you really should avoid doing database migrations if you can. So for example, if you're just doing software updates or if you're just doing kernel updates, you really shouldn't move state around. Instead, I highly encourage you to check out the talk that our colleagues gave yesterday n fx 305 on how to iterate faster on state services. They showed a lot of really useful techniques for how to avoid data migrations if all you're doing is touching like software or hardware, but sometimes sometimes you must migrate.

So when you must understand how to do one automate that migration and automate safety into every migration. And then we're going to execute those migrations in parallel and that's going to be the key. Alright. So let's dive in and let's see some techniques that we can use to migrate state when we have to.

Alright, we're going to do a three step process here. First, we're going to abstract the underlying storage api and this is because it's hard to migrate data storage if your applications are strongly coupled to those data store a ps. So we're going to figure out how can we decouple the applications from those underlying storage engine apis, then we're going to shadow traffic item potently and we'll go into what that means more in a second across both rights and res. And then finally, we'll merge those in-flight rights with backfill data in an item put and fashion and verify that the new data store has the properties that we expect to kind of take this and break it down into pictures.

We're going to follow this. We're going to start by setting up a client that sends traffic to some type of proxy that proxy forwards the traffic to an implementation which stores data in our database. The first step of setting up that shadow is setting up a second implementation which writes to some other database and then have our proxy layer kind of duplicate that traffic. And the reason that the lines are dashed there is because we want to indicate that you always want your that that side to kind of be a shadow or dark launched because typically, at least in my experience, sometimes the shadow breaks because there's something that you didn't think about. And if the shadow breaks, you don't want it to bring down your real production workload.

Third, we're going to initiate a backfill where we copy all of the data out of database one into database two and we have to copy it in a way that's item potent and merges with the rights that are coming in from that second implementation. Otherwise we're going to end up with database corruption and nobody likes database corruption. Once our monitoring says that this looks good, we can promote and switch traffic from primarily being served by implementation one to implementation two. And you'll notice that we didn't remove implementation one because even after all the verification that he or she is going to talk about between those two stages, sometimes something still goes wrong and you want to be able to quickly roll back to implementation one. And then finally, if everything looks good, we can decommission. So this, this is the full picture and this talk is going to walk through each one of these steps and show you how to do them.

Alright, let's get started. I'm excited. You guys excited. Yeah. Alright. So how are we going to set up? Like what, what do we actually mean by like a client talking about proxy and implementation? Well, at Netflix, we like to use this thing called a data abstraction or a data gateway. And what that is is it's an api the data platform team owns. So we define that get items and that puts api we operate that data gateway service and implement a translation layer that takes those get items and put items, calls and then translates them into the database specific apis.

So for example, here we have Apache Cassandra, we're taking get items, we're translating that to select queries, we're taking put items, we're translating that into insert queries. But the really cool thing about this is that we can change everything on your right. We can change the data store and the client, the application doesn't know the application is still calling get and put for them. The data store hasn't changed. So this is an example of DynamoDB where we're translating puts for example, batch right items.

Alright. But how does that api really look like? That? Sounds nice in theory, what's it look like? Well, this is the key value api at Netflix and the data model for the key value service is a hash map of strings to assorted maps of bytes and bytes. And we'll get into a bit why that sorted map is so important. And we, we essentially just have crud here, right? We have put, we have delete, we can do get and then we can scan, get returns all the items within the inner map, scan returns all the ones on the outer. If you guys have used Dymo db, this hopefully looks kind of similar, it's like kind of the same idea, but it has a couple of key concepts embedded. And one of them is these item potency tokens, all mutative operations in our data. A ps take these item potency tokens. What is an item potency token? Well, it's a mechanism for us to be able to replay rights and know that those rights duplicate.

So in this example, we're writing some values into this x register. So write x equals one at token, one, write x equals two at token seven and so on and so forth. And then our data store is resolving conflicts by always using the latest token, different data stores resolve conflicts in different ways. You might have to actually like model it into your, into your data model for this to work. But for example, in Apache, Cassandra, Apache, Cassandra resolves rights using last right wins.

So what that means is that the last, right on that, that put x equals one time equals one, it doesn't change the value of x because when it kind of duplicates with that right before it at t equals 12, specifically because t equals one is less than t equals 12. The database does not update the value.

So broadly speaking, item pc tokens are kind of the pseudo code. We get some token that depends on the data store. We send a right with that token and then when we're duplicating it to another data store, we make sure we send the same token so that we can kind of merge those streams.

So what does that look like for a last rate? Wins token. It's pretty simple. You just can just kind of get the time stamp. Uh add, add a random non. And then we like to mix the random nouns into like the last digit of the microsecond time stamp just to kind of increase the probability that things are different across machines.

And you might say but, but joey, you've, you've violated, you violated a pretty important rule to systems you used a clock. Everyone knows clocks don't work right. In our experience, they actually work really well, especially since aws nitro.

So two nitro has brought clock delay down as long as you're using like the most recent kernels. And uh you're using like the proper uh ntp demons. Uh this is real data from the netflix fleet of over 20,000 ec2 instances across 10 different instance families all using nitro. This is clock delay clock, essentially a measurement of clock accuracy. That's phenomenal, right? Less than a millisecond of clac toy clocks are pretty accurate.

The bigger problem is actually precision. It's that you want a microsecond level time stamp, but you can only get a millisecond one because java is amazing. And the but, but at netflix, what we found is that uh a lot of these arguments aren't really made with modern cloud environments in mind. They're made with like no, like, i mean, let's be. Amazon has like atomic clock and gps stuff like you don't have that stuff. Amazon has that. It's really useful, use it, it makes your job easier.

All right. So we've talked about rights. We've talked about how to make those tokens. But what about the reads? Because when you're designing for a ps to be shadow, you have to think about the reads as well. We don't want an API that returns giant amounts of data which for you to then you know, compare requires you to like load gigabytes of data into memory or something.

Instead we want to figure out how to break down those, read a ps and kind of pageant it through yielding items. So here's some pseudo code of the kv, get items. API it returns a page of items usually about two megabytes and then the the service returns a page token saying, hey, there's more data you have to keep reading.

Now, why is this important? Why is this important for shadow ability? It's important because most data stores return pages of results using some kind of database specific cursor. So for example, like if you're using a sql database, there's usually some particular cursor implementation for that particular result set.

But when we're shadowing across data store that maybe have different mechanisms of resume ability, we have to build resumption into the api. So how does that work for key value? Well, it takes advantage of that sorted map because sorted maps are assumable, all you have to do is record the last key that you returned into the token. And then no matter what the implementation is, whether it's sequel, whether it's cassandra, whether it's redis you can you can equally resume a re query from that position in the sort of map.

And this is actually something that i think is not super intuitive, right? Like you think, well, like obviously, like i can just resume the query. But in practice, you have to think about these things before you get into the migration, otherwise you're going to be trying to compare these massive result sets and you just won't be, you won't be able to build the confidence that the two parts of the shadow are functioning the same.

All right. So we've seen how to make our apis ready for migration. We've seen on the right side that we've made every mutation in the system item potent with item potency tokens. We've, we've talked about breaking down work on the reid side and to fix size work not count with resumed pages.

The same principle actually applies to rates as well. For example, if you have like a very large value, you want to break it down a lot of a ps do this through some form of like chunking or partitioning. For example s3 has like part numbers, right? Those are chunks, those are just breaking down larger values into smaller ones that make them individually retrial.

And now that we've made our api s our data api s ready for migration. Now let's dive into actually doing the migration. And for that, we need some shadow traffic.

All right. What's shadow traffic? That's step two in our diagram and shadow traffic is where we start capturing all of the get items, put items, scan calls and we start sending them to the second implementation in a shadow mode, which means that the primary implementation implementation, one is responsible for actually answering the client. But we're sending the traffic to implementation two and getting that data into the data store.

And again, at netflix, we like to do that using the gateways. So for example, here we see a put items, call hitting the data gateway. And then the data gateway is duplicating that traffic between an apache cassandra cql implementation on the top and a thrift cassander 2.0 on the bottom.

I'll forgive you if you don't know about apache cassander thrift. It's a really old version of cassandra. Um it's, we'll talk about later how we finally got off of it after a decade. But you'll notice though that we're translating the same put items call into these different apis into the batch mutate or the insert into, we can also do that for other data stores.

So for example, if we're doing the migration to dynamodb, we can use a batch rate item and sometimes we can't use the service because that like general purpose api we've designed is not general purpose enough for a particular service. We would say that netflix is about a 9010 split, 90% of users can migrate using our well-defined apis 10% have like these really weird data requirements.

So for those, we take the same technique, but we bring it into a library in their actual code base. So here you see that kind of like bridge impulse that's then duplicating to the key value, data axis object and the dynamite data access object dynamite is just a netflix like reddest system. So it's like a reddest compatible proxy.

And you can see here that we can use the same technique, even though we don't have that separate server, we can still have that bridge interface, we can still have those two implementations. Um and we can still translate what the client is seeing into those back ends.

We can do the same thing for rights with a library. We can here is an example with, with cassandra and thrift again. And then of course, we can also shadow reads using the gateway technique that we talked about earlier.

So in this case, we've got to get items and again, we're translating it between cql selections and thrift to get slices and we can chatter reads in the library. So we can kind of see how this kind of like shadow mode plus two implementations or more can be generalized both within a server and to libraries that implement arbitrary interfaces.

Sometimes you have to do library based migration for the reasons that i talked about earlier. But really it boils down to library based migrations. So it can support any bespoke interface like any interface that you can define in your programming language. This can you you can use a library based migration.

However, the reason that netflix generally tries to avoid library based migration and instead use those gateways is because when you have the library implementation, there's a lot of bad stuff that can happen.

So for example, the shadow can impact prod, what do i mean by that? Well, because your code is running in the same like jvm or, or other like execution environment, they can impact each other. So for example, if you have a bug in your new code that's talking to the new data store, that bug can actually bring down your good code that works. Who's ever deployed a bug while doing a migration? No. Oh come on, come on, come on me, i've done it. It happens.

And when you do that, the problem with the library based uh with the library based technique is that those bugs can bring down prod. So we try to avoid this if we can. And when we have to, we try to like be very diligent about about the implementations being safe.

We also have to add kind of bespoke insight because this isn't a service layer. We don't have the out of the box like interros communication metrics that we're very used to in netflix and which we'll talk about later are very useful for verifying that the two sides are, are equally performant.

It's hard to quickly iterate and it's hard to promote the primary in a quick way. So for example, you have to build some kind of control plane into this library so that when you're ready to promote the window of promotion is as narrow as possible.

So for all of those reasons. we try to avoid this, but it is always there as a as a kind of backdoor backup option if you can't use the gateway approach.

All right. And with that, i'll hand it over to yushi who will talk to you about backfilling.

so if we do scan at the target database, the order of the data might be different than that source one and hence, it can result in an inefficient comparison. so for often comparisons, we'll make use of the backups.

so what we do like backups of the source of the target database will be there in s3. so we'll read the snapshot for the same time for those both databases and download it to hive using the spark job. this is the responsibility of spark job to do the data checks and populate the differences so that we can look at the specific rows where the data is not in alignment and the backups can also be used for the functionality comparison.

suppose let's say we want to release a new version of a spark job, we want to make sure that the new version works are expected. so what we'll do, we'll pass the same backup snapshot to the old and the new job. and if the and verify for the difference, if the outcome of both the jobs aligns, we are good to release the new version. but if there is any discrepancy in the data that we might have to go back and fix the works.

in that case, this is just an example of one of the spark of which we have at netflix, we have a stable version which is used for the benchmarking. we have a kennedy version which is basically the system under testing and there's a comparison set up where we basically populate the results of the difference of the two systems.

so now let's highlight some key differences between online and offline comparison. online comparison is very, very efficient for a smaller data set. if we try to perform it for the larger data set, then it might place a huge burden on the system and degrade the performance of it. it is also closer to what client observes because it just points to the real time data which is in the system, unlike snapshots or backups, which may or may not be updated.

whereas for the, if you want to do a comparison for the larger data sets, offline comparison is the way to go. it will help us in dealing with a massive amount of data in the background by doing the data loading restoration independently and concurrently without affecting the performance of the live cluster because we don't interact with life cluster at all. so we don't have to worry about the performance degradation in that case.

so what is the best way to do the correctness verification? i will suggest combine both. we should always verify for a small sample of data using the online comparison. and once we have enough confidence, then we can populate use, then we can come to compare comparison of the 100% of our data using the offline comparison.

now let's talk about verification of performance. again, there are two ways to do it. either we can use the online using the read and write latency metrics or we can use the offline way where basically we have used canneries and the profile reports.

so in order to do the online comparison, we have dark launch, not only our rights but our reads as well. what does it mean that basically like we have when the application sends a query, the data gateway will take care of sending it to both the databases, but it will use the result of the database one to serve the traffic of the application. and the result of the database too will be disregarded. but then why do we need to perform read and write in on database two in the first place? the reason is to gauge the performance. it will help us in understanding the latency differences between the two systems and how can we make it better?

so this is an example of read latency. and we have also computed this thing for the right latency. and if we see that there's a huge difference, we can do some adjustments and check in the real time that whether those adjustments are reflected into the latencies at all.

but can you imagine doing this for hundreds of clusters manually? if that were the case, we would have submitted this talk for re invent 2024. so that's why we automated the entire process using canneries where it will use the old system as the stable benchmarking system. and the new system for the testing system before we can do the full scale roll out and gauges performance.

so this is an example of the kennedy, which we have used at netflix where we can specify different metrics, which it should monitor for the performance evaluation. and we can also specify that if the performance is not as expected, what should be the outcome, whether it can alert us or it can completely fail the system.

so this is just an example of a kennedy which has used this customized parameters and failed to performance evaluation for our system. we have also used cpu profiles to find out the bugs. these are the few issues which we figured out.

we have realized that there were a few queries which were not optimally compiled and they were using a huge amount of system resources leading into inefficient processing. we have figured out few serialization deser issues. we have realized that we were doing overall location of the resources which was leading to frequent garbage collection cycles.

we have also used cpu profile to compare the processes side by side. as you can see in this example, we have two processes running, one is the sql process and the other one is the thrift and we use cpu profile to find out the performance of it. and after the evaluation, we realized the secure path is taking 30% time on the cpu. whereas the thrift path is taking 13% time of the cpu.

so what is the entire cycle to get to the promotion? so basically, we start with a shadow launch, we shadow our reads and writes, then we'll backfill the data and then we'll monitor its performance. if you come, like when we see that, ok, there are no issues, then we'll try, if there are issues, we'll try to fix them and then we'll try to fix and monitor it till the time we see no issues.

once we are confident, we, once we are confident, we'll promote our secondary to serve the redry traffic. at this point, we still have our old database. we have not decommissioned it and then we'll see the performance of the new cluster.

if we see that there are any more issues with the new cluster, then because we still have our old database, we'll just roll back the read and write to the older database. in the meantime, we'll try to find and fix the bugs in the new implementation and we'll continue this entire cycle till the point where we see zero recreation where we are confident that we are not seeing any more issues. and at that point of time, we are ready to decommission the cluster.

and that brings us to our last step where we have decommissioned the old cluster. and now the new cluster is solely serving the eden right traffic for it. so these are the steps involved to do the database migration. but this is just for the one migration.

and now imagine the scale at which netflix is operating and we have huge amount of data which we have to move from one system to another, but not just for one system, but there are thousands of systems involved in such case.

so that's why i want to share with you all that in 2023 what we have accomplished so far as a part of our database migration initiative, we have migrated over more than 250 databases which were impacting more than 300 applications, dealing with thousands of terabytes of data and finally resulting into a 12 different migration paths. and all this is possible because of the migration steps put together by the data platform team.

now we have done this huge migration. so i want to walk you through the few of the case studies that were involved that were the part of these migrations.

the first one is the cassandra version upgrade. we recently migrated our entire cassandra fleet from version 3 to 4.1. and the reason was that cassandra three dot was reaching its end of life. and there are so many good features in 4.1 which our streaming platform could benefit from.

what is the major concern in this migration? as we all know that when there are new upgrades, few of the properties get deprecated, few of the properties gets added. and also in the new system, there could be a change in the way they handle a certain workload. so it was very important for us to do the configuration evaluation, performance evaluation.

what is the strategy we followed? we use the shadow read and write process and then once we are confident with the performance and the data correctness, we did the in place upgrade of the system.

so basically, we shadowed our rights using the data gateway, which took care of writing it to both cassandra three and cassandra four. and once we did, we did that for our test load and we tried to see the performance of it, the performance, we saw that, ok, the new system is having more latencies and it's not performing better than our old system. and we're like, ok, it's not too bad. probably we can test it for the production load and see how the performance is.

when we put the, the testing for our production workload, we saw that our right latencies were even, were doubled and the real latencies were even more than double for the new system. and our kennedys also failed us to promote the new system.

so basically, we were short, like what just happened, the new system is supposed to make our life easier. but what we are seeing is the worst performance. then we thought, ok, let's try to figure out what is the difference in the execution. like how are the, how, how about the cpo difference what is the proper difference in the both systems?

when we did the cpu profile analysis, we realized that the nodes running on cassandra four were mis configured to use the c ms garbage collector in place of g one gc. they were also mis configured to use a sun security provider which is a little slower for the netflix environment.

and then when we verified for the cassandra three notes, we realized that they were configured to use amazon curto crypto provider, which is significantly faster. so this are the few differences which we realize from the cpu profile.

we also realized that we are using the heap percentage usage in cassandra for nodes was more. and then we figured out that there are few heap settings difference in the configuration. and there were also a lot of hints were getting piled up because in cassandra four, we were throttling the rate at which hints can be transferred to the other nodes.

so after doing all this debugging, we figured out that cassandra is not the problem, it's a netflix configuration. so we tried to fix the netflix configuration for cassandra four and see, and we try to monitor the performance of it once the configuration has been fixed and we don't monitor the performance. these were the results if you see the red line denotes the cassandra four performance and the green is cassandra three.

so there was a significant improvement in the latencies, the reeds in the cassandra four were even performing way better than cassandra three. and the right latency is at par what we were observing cassandra three. then we waited for our canaries to give their result and we passed a performance evaluation with the flying colors and we finally promoted the, we did the in place upgrade of the cluster in this case.

now, joy will talk about thrift to sequel migration. i think so. is she and i just want to point out one of the things i'm so proud about those kinds of issues. nobody at netflix knew about them. the data platform team was the only one who had defined and fixed those issues. the application teams were never impacted because of the shadowing this capability to not impact. the ap teams became especially important when we undertook our thrift to sql migration and this was where we tried to upgrade our apache cassander two clusters to 3.0. so then they could be upgraded to four point.

i think we might be seeing kind of a pattern. why? well, because the thrift was deprecated in 2016. last i checked it's 2023. we really should get on this. we're kind of behind but why hadn't we done it? we hadn't done it because it was really scary between cassander two and cassander three. the entire storage engine got rewritten as we tried to upgrade a couple of test clusters from 2 to 3. we started seeing data corruption issues because of our use of thrift. and the 30 storage engine wasn't particularly well tested with thrift. so we were very concerned about this upgrade. so we kept kind of putting it off as much as we could.

in 2023 we decided to tackle this head on. we were going to shadow traffic to over 200 databases

We were going to backfill using our strategies and we're going to use these massively paralyzed and reproducible parity check systems. How'd it go?

Well, again, we defined a couple of migration paths and we started with trying to define to really netflix, we call this paving the path, which means that we kind of try to make something really easy for people to use. And in this case, we paved the path of migrating via the data gateway. This was kind of our recommended option for most users. And in this world, folks would actually call into the data gateway. So they would do an API migration, but they wouldn't do a data migration. And then the API server, the data gateway would translate between thrift and CQL. And we'll see how this kind of decoupling is really helpful later for us, it was really important because it allowed the application teams to make their code changes without migrating all of those thousands of terabytes of state. At least the time that we had started this project, we didn't have those offline comparison systems that Ayushi was talking about. So we didn't know how we were going to verify petabytes of state. So this allowed our app teams to move quickly, finish their API migrations. And then the data platform team could later go back and finish the data migration on our own pace once we had the tooling in place, there was a secondary path though and that secondary path was using a library here. We see that a client service couldn't use the gateway, maybe they were using some complex thrift feature that we didn't want to support in the gateway. And so for those use cases, we kind of call it, we call it white gloving, which is where we kind of help those users one on one white gloving doesn't really scale very well. But if you use it as a backup path, it can, it can help you kind of not have to build more complexity into your abstractions.

All right. So what does it look like? Well, this was a large, this was the migration campaign over 200 applications and this is kind of the burn down over roughly two years. And you can see that we were able to achieve a large number of migrations very quickly. And that was because of this use of the API migration, separating from the data migration, those teams were able to finish their migration. And then we were able on the data platform to later move the data for them and all of those ones over there that are like red, like all of those are the bespoke library bases migrations, right? Because you have to actually get the app team to build those, those those implementations because they have something special, right?

So one thing that we try really hard to do on the data platform is get as get as many folks onto that paved path so that we can do all the we can kind of take the pain of the data migration for the app teams. But along the way, what bugs do we find? I always love these because I think it shows the value of having that kind of shadow system in the comparison system. What are some bugs? Well, we found a pretty serious regression in multi hash key scanning in the thrift implementation of Cassandra multi key reads happened in parallel. So if somebody requested 100 hash keys, they'd get back all of the 100 at once in one in one look up in CQL, the upstream had changed it. So that coordinators batch by coordinator and then they execute them sequentially. So if you have a cluster with 100 nodes, this can make your query up to 100 times slower. This was pretty unacceptable. The user was like, hey, I can't do this migration, the latency would go up by, by a factor of eight.

Luckily because we control that gateway, we don't have to go convince the Apache Cassandra community community to change the implementation in the coordinator, we can just fix it in the gateway. We just implemented parallel reads on our own and we're able to bring that latency back down. And importantly, this happened without app teams knowing we detected this in our canaries. Our canaries failed 800% regression is an unacceptable regression. Canaries failed. We were then able to identify the issue and then, and then resolve it. And this, this is one of the really powerful things about having that abstraction layer is that you don't have to get your data store, your data store to agree with you. You can just fix it.

All right. So we found a multipart reed regression. What else? The next issue that we ran into was significant differences in how rift and CQL implemented pagination in particular thrift as we had it configured for this particular use case, didn't implement pagination. So if you asked it for a row that had 100 megabytes of data, it would return you 100 megabytes of data. And this is a problem because key value we talked about earlier returns data in like one mega 1 to 2 megabyte pages, order, order megabyte pages. So this meant that we were doing 100 megabytes of work times 100 for each page. The key value server was returning was retrieving 100 megabytes out of the back end data store. So like you know, accidentally quadratic kind of a bummer.

So this also failed our canaries. What was the solution? Well, again, we control that gateway, we control this behavior. So we can detect that we have these large partitions. And then we can kind of dynamically scale down the amount of data that we're asking for. In subsequent pages, we call it auto tuning page size up and down based on the first page. And and we're doing that to try to limit the rows fetched to get about one megabyte. So yes, maybe we do that 100 megabytes of work on the first page, but we don't do it 99 more times. And this fix was enough to, to bring the latencies back in check and, and unblock this particular application for migrating.

All right. So multi partition, read regressions, pagination inefficiencies. What about scan, scan scan was full of so many problems, scan across promotion was very frustrating because the two data stores don't provide the same sort order across hash keys. That's because thrift used something called the random partitioner, which essentially is just taking an md five hash of the key. And then that's how it locates a 120 token on a, on a, on a ring CQL. On the other hand, had quite intelligently moved to the murmur three hash which is faster and produces a nice 64 bit hash token. But this is problematic when you're trying to shadow scans across these two data sets. In particular, if you had a primary that was thrift and you were shadowing the same scan to CQL, the CQL implementation would decode the page token and would see a 120 bit hash token and be like, well, that, that's a completely invalid ring position. So the, the scan would go to a nonsense token. Even if you normalize the tokens to 64 bit, they still like went to the wrong place, but at least they went somewhere that like wasn't an error but more problematically was as clients scans at Netflix across, you know, terabyte data sets. Those can take hours, right? Like scans aren't a fast thing. So a client could initiate a scan against the thrift implementation. And then halfway through, if we promoted that, then they would all of a sudden start scanning somewhere totally different in the CQL implementation. So that'd be really bad because then the user would be getting incorrect results back halfway through also. Can you imagine debugging that? Right. Like someone comes into your help channel and they're like, well, halfway through my scan, i just started getting like totally random stuff that i already gotten before, right? Like it's, it's a really tricky bug to debug.

Um and we didn't want to have to. So we needed a solution to this kind of promotion across uh sorry, scan across promotion problem.

All right. So how do we solve that? Well, first we normalized tokens. So i talked about earlier about how like your API has to kind of bring resume ability into it. In this case, we did that by normalizing the tokens within, within the scan itself to something that could be translated with high fidelity to both to both implementations. And then in order to make the scan across promotion problem go away, we had, we basically informed the client when they started a scan, we informed them you need to continue your scan against this implementation. So you can see here that the scan items is sending this like use data abstraction layer thrift with this token. And then that tells the gateway regardless of who you think is primary, you need to keep sending the primary for this particular query to this implementation. So with that, that kind of wraps up some of the bugs, we talked about all of these like kind of complicated performance problems or and and and all of these were like different use cases, right? Like when you have hundreds of applications, you're going to run into some sample of issues. But again, really key to this is that our application developers didn't know because the data platform team had all of that insight into the shadow and we knew that those performance problems were there and we knew that those correctness problems were there before we did any promotions.

All right, with that, i'll hand it back to you. She close this out.

Thank you. So on the migration which we did at Netflix is the dynamite to key value pair. So dynamite is a Netflix implementation of distributed at this data store. And why was this migration required? Because when we evaluated the performance of Cassandra connected with cache, cache, we saw that the performance is way better than what dynamite is giving to us. So we carried out this migration. The major concern with this migration is that the user was very sensitive to the low latency. And during this entire migration cycle, we cannot afford to degrade the performance. We use shadow shadow reads and write to carry out the data to carry out the migration for this. And also the all of the data is marked as TTL. So we didn't have to copy the data from old database to new as a part of a backfill. We just decommissioned the cluster once the data has reached this TTL. So this migration was carried out by a library where the library basically takes care of writing it to key value and dynamite at the same time.

So what bugs did we find in this migration? None. Sometimes migrations are not painful. So it really worked for us. And I think the major reason for us it worked because most of our use cases were already supported by key value. And also the data was set with TTL. So we didn't have to move data and thus is reduce the correctness box in that case.

So during all those migrations, we had few key takeaways and we want to discuss those with you. The first is migrations can be complex time consuming. But with proper automation, you can really simplify them. It can highly speed up the process and reduce the human error and give us our important time back which we can focus on other important tasks.

Second, the new versions can have correctness in performance works, but it's up to us to do the proper evaluation. We should do performance evaluation, unit testing and integration testing to be sure that ok, the new system is working as expected before we promote it to the production environment.

Third, we should always avoid doing API and data migration together. Isolating them can help fully, can help us in debugging the issues properly. Let's assume like if you are doing a p and data migration together and there's some issue like it's very difficult for us to find out the source of the cause.

And lastly, we should prefer homogeneous migration over heterogeneous migration because data mapping is easier in homogeneous migration as the data structure or system is similar across the platform. And there are less compatibility issues to all from us.

Thank you for taking out your time and listening to us that