How Netflix uses AWS for multi-Region cache replication

Welcome everyone. Thank you for dialing in. Are we still in the covid? No, in person. So thank you for joining us in person.

It's, it's been a great re invent so far. I hope you're having an enriching and inspiring experience. I'm Pretti Sherma. I'm a Principal Solutions Architect with AWS. I've had the excellent opportunity to work closely with Netflix for last four years and I've had the opportunity to collaborate with various teams at Netflix and work on some really groundbreaking projects.

Like many of you, I'm also an avid Netflix user. I'm in love with the content they give me and whenever we log in, like there are over 240 million customers that Netflix have. When you log in, you get a very personalized home screen. Think about that. The enormity of the challenge here, 240 million or more customers when they log in, they instantly get their data on the screen. And not only that, I'm also marveling the resiliency that Netflix has. It has never gone down for me. Whenever I open it on my phone or my TV, it's always there ready to entertain us and think about the challenges that are behind the scenes of bringing that kind of experience. It's a testament to the robustness of AWS foundations and strong innovations at Netflix that have built so many resilient architectures.

Now, when you open the home screen, there are hundreds of microservices that get called out and all of these microservices need some data to quickly display the content that you are so eager to watch. And the enormity of the scale here is just mind boggling. Not only that, they also have to factor in how to fail over. So if something goes down and you have to fail over to another region, you need to make sure that the data is also present in the other region. And at the scale of Netflix, if each developer team is given the responsibility to maintain this level of resiliency and rapid data access, it's going to be an overwhelming challenge for all these teams. This is where Prithviraj, Shriram and the team come in. They have dedicated many years of their life in building out a system called EVCache that enables this rapid data access and multi region application seamlessly for developers where they don't have to worry about any of the challenges. They just need to focus on the business use case they have and push out the application code.

So we are in luck today with Shriram and Prithviraj who are going to dive deep into EVCache and walk us through their journey. So welcome, please. Thank you guys.

Uh so how many of you use Netflix here? Ok. I still see some hands down. So I'm going to give you guys 30 seconds to hop onto the internet and subscribe to the service. It's awesome, please, please use Netflix Alec.

So, uh today, the topic is gonna be how Netflix essentially does global replication of its caching engines. And when we talk about Netflix, Netflix is really powered by hundreds and hundreds of microservices. So a one hour talk will never be able to do a justification on doing a deep dive on all these microservices. So the focus of this talk is going to be on caching. And I was here along with my another colleague in 2021 we went doing a deep dive on the caching layer. But today, we are not going to talk so much about caching, but we are going to focus more on how we essentially take the caching that is caching engine or caching layer that is present in one region and replicate the data to multiple regions and how we have done it.

This is Mary Kondo. She's a very famous Japanese author and also a TV show host and the philosophy that she sort of tells to her audience is hold on to things that really spark joy and, you know, discard everything in your personal space. Netflix has a similar mission which is to entertain the world and really spark the spark joy through content to its members. And there is what you see is primarily the shows and movies. But when you imagine a world where you click, you had a tiring day at work, you came back home, you are relaxing on your couch, you clicked your favorite Netflix show and it took a minute to load frustrating, right? Luckily that's not the world we live in. We have a far better experience where once you log on once you click a title, it renders the content fairly instantaneously for you regardless of where you're trying to stream it from. Thanks to the unsung heroes of Netflix engineering and definitely AWS for making that happen.

Let's get into the agenda itself. So we are going to spend a quick few minutes giving you the introduction of the caching engine that we are going to talk about. Why do we need to do caching data replication, right? It feels very anti pattern, like replicating a database seems ok. But why do you want to replicate a cache data? So we'll we'll get into the nuts and bolts of why we need to even do that. And then we talk about the design and architecture of the system and we are going to take you to a window of how we manage a system like this - it's a pretty high scale system which does close to 30 million RPS per second and we are going to touch upon efficiency improvement.

So 2023 is the year of efficiency as Mark Zuckerberg has alluded. So I'm pretty sure all of you, so have you guys invested in any efficiency efforts in 2023 to reduce cloud spend? Great. So that's Netflix as well. And we are going to talk you through how or like what kind of investments that went to went into making this a lot more cost effective. And finally, we are going to also give you a sneak peek into what happens in our lives as an engineer. It might be interesting for a lot of you. So we, we'll spend some time on that as well and then we are going to leave you with some of the future items that the team is looking into.

This must be a very familiar sight to you. Once you hop on to Netflix, once you click on your profile, you get a personalized view of your own home page. And like, let's say you are sharing your account and you wanted to peak what that other person is looking into. You would, you would have probably noticed that their home page looks completely different than yours.

Part of the reason is there is a ton of personalization algorithms and ML algorithms that are involved to make sure that whatever you see is something that is going to, that you're going to click and watch it. So it is highly curated, highly personalized for a particular profile of a particular member.

Now, what you're seeing here is a fancy call graph, right? Essentially this call graph corresponds to all the micro services that get invoked. I wish I could zoom in. But let me just elaborate what these are - the leaf nodes that you are seeing are essentially the either the database or the caching calls. And about there are roughly about 100 calls in there and 70% of them are actually served from your cache.

So today, the focus is going to be on cache, which is a microservice at Netflix. And what is EVCache stand for? It stands for Ephemeral Volatile Cache. Volatile is a bit of a misnomer. I know the name was coined at least several years ago, but the cache is not volatile because it is indeed backed by SSD as well.

As I mentioned, there is a link towards the end of the talk which will go do a deep dive on EVCache, but at high level, it's a distributed sharded key value store. It is based off of memcached. It has auto-discovery and global replication baked into the client itself. And we are going to spend a whole lot of time on global replication. It is built to withstand failures - when I refer to failures, I'm talking about instances going down, availability zones going down, or your region itself going down. Also, the client is topology aware - when we say EVCache, I would want us to think it is not just a microservice, but it is a combination of at least five different microservices which are purpose built for a specific operation. Like the global replication service is purpose built for replicating the data.

The EVCache client library is sort of at the heart of all these things. And we sort of we use the client library in a sort of a modular fashion so that you can build like 5-6 microservices and use the underlying library. The cluster, the service side is also linearly scalable. When I say scalable, what I'm really referring to here is at Netflix, we also sort of see sudden spike in traffic, especially since we are almost Christmas is few weeks away.

Christmas is a very popular time on New Year. New Year is a very popular time where all the families get together. They want to watch their favorite favorite Christmas movie and we do expect a surge in traffic and we sort of have to scale out our backend stateful services almost in a day's notice. Again, scaling preemptively for a long period of time would cost us a lot of money. So we need to be able to scale up just with the headroom so that we are not spending too much money on the cloud as cloud cost as well. At the same time, we are not running the risk of taking the service down.

This is a typical deployment of the caching solution itself. We have, we maintain three copies per region and we do that for availability reasons. And we also have applications deployed in three different availability zones within a region. Typically reads always go to the local availability zone first. And if there is a cache miss, it gets rerouted to different zones.

And this is a scale of the service that we are talking about. This is deployed in four AWS regions spread across 200 clusters - when I'm referring to clusters, I think of them as a set of instances supporting a set of use cases or set of applications within Netflix's ecosystem. And we sort of maintain that level of isolation so that we don't run into noisy neighbor related issues.

And these 200 odd clusters are hosted on 22,002 instances. We do close to 30 million application events per second and 400 million operations per second within the region. We store close to two trillion items in cache with 14.3 petabytes of data.

With that, I'm going to let Shriram talk about global replication.

Thanks for the intro. This is Shriram Rangarajan. To be honest, this is my first conference. So, super excited to be here and thank you so much AWS team for giving us this opportunity.

Let's get started. In the previous slide, we saw EVCache deployment in one region. Whereas now let's expand that picture a bit of EVCache deployment in four different regions - us-east-1, us-east-2, us-west-1 and us-west-2.

Now if you take a closer look in the slide, actually, replication happens in two places, one within the region and one across the region. Within the region is a quite simple problem. So it's been handled by the EVCache client itself. Whereas across the region, it is actually a stateless service that's called the Replication Service. For the next 15 minutes, we'll be talking about this Replication Service. But let's take a step back before even understanding the detailed architecture and design of the Replication Service.

Like many of you are already wondering, "Hey, it's a cache data. Why do we need to even replicate the cache data? Why not just take fetch the data from the db and fill in the cache?" If you thought if you're wondering that, perfectly normal, correct questions. But hold on your thoughts, we'll talk about it.

This is exactly the same slide what we saw previously. The homepage, the personalized homepage rendered by different machine learning algorithms and 100 different microservices are called. And from these 100 microservices, almost 70% of the data actually comes from cache.

Let's break down into two important things. One personalized home page and the other microservices. Let's talk about the microservices first.

A very happy user coming to Netflix, logging into Netflix. Say 100 different micro services are called. A region fail over happens. What is region fail over? The traffic from one region is evacuated to the other region at Netflix. It's a very common practice that we do region failovers.

Say for simplicity and for discussion perspective, we can say that the traffic from US East 1 is rerouted to US East 2. See replication is not happening. So now since this user data is not available in US East 2, it is going to result in a cache miss.

So what's the application team going to do? They're going to fetch the data from the db and fill in the cache. Perfectly normal. Again, cache is going to have misses. That's a feature of a cache.

But consider the scale here. Currently, we have roughly around 200 million users. Say 50 million users get served per region. So if these 50 million users are going to hit the db, it's going to put too much pressure on the db due to entering her problem.

How can we solve this problem? Super simple, throw money at the problem. So add like just go pre-scale the database and that way we have enough room for it. But what's the problem here? For example, for every region failure exercise, we need to go scale the databases. So just consider the amount of coordination that is required, amount of cost that is going to happen by just pre-scaling this.

So now what is the end result of this? It's going to frustrate the user. So as we mentioned before, the goal of Netflix is to spread joy. So here the user is unhappy about it.

So now let's consider a happy scenario, right? Cache replication has worked for some magic. So the data is already available in the cache. So now what is going to happen is that from the user perspective there is not going to have any interruption because the cache, the data is going to be readily available and the business is going to run as usual.

So latency is extremely critical factor for building this replication service, to personalized view. So personalized homepage powered by ML, it's a very, very, very common practice for some of the machine learning teams to use caching as a back end store.

Oh yes, you heard it right. Actually caching as the backend store for two reasons. One, it's totally fine to have few cache misses because they can go and recompute the data for that user by either it's stored in the S3 or Cassandra. Or the other reason is that if they wanted to store the same data in the database, it's going to be extremely expensive for them.

So for some machine learning algorithms and for machine learning teams, caching is the persistence store.

So now let's consider the same scenario, happy user coming to Netflix. Region failure happens, cache miss. So what is the team going to do? They're going to go and recompute the data and fill in the cache again. Cache is going to have a hit or a miss, it's totally fine.

But consider that for the scale of 50 million users, they have to go and recompute the data. These algorithms are extremely CPU intensive. So it's going to cost extremely high amount of money. Who's going to be unhappy? Netflix.

So cost is another factor why we wanted this replication service was built. So the two factors, what we saw from the technical aspects, the cost and the other latency.

Now let's consider what is replication, right? So if if you take two steps back, it's like replication is nothing but a data movement problem.

So we have roughly around 200 EVC clusters which is equal to 200 use cases. So now like from a platform team, we felt that if we own the data movement problem and the application teams focus on the business logic, it's like a shared responsibility to make the end user experience better.

So that is another reason for building this application service.

What are the design goals of the replication service? Latency, as I mentioned previously, latency is a very critical factor for building this replication service. But we know that we don't want to promise false promises. So we cannot provide a sub millisecond because we know that every replication is going to have a cross region call as well as we don't want to provide SLA in minutes or seconds. It's a cache, it's going to have a detail associated to it. So that like it's no point in even replicating the data at that point.

So currently we provide an SLA of two seconds, tunable. As we mentioned previously, there are roughly around 200 EVC clusters. Let's assume that these 200 is equal to the 200 use cases. We don't want to deploy different replication services for different use cases. Then it's very hard problem to like it becomes extremely hard to maintain the system.

So we wanted one replication service to behave differently by just changing the configuration parameters. We have hundreds of configuration parameters in place. But one parameter, for example is consistency. Say that with one app team wants, "Hey, I want the data to be replicated across the regions." Whereas the other app team says, "Hey, I don't want the data to be replicated, but I want the data to be invalidated across the regions." Just for consistency purposes.

We have a property in place, our configuration is in place. It's called the invalidate for set. If the invalidate for set is set to true, we go and replicate sets as deletes which makes sure that if the data is written in one region, we go and delete the other region just for consistency purposes. Whereas if it's set to false, we make sure that the sets are being replicated as sets.

Extensibility, so again, the replication is a data movement problem if we keep the same design and architecture right? So then we can achieve different use cases. So in the latest section, we are going to talk about how just keeping the same design and architecture again, going to have slight changes in the design for sure. But changing some of the Kafka configuration, we can achieve different use cases.

Example, caching is one example how we replicate these cache shards. The other example are right head log delete use cases, but this is just a preview of it and more to come soon.

What is the design requirement? As mentioned previously, we wanted the data to be available in the other regions. So high availability is always the p0 for us. And now many of you are already aware is that all these replications are all happening in parallel. So we don't have any notion of a large write wins. So we cannot offer strong consistency. Currently, we offer best effort consistency.

What is the best effort consistency here? As I discussed previously, if the app team says, "Hey, I want some sort of consistency, just give me something." So we can go and like set the property for invalidate for set, which makes sure that the data if written in one region, we go and invalidate in the other regions.

So high availability is p0 and consistency is p1. Currently, we offer best effort consistency.

What are the design, non requirements? So as mentioned previously, we don't offer strong consistency. What does make is that we wanted to simplify the system design. We wanted to make sure we don't consider global locking, partial rollbacks and transactional updates.

Now, already many of you are already aware here is that all these three factors are related to strong consistencies.

Now, let's start understanding the workflow. Let's start seeing how the data from one region flows to the other region here in this case, how the data flows from US East 1 to US East 2.

What are some important components in this slide? Let's start with the EVC client. Cache client is a client library which is embedded in the app. It's actually responsible for two important functions. It sends a mutation to the local EVC service and it also initiates the replications. So and it also initiates the replication by sending an event to Kafka.

Replication service is actually broken into two important services. One, the reader service, the other is the writer service. What does the reader service do? The reader service actually pulls the messages from Kafka. Then that sort of a business logic or transformations and sends a cross-region call to the writer.

The writer actually goes and writes to the destination EVC service.

Let's start understanding the workflow. The cache client sends a mutation to the EVC service. What is the mutation here? It can be an add, it can be a set, it can be a touch, it can be an increment or it can be a delete.

In parallel, it also sends a metadata to the Kafka. What is the metadata here? As we mentioned previously, cache is a key value store. Other than the value, everything other than the value is a metadata. It can be a key, it is a detail, it is a compression - all the important related information.

Why we don't send value to Kafka? Two reasons behind it. One, we don't want to put too much pressure on Kafka. We have seen use cases where the data can be a 4KB, 4MB and also be a 40MB. So if you send a 40MB to Kafka, it's going to be too much pressure. So that is something we wanted to avoid.

Two, for example, if two mutations happen within milliseconds, say T1, T2. Say T1's data is already replaced by T2 data, then we don't want to even replicate the T1 data. So that is another reason why we don't want to send the value.

Now, the replication's reader actually goes posts the messages from Kafka and reads the recent value from the local EVC service. Now it does some sort of a transformation and sends a cross-region call to the writer.

The writer has a cache client embedded on it. Now, it also sends the write to the destination EVC service. Now, when the app team goes and reads the data or reads the key, the data is readily available and it's going to serve the data.

We spoke about EVC cache client, we spoke about servers, we spoke about Kafka, we spoke about readers and writers. We didn't speak anything about SQS. Yes, we use SQS as a retry queue.

So whenever the failures can happen in part 4, the read path, the cross-region call path or the write path. If any of these failures happen, the replication reader actually sends the event to the SQS and we have a similar reader and writer for SQS which actually pulls the messages from SQS, sends it to the writer SQS and then sends to the destination.

Now let's take a closer look at detailed design of the readers and writers.

So what we started off was with uh picture of deployment within a single region. Then we saw deployment four regions. We saw application service as just a black box. We are zooming in to the interactions or the data, the data flow between two regions.

Now let's try to understand the interactions between these components at a much closer look, right?

So on the left hand side, what you're seeing is a kafka cluster. And if you, we collect from the earlier slides, we said we have 200 odd clusters and to enable global application, we could have gone taken the route of creating a dedicated kafka cluster for each one of these cache clusters or the caching clusters are uh have a uh single kafka cluster that sort of tackles all the traffic from all this, right.

Uh so we went with the approach of just managing a single ka fca cluster. Uh otherwise it is going to be uh management uh overhead for the engineers who are are uh looking into this uh service.

And the way how we design this is each topic sort of corresponds to the back end cluster. And the number of partitions is going to be a factor of the rps that each of these clusters are receiving.

So imagine a cluster just receiving 100 rps, right? You don't need a lot of partitions for the topic. Whereas imagine a cluster receiving a million operations per second, you would definitely need a lot more concurrency and parallelism so that your kafka consumers can consume them faster.

Uh the way how we design this is each reader cluster is from kafka terminology is nothing but a consumer group. So they are able to read the same set of mutations from the kafka cluster.

And the reader cluster is purpose built to essentially consume this data from the kafka cluster. And it is going to look at the payload and if it needs so transformations, so for example, you want to uh you got a mutation but you want to invalidate data in all the other regions so that you want to keep the data consistent, right?

All such transformations are going to be happening within the reader cluster. And the reader cluster will also do a read of the data from the local region itself.

And uh it a short reader might look at this diagram and they'll be like, why do you need to read from three times? Right? Aren't you adding, we load on your ev service? That's a fair point.

Um but uh in, in practice, what we have noticed is like weeds are extremely cheap operations and they do not add a significant load on on the system.

So uh even though uh we are not sending the payload uh in via k fca cluster, we found the trade off to be a very good one in our practice.

And uh finally uh we touch upon this, which is if, if any of these things fail, uh we do park uh all those failed mutations in sqs and we have a finite amount of wheat rice.

Uh we did not implement dead letter q uh kind of uh a design. Uh because since it's a cache data, we can sort of tolerate some amount of uh mutations, uh lost mutations. We'll touch upon that uh in, in, in a second as well.

Uh we have seen the interactions between kafka and reader, let's understand the path between reader and writer closely as well.

So on the right hand side, you're seeing in us west two, the reiter cluster is present writer cluster is currently a rest server.

Uh we are in the process of rewriting it into gr pc server. All we do is send a rest call with the full payload. So i want to emphasize it's a full payload. So and it's also bright red. So it is an expensive call.

And then once the writer cluster receives this message, it is going to unwrap the message, it will see what kind of mutation we are talking about and then it will issue the call to the destination cluster and client libraries again embedded within the geiter cluster.

So we have made sure that there is a lot of common code abstracted into libraries so that these services that we are looking at are just focused on what they are supposed to do and they rely significantly on these common libraries uh extensibility.

So if you looked at the overall design, a vast majority of the components are reusable. So think of rod, rod b doesn't do any global application. It's just no local storage, right? But you can apply the same design element and make your arts to be globally replicated.

Likewise, for folks who use a grade uh enterprise does support. But if you're using stock grade, it doesn't have any notion of global application. So you can definitely use the same architectural paradigm to implement a general application service, right to head log.

So write to head log is something that all the databases have baked in.

Um so whenever there are any uh to, to ensure that the that the data is consistent, so that you can sort of replay the failed mutations, but imagine a situation where you have couped the data, right? Your dr strategy typically would be to take a snapshot and rest to uh to a point in time where you know that the data is good.

But what happens to all the mutations that happened after that snapshot, you need to sort of park those mutations somewhere.

Uh and we can use the similar architectural paradigm to build a right head log as a service outside of your database and finally delayed cues.

So uh the the solution of the architecture that we have described here, it does do uh three times of retries before giving up. But uh for a database workloads, you can't like never, never lose your mutations, right? You eventually want to make those requests succeed.

So how do you handle that? You need to implement uh delayed cues or if your downstream is down, like your database is down for an extended period of time?

Uh it is ok. But you need to still park that uh mu somewhere and attempt to eat at a later time. And similarly dead letter q also falls into the same bucket.

And we are investing into these based on some of the constructs that we have explained. And uh hopefully, if all of these things work out great, we would love to be back and share the result of these three services which, which are going to follow a similar architectural paradigm.

Um and in, in terms of extensibility, uh if you uh the, the core way, how we are able to achieve all of this is due to kafka.

And um i just want to share a few configurations. There are, there are a few more, but the important ones are these.

So in, in, in a cashing use case, we really care about availability and for availability, we we really don't care about uh having, you know quo uh of, of a kafka clusters. When you are sending the mutations, it is ok. if the leader is in an unclean state or if the other factor is not uh three.

And we don't need to have a q replicas. We can have it at just one. But for the right to head log, uh we really care about durability of our mutations.

So we want kafka configurations to be like a lot more stringent.

Um it would definitely add additional latency. Uh but again, you, you have to compromise on something to, to achieve uh the high durability.

Um again, these are just some of the uh extensibility, uh things that uh we, the team is looking into and uh we hope that we are going to achieve all of these things in, in, in the coming year, what are some of the design advantages?

So, in the context of the uh the way how we have mo every component has a, has a specific purpose. So if kafka is bottlenecked on something, we can scale up kafka cluster independently, if readers have blocked on something bottlenecked on something, we can definitely scale, scale it up similarly with the writers.

So all these are very, very much purpose built and they can be scaled up or scaled down independently.

Uh we also sort of have a predictable end to end latency because we uh sort of uh pretty much uh benchmark or like look at the amount of time that we spent between each of these t from the time the mutation is sent from the client to the uh c a fa from kafka to the reader reader to the gu and guide to the destination cluster.

There are almost like 55 portions to it.

Um and tunable, i think we already covered that. So i don't want to spend a lot of time on that cost efficient since we are using a single kafka cluster to achieve all these mutations, we are not exploring the number of kafka clusters.

It is not a management overhead, there is a section dedicated for cost efficiency wins that we had. So i'll again do a deep dive on it in the latest slides back pressure because we have so many moving parts, something can fail at some time.

And we don't want one system or one component going down to bring down our entire system. So we have a back page throttling mechanisms baked in so that we degrade the performance. But we don't like run into the risk of taking the system completely.

What are some of the design pitfalls because the system that we are looking at is a stateless service. We can rely significantly on auto scaling policies.

But auto scaling policies can be a double edged sword for folks who have maintained cafer clusters. If your consumer groups are constantly, there is a constant churn, you would see cafer cluster is always in the state of rebalancing the partitions across the consumer groups and it would uh attribute to all the increased or elevated latencies.

Um so if, if you're going through this design, uh i would definitely uh watch out for uh setting the scaling policies on the right thing, right parameters.

So that you don't, you're not always in this uh constant ch of scaling up and scaling down. You, you want to definitely aim for a smooth curve message duplication.

So since uh when we do global application, right, essentially, you can rely on a client driven architecture or a server driven architecture. The architecture that we have uh we went ahead with is client site.

Partly the reason is uh client is also doing in the region of application, that means every mutation also goes to three different copies within a region. And we don't, we need to pick a primary to replicate the data to the other regions.

And we didn't want to introduce that additional level of complexity on the server side. So we decided how about we just have a client library that does this additional job of sending this mutation to kafka.

But the design pitfall that i wanted to call out is whenever there are any library changes, right? You need to handhold some of your applications at times.

And we had an issue where we essentially migrating all of our java apps. Netflix is a java house and we were migrating all the applications from juice to spring spring boot.

And due to the library changes, we noticed a pattern where we were replicating the same mutation multiple times we have safety guards in check in place now. So, but i just want to call out with a client driven architecture that is a risk that you can run into

uh cascading failures is again a side effect of not applying throttling or back pusher.

Um just wanted to make sure that i call it out observable. So observable is sort of eyes and ears to what is happening within a service.

And as an engineer, i'm very passionate about observable and sh is going to talk us through that.

Uh thank you for observ and alerts. So i think one of the favorite topics of all the on call engineers here. So without observed, the alerts just consider the life of an engineer, right?

We don't know what the root cause of a problem is or how can we even what, how can we even understand quickly understand what is the root cause of a problem.

So we do have alerts and metrics at different places. We do have alerts at the sqs level. We do have alerts at the reader. We do have alerts of the writer.

But due to the shortage of time, we have like 24 minutes left. So we wanted to emphasize on two important alerts like the even failure alerts and the latency alert.

And then also we also wanted to emphasize on how autos skiing is extremely crucial for maintaining such a distributed system service.

And then as we saw a double replication is one of the design pitfalls. So just getting into the details of how we even duplications.

Let's start with a very, very happy part.

So the uh the graph, what you see on the left is the events processed. What are events processed? The the number of events that is read from kafka are the events process.

We do roughly around 31 million replication events on an average on a max, we do roughly around 40 million replication events.

The right, which is the we failed

This is a very, very happy part like we see a maximum of 40 failure. This is the data for the last two weeks. So as we, as we saw before, all the events failed actually are retried by sq. So again, here we do a retry of all the 30 failures and w and 0% failure. This is the best time to be on call like i want to be on call during this time. I think i was on call during this time as well. So i was lucky here.

So again, but we are dealing with uh distributed systems here. So issues can happen any time it can happen due to a bug, it can happen due to a system failure. So let's start seeing more interesting things.

Let's start with the on-call engineer. Let's start with the scenario where an on-call engineer gets alerted at 12:30 a.m. one am in the morning, ac rr even failure. What ac rr is cross-region failure evens how we have configured this event. So for example, in the last 10 minutes, if the five minutes have more than 1000 failure per minute, then we immediately go and alert the on call. Why do we need to go and alert the on call? Because high availability is the p zero for us. And we wanted to be very sure the data is going to be available in the other system, other regions, it's 12:31 a.m. in the morning.

What does a normal on-call engineer does watch netflix? So, so what we do as here is that like the first thing, what the on-call engineer does is that hey, we, i keep uh the cr a human failure, then we immediately see into the retry path. If there is retry path happening, then at least it's not a p zero because we know that the data is going to be available in the other region. We know that the data. So, but the latency is going to be on the higher end because it's going to be retraced. It's not going to be within two seconds, but at least the data is going to be available in the region in seconds.

What is a bigger failure if even these failures happens? This means that the data is not going to be available in the other region. So this is a p zero and immediately the on call immediately goes and understand what the actual issue is. This can happen due to two reasons. One, it can happen due to a bug. So bug is a quite simple problem. We go and roll back the bill to a more stable state. So this will at least resolve the problem. And during the normal business hours, we go and understand what the reason behind us. System failures. It can be a memory alert, it can be ac pu alert. During that time, we throw more hardware at the problem to get at least the errors to zero and then try to understand what the actual problem is.

Next. Alert is the latency alert like a normal business to heart. Like we see this is the p 99 latency. On an average, we see roughly around 2.0 seconds. And on max, we see this is these are the set latencies, 2.14 seconds. Let's consider the same scenario, right? So a latency alert happening on call engineer immediately goes and sees the latency and interesting that happens here is that only for one topic. So replication service is common across all the eb cash cluster. But seeing only for only one topic, the latency is roughly around 500 seconds. Oh, this is one of the most important alerts or interesting alerts. The reason is that because it can be due to a bug, it can happen on the reader, it can happen on the writer kafka sqs network combinations permutations, everything. So almost like art can be a cascading failure as well.

So we have a separate section where we go and understand or debug a latency. How do we, how do we resolve it? And how do we even deploy that to product that is have a later section? But this is again a preview of it.

What scaling as in as we mentioned before, like for replication service, we do roughly around 30 million replication on average. So, but as an evs team or a platform team, we don't want to go initiate a conversation to every individual app team for a special occasion. So that we know that we are going to pres scale our evc service. What can be the special occasions here, for example, live. As many of you already know that netflix started live events. If the number of concurrent viewers goes high, the number of replication events also goes high holiday season when the peak viewership goes high. For example, when squid game just released, the peak viewership went high. The number of replication events also goes high. A new service. It's extremely hard to predict the exact traffic of a new service which they wants replication. So we it's very hard to get a ballpark number, but with the ballpark number, we cannot go scale out evc service migration testing. These are the few simple like location. But what the main focus here is that we don't want to hand hold the replication service, but we want the auto scale to happen automatically.

This is an example, say on october 31st, a new topic got added and on the right, you see the number of instances also increased if you see the pattern like between the number of events and the number of instances exactly matches. So which means that the scaling was working perfectly? But for auto scaling to work perfectly, we need to have the right scaling policies as well.

What are the policies we have for the application service first?

Cpu so we wanted to be very sure that we use the maximum utilization of the resources. So currently we do a max of 70%. All our cp us are working at 70%. But if the average goes beyond 70% we go not to scare the system network. This is a very important metric because as a platform team, we don't own the actual data. The actual data as i mentioned before can be a four kb, four mb or 40 mb or if a new service all of a sudden comes and says, hey, i have introduced a new feature. Let me increase the payload size from like four kb to 40 mb. Then the auto scaling should happen automatically. So currently, if the network goes beyond 200 mb, we go and auto scale the system q size. This is another important metric. The first two are the system. So one is the cpu and the network is what is the q size. This is an internal metric. As we saw from the design, the readers are responsible for reading the different partitions and processing of all these partitions happens in a separate queue. And this is exactly the size of the q, the q size. So now where is the 700 number coming from? It's a magic number to be honest, it's just based on experience we have when we tried with a very low number, this auto scaling happens at a very rapid scale which means that we are throwing more money at the problem without effectively utilizing the resources. Whereas in the case of a higher number, what happened with that is that some of the nodes went into a very hard lt state because of memory issues. So 700 is based on experience. It is based on the traffic pattern but it's just by like from the current production traffic as mentioned before.

Uh auto scaling works perfectly if we make sure that we and we make sure that they utilize like resources are utilized at the maximum possible cpu as i mentioned, like we do a max cpu utilization of 70%. Currently, we have roughly around 2000 instances per region which includes both readers and writers. So consider that when two nodes goes down, do we need to alert the on-call? No, that which means that the on-call has to be alert almost the entire 24 7 for the entire period of one week. So what we do is that when an alert happen, if one or two, no nodes go down, we have auto remediations in place, we go try to terminate the node. If it's very less percentage, it's ok to go terminate the node. See if there is, if it resolves the issue.

Now comes the disc, let's consider another series. Currently, we use a max of 40% but alert comes in. But a group of this goes down. Now we don't want to go terminate the nodes. What we do is that we go and we alert the on call. So the general theme, what we have is that if a couple of nodes or a few nodes goes down, then we go and auto terminate these instances. But if a group of nodes goes down, then we will alert the on call

double replication. Like this is one of the design pitfalls of the current design as client initiates the entire replication in this graph. We see around 1.7 million duplicate events from the same. This is from the same application. We have roughly around 1.7 duplicate events. As prade mentioned, this can be a migration effort. I think around two years ago, we started the initiative of migrating most of our services from juice to spring. Juice actually uses a kafka library and spring uses spring kafka library. Now during this phase of a migration, if the client forgets to go and stop the replication, then the double replication will happen. What is the problem for this double application? The data is going to be available. So that is no problem at all. But every replication event is going to be is resulted in a cross-region call that is going to increase the cost. So this is more sort of a cost optimization effort than actually a p zero. Or if the data availability problem, what can be the reasons behind the double duplication? It can be a bug, it can be a configuration or it can be a migration.

How do we even detect these double replication? We use a very fancy way, the manual way because the problem is that the same like for example, consider 1.7 million events, but the application is actually doing a 50% deployment in each it's a perfectly valid scenario. Then this is not duplicated events but these are actual events. So it's very, very hard to go and automate this process, but we are trying to automate it, but we don't want to generate a lot of false positives as well. Currently, what we do is that we do have a jobs in place and the on-call engineer doing them if the job fails for some reason, the on-call engineer goes and understands what is the failure risk goes to understand what like sees the metrics sees the deployment model if required also goes and initiates with the conversation with the clients team.

Let's talk about efficiency improvements

Prive: Like the last section on the double application, we saw cost efficiency outside the application service. But as we mentioned this, the year of efficiency, we did a lot of, like a lot of improvements within the application service and I'll let Prive handle it.

Shivam: Thank you, Shuh. So at Netflix, actually, we had a hack day where we invited all the engineers to come up with some crazy ideas. So, uh we have done a lot of efficiency related projects, but this is definitely going to focus just on what we did within uh the service that we're talking about.

Uh the first one is batch compression and the second one is removing network load balances. Uh wait. Uh we did not talk about any network load balances, right? Uh so we'll see where uh and uh where it comes into the play.

So we've been uh talking from the beginning that reader to writer is an expensive call because it's a cross-region call. And uh for whoever uh they looked at uh who looked at the AWS um billing reports, the network costs are come out as a bit of a sticker shock. Uh right. Uh that happened with us as well when we started auditing.

And one idea that we had was all the data that is going from region A to region B for a particular use case, if you look at the data type, if let's say they're using port box or they're using json lot of fields are sort of common across the a particular use case. We thought, how about we just batch these messages and then enable completion on top of this, right?

And the results were quite impressive and we did try out z standard snappy and z zip even though we know that we're not going to be using j zip just as a data reference point. But eventually we settled on z standard, the standard level combustion. And what we noticed was the red line that you're seeing here is the average size of the data in encompassed which is 100 roughly hitting peaks of 100 k. And the composted data size was close to 70 k. So we saw almost 35% in network bandwidth reduction which resulted in definitely a lesser cloud bill cost for us.

Um and this is sort of a breakdown of the previous slide almost at a per use case level. The top line that you're seeing is probably one of our very high ip s cluster, which which is called as pre computer ranks. And the payload size is about 1.2 mb. And we could see that i've highlighted to the bottom and after combustion, it came down to roughly 400 kilobytes. So there is almost 65% combustion savings for some of our largest clusters.

Let's talk about network load balances. So between reader and writer, we were using network load balances. Even now we have network load balances deployed in our architecture. Um but when we look at the cost, right, when you do a cross vision call, you're going to be billed at easy to outcast and also for network balances, network load balances. And we had network load balances for two primary reasons.

One of them is the clusters are deployed in aws accounts which are not p. Um and we had to sort of rely on network load balancers to achieve the cross region call. Uh the second one was uh to have a predictable uh latencies and also obviously for load distribution, that's what nlb are built for.

But once we looked at the cost of this entire thing, we found it to be extremely expensive. So we thought, how about we evaluate it with a client side load balancing algorithm and see how things look like and how the latency look like.

So we eliminated network load balancers and tried to go directly easy to uh uh from the reader to the guided instances. And um this, this was the time when on April 13th of this year, we migrated pretty much all the workload, which was roughly 50 gigabytes per second handled by the nl bs. And all of them went pretty much to less than 500 mb.

Uh we still use nl bs as a fallback path just in case our client side uh load balancing fails on us. And for client side load balancing, we use our internal dns uh solution called as u a dns.

Now, all of this is great right now. Uh this sort of uh resulted in almost 50% reduction in the network transfer cost. Although you see uh a 90% drop in nl bs that, that is what we are able to eliminate from the cost perspective, but we are still getting billed for the ec2 out costs.

Did it really work? Right. So the litmus test for us is to ensure that the end to end latencies remain under the two second sls that we promised to our service teams and is did the load balancing thing really work.

So the right side top corner, the graph that you're looking at is the distribution of ip that each of the writer nodes are receiving and they are pretty much hugging to each other, which which is a fantastic win for us in the interest of time. I'm going to skim through very fast.

Uh so life of an evc engineer. So to give you a window into what we do, uh let me talk about the latency page. So whenever we get a latency page, uh this, this is at 10:20 a.m. in the morning. Not too bad. It's not like we are waking up in the middle of the night.

Uh the usual suspects of a latency page are uh it could, it could be due to your kafka and from kafka, the usually the offending ones are maybe the number of partitions that you have per topic is not high enough or there is a significant consumer lag or the joint rate of the nodes in your consumer group is the churn is very high on the reader side.

You could be bottling by your thread pools. So that was the q size metric that we showed you earlier. And the right side, it could be due to a slower back end itself. So it could be that your system has become slow because it's been up for so long.

Uh so these are the three places where we can see this issue. So once we got the page, we click on the page and we see, oh, there is a massive spike on october 14th at 10:20 a.m. Why did that happen? Right. Let's skim through some metrics on the kafka side.

Oh, there is a significant spike in joint latency. Why would that happen? Because there is a significant notion is that true that that that seems to be true because around the same time you see a huge spike in a number of reader underwriter notes getting added by the autos scaling policies.

So once you peel the onions uh uh um uh over and over you, you can actually get to the root cause of the issue. So in this case, it happened to be one of those situations where uh one of the availability zones in on aws had a networking glitch uh momentarily and we could have done better by retrying uh on a different tcp connection.

Uh but at that time, this incident has happened, we were not doing that. So that led to a bunch of tcp connections getting stuck in closed weight state uh as opposed to established state. And that led to a spike in t cpa from both the reader side and the writer side.

So until we went ahead and did a redeploy, um you know, things got settled. This is again just a window into one of the incidents. Shivam is going to talk us through one other incident.

Shivam: Thanks for so what is the life of an engineer uh without issues? It's super boring. So an interesting engineer, like interesting issues makes the engineer life interesting as well as makes us better.

So let's let's look into our next issue. So instead is going out of discovery. So this is a spinnaker which we currently use for deployment. The green ones are the healthy nodes. The blue ones are the unhealthy nodes if it unhealthy nodes, which means that it fails the health check.

So it goes out of discovery. The gray ones are the ones which are auto scarring, either it is tearing down or new, new incidence is coming up. So currently, we can see that almost 50% of the incidents are in a healthy state.

Uh what an on call engineer does, he immediately goes and does an h into that. And uh you should suspect the hip issue. Now now understanding the memory utilization of the node, we see that around 1220 the memory utilization almost went from two gb to 10 gb. Why can this happen? Two things?

The reason is that it can be an increase in traffic or it can be an increase in the payload. Now by closer looking and taking the heap dump, we saw that like as i mentioned previously, every rule, every reader is responsible for reading multiple partitions and each partition is being maintained in a queue.

So these queues were unbounded and there were roughly around 700 k items in the queue and each item was only kb a few k bs, but the entire queue size was roughly around 1.3 gb. So since there are multiple queues, so the uh instance went out of memory, what is the cost?

What is what, what actually resulted by this issue? So since the reader is responsible for multiple partitions, so some of the topic which even didn't have any changes in the traffic or changes was getting affected. So you can see a latency spike for multiple topics here.

What is the solution for this problem? Apply the black pressure? So the the, the the problem was pretty complex. But the solution here is that we updated the data structure from an unbounded queue to a bounded queue.

Great. We saw the problem, we saw the debugging process and we saw the solutions. But how do we make sure that the solution is working perfectly earlier? I had an opportunity to work in a start up where we had an amazing, amazing fancy testing way uh testing it in production.

So what we do is that if we have an issue, we just take the problem to fix the code, deploy it and proud if there's an issue. Ok. We go re trade the issue again if it is, it is. If the solution works amazing, we go to the next issue but we are at netflix and this is a tier zero and the scale we are working at, we don't have the luxury to test it in fraud, but the current way, right?

But the current design allows us to do the entire testing without zero co changes. How just add a new consumer group? We know from the issue that what are the topics that got affected? Just rout these topics to the new consumer group and with all the tunable in place, just reroute the traffic to a test writer and then to a test tv c cluster.

That way we can actually test the data, live data in the live environment without actually affecting the production traffic so what this gives us, it gives us the confidence. So example, these, for example, memory issues are extremely complex problems. We need the traffic pattern, but we also need the confidence that our fix is working fine.

So this system design helps us to do that future. We have uh we have work cut out for the next few years. So what are the, what are the things we are looking at? Currently? The entire replication service is in i pv four. We wanted to migrate it to i pb six.

So both readers and writers are currently running in ec2 instances. We wanted to do container migration generic service, as we mentioned before. It's a data movement problem. We wanted to achieve different use cases by this service. So we are working on right ahead log delay queues. If it all works perfectly, we'll see you again in 2024.

Thank you so much and thank you so much for your time and have a great rest of the day.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值