Netflix‘s success: Combining collaboration, hardware monitoring & AI (sponsored by Intel) (Intel)

Harshad: Hello, good morning, everyone. Welcome to re:Invent and welcome to our session AI315s on Netflix success stories on Intel. So my name is Harshad San. I'm a principal engineer at Intel. And you could describe my job as trying to squeeze the maximum performance out of Intel instances on AWS with customers like Netflix. And we have Phil here from Netflix.

Phil: Thanks, Harshad. Uh good morning, everybody. My name is Phil Flask. I'm a performance engineer at Netflix and my job is very similar to Harshad - is to make sure that Netflix runs smoothly, reliably and uses as little resources as possible. Thank you.

Harshad: So, in this talk, we are going to be talking about two parts mostly - on hardware monitoring and then AI usages - and we'll go really deep on some of this stuff. But more importantly, it's important to remember this is about a collaboration story and you'll see as the story unfolds.

So speaking about collaboration, this kind of gives a timeline of how Intel has been working with Netflix on AWS. There's a lot going on between the companies, but some of the external blogs are something you can look up. We have been working on optimizing performance on several of the services as you can see. And along that timeline, you can see the last blog there based on “Seeing Through Hardware Counters”. And this is the blog, we are going to kind of expand on and get into the story in the first half of our section.

So I think everyone knows Intel to be a microprocessor company, but maybe some of you know, some don't that it has at least 15,000 software engineers in the background and providing software optimizations and contributing to a lot of software including, you know, all the way from the base firmware, virtualization layers, runtime application. In fact, Intel is the largest contributor to the Linux kernel, to JDK, OpenJDK and so forth.

So within what's your experience been on Intel processes with your stack?

Phil: Yeah. So Harsha uh first of all this slide, uh the slide you are showing uh it almost entirely matches our stack we run in AWS. Uh we use Intel Xeon processors. Uh obviously, we deploy on Linux. Uh we use OpenJDK, Cassandra, TensorFlow, pretty much all the runtime languages. Uh and, and I think ultimately, it comes down not so much uh uh to the um number of commits or the lines of codes that Intel commits to those open sources, but more to the expertise that Intel has not only on the hardware level, but with the whole stack that runs on this hardware.

Harshad: Awesome. So this is going to be a fun question. But how many people have Netflix subscriptions here? Show of hands. Like that's I know it was obvious. But, but we know Netflix as a media streaming company, we watched all the content but we want to let's understand from what, hey, what goes on behind the scenes, right? How does Netflix operate?

Phil: Yeah. So uh Harshad uh there's, there's a lot of debate and ii I think it's not settled yet whether Netflix is a media company or technology company. So, but one thing is for sure, regardless of the answer is we use a lot of compute resources, whether it's cloud gaming or machine learning application or, or, or visualization. We also operate obviously uh a back end that allows you to, that allows us to provide recommendations and render your home pages and mint licenses and manage accounts. We need a lot of compute and um we deploy in AWS on Intel uh Intel Xeon hardware and we use all these um all these offerings to power Netflix.

Harshad: Awesome. So now that we know of a few of the things that talked about, we are going to get deep into one of the use cases and for that, we are going to look into observability at Netflix and then observability at Intel as well. So we are going to go into deep into one of the use cases with that. And I'd like to learn about performance and reliability engineering at Netflix and how that operates.

Phil: So, uh as a performance and reliability engineer, I see observability as a, as a foundation for everything we do without observability, we basically cannot do our job. We're blind and I see it a little bit of um as an iceberg, right. So the tip of the iceberg is the thing we do, the changes that we make and the bottom of the iceberg that 90% that is kind of like under the water is all the work we do to make sure that we understand the system. So ultimately, observability, strong observability allows us to gain insight into the system. And once we understand this behavior, we can leverage this inside to change, to change the characteristics of the system to our liking. A lot of times it's a trade off, sometimes we can get everything with uh with our change.

So we roughly define three levels of observability - infrastructure level observability, uh gives us insight into overall traffic flow in our uh ecosystem, into our overall utilization uh cost apportionment, things like that. Netflix deploys uh the whole stack is built on microservice architecture. It's not a monolithic app. It's a, it's a collection of microservices that communicate with each other. So the service or the tactical view uh is very important. This view is very important for us because it gives us insight within each individual microservice. And as you can imagine, we collect a lot of um pretty standard telemetry, like throughput, saturation, uh latencies, percentiles, breakdowns, uh errors, things like that. There is a lot of telemetry in the service view that is uh pertinent to our internal framework, that emit business metric. And we use all that to monitor the behavior of those services.

Harshad: And I think it's, it's just the three levels of observability you kind of bought, you know, you kind of get calls about, hey, something's wrong. And how do you apply these three levels? Is that one of the use cases you could describe?

Phil: Right. So, well, the last level of observability is an instance level observability where we actually look at things like uh a little bit deeper uh system metrics or we can run ad hoc profiling like per based flame graphs or uh JFR or Ayn Profiler to gain insight into actually what's going on within that instance. Um there is, however, uh we found there is a class of problems that require even deeper level of observability to introspection at CPU level. And here is one of those, one of those problems that um that Harsha and I worked together.

So it started off as a routine migration. We wanted to migrate one of our microservices from M5.4xlarge to M5.12xlarge. So I call it routine because uh we do it every so often to make sure we use the capacity and our reservations that we have to the fullest extent if we have spare capacity when we might want to consolidate on a different instance types. So this seemed like a straightforward migration, we expected about three x improvement in throughput with about the same latency and CPU utilization. So when we actually deployed it, uh we found out that we're far uh far below our target, we actually got to maybe a 20 20% improvement in throughput at the same CPU utilization. And you can see it uh on the graph on the left. But that's not all we see a really, really weird pattern. We see a bimodal CPU distribution uh in our fleet. And you can see it on the graph on the right, we see a clear band of good nodes at the bottom that have lower CPU utilization that have lower latency. And at the same throughput, we have a higher band of node that have higher CPU utilization, higher latency. They, they have exactly the same workload. And that was our starting point for investigation.

So the first pattern not being able to squeeze three x throughput was puzzling. But the second, the split of CPU, the bimodal distribution of CPU was even more puzzling to us.

Harshad: So in the bimodal distribution there, it looks like there's a specific pattern there is, is it could it be because if you were running on different zones or maybe, you know,

Phil: Ah yes, good point. So no, we checked uh zone affinity. And actually, as a matter of fact, the graph that we're shown here on the right uh was captured in a single zone. There is however, one property of this distribution that we noticed but couldn't explain the property is that no matter how we auto scale and you can clearly see the pattern of auto scale and and instances going up and down. However, we always see between 12 to 13% of the nodes in the in the bottom in the good band. That's that's interesting. So I just want you to remember this 12% magic number because it's gonna play on later. It's it's kind of a fun hint.

Ok. So great problem to bring to the table with him. So like Netflix showed their observability stack and there's a methodology to go about finding the issues Intel kind of does the same. And you can see the stack of the methodology that Intel follows. And at each level of this methodology, you will find the tools that have also been open sourced or you can find them on intel.com to be able to use as you go step by step, diagnosing things and the diagnosis may apply to something as deep as what was found or it could be just, hey, i don't know, i'm running my service. Is there some performance i could extract from it is that issue i'm leaving on the table and you could still follow the same methodology.

The first part of the methodology there uses health checks, which means it's going to check the health of your system from the hardware point of view and from the software point of view. And we have the tool over there called Intel System Health Inspector that allows you to do that. And it's a simple tool that runs in three or five minutes, whether you're on prem whether you're in the cloud, whether right, and it will kind of give a snapshot of what your perfect system should look like versus what your system is looking like. And those differences help you diagnose what's happening.

Second is hardware characterization. Now, this is a very important place where I'm going to talk about what is a PM or a performance monitoring unit, which is a hardware unit inside of the CPU. And that allows us to actually see how the resources are being utilized as applications are running on it. So we'll be giving some examples of hardware characterization and the tools that were used by Netflix in the same following that.

Once we know where the resources are being constrained, we can use those particular constraints and then do software profiling on them and then be able to find which part of code is actually being affected by the problem or parts of code if that be. And then we've got other streams like right sizing instances and optimizing whether it's statically or dynamically through Granulate and you might have heard them in other breakout sessions as well.

So given that, let me give you a quick snapshot of what System Health Inspector looks like this. It's got a lot more than this. We are trying to fit this in a small page, but essentially on the left hand side, you'll see the hardware configurations that you would find even dim populations. So it'd be the system details, everything that you are able to find on the system on the hardware side. And then it also optionally does micro benchmarking where you can see latency and memory bandwidth curves that will be generated by the system, what you should expect and what you're actually seeing on the system. Additionally, it is also going to throw your software versions, things like what your NICs are doing, what the disks are doing, what your processes are running and based on that, it can even generate recommendations saying, hey, I see you're running a lot of OpenSSL. Had you been on an M6i instance, you can make use of your, of our new ISAs and actually gain 20 30% performance from them. So it can give a whole recommendation tab of what you might be missing and keeping on the table.

So following that, now let's get to hardware characterization. So M7i instances, how many people have used M7i instances or at least tried them. So not surprised they are really, really new out there. And I'd like to give some information besides the fact that this is the seventh generation on AWS of Intel instances and they also expose core performance counters.

So remember the PM that I was talking earlier about it is going to give us deeper ability on what the CPU has been doing. And now you can do that on any M7i instances, irrespective of its size. Previously, we could only do them on metal instances. But now you've got core performance counters on all M7 instances and we are going to showcase how we use them and diagnose this particular issue.

So we used PerfSpec which is an open source tool that we develop, which kind of wraps around what the PM does and presents that data. So I handed over PerfSpec with him and said, hey, with him, can you capture this data on the bad instances and the good instances? And let us see what that looks like.

So with him, how was your experience being on the same?

Phil: Yeah, well, so first of all, uh it was pretty easy to run PerfSpec and um this is the the summarization of the results. So uh we highlighted the relevant lines here. So we see that CPU utilization on a bad node is about 100 100 50% over what we see on a good note which matches clearly, it matches our observations our telemetry that we already captured.

So the second highlighted line, the CPI is pretty interesting CPI is cycles per instruction. So knowing that workload on these instances is about the same. We see that CPUs spend in 100 70 175% more cycles per instruction to execute the same instruction. We spend more cycles on a bad note. So I think it kind of matches with CPU utilization. But again, it still doesn't give us a clue of where the problem is.

So the third highlighted line is L1 cache bound...

That was pretty puzzling to us because we wouldn't see, we wouldn't understand why the same workload would give different L1 cache bounded results.

Uh and lastly, the machine clears and I think Harshad you better speak about like how to interpret machine clears because for me, it was the first time I encountered that counter.

Sure. Sure. Um before we get to machine clears, like, like W said, it looks like a magic number of 3x everywhere, right?

CP, which is an indicator of performance has gone up by 3x. With him seeing in broad, seeing 3x latencies and CP utilization levels.

L1 bounded is just that's puzzling because it's the exact same CPU accepted. All you have done is having more 3x amount of the cores. And L1 basically says that this is my first level of cache. This is the closest in latency I can get to data. But hey, I'm spending 3x amount of the time fetching data from there. Why would that be?

And it also shows machine clears can't go high and machine clears are generally indicative of memory disambiguation, pipeline stalls or flushes self modifying code even. And there's going to be a good curious case about that.

So what we do now is that, hey, we know where the resource is being spent, but we still don't know where in code this might be happening. So we use this particular counters of L1 bounded and machine clears and feed it to our next tool in software profiling, which was our third step, right.

So in our third step, we are going to use Intel VProfiler, right? This allows us to use the same PM counters that we showed in the previous slide. But now we are going to do sampling as opposed to counting and sampling as you would know, like with perf allows you to capture instruction pointers, which means now you would be able to associate those particular counters that was showing as you know, very high and associate them with, with the code.

So in you can see there's a snapshot where on the left hand side, you see the function which are hot and on the right hand side, you see CP I and they have been sorted by the machine clearers. So on the top, you see these functions having an enormously high CPS of eight and seven.

So what's the next step? Now we know that these are some of the top hot functions. What VTune allows you to do is double click on those functions. And now we get even deeper. Now it spits out this data in the middle block there which shows you assembly code or what the code, the machine instructions that the CPU is going to be executing. And we are going to diagnose some of this in in in the next slide.

So now we are getting to the root cause. So we did find that if you look at the bottom block there of the block of instructions where the machine was spending most amount of time in L1 bounded, which is what we saw that i have highlighted two lines specifically in yellow.

There. You can see that there is a particular register RDI I believe being accessed by one thread at an offset of 20 in the bottom, the same register is being written to but at an offset of 18.

Now, why is that important is because the CPU operates in units of data called as cache lines which are 64 bytes. You may be aware of memory being accessed in a page, page level. Similarly CPUs access data in 64 bytes at a time.

Now, if one thread is accessing data in different section of that 64 bytes and it's independently accessing it. And another thread is accessing a different set of data. But on that same cache line, now, the CPUs have to do cache coherence, which means that I want to make sure that I'm consistent and there's no functional faults.

So it's going to be bouncing that cache line around from core to core, even though the sets of data being accessed are completely independent. And the reason that happens is by nature, you declare a variable, you declare a second variable and you start computing right? You don't think that the variable should be set apart by 64 bytes. But that is the particular case that happened here in the JD and this is known as false sharing.

Now, the reason it exhibits nominally higher here is because false sharing was always present even in the case where was running on a on the smaller machine on for Excel. It's just that instead of the two CPUs bouncing those cache lines, imagine now 48 CPUs bouncing these cache lines around, imagine the latency perturbations you're going to have even if you're running out of your L1.

So now that we know this was fall sharing with him, can you explain how this particular piece of code which is not even in your application? It's in JD k was affecting the application?

Yeah. Uh Harshad. So first of all, ii, i have to admit as a performance engineer, I knew about false sharing, uh you read about it in textbooks and it's described uh however, I didn't expect to see it in jdk or in the code that runs uh on, on jvm. So that was a pretty fascinating find.

Uh another thing is uh v tune did a really great job of highlighting this problematic piece of code. So the ability to, to break down the counters by the line of code and then drill down into assembly is very, very powerful and none of other existing profilers gives you the same ability.

So then back to the actual cause, right? So just to explain a little bit what's going on here for those of you who are familiar with java, the secondary uh java has something that it's called secondary superclass.

Cash secondary superclass is essentially an interface. Java doesn't support multiple inheritance. It's only single inheritance, but you can extend uh implement multiple interfaces and those become your secondary super classes.

So then every time you do an explicit cast instance of or even call a method of an interface, jvm does a type check. Generally, we assume that those type checks are very, very fast and that generally is true because jvm has machinery that we usually don't see it's within jvm itself to cash the super uh the the secondary super classes.

Now a certain class hierarchy and a certain usage pattern can expose uh the slow code path uh within, within that, that secondary uh super class look up.

So what was happening? If you think about our hierarchy, let's say we have a video object and this video could be uh an episode or it could be a season, it could be a trailer, it could be a movie. Uh there is multiple things we can operate with.

Now, those are all interfaces that are video object implements by repeatedly checking type, checking it against multiple different interfaces. We're essentially blowing up the cache. We keep writing jvm keeps keeps uh keeps writing to the last found uh secondary super class. That's where this false sharing come from.

And it just so happened then in jvm, the single element cash, the the the the secondary super cache is adjacent, the field is adjacent to the list of all secondary super classes. So it's a classic essentially, it becomes a classic example of false sharing when one thread writes to the cash line and another thread keeps reading from the same cash line.

Uh and this cash line is being flushed. Now, here is an interesting part. Remember how we said, oh, we only will we see 12 13% of the node in the bottom range. Now think about it. The cache line is 64 byte. The pointer is eight byte. So it's 88 pointers per cash line. We have a seven in eight chance of these two pointers being within the same cash line and we have one in eight chance of this, of these two pointers being in different cash lines. And that's exactly one in eight is 12.5%.

So that's exactly what we saw in our telemetry. This was absolutely fascinating. Fine. That is really fascinating. That exactly goes with your observations. But so like, like, but has said, you know, this was a really big deep dive, but like his iceberg story that you might have seen the solution is probably quite simple, right? With him absolutely harsh.

And there are be a better example of this iceberg analogy. So the fix was to simply insert the pattern 64 byte pattern between between these two pointers. There was zero code changes. The only thing we changed is the data layout. That's it.

So we recompiled jdk, we redeployed our service and you can see the results here. It's pretty dramatic. So the higher band of node just simply disappeared, it's gone.

So the highlighted area is when we did our deployment and you can see that now uh we have a much, much tighter distribution of both cpu and latency. Awesome. That is really cool.

And you know, netflix could just patch the java and could actually do it. But for open jk users intel has already upstream this patch into open jk and and you know, you can learn more from it from the blog which is in, in as a link in our presentation? That's that's great.

What what about uh further results on, on this finding? Yeah, we didn't stop there. So that was only the first step in our journey. We also encountered the true sharing and uh we addressed it as well.

So um our time here is limited and we wanted to talk to you about other use cases as well. But uh we refer you to our blog post. The link is provided on this slide to read more about our second step of addressing the true sharing and also to learn more about um about secondary superclass, it's called type pollution in java and other work that community did around that and maybe you are affected as well.

So, but um at the end of this exercise, we achieved 3.5 x improvement of our throughput, uh 3.5 times more throughput. So remember we went from four xl to 12 xl expecting three x or around that, maybe a little bit less than three x because it's not linear. We understand that. Now we got 3.5 x throughput at the same cp utilization at the same or lower latency.

So that, that was a, that was a great story and a great uh great optimizations that we did together. Thank you with. So excited to even talk about that story again.

So, um yeah, now let's move on to our second use case and how can we go without talking about a is, but in this particular use case, we are going to talk about a few usages of a i within, within netflix. I'm sure there are many as everyone would suspect. But could you walk us through? You know, how, how does netflix use a i?

Yeah. Well, so in a day and age where a i writes poems and drives cars and, and, and creates art, uh well, you bet we use a i, there is multiple usages of it. Our usages are, are a little bit more practical. So, and we'll be talking about one of those.

Um we use it as a part of our encoding pipeline. But um before we go to that use case, just a little bit about how netflix does encoding and um uh and, and how we deal with our, our video assets. Ok?

I think it's my turn to do a little trivia. So who knows which show this? This screenshot is from, i see raised hands. Yes. Awesome. Ok. It's from the money heist. Great show.

Uh ok. Now a harder question, who can tell the difference between these two, between these two pictures? At least i can't. I'm growing up. Ok. Ii i need to talk to you after the show. But, but, but this ok, these images are designed to look the same so i can reveal that the difference between the images is one of them encoded at 680 kilobit per second. And another is encoded and 252 kilit per second. So perceptually they're very, very similar

It's hard for humans to see the differences in those images. Uh we use our encoded optimization to reduce bit rates without, without loss in perceived quality.

I see. But 680 kbps and 252 in this modern era of gigabit, I mean does it really matter?

I, yeah, you might say I you have 50 or 100 or, or, or a, a gigabit internet at home. Um but not everybody is privileged to have such higher speeds. So there is part of the world like maybe central asia or parts of uh latin america or let's say india, where people uh people most they can get is four g cellular networks or sometimes even three g cellular networks for them lower bit rate is the difference between being able to stream netflix and not being able to stream it. So there is another use case we as netflix benefit from it as well. We use our content delivery network open, connect to, to host these resources. Lower bit rates means we can host more, more of these movies and deliver them with higher quality.

Well, and lastly, I think everybody can relate to this lower bit rates means that when you download content on your devices, let's say for a long haul flight, uh you, you, you can download more. I think everybody has been uh uh like has this problem. You're traveling, you have a 10 hour flight and you need to download movies and video and some something for your kids to watch and sometimes you run out of space. So this kind of optimizations are also optimize the space. So it has the benefit all around. Interesting.

Yeah. So moving on uh talking about the encoding pipeline at netflix 10,000 ft view of it is from the source that we get from the studios, which is a high quality uh high resolution source encoded uh i in a format that mostly uh is not supported by consumer level devices. First, we put it through a down sampler, we use a neural network to down sample, then we encode it and then on the device, we decode it and then up sample it, we need to down sample it because we cannot possibly support all resolutions. There is such a variety of devices that, that we run on that. It's, it's impossible for us to know all resolutions we encode in a handful of resolutions. The rest of it is down sampling and up sampling. Got it.

And is this a single pipeline or how much compute and optimizations does this require?

Right. So um talking about pipeline, well, everybody kind of like familiar with the concept of encoding. Uh some of you might even have done it at home and like, well, here's here's the chunk of video and you put it through encoding or some kind of uh uh processing and maybe two or three hours later uh it comes out of it. So for us, it would be prohibitively expensive to do it this way. So we have a parallelized pipeline, we split the video in a smaller chunks and we run encoding and validations and checks uh on this video on these chunks in parallel. So it has, it has certain benefits. First of all, it allows us to prioritize content. If there is certain content that we need to encode urgently, we can simply put more compute to it and encode it uh pretty quickly. There is also the, the flip side of it is that if there is something that we want to re encode at a lower priority in the background, we can reuse some of the spa capacity that we already have that is not currently used for serving regular netflix traffic, we can repurpose this capacity for encoding. So and that is why we don't have specialized hardware to run and coding it run on the same regular instances. M uh cr whatever we have. So it's extremely important for us to make sure that it runs quickly as much up wi with, with as high performance as possible on cpu on regular cp us.

Now going a little bit um a little bit further. We do per title optimizations on uh on our videos if you think about it, say cartoon, right? And an action movie, they are very different in the sense that cartoon is kind of simpler. We intuitively understand it. There is maybe a large, large large chunks of similar color. There is plus movement and we can encode, we can compress them a little bit further to get a, a lower bit rate. With the same perceived quality for action movies. We might apply very different recipe. We need to encode it maybe uh at a higher bit rate uh to preserve all this quality and, and, and, and movement.

We also apply per chunk optimizations the next slide. And lastly, our dynamic optimization introduces something that we call per shot optimization. We define a shot as a as a group of frames with a similar pixels. So for example, if you think about a shot of um of an actor's face uh against some backdrop and like we can s uh uh we can show it for a few seconds that could be a shot, very little changes within that time frame. So we can encode it, we can find those shots and we can encode it as a single unit and apply optimizations. So what that allows us to do is to create a variable rate bit stream. So uh different quantization parameters and different shots, we can walk this matrix of, of uh different encodes and shots. And we can find either the highest average quality at a given bit rate or we can find a lowest average bit rate with a given quality. That's, that's fascinating. It's fun to learn about what was going on behind when I'm watching netflix on my phone or tv.

Right. But getting back to a i, could you go a little deeper into how the down sampler actually utilizes a i, right. So focusing on the down sampler use case. So uh down sampling and up sampling is unavoidable. There has to be, there has to be some amount of it simply because we don't know the resolution of uh of the device. So we want to optimize it as much as possible. So what we did is we figured that applying, applying um and machine learning, given the question when we give an input image, what is the best down sampled representation of this image that would, that would result in the highest up sampled quality we know the up sampler, we know that part. So but give us the best representation of down sampled image. So we trained our neural network with that premise. And this is kind of like the conceptual diagram. It actually a little bit similar to how uh noise canceling used to work. We apply a mask, we generate a mask, you can see that little tiny image in the middle that, that we uh that we get by applying the co convolutional layers. So we have several convolutional layers. In the end we come up with a mask. Then we do a simple bi cubic uh down sampling, apply the mask to it and then it yields a better result. When uh when we upsampled the resulting image i see.

So when you down sample, you know, quality is of course very important to you. What's the scale of it like? Does 1% really matter to you?

Yeah. Uh harshad a any um anything like it, it's all matter to us. If you think about it, we use uh v math score uh to, to assess the video quality and even a few points on the math is perceptible to humans. Well, what does the math vv math is variable um multi uh multifunctional video uh assessment that netflix developed. Uh i think it's pretty much industry standard for assessing the the the video quality. So the v map score tries to replicate how human perceived video higher score means better quality. So we got uh we got a few, a few points higher on the v map scale uh with down some, but that's not all. We always uh we also were able to improve our quality of experience metrics with down sampled videos. We have less re buffers and less play uh play starts and the play delay is reduced as well. So it's a it's a win win. The only thing of course is, well, it's resource intensive. Got it.

So being resource intensive, how have you been running down sampling and then coding pipeline on intel instances in aws and how has this benefited you so far?

Yeah. So um as i mentioned earlier, we don't have specialized hardware for this pipeline. It's uh it's easier for us, we get more benefits when we run it on general hardware and we harvest our unused capacity. So it's imperative that the performance of the down sampling and the whole encoding pipeline that squeeze the maximum performance on regular uh intel zon processors. So with the, with the improvements uh that intel helped us to implement, we were able to achieve anywhere between 15% to 2 x uh improvement in frames per second in our overall encoding pipeline. So we were using uh one dnn and of course, it runs on intel hardware. That's that's great 15% to 2 x.

So let me explain a little bit of how the 15% to 2 x comes about. So we are again going to drive a little deeper and this is kind of the view of what it looks like inside the intel processor when you are doing vector, which is, which is what is used at the lowest level for some of the neural network applications. So intel's instruction set many years ago, it's been since 78 years, since vx 512 has been launched, which is essentially with f multiply ads units, you are able to complete the operation in three cycles with 5, 12 bit with vectors on an intake operation. So we've had this for a long time, it can do 85 intake such operations per cycle and m six ic six, the sixth generation of intel processors on aws also code named as ice lake came up with vnn or vector neural network instruction. This instruction essentially is going to shrink this into one cycle. So it allows you to do the same fuse multiply a as an extension to abx 512. But now in just one cycle, so that itself is great. But now let's go on to m seven eye which are the recently launched sahi rapids instances on on aws. And now you can actually see this thing is kind of going on steroids, right? Because what it introduces is known as ax, which is, which extends vx 512 even further. And what it adds are two more use cases. It basically adds a two dimensional register file. So you can have your matrices in memory being populated into registers. And at the same time, it introduces a new accelerator and silicon called steal, which we call these register files as tiles as you can see in the v. And that one block is a tile and then it can actually do tile matrix multiplied operation. So now you can do almost at x time that the previous generation did and you have 2048 in date operations per cycle. So with ax, you and matrix extensions, you can basically get a lot out of the box with the data learning accelerator.

So this is all what's going on inside the processor, right? And but how do we really use it? How do, how do regular developers make use of it? So intel makes this easy by introducing the one dnn or the or the neural network library, the deep neural network library. And it is highly optimized implementation for you neural networks to do basically sim operations, soft max or aggression analysis all low level operations. But in addition, it also does operations such as prefetching, making sure memory layouts are correct memory bandwidth optimizations in addition to the sim and ax. So this and with highly operations with the tiles, you can actually get multiple x performance at the cp level and then like with saw 15% to 2 x in his case in that application. But remember what he was running on the six i generation of processors. So there is still much more to benefit from in the seventh generation of processes.

So with that, I'd like to summarize that, you know, with the new seven generation of intel processors, you get the new monitoring hardware monitoring capabilities with which you can garner more by actually getting better deeper observable as well as you can improve your tco. And with the new innovative technologies with accelerators along with amx, which is present on every instance on the seven series, uh you can actually get much better performance on your machine learning workloads.

Yeah, the results we presented uh speak for themselves at netflix. We had great results on cascade lake uh uh hardware on ice lake. And we're looking forward to further collaboration on um on sapphire rapids. Um but i think the real differentiator is uh intel's investment in software and intel's expertise in optimizing uh for these platforms. Thank you for making netflix better.

Thank you. Thank you. So we'll take you and a now but come back to our final session, which is going to be right here, which is on a i acceleration at the edge. Also visit our booth and look at what our partners are doing with intel technology and you know, you have our emails up there if you wish to contact us offline. And thank you for coming for the session.

Thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值