Silicon innovation at AWS-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134603411

Hello and welcome to Silicon Innovation AWS. My name is Ali Si. I'm a senior principal engineer and the lead engineer for our Graviton instances. Today, I'm gonna talk to you about some of the silicon we're building and why we're building it. Join me with me today. I've got Ron, he's my uh colleague on the machine learning side and he's not going to um take care of the later half of the talk even though his name isn't on here.

So far, we've built chips that span multiple areas including data center and IO infrastructure core compute and machine learning with the Nitra system. We were consuming more and more of the resources in the host CPU for hypervisor functionality. The hypervisor is the piece of software that takes a big server and cuts it into smaller virtual machines. And so we started moving that functionality away from a traditional hypervisor onto special purpose chips. We started that journey back in 2013 and later working with start up named Annal Labs which Abis would later require. We moved more and more of the functionality out of the hypervisor onto special purpose silicon. This freed up the host CPU for more of your workloads, let us raise the bar on security for our servers.

Then with Graviton, we built a host CPU that provides powerful and efficient compute for a range of workloads in EC2 as well as the best price performance in, for instance, families. And lastly with Inferential and with Ranum, we have purpose built machine learning accelerators, we built from the ground up one for inference with Inferential and you can guess the other for training with Training, I'm gonna talk about each of these in turn, but first, I'm gonna spend a little bit of time talking about why we're going through all the work of building silicon here. And the answer is really simple. We're building silicon where we have a way to, to deliver more customer value when we do it. And there are a couple of ways that we do this.

The first is through specialization by building your own chips. We get a specialized hardware for the use cases we have in AWS. That's just really tailor the designs to operate in our environment and focus on the features that we need to deliver value for you. And you might say, well, of course, you do that. But when a third party is building a chip, they want to satisfy as many different customers as they can. And so they need features for them all. And that means they need to add the complexity, the cost and the power for all of those, when we're building chips for our use cases and the use cases, we can deliver value to you. We get a focus on the ones that we actually need. So I'll show you today how we've tailored the Nitrous system to AWS infrastructure and to improve the performance and security of our servers.

The second is speed by building our own silicon. We get better speed of execution and there's a couple of different dimensions here. The first is what I just mentioned actually specialization by focusing on the features that we really need, we can focus on that one market. We don't have to burden the design with other complexity that will take longer to deliver. The other is by owning the end to end development process. Defining the product, the silicon, we need chips, we need to deliver it when we start that project and being able to have the teams working on the hardware in parallel, have a teams working on the software so that when the hardware is ready, the software is ready as well and we can deliver both of them to you very quickly. That means there's less handoffs. Um and um you know, the software and hard team runs in parallel also running um buildings chips in silicon requires an immense amount of simulation simulation of the logic simulation of the physical design. And by actually using EC2 and the elasticity that EU provides, we can burst into an amazing amount of compute when we, we are reaching milestones to make sure that we do all the simulations that we need in a timely fashion.

The third is really innovation by building our own chips. We get to innovate more and create more value. And the reason for this is we get a cut across traditional silos. Traditionally, you have a chip vendor, a server vendor, software vendor. And each one of those people are making great decisions, but they're making it at the boundaries of what they're delivering. And by being able to look across the entire design, you might decide to spend maybe more on a chip, so you can spend less on a server. And and the traditional silos might not allow you to actually do that. And lastly security that to provide us a mechanism to enhance the security of our servers, offering a harder route of trust, verification of the firm or running on them and limiting the interactions with the host through an hour set of api s.

Let me start with the Nitrous system. And Nike was really a fundamental rethink about how virtualization should be done. And it really started with us saying if we took our learnings from running EC2 for almost a decade, what can we improve? And there are a lot of things that we thought we could improve, we can improve throughput to reduce the cost of running your workloads. We could simplify the hypervisor remove complexity and get bare metal like performance, we could reduce instance latency and jitter, which would allow you to bring more workloads to the cloud. And we can even build a system that would allow us to offer not only virtual machines but the same set of experience with attaching and detaching volumes starting and stopping instances dynamically from bare metal servers as well.

On the security side, we could use it to offer things like transparent encryption of data in a VPC. The hard route of trust. I mentioned removing operator access from the system and, and having a narrow set of audible API s and nitro gave us a path to implement all of those. Nitro is a combination of purpose built hardware built exclusively by four and by AWS and the software that runs on that. We've been doing this for a while now. We're on the fifth generation of our nitro chips. Peter introduced the, the fifth one back on Monday, but, and we introduced the first one back in 2017, we talked about the C5 instances. It's been five years since we, we did that. But really this journey started in 2013 where we launched enhanced networking and we started moving some functionality out of the hypervisor. And that's a piece of software that takes a big virtual, big server and cuts it into virtual machines and started moving that on to um into hardware. And that let us offer higher performance better latency improve cpu utilization over time. We expanded this to other types of IO to EBS to local storage. And eventually what we did was we, we thought of what our hypervisor should be and we moved a bunch of functionality since 2017. Every instance we've launched has been based on the nitro um on the nitro system to understand the transformation here. It makes sense to start with a little bit of what this looked like before nitro.

This is what one of our servers looked like for. Nitro. Customers are still using some of these instant sites today. People are still running M1s and similar. And here we have a hypervisor called Zen. Zen is great but it did a lot. It did memory management and CPU scheduling. It did device simulation, it did limiters, it did security groups enforcement did quite a bit more than that as well. It even has a full blown in linux user space in this privileged DO0. You see here and all this functionality use resources on that host. CPU that customers couldn't use.

So we started moving some of the functionality off of the host and we did this with our Nitro cards. We've got cards that offload the VPC EBS ME S SDS and various system control functionality. The BBC is the first one we did. So let me talk about that one a bit. The BBC data print offload, offloads your ENI attachment security groups, the packet encapsulation, DES uh encapsulation. We do flow logs, routing decisions, DHCP DNS and more, all those functions used to run on that hypervisor and we move them off our newer versions of Nitro cards also support uh VPC encryption or transparently and without a performance penalty, encrypt your U VPC traffic with AES. The BBC card presents the ENA device that you might have seen in your instances. That device started by offering up to 10 gigabits of bandwidth when we first announced it. And over time, we've gone up to 200 gigabits with that same device model, that same driver, it's a 20X increase in performance without having to, to tell you to go and change your drivers, build new armies or anything like that. And that's pretty unique. I can't think of another place where we've seen a 20X increase in performance without having to change your drivers or change the device model.

This week, we also announced in Express in Express, lets you take TCP, which is an end to end protocol between two of your instances. And instead of just following one path through our network in our data centers, take that traffic and send it down many paths. This lets you use more of our network find lower latency paths and reduce latency from between two instances by as much as 40% the P99 as well as increase the single traffic flow between two instances by 5X, we also have the Elastic Fabric Adapter. This is a network interface that's focused on machine learning and HPC workloads. These workloads are a little bit different and that usually you have a workload that's either low latency or it's high bandwidth. But with HPC and machine learning, you typically have a workload that's both needs high bandwidth and it needs low latency. And so we build EFA that allows applications to bypass the kernel talk straight to the device and get that low latency and high performance. And just like ENA Express, you can use multiple paths through our network to get high bandwidth and low latency traffic between your instances.

The other natural car to talk about is the natural SSD. And before I talk about that though, I need to tell you a little bit about how an SSD works. And there are two main components in a traditional SSD that are interesting. The first is the NAND, this is where bits are stored and NAND is a little peculiar that you can only write it after you've erased it. That kind of makes sense, but you have to erase it in chunks of megabytes. So even if you want to just change a word in a file, it means you really have to copy that file to a new location and then clean up the old violator. The other interesting thing about NAND is it wears out, the more you write to it, it eventually gets slower and then it's, it's unable to store the data anymore. And so you don't want to write to just one location over and over again. You want, you want to move those rights around all the land now that's complex to do. And so there's this other chip called a flash translation layer an FTL. And it takes care of this complexity. The operating system has a logical address where it stored something and the FTL knows where that was. It deals with that ware leveling, it deals with the garbage collection i just mentioned and really, it actually looks a lot like a write ahead logging database in the end. And like most databases as you go to different vendors, you get a different implementation, they all do a pretty good job on the average case. But when you push them, they behave a little differently sometimes at the worst time, maybe you've got a high bandwidth database and garbage collection kicks in just when you have a traffic spike.

So we looked at this and said, well, how can we get consistent performance from different vendors and devices? You're probably guessing where I'm going here. This is a rendering of a Graviton two storage server and you can see the NAND there with the heat sink on top. So we took that NAND and we added a Nitro card and the FTL runs on that Nitro card. The results here are pretty spectacular the I4I the IM4GM. The IS4GN instances lowered average latency by as much as 60% and lowered tail latency by 75%. This improvement allows you to run more intense workloads with stricter SLAs but it also has allowed customers to run at higher load than they were previously and get better cost savings versus our I3 and I3YN instances. One such customer is Splunk. Splunk is a leading data platform provider designed to investigate monitor and act on data at any scale. When they excuse me, when they evaluated IM4GN and I um the Nitro SSD, they ended up to a 50% decrease in search run time compared to their I3 and I3ENS.

So we took that IO from, from that traditional picture and we moved it off onto a Nitro card. You can see that DOM0 there that privileged domain in in in the Zen world is significantly offloaded. Many of its functions are now on, on Nitro cars. And so we could step back and say, what do we need the hypervisor to do? Really? We need to do two things, we need to do memory and CPU allocation, the rest they can just get out of the way. And that's what we ended up building a, a hypervisor that provides bare metal like performance with its reduced size and complexity.

Lastly, we added the Nitro Security Chip these are integrated onto a motherboard and provide a simple hardware-based root of trust that lets us measure the contents of the flash devices on our system, prevent unauthorized modification and make sure that we're running the firmware that we think we are.

So with all of this, now, we could remove DOM0 and replace Zen with the Hydro hypervisor and enhance the security posture of our servers. We designed the Nitro hy uh Nitro system to provide workload confidentiality, not just between customers but also between customers and AWS. There's no mechanism for any system or person to log into these Nitro instances and access customer data. There's no root user, there's no interactive access, there's only a narrow set of authorized and authenticated API s that don't have access to customer data. We recently published a white paper on this with the security overview of the Nitro system and you can find that online.

Now, we've been building silicon, as I mentioned for quite some time and a couple of nights ago, we announced the fifth generation of our Nitro card. This one has 22 billion transistors, nearly double that of our fourth generation. It supports DDR5 and PCIe Gen 5. We've gone from supporting encryption on the network device to adding encryption to DRM and adding encryption to support for the for PCIe. With our fifth generation Nitro card, the device increased the packet rate by up to 60%. Um has up to 30% lower latency and did all that while reducing power.

Actually, i talked about performance and i've talked about security, but there's one more aspect to the Nitro system that I wanted to share with you. And that's really, it's a modular design. It's almost like a Lego block of components. We've taken the same Nitro card that we started with in an Intel server. We've moved it to an AMD server. We put it in a Graviton server. We even connected to a Mac mini and offered the Mac1 instance type all without changing the cards. It took us 11 years to go from one instance type in 2006 to um seven business types in 2017. And since then, we've gone from 70 to around 600 instance types. And the reason we're able to do that is what i just mentioned that we've taken our Nitro cars and we put them in different servers with different storage, different architectures, different CPUs, different accelerators provided you a large amount of, of flexibility to choose the optimal instance for your workloads.

Ok? So we've talked about Nitro, let me talk a bit about Graviton. Gravitons are a line of ARM based server processors available exclusively in AWS that provide the best performance and price performance in their instance families. And we started this journey back in 2018 when we announced the Grab the first Graviton process, we didn't even call it Graviton one, it was just Graviton and a lot of us looked, a lot of people looked at us a little strangely then said, what are you doing here? But Graviton proved that you could have another architecture easy to, that. You could run different armies. You could start and stop. In instances you could see the same elasticity you saw on x86 instances you had wide IOS support. And it wasn't really though until Grab12 in 2019 that people said, oh, this chip has six times the transistors of the first Graviton four times the cores, each core is about twice as fast. And we went from a kind of a curiosity and uh it's interesting. I don't, i wonder why they did that to some of the best price performance in EC2.

We haven't stopped there with Graviton three last year. We've almost doubled transistors again, a huge step up in performance. I'm gonna talk about Graviton two a little bit and then talk about Graviton three. As of today, we've got 12 different instance types across our different families powered by Graviton two, the M6G and M6GD support general purpose workloads, one with local storage and one without, we have a Burst T4G instance. And there's still a free trial on T4G smalls. If you haven't tried one, you can try one, we've got the Compute Memory um intensive types and also AC6GN for network intensive workloads, like load balancers and firewalls, the IM4GN and IS3GN for storage, workloads with the Nitro SSDs I mentioned earlier and even one, the GPUS that we've seen people use for rendering, um Android emulation and, and machine learning. These are all available between one and 64 VCPUs now across 28 different regions worldwide.

One of the simplest ways to consume Graviton is through our managed services because they expose the same API if they're running on an x86 or they're running on a Graviton. So typically you can just um change the underlying instance type and not have to change your code. We've got six different databases now, DocumentDB, Aurora R, RDS, Elasticache, MemoryDB and Neptune, some of which offer Graviton by default, we have a few analytics offerings now on Gravitron as well, including OpenSearch EMR Compute with Lambda, Fargate and Beanstalk. And just a couple of months ago, SageMaker also announced support for Graviton and we've seen customers from large enterprises to start ups across different verticals segments, geographies all up to Graviton based instances in Amazon EC2 or those managed services, tens of thousands of them.

Now they typically just start with one service. They found something that they run a lot of and try to move it. And most customers tell, told me it was actually easier to move than they thought the hard part wasn't actually recompiling it. It was more in the pipelines that they have to build artifacts and deploy them. That was the thing that they had to work out once they worked that out, the code compiled it ran. They were happy. We've had people like Discovery use Graviton based EC2 instances to reach 100 and 75 million viewers during the Tokyo Olympics, Direct TV. Who did what I said and started with a single Go service and now are running hundreds of Go services on Graviton. We have people like Epic Games who are running Fortnite in the Unreal Engine on Graviton and Intuit, who moved Kafka workloads to Graviton Rev also offers energy efficient compute NEC and DOCOMO have been recently collaborating on a 5G core workload, bridging DOCOMO's on premise infrastructure with NECs 5G core workload running in our regions. Not only did they find great performance, but they also said we were delighted to announce that we've achieved significant reduction in power consumption of the 5G core. Thanks to NECs advanced cloud native 5G software and AWS is innovative in a highly efficient Graviton too.

With Grab on three. We've continued that journey. It's obviously our third generation chip. It looks a little different in that we've added seven different chiplets here. The transistor count has almost doubled to 55 billion transistors. And it's the first um system in our data centers that had DDR5 memory now Grab john three runs at a slightly faster clock but gets most of its performance by extracting more IPC or instructions per cycle those 30 to 55 billion transistors

Well, I can tell you where most a lot of them went. We made the cores bigger, of course, have a two x wider front end, a much bigger branch predictor that predicts branches better for large workloads a two xy or dispatch a two xy larger instruction window twice the sim bandwidth including things like blot 16, which are applicable to machine learning workloads twice the memory operations per cycle, twice the number of same transactions, twice the number of values.

And all this gives you two x performance on things like tls negotiation as well as some security features like pointer authentication, which is a way that that can prevent return to programming attacks and r and g instruction that lets you get random numbers from user space.

Now, graviton vcp us are a little different than the vcp us on x 86 x 86 systems support simultaneous multi threading or smt. What this means is those two vcp us end up sharing some physical resources in the core and the caches. And so if both cps aren't busy, that's great. It's higher efficiency, you get to, to timeshare that, that core effectively. But as the systems get busy now you're just time slicing them and you're, you're seeing contention in the cores and the caches.

This is why some people see as they push cp utilization about 50%. Their, their p 99 latencies start increasing rapidly. With graviton, every vcp use its own core. They don't interfere with each other, not in the core, not in the l and cas, not in the l two cas that provides a stronger security boundary. And it also typically lets customers push their load on graviton higher without sacrificing latency and thus reducing costs.

This is what a graviton um three cpu looks like we started with a big dye. We put 64 cores on there. We connected them together with a mesh that's running at over two gigahertz provides more than two terabytes, a second of bse bandwidth. And then we distributed 32 megabytes of less little cache around that mesh, which would be super high bandwidth with that cache and the private caches, the cores, there's over 100 megabytes of less little cache in the processor.

We have our dddr five controllers, eight channels of ddr 5, 300 gigabytes. A second of theoretical bandwidth and really easy accessible bandwidth. You just have simple loops, you end up getting that theoretical bandwidth, excuse me and just like grab on two, we encrypt all the gdr interfaces.

And lastly, we have pc ie we've done other innovations there too. Traditionally when a device interrupts a vm it goes to the hypervisor. This is ok. But that trip to the hypervisor increases latency. And so with gravitron three and then later, actually with graviton two, we bypass the hypervisor and suddenly interrupt directly into the, the vm that improves latency. But since we're also not actually ever getting to the hypervisor as well, it also improves throughput.

The c seven g instance is the first instance based on graviton three provides up to 25% higher performance than the c six instance. The one based on grab on two. Um and it has those things that i mentioned twice, the floating point performance twice, the cryptographic performance ends up to 60% more energy efficiency, energy efficient versus comparably these two instances, we've seen lots of customers run at scale data processing works on graviton two.

So we want to see what that looks like on graviton three, you're running spark sql with spark 3.3 and a corto 17 with an eight node cluster and a one terabyte data set. And compare to graviton two on ac six g sparks is 28% faster. So in a single generation, we've gone up to 28% we also look at applications like node dot js here, we fixed a load generator and we fixed a database.

We either put ac six g or c seven g running node. We're running this sample application called acme air that has flight scheduling and going from a graviton two to a graviton three. You see 37% performance increase we have a lot of customers who are doing video and coding. Um and so aws and the open source community have been spending time improving video encoding on graviton.

And in the, the over the past year, we've made a huge improvement going from an older version of mee, the latest, we've seen a 60% increase in performance. Then on top of that, going from gravitron two to graviton three, we see a 50% increase in performance. So, you know, well over two x from where we were a year ago, it seems also true for machine learning here, there's a lot of, of code where people have lovingly hand optimized the machine learning code and aws along with arm and others, spent a lot of time working on tensorflow pytorch one dnn and the arm compute library to improve the performance of machine learning on graviton.

You can see some of the numbers going from graviton two to graviton three where we added bloat 16 and yo 70 units, two x three x type numbers really big. So we started with the c seven g instance and this week, we've announced two more the c seven gn which is for network intensive workloads to 50% higher pp ss and two x higher bandwidth than our other networking instance types.

This features 1/5 generation nitro card that i talked about earlier in the talk and also hpc seven g a uh an instance focused on high performance computing workloads. This one has our graviton three e processor, which is a variant of graviton three that does more sym the operations per cycle. So when we focus, when we built c seven g, we're really focusing on the mainstream workloads, we went on to optimize power consumption. But hpc work codes are a bit different. They can have a lot more vector math, a lot more linear algebra.

And so when we looked at it and wanted to offer hpc seven g, we said, what can we do here? And we found a way to increase the 70 performance over where we, where we were with graviton three and the c seven g. This with us, we've seen 10% on electrodynamics workloads at groma, 30% on hpl and somewhere in the middle on some option pricing workloads.

Now, our innovations didn't stop with this, the chip, we have end to end ownership of the of these systems. And typically in the data center, you've got racks and the racks are about 42 u tall and they're full of two socket servers, each one's one u and traditionally, you run out of, of power in a position before you run out of space. But grab john two had the opposite problem. We'd run out of space in the rack before we would actually run out of power.

So what can we do? Well, when we sat down to build graviton three, we designed the chip, the package and motherboards to not have two but to have three sockets in a single one you server. This increased the soccer density by as much as 50% in the rack. It came closer to the power requirements of some other solutions. And because we also own the nitro card, we could, we developed it so we could manage three sockets at once, not just two.

So with that, we have what a graviton three server looks like. And we've had some great customer feedback on this. Um the person's next role, they're a marketing technology company. When they moved from c six g to c seven g, they saw 15% higher performance. That's not what excited them. What excited them was. At the same time, they saw a 40% reduction in latency and that was really important for them for their workload.

We have sprinkler who provides a unified customer experience management platform. When they moved to grabow three from grabow two, they saw a 27% better performance ansis which does industry leading physics based simulation. They have uh a product called ls dina that does crash simulation. And when they move from gram on 2 to 3, they saw 35% higher performance lse via an advertising company.

So 45% higher performance when they move from graviton 2 to 3 of their compute intensive um one of their compute intensive workloads. Ok. Thank you for tolerating my voice. I'm gonna hand over to uh to ron. Now, who's gonna talk about machine learning?

Hey, folks. Thanks for being with us today. So our third product line and it's actually a two sibling product lines are aws inferential and trainum. As ali mentioned, inferential provides you with best in class performance and price performance for inference workload and trainum does the same for training workloads.

So we built in french and reum seeing the massive explosion of uh a i workloads in recent years. And it's not only that a i is becoming more popular, but models are becoming larger and they are becoming larger because the larger models are more accurate. And that means that training becomes more expensive inference becomes slower.

So we figure that we have to do something to help you get more out of a i and reap the benefit of the a i progress in recent years. So we built aws infer again for inference workloads and training for training workloads. But we wanted to make the migration in and out of inferential and trainum as easy as possible.

And that's why we built the, the neuron sdk, the the neuron decay integrates in french and ranum to popular machine learning frameworks like py torch and tensor flow, making it super easy to run your compile and run your, your favorite machine learning models on these targets. You just add it's literally 34 lines of code and it compiles and runs seamlessly on the, on the hardware and we make it a priority.

Our goal here is to provide you with high performance out of the box without needing to change your models or do any performance tuning efforts. We announced our first instance back in 2019, it's called the in one instance. And it's based and the info instances are based on our first ml silicon called in french.

We efficiently pack 16 inferential devices in a single server and that's possible because again, we own the end to end and we drive the power inferential very low. And with 16 french devices in a single server in one delivers with one peta flops of blu 16 or f or f-16 compute and two peta ops of in eight compute.

Now we didn't make it into our slides today. But this week, we also announced in two based on inferential two, which delivers 2.5 x more compute performance and 15 x more memory bandwidth. So a massive increase between generation one and two here, let's look at end to end performance within one compared to our next in line inference optimized in a server which is g five based on gp us.

Infant delivers 25% higher end to end performance and up to 70% lower cost per inference. So that's a massive improvement that you're getting by just switching your workloads from gp u based instances to the inferential based instances. And we, we're, we're seeing thousands of customers that are using info today. Uh we have enterprise level customers like airbnb and snap.

Airbnb reported that when they migrated their chatbot from a gp u based instance to info, they saw a two x performance improvement out of the box again, again without needing to change their models. And we also have enterprise customers like nt pc and ken and also several fast moving start ups that are all migrating from other in other instances to the in one instance based on inference.

In addition to that, amazon teams are are also migrating their services to the a i infer services to inferential. Most recently, amazon.com moved their search from gp u based instances to inform and they saw an 85% lower cost while also reducing their latency at the same time. And this is pretty significant. Typically you get only one of them and not the two together.

All right, we talked about inferential. Let's move to talk about training with training. We did the same that we did for in feria, but for training workloads. Now, training workloads are different than inference workloads. They require massive amount of computations typically running for a very long time and very often running on clusters as well.

So that's why you see that the spec is quite different. Here, we again pack 16 training devices, our second generation machine learning chip inside the single tier in one server. And here we, we deliver with 3.4 beta flops of bfl 16 compute same for fp 16 compute and also 840 tera flops of uh flow 32 compute. And i'll touch on that later. That's an interesting point.

They also come with 13.1 terabyte per second of memory bandwidth. So that's an an extraordinary number. And again, this is critical when, when you dealing with training workloads because they tend to be very memory intensive. And at the last point here, trn one instances come with 800 gigabits per second of network bandwidth uh based on e fa v two.

This is important especially for scale out training where when we train a single model on multiple servers and to double down on that coming soon is the rn one n where we attach 1.6 terabytes per second. For every single instance, as i mentioned before, models are becoming larger very quickly. They are actually increasing in size in recent years by about 10 x per year, especially when talking about the large language models.

And that means that when training state of the art models, you actually need to train on hundreds, sometimes thousands. And in the state of the art cases, 10 tens of thousands of devices in a single training task. And this is quite challenging. So to handle that, we build the trn one and trn one n ultra clusters, these ultra clusters pack more than 30,000 training devices in a single data center with non blocking petabyte scale e fa interconnect between them.

So just think about the scale 30,000 devices in a single data center with non blocking interconnect. This uh we deliver with that capability six x flops of highly optimized machine learning compute and you can scale it up and down and get elasticity with at a supercomputer level. So that's quite a significant offering here.

Stepping back here and looking at that end to end performance, trn one delivers 1.5 x the performance and half the cost of train compared to our next nine ml training instance called p 40. Let's dive a little deeper into the training, the features that make all of this possible. And let's start with data types, machine learning workloads require a massive amount of floating point calculations.

And these calculations vary in range and precision. Uh starting from flow 32 through the popular float 16 and bloat 16 and all the way to a new emerging configurable float eight data type. Now, different neural networks require different data types to reach their optimal accuracy. And moreover, during earlier research and discovery phases, ml scientists typically prefer the uh the flow theory to data data type for its ease of use and accuracy.

But then as they move to train at scale, they'll typically, they'll typically go with a more compact data type like bro 16 in order to optimize for price performance. And this is especially true if this is one of the giant models that are costly to train, trainum comes with a very rich set of data types just like you see on the screen in order to allow you to select the most optimal data type for your specific use case.

And in terms of compute performance trn one delivers 40% higher blu 16 performance and five x higher flow 32 performance compared to our next nine ml training instance. And i just mentioned before, why flow 32 is important? So getting five x higher there is quite, quite significant.

Now, let's go into an interesting feature that we baked into uh into training. One of the common uh one of the the most significant contributors to accuracy loss during training is the rounding operations. And most computer platforms today implement a form of rounding that is called round nearest even. And that's very similar to what all of us learned in school.

Numbers are always rounded to the nearest representable value. So if you take an example with integers 0.9 will always be rounded to one and 0.2 will always be rounded to zero. And first glance this seems ok. But here's where it can hurt accuracy. Think about taking a weight value, a parameter value of one and a gradient value or a parameter, updated uh update value of 0.2.

When we add them together, we get 1.2. And after rounding, we get one. But even if we add 10 gradients of value points two to a weight value of one and perform rounding after each step, the result is still going to be one. And here you can, can intuitively see that we lost some precision.

We would expect that if we add 10 gradients of value 0.2 to a weight value of one, we'll get a result that is closer to three not to one with stochastic grounding. This problem is resolved on every rounding step. We round either up or down in a probabilistic manner according to the relative distance from the two nearest representable values.

So in our previous example, 1.2 will have 80% chance of being rounded to one and 20% chance of being rounded to two and over large floating point calculations just like we have in machine learning. This tends to average out and provide with more accurate results. Training fully supports in harbor both round steven and stochastic rounding. And it's your choice which rounding mode to to use.

But we saw that if you select stochastic rounding, you can speed up some models by by up to 20%. And this is super easy to use. By the way, you just set an environment variable with your preferred rounding mode and that's it training takes care of the rest.

Now, let's look at a real life example for the be of the benefits of stochastic grounding here, we have a loss convergence graph on the left and a a through and a throughput graph on the right. And you uh and, and our goal is to minimize the loss value during training. While at the same time maximizing throughput, we start with flow 32 in blue where loss convergence is great. But throughput is naturally suboptimal, it's slower to perform computations in flow 32.

Then we moved to bishop 16 round near steven in red. And as you can see throughput improved a lot, but lo convergence is not as good as it was with 432. And finally, we go to biff root 16 with stochastic rounding, which we which we just discussed throughput is just as high as blo 16 with ro or steven again because trainum natively supports sarcastic grounding in harvard.

So you don't need to do anything any extra compute in order to perform that form of rounding. But the interesting part is that the loss convergence now looks almost indistinguishable from flow torito. So you truly get the best of both worlds with different 16 stochastic grounding.

We launched trn one just a few months ago and we're seeing strong customer adoption is along with the partners that are uh working with us to build the the the infrastructure for new hardware systems for training at scale. Piper specifically is working very closely with us and there are other sessions where we go into details on that.

All right. So let's wrap up first. Uh thanks for spending the time with us today. Uh and we uh to discuss the custom build, the custom chips that we're building at aws. This was a pretty brief overview and we're very early on our journey. We keep building more chips to provide you with more value over time and across services.

Our ma main message for you today is that the reason that we're building chips is to provide you with more value and we work dig diligently to improve performance, better, better our cost structure and increase our security across the aws services. We're already working on our next set of chips. So we're looking forward to talking to you next year to talk about our progress and share the new, the new offerings that we have.

So, thanks, folks don't forget to fill the survey and we'll stick around if you have any questions, we'd love to chat with you.