Behind-the-scenes look at generative AI infrastructure at Amazon

Ok, welcome everybody. Good to see so many people. I really appreciate you coming to the behind the scenes look at generative AI infrastructure here at Amazon. I'm Gabby Hutt. I run director of product and business development at Anuna Labs.

With me will join us today, our customers, Mr. Naveen Ralph, VP of Gen AI at Databricks, and Mr. Pete Werner, the head of AI at Leonardo AI. Together we'll tell the story of how gene AI infrastructure works at AWS.

We have a packed agenda in order to tell the story. I thought we'll start with some history of how we got started at Anuna Labs, key steps and decisions we took along the way. Then we'll go over the hardware progression, software progression, latest benchmarks and numbers. And then we'll hear from our customers and we'll do a quick summary at the end and some glimpse to the future. So let's get started.

Anuna Labs is the team within AWS that builds all the purposely built chips. You're probably familiar with the Nitro product line with Graviton and of course Inferentia and Trainium will be the focus of today.

Since the beginning, we knew that even though we are building silicon products and hardware centric products, our customers will be all software developers. And it's a key distinction because our silicon architects not only need to know the latest technology in silicon, they actually need have intimate understanding of the software layers and ecosystem that these chips need to work with.

And this means that we take very specific decisions to ensure that the hardware fits well what our customers needs for one hand and of course fits what AWS, how AWS deploys at scale in the data centers building. The wrong piece of silicon is a very, very expensive mistake, not only capital expenditure that has to be restarted, but also in time. If there is a bug in silicon that cannot be worked around it, it will take nine months, up to nine months to fix the silicon. So it's super important to make the right decisions.

Let's see how we do this. Since we founded Anuna back in 2011, any product that we built was based on three key tenants:

Portability and reusability is a very clear concept in the software world. When you write code, you want to be able to reuse the code, you want to be able to deploy it in different locations. But similar thing, a similar concept we took into the silicon space. We are only hard coding pieces of silicon if there is a well-defined standard things, I think Ethernet or compression or encryption. And of course, anything that is data crunching, that's where the acceleration is coming from. Anything else will be software configurable. So we have a lot of flexibility to address the future needs of customers.

All the products have to be easy to use. If it's hard to use a new piece of technology, it's very simple folks, not many folks will want to do the journey.

And the last one is kind of obvious - cost and cost to performance. If the products don't provide value to customers, then again, the adoption will be negligible.

So let's go back to 2017. Back then, we started to see key businesses in Amazon like Amazon Go, like Prime Air, like Alexa and also enable services like Rekognition, Lex starting to use deep learning in a meaningful way. At the same time, we started hearing from customers that the cost of running these models and the performance has to improve today. It's way more obvious than it used to be back in 2017 with AI. But even back then we noticed that and we took a bet that the models and the computer intensity of these models will continue to grow.

So we wrote a business case in Amazon, it's called the six pager. And we basically outlined what why we think back then in 2017 is the right time to start building machine learning chips. And we landed on a few main opportunities to help customers:

The first one is optimization. We knew from previous product lines like Nitro that we can get high utilization if we build the chips purposely for machine learning, lower cost if we design and manufacture in-house and have full control of the hardware and software stack, that will, that will enable us to increase utilization and lower the cost to customers.

Performance - we knew we have to innovate on the software stack and of course, in hardware to make sure that we can meet the performance of our customer needs.

And integration - we were in a unique point in time when Nitro was already a mature product line with multiple generations already deployed. And we had a lot of Lego pieces that we could reuse in order to build the machine learning service on the model landscape.

Back in 2017, computer vision was very common. This is ResNet. And then there were other models for like LSTM, RNNs for NLP, natural language understanding, and some vector based representations like Word2Vec.

What you can appreciate is that all these model architectures are fairly different but they had commonalities. We noticed that all of them have a neural network based. So that means massive linear algebra computation. So it's a good opportunity to accelerate. And 95% of these operations are matrix operations.

Mid 2017 came the famous paper "Attention is All You Need". But the first model that was popular based on that paper came out in 2019. That's BERT. In 2019, we already had the chips at hand. So it's super important that the chip that we designed will be able to address all of these different architectures.

So, you know, when we design the chips, we can use two methods, right? We can have the crystal ball method where we try to predict what will happen two years down the line because it takes two years to design, manufacture and test a piece of silicon and then deploy it in the data center. Or we can use inversion.

Inversion basically encourages us to inspect the problem not head on, but actually from the opposite direction. And our problem was we didn't know where the models, what will be the popular models in two years, but we could guess what's not gonna change.

So that took us back to our tenants: customers will always want more performance, they will want better cost structure and a service that is easy to use. And for us as the design team, we wanted how do that, we can reuse generation over generation and kind of grow on the, on the, on the capital that we have.

So, let's see how we did this. It's kind of hard to show in slides, how we design chips because basically I'll, you'll see pictures of folks coding on computers. But I can tell you that we had to build a new team. Oh and we did that organically, it's basically was in a, in a startup within Anuna Labs, which is a startup within AWS which is a conglomerate of startups.

Back then, we had only less than 10 chip engineers, no compiler or application engineers and very little previous ML experience. Don't get alarmed. This picture is not a data center in AWS. It was generated by Stable Diffusion on Influencia.

But um, so let's go and talk about hardware. This is Ilan. He heads our hardware team and he is now carrying the first Inferentia chip. So after we did all the work of the design and simulation and manufacturing, everything comes back in a box and there are a few chips in that box. And this is, this is what Ilan is carrying back into the office.

And the folks started working on the servers. You can see it's really a startup mode. These are the first Inferentia servers. These are ones and Jake and Rohit from a hardware and software team are hand assembling them in the lab. Well, I don't see a lab here again. It's a startup mode. They had to do it on the floor. We didn't have tables yet. If, if you can look at the back, you will see the lab is getting built alongside the servers. So really a startup mode.

But then we get to scale in order to build these servers, you need huge manufacturing facilities that build the racks, do the cabling in the racks, run the testing and of course, assemble the servers. On the right hand side, you see the Inferentia servers. It's a 4U server. What you see here is the power delivery, the fans to cool. And then of course, the, the cards that have the chips on them.

We were able to pack 16 chips inside the 4U server. And you can see it's kind of an origami game on how to lay out the, the, the cards. But the result was pretty amazing because we, after we finished manufacturing, we actually had the first Inferentia server running inside the data center.

So that's the picture that you see here. You can see the server is pretty clean. But you know, it's complex from the inside because you've just seen it the insides of it. But more importantly, we had to deploy these things at scale.

So today, we have Inferentia servers in 23 global regions in AWS. In AWS, each region is many data centers known also as AZs. And we have thousands of these racks deployed. In one word, I think we can summarize it - massive scale.

But this was only the beginning. Looking at these Inferentia racks. Like I said, you can see that pretty clean, there's some top of rack which is accidentally in the middle of the rack. Actually not accidentally, it's easier to do cabling if it comes through the middle. And now let's look at the Trainium 1 servers. Looks like our training machines.

I hope you can appreciate that there is a lot more going on here. So don't worry, let's break it down.

So this is Trainium 1, just a bunch of Trainium boxes in the back end, like Inferentia, you see power delivery and fans in the middle. This is where the Trainium chips reside. And I have here with me, one of them, you can just appreciate, I hope you can appreciate the size and we have eight of those inside, inside the box. It's pretty heavy, so I'll put it down.

And this is the, this is the main section, of course, at the front, we have NeuraLink. These are the technology that allows us to interconnect all the chips in a 3D hypercube to detours configuration. And basically what this ensures is that there is a maximum of two hops between each chip. And that gives of course max minimum latency and high performance, high bandwidth interface.

So going back to the racks, a single Trainium 1 server is actually three separate boxes. There is a head node at the top and then there are two of the boxes we just reviewed giving a total capacity of 16 Trainiums that the folks use to run, training to run to any workloads. And of course, each rack has multiple servers in it.

Let's look at Inferential 2. So after we launch Inferential 1, we launched Inferential 2. The In2 servers provide 3x the performance as compared to In1 and enable LLMs and Stable Diffusion gen AI workloads.

Let's look at the server itself. The back is very similar - power delivery, fans in the middle, we have 12 In2 In2 cards. And those are here.

This is how it looks without the cover. Uh you can appreciate at least from the size. There is quite a difference. The reason is for inference workloads, we can lower down the compute because LLM inference is mostly memory bound, not compute bound. So the more memory bandwidth you have, the better the performance will be. And InferenTial 2 is the only server that is optimized for inference in the cloud that enables interconnectivity between all of those 12 chips to allow for LLM uh deployments at scale at the front.

This is a single server at the front is an x86 x86 compute. And you can see the two Nitro cards, one for storage connectivity and one for uh networking connectivity.

So we talked about the servers. Let's let's look, take a quick look on the chips themselves. On the right hand side is InferenTial, on the left side is Trainium uh compared to InferenTial Trainium has five x, the numbers of the number of transistors. So it's a way more complex chip. We also added HBM high bandwidth memory that allows the to unleash the performance of the compute.

Uh the chips are like i said, three x the compute and uh you can also appreciate it's, it's a more advanced packaging technology to fit all of those pieces of silicon into one package.

Let's dive in into the chip itself. You guys are uh uh uh i don't know if you are ready for this complexity, but don't worry, i'll, i'll guide you for it.

So the main thing, if you guys remember i mentioned linear algebra, the main engines are the neuron cores. Uh we have two of them in each Trainium. Excuse me. Yeah, we have two of them in Trainium. The main component is the tensor engine. That's where all the compute happens. There's also a large SRAM on chip memory that minimizes memory movement. So we can maximize the compute. And there are other engines for the 5% of the work of the operations i mentioned earlier uh to do things like uh scalar operations, vector operations. And also there are 16 general purpose inD cores. Those are important because that's the PMA part i was talking in the beginning if, if there is something that cannot be expressed or or optimally run on the compute cores that we designed, you can write C++ code and just run it on the syndic cor with uh with uh a very high bandwidth to the memory. On the side is the uh HBM memory that we just seen on the chip itself. And in order to enable the connectivity across uh across multiple chips, we have four neural links on each chip. And of course a set of DMAs and importantly a separate collective compute engine collective compute are the operations that allow to synchronize between multiple chips while you do training or in influence. And for us, we decided to put a dedicated engine for that because it allows us to overlap compute and communication and by that accelerate the training of the influence workload.

So now that you are experts in the chip technology and how the servers look like, i hope you can appreciate how we build the server. We basically have 16 of those Trainium chips inside this server giving uh memory capacity, high high bandwidth memory capacity of 512 gigabytes and a peak memory bandwidth of 13.1 terabytes per second just to pause for a second. And uh an SSD is roughly two terabytes, 13 terabytes is 6.5 SSDs being read every second just to explain how much memory bandwidth exists in this box. And similarly, with InferenTial 2, we put 12 of the InferenTial 2 chips.

What we have done is in terms of the sequence architecture, you can see it's fairly similar. The main difference on the chip itself is less neuron links because we don't need a lot of connectivity for uh for inference. So we we were able to cross reduce the chips a bit and then provide the value to customers in terms of cost to train uh cost to uh inference and how it looks in terms of numbers.

Um this is LaMDA 2 training uh done on Trainium 1 compared to the latest generation comparable EC2 instances in one word you get up to, you basically pay 50% less rough, roughly 50% less to train on Trainium versus the alternative solutions on EC2. And this is true for 7 billion, 13 billion uh and 70 billion LaMDA 2 models. The way we measure it, by the way is effective flops per dollar, which is the same to say as cost to train similar picture on the um inference side. This is comparing in 2 to the um latest um alternative inference uh instance in EC2. And you can see almost five x lower cost to deploy LaMDA. It cost 77 billion, 13 billion, 70 billion uh with InferenTial 2. So huge benefits to customers who uh decide to use those uh those accelerators.

And we have quite a few of those customers. Uh of course, i'm hope i'm hoping everybody heard Dao in the keynote from Anthropic who are developing the next generation models on top of Uranium in in French. Adobe, while adopting in 2 to their uh Firefly generative a i image image uh service stability a i, we are seeing 30% lower cost to train the LLM models in multiple languages and 60% lower cost per image when they use SD XL. And also worldwide, we have customers from Japan like Stock Market who has been uh who has trained 13 billion LaMDA model to speak Japanese. So LaMDA now can say arigato and they did that in less than a month on top of training. And of course, with my with significant savings like you've seen. And also Vico is de deploying LaMDA like models in Japanese and train them on Trainium. By the way, the Stock Market model is available on Hugging Face, which is pretty cool to see how the customer ecosystem is also starting to contribute back models that were trained on Trainium. And of course, we have Databricks and Leonardo that you will hear about from today.

Let's talk a bit about the software. So our software SDK is called Neuron. Uh i want, we, we'll show you a bit how to use Neuron, but we actually wanna be uh we want Neuron to be as thin as possible of, of a layer and leverage as much as possible open source models beca open source code and frameworks because that's what all of you are used to do uh with al with alternative uh infrastructure.

So uh we always take the approach of a plug in, we have a plug in to PyTorch, a plug in to TensorFlow and we are working on the JAX that is coming out soon and also um easy to use frameworks like Hugging Face and Ray. Uh of course, SageMaker JumpStart, all of those are available for Trainium in in French. And it literally takes a few lines of code. We'll take a look at that.

Uh next, let's focus on Hugging Face Hugging Face is the most popular uh model repository in the world. Millions of models, millions of downloads. Um and they have been working on a package called Optimum Neuron. Optimum Neuron is essentially implementing the transformer technology to fit into in optimally and performantly into InferenTial and Trainium. And what that allows is uh very large breadth of models that can work on that work on Trainium and InferenTial Happy to say that the latest is 93 out of the top 100 models in Hagen repo of uh performance on Traini and InferenTial amazing partnership with Hagen. And this will just continue to improve from here.

We'll get back to Hugging Face in a second. But before that, i want to talk about our compiler. The compiler is the main piece of technology that um our team has developed and it allows to take python uh models and lower down to machine language performantly that the chip can run.

Well, let's do a quick crash course and compilers and there are four stages. The first one is actually hydro agnostic. These are just genera generic graph optimizations that we are doing and to optimize how much to minimize the amount of compute we can as much as possible. So those are operations like operating f operative fusion or common sub sub expression elimination. Meaning if there are pieces of code that generate a result that is fixed, then we can just eliminate those pieces of code and compute less. So that's the first step after that, we take uh each operator and represented in loop format and just iterate to optimize uh that piece of, of compute the various techniques like tiling victimization, uh piping.

Um essentially what's happening there is making sure that uh we are um s uh arranging the computation in a way that will keep the accuracy and the intent of the the developer of the model but make it uh more performant and run uh more efficiently on top of the hardware. At the end of this phase, we take the inner loops and move them uh to uh the how to intrinsic mapping, which is basically lowering them to language that the the how the Trainium can can run Trainium in InferenTial. And then the last step is scheduling in a location that's basically taking all the pieces of compute and ordering them in a way that can paralyze computer as much as possible again, to increase efficiency.

So now that you will compile our experts, i'm very, very happy to announce a new feature that is coming soon. We call Niki. And that's Neuron Kernel Interfaces. Neuron Kernel Interfaces is allowing developers to write their own kernel performance kernels. On top of Trainium, it's basically a bare metal uh interface that allows you to develop what whatever innovation you you want run it perform on Trainium. And you can see it essentially bypasses almost all of the steps of the compiler and allows you to run bare metal natively on top of chain. The s the APIs to do that are the same APIs that OpenAI invented with Open with Triton. So folks that are familiar with Triton from OpenAI will find this environment uh familiar, you can see an example of softmax, which is a popular part of the part of a very popular operation in uh uh in language models. And we actually used the NI to softmax softmax performance on top of Trainium. So we use that already internally and this is coming soon uh with a detailed documentation of the hardware uh to customers. So super exciting and that's what this will allow customers to basically and invent new models on top of Trainium performance.

Let's go back to H and Face. When you use the Hugging Face with Trainium. Um this is a training example, you use Optimum Neuron optimum neuron is a package that is basically a wrapper technology that uh we we supply and it automatically helps child models. It can be used for training and for influence. Today will uh today we'll talk about in influence but similar prenn apply for influence as well.

Uh tensor populism is a way to shout the model um across um shout the model layers across multiple accelerators. So you can see it graphically on the top uh in order to achieve that with uh with um uh new and distributed, all you need to do is import of course the the needed bits and then define tp size of four. And that's, that will show the model into four different sub subsections that can run concurrently on different accelerators.

Similarly, if you want to do pipeline, that's another technique of sharding, uh you can pipeline usually pipeline across servers. So you do that over the network and you can cut the model in the middle, you define pp size pipeline parallel and pp size or two and um load the model and it will run, this is again a hugging the, the same hugging phase lamma seven b model.

Uh so now that you guys know how to do this, you can see that in order to run the full, the full training uh tp size defined over eight. And then there are some hyper parameters that uh you you additional hyper parameters that you can control and define and simply create the hugging phase training instance and start training.

Once you start training, you get a set of utilities like neuron top, which is like linux top. And basically it gives you visualization of the utilization of the neuron cores, memory utilization of the neuron cores. And of course, the host uh you can log all you get all of that also uh logged. So you can uh um uh trace it over time, but it's a very powerful tool to understand how, how effectively you are utilizing the. And similarly, there's also native integration into things like TensorBoard that you can track uh the curves of your training and and make sure that uh everything is progressing. According to plan.

Last uh last uh last update i want to give you guys is very exciting as well. In Trainium. When we launched Trainium, we were building clusters. Ulta cluster is the non blocking ABI networking technology within AWS to allow all of these massive clusters of chips to work on. Currently, when we launched uh our maximum cluster size was 30,000 chips. I'm now happy to announce that we are already building 60,000 chips, Trainium chips uh in the data center. So that will allow even running larger models and of course, finish training faster with that.

Let's hear how Databricks are are using Trainium Naveen.

Thank you. The, as a former hardware engineer myself, it was a little uh it was cool to see the early days, but a little cringe worthy. When you saw the servers on the carpet, it's like big. No, no, right.

Um anyway, yeah, so uh my name is Naveen Rao. I'm the uh uh i was a previous CEO and co-founder of a company called Mosaic ML, which was recently acquired by Databricks um to essentially uh run and, and build their, the, the generative AI strategy at Databricks.

You know, if you're not familiar with Databricks, it's a uh data platform for organizing uh data across many different tools, building pipelines like ETL and business analytics, things like this. And uh generative AI has now come on the scene as i think um uh the way to express the value of that data. And so it was a very natural partnership for us. In fact, we had multiple customers that we were working with together using Databricks and, and Mosaic to build models.

And uh it, it just became really clear that it made sense to join forces. So um what do we care about? Like we, we are building models ourselves and uh we're using those as, as sort of a base for our customers to then go and customize them. So cost uh performance scale, all the things that Gotti was talking about about actually become uh quite important to us.

We have a lot of uh we have about 10 c. Um can you hear me? Ok, kind of going in and out um 10,000 customers, you know, 6000 employees, 1.5 billion in revenue. We're, we're growing very fast. And uh and I think this whole idea of, of uh organizing enterprise uh companies, data and then providing tools on top of it is actually really starting to gain a lot of maturity. I think this is gonna be something that's as big as cloud um in, in the next 10 years or so.

So, um when we start talking about things at this scale, you inevitably have to start talking about cost and you have to start talking about uh uh optimizing that through software and, and great engineering. And so um that's why we're pretty excited to be here with uh the Ternium team and actually, you know, be part of that journey with them partnering early. Um it's a small team and I think they're, they're, they're hungry to make something that's uh that's really great. So I'm hoping we can, we can achieve that together.

So as I mentioned, the data lake house was this pattern that we, we established as a company, Databricks, established as a company to unify all organization systems of data and and some visualization on top of it. So now adding, adding generative AI on top of this as a natural extension equals what we call a data intelligence platform. Really generative AI is a way to take data and make it interactive. I can now ask questions of it. I can reason over things that are relevant to that data. Um you know, i can, i can use, i can, i can build better tools like entity extraction, removing PII these kind of simple things actually work really well. These sounds somewhat boring. But in enterprise context, this is a very hard problem to do well and actually can have massive operational savings for a lot of companies out there. And that's what that's what we do as a company.

So our platform uh roughly looks like this. We have what we call a Delta Lake, which is where you essentially dump all your data. Things can be in S3 buckets, they can be structured, unstructured, all kinds of different forms of data. And we we apply a governance framework called Unity Catalog. This allows us to trace the lineage of data, you know, where it came from when it was imported, who imported it. Um you know, maybe some permissions around it and uh um essentially start applying access control to that data. So certain users can see certain types of data based on those attributes. When we start talking about generative AI models. Again, as I mentioned, it's an expression of data. We want to apply that governance model to those models as well. So let's say I take an open source model like LAMA 70B and I fine tune it with some sensitive data for sensitive application. I actually want that fine tune model to inherit the attributes of the data because now that model can actually um uh uh you know, potentially regurgitate some of those facts that might be a feature, it might be something I want, but I don't want all my users to be able to see it. So I need to inherit those attributes to access control to that model. And so this is what our platform does is actually create all of these um uh tracing and audibility mechanism.

So, you know, if you're in a regulated industry, say you need to understand this user hit this API got this output at this time on this model that was built on that data. And so I think all of this uh uh uh audit log needs to be tracked somewhere in an organized fashion. That's a lot of what we do, the intelligence engine is actually a new thing we're bringing out. Um we, we actually just demoed it at uh uh a competitor's cloud. We'll say just recently where we can actually start typing queries in English. So what we're trying to do now is expand this whole platform with generative AI. The data intelligence platform is we have this set of tools like SQL engines and things like that, that you can, you can build very structured queries to find specific kinds of information or specific kinds of uh uh dashboards and things on your data. But we also want to extend that to users who are not experts who can't code in SQL. And so actually being able to query things in English think, you know, what were my sales in EM for the last 10 years for this product, right? That might be something an analyst would ask right now, that person has to go and get someone who can code that up into SQL. That SQL person needs to understand how the databases are structured, what joins and this and that has to be done. We, we want to collapse that all into just making an English query. And so I think that's, that's really where we're going uh over the next several years.

So it's a very exciting time really when we start thinking about data the last 20 years in tech, we've been talking about data logging it, but we couldn't really do anything with it. Now, we're getting to the point where we have the tools to do that. And I think JAI is a big unlock there. Um so because of this unlock and, you know, the tools like we're bringing to the market, uh we actually see that it is changing the future of work. We're seeing humans becoming more efficient. Now, I know there's a lot of uh a narrative out there about um you know, doom and this and that. But I think really, I see it as, as a way to make each human more efficient. I, i don't see it as a replacement. I see it as, you know, one human who does a certain task, like, you know, an accountant or lawyer can do that task more efficiently. They can have essentially higher throughput. So that reduces cost allows those humans to do other things. Um that might be a higher priority. So I think i, i really see it as this assistant pattern over and over again.

And so, um you know, there are, there are numbers here that uh uh from some surveys that we've conducted like 75% of CEOs are now looking at generative AI as a as a competitive advantage. And really most of them are thinking about this operational side of it, but it's gonna start hitting their products soon. I think that's, that's pretty exciting where we can start interacting with computers and naturalistic ways. This is something we've, I think tech, tech geeks like me have dreamt about, you know, my whole life. So it's, it's really, really great to see it. And, you know, we're, we're seeing this across different verticals and I'm sure everyone here is seeing these verticals out there. But financial services, healthcare are really the areas where we're seeing a lot of uptake first. That's, of course, beyond the tech world, the tech world is embraced into its technology. Uh the fastest.

Um we're now starting to see media, um you know, creating new images and video from uh large scale AI models that has only been possible for the last maybe a year or several months. And so now that stuff is getting really good, you can do very rich video. I've and seeing new companies launch all the time with this. Uh we, we'll hear from one in just a moment here. Uh but yeah, i i think that the future of media is actually kind of being shifted. Their ability to rapidly prototype is uh is changing. So we're seeing actually some uptake there, some, some of our customers are now starting to build these sorts of things. Of course, manufacturing and government will be a long tail um o over the next 10 years.

So why, why do companies come to us to build models? Well, so using uh uh closed models like OpenAI Anthropic these there, there's lots of use cases for them, but they're not, they don't cover every use case. There are many times where you need control over the model, meaning you need to understand what data went into it. You need to understand where it's being served. Um, you know, control its outputs very precisely. Like, especially if you're in a regulated industry, you can't have something that you don't understand what data went into it. Um, we see these little uh prompt injection attacks happen all the time now. Um it's fine for consumer stuff, but I'm, I'm running something that's like, you know, um being trusted with legal documents or tax or finance, you really can't do that. So that amount of control is actually quite important.

Um getting things to be very good in domain requires domain specific tuning. And that's actually a lot of what we do is can we make an existing model better based upon the specific data and observations that our customers have? And then of course, lower cost, everything at the end of the day will come down to cost. It just always does. And as these solutions become more ubiquitous, the scale goes up and then cost starts becoming the most important factor. So I think, um, it's, it's, it's a really important thing to think about that scale up front. Um if you have an understanding of, you know, where a solution is going to go in a year, you're going to give you a lot of hints on how you should attack the problem. Maybe I don't want to invest in just an API maybe I actually do want to start thinking about using tools like Databricks, Mosaic on AWS to customize models because then you can really control the costs.

And actually this is something we we did from the very beginning. So Mosaic started in 2021 which seems like ancient times at this point. Uh but we, we want to apply algorithmic methods to reduce costs. In essence, every computing flop like Gotti was talking about, you know, these matrix multiplications are just floating point operations. Every one of them is precious. Each one of them costs money no matter how many billions you have. There's there's a physical piece of silicon running that that uses energy and it's taking up more space. So all of that costs money. Can I make each one of those flops more effective in learning from data? That was the question we asked. And so uh we actually have done a lot on this and reduced the cost dramatically on certain things.

This is actually training Stable Diffusion. Uh we actually have our own diffusion model. It was something on the order of 600 k back in the end of last year. And we brought that under 50 k. I think it's actually close to like 30 k to train a model from scratch. So something that 30 k is actually quite accessible and we're seeing similar drops in uh in LLM. So GPT 3 sized models, uh we published a blog last year, it was about $450,000 to train on a 100 GPUs, you know, with things like training and we're going to bring that cost dramatically down. This is going to be something on the order of below $100,000. Actually. Um we have a whole new set of innovations on this and also using lower precision. Uh we've already brought it internally much, much below $100,000. So that makes these things super accessible, many enterprises can build um you know, and customize these models for their use cases.

And so, you know, this is actually a fun uh case study we had here, uh the Texas Rangers, the baseball team is one of our customers and uh they actually use Databricks and uh the ML capabilities to do analytics on their players. Like actually, you know, swing motions of batters and pitchers and things like this. And it's, it's a funny story because i it's almost like hard to believe, but they literally went from like a middling team. They used Databricks and now they won the world series. I don't know if we can claim credit for it or not, but it, it is actually kind of a cool story. And what's funny is uh a lot of other major league baseball teams are now signing up. So I think we're seeing data analytics, ML and JAI kind of coming together to solve all kinds of interesting problems. This is just a fun one that, you know, we all can kind of uh sink our teeth into a little bit.

Ok. So getting to the technical stuff a little bit, uh what we built in Mosaic is a, a platform that's actually multi cloud.

"Um, I know if I can say that in an AWS conference but uh, multi cloud and multi hardware and really the idea is that we can bring the price down by giving more flexibility to the user across different hardware and even different clouds and make the performance higher. So can we deliver more faster and cheaper? And that's why we're really keen to work with the Trainum team because they're focused on that specifically.

And I think this is a key tenant of AWS is deliver more for less constantly. And that's exactly what we're trying to do and this for users. So, you know, he went through the software stack that's hard to use still, right? And it's not, it's not a bad thing. It's just, it's a deeply technical software stack. It's not something that a general user of PyTorch could probably go and optimize on their own.

And so that's where we come in is we package these things, we make them work at scale, we deal with node failures and all these kinds of things that essentially abstract away the details of running the infrastructure and we make it super easy for our, for our customers to go from between GPUs and Trainium or whatever.

And so we have some critical components here like Composer and LLM Foundry. These are open source tools, you can actually go play with them yourself. This is actually the basis of our stack. There's a lot of other parts of the of the managed stack, of course. But you know, we support inference, fine tuning, pre training, there's many different types of fine tuning as well. And of course, now we are we are adding training support.

And uh I think Ati touched on this as well, you know, interesting forms of parallelism. Uh as I mentioned, I, I actually have a hardware background as well. So I love this stuff. New kinds of parallelism are are uh very interesting through any hardware geek and actually uh linear algebra geek. But really what it comes down to in the end is that I can build bigger models and make them go faster and utilize that hardware more effectively.

These kind of novel interconnects actually allow me to use more flops per chip. So just just a quick primer on the economics here like the chip is a fixed thing. I get it. It's a physical device. I want to extract everything out of that chip. And the only way I can do that is when I'm distributing that workload, I need to make sure I have enough communication bandwidth in the, the right sort of configuration to be able to extract all of that uh all those flops.

And so really these, these 3D parallelisms make a big difference here. And so um the team is now adding support for the MPT models. These are our Foundation models, Mosaic pre-trained transformers. Uh we'll have a new set of uh models coming out very soon that uh hopefully we'll be partnering more deeply with the Train team on and uh Llama two, of course, we support as well as well as Mistral.

So, looking ahead, um you know, we're, we're working on bigger and bigger cluster sizes uh along with the Trainum team and, you know, hoping to see that, that great performance that we can pass on in terms of cost and speed to our customers. So thank you for your attention."

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值