Train and tune state-of-the-art ML models on Amazon SageMaker

Good afternoon, everyone. My name is Gal Oshry. I'm a product manager at AWS working on SageMaker. I'm here with Emily Weber and Thomas Kollar to talk to you about training and tuning state of the art machinery models on Amazon SageMaker.

Before we start, how many of you are already training machine learning models today? Awesome. How many of you are training malls with more than 10 gp us or accelerators? All right. Anyone with more than 100? All right. 1000. No. All right, cool. Well, today we'll learn a bit about that.

Um so we'll talk about the challenges for training large scale machine learning models and we'll talk about how SageMaker can help you train those models. Emily will then talk to you about fine tuning and pre training large language models and show you a demo training lama seven b on stage maker. And then we'll hear from Tom about Toyota Research Institute and their machine learning use cases.

So machine learning has already proven itself useful across a wide range of applications from recommendations to creditors prediction and autonomous driving to document analysis. But recently there's been an explosion in interest in deep learning models for computer vision and natural language processing. Just a few years ago. this is the type of image you would get. if you try to generate a very clean living room with a machine learning model, you can tell that it's fake, it's not really coherent. You can't tell that it's a, it's a living room. And just a few years later, we can now generate images like this where you'd have to look really closely to tell that it's not a real image. And i showed something similar a year ago at re invent and at that time, this was kind of shocking, right? Seeing this type of image get generated from a model. But a year later, and i think many people in the audience have already seen these types of images get generated and the quality that you can get with machine learning models.

So how did this happen? Well, first, there were notable algorithmic improvements over the last few years, specifically, the transformer architecture, which is used in many of the large scale models that you hear about today. However, there's al also an increase in the data sets, the model sizes and the amount of compute that is used to train these models. And a lot of the research shows that we can continue increasing these dimensions to get better and better results. So to be competitive, you really have to think about how do you leverage the these advancements to provide the best experiences for your customers with machine learning.

Ok. So training large scale models is awesome. Let's just train the biggest one immediately and be done with it, right? But it's a bit more complicated than that. There's some challenges. The first is that you want to use the latest hardware every few years. There are innovations in hardware that lead to 2 to 9 x improvements in training efficiency, but it's not enough to get access to the latest hardware. You have to think about how well it works. Is it like fault resistant enough to be able to let you to continue your training with minimal interruptions to the machine learning team. You have to think about orchestration and how to most effectively use the resources you have available. Especially if you have a large team of data scientists who want to train many malls in parallel.

We talked about how you want to have larger data sets and being able to store load and process these large data sets can require a lot of work. And there are a lot of pitfalls in doing that. You want to think about scaling up both the infrastructure to get more compute for training the model as well as the algorithms that you use. The models that we train today for these use cases do not fit on a single accelerator. So you have to think about the algorithms that you need to use to scale up and finally, we have to think about cost training, these models can cost hundreds of thousands or millions of dollars. So you need to think about efficiency when you're training those models, especially at the beginning. When you're doing sporadic experimentation, you're trying out different ideas and you don't use the hardware all the time, right? So you want to think about how you use that efficiently and it's not just the financial cost but the team's time, right? A lot of customers tell us that making sure that their ml engineers are not spending time dealing with infrastructure is one of their top priorities, but not all hope is lost.

Amazon SageMaker can help with many of these challenges. I'll give a high level overview of how SageMaker works, but you'll see it in a lot more detail during Emily's demo.

We start by calling the CreateTrainingJob API this h api captures information about your data set, your compute resource configuration as well as the training algorithm that you want to use. SageMaker will then set up the cluster for training the model with the right VPC and networking configurations by default to save you a lot of time, but you can configure all of it yourself as well and add the flexibility that you need as part of spinning up the cluster.

SageMaker will also run health checks on the hardware to make sure that everything is working effectively before the job. Even begins and before the tr the billing starts. So this saves you time and money to make sure that the training can continue efficiently.

SageMaker will then load data from S3 EFS or FSx for Lester. And you have options to either copy or stream the data depending on your data set size. One of those might be more applicable. But again, and you'll hear this theme again and again. While its stage maker provides great options for getting started and moving quickly, you also have the flexibility to do what you want and load data from other sources.

You can then download the training image from ECR and you have options of built in algorithms and SageMaker. You can use one of the SageMaker deep learning containers to quickly use PyTorch TensorFlow or Hugging Face or you can bring your your own training image with your own algorithms. Completely.

Sagemaker also offers distributed training libraries that can help accelerate both data and model parallel training. And you'll hear more about that later.

Saker then starts the training and streams the logs to CloudWatch throughout training, it stores the metadata and hyper parameters. So you can view it later. And you have again options for using TensorBoard and other tools to visualize your experiments. It will synchronize your checkpoints throughout training to your storage which is critical. If you are, you know you want to be fault resistant. In case something fails. During training, you don't want to lose your progress until that point at the end of the training SageMaker will save the model and other output data. So you can revisit it later at the end of the training SageMaker spins down all the compute so that if the job fails at 3 a.m. no one has to wake up to turn anything off and make sure that you're not paying for all that hardware, all those instances running without being used to train a model. And with the same paradigm, we can actually scale up our training to many more instances really easily to to get those large scale models.

One really awesome feature that we launched this year is a cluster repair feature. So if during the training, one of the instances fails, we look at what happened to that instance and decide whether we need to reboot it or replace it with a different instance and then restart the training within a few minutes. So there are all these resiliency capabilities to ensure the training continues as quickly as possible. And without manual manual intervention in case any of that sounds intimidating, the good news, it's actually really easy to get started.

The most important code for converting model training to a SageMaker training job is the Estimator API you'll see more of the demo later, but at a high level, the API takes a a python file or an entry point. In this case, sar 10 dot pi which is very similar to how, how i would do the model training on my laptop. I also provide the instance type i want to use how many of them and hyper parameters which i can easily change later to try additional training jobs. I also add metric definitions so that i can view those metrics in CloudWatch during the training. Finally, i provide a path to my data and call Estimator.fit.

Now, the even better news is that we recently made it even easier to get started. Now, you can take your existing python code and add the @remote_python decorator to it to immediately serialize the run time, the packages functions and everything else so that it runs as a SageMaker training job without even having to learn about the Estimator API

Once the training job begins, you can easily view the metadata and reproduce it later or clone that training job. And you know, like tracking experiments is is extremely important. You want to learn from what you've done before. And we often see people who keep the training results in a spreadsheet or even in a document and pass it around the team, but that makes it more difficult to collaborate and learn from past experiments. So by automatically keeping all of this in one place, it becomes much easier to, to learn from your mistakes and and build better models.

Now let's move on from tracking the training to improving the performance, specifically, the training speed which impacts how much time you end up requiring to use, like use the instances to train the model and the overall project completion time as well as the cost.

Now, the SageMaker profiler is an ML Observable tool that enables you to understand hardware utilization and root cause performance issues to maximize the efficiency of your model training on the dashboard shown here, we can see some overall metrics around the GPU usage. And you want that to be as high as possible as well as the GPU usage throughout the training job across each individual node within your cluster. So you can see that even if your utilization overall might be high within some intervals, there might be low utilization that you want to check out a bit more lower down on the dashboard.

There are other metrics, for example, the total time spent on each GPU kernel. So that gives you additional hints about what you want to optimize next to further improve your training. There's another page in the profiler showing a more detailed timeline view that allows you to get data from your host and devices at all the different levels. So you can dig deeper to understand what is happening at each point.

Now, I'm excited to announce a preview of a new capability in SageMaker - Smart Sifting of data. Smart sifting is an online data refinement technique that can reduce your deep learning training time and cost by up to 35%. Now, when you train your, your models and we talked about wanting to use larger data sets. It also matters about the quality of the of the data sets and some samples in your data might be less informative to your model training or you might have seen those samples already. There might be duplicate data or similar data and it's often difficult to preprocess that data and remove the data that you don't want in the training anymore.

So Smart Sifting helps because it analyzes your data during the training job and filters out the low loss samples which are less informative to the model. By training on the subset of your data, you can reduce the time and cost of the training by up to 35%. And because it only filters out the low loss samples, it has minimal or no impact to the final training accuracy and it's easy to get started because it does not require you to make changes to your data or training pipeline.

Here's a simple example that uses Smart Sifting. We use the SageMaker Deep Learning container, we load the Sifting data loader and then we wrap whatever existing data loader we use with the Sifting data loader and provide a bit more configuration to start using it. I don't need to change the rest of my model or data pipeline and we already have customers who are seeing great results with this capability.

For example LGAI Research use it to get meaningful increase in training performance without any changes to the model accuracy at VVenture.

We also announced Amazon SageMaker HyperPod. This enables customers who are training large scale models to get all the managed benefits of SageMaker that we've been discussing today with a UX that they might be more familiar with of being able to access the instances directly using SLURM and so on.

It has similar resilient capabilities to what we discussed for training in terms of replacing faulty instances and enabling the training to begin a bit more quickly, saving up to 20% in time. It also benefits from the optimized distributed training libraries that also improve performance for both model and data parallel training.

I mentioned it provides more granular control over the cluster and what you're doing, being able to access the instance directly, install additional software and make any changes that you want to the cluster to be able to fine tune your training a bit more.

Now, we've talked about training really large scale models, but sometimes you don't need to do that. Sometimes you just want to fine tune an existing model and that's beneficial if you have an existing model and a foundation model, and you want to bring in your own data to fine tune that model to a particular use case.

So by bringing in your own data, you're making the model better than if you were just using an off the shelf foundation model. But it saves you a lot of time and money because you don't have to train that whole model from scratch.

Now, the challenge is that some models are not open sourced, right? You can't download the model weights and fine tune them yourselves in an existing SageMaker training job. But this has changed with enhancements we've made to Sako algorithms and model packages.

You can now easily and securely customize third party models by fine tuning them on your private data. This provides end to end security. The model provider can provide their model without revealing their model weights and you as a customer can fine tune on that model by bringing in your own data without exposing that data to the model provider and the final model weights after your fine tuning are also only available to you.

Now, this can be done with a variety of models and algorithms, for example, Coherent models and all this is easy to use and done through the Python SageMaker SDK that we were discussing earlier and integrates with other SageMaker capabilities like SageMaker Experiments and Pipelines.

And of course, with SageMaker Inference, you can deploy the models at the end to use them for inference and production scenarios in a secure way.

I'll now hand it over to Emily to talk about fine tuning and pre-training LMs on SageMaker.

Alright, thanks Gal. Great! So I, I hope you're as excited as I am about a lot of these new launches, new features. I should introduce myself. My name is Emily Webber. I lead our Generative AI Foundation's technical field community here at AWS.

In particular, some of those launches, many of them came directly from conversations with you. Actually, we were listening with customers and chatting with you to understand key capabilities that you wanted to see in our training stack. And this led to a number of the features that you've just learned about.

In any case, there are many ways to customize a large language model here. I'm presenting them on two axes, right? So at the bottom, you have roughly complexity and then cost. Obviously you want to be closer to the left, you want your LLM customization techniques to be roughly easy because then of course, that's faster for you to get started and then that's less expensive.

However, there is a progression of these techniques. Most customers will start with prompt engineering, which is a nice way to easily improve and then customize your large language model. However, it's not as accurate as some of these extra techniques that you can use.

Most customers will move from prompt engineering into what we call retrieval augmented generation stack where you have some set of data you're converting that data into embedding or that dense representation and then retrieving those documents to interact with your consumers. This then can transform if you will into a fine tuning stack, actually, there's a bit of an overlap there. But in any case, you can take as Gal mentioned custom data, fine tune your model to add that extra knowledge.

All of these techniques however pale in comparison to the holy grail, which is pre-training, which is creating a net new foundation model. And so all of these techniques are available on SageMaker and well supported by the stack. So we're going to learn how to move from fine tuning into pre-training during our session here today.

Now, fine tuning small models is really impactful. Here are a couple of reasons why you would consider fine tuning a small model. The first is of course, it's less expensive. You're going to use a smaller dataset, possibly a smaller model, and then still improve accuracy because you're fine tuning this model, but you're keeping your costs down when you're working with a smaller model such as something in the 7 billion parameters.

This is inherently faster because the model itself is just physically smaller than some of those larger ones. And so the training time is faster, the infer time is faster, which means you can train more models and you can do more infer again with that smaller object because the object is smaller, it's easier for you to manage.

And so again, the storage requirements are smaller, so it's easier for you to copy the model. It's easier for you to put the model into your applications and your packages and your CI/CD pipelines and your repositories.

Many customers inherently prefer the ownership that comes with creating new models, particular through fine tuning and then again, pre-training. This allows you to increase the IP of your firm. And then of course, you have more deployment options. When you're fine tuning again that small model, the more deployment options include serverless.

Actually, I have customers who create and then fine tune these small 7 billion parameter models, compile them and then host them on Lambda and run them on serverless infer. And so absolutely, when you're working with these tiny models that are knowledgeable in small domains, you have a lot of flexibility.

Pre-training is really best for extremely large datasets. So when you have hundreds of GBs or multiple terabytes of custom language data that just really is not online. If it the the language data that you have, if it's not in Wikipedia, if it's not on Reddit, if it's the core language that you're using, if when you take a sentence and try and put that sentence into Wikipedia for example, if Wikipedia doesn't understand what you're trying to say, you may want to consider seriously customizing a language model and then possibly creating a new one from scratch.

Now, why is this the case? Why is pre-training so powerful? Part of this is because the pre-training loss function is more generalizable. So when you're creating that new foundation model from scratch, the learning is slightly different, it's more general. And it's deeper in the neural network actually, also when you're creating a new foundation model, you can do this without supervised data.

So you don't need to go label millions of records in pre-training, you can just capture and tokenize a terabyte of your own language data and then throw that into the network. There's no need to add additional supervision on top of that, which makes it very attractive.

Also, I love to see the efficiency gains of pre-training. Actually, we all have small teams, we all have a few resources for data science and modeling. And so when we take our small teams and focus them on one project and create this one massive, you know power foundation model and then use the foundation model in many many applications. I find it's more efficient than optimizing and then maintaining our tiny ML ops workloads, which is what many of us were doing prior to transformers.

So what does it take to pre-train a new foundation model? It sounds scary. It sounds like only, you know, the the best can do this. But in fact, in large part, due to, you know, a very sophisticated and very mature training infrastructure that you're here to learn about, it's actually pretty accessible.

So how do we, how are we gonna do this? So here are three example models that were pre-trained and created from scratch on Amazon SageMaker, especially on our training infrastructure.

Stable Diffusion clocking in at 5 billion images and 240 terabytes of image data. And so of course, that's a lot. Image models tend to take a lot of data, but the models themselves are a bit smaller. And so you can use smaller cluster sizes.

The FALCON model of course from Technology Innovation Institute, is a very large language model, the largest open source language model, one trillion tokens, just under three terabytes of language data, 40 billion parameters and then 48 p4dn.40xlarge instances. So sizable cluster and that is two months to train this model.

And then we have another finance large language model trained on SageMaker with just under two terabytes of language data. And so all of these requirements are surprisingly accessible. Actually, I think there are there are quite a few companies with that volume of language data. And then the capabilities that we provide on SageMaker make the training experience again very accessible to a wide variety of companies.

So how do we do this? If we know we, we ha we meet the requirements, how are we gonna go about creating and pre-training these foundation models on AWS?

The first step is just gathering and accessing that data. And again, we want at least I'd say one terabyte of your own language data. So this is documents, digitized PDFs, conversations, you know language streams like rich, rich robust language data. So you want to gather about one terabyte of this language data.

Many firms will then pair that with source data actually. So that your model understands both the nuances of your company's acronyms and history and phrasing and domain expertise. But also knows what time the sun rises in Honolulu because of course, we want that mix of the general sort of open source knowledge but also what's specific to your company.

And so, so that's gathering and storing the information. After that you'll preprocess your data. SageMaker also has a really nice capability for preprocessing datasets. Actually, one of our builders, Jenny over here helped me run many preprocessing and data transformation jobs on SageMaker. And so you can use training job API including that remote function that we just learned about to run jobs in parallel, which are then tokenizing and preprocessing.

So this core sort of training job construct is applicable both for creating new models from scratch and also for general data transformation and general processing. So you'll preprocess your data sets and then you'll optimize those datasets using your preferred data storage.

We see a lot of customers using FSx for Lustre. This is because you can store your data in one place and then easily attach this volume to training job run. So as you're iterating through different model sizes and different infrastructure and experimental choices, you can use and store your data in the same place.

After this, customers will then need to develop and iterate over their training scripts. And the elasticity that you get with the infrastructure on SageMaker is beautiful. You can use and run tiny instances, so the t3.medium and the t2 that Werner shared with us this morning.

So the t3.medium is a great choice for notebook instances, very cost effective, very small machine. And then you can scale that up with a click of a couple buttons to a small GPU, for example, the g4 or the g5 series which your teams can then develop on and get the nuances working in their training loop and then ultimately scale out in the same platform in the same service to hundreds and thousands of GPUs.

And so that's that step from that move from step 4 to step 5 where you're developing and testing on increasingly larger instances and then ultimately scaling up and using the massive training infrastructure that SageMaker provides.

And then of course, you'll evaluate the model artifacts step by step. And the way that SageMaker holds on to the metadata holds on to your scripts, holds on to your hyperparameters, stores all of your artifacts in S3, makes it so easy to just look up your previous work.

So I know if you're trying to capture an experiment that you ran six months ago, or even three years ago, as long as it was in AWS, then you can easily go look up the results of that job, capture some of the artifacts and then run a new experiment. And so at a high level, that's how you can pre-train foundation models on AWS.

And again, all of this is possible because of the distributed training libraries that we provide on Amazon SageMaker's. So these are capabilities that we've been building for many years including data parallel and model parallel distributed training libraries that give you efficiency and enhancement.

So model parallel is a way to distribute a neural network over multiple accelerators and GPUs providing optimal performance.

"And then our data parallel package will let you actually make copies of your model across a large cluster. And then we're delivering custom communication collectives actually that are optimized for the AWS network topology to save you up to 40% in the overall training time.

And so this is after many years of innovation at this layer in the stack. And again, all of this is available through SageMaker and customers agree with us. So as you heard from Swami's keynote yesterday, Arvind Srinivasa CEO of Perplexity AI is happily using SageMaker and in particular, the data and model parallel training libraries. Uh again, to get that optimized performance in particular in the HyperPod mode.

Another feature of SageMaker that I find really handy is warm pools. And so the training job API again is creating infrastructure when you train a model. So when you call model.fit or when you run that Python training script, we actually turn on our instances at the same time. And so that call to create the cluster and to execute the scripts are they happen together.

And now again, this is really useful for cost efficiency so that when the job fails because I forgot to point to the right Luster volume that instance isn't sitting up there charging me money, right? It turns off. So it's extremely compute efficient. However, as a dev that can be challenging because I don't want to wait eight minutes just to ship a new line of code.

And so we launched last year, our warm pools feature that lets you run new jobs using the same image in seconds. And so as a developer, it's extremely handy because you can make just 1-2-3 line edits in your training script and then just run the job in seconds. And so the warm pool feature is incredibly useful for developing with SageMaker training API.

Another core feature of SageMaker is the ability to use many different types of instances and have a lot of flexibility with the underlying infrastructure where you're trying to run your scripts. One of these, of course is custom accelerators from AWS. And so the Trainium and Inferentia capabilities are both available on SageMaker.

And you're seeing a lot of cost performance relative to comparable Amazon P2 instances up to 46% with Trainium one relative to P2. And so you'll see even better performance with Trainium two, which was just recently announced. And so in the demo today, actually, we're going to take a look at Trainium on SageMaker.

So what is this demo? So we're gonna be pretraining LLama two. Of course, I got a little visual for you. So this is a cartoon llama with sunglasses in the Las Vegas strip from Bedrock. And so we're gonna pretrain a 7 billion parameter LLama on SageMaker.

Now, why are we going to do this? Why is this a useful exercise? Again, this is assuming I have at least a few 100 gigabytes of custom data. So really a sizable dataset of my own language data. And then again, this is knowledge that's not generally available online. So it's, it's my own proprietary dataset. So it's knowledge that wouldn't generally be found in, say a Wikipedia archive, this then drives in domain accuracy.

And so this small model will be very surprisingly accurate within that domain. Again, it won't know everything under the sun, but it will have a surprising amount of accuracy again in that data set and in that domain where you're training it. This of course, then drives, as I mentioned earlier, ownership. It drives flexibility and then that lets you again use serverless hosting. And then ultimately cost reduction opportunities.

And again, how are we gonna do this? So I have some example notebooks, we're gonna walk through different instances again, that t3 medium, the train one optimized large scale data stored on FSx for Luster and then some warm pools. And then again that, that distributed training infrastructure. So let's check this out."

When I ran this, I can view the outputs. Um I can step through the logs and see every piece of information uh, that I need for my model, which then of course I can download it and build an entire app on top of this. And so with that, I'm gonna hand over, uh, the stage to Tom and he's gonna share some information with you about uh Toyota.

Great. Thank you, uh Emily. Uh, so today I'm going to really tell you about uh how we're using SageMaker to accelerate machine learning at TRi. And first, maybe I should tell you what a little bit about what TRi actually is. Uh I'll give you a couple of examples of projects that, that we have um ongoing right now. Uh the first is um is a uh a project around uh autonomous drift driving uh for to with a Toyota Supra. And I'll just let the video play here.

So that's one example. And of course, AI here helps lay the foundation for all of this work. The second is we work on a lot of challenge problems. And so uh there's a big robotics group that focuses on challenge problems that um that we start in the lab, but also go out to the real world environment and uh and evaluate our systems in. And so uh this is an example of where we uh we have a robotic system that we built in house uh from the ground up and uh we're able to retrieve and stock uh uh grocery store shelves.

Um this has evolved more into the factory setting as well more recently. And we're 250 people across a few different locations. So there's uh uh a team in Los Altos and a team in Cambridge Mass. Uh and there's teams also in human centered AI and also material science as well. And most recently, one of the things about generative AI that we found anyway um is in the context of robotics is that it can be applied, it can now be applied to uh robotics to be able to do a wide variety of tasks that we never thought were possible.

And this is a a technique called diffusion policy that um that is now able to learn from a few examples of a human from a human, how to perform very complicated tasks. And so building on this, the machine learning team at TRi tries to build a foundation across language, vision and action language in the sense uh that uh both common sense knowledge and also um uh and also um a wider variety of of applications.

So like language has applications across Toyota more generally in the context of uh enterprise applications, but also uh in terms of code generation as well, vision feeds into language to give robots eyes. Um for example, and then action uh to perform a wide variety of tasks across a number of different uh uh platforms. But this uh this talk is more about SageMaker.

And so I want to tell you about how we're using SageMaker at TRi uh to really accelerate our, our progress. And the first is sort of general experimentation where we use sort of 1 to 8 instances to uh scale up our training jobs. Uh and the second is how we can take some of these ideas and really scale this up very, very quickly um to not just uh a few GPUs but to hundreds of GPUs at a time.

And finally, we're also looking uh we're also able to use SageMaker for even more broad applications such as like just serving models. Um uh as these are hard to serve sort of locally on, on uh device. Uh so let me tell you a little about a little bit about the experimentation that we do on SageMaker at TRi uh first, you know, here's, I guess the, the high level is that we have a wide variety of applications that were, were uh models that were training on SageMaker.

Uh the first is uh large language models. Second is a mono depth model. So taking RGB images and inferring depth, for example, a third is uh sort of stable diffusion uh language to image generation for better feature representations. And a 3rd, 1/4 is uh 3D representations uh such as language 2 3D sort of structures as well. That's useful for robotics and a number of other applications. But uh and across all of these, we found uh SageMaker to be very useful uh for a number of reasons.

Uh some of the challenges that we come up that that come up for us include uh include a few things. Uh first, we want to be able to reuse uh existing training infrastructure and clusters that we uh that we um create. And so, uh you know, warm pools that you heard about earlier uh are one way in which to do that. And we um we take advantage of that on a daily basis to pull back those resources and continue iterating on our training jobs.

The second is, is uh for scaling, we need to be able to go from one to many instances very quickly and also to uh change instant type, instant types very quickly as well. Uh we also need high performance systems. So, um you know, uh and SageMaker is uh very well optimized um in the back end. And finally, we need uh a sort of flexibility and we need to run a number of different jobs across um across all of uh all of the uh sort of uh science group.

And so, uh you know, uh I'll echo like um this is the uh um the code you saw earlier. Uh I'll echo how easy it is to sort of like um scale these things up, you can start with one instance, and you can uh iterate with your training job with that instance, for example. And if you need to scale, it's a very simple change uh to um to enable that. So this can uh uh in this case, you can change the instant count to like from 1 to 8, for example, and then start just scaling your, your runs very quickly.

Uh the second is to, as new hardware comes out, as Gal mentioned, we're able to um we're able to quickly change the hardware types as well. And so, in this case, uh the um we can change, say from a AP four instance to a P five which uh will give twice the throughput um for our training jobs and reduce sort of the training times for us.

And just to give some evidence of how this looks, um uh how performant these systems are. Uh if you look at um sort of scaling across a number of instances, uh it's almost linear um in terms of the scalability here. So SageMaker has been very performant for us in terms of uh in terms of scaling up our training jobs as well as Emily mentioned earlier.

Uh you know, using these data sets are huge too. Uh and as we started, uh we started using uh data sets of a few terabytes. Um uh and it's, it's nice to be able to quickly start up with FSx for Luster. However, as we scaled our training jobs as well. Um and the amount of data that's been uh that we need for, for these from, you know, a few terabytes to a half a petabyte or more.

Uh we uh the flexibility in SageMaker to uh pull in other resources like Web data set uh has been really, really great um and has really accelerated sort of uh the um the training runs that we have as well. And so, uh just to reiterate, I mean, there's uh the, the group is uh running jobs from, you know, one instance, eight inches instances. Um and these are a few of the applications, but going beyond this, we're also able to scale our training runs uh to a much larger scale as well here.

And so, you know, just to highlight one of the ways in which we're doing that at TRi, uh we're building sort of state of the art. Uh the, the question, you know, the question is, how do you build, can you build state of the art uh LLM as a SageMaker? And uh we, uh we at TRi have been, have been doing this.

Um we've, we've been reproducing some of the uh Lama two models initially uh to uh validate uh all of our um all of our systems. And uh for this uh SageMaker, we need sort of scalability and performance across all of these instances and what SageMaker really provided for us is that scalability.

So this is the newest hardware, the H one hundreds uh and uh sort of linear scaling uh as the sort of um number of nodes uh increases. And this is, you know, um uh so this ends up being like uh 256 like H one hundreds. Um and uh and if you run this out uh a training job like this for pre training, a Lama two model, you know, can take about a week um when you start scaling out to 30 instances.

And this is just to say, you know, uh with a trillion, with uh more than a trillion tokens. Um uh we can reproduce sort of like the um the state of the art models here. And we're scaling not only from the 7 billion parameter models up to 1334 70 as well on SageMaker right now, one of the key features or one of the nice features um of SageMaker so that you don't lose any time has also been some of the repair work.

I think uh Gao may have mentioned this earlier. Uh so your job may actually uh as you scale these jobs, it's often the case that like uh hardware will fail and when hardware fails, um you have downtime and if you have downtime uh you, it costs you money, it's um uh you're not training, training your models. Uh one of the great parts of SageMaker is that uh is that it has this option for cluster repair.

And so for us, this happened in about 10 minutes, like one of the machines failed and the cluster came right back up and we were able to uh continue our training run very quickly. Uh so that's pre training. And, you know, the, the other thing is that um uh it's sort of on more on the up side of up training where you have a large data set and you want to, but not quite the size you need for pre training and you want to uh look at a particular domain.

So we at TRi, we, you know, because we're a Toyota centric uh entity. Uh Japanese was sort of like one of the areas that we uh were very interested in. And so, you know, you can take uh some of the state of the art models which aren't actually trained for. They are, they do have a little bit of say, for example Japanese training data, but they don't have that much.

Uh if you go out there and acquire all of say the open source data available, you get to 100 billion uh 10 to 100 billion tokens here, which is enough to up trainin the model um such as Japanese. And what we found is that, you know, taking Lama two with 13 billion parameters and up training, you gain uh you gain some performance.

This is a uh a win rate, uh metric against um some of the best model, closed source models. Uh but the next step where you actually instruction fine tune the model. So this is how you get uh large language models to uh to follow instructions to be sort of chatty is to fine tune them using uh instruction, fine tuning uh with data of this type where the instruction in the first part would be, would be in the first part and the second part would be the sort of response you would expect.

And if you do that with the uh pre additional pre training with the additional structure and fine tuning in Japanese on some of the more performant models out there, you can get state of the art performance uh in, in Japanese here. And so this is a much smaller model compared to say a Lama 70 b uh yet still more performant, for example.

Uh and so SageMakers really enabled us to do a lot of this experimentation very rapidly at TRa uh the final one I just want to mention uh it's not covered as much in this topic. But um you know, there's uh there's also the ability to do other sort of workloads such as serving models as well. And uh we've been leveraging S SageMaker endpoints to actually uh you know, uh serve both open source models as well as the models that we have um uh in house uh internally uh across TRi and maybe eventually uh externally as well.

So, with that, I just wanted to say, um there's sort of three primary um areas which we are uh we're focused on in S using SageMaker for small scale experiments such as uh 1 to 8 nodes, for example, uh large scale training up to um you know, 3264 or more instances um as well as surveying as well. And so, um you know, see maker has been very critical uh and, and important for our uh our training um uh of these uh of this variety of models um and experimentation generally.

And um uh and I just wanted to close, I guess with saying like, you know, uh I, it, you know, it's been a great working with SageMaker for training all of these models uh in the next time, hopefully when we come back to uh to a Davis re invent, maybe we will have a um a foundation model that can be trained once and do a whole lot, do uh many different robotics tasks um in response to uh language and other things as well. So with that I'll end and uh maybe I'll give Gao.

Thank you, Tom. Yeah. Oh, we just wanted to end by showing you a couple of links qr codes to learn more about SageMaker and how to use it. And thank you. All for your time. We'll all stand around here for a little bit longer if you have any questions. I actually think some members of the Smart sifting team are also here if you have questions about that and want to learn more.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值