AI parallelism explained: How Amazon Search scales deep-learning training-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134584518

Hello, everyone. Thank you for uh joining us for this session and we all trust that you're having a great re invent. Uh my name is James Park. I'm a uh solutions architect with Amazon Web Services.

Hey, everybody. I'm RJ, I'm an engineer with Amazon Search and going to tell something about how we took the language models to production and what is the ro i that we measured with it. So yeah, for the next hour or so, uh we have a lot of content to cover. Our goal is to really not only give you guys the uh marketing details and the services that we're using, but also give you uh an idea of the problems that we faced as well as the solutions for search to accomplish what they do.

So again, for the next hour, first of all, I'm going to start to set the stage that means giving you guys an introduction as far as what's available today on AWS, as well as some of the trends we're seeing in the market regarding uh deep learning. We'll then talk about some of the challenges and ways to address those challenges and that's when the real fun will begin. RJ is gonna come up and discuss with you guys how Amazon Search or the M5 team in particular handles these challenges and what they've built in the, within the engineering team in the past year and a half.

So I have probably the coolest job in AWS. I get to be the solutions architect for amazon.com. That means all the processes that you guys use or our customers use to search for products have a great experience and ultimately get their packages delivered. Um I usually get a hand in playing and working with these teams with one of which being Amazon Search.

Now, as you might imagine, Amazon is huge. The number of teams we have operating and the num the number of people on those teams is gigantic. Teams like Search could often be an entire company in and of themselves. The great thing about AWS is that we have a wide number of services to cater to different classes of customers.

This slide is something that you've probably seen before. But if you take a step back and look at the ML capabilities on AWS, we again cater to three different classes of customers and all of these classes exist within Amazon.com.

The first class is those individuals or teams that want to embed AI capabilities into their applications, but either don't want to get into the business of AI as far as training models, the data science and so forth. And so when these teams want to embed AI capabilities and don't want to mess with ML, they can turn to these higher level services like Amazon Recognition or Translate and with a simple API call, they can leverage everything that AWS is doing behind the scenes so that they can embed these capabilities into, into their applications. For example, Amazon Recognition is used by Amazon Ads, it meets their needs and it's an easy way for them to gain the benefits of ML without investing a huge amount of resource into building an ML dedicated team.

The second class of customer is where our Amazon SageMaker services really come into play. Amazon SageMaker is an end to end managed set of services that allow individuals, developers, scientists and engineers to do everything from data engineering with services like Data Wrangler, uh again with feature engineering uh and feature storage with Feature Store training, their models with SageMaker Training all the way to hosting with uh SageMaker Endpoints um and other various inference options.

The third class a customer is at the uh working at the frameworks and infrastructure layer and that's where we'll be mainly focusing on today.

Now, if I'm to double click on the third layer of frameworks and infrastructure that looks like this, basically the entire AWS cloud and the suite of services is available for you to leverage. However, you want.

Now, as you might imagine this can be pretty complex. Everything from all the EC2 instances are available to you and you know, depending upon the needs of your business and your workloads, you can pick and choose what services, how to stitch them together in order to ultimately meet your goals. This is where RJ will come up shortly and discuss how they've leveraged some of Amazon services or AWS services in order to do what they do at the scale that they do.

Now to set the stage. Deep learning and neural nets have seen explosive growth within the past, you know, decade or so transformers as well as GPU processing power has largely allowed us to now get into a realm where it's really crazy what we're doing in the in uh subjects such as like NLP, right? It's almost impossible to distinguish an ML model responding to you versus an actual human being. But that's where we are models are gaining size year after year. And if you actually plot it out back in 2018, we were talking about models with 100 million parameters and year after year, we see about a 10 fold increase in the size of those models. Just last year, we collaborated with Meta and actually trained a one trillion parameter model.

Now that's all fine and good and it's great for deep learning, right? What we've seen generally is with classical models, you reach a point where the more data you use to train a model uh doesn't reap the same rewards as far as the model performance, but it's different with deep learning, the more data we have, the larger the models generally, the better performance we get. And as we increase our models, we need to be able to train host and operate at this scale. But frankly, the the hardware that we're using today hasn't scaled as fast as the model size of scale. So what do we do in order to address this?

Um some concepts that you should very much be aware or you should, I hope you're aware of, but we'll go over anyways is around distributed training. The first side or first way to do distributed training is around or called data parallelism.

Now, data parallelism works fantastic for use cases where you have an ML model and that model can fit entirely on a single accelerator that might be a GPU that might be our Trainium chips or you know, various other accelerators. First of all, we take your model which again fits on a single accelerator and we copy it to all the workers within your training cluster, we then have the training data. And again, usually the more data you have the better performance of your model. So it's not uncommon for us to have jobs where we're training models that take days, if not weeks, sometimes even months, especially at the scale of Amazon. But you can effectively take your data split it up and have them directed towards each of the workers in your cluster, therefore expediting your training time.

But what do you do when your models are big and they don't fit on a single accelerator? Then we look at model parallelism. There's two forms of model parallelism today. The first one is pipeline parallelism. Now on the screen, you see a neural network and you see the various layers that make up that neural net, basically what we need to do is split up that neural network. So it fits on these accelerators, you can see pipeline parallelism also called interlayer model parallelism. We we essentially divide up. Uh and you can see those dotted lines. Um the layers of the model, we then have a similar scenario where we have a cluster of workers and based upon how we segmented that model pieces of that neural net land on those various workers.

Now, in this situation, the data actually has to be processed in a pipeline manner. So ultimately, every worker processes, all the data that you're using for your training set when this is all done. And all the data that data has been used for training, the model is converged hopefully. And again, you have a model that you can then use for inference.

The second type of model parallelism is called tensor parallelism also referred to as intra layer model parallelism. Now you can see here a similar neural network but instead of separating it vertically, we now separate the layers horizontally because we've separated it in this fashion, we can then take the same number of workers, put the proper pieces of the neural net on those workers. And instead of in this case using and piping the data through all the workers, each worker is then gonna get a copy of the full set of data, do its processing. And after performing all the communication it needs between the nodes, eventually you have a trained model.

So as you can see, these types of strategies are pretty instrumental in being able to operate at this scale. You don't wanna wait, you know, months, for example, to train a model only for you to realize that hey, after a month has passed, it's become irrelevant, right? Think of use cases like fraud detection where people are constantly attacking websites or businesses like Amazon and we need to be able to train our models as fast as possible so we can react to the threats as fast as possible.

So regarding ML, we're seeing a few trends as far as deep learning is concerned. Again, parallels and constructs which we just went over. The second is the number of accelerators that um are available in the marketplace. So you're seeing first with Nvidia GPUs, how we can, you know, really accelerate model training and other various aspects of ML. Uh Amazon also now provides uh capabilities with training and inferential chips. There's even Intel providing uh our DL one instances behind you know, our EC2.

Um but as our clusters grow, it becomes even more important that we take a look at the communication not only within the accelerators of a single node, but between the communication of the nodes participating in that cluster.

Another trend that we're seeing is that containers are being used to scale these workloads some popular options. And you'll hear this from RJ shortly include things like AWS Batch. Now, SageMaker Training is a great option and it provides a managed way for you guys to go about training your models. But if you really want to be deep down with your hardware and gain some capabilities that Batch affords today, it's a great option.

In addition, other customers are using options such as ParallelCluster, which happens to use SLURM behind the covers. Another thing to be very conscious of is the cost of training. Now, accelerators today are quite a bit more expensive than CPU counterparts at small scale. You know, it might not make a big of a deal. But when you're talking about hundreds of nodes in a cluster, if not thousands a minute of downtime in your cluster ends up being a huge cost to your business, right? So you want to be looking at ways to manage cost and avoid some anti patterns, right? If you're gonna spin up 100 node cluster full of P4dn's for example, uh something we see quite often is customers doing a file download, for example

And really you should use that GPU for what the GPU is used for and do the file download with something more cost effective like a Lambda. In addition, in order to help manage costs, there are various best practices to take into account such as check pointing. If again, you are going through a, a cycle of training, a large model that let's say takes a week, you don't wanna get through three days of that week for all of a sudden, something to go wrong and then have to start over. If you embed check, checkpoint into your training, you're able to start off hopefully where uh the failure occurred.

In addition, as far as the market is concerned, we're seeing that with the uh the pandemic, that accelerator capacity itself can be a risk factor. So what do you do when, for example, you don't have enough GPUs to meet the needs of your business? Can you turn to other alternatives and we'll talk a little bit about that shortly.

And lastly, we're not alone in this with tool sets such as DeepSpeed or PyTorch, fully shaded, fully charted data parallel. Uh we're often given some higher level constructs and tools to actually do this effectively and do so within your organization with little pain.

So knowing these key trends, we can also derive some strategies to help us in this journey to again be able to train large models, we already talked about parallelism strategies. But let's talk about some other things that you might not be thinking about.

First of all, if you're talking about a cluster, you might not be thinking about it, you might only be thinking about the chips that you're using for acceleration. But you also need to take into account that everything in that cluster matters, right? The communication between accelerators matters. If the communication between nodes is fairly bottlenecked, your training is only going to go as fast as that bottleneck will allow you to do to give you an example when working with one of our customers. In such the training scenario, we are able to leverage power sg ds gd to use gradient compression to optimize and have a tenfold increase in training performance.

In addition, other things that you might want to look for uh or leverage if you're using things like DeepSpeed are one bit atom and one bit lamb. Again, as far as like network is concerned, a part of it is what's available from hardware. EFAs are a great choice if you need the bandwidth between your nodes in a cluster. But in addition of or in lieu of just throwing hardware at the problem, you can actually have strategies to limit the amount of communication bandwidth you need by leveraging things such as uh grading compression, one bit atom and one bit lamb.

In addition, you also want to make sure that you're leveraging all the resources that are available to you that includes leveraging things like zero offload to ensure that you're gain the most from your memory as well as the the on host CPU. Again, with the uh the advent of all these different accelerators in the marketplace, we're seeing from our end that it's often in a customer's best interest to if they can abstract themselves from being tied in to a particular accelerator.

So we've actually um proven out, for example, doing some training on a one accelerator check, pointing it and continuing to do that training on a different accelerator. Therefore, giving us a great deal of flexibility, you know, costs may change or avail availability may be a concern. But if you are protecting yourselves from putting all your eggs in one basket, you're really setting yourself up for success.

Another thing to take into account is profiling and i should probably add profiling and debugging to this really with deep learning. Um a ultimately, we want to make sure that we're utilizing all the resources that we have that the hardware provides to us. But more than that, we need to be able to really introspect what's going on and make intelligent decisions so that as training occurs, we're doing the best that we can do in terms of return on investment, cost savings and so forth.

Lastly, data powers everything and you should definitely not forget to think about how the data is getting moved into your training platform. For example, is data being the bottleneck in your cluster and not getting as much bandwidth of data as it needs. And therefore you're actually starving your training cluster. Uh furthermore, we need to think about things like reps reproducibility, right, especially with large training data sets um that becomes a very key factor.

So a few things to take into account a and best practice is a, if you can decouple your training and inference, that's a great thing to, to look into, right? We all know that training isn't the last step. And the way we provide our models to our downstream teams, whether it's model format or other aspects should be a consideration. And something i think rj will also talk about talking about inference.

Um you really need to take into account uh the various requirements that you would need for inference and two categories of which how you can think of this is a sparse inference where you need medium latency slash low throughput. Um and in order to again be cost-effective, a popular option is to leverage our c six i uh instances which leverage intel cp us behind the scenes because they are very cost-effective. In addition, with the um support for instruction sets like ip e, you're actually able to get pretty good performance out of these CPUs. And for, let's say a million requests of a burt large model with 128 token length, you can expect to get about 100 milliseconds of latency in that configuration using 66 c six i instances.

In addition, the second category is around dense inference. So when you really need low inference, low latency and high throughput, you can also look at various accelerators to help you with this, whether that be the, the p series instances g series or even our dl one instances. So we found that dl one allows people to infer burt large models with 256 token lengths at about 15 milliseconds of latency. So pretty darn fast.

Lastly, with inference, you need to take into account. Hey, what is the actual needs of that inference? Do you need real-time inference where you know what your workload is going to look like? And therefore, a provision strategy is best versus a bursting inference sagemaker, for example, provides inference options similar to lambda where you can have serverless inference. So those options may take a little longer for the model and the container to spin up behind the scenes. But ultimately, they will spin up and hopefully be very cost-effective for you in your business needs.

Um also take into account whether you need real time inference or batch or asynchronous inference can be a huge driver in what decisions you might make.

One of the things that Amazon uses for inference technology is uh the Triton Inference Server. So if you're familiar with this, we actually offer it uh in a couple of different ways. SageMaker provides Triton Inference Server as an option behind real time endpoints. And it's awesome in that it's fully managed for you and you get the full benefits that SageMaker provides as far as the integration with CloudWatch A/B deployment testing and a host of other things, I strongly encourage you to look into that said you might not meet exactly what your needs are and you still have the option to deploy Triton on other services such as ECS Triton allows us to host models and get really awesome performance.

It supports various backends for various model types. Very recently. They released uh support for a Fill backend uh which you can use for XG models, for example. And behind the Trent Inference Server, again, they're able to leverage the accelerators in this case, whether they're NVIDIA GPUs, A or B 100 CPUs or GPUs or AWS Inferential. All of these are options behind the Inference Server. And again, we actually use this day to day for our global spell check models for when you're on the website searching for something we need really, really good performance, right? We, we can't settle for anything, you know, let's say 50 milliseconds or above. And Triton allows us to operate globally at scale with the amount of throughput and performance that we require.

Um there's, and in addition to that, there's always ways to optimize we leverage tensor RT. In addition to trying an inference server and the, the benefits of how that can affect your performance can be pretty staggering.

Now, you don't have to take my word for it. Uh Amazon Search, you know, powers a lot of teams within Amazon and they are constantly pushing the frontiers training, the largest models getting the best performance. And at the end of the day, like it's pretty staggering to see what our M5 team does for Amazon and in turn for Amazon.com customers.

So with that, I'll turn it over to RJ to deep dive deep and to give you guys a a much deeper look nice introduction about the distributed training, the vending and the infant that we do to launch this model in production. As you can see AWS and the PHO everybody gives you the capabilities to train this large language models. But at every day, we have to take these models to production and deliver it to our customers so that Amazon can get a good return on investment on the hardware that we are doing.

So what does Amazon Search especially MFI do? Right. So before going there, I would like to say that we are trying to get the large language models impact on teams across Amazon. That's what we do. And we are a sub team within the Amazon Search and we are able to get good results like we are able to train 100 billion parameter model and also infer that at less than 10 millisecond latency using different hardware types.

So before you go details, I'd like to give introduction about M5. What does M5 stand for? It's 5 Ms actually. So what, what do you mean by 5 ms one multimodal? Because we believe to input our model and not just be text, it can be images, it can be structured, unstructured, text tables and why not videos. And the second M stands for multi local as Amazon customers, we are selling products in multiple places, several countries. So we need to make sure the model that we are learning considers the different localities that we are operating.

Third is multilingual. Multilingual is important for Amazon because we work on different primary and secondary languages across the world. So we need to make sure that what the model that we train the foundation models consider the multilingual capabilities.

Fourth is multi entity. What does multi entity do? Because in the Amazon world, when we say about the models, we are different kinds of entities like customers, queries, sellers. So we have to make sure the foundation model that we are building actually captures the relationship between across these entities so that we can have good model given to our partners and deliver in the production system

"So we need to make sure that we are learning the multi relationship fifth, the multitask, one of the innovation of a large foundation model was actually you can learn your representation across task. So we train our model across different kinds of tasks like duplicate detection, a semantic search matching, product type identification and lot of workloads in the search. So that's what m stands for and what our vision when we started this program based on the literature out there about the impact of the large man models was actually build the foundational and the universal semantic representation of amazon entities like products, queries and shopping sessions.

We believe that if we can build a universal m five model, it will help us to learn between the entities and have the new entities which is coming up, take a learning from the previous entities. And why did we do it? We believe that it should be powered with large scale deep learning system. And we try to optimize our system for supporting large training. And the third is just not the training that is important. It's actually taking these models to production with our partners is key for our success and for amazon's success. So we also have a third pillar which is to deploy to teams across amazon. So that was my vision and m five's vision.

And then how did we achieve it? So if you go to a life cycle of an m five model or a large language model, this is a life cycle which happens in m five. So we pre trainin a large language model with 100 billion parameters. And then you take data sets from across the world. Wiki da mc four ain or product description. We train our core model with all this information. And then once the model is trained and converged, we actually do the pre fine tuning on the task that we are, we are whitelisted for. And then we do the measurement of the accuracy of this task on the model. That model is good. But to take such a large 100 billion parameter model to production will be very hard for multiple teams across amazon. And not only the cost, not only the hardness, but actually the cost of these models to deployed in production will be significantly high.

I look at the throughput at which amazon has to operate across the world. There are millions of customers out there who is using amazon daily. So we want to make sure the models is small enough so that we can deploy to production and meet the latency requirements. And also sometimes i am a requirements for operating the service. So we call it a distillation step where we actually distill the model from 100 billion to a much smaller size so that we don't lose the accuracy of the large model, we can keep the accuracy much, much closer to the model that we pre trainin but also meet the production constraints.

And then once the distal model is there, we actually when this distal model to our partners, so they can fine tune on their task with the limited resources. So they don't need large gp clusters or any specialized hardware cluster to fine tune their model on the distal model. So normally you can see the distal model just require one gp or one machine to do it. So this is a natural life cycle of an m five model as we go through the system.

But if you go through this, it's not like one model is getting trained or one model is getting pistol or one model is getting multitask free fine tuning. So there are lots of experiments which keeps happening in this thing. So when we decided on this life cycle, which is the first thing that we did. The second thing as we have to decide is what are the frameworks out there. But we have to select in that we actually choose by touch as a frameworks of choice. Because what we found most of the large language models out there in the literature was getting trained with by touch.

In addition to the p to what we found out there was a framework called deep speed, which has a support for zero optimization. So we choose deep speed as a framework for training large language models using the zero optimization. So zero optimize helps you to actually use the cpus in the gpu hardware mode so that you can train larger models. And that advantage for us was these was built on distributed p touch. So people are expertise in p touch and all the tools that a builder on pythons we are able to adopt and we are able to migrate and use it. So that's the reason why choose pyot and deep speed.

So once we choose the framework, the most important question and next is actually how to choose the hardware because the hardware budget of this large language models can be significantly high, right? And the duration of the training is also very important, you cannot have a idea and then wait for one month to see whether your idea worked or not. So choosing the hardware is very important. And then as aws provided, they were at that time when we started, they were like p 3d n, this p four dn and g four d and hardware, which was out there.

So what we did is we match, we took first our p 3d and we benchmarked our model. But when the p four and came out with a better like nvidia gp direct support with always bypass, we took p four dn and started training of large language models. Not only we took a dependency on p four dn, we actually work with our multiple co and actually our job to give 4.37 x perform improve of our pre 3d. What does it give us actually a job which was running for five days can get done in less than a day and maybe like 30 40% reduced ir so we are able to go at faster pace by choosing the right hardware.

And in that hardware, you also have to choose the right libraries which is optimized for aws. So we actually took a dependency on the aws optimized nickel nickel is the nvidia distributed training library. And then we also took a good dependence on eae a is the networking layer between aws that they are developing. And i think there was an announcement for the e a 2.0 which gives better performance. So we are able to fine tune this e a for our workloads. And that's how we reached the 4.37 x performance.

So now our team has decided on the hardware, the framework and then what should we do? So we, what we found out is choosing the framework and the hardware is not sufficient. You still need to profile your jobs and find out where the hardware is not getting utilized so that you still can move faster. So this was another work that our team did with aws deep engine science team that we build capabilities within the aws using the e a layer and the capabilities of p four d and network structure. And now we build the capability to train a trillion parameter model training on aws. The good thing about this is this feature is also available in the sage maker platform. So you can get the benefit of training a large language model up to trillion parameters at the lower cost using sage maker.

So once we did the framework optimizations and everything, now our team was small, we did the work models at training, we are getting good successful results and we have like small clusters of gpus. So seeing the success, we want to innovate more. So when we want to innovate more, what happens you have to share the cluster with our other developers, you need to do the bookkeeping of your experiments and you are to track the experiments and then checkpoint and restart to see whether my new idea works. That is where and because of the gp shortage across due to the pandemic, we are not getting capacity in one region.

So what did we build with that primitives? So with the primitive, what we build is actually using aws badge for job queuing our workloads and then use using s3 and fsx to actually store the data so that you can easily stream it across. So now we are able to get the capacity across six regions and we thought our developers will be able to move faster because we thought the gp scarcity was the only problem that we need to face. It was not the case. Actually when we land there, we had to run a lot of experiments and our experiment velocity reduced.

So why did the experiment velocity reduce? I want to just give you a picture of our experiment looks like for an experiment, it's just not the data or the model parameters, it's actually code and then there's a lot of hyper parameters and hypothesis. So in our team, anything that you want to try, we call it an hypothesis an idea. But the idea can only be converged if you can train the model converge it and validate with result.

So as you can see, we call it as am five workspace, so we have hyper parameters, code config data sets, different kinds of model architectures and different kinds of trainers. You need different kinds of trainers based on the what machine you're using and all these things. So you need to make sure that your workspace that you work on is actually can be checkpoint and also so that you can reproduce the results.

So three pillars that we've believed for experimentation were actually reproducibility, reliability and the debug ability for reproducing. I can give all your different simple examples of what a reproduced really look like. This is where things get lost when you're in multiple aws regions. Like people are thinking where my research was in this region, but i cannot see in the other region because they were tracking in a documentation or somewhere. The model artifacts, which are the most expensive stuff that your model that is sitting on your network file system in a region is not getting replicated to the other regions. And then because we we were using the initial network file system, there was no worsening on the storage"

So what happens? Somebody will actually delete a file or they will move a file to a different location and then let others know this is actually a conversation which we had in our Slack channel of our developers.

Did you somebody remove this tokenizer file? The token file is a very key for training because that's what you tokenize your input data on. If somebody moves, nobody can track it. If nobody can track it, it becomes not reproducible.

And the other thing like you create multiple copies of these files, data artifacts. So what happened with the multiple copies is people sometimes change, sometimes doesn't change. So it's ended up that the cost of reproducible increases even though we get a large capacity.

And the last thing that we always say that if we don't have reproducibility of our results, which are more or less converged. Why is this so important for M51, our partners and the developers who work with us will not trust our results because they cannot validate the M5 model and validate what the input that we streamed in to train the model is the right one or the wrong one.

Second, the configuration is required for them to further fine tune the models. So we want to make sure that all the artifacts that we store is actually reproducible.

Second, we also believe this large language models like 100 people throw 50 billion, 100 billion. So to converge a model, it's not like you do it in a single shot. You actually want to train, you get a checkpoint, you evaluate the model if the model looks good, continue further. Otherwise you, you change the configuration and retrain this model again and changing the hyper parameters and have to continue training.

So you want to make sure when you continue the training again, any change in the artifact for the large language can affect you. So make sure that your experimentation is tracked. So that's why we believe on the reproducibility was important.

And another reason is the reproducible. What's the cost of reproducing experiments? Sometimes the cost is actually hundreds of millions of hundreds of 100k dollars or millions to reproduce a large language model. So you want to make sure that everything is trapped and what did we do to fix the reproducibility?

So what we build is actually a global access to the data and the artifacts. So we only kept the data and all the artifacts in one AWS region, which is us one and we used to stream this data across multiple regions so that we can have is a unique artifact tracking.

And the second is we start moving data processing between systems. We actually do the on the fly data processing on the CPU cores of this B machines. As you all know, these machines are huge and there are lots of unused resources like CPU, which is sitting idle when you train the model.

So we started with a lot of data processing on the CPU courses. So what is actually enabled us in the training in the CPU course? If the data processing our trainer code and the data processing code are in the same machine, so that means you can reproduce it.

When you send out the code review, you know the data processing which is done and you know what is the computation being done? So when you decouple this, sometimes you lose track of what are the data processing being done. So it becomes harder.

Third, we did actually use MLflow for tracking all our experiments. So MLflow helps us to track all our experiments including comparing the experiments like how is your loss function when you change this parameter? How does it look when you have three different model configuration? And you want to learn what are the key differences? So we use MLflow for all our experiment tracking from day one.

So even now we can go back and look at our research one year back and see what are the parameters that we choose. And why did we choose that? Because these are all experiment and sometimes your hypothesis can be true or false, sometimes you get lucky. But we always want to explain all the experiments to our teams to ensure that we are able to move further.

And the fourth one is actually workspace snapshot. The workspace snapshot is very important because people think it's only the checkpoint with matters, but it's not, it's actually your code, your data and your config are very important for reproducing these experiments.

So that's how simple engineering techniques which was there developed across multiple 10 or 20 years that we used to build a reproducible solution.

And then I want to also step into a little bit on the data storage of the reproducibility. As you can see when we started training large models in text and we started streaming in the images. So what the images takes huge space in your system like the storage system. But as you can see, we actually don't process the image in a separate cluster, we actually started processing the image in the same CPU notes.

And as you can see, we kept put this preprocessing logic in the shared memory of the instance. So with that, we are able to for example, a 48 pixel image, we can actually reduce into 224 and then do batching on the system and stream to our training work flows at a higher speed.

So what is enabled us is actually in addition to the text streaming, we actually build the capability to stream large image data sets and also preprocessing the imaging data sets enabled us to have two different modalities to input our model.

So the story reproducible is very important when you're training this large language models.

And then the second pillar is the reliability, reliability of your jobs is very key because these jobs are long running jobs sometimes one week, sometimes one month. And because of the scarcity, these clusters are also long, you cannot release the capacity back and expect it to get it back.

So we have to work on a long running clusters and the long running jobs. So in the long running job, you have to make sure that the environment in which you're running, the job doesn't change if you need to update your infrastructure or if you want to update the good code.

So you have to make sure that the reliability becomes important because what happens after two weeks, a job failed, but when you restart, it's giving random answers. So you cannot have that.

So what we found out when we started this journey is the stockiness of this model's training is across the track. So you can have sometimes a bad GPU or a stuck GPU. So you need to continuously monitor your training jobs to ensure that is the model progressing if not terminate the job so that people can use the resources for other large model training.

And second thing about reliability, what we found is there is a noisy neighbor problem. So we used to have the same network volume shared across the developers. So as you can see when the large model is getting trained for multiple weeks, the suddenly storage is full, what will you do? All your job starts failing?

So what we found is when you invest on such large costly jobs isolate your resources completely. So here is another example in the Slack that while the model was training somebody filled up the disc and now our developers is actually spending time to clean up the disk to ensure that the large model which is costly job doesn't get affected.

So make sure that when these jobs are running, we don't get affected by the noisy neighbor.

And when you want to build up the reliability of this infrastructure on AWS, what did we do? We went back to the, the olden days where the high performance computing systems were built and deployed. We took the learnings from those systems and build like four different pillars.

One is called stuck job detectors. So what does stuck job detector do? It's actually doing a job heartbeat back to the node that I'm making progress. We are able to identify now what are the code snippets with actually making the job stuck? And the developers were actually quickly able to go inside and see why my job was stuck and should I terminate the job?

And the other thing was we automatically terminate the stuck jobs after 30 minutes so that the precious clusters that we are using can be used by other developers also.

Second, the canary jobs as with any system, you need to have canary jobs. But we were thinking, why should we need can jobs for a training system? The problem which happened is because of the velocity at which these tools are being developed in the d space. Sometimes the cases happens in the edge cases.

So we need to identify where are the failures happening by ourselves without letting the developers tell us that it failed. So what we did is actually we run a small can job distributed every one hour. And whenever it failed, it actually gives a good signal that there is system problem with the system. And that helped us to identify the problems early and then fix it.

For example, when you're training a large model and your model, it needs one to node cluster. The edge case can actually kill one node and your job is lost and you lose the money. So that's why we want to make sure with the canary jobs, we are able to identify them.

And as I showed earlier, we started to have per user job levels isolation for both storage as well as compute.

And the fourth thing, we actually do frequent checkpoints for all our models. It's very important for us to have frequent checkpoints to ensure that we can retrain the model once there is a failure.

So once the reliability was built, we built so many tools, so many layers for developers to build it brings another problem. Now, their developers have to learn so many tools even to debug that makes it harder.

So that's again reduce our experiment velocity. So we started investing on the third pillar which is for the debug ability of our system. And what we did was as mentioned earlier, every job artifact has been tracked in ml flow including your army, your nickel version, your py to version your gate commit and your job definition version. And what is the e a version that you're using? Everything has been tracked?

So this enables us to compare when there is a failure. When we update something, we also enabled the automatic error classification for all the jobs including slack notification because we are running a distributed jobs across 16 or maybe 100 under notes. So it's very hard for a new person to come in and quickly identify what was the reason of the failure. And there will be an expert sitting there who knows the failure reason and the person become the bottleneck of innovation because that person has to help everybody to deeper and the person developer also has to develop his tools.

So what we did is we tried to build on praise of cloud watch logs we put the error classification and with a slack notification with the suggestion, what could be the pointers to the error? So that enables when new developers were on boarding to this our tool, they were able to identify the issues and then fix them independently. So fixing them independently. What enabled our team is to increase the experiment velocity further without the expertise, people being the bottleneck.

The third thing which i clearly say about debug ability is take the software practices and do continuous integration and continuous deployment of your code. Please check in the code of main line and have pre flight cr checks of your code. This enables our team that if one person checks in a code and the other person syncs it up and start the training job, maybe after one week, you'll find a bug that's not good. You lose a lot of velocity in the team, lot of imr. So make sure that whatever code you're doing c id them frequently, even if they are for experiments, let's bring the code to mainland and test it out very well. And we'd use the pre flight c checks from the code build and the pre flight c checks from the code build uses the gp hardware or any specialized hardware for us to run the testing. So even if the developer doesn't have a gp box, all this code is actually going through a pref flight c checks of the gp u box available in the code building.

Fourth for debugging, like we actually enable a lot more performance debugging tools within our team. So debugging tools enables us to identify jobs which are not utilizing the hardware effectively and giving visibility to the developers that they can do much more with the available hardware.

Second, this performance debugging also shows us the holes in our system that where we should invest to make our systems better.

Now we learned how to train these models, how to effectively get this model with a higher larger experiment velocity. But our journey was not done yet. As I told, we have to work with our partners to enable this large model training's influence on teams across amazon.

So one key thing about when you train these models to get to our partners is model vending. So model vending is a process by which you have to share your models with your partners so that they can take further next steps into the processing pipeline.

So what happens when you actually just share your code and experiment? That means they have to take a code and the infrastructure dependency on your training system. So it will create a lot of friction and reduce velocity.

So we actually use the sage maker model vending for the model discoverability of all our models and the model cards. So that enables all our partners to actually look at the model cars, find out what this model was trained on what are the details? And then the tools are very standard to now take this model card integrate with the sage maker training and then fine tune their models into their production system to decouple the training and the inference code we actually built on pyot model civilization framework and we use to to decouple the dependency.

So when you are able to decouple the dependency, our training law with experiment code and which are tim for the training hardware moves at faster cycle. And then our customers doesn't have to always sync with our training infrastructure. They are able to independently train as a fine tune and use this fine tune model for inference of further fine tuning. So it enables both our teams to go at a higher velocity.

So once you did the vending, i don't think we are done yet because our customers also did not have the expertise to how to infer this model to lower latency. That's why we always work with our developers within amazon that how can we reduce the cost of adoption?

So that's the third key thing that we need to invest invested on and we where we got good results on it's called inference optimization.

So what is in influence optimization? So when these models are there and you are ready to take it to production, you need to make sure that it can meet the latent requirements of the system, which is out there. And also because a high throughput does the imr cost matters, right? You cannot have a like a large model and then your imr is significantly high, you cannot have it.

Third, you don't get the gp u resources that you need because of the scarcity. So we need to make the best out of the hardware which is there. So we have three layers for the inference optimization tag one is algorithmic and data structure.

So in the algorithm, you can think of a distillation is an algorithm. We do precision control with fp 32 and fp 16. And then if you look at the picture, you can actually do pruning where if you know about inference, the computation graph never changes. So we look at the computation graph of the inference and do pruning to get the best performance from the algorithm side that is from the algorithm.

But next, you also have to invest on the sdk and model format. The sdk and model formats enables you to take this optimized model and now compile it to the hardware which is required. And when you do this compiling, they will be able to identify which path can be optimized further.

For example, aws neuron is the compiler we use for inferential and we use tensor from nvidia for gpus, we use onyx and all this thing is hosted on a triton inference server.

Last thing where we have to make the best decision is the hardware stack. You are to choose the best hardware which actually meets your latency constraint and your imimr budget.

So what we have learned is different model, have different model characteristics. So the cost may differ on different hardware types. So make sure that you spend time to evaluate across all the three or four different hardware types including cpus and measure the imr cost to before you release this this thing.

So i explained the journey that we have been doing for the last 1.5 years. But what are the research that we achieved from mfi side? What we can say is that we are able to not only train 100 billion parameter encoder models, but we are able to train and converge them. Converging is very hard because the infrastructure dependency, algorithmic dependency, but we are able to convert them and use it in production and not only converging them, we are able to train multiple 100 billion models to try different, different hypotheses, everything was done on aws hardware and the capability that they gave us for the experimentation was pretty high.

Second thing that we enabled in our team was close to 10-k experiments a month or jobs per month. So this enabled our developers to try new ideas, new hypothesis and then close down whether it's a fact or just an idea. So that's what we are able to get at this velocity at what we needed to move our team, we trained our models on trainum p four d and p 3d and dl one and measured which one gives us the best price performance ratio and also work with our collaborators to make sure that we are getting the best performance from the underlying hardware.

Last, we are actually able to meet the 10 millisecond latency for a 1 billion encoder models. So we are able to get in both in gp instances as well as the in hardware with less than 10 billion. So what's the benefit of being less than 10 millisecond for 1 billion? That means you don't have to distill your model and lose the accuracy. You can deploy these models and get the full benefit of the large models in production at the same time, get the customer value for it.

For example, if you want to get call for every clicks which is happening in amazon, you need to have significantly low latency. So we are able to prove and not and deploy them in production as the journey ends.

I also want to say none of this is can be done with the great collaborators we had. So we had a great collaboration with aws, a team, the aws deep engine sciences team who actually taught us and enabled us to go and experiment with this and use the capabilities that these teams are building the aws badge team who took care of all our infrastructure management, including managing the cluster jobs and everything at almost zero cost the nvidia folks who continuously work with us to optimize our model on the nvidia hardware, the meta pi to who we want to know, what are the future features coming or is the a p going to change? Are we doing the right choices for sterilization? And then the amazon fsx team who gave us a lot of insights in the data storage layer for the network file system, which is used for large model training. And they gave us the most cost effective way to manage it. So these are the collaboration that we had, which enabled us to be successful.

So before we step out, i also want to go, what are the key takeaways that we learned in this journey? Other than the science behind the large model as your team grows, you should focus on reducing the cost and friction to the experimentation velocity. What we did is we focused on reproducibility, reliability and debug ability.

And the second, what we learned and we still trying to learn is training large models to convergence, still remains a challenge. And we are we picked up the debugging and checkpoints and do a lot of experimentation on the checkpoint at file to get where we want.

Third is profile, your model, observe your model's performance. And also look at the hardware statistics to understand where you can do better. Because once you can reduce the imr and increase the performance, that means your team will be able to run more ideas, more experimentation, more task which can be used in production.

So those are the key takeaways from this presentation. And with that, i would like to thank you for listening to the journey that which he had in search and to taking this large models to production and having a significant production impact on this. So me and james will be available down. So if you have any follow up questions.