Train and host foundation models with PyTorch on AWS

All right. Thank you guys so much for coming out. Um I know it's 7pm on a Wednesday and I'm sure you had long days and might be hungry. So what I can promise is we'll show you lots of pictures and by the end of tonight you'll be really excited to go and try a generative AI.

So hi, everyone. My name is Aia Bindle. I'm with the product team at AWS AI. Um and I have my co presenters here with me Yong Jun Wang. So you would you like to introduce yourself? Oh, ok. And my name is Yong Jun. Hi, nice to meet you. And my name is Ku Mu. Nice to meet you too.

Awesome. Um so today we're gonna talk about foundation models and what we're gonna do is we'll start by explaining what foundation models are, um how you should think about them, why they are important. And then we'll go into some uh detailed examples of what it takes to train fine tune, deploy these models. Uh and then finally, we'll hear from our friends at LG and some of the amazing work they've been doing with these foundation models. So let's get started.

So you might have seen in Twitter recently, there was this pretty amazing insane video that these two guys in Los Angeles made and they basically took Stable Diffusion which is a text image model. So you give it a prompt of text and it generates images and they made this entire video out of it. So all they did was fine tune it on a handful of samples and they use text prompts and then they got these incredible images and videos and they bare minimum of post production and any kind of traditional visual effects. And the result was something that you might expect from a professional studio that spends months of effort with top tier teams um spending millions of dollars and they may be spent like a couple of 1000 bucks using Stable Diffusion and they made something really amazing.

So the question is, how did we get here? How do we get to a point where we could generate things like this from? Just nothing more than text.

Now, one of my favorite XKCDs from about eight years ago is uh this joke where someone asked the engineer, can you make me a app that tells you if a photo was taken inside a national park? And the developer goes. Yeah, sure, no problem. That's easy. And then uh can you also tell me if the photo contains a bird? And the punchline of the joke is uh of the comic is that I lead a research team in five years. And in 2014, this idea that you could just detect a bird inside a photo was a punch line. It was something that people just laughed about. And since then, we've just come such a long way, we've made so much progress in beating what we call human baselines.

So there are different benchmarks for language, image video, natural language, understanding different types of tasks that AI has become really good at. And what we like to see is how have we gotten closer and closer to human baseline. So each of these lines represents a different type of benchmark that's used by researchers to measure the quality of an AI model. So for example, if you look at MNST, that's a benchmark that sees if you can identify with AI human handwriting of characters, numbers and so on and with the Endless dataset, you can see that, you know, we made progress but it took a while, it took almost a decade till we became as good as the average human. And then you see some benchmarks with ImageNet starting around 2008, 2009. And we got really good, really fast. So almost about a year and a half after this comic came out, ImageNet was doing it. So what was the punch line of a joke in 2014 is probably a homework assignment? in a CS class today, there's probably a workshop somewhere where you were being asked to do this in real time in like 30 minutes.

And then if you look at the most recent benchmarks, the gray and the slightly orangish line at the end, it's almost vertical, the rate at which we're seeing these improvements. So the pace at which these models are getting better and they're able to achieve human baselines is accelerating.

Now, how did this happen? One of the main things that work here is scaling laws. So the underlying AI models that were powering all of this uh increase in accuracy were becoming bigger and they were training with more data and they were using more compute.

So on this graph from this really interesting paper that I encourage you to read, um it looked at the computing trend since the 19 fifties and AI and what they found was that you can basically look at three different segments. So you see um the initial segment till about 2008 and at that point, deep learning starts to really take off. And then you see another breakpoint right around 2015, 2016 when this era of really large models takes off. And the y axis is looking at the number of floating point operations per second. So you can think of that as a proxy for how much compute do these models need.

Now, what's interesting is that the rate at which researchers are finding new breakthrough models that are using more compute is increasing at about 10x every 10 months. So every 10 months, you get a model that's 10 times bigger, that's breaking some new barriers, doing significantly more impressive things. And if you look at some of the recent domains like text to image with Stable Diffusion that we saw at the beginning, that rate has actually increased even more. And it's now about every 2 to 3 months. So the scaling laws are what's powering these increases in accuracy.

Now, when you make these models really big, that's when you get this concept of foundation models. So researchers at Stanford AI coined this term uh foundation models and it really took off. And the reason I think I personally like it as a phrase, someone wants to describe these kind of models is they represent these very large corpuses of knowledge. So you're compressing lots of language data, lots of image data, lots of video and you're able to create one very large model that you can then reuse across different domains across different industries. And you don't need to keep training smaller separate models for every single thing.

So if you look at examples like GPT3 that some of you may have used or DALL-E Jurassic from AI21 Opt from Meta Stable Diffusion and Stable Diffusion two from Stability AI and Bloom from Hugging Face. These are all foundation models with tens or hundreds of billions of parameters. And they allow you to leverage what's already been trained on a very huge amount of data.

So there are three ways in which you can interact with them. The first one is you could do inference with them. So you could say here is the model. I will give it some prediction, I will give it some prompt and say, you know, tell me what the capital of Spain is and it'll give you a response back. You're not giving any new data to the model, you're not retraining or fine tuning. You're just using the model as is. And sometimes if you want to point the model in a particular direction, you can give it a few examples what's called 01 or few shot learning during inference. So you can say this is my question or this is my prompt. And here's an example of the direction I want you to think in, you're not actually changing the models parameters, you're just giving a few examples.

The second is you can actually fine tune and then deploy a customized model. So you can go and say actually my domain is unique. I have some custom vocabulary and I'm going to fine tune this model. So I'll take the big free train model, I'll fine tune it and then I'll deploy it in for inference and it's more accurate for me.

One benefit of this approach is that you can actually distill it. So you can make the size of the model smaller. And you can say I don't need the network with, you know, 100 billion parameters. I know exactly what uh use case I have so I can shrink and distill the model down, I can fine tune with some data. And now that model becomes a lot cheaper for me to deploy. And the inference and the latency gets better because it's a smaller model.

The downside of this approach though is it requires you to have the expertise to be able to fine tune, it requires you to have the data uh in order to fine tune as well. And it also requires you to manage some of the computer infrastructure or use a service like SageMaker that allows you to trigger these fine tuning jobs. And then finally, there's model drift. So let's say you use a Blue model or a Jurassic model from AI21 and you fine tune it. And now you find that um AI21 has version two of the model. But when you find tune on top of the version two model, something stops working, you need to repeat this entire process. So fine tuning has a lot of potential, but the bar goes up a lot in order to making it work for you.

Now, the third one which is novel and interesting is reinforcement learning agents. So it's in a way combining the best of both worlds. So instead of bringing a lot of labeled data to a model and saying, let me train this model with my data, you're actually bringing the model to the data. And you're saying that this model already exists inside the model in its latent space. It understands something about context, language, whatever specific domain you're looking at and let me try and generate some content from it. And then let me give it some feedback.

So you can have a human in the loop, which is just a way of saying someone, a guy who's like giving thumbs up, thumbs down as the model is making its predictions. Um or you could have your end users give feedback. So imagine you're engaging with the chatbot or you have some kind of generative image module within your app. And the developer makes that available to the end user, the end user tries it and then give some feedback in real time in the app and that feeds back and makes the model better.

So what's nice about this is you can scale the reinforcement learning aspect across your entire user base across different segments of users and you can keep making the model better. The downside is that it's relatively new. So it's more researchy at this point, it's less mature. But I think this is gonna be a big area of development and there's a signal that most of the large research organizations are now heading in that direction. And finally, you need more UI and product integration. So if you want to really leverage the reinforcement learning approach, you need to have something in your app that allows you to uh incorporate that feedback from users.

Now, if we go back to the rate at which we're seeing the compute, increase the requirements increase. One of the things we've noticed is that hardware is getting better. So if you look at the generations of GPUs, we see that uh hardware performance has increased by 4x when you compare the V100 from Nvidia to the A100. And I think there's a similar 4 to 6x with the new H100s. The problem is that the GPU compute requirements for the state of the art models increase by 1600x and that leaves a big gap. So hardware is not gonna keep up, you need software to optimize the performance of these models. And that's exactly where we've been investing a lot in AWS to develop the tools and the machine learning engines to train and serve these models efficiently.

Now, to give you some examples of what this looks like, we do a lot of optimization to make all of these workloads run well on a single machine with GPUs and on multiple machines. So you can train in a distributor cluster. And our customers find this incredibly valuable because when you can utilize the underlying GPU and the network infrastructure better, you're able to train faster, which means your engineering velocity goes up and you're able to train with less cost, which means that your overall bill goes down in uh a very recent benchmark that we actually talked about today. Uh Stability AI which is the company behind Stable Diffusion uses AWS now to train all of its foundation models. And what we found was that if we actually go and tune this algorithm and make the framework utilize our infrastructure, our network better, we can improve the performance, the training speed of the model by 58%. And in terms of how that can impact velocity, how that can impact the number of workloads you can run in parallel. It's really an order of magnitude shift for them because now they can have two different projects instead of one in the same cluster, they can finish those quickly and then they can start two more and so on. So the innovation flywheel just accelerates.

And in addition to Stability AI, we also have a really exciting example that we'll talk about with LG. The last thing I'll just mention before we hand it over to Yong June is that in addition to training, then there's a lot of optimizations you have to do for inference. So once you have these big models, you need to deploy them. So when that happens, you look at things like compilation, you look at the model servers, you look at ways that are still and quantize the model to make it fit on a GPU and run more efficiently and then we're also investing in custom silicon like Training and Influential so that we can reduce the training cost while giving you better latency and hired throughput

So with that, let me hand it over to Yong Jun to talk more about some detailed examples of how you can train these models. Thank you.

Thank you. Hi, good evening. My name is J Ju Che. I'm an expert SA in KS a team at AWS. I'm here to talk about how to train and host the foundation model in Decision Maker. Let's get straight into it.

To utilize the training environment, uh you first upload the transcript for the foundation model to Se Make Do Amazon code com or gar you. Then prepare a training container image with the packages in order to run the train script and only the installation of python packages is trained for uh is required for training. We can use the Sage Make a built in container image. In this case, the packages are in the packages are automatically installed when a training job is launched by adding the packages name in the requirement that txd file under the same directory of the transcript.

In addition for the creation of a custom container image, the Saker built container image can be used as a base image. We can then simply write comments to python or renu package in to we are then able to use the custom container images. After uploading it to the Amazon ECR which is a fully managed container registry service.

Lastly, data set and pre-trains artifacts such as pre-trains, buddhas and pre-trains toke irs are required such data set and pre trainin artifact uh to 100 of gigabytes to the terabytes of data set and pretend pre-trains artifact that need to be easily and quickly readable during the f foundation mode. And amazon fss pero can adjust that as a high performance price system.

Continuing on raging these services to train the foundation model works well. If we don't have enough infrastructure capabilities to run hyperscale gp u clusters for the foundation model, training simply set the number and the type of instances to include clusters using python code. And we are able to get the cluster we want and start the training right away. In other words, we don't have have to set up networks or copy the training script for each instance in each instance in the clusters. This is because amazon stage maker training service provide gp u instances as well as uh fast network connectivity between the instances with the 400 gig pps gp u direct id ma support on e fa.

So as a data scientist, we only need to focus on preparing data set and training script to train the foundation model.

Now i introduced to the blue model. The blue model is one of the largest range model with up to 176 million parameters it is based on transformer and improved motor performance by stacking multiple bloom blocks which have an attention mechanism and a multi-layered perceptron. Blue model is a commonly used for language generation language translation question answering. And more we are able to use, use the hogging paste to train the blue model. The hawking pace makes it easy to revel in the transcript. Many data set and pre trained artifact for training the bloom models. It automatically download them as soon as the training job launch after defining the data set name to connect the name and blue model name to use.

However, the largest blue model provides a shard checkpoint which is divided into 72 different files and total download size is quite large around 700 gigabyte. That's why it is recommended to download them in separate cpu instance using stage maker notebook or stage maker processing in in a preprocessing step and, and then you can save, we can save the train time and coast. It's better to save the result of the preprocessing step in fss per roster and resolve the saved file as your cache management or the load from this method in hogging phase.

For example, when setting up the s make a training job, we are able to add the tokenized data set and mo weight path stored in fss pero which will be used as an input to the blue model. It can also be used in transcript by defining the data set and mo motor weight path of the training container images in hyper parameters.

When you want to train the foundation model with the sagemaker model parallel library using trainer class or hogging pace, we only need to add the distribution parameter when setting up the sagemaker training jobs. That's why sage maker and hawking p are well integrated. We can easily train and hosts the hawking phase model with the in s stage maker with the model parallel libraries such as deep speed sagemaker model parallel library, etcetera. The parameter of a pipeline parallelism and microbes are required and we are able to add tensor parallelism, distributed data parallelism, high precision shared optimize status if you want to use them.

Ha case make it easy to host the bloom model in decision maker using a trend model stored in amazon s3. We can use the deep java library container means provide by sales maker to train a host foundation model. We can also write the infra script option to go then use the trend model for prediction compared to using deep speed. In benchmark test, we achieve a higher throughput using p make model parallel library. You can see on the table here and we test it for 30 billion parameter gp t two model with the sequence lengths of the 512 and 2048.

I will then introduce to the other foundation model such as stable diffusion vqg and a bos which generate images based on the input text. In most cases, distributed data parallelism, accessory the process of training text and input image data. In addition, transcript of these models are coded based on various framework such as tensor flow p touch, etcetera. As a p touch lighting is currently one of the most widely used used framework. I will explain more about the usage of sage maker distributed data parallel library in p to lightening.

Of course, we can use the s maker to leverage various data distributed distributed data parallel library in p righting the use of the s maker distributed data parallel library. When setting up the stage maker, training job simply simply requires defining the distribution parameter for the ddp strategy of p to lighting. In the training script is converted into the stage maker distributed data parallel. When trained the foundation model, the so bo lag is the most commonly sighted problem.

Multiple cases have revealed in immense needs for computing resources due to m configurations and or inefficient, recorded the transcript of the foundation models. If you look at resource bottle, we'll often see that the train speed may not be consistent in each train step. And cpu or system memory utilization is close to 100% to address this resource problem. We can modify inefficient resource code in the data ro or at a specific computation in python packages such as open cb and python image library.

Also it is effective to use mvd data in the data loader mp directly reduce the cpu usage by processing part of the data loading on gp u instead of ac pu the mbd di requires a training pipeline to read an augment data. Then we have to learn the pipeline to feed the data through a data lo iterator like a di classification iterator. We found that the repressing the data load of pi totes with the data iterator of the mvd da was able to mitigate the cpu bottle and increase the gp u utilization with the phase maker distributed data parallel.

Now, we are hearing more about the customer story for foundation models from a research scientist at the lga i research. Thank you.

Thank you, yang jen. Uh hello, i'm guangong, working as a research scientist at lga i research. Today, i would like to present about a case of our company. We have been using aws service about a year and a half and aws stage maker in particular has been very helpful in training foundation model with large scale data in the presentation. I would like to tell you about the features of aws service we haven't used so far and the train foundation model using those features.

First, let me give you a brief overview of our company. The official name of our company is lta i research. I'm sure some of you have already heard about the lg in the lg group. There are several areas such as energy chemistry, lgcns or lg electronics, which is famous for its home appliances. The goal of our lga a research is to solve problems and strengthen their competitiveness. through the a i techno, we were about two years ago with the advancing a i for a better life.

Now, let's take a closer look at what we do. Our job is to utilize the latest a i technology to have solved the difficult problem that our fears face. For example, we use a i for battery cycle forecasting and new drug development or vision inspection. This has significantly reduce the cost of manufacturing or service and also allow us to work more efficiently.

In addition to supporting our periods, we also conduct our own research. We present various research paper to the top tier a i conference every year. And our current area of the of focus is the foundation model. The foundation mother refers to a mother that is pre trained with huge amount of data composed of various types of data so that it can be used for uh various tasks with a little bit of fine tuning. The foundation model is not only just a research topic but a practical model that is expected to have our spirit solve their various problems.

The name of the mother we developed is exa one which refers stands for expert a i for everyone. The idea behind xr one is to train a huge model using a variety of multi model data and then apply it to multiple downstream tasks in the form of an application. There are several applications we have in mind. But currently we are using text and image data for rank processing text to image generation and image captioning. We train the exxon model with uh 600 billions of text tokens and 250 millions of images.

However, we encounter some challenges while training a foundation model first because we use both text and image data, the size of the data set grows to hundreds of terabytes. So it has to be able to effectively manage such data and also data input and output had to been done very quickly. In addition, in order to include such large amount of data in the mother, the number of parameters of the mother must also be very large. Therefore, it required a lot of resources, especially the gp resources that can be used with flexibility and computation computation speed was also important for faster training.

And lastly, we needed a small system to utilize and deploy the trained model to various downstream tasks for business fears. And in order to solve this problem, we initially thought about building an internal infrastructure. However, it was estimated that infrastructure maintenance team of about 20 people would be needed to purchase, setting set up and maintain the gp us. Therefore, significant initial cost would be needed. In addition, since the new gp u is released approximately every year, we determine that there is a problem in portraying it every time. And we also needed the elasticity to use more gp u during certain periods of time.

Then we started looking into aws for health and will be able to solve these problems using their solutions. For example, we use f aws s fx for roster to manage data and faster data input output. During the model training, decision maker also enabled us to try distributed training in very large scale, we were able to adjust the number of instances and experiment with the just the right number of gp u as we need it. The compu computational speed was also significantly improved by using the space maker data parallel feature.

And roughly the e two instance made it easy to influence and fine tune our mother. In particular, being able to access these solutions without much difficulty was the biggest advantage for us. The schematic diagram looks like this stage maker operates in the same way as yang jung from aws explained earlier. And what the user needs to do is very simple.

First, put the data to be trained into an s3 bucket and then sync it with fx x for roster. The linked by systems of fx s per roster is connected to its instance and provides the training data very quickly. And after slightly modifying the code used in their local environment to fit their space maker environment, you can proceed with phase maker training through uh gp or not. And the generated model is saved in three bracket and can be easily retrieved through ec2 instance for fine tuning and in work.

So as i just mentioned that the training code needs to be slightly modified to beat the stage maker environment. The same goes for phase maker, the parallel, this is code that applies stage maker data parallel instead of the existing pals distributed data parallel, just add a few simple lines. And you will see the job done. And the applied stage maker data parallel is much faster compared to the existing pythons. This three data parallel. Of course, the amount of performance improvement varies depending on training environment or the number of gp us. But in our case, when using stage maker data parallel, we were able to train about 10,000 iterations per day. This is about 60% faster than the 6.3000 iteration we recorded when using pyot distributed data parallel. The time required to train one apple was also greatly reduced and resulting in a cost saving of about $70,000.

This is how a stage maker enabled us to train the mother more simply and quickly. From here, i will show you the experimental result of our trained model. First step is the image captioning task which automatically creates a corresponding caption from the input image. Whereas conventional mother generate a short sentence caption for epidemic. But our mother described the epidemic in very detail. You can also tweak this a bit to generate related keywords rather than sentenced. If you provide an image, it will generate a several appropriate keywords such as in hashtag forms.

Next is the image generation model. This is a fear that has been receiving a lot of attention since of a i's daily research. When the user enters the desired sentence, the corresponding image is generated. The mother does not simply search for image that corresponding to the text but understands and creates them. For example, in the case of a pumpkin shaped head, the correspond the corresponding object is not in the data, but the mother understands and combines the property of each object to generate a new object and fine tuning. This model enables various tasks.

First, you can upscale a small size image to a high resolution image by applying it to the image super resolution task. You can also modi modify the image by using image control task. Let's say that you request to create a new object for the masked area. A rare flower is create next to the bed and a painting of a flower is created of the bed in a style that goes well with the surrounding world. In addition, you can just change the color of the object without changing the shape of an object.

And it is also possible to give prior analysis about the image reason if you give the mother an input to use the segmentation information to generate an image corresponding to the sentence uh sufferer standing on the bits. The mother creates an image that fits the reason and text.

Ok. So we were amazed at how easily we can use aws features for diverse purposes. They are never to train our mother effectively. We were particularly impressed with the scalable use of multiple gp us with stage maker and the significant significant increase in training speed using their stage maker, data parallel feature. And the aws team have a lot to troubleshooting several issues. We hope to continue this kind of collaboration and we plan to use data makers mother para feature to further increase the size of our mother and train more efficiently. We are also willing to reverence the face makers influence feature for a more effective modern management and deployment.

Ok. Thank you for listening.

  • 9
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值