Simplifying the adoption of generative AI for enterprises (sponsored by NVIDIA) (NVIDIA)

Oh, hello everybody. Uh thank you for coming to our talk on simplifying the adoption of generative AI for enterprises. Uh just a little introduction to ourselves.

Uh my name is Peter Diki. I'm a solutions architect at Nvidia. I focus mostly on multimodal learning and large scale training. And this is my uh co presenter, Zhong who is our expert in inference, who'll be going over that in the second half of the talk.

Just wanted to start with a little agenda for today. So we're going to be starting out with kind of some of the work that our team does with AWS and some of the collaborations we've done with generative AI, then some of the Nvidia AI on AWS, some of the software we have some of the hardware we have and then we'll go into some of the frameworks we have.

So we'll all start out with some at scale training on AWS. How do we scale out? How do we make everything efficient? And then Jong will take it from there with our deployment with Triton and some of our acceleration with TensorRT. And then at the end we'll have some call to action and other conclusions to take from the talk.

So the first thing I want to start with is that Nvidia isn't just a hardware company. We're a full stack AI company as well. So we have optimized containers. We have optimized uh software for all different types of use cases across the board and it all stacks up. So you can think of your frameworks at a low level where we've optimized for PyTorch for TensorFlow for JAX.

And then we also have framework on top of this, the one that we'll be going into today would be the NeMo framework for generative AI. But we also have other ones like Merlin for recommender systems. We have BONe for bio use cases in across the board. We have tons of different frameworks that we can be used at higher levels for some of these different use cases in industries as well.

So it's not just a hardware company. We also are kind of trying to optimize across the board from our software as well.

Um so just to start with our uh collaboration from Nvidia and AWS.

Um and so we have a bunch of different GPUs all the way from the cloud to the edge that we offer. So at a higher level, the most performance you have is the P5 instance, which is the H100 instance that's offered right now. So this is where you can do all of your large scale training and HPC works um that you need the highest in GPUs with the most performance and the most memory.

And then we also have other instances which is a G5 inches, which is the A10 GPU from Nvidia that's used a lot for ML inference. It's a little smaller but it's great for inference and for uh cost performance benefit.

And then again, we also have the for the graphics workloads, we have the T4G which is um also T4 GPU by Nvidia as well.

Um and then you can also use the uh Triton and Infer server that we're gonna go over uh later for deploying your models for managing all your requests and batching all of that. And that also integrates with SageMaker is something that our team and Zhong has worked on as well to make it a seamless to be able to deploy models on SageMaker as well.

And you can also go all the way to the edge with our Jetson platform. So these are embedded devices so you can train on one of these higher end GPUs for training and then you can deploy it at the edge for GPUs with Jetson as well.

And all of this optimized software is offered on the AWS marketplace as well.

So one of the use cases that we've actually worked on is the Amazon Catalog team. And so our team has helped them optimize for inference as well. So this is allows you to automatically generate product listings for sellers and with TensorRT and Triton Inference Server, the uh catalog team was actually able to um increase their, decrease their latency by 3x and do 2 times more throughput just by using our frameworks with TensorRT that optimize the inference as well.

Another team that we've worked on with is Amazon Music. So this is their spell correction model. It allows you to uh for when they're doing, searching, automatically spell correct the uh the input. And with TensorRT, we were able to um in Triton, we were able to get a 63% reduction in latency versus CPUs, 73% reduction versus CPUs.

And a 25 millisecond N10 latency. And then also our team helped them with tensor cores and other parts of the H100 actually do a 12x reduction in training time as well and get a, a major boost in GPU utilization.

Uh the last one would be Amazon Search. So this is real time, spell track as well. For product search, it takes in tons of queries every time. So it's constantly, it has a lot of throughput and uh latency requirements. And you're able to get it to a sub 50 millisecond latency and deliver 5 times the throughput using Triton and TensorRT as well.

And using the TR uh model analyzer, something that might have taken weeks to be able to find the correct configuration can be done in hours.

So some stuff about Nvidia on AWS.

So NGC, this is one of the I think one of the most important parts of the talk is where you can actually get a lot of all of our Nvidia software. So NGC is where we also produce all of our containers.

So you can imagine all those PyTorch dependencies and making it all work on GPUs efficiently is difficult. We have PyTorch containers, TensorFlow containers, containers for all of our frameworks as well. And then we also publish our models there as well. So you can be able to take any of the models that are internal research teams have trained and be able to use them.

And so this has all of our frameworks such as BONe NeMo, Omriva, all of those things as well. And these are all updated regularly and they're transparent for you to be able to check that it's gone through security checks and then you can pull them as well.

We also do all of our early access through NGC. So if you sign up for early access, say for our multimodal containers, you can get that through NGC as well. And it, it makes it for a lot simpler deployment and use.

Ok. So uh just wanted to go through kind of the what i was talking about when i said that we have frameworks at every single level. These are just a couple of examples.

So you can think of Rapids Spark CuDF these are drop in replacements to be GPU um accelerated for data science and analytics as well. Or QML, you can think of as the drop in replacement for stuff like Scikit-learn and then for recommender systems, we have Merlin.

And so this is for all of those huge embedding tables and being able to manage all those DLRMs Merlin can help you with that. And that's all GPU optimized for be able to use for super large workloads as well.

For speech and vision, we have Tao and Riva and DeepStream. So DeepStream allows you to be able to optimize the video input uh and all of the preprocessing that goes on there. And then with Tao, you can automatically have optimized training and inference models for your computer vision workloads and Riva is our speech one.

And then for inference, which Jong will be talking about later, uh we have the Triton Inference Server. So this is an inference server that allows you to easily deploy your models uh whether it's a different uh TensorRT backend a Python backend or any other backend and be able to manage those requests and optimize for it.

And the last one that we'll be going into a little bit is NLP and also more generally generative AI is our NeMo framework and this is our framework that allows you to train fine tune and also deploy all the super large models that are coming out these days.

Um so just quickly we have the NVA H100. So this would be the P5 instance on AWS. So this was our uh one of our newest chips that has come out and is now generally available on AWS.

Uh it has the second generation MG which allows you to take a single GPU and split it up in up to seven, fully um differentiated uh workloads that they don't run into each other at all.

The Transformer Engine, which is one of our newest features for being able to train some of these really large models. It allows you to do FPA and I'll show you a little bit of that in a second um which is a, a lower precision, which means you can go a lot faster and have a lot less memory usage as well.

It also has uh fourth gen NVLink for uh a lot faster performance um between the networking between the GPUs. DPX instructions and then also confidential computing as well.

So the H100 is great for a lot of these super large generative models for LMs. You can see that just from the A100 to the H100 we can get a 4x uh reduction in training time. For some of these super large models, the tuning times go down drastically as well by about 5x.

And also the inference through puts for being able to use an A100 versus an H100 there's a drastic difference as well. This is because we now have faster memory, faster networking, the actual GPUs are faster and with FP8, you can just squeeze out even more performance in terms of inference as well.

And for training, so a little bit more into the Transformer Engine. So this is what is one of the biggest optimizations that came with the H100 besides just having faster chips is it allows for 66 times faster training and inference because it will dynamically adapt the 16 bit and 18 the 16 bit and eight bit math within your network.

So operations in your uh network. So a lot of the GEMMs can actually be done in a lower precision and accumulated in a higher one and without a loss to accuracy. And so when you're running at these lower precisions, you get a lot better performance and it's a lot faster and it's a little bit kind of confusing, but it's a Transformer Engine is actually part of the chip.

But we also have a library called Transformer Engine that's all open source of GitHub and allows you to have these layers that can be automatically replaced in regular PyTorch code. So if you have a PyTorch linear layer or another or a transformer block, you can just, instead of importing for PyTorch, you can import from Transformer Engine, it will run an A100 H100.

And that same code can automatically be used with FP8 and do all the optimizations for you as well. And all these different blocks can be fully configurable with if you want to take a certain layer from a Transformer Engine or you want to take the whole Transformer Engine, the whole block.

So the whole GPT model or something like that from the Transformer Engine as well can be taken from the github.

Uh just we want to touch on the A100 which is also uh generally available. Uh this is our last GPU that is used for a lot of the large scale training or a lot of the inference as well on some of these larger models.

So this would be your P4 or your P4d the difference being the P4d is 40 gig per GPU. P4de is the 80 gig GPU as well. And this is has all the tensor cores, it doesn't have the FP8, but it can do all the BF16 FP16 training, all of the mixed precision training that you know for some of these larger models and this is just a broad overview of uh the major GPUs that are on Amazon EC2.

So you have your T4s. This is kind of the one that a lot of people know as the ML inference one. This is what a lot of people uh run their smaller networks on because it's super cheap. So it's a good uh cost per performance GPU as well. And that is the uh the G4 instance.

And then the G5G instance, your T4G is mostly for your, your graphics workloads and your gaming workloads where you uh it's not necessarily the ML workloads, but you can run your graphics workloads on there as well.

G5 instance is the A10G which we already mentioned. This is for some of the best performance for graphics HBC and also cost effective inference. It's got a little bit more memory than the T4. It's also Ampere. So it has a faster performance. You can do all of your inference there. And that's the one that we see a lot of people deploying on right now is the A10G just because it's a good price performance for your inference.

There's the A100 as we already mentioned, the P4 P4de, this is ML training HPC being able to scale up with good networking.

And then the newest one is the H100 which is your P5 instance. And then obviously today we announced that we also have the H200 the Grace H200 as well that are coming to AWS as well. That will be for scaling out to even larger and larger models with better networking and better performance and larger memory.

Ok. So now I'm just going to go through the training of the generative models. This is where we're going to get into the frameworks.

Uh the main framework is the NeMo framework

"So this is an end to end framework for generative AI that our team has been building. Um, so it has everything from the data curation and we'll go into that a little bit later to distribute training. So how do I take these huge models that won't fit into a single GPU and be able to distribute them across thousands and thousands of GPUs without losing the efficiency with all the networking overhead and all of that that comes with that all.

So model customization. So obviously, right now in research, there's a lot of different techniques for model customization, your P tunings. All of those, we have algorithms within NeMo that you can take your pre trained models or your models that you trained within NeMo and be able to do all of this SFT all of the fine tuning on those models that you'd want to do as well.

Then with these models that you've built within NeMo you can then take our NeMo inference containers and be able to automatically deploy them with TensorRT LLM and Triton. And so you don't have to go through the largest process of manually having to try and convert some of these models into inference friendly formats and be able to deploy them. Our team has already been able to manage a lot of these platforms and it's continuing to get better for an ease of transition from training to inference.

And then at the end, there's guardrails, which is similar to I guess what it was announced at AWS today. So guardrails is how we as our open source repository where it allows you to be able to direct your model to be able to do inference without having hallucinations or maybe information leaking into your model that you don't want it to when it kind when it goes to retrieve something as well. And so this is generally available with NeMo enterprise. And then we also have a multimodal container which we'll go into again, which is in early access. If you can apply, you can be able to get access to that. And so that's for all of your text to image, image, to text models as well that can all be trained with NeMo as well.

So we also, so as I just mentioned, we support model architectures across different modalities. So you have your basic language models, these are your GPTs, your T5s your BERTs your encoder encoder decoder, your decoder models are all supported as well uh text image models. So this is Staple diffusion image gen Clip, uh different vision transformers as well and then image to image as well with Dream Booth Instruct to Pix to Pix. And then another one that I want to mention is we also have support for the Lava. That's also with how do you combine image and text kind of what you would see with like a Chat GPT as well.

So building generative AI foundation models is a very difficult process. So you have um mountains of this training data that's very difficult to curate to be able to put into your model and make sure that it's training, not on stuff that you don't want to be able to train on and actual how do you manage all this training data as well. Um you also have the massive infrastructure that you have to manage when you're scaling out tons of problems happen. How do I organize it? So I'm getting the best performance, how and also I'm not any of that performance because of how I'm configuring my software as well. And then the other part that's really difficult is you also have to have the people that are uh have the technical skills to do this. One thing is a lot of people might be good at doing deep learning when you're going to scale, everything kinda gets a little like goes a little wide and it's like a lot harder and a different skill than what you might have from a regular deep learning practitioner. So there's a different skill set here as well that you need. And the hope is is that NeMo will help kind of ease you into this and make it a lot easier to be able to run these larger and larger models. And then how do you also manage these complex algorithms? Building an algorithm that works across all of these different nodes is really difficult debugging parallel programs across all of these can be hard. And so NeMo provides that out of the box through our different tools and we'll go into those a little bit as well.

So this is at a high level, these are all the different pain points that are we're hoping to solve with an email. So large scale data processing, we have data curation and data preprocessing tools. We have multilingual data uh support through rope and then also being able to go to longer sequence links through rope interpolation as well. Find the oppo optimal hyper parameters doing these large sweeps to be able to find the configure that runs the best we have the hyper parameter tool, convergence of models. So it's a lot harder to go find the recipe that works to do a convergence of a large language model. You don't want to be running that tons of times. So our team internally has verified recipes for what the learning rates. What do the schedules look like for all these GPT and T5 models? So you don't actually have to go do that yourself. And then we also obviously works on the cloud. So on AWS, we automatically have tools to be able to scale up. So your node counts whatever number of GPUs, whatever model you want. And then we can also deploy inference and deploy at scale automatically through TensorRT LLM um and through Triton inference server, which Johan will be talking about a little later. And then we also have some tools for being able to benchmark your models automatically so that you can see how they're doing such as MMLU or any of these tools. And we'll be open sourcing. Actually a new tool I believe soon for being able to just take a new model and automatically test it against all these open source benchmarks that everybody loves to compete on as well. And again, also we have FP8 support automatically all the recipes in their field train FP8 right onto Hopper as well. So if you're on a 100 right now, you want to move to Hopper, you can do it automatically and get the FP8.

So the first one I want to talk about is the data curation tool. So this is the tool for taking these large pre trained data sets. They're often trained on trillions and trillions of tokens. So how do I go through the internet? Pull all that down and then go through the curation process. So we we have a GPU accelerate tool called the data curation tool. And so this will download it, extract it, duplicate, filter the documents and do it at scale in a performance way. And so this can be HTML LaTeX files, different p uh PDFs can be downloaded and then do the uh language detection on. So to make sure, hey, I only want English or French or I want these languages in it, filter them out. And then we have GPU accelerated uh document level d duplication, so fuzzy duplication and exact duplication. And then we also have some uh document level quality filtering. So you only want the highest level quality documents inside of your training data because what you train on is what your model you're gonna get out is if you have bad data in there, that's uh a little messed up. You're gonna have a model that's messed up as well. That's gonna be reproducing these things as well. So we have these different filtering techniques that are in there that are GPU accelerated as well and then also different um task duplication as well.

Um, so at the core, what i the where all the power of NeMo comes from is called Megatron core. So if you're familiar with Megatron, it was the original paper that we had with all these different parallels and techniques that allowed us to train that 530B model. I think it was about three years ago now and we now have it in Megatron core. And so this is a PyTorch based library for scaling any of your transformer models. It has all of your core blocks that you need for being able to build your LLMs or different also multimodal models as well. So inside of here, we have tensor parallelism pipeline paras a distribute optimizer. This is similar to what you think of as FSDP or zero. We recently released expert parallelism for a mixture of expert models. So how do you actually, if you have tons of different experts, they don't fit on a GPU one. So how do you distribute those as well? Distribute, check pointing and all these are modular or transformer layers as well. So you can specify i want this activation, i want this layer and build them. And then we also within it have already configured GPT Burt T5 models as well and different data sets as well. So you can take your data, put it into a format that's performing, be able to train it.

Uh one thing that i just want to go through quickly is kind of what are these different parallelisms that are in it? I don't won't go into the details very high level, but these are what we usually refer to as 3D pers. So if you're familiar with data parallelism, you're taking the model and you uh completely duplicate it across different models and they get different data. Three parallelism is what we commonly use to be able to get actually better performance out of uh scaling these models because the model no longer fits on a single GPU. So you have to split up the model itself. So tensor parallelism for training is intra level parallelism. So this is where you horizontally split your models layers. So you can imagine in this model right here, uh the layer 0 to 5, each layer is actually split into different GPUs. If your 10 pairs, two, each layer is split into two different GPUs. And it's doing communication to be able to accumulate these uh the data so that you get the same as you would if you were running on a single GPU as well. Um it reduces the amount of computation on each worker which is great, but it allows you to scale up to a much larger model that no longer fits. And it's actually a lot easier to be able to uh implement for most models because it's actually pretty simple. And it allows for good strong scaling. Uh most of the time we use tensor paras within a single node. So because in a single node, you have NVLink where you get really good networking. And so it scales really well within a node.

Now, the third way of that of the 3D parallelism is pipeline paras. And so this is where you're vertically splitting your layers. So GPU zero might get uh layers zero and one, the second, uh GPU might get uh two and three so on and so forth. Um and so this one is a little bit harder to actually implement for most generic models. But uh it allows for you to be able to scale even farther than just with tensor pels as well. And there are certain techniques that we've also implemented within the Megatron core, such as interleave scheduling to reduce these pipeline bubbles as well. And all of these PSS are already implemented for all of your common GPT models that you might want to use.

And the last paras that I'll talk about in Megatron core would be sequence parallelism. And so this is where we're actually splitting the tensors along the time and sequence men uh dimension. And so this allows you to reduce the uh memory consumption from the activation tensors and be able to get um reduce your activation memory as well. And with this in the same paper, and these are all in the Megatron papers, we have selective activation recut. And so this is where you have the memory and compute trade off. So we actually are able to selectively say, hey, these are the, these are the operations that take up a lot of memory, but they don't take that much compute. And so maybe we offload those and we recompute those in the backwards pass. And so this is a really efficient way to be able to do reductions in activation memory. And uh coming up, we have actually an extension of this called context pers that is in Megatron core right now and will be released in NeMo that will allow you to scale up to those super large sequence lens. So for all these models, you might be seeing where they have 100,000 context link, the problem becomes where do i put the memory if i have all these activations that are blowing up because now my sequence length is long, we have a new, i guess it would be like four D parallelism. I don't know what they call it, but it allows us to go up to say 100,000, 125,000 comps lengths and be able to train those models as well.

Um, so the impact of high cost training LLMs. So there's the economic impact, it costs a lot of money to be able to do this. There's the environmental impact running through all of this energy is costly as well. And the barrier to entry can be hard because you need all of your engineers to be able to use it. You need to be all the money and there's also resource contention. So a lot of you people have a cluster that shared behind a bunch of, of researchers and they're all trying to use the compute for their own research. So being efficient and being able to use these LMs becomes a difficulty because you now are taking away resources from other researchers as well.

And so what is actually driving this? It's not just that it's the data set size, the model size, the training volume, there's also inefficiencies in how we approach these, that we're redoing some of these things that NeMo hopes to help. So to get a stable hyper parameter configuration, it might take multiple runs to get better scaling efficiency. You might have to configure your model parallel those tensor parallels and pipeline parallels differently and all and different inference tests and different tests along the way to be able to find the correct configuration for you to train your model. And the hope is is that NeMo has tools and all and different preconfigured configurations that you can just take off the shelf and be able to run them. So you're not spending all this money doing these tests that have already been done for you.

And so one tool that I want to highlight in NeMo is the auto configured tool"

It's got two different steps. You could put your training constraints in or your inference constraints in and it'll take these and it'll give you back a model size and a configuration for you. So it might, and we'll go into this a little bit, but it would be like, hey, I've got this much compute and I've got this much data, what should my model size be?

But a lot of people go in and they already have a model size that they actually want to be able to run with. So that's fine. You can be able to take your own model size and then put it into the second stage of this pipeline, which is the efficient hyperparameter search. And so it'll take your model configuration. So for example, I want to train a 30B GPT style model and it will automatically do an optimized training grid search on the model parameters to be able to find a configuration that will give you the fastest performance.

So we've already done a lot of tests on what the good TP distributed optimizers. There's 1000 different ways you can configure these things across thousands of GPUs and we do an efficient grid search for it. So you don't spend all that time trying to find it manually. I've seen a lot of people take a lot of time just trying to manually find what's the correct TP size or the PP size that gets me the lowest iteration time. We have a tool that will automatically do that for you.

So an example of what an input might be, it could be the model size, it could be the number of GPUs, the vocab size, the tokens and then it'll run the grid search on those parameters to be able to find out what's the best performance. And you can also find training grid trains as well as I mentioned. So you could say, hey, I have a GPT-3 model. I want to run for 60 nodes on for 16 days for six days. And it might say, hey, the 5B parameter model is the best way to go about this. And there's different scaling laws that it can use to be able to find that configuration for you.

So we also work with a bunch of different foundation models. So NeMo, our team produces our own foundation models in different sizes. So for example, GBT 8B for, if you need faster responses, you don't necessarily need the accuracy at the higher levels or there's different balances between accuracy and latency at a GPT 22 that we produced for complex tasks. For the larger models, you can use GPT 43B which is much larger.

We also support different community models and we're continuing to add new models over and over again as they come out for what the state of the art is. So Falcon, Llama, MPT, StarCoder are already supported out of the box with NeMo for you to be able to use them as well.

So we also have a suite of model customization tools. So I know a lot of people don't necessarily just do pre-training because that costs a lot of money. But you can also do different fine-tuning and it's at different levels of complexity for what you need. So we have prompt engineering where you can be able to configure your prompt to get the best performance out of your model. And this is at a lower level simple. You don't necessarily need modify or train the model itself.

There's also P-tuning and prompt learning where you're actually learning the tokens that are dependent on. And so you might have a small data set and you can tune the actual prompts or the P-tuning as well to be able to be more suited for your task so that you can take one of these foundation models and make sure it works on your task.

Then we also have the next stage would be parameter efficient fine-tuning. And so these are like Lora or different adapters that you actually are training a small subset of weights that are then added back to your model itself. Lora being probably the most famous one that's used that is also supported out of the box, able to fine tune on your dataset to be able to use it.

And then at the highest level, you have instruction fine-tuning. This is one of the more complex ways to do this. So you have supervised fine-tuning and RHF which I'll talk about in a little bit for the highest level of you need to fine tune for your task as well.

So as you think, if you're thinking about this, if you have a model for model customization, if you have an enterprise LM, that you may be pre-trained or you've taken one, that's one of these community models, you could take the foundation model, bring it into NeMo, do prompt learning, could do supervised fine-tuning, reinforced learning, a combination of them, these can all be used together and then you could deploy your model for inference with something like information retrieval on your own, on your own datasets. So that knows your datasets as well.

So just a little quick intro to what RHF is that we support and RHF is, it's one of the more impressive things within our NeMo framework because it's a very hard thing to do efficiently because you actually have multiple models that are all training and they have to be distributed efficiently and it's a very complex.

So you have your supervised fine-tuning, which is the first stage which you have some prompt and responses that you fine tune your model on. So that may be a question if you want to be a question and answer LM, you have questions with answers it fine tunes on it first and then with human feedback where you have a collected dataset of what humans think are the correct responses to different questions. You train a reward model on this dataset.

And then the third step is the actual RHF and this is where you use PPO, which is our reinforcement learning algorithm to be able to optimize your foundation model. Usually a lot of times using the SFT model, you're very trained to be able to do the task that you want. So this is a very complex that requires a lot of skill to be able to do this. But we provide a lot of recipes and also very optimized code to be able to do this efficiently as well within the NeMo.

We also do RAG or retrieval augmented models as well through our InFlight pipeline. And so this allows you to take an LM take another model that is your retriever model and you can go and grab the context that is needed for your task, a database of or different, different data that you have within your databases. Pass it to your model so that it's able to answer that. So we can go and retrieve different contexts that it needs to be able to actually answer these models itself without necessarily doing the fine-tuning.

And one of the last things I wanted to highlight is our guard rails. But one of the biggest problems is when you have these super large models, these foundation models, they're very powerful, but they can kind of go and do different things that you don't want them to do. So we have these rails that are built within guard rails, that you can use to be able to take your models and be able to guide them where we are.

So we have topical guard rails. If you have a medical bot, you really only wanted it talking about medical things. If you, have a financial LM, you only want to talk about financial stuff. You can have these guard rails that are built on top of it to be able to guide it to make sure that it's not gonna go off topic.

You also have safety guard rails. You don't want to be able, you don't want your models say producing, you don't want it producing certain things or output stuff that might be hurtful. And you can also take that out as well with these rails and then the last one would be security guard rails. If you're getting, if you have some information that's from internally that you don't want it producing, it can be able to catch that as well so that you're not, you're not having your LLM produce stuff that you don't want leaked.

Here's just a quick slide on getting started with NeMo framework. Feel free to reach out afterwards or come up to us after the talk and I can point you to all the different places for where you can get NeMo framework, the multimodal container or different parts as well. And we also have some GTC sessions that you can go through that will go a lot more in depth on the NeMo framework itself than we've had the time to go through here, but feel free to come up after the talk.

Ok. Thank you, Peter. So for the next part, I'm going to talk about the inference side. So before I get started, I want to do a quick poll. Have any of you heard of the Triton Inference Server? That's pretty cool. How about TensorRT? Ok. How about Fast Transformer? Ok. That's pretty good.

So, in the next part, I'm going to cover all of those. So first, contrary to what people usually think inference is actually also very challenging. So this is a process that's involving multiple groups of people. And we usually do it in different stages.

First of all, we have ML engineers who are specialists in these models and understand how to optimize the model itself. And of course, after being optimized, we have ML Ops or Dev Ops that's going to deploy this model into actual production so that we can have an API or some sort of way of exposing this API to our customers, to our users to make it an actual service. So this is the part you are actually earning money back from your investment.

And so let's talk about inference server. For the first part, there are six pillars of challenges we observe from our customers. First of all, we have different kinds of models for different use cases. For example, if you are more into computer vision, you might be more into traditional convolution neural networks. And also we have recurrent neural networks, graph neural networks and even traditional ML models like decision trees and others. And typically different people have different familiarity with frameworks, they might choose different ones for these use cases. So having a unified solution is very important. Otherwise, you will have a very messed up architecture internally.

And also based on use cases, you might have different query types. For example, if you are focused on real time use cases, you might want to send out best one request so that to make sure the latency is lowest as possible. And also if you are more into the TCO or the cost, you might want to send batch request just to reduce it.

And also with the LM being more and more popular, streaming is also very important. Otherwise, you need to wait until the whole answer being generated to prompt out. But with streaming, you can always see generated token by token which will give our users a much better experience.

And also these kinds of large language models requires very heavy infrastructure investments. So you might want to have hybrid infrastructure so that you can have on-prem systems as well as cloud. So in order to address all these challenges, NVIDIA introduced Triton Inference Server.

So Triton Inference Server is an open source software that's specially designed for fast, scalable and simplified inference experience. So Triton supports any framework, any query type, any platform and it's also dev and ML ops ready, that means that it's always production ready and we want our customers to have confidence of deploying them directly into production.

And the last but not least since NVIDIA is a GPU specialized company, we also designed Triton to be super optimized to make sure that you can utilize your hardware to its fullest so that you can be very cost effective.

So what does Triton looks like? This is the architecture graph of Triton. So on the left side, you can see we have different kind of clients you can use whatever program language you are familiar with. We support Python, C++ and even Java and through HTTP or gRPC protocol, you can directly send a request to the Triton server.

After this payload gets received, we actually have a dynamic batcher which depending on your constraints, for example, latency, throughput and so on, we can intelligently batch them together and then send it to a policy queue and inside here depending on the importance of different query types, you can assign them different priority so that your high priority task can always be executed first.

And following that, we have different backends. This could be TensorFlow, PyTorch, ONNX. And of course, for best performance on GPU, we would highly recommend TensorRT which we are going to introduce later.

And along the side, we have different kind of model orchestrators. So you can dynamically load and unload models based on your business need. And also we support very detailed monitoring metrics. This includes hardware utilizations, throughput, latency and so on. So this will give you a good confidence that your service is always in good health and it can give you quick notice if anything goes wrong.

Here is just a quick summary of what kind of backends are currently supported. So we talked about TensorRT, PyTorch and ONNX and also for TensorRT, that's our most optimized inference solution and they have integration with all these frameworks as well.

And in today's talk, we are also going to focus on TensorRT LM. So this is a new package we built on top of TensorRT to provide you the best solution for these large language models.

Going next, we are going to talk about what kind of general optimizations Trident is introducing to give you better utilization of your hardware.

First, we have concurrent model execution. So this feature means that you can easily deploy multiple copies of the same model on one single GPU or even multiple models if you uh want to host it for different kind of use cases.

And also dynamic batching is also a very popular uh feature. We see customers enabling a lot uh cause uh we, as we all know GPU are good for parallel computing. But uh oftentimes when the customers are directly sending a request, they don't have this concept in mind. And so it's very easy to underutilize our GPU.

So with this dynamic batching technique, we can batch all this small request on the server side to form a larger batch in order to use the GPU to its voice.

But a very common question is of course, this optimization is very helpful. But how do I know what kind of configuration I should set for that? For example, how many copy of model should i host on this specific GPU? And also what kind of batch size or weight time I should configure for the time batcher.

So I won't keep my customers waiting. That's where this uh model analyzer comes in. So basically after you specify the constraint of latency throughput, uh and then you can directly run this model analytic tool to do benchmark for you. And after the benchmarking, you can see with one single line of command, we can generate very detailed report to tell you how the model is doing under different circumstance. This will give you a better confidence of what kind of configuration you should be using.

And also uh you can be expecting to get the similar performance in your production.

Moving next. Uh as the uh the logic of models becoming more and more complicated. Usually one model itself is not enough of handling your task. So we are introducing business logic, script scripting, which allows you to introduce uh different kind of complicated logics like conditional loops and even custom control flow you want to do.

On the left side, as you can see, that's a typical NLP use cases. So after you have different language, you can actually feed it into different models. So this is a very uh typical of how conditional works.

And on the right side, uh we can also run the model in a loop. So with uh the recent large language models decoder only is actually one of the most popular architecture being used today. And instead of running the model just once we actually want to run it in a autoregressive way.

So, and these features enables you to streamline uh the whole pipeline and make it seamless.

Moving next. That's the decouple modes. So this is feature that allows you to break the 1 to 1 mapping uh constraints of inputs and outputs.

So as i explained earlier, when you have a super long contest, uh when you ask the AI model of very long questions. You don't want to wait until the whole response being generated. You may want to see the results as it's being generated so that you can easily interrupt or early stop it or even give it more detailed prompt to make sure it's generating the results you want.

And for any users that's very new to Triton, we also provide a new integration called Pi Triton. So this will give you a much closer experience to Flask for uh in Python so that any customers can have a very seamless migration while benefiting all the optimizations, i just talked.

Cool. And currently, aside from all the success story, we talked in the beginning of the uh from Peter's part, uh we have Amazon Search, Amazon Ads, Amazon Catalog and so so many more.

Here is just a short short list of customers who are currently leveraging Triton to make sure they are getting best of out of the GPUs. And you are, if you are interested here is some useful link, you can check out to get you started.

So this uh standard GitHub link, uh it include more detailed release note and we'll give you more detailed description of all these features and how they work.

Next. Um Triton is also in uh integrated into a Decision Maker. So i know a lot of you might already be super familiar with SageMaker. But if you are wondering what else you can do to boost your performance even further. Triton Server integration is definitely something you can easily check out.

We provide a Triton Server container directly hosted on ECR so anyone can directly pull it and use it for your own use cases. And also of this can be directly involved with a Decision Maker, Python API.

We also provide very detailed code example in both Nvidia and AWS github. So with both drills combined, you can expect your application to be run in more cost effective, flexible and also secure.

Here is some useful links you can check out. And we we are also uh tightly collaborating with the SageMaker engineers. We also publish a tons of blogs for all kinds of different use cases.

So if you guys interested, feel free to check it out, uh you might find uh your use cases already covered. So uh simply following the steps, we'll give you a very seamless experience.

Cool. After doing this, if the performance is still not meeting your expectation, you might wonder, what else could you do? Is there any other technique that's focusing on the model itself? You can leverage to boost the performance to another level?

That is where TensorRT comes in. So it is very GPU specific and it's uh internal to Nvidia, that's why we are very confident that it should give you the optimal performance possible.

So this is the same graph. I'm showing before uh but basically inside TensorRT you can uh also optimize it for different kind of constraints. And then we make it support all kinds of GPUs uh videos producing that include uh all the data center GPU uh and also the embedded devices and so on.

So a quick question is why do you want to go this uh extra efforts to optimize your model? So this is just a quick summary, how much speed up you can easily get by introducing this new framework to you.

As we can easily see, it's basically providing acceleration for all kinds of use cases for computer vision. That's a 36x for speech recognition, that's almost 600x which is astonishing. And similarly uh for NLP uh for transformers for text to speech and even recommend 1030x is basically uh providing acceleration all across the board.

So what is happening underneath? Uh first of all, we are providing different kind of quantization technique contributed to the training process where you may want to uh use higher precision so that you can make sure you have a very good performance model during inference, we actually don't need the high accuracy and that there are also different kind of quantization technique that allows you to even go lower to int8 or fp8.

And also we provide layer and tensor fusion. So uh surprising to a lot of people the performance you get on GPU can easily be bottlenecked by CPU because you are keep launching small kernels and all those small kernels have very big CPU overhead.

So by running different kind of fusion techniques, you can directly increase the utilization of GPU by forming a much larger kernel. And also we have um all kinds of other technique like dynamic tensor memory, uh multi string execution, uh tensor fusion and so on.

And here is a, a bunch of resource you can leverage to try out TensorRT easily. But of course, you are today, not only for the general TensorRT, you want something specific to large language models.

To GPT is and that's where we bring TensorRT LM. So TensorRT is always a good uh and it's been providing uh state of the art performance on single GPU. But as this model grow larger and larger, single GPU is not enough of handling them.

That's why we, we uh find it. It's uh very important that we need new support for this kind of large models. So you can run multi GPU or even multi multi node.

So previously Nvidia actually have a project called fast transformer, it's providing very good performance, but it's also having some challenges because this area is developing so fast. It's always coming out with new model architecture.

And since fast transformer is designed in C++ it's actually very hard to extend, especially for our customers in order to resolve it, we um bring it back into TensorRT so that we can provide a more modular API to make it easy extensible while still offering the state of art performance.

And here is a very quick summary of what the current performance looks like as you can see it's providing so performance and it's also tightly integrated with Triton so that everything can work together to streamline your deployment.

And we are also providing different kind of parallelism technique. So you can split the model either using tensor parallel or pipeline parallel.

Here is the architecture of what TensorRT LM looks like underneath it. So as as what we are saying, uh it's actually based on the TensorRT run time, but we are bringing out all the optimized kernels we develop in fast transformers and we are also introducing NCCL communications.

So we can run it with multi GPU and multi node. So currently we already have a bunch of different prebuilt models. You can e easily leverage. For example, you if you are using GPT-3, all those are already supported. Uh you can directly follow example to run that easily.

And also in order to support all new models, we are also exposing all the underneath layers. For example, we have different kind of operations, different kind of activation layers. All those are also being introduced are exposed so that you can take it on your own to do customizations on top as well.

And here is a very quick workflow, what the experience of using TensorRT looks like. So basically after you have your model, pre trained your original framework, this could be PyTorch, TensorFlow or NeMo.

As we just talked in the earlier in the session, you can just initiate this model in original framework and then run an engine building command. This is just one line of code using TensorRT.

And we are also doing a tons of efforts to reduce the compilation time. For example, for today, it only takes a couple of minutes for super giant model. And once it's been compiled, you can directly use Python or C++ to execute it.

And you also have the option to simply serve with Triton. Another unique offering when you are using with Triton is called in flight batching. This is a super popular technique which is also known as ORCA.

So on the top side, where is what the traditional batching looks like? So as you can see, since your inputs is likely to have different sequence length, you need to pad it with mainly list tokens to form a matrix so that you can fit into the model. That's where the uh black block represents.

But with flight batching instead of doing that, you can feeding arbitrary sequence length to have a much better utilization of the GPU. So this means that no downtime at all.

So with all these features combined here is what kind of performance you can be expecting with GP to j six b. You can see we can easily achieve a x improvement and also lamma two, this is actually a seven b variant.

So you can see uh it's actually not possible to fit in one single GPU. This is like leveraging the tensor prior and you are achieving about phi x theta. And of course, we have more model support and more performance benchmark coming out if you're interested, feel free to go to our github and check out our blog recently published.

Here is all kinds of resource you can easily leverage. Uh we have the standard github for TensorRT package and also the Triton TensorRT integration. We also have uh launch blog and all kinds of different API documentations you can easily refer to.

Ok with that, that concludes our inference side. Uh so some culture actions and the conclusions we can take take away.

So Nvidia GPU powers most compute intensive workload from computer vision to speech and many more. And for training, it's always good to leverage NeMo framework.

And also we have Trident which is open source model saving solution we you can use for any type of models and even any type of hardware and TensorRT for absolute best performance on GPU.

You may always want to check out TensorRT and TensorRT LM if you are using large language models. And of course, if you need any help, Nvidia is offering AI enterprise support.

Uh so which give you direct access to the experts inside Nvidia to give you more confidence of adopting all these awesome technologies.

Sounds cool and this is actually our last slide and feel free to stay along. If you guys have any questions, we are happy to take them.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Profiling is the process of analyzing the performance of a program or function in order to identify bottlenecks or areas for optimization. cProfile is a built-in Python module that allows you to profile your code and generate a report of the performance metrics. To optimize the performance of a slow-running function using cProfile, you can follow these steps: 1. Import the cProfile module at the top of your Python file: ``` import cProfile ``` 2. Define the function that you want to profile: ``` def my_function(): # code goes here ``` 3. Run the function with cProfile: ``` cProfile.run('my_function()') ``` This will generate a report of the performance metrics for your function. 4. Analyze the report to identify bottlenecks or areas for optimization. The cProfile report will show you the number of times each function was called, the total time spent in each function, and the amount of time spent in each function call. Look for functions that are called frequently or that take a long time to execute. 5. Make changes to optimize the function. Once you have identified the bottlenecks, you can make changes to your code to optimize the function. This may involve simplifying the code, reducing the number of function calls, or using more efficient algorithms or data structures. 6. Repeat the profiling process to measure the impact of your changes. After making changes to your code, run the function again with cProfile to see if the performance has improved. If not, you may need to make additional changes or try a different approach. By using cProfile to profile your code and identify bottlenecks, you can optimize the performance of slow-running functions and improve the overall efficiency of your Python programs.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值