Cutting-edge AI with AWS and NVIDIA

Recent breakthroughs in artificial intelligence have captured our imaginations. AI has become so rooted in our discourse over the past couple of months, and yet our journey with this technology has only just begun.

Today we'll talk about how cutting edge AI is powered on Amazon. We'll also talk about our partnership with Nvidia, with whom we've been collaborating over 13 years to build the most powerful machine learning instances in the cloud.

My name is BB and I'm a senior product manager at Amazon EC2.

Samantha: I'm a product manager at EC2.

Sore: I'm a senior engineer at Pinterest.

We have a lot to cover today, so let's jump right in. First, we'll talk about three trends as we currently see them shaping the evolution of the industry. We'll then talk about our Amazon EC2 instances and our other infrastructure capabilities that are purpose built for machine learning.

Then we'll talk a bit about how to choose the right EC2 instances and set up for different machine learning use cases. And finally, we'll hear from our customers and we're really excited to have Soorena here today to tell us about all the exciting AI work happening at Pinterest.

So trends in machine learning, there are three trends we want to talk about today.

Starting off with the growth in large language models. This is easily the fastest growing use case that we've seen on GPUs this past year. And there are a few aspects of this growth that we want to focus on today.

First, the size of these models. A couple of years ago, the biggest models we saw in production were tens or maybe hundreds of billions of parameters. Today, the biggest models are over one trillion parameters.

To put that number in perspective, each parameter can take 1 to 2 bytes to store in memory. That means it takes over a terabyte to store just the model itself. At this scale, you can no longer fit the model on one GPU and you actually need to split it across multiple GPUs during training and inference. We'll talk more about this later on.

The second aspect of this growth is growth in the size of the data sets. One of the interesting trends we've seen this past year is that NLP models are increasingly starting to incorporate vision and other multimodal features. Vision data sets can be an order of magnitude bigger than language data sets and can easily push into the petabyte scale. So that's the second aspect - growing data sets.

The third aspect is growth in the size of these training jobs. A few years ago, the biggest training jobs in Amazon were hundreds or maybe thousands of GPUs. Today, the biggest training jobs can use over 10,000 GPUs in one job. I mean, that's fascinating to think about 10,000 GPUs working concurrently to train one model.

This has pushed us to think deeply about resiliency at scale to deliver the job uptime that our customers expect. As you can imagine, training on thousands of GPUs for several months can be extremely expensive.

To address this, we launched Amazon Bedrock earlier this year which features off-the-shelf pre-trained foundational models from Anthropic, Cohere, AI21 and others. We're really excited about this because it means that if you're a researcher at university or a startup or a nonprofit, you can access the most powerful models in the industry without having to train them from scratch.

Bedrock features a number of open source models as well like LaMDA and Stable Diffusion. But one of the most exciting things about working in AI is that the open source community is actually super active across every layer of the stack, right Samantha?

Samantha: That's right. We're seeing AI companies build open source frameworks and libraries. In addition to these foundational models, for example, Ray is a distributed execution framework that enables companies to train and inference at a more distributed scale going from a single laptop to thousands of GPUs.

Similarly, MosaicML is a full stack managed platform which helps you manage the orchestration of your instances as well as optimize your development time so that you can deliver solutions to your customers even faster and at a lower cost.

And at a lower level, we're seeing companies build dev tools like optimization libraries to help with your distributed training workloads such as Meta's FSDP or Nvidia's Megatron.

All of these frameworks, tools and foundational models that are being open sourced are enabling customers to pick up new AI trends even faster, such as generative AI in unlocking new use cases such as more intelligent chatbots, code generators and code AI as well as image, video or music generation.

BB: Ok, so now that we've talked about these trends in AI/ML, let's talk about how they're impacting the way customers make their decisions as they're starting new ML journeys or adding to their existing ML workloads.

Samantha: Yeah, let's talk about that more. So we see that there are few considerations that are top of mind for customers when they're thinking through decisions around their ML infrastructure.

And typically the first consideration is performance. On the training side, performance determines how quickly your models converge, the key metric there being time to train and you typically want this to be as accelerated as possible because it means you can iterate on your models more and get the production quicker and get the market quicker.

On the inference side, performance determines your inference latency or how quickly your models can respond to end users in production. If you want to deliver a real-time engaging experience to your end users, you need to make sure that you have the performance you need to enable that.

On the other hand, as customers are pushing their performance bars even higher, they're starting to need more powerful GPUs at higher quantities. This means that it's having a significant impact on their costs. And now customers are having to think more carefully about how to balance their cost and performance, to meet their customer expectations as well as their cost targets.

To help with that, AWS has the broadest portfolio of infrastructure to support various ML workloads so that customers can pick the best price performance for their individual workloads.

Oftentimes you may find that you need to scale from a handful of instances out to hundreds of instances. Maybe there's increased demand that you need to meet with your inference pipeline or you need to accelerate time to train to meet an upcoming milestone. But how do you make sure that your performance scales linearly with the number of instances?

This is a question we obsessively think about at Amazon and one of our core offerings here is our EC2 Ultra Clusters where we deploy co-located clusters of GPUs on dedicated networking infrastructure. We'll talk more about this later on, but this basically enables customers to scale their workloads out seamlessly to hundreds or thousands of instances.

And as you can imagine, as customers are scaling to thousands of GPUs in a single cluster, they can be harder to manage. So AWS offers managed services such as Elastic Kubernetes Service, ParallelCluster, SageMaker, and Auto Scaling to help you manage your fleet to make sure that you are getting your peak performance at optimized costs.

Additionally, we support popular ML frameworks such as PyTorch, JAX and TensorFlow so that customers can easily get started using our instances with whatever tools/frameworks that they are most comfortable with.

One of the most exciting trends we've seen this past year is that our customers are starting to tell us that energy efficiency and sustainability are top of mind for them equally important to any of the other considerations.

On this slide, we continue to innovate relentlessly at the instance level and at the infrastructure level to deliver increased performance per watt over time. AWS is fully committed to helping Amazon meet its net zero emissions target by 2040.

Ok, so now that we've talked about key customer considerations and needs, what different products and services do we have to offer our customers at AWS?

We offer a broad set of different services to help customers wherever they are in their AI journey. For example, for customers that are newer to ML and aren't ready to develop their own models yet, at the top of the stack, we deliver a set of AI services where customers are able to get API access to pre-trained models that can then be integrated into their existing customer applications and websites.

For customers that are more experienced with ML development and want to build their own models, in the middle of the stack, we have Amazon SageMaker and Amazon Bedrock which enable you to build your own ML tools or fine tune pre-trained models without having to manage the infrastructure yourself.

And at the bottom of the stack for customers that want to both manage their ML, develop their ML models as well as their infrastructure, we have the infrastructure layer.

You'll see later in the talk from Sorena that it is important to be able to co-optimize your ML code and your infrastructure to drive the most optimal price performance for your workload.

So now that we've talked about that, we're going to dive deeper into the bottom layer of the stack to talk about the different services that we offer at the infrastructure layer.

So what do we have in the infrastructure layer? At the foundation of this layer is our EC2 instance portfolio. Our philosophy here is to provide as much choice as possible to our customers so that they can pick the right tool for the job.

Our accelerated instance portfolio features GPUs from Nvidia and AMD, custom accelerators from Intel and Qualcomm, and our own custom AI accelerators for training and inference. We seek to deliver a range of performance and price performance benefits across different ML cases through this portfolio.

Above this layer of compute, we have storage and networking. If you're training on petabytes of data, you need access to highly performant, cost efficient storage options like Amazon S3 and FSx for Lustre.

We'll also talk a lot more about Elastic Fabric Adapter or EFA in following slides to tell you more about how you can scale these distributed workloads across the network.

At the orchestration layer, we find that customers gravitate towards services like EKS and ParallelCluster to manage their GPU clusters so that they can spend more time focusing on innovating at the model architecture level rather than having to manage the complexities of their fleet.

And finally, we maintain a number of deep engagements in the open source community with PyTorch, Hugging Face and others so that customers can access the power of our scale right out of the box.

It's pretty incredible to think that you can take a container off the shelf, deploy it to an EKS cluster and within minutes run some of the most sophisticated algorithms in the industry. Some of these containers are actually optimized and accelerated for GPUs.

So let's talk more about our GPU instances and specifically about our Nvidia GPU instances. On the left, we have our P family instances which generally feature Nvidia's most powerful GPUs.

We launched P5 this summer which features Nvidia's H100 GPUs built on their Hopper tensor architecture. These GPUs are incredibly powerful. Each GPU packs 1000 teraflops of FP16 performance and we build eight of these into a server for a total of 8000 teraflops of performance. This is a 3x improvement over our predecessor instance.

Nvidia is continuing to push their memory specifications as well. Each H100 GPU has 80 gigabytes of HBM3 memory at 3 terabytes per second of memory bandwidth. We're starting to see that certain use cases like LM inferences are actually a lot more sensitive to memory bandwidth than they are to GPU compute. And so Nvidia is continuing to open up new use cases in this realm.

In fact, today we just announced that we will be launching Nvidia's H200 GPUs that are further pushing the memory specifications for GPU.

For training, we provision 3200 gigabits per second of EFA networking for each instance, so that workloads can scale efficiently to multiple instances. And finally, we provision 30 terabytes of local storage for each instance, so that customers can cache their training data locally and read it directly to the GPUs - further simplifying training pipelines.

Our P4 instances continue to be incredibly popular as well. We have one of the biggest A100 fleets in the industry and many of the best known NLP models out there either were trained or actually continue to be trained on P4 instances today.

However, P4 and P5 are not always the right tool for the job.

Samantha: That's right. And that's why we built the G instances on the right here. These are built to be more versatile to support deploying your ML models as well as training smaller scale models such as computer vision models.

The G5 instances are our most cost efficient GPU for deploying ML models and training at a small scale. They feature Nvidia A10G GPUs, which are up to 3x better performant than the Nvidia T4 GPUs that we feature in G4dn.

So for customers that aren't able to fully utilize these more powerful GPUs, the G4dn instances are our lowest cost GPU for deploying your ML models. Since they feature Nvidia T4 GPUs and start as low as 53 cents per hour on demand. This enables customers to easily get started with ML at low cost.

Both the G5 and G4dn instances are built with up to eight different instance sizes so that you can find the right amount of GPU, vCPU, memory, networking, and storage that you need for your individual workload. So you can make sure to cost optimize these instances for your use case.

Additionally, just today, we announced that we're launching G6 and G6e instances next year, which feature NVIDIA A100 GPUs and NVIDIA A40 GPUs.

So it can be a bit overwhelming to look at a bunch of specs and figure out which instance is right for you. One way to approach this is to consider the main design differences between our P family and G family instances.

So starting off on the left with our P family instances, there are two aspects we want to point out here:

  1. The NVSwitch layer at the bottom there. This connects all GPUs on an instance in a full meshed topology. On P5 for instance, any two GPUs can talk to each other at 900 Gigabytes per second. So if you have an expensive collective comms operation in your algorithm, this becomes incredibly useful.

  2. The second aspect to point out is the PCI layer in between. As you can see, GPUs and EFA devices actually share the same PCI switches on our design. What this means is that GPUs can talk directly to EFA devices and then over the network without having to go through the CPU, which significantly improves latency and network performance in general.

So if you need the most performing GPUs, a strong interconnect, and the ability to scale out, then P family is a good choice.

On the G instances, like I mentioned before, they're built for smaller training workloads or deploying your workloads. This means that these workloads typically fit within a single node or even in a single GPU. For that reason, customers don't need as fast GPU-to-GPU communication or GPU-to-other component communication, which enables us to simplify the design and remove that PCIe layer between the CPU and the GPU. This enables us to better cost optimize the infrastructure to meet your specific needs for different workloads.

Then you can pick between the G and the P instances depending on your needs and make sure that you strike that right balance between cost and performance that we've talked about before.

So we've mentioned the EFA a couple of times, but let's dive deeper here - talk more about what it is and why it's so crucial for machine learning.

First, a little bit of history. A few years ago, a group of engineers and entrepreneurs took a step back to design a new network transport protocol from scratch. This protocol is called SRD. There's a great paper out there that goes into depth on how it works, but there's one feature about this in particular that I want to point out today that's especially useful for machine learning training.

And that feature is GPUDirect RDMA over EFA. With this feature, a GPU on one instance can talk to a GPU on another instance by talking directly to the EFA device over the network and to the GPU on the other side, removing the CPU from the loop. This enables an order of magnitude lower latency for GPUs to talk to each other.

But how is this relevant to machine learning? Well, any distributed training workload will have what's called a collective communications operation so that GPUs can share information amongst themselves. You typically want this collective comms operation to be as quick as possible so that your GPUs can get back to work.

As you can see on the graph, we improve completion times for our collective comms by up to 5x on P5 compared to P4. This means that certain network bound workloads like FDP have also seen a 5x improvement moving from P4 to P5. That's pretty incredible that we can deliver that kind of performance by focusing just on the network alone.

But how does all of this work behind the scenes? This is where our two UltraClusters come into the picture. We actually completely redesigned our UltraClusters this year with the launch of P5. And there are two aspects of the new design that I want to focus on:

  1. Scale - we can now fit over 20,000 GPUs on one EC2 UltraCluster. And next year, we're pushing to even scale this higher.

  2. Latency - we flattened our network designs so that the maximum number of hops between any two instances on a spine has been reduced as well, driving a 15% latency improvement across the spine. This means that workloads can scale even more efficiently out to even more GPUs than ever before. And it's something we're super excited about.

But a question that we often get from customers is - do I even need an UltraCluster for my workload? So let's talk more about this question - about which workloads benefit from UltraClusters and more generally about how we think about the different requirements for different machine learning use cases.

And a good place to start here is with large language models because they represent a range of requirements across compute, networking, memory, and storage. There are typically three stages in an LM pipeline - pre-training, fine-tuning, and inference.

In pre-training, you're training your model from scratch on a large corpus of unstructured data. In fine-tuning, you take your pre-trained weights and you optimize them towards a certain data set or use case. And finally, when you've hit the accuracy and latency metrics that you need, you deploy these models to inference.

So let's talk about each of these three stages in a bit more detail:

In pre-training, these are the biggest workloads that we see on GPUs today. And as we mentioned earlier, they can leverage thousands of GPUs. There are a range of interesting parallel techniques that evolved over the past couple of years to leverage this scale.

We'd like to focus on one framework called 3D parallel. So how does this work? Well, if you have a trillion parameter model for instance, there are two ways you can split that model across multiple GPUs:

  1. Pipeline parallel - put different layers of the model onto different GPUs

  2. Tensor parallel - take individual layers and split them across multiple GPUs

These are two of the three dimensions of 3D parallel. The third being data parallel where you take your dataset and split it into different chunks that are fed to different GPUs.

As you can imagine, if different GPUs are working on different parts of the model and different parts of the data, they continually need to share information between them through those collective comms operations that we talked about so that their parameters and their gradients are all in sync at a regular cadence.

For this reason, these workloads tend to be extremely network intensive. And we recommend using our P4 and P5 instances deployed on EC2 UltraClusters to get the right combination of compute and networking for these workloads.

In fine-tuning, you can take one of two paths - either you can distill your model down to a smaller version and deploy that to production. The benefit being that inferring a smaller model is more cost efficient and it can give you better latency.

Or you can actually inference a massive 100 billion parameter model to production. If you are inferring a smaller model we recommend using our P4 or G5 instances. If you're inferring a bigger model, then actually your top of mind for you is memory. You need to make sure that your model can fit onto one instance.

And as we discussed earlier, these models can actually be more sensitive to memory bandwidth rather than GPU compute. And so you generally look for the instances that have the highest memory specs like our P4d or P5 instances.

One last thing to point out here - when you're pre-training, you generally want your GPU deployment to be as co-located as possible to minimize latency between GPUs. But for inference, you actually want your GPUs to be as close to the end user to minimize round trip latency of the network. This creates a more real-time and engaging experience for your user because the model is that much more responsive.

So that's LMs in a nutshell. But what about vision and multimodal and other models?

Right, so vision and multimodal and recommender are the other two common types of models that we see customers training and inferring on today. For both of these models, they tend to be smaller than LMs, but they have more preprocessing requirements because they have larger image or video datasets or large embedding tables.

On the vision side, we typically see that customers workloads can fit in one of two groups:

  1. Discriminative - this means that you're classifying an existing data points via supervised learning.

  2. Generative - this means you're trying to understand a dataset structure so that you can generate similar examples. And this is typically done with unsupervised or semi-supervised learning.

When you're doing generative training and using semi-supervised or unsupervised training, this means that you don't need your data to be labeled, enabling you to train on even larger datasets, meaning that you need higher resources and have a more distributed set of infrastructure for your training.

For that reason, we often recommend that customers use our UltraClusters with P5 or P4g instances. When you're training a smaller discriminative model that doesn't need to be as distributed across multiple GPUs, you can scale down to the P4d or the G5 instances for better cost optimization on the instance.

On the inference side, we typically see that across both of these types of workloads, customers are able to fit their models on G4 or G5 instances for better cost optimization.

In some cases though, when customers are inferencing generative workloads, where they're trying to push the bounds on the fidelity of the output and the size of the output, they're starting to see that they're surpassing the GPU memory needs that are supported on our G5 and our G4 instances.

So for those customers, we recommend using our P4d or P4g instances.

On the recommender side, customers typically have extremely large embedding tables that are continuing to grow. These tables are growing so much that they can no longer fit simultaneously in your GPU memory with your ML model.

Customers that are training typically need these embedding tables to be in their GPU memory. And so they have to start splitting across multiple GPUs. So similarly, for training those workloads, we recommend our UltraClusters with P4 instances.

As you start to do incremental retraining on these models, you can start to scale down to the G5 instances to cost optimize.

The difference on training and inference is that once you're ready to inference, you no longer need to store those embedding tables in your GPU memory as there are simple mapping jobs that can be handled by your CPU.

So when it gets time to start inferring these workloads, you can start to use our G5 and our G4 instances to again cost optimize for that specific kind of workload.

Now, as we've talked about before, when it comes to inference, it's important to have these instances as close to your end users as possible.

So AWS offers the broadest set of geographic support of any cloud provider. We support 32 geographic regions and each region is built with multiple Availability Zones. These Availability Zones are discrete data centers that have redundant power, networking, and connectivity which enable you to build more resilient solutions.

In addition to our AZs, we also support Local Zones which are extensions to AWS regions that are built closer to large populations and industry centers so that customers can drive single digit millisecond latency to their end users.

Our GPU instances have been incredibly popular over the last year and in certain cases, demand has outpaced industry-wide supply leading to constraints.

To address this, a couple of weeks ago, we launched two Capacity Blocks for ML where customers can reserve a block of P5 capacity on an UltraCluster. At some point in the future, this gives customers peace of mind that they'll have the GPUs that they need when they need them for an upcoming pre-training or fine-tuning workload without having to hold on to an expensive reservation to guarantee capacity.

It's something we're really excited about because we're helping put GPUs in the customers hands without breaking the bank.

So now that we've talked about a few recent product launches and our two instance portfolios in general, let's learn more about how customers put these capabilities to use.

The first story we want to talk about is with, is with Adobe who we've been working closely for several years. It's been really exciting to see Adobe embed generator AI into their core user experiences.

Over the past couple of years earlier, this year, Adobe launched Adobe Firefly which allows users to input a prompt and generate an image. Since its launch earlier this year, users have generated over 4 billion images through Firefly. This is something we're really excited about because the models that power Adobe Firefly were trained on the very same P4, P5 and G5 instances that we talked about earlier.

A second example we want to share is what we've been doing over the past few years with Aurora, they're an autonomous transportation company that delivers the benefits of self driving technology safely quickly and broadly, through autonomous trucks and passenger vehicles. They built a safety case framework with a principle to be continuously improving, to help them do that. They are all in on AWS for all of their ML training and cloud based simulation. They leverage our P4, DG5 and G4 DNG4 DN instances for all of their ML training and to run millions of simulations daily.

Now that we've shared those two quick examples, we want to hand it off to Sorta to talk more in depth about how Pinterest uses AWS to deliver ML workloads for their customers. Thanks.

So thanks. Hello, everyone. I'm Sa, I'm the tech lead for the machine learning platform at Pinterest. And in the next section of these slides, I'll try to give an overview of how Pinterest has built their machine learning platform stack on top of NVIDIA GPUs and AWS.

To begin with the agenda would be with four sections. We'll talk about a 30,000 ft view of machine learning at Pinterest. Then I'll talk a little bit more about model training, model serving. And we'll end with some key takeaways that we learn throughout our journey. In each of these sections, I'll try to highlight one of the key challenges that we saw, a key technology that we are introducing and just describe the journey of how we adopted GPUs at scale.

Before talking about machine learning itself, I just wanted to quickly point out that Pinterest's mission is to bring everyone the inspiration to create a life they love and we do this through multiple product surfaces.

The first one that you see is home feed where we recommend content that is curated for you based on your interests. So if you are interested in home decor or fashion, we recommend the content that is related to that.

The second one is related pins where we recommend content that is similar to the content that you are currently engaging with. So if you are looking for coffee tables, we recommend you similar coffee tables.

Shopping is another important surface where you can click on different products that you see in the images and you can shop for similar products on our app.

And the last one is a traditional text based search product where you enter in some keywords. And we recommend you content related to those keywords from our corpus.

The reason why I'm talking about this is because all of this in the back end is powered by hundreds of machine learning applications that is running on thousands of GPUs on web scale data in real time.

For example, the home feed is actually a very powerful recommender system that is based out of the transformer as a foundational building blocks. The related pins, which is the second surface uses graph based neural networks where we learn from our taste graph of pins boards and users. And we recommend similar content using that.

We also heavily use representational learning where we learn embedding representations of the pins, the boards and the interactions that they have.

The third surface, which is the visual search experience. You can imagine we use all kinds of vision based models where we use segmentation object detection on our visual transformers. And then the text based surface uses the traditional text based retrieval engines. And we also use certain large scale models here to give you a sense of scale at which we operate.

We have about 482 million monthly active users and we have about 4 75 billion pins that are saved. So you can imagine the scale of the machine learning data sets that we operate on.

I already described about the product surfaces that these machine learning applications power and these are actually a wide variety of ML portfolio that are powering all of these product surfaces.

Before I move on to model training, I just wanted to highlight three key themes that we learned throughout our journey of adopting GPUs. And the first one is that GPUs drive innovation. We all know that GPUs provide a lot of raw compute power. But the key thing here is that they actually unlock innovation because they can allow us to ship certain state of the art models which would otherwise be impossible to serve by just using CPUs.

The second one is that one size doesn't fit all. And what I mean by this is that just using one specific type of GPU, one specific instance type or just one variant of a GPU is not enough anymore. Just one type of training setup, one type of serving setup is also not enough anymore. We have to tailor make the setups for every single machine learning workload that we have. Even though we try to strive to unify as much as possible, it's not enough to just have one type of setup anymore.

And the third, the third point here is that we need to co optimize the model and infrastructure together. And you'll hear a lot more about this in the model serving section.

To give you a flavor of model training at Pinterest, all the machine learning training jobs at Pinterest are orchestrated through our unified machine learning API which is, which has very tight integrations with the rest of the Pinterest ecosystem.

Our training jobs overall cover over on an order of tens of petabytes of data and thousands of features. So this is actually a very large machine learning data set.

We run all kinds of jobs from single node, single GPU jobs to multi node, multi GPU jobs and the span board, distributed data parallel and fully shard data parallel. On an average, we run about thousands of daily training jobs and we have seen an unprecedented 100% year over year growth in the number of training jobs.

So this has lasted, you can see on the chart below that for the last two years, we have seen a doubling of the number of training jobs every year and the growth continues and there is a corresponding growth in the amount of GPUs that we consume.

To give a little bit of a deeper dive. The default training type that we use is a P4D instance, which is the NVIDIA A100 family and we are actively evaluating P5. So majority of our training runs on these P4D instance types, but we also have certain large scale model training workloads which are extremely compute and memory bound. And for these, we use P4D instance types which have the 80GB GPU memory of the A100 variant.

The other traditional machine learning workload and offline setting is the batch inference. And as Samantha said, I think these G5 instances are ideal for inference. And so majority of our batch inference workload actually runs on G5.

And then there's a special case of incremental retraining where our models train from scratch using P4Ds. And then an hour retraining happens using G5. So you can see a single ML workload is using a mix of P4D and G5.

The reason why I'm mentioning all of this is you can already see that just for one company's training workloads, we are using all four different types of instances here and this is on purpose because they are tailor made for those specific kinds of workloads.

There are two types of training jobs that we typically observe ones that last for hours, two days and we typically run them on our own internal CITIES orchestration platform. But there are also certain long, long jobs that can last for weeks to months. You can think of these as the more large scale pre training jobs that you have heard of in the industry. For those specifically we use AWS Batch and Ultra Clusters.

If I can draw your attention to the diagram, you can see that all of this machine learning training workloads are orchestrated through the TCPA P gateway. So this is a unified entry point and this actually allows us to innovate underneath where below you can see that there are three different boxes.

The first one is Pinter's internal CITIES infrastructure. Again, within that you can see we have both G5 and P40 instance types. And the second box, there is actually AWS Batch in which we run P4Ds as themselves or we run it as an Ultra Cluster of P4D. And I'll talk a little bit more about this later. And we are also excited about the eval of Amazon EKS because of its open source friendliness and we are probably exploring using it for.

So what are these workloads that we run? As I said, Pinterest is a very real time app where we recommend you content based on your interests. And so recommends are a very common workload that we see. But our recommends are not the traditional recommends that most are familiar with. Our recommends are actually state of the art models. These can range anywhere from 100 million all the way up to 5 to 10 billion parameters in size if you consider large embedding tables.

And the key part that I want to highlight here is compared to our CPU baseline just two years ago, the models that we currently serve are almost 100 times larger than the number of floating point operations that they use. So these are very expensive models. They typically train on 7 to 30 days of training data and they train on 8 to 16 GPUs using distributed data parallel.

A key part here is that training such models also requires heavy data preprocessing. And I'll highlight a key challenge that we see with this specific problem in the next set of slides.

The reason I have added the architecture diagram of our model is to highlight the most computationally intensive parts here in the orange. You can see that it uses transformer encoder for the activity sequences. It uses the transformer mixer for feature crossing and it has fully connected layers. These transformer blocks are actually the foundational building blocks that you see in the industry today. For the GPT style models.

The second class of models that we see are large scale model training. You can imagine these are the most cutting edge image and text based models that you see in the industry. They can be all the way 7 billion or more number of parameters. And for these specific types of models, we leverage the Ultra Clusters with AWS that which mentioned about.

And the reason why we use these is because training such models at scale requires you to actually shard the model and achieve both pipeline and tensor parallelism. And so they not only run on multiple GPUs, they actually need hundreds of GPUs that are scattered across multiple nodes. And so retaining a very low GPU to GPU communication latency is extremely essential to have a reasonable training time for these jobs.

And for this reason, we use Ultra Clusters on AWS Batch. And in addition to that, we actually use these Ultra Clusters with AWS EFA which is also what which referred to as a means to achieve very low latency GPU to GPU communication.

Some of the early ones that we have seen with this is these large scale training jobs can speed up almost two x in their job run time. And in the diagram, you can see that these are still orchestrated through our internal infrastructure where we call into the friendly APIs of AWS Batch and they orchestrate the Ultra Cluster for us.

And you can see that the Ultra Cluster is a densely packed GPU cluster of P4D instance types which are interconnected using EFA within these models. Again, you can see a lot more diversification, right? Like I already highlighted the compute intensive jobs which do need Ultra Clusters and very low late in GPU to GPU communication.

But as you can imagine, Pinterest is a kind of an ecosystem where you have pins, boards, users. And you can imagine it like a graph. So graph neural networks are actually a very useful technique that we apply everywhere in our product. But training these kinds of models actually need heterogeneous mix of different instance types. You don't just need the GPU instances, you actually need memory heavy CPU instances and you also need storage instances.

So these workloads use a mix of r5 instances, p4d instances and i4i instances. So these kinds of workloads we actually orchestrate on AWS Batch. But you can see that these are not actually orchestrated on ultra cluster. And the thing that I want to call out here is based on the needs of the job, you need to tailor make the infrastructure that you use to train these models.

And the last one is again vision transformers, but they run on our internal coin infrastructure simply because we have GPUs available there. So again, we use a different flavors of more training setups just within the ecosystem at Pinterest.

Next, I want to just highlight a very specific key challenge that we see for machine learning at Pinterest. But these are more applicable to the recommended style models. And we all know that one of the key bottlenecks with machine learning iteration is the velocity. We know that better the velocity yields better models. It leads more models which eventually leads to more product engagement and a better product. But the velocity is usually hampered because there is framework fragmentation, machine learning engineers need to know about pandas, PyTorch, workflow systems and many more other tools that they need to use to orchestrate these jobs. And so the whole dev velocity is slow.

The other part that we frequently see is with bulk execution frameworks. I think we have all adopted from the whole ETL pipeline style jobs where we use Spark for data processing, which is great. But I think with model specific training, you want to reprocess the data before it enters the model. And using bulk execution frameworks can sometimes slow you down because your feedback loops are by processing the entire data set at a time, which can be slow.

The other issue that we see is with data preprocessing, where before your model, before the batch is fed into your GPUs, you need to perform a lot of non-trivial CPU work to the extent where your CPUs are running so slow that they are not able to catch up to the speed of GPUs. So you are not able to effectively saturate the GPUs because your CPUs are not able to catch up. And this actually slows down the training throughput.

And what makes it worse at times is the fixed CPU to GPU ratio that you have on the instance types. For example, on a p4d instance, there is a fixed 12 CPUs to one GPU ratio. So you basically cannot have more than 12 CPUs to one GPU ratio. And you can see this effect at play on the graph that I've shown below. You can ignore the numbers. But you can see the blue line is the one where we are not having any kind of preprocessing on the training data. Orange line is where we add one layer of preprocessing. And the green line is where we have a second line of preprocessing. You can see like every time we add an expensive preprocessing, our training throughput takes a hit.

And what that means for us is our training jobs are gonna run longer, they're gonna consume more GPUs. And so it's gonna cost us more to fix this specific bottleneck. We are introducing Ray into our ecosystem and we are specifically introducing Ray as a last mile data preprocessing engine. What I mean by last mile here is the ETL pipelines are still happening on Spark. So we do not change that. But before the mini batch enters the model, we reprocess the data like for example, we need to run some kind of filtering sampling. All of those steps are now executed on Ray. And it actually gives you a very fast feedback loop.

The powerful part of Ray is that it is executing in a stream execution paradigm where one mini batch might be performing inference on the GPU while your next mini batch is performing CPU preprocessing while the batch before that is performing a file download. So it actually allows you to like break down your execution into multiple pipelines and paralyze it across all of your resources. And this is actually very powerful if your training uses a mix of a teacher model inference and a student model training.

Ray also is very powerful because it allows you to orchestrate a heterogeneous cluster of resources. What that means is you can have your traditional GPU based training instances but you can add cheaper CPU nodes into the mix. So we typically see a mix of r5 and p4 instances being used together.

The wins that we see is for the offline batch inference, this can actually speed up up to 4x the training jobs. The batch inference jobs can speed up almost up to a 4x in some of our representative workloads. And this is achieved by using a mix of r5 and g5. And the reason why we see the speed up is because it allows us to pipeline the execution where the file write is happening separately, GPU execution is happening separately, and the preprocessing is happening separately.

The feedback loop is a lot faster because instead of processing the entire data set before you learn about it, you can just process a mini batch right before it enters into the GPUs. And the reason why this is powerful is we can continue adding heavy CPU preprocessing and it allows us to maintain the job on time by adding cheaper CPU nodes into the mix. So you can achieve the kind of infrastructure set up that you need and achieve an improved or be able to retain the amount of cost for a job.

And on the right, you can see that I've just highlighted one specific set up where the orange line is representing the PyTorch based data loader, which has just one p4 instance. And the second blue line is representing Ray which has the same data loader like the p4 instance along with five memory based r5 instances. And you can see the blue line is a lot flatter, which means that the training job run time is a lot more stable even if we had CPU preprocessing over time. And this is achieved because of the heterogeneous clusters that are provided by Ray.

Next, I'll talk about model serving a little bit where at Pinterest we perform online inference because majority of recommends are run online. We run anywhere on the order of hundreds of millions of inferences per second. They can go all the way up to half a billion inferences every second, majority of our use cases are now running on GPUs. So we run thousands of GPUs in production in real time.

Most of our inference workload is orchestrated through the model server C++ runtime, which is built in house and it's built on top of all the open source frameworks. So all the recommender vision transformers, all of them are running on C++ runtime. But you can also see Ray there in the mix where we are using Python runtime to serve some of the largest scale models that are seen in the industry today.

The diagram shows that we are basically an AWS shop. Again at the bottom. You can see we have a variety of different instance types. Again, one size doesn't fit all, we use all kinds of GPUs there, p4ds, g4dns and g5. We also use the AWS Deep Learning AMI because it actually allows us to easily install the Nvidia drivers and the CUDA drivers that we need to power our applications. And it has actually allowed us to quickly upgrade our stacks. So we are currently on PyTorch 2.0 already.

Both of the stacks that you see there. One is a C++ stack which is, as I said is in house and we use CUDA graph formats there. I'll talk a lot more about this in the next slides. And there is also Python where we use Ray and Hugging Face and Ray allows us to express data preprocessing where we can run some parts on CPUs and then perform the more expensive model executions on the GPUs.

Next, I want to take you on a journey of how we adopted GPUs at scale for Pinterest and this is actually an interesting part and I'll use the recommender as a workload to convey my point.

So just to recap our recommended models are using transformers. They are expensive. As you can see, the three most expensive parts are the transformer encoder, transformer mixer and the fully connected layers. And these can range anywhere from 100 million to 5 to 10 billion in parameters. And they have grown almost 100x in the number of floating point operations that they need.

The way this panned out for us is our modeling engineers went out, they used the p4 instances, they came up with a new a lot more complex model architecture. They came to the model serving team saying, hey, we want to serve this model now. So the obvious thing that we did is we just converted that model from GPUs to CPUs tried to serve it. It obviously didn't work. We saw that the latency was almost three times larger or 300% larger than what we wanted it to be. So it was obviously not useful for our product, but we did want to ship this state of the art model and we knew what the next thing that we had to do was and that was to adopt GPUs.

And so we just took that model, converted the devices to GPUs and started serving it, but it didn't just work out of the box as we expected. So you can see that at the bottom, right, that even though the latency is a lot better than CPUs, it's still almost three times larger than what we actually need for it to be worthwhile for our product. And so we had, we knew that we had to optimize this. And so the journey of how we optimized is actually in three parts. And that's, that's how I want to convey this message of why we need to co optimize model and infra.

The first place where we looked at is the inference loop. And what I mean by that is when you need to execute a model, you need to put the features inside the model, execute the model itself and you get back the response we had to optimize that inference loop itself. And interestingly enough, this is actually an infra problem. So the goal is that we need to maximize the GPU utilization while staying under the latency SLA. So unlike the training workloads, the wild part here is the more key part where we cannot just arbitrarily increase GPU utilization, we have to do this while staying under the latency SLA otherwise, the response is of no use to your product.

And while we were benchmarking, we saw two specific bottlenecks. The first one is CUDA kernel launch and the second one is CUDA memory copies. And I'll talk about both of them in the next slides.

The first one is ka kernel launch. I'll just try to explain it very quickly. So if you imagine that you have a model and the model has many math operations, imagine that you have a model with four operations as I showed ABC. And the way it works is when the model is executing on GPUs, your CPU actually submits each of these operations to be executed on the GPU. So you can imagine the operator ABC and sequentially launched onto the GPUs and each of this launch incur some latency and that is called the launch latency.

The way we fix this launch. The reason why this launch latency is a problem is you can imagine a complex model can have anywhere from hundreds to thousands of operators. So this latency can quickly add up. And the way we fix this is by using ka graph, which is by the way, a part of torch framework is we perform a static graph capture of this launch sequence. And then we replay the entire model graph as a single operation. So what I mean by that is if I draw your attention at the top of the diagram, we perform a stream capture of all of these four operators. And it is at the bottom, you can see that it is actually bundled as a single graph. And then we launch that entire graph as a single operation and the operators within it which are ABC and D are actually launched internally on the GPUs by that graph.

So now we have come down from like thousands of ka kernel launch overheads to just one ka kernel launch for that entire graph. This was the first optimization.

The second optimization has to do with the CPU to GPU memory copy. We know that when things need to execute on the GPU, you have to actually copy the features from CPU to the GPU and then execute on the model. And you can imagine large scale models, especially recommended models can have anywhere from hundreds to thousands of features. So what this means is each of this translates to thousands of memory copy calls and that is too many and it's expensive.

So the way we address this issue is by using pre allocated memory chunks. So we have a pinned host memory chunk and a complementary memory chunk on the GPU. And what we do is on the CPU side, we copy all of our tensors into the spinned host memory. Also pin host memories a lot faster to copy to the GPUs. And this is very often used with torch data loaders. So the spinned host memory buffer is then copied in a single pass on to the GPUs. And then we perform inference. What this translates to is that thousands of ka memory copy calls have now just dropped down to one and the thousands of ka kernel launch overheads are also dropped down to one.

And the reason why this was important is that actually reduced through reduced our latency and improved throughput dramatically. Whereby adopting ka graphs and the optimized memory copy, we saw almost a 30% reduction in the amount of latency. And this 30% reduction is in comparison to our CPU baseline. What that means is at the same cost, we shipped an almost 77 times more expensive and complex model which is state of the art into our online systems at better latencies and higher throughput.

So this was actually a key for us and this was about the optimization of the inference loop. But of course, as any innovation goes, things don't stop there, right? Like over time, like as months pass, the modeling engineers innovate more, they are larger MLP layers, more cross layers they play with longer activity sequences which are quadratic for transformers. And the model keeps growing at a point where it is like almost 100 times more in the number of flops.

So it was time to optimize again. But unlike the first time where we optimize the infra this time, we optimize the model. So for example, the modeling engineers used half precision inference, but instead of performing inference using the full float 32 they performed inference using float 16 or b float 16, they applied these to the most expensive layers. They also played with custom ka kernels for certain specific operations which are expensive for our models.

And you can see that the results reflect what they expected it to be. So compared to the then baseline, which is also GPUs, they saw a dramatic reduction in latency again and an improvement in throughput. But the key part here is that the GPU usage dropped by 30% which is an expected effect because instead of performing inference on a full precision, you're performing it on half half precision. So even though you perform the exact same number of flops, you have a lot more spare GPU cycles to use.

And the reason why I'm bringing this up is you can see this flywheel in effect where we first optimize the infra which allowed us to ship a bigger model over time, the model grows bigger. And then we do innovations on the modeling side to actually improve the efficiency. But now the GPU usage has dropped again. And so now we can pack even larger batches and drive more efficiency. And so you can see this dance happening where we optimize the infra, then we optimize the model and now the ball goes back into the court of optimizing the infra.

So that's what we did next. We knew that there were some obvious bottlenecks throughout our ecosystem because we just tried to retrofit GPU serving onto our CPU infrastructure. You can see the original CPU infrastructure that we had is a traditional root leaf based data chartered architecture where as you receive a request, you scatter it into smaller parts and each of the leaf performs feature fetching and the inference there. And so each of the leaf had turned from CPUs to GPUs. But this setup is actually very, very counterintuitive to how GPUs work because GPUs are not latency, they're actually through put oriented devices.

So they prefer fewer work items while each item needs to be large. And this setup was actually scattering a large workload into smaller pieces. The second issue was with inference parallelism in the CPU world views to easily serve 30 plus models on the same host. But GPUs or in general hardware accelerators don't like this order of parallelism. They typically work at best with four or five models in any given host.

The third challenge is GPU memory limits. Unlike CPU nodes which have hundreds or even up to a terabyte of memory, GPU nodes typically have only about 24 GP of memory. So while the models are getting more expensive and larger, we have a lot lesser space to accommodate them. So this was a problem, we couldn't even fit all the experimental models for a use case onto all into any single GPU that we had.

So we optimize the infra and it's actually a very simple but very effective change. And we refer to this as remote inference and model farms. And the difference here is you can see another layer of network that is added. So we basically pulled out our inference engines out of the ranking clusters and now they are into a dedicated inference server. And the reason why this is powerful is because now your ranking clusters can continue being CPU only nodes.

So you can pick the cheapest instances that are available for your use. They can be memory heavy instances if you want, you can or cannot use data sharding based on your needs. But the real nice property here is you get a nice decoupling of your ranking and the retrieval systems from the inference servers itself.

The other nice property that happens here is your ranking cluster forwards requests to a smaller subset of GPU nodes. So you get this nice fork and join effect where many different instances are forwarding items for the exact same model to a smaller subset of hosts that allows you allows you to better perform dynamic batching and pack larger batches.

So the effect that this had is while we decouple the architecture, of course, we had more hosts. But because of the cost ratios of the CPU and GPU notes, this was actually better for our costs and we saw a 10 to 20% increase in the GPU utilization from the baseline. This also created a good decoupling of the architecture where we had a nice CPU based ranking and retrieval systems. And the GPUs were only running on the inference servers.

This also allowed us to get a heterogeneous instance support where if you see at the bottom, the inference, the inference engines itself can be CPUs or GPUs and the customers don't need to care and they only talk to the ranking layer and we orchestrate the traffic to the inference proxies. This also allowed us to bypass the limitations of smaller GPU memory because now you can horizontally scale out the model capacity because you can have as many model groups underneath as you want.

And we typically see that customers try to logically group the models like for example, all the models of an experiment belong to a single model group. And the last thing is the inference servers are now stateless. And so you can very quickly auto scale them if you have an infrastructure like cities.

Other than the details that I mentioned here, the main point that I'm trying to convey is when we are targeting ML efficiency, especially for GPUs, it's not enough to solve it in isolation. for just the infra or just the model, you have to keep playing this back and forth of optimizing the infra and then the model, then the infra and so on.

So that's the third, the third key takeaway that I wanted to reference. This is the last model serving slide. I'll just end with the key takeaways. We saw that GPUs drive innovation for us, we were able to serve state of the art models using GPUs which would otherwise be infeasible for us to use one size fits all doesn't exist. We saw this throughout the slides that just one specific type of GPU just one specific GPU variant, a specific training setup serving setup doesn't work anymore. Like we have to specifically select the hardware and the specific setup that works for our workloads.

Throughout the model serving slides, we saw that we need to co optimize the model and the infra together to drive ML efficiency. The fourth point is actually a key point and that is to maximize the use of open source. We greatly benefited by being on top of PyTorch stack, we use all kinds of things. We use PyTorch profiler distributed data parallel, fully shard data parallel ka graphs, like none of these would have been possible without the open source innovation. And this is actually a great driver of ML velocity.

So I think maximizing the use of open source was very beneficial to us. And the last point I want to make is profiling and right sizing, we did a great job of this in the CPU world. But we need to get used to also profiling the GPUs itself, like using tools like DCGM and Media Insights to drive efficiency for GPUs.

That's the last slide. These were the five key takeaways. Hope this was a value add to all of you. And I'm happy to take if any of you have any questions. Thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值