Introducing AWS Inferentia2-based Amazon EC2 Inf2 instances_throughput tokens per second-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134617518

Good afternoon, everyone. I hope everyone is enjoying re:Invent. Before we jump in, how many of you are familiar with our Inferentia product line? Great. And anyone catch Adam's keynote this morning? Ok, good. A few of you.

So there are a lot of exciting announcements, including the one I'm most excited about - Amazon EC2 M2 instances featuring our latest ML accelerator, Inferentia 2.

My name is Joe Sirosh. I'm a product manager on our Amazon EC2 team. Here with me today, I have two customers:

Samir, a senior machine learning engineer from Qualtrics
Shrini, a senior software development manager from Amazon CodeWhisperer

They're both here to talk about their experiences with the Inferentia product line and how M1 and I2 is helping them improve performance and reduce the cost of their applications.

Today we'll take some time to:

Talk a little bit about the AI and ML innovations that we see in industry and at AWS
Talk about how we're democratizing inference with AWS Inferentia - a goal of AWS as a whole to bring more machine learning to folks
Then we'll go into high performance natural language processing inference with Qualtrics when Samir will come on stage
Then we'll jump back to talk more about M2 innovation in silicon and the distributed inference that M2 will support
And then we'll pass it off to Srini who's working on some large language models that are dependent on this distributed inference

So let's go ahead and jump in...

[Joe discusses AI/ML innovations]

...and now that we've heard a little bit about why we're excited for the Inferentia product line, we thought it would be best to hear directly from a customer using M1.

I'll now invite Samir, a senior machine learning engineer from Qualtrics on stage to talk a little bit about their work on Inferentia.

[Samir discusses Qualtrics' experience with Inferentia]

Thank you, Samir. It's great to hear our customers are using and benefiting from the Inferentia product line.

Now we'll transition to talk about the innovation underlying the M2 instances...

[Joe discusses M2 innovation]

As an example, CPUs are programmable and highly flexible. But unfortunately, they're not designed specifically for workloads like deep learning. Whereas hardware accelerators are extremely optimized for a given workload but they're not as programmable.

So to achieve the best of both worlds, we coupled a control CPU with a highly optimized purpose built data path. The control CPU handles conditionals and loops, while the purpose built data path does the number crunching at wire speed. This architecture allows the hardware to support dynamic execution, increasing flexibility without compromising on performance.

And in order to improve flexibility further and future proof the design, we also deeply embedded vector processors inside the Inferential 2 neuron core compute units. When I say deeply embedded here, I mean that the processors have direct access to on-chip SRAM memories which enables a blazing fast memory interface.

To give you context here, most CPUs can access the memory through a bus of 256 or 512 bits per cycle. But these vector processors have an order of magnitude more memory bandwidth with 4000 bits per cycle. This allows customers to execute custom operators directly on the neuron core compute units without needing to move data back and forth between a CPU and a coprocessor. And without even the need to move data between device to device memory, providing high flexibility at high performance.

And one of the other major innovations in Inferential 2 is the data types. It will support six unique data types. Machine learning workloads require a massive amount of floating point calculations and these calculations can be done at different levels of precision ranging from floating point 32 through to the common 16 bit formats, float16 or bfloat16. And to the emerging float8, different networks require different data types to achieve their optimal accuracy.

During early research and discovery phases, researchers typically prefer the ease of use of float32. But then as they ready a new ML model to run inference in production, they often optimize for price performance via more compact data types like bfloat16 or even float8, especially if it's a really large model that needs higher throughput or lower latency.

By supporting six data types, we increase flexibility so customers can determine the performance and accuracy that meets their application needs. And so in terms of performance by data type, we've plotted here, the theoretical values that M2 has versus our competitor versus a GPU based instance which shows M2 has 100% and 10% more bfloat16 compute performance compared to the GPU and 4.3 times more float32 compute performance.

As you can see selecting a higher precision data type such as float32 reduces the petaflops, but with 4 total bytes, it provides the highest accuracy. The key element here is that we're enabling our customers to pick the most optimal data type for their use case where the tradeoffs are throughput, latency and accuracy.

So we discussed innovation at the silicon level but we're also defining new features at the server level. M2 instances are the first inference optimized instance on Amazon EC2 to support distributed inference with direct ultra high speed connectivity between accelerators, which also enables 10 terabytes per second accelerator memory bandwidth.

So this high speed inter-chip communication is critical for the collective comms required to support very large models using distributed inference. And so we talked about this slide a little bit before. Do you recall that growth in model size has been explosive over the last few years? Well, we built Trainium to offer the best scalability and performance for training these very large models. But let's talk about how M2 optimizes inference for such large models especially when they simply will not fit into a single chip.

So you may all be familiar with the distributed training techniques that have become popular for billion or even trillion parameter models. But let's cover why these techniques are also important for inference. As you can see from the simple illustration, as model parameter size grows larger, in most cases, this means the model was designed with deeper, denser layers during training.

The number of parameters is the one of the most important factors driving the amount of memory needed to store the model. Now there are other important factors such as the data type but for example, we'll hold all of those equivalent under these assumptions. There is a roughly 500x difference in memory required for these two models.

And so let's take a look at how we deployed a 300 million parameter model. There are several cloud based hardware accelerators available that can fit a 300 million parameter model in a single worker. While these were considered large models three years ago, today we see them frequently in production, they run on existing hardware accelerated instances such as Inferentia1.

But when we're looking at models with billions of parameters, they require even more workers. So the only option is to use a technique known as model parallelism, here more commonly or more specifically tensor parallelism, to split the model across several workers.

To understand why this is possible, let's take a high level look at the math for neural networks. So we'll come back to our basic fully connected artificial neural network. While this is a nice graphical representation of a neural network, it doesn't help explain how to split layers across workers.

Instead, we could think of the fully connected layers of a neural network as a set of dense matrix multiplication calculations, simplified here for explanation. And using this matrix format, we use the properties of matrix multiplication to split the computation even further. We can see that both of these are mathematically equivalent. But now different workers can complete different portions of the computation.

However, when we divide these computations, we still need to sync the values across each of the workers. So you see the arrows here indicate computations that were completed by an independent worker, so they now must be summed across all of the accelerators to complete the inference request.

So we designed a ring topology with direct ultra high speed connectivity between accelerators. With this ring connectivity across an M2 instance, we effectively built a 384 gigabyte accelerator memory pool with 10 terabytes per second bandwidth.

So M2 provides high performance, low latency inference for today's most demanding 100 billion parameter models. And bringing all these innovations together, we have the most price performance inference optimized platform available on AWS.

Let's take a quick look of how M2 performance stacks up against a GPU optimized for inference and readily available on AWS today. For OPT-30B, that's a 30 billion parameter model, M2 delivers 573 tokens per second compared to 181 tokens per second on the GPU - that's over 3 times the throughput for this example of a large language model.

Now let's take a look at even larger models such as OPT-66B. As you guess that's a 66 billion parameter model. M2 delivers 248 tokens per second. But here something else happens - the GPU only has 192 gigabytes of total memory as compared to the 384 gigabytes we see on M2. So we see the GPU run into out of memory issues as the models reach 66 billion parameters, whereas M2 is still delivering high performance. That's because M2 was purpose built to serve these types of large language models, delivering high performance, low latency outputs at the lowest cost on Amazon EC2.

And so we're really excited because our internal customers have already seen great performance results. For example, Amazon Search tested their billion parameter model and is seeing 2x the throughput compared to the GPU. According to Amazon Search, these larger pre-trained encoders have a higher capacity to encode the rich information in the data leading to higher accuracy.

However, serving larger models can be prohibitive in real time production systems with strict latency requirements such as search, ads or recommendations. Reducing 1 billion parameter encoder inference latency to single digit millisecond latency opens up the possibilities for these and additional applications to leverage the benefits of large models for better customer engagement and reduced deployment costs.

So how can customers take advantage of all the innovations in M2? Well accessing all of this performance is easy with the AWS Neuron SDK, which is the software development kit we talked about before. Neuron integrates M2 into popular machine learning frameworks like PyTorch and TensorFlow. This enables customers to bring their model as is and seamlessly compile them to M2 with only a few lines of code.

The Neuron SDK includes a compiler, runtime, and profiling tools. Our goal here is to provide our customers with good performance out of the box with no model changes or performance tuning on their part. And we constantly work to improve Neuron. Over the last year, we've added support for distributed training, eager debug mode, collective communication optimizations and more.

And by using Neuron, you also keep getting new speedups over time as we optimize the stack more. We often hear customers using Inferentia1 and Trainium1 that after updating to the latest Neuron drivers, they see a noticeable out of the box performance improvement.

So Neuron makes deploying a model on M2 as simple as possible. And we put a lot of effort into that here. For example, you can see that we download a Hugging Face Vision Transformer model with just a few extra lines of code highlighted in the red boxes - we compile it to M2 and deploy it at high performance.

We don't require model tweaking or data type conversion, just bring your model as is and Neuron will run it. And we also know how important monitoring and debugging tools are when deploying models. M2 has an extensive toolset for monitoring and debugging.

This includes Neuron-ls for device discovery and topology information, as well as Neuron-top for real-time visualization of the Neuron Core and DCP utilization, host and device memory usage, and breakdown of memory allocation. This information is also available in JSON format which could be used with popular visualization and deployment monitoring tools like CloudWatch or Grafana.

So now that you've seen the performance and how easy it is to just get started, I'm gonna hand the mic over to Srini from Amazon CodeWhisperer to talk about the innovative work they're doing with large language models and how M2 will help address their challenges.