A deep dive on AWS infrastructure powering the generative AI boom

最新推荐文章于 2024-09-15 18:44:36 发布

李白的朋友王维

最新推荐文章于 2024-09-15 18:44:36 发布

阅读量136

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134832148

版权

And uh I'm part of the EC2 product team at AWS and I've been with AWS for a little over seven years now. And uh it's been a fascinating ride. I have uh I have managed this uh hardly accelerated business for the last six or seven years. And um the amount of growth and excitement that we have seen in this space is just phenomenal. So I'm super excited to talk to you guys about some of the nuts and bolts around the infrastructure that we are building and deploying uh in order to power the latest gen AI applications.

And I'm super excited to have my cobras as well. We have uh Alexandro Costin from Adobe, uh who is a Vice President for their generative AI platform. And he's going to talk about how Adobe is leveraging JAI for their applications and also how they are leveraging AWS to accomplish their goals.

Uh we also have Belinda Zeng. Uh she's a Director of Engineering uh with our retail team. And as you guys can imagine, Amazon.com is a really big user of deep learning and gen AI technologies. So she's gonna show a little bit behind the scenes around how we are powering us some of our customer experiences on Amazon.com.

What are the Cortex uh technologies we are building and again how she is leveraging and her team is leveraging AWS, right?

So quick show of hands. How many of us have heard this term? Je I just kidding. I'm sorry, I know the session that this keynote has all been about. Uh this, this week has all been about JAI. In fact, this entire year has been all about uh JAI and it's all thanks to Chad GPT 3 Cha GPT. And uh the interesting part is like these kind of models have been around for good 3 to 4 years and it follows that inflection point where masses of consumers were able to kind of harness its power to actually enhance their own productivity and truly take advantage of what uh what uh uh you know, large language models and gen AI has to offer.

Now, the phase we are in right now is a super interesting phase where lot of large, lot, lots of large enterprises like yourselves um are now like getting super serious about like taking a look at this technology and understanding how they can leverage it. Uh in fact, there was a research from Goldman Sachs that came out earlier this year that predicted that in about two or three years, there might be a $7 trillion GDP growth worldwide. Based on JAI, right. So a fascinating growth opportunity um for all of us in the room and obviously the entire tech industry now.

Uh gen AI is transformational across a whole bunch of industries. Uh you guys are already pretty familiar with, might already be familiar with this. I'm not gonna spend too much time on the slide, but across healthcare, life sciences, finserve, retail, media and entertainment, there are just lots and lots of use cases where not only large tech companies like ourselves but our customers, like you guys are really envisioning how they can, they can really upgrade the uh customer experience.

So simple things like a digital scribe that is that's accompanying doctors and healthcare practitioners to drug discovery or operational efficiencies and even simple things as call center engagements, right? So um there's just a lot of value uh to be unpacked by leveraging this technology on the AWS side.

Uh you guys might have seen like a like a stack chart from us that talks about all the different services that we're building. I'm not gonna reference that slide, but I'm just gonna share our thought process and our general high level strategy around how we are thinking about supporting gen AI applications, right?

So first and foremost, uh you, we have mentioned this consistently over this entire year. Uh we don't think that there's gonna be a single model that's gonna be sufficient for all different types of use cases. In fact, it's just not going to be cost effective, right? So the more a larger a model gets, the more complicated it gets, the more expensive it gets and less specific it gets in in some of the certain use cases.

So from the AWS standpoint, we wanna make sure we offer a portfolio of capabilities that includes a collection of foundation models that you can, you guys can leverage directly. Uh we we understand really, really well that enterprises are gonna differentiate based on their data. You guys are supporting your customer base, internal or external and have a fairly good understanding of how they're using your products and services, you know, and how you can leverage data to optimize their experiences or delivery efficiencies.

So from our standpoint, if you have very, very sharp focus on making sure that you're able to leverage this tech in a secure, reliable manner and really tune it to based on your organization, organizational needs and the data sets um that, that you have on um uh um uh around your systems.

We also understand that there are going to be top level applications that are going to be useful. It's not just about tooling, it's also about like end end applications. And lastly, we've got a very strong focus from an infrastructure perspective to provide the most performant and as low cost as possible infrastructure to uh to support these JAI applications. And this is a segment that I'm going to focus um very deeply on.

Uh and then Alexandre and Belinda are gonna jump in and talk about how they're leveraging all of AWS to kind of build their services.

All right, how many of us have seen a slide like this, the way that talks about model parameters and how they're growing over over the last few years, many of us i see a whole bunch of head nods. So over the last four years, there has been a dramatic increase in the size of models, right? And uh in this case, as, as we are showing, it has gone up 3000x, it's just crazy.

Now, a lot of times I get the question, uh it's like why is this happening? Like why do, why are the researchers pushing the boundaries in this particular dimensions? And if you look through a whole bunch of research papers, the answer to that question is relatively straightforward, right?

So if you look at this particular chart, this is from the famous uh Llama paper that Meta published a few months ago, y axis is the training loss like so the lower the number gets better, the model gets the x axis is the amount of tokens or amount of data that is used to train that model. And the different traces you see are different model sizes, right?

So if you look closely enough, you have the red line that is 7 billion parameters. Uh and you have the pink line at the bottom that is 70 billion parameters. So a larger model trained with the same amount of data yields better output, right? So that is that is the key thing that is propelling all these researchers who think through like building larger and larger models. Because ultimately, if you're interacting with a chatbot or if you have an agent that is like generating code for you, uh or if there's a recommendation engine that is making recommendations for your customers, you want to provide better and better results uh that are directly mapped to um positive business outcomes.

Now, besides that, there's also this common uh ideology around the scaling law, right. So we talked about how, you know, the, you know, as as a model gets larger, it tends to perform better. Um there's also this multidimensional uh correlation with the amount of data that is used to train and then ultimately the amount of compute that is needed in order to carry off that training, right.

So you see this, this this chart where on the second chart, the left axis is the amount of compute that is needed to train and the right axis is among the data. So there's a and, and both of these actually ax accesses are a log and a logarithmic ax accesses, right. So there's an exponential increase in the amount of data that you need to train a model and the associated compute, right.

So let's bring it down. Let's look at a, a specific example of what that really means. Um so if you are training, let's say a 520 billion parameter model and you use uh eight KH 100 GPUS, it's gonna take you 20 weeks to train. That model cost you $80 million and you're gonna consume 43 gigawatt hour of power. This my friend is the main reason why there aren't enough GPUs to go around and why Nvidia's market cap has exceeded a trillion dollars, right? Because you know, again, there's just a lot of compute that is needed in order to kind of train these models.

So let's let's take a look deeper, right? So what type of computer are we talking about for the most part if you look under the covers and many of you in the audience might already be aware that training and ultimately influencing a deep learning model requires a whole bunch of matrix multiplications. So if you have a CPU that has tens or maybe hundreds of processing cores, surely it can paralyze across those 100 cores, but it's not going to be as effective as a GPU that has thousands of processing cores.

Some of the latest GPU from Nvidia have 7000 processing cores, right? So you can imagine being able to kind of paralyze all those math operation across all these thousands of cores and get a job done very quickly. In fact, the entire industry on deep learning was supercharged about 10 years ago by a simple PhD researcher who actually took a single model and trained it in a month using two GPUs. And that just kind of took the industry by the storm. And the model was called uh AlexNet that came out, you know, 10 years ago. And there was a paper uh that that's particularly individual had uh had presented and it just unlocked uh the the entire innovation in the deep learning space.

Now, the the slide on the right custom silicon that is super important also in this space, like once you get to a point where you have a lot of volume around the type of workload, uh you can actually cost optimize and performance optimize by building your own custom chip that is just targeted for that particular use case.

So in the case of a custom silicon, you have a clean blank slate and you can actually design that particular part to deliver the functionality and the capability that you're looking for. And this is where we have been investing across the spectrum on the AWS side, specifically on the compute side.

So if you look at the bottom portion of the stack where two instances we have instances that feature in video GPUs, we have been forced to market with v 100 based instances, a 100 instances and first major cloud provider to launch h 100 instances, we have also silicon from other partners. And then over the last five years, we have been building our own silicon to accelerate deep learning applications.

So specifically train human inferential and it's not just about compute, it's about storage, it's about networking, it's about manage services. And ultimately, it's about ease of use for the end user by integrating into common ml frameworks um like PyTorch, TensorFlow Hugging Face and uh and also Open EX L A, right? So, so this is the entire pool of capabilities. We are actually optimizing month in month out to kind of deliver you guys uh the capability you need uh in order to build your own um you know, gen I locations.

Quick fly by on Nitro. Nitro is a core fundamental technology that actually enables us to build and launch these censuses at a really quick pace. It consists of multiple components, uh Nitro chip that is on the motherboard, a Nitro card that's in the server, our own hypervisor. So there's tons of innovation that we have brought in the hypervisor and virtualization stack to uh to enable us to get to market quickly and provide you guys with the most performed platforms, right?

So quick overview of Nitro. So this is the pre Nitro configuration, right? So this is uh circa I would say 2020 1420 15 time frame where a whole system is a single server and on the same server. We have these customer instances that you guys launch and, and, and utilize that were managed by the hypervisor layer. And we had the dom zero that was consuming resources on the server to run a whole bunch of management task, which includes like EBS uh activities like a launch control of the instance life cycle, local storage and things like that.

With Nitro, what we did was we took all that management overhead and offloaded it to a dedicated card that was in the server itself, right? So now on the host server, almost the entirety of the compute resources, memory resources were available for us to provision to customer instances, right?

So as you can see, the decoupling of this architecture allows two things. First, it allows us to move faster in kind of bringing new products to market. And second and more importantly, it allows us to maximize the utilization of these compute resources and pass on those cost savings to you.

Uh over time, a quick look at what a typical server looks like. So this is an image of ac five n server that we shipped a couple of years ago. It's not the entirety of the server. You can see it in the back. There's empty spaces of stuff that we haven't populated there, but you can see the two CPUs, there's a motherboard there and I want to call out these two particular cards.

So these are Nitro cards. And again, there, there isn't just a single Nitro card, but there are actually two Nitro cards of the server uh where there's one primary card that manages a control plane or EBS volume our VPC encryption. And the other one, we can actually use it for uh for networking.

Now, on the Nitro side, we have been in innovating over the last 10 years, right? So we are on our fifth generation of Nitro chips. And if you take a quick look at the image in the middle for Nitro three, and if you look at that chip and then go back and look at that red card that we have in the server, you'll see that the same red chip is there. This is again, you know ac five n that we shipped a couple of years ago with the Nitro three chip, right?

So we have been building chips for the last 10 years and we have been actually building a lot of expertise around, you know, how to, how to actually build these systems and how to actually scale them out over millions of servers that we have deployed over the years.

Um quick overview of P5. By the way, how many of us in the room are actually uh training uh their own models versus consuming high level services. Show fans, I see some uh great. So P5 is a really great platform for training uh large scale models. Uh and it's also actually ending uh a great platform for fine tuning.

So it is the latest H100 GPU from Nvidia. 2.5 to 4x higher performance than our prior generation P4d instance. It's got eight GPUs in the platform, industry's highest networking capability of 3200 gigabits per second. When I joined uh AWS seven years ago, the maximum networking bandwidth we had was 50 gigabits per second. Now we've got 3200. Yeah. So some or of a magnitude more than, or of a magnitude increase in the amount of networking we have been able to uh uh uh bring to the table.

So, so that's a quick look of P5. I mentioned about custom silicon. So again, P5 using GPUs from Nvidia really high performance platform. Uh I mentioned earlier about how we are leveraging our expertise in building silicon Nitro. We also talked about Graviton earlier this uh this week about Graviton for being our next general purpose CPU but over the last five years and slowly instead, L we have been executing on building out our own dedicated chips for accelerating deep learning applications.

So we have Inferentia one that was shipped back in 2020. In two and Trainium one were made generally available over the last 12 to 18 months. And all of these are like alternatives to uh other computer options we have in our portfolio.

So if our customers are, are interested in Nvidia, we want to make sure we've got the latest and greatest Nvidia products in the portfolio. But if you guys are open to looking at a different architecture that could potentially provide you with 30 to 40 maybe 50% cost savings, um then we want to make sure we have those options in our portfolio.

Let's dive deep specifically into Trainium one. It's custom designed for deep learning.

Uh and let's look a little bit under the covers, right? So uh Ranum one, it's got two massive neuron cores. So unlike GPUs where you have thousands of processing cores, we have two large processing cores. And within those cores, we have something called as a tensor engine, which is basically a large matrix multiply engine. Uh these cores also support scalar and vector math. Um you have collective coms engine that actually help offload a chip to chip connection. Then obviously you have a high performance memory uh to actually provide you with that HBM memory you need on the chip, right?

So uh Ranum one is packaged up as Trainum one instance where we pack in 16 of these chips uh into a single server. And we ended up getting a really high performance server um that has class leading performance. Um and at the same time providing a lot of value, let's dive deeper into a Ranum one server.

So like I mentioned Ranum one is a chip, Ranum one N is the server that houses the chip, right? So it's a dual socket uh system where you have two CPUs. Uh and each CPU has a bank of PCI Express switches that connect to another set of uh Trainum chips. So a single Ranum one server consists of 16 Ranum one chips and a whole host of Nitro cards. So you're seeing the blue blocks on the screen and they're all Nitro cards that are, that are using uh that we are using for that 321,600 gigabits per second capability. Along with uh all the Ranum chips that are all interconnected using a hyper mesh fabric uh from a silver standpoint to help you kind of visualize like one half like eight Ranum chips.

Uh the, the part of the server looks like the image on the left. As you, you guys can tell the amount of complexity that this subsystem has as to the cc five n itself is visually um uh visually understandable. And then we have another one that contains the other uh a set of eight chips. And then we have a two u server head node that actually racks it all together. So when you look at a Trainum one server, it's actually not a one u or two u node. It's actually a 10 u product. It's like this big, it's about 48 inches deep and it's, it's massive, right?

So the amount of engineering complexity from the chip level to the server level that we have to deal with is just incredible, right? So besides the compute side, um like I mentioned earlier, it is super super important for us to make sure we have got high performance network that allows our customers to have uh a large cluster that they can spin up to accelerate their training applications for a model size that we talked about 520 billion parameters. As an example, you can't train it on a single machine. You just can, it will take you forever and it's not going to be efficient.

So customers are scaling out to hundreds of these machines to actually run their training job in parallel in order to accelerate their time to market for that is, is, is absolutely essential for us to have a high performance network. This is where EFA comes in EFA leverages our Nitro card and has a custom uh protocol which we call as SRD that is optimized for low latency and high bandwidth communication. It's something that is unique to AWS.

Uh and it's something that we have been investing in over the last 7 to 8 years, right? It's, it's highly scalable and it's uh it's been used by some of the leading foundational model uh builders to uh to um to train their models on big act to market.

Lastly, to kind of close on this piece. I talked about clusters, right. So this slide kind of visualizes how we think about clusters. Uh specifically, it talks about the design of Ultra Clusters where our customers are able to consume tens of thousands of these accelerators as part of a single entity. So each of those purple boxes at the bottom represents a single rack of compute. This could be P5s, this could be Trainum ones. And uh the way we deploy these racks, they're actually in a single data center interconnected using a dedicated networking, what we call as Ultra Clusters, right?

So this entire thing is purpose built to support large scale training. It has petabyte scale non blocking networking. That means each rack can talk to the other rack at full bandwidth. Um and we have been on second, we have been innovating and we are actually on a second generation of this network which has allowed us to actually scale it out a X and lower latency uh by 15%.

So, besides the, the, the compute the networking and these clusters, uh we want to make sure that you're able to kind of use these uh this this compute as effectively as possible. So we uh we, we, we want to make sure we focus on the storage aspects and also the manage service aspects, right?

So we have EBS local storage S3 and FSX Luster as key offerings on the storage side. And then SageMaker EKS and ECS on manage services site to make our customers, make it easier for customers to got to consume this capacity and and build the models.

And lastly, you know, all of this has enabled us to support a wide set of customers on AWS, right. So over the last 10 years, we have been focused on supporting machine learning. So we have over 100,000 customers that are building ML applications. And as Swaby mentioned this morning, we have over 10,000 customers that are building Gen AI applications on AWS.

With that, I would like to welcome Alexandre to actually talk about Adobe and how they are levering JAI and specifically AWS. Thank you.

Thanks Jason. I'm happy to be here. I run the Generative AI team and um our training platform at Adobe. I've been with Adobe for 23 years, which is uh which is a lot and I've been doing AI heavily in the last uh 3 to 4 years. And uh I'll tell you our story, how we got to get to build the Firefly family of models, how we partner with AWS to, to succeed. And um in general, starting with uh Adobe's mission, we are helping the world create content and build personalized digital experiences.

So we have three business units, the Creative Thought business units for professionals that want to create content, Document Cloud business unit with Acrobat and PDFs and we have the Experience Cloud business unit for marketers. So we've been trusted by many companies with their data, they host their data with us and we help them do marketing better, we help them uh do document management better and create content. And uh we, we're a four decade old company.

So throughout our history, since uh we, we've started, we've helped our customer base navigate many technological disruptions. We actually created the first one, the digital publishing revolution. Then digital photography happened and internet mobile era, social era, many technological disruption changed how our customers created content, how our customers do, marketing, how our customers consume and edit documents.

But last year, in 2022 we realized that the new era was starting the AI era. And in 2022 this wasn't clear to a lot of people. Actually, I remember the good times where Chasten and the E double S team were talking to us about getting uh GPUs and they were in, they were a lot of supply, there was supply, there was not the crazy scarcity you find today in silicon. And we actually decided in 2022 that Adobe needs to embrace this technology and lead because our customers need to become um Gen AI enabled to be successful in this new world.

And we've started our journey, which I'm I'm gonna present to you today with some of our learning throughout the last couple of years. And when we went about creating our Gen AI strategy. we decided that at Adobe, because we have all this software for all these customer segments that we need to train our own foundational models for creativity. And we've decided that our customer needs those to be integrated into the applications they love and use like Photoshop or Acrobat or the marketing applications.

We've decided that our models, we will train them and invest in them. We have a long history of doing research because we want to be in control of exactly the quality and the capabilities those models need to have. And we also decided to train these models in a special way, not training them with data from the internet, but training them on data we had access to because we had a large catalog of data uh in our content marketplace called Adobe Stock that gave us the opportunity to train these models in a different way and make them more commercially useful.

And also we've decided not to build our own infrastructure. We it's very easy when you look in the short term to say, oh, I'm gonna create my own at Adobe scale. I'm gonna have my own data center and invest in it. But in reality, the partnership with AWS has enabled us to focus on actually what differentiates us as a company and have our partners help us with the lower level of the stack.

And we've managed to launch a lot of capabilities. I'm gonna let that video play out showing what we've been able to launch. in the last nine months. We we've launched three foundational models, one for images, one for um vectors, one for designs and, and we've been massively successful. Our customers love it. They see themselves helped by Gen AI enabled to be more creative, more productive.

So they've embraced, that embraced those capabilities in Photoshop for the generative field capability powered by the Firefly image model is used is the most used feature in Photoshop today, which is very hard Photoshop is so entrenched in our customers workflows that this capability is so great is a copilot helping you achieve your generative edits that they are just using just the left and right.

As I said, deep integration is what got us to, to have this usage and and helping our customers. And we also decided to train on Adobe Stock data and that that data doesn't contain IP intellectual property, trademark content or uh recognizable character. So when you generate something with Firefly, we know you cannot generate Hello Kitty or Tom Cruise because our model during training has never seen that.

So 4 billion images have been generated. And if you look at uh how we do it and maybe that's something you might want to replicate in your company. But I took, I took over this job literally 11 months ago. And when I came in, my goal was to create an AI superhighway to accelerate how we bring this Gen AI capabilities into our products and the way we organize to, to make that happen first, we made sure that we have our training, data cataloged and high quality and filtered and available for fast distributed training.

Then I'm also responsible for our training ML platform. So we don't use SageMaker today, we use a lot of capabilities from AWS. But because we've started building our own internal platforms before SageMaker was mature enough, we kind of ended up having to build a lot of the capabilities ourselves. Now, based on what i saw this morning, there's a lot of new innovation coming from AWS that I think you guys should explore and I don't think it's worth it to rein but at the time, we were too early.

So we needed to build our own framework. So we do have this training platform that operates on top of EC2 and we have a data platform that stores data in Amazon S3 and Amazon Luster FX. But uh we, we build our own sort codes to, to help us manage these training jobs.

Adobe has a large research organization. We have hundreds of researchers and every summer we welcome hundreds of interns that come and, and do research with us and they are organized in labs that are um focused on a specific modality. There's an imaging lab, a video lab, an audio lab, a 3D lab and each of those labs build and uh create models and train them on top of our platform running in AWS.

We also have an applied research organization that is in charge of taking these models and making them production ready, doing the final training. I i call it the last 20% of the job that actually takes 80% of the work. There's a lot of fine tuning production, ready, testing the quality, making sure the content is uh safe for work. So huge, huge amount of work goes into this applied research um model product organization.

And then we take those models and we expose them as services both internally for Photoshops of the world and all the products in Adobe catalog, but also for enterprises and third party. So this is what we call the AI superhighway. And we're organizing ourselves in a way that uh promotes uh the the fast transfers of model creation of models and the fast conversion of those models into services. So they can be consumed by uh by our applications and ending up in the hands of our customers.

And we've been this year through a lot of launches, I'm showing here. A lot of the models we've launched in March actually in Vegas, we launched the first version of our uh Firefly image model. Then we've integrated it in Photoshop in May. Then we've integrated it in, in Illustrator in June that in Adobe Express, which is our design suite for collaborators in August, then we've added capabilities until our Max conference, which was happening in October. We launched the second version of our model.

So it's a huge amount of uh model versions and model types that we launched and integrated them in a lot of places. But, and this seems like we've moved very fast. The reality is there's a secret behind it and the secret is we've been working on it for a long time. Adobe Research has existed for...

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫