Optimizing performance for machine learning training on Amazon S3

最新推荐文章于 2024-07-24 16:22:27 发布

李白的朋友王维

最新推荐文章于 2024-07-24 16:22:27 发布

阅读量105

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837686

版权

Hello and welcome. How's everyone doing? Good? All right, before we get started, I just want to ask a few quick questions just to set the context a little bit.

How many of us here in the room have used Amazon S3? Yes, that's everyone. It's amazing how many of us have used Amazon S3 for machine learning use cases.

Alrighty, that's a good number. And last question before we move on, how many of us have tried to optimize storage performance for machine learning, use cases not specifically in the context of ST but storage performance in general.

All right. So the good news is with the launches we're going to talk about today. You don't have to work on optimizing storage access for ML cases with these features will do that work for you. Isn't that exciting?

All right. So on that note, let me just get started. My name is Dave Kumar. I'm a product manager on the Amazon S3 team. And today I and my colleague, James and Alex are going to talk to you about how you can use these new features that we announced recently to optimize performance for machine learning training on Amazon S3.

We are first going to talk about why run machine learning training on Amazon S3. Then we are going to talk about how you can run training on Amazon S3, regardless of whether you are running in a fully managed setting and training on Amazon S3, using our purpose built services like Amazon SageMaker or you are running your self managed infrastructure and services and you want to look and you are looking for general purpose solutions will cover all of the best practices and recommendations in the talk today.

And finally, we will summarize the key takeaways and conclude our session. So the first question, I guess that comes to our mind is why run training on Amazon S3. Well, machine learning, as we know begins with data and Amazon S3 is the best place to store your data because of its unlimited scalability, its high elastic throughput performance and high durability.

Customers of all sizes and industries use Amazon S3 to store and protect petabytes of data for use cases such as data lakes, media and entertainment, internet of things, genomics, health care, life senses, so on and so forth. When you have all of this data already in Amazon S3, then you would think it's the natural place to start your machine learning journey from. Well, that's absolutely true.

There are some very sound and specific reasons why you should use Amazon S3 as the foundational storage for your ML machine learning artifacts and training data.

First, Amazon S3 offers a range of storage classes you can choose from based on the data access patterns, the cost requirements and the data resiliency requirements of your workload. For example, um you have a three intelligent cheering that automatically optimizes storage costs for you. It is one of the most popular storage classes that we have in Amazon S3 that customers use for use cases like machine learning and data lakes. That's because the access patterns are often unknown and intelligent. Cheering for a small monitoring and automation charge moves objects that have not been accessed to lower cost tiers, delivering cost savings to you without any impact on performance and without any operational overheads.

Additionally, Amazon S3 also provides features to optimize cost throughout the life cycle of your data. For example, if you once you set life cycle policies on your bucket, then the data moves to other storage classes, delivering cost savings to you without requiring you to do any change in your application.

So to summarize this point with our range of storage classes and cost optimization features, you can store terabytes to petabytes of data on Amazon S3 and benefit from these features to get automatic cost savings.

Second, Amazon S3 delivers very high aggregate throughput that is extremely important for machine learning training. Specifically, we are because you want to get your data as soon as possible from Amazon S3 to your compute instance. Recently we have announced I would say over the last few days and over the last few months, this year, we have announced a bunch of optimizations on the client side including a brand new file client for Amazon S3 called mount point integrations of AWS common runtime library, which does automatic performance optimizations for you in AWS ST like the three of the Python STK and also a new S3 connector for Pyotr.

With these new client integrations, you get high data transfer rates between your Amazon S3 bucket and your machine learning job. And the key message here is that these clients do things like request, para manage, retrials and retries and time out so that you don't have to write any code for optimizing storage access.

As I was saying for ML use cases last but not the least. Actually, it's a very significant point. We besides these client optimizations, we also announced a brand new storage class yesterday. It is called S3 Express one zone. The express one zone is a new storage class that is purpose built for delivering low latency access for performance oriented or performance heavy workloads. It's specifically useful for latency sensitive workloads. For example, for training jobs where you have a lot of random data access happening.

What we have heard from customers is that in use cases such as training vision models, there's a lot of random data access that happens which my colleagues will dive deeper into and these workloads happen to be latency bound so with Amazon S3 Express one zone and its integration with AWS as, as well as mount point for Amazon S3, you can achieve low latency as low as low single digit first by low single digit millisecond first by latency so that you your latency bound workloads finish fast and you save on compute costs.

Finally, Amazon S3 integrates with a number of our purpose built services that help you run different stages of the ML life cycle. Amazon S3 integrates with Amazon SageMaker so that you can build train and deploy your machine learning jobs in a fully managed setting. It also integrates with Amazon Bedrock so that you can use popular ML models that are available from top companies and then fine tune them and customize them using your data in S3 buckets.

If you're running in a self managed setting, then you can also use Amazon EMR to run data preparation jobs, for example, image augmentation at scale. And if you are, if your applications need a fully featured file system interface, for example, to collaboratively to facilitate collaborative editing of data, then you can use Amazon FSx for Lustre which integrates seamlessly with Amazon S3 giving you a fully featured file system to work with for applications that are file based.

So to summarize with Amazon's, Amazon S3's range of storage classes and cost optimization features the new high performance storage class, the Express one zone, our new client integrations that we launched over the last few weeks, I would say, and it's seamless integration with purpose built services. You can use Amazon S3 as the foundational storage for all your machine learning needs and build your workflow on top of it.

Due to this reason, customers are running ML jobs of all sizes on Amazon S3 and benefiting specifically for training very large models that they can use, that they train working with Amazon SageMaker. For example, one of the good examples that we have in this respect is that of the Technology Innovation Institute or TII that build their Falcon series of models using Amazon S3 and Amazon SageMaker. It is one of the most popular models on on that Hugging Face leaderboard today. And it was built with 180 billion parameters trained on 3.5 trillion token data set. It was built on a cluster of hundreds of ML P4D 24 instances benefiting from S3's elastic throughput using S3 integration with over three or the Python SDK.

So basically this is a good example where the customer used the elastic throughput of S3 to build a very large scale model, which is what happens to be one of the most popular models on the leader board as I mentioned, while Amazon S3 is is really useful for large scale models, the performance optimizations that we talked about and which my colleagues will dig deeper into are relevant for essentially working with ML use cases at any scale and they are applicable regardless of the data set size and the size of the ML model you are working with.

So to dive deeper on that and take you through the full ML life cycle. I'll invite my colleague, Alex.

Thank you. Good morning. Uh my name is Alexander Arzhanov and I'm specialist solution architect here at AWS helping our customers to deploy and optimize machine learning workloads on the cloud. And in the past uh three years that I've been here at AWS, uh I've talked to different customers in different use cases. But every single machine learning um project uh usually follows the same uh the same cyclical uh uh pattern.

So where everything starts with data ines data exploration and data preparation. And then in the building phase, uh data science team will usually uh take different uh or test di different approaches, test different machine learning algorithms. And once they are satisfied with the uh uh with, with the, with the approach, they will go and train those end to end uh on the entire data set that was previously preprocess until convergence.

And of course, they will monitor different metrics that makes sense for the business use case they solve. And once they are satisfied with the performance, those models will be deployed out to the production and there uh uh the, the models themselves will be continuously monitored. So if the performance degrades the machine learning life cycle will repeat itself.

However, it is important to understand that throughout those different stages uh in the machine learning life cycle, your data leaves and it continuously evolves on Amazon S3. So, uh for example, if we talk, if we take uh the uh case that Dave has previous talk about TII and the Falcon models uh in the uh pre training of latch language models.

What you might start here in this case is that you have um some kind of scrape of the entire internet with petabytes of unstructured data html files. So different uh uh media file formats stored on Amazon S3 and then your data science team will work around the clock in order to duplicate this data filter, clean it so that uh uh at the end uh what you end up do, uh what you end up having is some kind of more structured high quality data set that you store back to S3. And can it can be uh parquet files or different other uh machine learn specific formats?

And then uh those terabytes of data will be continuously streamed uh into the uh compute cluster. Uh and the training phase itself may go for days, weeks or even months. And of course, throughout the training, the uh snapshots of the model weights will be taken periodically. Uh in case of flash language models, they can be again, uh hundreds of terabytes in size and put back on Amazon S3 um to summarize here.

Uh if you are doing machine learning on AWS, it is really important to understand where, how in, in which data format to store your data in order to optimize for performance and cost. So regardless also uh of uh different uh machine learning frameworks and machine learning tools that you might be using. Uh it is really important to understand the uh uh i implications or of the uh the proper data uh data io performance.

So for example, if we uh take the uh training phase, uh we can uh see why this is important uh in this case. So if you have uh the data i performance, uh if it is sub optimal for the training phase, what you end up is you end up with your compute resources that are under utilized throughout the, the training. And it is something that you don't want to be doing because probably the most expensive resources being the gp us and the other accelerated uh compute instances, they will be uh uh end up uh idling throughout significant amount of time during the training.

And also somewhat connected to this is uh the bottleneck uh which you have in your, in your training, in your data. I will uh uh end up, you will end up having your uh machine learning training ramps to take longer to, to, to finish. And this will of course slow you and your team teammates down in iterating throughout this machine learning lab.

Ok. So uh performance is really important but which uh main considerations should you have in your mind when you talk about this topic? So, first of all, naturally, uh it is the total amount of uh uh data that you have, the total uh data set size and the number of files that your data set has naturally. If your uh data set is large, it will take longer to download this data set or even list files. in some cases.

Somewhat connected to this question is the question of uh how your uh data set is uh composition. So if it is comprised of large number of small files or have you previously serialized your data set, so you have just a few larger chunks.

Another question that you might be asking yourself is whether you need to download this data upfront to your local instance storage or would you rather have this data set streamed on demand as your training script demands the data.

And finally, one of the important questions that I usually challenge my customers with is whether you truly need to have fully random access during your training to the training samples or can you still design your training algorithm and your data loader to consume training samples one by one sequentially.

So let's double down on this uh last topic because it is really important. So sequential reads versus random reads, I usually uh start this uh conversation with, with my customers by drawing the analogy to the good old hard disk drives that you al almost all are certainly familiar with. So if you remember the hard disk drives have this mechanical actuator arm that moves around uh the spinning display and reads different memory blocks.

So in this case, we have sequential reads, if our memory blocks are read contiguously one after another, so that the mechanical arm doesn't have to move around much. We have random access pattern. If we start requesting data at random, that means that our mechanical arm will have to move all around the disc in order to read those uh memory blocks. And this will incur some delays when reading subsequent memory blocks.

So the situation in the context of machine learning workloads is somewhat similar. So again, by the analogy to the HDDs, uh we say that uh we have sequential read patterns in our machine learning workflow, if our data set is comprised of large number of a small number of uh uh uh large files and each file contains many, many training samples so that you can quickly read each training sample contiguously one after another, only by firing one single get request.

We say that we have random access pattern in our machine learning workload. If our data set is comprised of a large number of small files, and each file represents just single training sample. In this case, we end up doing, ah we end up firing get request for every single training sample.

Alternatively, we can uh still have a situation where we have serialized our data set into smaller number of uh larger files

"But our data loader requests data from different parts uh within this file. For example, by firing a byte range, get request to s3. In either of those cases, what we end up doing is we end up firing many, many, many s3 uh get requests in order to read the same amount of data.

So let's uh illustrate the performance implications for this to read patterns with this uh two fictional um calculations.

So first, let's say that we have uh computer vision task at hand. Uh we have our data set comprised of j pc files, let's say, and in the sequential read uh case we have first serialized this data set into a larger file chunks. Maybe we're using proto buffers for this and, and, and place those uh as objects on um amazon s3. And let's say that every object is about 150 megabytes in size. And then we start our data loader with a single thread.

So we start uh getting data uh from, from the sd by firing the get request. But actually, until we get the first byte back, uh it takes some time and this time uh is usually called uh as time to first byte latency. But after we have gotten this first byte back, we start downloading our um uh data samples uh in, in, in bulk. And let's say it takes only one second to do.

So. If we assume that our time to first byte latency is approximately, let's say, uh 100 millisecond, we arrive to the conclusion that uh we have an effective throughput in this case of about 140 megabytes per second. And of course, i must add here, this is a fictional uh scenario where we, where you have only single uh client thread. Amazon s3 uh on the server side can scale naturally to much larger numbers.

Continuing with our fictional example of single client thread. Now, we have the random access pattern where our data set is stored as raw jpeg files on amazon s3 and each jpeg file represents single training sample.

So we again start firing, get requests. And every time we, we fire a get request, we have to wait tt sd amount of time. And then let's say in this case that our download time for those tiny files is negligibly small compared to the uh time to first by latencies and can be neglected altogether.

In our uh quick calculation here. In this case, we quickly arrive to the conclusion that our throughput will be about 140 times lower than in the sequential case. But of course, it will be, it is quite unlikely that you built a data load and, and read and read data with just a single um with, with just a single thread.

Nonetheless, it was uh very illustrative uh to uh show uh the drastic differences into uh in these two different read patterns on the uh server side. However, as i have already mentioned, amazon s3 uh can scale to much, much larger uh throughput numbers if you uh if you paralyze your calls uh properly.

However, when you start uh uh parallel your calls on the client side, you also increase the strain on the cpu uh resources. And it is still uh recommended if you are trying to optimize for the throughput to follow the sequential read pattern.

Despite this, there might be still situations where you don't want to serialize your data sets or you just can't do this because of the peculiarity of your machine learning algorithm. But still would like to stream data from amazon s3 with high throughput.

So, as de has mentioned previously yesterday, we have introduced our new amazon s3 storage class, um s3 express one zone which is purpose built to uh deliver consistently single digit millisecond latencies. And it was also the first storage class uh of amazon s3 that uh allows you to uh select an availability zone. And by doing this, you can co locate your computer and storage resources in order to optimize for uh throughput.

Uh and, and the cost and uh uh in one of the benchmarks that we have uh performed. We have seen that um amazon s3 express one zone can deliver time to first bite latencies that are 10 times lower than in the case of s3 standard.

Ok. So now let's uh see how we can actually access our data set start on amazon s3 from machine learning workloads. And let's do this with amazon stage maker first.

So taking a look at this uh slide that i have shown previously again, we, we can see that there are a lot of different stages, a lot of different moving parts involved in the life cycle. And there might be situations where you or your team doesn't want to build everything from scratch or manage that or even uh might be that you don't have the skills to do so.

So in these cases, we have amazon stage maker that is our fully managed machine learning service to uh help you throughout every stage of machine learning life cycle. And it also includes fully managed interfaces to data on amazon s3 that i will talk about shortly.

Well, let's first uh recap quickly how a machine learning uh training uh job looks like on amazon sage maker. So when you file a get request to start an uh training job with sage maker, uh the following sequence of events will happen.

So first, uh the machine learning compute instances that you have specified in your a p a call will be automatically provisioned and put behind a virtual private cloud. If you have requested that then uh a container um can be your own container or one of the uh prebuilt containers with different frameworks uh that we have on uh elastic container registry will be pulled down and spun up automatically on the compute instances later.

Also your custom training script will be injected into the container and executed. And also the si have mentioned previously, data sets on amazon s3 might be made available for you to access directly from the container. If you have requested that once the training job finishes, all the provision resources will be torn down automatically. And all the machine learning artifacts that you have produced during your training job, they will start back to amazon s3.

So you end up only paying for what you have actually used throughout the training. So now let's talk about the uh sage making and manage data loading options uh that uh from s3 that we have here.

So first of all, we have um file mode, file mode uh in in this mode, you give stage maker uh for for example, an s3 graphic or a list of uh different s3 objects in a uh in a so called manifest file. And all this data will be automatically downloaded upfront to the local uh infant storage. And then your training script will be able to read this data locally.

Of course, in in this case, uh uh it provides a very familiar file system based data access with p interface and really fast uh random io. However, the downside of this is uh that your data set will be uh downloaded upfront, which can incur in some cases uh uh some start up delays, especially if your data set is large or is comprised of a large number of small files.

Another consideration to have in mind is your data must fit into the local storage system. And also uh if you read data locally, uh there might be cases where your throughput will be dependent on the uh like volume size of the ebs back uh instances or the instance type itself. If you, if you are using nv me back instances and now uh file mode supports s3 express one zone which will allow you to significantly reduce the download phase if you choose this input mode.

However, there is another option uh to access data from amazon s3 with sage uh with sage maker that is fast file mode in this mode. The data won't be downloaded upfront but rather directly streamed from the s3 bucket as your training script demands it. Of course here there will be uh no start up delay or uh so your uh jobs will kick off near instantaneously.

And uh since you are streaming data directly from s3, you can enjoy the highly elastic throughput that it offers basically what that means is, you can consume your machine learning training uh training samples as fast as your cpu can read and preprocess data, also your data set, uh data sets no longer have to fit into the local instant storage.

And most importantly, there is absolutely no quote change required in the training script. So basically for the training script, it appears as if the data has already been downloaded locally. However, as we have discussed previously, if you are doing random reads, you might suffer from somewhat low throughput. But again, uh with the support of s3 express one zone in fast file mode as well, you will be able to alleviate this uh last point uh substantially.

So in order to summarize sage maker have two different options to access data from amazon s3 file mode and fast file mode, use file mode to automatically copy data upfront and then read it locally from the local instant storage and how to do this, how to kick off a job with this input mode. If you are using sage maker sd, that is really easy.

So first you initiate an object that is called estimator, which is nothing else than collection of different configurations. In order for a stage maker to know how to execute the training job. For example, you specify the image u i of your container from the elastic container registry, you point the estimated object to the to the custom training script. And then you specify the input mode. In this case file mode, you specify the inputs in this case, it is the s3 prefix and then you kick off the training job by calling the f function.

And all of those chain of events that i have previously shown will be automatically executed and use file mode. If you would rather stream your data directly from uh from sri uh on demand as your training script reads it without any code change again. And what i mean by without code change, i can i, i mean that you literally only has uh have to add four characters into the specification of the input mode in order to do so.

So you can quickly jump from one read pattern to another read pattern to see what works best in your case. Of course, sagemaker is just one of the many options that we have on aws to do machine learning training. And here i will hand over to my colleague james in order to walk you through some other options that we have here.

Awesome. Thank you. Uh good morning. So uh how many people are here at reading them for the first time? Awesome. Me too. I hope you're having a great conference. Congratulations on making it to wednesday. You're halfway there. Um i'm gonna tell you a little bit how some of the work we've been doing on making it easier to use s3 to train ml model using your own computers or if you're not on stage maker.

Now, one of the things that i've been finding really interesting about working in the am l space recently is that things are changing really fast, right? And so rather than trying to tell you something about how to over ft for any particular model or algorithm or something like that, i thought what i would do with our time is tell you some general purpose answers, right? Things we think will work well in any scenario. And then gradually we'll work towards a couple of more specialized solutions that will work in cases that you're likely to run into but are specialized in particular use cases.

One of our kind of guiding tenants for s3 is it should be simple to use. In fact, it's actually in the name, we don't say the name out loud very often, but s3 is simple storage service. We want it to be simple to use. And what we think about when we talk about clients then is it should be easy to get data in and out of your application from s3. We don't want you to have to think about the difficulties of getting storage in now how i can go fast, how to optimize it all that kind of stuff. We want to do that work for you. The team really spends a pretty remarkable amount of time thinking about simplicity, thinking about how to make s3 easier and faster for you for free.

And so i want to tell you a little bit about some of the work we've been doing in that space. One of the first things we did here earlier this year was the launch of a new product we called mount point for amazon s3. Mount point is what we call a file plan. What that means is that it presents your s3 bucket, your objects as if they were files in a local file system.

And we built mount point because we heard from customers that they really loved s three's low cost its elasticity, but they had applications that either didn't understand s three's apis, didn't speak object apis or didn't speak object apis as fast or as high performance as they would want them to be.

And so when we were looking at this feedback, what we realized was that actually there's already a pretty ubiquitous interface for getting data in and out of applications, right? It's the file interface. Every application for the last 50 60 70 years understands how to speak file, right? It's probably not something you think of as a feature in an application, just file is the thing that it does.

And so with mount point, what we do is build a way for you to get your s3 objects through the file system with the same high performance, the same elasticity, the same low cost do you expect from s three's object a ps? What that means for ml training is that your mail point is a really great default option for getting data in and out of your applications"

Because just like every application understands files, every ML framework that you can think of has interfaces for getting data out of out of files, for training for for loading trading samples and saving data back into files for modal checkpoints. And things like that mount point is super easy to get started with. If you install it, it's a one line piece of code to sort of start up mount point, you tell it what bucket you want to use, where you want to present it on the file system. And then you'll be able to get your objects through that interface. And all the code you can find on the internet for any ML framework will just work through the interface.

So for example, if you're training a model like a vision model using PyTorch, you can use the built in image folder data set abstraction to just load stuff out of an entire folder in your S3 bucket as if it was local. Same thing if you're working with TensorFlow, right? If you're using a TensorFlow training pipeline, you can use its built in features for listing files and directories, opening files and directories to get data into your TensorFlow training from S3 through this file interface. And again, the really important thing here is that mount point is gonna give you really high performance, you know training against s3.

So let me tell you a little bit about what that looks like. One of the things the mount point makes my point awesome is that you get this performance out of the box. And so what I have here are some examples or some results comparing a mount point based training pipeline to one that uses fs spec. Now, you might not have heard of fs spec. It's a pretty common library for this kind of thing. If you're using an ML training framework and you're get already getting data out of s3, it's kind of built in, you're probably using fs spec under the hood because most frameworks build it in.

And so that's what we're going to compare to here. Now, as Alex was telling you earlier, one of the most important decisions you can make when designing an ML training pipeline on s3 is how you arrange and access your data, right? There's a big difference between having one training sample per object sort of random access files in your bucket versus being able to combine many different objects into small, into larger, sorry many different samples into larger objects and accessing them sequentially reading them front to back.

And so I'm showing you both those results here, right? So on the x axis here is training throughput or so higher is better. We want to see more images per second. We're training a pretty small vision model here. And what I'm showing you is results for both random and sequential access. So the first thing you notice is exactly what Alex was telling you, right, being able to colorate your file your samples into larger objects dramatically increases training throughput right by up to 10x in this case.

So sequential is better if you can get there. The other thing I wanted to show you here is that mount point is actually up to 60% faster than a spec, especially in this large object case, right, mount point has a bunch of optimizations built in that. Actually, I'll tell you more about later that are really optimized for streaming large objects out of s3 at really high throughput.

The other thing you might notice on a slide, by the way, if you're paying sort of close attention is that actually in the random access case, mount points a little bit worse than a spec. Uh don't worry we're going to fix that in a couple of slides, but this is sort of the big 0.1 should take away from this, right, like mount point is a great way to get good performance in an ML training pipeline. Hopefully, with very little work on your part, we're not done.

We've actually launched two recent new features for mount point that are gonna make this even better for some of your training use cases. The first thing we launched last week is that mount point can now cache training data locally on instant storage or on ebs volume. Well, that means that we can further speed up your training jobs, especially jobs that are repeatedly accessing the same data.

So if you're training for multiple epochs on the same data set, this caching is going to work great for you. Like everything else with mount point, we want it to be really plug and play. And so really, it's just a one line thing to do to turn on the cache, you just tell us where you want us to cache your data. That can be a file system on your local mve storage. It can be an ebs volume that you've mounted or it can be even local memory if you set up a temporary disk.

Now, of course, you could get the same performance benefit by just doing this yourself, right? You could download all the objects onto your local storage and then just read from them directly. You don't have to have mount point or s3 in the loop at all. And that would totally work, right? But the difference here is that mount point is gonna do a lot of the heavy lifting for you, right?

So if you don't, if you don't want to read all the objects out of your s3 bucket, maybe you want a subset of them mount point will load them dynamically for you. So you don't have to think about, I only want this subset of my objects or that subset mount's going to manage the cache for you. So if it doesn't fit in cash. We're gonna do c eviction and c uh admission for you for free.

So I promise you this would be faster and actually in sort of simple benchmarks, it is up to 2.5 times faster when you read out of the cache, which makes sense, the storage is local. But let me tell you a little bit about how this works specifically for machine learning training.

So this is the same benchmark. I was showing you a few seconds ago a pretty simple small vision model because at this time, I'm going to show you multiple training epics. So the x axis here is going to be the epic and the y axis is going to be throughput.

Now, as you saw earlier without this feature, when we train on individual image files, so sort of random access to s3 throughput's not great, right? Because you're sort of bound by s three's latency in this case. But when we turn on the map point caching, we see a couple of interesting things.

The first one is that even in the first epoch of this training mount point is foster with caching turned on. And that's a little bit weird, right? Because you're still downloading the objects from s3 in that first epic, we're not on their local storage yet. What's going on is that mount point is also able to cache metadata, right?

So this first epic with first iteration is foster, not because we already have the data, we're still downloading the data, but the metadata is cashed the first time around. And so even the first epic is foster, but then it's off the first epic that things really kick into gear, right? And so as we get into these later epics where we're serving all of your requests entirely from cash, not speaking to s3 at all. Trading is up to two times faster using this caching feature than going directly to s3 every time.

So I'm really excited for you to try out this new catching feature of mo point. It's a pretty plug and play way to get better performance from multiple epic training uh on s3. The second new thing that we launched recently is something that actually dev has already mentioned to you a couple of times, right? It's the new storage clause for s3. Amazon s3 express one zone s3 express one zone is our new high performance storage class.

It's built for single digit millisecond latencies for your objects. Unlike most of our storage classes, s3 express 11 zone objects are stored in a single availability zone to minimize access latency. And that means that you can co locate your compute closer to your storage to get even better performance.

Amazon s3 express one zone also has a new kind of bucket. The first time we've ever launched a new bucket type for s3 called the directory bucket. This directory bucket is a thing that optimizes specifically for high request rates for high parallel throughput for high tps.

And so hopefully given the kind of build up that I've given you, it won't surprise you to learn that. This week, we also launched mount point support for s3 express one zone. What that means is that you can just use mount point like you were doing it before, except you can point it at a directory bucket instead of a regular bucket and you'll get access to this low latency uh storage storage clause through a file interface.

And so what we've seen here again is up to up to 30% higher throughput training against f three express one zone with individual files stored there versus stored in s3 standard. What I find really exciting about this actually is that it sort of starts solving this question of like having to make a decision about storage classes and storage files and layouts and that kind of stuff.

You can just put individual image files in in x ray, press one zone and you'll get pretty good training through it out of it, right? So one of the goals here is to try to make it easier for you to make, have to make fewer decisions in your training pipeline.

And so that's map point. Again, we think mount point is really a great general purpose answer for getting data in and out of s3 in any training pipeline, anything that understands files, which is pretty much everything.

Now, if you wanted to uh you could leave right now, right? This is kind of the most important thing. I wanted to say mount point is awesome. You should try mount point. I hope you'll stay because I have a couple more things to tell you about that. I think are really cool. And I gonna specialize for a couple of the most common scenarios you're gonna run into in any ML training pipeline, getting data in and getting data out.

And the s3 team has been working on some really cool stuff here specifically for ML training first though. I hope you will sort of geek out with me for a couple of slides about how clients like mount point are able to get this awesome performance out of s3. As you probably know s3 has been the storage service of choice for data lake workloads for over a decade now.

And along the way, we've learned a ton about how to get really high performance and high, really high elastic performance out of s3 at scale. A few years ago, we realized that there was really a need to bring some of our performance best practices, things that we tell customers to do with s3 to as many of our clients as we could.

And so we built this really cool software stack that we call the aws common runtime. The c rt, the common run time is a collection of libraries that offer all the things that you need to interact with s3, things like asynchronous io http client encryption authentication, all that kind of stuff. Most importantly, it includes an s3 client and that s3 client does a bunch of really smart things like automatic request paralyzed to do multipart gets things like load balancing across different dns servers, things like retries and timeouts, that sort of thing, all the things you need to build a high performance, high reliability s3 client.

Let me give you an example of the kinds of things the c rt, the c rt can do. One of the great things about s3 is the simplicity, right? It's just a htp service. You can upload 2 s3 or download from s3 from anything that speaks htp. When you do that, you're going to get about 100 megabytes a second speaking to a single s3 front end host.

But as you probably know s3 is not just one server, right? S3 is actually a pretty massive collection of front end servers. And so one of the things that come runtime knows how to do is fan out your requests across multiple s3 front end servers to get even higher throughput, to aggregate throughput across all of these servers.

This is also a way to take advantage of some of our recent launches, things like uh multi-valued answer dns to give you back more responses, more servers you can access more quickly. And so the s3 team has really been relentlessly looking for opportunities to use this come and run time to use the c rt to transparently accelerate your applications.

And it really comes back to the guiding tentative simplicity that i was talking about, right? We really want s3 to be something that you have to think about that it just works for your training work flows. And so we've been looking for opportunities to use the c rt for more things.

One of the areas where identified an opportunity like this is in training workloads that use kubernetes to to orchestrate their training jobs. Often these are distributed training jobs across multiple nodes. Now there are a few different ways that customers orchestrate training jobs like this today. But we've heard pretty often is that customers want to use the kubernetes primitives that they're familiar with and their other production workloads to orchestrate ml training as well.

Often they also want to use ets and managed kineti offering to sort of do the configuration, the automation for them. Now, what makes community really cool is that it's a declarative way to manage training infrastructure, right? You launch your training jobs, you declare what resources they access and you can scale them up and down as you need without manually managing individual nodes or the connections between them.

And so if you're using cinetis or eks to manage your training jobs. This week, we launched the new mount point c si driver for cinetis. The mount point c si driver makes it really easy to attach s3 storage to your cinetis pods. And it's just mount point under the hood as the name suggests.

So you declare that you want to mount a particular s3 bucket to your cinetis pod and it shows up like the point i was showing you before. It shows up as a directory in your pod that you can access through a file interface and you get the same great performance that we were talking about from the n point before, right? The high throughput uh access to s3

If you're using EKS, the NFS mount point CSI driver is also available as an EKS add-on. What that means is that you can use EKS and that EKS manages deploying this add-on to your cluster automatically. So you don't have to do the work of deploying it yourself. So we're really excited about this for customers who are using EKS, particularly to manage their training workflows.

Again, we think it's a really great default option to get S3 data in and out of your containerized applications without any heavy lifting on your part.

Let's talk a little now about frameworks and a framework that you use for ML training. There's really a plethora of options here. There's tons of options for ML training, probably some grad student right now is writing a new framework as we speak. But mostly what we hear from customers is they're kind of focusing on PyTorch. Right. PyTorch is increasingly a framework of choice for training ML, uh for training ML models and this makes a bunch of sense. Right. PyTorch has all the primitives that you need to support all kinds of models, all kinds of training algorithms. It has many different training accelerators, supported things like GPUs and training NPUs. It's also the language of choice for a lot of research these days in ML.

And so even though we think NFS is a really great default option for getting ML data in and out of any training workflow, we think in the case of PyTorch, there's some opportunities where we can do even better specifically for PyTorch access. So I want to tell you a little bit about that.

Last week, we launched the new Amazon S3 connector for PyTorch. The connector for PyTorch closely integrates S3 with your PyTorch training pipelines. Now, unlike NFS, using the S3 connector for PyTorch requires a little bit of code change on your part, right? You have to use it in your training code. But in return, it's going to help you solve two of the most important problems in any ML training pipeline, how to get training data into the pipeline and then how to get model data, uh model checkpoints out of the pipeline. I'm gonna talk about each of those in turn. But the common theme here is going to be that by using the AWS common runtime, the S3 connector for PyTorch makes it really easy to get that elastic performance that we were talking about before to reduce your training time and costs.

So let's start by talking about data loading. What we heard from customers using S3 for training data is that there's a lot of boilerplate work they have to do to get data in and out of their training workflows. So for example, if you want to stream data out of S3 rather than loading it all up front today, you're on the hook for managing all of that. You have to go and list your bucket, you have to go and manage, gets often parallel, gets to multiple uh different objects. You have to interleave those. If you want to get really good streaming performance, there's a lot of work that you have to do to make this work.

On the other hand, if you want random access to your training set, right? If you want to be able to sort of randomly access individual samples, again, you're on the hook for managing this right, you have to go and list your bucket, figure out how to map index those to objects and maintain that listing consistently often across many different training nodes.

And so to make these problems simpler, the S3 connector for PyTorch has two abstractions for loading data directly in directory form S3. In the streaming case, it has what we call an S3 iterable dataset, an iterator file that is set for PyTorch. This is going to stream objects from an entire prefix of your S3 bucket. And that's again, it's a one line piece of code that is going to do all the work that I just talked about. It's going to manage listing your bucket for you. It's going to inter those listings with gets so we can get throughput out of S3. It's going to do all the work of streaming objects out of your S3 bucket.

On the other hand, if you want random access to your S3 bucket, we also provide a map style dataset. This is something where you can sort of look up objects in your in your bucket by random access. And again, it's going to do all the work for you, right? So you tell us what prefix your training data is saved in. And then it will do all the work of listing your bucket, figuring out how to map indexes back to objects in your bucket. And again, doing those gets for you with high performance.

The cool thing about both these datasets, they integrate tightly with PyTorch's existing data loader infrastructure. So these will just work if you're already using PyTorch for data loading. And what we've seen again comparing to FUSE before uh with, with the uh interval style dataset. In particular, what we've seen is up to 50% higher training throughput using these data loaders, using the datasets rather than using the built in uh FUSE libraries.

So that's the data loading side. Let's talk a little bit about checkpointing for large scale training and especially for distributed training across hundreds of nodes. It's really important for your training jobs to periodically save checkpoints in the model state. So that if one of those nodes fails, you don't have to go back to the beginning and start again.

Now checkpointing in most ML training algorithms these days is a stop the world operation, right? All the nodes need to stop pull their state off the GPU, save it to storage and no one can start training again until those, all those saves have happened. What that means is that checkpoint performance actually directly affects the training time and cost of your training jobs. It also means that checkpointing is a really bursty and really demanding storage operation, right? You might only save a checkpoint once every hour. But when you're saving that checkpoint, you need the storage to be as fast as possible, right? Because every second that you spend saving that checkpoint is second is a second. You don't spend doing useful work in a training pipeline.

This is a really great fit for S3, right. S3's elastic performance means you can burst up to potentially hundreds of gigabytes per second of throughput when you need it to save a checkpoint and then scale back down to zero immediately and only pay for what you used. And so we think S3 is a really great target for saving checkpoints.

And so part of the S3 connector for PyTorch is an interface for saving checkpoints and loading checkpoints directly to and from S3 again, it's a couple of lines of code that plug in the PyTorch's native checkpoint saving and loading framework where you can just say I want to save this checkpoint to this path in my S3 bucket. And we'll take care of streaming the object for you as quickly as possible using the CRT.

What's awesome about this is that it's really high performance. Uh it's certainly much faster than other S3 clients we've tested. But the really surprising thing here is that actually we also found that it's up to 40% faster to save a checkpoint this way than to save a checkpoint to local NVMe instant storage.

Now, if you're like me, you hear that number and you say that's ridiculous. That can't be true. Like how is that possible NVMe is really, really fast and the answer is well like, yeah, NVMe is really, really fast, right? That's definitely true. But remember this picture that I was showing you before, right. One of the benefits of the common runtime in which we built the connector for PyTorch is that it makes it really easy to burst your object uploads across the entire S3 fleet.

So when you're going to save one of these checkpoints, rather than saving it to a single NVMe disk on your instance, you actually get access to the entire scale of S3, right, potentially millions of storage devices to save your checkpoint to at once when you need it and then scale back down to zero once you're done.

So we think this is a really great way to handle model checkpointing that will reduce your training costs, increase, increase your training throughput and increase your flexibility to retain your model checkpoints in S3 to use our storage management features, things like Intelligent Tiering to manage your your checkpoints for you.

Now, like everything I mentioned today. So far, the Amazon S3 connector for PyTorch is an open source product. So we're really excited to hear your feedback on it, contributions for it, particularly if you're interested in doing that. And the team is really just getting started here. But this is our first step in sort of more tightly integrating PyTorch and S3 together to make it a really great training experience when you're using training data stored in S3.

The last thing I want to tell you a little bit about is to admit that even though we're really excited about NFS and the NFS CSI driver and the S3 connector for PyTorch, we also recognize that they're not gonna solve every problem for you, right? Sometimes you're just gonna have bespoke data loading needs. Maybe your bucket is laid out in a certain way or you want to build your own prefetching logic because you know better or there's some other reason that you just can't use these off the shelf tools in these situations today. You probably use Boto3, the AWS SDK for Python or maybe you use the CLI right to get your S3 objects in and out of S3 onto local storage or into your application in the Boto3 case uh as fast as you can.

And when you use the Boto3 or the CLI, we want you to get the same great S3 performance that we've been showing you so far through NFS and the S3 connector. And so this week, we announced also that Boto's common runtime is now enabled by default for S3 transfers in Boto3 and the CLI when running on ML training instances, that's P4, P5 and TRN1. It's also available as an opt in on any other instance type.

What that means is that you can now get access to the same CRT based throughput that I've been showing you from your applications directly, even if you're not using one of our higher level tools. For example, if you're uploading or downloading ML checkpoints or ML training data. We've seen up to 3x faster downloads using the CLI with this new CRT integration than we saw without it up to 5x faster uploads using the CLI with the CRT integration.

So if you're not able to use one of our higher level tools starting this week, Boto3 and the CLI can now do a lot of this heavy lifting for you and hopefully it will be much faster. And so I encourage you to try those out as well.

And so that's pretty much all I wanted to share with you. Uh again. So the three key takeaways I think to take away from this session, the first and the most important one by far is what Alex was telling you about data access patterns. It's the most important decision you can make when training on S3. Think about sequential versus random. Can you combine multiple samples together into larger objects or do you have to do random access if you have to do random access? Think about whether you can use S3 Standard or S3 One Zone for new low latency storage classes to reduce the cost of those random accesses.

The second thing I think is really important is to be on NFS if you can, right? NFS is awesome and it does all this work for you, right? You don't have to think about other than making that choice between file mode and non-file mode. You don't have to think about how to optimize all this training throughput for you. We'll do all the work for you there.

If you're not on NFS, then our general purpose advice here is use NFS, right? That should be the first thing you try. It will work really well. In many cases, it works in pretty much any training framework that you can think of. Uh and it, it gives great performance.

The S3 connector for PyTorch is also a good option if you're in PyTorch specifically, if you're training large models or you're saving large model checkpoints. The S3 connector for PyTorch is a great place to be. There's some really great storage training materials we have available if you're interested in learning more about how to use S3 or any of our storage services.

And with that, I just want to say thank you very much for being here. Um the three of us will be outside, outside those doors if you want to have questions uh that we can or want to talk about S3 or any, anything like that. We're also happy to talk over email if you have other questions other than that. Uh thanks for being here this morning and have a great re:Invent event.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimizing performance for machine learning training on Amazon S3

Good?
复制链接

扫一扫