Optimizing storage price and performance with Amazon S3

Hello and good afternoon, everyone. Welcome to today's session, "Optimizing Storage Price and Performance with Amazon S3." My name is Andrew Coetzee. I'm a product manager with S3. I'm here with Carl Summers who's a principal engineer with S3.

We're absolutely delighted to be joined on stage today by Nova De Sarma, who's a systems lead with Anthropic.

Now in our session today, we're gonna focus on the steps that you can take to optimize how you store and access your S3 data from the initial stages of measuring and monitoring your storage to tuning the way in which you read data out of Amazon S3 to scale to millions of requests per second.

Now, really, my goal is that you can walk away from this session with a series of concrete and immediate steps to take to optimize how you use S3. We'll talk about how customers are monitoring their storage growth, how customers with different data access patterns are using S3 to optimize their storage costs and how they're able to achieve really awesome performance with minimal effort.

We'll then hand the mic over to Nova who's gonna talk to us about how Anthropic uses S3 to manage hundreds of petabytes of data to train their foundation models.

Now, before we get started, I wanted to say thank you, thanks to those in the room and thanks to the many millions of customers who use Amazon S3 to help us innovate.

Now, the thing that I find particularly fascinating about this slide is really the breadth and the depth of stored use cases and how these use cases have evolved over time.

When S3 launched in 2006, it was storage for the internet. I'm sure that many of us would largely agree that top use cases back then were things like website hosting and backups. But today we see customers of all sizes across every industry running every imaginable workload on Amazon S3.

They span from running data lakes to storing massive volumes of data, to train generative AI models to running analytics applications. And the thing that so many customers like about S3 is its price performance and elasticity, which is really, really important because data is growing so rapidly.

In fact, last year IDC said that there is gonna be 100 zettabytes of data created and replicated in that year. Now, I like to put things into perspective. If we were to store that data on 10 terabyte hard drives, you could stack those hard drives all the way to the moon back down to the earth and one more time to the moon, it's a massive amount of data.

And with this growth of data, customers tell us that there's so much value in storing, processing and learning from their data sets. It's their differentiator and they want to do it in a way that is both cost and performance optimized.

So let's start by looking at cost. Now, when we see customers managing really large amounts of data, they typically organize around a few pillars and these pillars help them think about how to optimize how to learn and how to adjust over time.

Now, as a first step, you want to define your workload requirements, it seems a little simple, it seems a little obvious, but it's really important to understand your use case and the requirements for that use case.

Now, once you define your requirements, you wanna understand your storage and develop insights. Because if you don't understand your storage, you can't take action. And lastly, you want to be able to optimize your storage and measure results.

Now, when you start to think about your requirements, the bottom line here is that no one knows your application as well as you do. So you are the best person to define these requirements, you wanna deeply understand the answers to questions like how long is my data going to live for? What performance do I need? What resiliency do I want? Having the answer to these questions, helps you use our service in the most optimized way.

Now, once you've defined these requirements, let me ask you this, why is it so important to develop insights into your storage? Because you need to understand how your storage is used to be able to take action.

Now, when many customers just get started with AWS, they're always going to have a pretty small storage footprint, they'll have a few S3 buckets and a handful of objects that span across those buckets. But as they scale to millions or billions of objects, things become a little bit more complicated.

They'll suddenly have hundreds of accounts and thousands of S3 buckets that span across in numerous AWS regions. And customers with these sorts of storage estates want to be able to understand how their storage is used across their entire organization.

Now, very specifically customers want the visibility into their object storage usage to avoid having to go bucket by bucket or account by account to gather details. They want the ability to analyze their data at the right granularity for root cause analysis and they want the ability to get meaningful insights that allow them to take action.

And so that's why we launched S3 Storage Lens which gives you organization wide visibility into your object storage usage to improve both data protection and cost efficiency.

S3 Storage Lens provides a single view of your storage usage across hundreds of accounts in your organization and it lets you drill down all the way to the prefix level. It's a visual and highly interactive dashboard that built right into the S3 console. And perhaps most importantly and best of all, you can get a handful of S3 Storage Lens metrics today for free.

Now, over the last couple of years, almost all of our large S3 customers have been using S3 Storage Lens to make decisions about the growth of their data and get deeper visibility into their storage.

Now, who here has, who has asked the question, how much has my storage grown in the last 90 days? Unsurprisingly, there's a few hands. Now that's a question that you can answer with S3 Storage Lens. But there's so much more before we dive into a few examples. I want to highlight two recent launches.

We announced that you can aggregate your S3 Storage Lens metrics using custom filters based on information like tags, object age size. And more. In other words, you can create your own groupings or aggregations of your S3 data and look at storage metrics specifically for those groupings.

For example, let's say you have 100 prefixes in a bucket, but those prefixes are owned by different departments in your organization. You can now use S3 Storage Lens to group those prefixes into separate units and view metrics specifically for them. Or you can see how much of your data and your data lake is using a specific file format like Parquet.

We also just announced that you can view all of your activity metrics, which are the things like the number of puts and get requests and the volume of data uploaded and downloaded all the way to the prefix level. Similarly, you can look at your status code metrics like the number of 503 slow down errors at the prefix level. So it makes S3 Storage Lens a great place to go if you want to troubleshoot performance issues.

Ok. So let's start by looking at your overall storage usage metrics. This is great to see, but it's a starting point because you're not likely going to be able to take action based on this information.

But how about your, how your storage is actually used? How is it accessed? Let's look at the retrieval rate. Now, the retrieval rate is a volume of data that is retrieved over the volume of data stored. So if you look at this image, you'll see that there's a bucket in the bottom, right? That has a large amount of storage and a pretty low retrieval rate. This makes it a pretty good candidate for further analysis and possible cost optimization.

Now, that's a very important thing to talk about when it comes to the retrieval rate of your data. Let me give you an example. Let's say you have 10 gigabytes of data and 10 gigabytes is accessed, that's going to be a retrieval rate of 100%. But let's say you have 10 objects in your bucket. Each object is one gigabyte. What if one of those objects is accessed 10 times that is a monthly retrieval rate of 100%. But now you have nine objects that are not being touched at all.

Now, this is actually a very important cost optimization opportunity and it speaks to the unknown access patterns of a lot of data. We do have a storage class. It's really purpose built for this type of access and we'll talk about it in just a moment.

Now, there's a couple of things that you can do in S3 Storage Lens to get what I call quick cost optimization wins. The first is identifying incomplete multipart uploads.

So for a little bit of context multipart uploads accelerate the uploading of your data to S3 by letting you upload your objects and splitting them into logical parts that can be uploaded in parallel. There's a really awesome performance benefit with multipart uploads. We'll talk about that in a moment.

Now those parts are stored in S3. And what that means for you is that you end up paying for the parts that are stored. And so here you can see that the red line is the object count and the blue line is the number of incomplete multipart uploads that are older than seven days.

As a best practice, you should set an S3 life cycle policy to stop multipart uploads that don't complete within a specified time. And so when a multipart upload is not completed within this time, S3 will stop the multipart upload and delete the parts that are associated with it. This is so easy to do in the console today. You can go into the console, navigate to a bucket, click on the management tab and create a lifecycle policy. This gives you a really quick cost optimization win.

Now, a very similar scenario is when you are storing many noncurrent versions of data. Again for a little bit of context S3 versioning retains multiple versions of your data that can be used to recover data if an object is accidentally deleted or is overwritten.

Now S3 versioning can have storage cost implications if a number of previous noncurrent versions have accumulated. And so what that means for you is that it's really important to consider the storage cost of this purposely redundant data.

The red line here shows you the object count and the blue line is a percentage of noncurrent version storage and bytes. These are just a few of the things that you can do in S3 Storage Lens today to find opportunities to optimize your cost.

Now, in addition to using S3 Storage Lens in the console, you can publish your S3 Storage Lens metrics to CloudWatch. Now, this route is really helpful if you already use CloudWatch today and you want to create a view to monitor your Storage Lens metrics alongside other application metrics using CloudWatch dashboards.

Now, there's a really cool thing that you can do when you're in CloudWatch today, which is set up notifications using CloudWatch alarms. For example, you can set up an alarm and be notified if there's an increase in the number of incomplete multipart uploads in a given account.

So we talked a little bit about S3 Storage Lens, which is great if you want to go down to the prefix level. But what if you want to go to the object level? Well, if you want to go to the object level, you can use S3 Inventory which lets you report on your storage and list all of the objects for an entire bucket or prefix.

Now S3 Inventory provides a list of your objects and the associated metadata of those objects, which includes things like the object name, the size, replication status, encryption status, whether an object is current or not and more. One thing that many of our customers are doing is using S3 Inventory with Amazon Athena to derive insights using SQL queries. So Athena uses SQL expressions to analyze your data and is really commonly used for ad hoc data discovery.

For example, a customer may use Athena to further filter their inventory report and get a list of objects that are greater than a specified size and are noncurrent versions.

Let's talk about storage classes which are optimized for different data access patterns. Storage classes are how you store your data in Amazon S3. Now, before you choose a storage class, you do want to identify whether you have data with known or predictable access patterns or data with unknown changing or unpredictable access patterns.

Let's talk about data that has a known or a predictable access pattern such as data that becomes infrequently accessed after a definitive period of time. Let's take for example, user generated content like videos and photos that we share with our friends and family. Now that content is going to be frequently accessed right after we upload it, but it becomes infrequently accessed after a few weeks or perhaps even after a few days for use cases like these, our customers typically know when their data becomes infrequently accessed, it can pinpoint the right time to move their data to a lower cost storage class that's designed for that specific type of access.

Now, as you can see here, the retrieval rate of a particular data set starts with a moderately high retrieval, but it drastically cools off over time. Now, this is actually a very real use case. And here specifically, we're looking at an online content sharing application that stores videos and photos. Now this is a screenshot from S3 Storage Lens. The blue line shows you the total volume of data stored and the red line shows you retriever.

Now often customers that have predictable access patterns observe that the percentage of data that is accessed each month is consistently low after their data becomes rarely accessed. Now, based on this insight, you would set an S3 life cycle policy to move your data to a lower cost storage class.

Let me give you a very specific example of what I mean by a use case with a predictable access pattern. One common use case is medical imaging where you have hospitals and imaging centers that retain petabytes of data for decades to meet regulatory requirements. Typically, healthcare providers expect that when an x ray or ac t study is generated, it's going to be frequently accessed for a short period of time by the radiologist, the practicing physician and the technicians. But after a month or two, that x ray or that ct study is likely rarely ever asked again perhaps for a patient visit or in the event that a radiologist needs to review a patient's prior history.

Now, if you have a use case with a known access pattern, as the next step, you want to choose a storage class that best aligns with your access pattern. Now s3 has a number of storage classes to choose from. And in the following, we're going to talk about how you can match your access pattern to the right s3 storage class.

First, let's say that you're streaming content and you want to use s3 as the origin, that data is going to be frequently accessed by users all around the globe. In this case, you should use s3 standard, which is really ideal for active data. It's the best choice if you access your data more than once a month.

Now, let's say you store and you analyze images as part of your data lake. Those images may become infrequently accessed. In this case, you should use s3 standard infrequent access to save on storage costs compared to s3 standard.

Now s3 standard infrequent access and the storage classes to follow. So everything related to the right here are designed for less frequently accessed workloads where your cost of storage data decreases. But the cost to access your data moderately increases s3 standard infrequent access is the ideal storage class for data that is accessed about once every month or two.

Now, let's say you have medical records where that data must be immediately accessible but is really only accessed a few times a year. In this case, you should use s3 glacier instant retrieval, which is the ideal storage class for data that is accessed once per quarter.

But what if you have archive data that does not require immediate access and it's kept around for really long periods of time like logs or compliance data? Well, you can use s3 glacier flexible retrieval which delivers additional cost savings over s3 glacier instant retrieval with retrieval times and minutes with expedited retrievals or free bulk retrievals in 5 to 12 hours. Making it the really ideal storage class for backup or disaster recovery use cases.

Lastly, for long term archives, you can use s3 glacier d rpg which has the lowest cost storage in the cloud at roughly a dollar per terabyte and retrieval times of 12 to 48 hours.

Now, we've talked about the storage classes to use when you have a predictable access pattern. But I want to double click into our archiving cap capabilities because they present an opportunity for you to optimize your cost.

Now, every industry has an archiving use case from storing sports highlights to news, broadcasting content to medical records and genomic sequence data. If you take a step back and you think about your business, it'd be no surprise that you too have an archival use case, whether it's to meet compliance requirements, store historical data for future analytics or to store backups. As part of your data protection strategy, we continue to make archival storage better for our customers, whether it's launching new storage classes or making improvements to performance.

Now, I know we're not at the performance section yet, but I do want to highlight that earlier this year, we made a big big change to restore performance for the s3 glacier flexible retrieval storage class, improving restore times by up to 85%. Now, this performance improvement automatically applies to all standard retrievals that are made through s3 batch operations from restoring really large backups to pulling down a ton of data to train a machine learning model.

Now, let's look at this capability in action here, we're going to restore 250 objects that total roughly 25 gigs. Now, the blue line shows you the previous restore performance where you can see that, um, it took typically 3 to 5 hours to complete with this improvement that we've made shown in the ore line here. The same restore typically started in minutes. And in this particular case, the restore job completed in less than 30 minutes. It's a big improvement and restore performance. And we made a ton of performance improvements to glacier over the years to help customers like ancestry pull back terabytes of data to unlock new business value.

Ancestry is doing a lot of really awesome machine learning work for handwriting recognition and training their models in handwrit documents, many of which are stored in the s3 glacier storage classes. Now ancestry restored hundreds of terabytes of images from s3 glacier flexible retrieval within hours instead of days to train their models. Now being able to restore this amount of data so quickly, they're able to train against massive data sets in a very cost effective way.

So to recap in choosing when to use which s3 storage class, there's really three factors that you want to consider. The first is the frequency of access. How often are you going to access data? The second is the duration of storage. How long is your data going to live for? The third are your retriever requirements? For example s3 glacier instant retrieval is really the ideal storage class for data that is accessed once per quarter but why is that? It's because although the storage price is lower than s3 standard and frequent access, the cost to access your data is moderately higher, which means that there's a break even point where if you're accessing data too frequently, it would make sense to keep that either in s3 standard or s3 standard in frequent access. Additionally, s3 glacier instant retrieval has a minimum storage duration of 90 days. So let's say that you expect to upload a file and delete, delete it within 90 days. We'd recommend that you keep that either in s3 standard or s3 standard in frequent access.

Now, there's a very important angle to consider when it comes to the duration of storage and the average size of your objects. Now, not only is it important to consider because storage classes have minimum storage durations, but because they have life cycle transition costs where you're paying a per request fee to move data from one storage class to another. And so if we look at this particular example, we're comparing the cost savings between s3 standard to s3 glacier deep archive. And this shows you that if you move data from s3 standard to s3 glacier deep archive, you should really consider whether you will store your data for at least 20 months. If you have an object size, that is smaller than 128 kilobytes. Now, as your object size goes up your break, even period shrinks. So at one megabyte, you're going to start saving within a couple of months.

Ok. So let's say that you've identified that the vast majority of your data has a known or predictable access pattern. And you go down this route of managing the life cycle of your data to do this. You would use s3 lifecycle policies which are rules that you can set up to move data from one storage class to another after a given number of days. Now, these rules are based on the creation date of the object and can be filtered to apply to a bucket, a prefix or set of tagged objects.

Now, let's go back to that medical imaging example i gave earlier when an image is generated, it's going to be frequently accessed for a short period of time by various individuals, but then is rarely ever accessed. Again. In this scenario, you can life cycle your data from s3 standard. Again, a storage class designed for active and frequently access data to s3 glacier instant retrieval after 90 days and then further optimize your storage costs by moving that data down to s3 glacier deep archive.

Now, we've talked about which storage class to use when you have a known or predictable access pattern. But we know that's not the case for many workloads. In fact, the vast majority of data today has an unknown, changing or unpredictable access pattern. From the tables in your data lake to the massive volumes of data that you suddenly decide to pull down to fine tune a machine learning model.

Now, it's particularly challenging to predict when each individual object is going to be accessed. Let's go back to that example i gave earlier. Let's see, you have 10 gigabytes of data and 10 gigabytes is accessed. That's a monthly retriever rate of 100%. But what if there's 10 objects in that bucket? It's very, very unlikely that each of those objects is going to be accessed exactly once. What's likely to be the case is that you have a pocket of objects that are going to be axed a few times, which means that you have another pile of data that's not being accessed at all. And when you have this kind of data access pattern, it's really challenging to know which object you should keep in which s3 storage class. Because to drive down your overall storage cost, you want to optimize the way in which you store every object individually.

And so that's why we launched the S3 Intelligent Tiring Storage Class, which gives you a solution to automatically optimize your storage cost when your access patterns change at the very granular object level.

And for a small monthly monitoring and automation charge, S3 Intelligent Tearing monitors the access patterns of your data and moves objects that have not been accessed to lower cost access tiers.

So when you put an object into S3 Intelligent Tearing, it's going to start in the Frequent Access tier. After 30 consecutive days of no access, S3 Intelligent Tearing will move the object from the Frequent Access tier to the Infrequent Access tier. And after an additional 60 days, move that into the Archive Instant Access tier.

Now objects that are stored in these three access tiers have similar performance to S3 Standard, which means that they're immediately accessible when needed.

Now, you can optionally activate our Archive Access tiers which are really designed for asynchronous access. So for example, if you opt into the Deep Archive access tier, S3 Intelligent Tearing will automatically move objects that have not been accessed for 180 days into this tier. And you're storing your data at roughly a dollar per terabyte.

Now any time an object in these asynchronous access tiers is accessed, the object is going to move into the Frequent Access tier in as little as a few hours. And there's a really cool thing that you can do if you're opting into these asynchronous access tiers using S3 event notifications. For example, you can set up and configure S3 event notifications to track when objects move into the Deep Archive access tier, which means that your applications know when data may not be immediately accessible and need to be restored, which can take several hours.

Now, there's a lot of customers who are using S3 Intelligent Tearing today to automatically optimize their storage. One very specific example is Illumina. Illumina is making rapid advancements in life science research to improve human health.

Now, one of the challenges that Illumina faced was storing petabytes of genetic data at a very low cost and they move roughly 50 peta bytes of data into S3 Intelligent Tearing and saw a 60% reduction in data storage costs without impacting their performance.

Most importantly, they could scale elastically and stay really focused on their core business.

Now, we've talked about the savings of a single customer using S3 Intelligent Tearing. But I want to take a step back and talk about the impact that S3 Intelligent Tearing has had on all S3 customers.

Since the launch of S3 Intelligent Tearing, customers have saved $2 billion from adopting S3 Intelligent Tearing compared to S3 Standard.

Now we get really, really excited about this number because customers tell us that these savings let them focus on innovation rather than being hands on managers of storage infrastructure.

And then just a little bit you're gonna hear from Anthropic about how they use S3 Intelligent Tearing.

Now let's switch gears and talk a little bit about performance. Now, optimizing for cost is clearly important but equally important is optimizing for your performance.

If you ask me what my favorite feature of S3 would be, it has to be its elasticity. With Amazon S3, you can scale to tens of millions of requests per second. And it's this kind of capability that allows customers like Netflix to deliver billions of hours of content or for FINRA to be able to analyze 37 billion trading records every day.

So we're going to take a look at how customers like these think about designing for scale and performance with S3.

S3 is a large distributed system and your customer code, your code, is a critical part of it. And so we want you to think about three different aspects when designing for performance with S3.

The first is making sure that your key space is arranged so that we can give you the maximum request rate.

Next, we want to work with you to minimize your request latency, to ensure that your applications get that first useful byte as quickly as possible.

And finally, we want you to achieve the maximum throughput getting as many bytes into and out of S3 as you can as quickly as your application needs it.

Now, like I said, Amazon S3 automatically scales to high request rates such that every prefix in S3 can achieve 3500 TPS of puts or 5500 TPS of gets.

Now, I appreciate that it's a bit cliché but to give this a frame of reference at approximately 5000 gets per second, you could download each of the 31 million books in the Library of Congress 14 times every day.

And this means that the majority of S3 customers don't need to do anything special in this regard at all. But as we talked about earlier, not all workloads are predictable and they grow over time and you may need to grow or burst above these numbers.

So let's talk a little bit about how you can help S3 scale to meet those needs.

As an example, suppose we have a large fleet of autonomous vehicles and every evening, those vehicles want to upload the data about the drives that they had that day into S3.

A reasonable first cut at the key structure for this application looks like this one. We have a parent folder for containing all of the daily uploads. And because we're uploading on a daily cadence, we have our date followed by the car's identifier. So we can easily identify data from a specific car. And finally, we have one or more keys representing the drive data itself. This is things like telemetry from the sensors or imagery from the cameras.

Now 6:30 in the evening rolls around and all of the folks driving our cars are arriving home and those cars want to start uploading their drive data.

And as we sell more and more of these cars, we're going to see more and more TPS against this shared prefix, which in today's case will be the date.

Now these vehicles as they grow may start to see some 503 slowdown errors come from S3, but we will learn the traffic patterns and we notice that there's a place to split the incoming traffic again. In our case, it will be here on the car's identifiers.

And by the end of the evening, each car or group of cars will have dedicated allotments of that 3500 TPS of puts.

Now, fortunately for us, but maybe not so much for our system, tomorrow arrives. And S3 has to learn this pattern all over again. But there's a relatively simple mechanism to avoid this.

If we simply swap the date and the car's identifier, then S3 will learn this pattern once and eventually each group of cars or car will have its own top level prefix, being able to take advantage of that 3500 TPS of puts for uploading its data as quickly as possible, right?

But as we said before, workloads change and they grow and they can be unpredictable. And when your applications exceed the numbers that we talked about, S3 will respond with those 503 slowdown errors.

And this indicates that there may be areas of your key space that need your attention. We recommend that you use S3 request metrics in CloudWatch to monitor an alarm on the number of 503 responses that you receive.

And when you see these, you can use the advanced metrics and S3 Storage Lens to identify very specifically which prefixes or areas of your key space are generating those 503s.

And finally, it can often be difficult to identify exactly which application or user has had this change in traffic. And there you can use S3 server access logs to identify the specific application or user, right?

Let's talk a little bit about minimizing your request latencies.

Now, S3's customers use us from a wide variety of locations anywhere from an EC2 machine, potentially meters away from the data in S3 to mobile devices, accessing the internet over 3G connections to literal spacecraft, beaming their telemetry and imagery from Mars.

And to accommodate that wide variety of use cases, S3's SDKs ship with relatively conservative settings for how much time they are willing to wait to establish a connection or for a response to a request for data.

And the most impactful thing that you can do to influence or reduce your time to first useful byte is to tune those client timeouts to match the environments that your applications are running in.

You can use the same tools that we just talked about like request metrics and Lens to identify the expected latency for your application in its production environment. And you use these expectations like the 95th percentile to set your client timeouts so that your slowest requests are canceled and immediately retried.

That retry is very likely to take a different path through the network and even within S3 and succeed with much better latencies and throughput.

The key to maximizing your throughput with S3 is parallelization. We would like to see your applications opening many, many connections to S3 and making multiple parallel requests for objects across those connections.

There's one element of parallelization that I want to dig into a little bit in detail.

Now it's somewhat obvious. But as the size of an object grows, the time required to download that object in a single request grows right along with it.

So to achieve the best aggregate throughput, we recommend instead making multiple requests even within the same object.

So that means when retrieving data from S3, use byte range fetches and when you're uploading objects to S3, use the multipart upload feature.

So let's recap real quickly. The considerations for optimizing the best latency and throughput from S3:

The first is to ensure that you are paralyzing your requests for objects including within individual objects.

And when you're doing this, you want to ensure that you're using multiple connections and reusing them in order to mature the cost of their setup.

Importantly, you want to make no more than about 5 to 10 connections to any one of the IP addresses you receive from S3. And this means you want a wide variety of S3 IP addresses and we recently launched support for multi-valued DNS to make it easier to get multiple IP addresses in a single DNS resolution.

Now, once you've established those connections, you want to monitor the performance in terms of latency and throughput that you're seeing on them and tear them down if they're not performing well.

Then tune your client timeouts to fit your application's network environments, canceling those requests that are falling in your 90th or 95th percentile and retrying them.

And of course, when retrying, ensure that you're using exponential back off and introducing jitter to ensure the likelihood of their success.

And finally, it was a bit out of scope for this talk. But when designing for maximum throughput in S3, we see things like cross NUMA domain memory accesses can significantly impact your throughput.

And like in that case, you'll want to ensure that your threads are pinned or bound to specific NUMA domains for the entire lifetime of the request and connection.

But frankly, I would rather spend my time building my application. And thankfully, we've built all of the best practices that we've just talked about and more into the AWS Common Runtime or CRT.

The CRT is a set of native libraries that optionally underpin many of the AWS SDKs client connectors and tools and it provides out of the box, all of these throughput and latency improvements.

Let's take a moment to look at the impact that the CRT can have. In this example, we're using the AWS CLI to download 101 gigabyte files to an EC2 instance in the same region using the default C with no tuning, takes about 8.5 minutes to complete.

Next, we're going to update the CLI's configuration and tell it to use the CRT as the preferred transfer client. Now running that command a second time, we see that CLI is able to complete that same data operation in just under four minutes. That's more than a 2x performance improvement for a single line of configuration.

In fact, we're so pleased with the performance and the stability of the CRT that I'm happy to announce it is now the default transfer client in the CLI and the SDKs and libraries on Amazon EC2 Trainium, P4d and P5 instance types.

So in addition to the SDK and the CLI, where else can we find the CRT being used?

Well, a common theme that we've heard from customers is that they are using software or applications that were built to expect a file-like API and they can't easily make changes to these applications to extend it to make use of S3's REST APIs. And these customers wanted a way to give those applications the same level of stability, performance and technical support that they've come to expect from S3 and the AWS SDKs.

So earlier this year, we delivered Mount Point for Amazon S3, which is a new open source file client, making it easy for Linux-based applications to connect directly to Amazon S3 buckets and access objects using file-like APIs.

So Mount Point presents your S3 objects as files in the local file system. And it translates file system calls like fopen into an appropriate and efficient use of S3's REST APIs.

With Mount Point for Amazon S3, you can achieve high single instance throughput to finish your jobs faster and you can reliably scale up and down over thousands of instances.

Just last week, we also announced that you can now store your S3 objects or cache your S3 objects on instance storage, instant memory or an EBS volume to reduce the cost of repeated data access and further accelerate your workloads.

This feature is really perfect for workloads that repeatedly access the same data in S3 multiple times and like everything else with Mount Point, we really wanted it to be plug and play where the only thing you need to do to take advantage of it is when mounting your bucket, specify a directory for us to save objects into and Mount Point takes care of the rest.

So like I said, this feature is really meant for applications that have repeated data access to the same sets of data, much like we see in ML workloads like image recognition where you are going through multiple epochs of training where each epoch is a pass through the entire data set.

In this example, we can see that having to reread individual files multiple times limits our training throughput. But after enabling the caching feature, we see that we're not downloading data from S3 after that initial access and instead loading data directly from local storage leading to a 2.5x performance improvement in training throughput.

Mount Point for S3 is entirely open source and it's available for production use today. You can scan this QR code to be taken to the GitHub repo where you can learn more about Mount Point for S3 including finding instructions for installing it and using it from your own EC2 instances.

There are many customers who have implemented S3's cost and performance best practices to scale to millions of requests and terabits of throughput per second and to do so at the lowest cost.

One company that knows how to do this very well is Anthropic. Anthropic stores over 200 petabytes of data in S3 and regularly sustains data transfer rates in excess of 800 gigabytes per second during their training jobs.

Now 800 gigabytes per second sounds fast and it is - that's enough bandwidth to support streaming HD video content to a population four times the size of Las Vegas.

At the same time here to talk about how Anthropic uses S3 at the best possible performance and price is Nova Dear.

Hi, I'm Nova. I'm the systems lead at Anthropic. You may have heard of us. We're an AI safety and research company out of San Francisco, California. We were founded in 2021 and we build things like Claude, which you may have heard of.

Our mission is to ensure that AI helps people and society flourish, building these frontier systems and working to responsibly deploy them and regularly sharing our safety insights with the community for the benefit of everyone.

So, talking a little bit about our approach here and what makes us different - we take an empirical approach to AI safety, we have continuous experiments and focus our research on interpretability, training our models to align with human values and societal impacts of deploying that kind of technology, discovering things like constitutional AI which allows us to reduce impact to human workers who might be in that pipeline as well as speed up iteration for developing new models.

And we use a whole bunch of S3 in this process. As I mentioned before, we have stored over 200 petabytes of data and artifacts for our training and research inside of S3 today across several different buckets.

We use S3 in a bunch of different ways. Here are a couple of things that we like about S3:

It's the elasticity - so no more storage warehouses and things like that. We use S3 to be able to scale elastically with our needs with no prior optimization required for storage optimization.

Intelligent tiering allows us to balance our performance and cost efficiency. So when we have experiments that we don't know whether they're going to be valuable or not, we can drop them into S3. And if it turns out that we don't need them for a while, they'll sort of just go away and no longer contribute to our bill, but they'll be there if we need them.

And the deep, deep archive access tier in Intelligent Tiering means that we don't worry about that kind of decision.

And obviously, S3 is the best way to help us scale. And to dial in on that storage elasticity, we talk about that 800 gigabytes per second with as a sustained read speed here. We rely on that kind of a speed to be able to load data quickly as well as load checkpoints quickly.

For checkpoint load speed, we see spikes up to two terabytes per second of data out of S3 into our accelerated workloads. And what that translates to us in terms of efficiency is that we can utilize our accelerators more of the time because we're spending less time loading that data onto the accelerators and that translates to huge amounts of cost savings and speed ups in terms of how long that machine learning workload is going to rest on those accelerators.

We use a lot of S3 in the last year. These are directly from our metrics. Our storage needs have grown over 6x and S3 has grown with us. As you can see from the graph here, a large amount of our data is inside of Intelligent Tiering, almost all of it, and much of it is inside of these archive access tiers because a lot of it isn't going to be needed again.

It's hard to predict science and it's hard to predict what we're gonna need next. That brown line there is showing the growth of our Deep Archive access tier. And you can see that something like 40% of our data is in that at any given time, which translates to a fairly large cost savings.

As Eli, one of the folks on my team likes to say, this brings our storage bill down to something that we don't need to think about anymore. And that's pretty important to us because it's a pretty small team building some pretty big models.

And I'm of course sure people are interested in how do we do it? I ran out of time to build these slides so I threw our Terraform into Cloud and this is what we came up with - data flows from S3 and back again several times in the course of training a large language model.

We take this raw data and process it on Spot EC2 instances that are provisioned by a Carpenter. Those Spot EC2 instances then feed right back into S3 in the form of tokens, so taking text and images and other kinds of data and converting them into numbers that the large language models are gonna understand.

That goes into that S3 bucket across multiple objects that are spread over multiple prefixes to optimize throughput. Then it flows right back again into accelerated EC2 instances like Trainium or P4ds or P5s for training the actual model.

When we do that, we use asynchronous queues and try and buffer as much of that data as possible so we can stay in a throughput bound regime, which is where I think S3 really shines compared to alternative solutions.

And then we dump that data, the language model itself, back off the accelerators once we have it. One thing you might not know about these runs is that they crash fairly often. We see crashes multiple times a day. And anytime we crash, we need to restore from some kind of a checkpoint, some kind of partial progress on the model. And we do that using S3.

So we store our partial progress into S3 in these checkpoints and then flow back into those EC2 instances in a restore operation anytime we need it.

And here are a couple of tips and tricks that we have about how we use S3. Some of these might be familiar to you from earlier slides:

  • We like to optimize our object size to get better throughput.
  • We use ranged requests to be able to get partial segments of objects for tokens. This means that you don't have to shuffle your data around between multiple objects. You can just sort of shuffle your index there.
  • We also try and scale across multiple prefixes. We talked about the per prefix limit for gets and that's pretty high for a lot of things, but to get to that 800 gigabyte per second sustained performance, we need to utilize multiple prefixes and we make sure that we use UUIDs early in our path to be able to shard across those multiple prefixes on the back end there.
  • The CRT is your friend. It's definitely gonna help you influence a lot of these best practices without having to think about it very much.
  • And yeah, think async - make sure that if you can architect your application so your state and your application can scale separately. If you have a queue for that data, then it doesn't matter whether there is a miss in latency on any particular part of it because you've got stuff buffered to keep your application flowing as fast as you can.

And I'm gonna hand it back off to close us out. Thank you.

Thank you, Nova.

Now, as we wrap up today, let's put it all together with a few of the immediate steps that you should take to optimize how you store and how you access your S3 data:

  • Start to explore S3 Storage Lens because you get a number of S3 Storage Lens metrics today for free and find opportunities to optimize your costs immediately.
  • I talked about multipart uploads and a little bit about S3 versioning earlier - use S3 Intelligent Tiering to automatically optimize your storage cost when you have unknown, changing or unpredictable access patterns and do it at the granular object level.
  • Optimize your storage performance by putting in place the right key naming strategy and turn on the CRT for automatic request parallelization.
  • And once you've done all of this, put your data to work with services like Amazon Athena, Redshift, SageMaker and more.

Thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值