Maximize the value of cold data with Amazon S3 Glacier

最新推荐文章于 2024-10-07 20:55:20 发布

李白的朋友王维

最新推荐文章于 2024-10-07 20:55:20 发布

阅读量80

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134814238

版权

All right. Hello everybody. Welcome. We're so glad you're here. We're here today to share with you our perspective on how customers across industries are changing their archive strategy and how they're doing more with their cool data, how they're deriving more value from their co data.

My name is Ruhi Sooth and I'm a product manager at Amazon S3 Glacier. I'm joined by my colleague.

N: Hi. My name is Nish Pande. I'm a product man with Amazon S3 and Glacier team and I'll be back on the stage in a while to talk about how to maximize the value of your cold data. So see you shortly. Thanks.

All right. So for our presentation today, we have divided the section. We have divided our presentation into three main pieces. First, we'll talk about why is it important more than ever before to have a cold data management strategy. Next, we'll share some strategies and tips on how you can optimize your storage costs using Amazon SG Glacier. And finally, we'll talk to you about how you can do more with your cold data, how you can put it to work. So let's get started.

First. Let's discuss why do we care so much about the school data we are here? Why do we care? Maybe aaa quick show of hands for me here. Um how many of us here are dealing with data coming from all of these sources? Wow, that's a lot of hands, some nerds, some smiles. Thank you. That's a pretty spectacular reminder of how most of us are dealing with more storage coming from more sources. And I have to think about how we, how do we optimize all of the storage management.

And if you're already dealing with storage, this is not going to come as a surprise to you. A lot of this data gets cold very, very quickly. In fact, I dc suggests that 60 to 80% of the words data is cold, meaning it's infrequently accessed and stored over long periods of time. And most customers don't store this data for its immediate value, they store it for its future potential. And what does that mean? When I say customers are storing coal data for its future potential.

Let me walk you through some use cases. The most common use case is preservation of raw data. For example, in the media and entertainment industry, customers store high quality high resolution original footage from let's say a movie shoot. So weeks, months or even years later, if the creative staff decides that they want to create a promotional material, a teaser, a trailer, they can go back to that original content.

Another popular use case is backups. The good old backups, you have a server, a database, you're backing it up. So in case the disk goes offline, the data center goes offline, you have access to the critical information you need. And finally, some industries like health care and finance require you to store patient data or end user data for multiple years. In some cases, it could be as long as 7 to 10 years. And this is why customers would sometimes store data in archive for compliance reasons.

One area where we are seeing a change happen today is preservation in the age of gen machine learning advanced analytics where data is the differentiator, customers are doing more with their co data and they are storing more co data. Specifically, customers are storing original high quality sometimes even proprietary data for their future machine learning. use cases.

For example, in the autonomous vehicle space, you could be getting tens of hundreds of images every day from each car. You are not going to use every single image immediately. For machine learning, oftentimes customers will score an image, identify what's relevant for immediate machine learning. Keep that in the hot storage and move all the rest of the data to archive storage. Now, if a use case comes up, like in this case, if you're trying to retrain your car on how to take a left turn, you can go back, identify the relevant information and retrain your model.

We're seeing several other use cases where customers are using cold historical data to identify seasonal shopping trends, create more personalized products and in some cases even to identify fraudulent activities. And this is why we recommend that it's important more than ever before to store your raw original proprietary data rather than delete it.

And the first question that comes up when you're dealing with petabyte scale data is cost. Customers are looking for storage that allows them to minimize their business costs. And at Amazon S3, we offer a variety of storage classes that allow you to meet your use case and cost needs. So let's dive a little bit deeper into how Amazon S3 storage classes can help you with your storage optimization.

So what you see on the screen here is some of our most popular storage classes. You could keep your most frequently access data in a hot storage class like Amazon S3 standard or you could tear it all the way down to a cold storage class or the coldest storage class here like Amazon S3 Glacier Deep Archive. You could use intelligent tearing to tear data automatically between tiers over the next few slides. I'm going to dive deeper into the glacier storage classes which are highlighted on the screen here.

First, we'll talk about Amazon S3 Glacier instant retriever. This storage class is purposely designed for millisecond access. If you have a performance sensitive workload that needs immediate access. This is the right storage class. The things that you want to focus on here are millisecond, retrievals and accessing or retrieving a small percentage of your data throughout the year.

Can you think of an example where you have petabyte scale data, petabyte scale? Rarely access data. But if you ever need access to it, you need immediate access. I can share a couple here. Let's say you are a news broadcaster. Now, when you're broadcasting news, there is a lot of advantage on being first on the air. Uh you want more likes, you want more clicks being first is going to help.

Now you just learned that Tyler Lockett of Seattle Seahawks has a hamstring injury and he's questionable for his next Seahawks game. You want to get the story on air but you want some really cool sporting images of Tyler, maybe of his best receptions, maybe his best touchdowns. Glacier instant retrieval will be your friend. You can get quick access to those images, create your story hit post.

Another example where we are seeing a good use case for glacier instant retrieval is the medical industry. Let's say you are a medical professional or health care professional, you store patient data, x rays, mrs, all sorts of patient data in an archive. You're not going to be accessing this data every day. You're not going to go back and look at all of your patient data every day. But if a patient walks in, you need immediate access to this data. If you want to make a critical medical decision again, this is where later instant retrieval will give you instant millisecond access and you can make that critical medical decision.

The next storage class that we're going to discuss is Amazon Glacier flexible retrieval. Now, this storage class is designed for workloads which need the flexibility to retrieve large data sets multiple times a year at low costs. It gives you the flexibility to retrieve data within minutes to hours. The things that you want to focus on for this storage class are retrieval within hours and retrieving large amounts of data at low costs.

One example that I can share with you is let's say you're a financial analyst and every quarter, you generate a report and you need data from the entire quarter to produce that report. So you could schedule a retrieval in Amazon st glacier flexible retrieval which says give me all the data for the entire quarter and create that report. You could use glacier instant retrieval in this case. But remember as your number of retrievals will go up, your retrieval costs will go up too.

So if you're retrieving large amounts of data throughout the year, your retrieval costs will go up too. Whereas with amazon sg glacier flexible retrieval, you can retrieve large amounts of data at really low costs. In fact, even for free, and we'll touch upon that in in some time.

Another popular use case here is trend analysis. So you're doing a one time analysis or you need large amounts of data back. You could use amazon sg glacier flexible retrieval, retrieve that data for really low costs.

And finally, we have Amazon SG Glacier Deep Archive. This is our lowest cost storage. In fact, this is the lowest cost storage in cloud. It is designed for rarely access data where you are ok with the data becoming available in 12 to 48 hours. Common examples is you remember the compliance use case we discussed, you're storing large amounts of data. You're only storing it for compliance reasons. You can store it at low costs in amazon sg glacier deep archive.

Again, raw data, proprietary data. You want to store your raw, raw data for machine learning use cases or maybe even organizational gen i use cases. You can store it here for purely really low costs.

Now that we have discussed the three storage classes, I've given you some examples you might ask me. But what is the right storage class for me? What is the right storage class for my workload? I'm going to leave you with three factors that you can use to decide on. What is the right storage class for your workload.

The first thing that you want to think about is the retrieval speed. The second thing very intuitive is the storage cost. And the third thing that you want to think about is the data retention. Let's dive deeper into the retrieval speed. First.

Now, the choice of an of an archive storage class depends first and foremost on the retrieval sensitivity of your application. So if you have a performance sensitive use case, we discuss the health care use case, the news media broadcasting use case. If you want millisecond access, then glacier instant retrieval is the right storage class for your workload.

But if your application is designed to handle minutes to hours of retrieval time, then you can choose between amazon three glacier flexible retrieval or amazon s3 glacier deep archive depending on whether you want same day access or next day access. So with amazon s glacier flexible retrieval, you can get your data within minutes to 12 hours. And with amazon st glacier deep archive, you can get your data between 12 to 48 hours.

So that's our first factor retrieval speed, retrieval, sensitivity of your application. The next thing that you want to consider is of storage costs. So we offer storage costs ranging from $4 per terabyte month with amazon s3 glacier instant retrieval to as low as $1 per terabyte month with amazon s3 glacier deep archive.

And the final factor, the third factor that you want to consider is how long are you going to retain your data for data retention. So if you have a compliance use case where you're going to store data for more than 100 80 days, glaser deep archive is perfect. Similarly, for glacier instant retrieval, glacier flexible retrieval, you need to store your data for at least 90 days, anything less than that and you will be charged an early delete fee.

So that is why it's important to consider your data retention periods as you're choosing your storage class. So we've discussed the three factors, you choose your storage class based on retrieval sensitivity, storage costs and the data retention periods.

So now we've learned about the three storage classes, the three glacier storage classes. We've learned about the factors you can use to determine how which storage class is right for you. But how do you get your data there? How do you move your data into these storage classes?

If you know from the get go that your data is cold, you can just use our a ps use the three factors that we discussed, move your data into the right storage class, but most data does not start out cold, it starts out hot, it's going to get accessed a bunch of times and over a period of time, it will cool down for this kind of data. We offer a number of capabilities that you can use to tear your data into the archive storage classes.

The most ideal way to identify co data is using amazon s3 intelligent theory. Customers have told us that sometimes their data access patterns are unknown, it's just flat out unknown and this is why we build amazon s3 intelligent theory

Amazon S3 Intelligent Tiering automatically tears data between different storage classes based on access patterns. With a small monitoring fee, you can tier your data without any performance impact, without any transition costs, and without any retrieval fee. It offers three synchronous tiers and two asynchronous tiers. Let's see how this works.

So let's say you put your data, put an object in Amazon S3 Intelligent Tiering. It's going to go into the Frequent Access tier. After 30 days, if the data has not been accessed, your application has not touched that object, it is going to tear down into the Infrequently Accessed tier. Now your storage costs are beginning to drop because you're paying for the Infrequently Accessed tier.

After 90 days, if your object has still not been accessed, it will tear down into the Archive Instant Access tier. So now again, if you are optimizing your storage costs, the object has not been touched, Intelligent Tiering has moved it into the Archive Instant Access tier.

Let's say your application comes back and retrieves the object at this point. It accesses the object in Archive Instant Access tier. At this point, the object, the application can access the object. There will be no performance impact but the object moves back into the Frequent Access tier, it becomes hot again and the whole process will repeat itself.

You can also enable the two asynchronous archive tiers, Archive Access tier and the Deep Archive Access tier. The minimum duration for the Archive Access tier is 90 days and the minimum duration for the Deep Archive Access tier is 180 days. And you can change these minimum durations to a value more than the minimum.

Now, this is how Amazon S3 Intelligent Tiering can help you optimize your costs by tiering the data between these different storage tiers. And in fact, our customers have saved more than $1 billion in storage cost savings since Intelligent Tiering was launched.

Now, Intelligent Tiering is a great option if you do not know your access patterns. But what if you do know your access patterns? You do know your data is going to get cold after 90 days or it's really going to get accessed after a certain period of time. In that case, Amazon S3 Lifecycle is a great option and you can move your data into the colder tiers depending on the age of the object.

Let's see how this works. So you create an object, you put it in Amazon S3 Standard. You could add a lifecycle policy here which says after 90 days, move my object into Amazon S3 Glacier Instant Retrieval. So you start saving on your storage costs at this point, after 90 days from the date of creation, the lifecycle policy will move your object into Amazon S3 Glacier Instant Retrieval.

If you want to further deepen your savings, you could add a rule to the lifecycle policy which says move my object after 180 days to Amazon S3 Glacier Deep Archive. Remember the 180 days is from the date of creation here. So now you've moved your object into Amazon S3 Glacier Deep Archive.

Let's go back to our compliance use case. You need the object for seven years. Don't need it any longer. You could add an expiration rule that says delete this object in seven years. So you do not have to go back and clean up your data.

And this is how Amazon S3 Lifecycle can help you optimize your storage costs. You could choose to move all of the objects in a bucket. You could choose to move selected objects based on prefixes, based on tags, based on object sizes or versions.

And while we're talking about lifecycle and object sizes, I want to leave you with a quick tip. Always consider using the object size filter as you're designing your lifecycle policies. And here is why:

Let's say you have a 1 terabyte dataset, you store it in Amazon S3 Standard. It's going to cost you about $24 to store it in Amazon S3 Standard. The same dataset is going to cost you $1 to store in Amazon S3 Glacier Deep Archive. So $24 in S3 Standard, $1 in Deep Archive - $23 of storage savings right there.

But before you decide to move this object, you want to look at the transition cost. So let's say you have 1000 objects in this dataset, it's going to cost you 5 cents approximately to move this data. So 5 cents to move it, $23 of cost savings. You can start realizing your cost savings almost immediately.

But what if you had really small sized objects? In this case, take a look - it's a 128KB object. It's going to take you 20 months before you can start realizing those same cost savings. As the object size gets bigger, you can start realizing those cost savings sooner.

At about 2 megabytes, you can start realizing those cost savings in about a month and a half. After 2.5 megabytes, you can start realizing those cost savings in under 30 days. We're talking about S3 Standard and Deep Archive in this particular example. But the concept is - as your object sizes get smaller, you want to take a look at the number of requests and the request cost.

So this is why we recommend that you use the object size filter as you're designing your lifecycle policies. And we offer a variety of tools that can help you with analyzing your object sizes. We just launched the S3 Storage Lens - you could create a group based on a prefix, for example, look at a known prefix, look at the object size distribution, object age distribution and decide on which objects are ready to be archived.

Now that we have talked about storage classes, how do you get your data in there? I want to talk about this customer example where a customer took advantage of S3's high performance and Amazon S3 Glacier low cost storage.

Snapchat has more than 363 million daily active users, probably much more than that at this point, growing every day. These users upload images, videos, media content to Snap's platform. Previously, all of this content was ephemeral - it would go away after a point in time. For those of us who use Snapchat would remember that.

Snap launched something called Snap Memories which allowed users to store their data over a long period of time. Now, with this feature, Snap storage needs increased and they needed to think about optimizing their costs. They decided to move 2 exabytes of data to Amazon Glacier Instant Retrieval and this helped them save tens of millions of dollars in cost savings.

This is an excellent example of how you can use Amazon S3 Glacier and the low costs while still reinventing your business and expanding your business with this.

I conclude the discussion on Amazon S3 Glacier storage classes, how do you get your data into those storage classes with Lifecycle and Intelligent Tiering. And I'm going to invite N back on stage so he can tell us a little bit more about how do we put all of this data to work.

You: It will take around three hours for you to submit the request. For example, if you are using standard restore from Glacier Flexible Retrieval, it will take Glacier almost 3 to 5 hours to complete the restore.

So you can complete, you can complete your entire workload in five hours or under six hours. And to ensure that you get the maximum or highest visitor performance, you should use S3 Batch Operations because it automatically optimizes for the TPS limit and also accounts for retries and errors.

We launched this improvement during Green Went last year and since then the weekly customer count, sorry, weekly customer resource has it has increased significantly. We have seen workloads that would have taken weeks or even months to complete within days.

And here is a great example of how Ancestry benefited from the higher request rates. Ancestry believes every family has a story and it helps its customers understand their genealogies and learn more about their family trees. One of the ways Ancestry makes it possible is by extracting the insights from handwritten historical documents.

These are critical records, documenting incredible journeys that people have undertaken. At the same time, these documents are rarely accessed so Ancestry restored them in Glacier Flexible Retrieval now to help its customers in their personal discovery journey.

Ancestry trains AI and machine learning models to recognize handwriting in these historical documents. And with the improvement in the restore request rate, Ancestry was able to restore hundreds of terabytes of these documents in hours instead of days. And being able to restore this much data so quickly, Ancestry was able to train their machine learning models against massive data sets in a very cost effective way.

And this improved the quality and accuracy of the machine learning models. As a result, Ancestor was able to use the cold archive data stored in Glacier to help its customers find warm family ties.

Now coming to the third thing to consider when you're submitting the request, which is to specify the duration or number of days you want to keep the restored object. When you restore an object from Glacier Flexible Retrieval or Deep Archive, Glacier creates a temporary copy in your bucket.

You can specify the number of days you want to keep this copy, which can range from, let's say one day to let's a very long period. But what if you have fat fingers like me and you wanted to put 10 days instead of 1000 days, how do you update the restoration period for a restored object?

It's simple. All you have to do is submit a restore request for the same object with 10 days as the restoration period. And Glacier will override the number of days associated with that restored object.

To summarize, there are three things to consider when you're submitting requests to Glacier:

The restore time, the time it will take to initiate the restores
The number of days you want to keep the restored copy

Now, while we are waiting for the restores to complete, you may want to check the status of your restores. Amazon S3 provides you with multiple options to check the status of your restore request.

You can do it using the Amazon S3 console or you can use the HEAD object command in the CLI or use the HEAD object API. We recently added restore status to the S3 List API, so that's an option too.

In addition, you can use event notification options in AWS and that's what we recommend. Using a notification service will help you easily integrate the restored data into your workflow or even automate the entire process.

Amazon S3 will create events for restore initiation and restore completion. And you can publish these events to an SNS topic, SQS queue or even trigger a Lambda function.

You can also configure restore completion events with EventBridge and send them to an SNS topic that your application can subscribe to. This will allow your application to automatically proceed with the next steps, such as a GET or COPY, as soon as the objects are restored.

And now comes the last part with respect to the restore process where you access or process the restored data. Once the object is restored from Glacier, your application can access it as any other object in S3.

Remember, the restored data is a temporary copy of the object. So most of the time you can start by doing an in-place copy or copying it to a new bucket.

Now that you are an expert in restoring from Glacier, let's use this recently acquired knowledge to solve a problem. To help you out, I'll reiterate the three points we discussed:

Things to consider when creating the request
Options to check the restore status
Processing the restored data

Let's get started and solve this problem:

Let's assume you have a large collection of high definition and curated set of images that you have stored in Glacier Flexible Retrieval. And you want to generate thumbnails of these high quality images to host them on a website.

The low resolution images will help improve the latency of your website and also help your end users browse and identify the content easily and quickly. So how do we solve that? Let's get started.

The first step is to initiate the restores. In this step, you can define or specify the restore options, so let's use standard restore from Glacier Flexible Retrieval. Then you start submitting the requests.

Glacier will accept the requests at a rate of 1000 transactions per second.

What should be the next step? We can check the status of the requests. For this, we will use EventBridge and an SNS topic to trigger the next step of the process or workflow. And that is to create the thumbnails, to trigger the Lambda function to create the thumbnail images.

Once the restore is complete, we will be notified about that. The Lambda function will create a thumbnail of each image in the destination bucket. That's it! We have solved the problem. Good job everyone!

Now comes the most exciting part - the recent enhancements we have made to the restore process. We are continuously making improvements to Glacier's restore process so that you have all the more reasons to make use of your archive data.

This year, we increased the performance of standard restores from Glacier Flexible Retrieval and added support for Amazon Athena. This enhancement will make it even easier for you to incorporate archive data into your workflows.

Let's look into each one of them and see how you can benefit:

We improved the data restore from Glacier Flexible Retrieval using standard retrieval by up to 85%. The restores from Flexible Retrieval now begin to return objects within minutes. So you can process your archive data faster.

This performance improvement automatically applies to standard retrieval when you are using S3 Batch Operations with the standard restore option from Glacier Flexible Retrieval. And the best part - it is available at no additional cost!

Whether you are transcoding media, restoring backups, training machine learning models or analyzing historical data - you can easily speed up your data restores from Glacier Flexible Retrieval.

To understand the impact of this improvement, let's take an example:

We submitted a restore job containing 250 objects, approximately 25 GB in size, using S3 Batch Operations and the standard retrieval option from Glacier Flexible Retrieval. This is a typical workload that we see from our customers.

Previously, the restore job would typically take 3 to 5 hours to complete, and then you could access the restored data all at once.

Now, with this improvement, the same restore will typically start in minutes. And in this case, the restore job finished in less than 30 minutes! This is a big deal because now you can start processing the objects as they are getting restored from Glacier, rather than waiting for all of them to complete.

Let's see how one of our partners, Cohesity, is using this feature. Cohesity is a data protection company and helps its customers secure, defend and recover their workloads.

Cohesity's customers use the Glacier storage classes to significantly reduce their storage costs and to meet compliance needs. At the same time, they use these critical backup copies for long term preservation and for generating business insights.

Cohesity wanted to restore these backups faster in order to improve their recovery time objectives. Now with restores starting in minutes, Cohesity was able to restore certain backups within an hour compared to 3 to 5 hours earlier from Glacier Flexible Retrieval.

This enhancement enabled Cohesity to reduce the time and cost associated with restoring their archive backups. Now Cohesity's customers can keep more of their data in Glacier at low cost while retaining the ability to recall them faster when they need it.

Now, let's see how we can use this new capability to further improve the performance of the thumbnail solution that we just built.

In the previous approach, it would take 3 hours to restore the data and then 2 hours to create thumbnails for each image. So approximately 5 to 7 hours to complete the entire workflow.

Now what we will do is use standard restore with S3 Batch Operations to initiate the requests from Glacier Flexible Retrieval. And now we'll start getting objects in 20 to 30 minutes.

As the objects are getting restored, the event notification mechanism can trigger the Lambda function to create the thumbnail of each image. This is great because you can process your data in parallel as they are getting restored from Glacier Flexible Retrieval.

It can significantly reduce the time for the whole workload. For example, with this approach you can complete your entire thumbnail creation process in less than 3 hours versus 5 to 7 hours earlier.

We believe this can be a game changer for applications that are restoring data from Glacier Flexible Retrieval. As we already discussed, you should use S3 Batch Operations to automatically optimize for restore performance.

And I'm excited to give you one more reason to use S3 Batch Operations! Last week, we launched a new feature that allows you to create the list of objects on-the-fly and eliminates the need to submit a manifest.

So what does that mean for you? Let's go back to our previous example. Let's assume all the thumbnail images were stored in a prefix with .png extension.

Now, with this capability, you can submit the job and specify the prefix and put .png in the suffix filter. Behind the scenes, S3 Batch Operations will generate the list of manifests and start initiating the requests.

This capability to automatically generate the manifest applies to all operations in S3 Batch Operations, not just restores. So you can use it for copying objects, replicating objects, adding/replacing/deleting tags, or updating object lock modes.

To summarize - you can perform restores or other operations at scale with a single step.

Now comes the second improvement and the final topic that we want to cover today - Glacier support for Amazon Athena.

If you don't know about Amazon Athena, it is a serverless interactive analytics service that makes it simple to analyze petabyte-scale data stored in your S3 data lake or across 30+ other data sources.

We recently added the restore status to the S3 List API, and Athena uses this restored status available in the List API response to identify whether an object is restored or not.

If the object is restored, Athena can use this data directly for further analysis. This means you can query restored data from Glacier Flexible Retrieval and Glacier Deep Archive without copying the data into standard storage.

This can be incredible for use cases like log analysis or long term analysis. We're excited that with this capability, you will get the opportunity to incorporate your archive or cold data directly into your analytical workloads using Amazon Athena.

And with that, I would like to conclude this session and share the key takeaways we have covered:

Advancements in AI, especially generative AI, are changing every industry and data is becoming the biggest differentiator. You should consider preserving your assets at low cost in Amazon Glacier for future AI use cases instead of deleting them.
Glacier offers multiple options to archive your data. Pick the storage class that best suits your retrieval needs and operational requirements.
Use bulk retrieval to significantly lower your restore costs from Glacier Flexible Retrieval and Glacier Deep Archive. Remember, bulk restores from Glacier Flexible Retrieval are free.
Use Batch Operations to optimize your restore process. S3 Batch Operations uses the maximum TPS for initiating requests and automatically accounts for retries and errors.

To learn more about how you can maximize the value of your cold data with Amazon Glacier, you can scan the QR code to access our free ebook. You can also go to the AWS Training & Certification storage section to further expand your storage knowledge.

We hope this session showed you why cold data is more important than ever before and how you can derive more value from it. Please fill out the survey, we really value your feedback. Enjoy the rest of your re:Invent sessions. Thank you!

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫