Get started with checksums in Amazon S3 for data integrity checking

All right. Um welcome everyone. Our session today is "Getting Started with Checksums in Amazon S3 for Integrity Checking." My name is Aritra Gupta and I'm a senior product manager on Amazon S3.

Now, before we get started, let me see a quick show of hands. How many checksum validations do you think Amazon S3 performs each second? Any guesses?

Yeah. So let me get that for you. So S3 performs more than 4 billion checksum validations each second. And that's because the integrity of your data is top priority for us.

So let's look at today's agenda. We'll start off with a brief introduction of checksums and then followed with data validation on Amazon S3. Then I'll talk about some of the advanced checksum capabilities that you can use to accelerate your integrity checks. And finally, we look at the demo today. So let's get started.

What is a checksum? So a checksum, or some of you might also commonly call it a hash, is a unique alphanumeric representation of the contents of an object. And how do we calculate a checksum? Well, we apply a checksum algorithm, in this case a CRC-32, onto the contents of an object to get the checksum value. So as you can see here, we have three unique values for all of your different objects.

Now, where do we really use checksums? And there are two key use cases which I wanted to highlight here. The first one is data in transit. So imagine you're moving bikes from a source to a destination and you want to make sure that the same bikes that were at source landed at the destination to validate this. You look at the checksum on both sides and just do a simple comparison.

The second is data at rest, or when you're storing data for longer periods of time, you want to make sure that the data was not altered since the last time that you accessed it. And you can use it. You can use the current checksum value and compare it to the previously known checksum value to make that distinction.

I want to highlight here that Amazon S3 does a combination of CRC, SHA-256, and MD5 algorithms for all data in transit or at rest, and all the capabilities that I'm going to talk about today are additional tooling on top of the data validation that we already perform.

So look at some of the common applications for checksum validation. The first one is data migration. So let's say that you are moving a large amount of data from on-premises or any other appliances to S3. And you want to make sure that again, the same bytes actually landed to S3 to achieve this. What you can do is you can compare the previously existing checksum values against those that S3 calculates for you and make a distinction whether the right bytes landed or not.

The second one is digital preservation and this is quite common for customers in the media and entertainment industry as well as government agencies where you have digital archives that need to be stored for multiple years. And this is very helpful for maintaining a chain of custody if you are maintaining all of your archives.

And finally, I also wanted to talk about duplication, which is a common use case where imagine you have three buckets with millions of objects and you have multiple users creating copies of your data in case where you want to cost optimize, you can also use checksums to determine the duplicates of your object and probably delete those altogether.

So now I want to talk about the data validation in Amazon S3 and S3 is the first storage provider in the cloud to support a variety of checksum algorithms. We give you the flexibility to pick and choose these algorithms based on your use case.

The second, I want to talk about the advanced integrity checking capabilities which are our failing checksums and our parallel checksum operations and these are particularly helpful when you're dealing with streaming workloads or you're working with large objects.

Now, common question that I get asked is what algorithm should I even be choosing? And I always have the default answer. Well, it really depends, it depends on what your use case is. And as I mentioned previously, we give you the flexibility to choose among these algorithms.

So common example would be let's say you're streaming multiple bytes into S3 and you want a fast performant lightweight algorithm as a durability best practice. So in this case, I would recommend something like a CRC-32C to be used.

On the other hand, now let's say you're storing genomics data and the FDA mandates you to use SHA-256 as a compliance requirement. In that case, you can go ahead and use the SHA-256 algorithm as well. It all depends on your use case.

I'll, I'll get that after the session, I'm not able to you what I, I'll get back to you.

All right. Moving on next. I wanted to talk a little bit about our integrity checks or advanced capabilities. The first one is a trailing checksum where instead of adding a checksum as a header to an object, what we'll do is while you're streaming bytes to S3, we will append the checksum as a trailer and I'll come to the performance gains that we get through this mechanism.

And the second is a concept of a parallel checksum where instead of calculating the whole object checksum for a large object, we basically calculate the checksum for each of the different parts. And we kind of farm out that operation onto multiple cores. So in that way, you can get parallel performance optimizations as well.

Now let's start with the first one. So traditionally, if you were to upload an object into S3, what you would do is it'll be two steps. First, you would compute the checksum and second, you would upload the object into S3. Now both of these operations, as you can see here have their own wall clock time as well as processing power. What that means is more cost for you and time to complete the operation.

But with the trailing checksums concept, what we can do is we can calculate the checksum for you as you're streaming your bytes into S3. And then we append it as a trailer to the request. What this means is we effectively reduce the two operations into just one thereby saving you cost as well as processing time.

Yeah, next, I want to talk about the multipart uploads. So let's imagine you are uploading large objects into S3 and these objects could be gigabytes in size or terabytes in size. Now, when you would be uploading these large objects, you typically use our multipart upload API.

So for these kind of objects, what we compute is what we call a checksum of checksums. And how do we compute this value? Well, we compute the checksum for each of the parts that you see here. And then we do a checksum of all the checksum values that we get here. So this is what we call a checksum of checksums.

Now, if you were to do this data validity on your end, how would you construct a checksum of checksums? You would need to know all the parts, you would need to know all the part level checksums. And we purpose built an API just to help you with this process.

So this is my favorite GetObjectAttributes API. This works well for small objects as well. But this is really built for large objects because this API basically gives you the checksum algorithm, the checksum of checksums value, it gives you the number of parts that are present in this object. It gives you the part boundaries and as well as the part level checksum values.

So in any case, when you're uploading large objects into S3, and you want to look at the attributes of that object, we recommend you using this API.

So let's talk about performance a little bit. Now imagine you're calculating a serialized checksum for a large object. In this case, a one terabyte object. And you want to use the SHA-256 algorithm.

So what we found is it took typically took us about 86 minutes to calculate the full object checksum. Whereas if you were to calculate the parallel checksum of checksums as I described earlier, it just took us seven minutes and we use the same two instance. And this has been a game changer for customers who want to perform integrity checks as a durability practice or as a compliance requirement. And they all have large objects and they want to perform these checks at scale.

Now, let me move on to the demo.

All right. Um so for this demo, I'm going to show you how we work with checksums and I'm using the Python client here and I'm using the Boto3 SDK. But you feel, please feel free to choose any of your SDK or programming language of your choice. Checksums are supported across all of our SDKs.

So we'll just go ahead and quickly initialize our code and I'm from Seattle. So Mount Rainier is my favorite place to visit over the weekends. So I'll try and upload this image onto S3 here. Before doing that, what I want to do is I want to calculate the CRC-32 of this image locally on my system as well. And then let's see how the checksum validations really work.

So in this step, I'll calculate the checksum value locally. And let's print that out. So as you can see, we get our local checksum value here.

Now, let's imagine you are, you have thousands of objects on-premises and you're moving those to S3. Now, if you want to ensure that you are uploading the right object to S3 and assuming that you already have the checksum value for these objects present with you. What you can do is you can construct a request and you can tell us the checksum algorithm and you can also tell us the value of the checksum that you expect this object to have.

So in this case, I'm sending the checksum value that I calculated locally here as well. Now this request should only succeed if the checksum is equal to the checksum that we will calculate. So let's go ahead and because I took the same checksum that we calculated locally, we expect this to succeed. And let's see the value.

So while we see a 200 status code here, and the checksum that we see in this response is what S3 sends us back. So this is calculated by S3 as well and we validated that the checksum that we calculated is the same that S3 calculated as well.

Now let's flip this around a little bit. Let's assume that someone on your team changed the image and correspondingly has a different checksum value here. So as you can see, I've slightly modified the checksum that we got originally to a different checksum value here.

Now, as I mentioned, if you're attaching this to our request, this operation should fail. So let's run this operation. It'll do a few retries. So we'll have to wait and we at the mercy of the wi-fi as well. So fingers crossed, but let's see what the response comes here.

And as I mentioned, the checksum value here is different from the checksum that we had.

All right. So now as you can see, we've received an error here and let me quickly go to the exception here. So as you can see here, we say that the CRC-32 you specified did not match the calculated checksum. What that means is if you have an object, which has a particular checksum, if you specify an incorrect one and that's being sent to S3 will basically reject the request and that object will not be uploaded to S3.

Now, you might ask me that, what if I don't have the checksum at all? You know what in that case?

So let me show you this request here where we have the checksum algorithm, but I do not specify any checksum value here. In this case, what this is supposed to do is it will upload the object to S3, it will successfully upload it, but it will also calculate the checksum for you and return it to you as the response as well.

So in case you do not have the checksum today, you can just upload the object to S3, you can specify what algorithm you want us to check for. And we'll do that for you.

Now, let's use the GetObjectAttributes API to get the same value for us as well. So as you can see from the GetObjectAttributes, we got the checksum value here. And because this is not a multipart upload object, this is a single object, you just get a simple checksum value with the algorithm specified as well.

Now for the next demo, what I've already done is I've created multiple five megabyte files and I have appended both of them and uploaded them to S3 using our multipart upload steps because this is a large file. I already did this in advance before the demo. But as you can see here, we have the CreateMultipartUpload step. Then we upload both of the parts and then we complete the step as well. You can use any of our SDKs. You can use the Transfer Manager or you can use any other tools at your disposal to automate all of these parts and just have a single step to upload these large objects.

Now let's go to the page where now let's say this object was already uploaded to S3 and let's talk about the GetObjectAttributes API. So in this case, if I call the GetObjectAttributes API, you can see that we have the number of parts that this object was containing. And for each part, you can see the checksum for each part as well. And you can look, you can look at the part number boundaries as well. You can look at the part values, part sizes, you can look at the total parts as well.

So imagine you have a large object and you want to maintain data integrity for each of those parts. You can validate those at the part level as well.

Yes, this is because I uploaded the same object twice.

And then finally in the last row, you get the checksum value, which is the checksum of checksums that we calculated.

Now let me show you how I actually how we calculate the checksum of checksums value. You do not need to do that. But you know this is a demo. So let's go into the details as to how this works, right?

So I'll go through this line of code line by line. So first, what I do is I take both of these checksum values and we decode both of these values and we create a list of all of the checksums that you would have. Now I create a string where I concatenate all of these different checksum values as well. So you'll basically have the first checksum and the second checksum as a, as a singular string, then what we'll do is we'll checksum that string, it's together and we'll calculate the checksum of checksums.

So let's see if that, if this value matches the checksum of checksums here. So let's take this and as you can see the local checksum value that we computed in this step matches the checksum of checksums here.

So let's go back to our slides.

All right. So yeah. So if there are three main things that I want you to take away from my session today is a, we support a variety of checksum algorithms and it really depends on your use case as to what checksum algorithm you want to use. We've seen customers use SHA-256 for a particular workload versus using CRC for something else. It really depends on what your use cases.

And something else that I wanted to call out is using the trailing checksums, which definitely improves your performance. But also if you're looking at large objects and you want to do data integrity for them, the checksum of checksums operations is very performant and helps you achieve that as well.

So that was my session. Thank you for attending. And I'll be around for a few questions. I think we have three.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值