Dive deep on Amazon S3

最新推荐文章于 2024-10-01 19:57:56 发布

李白的朋友王维

最新推荐文章于 2024-10-01 19:57:56 发布

阅读量150

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134824513

版权

Amy Therrien, Director of Engineering for Amazon S3: Hello, I'm Amy Therrien, Director of Engineering for Amazon S3. And this is my colleague, Seth Markle, who's Senior Principal Engineer with S3.

I want to thank you so much for coming here today. You're here because you have a curious mind. You want to know a little bit more about S3 than maybe you can read in some documentation or was in a video. And frankly, the S3 team is full of people like that, that are curious. They're so curious about how things work inside of S3, that they actually join the team as builders. So I'm hoping that today you walk out of this room with just a little bit more insight into S3.

Seth and I have been with S3 for over a decade and our curiosity is still not satisfied. So first, I'm gonna take you through a little bit of our history.

You know, S3, we've been serving up object storage for customers for you for more than 17 years. And over that time, we've been learning and improving our software systems to make them more resilient.

If you guys are, you know, sort of interested in machine learning at all. Or like analogies of machine learning. We've been training the foundational model of how to operate and build an object storage for the service for the cloud during that time. And I'm super proud of that model that we've come up with.

There's really no compression algorithm for experience. So if we look back in time a little bit for S3, we went through a couple of different stages. In the early days, a lot of things sort of surprised us. We learned lessons the hard way, there were a lot of late night sleep nights reacting to things. We were very much in a reactive mode.

The initial projection of usage for S3 was in the gigabytes of storage and very quickly, more than anyone anticipated, we needed to support exabytes of storage. And so pretty soon we wanted to kind of get ahead of problems for you, our customers and for our builders.

And so we started looking at the concept of threat modeling, threat modeling is sort of looking ahead, looking at everything that could possibly go wrong and trying to put in place a proactive mitigation for it. And then over time, as we, you know, learned how to do threat modeling, we instilled that as a culture within the whole s3 team thinking about how can we constantly be proactive, looking for what those threats are and looking for proactive mitigation.

So less nights awake and more being proactive and building solutions into our service. So how does this whole threat modeling stuff work that we do? You may have heard of Amazon's doc writing culture. It is a real thing. We do write docs and we write a lot of docs because human memory is somewhat fallible. We misremember things all the time. Our mind plays tricks on us.

So we take what's in our head, our worries about threats and we write them down on paper. We have an opportunity in the future to reflect on that accurately and make sure that we look back and like, did we miss something? Is there something that's not quite right in our thinking? And we continue our learning process that way.

Very specifically when we do threat modeling, we create an actual document and if we're delivering a feature that could be one document, it could be for a very large feature, you know, 50 documents that we write and we have an interactive meeting where we review that document and we review it with our most experienced engineers like Seth here. And it's a great learning opportunity for the folks on our team that are newer to building resilient services.

So what's in a threat model? A threat model consists of threats and threats are things that are unexpected or could happen in the future and they're not necessarily bad things. The word threat kind of sounds sort of negative, but they could be good. For example, having greater than expected customer usage is kind of a good threat to have, you know, greater than expected scale is definitely something that we, we love and appreciate.

In addition to trying to come up with threats, we try to come up with an exhaustive list of threats. We look at what is the likelihood of that threat happening. And that's important when we're starting to develop mitigations. Thinking about how good does that mitigation need to be? It is think about how often is that threat going to happen for us in the future.

Another aspect that we look at for threats is the area of impact. How impactful is that threat going to be if it does occur? Is it impacting one version of one object? Is it impacting all of the objects in a customer bucket? Is it impacting all the objects for all customers in a region?

By looking at that, it sort of gives us a sense of priority of how important it is and inspires our team to really come up with really great mitigations for that.

Now, while you're going through this threat assessment, it's really important not to get too dog bogged down with mitigations until you come up with that exhaustive list. You really want to do creative brainstorming, what are all those threats that are out there?

Once you've got the threats, likelihood and area of impact, then you think about what are those mitigations and this is where things can get a little tricky. So sometimes when we're thinking about what's the proactive mitigation that we want to put in place for a threat, we introduce new threats. And so you want to keep iterating on that process until you've come up with enough mitigations and mitigations that don't introduce new threats.

So, what we're gonna talk about today are some ways that S3 is thinking about mitigating threats inside of S3 of the service. But we're also gonna talk about how you as customers of S3 can use S3 as a mitigation for threats, threats such as greater than expected customer usage or scale. So we're gonna talk through a few of those.

Now, before we get too down deep into s3, we're gonna give a brief overview just to kind of give you a sense of of what's going on within three. And we've got over 350 micro services in each one of our regions. That's a lot of complexity, a lot to think about more than we can cover in this session today. So we're gonna keep it really simple with what we're focusing on today.

We're focusing on the right path. So what it means to write data into S3 through a put request. So how does that work? It starts at the client. So at the client, there's some data, some data that you want to write into S3 and at the client, our customers are using rest APIs. They might use an SDK a CLI or an application that can speak to those rest APIs.

And they start with that piece of data, that piece of data comes in as a put rest API request to our front end. Our front end is composed of DNS routers, web servers and the other components in the network. And that web server eventually interprets that put request and understands that it's a put and, and it decides to put the key part of that, put request into our indexing system and then the data part goes into our storage system.

Now, this is where we start to talk about some of those threats. One of the things that you can use S3 for is anticipating or leveraging, you know, greater scale and greater than expected customer usage for your application. And to do that, you can look to our front end as the first step, our front end is big. It is serving greater than a petabyte of data per second. That's huge. And you can leverage that scale to help anticipate and grow your applications.

There's a couple of mitigations that you can use. The first is multipart upload. The second is range gets and the third is multi value DNS and we're going to talk a little bit about each one of these and how they can help go and how they can help you with the threat of greater than expected customer usage.

So the first one is multipart upload. So here we have a client that has 100 megabytes that they would like to upload as an object into S3 and that can be done over a single request. So that single request has some threats associated with it. Things can go wrong like something on the internet might fail in the middle of that request. And if that happens, then that client has to restart the entire process over again, which has some slowness factors to it.

It's also hitting like a single server with a single you know, capability of that physical web server. Alternatively, we can take that 100 megabytes and we can divide it up into chunks. So in this particular case, we're dividing it up into 20 megabyte chunks. So five different requests and sending those requests in parallel to multiple servers within the S3 front end.

Now, the cool thing about this is if you're dealing with one of those internet threat kind of situations or client problems or something goes wrong with the request. If it just goes wrong with one of those requests, then the application only needs to reupload that one part.

In addition to that, the cool thing is if you do this in parallel and like have them all going parallel, you get five x the throughput improvement. So that parallelism helps things to go faster, each one of those requests. has less work to do and they can go in parallel.

And this is what it looks like in practice, here's an example of the API create multipart upload and then upload each one of the parts and times that can be done in parallel if you want to get the good throughput from it. And then when finished with all the parts complete multipart upload.

So that's a way to kind of handle the right path. And the same is true in reverse. So we can paralyze in reverse using range gets or getting the individual parts. And again, same, you know, benefit of, you may only need to redownload just the one part or you can get that parallelization going so that you get graded or increased throughput.

And here's what it looks like in practice with our CLI. So get object attributes, we'll list all the part numbers and the sizes of each of those parts that you want to retrieve for that object. And then you can call get object to get each one of those individual parts and then stitch that object back together. So one whole object on the client.

Now, if we did all this perils that we're talking about, and you send it to a single server in our web server fleet, you will be limited to the capabilities of that single server. It's a real physical server from one server, one web server. It's got a certain amount of cpu a certain amount of network capabilities. And so that's kind of limiting instead, you can use multi-valued DNS which will spread those requests across our entire large front end fleet, that front end fleet that was doing over a petabyte, a second of bandwidth.

And by doing that, you get to leverage all the capabilities you can have great throughput, you can scale your application to whatever your customers need in terms of growth and even unexpected growth that you might have for your customers.

This one's actually super important. So I was working with a customer once who was having, you know, some latency issues when they were interacting with S3. And when we looked at their metrics, you know, the traffic looked pretty smooth and like a five minute granularity. But when we actually went and, and dove into their logs, what we noticed was they were focusing the entire five minutes' worth of traffic over an eight second period hitting a single IP address.

And so when we spread their traffic across multiple IP addresses, I don't even think we had to spread it across time just across IP addresses, their issues went away. So this is, this is like a very real, real piece of advice here. That's super helpful. That's perfect example. Thanks Seth.

All right. So you remember a while ago while I was talking about threats and how sometimes you can introduce a mitigation for a threat that introduces new threats. Well, the thing about doing all this parallelism is that the software is a bit more complex when you talk about doing things concurrently doing, you know, multipart uploads concurrently or range gets concurrently. And, when it's more complex, there's a risk of more software bugs being introduced.

And so the mitigation that we have for that is something we call the common run time. The common run time is an open source library that has all the best practices of the things that I just talked about. So it's got the multipart capabilities...

The range gets uses multi value DNS, it has others, the best checks, the best retry algorithms all condensed into this common run time. It's included with our SDK and our CLI. So you can use it there to get all this great throughput that I've been talking about and we're starting to include it in other applications as well.

For example, we've included it in the mount point for Amazon S3 family. So earlier this year, we launched mount point for Amazon S3, which is a FUSE file connector for S3 and uh this month. And at Re:Invent, we've added some features onto that. So we've added a caching capability. We've added support for containers in the form of a con uh uh ACSI driver storage interface that you can use it in Kernes or ETS environments. And we've added the new S3 fast storage class S3 express uh in there as well. And this has all those best practices for best performance included from the CRT in this FUSE file system connector.

Um if you're interested in seeing mount point for Amazon S3 in action, there is uh a session from one of our customers who used it in high performance computing. So Arm and AstraZeneca um have done some really cool machine learning a very large cluster using this FUSE connector. They were um getting great performance uh using mount point. So if you're looking in for a real life example, there's another session that you can take a look at there.

So those are some ways that you could leverage that front end part of S3 to get really great scale to get really great growth for your application. Next, I'm gonna talk about some of the things that you can leverage within index and also how the index itself is uh mitigating some threats.

So the index is big again, uh which you can use to your advantage. Uh we're storing over 350 trillion objects in our index and we serve over 100 million requests per second. So it's doing a lot of work and it's storing a lot of stuff. Uh and you can use that a lot.

Um we're constantly scaling the index so that we can keep up with all that growth of all of our customers. We want to make sure that we always have sufficient capacity because it's not great when you have not enough capacity, you can't put things in there and we're constantly keeping ahead of that for our customers. That's one of the threats that we think about is how do we keep adding and scaling out to make sure that we can handle more and more objects and more and more transactions per second. So this is kind of how we handle all that internally. We're thinking about how do we keep growing, how do we keep scaling?

And so what we do is we partition all those keys over many servers and we do that uh lexical losing the name, the letters and numbers and the key names. And this particular illustration shows what it could look like at a high level to divide them up by key name. If you looked at like the first letter of the key names out there.

And one of the challenges that we might see in this particular space, if you assume, for example, that all our key names are like english words. And we're looking at that first letter to figure out how to spread things out. Uh it turns out that like in english, for example, uh the first letter of words uh is not an even distribution. It's not like a, has the same distribution as z uh vowels are super popular for starting and there's a couple of letters rt snl um that are a little bit more popular at the beginning of like english words.

So we might see a hotspot. Um so this n and s area of our partition space uh has two of those hot letters in it has n and s and so our system will look for these hotspots and take them and spread them out over more servers. So in this case, we're splitting out nnp and q and s to kind of spread that heat out. That's kind of how our system works.

And you can use this to your advantage as you are scaling your application as you're anticipating your customer scale. You can use kind of a bit of understanding here to do that. Now, I'm gonna explain that in a little bit of detail on how that works. But in order to do that, you have to understand what a prefix is.

So a prefix in S3 is any string of characters that comes after the bucket. So anything that comes right after the bucket name, that's part of the key name. And here's an example of what prefixes might look like for a Re:Invent bucket. You can see that p is kind of that first letter after that. Some of these have the word, all of these have the word prefix here, some say prefix, one, prefix, two, prefix, one has data and other that are part of the prefix.

And we use that to kind of spread out the load across all those servers each prefix has 3500 put request capability and 5500 get requests capability. And we have that so that it's easier for you to predict how can you develop the right prefix strategy to make sure that you have enough scale for your customers.

What happens if you need more than that? Well, if you have just a single prefix, if you need more than that and you exceed your prefix capability, you might get a 503503 is a response that says slow down too much. So how do we, what do we do if we want more than what i just told you in terms of our prefix?

Well, here's an example, this starts off with that 5500 get per second like with the bucket itself. And then we split that into two prefixes. So we have prefix one and prefix two and each one of those can get 5500 get requests per second. And then we split that further with further prefixes. So here we add prefix one and prefix two. And then under those, we have a and b and that gives us a total of 22,000 transactions per second or request per second. Pretty great.

So you can use your prefect structure in order to get greater scale and greater throughput. Now, what are some things that we might think of in order to mitigate this or things that we need to think about. There's a couple of mitigation strategies so that we can avoid getting 50 threes.

The first mitigation strategy is keeping cardinality to the left in key names. And what that means is thinking about having as much diversity in those characters that follow the bucket name as possible. So using all the variety of characters that can possibly think of, there will help improve the amount of splitting and partitioning that we can do.

The second one is keep dates to the right. So dates tend to be numerical. And if you think about numbers, there's only 10 different numbers zero through nine versus like the full alphabet which you know, has 26 and all the letters and numbers mixed together. So keeping dates to the right can help increase your capability to scale and plan for greater than expected customer usage.

Now, let's look at that sort of in practice. So this first example is uh what not to do. So don't do this. Uh we start, we're putting day or date to the left instead of creating diversity there instead of putting it to the right. So we've done that day to the left and we split that off and you know, we was a similar sort of uh tps that we got so far, things look fine. I get 22,000 tps from that.

What happens when day two comes along? Well, when day two comes along day two, you start moving all of your rights to day two. Day one was yesterday, day two, I'm putting all my new data into day two. It does not get to leverage any of that partitioning and spreading of load that happened in day one. And so now you might see some throttles for a while as we reread out everything in day two.

So a better way to do this is to move all the days and the dates far to the right. So in this case, when we move from day one, writing to day two, writing, we get to leverage all of that load spreading that we did in day one for day two and we get to minimize any throttling.

All right. So with that, i am going to turn it over to my colleague, Seth who's gonna go into some threats in the storage space. Thanks Amy.

Hey, everyone. So, uh once again, my name is Seth Markle. I'm a senior principal engineer. I've been in S3 for um just about 14 years. It'll be 14 years this February and I'm gonna be talking about how we keep your data uh durable and available.

So, um a couple of stats we have millions of hard drives that store exabytes of data just to give a sense of how many hard drives we have. I actually, I actually had to research this to make sure it was an honest statement. But if you stack them kind of flat way top to bottom. You can reach the international space station several times over. But that's how many hard drives. A million hard drives are, millions of hard drives are. It's, it's a huge number.

Most people have heard of S3 have heard of this number. This is if you don't want to count, there's 11 nines there, that's our 11 nines durability design. And we're going to talk about this number in a bit. We're going to go over like specifically how this is calculated. But I want to start by highlighting that durability is much more than a single number, right?

Internally, we have a goal that we have. We're absolutely lossless attendees at Re:Invent in 2019. Might remember a talk that I did with a colleague Andy Warfield on Esther's durability culture. We talked about several factors that influence durability. So the obvious factor that influences durability is that we store data on drives and drives can fail. We also store drives in buildings and buildings can have issues, right? There can be fire, there can be water, there's also issues like data can get corrupted both on the wire, it can get corrupted at rest.

We have operators running the system, there could be deployment issues, there could be bugs, there could be operator error. So if you zoom out on this durability is actually influenced by three high level, high level things. So there's people, there's drives and there's zones so drives is obvious drives is the hardware that the data is stored on zones or buildings or facilities you'll hear in different contexts is actually where we're putting those drives, the physical buildings, people might be less obvious we've actually been talking about people this whole talk.

So people here refers to our builders and how our team approaches software development. So that includes things like operator errors, deployment problems, even software bugs, those are all people related issues. And that's where this threat modeling mechanism fits in.

Ok. So with that background, let's go into this 11,109 s number and talk about how we mitigate drive failure. So as usual, we start with a statement of the threat. So here we must protect data stored on drives that can fail or corrupt bits at rest.

So what are our mitigations? Well, we have three mitigations that tie together to give us this 11 9 s design goal. So the first mitigation is that we start with end to end check summing. This is how we make sure that what we store is actually what you gave us. When we store data, we store it redundantly across several drives and we have background auditors that are constantly monitoring the system looking for faults so that we can react to them.

So Seth before you move on, I want to talk a little bit about those monitors

So um, one of the things I'm really proud of is these monitors exist. Uh I mentioned how stressful it can be to have like uh the gravity of keeping care of all these exabytes of data. And it's those durability monitors, those independent monitors that make it so that I can sleep at night and know that my customer's data is safe. Um they're really important part of, of our success and making sure that things are the right quality.

Yeah, totally. So let's talk about the first mitigation which is end to end checking of requests. So when a request comes into S3, we send it through a series of services internally on our request plane, we launched a feature, I think it was about a year ago that I believe it's called additional check sums publicly internally. We call it flexible check sums or flex checks which allow customers to specify any one of any number of checks, some algorithms so that we can actually get checks some data from customers on the insertion path.

We've had that all along, but it used to just be content md five and we wanted something that was more performant than just md five. And so we've added cr ccr a 32 crc 32 c sha 256 which isn't necessarily perform it, but it's there. And so it allows us to validate check, sums down our request plane and we're transforming data as we move down this request plane, we might be framing the data as we go, ultimately, we want to store that data redundantly across multiple drives.

And so we use a technique called erasure coding, erasure coding, super super high level, it will divide your data into chunks. Those chunks are called shards. It creates several additional chunks and then it spreads those chunks or shards across multiple storage devices, multiple disks. When we do that, we assign check sums to every one of those devices, every one of those chunks rather those shards.

So in the process of doing this, we've taken data, possibly framed it and then split it across multiple devices. So we want to make sure that we did this correctly before we go and tell you the customer that we've stored your data. And so what we do is we'll actually reverse that whole pipeline of transformations to make sure that the data that we're storing is actually the data that we got us.

So we don't just calculate check sums as we go forward, we go backwards and make sure that we can actually red derive the data from what we had just done. And only then do we respond with a 200. So by the time you get an acknowledgement back from s3, we have not only checked somebody and stored your data, but we've also reversed all of those operations to make sure that we're storing the right thing.

There's a lot of check sums here. So another one of these stats I'd like to say is we, we calculate 4 billion check sums per second across the fleet. 4 billion every second. So one second, two seconds, it's a tremendous number of check sum calculations. I'm still kind of in awe of that number.

So as I mentioned, we erase your code data, we erase your code, your data across multiple redundant storage devices. So here's an example, I have five pieces of data, abc d and e, each one of these kind of vertical rectangles is a storage device. And here I've spread each object across three devices. The fact that there's five and I chose three is just an example for illustration here.

So what happens when that particular drive fails? So our background auditors will detect the failure. We're constantly monitoring the system to look for drives that failed. And when they detect failure, we're able to re replicate that data from the other drives in the system, right? And so here we have abc d and e a b and d are able to be recovered from the other drives.

And it might not be obvious in this picture. But while this is happening, those those objects are still accessible from the drives that are online. So at scale, it looks a little bit like this illustration. Here, we have a lot of spare capacity on hand. And the way we manage spare capacity is not just a fleet of empty drives that are just waiting for their turn to join the game.

We keep slack space across our entire fleet. And the reason we do that is if we, if we lose a drive in the system by having several other drives have enough free space to help in recovery, we're able to parallelize that recovery similar to how amy talked about parallel multipart uploads, uploads. And so what this gives us is a very high throughput of recovery.

So when a drive fails and we detect it, we can re replicate that data using hundreds or even thousands of drives to participate in that re replication drives. Don't simply just fail drives also come with a certain bit error rate. This is the the rate at which drives will just spontaneously flip bits at rest. This is just a property of storage medium.

And so this is where both of our techniques that we've talked about so far come in. So we have constant monitoring and we have check summing, remember i'm check summing all of the shards that go to disk. So now on a per disk basis, i can scan the shards and look for anything where the check sum which matched at the time i put it now doesn't match. And if i find an issue like that, i can enlist the re replication system, i just talked about to re replicate your data.

So let's piece this all together. We have data replicated across storage devices, those devices fail at a particular rate and at the scale of millions of drives that rate does converge and it's measurable, we also can re replicate at a particular speed, right. We talked about how we use the width of the system to be able to get high throughput and re replication.

So there's well-defined mathematical models where those observed inputs, those are like those are real-world observations that we make, get input into the model and we come up with a durability rate. And that's what our 11 9 s durability rate is. So that 11 9 s is the goal and our observations in the modeling ensure that we are exceeding that goal.

Ok. So earlier, i referenced the three kind of coarse grained threats for durability are people drives and zones. And we just talked about how our 11 9 s modeling covers us for drive failure. So what about availability zones?

So let's read this threat. We must protect data, we store against the unexpected total or partial loss of a zone, any single facility may fail at any time and we must protect the durability of data that is stored within it. Now, the a z complexity of thinking about threats is something that i got to learn a lot working with s3. I'm kind of astounded like how non intuitive it can be, how much planning goes into planning for those threats super that we're able to take on all that burden of thinking about what can go wrong with an a z so that our customers that you guys don't have to worry about it.

Absolutely. For most storage systems, the loss of a data center is a threat that you have to think about. But our our default storage classes take care of that for you. And here's how that works.

Ok. So remember this picture before, this is a kind of a, a more expanded view of that example of five drives that i had before. Like i said, that was just for illustration, we don't literally do five. We don't also literally do however many of this is 12, but this might be what a subset of our fleet looks like with blobs spread across them.

And our 11 9 s again comes from monitoring these for failure and re replicating them. So what we do is we take those, those replicas and we spread them across zones, right? And we do it in such a way that we could lose any subset, any chunk of those replicas in a zone and still be able to recreate your data if that zone is lost.

And this actually happens on a request by request basis. So when the request comes in, it goes through that same checksum bracketing algorithm that i described earlier, only the storage devices it uses are spread across zones very intentionally. So by the time you get a 200 success response, your data is already able to tolerate the loss of an entire zone with no impact to the integrity of your data, right?

And so i mentioned earlier that by the time you get a 200 it's already gone through all of these reverse check sum or reverse transforms and check summing. It's also by that point been spread across multiple zones.

So show of hands who heard the announcement for s3 express yesterday, quite a few amy heard about it. So we have a new storage class three express one zone and this is aimed at high performance applications. And one of the ways we achieve high performance is localizing your data in a single availability zone. So this is a trade off that we made.

So adam talked yesterday about how our availability zones achieve isolation by actually being spread apart really far. I think the numbers he used yesterday were i always get these reversed, but it's 60 miles 100 kilometers. That's pretty far and that's great for isolation, right? But the speed of light is a real thing and the further apart you put these buildings, the higher your latency is.

And so the choice that we made for s3 express one zone was to localize data in an availability zone to minimize latencies. But that comes with risk that customers need to be aware of. So here a threat is full or partial loss of an availability zone may actually lose my data in s3 express one zone.

So again, here's that picture before. This is the the regional s3, all of our standard storage classes, data is spread across the availability zones in a one zone storage class. Those discs are in a single zone. So it's a similar replication scheme. We still go through bracketing and all that, but everything's collected in the zone.

So this is how we're able to offer 11 lines of durability in a single zone. We still go through the end to end check summing. We have very similar erasure coding schemes. We are auditing the data at rest using the same systems that we use for regional s3. The difference in a one zone storage class is that it is not resilient to the loss or damage to all or part of an a z.

So this includes things like fire and water damage that could harm the a z. And so you as a customer, when using a one zone storage class, have to build your applications to be aware of that risk. So that typically means that you'll want to be able to put recre data into a storage class like three express one zone or have a backup solution in place so that you can restore your data in case of a disaster event.

The fact that it's 11 and nine s though means that in steady state, you shouldn't expect to see objects just vanishing, right? But in the unlikely case that there's a full or partial issue in the availability zone. That's when you might have to enlist a disaster recovery mechanism.

Ok. So we've talked about drives and we've talked about zones. So now let's talk about people. But from the perspective of one of your builders, there is a fictional example. But let's say in this case, your builders are building a system to clean up some of your buckets. And i talk to customers all the time where they do these one time cleanup jobs against their bucket.

So let's write out the threat here. And again, the reason we write these out is because writing solidifies your thinking. One of my favorite quotes and i use this all the time with, with colleagues. It's by a cartoonist dickon which is writer, writing is nature's way of letting you know how sloppy your thinking is. It is i find this to be true. Time and time again, when you write things out, it really helps you form clarity of thought.

So in this particular case, in this fictional example, we never want accidental deletion of data in amazon s3 buckets, especially due to bulk deletion operations. So we have this threat written out, but threats need mitigations. So let's talk about what you as a customer can do if you're in this, if you find yourself in this scenario

so s3 offers a variety of features um that can help you as a customer protect and mitigate this risk.

so the first one that's one of the more common ones that people use is s3 versioning. so versioning allows you to preserve an object history as you overwrite objects or delete objects. it actually creates a stack of objects for you, which on the surface sounds like great. i can protect myself when i'm doing a cleanup job.

however, if you're trying to clean up, what, what good is it if you're stacking the objects up because you haven't really actually cleaned up anything. and so what a lot of customers will do is they'll pair versioning with a life cycle policy.

so s3 has a life cycle feature that does noncurrent version expiration. it's always a mouthful for me. so you could do something like, you know, i'll go do my cleanup job and i'll set a policy that will age out the old objects after seven days of being in that state. in the meantime, if you try to access those objects, you'll get a 404. so hopefully that gives your applications enough time to break so that your systems realize that you've done something wrong, you can stop the lifecycle policy and go figure out how to clean things up.

another feature that we offer is replication. so s3 replication copies your data between buckets. those buckets can be cross-account, they can be cross-region. and by default, the replication takes seconds to minutes from the point where you put the object customers with stricter time requirements can use our replication time control feature, which has an s for 99.99% of objects being copied within 15 minutes.

another feature is object lock. so object lock is a feature where you can set, you can effectively lock an object and set a time when before that time the object cannot be deleted no matter what, right. and so that will protect your data set. it really is no matter what. so that will protect your data set against accidental deletion if you never want a piece of data to be deleted.

now, this is often used by customers with regulatory requirements, but it's also for customers who just need an extra layer of protection and then backups is a general concept. so aws offers backup solutions. we have several backup solutions from our partners.

backup systems are great and they offer really two key features that i think are super valuable as part of the system. so one is a, a backup catalog. so a good backup system will allow you to build, peruse your, your backup history to be able to select like what point in time you want to restore from. and the other thing that backup engines offer is a restore engine backup without restore is not really backup. the way i've heard it described as backup is really just an implementation detail of restore. if you can't restore. why have you backed up at all? right. so restore is the most important thing and you need to be testing your restore procedures, right. the worst time to test restore is when you're in a disaster recovery scenario and actually need to try it for the first time.

ok. so let's pivot to availability. so we're operating in all abs regions. um, a threat that one might consider, which seems pretty easy to mitigate today is the availability of one aws region can never affect the availability of another region. but how many people here knew that s3 started out with the intent to become a single global service? got a couple. ok, s3 taught aws a lot of lessons. that's what you, that's what you get to do as, as one of the older services.

so when s3 first launched, we envisioned a single global network of storage. and so we started out with a presence on the east coast and the west coast of the united states. i don't actually know if i pointed east and west just now. i don't know what direction i'm facing. and so our indexing system that amy talked about would replicate traffic between the east and the west coast. best effort. so it would literally be a udp packet that we sent across the country storage would be stored wherever the request was received. so if i put my data into washington, the object data would be stored there. and i'd best effort replicate the indexing entry over to virginia.

this ended up being a problem for, for customers for several reasons. so the first is eventual consistency. so if i'm best effort replicating my index and my metadata across the country and something happens to that data ground, then it's going to take a while before i can actually see that object. if i put it in washington, maybe i can't see it in virginia for a while. we did have background processes that would perform reconciliation between the two sides. but those could take hours or even sometimes a day. and so you would see often high eventual consistency because of this

also latency is uncertain if i put the object in washington. but all my readers are reading it in virginia, even though the the metadata is in virginia, i still have to haul the object bytes across the country every single time i get it. and that might be hard to predict if i don't know where my putter were.

and the other issue is that the availability of an architecture like this is lower than if it were just localized in the region because any issues with networking across the country uh could cause some of my data to be inaccessible. and so we don't operate this way anymore since um it says 2010 here, i think it might have actually been 2009. um but s3 is a regional service.

so let's revisit zone failure again. but now in the context of availability, um we must remain available through the unexpected loss of an entire zone. and one of the key fundamental design goals that we have is to operate during an a z failure without customers even realizing it.

ok. so this picture, remember this picture from earlier when we talked about uh how we replicate data for uh durability across availability zones. well, this actually also has availability benefits. so let's say we have the temporary loss of an availability zone. for example, due to a power event, we're able to route around this failure and serve your data without you even knowing that there's an ongoing issue. and our background systems that we replicate data in the event of a disc failure are also responsible for making sure that this property holds at all times.

so we're not simply looking for free space to replicate the data, it has to be free space in a zone such that i maintain this property that any zone can get lost temporarily or permanently. and the data is still accessible. this is of course not inclusive of the one zone storage class where the data might just be an a z three.

so as amy showed earlier, we advertise our front-end servers in dns. but when an a z fails, we want to stop advertising ip addresses in the failed a z. so we employ automation similar to el b's zonal shift feature. i don't know if folks have heard of that, but it allows you to remove ip addresses from dns due to some sort of signal, for example, an a z failure. and so this automatically removes end points in the affected a z just like that. but this is why it's important as a customer to honor our dns tts, right? and you want your application to refresh from dns frequently the sdk does this for free for you, right? but you want to make sure you have that ip address diversity and that you are actually leveraging the ttl s because hosts come in and out.

we also run our operations considering a z boundaries. so what does this mean when we deploy new software, we're deploying a single zone at a time when we adopt new hardware, we're adopting it a zone at a time. configuration changes can occur at an a z at a time. this minimizes the chances of an unintended bug that impacts the entire region. why does it do that? because we've already designed our system to be able to lose an a z. and so by leveraging that we're able to use that to our advantage during our typical operations and treat an issue in an a z like an a z loss.

so we've gone through several examples of how we build our software to be correct by looking at threats and then, and then finding appropriate mitigations for those threats. but it's also important to then go and assume incorrectness, right?

so i'm building for correctness and we have really robust processes to make sure it's correct. but then we pessimistically want to assume it's incorrect and build our defense in depth. this is what we call guardrails. and so i want to talk about two guardrails just to illustrate some of the things that we do internally in our own operations.

so the first is for our larger code changes, we will run the new and the old code paths together on the same box with a new code running in what we call shadow mode. now, obviously, we can't do this for everything like features that have externally visible side effects. but this technique allows us to run against billions or even sometimes trillions of requests and compare results against the previous code before enablement. and again, this is not in lieu of building correct software. by the time we get here, we're already certain that our software is correct. yet, we still do this pessimistically because we assume incorrectness.

another example is that we have control plane limits. so i mentioned how we operate our system a zone at a time. for example, new hardware in a zone before we put it in other zones. in addition to doing that, our re replication systems will make sure that even though it's still only a zone and i can lose a zone, we're going to make sure that no object is disproportionately placed on that new hardware. all right. so even though we have these controls in a zone, we have our control plane which then goes further on a perect basis.

ok. so this is a pretty dense talk today. when we say there's no compression algorithm for experience, it's really about this focus on fundamentals that we've built from decades of experience growing and operating services like s3. the easiest thing we do is build for the happy case, building to run in the presence of these real-world threats is is much harder than the happy case. and most of our code by volume is actually for these exceptional cases. but beyond just building for these exceptional cases, what's harder still is building the organization that's designed at its core to think about these threats and mitigations so that the organization is building robust and correct hardware. and so if you take anything away from this talk, we hope is that these fundamentals take intentional effort.

well, i just wanna say thank you again um for taking time to join us today. um also thank you to paul megan, who is our colleague who helped us put these slides together. i have a huge ask of everyone here to please fill out the survey for this um in the suggestion text area. please let us know if there's other parts of s3 that you'd like to know. we're probably gonna do another deep dive at summits and re invent next year. and we'd love to hear what you'd like to hear about.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫