Building and optimizing a data lake on Amazon S3

最新推荐文章于 2024-07-24 16:22:27 发布

李白的朋友王维

最新推荐文章于 2024-07-24 16:22:27 发布

阅读量149

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134813179

版权

Hello, everybody. Good afternoon. Thank you for joining us today. This is "Building and Optimizing Data Lake on Amazon S3". My name is [name]. I'm an engineer with Amazon S3. I'm joined by my colleague Huy, who's a product manager in Amazon S3. And we're both very excited about this as I'm sure you are as well.

Later, we're gonna be joined by Ryan Blue, who is a co-creator of Apache Iceberg, an open table format which has been gaining an incredibly rapid uptake in the data lake ecosystem. So stick around. We have a lot of interesting content for you today.

Now, when we talk about data lakes, different images come up in different people's minds. You may be thinking about big data analytics. You may be thinking about ML generative AI. Perhaps pictures of cats - all of these things can be stored in your data lake. So the general space of data lake is incredibly broad.

And if we try to just focus on the things that AWS has implemented and supported in the space of data lake, we'll be here a very long time. So we're not gonna do that. We're not going to try to boil the ocean or the lake - that's the last dead joke today.

Instead, we're gonna look at what underpins every single data lake - its storage layer. Every data lake needs one. Without storage layer, there is no data. Without data, there is no workload. Well, there's no point.

And so today, we're gonna dig deeper into Amazon S3 and why we believe it's a great foundation for building and running and optimizing data lakes. And there is one word you will find us using a lot today and that is "scale".

As some of you know, and we will reinforce, scale plays an incredibly important role in the data lake ecosystem and actually a huge role in the decisions we made in the Amazon S3 architecture.

So before we go digging deep into Amazon S3, I'd like us to take a very quick trip down memory lane into data lake history. Don't worry, we're not gonna spend a lot of time there, just enough to capture one moment in data lake evolution and see a curious pattern that keeps repeating.

Now, the term "data lake" was coined in 2010, 2011 probably. But the very thing it describes, which is a repository of structured and unstructured data, has existed for a while - much, much longer than this. And to pinpoint a time when they first appear is very hard, but we can go as far back as the nineties, perhaps, where a sort of client-server architecture was the revolutionary development.

And so in a typical business, in a typical IT department, you would have a system that looks something like this - you would have a relational database, you would have your email server, file share. And all of this would be placed on the internet, which was the height of sophistication of technology and it worked really great.

There are two things that are interesting here. First of all, this kind of begins to resemble a data lake - you have your structured and unstructured data placed together. But it's another aspect here - it's not placed exactly together, it's placed close to each other. But there are actually areas of data that are disparate from one another. And this is sometimes something we describe as "data silos".

Now in the nineties, this particular configuration probably made very little difference. But as things evolve, it actually makes a lot of difference.

Now, if your business is growing, it's presumably growing across all dimensions - so your data set is growing, your user base is growing, and the kinds of things you do, your scenarios, are exploding as well. And so you need to scale. And to scale, you need to add machines.

And in the nineties, this is literally what you did. And so you may have started somewhere here, but then over time, you may have ended up with a system that looks something like this. And the sheer scale of this and the sheer awkwardness of trying to imagine running a report across all of this - joining data from your databases with your invoices on your file server - is incredibly difficult.

Imagine trying to secure this with some kind of coherent governance policy - it's really, really hard. And the primary reason for this are those silos again. It's very disparate systems and they're really beginning to bite.

So what did the industry do at the time? It invented what it called "data marts" which actually pushed all of the data together again into what some of the systems at the time called a "multidimensional cube". And so this automatically solved some of the problems we just outlined.

Through innovation.

Now, the story doesn't end here, of course, but we're not gonna go into the rest of it. Much more happens down the road. But let's look at what we just saw - we saw a cycle of breakdown or cycle of innovation, depending on how you look at it. I prefer to look at it as a cycle of innovation.

You start with a certain system that fits your needs, you move your data to it, and as you ingest more and more data into it, and as your scenarios evolve at the same time, the system starts stretching, stretching, until it eventually breaks down. And in order for you to move forward, you have to innovate.

And the two vectors that move this cycle are:

The business needs that keep evolving over time
The scale - the physical growth of data, personas, and scenarios

Of course this is all happening today, 10-20 years ago we didn't worry about data locality, for example. Now it seems like it's all we worry about. And this is not going to stop, this is gonna continue going faster and faster and faster.

And of course, we can all continue working in this space - we're all technical workers and we can rebuild our systems. But if that's all we do, then we don't get to focus on our business. Which is why you want something like this.

And of course, this is an Amazon conference, so I'm going to talk to you about Amazon S3. But I'm not gonna go and recite all of the features of Amazon S3 and all the scenarios it supports - of which there are many.

I hope that you visit other sessions that will go really deep into the governance, the cost saving model, the best practices, and the conversations around scale in the Amazon S3 architecture.

But one of the things that I would really like to double down on right now is to talk to you about scale. And this is particularly pertinent to our conversation.

Amazon S3 is not only built for scale - it in fact somehow paradoxically thrives at scale. It relies on scale to support certain scenarios. And we'll get into specific examples of what that looks like and why you should care.

So let's look at just some of the mechanisms of how we achieve this. And let's start pretty simple, because as we'll realize very quickly, it's gonna get really complicated really quickly.

Let's look at a PUT - one customer puts an object into Amazon S3. The first thing we do is shard it - we basically split it into multiple chunks. We also create additional parity chunks. This process is called erasure coding. Then we store it on hard drives, but we don't just dump it all together - we spread it across a variety of hard drives across multiple racks, multiple AZs, and a variety of facilities.

On GET, you would reconstruct the object - the back objects - from these chunks.

And so this particular design has two very important implications:

Individual customer data only occupies a tiny amount of individual drives. So if something happens to a drive, as we know hardware fails, you haven't lost the entire object. In fact, you have enough shards to restore it. So it's great for durability.
The flip side of this benefit is that the customer data is spread across multiple facilities and disks, which allows you to actually exercise huge parallelism of different resources in order to serve your requests. This is incredibly important for performance.

Let's see why - consider two data lake workloads, maybe a little bit contrived, but they should drive the point home:

Workload 1:

Operates on a relatively small data set - 4 petabytes. It's not that huge, it's relatively large but not that huge.
But what is interesting is that the workload being driven is incredibly spiky - we're looking at millions of TPS, maybe it's a quarterly or yearly report, or a very large ETL that is running.

Workload 2:

Much bigger dataset - several times bigger
But the TPS is actually much lower - we're talking thousands of TPS vs millions of TPS

If you compare these workloads and try to design a storage system that accommodates both, with performance, durability, and availability characteristics that scale - it's very hard, very very different.

But it turns out that scale actually helps. Scale is a challenge but it's also an opportunity - the opportunity to aggregate hundreds of thousands of workloads into one aggregate demand.

What you're seeing on the left is a set of workloads with very different shapes and TPS/IOPS demands. But on the right, the aggregate demand, even though each workload is a snowflake, the aggregate is actually fairly flat. And that kind of system is much easier to build.

So this is a great example of how Amazon S3 actually relies on scale to support performance and availability. Scale is at the core of everything we do.

To give you an idea of our scale, I'm not going to recite all of this, just two stats that still boggle my mind:

Worldwide, Amazon stores a third of a quadrillion objects. I don't think my brain is designed yet to hold a number that big.
At peak, we serve 1 petabyte a second of traffic - also a crazy number.

But the number that impressed me the most, and I just learned about it recently, is that some customers have buckets that span millions of drives or more. This is the storage desegregation we've been talking about.

Ok. So why does it matter? Turns out there are some very tangible implications of this and performance is where you can really experience this. So let's start with basics.

Each operation against Amazon S3 is a request and then there is a response and workload is made of series of these requests and some and sometimes you have options around issuing them serially or issuing them in parallel. And data lakes typically drive workloads that issue a lot of requests in parallel because well, multiple workers and it's the primary scaling mechanism. One of the advantages of why data lake workloads can process so much data. But if you consider what kind of a capability storage system needs to support, to be able to handle this. It's actually very specific. If you drive a million IOPS against your local hard drive, it will melt down, you will not increase your performance. In fact, you will arrive at the congestive collapse and some storage systems will behave this way. But S3 is not like this because of the the correlation we're talking about. In fact, you can and should set as many requests to S3 as you can in parallel. In fact, really, as many as you can buckets and three general purpose are not actually scaling units. And so UTPS and we talked a little bit, we'll talk a little bit about the main space TPS and some constraints it's introduced is generally speaking, can scale up to really, really high numbers.

So what it means for data lakes do use parallelism, do use many workloads. There is more scaling dimensions again, let's go back to the basics. Let's consider a put request, but let's consider it in slow motion. So when a request is made against three, what really happens is the client? Well, there's a lot of things that happen but what client does, first of all, it needs to find a host to talk to and what it does, it has an end point DNS which is registered in the domain name service and it queries it and it gets back an IP address and then it sends a request to this a p address. Typically it's a load balancer. And so here you have a customer interacting with a single load balancer. And so all of the requests from this customer are gonna continue going to this load balancer. In fact, whatever the IP address, this DNS resolve to will be sent to this load balancer. What bounds will scale very well. But there is still a singular thing. So there is actually an opportunity to scale it across multiple dimensions. And DNS is the way we can allow this. And it's not a trivial thought that comes to mind, but it's actually a very simple idea.

So starting this year S3 relies on what is called multi value answer routing. And the principle is really simple based on a single DNS name, multiple IP addresses are given in return, one of them is considered primary, but there is a secondary address as you can also send your request to. And so now you get a list of this load balancer. And so if you send your request and parallelize them across many of them, you can parallelize across end points.

Ok. So we looked at parallel parallel operations and paralleling endpoints and all of this again speaks to the scale and all of this is enabled by the way, Amazon S3 is built.

Now we talked about two polarization dimension. What about individual operations, data lakes actually typically work with pretty chunky objects megabytes, 10 hundreds megabytes gigabytes. And if you try to upload or download an object on a single request, well, it's gonna take you some time obviously. And so performance wise, it's a, it's a par. Also, if you have any error that happens during the time you're gonna have to retry the whole thing. So can you paralyze within a single operation? The answer is yes. Amazon industry supports what is called multi part uploads which allows you to upload objects in chunks. And because again of the correlation of storage and parallelization across all dimensions, generally speaking, the performance of individual upload is completely independent for performance of other uploads. And so if you have more uploads, you get more throughput, you get better performance. The flip side of it exists on the downloads as well. uh through uh what we call ranged gets, you don't have to fetch an entire object from amazon tree. you can fetch a range of bytes that you specify. And exactly the same principle apply. If you chop up your object into chunks and download them as parallel. Most of the time, the kind of performance you get from this process uh will allow you to and time your throughput and thus dramatically decrease the performance, i mean dramatically improve the performance will dramatically decrease the time it takes for your workload to run.

Ok. So these are scaling dimensions that pertain to basically bytes and requests. There is another interesting scaling dimension in S3 which interacts with them in an interesting ways. And that is the name space of three. Now, those of you who are familiar with HDFS, uh think some may think of three as more of a file system. It isn't, it was designed as a blob store. Sure, keys can have forward slashes, but S3 4 intents and purposes ignores those. And so it doesn't scale based on directories. It's scale on what we call prefixes. Prefixes are similar to directories, but they basically constitute the beginning of your key. And then, and so the entire key space of an entire region is chopped up in this individual scaling blocks that we call prefixes. And each prefix has these dimensions 5500 TPS on r, 3500 TPS on, right. Obviously, this is not enough because the traffic constantly fluctuates across the region. And so uh Amazon S3 has a scaling machinery that runs in the background that consistently absorbs this TPS and uh scales the back end based on that. And so new prefixes are getting built which increase the overall TPS.

Let's consider an example to see what sort of a behavior it can introduce and maybe ways in which we can optimize. Now those of you who have come to r england before probably have heard this example. It's just, it's just so good we keep repeating it. uh let's consider autonomous vehicles. You have a garage, you have a bunch of autonomous vehicles sitting there every morning. They uh they drive out into the world and then in the evening they come back and then they upload all of the data that they have collected. And so you have a bucket and you have some kind of a pattern of keys that has been used to upload the data. This organization is fairly sensible, particularly for humans. You have some kind of a root directory, you have a date, you have a car id, and finally you have your your data. What you see here is that a lot of these keys and that is light in white actually have exactly the same characters in the beginning of it, which means they're most likely gonna end up in the same three names space scaling prefix. And so your good put, which is the throughput of request that succeeds wiki ps of request that succeeds. In this case may look something like this, your tps as your as your cars upload in the data. The tps is growing and then there is the moment of plateau. And this is probably because we have just hit the scaling limit of individual s3 name space prefix. And as the time goes by when we receive some pushback from s3 in the form of five or three errors, which basically tell you to retry. And again, most of the time you just do that, you have this plateau happening and then it spikes back up again because the three has scaled everything in the back. So this is what you would would sometimes see and sometimes you may be completely fine with it. In fact, for a lot of workloads, it's completely reasonable. However, there is a way sometimes to do better, particularly given the data lakes commonly drive very spiky workloads when gps goes up like this. And so one of the options that we recommend is to introduce what we call entropy and entropy is basically a salting or hashing mechanism by which you insert, which are seemingly random characters close to the start of the prefix. But they really typically hash of the rest of the key, for example, hash of something in a way that is deterministic. But what this results in is that the common prefix of these keys is shrunk now to relatively small number of characters. And that means that these things likely these keys are going to end up in different scaling prefix of in three. And that means that instead of hitting one prefix with the otps, you're hitting many at the same time, which means you can support much higher tps. So this is another example of the very different scaling mechanic that s3 has selected for its name space.

And let's discuss how we can actually apply these things when it comes to data leaks object size. This is an interesting conversation because there is a healthy tension that exists between the scaling of name space and scaling of the the byte upload. For example, because we want to parallelize as much uploads as possible. That means that that means higher TPS on u key space or you can reduce UTPS on key space, but then you have larger objects and less parallelism. And so we recommend finding a number that works well for your specific workloads. The place to start is sort of a medium chunky objects around 10 megabytes, which typically results in more optimal cost profile and performance profile. But you should definitely experiment based on your specific workload.

Object formats are very important for several reasons. A very frequent, we used to use a lot of formats like csv, tsv, but columnar formats like or c and to a great extent, parquet have become very, very popular. And there is a very good reason for that they incredibly are friendly, they are supported by a wide range of analytics products. And because of that the a lot of analytics products are actually optimized for them specifically. So when considering the format of the objects, uh selecting a column, well adopted column in the format such as parquet is probably a better choice.

Table format is something we haven't spoken before. And table format is a concept that basically describes how your objects organized composed into tables traditionally for the longest period of time. And it came from HDFS itself. Hive used hierarchical based on the directories that created a set of problems which we're going to talk about in a little bit later. And Hugh and Ryan will cover how open table formats address this. So selection of the type of a table format is also very important for the performance on your data la

Now, the previous slide is where we would typically finish when we talk about performance

But today is a special year and we're going to talk about new offering, which is um is to express one zone. It's a new flavor of a story. If you will, it's a new storage class and it focuses on the request intensive workloads. A lot of the what we discussed before about general purpose three buckets is that throughput typically is what you care about. But there are workloads where a lot of requests go between the client and the server and latency by to force latency matters tremendously.

And so ri has undergone through yet another cycle of breakdown and innovation. And uh we came up with this particular offering, it scales to millions of requests a uh a minute. It actually has a different scaling model for the name space that we discussed, which i'm gonna get to in a second and it has a very stable performance profile.

Um let's mention the scaling model a little bit. So the classic general purpose bucket, as we discussed, they don't come with a predefined scaling tps. In fact, bucket is not a scaling unit in industry general purpose but in directory buckets, which is the type of the buckets we introduced with this new offer and actually come pre scale. And that makes sense because we focus on workloads with low latency and a lot of requests back and forth. And so to achieve a very low latency, we have to pres scale these buckets.

And so buckets in the system is actually new scaling unit and they come per scale with about 100 and kps. And this is another example of a string innovating in a different dimension. It's not magic and we can get into the architecture of this. But this is not what this talk is about what i wanna. The point i want to bring home here is yet again, it really relies on this incredible scale to me to innovate.

And the in this particular example, you, we not only did we create an entirely new name space to support high tps and low latency. Not only we put into a single a z to achieve a reduced network hops, but we focused on everything that we know how to focus on. For example, we even changed the authentication model to session based because we know that a actually placed substantial at substantial lat you made substantial to your request.

And to give you an idea of the kind of benefit that we see with this three express one zone for analytics workloads, which are a request heavy. These are some really, really interesting numbers. There is a lot of sessions on this particular offer. It was mentioned on keynotes, there is a lot more coming up and if you're interested, please talk to us about it.

Now, this is an example of the way we innovate it inside amazon three. On the back end, we also outlined a lot of best practices around parallelize everything, use this type of author and so on. But we've been thinking about what to do to actually make it easier for the customers, how to, how to make these things work in practice. Another area of innovation that we've been engaged in is innovation on the client. And this is not a new product, but this is a product to truly achieve maturity this year.

It is a double e common runtime c rt. It is an incredible you can think about it is the best practices in code. It is a product parts of it is written in native code for high performance. It utilizes very specific uh performance character characteristics of independent on the individual operating systems. It is now integrated and has been integrated into the a w us sdk. And it's currently in the process of being integrated into a wide variety of data leak connectors such as s3 a.

We have covered a lot of ground talking about the three scale in dimensions and how they apply to performance. But so if there is one lesson to learn here is that our history, architecture thrives at scale, but performance is not the only aspect for which this is true. And in continuing with this theme, uh hugh will talk uh uh more about other history capabilities as it pertains to cost and governance. Thank you.

Um hi, everyone. I'm pee from uh your product team. And um and let's not talk about uh scale. So coming back to our themes about, you know, building and operating a three day scale um as your daily scale, um you kind of want to put some, you know, cost optimization techniques and practices and feature in place so that uh your, you know abi bill and cost doesn't scale, you know, linearly with your data lakes. And so in this section, we're gonna talk about some of those features and, and practice that you can, you can put in into practice.

The first thing is uh storage classes. Uh this is the bread and butter of s3. And over the years, we built a large number of storage classes. And the idea behind those storage classes is that um there's, you know, different kind of workloads, uh analytics and backups and media and entertainment and each of those workload have their own specific access pattern. And the idea of these source classes is that how can we build a sort of a source tier that on one hand, uh optimize, you know, cost for you but on the other hand, doesn't really sacrifice performance.

Um and with that in mind today, um there's a large number of source classes that you can choose depending on the specific access pattern and the specific workloads that you have. Um so on one hand of the spectrum, um and this is very typical for a lot of analytics workloads where the data is frequently accessed, you can use um you know s3 standard. Um and on the other end of the spectrum for long term archive data where you don't quite really need, you have glacier deep archive and sort of everything in between.

And as oleg mentioned, which is, you know, launch ray express one z one as well for that kind of even lower lane c higher tps workloads. But one thing to know about access pattern is that typically it's not static. And this is especially true for analytic and data lake workloads where usually like fresher data, the newly ingested data is more frequently accessed. But then as time goes on those data tend to get a little bit colder.

And what you want to do to put in place is that how you can you know some kind of features or something that you can help you transition um data from hotter tier like s3 center to colder tier like glacier uh as the data becomes colder. And with that, we build s3 life cycle.

Um and just out of curiosity how many of you guys have heard of s3 life cycle. we're, you know, currently using it. Ok, great. So i don't have to spend too much time on this.

Um so life cycle is three feature to help you transition towards classes at scale. The way that it works is that you configure rules and specifying uh you know, what's some of the criteria you want, you want to fill, fulfill in order for the transition to happen. So time base is, is one. Uh and then, you know s3 will will manage that transition for you.

And as with everything, life cycle is built with scale in mind, a lot of customers use lifecycle to delete data uh in like a large chunk. Uh and uh you can use lifecycle to transition or delete data uh for billions of objects and petabyte scale data leak.

Now um with storage classes in life cycle. Um what that also means for you is that um you need to be pretty familiar with your ax pattern and also pretty familiar with s3 offering to figure out what is my ax pattern and how do i what is the best three source class for that axis pattern? And that could be a lot of work from from your end to do all the research and do all the monitoring ais pattern.

Um and you might just want sort of a magic wand s3 source class or feature that take care of the sw cost transition and access pattern monitoring for you. And with that, we have s3 intelligent tearing. Also curious how many of you guys heard of intelligent tearing or are currently using intelligent tearing? Ok? A little less interesting. That's less than life cycle.

Um so intelligent is a new source, well, not new source class storage class, we've launched for a while uh that basically monitoring the access pattern for you and then transition the data between storage storage tier uh without sort of any intervention or work from you.

Um and what that means is that it, it is really just a s3 and aws managed way to have automated cost saving. And we're very excited to say that since launch, we've saved customers over $2 billion with intel engineering. And that once again is using intel engineering is like sort of manage automated way for cost saving without really a lot of work from you.

Now, another dimensions setting aside from the source classes and life cycle uh is actually visibility. Um and we see this a lot with our customer is that if you don't have visibility into how your uh how your data is being used, it's very hard to come up with cost optimization, things that you need to do.

And, and you know, we're all, you know, talking about data lakes here and one typical path we see a lot of customers run into is that when you first start up building a data lake is fairly small scale, maybe a couple of dozen data sets, a couple of buckets, a few dozen users. But then you can quickly scale to, let's say, you know, thousands of data sets, hundreds of users, uh dozens of a bs accounts.

And when, when you get to that scale, uh you can quickly lose sight of how your data is being used. So how can you get that visibility and try to figure out how your data is being used? The answer is swin i'm gonna ask a lot of questions because i think we all need a little stretching. There's too much sitting this week. How many of you guys have heard of uh storage lines? Ok, perfect.

Um so we launch storage lens is a feature we launched back in 2020. And the idea there is to give you metrics uh and visibility into how your data is being used.

Um and one nice thing about storage lens is that there's a free tier. So you know, you can go back and start turning it on and uh a lot of the metrics are are free. And what you can use like search lens for is to ask those kind of questions to help you cost optimize.

So question like, hey, which of my bucket or prefixes are uh are being frequently used uh or am i using the right storage classes uh for this particular data? Um with this particular a pattern and with this kind of metrics and what storage lens can do for you is that a, a starts this kind of flywheel of observe, analyze and optimize.

So let's say that, you know, you observe that there's a prefix or bucket and that's not really frequently uh frequently used. You start analyzing why that's the case and what kind of the workload is causing that. And you know, at that point you might determine ok, this data is uh you know, is for long term archival storage or i might not need it anymore, then you can start optimizing using some of the feature we talked about.

Uh let's say life cycle uh to either transition, you know, a large chunk of data into a coder tier for archival reasons or you might just delete the data altogether. So a couple of takeaways from the sections on cost, one is uh intelligent detering for automated uh saving a scale.

And um you know, i just want to emphasize that that's probably for, for many of you guys, like the most important takeaway is that for most workloads, like, you know, if you just pull your data into an inte, you just start saving. It is really not a whole lot of work from you and it can save a lot of money.

The second thing is life cycle, which is, you know, if you choose to, you know, decide what sort of causes and source here, you wanna, you you want to have control over that, use life to do that sort of transition and last but not least is use sw lens and especially, you know, there's a free tier and to start monitoring how your data is being used and and use that kind of insight to, to drive uh cost optimization actions.

So uh let's now talk about security and governance, which is um becoming an increasingly hot topic. Um making sure the right people have the right access to the right data. And coming back to our theme about uh data governance and uh to scale is that as your data to scale?

Um we've seen, you know, many customer running into this is that you have a proliferating number of data sets. You want to control access. On one hand and on the other hand, a large number, a growing number of user application team that need access to the data

And that means that um data governance and making sure the right people have the right access can become a little bit hairy. And specifically when we talk to customer about, hey, what kind of three data governance challenges you run into? They cite these couple of problems.

The first is as was mentioned, a proliferating number of data sets and users that need access to data. And what customer want is a scalable way to enforce uh granular re access. So you don't want to say that, hey, this user application have access to my entire bucket. You want to say like, hey, this user application can only access this and that data sets. Uh and you want to do that at scale.

The second is grant permission directly to identities and external corporate directory. What does that mean? Um traditionally s3 was used for a lot of, you know, media, entertainment and backup and this kind of workloads. And that means the consumers that's consuming data from s3 are a lot of time applications. But increasingly for analytics and data la workload is not just applications anymore, it's humans. So it's data scientists and data engineer that might want to pull data from s3 into a notebook environment, let's say, run some analysis and start training a machine learning model. And for that customer want a way to grant access directly to those uh human so to speak.

And last but not least is uh auditing. And this is especially important for, you know, uh customer from highly regulated industry like uh government financial services and health care where you want um a very detailed view of who, which user access, what data at what time using, what application for high volume of access.

Now, with all that sort of uh challenges in mind, the typical way that customer manage access to three today is using im um and i ma bucket policy and you know, going through those three bullet points and im can be a little bit over like a little bit limiting when it comes to uh what customer want for data lake access specifically.

Uh one for that proliferating number of users and applications. And you want sort of some way to enforce access at scale bucket policy usually have a 20 kilobyte size limit. So it's very hard to scale beyond that limit and to, to enforce access. And second for that human identity, im by nature can only understand im principles. So im user and roles and what that means is that customer have to maintain some kind of mapping between their directory, user and groups. So a user 80 group octa user octa group into im role and then gras role access to s3. And that means just more work to provision that roles and maintain that mapping.

And last part is auditing. So in the world, which is a very common pattern where you know, jane and david are in the same 80 group and they're sharing this role to access s3. You can't really use cloud trail to audit which end user access, what data, whether it's jane or whether it's david, you can only see that ok, this role access the data. And that means that for customer from highly regulated industries, uh like financial services, you know, government, you know, um they will have to build custom solution for that end to end auditing to piece together end user access.

So with that in mind, sort of the, the data lake access pattern challenge. And um and the sort of im where, where the gaps can be uh we built and recently launched reactive screen um to help customer to scale uh sr data governance. And very curious, like how many people uh how many you guys heard of r a screen? Ok. Sweet. Ok. So we just launched this feature on sunday. So uh very exciting um uh what it allows you to do is that grant s3 access directly to user uh and groups in your corporate directory. Uh so let's say uh ira d you know, renamed now as entre ad uh octa uh paying one login in addition to in principle, and you can define access in a very intuitive brand style that's very common in analytics and you know, relational typical like you know how you define permission in a relational database uh and in a very highly scalable uh way.

So the grinds will come look like something on the left where you know, first example here you're granting different axis level uh to different s3 folders to an external a group. Last, for example, you're gran different axle level to different folders, to an im role. And in a world of analytics and data, you can imagine first example, you're granting access to a data scientist group and this data scientists are tasked to, you know, uh pull data from s3 and train the next generation of machine learning model for you guys. And also, for example, is, you know, you can imagine this for a smart job uh running, have read access to a staging to ingest like roll logs and then spit out have right access to um to the clean folder to spit out the transform tables and what access behind the scene does and and we'll cover this in a bit is that events sort of just in time, at least privileged temperate credential based on the grants. And so that just further enhance your security posture because all the credentials uh is is short, short term by nature. And you won't have someone, you know, have the long term secret key sitting in their laptop. And you know, as a security guy, you're concerned, like uh someone's gonna hack that. And last but not least we talked about auditing. So a integrate with cloud trail um so that you can use cloud trail to audit that end user access. So whether it's jane, whether it's david, even though they're from the same a group red project, whether it's jane, whether it's david access the data, at what time using what applications

Let's walk through a concrete use case of how you can put access into action and just walk through the customer experience. Here is a super high level overview of, of uh like a data lake scenario. So you have a data link that sort a bunch of objects could be pdf jo mp four and you have user that want access could be technical user, data scientist, data engineer who and co uh or business user who want just an intuitive uh sort of interface or interact.

How do you use a axis for grain and axis, you have your bucket and you have your users. The first step to use x is that you create a x current instance. Now this instance is, you know, completely managed by a aws. Um the concept is just sort of a, a logical grouping of all the permission and grants that you register. The second step, you register grants uh specifying who have will have access to what resources. And then when user at time of access, you know, these are, let's say data scientists, when they want to pull something from a notebook to s3, they will uh request access via this get data access api uh and by calling the access for instance, and there will be uh uh there will be like an end point, you can call the oxygen instance, then we will look up all the grants figure out whether the request should be authorized based on the user's identity and based on the grants that you have. If the request is authorized, it will be back temporary credentials and this temporal credentials. Um you know, if you guys uh probably like it's sts token uh it, what you will get back today by calling a zoo role and then using that temporary credentials uh customer can now uh the client or user can access s3 and you know, pull the data from, from uh from s3 into a notebook environment. For example.

Now, one thing we want to know is how a current and lake formation can work together. And i'm gonna ask one more question like how many of you guys heard of lake formation or using la formation? Ok. Sweet. Uh so lake formation is another a w bs services that help you govern access to structure and catalog data. So uh so think you know glue table for example. And um what axis does is that it can work in conjunction with late formation and with the two feature uh go hand in hand, you can have a comprehensive data governing solution.

So specifically, you can use late formation to control access or to structure catalog tabular data and you can access based on role column cellar level. Um and then you use access grants to grant access to um on structure on catalog uh unstructured or non tabular data. So for example, images, audios for machine learning models uh or raw logs, uh newly ingested logs that hasn't been transformed into structure data. So those two, those two data governance product together can help you to govern data, uh all kinds of data, whether it's structure uh catalog data or unstructured on catalog data.

Now, the last thing i want to mention is that axis rn is part of a bigger aws story. We talked a lot about directory identities, you know, uh active directory octa. Um there's a big investment on the aws side um that are, we are trying to make multiple aws services including s3 uh to let you be able to authenticate and authorize against identities in your external corporate directory without going through im and that kind of picturing my aws im identity center, uh formally known as sso. It's gonna be that single entry point for you to on board your workforce identity. So you on board your active directories octa uh p to identity center. Once and thereafter, you can define s3 permissions, russia permission lake permission, permission against those uh external directory identities.

So to recap um this section, um you know, im works uh for most re use cases, a lot of time, you know static application, a pattern. But um for data lake axis pattern, uh if you are running into some of the challenges, we talked about whether it's im scalability granting access to human identities. Um looking to axis screens um as a sort of a complement to im for specific data lake pattern. And last but not least is that you can leverage a a together with lake formation to govern both structure and unstructured data.

So at this point, we talked a lot about the the fundamental pillars of how you want to think about building a data lake. So performance cost security and data governance. And increasingly on top of those fundamental pillar, we're seeing the rise of transactional data lakes and what does that mean? And so increasingly, you know, going back to the scaling uh story uh data continue to explode and number of user and application of concurrent uh user and application continue to uh to, to increase. And we're seeing demand from customers that want to combine uh the traditional warehouse like capabilities like, you know, um whether it's high performance asset transaction together with the cost effectiveness and elasticity of centric data lakes and with that sort of customer demand in mind, um we're seeing the rise of a lot of open table formats like iceberg uh and delta lake, which i and ryan will dive into iceberg very shortly. Um that basically it's becoming the basis of the new uh lake house architecture. And this table format bring a lot of exciting uh capability that are traditionally more associated with warehouses, uh like isolations, um high performance analytics, uh time travel um and two sort of s3 centric day lakes. Um and, you know, without further ado i will pass it to ryan who can do a much better job than me to talk about icebergs. Thank you.

Hi, everyone. It worked. Ok, good. Um i'm ryan blue. Um i am the co-creator along with my uh tabular io co-founder, dan weeks um co-creator of the apache iceberg project. So today i'd like to talk um a little bit more from the i think database or usage perspective than uh what these guys have have covered so far. Um the, the s3 and, and object store perspective of data lakes in the cloud, um you know, built on top of s3.

So the story of apache iceberg actually starts uh about six years ago, which is crazy. Um i, i can't believe we started the project in 2017. It was so long ago. Um and it, it started at netflix where uh dan and i worked at the time, we were trying to build a, a cohesive data platform. Um and one of our central tenets was the separation of compute and storage um specifically built on s3. Um what we were using at the time was a hadoop based infrastructure. So we were running a pretty gigantic yarn cluster and s3 was the source of truth for all of the data.

Um so we had a ton of different projects um just uh a lot of different infrastructure that our team needed to pull together. So things like er we had spark in the mix there, presto, which was later renamed torino amazon redshift. At the time, we were exploring tools like snowflake, we had streaming through flink and also we had just our own custom, things like a custom hive meta store that was allowing us to scale beyond the limits of just a normal hive meta store. And we had a lot of problems with this. It was like a, a puzzle that we just could not figure out how to fit together. Uh nothing really, really worked well.

Um and we had a ton of problems. So the problems after we sat down and looked at this, they fell into three different buckets

Number one was cost and performance at Netflix scale. We were trying to deliver all of this data, get value out of it, but we had, I think at the time tens to hundreds of petabytes. Um how do we manage that and effectively and efficiently access all of that data? It's awesome that we can store it. But if you can't filter through it and select just what you need for your use case it's effectively dead and you're just paying for nothing.

The, the second thing was correctness. Um hadoop systems were built with this very simple idea of uh whatever happens to be in a directory listing, that's what's in your table. And the problem with that is if you need to list multiple directories or if you need to make changes to multiple directories at the same time, you can't do that atomically. There is no transactional mechanism uh in a normal file system, let alone s3. Um and it's even worse if you're trying to view s3 as a file system because it's not built that way. That's not how it scales.

And then the third thing was this, this bucket that was i think made worse by cost and performance and correctness issues. It was the ability of our workforce to actually take advantage of data to be productive as data engineers to use the data effectively. That was very, very hard, hard. So if you're waiting way too long for data to arrive or if you have correctness problems and you don't trust that while someone else was writing to this table, you were actually able to read it correctly and trust your answer. Those of course are productivity issues. But I'm also talking about things like adding a column and realizing it duplicates a column that existed two years ago. Hey, it turns out we had profile id two years ago in this table and we just resurrected data from two years ago and it filtered into our analysis and that's why we had hey 2 million profiles, even though it was a brand new feature just released five minutes ago. You know, that sort of thing is really disruptive.

What we realized was that at the center of all of these was the hive table format and how simplistic we were, uh how simplistic it was to try and keep parquet files in a directory structure and call that a table. And so what we realized was that we needed to um uh basically steal a whole bunch of ideas from the database world and bring them to the hadoop world. And that's how we got to iceberg. We asked ourselves this very simple question. What if we just stole a whole bunch of great ideas from the people who have been doing this for 30 years? Um turns out that was a very powerful idea.

So, um what we did was we created iceberg now. Um it's, it's impossible to think about iceberg right now uh from that lens of where we were six years ago. So i'm gonna give you the perspective from today, looking backwards about what we built. And I think there are two things that actually really stand out about iceberg.

Number one, it's an open standard for tables. Um it is something that we wanted to be built into red shift and i believe red shift just announced that their, their support for iceberg tables is ready to go, right? Um that announcement, i believe came out today, the support across the aws ecosystem is, you know, really amazing and we wanted to build a standard and a format that could do that to do that. we needed to open source. it, we needed to make sure it was something that a lot of different companies could standardize on and could really invest their time in. So that was one of the core things that we did.

The second thing that we did was we, like i said, stole a whole bunch of great ideas from database people. We said everything is going to have sequel behavior sql is a declarative language which allows the infrastructure itself to make decisions and do the right thing. You tell it how you want something to be and it figures out the best way to do that and it can improve on that next week. It can figure out the best way to do it given the current state of a table or maybe we can release a new version of the database and it gets even better and even faster. So we wanted sequel behavior and not just because it can get faster and it's declarative, but because we needed the same guarantees for people to be productive. No unpleasant surprises like zombie data columns, no no problems. When someone else was writing to a table as you're reading it, we needed to fix those issues. We built it for data engineers to be productive and added things like merge and the transactional data lake queries, time travel, hidden partitioning. So you don't have to worry and retrain analysts to be able to query the table in order to take advantage of the work that you've done to prepare the data for them, things like that.

And another thing that i want to highlight since we're all s3 nerds in this room is we built a high performance design for s3 specifically. So our data lake was on s3 and we wanted to take advantage of the best of s3 and also not do things that just don't fit with an object store so, um apache iceberg is actually the only open table format that is specifically designed to live in an object store. It was never designed for hdfs, although hdfs uh and the, the hadoop ecosystem benefits pretty substantially by reducing the load on the name node.

Um so first of all, we didn't want to use any list operations that hit the index and, and tax the index unnecessarily. Um it's really nice to be able to list and see what's there. But it's not an operation that you actually need to do. In order to access data correctly. We want to store everything as a tree structure with hard references so that we never need to go and query the state of s3 to find out the state of a table. Uh number two, we wanted to remove uh the rename or move semantics that were needed for commits um in an object store. A rename is actually a backend copy and a delete. So you're, you're really wasting a lot of cycles and time by not having just unique names by reusing the same names or having these strategies that rename from one place to another to commit.

We also uh knew that uh s3 is super smart and does things like it remembers if you've asked if an object exists and it says no, well, that's not great. It turns out that if you're relying on objects existing in a listing and you, you, you know, go ahead ahead of time and ask if it exists, you might get the cashed answer and that's not great. So we took all of these things into account and we designed this tree structure that never needs to use those uh those features that are not really designed uh sorry, those features of s3 that um make it very difficult to achieve those high performance workloads.

Um we also did something i think which is uh really nice. We didn't realize it until later. But um by having this tree structure, uh what we did was we added the ability to add entropy at the start of our prefixes at the start of a table. So the old model was to keep everything together for listing purposes. Well, if you don't need to list, then you can put entropy way further up in the key space shard effectively against those back end prefixes and really get excellent performance like oleg was talking about earlier.

Um the last thing i want to mention is that we, we introduced a new abstraction which is file io which is just a more seamless abstraction. So rather than trying to treat s3 as a file system, doing extra checks to say, hey, does anything under this prefix exist? Because this may actually be a directory. We we got rid of all of that and we just treat it as an objects store.

Now what ended up happening with iceberg was actually a bit of a happy accident. So i mentioned that we wanted to fix acid transactions, have database semantics on top of tables in our object store. And what we actually ended up doing was not just making it so that two spark processes can use the same table. At the same time, we ended up accidentally creating what is now thought of as more universal analytic storage.

So remember i was talking about having emr reno presto flink and other things in our, our architecture at the time and nothing really fit together. Because if flink was changing a table spark is gonna get incorrect results. Now that we fix that problem, we're tying all of those different systems together. So you can effectively have this broad data platform with all of the tools that you want for different use cases and needs spark reno you know, and, and all of these are available through, you know, aws products, emr athena, ms k redshift kinesis.

Um you know, you can actually use these in a unified data architecture and it's really, really powerful and that's what i'm very excited about.

Um we had some amazing use cases that unfortunately, due to time, i'm going to uh skip through really quickly at netflix. Um we, the, the long story short is basically that we had to go for use cases in order to build trust in a new storage system that we could not have done otherwise. So anyone using a hive table and you know, fairly happy with it. They said, ah why do i need to be the first mover? Right? Why do i need to be the early adopter? So we went to things like our telemetry data. These are uh you know, second level measurements of every metric you can imagine across the entire netflix data platform. This was a massive data set and it used to take us uh more than a day or just to try and run a query and then it would inevitably fail.

Um this uh this case actually took 10 minutes just doing the directory listing to find out what files we needed to read for a query. And then it would go into this 24 hour period before failing in iceberg, we got that down to 13 minutes of run time and 10 seconds of planning time. And then we added additional filtering and additional metadata in the format to skip through files more effectively and do file level pruning. We got the whole thing down to 42 seconds. So this is looking at more than a month of one metric stored in a very, very vast telemetry table.

We also had to had an opportunity to replace an elastic search cluster where we were always querying by the same key. And we used iceberg metadata to provide an index and store all of that in s3, no need for online or runtime systems. And we were able to use a dynamically scaling presto cluster uh and basically reduce the cost by several orders of magnitude. So it was really, really cool.

Now, i want to be in my last minute, a little more forward looking. So we've talked about how iceberg has unlocked uh this universal analytic storage. Um it's, it's actually just the beginning of what you can do with s3 as this ubiquitous layer for storage. Iceberg provides a table level interface to that that is really powerful. And so what you're seeing is the emergence of essentially s3 but with a table level interface on top of it provided by apache iceberg that then with the ubiquitous availability of engines that connect up to iceberg and can use that data. We're actually building a different type of data architecture. One where your data is all in one place. It's all in s3. It is a table space as large as you want to make it and everything goes to s3 1 central location for that data.

So we're we're done with the days of you load your data into a database and then you pay that database to get access to your data. Now you grant access to that database through s3. And we're building a more modular architecture where all of those engines play in in the exact same uh central data store.

Now all of this has been built on at least one open standard iceberg tables, but it's going to change and i think to to continue on this modular path, we're going to see more governance and security features, things like oath to rest catalogs, so you can interact with metadata as well. So we need a lot more open standards, but a very exciting world where all of our databases in the analytic space are becoming more modular, they're becoming interchangeable. You can use uh spark to produce data that you consume in red shift, you know, those sorts of use cases never quite worked right in the past. And it's really exciting to see where we're heading.

So, thank you. That is apache iceberg and i appreciate everyone coming and listening to us.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Building and optimizing a data lake on Amazon S3

Ok?Right?
复制链接

扫一扫